MySQL Bugs: #13200: data node crashes on large transaction performed by stored procedure

Bug #13200	data node crashes on large transaction performed by stored procedure
Submitted:	14 Sep 2005 19:53	Modified:	19 Dec 2005 21:57
Reporter:	Hartmut Holzgraefe	Email Updates:
Status:	Won't fix	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	5.0	OS:	Linux (linux, solaris)
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
.

How to repeat:
.

Seems to be Signal lost, send buffer full
Please increase SendBufferMemory, se
Reference manual:
16.4.4.7. MySQL Cluster TCP/IP Connections

[TCP]SendBufferMemory

TCP transporters use a buffer to store all messages before performing the send call to the operating system. When this buffer reaches 64KB its contents are sent; these is also sent when a round of messages have beenexecuted. To handle temporary overload situations it is also possible to define a bigger send buffer. The default size of the send buffer is 256KB.

Does this fix the problem?

increasing the send buffer to 1MB helps with the medium sized 'SILVER' test case,
the big BRONZE test still fails. i have been able to produce a core file this time,
please find the backtrace below, cluster log files, core file and ndbd binary 
(linux x86, suse 9.3) are attached

#0  0xffffe410 in ?? ()
(gdb) bt full
#0  0xffffe410 in ?? ()
No symbol table info available.
#1  0xbfffef08 in ?? ()
No symbol table info available.
#2  0x00000006 in ?? ()
No symbol table info available.
#3  0x400ddb75 in abort () from /lib/tls/libc.so.6
No symbol table info available.
#4  0x0826e273 in NdbShutdown (type=NST_ErrorHandlerStartup, restartType=NRT_Default) at Emulator.cpp:245
        restart = true
        shutting = 0x82f9ea2 "restarting"
        exitAbort = 0x82f9f6b "aborting"
        outputStream = (FILE *) 0x8557d90
#5  0x08274e1a in ErrorReporter::handleError (type=ecError, messageID=2308, problemData=0x82efb70 "Unhandled node failure during restart", 
    objRef=0xbffff0b0 "NDBCNTR (Line: 1404) 0x00000008", nst=NST_ErrorHandlerStartup) at ErrorReporter.cpp:214
No locals.
#6  0x0826634b in SimulatedBlock::progError (this=0x851d0a0, line=1404, err_code=2308, extra=0x82efb70 "Unhandled node failure during restart")
    at SimulatedBlock.cpp:736
        aBlockName = 0x82ff3f4 "NDBCNTR"
        magicStatus = 8
        buf = "NDBCNTR (Line: 1404) 0x00000008\000\001\000\000\001�\000\000X��+T&\b\000\000\000\000�A\b\001\000\000\000\000!A\b\001\000\000\0004\n"
#7  0x08203aac in Ndbcntr::execNODE_FAILREP (this=0x851d0a0, signal=0x84120c0) at NdbcntrMain.cpp:1404
        phase = 5
        tStartConf = true
        nodeId = 135015143
        nodeFail = (const NodeFailRep *) 0x8412100
        allFailed = {<BitmaskPOD<2u>> = {rep = {data = {4, 0}}, static Size = 2, static NotFound = 4294967295, 
    static TextLength = 16}, <No data fields>}
        failedStarted = {<BitmaskPOD<2u>> = {rep = {data = {4, 0}}, static Size = 2, static NotFound = 4294967295, 
    static TextLength = 16}, <No data fields>}
        failedStarting = {<BitmaskPOD<2u>> = {rep = {data = {0, 0}}, static Size = 2, static NotFound = 4294967295, 
    static TextLength = 16}, <No data fields>}
        failedWaiting = {<BitmaskPOD<2u>> = {rep = {data = {0, 0}}, static Size = 2, static NotFound = 4294967295, 
    static TextLength = 16}, <No data fields>}
        tMasterFailed = true
        tStarted = true
        tStarting = false
        tWaiting = false
        st = (const NodeState &) @0x851e700: {static DataLength = 10, startLevel = 2, nodeGroup = 4092851187, {dynamicId = 2, masterNodeId = 2}, {
    starting = {startPhase = 5, restartType = 2}, stopping = {systemShutdown = 5, timeout = 2, alarmTime = 0}}, singleUserMode = 0, 
  singleUserApi = 4294967295, m_connected_nodes = {rep = {data = {0, 0}}, static Size = 2, static NotFound = 4294967295, static TextLength = 16}}
        rep = (NodeFailRep *) 0x5
        nodeId = 135015143
#8  0x080f0e16 in SimulatedBlock::executeFunction (this=0x851d0a0, gsn=26, signal=0x84120c0) at SimulatedBlock.hpp:557
        f = {__pfn = 0x8203808 <Ndbcntr::execNODE_FAILREP(Signal*)>, __delta = 0}
        errorMsg = '\0' <repeats 11 times>, "\001", '\0' <repeats 60 times>, "�000\000\000�\017\000\000\000\000\000\001\a", '\0' <repeats 15 times-
imes>, "\005\000\000\000\024!A\b�Q#@\002\000\000\000\005\000\000\000]b\004"
#9  0x0826a635 in FastScheduler::doJob (this=0x841b9a0) at FastScheduler.cpp:138
        tJobCounter = 3801
        tJobLap = 265945
        b = (class SimulatedBlock *) 0x851d0a0
        gsnbnr = 0
        reg_bnr = 251
        reg_gsn = 26
        loopCount = 8
        TminLoops = 32
        TloopMax = 64
        signal = (Signal *) 0x84120c0
        tHighPrio = 1
#10 0x0826c1b1 in ThreadConfig::ipControlLoop (this=0x84267b8) at ThreadConfig.cpp:175
        timeOutMillis = 10
        i = 12
#11 0x080c09a6 in main (argc=1, argv=0xbffff464) at main.cpp:244
        theConfig = (Configuration *) 0x8426738
        status = 0
        buf = 0x8557ce8 "/usr/local/mysql-5.0/cluster/ndb_3_signal.log"
        tmp_aptr = {m_obj = 0x8557ce8 "/usr/local/mysql-5.0/cluster/ndb_3_signal.log"}
        signalLog = (FILE *) 0x8557d90

there were actually 3 core files, not just one, from this test rum ("StopOnError=False"),
i've uploaded the cores and logs to our FTP server now as bug13200-logs.tar.bz2

backtrace from 2nd core file

Attachment: bt.14355 (application/octet-stream, text), 7.90 KiB.

backtrace from 3rd core file

Attachment: bt.14386 (application/octet-stream, text), 3.87 KiB.

By running small transactions inside the SP the app can be made to work
at a negliable increased execution cost

CREATE PROCEDURE AddIpAddrs(IN i_start INT UNSIGNED,
                            IN i_end INT UNSIGNED,
                            IN i_ippoolidx SMALLINT UNSIGNED)
    #WET? DETERMINISTIC
    MODIFIES SQL DATA
    IF ISNULL(i_start) OR ISNULL(i_end) OR i_start>i_end OR ISNULL(i_ippoolidx) THEN
        CALL AbortTransaction();
    ELSE
        BEGIN
            DECLARE a_addr INT UNSIGNED DEFAULT i_start;

            # Check to see if new range overlaps/collides with an existing range.
            SELECT COUNT(*) INTO @count
              FROM IpRanges
              WHERE (i_start BETWEEN Start   AND End  )
                 OR (i_end   BETWEEN Start   AND End  )
                 OR (Start   BETWEEN i_start AND i_end)
                 OR (End     BETWEEN i_start AND i_end);
            IF @count > 0 THEN
                CALL AbortTransaction();
            END IF;

            # Add the addrs, one by one.
            WHILE a_addr <= i_end DO
                # "Assume" a_addr is already present (if so, it must be a zombie).
		START TRANSACTION;
                UPDATE IpAddrs
                  SET IpPoolIdx = i_ippoolidx,
                      LastChange = NOW()
                  WHERE IpAddr = a_addr;  # AND Idx = 0  -- unnecessary
                IF ROW_COUNT() > 0 THEN  # if a_addr was present, update zombie info
                    UPDATE IpPools
                      SET AddrsConfigured = AddrsConfigured - 1
                      WHERE Idx = 0;
                ELSE  # if a_addr not present, add it now
                    INSERT INTO IpAddrs
                      SET IpAddr = a_addr,
                          IpPoolIdx = i_ippoolidx;
                END IF;
                # In either case, update i_ippoolidx info.
                UPDATE IpPools
                  SET AddrsConfigured = AddrsConfigured + 1
                  WHERE Idx = i_ippoolidx;
		COMMIT;
                # Continue while-loop.
                SET a_addr = a_addr + 1;
            END WHILE;
        END;
    END IF;

and then removing the
START TRANSACTION; ... COMMIT;
around the call in AddRange.sh

Could even batch a little bit by adding a modulo check around
a counter before START TRANSACTION;/COMMIT; to maybe 
update 100 at a time.

Like this:
CREATE PROCEDURE AddIpAddrs(IN i_start INT UNSIGNED,
                            IN i_end INT UNSIGNED,
                            IN i_ippoolidx SMALLINT UNSIGNED)
    #WET? DETERMINISTIC
    MODIFIES SQL DATA
    IF ISNULL(i_start) OR ISNULL(i_end) OR i_start>i_end OR ISNULL(i_ippoolidx) THEN
        CALL AbortTransaction();
    ELSE
        BEGIN
            DECLARE a_addr INT UNSIGNED DEFAULT i_start;
            DECLARE loop_count INT UNSIGNED DEFAULT 0;
            DECLARE loop_mod INT UNSIGNED;
            # Check to see if new range overlaps/collides with an existing range.
            SELECT COUNT(*) INTO @count
              FROM IpRanges
              WHERE (i_start BETWEEN Start   AND End  )
                 OR (i_end   BETWEEN Start   AND End  )
                 OR (Start   BETWEEN i_start AND i_end)
                 OR (End     BETWEEN i_start AND i_end);
            IF @count > 0 THEN
                CALL AbortTransaction();
            END IF;

            # Add the addrs, one by one.
            WHILE a_addr <= i_end DO
                # "Assume" a_addr is already present (if so, it must be a zombie).		  
                SET loop_count = loop_count + 1;
		SET loop_mod = loop_count % 100;
		# Batch in groups of 100
		IF loop_mod = 0	THEN		
			START TRANSACTION;
		END IF;
                UPDATE IpAddrs
                  SET IpPoolIdx = i_ippoolidx,
                      LastChange = NOW()
                  WHERE IpAddr = a_addr;  # AND Idx = 0  -- unnecessary
                IF ROW_COUNT() > 0 THEN  # if a_addr was present, update zombie info
                    UPDATE IpPools
                      SET AddrsConfigured = AddrsConfigured - 1
                      WHERE Idx = 0;
                ELSE  # if a_addr not present, add it now
                    INSERT INTO IpAddrs
                      SET IpAddr = a_addr,
                          IpPoolIdx = i_ippoolidx;
                END IF;
                # In either case, update i_ippoolidx info.
                UPDATE IpPools
                  SET AddrsConfigured = AddrsConfigured + 1
                  WHERE Idx = i_ippoolidx;
		IF loop_mod = 0	THEN		
			COMMIT;
		END IF;
                # Continue while-loop.
                SET a_addr = a_addr + 1;
            END WHILE;
	# Commit any remaining batch
	COMMIT;
        END;
    END IF;

The only crashes discovered were WatchDog terminating one node and
Arbitrator shutting down the other.
Handling of very large transactions in cluster will be improved later, but
currently no real bug was discovered. Work-around is of course not to do
very large transactions.

changed to "to be fixed later"

This bug is not scheduled to be fixed at this time.