Bug #13200 | data node crashes on large transaction performed by stored procedure | ||
---|---|---|---|
Submitted: | 14 Sep 2005 19:53 | Modified: | 19 Dec 2005 21:57 |
Reporter: | Hartmut Holzgraefe | Email Updates: | |
Status: | Won't fix | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
Version: | 5.0 | OS: | Linux (linux, solaris) |
Assigned to: | Assigned Account | CPU Architecture: | Any |
[14 Sep 2005 19:53]
Hartmut Holzgraefe
[15 Sep 2005 8:47]
Martin Skold
Seems to be Signal lost, send buffer full Please increase SendBufferMemory, se Reference manual: 16.4.4.7. MySQL Cluster TCP/IP Connections
[15 Sep 2005 8:50]
Martin Skold
[TCP]SendBufferMemory TCP transporters use a buffer to store all messages before performing the send call to the operating system. When this buffer reaches 64KB its contents are sent; these is also sent when a round of messages have beenexecuted. To handle temporary overload situations it is also possible to define a bigger send buffer. The default size of the send buffer is 256KB.
[15 Sep 2005 8:55]
Martin Skold
Does this fix the problem?
[15 Sep 2005 13:07]
Hartmut Holzgraefe
increasing the send buffer to 1MB helps with the medium sized 'SILVER' test case, the big BRONZE test still fails. i have been able to produce a core file this time, please find the backtrace below, cluster log files, core file and ndbd binary (linux x86, suse 9.3) are attached #0 0xffffe410 in ?? () (gdb) bt full #0 0xffffe410 in ?? () No symbol table info available. #1 0xbfffef08 in ?? () No symbol table info available. #2 0x00000006 in ?? () No symbol table info available. #3 0x400ddb75 in abort () from /lib/tls/libc.so.6 No symbol table info available. #4 0x0826e273 in NdbShutdown (type=NST_ErrorHandlerStartup, restartType=NRT_Default) at Emulator.cpp:245 restart = true shutting = 0x82f9ea2 "restarting" exitAbort = 0x82f9f6b "aborting" outputStream = (FILE *) 0x8557d90 #5 0x08274e1a in ErrorReporter::handleError (type=ecError, messageID=2308, problemData=0x82efb70 "Unhandled node failure during restart", objRef=0xbffff0b0 "NDBCNTR (Line: 1404) 0x00000008", nst=NST_ErrorHandlerStartup) at ErrorReporter.cpp:214 No locals. #6 0x0826634b in SimulatedBlock::progError (this=0x851d0a0, line=1404, err_code=2308, extra=0x82efb70 "Unhandled node failure during restart") at SimulatedBlock.cpp:736 aBlockName = 0x82ff3f4 "NDBCNTR" magicStatus = 8 buf = "NDBCNTR (Line: 1404) 0x00000008\000\001\000\000\001�\000\000X��+T&\b\000\000\000\000�A\b\001\000\000\000\000!A\b\001\000\000\0004\n" #7 0x08203aac in Ndbcntr::execNODE_FAILREP (this=0x851d0a0, signal=0x84120c0) at NdbcntrMain.cpp:1404 phase = 5 tStartConf = true nodeId = 135015143 nodeFail = (const NodeFailRep *) 0x8412100 allFailed = {<BitmaskPOD<2u>> = {rep = {data = {4, 0}}, static Size = 2, static NotFound = 4294967295, static TextLength = 16}, <No data fields>} failedStarted = {<BitmaskPOD<2u>> = {rep = {data = {4, 0}}, static Size = 2, static NotFound = 4294967295, static TextLength = 16}, <No data fields>} failedStarting = {<BitmaskPOD<2u>> = {rep = {data = {0, 0}}, static Size = 2, static NotFound = 4294967295, static TextLength = 16}, <No data fields>} failedWaiting = {<BitmaskPOD<2u>> = {rep = {data = {0, 0}}, static Size = 2, static NotFound = 4294967295, static TextLength = 16}, <No data fields>} tMasterFailed = true tStarted = true tStarting = false tWaiting = false st = (const NodeState &) @0x851e700: {static DataLength = 10, startLevel = 2, nodeGroup = 4092851187, {dynamicId = 2, masterNodeId = 2}, { starting = {startPhase = 5, restartType = 2}, stopping = {systemShutdown = 5, timeout = 2, alarmTime = 0}}, singleUserMode = 0, singleUserApi = 4294967295, m_connected_nodes = {rep = {data = {0, 0}}, static Size = 2, static NotFound = 4294967295, static TextLength = 16}} rep = (NodeFailRep *) 0x5 nodeId = 135015143 #8 0x080f0e16 in SimulatedBlock::executeFunction (this=0x851d0a0, gsn=26, signal=0x84120c0) at SimulatedBlock.hpp:557 f = {__pfn = 0x8203808 <Ndbcntr::execNODE_FAILREP(Signal*)>, __delta = 0} errorMsg = '\0' <repeats 11 times>, "\001", '\0' <repeats 60 times>, "�000\000\000�\017\000\000\000\000\000\001\a", '\0' <repeats 15 times- imes>, "\005\000\000\000\024!A\b�Q#@\002\000\000\000\005\000\000\000]b\004" #9 0x0826a635 in FastScheduler::doJob (this=0x841b9a0) at FastScheduler.cpp:138 tJobCounter = 3801 tJobLap = 265945 b = (class SimulatedBlock *) 0x851d0a0 gsnbnr = 0 reg_bnr = 251 reg_gsn = 26 loopCount = 8 TminLoops = 32 TloopMax = 64 signal = (Signal *) 0x84120c0 tHighPrio = 1 #10 0x0826c1b1 in ThreadConfig::ipControlLoop (this=0x84267b8) at ThreadConfig.cpp:175 timeOutMillis = 10 i = 12 #11 0x080c09a6 in main (argc=1, argv=0xbffff464) at main.cpp:244 theConfig = (Configuration *) 0x8426738 status = 0 buf = 0x8557ce8 "/usr/local/mysql-5.0/cluster/ndb_3_signal.log" tmp_aptr = {m_obj = 0x8557ce8 "/usr/local/mysql-5.0/cluster/ndb_3_signal.log"} signalLog = (FILE *) 0x8557d90
[15 Sep 2005 13:20]
Hartmut Holzgraefe
there were actually 3 core files, not just one, from this test rum ("StopOnError=False"), i've uploaded the cores and logs to our FTP server now as bug13200-logs.tar.bz2
[15 Sep 2005 13:25]
Hartmut Holzgraefe
backtrace from 2nd core file
Attachment: bt.14355 (application/octet-stream, text), 7.90 KiB.
[15 Sep 2005 13:26]
Hartmut Holzgraefe
backtrace from 3rd core file
Attachment: bt.14386 (application/octet-stream, text), 3.87 KiB.
[17 Oct 2005 13:41]
Martin Skold
By running small transactions inside the SP the app can be made to work at a negliable increased execution cost CREATE PROCEDURE AddIpAddrs(IN i_start INT UNSIGNED, IN i_end INT UNSIGNED, IN i_ippoolidx SMALLINT UNSIGNED) #WET? DETERMINISTIC MODIFIES SQL DATA IF ISNULL(i_start) OR ISNULL(i_end) OR i_start>i_end OR ISNULL(i_ippoolidx) THEN CALL AbortTransaction(); ELSE BEGIN DECLARE a_addr INT UNSIGNED DEFAULT i_start; # Check to see if new range overlaps/collides with an existing range. SELECT COUNT(*) INTO @count FROM IpRanges WHERE (i_start BETWEEN Start AND End ) OR (i_end BETWEEN Start AND End ) OR (Start BETWEEN i_start AND i_end) OR (End BETWEEN i_start AND i_end); IF @count > 0 THEN CALL AbortTransaction(); END IF; # Add the addrs, one by one. WHILE a_addr <= i_end DO # "Assume" a_addr is already present (if so, it must be a zombie). START TRANSACTION; UPDATE IpAddrs SET IpPoolIdx = i_ippoolidx, LastChange = NOW() WHERE IpAddr = a_addr; # AND Idx = 0 -- unnecessary IF ROW_COUNT() > 0 THEN # if a_addr was present, update zombie info UPDATE IpPools SET AddrsConfigured = AddrsConfigured - 1 WHERE Idx = 0; ELSE # if a_addr not present, add it now INSERT INTO IpAddrs SET IpAddr = a_addr, IpPoolIdx = i_ippoolidx; END IF; # In either case, update i_ippoolidx info. UPDATE IpPools SET AddrsConfigured = AddrsConfigured + 1 WHERE Idx = i_ippoolidx; COMMIT; # Continue while-loop. SET a_addr = a_addr + 1; END WHILE; END; END IF; and then removing the START TRANSACTION; ... COMMIT; around the call in AddRange.sh
[17 Oct 2005 14:05]
Martin Skold
Could even batch a little bit by adding a modulo check around a counter before START TRANSACTION;/COMMIT; to maybe update 100 at a time.
[18 Oct 2005 12:17]
Martin Skold
Like this: CREATE PROCEDURE AddIpAddrs(IN i_start INT UNSIGNED, IN i_end INT UNSIGNED, IN i_ippoolidx SMALLINT UNSIGNED) #WET? DETERMINISTIC MODIFIES SQL DATA IF ISNULL(i_start) OR ISNULL(i_end) OR i_start>i_end OR ISNULL(i_ippoolidx) THEN CALL AbortTransaction(); ELSE BEGIN DECLARE a_addr INT UNSIGNED DEFAULT i_start; DECLARE loop_count INT UNSIGNED DEFAULT 0; DECLARE loop_mod INT UNSIGNED; # Check to see if new range overlaps/collides with an existing range. SELECT COUNT(*) INTO @count FROM IpRanges WHERE (i_start BETWEEN Start AND End ) OR (i_end BETWEEN Start AND End ) OR (Start BETWEEN i_start AND i_end) OR (End BETWEEN i_start AND i_end); IF @count > 0 THEN CALL AbortTransaction(); END IF; # Add the addrs, one by one. WHILE a_addr <= i_end DO # "Assume" a_addr is already present (if so, it must be a zombie). SET loop_count = loop_count + 1; SET loop_mod = loop_count % 100; # Batch in groups of 100 IF loop_mod = 0 THEN START TRANSACTION; END IF; UPDATE IpAddrs SET IpPoolIdx = i_ippoolidx, LastChange = NOW() WHERE IpAddr = a_addr; # AND Idx = 0 -- unnecessary IF ROW_COUNT() > 0 THEN # if a_addr was present, update zombie info UPDATE IpPools SET AddrsConfigured = AddrsConfigured - 1 WHERE Idx = 0; ELSE # if a_addr not present, add it now INSERT INTO IpAddrs SET IpAddr = a_addr, IpPoolIdx = i_ippoolidx; END IF; # In either case, update i_ippoolidx info. UPDATE IpPools SET AddrsConfigured = AddrsConfigured + 1 WHERE Idx = i_ippoolidx; IF loop_mod = 0 THEN COMMIT; END IF; # Continue while-loop. SET a_addr = a_addr + 1; END WHILE; # Commit any remaining batch COMMIT; END; END IF;
[18 Oct 2005 12:49]
Martin Skold
The only crashes discovered were WatchDog terminating one node and Arbitrator shutting down the other. Handling of very large transactions in cluster will be improved later, but currently no real bug was discovered. Work-around is of course not to do very large transactions.
[19 Dec 2005 21:57]
Hartmut Holzgraefe
changed to "to be fixed later"
[13 Mar 2014 13:34]
Omer Barnir
This bug is not scheduled to be fixed at this time.