Bug #14028 [Data Loss] NDB node crashes during backup if a BLOB is inserted and deleted
Submitted: 14 Oct 2005 13:51 Modified: 14 Mar 2006 8:55
Reporter: eric von Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:4.1.15, 5.0.18 OS:Linux (SuSE 9.2)
Assigned to: Jonas Oreland CPU Architecture:Any

[14 Oct 2005 13:51] eric von
Description:
I'm running MySQL 4.1.12-max on SuSE Linux 9.2. I have a four-node cluster, with three data nodes and a mgm. MGM is node 1, NDB nodes are 2,3, and 4, and there are 9 API node slots (5-13). Today the following sequence of events occurred, based on the log file on the MGM node. I will post the cluster log for the relevant time span in a followup comment.

MGM node detects NDB node 2 disconnection
MGM node arbitrates and selects NDB node 3 as new master
MGM node detects NDB node 3 disconnection
MGM node arbitrates and selects NDB node 4 as new master
NDB node 4 starts taking over (table fragments scroll by with LcpStatus 3)
NDB node 2 reconnects and starts up, with CM_president = 4, own Node = 2, our dynamic id = 4
NDB node 3 reconnects and begins the startup process
NDB node 2 loads all indexes and completes startup process.
NDB node 3 disconnects again
MGM node arbitrates and NDB node 4 wins the election again
NDB node 3 reconnects, loads indexes, and completes startup.

Now, the problems:

NDB nodes 2 and 4 complain every 1 minute, that "Failure handling of node 3 has not completed in [n] min."
API nodes are allowed to write data via insert/update to the cluster, but some writes are handled on node 3 and others on node 4.

The result is that an API node querying NDB node 4 sees different data than a node querying node 3.

Backups at this point fail to complete, although the mgm node's logs indicate they were successful. The strange part is that node 4 gets 2 fragments, node 3 gets one, and node 2 doesn't even run the backup (no files or directories generated in /clusterdata/backup, at least)

This bug was originally posted on the forums at http://forums.mysql.com/read.php?25,49305,49305#msg-49305

How to repeat:
Unknown.

Suggested fix:
Cascaded failover could force a shutdown of 'failed' nodes.
[14 Oct 2005 13:56] eric von
Additional information:  The shutdown happened while the scheduled daily backup was being run from cron.
[14 Oct 2005 14:50] eric von
(changing category to cluster)
[14 Oct 2005 15:36] Jonathan Miller
Not that this would fix your issue for sure, but I notice that you are 2 version behind on the software release. your version = 4.1.12 current version = 4.1.14 and I believe that version = 4.1.15 is about to be released.

Upgrading to the latest version may correct some if not all the issues you are seeing.

Thanks
JBM
[14 Oct 2005 18:49] eric von
As this is a production system, we only incur the downtime required for an upgrade when we feel it is necessary.  The changelogs for 4.1.13 and 4.1.14 didn't contain anything we felt was critical enough to merit this downtime, or anything that looks extremely likely to have caused this problem.

We will hopefully be upgrading to 4.1.15 when it comes out since there are quite a few cluster bugfixes there.  However, I think we still need to find out what caused yesterday's outage.  Since it's intermittent (we had a few weeks of uptime before the breakage), an upgrade followed by a week of normal behavior doesn't necessarily mean that the problem is really solved.

http://bugs.mysql.com/bug.php?id=12950 seems almost relevant, but we don't have any text columns, and our results were far more devastating than this bug indicates.
[14 Oct 2005 18:52] eric von
As an addendum to my last comment: we do have quite a few BLOB columns.  Not sure how relevant that may be.
[15 Oct 2005 1:53] Stewart Smith
This could be the same bug. Are you updating a column with a charset during the backup?

TEXT and BLOB are implemented the same way in NDB.
[15 Oct 2005 4:20] eric von
I can't say for certain, but it does seem likely that rows with BLOBs were inserted in at least one table during the backup.  None of the BLOB columns in any table should ever be updated (we use them to log communications on one of our uplinks), but it is possible that other columns in the same table were modified (received date, processed date, etc).  

Bug 12950 seems to indicate that only updates to the TEXT/BLOB column would produce the bug. "ndbd could crash if there was updates of a column with charset"

Could the problem from 12950 result in the behavior of my system?
[27 Oct 2005 15:32] Jonas Oreland
Hi,

bug#12950 is most likely not the problem.
Can you upload your config.ini
  and error logs/tracefiles from crashes.

---

You have managed to get a split brain cluster, impressive :-(

You should stop node 3
You should then restart node 4 and node 2 one at a time.
  This should remove "Failure handling of node 3 has
not completed in [n] min."

Note, the backup is probably not working due to the "Failure handling"...
  so with some bad luck, you might have to restore from a backup prior to problem start
  if worst comes to worst.

Then you should restart node 3 (maybe --initial)

---
But please upload the files.

Best of luck, and hope it works out.

/Jonas
[3 Nov 2005 14:44] eric von
I have attached logs from all three NDB nodes at crash time, and logs from the MGM node since it started.  

I have noticed many network partition events during our nightly backups since the problem first occurred.  Fortunately, this doesn't appear to have resulted in a splitbrain cluster, but I worry that our luck will run out.
[5 Dec 2005 16:13] eric
Hello,

Has there been any progress on determining the source of this problem? 

We just had another similar event this weekend.  Saturday's backup caused a crash, followed by a split-brain status for all of Sunday.  Then Sunday's backup crashed the cluster again and it recovered without the split-brain - but also without all of Sunday's data.

I will post logs as soon as I can.

Thanks.
[5 Dec 2005 16:22] eric von
Cluster log from this weekend and prior.  Note the frequent crashes at backup time (4:15AM UTC)

Attachment: ndb_1_cluster.maskedips.zip (application/zip, text), 28.70 KiB.

[5 Dec 2005 16:27] eric von
I've uploaded the latest logs.  To potentially increase exposure, I masked out the IPs and made the log public. 

As you can see, the scheduled backup run every day at 4:15 has a high rate of crashing the cluster and aborting.  It so happens that this Saturday's crash resulted in the split-brain behavior.  

One item I noticed in the cluster limitation documents is the multiple network interfaces statement. Each node does have multiple interfaces, but they are on separate subnets, which do not route to one another.  So traffic to an address on subnet A cannot be delivered across the adapter connected to subnet B in the case where adapter A is offline.  However, is there some other underlying problem I need to be aware of with multiple NICs?

Thanks in advance for your help.  I will post the NDB logs upon request, although they are substantially the same as the last ones I posted.
[5 Dec 2005 17:49] eric von
We have also found similar behavior on our internal test cluster, which has the same hardware and software setup, but different networking environment. Only one NIC is present on the internal machines, and they are on different types of switches than the live deployment.

We are going to upgrade the test cluster to 4.1.15, but nothing in the changelogs indicates hope for that upgrade to fix the problem.  I'll follow up on this bug when I have results, though.
[5 Dec 2005 21:54] eric von
We have upgraded our internal cluster to 4.1.15 and are able to generate similar behavior.  We haven't made the split brain startup post-crash happen yet, but we are definitely able to reliably and repeatably crash during backup.  I am modifying the Version field of this bug as a result.

The crash always occurs when we run a message handling process while backup is running.  This message handling process is fundamental to our application, and uses HAJMS and other technology I won't bore you with.  But the basic process is:

1) Node 2 inserts a record in a table with a BLOB
2) Node 2 inserts records in a few other tables with no BLOB
3) Node 3 inserts about 10 records and updates about 25, across many tables, some of which were inserted/updated by step 2, but none have BLOBs.
4) Node 3 inserts a row in a table with a LONGBLOB field
5) Node 3 updates the value of the LONGBLOB field in that new row
6) Repeat of step 3
7) Node 3 deletes the record created in step 1

I can provide tcpdump packet captures for all API nodes as this process runs during a backup (and leading to a crash), if that will help.
[5 Dec 2005 22:20] eric von
"Successful" split-brain cluster crash on 4.1.15, including ndb_mgm show output post-crash.

Attachment: 4.1.15-crashlogs.tgz (application/x-compressed, text), 16.51 KiB.

[5 Dec 2005 22:28] eric von
We have just duplicated the splitbrain problem using 4.1.15.  Logs are attached.  We have also narrowed the query pattern to run during the backup to the following:

1) On the API node on the same machine as NDB node 2, insert a record containing a LONGBLOB
2) On the API node on the same machine as NDB node 3, delete that record by primary key.  

Exact queries used in our case:
1) 
insert into JMS_MESSAGES (MessageID, MESSAGEBLOB) values (1,'0xACED0005737200246F[and so on, 1726 hex digits, encoding 863 bytes]')

2)
delete from JMS_MESSAGES where messageID = 1

Create statement for the table JMS_MESSAGES (obtained via show create table):

CREATE TABLE `jms_messages` (
  `MESSAGEID` int(11) NOT NULL default '0',
  `DESTINATION` varchar(150) character set latin1 NOT NULL default '',
  `TXID` int(11) default NULL,
  `TXOP` char(1) character set latin1 default NULL,
  `MESSAGEBLOB` longblob,
  PRIMARY KEY  (`MESSAGEID`,`DESTINATION`),
  KEY `JMS_MESSAGES_TXOP_TXID` (`TXOP`,`TXID`),
  KEY `JMS_MESSAGES_DESTINATION` (`DESTINATION`)
) ENGINE=ndbcluster DEFAULT CHARSET=utf8

I hope that this will provide you with enough information to find and fix this bug for good.  Exact byte values for that insert available upon request.
[7 Dec 2005 7:34] Stewart Smith
NOTE: 3 REPLICAS

Date/Time: Thursday 13 October 2005 - 21:09:19
Type of error: error
Message: Internal program error (failed ndbrequire)
Fault ID: 2341
Problem data: Suma.hpp
Object of reference: SUMA (Line: 596) 0x00000008
ProgramName: ./bin/ndbd
ProcessID: 8794
TraceFile: /usr/local/mysql/clusterdata/ndb_4_trace.log.2
Version 4.1.12
***EOM***

--------------- Signal ----------------
r.bn: 257 "SUMA", r.proc: 4, r.sigId: 29005 gsn: 691 "SUMA_START_ME" prio: 1
s.bn: 257 "SUMA", s.proc: 2, s.sigId: 2111872 length: 1 trace: 2 #sec: 0 fragInf: 0
 H'01010002
[7 Dec 2005 7:37] Stewart Smith
problems with BLOB. Default buffer size.

Date/Time: Thursday 13 October 2005 - 17:00:22
Type of error: error
Message: Internal program error (failed ndbrequire)
Fault ID: 2341
Problem data: Backup.cpp
Object of reference: BACKUP (Line: 3889) 0x00000008
ProgramName: ./bin/ndbd
ProcessID: 31611
TraceFile: /usr/local/mysql/clusterdata/ndb_2_trace.log.1
Version 4.1.12
***EOM***
[7 Dec 2005 7:49] Stewart Smith
Hi,

We have now looked at all your uploaded tracefiles/logs.
We can find no new "split-brain".
We can find a backup bug that looks like http://bugs.mysql.com/bug.php?id=12950.
That has been solved in 4.1.15
We found a new problem in SUMA that we will fix, this is related to 3 replicas, 

BTW: all your tracefiles contain 4.1.12...

Questions:
1) When you refer to "split-brain", how do you come to that conclusion
2) Can you post tracefiles/error logs for backup crash in 4.1.15
[7 Dec 2005 15:07] eric von
Hi Stewart,

Please see attachment on 12/5/05 for logs from a 4.1.15 cluster crashing during backup, and then coming back up with two masters.  Obviously, whatever is causing our problems is not fixed in 4.1.15.

We come to the split brain conclusion via:

1) Data inserted via an API node on the same host as NDB node 4 only shows up on NDB node 4.  Data inserted via an API node on NDB node 3 or 2 only shows up on NDB nodes 3 or 2.  This is the definition of a split brain cluster, as far as I can tell.

2) 2 different nodes report checkpoint numbers interleaved with one another (2 and 4 in my logs)  I thought only the master node should report checkpoints.  For example:
2005-12-07 02:52:42 [MgmSrvr] INFO     -- Node 3: Local checkpoint 280 started. Keep GCI = 393134 oldest restorable GCI = 393046
2005-12-07 03:20:59 [MgmSrvr] INFO     -- Node 4: Local checkpoint 280 started. Keep GCI = 394593 oldest restorable GCI = 348080

3) "Failure handling of node 4 has not completed in 5 min." messages in the log - this appeared last time we had a split brain.

I will attach an updated log file from this morning.  We left the server running overnight with the odd behavior.  Last night's backup crashed it again and it came back up with only one master - but for all of the 6th there were two masters. Again, this is all on 4.1.15.
[7 Dec 2005 15:31] eric von
Traces generated on NDB2 during backup crashes in 4.1.15

Attachment: NDB2.zip (application/zip, text), 168.78 KiB.

[7 Dec 2005 15:31] eric von
Traces generated on NDB3 during backup crashes in 4.1.15

Attachment: NDB3.zip (application/zip, text), 110.22 KiB.

[7 Dec 2005 15:31] eric von
Traces generated on NDB4 during backup crashes in 4.1.15

Attachment: NDB4.zip (application/zip, text), 167.95 KiB.

[7 Dec 2005 15:32] eric von
Cluster log during and after backup crashes in 4.1.15

Attachment: ndb_1_cluster.zip (application/zip, text), 36.27 KiB.

[7 Dec 2005 15:45] Stewart Smith
Your old attachments didn't include the error log and trace files - which are the most useful in finding the problem.

Your new ones have the files we need. Having a look now.

Just today I've developed a script that will automatically gather all data needed for a good bug report (all the log files, error files and trace files) to make it easier for users to get all the right files.
[20 Dec 2005 18:52] eric von
Hi Stewart (or Jonas),

Has anyone been able to duplicate or make any progress on this item?  I can generate and post more error/trace dumps if you'd like.
[11 Jan 2006 18:55] eric von
Hi Folks, 

Any news?  Have you been at least able to duplicate the problem?  Shall I post more logs?

Thanks.
[23 Feb 2006 22:35] eric von
We've updated our test cluster to 5.0.18 in the hopes that this bug would have been eradicated (perhaps unintentionally) in the new version, but no such luck.  The same actions cause the same crash.  Full logs are on their way.

Since the reproducible part of this bug is the crash, I'm changing the summary to represent that fact.  Sometimes the crash recovery winds up split brain, but if there was no crash, there'd be no split either.

Stewart, where can I find the script to gather the logs?  Also, since this bug is still present in the latest releases, might that bump its priority a bit?
[23 Feb 2006 22:42] eric von
Activity from ndb_mgm console:
=========>
ndb_mgm> start backup wait completed
Connected to Management Server at: a.b.c.71:1186
Waiting for completed, this may take several minutes
Node 2: Backup 1 started from node 1
Node 4: Forced node shutdown completed, restarting. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
Node 2: Backup 1 started from 1 has been aborted. Error: 1324
Start of backup failed
*  3001: Could not start backup
*        Backup log buffer full: Permanent error: Internal error
ndb_mgm> Node 4: Started (version 5.0.18)
<==========

Activity performed on MySQL API node 5 (same host as NDB node 2), as soon as the "Backup 1 started" message appeared on the ndb_mgm console:
=========>
insert into JMS_MESSAGES (MessageID, Destination, MESSAGEBLOB) values (1,2,'0xACED0005737200246F72672E6A626F73732E696E766F636174696F6E2E4D61727368616C6C656456616C7565EACCE0D1F44AD0990C000078707A0000032000000318ACED0005737200116A6176612E7574696C2E486173684D61700507DAC1C31660D103000246000A6C6F6164466163746F724900097468726573686F6C6478703F400000000000307708000000400000001974000D70756D7032466C6F7752617465737200116A6176612E6C616E672E496E746567657212E2A0A4F781873802000149000576616C7565787200106A6176612E6C616E672E4E756D62657286AC951D0B94E08B02000078700000000074000D746F74616C466C6F77526174657371007E00030000000074000D72747541726D656453746174657371007E0003000000017400056368616E337371007E0003000000007400056368616E387371007E00030000000074000B4D65737361676554797065740013322050554D50204E4F20544F54414C495A455274000C70756D703152756E54696D657371007E0003000000007400066368616E31307371007E0003000000007400034D494E74000A313735343231393438387400056368616E397371007E00030000000174000B70756D70315374617274737371007E00030000000074001373696D756C74616E656F757352756E54696D657371007E0003000000007400056368616E377371007E00030000000074000954696D657374616D707372000E6A6176612E6C616E672E4C6F6E673B8BE490CC8F23DF0200014A000576616C75657871007E0004000000004394B06C7400034D534E7371007E001F00000000D60D1DA174000B70756D70325374617274737371007E0003000000007400056368616E347371007E00030000000074000B4D73635377697463684E6F7372000F6A6176612E6C616E672E53686F7274684D37133460DA5202000153000576616C75657871007E0004000174000B4D73634D61726B657449647371007E002800147400056368616E367371007E0003000000007400067361724269747371007E0003000000017400056368616E357371007E00030000000074000C70756D703252756E54696D657371007E00030000000074000C706F6C6C526573706F6E73657371007E00030000000074000D70756D7031466C6F77526174657371007E0003000000007884BE8A3F78')
<==========

Activity performed on MySQL API node 6 (same host as NDB node 3), as soon as node 5's insert completed:
=========>
delete from JMS_MESSAGES where messageID = 1
<==========
[23 Feb 2006 22:50] eric von
Crash trace from node 4, and relevant portion of cluster log.  Nodes 2 and 3 did not generate traces.

Attachment: node4crash.zip (application/zip, text), 40.05 KiB.

[23 Feb 2006 22:59] eric von
Updating synopsis a bit to reflect the fact that the cluster doesn't always crash, but at least one node always does.
[13 Mar 2006 13:40] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/3774
[13 Mar 2006 13:44] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/3775
[13 Mar 2006 15:13] Jonas Oreland
pushed into 5.1.8
[14 Mar 2006 7:40] Jonas Oreland
pushed into 5.0.20
[14 Mar 2006 8:41] Jonas Oreland
Pushed into 4.1.19
[14 Mar 2006 8:55] Jon Stephens
Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

Documented bug fix in 4.1.19/5.0.20/5.1.8 changelogs. Closed.
[20 Mar 2006 10:08] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/3963