Bug #44499 GCP Stop / failed handling of GCP_SAVEREQ / Dbdih::execBLOCK_COMMIT_ORD
Submitted: 27 Apr 2009 19:55 Modified: 26 Feb 2010 12:22
Reporter: Henrik Ingo Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.1-telco-7.0 OS:Solaris
Assigned to: Jonas Oreland CPU Architecture:Any
Tags: mysql-5.1-telco-7.0.4

[27 Apr 2009 19:55] Henrik Ingo
Description:
The following was observed on a MySQL Cluster of 24 datanodes and 32 mysqld nodes, running on 2 Sparc CMT machines with Solaris, under high load. 

The problems start with a GCP stop detected in node 3. Node 4 and 16 miss heartbeats and are also disconnected. After a while it seems there is a "pulse" of Dbdih::execBLOCK_COMMIT_ORD errors followed by closing of connections going through each node, for instance for node 8:

2009-04-17 07:56:30 [MgmSrvr] INFO     -- Node 8: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 2 sig->failNo =
2009-04-17 07:56:30 [MgmSrvr] INFO     -- Node 8: Communication to Node 3 closed
2009-04-17 07:56:30 [MgmSrvr] INFO     -- Node 8: Communication to Node 4 closed
2009-04-17 07:56:30 [MgmSrvr] INFO     -- Node 8: Communication to Node 15 closed
2009-04-17 07:56:30 [MgmSrvr] INFO     -- Node 8: Communication to Node 16 closed
20

In the end nodes start shutting down due to not having an unpartitioned cluster.

Full logs are attached in zip file. Below are some highlights:

clusterlog:
2009-04-17 07:49:20 [MgmSrvr] INFO     -- Node 3: Local checkpoint 31 started. Keep GCI = 24534 oldest restorable GCI = 2445
2009-04-17 07:55:52 [MgmSrvr] INFO     -- Node 3: Detected GCP stop(1)...sending kill to [SignalCounter: m_count=3 0000000000010018]
2009-04-17 07:56:13 [MgmSrvr] WARNING  -- Node 12: Node 4 missed heartbeat 2
2009-04-17 07:56:13 [MgmSrvr] WARNING  -- Node 15: Node 16 missed heartbeat 2
2009-04-17 07:56:23 [MgmSrvr] WARNING  -- Node 12: Node 4 missed heartbeat 3
2009-04-17 07:56:23 [MgmSrvr] WARNING  -- Node 15: Node 16 missed heartbeat 3
2009-04-17 07:56:24 [MgmSrvr] ALERT    -- Node 5: Node 4 Disconnected
2009-04-17 07:56:24 [MgmSrvr] ALERT    -- Node 10: Node 4 Disconnected
2009-04-17 07:56:24 [MgmSrvr] INFO     -- Node 10: Communication to Node 4 closed
2009-04-17 07:56:24 [MgmSrvr] ALERT    -- Node 15: Node 4 Disconnected
2009-04-17 07:56:24 [MgmSrvr] INFO     -- Node 15: Communication to Node 4 closed
...
2009-04-17 07:56:28 [MgmSrvr] ALERT    -- Node 1: Node 16 Disconnected
2009-04-17 07:56:28 [MgmSrvr] ALERT    -- Node 18: Node 3 Disconnected
2009-04-17 07:56:28 [MgmSrvr] INFO     -- Node 18: Communication to Node 3 closed
2009-04-17 07:56:30 [MgmSrvr] INFO     -- Node 5: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 2 sig->failNo =
2009-04-17 07:56:30 [MgmSrvr] INFO     -- Node 5: Communication to Node 3 closed
2009-04-17 07:56:30 [MgmSrvr] INFO     -- Node 5: Communication to Node 4 closed
2009-04-17 07:56:30 [MgmSrvr] INFO     -- Node 5: Communication to Node 15 closed
2009-04-17 07:56:30 [MgmSrvr] INFO     -- Node 5: Communication to Node 16 closed
...
2009-04-17 07:56:40 [MgmSrvr] ALERT    -- Node 15: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
2009-04-17 07:56:40 [MgmSrvr] INFO     -- Node 24: Communication to Node 15 opened
2009-04-17 07:56:40 [MgmSrvr] INFO     -- Node 26: Communication to Node 15 opened
2009-04-17 07:56:40 [MgmSrvr] INFO     -- Node 26: Communication to Node 17 opened
...
2009-04-17 08:01:33 [MgmSrvr] ALERT    -- Node 22: Forced node shutdown completed. Initiated by signal 11. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary e
2009-04-17 08:01:47 [MgmSrvr] ALERT    -- Node 5: Forced node shutdown completed. Initiated by signal 11. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary er

Attached is the complete clusterlog file.

From one of the mysqld logs:
090416 13:55:33 [Note] NDB Binlog: Node: 26, subscribe from node 134, Subscriber bitmask ffdfff07f8
2009-04-17 07:56:34 [NdbApi] INFO     -- Management server closed connection early. It is probably being shut down (or has problems). We will retry t
he connection. 145 Time out talking to management server Error line: 464
090417  7:58:53 [Note] NDB Binlog: Node: 3, down, Subscriber bitmask 00
090417  7:58:53 [Note] NDB Binlog: Node: 4, down, Subscriber bitmask 00
090417  7:58:53 [ERROR] Got error 4119 when reading table './[DBname]/[TABLEname]'
090417  7:58:53 [Note] NDB Binlog: Node: 5, down, Subscriber bitmask 00
090417  7:58:53 [Note] NDB Binlog: Node: 6, down, Subscriber bitmask 00
090417  7:58:53 [ERROR] Got error 4119 when reading table './[DBname]/[TABLEname]'
...
090417  7:58:53 [Note] NDB Binlog: cluster failure for ./mysql/ndb_schema at epoch 26573/0.
090417  7:58:53 [Note] NDB Binlog: cluster failure for ./mysql/ndb_apply_status at epoch 26573/0.
090417  7:58:53 [Note] Restarting Cluster Binlog
090417  7:58:53 [ERROR] Got error 4010 when reading table './[DBname]/[TABLEname]'
090417  7:58:53 [Warning] NDB: Could not acquire global schema lock (4006)Connect failure - out of connection objects (increase MaxNoOfConcurrentTran
sactions)
090417  7:58:53 [ERROR] Got error 4010 when reading table './[DBname]/[TABLEname]'
...
090417  7:58:53 [ERROR] Got error 4010 when reading table './[DBname]/[TABLEname]'
090417  8:00:12 [Note] Restarting Cluster Binlog
090417  8:00:12 [ERROR] /usr/local/mysql-5.1.32-ndb-7.0.4-innodb-solaris10-sparc/libexec/mysqld: Incorrect information in file: './[DBname]/[TABLEname].frm'
090417  8:00:12 [ERROR] /usr/local/mysql-5.1.32-ndb-7.0.4-innodb-solaris10-sparc/libexec/mysqld: Incorrect information in file: './[DBname]/[TABLEname].frm'

Complete file attached.

Interestingly both ndb_3_error.log and ndb_4_error.log have:
Time: Friday 17 April 2009 - 07:55:53
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug)
Error: 2303
Error data: Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutting down node due to failed handling of GCP_SAVEREQ
Error object: DBLQH (Line: 20241) 0x0000000e
Program: /usr/local/mysql/libexec/

The ndb_3_out.log is:
2009-04-17 07:55:55 [ndbd] WARNING  -- Ndb kernel thread 3 is stuck in: Job Handling elapsed=1429
2009-04-17 07:55:55 [ndbd] INFO     -- Watchdog: User time: 945281  System time: 445469
2009-04-17 07:55:55 [ndbd] INFO     -- Watchdog: User time: 945284  System time: 445469
2009-04-17 07:55:55 [ndbd] WARNING  -- Watchdog: Warning overslept 1379 ms, expected 100 ms.
2009-04-17 07:55:56 [ndbd] WARNING  -- Ndb kernel thread 4 is stuck in: Print Job Buffers at crash elapsed=110
2009-04-17 07:55:56 [ndbd] INFO     -- Watchdog: User time: 945304  System time: 445470
2009-04-17 07:55:56 [ndbd] WARNING  -- Ndb kernel thread 5 is stuck in: Print Job Buffers at crash elapsed=110
...
2009-04-17 07:56:27 [ndbd] INFO     -- Watchdog: User time: 945744  System time: 445541
2009-04-17 07:56:27 [ndbd] INFO     -- Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutting down node due to failed handling of GCP_SAVEREQ
2009-04-17 07:56:27 [ndbd] INFO     -- DBLQH (Line: 20241) 0x0000000e
...
2009-04-17 07:56:38 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

And then I also have two core dumps of ndbmtd's in my file system. Those are huge, but I could upload them to a ftp server, if you like to have them. In this mail I have just attached the pstack's on those two cores. 

How to repeat:
Could be hard to reproduce.
[7 May 2009 13:10] Jonathan Miller
http://bugs.mysql.com/bug.php?id=43882
http://bugs.mysql.com/bug.php?id=39498
[7 May 2009 14:11] Henrik Ingo
Note: 

The GCP stop very likely simply was due to all ndbd's on one host using *1 disk*. With a RAID stripe of 6 disks the GCP stop does not happen and thus none of the other things happen either.

The point of this bug report is more to analyze everything that happens after the GCP stop, which looks a bit uncontrolled.
[26 Feb 2010 12:22] Jonas Oreland
this looks very much like a duplicate of bug#51512
closing