MySQL Bugs: #45600: Cluster various node restarts

Bug #45600	Cluster various node restarts
Submitted:	18 Jun 2009 23:15	Modified:	5 Mar 2010 12:27
Reporter:	Robert Klikics	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-5.1-telco-7.0	OS:	Linux (Debian 5.0)
Assigned to:	Jonas Oreland	CPU Architecture:	Any
Tags:	7.0.9b, cluster, crash, node, restart

Description:
Hello,

suddenly 2 of my SQL-nodes have restartet and the mgm-node throws the following error's. There was no heavy load or any network problem at the moment this happend. After some minutes everything went fine again:

2009-06-19 00:20:04 [MgmSrvr] WARNING  -- Node 5: Node 9 missed heartbeat 2
2009-06-19 00:20:04 [MgmSrvr] WARNING  -- Node 3: Node 9 missed heartbeat 2
2009-06-19 00:20:05 [MgmSrvr] WARNING  -- Node 2: Node 9 missed heartbeat 2
2009-06-19 00:20:05 [MgmSrvr] WARNING  -- Node 4: Node 9 missed heartbeat 2
2009-06-19 00:20:06 [MgmSrvr] WARNING  -- Node 5: Node 9 missed heartbeat 3
2009-06-19 00:20:06 [MgmSrvr] WARNING  -- Node 3: Node 9 missed heartbeat 3
2009-06-19 00:20:06 [MgmSrvr] WARNING  -- Node 2: Node 9 missed heartbeat 3
2009-06-19 00:20:07 [MgmSrvr] WARNING  -- Node 4: Node 9 missed heartbeat 3
2009-06-19 00:20:07 [MgmSrvr] INFO     -- Node 2: Communication to Node 9 closed
2009-06-19 00:20:07 [MgmSrvr] INFO     -- Node 3: Communication to Node 9 closed
2009-06-19 00:20:07 [MgmSrvr] WARNING  -- Node 5: Node 9 missed heartbeat 4
2009-06-19 00:20:07 [MgmSrvr] ALERT    -- Node 5: Node 9 declared dead due to missed heartbeat
2009-06-19 00:20:07 [MgmSrvr] INFO     -- Node 5: Communication to Node 9 closed
2009-06-19 00:20:07 [MgmSrvr] INFO     -- Node 4: Communication to Node 9 closed
2009-06-19 00:20:07 [MgmSrvr] ALERT    -- Node 2: Node 9 Disconnected
2009-06-19 00:20:07 [MgmSrvr] ALERT    -- Node 4: Node 9 Disconnected
2009-06-19 00:20:07 [MgmSrvr] ALERT    -- Node 3: Node 9 Disconnected
2009-06-19 00:20:07 [MgmSrvr] ALERT    -- Node 5: Node 9 Disconnected
2009-06-19 00:20:11 [MgmSrvr] INFO     -- Node 5: Communication to Node 9 opened
2009-06-19 00:20:11 [MgmSrvr] INFO     -- Node 4: Communication to Node 9 opened
2009-06-19 00:20:11 [MgmSrvr] INFO     -- Node 2: Communication to Node 9 opened
2009-06-19 00:20:12 [MgmSrvr] WARNING  -- Node 3: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for a while solved
2009-06-19 00:20:14 [MgmSrvr] WARNING  -- Node 3: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for a while solved - Repeated 74 times
2009-06-19 00:20:18 [MgmSrvr] ALERT    -- Node 3: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2009-06-19 00:20:18 [MgmSrvr] INFO     -- Node 2: Communication to Node 3 closed
2009-06-19 00:20:18 [MgmSrvr] INFO     -- Node 4: Communication to Node 3 closed
2009-06-19 00:20:18 [MgmSrvr] ALERT    -- Node 5: Node 3 Disconnected
2009-06-19 00:20:18 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2009-06-19 00:20:18 [MgmSrvr] ALERT    -- Node 2: Arbitration check won - node group majority
2009-06-19 00:20:18 [MgmSrvr] INFO     -- Node 2: President restarts arbitration thread [state=6]
2009-06-19 00:20:18 [MgmSrvr] ALERT    -- Node 2: Node 3 Disconnected
2009-06-19 00:20:18 [MgmSrvr] INFO     -- Node 2: Communication to Node 3 closed
2009-06-19 00:20:18 [MgmSrvr] ALERT    -- Node 4: Node 3 Disconnected
2009-06-19 00:20:18 [MgmSrvr] INFO     -- Node 5: Communication to Node 3 closed
2009-06-19 00:20:18 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2009-06-19 00:20:19 [MgmSrvr] INFO     -- Mgmt server state: nodeid 9 timed out.
2009-06-19 00:20:25 [MgmSrvr] WARNING  -- Node 5: Transporter to node 2 reported error 0x16: The send buffer was full, but sleeping for a while solved
2009-06-19 00:20:28 [MgmSrvr] INFO     -- Node 4: Communication to Node 2 closed
2009-06-19 00:20:28 [MgmSrvr] ALERT    -- Node 5: Node 2 Disconnected
2009-06-19 00:20:28 [MgmSrvr] INFO     -- Node 5: Communication to Node 2 closed
2009-06-19 00:20:28 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2009-06-19 00:20:28 [MgmSrvr] ALERT    -- Node 4: Node 2 Disconnected
2009-06-19 00:20:30 [MgmSrvr] INFO     -- Node 5: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 3 sig->failNo =
2009-06-19 00:20:30 [MgmSrvr] ALERT    -- Node 5: Node 4 Disconnected
2009-06-19 00:20:30 [MgmSrvr] INFO     -- Node 5: Communication to Node 2 closed
2009-06-19 00:20:30 [MgmSrvr] INFO     -- Node 5: Communication to Node 4 closed
2009-06-19 00:20:30 [MgmSrvr] ALERT    -- Node 1: Node 4 Disconnected
2009-06-19 00:20:31 [MgmSrvr] ALERT    -- Node 1: Node 5 Disconnected
2009-06-19 00:20:31 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 reserved for ip 192.168.10.102, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000010000fca.
2009-06-19 00:20:31 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2009-06-19 00:20:31 [MgmSrvr] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
2009-06-19 00:20:31 [MgmSrvr] ALERT    -- Node 5: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
2009-06-19 00:20:31 [MgmSrvr] INFO     -- Mgmt server state: nodeid 2 reserved for ip 192.168.10.101, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000010000fce.
2009-06-19 00:20:31 [MgmSrvr] INFO     -- Mgmt server state: nodeid 4 reserved for ip 192.168.10.103, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000010000fde.
2009-06-19 00:20:31 [MgmSrvr] INFO     -- Mgmt server state: nodeid 5 reserved for ip 192.168.10.104, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000010000ffe.
2009-06-19 00:20:32 [MgmSrvr] INFO     -- Node 1: Node 5 Connected
2009-06-19 00:20:32 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2009-06-19 00:20:32 [MgmSrvr] INFO     -- Node 1: Node 4 Connected
2009-06-19 00:20:32 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2009-06-19 00:20:33 [MgmSrvr] INFO     -- Mgmt server state: nodeid 5 freed, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000010000fde.
2009-06-19 00:20:33 [MgmSrvr] INFO     -- Mgmt server state: nodeid 4 freed, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000010000fce.
2009-06-19 00:20:33 [MgmSrvr] INFO     -- Mgmt server state: nodeid 2 freed, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000010000fca.
2009-06-19 00:20:33 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000010000fc2.
2009-06-19 00:20:34 [MgmSrvr] INFO     -- Node 5: Start phase 0 completed 
2009-06-19 00:20:34 [MgmSrvr] INFO     -- Node 3: Start phase 0 completed 
2009-06-19 00:20:34 [MgmSrvr] INFO     -- Node 5: Waiting 30 sec for nodes 000000000000001c to connect, nodes [ all: 000000000000003c connected: 0000000000000020 no-wait: 0000000000000000 ]

[ ...  ]

How to repeat:
No idea at the moment

Additional output from a mysqld-node:

090619 00:21:34 mysqld_safe Number of processes running now: 0
090619 00:21:35 mysqld_safe mysqld restarted
090619  0:21:36  InnoDB: Started; log sequence number 0 46409
090619  0:21:36 [Note] NDB: NodeID is 9, management server '192.168.10.100:1186'
090619  0:21:37 [Note] NDB[0]: NodeID: 9, no storage nodes connected (timed out)
090619  0:21:37 [Warning] NDB: server id set to zero will cause any other mysqld with bin log to log with wrong server id
090619  0:21:37 [Note] Starting Cluster Binlog Thread
090619  0:21:39 [Note] Event Scheduler: Loaded 0 events
090619  0:21:39 [Note] /usr/local/mysql/bin/mysqld: bereit für Verbindungen.
Version: '5.1.34-ndb-7.0.6-cluster-gpl'  Socket: '/var/run/mysqld/mysqld.sock'  Port: 3306  MySQL Cluster Server (GPL)

Additional output from a ndb-node:

2009-06-19 00:20:27 [ndbd] INFO     -- Node 2 killed this node because GCP stop was detected
2009-06-19 00:20:27 [ndbd] INFO     -- NDBCNTR (Line: 260) 0x0000000c
2009-06-19 00:20:27 [ndbd] INFO     -- Error handler restarting system
2009-06-19 00:20:27 [ndbd] INFO     -- Error handler shutdown completed - exiting
2009-06-19 00:20:28 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2009-06-19 00:20:31 [ndbd] INFO     -- Ndb has terminated (pid 30226) restarting
2009-06-19 00:20:31 [ndbd] INFO     -- Configuration fetched from '192.168.10.100:1186', generation: 1
2009-06-19 00:20:31 [ndbd] INFO     -- Angel pid: 30225 ndb pid: 26983
NDBMT: MaxNoOfExecutionThreads=2
NDBMT: workers=1 threads=1
2009-06-19 00:20:31 [ndbd] INFO     -- NDB Cluster -- DB node 2
2009-06-19 00:20:31 [ndbd] INFO     -- mysql-5.1.34 ndb-7.0.6 --
2009-06-19 00:20:31 [ndbd] INFO     -- Ndbd_mem_manager::init(1) min: 59610Mb initial: 59630Mb
Adding 8192Mb to ZONE_LO (1,262142)
Adding 8192Mb to ZONE_LO (262145,262142)
Adding 8192Mb to ZONE_HI (524289,262142)
Adding 8192Mb to ZONE_HI (786433,262142)
Adding 8192Mb to ZONE_HI (1048577,262142)
Adding 8192Mb to ZONE_HI (1310721,262142)
Adding 8192Mb to ZONE_HI (1572865,262142)
Adding 2286Mb to ZONE_HI (1835009,73151)
NDBMT: num_threads=4
thr: 0 cpu: 1 err: 38 DBTC(0) DBDIH(0) DBDICT(0) NDBCNTR(0) QMGR(0) NDBFS(0) TRIX(0) DBUTIL(0)
thr: 3 cpu: 1 err: 38 CMVMI(0)
thr: 2 cpu: 1 err: 38 PGMAN(1) DBACC(1) DBLQH(1) DBTUP(1) BACKUP(1) DBTUX(1) RESTORE(1)
saving 0x7f1e80480000 at 0xa16618 (0)
2009-06-19 00:20:32 [ndbd] INFO     -- Start initiated (mysql-5.1.34 ndb-7.0.6)
saving 0x7f1e80660000 at 0xa21ce8 (0)
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer

Workaround: increase send buffer on data nodes?

Please provide trace files and configuration files. No heavy load: could you please describe the load that was on the system?

tracelog ndb node1

Attachment: ndb2.tar.gz (application/x-gzip, text), 162.62 KiB.

tracelog ndb node2

Attachment: ndb3.tar.gz (application/x-gzip, text), 115.48 KiB.

tracelog ndb node3

Attachment: ndb4.tar.gz (application/x-gzip, text), 155.54 KiB.

tracelog ndb node4

Attachment: ndb5.tar.gz (application/x-gzip, text), 197.81 KiB.

Thanks for your reply. I've just added the tracelogs created at the problem appeared from all 4 ndb-nodes to this report.

I've also increased TotalSendBufferMemory to 8M for each data and sql-node. Before there was a global setting of SendBufferMemory=8M.

But I think a cluster oder node restart should not be caused if the SendBuffer is litte too low ...

The Linux-Loadavg was around ~1.0, here some "atsar" output:

00:00:01  pswch/s runq nrproc lavg1 lavg5 avg15                  
00:10:01    10687    1    119  0.75  0.81  0.81
00:20:01     9587    2    119  0.78  0.85  0.82
00:30:01     2024    1    119  0.22  0.72  0.79
00:40:01     1254    2    119  1.00  0.66  0.63
00:50:01     3072    1    119  0.10  0.57  0.64
01:00:01     6391    1    119  0.37  0.66  0.65

Just got this error minutes ago.  We have also have seen them in the past.

2009-09-29 14:37:07 [ndbd] INFO     -- Watchdog: User time: 23852  System time: 9374
2009-09-29 14:37:07 [ndbd] WARNING  -- Watchdog: Warning overslept 3629 ms, expected 100 ms.
2009-09-29 14:37:07 [ndbd] INFO     -- Node 21 killed this node because GCP stop was detected
2009-09-29 14:37:07 [ndbd] INFO     -- NDBCNTR (Line: 260) 0x0000000a
2009-09-29 14:37:07 [ndbd] INFO     -- Error handler shutting down system
2009-09-29 14:37:07 [ndbd] INFO     -- Error handler shutdown completed - exiting
2009-09-29 14:37:07 [ndbd] ALERT    -- Node 21: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     4 node(s)
id=21 (not connected, accepting connect from 192.168.0.84)
id=22   @192.168.0.85  (mysql-5.1.34 ndb-7.0.6, Nodegroup: 0, Master)
id=23   @192.168.0.86  (mysql-5.1.34 ndb-7.0.6, Nodegroup: 1)
id=24   @192.168.0.87  (mysql-5.1.34 ndb-7.0.6, Nodegroup: 1)

[ndb_mgmd(MGM)] 2 node(s)
id=1    @192.168.0.81  (mysql-5.1.34 ndb-7.0.6)
id=2    @192.168.0.82  (mysql-5.1.34 ndb-7.0.6)

[mysqld(API)]   8 node(s)
id=11   @192.168.0.81  (mysql-5.1.34 ndb-7.0.6)
id=12   @192.168.0.81  (mysql-5.1.34 ndb-7.0.6)
id=13   @192.168.0.81  (mysql-5.1.34 ndb-7.0.6)
id=14   @192.168.0.82  (mysql-5.1.34 ndb-7.0.6)
id=15   @192.168.0.82  (mysql-5.1.34 ndb-7.0.6)
id=16   @192.168.0.82  (mysql-5.1.34 ndb-7.0.6)
id=17 (not connected, accepting connect from 192.168.0.80)
id=18 (not connected, accepting connect from 192.168.0.81)

Could you provide the full collection of config and log files, e.g. using the ndb_error_reporter utility?

http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-programs-ndb-error-reporter.html

Since my cluser crashed again today (now running 7.0.8a, I also mailed to the list..) the bug still seems to be present.

Log-Output from today:

2009-11-10 10:02:54 [MgmtSrvr] WARNING  -- Node 3: Node 6 missed heartbeat 2
2009-11-10 10:02:54 [MgmtSrvr] WARNING  -- Node 3: Node 7 missed heartbeat 2
2009-11-10 10:02:54 [MgmtSrvr] WARNING  -- Node 3: Node 8 missed heartbeat 2
2009-11-10 10:02:54 [MgmtSrvr] WARNING  -- Node 3: Node 9 missed heartbeat 2
2009-11-10 10:02:54 [MgmtSrvr] WARNING  -- Node 3: Node 11 missed heartbeat 2
2009-11-10 10:02:54 [MgmtSrvr] WARNING  -- Node 3: Node 28 missed heartbeat 2
2009-11-10 10:02:57 [MgmtSrvr] WARNING  -- Node 2: Detected GCP stop(1)...sending kill to [SignalCounter: m_count=2 0000000000000028]
2009-11-10 10:02:58 [MgmtSrvr] WARNING  -- Node 3: Node 6 missed heartbeat 2
2009-11-10 10:02:58 [MgmtSrvr] WARNING  -- Node 3: Node 7 missed heartbeat 2
2009-11-10 10:02:58 [MgmtSrvr] WARNING  -- Node 3: Node 10 missed heartbeat 2
2009-11-10 10:02:58 [MgmtSrvr] WARNING  -- Node 3: Node 11 missed heartbeat 2
2009-11-10 10:02:58 [MgmtSrvr] WARNING  -- Node 3: Node 28 missed heartbeat 2
2009-11-10 10:03:03 [MgmtSrvr] WARNING  -- Node 2: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for
a while solved
2009-11-10 10:03:03 [MgmtSrvr] WARNING  -- Node 2: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for
a while solved
2009-11-10 10:03:05 [MgmtSrvr] WARNING  -- Node 3: Node 6 missed heartbeat 2
2009-11-10 10:03:05 [MgmtSrvr] WARNING  -- Node 3: Node 7 missed heartbeat 2
2009-11-10 10:03:05 [MgmtSrvr] WARNING  -- Node 3: Node 11 missed heartbeat 2
2009-11-10 10:03:05 [MgmtSrvr] WARNING  -- Node 3: Node 28 missed heartbeat 2
2009-11-10 10:03:07 [MgmtSrvr] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed
ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Here is the download-link for the bz2-archive from ndb_error_reporter (23M):
http://85.25.144.101/files/ndb_error_report_20091110111356.tar.bz2

The file-limit here is 500k, so I needed to upload this to one of my own servers ;)

Hope this helps!

Regards,
Robert

Hello,

I just ran into the same problem , even with 7.0.9b :(

2009-11-11 13:55:40 [MgmtSrvr] WARNING  -- Node 3: Node 11 missed heartbeat 2
2009-11-11 13:55:42 [MgmtSrvr] WARNING  -- Node 3: Node 4 missed heartbeat 2
2009-11-11 13:55:42 [MgmtSrvr] WARNING  -- Node 3: Node 6 missed heartbeat 2
2009-11-11 13:55:42 [MgmtSrvr] WARNING  -- Node 3: Node 7 missed heartbeat 2
2009-11-11 13:55:42 [MgmtSrvr] WARNING  -- Node 3: Node 9 missed heartbeat 2
2009-11-11 13:55:42 [MgmtSrvr] WARNING  -- Node 3: Node 10 missed heartbeat 2
2009-11-11 13:55:42 [MgmtSrvr] WARNING  -- Node 3: Node 28 missed heartbeat 2
2009-11-11 13:55:43 [MgmtSrvr] WARNING  -- Node 2: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for a while solved
2009-11-11 13:55:44 [MgmtSrvr] WARNING  -- Node 2: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for a while solved
2009-11-11 13:55:44 [MgmtSrvr] WARNING  -- Node 3: Node 8 missed heartbeat 2
2009-11-11 13:55:44 [MgmtSrvr] WARNING  -- Node 3: Node 11 missed heartbeat 2
2009-11-11 13:55:44 [MgmtSrvr] WARNING  -- Node 2: Transporter to node 3 reported error 0x16: The send buffer was full, but sleeping for a while solved
2009-11-11 13:55:45 [MgmtSrvr] WARNING  -- Node 2: Transporter to node 3 reported error 0x16: The send buffer was full, but sleeping for a while solved
2009-11-11 13:55:45 [MgmtSrvr] WARNING  -- Node 2: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for a while solved
2009-11-11 13:55:45 [MgmtSrvr] WARNING  -- Node 2: Transporter to node 3 reported error 0x16: The send buffer was full, but sleeping for a while solved
2009-11-11 13:55:46 [MgmtSrvr] WARNING  -- Node 3: Node 4 missed heartbeat 2
2009-11-11 13:55:46 [MgmtSrvr] WARNING  -- Node 3: Node 6 missed heartbeat 2
2009-11-11 13:55:46 [MgmtSrvr] WARNING  -- Node 3: Node 7 missed heartbeat 2
2009-11-11 13:55:46 [MgmtSrvr] WARNING  -- Node 3: Node 9 missed heartbeat 2
2009-11-11 13:55:46 [MgmtSrvr] WARNING  -- Node 3: Node 10 missed heartbeat 2
2009-11-11 13:55:46 [MgmtSrvr] WARNING  -- Node 3: Node 28 missed heartbeat 2
2009-11-11 13:55:47 [MgmtSrvr] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

After that all 4 NDB-nodes made a restart .. really bad!

I uploaded the data from ndb_error_reporter here:
http://85.25.144.101/files/ndb_error_report_20091111135937.tar.bz2

The problem did appear again. Here some other ndb_error_reporter output, hope it's useful:

http://85.25.144.101/files/ndb_error_report_20091120133852.tar.bz2

Hi,

From reading you're cluster log, 
I think you check how your network is doing.

If the switches are ok, if full-duplex is used,
that interrupt throttling is not limiting throughput
(and increasing latency)

/Jonas

Hi Jonas,

the switches are ok and all intefaces talking in fd mode. We're using the default values for interupt throttling, which should be fine.

Since we've switched to ndbd, prior it was ndbmtd, the cluster run's almost stable (we've got one failure in 2 months, which is stable for us).

Sincerelly
Martin

i don't think we'll get any further on this.
setting "can't repeat" until further evidence surfaces