Bug #45600 | Cluster various node restarts | ||
---|---|---|---|
Submitted: | 18 Jun 2009 23:15 | Modified: | 5 Mar 2010 12:27 |
Reporter: | Robert Klikics | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S1 (Critical) |
Version: | mysql-5.1-telco-7.0 | OS: | Linux (Debian 5.0) |
Assigned to: | Jonas Oreland | CPU Architecture: | Any |
Tags: | 7.0.9b, cluster, crash, node, restart |
[18 Jun 2009 23:15]
Robert Klikics
[18 Jun 2009 23:25]
Robert Klikics
Additional output from a mysqld-node: 090619 00:21:34 mysqld_safe Number of processes running now: 0 090619 00:21:35 mysqld_safe mysqld restarted 090619 0:21:36 InnoDB: Started; log sequence number 0 46409 090619 0:21:36 [Note] NDB: NodeID is 9, management server '192.168.10.100:1186' 090619 0:21:37 [Note] NDB[0]: NodeID: 9, no storage nodes connected (timed out) 090619 0:21:37 [Warning] NDB: server id set to zero will cause any other mysqld with bin log to log with wrong server id 090619 0:21:37 [Note] Starting Cluster Binlog Thread 090619 0:21:39 [Note] Event Scheduler: Loaded 0 events 090619 0:21:39 [Note] /usr/local/mysql/bin/mysqld: bereit für Verbindungen. Version: '5.1.34-ndb-7.0.6-cluster-gpl' Socket: '/var/run/mysqld/mysqld.sock' Port: 3306 MySQL Cluster Server (GPL)
[18 Jun 2009 23:44]
Robert Klikics
Additional output from a ndb-node: 2009-06-19 00:20:27 [ndbd] INFO -- Node 2 killed this node because GCP stop was detected 2009-06-19 00:20:27 [ndbd] INFO -- NDBCNTR (Line: 260) 0x0000000c 2009-06-19 00:20:27 [ndbd] INFO -- Error handler restarting system 2009-06-19 00:20:27 [ndbd] INFO -- Error handler shutdown completed - exiting 2009-06-19 00:20:28 [ndbd] ALERT -- Node 2: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. 2009-06-19 00:20:31 [ndbd] INFO -- Ndb has terminated (pid 30226) restarting 2009-06-19 00:20:31 [ndbd] INFO -- Configuration fetched from '192.168.10.100:1186', generation: 1 2009-06-19 00:20:31 [ndbd] INFO -- Angel pid: 30225 ndb pid: 26983 NDBMT: MaxNoOfExecutionThreads=2 NDBMT: workers=1 threads=1 2009-06-19 00:20:31 [ndbd] INFO -- NDB Cluster -- DB node 2 2009-06-19 00:20:31 [ndbd] INFO -- mysql-5.1.34 ndb-7.0.6 -- 2009-06-19 00:20:31 [ndbd] INFO -- Ndbd_mem_manager::init(1) min: 59610Mb initial: 59630Mb Adding 8192Mb to ZONE_LO (1,262142) Adding 8192Mb to ZONE_LO (262145,262142) Adding 8192Mb to ZONE_HI (524289,262142) Adding 8192Mb to ZONE_HI (786433,262142) Adding 8192Mb to ZONE_HI (1048577,262142) Adding 8192Mb to ZONE_HI (1310721,262142) Adding 8192Mb to ZONE_HI (1572865,262142) Adding 2286Mb to ZONE_HI (1835009,73151) NDBMT: num_threads=4 thr: 0 cpu: 1 err: 38 DBTC(0) DBDIH(0) DBDICT(0) NDBCNTR(0) QMGR(0) NDBFS(0) TRIX(0) DBUTIL(0) thr: 3 cpu: 1 err: 38 CMVMI(0) thr: 2 cpu: 1 err: 38 PGMAN(1) DBACC(1) DBLQH(1) DBTUP(1) BACKUP(1) DBTUX(1) RESTORE(1) saving 0x7f1e80480000 at 0xa16618 (0) 2009-06-19 00:20:32 [ndbd] INFO -- Start initiated (mysql-5.1.34 ndb-7.0.6) saving 0x7f1e80660000 at 0xa21ce8 (0) NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
[22 Jun 2009 13:47]
Jørgen Austvik
Workaround: increase send buffer on data nodes? Please provide trace files and configuration files. No heavy load: could you please describe the load that was on the system?
[22 Jun 2009 14:00]
Robert Klikics
tracelog ndb node1
Attachment: ndb2.tar.gz (application/x-gzip, text), 162.62 KiB.
[22 Jun 2009 14:00]
Robert Klikics
tracelog ndb node2
Attachment: ndb3.tar.gz (application/x-gzip, text), 115.48 KiB.
[22 Jun 2009 14:00]
Robert Klikics
tracelog ndb node3
Attachment: ndb4.tar.gz (application/x-gzip, text), 155.54 KiB.
[22 Jun 2009 14:01]
Robert Klikics
tracelog ndb node4
Attachment: ndb5.tar.gz (application/x-gzip, text), 197.81 KiB.
[22 Jun 2009 14:04]
Robert Klikics
Thanks for your reply. I've just added the tracelogs created at the problem appeared from all 4 ndb-nodes to this report. I've also increased TotalSendBufferMemory to 8M for each data and sql-node. Before there was a global setting of SendBufferMemory=8M. But I think a cluster oder node restart should not be caused if the SendBuffer is litte too low ...
[22 Jun 2009 14:05]
Robert Klikics
The Linux-Loadavg was around ~1.0, here some "atsar" output: 00:00:01 pswch/s runq nrproc lavg1 lavg5 avg15 00:10:01 10687 1 119 0.75 0.81 0.81 00:20:01 9587 2 119 0.78 0.85 0.82 00:30:01 2024 1 119 0.22 0.72 0.79 00:40:01 1254 2 119 1.00 0.66 0.63 00:50:01 3072 1 119 0.10 0.57 0.64 01:00:01 6391 1 119 0.37 0.66 0.65
[29 Sep 2009 21:20]
Matthew Bilek
Just got this error minutes ago. We have also have seen them in the past. 2009-09-29 14:37:07 [ndbd] INFO -- Watchdog: User time: 23852 System time: 9374 2009-09-29 14:37:07 [ndbd] WARNING -- Watchdog: Warning overslept 3629 ms, expected 100 ms. 2009-09-29 14:37:07 [ndbd] INFO -- Node 21 killed this node because GCP stop was detected 2009-09-29 14:37:07 [ndbd] INFO -- NDBCNTR (Line: 260) 0x0000000a 2009-09-29 14:37:07 [ndbd] INFO -- Error handler shutting down system 2009-09-29 14:37:07 [ndbd] INFO -- Error handler shutdown completed - exiting 2009-09-29 14:37:07 [ndbd] ALERT -- Node 21: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. ndb_mgm> show Cluster Configuration --------------------- [ndbd(NDB)] 4 node(s) id=21 (not connected, accepting connect from 192.168.0.84) id=22 @192.168.0.85 (mysql-5.1.34 ndb-7.0.6, Nodegroup: 0, Master) id=23 @192.168.0.86 (mysql-5.1.34 ndb-7.0.6, Nodegroup: 1) id=24 @192.168.0.87 (mysql-5.1.34 ndb-7.0.6, Nodegroup: 1) [ndb_mgmd(MGM)] 2 node(s) id=1 @192.168.0.81 (mysql-5.1.34 ndb-7.0.6) id=2 @192.168.0.82 (mysql-5.1.34 ndb-7.0.6) [mysqld(API)] 8 node(s) id=11 @192.168.0.81 (mysql-5.1.34 ndb-7.0.6) id=12 @192.168.0.81 (mysql-5.1.34 ndb-7.0.6) id=13 @192.168.0.81 (mysql-5.1.34 ndb-7.0.6) id=14 @192.168.0.82 (mysql-5.1.34 ndb-7.0.6) id=15 @192.168.0.82 (mysql-5.1.34 ndb-7.0.6) id=16 @192.168.0.82 (mysql-5.1.34 ndb-7.0.6) id=17 (not connected, accepting connect from 192.168.0.80) id=18 (not connected, accepting connect from 192.168.0.81)
[10 Nov 2009 10:04]
Hartmut Holzgraefe
Could you provide the full collection of config and log files, e.g. using the ndb_error_reporter utility? http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-programs-ndb-error-reporter.html
[10 Nov 2009 10:22]
Robert Klikics
Since my cluser crashed again today (now running 7.0.8a, I also mailed to the list..) the bug still seems to be present. Log-Output from today: 2009-11-10 10:02:54 [MgmtSrvr] WARNING -- Node 3: Node 6 missed heartbeat 2 2009-11-10 10:02:54 [MgmtSrvr] WARNING -- Node 3: Node 7 missed heartbeat 2 2009-11-10 10:02:54 [MgmtSrvr] WARNING -- Node 3: Node 8 missed heartbeat 2 2009-11-10 10:02:54 [MgmtSrvr] WARNING -- Node 3: Node 9 missed heartbeat 2 2009-11-10 10:02:54 [MgmtSrvr] WARNING -- Node 3: Node 11 missed heartbeat 2 2009-11-10 10:02:54 [MgmtSrvr] WARNING -- Node 3: Node 28 missed heartbeat 2 2009-11-10 10:02:57 [MgmtSrvr] WARNING -- Node 2: Detected GCP stop(1)...sending kill to [SignalCounter: m_count=2 0000000000000028] 2009-11-10 10:02:58 [MgmtSrvr] WARNING -- Node 3: Node 6 missed heartbeat 2 2009-11-10 10:02:58 [MgmtSrvr] WARNING -- Node 3: Node 7 missed heartbeat 2 2009-11-10 10:02:58 [MgmtSrvr] WARNING -- Node 3: Node 10 missed heartbeat 2 2009-11-10 10:02:58 [MgmtSrvr] WARNING -- Node 3: Node 11 missed heartbeat 2 2009-11-10 10:02:58 [MgmtSrvr] WARNING -- Node 3: Node 28 missed heartbeat 2 2009-11-10 10:03:03 [MgmtSrvr] WARNING -- Node 2: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for a while solved 2009-11-10 10:03:03 [MgmtSrvr] WARNING -- Node 2: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for a while solved 2009-11-10 10:03:05 [MgmtSrvr] WARNING -- Node 3: Node 6 missed heartbeat 2 2009-11-10 10:03:05 [MgmtSrvr] WARNING -- Node 3: Node 7 missed heartbeat 2 2009-11-10 10:03:05 [MgmtSrvr] WARNING -- Node 3: Node 11 missed heartbeat 2 2009-11-10 10:03:05 [MgmtSrvr] WARNING -- Node 3: Node 28 missed heartbeat 2 2009-11-10 10:03:07 [MgmtSrvr] ALERT -- Node 2: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. Here is the download-link for the bz2-archive from ndb_error_reporter (23M): http://85.25.144.101/files/ndb_error_report_20091110111356.tar.bz2 The file-limit here is 500k, so I needed to upload this to one of my own servers ;) Hope this helps! Regards, Robert
[11 Nov 2009 13:12]
Robert Klikics
Hello, I just ran into the same problem , even with 7.0.9b :( 2009-11-11 13:55:40 [MgmtSrvr] WARNING -- Node 3: Node 11 missed heartbeat 2 2009-11-11 13:55:42 [MgmtSrvr] WARNING -- Node 3: Node 4 missed heartbeat 2 2009-11-11 13:55:42 [MgmtSrvr] WARNING -- Node 3: Node 6 missed heartbeat 2 2009-11-11 13:55:42 [MgmtSrvr] WARNING -- Node 3: Node 7 missed heartbeat 2 2009-11-11 13:55:42 [MgmtSrvr] WARNING -- Node 3: Node 9 missed heartbeat 2 2009-11-11 13:55:42 [MgmtSrvr] WARNING -- Node 3: Node 10 missed heartbeat 2 2009-11-11 13:55:42 [MgmtSrvr] WARNING -- Node 3: Node 28 missed heartbeat 2 2009-11-11 13:55:43 [MgmtSrvr] WARNING -- Node 2: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for a while solved 2009-11-11 13:55:44 [MgmtSrvr] WARNING -- Node 2: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for a while solved 2009-11-11 13:55:44 [MgmtSrvr] WARNING -- Node 3: Node 8 missed heartbeat 2 2009-11-11 13:55:44 [MgmtSrvr] WARNING -- Node 3: Node 11 missed heartbeat 2 2009-11-11 13:55:44 [MgmtSrvr] WARNING -- Node 2: Transporter to node 3 reported error 0x16: The send buffer was full, but sleeping for a while solved 2009-11-11 13:55:45 [MgmtSrvr] WARNING -- Node 2: Transporter to node 3 reported error 0x16: The send buffer was full, but sleeping for a while solved 2009-11-11 13:55:45 [MgmtSrvr] WARNING -- Node 2: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for a while solved 2009-11-11 13:55:45 [MgmtSrvr] WARNING -- Node 2: Transporter to node 3 reported error 0x16: The send buffer was full, but sleeping for a while solved 2009-11-11 13:55:46 [MgmtSrvr] WARNING -- Node 3: Node 4 missed heartbeat 2 2009-11-11 13:55:46 [MgmtSrvr] WARNING -- Node 3: Node 6 missed heartbeat 2 2009-11-11 13:55:46 [MgmtSrvr] WARNING -- Node 3: Node 7 missed heartbeat 2 2009-11-11 13:55:46 [MgmtSrvr] WARNING -- Node 3: Node 9 missed heartbeat 2 2009-11-11 13:55:46 [MgmtSrvr] WARNING -- Node 3: Node 10 missed heartbeat 2 2009-11-11 13:55:46 [MgmtSrvr] WARNING -- Node 3: Node 28 missed heartbeat 2 2009-11-11 13:55:47 [MgmtSrvr] ALERT -- Node 2: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. After that all 4 NDB-nodes made a restart .. really bad! I uploaded the data from ndb_error_reporter here: http://85.25.144.101/files/ndb_error_report_20091111135937.tar.bz2
[20 Nov 2009 13:02]
Robert Klikics
The problem did appear again. Here some other ndb_error_reporter output, hope it's useful: http://85.25.144.101/files/ndb_error_report_20091120133852.tar.bz2
[22 Feb 2010 19:40]
Jonas Oreland
Hi, From reading you're cluster log, I think you check how your network is doing. If the switches are ok, if full-duplex is used, that interrupt throttling is not limiting throughput (and increasing latency) /Jonas
[24 Feb 2010 9:37]
Robert Klikics
Hi Jonas, the switches are ok and all intefaces talking in fd mode. We're using the default values for interupt throttling, which should be fine. Since we've switched to ndbd, prior it was ndbmtd, the cluster run's almost stable (we've got one failure in 2 months, which is stable for us). Sincerelly Martin
[5 Mar 2010 12:27]
Jonas Oreland
i don't think we'll get any further on this. setting "can't repeat" until further evidence surfaces