| Bug #45600 | Cluster various node restarts | ||
|---|---|---|---|
| Submitted: | 19 Jun 1:15 | Modified: | 20 Nov 14:02 |
| Reporter: | Robert Klikics | ||
| Status: | Open | ||
| Category: | Server: Cluster | Severity: | S1 (Critical) |
| Version: | mysql-5.1-telco-7.0 | OS: | Linux (Debian 5.0) |
| Assigned to: | Target Version: | ||
| Tags: | crash, restart, node, cluster, 7.0.9b | ||
| Triage: | Triaged: D2 (Serious) / R6 (Needs Assessment) / E6 (Needs Assessment) | ||
[19 Jun 1:15]
Robert Klikics
[19 Jun 1:25]
Robert Klikics
Additional output from a mysqld-node: 090619 00:21:34 mysqld_safe Number of processes running now: 0 090619 00:21:35 mysqld_safe mysqld restarted 090619 0:21:36 InnoDB: Started; log sequence number 0 46409 090619 0:21:36 [Note] NDB: NodeID is 9, management server '192.168.10.100:1186' 090619 0:21:37 [Note] NDB[0]: NodeID: 9, no storage nodes connected (timed out) 090619 0:21:37 [Warning] NDB: server id set to zero will cause any other mysqld with bin log to log with wrong server id 090619 0:21:37 [Note] Starting Cluster Binlog Thread 090619 0:21:39 [Note] Event Scheduler: Loaded 0 events 090619 0:21:39 [Note] /usr/local/mysql/bin/mysqld: bereit für Verbindungen. Version: '5.1.34-ndb-7.0.6-cluster-gpl' Socket: '/var/run/mysqld/mysqld.sock' Port: 3306 MySQL Cluster Server (GPL)
[19 Jun 1:44]
Robert Klikics
Additional output from a ndb-node: 2009-06-19 00:20:27 [ndbd] INFO -- Node 2 killed this node because GCP stop was detected 2009-06-19 00:20:27 [ndbd] INFO -- NDBCNTR (Line: 260) 0x0000000c 2009-06-19 00:20:27 [ndbd] INFO -- Error handler restarting system 2009-06-19 00:20:27 [ndbd] INFO -- Error handler shutdown completed - exiting 2009-06-19 00:20:28 [ndbd] ALERT -- Node 2: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. 2009-06-19 00:20:31 [ndbd] INFO -- Ndb has terminated (pid 30226) restarting 2009-06-19 00:20:31 [ndbd] INFO -- Configuration fetched from '192.168.10.100:1186', generation: 1 2009-06-19 00:20:31 [ndbd] INFO -- Angel pid: 30225 ndb pid: 26983 NDBMT: MaxNoOfExecutionThreads=2 NDBMT: workers=1 threads=1 2009-06-19 00:20:31 [ndbd] INFO -- NDB Cluster -- DB node 2 2009-06-19 00:20:31 [ndbd] INFO -- mysql-5.1.34 ndb-7.0.6 -- 2009-06-19 00:20:31 [ndbd] INFO -- Ndbd_mem_manager::init(1) min: 59610Mb initial: 59630Mb Adding 8192Mb to ZONE_LO (1,262142) Adding 8192Mb to ZONE_LO (262145,262142) Adding 8192Mb to ZONE_HI (524289,262142) Adding 8192Mb to ZONE_HI (786433,262142) Adding 8192Mb to ZONE_HI (1048577,262142) Adding 8192Mb to ZONE_HI (1310721,262142) Adding 8192Mb to ZONE_HI (1572865,262142) Adding 2286Mb to ZONE_HI (1835009,73151) NDBMT: num_threads=4 thr: 0 cpu: 1 err: 38 DBTC(0) DBDIH(0) DBDICT(0) NDBCNTR(0) QMGR(0) NDBFS(0) TRIX(0) DBUTIL(0) thr: 3 cpu: 1 err: 38 CMVMI(0) thr: 2 cpu: 1 err: 38 PGMAN(1) DBACC(1) DBLQH(1) DBTUP(1) BACKUP(1) DBTUX(1) RESTORE(1) saving 0x7f1e80480000 at 0xa16618 (0) 2009-06-19 00:20:32 [ndbd] INFO -- Start initiated (mysql-5.1.34 ndb-7.0.6) saving 0x7f1e80660000 at 0xa21ce8 (0) NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
[22 Jun 15:47]
Jørgen Austvik
Workaround: increase send buffer on data nodes? Please provide trace files and configuration files. No heavy load: could you please describe the load that was on the system?
[22 Jun 16:00]
Robert Klikics
tracelog ndb node1
Attachment: ndb2.tar.gz (application/x-gzip, text), 162.62 KiB.
[22 Jun 16:00]
Robert Klikics
tracelog ndb node2
Attachment: ndb3.tar.gz (application/x-gzip, text), 115.48 KiB.
[22 Jun 16:00]
Robert Klikics
tracelog ndb node3
Attachment: ndb4.tar.gz (application/x-gzip, text), 155.54 KiB.
[22 Jun 16:01]
Robert Klikics
tracelog ndb node4
Attachment: ndb5.tar.gz (application/x-gzip, text), 197.81 KiB.
[22 Jun 16:04]
Robert Klikics
Thanks for your reply. I've just added the tracelogs created at the problem appeared from all 4 ndb-nodes to this report. I've also increased TotalSendBufferMemory to 8M for each data and sql-node. Before there was a global setting of SendBufferMemory=8M. But I think a cluster oder node restart should not be caused if the SendBuffer is litte too low ...
[22 Jun 16:05]
Robert Klikics
The Linux-Loadavg was around ~1.0, here some "atsar" output: 00:00:01 pswch/s runq nrproc lavg1 lavg5 avg15 00:10:01 10687 1 119 0.75 0.81 0.81 00:20:01 9587 2 119 0.78 0.85 0.82 00:30:01 2024 1 119 0.22 0.72 0.79 00:40:01 1254 2 119 1.00 0.66 0.63 00:50:01 3072 1 119 0.10 0.57 0.64 01:00:01 6391 1 119 0.37 0.66 0.65
[29 Sep 23:20]
Matthew Bilek
Just got this error minutes ago. We have also have seen them in the past. 2009-09-29 14:37:07 [ndbd] INFO -- Watchdog: User time: 23852 System time: 9374 2009-09-29 14:37:07 [ndbd] WARNING -- Watchdog: Warning overslept 3629 ms, expected 100 ms. 2009-09-29 14:37:07 [ndbd] INFO -- Node 21 killed this node because GCP stop was detected 2009-09-29 14:37:07 [ndbd] INFO -- NDBCNTR (Line: 260) 0x0000000a 2009-09-29 14:37:07 [ndbd] INFO -- Error handler shutting down system 2009-09-29 14:37:07 [ndbd] INFO -- Error handler shutdown completed - exiting 2009-09-29 14:37:07 [ndbd] ALERT -- Node 21: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. ndb_mgm> show Cluster Configuration --------------------- [ndbd(NDB)] 4 node(s) id=21 (not connected, accepting connect from 192.168.0.84) id=22 @192.168.0.85 (mysql-5.1.34 ndb-7.0.6, Nodegroup: 0, Master) id=23 @192.168.0.86 (mysql-5.1.34 ndb-7.0.6, Nodegroup: 1) id=24 @192.168.0.87 (mysql-5.1.34 ndb-7.0.6, Nodegroup: 1) [ndb_mgmd(MGM)] 2 node(s) id=1 @192.168.0.81 (mysql-5.1.34 ndb-7.0.6) id=2 @192.168.0.82 (mysql-5.1.34 ndb-7.0.6) [mysqld(API)] 8 node(s) id=11 @192.168.0.81 (mysql-5.1.34 ndb-7.0.6) id=12 @192.168.0.81 (mysql-5.1.34 ndb-7.0.6) id=13 @192.168.0.81 (mysql-5.1.34 ndb-7.0.6) id=14 @192.168.0.82 (mysql-5.1.34 ndb-7.0.6) id=15 @192.168.0.82 (mysql-5.1.34 ndb-7.0.6) id=16 @192.168.0.82 (mysql-5.1.34 ndb-7.0.6) id=17 (not connected, accepting connect from 192.168.0.80) id=18 (not connected, accepting connect from 192.168.0.81)
[10 Nov 11:04]
Hartmut Holzgraefe
Could you provide the full collection of config and log files, e.g. using the ndb_error_reporter utility? http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-programs-ndb-error-reporter.html
[10 Nov 11:22]
Robert Klikics
Since my cluser crashed again today (now running 7.0.8a, I also mailed to the list..) the bug still seems to be present. Log-Output from today: 2009-11-10 10:02:54 [MgmtSrvr] WARNING -- Node 3: Node 6 missed heartbeat 2 2009-11-10 10:02:54 [MgmtSrvr] WARNING -- Node 3: Node 7 missed heartbeat 2 2009-11-10 10:02:54 [MgmtSrvr] WARNING -- Node 3: Node 8 missed heartbeat 2 2009-11-10 10:02:54 [MgmtSrvr] WARNING -- Node 3: Node 9 missed heartbeat 2 2009-11-10 10:02:54 [MgmtSrvr] WARNING -- Node 3: Node 11 missed heartbeat 2 2009-11-10 10:02:54 [MgmtSrvr] WARNING -- Node 3: Node 28 missed heartbeat 2 2009-11-10 10:02:57 [MgmtSrvr] WARNING -- Node 2: Detected GCP stop(1)...sending kill to [SignalCounter: m_count=2 0000000000000028] 2009-11-10 10:02:58 [MgmtSrvr] WARNING -- Node 3: Node 6 missed heartbeat 2 2009-11-10 10:02:58 [MgmtSrvr] WARNING -- Node 3: Node 7 missed heartbeat 2 2009-11-10 10:02:58 [MgmtSrvr] WARNING -- Node 3: Node 10 missed heartbeat 2 2009-11-10 10:02:58 [MgmtSrvr] WARNING -- Node 3: Node 11 missed heartbeat 2 2009-11-10 10:02:58 [MgmtSrvr] WARNING -- Node 3: Node 28 missed heartbeat 2 2009-11-10 10:03:03 [MgmtSrvr] WARNING -- Node 2: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for a while solved 2009-11-10 10:03:03 [MgmtSrvr] WARNING -- Node 2: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for a while solved 2009-11-10 10:03:05 [MgmtSrvr] WARNING -- Node 3: Node 6 missed heartbeat 2 2009-11-10 10:03:05 [MgmtSrvr] WARNING -- Node 3: Node 7 missed heartbeat 2 2009-11-10 10:03:05 [MgmtSrvr] WARNING -- Node 3: Node 11 missed heartbeat 2 2009-11-10 10:03:05 [MgmtSrvr] WARNING -- Node 3: Node 28 missed heartbeat 2 2009-11-10 10:03:07 [MgmtSrvr] ALERT -- Node 2: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. Here is the download-link for the bz2-archive from ndb_error_reporter (23M): http://85.25.144.101/files/ndb_error_report_20091110111356.tar.bz2 The file-limit here is 500k, so I needed to upload this to one of my own servers ;) Hope this helps! Regards, Robert
[11 Nov 14:12]
Robert Klikics
Hello, I just ran into the same problem , even with 7.0.9b :( 2009-11-11 13:55:40 [MgmtSrvr] WARNING -- Node 3: Node 11 missed heartbeat 2 2009-11-11 13:55:42 [MgmtSrvr] WARNING -- Node 3: Node 4 missed heartbeat 2 2009-11-11 13:55:42 [MgmtSrvr] WARNING -- Node 3: Node 6 missed heartbeat 2 2009-11-11 13:55:42 [MgmtSrvr] WARNING -- Node 3: Node 7 missed heartbeat 2 2009-11-11 13:55:42 [MgmtSrvr] WARNING -- Node 3: Node 9 missed heartbeat 2 2009-11-11 13:55:42 [MgmtSrvr] WARNING -- Node 3: Node 10 missed heartbeat 2 2009-11-11 13:55:42 [MgmtSrvr] WARNING -- Node 3: Node 28 missed heartbeat 2 2009-11-11 13:55:43 [MgmtSrvr] WARNING -- Node 2: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for a while solved 2009-11-11 13:55:44 [MgmtSrvr] WARNING -- Node 2: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for a while solved 2009-11-11 13:55:44 [MgmtSrvr] WARNING -- Node 3: Node 8 missed heartbeat 2 2009-11-11 13:55:44 [MgmtSrvr] WARNING -- Node 3: Node 11 missed heartbeat 2 2009-11-11 13:55:44 [MgmtSrvr] WARNING -- Node 2: Transporter to node 3 reported error 0x16: The send buffer was full, but sleeping for a while solved 2009-11-11 13:55:45 [MgmtSrvr] WARNING -- Node 2: Transporter to node 3 reported error 0x16: The send buffer was full, but sleeping for a while solved 2009-11-11 13:55:45 [MgmtSrvr] WARNING -- Node 2: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for a while solved 2009-11-11 13:55:45 [MgmtSrvr] WARNING -- Node 2: Transporter to node 3 reported error 0x16: The send buffer was full, but sleeping for a while solved 2009-11-11 13:55:46 [MgmtSrvr] WARNING -- Node 3: Node 4 missed heartbeat 2 2009-11-11 13:55:46 [MgmtSrvr] WARNING -- Node 3: Node 6 missed heartbeat 2 2009-11-11 13:55:46 [MgmtSrvr] WARNING -- Node 3: Node 7 missed heartbeat 2 2009-11-11 13:55:46 [MgmtSrvr] WARNING -- Node 3: Node 9 missed heartbeat 2 2009-11-11 13:55:46 [MgmtSrvr] WARNING -- Node 3: Node 10 missed heartbeat 2 2009-11-11 13:55:46 [MgmtSrvr] WARNING -- Node 3: Node 28 missed heartbeat 2 2009-11-11 13:55:47 [MgmtSrvr] ALERT -- Node 2: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. After that all 4 NDB-nodes made a restart .. really bad! I uploaded the data from ndb_error_reporter here: http://85.25.144.101/files/ndb_error_report_20091111135937.tar.bz2
[20 Nov 14:02]
Robert Klikics
The problem did appear again. Here some other ndb_error_reporter output, hope it's useful: http://85.25.144.101/files/ndb_error_report_20091120133852.tar.bz2
