| Bug #21545 | no arbitration after node loss | ||
|---|---|---|---|
| Submitted: | 9 Aug 2006 16:30 | Modified: | 27 Oct 2010 20:05 |
| Reporter: | Hartmut Holzgraefe | Email Updates: | |
| Status: | Can't repeat | Impact on me: | |
| Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S3 (Non-critical) |
| Version: | mysql-5.0 | OS: | Linux (linux) |
| Assigned to: | Hartmut Holzgraefe | CPU Architecture: | Any |
| Tags: | 5.0bk | ||
[10 Aug 2006 11:38]
Jonas Oreland
just an idea...maybe tcp load got so high so it lost connection to ndb_mgmd (or ndb_mgmd did not reply within ArbitrationTimeout)
[18 Sep 2006 12:01]
li zhou
Can't reproduce in 5.0-ndb-bj tree. All work well
[19 Sep 2006 6:07]
li zhou
Can't reproduced in 5.0-main tree.
[12 Oct 2006 16:08]
Daniel Rey
I have a 2 node cluster and got the same 2305 error on a node after rebooting the other node. It happened on both nodes (first one, restarted all processes manually, rebooted the other and the same thing happened). This left me with both ndbd processes down, so no database access at all. Maybe you can reproduce the bug this way, hope it helps. Reboots were because of kernel upgrade (both nodes run RHEL 4.4, and MySQL 5.0.24a).
[10 Nov 2006 0:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".
[10 Nov 2006 12:23]
Daniel Rey
I sent the feedback by email, sorry. Copying it here... Attached you can find the network diagram and the config files you asked for. IP adresses have been masked except for the last octet, the DB servers are 68 and 69. You can see that I have configured now management servers on 66 and 67. Since this change the "ndbd dies on reboot" issue has gone, it happened when 68 and 69 were also the management servers.
[10 Nov 2006 12:29]
Daniel Rey
I can't upload the file here, I emailed it to Li Zhou on October 27.

Description: Found this while trying to create a copy of a large table using: CREATE TABLE t_new LIKE t_old; INSERT INTO t_new SELECT * FROM t_old; The first node died with: Time: Wednesday 9 August 2006 - 16:51:18 Status: Permanent error, external action needed Message: Signal lost, out of send buffer memory, please increase SendBufferMemory or lower the load (Resource configuration error) Error: 6052 Error data: Remote note id 2. Error object: TransporterCallback.cpp Program: ndbd Pid: 13219 Trace: ./ndb_1_trace.log.2 Version: Version 5.0.23 ***EOM*** so far, so good, should have tried to copy that table in smaller chunks maybe. But then the other node went down, too: 626Time: Wednesday 9 August 2006 - 16:51:32 Status: Temporary error, restart node Message: Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s) (Arbitration error) Error: 2305 Error data: Arbitrator decided to shutdown this node Error object: QMGR (Line: 4556) 0x0000000e Program: ndbd Pid: 25427 Trace: ./ndb_2_trace.log.2 Version: Version 5.0.23 ***EOM*** Two strange things here: - this is a 2 node 2 replica cluster, a mysqld and the ndb_mgmd where up and running, why did arbitration fail? - why is the 2nd error log file truncated? it really looks like this, starting with "626Time: Wednesday 9 August 2006 - 16:51:32" on line 1, not the usual "Current byte-offset of file-pointer is: 1067" line ... The cluster log looks like this, note that here the end is truncated: (all system had several GB of free disk space) 2006-08-09 16:49:58 [MgmSrvr] INFO -- Node 1: Local checkpoint 91 started. Keep GCI = 38266 oldest restorable GCI = 38250 2006-08-09 16:51:16 [MgmSrvr] WARNING -- Node 1: Transporter to node 2 reported error 0x16 2006-08-09 16:51:17 [MgmSrvr] WARNING -- Node 1: Transporter to node 2 reported error 0x16 2006-08-09 16:51:19 [MgmSrvr] INFO -- Node 3: Node 1 Connected 2006-08-09 16:51:19 [MgmSrvr] ALERT -- Node 1: Forced node shutdown completed. Initiated by signal 6. Caused by error 6052: 'Signal lost, out of send buffer memory, please increase SendBufferMemory or lower the load(Resource configuration error). Permanent error, external action needed'. 2006-08-09 16:51:21 [MgmSrvr] INFO -- Node 3: Node 2 Connected 2006-08-09 16:51:33 [MgmSrvr] ALERT -- Node 2: Forced node shutdown completed. Initiated by signal 6. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary err How to repeat: not sure yet whether it is reproduceable, trying to repeat just now