MySQL Bugs: #21545: no arbitration after node loss

Bug #21545	no arbitration after node loss
Submitted:	9 Aug 2006 16:30	Modified:	27 Oct 2010 20:05
Reporter:	Hartmut Holzgraefe	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	mysql-5.0	OS:	Linux (linux)
Assigned to:	Hartmut Holzgraefe	CPU Architecture:	Any
Tags:	5.0bk

Description:
Found this while trying to create a copy of a large table using:

  CREATE TABLE t_new LIKE t_old;
  INSERT INTO t_new SELECT * FROM t_old;

The first node died with:

  Time: Wednesday 9 August 2006 - 16:51:18
  Status: Permanent error, external action needed
  Message: Signal lost, out of send buffer memory, please increase     
    SendBufferMemory or lower the load (Resource configuration error)
  Error: 6052
  Error data: Remote note id 2.
  Error object: TransporterCallback.cpp
  Program: ndbd
  Pid: 13219
  Trace: ./ndb_1_trace.log.2
  Version: Version 5.0.23
  ***EOM***

so far, so good, should have tried to copy that table 
in smaller chunks maybe. But then the other node went
down, too:

  626Time: Wednesday 9 August 2006 - 16:51:32
  Status: Temporary error, restart node
  Message: Node lost connection to other nodes and can not form a 
  unpartitioned   cluster, please investigate if there are error(s) 
  on other node(s) (Arbitration error)
  Error: 2305
  Error data: Arbitrator decided to shutdown this node
  Error object: QMGR (Line: 4556) 0x0000000e
  Program: ndbd
  Pid: 25427
  Trace: ./ndb_2_trace.log.2
  Version: Version 5.0.23
  ***EOM***

Two strange things here: 

- this is a 2 node 2 replica cluster, a mysqld and the ndb_mgmd 
  where up and  running, why did arbitration fail?

- why is the 2nd error log file truncated? it really looks like this,
  starting with "626Time: Wednesday 9 August 2006 - 16:51:32" on line 1,
  not the usual "Current byte-offset of file-pointer is: 1067" line ...

The cluster log looks like this, note that here the end is truncated:
(all system had several GB of free disk space)

2006-08-09 16:49:58 [MgmSrvr] INFO     -- Node 1: Local checkpoint 91 started. Keep GCI = 38266 oldest restorable GCI = 38250
2006-08-09 16:51:16 [MgmSrvr] WARNING  -- Node 1: Transporter to node 2 reported error 0x16
2006-08-09 16:51:17 [MgmSrvr] WARNING  -- Node 1: Transporter to node 2 reported error 0x16
2006-08-09 16:51:19 [MgmSrvr] INFO     -- Node 3: Node 1 Connected
2006-08-09 16:51:19 [MgmSrvr] ALERT    -- Node 1: Forced node shutdown completed. Initiated by signal 6. Caused by error 6052: 'Signal lost, out of send buffer memory, please increase SendBufferMemory or lower the load(Resource configuration error). Permanent error, external action needed'.
2006-08-09 16:51:21 [MgmSrvr] INFO     -- Node 3: Node 2 Connected
2006-08-09 16:51:33 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Initiated by signal 6. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary err

How to repeat:
not sure yet whether it is reproduceable, 
trying to repeat just now

just an idea...maybe tcp load got so high so it lost connection to ndb_mgmd
  (or ndb_mgmd did not reply within ArbitrationTimeout)

Can't reproduce in 5.0-ndb-bj tree. All work well

Can't reproduced in 5.0-main tree.

I have a 2 node cluster and got the same 2305 error on a node after rebooting the other node. It happened on both nodes (first one, restarted all processes manually, rebooted the other and the same thing happened). This left me with both ndbd processes down, so no database access at all. Maybe you can reproduce the bug this way, hope it helps.

Reboots were because of kernel upgrade (both nodes run RHEL 4.4, and MySQL 5.0.24a).

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

I sent the feedback by email, sorry. Copying it here...

Attached you can find the network diagram and the config files you asked for. IP adresses have been masked except for the last octet, the DB servers are 68 and 69.

You can see that I have configured now management servers on 66 and 67. Since this change the "ndbd dies on reboot" issue has gone, it happened when 68 and 69 were also the management servers.

I can't upload the file here, I emailed it to Li Zhou on October 27.