Bug #23148 NDB with REPLICA=3 (or ODD NUMBER)
Submitted: 10 Oct 2006 19:31 Modified: 30 Oct 2006 14:37
Reporter: andy kwong Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:5.1.11 Beta OS:Linux (Redhat Linux ES4)
Assigned to: CPU Architecture:Any
Tags: cluster, ndb, replica

[10 Oct 2006 19:31] andy kwong
Description:
I want some extra redundancy so I'm testing replica = 3 with 3 NDB nodes, 3 API nodes and 1 MGM node. When the replica is set to 3 (and possibly ODD number). The cluster will be unstable (not completely down but unavailable from time-to-time) due to auto-restarting on the mysqld nodes.

How to repeat:
To simulate a problem, I shutdown one of the NDB node and I got an error on the ndb_mgm console. and all the mysql is not able to get data sometimes. When the ndb_mgm show the mysqld is connected, it will work. But then another second, the ndb_mgm will show mysqld is off and the query fails. I checked the error log on mysqld and it seems like the mysqld is restarting over and over. 

ndb_mgm> show 
Cluster Configuration 
--------------------- 
[ndbd(NDB)] 3 node(s) 
id=2 @10.0.0.144 (Version: 5.1.11, Nodegroup: 0, Master) 
id=3 (not connected, accepting connect from 10.0.0.145) 
id=4 @10.0.0.146 (Version: 5.1.11, Nodegroup: 0) 

[ndb_mgmd(MGM)] 1 node(s) 
id=1 @10.0.0.140 (Version: 5.1.11) 

[mysqld(API)] 3 node(s) 
id=5 @10.0.0.141 (Version: 5.1.11) 
id=6 @10.0.0.142 (Version: 5.1.11) 
id=7 @10.0.0.143 (Version: 5.1.11) 

ndb_mgm> Node 3: Forced node shutdown completed. Occured during startphase. Initiated by signal 8. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. 

ndb_mgm> Node 2: Forced node shutdown completed. Initiated by signal 0. Caused by error 2305: 'Arbitrator shutdown, please investigate error(s) on other node(s)(Arbitration error). Temporary error, restart node'. 

mysql> select count(*) from account; 
+----------+ 
| count(*) | 
+----------+ 
| 1000000 | 
+----------+ 
1 row in set (0.00 sec) 

mysql> select count(*) from account; 
ERROR 2006 (HY000): MySQL server has gone away 
No connection. Trying to reconnect... 
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2) 
ERROR: 
Can't connect to the server 

the error log: 
Number of processes running now: 0 
061008 02:32:46 mysqld restarted 
061008 2:32:46 InnoDB: Started; log sequence number 0 46403 
061008 2:32:53 [Note] Starting MySQL Cluster Binlog Thread 
/usr/local/mysql/bin/mysqld: Table 'general_log' is marked as crashed and should be repaired 
/usr/local/mysql/bin/mysqld: Table 'slow_log' is marked as crashed and should be repaired 
061008 2:32:54 [Note] /usr/local/mysql/bin/mysqld: ready for connections. 
Version: '5.1.11-beta' socket: '/tmp/mysql.sock' port: 3306 MySQL Community Server (GPL) 
061008 2:32:54 [Note] SCHEDULER: Manager thread booting 
061008 2:32:54 [Note] SCHEDULER: Loaded 0 events 
061008 2:32:54 [Note] SCHEDULER: Suspending operations 
INVALID SUB_GCP_COMPLETE_REP 
gci: 1630 
sender: 1010004 
count: 5 
bucket count: 4294967295 
nodes: 3 
mysqld got signal 6; 
This could be because you hit a bug. It is also possible that this binary 
or one of the libraries it was linked against is corrupt, improperly built, 
or misconfigured. This error can also be caused by malfunctioning hardware. 
We will try our best to scrape up some info that will hopefully help diagnose 
the problem, but since we have already crashed, something is definitely wrong 
and this may fail. 

key_buffer_size=8388600 
read_buffer_size=131072 
max_used_connections=0 
max_connections=100 
threads_connected=1 
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 225791 K 
bytes of memory 
Hope that's ok; if not, decrease some variables in the equation. 

Number of processes running now: 0 
061008 02:32:55 mysqld restarted 
061008 2:32:55 InnoDB: Started; log sequence number 0 46403 
061008 2:33:02 [Note] Starting MySQL Cluster Binlog Thread 
/usr/local/mysql/bin/mysqld: Table 'general_log' is marked as crashed and should be repaired 
/usr/local/mysql/bin/mysqld: Table 'slow_log' is marked as crashed and should be repaired 
061008 2:33:04 [Note] /usr/local/mysql/bin/mysqld: ready for connections. 
Version: '5.1.11-beta' socket: '/tmp/mysql.sock' port: 3306 MySQL Community Server (GPL) 
061008 2:33:04 [Note] SCHEDULER: Manager thread booting 
INVALID SUB_GCP_COMPLETE_REP 
gci: 1633 
sender: 1010004 
count: 5 
bucket count: 4294967295 
nodes: 3 
mysqld got signal 6; 
This could be because you hit a bug. It is also possible that this binary 
or one of the libraries it was linked against is corrupt, improperly built, 
or misconfigured. This error can also be caused by malfunctioning hardware. 
We will try our best to scrape up some info that will hopefully help diagnose 
the problem, but since we have already crashed, something is definitely wrong 
and this may fail. 

key_buffer_size=8388600 
read_buffer_size=131072 
max_used_connections=0 
max_connections=100 
threads_connected=1 
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 225791 K 
bytes of memory 
Hope that's ok; if not, decrease some variables in the equation.

Suggested fix:
None, except only use replica = even number?
[10 Oct 2006 20:19] Geert Vanderkelen
Thanks for the report, I could indeed reproduce this.
I'm going to dig a bit further in it before setting this to verified.
[11 Oct 2006 9:16] Geert Vanderkelen
Thanks for the report Andy!
Verified using latest 5.1bk, could not repeat using latest 5.0bk.

When using NoOfReplicas > 2 (3 or 4) and you shutdown 1 data node, the mysqld connected to the cluster will crash. When you bring the data node back up, the mysqld will still not be able to reconnect.

A related bug is #18621, but this is another problem I think.

Used config.ini:

[COMPUTER]
Id = 1
Hostname = somehost

[API DEFAULT]

[TCP DEFAULT]

[NDB_MGMD DEFAULT]
DataDir=/data1/users/geert/single/

[NDBD DEFAULT]
DataDir=/data1/users/geert/single/
NoOfReplicas: 3
DataMemory: 80M
IndexMemory: 20M
DiskLess: 1

[NDB_MGMD]
id=1
ExecuteOnComputer = 1
ArbitrationRank = 1

[NDBD]
id=3
ExecuteOnComputer = 1

[NDBD]
id=4
ExecuteOnComputer = 1

[NDBD]
id=5
ExecuteOnComputer = 1

[MYSQLD]
id=7

[MYSQLD]
[MYSQLD]

Used my.cnf:

[mysqld]
skip-networking
basedir=/data1/mysql/5.1bk
datadir=/data1/users/geert/single/mysql
socket=/tmp/mysql_geert.sock
port=3306

ndbcluster
ndb-connectstring="localhost"
[30 Oct 2006 14:37] Jonas Oreland
Hi

This is a exact duplicate of #18621 as Geert suggested.

Currently >2 replica does not work well in 5.1
  (due to replication, which is always atleast "almost" on to handle
   distribution of DDL)

I'm not sure that this will be fixed before 5.1 GA

/Jonas