Description:
After killing one of the data nodes (while systems where under transaction stress) to see if the replication clusters stayed up on a single failure, I issues a shutdown request through ndb_mgmd. The system showed that the request was completed, but the data nodes and managment process were still listed in the process tree.
14061 ? 00:01:26 ndb_mgmd
14346 ? 00:00:00 ndbd
14347 ? 01:53:29 ndbd
14381 ? 00:00:00 ndbd
14382 ? 01:47:44 ndbd
14584 pts/0 00:01:22 bankTimer
14592 pts/0 01:25:42 bankMakeGL
14600 pts/0 00:11:41 bankTransaction
ndbdev@ndb08:~/jmiller/builds/run> kill -9 14346
ndbdev@ndb08:~/jmiller/builds/run> kill -9 14347
ndbdev@ndb08:~/jmiller/builds/run> psERROR: 4028 Node failure caused abort of transaction
Status: Temporary error, Classification: Node Recovery error
File: Bank.cpp (Line: 1090)
performMakeGLForAccountType returned NDBT_FAILED
ERROR: 4031 Node failure caused abort of transaction
Status: Temporary error, Classification: Node Recovery error
File: Bank.cpp (Line: 487)
TEMPORARY_ERRROR retrying
14061 ? 00:01:26 ndb_mgmd
14381 ? 00:00:00 ndbd
14382 ? 01:47:46 ndbd
14584 pts/0 00:01:22 bankTimer
14592 pts/0 01:25:44 bankMakeGL
14600 pts/0 00:11:42 bankTransaction
ndb_mgm> shutdown
4 NDB Cluster storage node(s) have shutdown.
NDB Cluster management server shutdown.
ndb_mgm> exit
14061 ? 00:01:26 ndb_mgmd
14381 ? 00:00:00 ndbd
14382 ? 01:48:40 ndbd
14584 pts/0 00:01:23 bankTimer
14592 pts/0 01:26:21 bankMakeGL
14600 pts/0 00:11:47 bankTransaction
ndbdev@ndb08:~/jmiller/builds/run> kill -9 14584
[1] Killed NDB_CONNECTSTRING=ndb08:14000 ../bin/bankTimer -w 5 >>btBank.out
ndbdev@ndb08:~/jmiller/builds/run> kill -9 14592
[2]- Killed NDB_CONNECTSTRING=ndb08:14000 ../bin/bankMakeGL >>bGL_Bank.out
ndbdev@ndb08:~/jmiller/builds/run> kill -9 14600
[3]+ Killed NDB_CONNECTSTRING=ndb08:14000 ../bin/bankTransactionMaker >>bTrans_Bank.out
ndbdev@ndb08:~/jmiller/builds/run> ../bin/ndb_mgm ndb11:14000
-- NDB Cluster -- Management Client --
ndb_mgm> show
Unable to connect with connect string: nodeid=0,ndb11:14000
Retrying every 5 seconds. Attempts left: 2, 1 failed.
ndb_mgm> exit
14061 ? 00:01:24 ndb_mgmd
14381 ? 00:00:00 ndbd
14382 ? 01:49:08 ndbd
NOTES:
1) There were two BANK test running for the last 5 hours. One to BANK and one to BANK2. This consisted of 2 bankTimer, 2 bankMakeGL and 2 bankTransaction.
2) Just completed the replication_sample.txt script.
3) Transaction were still running when the master data node was killed (nodeid=2, The TC).
4) The cluster was still responding after the data node (nodeid=2) was terminated.
5) Transaction where still running when a regular "Shutdown" was issued in the ndb_mgmd;
All logs have been moved off to ndb08:/tmp/bug#### where #### = this bug report number.
On cluster restart, cluster hangs with 2 nodes in phase 0 and 2 nodes in phase1 just like http://bugs.mysql.com/bug.php?id=10893
ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)] 4 node(s)
id=2 @10.100.1.93 (Version: 5.1.0, starting, Nodegroup: 0)
id=3 @10.100.1.94 (Version: 5.1.0, starting, Nodegroup: 0)
id=4 @10.100.1.93 (Version: 5.1.0, starting, Nodegroup: 0)
id=5 @10.100.1.94 (Version: 5.1.0, starting, Nodegroup: 0, Master)
[ndb_mgmd(MGM)] 1 node(s)
id=1 @10.100.1.93 (Version: 5.1.0)
[mysqld(API)] 10 node(s)
id=6 (not connected, accepting connect from ndb08)
id=7 (not connected, accepting connect from ndb08)
id=8 (not connected, accepting connect from ndb08)
id=9 (not connected, accepting connect from ndb09)
id=10 (not connected, accepting connect from ndb09)
id=11 (not connected, accepting connect from ndb09)
id=12 (not connected, accepting connect from ndb10)
id=13 (not connected, accepting connect from ndb10)
id=14 (not connected, accepting connect from ndb10)
id=15 (not connected, accepting connect from ndb10)
ndb_mgm> all status
Node 2: starting (Phase 1) (Version 5.1.0)
Node 3: starting (Phase 1) (Version 5.1.0)
Node 4: starting (Phase 0) (Version 5.1.0)
Node 5: starting (Phase 0) (Version 5.1.0)
How to repeat:
Follow steps above.
Suggested fix:
managment client/server should make sure ndb process are shutdown.