Bug #10938 System shutdown does not work correctly for clusters underload
Submitted: 28 May 2005 2:33 Modified: 16 Sep 2005 11:45
Reporter: Jonathan Miller Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:4.1,5.0 OS:Linux (Linux)
Assigned to: Jonas Oreland CPU Architecture:Any

[28 May 2005 2:33] Jonathan Miller
Description:
After killing one of the data nodes (while systems where under transaction stress) to see if the replication clusters stayed up on a single failure, I issues a shutdown request through ndb_mgmd. The system showed that the request was completed, but the data nodes and managment process were still listed in the process tree.

14061 ?        00:01:26 ndb_mgmd
14346 ?        00:00:00 ndbd
14347 ?        01:53:29 ndbd
14381 ?        00:00:00 ndbd
14382 ?        01:47:44 ndbd
14584 pts/0    00:01:22 bankTimer
14592 pts/0    01:25:42 bankMakeGL
14600 pts/0    00:11:41 bankTransaction

ndbdev@ndb08:~/jmiller/builds/run> kill -9 14346
ndbdev@ndb08:~/jmiller/builds/run> kill -9 14347
ndbdev@ndb08:~/jmiller/builds/run> psERROR: 4028 Node failure caused abort of transaction
           Status: Temporary error, Classification: Node Recovery error
           File: Bank.cpp (Line: 1090)
performMakeGLForAccountType returned NDBT_FAILED
 ERROR: 4031 Node failure caused abort of transaction
           Status: Temporary error, Classification: Node Recovery error
           File: Bank.cpp (Line: 487)
TEMPORARY_ERRROR retrying
 
14061 ?        00:01:26 ndb_mgmd
14381 ?        00:00:00 ndbd
14382 ?        01:47:46 ndbd
14584 pts/0    00:01:22 bankTimer
14592 pts/0    01:25:44 bankMakeGL
14600 pts/0    00:11:42 bankTransaction

ndb_mgm> shutdown
4 NDB Cluster storage node(s) have shutdown.
NDB Cluster management server shutdown.
ndb_mgm> exit

14061 ?        00:01:26 ndb_mgmd
14381 ?        00:00:00 ndbd
14382 ?        01:48:40 ndbd
14584 pts/0    00:01:23 bankTimer
14592 pts/0    01:26:21 bankMakeGL
14600 pts/0    00:11:47 bankTransaction

ndbdev@ndb08:~/jmiller/builds/run> kill -9 14584
[1]   Killed                  NDB_CONNECTSTRING=ndb08:14000 ../bin/bankTimer -w 5 >>btBank.out
ndbdev@ndb08:~/jmiller/builds/run> kill -9 14592
[2]-  Killed                  NDB_CONNECTSTRING=ndb08:14000 ../bin/bankMakeGL >>bGL_Bank.out
ndbdev@ndb08:~/jmiller/builds/run> kill -9 14600
[3]+  Killed                  NDB_CONNECTSTRING=ndb08:14000 ../bin/bankTransactionMaker >>bTrans_Bank.out
ndbdev@ndb08:~/jmiller/builds/run> ../bin/ndb_mgm ndb11:14000
-- NDB Cluster -- Management Client --
ndb_mgm> show
Unable to connect with connect string: nodeid=0,ndb11:14000
Retrying every 5 seconds. Attempts left: 2, 1 failed.
ndb_mgm> exit

14061 ?        00:01:24 ndb_mgmd
14381 ?        00:00:00 ndbd
14382 ?        01:49:08 ndbd

NOTES:

1) There were two BANK test running for the last 5 hours. One to BANK and one to BANK2. This consisted of 2 bankTimer, 2 bankMakeGL and 2 bankTransaction.
2) Just completed the replication_sample.txt script.
3) Transaction were still running when the master data node was killed (nodeid=2, The TC).
4) The cluster was still responding after the data node (nodeid=2) was terminated.
5) Transaction where still running when a regular "Shutdown" was issued in the ndb_mgmd;

All logs have been moved off to ndb08:/tmp/bug#### where #### = this bug report number.

On cluster restart, cluster hangs with 2 nodes in phase 0 and 2 nodes in phase1 just like http://bugs.mysql.com/bug.php?id=10893 

ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     4 node(s)
id=2    @10.100.1.93  (Version: 5.1.0, starting, Nodegroup: 0)
id=3    @10.100.1.94  (Version: 5.1.0, starting, Nodegroup: 0)
id=4    @10.100.1.93  (Version: 5.1.0, starting, Nodegroup: 0)
id=5    @10.100.1.94  (Version: 5.1.0, starting, Nodegroup: 0, Master)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @10.100.1.93  (Version: 5.1.0)

[mysqld(API)]   10 node(s)
id=6 (not connected, accepting connect from ndb08)
id=7 (not connected, accepting connect from ndb08)
id=8 (not connected, accepting connect from ndb08)
id=9 (not connected, accepting connect from ndb09)
id=10 (not connected, accepting connect from ndb09)
id=11 (not connected, accepting connect from ndb09)
id=12 (not connected, accepting connect from ndb10)
id=13 (not connected, accepting connect from ndb10)
id=14 (not connected, accepting connect from ndb10)
id=15 (not connected, accepting connect from ndb10)

ndb_mgm> all status
Node 2: starting (Phase 1) (Version 5.1.0)
Node 3: starting (Phase 1) (Version 5.1.0)
Node 4: starting (Phase 0) (Version 5.1.0)
Node 5: starting (Phase 0) (Version 5.1.0)

How to repeat:
Follow steps above.

Suggested fix:
managment client/server should make sure ndb process are shutdown.
[28 May 2005 14:58] Jorge del Conde
I was able to reproduce this with the latest 5.1 pull under FC2.
[16 Jun 2005 21:43] Tomas Ulin
this will most certainly be an issue already in 5.0 if not in 4.1

assigning it to martin for him to decide what to do about it
[7 Sep 2005 12:14] Jonas Oreland
This is most likely the same as BUG#11623
for which I submitted a new patch
[12 Sep 2005 12:32] Jonas Oreland
Pushed into 4.1.15 and 5.0.13
[16 Sep 2005 11:45] Jon Stephens
Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

Documented fix in 4.1.15 and 5.0.13 changelogs.