MySQL Bugs: #21017: Second Data Nodes has failure at the end of first Data Node's recovery.

Bug #21017	Second Data Nodes has failure at the end of first Data Node's recovery.
Submitted:	12 Jul 2006 20:41	Modified:	1 Sep 2006 13:13
Reporter:	Jonathan Miller	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:		OS:
Assigned to:	Pekka Nousiainen	CPU Architecture:	Any

Description:
In testing DN failure and recovery for cluster, Data Node ID3 had just completed recovery when Data Node ID2 failed. I had issued a sig 11 of ID3 and then restarted it.

Cluster log as follows:
2006-07-12 21:12:13 [MgmSrvr] INFO     -- Node 3: Communication to Node 4 closed
2006-07-12 21:12:13 [MgmSrvr] ALERT    -- Node 3: Node 5 Disconnected
2006-07-12 21:12:13 [MgmSrvr] INFO     -- Node 3: Communication to Node 5 closed
2006-07-12 21:12:13 [MgmSrvr] INFO     -- Node 3: Started arbitrator node 1 [ticket=1eff0005642750b0]
2006-07-12 21:12:14 [MgmSrvr] INFO     -- Mgmt server state: nodeid 5 freed, m_reserved_nodes 0000000000000012.
2006-07-12 21:12:14 [MgmSrvr] INFO     -- Mgmt server state: nodeid 4 freed, m_reserved_nodes 0000000000000002.
2006-07-12 21:12:15 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Error log for ID2:
Time: Wednesday 12 July 2006 - 21:12:12
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbdict/Dbdict.cpp
Error object: DBDICT (Line: 13547) 0x0000000e
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 29686
Trace: /space/run/ndb_2_trace.log.1
Version: Version 5.1.12 (beta)

Trace File:
BLOCK   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR
       ?009195 009206 009243 002971
DBTUX   060382
DBTUX   060285 060304 060306
DBTUX   060388 060434 010224 010224 010224 060781 060781 060781
        060781 060781 060781
DBTUP   028109
DBTUX   040064 040577 040580 060796 060844 060852

--------------- Signal ----------------
r.bn: 250 "DBDICT", r.proc: 2, r.sigId: 459739878 gsn: 420 "DICT_UNLOCK_ORD" prio: 1
s.bn: 246 "DBDIH", s.proc: 3, s.sigId: 20556091 length: 2 trace: 1 #sec: 0 fragInf: 0
 H'00000000 H'00000001
--------------- Signal ----------------
r.bn: 250 "DBDICT", r.proc: 2, r.sigId: 459739877 gsn: 517 "DROP_INDX_CONF" prio: 1
s.bn: 250 "DBDICT", s.proc: 3, s.sigId: -1 length: 3 trace: 0 #sec: 0 fragInf: 0
 H'00036319 H'00fa0002 H'00000010
--------------- Signal ----------------
r.bn: 258 "DBTUX", r.proc: 2, r.sigId: 459739876 gsn: 332 "NEXT_SCANREQ" prio: 1
s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 459739870 length: 3 trace: 1 #sec: 0 fragInf: 0
 H'00000010 H'ffffffff H'00000002
--------------- Signal ----------------
r.bn: 258 "DBTUX", r.proc: 2, r.sigId: 459739875 gsn: 332 "NEXT_SCANREQ" prio: 1
s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 459739869 length: 3 trace: 1 #sec: 0 fragInf: 0
 H'00000027 H'ffffffff H'00000002
--------------- Signal ----------------
r.bn: 248 "DBACC", r.proc: 2, r.sigId: 459739874 gsn: 332 "NEXT_SCANREQ" prio: 1
s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 459739868 length: 3 trace: 1 #sec: 0 fragInf: 0
 H'00000006 H'000075c5 H'00000003
--------------- Signal ----------------
r.bn: 248 "DBACC", r.proc: 2, r.sigId: 459739873 gsn: 72 "ACC_CHECK_SCAN" prio: 1
s.bn: 248 "DBACC", s.proc: 2, s.sigId: 459739867 length: 2 trace: 1 #sec: 0 fragInf: 0
 H'0000000b H'00000000

How to repeat:
This happened during an attempt to repeat bug#21015

I have not been able to repeat it yet. 

Testing

Master Cluster
3 Hosts

N07: NDBD
N08: NDBD, MySQLD, ndb_mgm
N09: MySQLD

Slave Cluster
3 Hosts

N10: NDBD
N11: NDBD, ndb_mgm
N12: MySQLD

Test running on master:

N08: cid_ndb_dd.pl
N09: DBT2

I had issues a killall -11 ndbd on N07. No MySQLD crash this time as I think that I had missed the delete call. So I restarted NDBD on N07 and allowed it to recover. Then at the end of the recovery, NDBD on N08 failed and the program running on this host reported:

Table TESTER.t1 Dropped.
Database TESTER  Dropped.
Disconnected from MySQLD
************ End Loop **************
********* Start New Loop ***********
Connected to MySQLD
Database Created.
Creating TESTER.t1, please stand by....
DBD::mysql::st execute failed: Table 't1' already exists at cid_ndb_dd.pl line 460.
Create Table Error: Table 't1' already exists at cid_ndb_dd.pl line 460.

This seems like another bug.

Bug in pekkas new "dict lock during node restart"

pushed to 5.1.12

Documented bugfix in 5.1.12 changelog.