Bug #43601 on-line add node test causes file system inconsistency error DBLQH (Line: 16380)
Submitted: 12 Mar 2009 15:38 Modified: 11 Aug 2009 10:55
Reporter: Jonathan Miller Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:mysql-5.1-telco-7.0 OS:Linux
Assigned to: Jonas Oreland CPU Architecture:Any

[12 Mar 2009 15:38] Jonathan Miller
Description:
While trying to stitch Wen's tests into ACRT, the test produced the following:

Time: Wednesday 11 March 2009 - 22:46:03
Status: Ndbd file system error, restart node initial
Message: Error while reading the REDO log (Ndbd file system inconsistency error, please report a bug)
Error: 2310
Error data: Error while reading REDO log. from 16273
D=8, F=0 Mb=78 FP=2527 W1=8178 W2=7 : end of log wo/ having found last GCI gci: 10713
Error object: DBLQH (Line: 16380) 0x00000002
Program: /data0/cr_autotest/libexec/ndbd
Pid: 5229
Trace: ./ndb_2_trace.log.1
Version: mysql-5.1.32 ndb-6.

Clone information:
revno: 2927
revision-id: tomas.ulin@sun.com-20090311144605-5npg9fk8jqweuvee
parent: tomas.ulin@sun.com-20090311143451-j2pfku5fj9tdnr5b
committer: Tomas Ulin <tomas.ulin@sun.com>
branch nick: mysql-5.1-telco-6.4
timestamp: Wed 2009-03-11 15:46:05 +0100
message:
  change defaults for some more params

How to repeat:
ACRT config = 2-dn-func.cnf

ACRT Command Line:
sh -x drivers/online-add-node-tester.sh /space/cluster_rep_auto/cr-autotest.conf

Test Command Line:

online-add-node.sh --run=/data0/cr_autotest/run-2-dn-func-mysql-5.1-telco-6.4/run --conf=/space/cluster_rep_auto/cr-autotest.conf --log=/space/cluster_rep_auto/logs/online-add-node.log --csv=/space/cluster_rep_auto/logs/online-add-node.csv --pk=/space/cluster_rep_auto/logs/online-add-node.pk --row-num=1000000,1000000,1000000
[12 Mar 2009 15:48] Jonathan Miller
DBLQH ->

        017402 016025 016092 016025 016092 016025 016092 016144
        018603 016162 017737 017402 016025 016092 016144 018603
        016162 017737 017737 017402 016025 016092 016263 016380

--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 9164455 gsn: 164 "CONTINUEB" prio: 1
s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 9164453 length: 2 trace: 2 #sec: 0 fragInf: 0
 H'00000006 H'00000000
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 2, r.sigId: 9164454 gsn: 164 "CONTINUEB" prio: 1
s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 9164452 length: 1 trace: 1 #sec: 0 fragInf: 0
 Scanning the memory channel again with no delay
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 9164453 gsn: 262 "FSREADCONF" prio: 1
s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 9164452 length: 1 trace: 2 #sec: 0 fragInf: 0
 UserPointer: 3
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 2, r.sigId: 9164452 gsn: 164 "CONTINUEB" prio: 1
s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 9164450 length: 1 trace: 1 #sec: 0 fragInf: 0
 Scanning the memory channel again with no delay
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 9164451 gsn: 255 "FSCLOSECONF" prio: 1
s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 9164450 length: 1 trace: 2 #sec: 0 fragInf: 0
 UserPointer: 20
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 2, r.sigId: 9164450 gsn: 164 "CONTINUEB" prio: 0
s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 9164448 length: 1 trace: 1 #sec: 0 fragInf: 0
 Scanning the memory channel every 10ms
--------------- Signal ----------------
r.bn: 252 "QMGR", r.proc: 2, r.sigId: 9164449 gsn: 164 "CONTINUEB" prio: 0
s.bn: 252 "QMGR", s.proc: 2, s.sigId: 9164447 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00000004
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 2, r.sigId: 9164446 gsn: 264 "FSREADREQ" prio: 0
s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 9164445 length: 15 trace: 2 #sec: 0 fragInf: 0
 UserPointer: 3
 FilePointer: 77
 UserReference: H'00f70002 Operation flag: H'00000002 (No sync, Format=List of mem pages)
 varIndex: 1
 numberOfPages: 8
 pageData:  H'0000002f,  H'0000002e,  H'0000002d,  H'0000002c,  H'0000002b,  H'0000002a,  H'00000029,  H'00000028,  H'000009d8,
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 9164445 gsn: 164 "CONTINUEB" prio: 1
s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 9164443 length: 2 trace: 2 #sec: 0 fragInf: 0
 H'00000006 H'00000000
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 2, r.sigId: 9164444 gsn: 164 "CONTINUEB" prio: 1
s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 9164440 length: 1 trace: 1 #sec: 0 fragInf: 0
 Scanning the memory channel again with no delay
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 9164443 gsn: 262 "FSREADCONF" prio: 1
s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 9164440 length: 1 trace: 2 #sec: 0 fragInf: 0
 UserPointer: 3
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 2, r.sigId: 9164442 gsn: 257 "FSCLOSEREQ" prio: 0
s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 9164441 length: 5 trace: 2 #sec: 0 fragInf: 0
 UserPointer: 20
 FilePointer: 79
 UserReference: H'00f70002
 Flags: H'00000000, Don't remove file
 H'0000004f H'00f70002 H'00000014 H'00000000 H'00003ec8
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 9164441 gsn: 164 "CONTINUEB" prio: 1
s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 9164439 length: 2 trace: 2 #sec: 0 fragInf: 0
 H'00000006 H'00000002
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 2, r.sigId: 9164440 gsn: 164 "CONTINUEB" prio: 1
s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 9164438 length: 1 trace: 1 #sec: 0 fragInf: 0
 Scanning the memory channel again with no delay
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 9164439 gsn: 262 "FSREADCONF" prio: 1
s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 9164438 length: 1 trace: 2 #sec: 0 fragInf: 0
 UserPointer: 0
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 2, r.sigId: 9164438 gsn: 164 "CONTINUEB" prio: 0
s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 9164436 length: 1 trace: 1 #sec: 0 fragInf: 0
 Scanning the memory channel every 10ms
[12 Mar 2009 15:53] Jonathan Miller
Note:

This happened during the restart of original data nodes after configuration had been updated to include data nodes that where to be added.
[12 Mar 2009 19:03] Jonathan Miller
It's reproducable:

Time: Thursday 12 March 2009 - 19:30:31
Status: Ndbd file system error, restart node initial
Message: Error while reading the REDO log (Ndbd file system inconsistency error, please report a bug)
Error: 2310
Error data: Error while reading REDO log. from 16303
D=8, F=0 Mb=78 FP=2527 W1=8188 W2=7 : end of log wo/ having found last GCI gci: 2698
Error object: DBLQH (Line: 16410) 0x00000002

------
Database DB1 is Created.
Creating t1, t2, t3, please stand by....
Table t1 Created!!!!!
Table t2 Created!!!!!
Table t3 Created!!!!!
row num is 1000000,1000000,1000000
Load data.
Insert data to table.
Disconnected from MySQLD
Script -> All the data has been load to table t1!
Script -> All the data has been load to table t2!
Script -> All the data has been load to table t3!
Script -> Get unused node id for new ndbd
node id for the first added "new" ndbd is 9
node id for the second added "new" ndbd is 10
Script -> Stopping ndb_mgmd
Connected to Management Server at: ndb13:15000
Node 1 has shutdown.
Disconnecting to allow Management Server to shutdown

Script -> Maximum time for stopping ndb_mgmd is 60 seconds, otherwise the test will fail
Script -> ndb_mgmd has stoped!
Script -> Editing the config file
Config file my.cnf exists!
Script -> New config file has generated
Script -> Starting ndb_mgmd
2009-03-12 19:28:27 [MgmSrvr] INFO     -- NDB Cluster Management Server. mysql-5.1.32 ndb-7.0.4-beta
2009-03-12 19:28:27 [MgmSrvr] INFO     -- Loaded config from '/data0/cr_autotest/run-2-dn-func-mysql-5.1-telco-6.4/run/ndb_mgmd.1/ndb_1_config.bin.1'
Script -> Check if ndb_mgmd has started and the old ndbd has connected
Node 2 has started!
Node 3 has started!
Connected to Management Server at: ndb13:15000
Cluster Configuration
---------------------
[ndbd(NDB)]     4 node(s)
id=2    @10.128.22.113  (mysql-5.1.32 ndb-7.0.4, Nodegroup: 0, Master)
id=3    @10.128.22.114  (mysql-5.1.32 ndb-7.0.4, Nodegroup: 0)
id=9 (not connected, accepting connect from ndb13)
id=10 (not connected, accepting connect from ndb14)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @10.128.22.113  (mysql-5.1.32 ndb-7.0.4)

[mysqld(API)]   5 node(s)
id=4    @10.128.22.113  (mysql-5.1.32 ndb-7.0.4)
id=5    @10.128.22.113  (mysql-5.1.32 ndb-7.0.4)
id=6    @10.128.22.113  (mysql-5.1.32 ndb-7.0.4)
id=7 (not connected, accepting connect from ndb13)
id=8 (not connected, accepting connect from ndb13)

Script -> Restarting original data nodes
Script -> Maximum time for restarting each node is 800 seconds, otherwise the test will fail
Script -> Restarting the first original data node with node id 2
Connected to Management Server at: ndb13:15000
Node 2 is being restarted

Script -> Node 2 (re)start failed since timeout!
===========================================
Thu Mar 12 19:43:34 CET 2009
===========================================
===========================================
Thu Mar 12 19:43:34 CET 2009
===========================================
NDBT_ProgramExit: Failed
NDBT_ProgramExit: Failed
Script -> FAILED!!: /space/cluster_rep_auto/tests/online-add-node/online-add-node.sh --run=/data0/cr_autotest/run-2-dn-func-mysql-5.1-telco-6.4/run --conf=/space/cluster_rep_auto/cr-autotest.conf --log=/space/cluster_rep_auto/logs/online-add-node.log --csv=/space/cluster_rep_auto/logs/online-add-node.csv --pk=/space/cluster_rep_auto/logs/online-add-node.pk --row-num=1000000,1000000,1000000

Program: /data0/cr_autotest/libexec/ndbd
Pid: 28961
Trace: ./ndb_2_trace.log.1
Version: mysql-5.1.32 ndb-7.0
[11 Aug 2009 10:55] Jonas Oreland
this is a duplicate of bug#44952