Bug #18780 DD: DBT2 causes cluster failure using disk data in dbtup/DbtupDiskAlloc.cpp
Submitted: 4 Apr 2006 19:47 Modified: 14 Jun 2006 0:39
Reporter: Jonathan Miller Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Disk Data Severity:S1 (Critical)
Version:5.1.9 OS:Linux (Linux)
Assigned to: Jonas Oreland CPU Architecture:Any

[4 Apr 2006 19:47] Jonathan Miller
Description:
I am in the process of altering the DBT2 scripts to run with Cluster, Cluster Disk Data and Cluster with a mix of memory and disk data tables. Today, I got the Disk Data test running. Shortly into the test run of a 60 second test the cluster crashed.

Test Config:

DATABASE NAME:                dbt2
DATABASE USER:                root
DATABASE CONNECTIONS:         20
TERMINAL THREADS:             40
SCALE FACTOR(WARHOUSES):      4
TERMINALS PER WAREHOUSE:      10
DURATION OF TEST(in sec):     60
SLEEPY in (msec)              300
ZERO DELAYS MODE:             1

LG and TS creates:

LOGGROUP="CREATE LOGFILE GROUP $NDB_LFG ADD UNDOFILE './$NDB_LFG/undofile.dat' INITIAL_SIZE 500M UNDO_BUFFER_SIZE = 4M ENGINE=NDB"

TABLESPACE="CREATE TABLESPACE  $NDB_TS ADD DATAFILE './$NDB_TS/datafile.dat' USE LOGFILE GROUP $NDB_LFG INITIAL_SIZE 750M ENGINE=NDB"

Cluster Log:
2006-04-04 21:26:37 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2006-04-04 21:26:37 [MgmSrvr] ALERT    -- Node 3: Node 2 Disconnected
2006-04-04 21:26:37 [MgmSrvr] INFO     -- Node 3: Communication to Node 2 closed
2006-04-04 21:26:37 [MgmSrvr] ALERT    -- Node 3: Network partitioning - arbitration required
2006-04-04 21:26:37 [MgmSrvr] INFO     -- Node 3: President restarts arbitration thread [state=7]2006-04-04 21:26:37 [MgmSrvr] ALERT    -- Node 3: Arbitration won - positive reply from node 1
2006-04-04 21:26:37 [MgmSrvr] INFO     -- Node 3: GCP Take over started
2006-04-04 21:26:37 [MgmSrvr] INFO     -- Node 3: LCP Take over started
2006-04-04 21:26:37 [MgmSrvr] INFO     -- Node 3: ParticipatingDIH = 0000000000000000
2006-04-04 21:26:37 [MgmSrvr] INFO     -- Node 3: ParticipatingLQH = 0000000000000000
2006-04-04 21:26:37 [MgmSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2006-04-04 21:26:37 [MgmSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000]
2006-04-04 21:26:37 [MgmSrvr] INFO     -- Node 3: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2006-04-04 21:26:37 [MgmSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_From_Master_Received = 1
2006-04-04 21:26:37 [MgmSrvr] INFO     -- Node 3: GCP Take over completed
2006-04-04 21:26:37 [MgmSrvr] INFO     -- Node 3: LCP Take over completed (state = 4)
2006-04-04 21:26:37 [MgmSrvr] INFO     -- Node 3: ParticipatingDIH = 0000000000000000
2006-04-04 21:26:37 [MgmSrvr] INFO     -- Node 3: ParticipatingLQH = 0000000000000000
2006-04-04 21:26:37 [MgmSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2006-04-04 21:26:37 [MgmSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000]
2006-04-04 21:26:37 [MgmSrvr] INFO     -- Node 3: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2006-04-04 21:26:37 [MgmSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_From_Master_Received = 1
2006-04-04 21:26:37 [MgmSrvr] ALERT    -- Node 3: Node 6 Disconnected
2006-04-04 21:26:37 [MgmSrvr] INFO     -- Node 3: Communication to Node 6 closed
2006-04-04 21:26:37 [MgmSrvr] INFO     -- Mgmt server state: nodeid 6 freed, m_reserved_nodes 0000000000000002.
2006-04-04 21:26:38 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2006-04-04 21:26:38 [MgmSrvr] INFO     -- Node 3: Started arbitrator node 1 [ticket=02c40003665eecf4]
2006-04-04 21:26:41 [MgmSrvr] INFO     -- Node 3: Communication to Node 2 opened
2006-04-04 21:26:41 [MgmSrvr] INFO     -- Node 3: Communication to Node 6 opened
2006-04-04 21:26:49 [MgmSrvr] INFO     -- Mgmt server state: nodeid 6 reserved for ip 10.100.1.93, m_reserved_nodes 0000000000000042.
2006-04-04 21:26:49 [MgmSrvr] INFO     -- Node 6: mysqld --server-id=1
2006-04-04 21:26:50 [MgmSrvr] INFO     -- Node 3: Node 6 Connected
2006-04-04 21:26:50 [MgmSrvr] INFO     -- Node 3: Node 6: API version 5.1.9
2006-04-04 21:28:24 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2006-04-04 21:28:24 [MgmSrvr] INFO     -- Mgmt server state: nodeid 6 freed, m_reserved_nodes 0000000000000002.
2006-04-04 21:28:24 [MgmSrvr] ALERT    -- Node 3: Forced node shutdown completed. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
006-04-04 21:28:31 [MgmSrvr] INFO     -- Mgmt server state: nodeid 6 reserved for ip 10.100.1.93, m_reserved_nodes 0000000000000042.
2006-04-04 21:28:31 [MgmSrvr] INFO     -- Node 6: mysqld --server-id=1

Will append more in comments

How to repeat:
Create Disk Data DBT2 database:
./mysql_load_db.sh -d dbt2 -f . -m ORIG -s /tmp/mysql.sock -h ndb08 -u root -e NDBDD
Run DBT2
./run_mysql.sh  -c 20 -t 60 -w 4 -h ndb08 -u root -e
[4 Apr 2006 19:48] Jonathan Miller
Ndb 2 error log:
Time: Tuesday 4 April 2006 - 21:26:37
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbtup/DbtupDiskAlloc.cpp
Error object: DBTUP (Line: 870) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 701
Trace: /space/run/ndb_2_trace.log.1
Version: Version 5.1.9 (beta)
***EOM***

ndb2 Trace:

--------------- Signal ----------------
r.bn: 261 "PGMAN", r.proc: 2, r.sigId: 201308218 gsn: 164 "CONTINUEB" prio: 0
s.bn: 261 "PGMAN", s.proc: 2, s.sigId: 201308217 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00000002
--------------- Signal ----------------
r.bn: 261 "PGMAN", r.proc: 2, r.sigId: 201308216 gsn: 164 "CONTINUEB" prio: 1
s.bn: 261 "PGMAN", s.proc: 2, s.sigId: 201308213 length: 1 trace: 1 #sec: 0 fragInf: 0
 H'00000001
--------------- Signal ----------------
r.bn: 248 "DBACC", r.proc: 2, r.sigId: 201308215 gsn: 332 "NEXT_SCANREQ" prio: 1
s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 201308212 length: 3 trace: 1 #sec: 0 fragInf: 0
 H'00000003 H'0000757c H'00000002
--------------- Signal ----------------
r.bn: 248 "DBACC", r.proc: 2, r.sigId: 201308214 gsn: 332 "NEXT_SCANREQ" prio: 1
s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 201308211 length: 3 trace: 1 #sec: 0 fragInf: 0
 H'0000000f H'0000761d H'00000002
--------------- Signal ----------------
r.bn: 261 "PGMAN", r.proc: 2, r.sigId: 201308213 gsn: 164 "CONTINUEB" prio: 1
s.bn: 261 "PGMAN", s.proc: 2, s.sigId: 201308210 length: 1 trace: 1 #sec: 0 fragInf: 0
 H'00000001
--------------- Signal ----------------
[4 Apr 2006 19:50] Jonathan Miller
ndb 3 error log:
Time: Tuesday 4 April 2006 - 21:28:22
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbtup/DbtupDiskAlloc.cpp
Error object: DBTUP (Line: 870) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 708
Trace: /space/run/ndb_3_trace.log.1
Version: Version 5.1.9 (beta)
***EOM***

ndb3 trace log:

--------------- Signal ----------------
r.bn: 261 "PGMAN", r.proc: 3, r.sigId: 187492690 gsn: 164 "CONTINUEB" prio: 0
s.bn: 261 "PGMAN", s.proc: 3, s.sigId: 187492689 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00000002
--------------- Signal ----------------
r.bn: 245 "DBTC", r.proc: 3, r.sigId: 187492688 gsn: 351 "SCAN_FRAGCONF" prio: 1
s.bn: 247 "DBLQH", s.proc: 3, s.sigId: 187492687 length: 6 trace: 1 #sec: 0 fragInf: 0
 H'0000001f H'00000000 H'00000002 H'000000b5 H'00d00600 H'00000000
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 187492687 gsn: 389 "STORED_PROCCONF" prio: 1
s.bn: 249 "DBTUP", s.proc: 3, s.sigId: 187492686 length: 2 trace: 1 #sec: 0 fragInf: 0
 H'000000b8 H'00000014
--------------- Signal ----------------
r.bn: 249 "DBTUP", r.proc: 3, r.sigId: 187492686 gsn: 391 "STORED_PROCREQ" prio: 1
s.bn: 247 "DBLQH", s.proc: 3, s.sigId: 187492684 length: 5 trace: 1 #sec: 0 fragInf: 0
 H'000000b8 H'0000000a H'00000007 H'00000003 H'00000014
--------------- Signal ----------------
r.bn: 245 "DBTC", r.proc: 3, r.sigId: 187492685 gsn: 351 "SCAN_FRAGCONF" prio: 1
s.bn: 247 "DBLQH", s.proc: 3, s.sigId: 187492683 length: 6 trace: 1 #sec: 0 fragInf: 0
 H'00000019 H'00000001 H'00000000 H'000000b5 H'00d00600 H'00000074
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 187492684 gsn: 330 "NEXT_SCANCONF" prio: 1
s.bn: 258 "DBTUX", s.proc: 3, s.sigId: 187492682 length: 3 trace: 1 #sec: 0 fragInf: 0
 H'00000019 H'ffffff00 H'ffffff00
--------------- Signal ----------------
[4 Apr 2006 20:19] Jonathan Miller
Note: Cluster Recovered from crash:

2006-04-04 22:04:16 [MgmSrvr] INFO     -- Node 3: CM_REGCONF president = 2, own Node = 3, our dynamic id = 2
2006-04-04 22:04:16 [MgmSrvr] INFO     -- Node 3: Node 2: API version 5.1.9
2006-04-04 22:04:16 [MgmSrvr] INFO     -- Node 3: Start phase 1 completed
2006-04-04 22:04:16 [MgmSrvr] INFO     -- Node 3: Start phase 2 completed (system restart)
2006-04-04 22:04:16 [MgmSrvr] INFO     -- Node 2: Start phase 2 completed (system restart)
2006-04-04 22:04:16 [MgmSrvr] INFO     -- Node 3: Start phase 3 completed (system restart)
2006-04-04 22:04:16 [MgmSrvr] INFO     -- Node 2: Start phase 3 completed (system restart)
2006-04-04 22:04:26 [MgmSrvr] INFO     -- Node 3: DICT: index 10 activated done
2006-04-04 22:04:26 [MgmSrvr] INFO     -- Node 3: DICT: index 12 activated done
2006-04-04 22:04:26 [MgmSrvr] INFO     -- Node 3: DICT: index 14 activated done
2006-04-04 22:04:26 [MgmSrvr] INFO     -- Node 3: DICT: index 16 activated done
2006-04-04 22:04:26 [MgmSrvr] INFO     -- Node 3: DICT: index 18 activated done
2006-04-04 22:04:26 [MgmSrvr] INFO     -- Node 3: DICT: index 22 activated done
2006-04-04 22:04:26 [MgmSrvr] INFO     -- Node 3: DICT: index 24 activated done
2006-04-04 22:04:26 [MgmSrvr] INFO     -- Node 3: DICT: index 26 activated done
2006-04-04 22:05:54 [MgmSrvr] INFO     -- Node 3: Applying undo to LCP: 55
2006-04-04 22:05:55 [MgmSrvr] INFO     -- Node 3: Undo head - ./lg1/undofile.dat page: 2146 lsn: 402011
2006-04-04 22:05:56 [MgmSrvr] INFO     -- Node 2: Applying undo to LCP: 55
2006-04-04 22:05:57 [MgmSrvr] INFO     -- Node 2: Undo head - ./lg1/undofile.dat page: 2167 lsn: 401665
2006-04-04 22:06:01 [MgmSrvr] INFO     -- Node 3: Logfile group: 5
2006-04-04 22:06:01 [MgmSrvr] INFO     -- Node 3:   head: ./lg1/undofile.dat page: 2146
2006-04-04 22:06:01 [MgmSrvr] INFO     -- Node 3:   tail: ./lg1/undofile.dat page: 2099
2006-04-04 22:06:01 [MgmSrvr] INFO     -- Node 3: Flushing page cache after undo completion
2006-04-04 22:06:02 [MgmSrvr] INFO     -- Node 3: Flushing complete
2006-04-04 22:06:03 [MgmSrvr] INFO     -- Node 2: Logfile group: 5
2006-04-04 22:06:03 [MgmSrvr] INFO     -- Node 2:   head: ./lg1/undofile.dat page: 2167
2006-04-04 22:06:03 [MgmSrvr] INFO     -- Node 2:   tail: ./lg1/undofile.dat page: 2121
2006-04-04 22:06:03 [MgmSrvr] INFO     -- Node 2: Flushing page cache after undo completion
2006-04-04 22:06:03 [MgmSrvr] INFO     -- Node 2: Flushing complete
2006-04-04 22:06:10 [MgmSrvr] INFO     -- Node 2: Start phase 4 completed (system restart)
2006-04-04 22:06:10 [MgmSrvr] INFO     -- Node 3: Start phase 4 completed (system restart)
2006-04-04 22:06:11 [MgmSrvr] INFO     -- Node 3: Local checkpoint 56 started. Keep GCI = 39617 oldest restorable GCI = 39603
2006-04-04 22:09:14 [MgmSrvr] INFO     -- Node 2: Start phase 5 completed (system restart)
2006-04-04 22:09:14 [MgmSrvr] INFO     -- Node 2: Start phase 6 completed (system restart)
2006-04-04 22:09:14 [MgmSrvr] INFO     -- Node 2: President restarts arbitration thread [state=1]2006-04-04 22:09:14 [MgmSrvr] INFO     -- Node 2: Start phase 7 completed (system restart)
2006-04-04 22:09:14 [MgmSrvr] INFO     -- Node 3: Start phase 5 completed (system restart)
2006-04-04 22:09:14 [MgmSrvr] INFO     -- Node 3: Start phase 6 completed (system restart)
2006-04-04 22:09:14 [MgmSrvr] INFO     -- Node 3: Start phase 7 completed (system restart)
2006-04-04 22:09:25 [MgmSrvr] INFO     -- Node 3: DICT: index 10 rebuild done
2006-04-04 22:09:27 [MgmSrvr] INFO     -- Node 3: DICT: index 12 rebuild done
2006-04-04 22:10:18 [MgmSrvr] INFO     -- Node 3: DICT: index 14 rebuild done
2006-04-04 22:10:22 [MgmSrvr] INFO     -- Node 3: DICT: index 16 rebuild done
2006-04-04 22:10:23 [MgmSrvr] INFO     -- Node 3: DICT: index 18 rebuild done
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 3: DICT: index 22 rebuild done
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 3: DICT: index 24 rebuild done
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 3: DICT: index 26 rebuild done
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 3: Communication to Node 6 opened
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 3: Communication to Node 7 opened
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 3: Communication to Node 8 opened
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 3: Communication to Node 9 opened
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 3: Communication to Node 10 opened
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 3: Communication to Node 11 opened
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 3: Communication to Node 12 opened
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 3: Communication to Node 13 opened
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 3: Communication to Node 14 opened
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 3: Communication to Node 15 opened
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 3: Start phase 8 completed (system restart)
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 3: Start phase 9 completed (system restart)
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 2: Communication to Node 6 opened
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 2: Communication to Node 7 opened
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 2: Communication to Node 8 opened
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 2: Communication to Node 9 opened
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 2: Communication to Node 10 opened
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 2: Communication to Node 11 opened
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 2: Communication to Node 12 opened
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 2: Communication to Node 13 opened
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 2: Communication to Node 14 opened
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 2: Communication to Node 15 opened
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 2: Start phase 8 completed (system restart)
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 2: Start phase 9 completed (system restart)
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 2: Start phase 100 completed (system restart)
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 2: Start phase 101 completed (system restart)
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 2: Started (version 5.1.9)
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 3: Start phase 100 completed (system restart)
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 3: Start phase 101 completed (system restart)
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 3: Started (version 5.1.9)
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 3: Node 6 Connected
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 3: Node 6: API version 5.1.9
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 2: Node 6 Connected
2006-04-04 22:10:27 [MgmSrvr] INFO     -- Node 2: Node 6: API version 5.1.9
2006-04-04 22:10:28 [MgmSrvr] INFO     -- Node 2: Node 1: API version 5.1.9
2006-04-04 22:10:28 [MgmSrvr] INFO     -- Node 3: Node 1: API version 5.1.9
2006-04-04 22:10:28 [MgmSrvr] INFO     -- Node 3: Prepare arbitrator node 1 [ticket=482200016685f0ce]
2006-04-04 22:10:28 [MgmSrvr] INFO     -- Node 2: Started arbitrator node 1 [ticket=482200016685f0ce]

But the crash is repeatable:
Time: Tuesday 4 April 2006 - 22:17:42
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbtup/DbtupDiskAlloc.cpp
Error object: DBTUP (Line: 870) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 18470
Trace: /space/run/ndb_3_trace.log.2
Version: Version 5.1.9 (beta)
***EOM***
[4 Apr 2006 21:20] Jonas Oreland
Jeb, where can I find the test program?
[24 Apr 2006 7:05] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/5372
[12 Jun 2006 11:38] Tomas Ulin
pushed to 5.1.12
[14 Jun 2006 0:39] Jon Stephens
Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

Documented bugfix in 5.1.12 changelog. Closed.