MySQL Bugs: #21073: CR: Slave dies and NDBD goes to 99% CPU using DBT2

Bug #21073	CR: Slave dies and NDBD goes to 99% CPU using DBT2
Submitted:	14 Jul 2006 23:17	Modified:	30 Nov 2007 18:23
Reporter:	Jonathan Miller	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Disk Data	Severity:	S2 (Serious)
Version:	5.1.12	OS:	Linux (Linux)
Assigned to:	Jonathan Miller	CPU Architecture:	Any

Description:
I started a DBT2 test running against a cluster replication setup before I went to bed last night. The test was set to run for 11 hours. This run of DBT2 was using a mix of memory and disk data tables. I am opening this under DD due to the PGMAN sigs in the trace log. 

When I checked on it this morning. The slave cluster had stopped replicating with a write error.

060714  4:54:44 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table dbt2.stock, Error_code: 899
060714  5:57:29 [ERROR] Slave: Error in Delete_rows event: row application failed, Error_code: 410
060714  5:57:29 [ERROR] Slave: Error in Delete_rows event: error during transaction execution on table dbt2.stock, Error_code: 410
060714  5:57:29 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table cluster.apply_status, Error_code: 410
060714  5:57:30 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table cluster.apply_status, Error_code: 410
060714  5:57:32 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table cluster.apply_status, Error_code: 410
060714  5:57:35 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table cluster.apply_status, Error_code: 410
060714  5:57:39 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table cluster.apply_status, Error_code: 410
060714  5:57:44 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table cluster.apply_status, Error_code: 410
060714  5:57:49 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table cluster.apply_status, Error_code: 410
060714  5:57:54 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table cluster.apply_status, Error_code: 410
060714  5:57:59 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table cluster.apply_status, Error_code: 410
060714  5:58:04 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table cluster.apply_status, Error_code: 410
060714  5:58:04 [ERROR] Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
060714  5:58:04 [ERROR] Slave (additional info): Unknown error Error_code: 1105
060714  5:58:04 [Warning] Slave: Got temporary error 410 'REDO log files overloaded, consult online manual (decrease TimeBetweenLocalCheckpoints, and|or incre' from NDB Error_code: 1297

In looking at the cluster log, I had found that the cluster hand not checkpointed in over and hour. The cluster was still up and running but looking closer at the processes, one NDBD was using 0% CPU and the other was using 99% CPU. I issued "killall -6 ndbd" to get a trace log. The log is full of:
--------------- Signal ----------------
r.bn: 261 "PGMAN", r.proc: 2, r.sigId: -2002331445 gsn: 164 "CONTINUEB" prio: 1
s.bn: 261 "PGMAN", s.proc: 2, s.sigId: -2002331446 length: 1 trace: 4 #sec: 0 fragInf: 0
 H'00000003
--------------- Signal ----------------

I then tried to restart this data node. The node got to phase 4 and the dropped to 0% CPU usage. The other DN was now at 99% usage. A "killall -6 ndbd" on that data node produced a trace log that looks pretty much like the first.

Slave Config.ini

[DB DEFAULT]
NoOfReplicas: 2
IndexMemory: 500M
DataMemory: 1000M
BackupMemory: 64M
MaxNoOfConcurrentScans: 100
DiskPageBufferMemory: 300M
MaxNoOfConcurrentOperations: 100000
DataDir: /space/run
FileSystemPath: /space/run

[MGM DEFAULT]
PortNumber: 14000
ArbitrationRank: 1
DataDir: .
[ndb_mgmd]
Id: 1
HostName: ndb10

[ndbd]
Id: 2
HostName: ndb10

[ndbd]
Id: 3
HostName: ndb11

[api]
Id: 5
[api]
Id: 6
[api]
Id: 7
[api]
Id: 8

[mysqld]
Id: 4
 

How to repeat:
See above

I am not able to recover the slave at this point:
Time: Saturday 15 July 2006 - 01:44:52
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbtup/DbtupDiskAlloc.cpp
Error object: DBTUP (Line: 730) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 13277
Trace: /space/run/ndb_3_trace.log.3
Version: Version 5.1.12 (beta)
***EOM***

BLOCK   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR
       ?004368 004377
DBLQH   006387 006464 002828 002837 002506 005090 005659 006587
DBLQH   002481 002485 002488
DBTC    004050
DBTUP   002032
DBLQH   003299 003506 016323 003617
DBACC   001054 003266 003330 003330 003330 003330 003330 003330
        003330 003330 003266 003330 003330 003330 003330 003336

--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 6979696 gsn: 4 "ATTRINFO" prio: 1
s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 9743852 length: 4 trace: 0 #sec: 0 fragInf: 0
 H'00000002 H'00010000 H'0f700200 H'00000075
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 6979695 gsn: 4 "ATTRINFO" prio: 1
s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 9743852 length: 25 trace: 0 #sec: 0 fragInf: 0
 H'00000002 H'00010000 H'0f700200 H'00000c5e H'00030004 H'00000002 H'00040004
 H'00000ff0 H'00050004 H'00000004 H'00060004 H'44b70841 H'00070008 H'00000000
 H'3ff00000 H'00080008 H'00000000 H'40440000 H'00090019 H'59435718 H'6b4a6676
 H'664d4961 H'4e537757 H'4a624344 H'56714e78
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 6979694 gsn: 316 "LQHKEYREQ" prio: 1
s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 9743852 length: 23 trace: 0 #sec: 0 fragInf: 0
 ClientPtr = H'00000002 hashValue = H'76251550 tcBlockRef = H'00f70002
 transId1 = H'00010000 transId2 = H'0f700200 savePointId = H'00000000
 Op: 2 Lock: 1 Flags: Simple Dirty Rowid GCI ScanInfo/noFiredTriggers: H'0
 AttrLen: 28 (5 in this) KeyLen: 4 TableId: 25 SchemaVer: 2
 FragId: 0 ReplicaNo: 0 LastReplica: 0 NextNodeId: 65535
 KeyInfo: H'00000c5e H'00000001 H'00000004 H'00000002
 Rowid: [ page: 857 idx: 8030 ]
 GCI: 55857 AttrInfo: H'00020004 H'00000004 H'00010004 H'00000001 H'00000004
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 6979693 gsn: 4 "ATTRINFO" prio: 1
s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 9743843 length: 20 trace: 0 #sec: 0 fragInf: 0
 H'00000002 H'00010000 H'0f700200 H'000f0008 H'00000000 H'00000000 H'00100033
 H'6f66351a H'4fe9e371 H'f76669f9 H'fc46e5ee H'32512fe9 H'38f76032 H'403948e7
 H'694ee8f9 H'ed23dd57 H'6341443c H'25dd564a H'216d633b H'00fd75de
--------------- Signal ----------------

r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 6979692 gsn: 4 "ATTRINFO" prio: 1
s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 9743843 length: 25 trace: 0 #sec: 0 fragInf: 0
 H'00000002 H'00010000 H'0f700200 H'000b0019 H'684c7318 H'62627558 H'74534b6e
 H'786b6259 H'455a4877 H'4d744a71 H'0000006b H'000c0019 H'5a556318 H'715a6a59
 H'704f6856 H'5464504a H'4a527768 H'4f485068 H'0000006f H'000d0008 H'00000080
 H'00000000 H'000e0008 H'00000000 H'00000000
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 6979691 gsn: 4 "ATTRINFO" prio: 1
s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 9743843 length: 25 trace: 0 #sec: 0 fragInf: 0
 H'00000002 H'00010000 H'0f700200 H'696b7859 H'6c5a504a H'70714b58 H'6c6e4947
 H'6e506a78 H'0000006e H'00090019 H'4d4b5318 H'6b42664d H'6c4a496e H'68757253
 H'62474771 H'46685677 H'00000069 H'000a0019 H'4a544418 H'4a4f6752 H'57564151
 H'6a4b4761 H'68627670 H'57794b51 H'00000070
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 6979690 gsn: 4 "ATTRINFO" prio: 1
s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 9743843 length: 25 trace: 0 #sec: 0 fragInf: 0
 H'00000002 H'00010000 H'0f700200 H'4d4e4f49 H'6964586e H'4762754a H'00000072
 H'00060019 H'4a637518 H'69654d4b H'42795a58 H'524f6c68 H'67617153 H'42744d64
 H'00000070 H'00070019 H'464c6a18 H'4c694471 H'72575366 H'766f6577 H'50657064
 H'55685645 H'00000056 H'00080019 H'68687818

ndb_mgm> Node 3: Forced node shutdown completed. Occured during startphase 4. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error
Node 2: Forced node shutdown completed. Occured during startphase 4. Initiated by signal 0. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.

lots of fixes made since "then", please retest

Hi,

Spent time this week adding this testing to the ACRT framework. Currently running on latest telco clones with no issues.

Closing the bug at this time.

Best wishes,
/Jeb