Bug #21124 DBT2 cause temp failures 4028 that lead to scan error 245
Submitted: 18 Jul 2006 18:45 Modified: 23 Aug 2006 21:06
Reporter: Jonathan Miller Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.1.12 OS:Linux (Linux)
Assigned to: Jonas Oreland CPU Architecture:Any
Tags: Blocking Testing at customer site

[18 Jul 2006 18:45] Jonathan Miller
Description:
Running a Cluster Replication stress test using DBT2 for 11 hours produced several issues.

1) Test was started @ Mon Jul 17 23:50:50 2006, at Mon Jul 17 23:58:24 2006 the test client started getting ERROR: 1297 Got temporary error 4028 'Node failure caused abort of transaction' from NDBCLUSTER. These errors continued about every 5 - 10 minutes. Around Tue Jul 18 04:50:35 2006, these error started repeating 3 minutes.

2) By Tue Jul 18 08:19:19 2006 the cluster started reporting Got temporary error 245 'Too many active scans' from NDBCLUSTER. 

3) The slave showed to still be running and caught up with Master. I restarted the master cluster to remove the scan issue A dump of the master and slave data showed that the data was not the same on the two clusters.

A dump of ACC scans by Pekka showed 10 still running and 200 closing.

 

How to repeat:
Create two 3 host cluster with 2 DN and 2 replicas each.
Laod the DBT2 database and start a 11 hour run.

./run_mysql.sh -n dbt2 -h ndb09  -u root -s 0  -t 40000 -z "initial testing for stability" -c 10 -w 4 -o /tmp/mysql.sock -v -e -x 3000
[18 Jul 2006 21:09] Jonathan Miller
Hi,

Looking at the test error log, you will see that the 4028's always happens on the new order stored procdure call. So I created a perl script to do that call and put 16 of them running. After some more thinking, new order is the only transaction in the test that has built in rollbacks. That is 1 % of all new orders will be rolled back buy the test. 

In looking at it with Mikael, he found a small bug where it is reporting 4028 when it should be reporting 4008, scan time out. These scans start and never return. The causes the scan record leak where we finial run out of free scan records and all the transaction start failing.

So looking at it, there seems to be an issue doing rollbacks in a Heavily loaded  cluster that causes scans to not return which use up all the free scan records. In addition, these cause data to not be replicated correctly giving missed match data on the slave.

The perl script does not reproduce these conditions, but the test with all the calls and 40 threads do. 

In addition, a new order transaction is a very large transaction to have to rollback.
[27 Jul 2006 20:02] Jonathan Miller
Slave data not really corrupted. 

The table "history" has not primary key and therefore mysqldump can not order the dump. The tables dumps different from cluster to cluster.
[21 Aug 2006 11:12] Jonas Oreland
"So I created a perl script to do that call and put 16 of them running."

Did you get error this way?
[21 Aug 2006 12:11] Jonathan Miller
No, I did not get the errors by using the perl scripts.
[22 Aug 2006 10:48] Jonas Oreland
Hi,

I've now run for 17h wi/ getting the 4028.
Can you please _upload_ (or mail)
1) your copy of dbt2 (including _all_ potential changes you made)
2) your my.cnf for all mysqlds
3) your config.ini

And also specify exact command line for both load and run.

/Jonas
[22 Aug 2006 12:28] Jonathan Miller
Do you have "engine_condition_pushdown = on"? THis might have something to do with it.

Just a thought.
[22 Aug 2006 13:42] Jonas Oreland
1) can upload/mail config for 64-bit also?
2) how long did you run
[22 Aug 2006 14:27] Jonathan Miller
The config.ini is the same as the 32 bit testing. Copied it over and just changed out the host.

I ran for 40000 seconds.

In addition, what seems strange is the cluster is not able to cleanup the scans even though the testing has stopped.

mysql> use dbt2
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select count(*) from warehouse;
ERROR 1297 (HY000): Got temporary error 245 'Too many active scans' from NDBCLUSTER

So it is stuck in this state.

I am going to try restarting each DN and see if it clears.
/jeb
[22 Aug 2006 14:31] Jonathan Miller
ndb_mgm> 2 restart
Node 2: Node shutdown initiated
Node 2: Node shutdown completed, restarting, no start.
Node 2 is being restarted

ndb_mgm> Node 2: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 8. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
Node 3: Forced node shutdown completed. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Time: Tuesday 22 August 2006 - 16:28:11
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000
Error data: Signal 8 received; Floating point exception
Error object: main.cpp
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 1804
Trace: /space/run/ndb_2_trace.log.1
Version: Version 5.1.12 (beta)

BLOCK   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR
       ?000077 001200 003759 003874 003759 003874 003759 003874
        003759 003874 003759 004084 004049 003759 004002 003946
        003948 003759 004084 004049 003759 004084 004049 003759

--------------- Signal ----------------
r.bn: 262 "RESTORE", r.proc: 2, r.sigId: 762227 gsn: 164 "CONTINUEB" prio: 1
s.bn: 262 "RESTORE", s.proc: 2, s.sigId: 762222 length: 2 trace: 2 #sec: 0 fragInf: 0
 H'00000000 H'00000000
--------------- Signal ----------------
r.bn: 248 "DBACC", r.proc: 2, r.sigId: 762226 gsn: 253 "EXPANDCHECK2" prio: 1
s.bn: 248 "DBACC", s.proc: 2, s.sigId: 762222 length: 3 trace: 2 #sec: 0 fragInf: 0
 H'00000016 H'0000030f H'000003ff
--------------- Signal ----------------
r.bn: 249 "DBTUP", r.proc: 2, r.sigId: 762225 gsn: 504 "CREATE_TRIG_REQ" prio: 1
s.bn: 257 "SUMA", s.proc: 2, s.sigId: 762221 length: 13 trace: 0 #sec: 0 fragInf: 0
User: 16842752, Type: UNKNOWN [9], Action: DETACHED, Event: UPDATE, Trigger id: 917510, Table id: 16, Monitor replicas: true Monitor all attributes: false Attribute mask: 000000000000000000000000001fffff
 H'01010000 H'00000006 H'00030001 H'00000010 H'004c9c29 H'000e0006 H'01020309
 H'00000010 H'01010000 H'001fffff H'00000000 H'00000000 H'00000000
--------------- Signal ----------------

Time: Tuesday 22 August 2006 - 16:28:12
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbdict/Dbdict.cpp
Error object: DBDICT (Line: 13643) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 8760
Trace: /space/run/ndb_3_trace.log.1
Version: Version 5.1.12 (beta)

BLOCK   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR
       ?001862 001862 001862 001862 001862 001862 001862 001862
        001862 001862 001862 001862 001862 001862 001862 001862
        001862 001862 001862 001862 001862 001862 001862 001862

--------------- Signal ----------------
r.bn: 250 "DBDICT", r.proc: 3, r.sigId: 1198440022 gsn: 26 "NODE_FAILREP" prio: 1
s.bn: 251 "NDBCNTR", s.proc: 3, s.sigId: 1198440018 length: 5 trace: 8 #sec: 0 fragInf: 0
 H'00000003 H'00000003 H'00000001 H'00000004 H'00000000
--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 3, r.sigId: 1198440021 gsn: 26 "NODE_FAILREP" prio: 1
s.bn: 251 "NDBCNTR", s.proc: 3, s.sigId: 1198440018 length: 5 trace: 8 #sec: 0 fragInf: 0
 H'00000003 H'00000003 H'00000001 H'00000004 H'00000000
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 1198440020 gsn: 26 "NODE_FAILREP" prio: 1
s.bn: 251 "NDBCNTR", s.proc: 3, s.sigId: 1198440018 length: 5 trace: 8 #sec: 0 fragInf: 0
 H'00000003 H'00000003 H'00000001 H'00000004 H'00000000
--------------- Signal ----------------
r.bn: 245 "DBTC", r.proc: 3, r.sigId: 1198440019 gsn: 26 "NODE_FAILREP" prio: 1
s.bn: 251 "NDBCNTR", s.proc: 3, s.sigId: 1198440018 length: 5 trace: 8 #sec: 0 fragInf: 0
 H'00000003 H'00000003 H'00000001 H'00000004 H'00000000
--------------- Signal ----------------
r.bn: 251 "NDBCNTR", r.proc: 3, r.sigId: 1198440018 gsn: 26 "NODE_FAILREP" prio: 1
s.bn: 252 "QMGR", s.proc: 3, s.sigId: 1198440014 length: 5 trace: 8 #sec: 0 fragInf: 0
 H'00000003 H'00000003 H'00000001 H'00000004 H'00000000
[23 Aug 2006 21:06] Jonathan Miller
This bug show 2 different bugs, but the original bug can be avoided.

Main issues where:

1) MaxNoOfConcurrentScans where = 100
and
2) TransactionDeadlockDetectionTimeout: where = 300000

So, trancations would get deadlocked and would time out after 2 min.

The bugs this shows are:

1) the 4028 is the wrong error to return. The correct error would be 4008.
and 
2) When the scans time out, they are not releasing the scan records. After a while, the cluster runs out of scan records and stops processing. Basically the cluster is useless at this point until restarted

I am closing this bug and opening a P3 for number 1) and a P1 for number 2) above.