Bug #21124 | DBT2 cause temp failures 4028 that lead to scan error 245 | ||
---|---|---|---|
Submitted: | 18 Jul 2006 18:45 | Modified: | 23 Aug 2006 21:06 |
Reporter: | Jonathan Miller | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
Version: | 5.1.12 | OS: | Linux (Linux) |
Assigned to: | Jonas Oreland | CPU Architecture: | Any |
Tags: | Blocking Testing at customer site |
[18 Jul 2006 18:45]
Jonathan Miller
[18 Jul 2006 21:09]
Jonathan Miller
Hi, Looking at the test error log, you will see that the 4028's always happens on the new order stored procdure call. So I created a perl script to do that call and put 16 of them running. After some more thinking, new order is the only transaction in the test that has built in rollbacks. That is 1 % of all new orders will be rolled back buy the test. In looking at it with Mikael, he found a small bug where it is reporting 4028 when it should be reporting 4008, scan time out. These scans start and never return. The causes the scan record leak where we finial run out of free scan records and all the transaction start failing. So looking at it, there seems to be an issue doing rollbacks in a Heavily loaded cluster that causes scans to not return which use up all the free scan records. In addition, these cause data to not be replicated correctly giving missed match data on the slave. The perl script does not reproduce these conditions, but the test with all the calls and 40 threads do. In addition, a new order transaction is a very large transaction to have to rollback.
[27 Jul 2006 20:02]
Jonathan Miller
Slave data not really corrupted. The table "history" has not primary key and therefore mysqldump can not order the dump. The tables dumps different from cluster to cluster.
[21 Aug 2006 11:12]
Jonas Oreland
"So I created a perl script to do that call and put 16 of them running." Did you get error this way?
[21 Aug 2006 12:11]
Jonathan Miller
No, I did not get the errors by using the perl scripts.
[22 Aug 2006 10:48]
Jonas Oreland
Hi, I've now run for 17h wi/ getting the 4028. Can you please _upload_ (or mail) 1) your copy of dbt2 (including _all_ potential changes you made) 2) your my.cnf for all mysqlds 3) your config.ini And also specify exact command line for both load and run. /Jonas
[22 Aug 2006 12:28]
Jonathan Miller
Do you have "engine_condition_pushdown = on"? THis might have something to do with it. Just a thought.
[22 Aug 2006 13:42]
Jonas Oreland
1) can upload/mail config for 64-bit also? 2) how long did you run
[22 Aug 2006 14:27]
Jonathan Miller
The config.ini is the same as the 32 bit testing. Copied it over and just changed out the host. I ran for 40000 seconds. In addition, what seems strange is the cluster is not able to cleanup the scans even though the testing has stopped. mysql> use dbt2 Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed mysql> select count(*) from warehouse; ERROR 1297 (HY000): Got temporary error 245 'Too many active scans' from NDBCLUSTER So it is stuck in this state. I am going to try restarting each DN and see if it clears. /jeb
[22 Aug 2006 14:31]
Jonathan Miller
ndb_mgm> 2 restart Node 2: Node shutdown initiated Node 2: Node shutdown completed, restarting, no start. Node 2 is being restarted ndb_mgm> Node 2: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 8. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. Node 3: Forced node shutdown completed. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. Time: Tuesday 22 August 2006 - 16:28:11 Status: Temporary error, restart node Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug) Error: 6000 Error data: Signal 8 received; Floating point exception Error object: main.cpp Program: /home/ndbdev/jmiller/builds/libexec/ndbd Pid: 1804 Trace: /space/run/ndb_2_trace.log.1 Version: Version 5.1.12 (beta) BLOCK ADDR ADDR ADDR ADDR ADDR ADDR ADDR ADDR ?000077 001200 003759 003874 003759 003874 003759 003874 003759 003874 003759 004084 004049 003759 004002 003946 003948 003759 004084 004049 003759 004084 004049 003759 --------------- Signal ---------------- r.bn: 262 "RESTORE", r.proc: 2, r.sigId: 762227 gsn: 164 "CONTINUEB" prio: 1 s.bn: 262 "RESTORE", s.proc: 2, s.sigId: 762222 length: 2 trace: 2 #sec: 0 fragInf: 0 H'00000000 H'00000000 --------------- Signal ---------------- r.bn: 248 "DBACC", r.proc: 2, r.sigId: 762226 gsn: 253 "EXPANDCHECK2" prio: 1 s.bn: 248 "DBACC", s.proc: 2, s.sigId: 762222 length: 3 trace: 2 #sec: 0 fragInf: 0 H'00000016 H'0000030f H'000003ff --------------- Signal ---------------- r.bn: 249 "DBTUP", r.proc: 2, r.sigId: 762225 gsn: 504 "CREATE_TRIG_REQ" prio: 1 s.bn: 257 "SUMA", s.proc: 2, s.sigId: 762221 length: 13 trace: 0 #sec: 0 fragInf: 0 User: 16842752, Type: UNKNOWN [9], Action: DETACHED, Event: UPDATE, Trigger id: 917510, Table id: 16, Monitor replicas: true Monitor all attributes: false Attribute mask: 000000000000000000000000001fffff H'01010000 H'00000006 H'00030001 H'00000010 H'004c9c29 H'000e0006 H'01020309 H'00000010 H'01010000 H'001fffff H'00000000 H'00000000 H'00000000 --------------- Signal ---------------- Time: Tuesday 22 August 2006 - 16:28:12 Status: Temporary error, restart node Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug) Error: 2341 Error data: dbdict/Dbdict.cpp Error object: DBDICT (Line: 13643) 0x0000000a Program: /home/ndbdev/jmiller/builds/libexec/ndbd Pid: 8760 Trace: /space/run/ndb_3_trace.log.1 Version: Version 5.1.12 (beta) BLOCK ADDR ADDR ADDR ADDR ADDR ADDR ADDR ADDR ?001862 001862 001862 001862 001862 001862 001862 001862 001862 001862 001862 001862 001862 001862 001862 001862 001862 001862 001862 001862 001862 001862 001862 001862 --------------- Signal ---------------- r.bn: 250 "DBDICT", r.proc: 3, r.sigId: 1198440022 gsn: 26 "NODE_FAILREP" prio: 1 s.bn: 251 "NDBCNTR", s.proc: 3, s.sigId: 1198440018 length: 5 trace: 8 #sec: 0 fragInf: 0 H'00000003 H'00000003 H'00000001 H'00000004 H'00000000 --------------- Signal ---------------- r.bn: 246 "DBDIH", r.proc: 3, r.sigId: 1198440021 gsn: 26 "NODE_FAILREP" prio: 1 s.bn: 251 "NDBCNTR", s.proc: 3, s.sigId: 1198440018 length: 5 trace: 8 #sec: 0 fragInf: 0 H'00000003 H'00000003 H'00000001 H'00000004 H'00000000 --------------- Signal ---------------- r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 1198440020 gsn: 26 "NODE_FAILREP" prio: 1 s.bn: 251 "NDBCNTR", s.proc: 3, s.sigId: 1198440018 length: 5 trace: 8 #sec: 0 fragInf: 0 H'00000003 H'00000003 H'00000001 H'00000004 H'00000000 --------------- Signal ---------------- r.bn: 245 "DBTC", r.proc: 3, r.sigId: 1198440019 gsn: 26 "NODE_FAILREP" prio: 1 s.bn: 251 "NDBCNTR", s.proc: 3, s.sigId: 1198440018 length: 5 trace: 8 #sec: 0 fragInf: 0 H'00000003 H'00000003 H'00000001 H'00000004 H'00000000 --------------- Signal ---------------- r.bn: 251 "NDBCNTR", r.proc: 3, r.sigId: 1198440018 gsn: 26 "NODE_FAILREP" prio: 1 s.bn: 252 "QMGR", s.proc: 3, s.sigId: 1198440014 length: 5 trace: 8 #sec: 0 fragInf: 0 H'00000003 H'00000003 H'00000001 H'00000004 H'00000000
[23 Aug 2006 21:06]
Jonathan Miller
This bug show 2 different bugs, but the original bug can be avoided. Main issues where: 1) MaxNoOfConcurrentScans where = 100 and 2) TransactionDeadlockDetectionTimeout: where = 300000 So, trancations would get deadlocked and would time out after 2 min. The bugs this shows are: 1) the 4028 is the wrong error to return. The correct error would be 4008. and 2) When the scans time out, they are not releasing the scan records. After a while, the cluster runs out of scan records and stops processing. Basically the cluster is useless at this point until restarted I am closing this bug and opening a P3 for number 1) and a P1 for number 2) above.