Bug #60851 Complete Cluste crashed by 'Temporary error', Error: 2341 in 'dbtc/DbtcMain.cpp'
Submitted: 13 Apr 2011 10:34 Modified: 15 Apr 2011 9:05
Reporter: Stefan Auweiler Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:mysql-5.1.44 ndb-7.1.4b OS:Solaris (10 x86 on SUN X4600)
Assigned to: Assigned Account CPU Architecture:Any
Tags: cluster crash

[13 Apr 2011 10:34] Stefan Auweiler
Description:
Yesterday, we lost our complete Cluster.

Within seconds, two nodes of the same Nodegroupe crashed by the above mentiones error, so all other nodes (6 at all) decided to shut down.

We had a system outage of about 6 hours during main workund hours.

I will attach the ndb_error_Reporter file to this issue.

Node4 (Group1)
Time: Tuesday 12 April 2011 - 10:01:35
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbtc/DbtcMain.cpp
Error object: DBTC (Line: 3868) 0x0000000a
Program: /usr/local/mysqlCluster/mysql/bin/ndbmtd
Pid: 9173 thr: 0
Version: mysql-5.1.44 ndb-7.1.4b
Trace: /DB/mysql/data/ndb_4_trace.log.10 /DB/mysql/data/ndb_4_trace.log.10_t1 /DB/mysql/data/ndb_4_tra

Node6 (Group2)
Time: Tuesday 12 April 2011 - 10:01:39
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: 
Error object: DBTC (Line: 1646) 0x0000000a
Program: /usr/local/mysqlCluster/mysql/bin/ndbmtd
Pid: 9144 thr: 0
Version: mysql-5.1.44 ndb-7.1.4b
Trace: /DB/mysql/data/ndb_6_trace.log.10 /DB/mysql/data/ndb_6_trace.log.10_t1 /DB/mysql/data/ndb_6_trace.log.10_t2 /DB/

Node5 (Group2)
Time: Tuesday 12 April 2011 - 10:01:51
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbtc/DbtcMain.cpp
Error object: DBTC (Line: 3868) 0x0000000a
Program: /usr/local/mysqlCluster/mysql/bin/ndbmtd
Pid: 28115 thr: 0
Version: mysql-5.1.44 ndb-7.1.4b
Trace: /DB/mysql/data/ndb_5_trace.log.12 /DB/mysql/data/ndb_5_trace.log.12_t1 /DB/mysql/data/ndb_5_tr

Node3 (Group1)
Time: Tuesday 12 April 2011 - 10:01:52
Status: Temporary error, restart node
Message: Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s) (Arbitration error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
Error object: QMGR (Line: 5532) 0x0000000a
Program: /usr/local/mysqlCluster/mysql/bin/ndbmtd
Pid: 25087 thr: 0
Version: mysql-5.1.44 ndb-7.1.4b
Trace: /DB/mysql/data/ndb_3_trace.log.12 /DB/my

Node7 (Group3)
Time: Tuesday 12 April 2011 - 10:01:52
Status: Temporary error, restart node
Message: Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s) (Arbitration error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
Error object: QMGR (Line: 5532) 0x0000000e
Program: /usr/local/mysqlCluster/mysql/bin/ndbmtd
Pid: 6537 thr: 0
Version: mysql-5.1.44 ndb-7.1.4b
Trace: /DB/mysql/data/ndb_7_trace.log.13 /DB/mys

Node8 (Group3)
Time: Tuesday 12 April 2011 - 10:01:53
Status: Temporary error, restart node
Message: Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s) (Arbitration error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
Error object: QMGR (Line: 5532) 0x0000000e
Program: /usr/local/mysqlCluster/mysql/bin/ndbmtd
Pid: 14472 thr: 0
Version: mysql-5.1.44 ndb-7.1.4b
Trace: /DB/mysql

How to repeat:
It happened once and will hopefully not happen again.
It is required by my customer, to get an explanation on wht happende and how we can prevent the system from getting hit again.
[13 Apr 2011 11:49] Jonas Oreland
Hi

Are you also uploading trace-files ?

/Jonas
[13 Apr 2011 11:57] Stefan Auweiler
Hi Jonas,
Thanks for the quick response.

I've uploaded the complete ndb_error_report_20110412115440.tar.bz2 file to your FTP Server.

As it is about 176 MB in size and your FTP server disconnected several times, I've chunked it to 4 files

"Bug_60851_ ndb_error_report_20110412115440.tar.bz2.001" to  "...004"

Do you see them? (The FTP directory is hidden to me)

Regards
Stefan
[13 Apr 2011 12:01] Stefan Auweiler
sorry, Jonas,

the readme for the FTP Upload was to be sent ... in an other browser window :-)
Here it is ...

Regards Stefan
[13 Apr 2011 13:24] Jonas Oreland
Jonas notes:

Analysis node 6:
Trx: 0135edfb
START - 3 x DEL (UI,IE) - COMMIT

One base-table delete gets LQHKEYREF 410
The other 2 has triggers...
Crash is when LQHKEYCONF last (trigger op)
[14 Apr 2011 13:05] Jonas Oreland
Hi,

I can inform you that this has been fixed for upcoming 7.0.24 and 7.1.13
http://lists.mysql.com/commits/134425

I verified you testcase before/after that fix,
and before the fix, it crashed in either of the 2 places that you
found (DBTC (Line: 3868) or DBTC (Line: 1646)) but with the fix
it all works out ok.

I'll however commit your testcase to out regression suite regardless,
cause it's not exactly the same as the one in above mentioned commit.

Thx for great bug report
/Jonas
[14 Apr 2011 13:07] Jonas Oreland
As a side note: This bug confirms my observation that when you fix a bug that you found yourself. People will shortly afterwards start reporting it. Really weird.
[14 Apr 2011 16:07] Stefan Auweiler
Hi Jonas,

thanks for the good news.

What exactly caused the crash? Do you have any advice on what queries/inserts/updates to have a look at, until I can get 7.1.13 in my hands?

I'm a little bit afaid, that we might face this problem again. 
We did not within the last 17 Month, but for a while now, we started using batching more and more :-)

Thanks.
Stefan
[14 Apr 2011 16:15] Jonas Oreland
not really sure, but my guess would be a statement like
delete from Table where unique_key in (1,2,3);

in your case, the middle delete, encountered a temporary error (410)
and incorrect error handling caused the crash.

delete from Table where unique_key in (1) - would not have had the same problem.

---

If you can build from source...you can take the patch and apply it yourself

/Jonas