MySQL Bugs: #64788: Repeatable crash in DBDIH in 7.2.5 on node restart after import of new .sql

Bug #64788	Repeatable crash in DBDIH in 7.2.5 on node restart after import of new .sql
Submitted:	28 Mar 2012 14:39	Modified:	3 Apr 2012 6:10
Reporter:	Carl Krumins	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	7.2.5	OS:	Linux (Binary x64 linux generic versions on latest Ubuntu)
Assigned to:		CPU Architecture:	Any
Tags:	crash, dbdih, Failure, ndbmtd, node restart

Description:
Repeatable crash in DBDIH error 2341 in all versions on restart of data node after import of .sql on new --initial cluster.

Following on from old unclosed bug id:62650 with unsuccessful patch maybe fix from Jonas Oreland.
Still repeatable with same issue with latest 7.2.5 version of ndbmtd.

How to repeat:
Start with fresh cluster running --initial with 4 data nodes starting up successfully with empty data. 
Restore data from either ndb_restore or from a mysqldump .sql (tried both - same result) with all 4 data nodes online.
Restart one data node after restoring the .sql data and it will crash repeatedly with exact same error every time.
This node will never start up again after importing the .sql data in after the --initial first startup.
Attempt to restart another data node and it also will never start up either with same DBDIH crash 2341 error.
Only 2 out of 4 nodes will remain up and running after loading data in from .sql 
Repeatable through different ndb versions including latest 7.2.5
Have tried loading same .sql onto an alternate brand new --initial 4 data node cluster on different hardware on a different network with exact same error result.

Tried with 7.1.9, 7.1.15a, 7.2.2, 7.2.4, 7.2.5 with same crash result.
7.2.5, 7.2.2 7.1.15a, 7.1.9 report DBDIH error 2341
7.2.4 also reported same error 2341 with DBLQH and DBTUP error.
Have included ndb_error_reporter from latest 7.2.5 version
All versions are binary linux generic x64 versions of ndbmtd (except for suggested patch attempt mentioned in bugid 62650 by Jonas Oreland with results mentioned below)

===Latest Version 7.2.5=======
Time: Monday 26 March 2012 - 18:10:55
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DbdihMain.cpp
Error object: DBDIH (Line: 14745) 0x00000000
Program: ndbmtd
Pid: 96411 thr: 0
Version: mysql-5.5.20 ndb-7.2.5
Trace: /data/mysqlcluster//ndb_3_trace.log.15 [t1..t17]
***EOM***

===Old Version 7.2.4==========
Time: Tuesday 13 March 2012 - 12:23:49
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DblqhMain.cpp
Error object: DBLQH (Line: 9764) 0x00000004
Program: ndbmtd
Pid: 25812 thr: 3
Version: mysql-5.5.19 ndb-7.2.4
Trace: /data/mysqlcluster//ndb_3_trace.log.1 [t1..t7]
***EOM***

===Old Version 7.2.4 Part 2===
Time: Sunday 25 March 2012 - 23:36:26
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DbtupRoutines.cpp
Error object: DBTUP (Line: 728) 0x00000004
Program: ndbmtd
Pid: 27403 thr: 2
Version: mysql-5.5.19 ndb-7.2.4
Trace: /data/mysqlcluster//ndb_2_trace.log.2 [t1..t7]
***EOM***

===Old Version 7.2.2==========
Time: Sunday 15 January 2012 - 22:26:41
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: /pb2/build/sb_0-4474442-1322943363.11/mysql-cluster-gpl-7.2.2/storage/ndb/src/kernel/blocks/dbdih/DbdihMain.cpp
Error object: DBDIH (Line: 14625) 0x00000000
Program: /usr/local/mysql/mysql-cluster-gpl-7.2.2-linux2.6-x86_64/bin/ndbmtd
Pid: 1288 thr: 0
Version: my

===Old Version 7.1.15a========
Time: Wednesday 5 October 2011 - 00:39:59
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or
missing error message, please report a bug)
Error: 2341
Error data: dbdih/DbdihMain.cpp
Error object: DBDIH (Line: 14611) 0x00000000
Program: /usr/local/mysql/mysql-7.1.15a-linux-x86_64/bin/ndbmtd
Pid: 3609 thr: 0
Version: mysql-5.1.56 ndb-7.1.15a
Trace: /data/mysqlcluster//ndb_4_trace.log.1 /data/mysqlcluster//ndb_4_trace.log

===Old Version 7.1.9==========
Time: Friday 14 January 2011 - 10:16:03
Status: Temporary error, restart node
Message: Pointer too large (Internal error, programming error or missing error message, please report a bug)
Error: 2306
Error data: dbdih/DbdihMain.cpp
Error object: DBDIH (Line: 17161) 0x00000006
Program: ndbmtd
Pid: 29803 thr: 0
Version: mysql-5.1.51 ndb-7.1.9
Trace: /data/mysqlcluster//ndb_2_trace.log.3 /data/mysqlcluster//ndb_2_trace.log.3_t1 /data/mysqlcluster//ndb_2_trace.log.3_t2 /data/mysqlcluster//ndb_2_tra

==============================

When applying suggested patch from Jonas Oreland in http://bugs.mysql.com/bug.php?id=62650

Which was:
=== modified file 'storage/ndb/src/kernel/blocks/dbdih/Dbdih.hpp' 
-#define ZPAGEREC 100
+#define ZPAGEREC 400

When setting to 400, the crash results were exactly the same and data node crashed on startup with same error as above (no improvement or change at all).
I tried further increasing the value of ZPAGEREC again in the patch even further (set ZPAGEREC to 2000 as a guess, and the cluster didn’t crash any more, but saw message such as:
2012-03-10 09:32:27 [MgmtSrvr] WARNING  -- Node 4: GCP Monitor: GCP_SAVE lag 60 seconds (no max lag)
2012-03-10 09:33:29 [MgmtSrvr] WARNING  -- Node 4: GCP Monitor: GCP_SAVE lag 120 seconds (no max lag)
2012-03-10 09:43:30 [MgmtSrvr] WARNING  -- Node 4: GCP Monitor: GCP_COMMIT lag 10 seconds (no max lag)
2012-03-10 09:43:40 [MgmtSrvr] WARNING  -- Node 4: GCP Monitor: GCP_COMMIT lag 20 seconds (no max lag)
..etc repeated for hours with increasing times, and cluster never started.. 
The cluster eventually just ‘hung’ and became non responsive. 

Cluster is using approx 70% of DataMemory=38800M on (normally 4 nodes) across 2 nodegroups, so unsure which piece of data is causing it to crash if it is data related.
Cluster status is currently running on 2 out of 4 nodes in 1 node group. Other node group refuses to start.
As a result, cluster is currently running with zero redundancy as should be treated with priority please.

Suggested fix:
Not sure.

Have uploaded ndb_error_report to FTP filename:
bug-64788_ndb_error_report_20120328232448.tar.bz2

fixed in 7.0.32, 7.1.21, and 7.2.6

Thanks Bogdan Kecman for the update. Is the latest 7.2.6 available on Launchpad (or is it elsewhere?) and has this patch/bug been applied to 7.2.6 already or is there another patch to try and test so we can confirm it has resolved our issue? 
Is there any further additional technical information available which you can disclose regarding the problem, the solution, and the appropriate fix. 
Thanks
Carl

7.2.6 and 7.1.21 are tagged but not released yet. 

The problem is related to the size of table definition. When the table definition is too large (over 32k) you hit this bug.

data node crashed log

Attachment: ndb_2_trace.log.zip (application/x-zip-compressed, text), 48.09 KiB.