Bug #46782 one ndbmtd crash with Err code 2341 when take a dbt2 test
Submitted: 18 Aug 2009 10:20 Modified: 25 Aug 2009 11:30
Reporter: raid fifa Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:5.1.35-ndb-7.0.6 OS:Linux (suse EL 10 sp2)
Assigned to: Jonas Oreland CPU Architecture:Any

[18 Aug 2009 10:20] raid fifa
Description:
Environment:
4 machines: 
2 IBM x3850m2(4cores*4,8GB mem) as one mgmd and two mysqlds, 2 IBM x3950m2(4cores*8, 32GB mem) as four ndbmtd nodes.
OS:
SuSE EnterpriseLinux SP2
MySQL Cluster:
mysql-com-5.1.35-ndb-7.0.6 for Linux x86_64
DBT2:
dbt2-0.37.45.tar.gz from http://www.iclaustron.com/

I run the following step:
./mysql_load_db.sh --database dbt2w5 --path /tmp/dbt2test --socket /tmp/mysql.sock --engine NDB

there is always one ndbmtd crash with the error info:
Time: Friday 14 August 2009 - 13:20:23
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: suma/Suma.cpp
Error object: SUMA (Line: 4121) 0x0000000a
Program: ndbmtd
Pid: 5887

How to repeat:
after builded the mysql cluster, then just run dbt2 test with above environments.

NOTE:
I had test mysql cluster with dbt2 many times, and I am sure I can take a right process.
I used dbt2-0.37.45 to test mysql-cluster-com-5.1.35-ndb-7.0.6(x86_64 version) on other Intel multi-core machines(4core*2,2core*2) and no problem.

Suggested fix:
I use real-time extend and multi-thread parameters on config.ini file, please refer to my config.ini file.
I think maybe there is a bug which happens when ndb-7.0.6 run on multi-core machines( No. of cores > 16).
[18 Aug 2009 10:23] raid fifa
one ndbmtd output log & error log and configuration files

Attachment: ndb_log.zip (application/x-zip-compressed, text), 358.11 KiB.

[18 Aug 2009 11:39] Jonas Oreland
see bug#46123
see bug#46723
see bug#45612
[19 Aug 2009 15:27] Jonas Oreland
proposed patch

Attachment: bug46782.patch (text/x-patch), 8.82 KiB.

[19 Aug 2009 15:27] Jonas Oreland
Hi,

If you could retest with patch that I attached that would be great

/Jonas
[19 Aug 2009 15:30] raid fifa
I'm glad to hear from your and I'll test this patch.
thank you!
[19 Aug 2009 15:35] Robert Klikics
Thanks for the patch, Jonas.

Because my cluster-setup is in a production enviroment, I can't test the patch right now, escpacially not under load ... sorry!

Regards,
Robert
[21 Aug 2009 5:19] Jonas Oreland
Hi raid (and rest),

Did you test the patch that I made for the bug?
I'm a bit in a hurry cause we have a 7.0.7 release scheduled for next week...

and I can only reproduce the problem my self by being really nasty:
- start ndbmtd with MaxExecutionThread=8 (on a 4-core machine)
- start high update load
- start big background load (compiling)

etc...but the patch fixed the problem in this scenario.

and i'm really keen on getting feedback

/Jonas
[22 Aug 2009 10:46] raid fifa
Sorry, This mysql cluster was deployed on our customer's production system, I was migrating some oracle data to mysql cluster data and so busy recently. Maybe I could test this patch in the next one or two week.
[24 Aug 2009 8:19] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/81378

2964 Jonas Oreland	2009-08-24
      ndb - bug#46782
        crash in SUMA.
        For each global checkpoint, schedule each thread running an LQH to prevent
          uneven load causing SUMA to overflow circular buffer
[24 Aug 2009 8:24] Jonas Oreland
pushed to 7.0.7
[25 Aug 2009 11:30] Jon Stephens
Documented bugfix in the NDB-7.0.7 changelog as follows:

        During a global checkpoint, LQH threads could run unevenly, causing a
        circular buffer oveflow by the Subscription Manager, which led to data
        node failure.