Bug #49664 MySQL Cluster data node failure
Submitted: 14 Dec 2009 10:32 Modified: 21 Dec 2009 13:55
Reporter: MySQL DBA Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:mysql-5.1-telco-7.0 OS:Linux (Fedora)
Assigned to: Assigned Account CPU Architecture:Any
Tags: 5.1.35-ndb-7.0.7-cluster-gpl, data, Failure, MySQLCluster, node

[14 Dec 2009 10:32] MySQL DBA
Description:
Hi All,

I am getting this error from quite a few days now. I try to run ndb_error_reporter on this but it didn't work. I am getting following error:-

Time: Sunday 13 December 2009 - 23:27:28
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: DBLQH (Line: 512) 0x0000000a
Program: ndbmtd
Pid: 26172 thr: 2
Version: mysql-5.1.39 ndb-7.0.9b
Trace: /usr/local/mysql/data/ndb_4_trace.log.14 /usr/local/mysql/data/ndb_4_trace.log.14_t1 /usr/local/mysql/data/ndb_4_trace.log.14

How to repeat:
This error is occurring again and again when I trying to load bulk data.
[14 Dec 2009 10:34] MySQL DBA
This is configuration file.

Attachment: Config.ini (, text), 2.43 KiB.

[14 Dec 2009 10:35] MySQL DBA
This is the trace log related to this error

Attachment: Trace_log.txt (text/plain), 28.89 KiB.

[14 Dec 2009 10:35] MySQL DBA
This is the second trace log

Attachment: ndb_4_trace.log.14_t1.txt (text/plain), 28.89 KiB.

[14 Dec 2009 10:38] MySQL DBA
These are the logs from management nodes.

2009-12-13 00:52:23 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 409 started. Keep GCI = 702757 oldest restorable GCI = 664173
2009-12-13 01:50:19 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 410 started. Keep GCI = 704461 oldest restorable GCI = 664173
2009-12-13 02:48:15 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 411 started. Keep GCI = 706165 oldest restorable GCI = 664173
2009-12-13 03:46:11 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 412 started. Keep GCI = 707869 oldest restorable GCI = 664173
2009-12-13 04:44:10 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 413 started. Keep GCI = 709574 oldest restorable GCI = 664173
2009-12-13 05:42:07 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 414 started. Keep GCI = 711280 oldest restorable GCI = 664173
2009-12-13 06:40:06 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 415 started. Keep GCI = 712986 oldest restorable GCI = 664173
2009-12-13 07:38:07 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 416 started. Keep GCI = 714691 oldest restorable GCI = 664173
2009-12-13 08:36:05 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 417 started. Keep GCI = 716399 oldest restorable GCI = 664173
2009-12-13 09:34:06 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 418 started. Keep GCI = 718106 oldest restorable GCI = 664173
2009-12-13 10:32:03 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 419 started. Keep GCI = 719815 oldest restorable GCI = 664173
2009-12-13 11:29:56 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 420 started. Keep GCI = 721519 oldest restorable GCI = 664173
2009-12-13 12:27:55 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 421 started. Keep GCI = 723220 oldest restorable GCI = 664173
2009-12-13 13:25:53 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 422 started. Keep GCI = 724926 oldest restorable GCI = 664173
2009-12-13 14:23:48 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 423 started. Keep GCI = 726631 oldest restorable GCI = 664173
2009-12-13 15:21:47 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 424 started. Keep GCI = 728333 oldest restorable GCI = 664173
2009-12-13 16:19:45 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 425 started. Keep GCI = 730039 oldest restorable GCI = 664173
2009-12-13 17:17:42 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 426 started. Keep GCI = 731745 oldest restorable GCI = 664173
2009-12-13 18:15:43 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 427 started. Keep GCI = 733450 oldest restorable GCI = 664173
2009-12-13 19:13:40 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 428 started. Keep GCI = 735160 oldest restorable GCI = 664173
2009-12-13 20:11:38 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 429 started. Keep GCI = 736865 oldest restorable GCI = 664173
2009-12-13 21:09:38 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 430 started. Keep GCI = 738570 oldest restorable GCI = 664173
2009-12-13 22:07:35 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 431 started. Keep GCI = 740277 oldest restorable GCI = 664173
2009-12-13 23:05:31 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 432 started. Keep GCI = 741982 oldest restorable GCI = 664173
2009-12-13 23:31:15 [MgmtSrvr] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2009-12-13 23:31:15 [MgmtSrvr] ALERT    -- Node 3: Node 4 Disconnected
2009-12-13 23:31:15 [MgmtSrvr] ALERT    -- Node 3: Network partitioning - arbitration required
2009-12-13 23:31:15 [MgmtSrvr] INFO     -- Node 3: President restarts arbitration thread [state=7]
2009-12-13 23:31:15 [MgmtSrvr] INFO     -- Node 3: Communication to Node 4 closed
2009-12-13 23:31:15 [MgmtSrvr] ALERT    -- Node 1: Node 4 Disconnected
2009-12-13 23:31:15 [MgmtSrvr] ALERT    -- Node 3: Arbitration won - positive reply from node 1
2009-12-13 23:31:15 [MgmtSrvr] ALERT    -- Node 1: Node 4 Disconnected
2009-12-13 23:31:15 [MgmtSrvr] INFO     -- Node 3: Started arbitrator node 1 [ticket=4bc00003c1427405]
2009-12-13 23:31:18 [MgmtSrvr] INFO     -- Node 3: Communication to Node 4 opened
2009-12-14 00:03:20 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 433 started. Keep GCI = 743685 oldest restorable GCI = 664173
2009-12-14 01:01:11 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 434 started. Keep GCI = 745386 oldest restorable GCI = 744481
2009-12-14 01:59:07 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 435 started. Keep GCI = 747088 oldest restorable GCI = 747072
2009-12-14 02:52:02 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 436 started. Keep GCI = 748793 oldest restorable GCI = 748538
2009-12-14 02:52:34 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 437 started. Keep GCI = 750348 oldest restorable GCI = 750360
2009-12-14 03:50:18 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 438 started. Keep GCI = 750365 oldest restorable GCI = 750360
2009-12-14 04:48:09 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 439 started. Keep GCI = 752059 oldest restorable GCI = 751726
[14 Dec 2009 10:45] Jonas Oreland
Hi,

The problem is with send-buffer (and has been addressed in http://bugs.mysql.com/bug.php?id=48852)
Remove these lines from your config.ini
TotalSendBufferMemory=4M
SendBufferMemory=2M

And instead add:
[TCP DEFAULT]
SendBufferMemory=2M
ReceiveBufferMemory=2M
[14 Dec 2009 10:53] MySQL DBA
Thanks Jonas for a quick response.

I'll implement it and get back to you.

Regards, 
Manasi Save.
[14 Dec 2009 11:34] MySQL DBA
Hi Jonas, 

I changed the config.ini TCP DEFAULT section.

But immediately after connecting one node got disconnected with following :- 

error.Time: Monday 14 December 2009 - 06:26:11
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: DBLQH (Line: 512) 0x0000000a
Program: ndbmtd
Pid: 20136 thr: 2
Version: mysql-5.1.39 ndb-7.0.9b
Trace: /usr/local/mysql/data/ndb_3_trace.log.13 /usr/local/mysql/data/ndb_3_trace.log.13_t1 /usr/local/mysql/data/ndb_3_trace.log.13

How should I do now?
[14 Dec 2009 11:37] Jonas Oreland
Hi again,

1) check that you removed it from ndbd default section
2) check that your config changes took effect
   (you need to restart ndb_mgmd using the "--reload" switch)
3) Increase the value of SendBufferMemory to say 16M
[14 Dec 2009 11:45] MySQL DBA
Hi Jonas, 

Yes, I have removed it completely from ndbd section.
I have started ndb_mgmd with --reload.
and now I'll try to increase the SendBufferMemory.

Also,

These are few more warnigns I got after connecting to all nodes: -2009-10-06 02:45:53 [MgmSrvr] WARNING  -- Failed to convert connection from '10.1.138.3:37239' to transporter
2009-10-06 02:45:53 [MgmSrvr] WARNING  -- Failed to convert connection from '10.1.138.3:37240' to transporter
2009-10-06 02:45:53 [MgmSrvr] WARNING  -- Failed to convert connection from '10.1.138.3:37241' to transporter
2009-10-06 02:45:53 [MgmSrvr] WARNING  -- Failed to convert connection from '10.1.138.3:37242' to transporter
2009-10-06 02:45:53 [MgmSrvr] WARNING  -- Failed to convert connection from '10.1.138.3:37244' to transporter
2009-10-06 02:45:53 [MgmSrvr] WARNING  -- Failed to convert connection from '10.1.138.3:37245' to transporter
[21 Dec 2009 12:26] Jonas Oreland
How is testing of new config doing ?
[21 Dec 2009 12:41] MySQL DBA
Hi Jonas,

Sorry for not giving status. The new config is working fine. so, far there is no problem.

Thanks,
Regards,
Manasi Save
MySQL DBA
[21 Dec 2009 12:55] Jonas Oreland
super!

so i'll close this a duplicate of the one with the error message (for now)
[21 Dec 2009 13:55] MySQL DBA
Ok.

Thanks for your active support.

Regards,
Manasi Save
MySQL DBA