Bug #21969 DN failure during restart -i to increase of NoOfFragmentLogFiles
Submitted: 1 Sep 2006 14:15 Modified: 18 Dec 2006 14:28
Reporter: Jonathan Miller Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Cluster: Disk Data Severity:S2 (Serious)
Version:5.1.12 OS:Linux (Linux 64 bit OS)
Assigned to: CPU Architecture:Any

[1 Sep 2006 14:15] Jonathan Miller
Description:
I had setup a Cluster Disk Data replication stress test last night using DBT2 and left it running over night. When I checked the slave this morning I found that it had stopped on Error 410:

Last_Errno: 410
Last_Error: Error in Write_rows event: error during transaction execution on table dbt2.new_order

The Mysqld error log showed:

060901  8:44:22 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table dbt2.orders, Error_code: 410
.
.

Perror showed:

 ~/jmiller/builds/bin/perror --ndb 410
NDB error code 410: REDO log files overloaded, consult online manual (decrease TimeBetweenLocalCheckpoints, and|or increase NoOfFragmentLogFiles): Temporary error: Overload error

The current NoOfFragmentLogFiles = 151, and I increased it to 181 and restarted the ndb_mgmd process:
-- NDB Cluster -- Management Client --
ndb_mgm> 1 restart
Connected to Management Server at: 
Node 1 is being restarted

Once restarted, I started to restart the data nodes starting with ID#2:
ndb_mgm> 2 restart -i
Node 2: Node shutdown initiated
Node 2: Node shutdown completed, restarting, no start, initial.

During this restart I got the following message:
Restart failed.
*     0: No error
*        Executing: ndb_mgm_disconnect

I tried to do a show but got back:
ndb_mgm> show
Could not get status
*  1010: Management server not connected
*

So I disconnected and reconneted to the the ndb_mgm and was able to do a show. ID#2 was no longer trying to restart.

ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=2 (not connected, accepting connect from n16)

Looking in the data directory, I found a different type of error log, no trace file and a core.

$ ls
config.ini  ndb_1_cluster.log  ndb_1.pid  ndb_2_out.log  ndb_pid10600_error.log
core.10600  ndb_1_out.log      ndb_2_fs   ndb_2.pid

The error log showed:

Time: Friday 1 September 2006 - 14:27:38
Status: Permanent error, external action needed
Message: Invalid configuration received from Management Server (Configuration error)
Error: 2350
Error data: Unable to alloc node id
Error object: Error : Could not alloc node id at n16 port 14000: Cluster refused allocation of id 2. Error: 1703 (Node failure handling not completed: Permanent error: Application error).
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 10600
Trace: <no tracefile>
Vers

I tried several times to restart ndbd with --initial but got the same error each time.

Back Trace showed:
#0  0x00000033b835bd3d in fflush () from /lib64/libc.so.6
(gdb) bt
#0  0x00000033b835bd3d in fflush () from /lib64/libc.so.6
#1  0x00000000004a8560 in writeChildInfo ()
#2  0x00000000004a8594 in childReportError ()
#3  0x00000000006a7542 in ErrorReporter::handleError ()
#4  0x000000000069ea53 in Configuration::fetch_configuration ()
#5  0x00000000004a93af in main ()

I finial took a backup from the other data node that was still up, and shutdown the slave cluster.

I then edited config.ini and set NoOfFragmentLogFiles = 200 in the configuration file.

I then had to manually rm -rf the 2 ndbd file systems due to using disk data and having to do a restore.

I brought the cluster back up and did a restore. During the restore I recieved about 50 of the following:

_____________________________________________________
Processing data in table: dbt2/def/NDB$BLOB_25_20(26) fragment 0
Temporary error: 1221: REDO buffers overloaded, consult online manual (increase RedoBuffer)
Temporary error: 1221: REDO buffers overloaded, consult online manual (increase RedoBuffer)
Temporary error: 1221: REDO buffers overloaded, consult online manual (increase RedoBuffer)

The restore completed and I have since restart the slave which is about Seconds_Behind_Master: 25577.

 

How to repeat:
Not sure. See above.
[21 Sep 2006 23:27] Jonathan Miller
Yep, it is the same bug, but this was not a debug build. Would like to take it before the bug commitee.

Thanks
[16 Dec 2006 10:11] Jonas Oreland
Hi,

I'll close this as duplicate,
of the two bugs.

If you find error log/tracefile for "signal" crash at end.
then please open new bug report supplying that.

/Jonas
[18 Dec 2006 13:27] Lars Bo Svenningsen
> I'll close this as duplicate,
> of the two bugs.

What two bugs?
[18 Dec 2006 14:28] Jonathan Miller
bug http://bugs.mysql.com/bug.php?id=10894