Description:
I had setup a Cluster Disk Data replication stress test last night using DBT2 and left it running over night. When I checked the slave this morning I found that it had stopped on Error 410:
Last_Errno: 410
Last_Error: Error in Write_rows event: error during transaction execution on table dbt2.new_order
The Mysqld error log showed:
060901 8:44:22 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table dbt2.orders, Error_code: 410
.
.
Perror showed:
~/jmiller/builds/bin/perror --ndb 410
NDB error code 410: REDO log files overloaded, consult online manual (decrease TimeBetweenLocalCheckpoints, and|or increase NoOfFragmentLogFiles): Temporary error: Overload error
The current NoOfFragmentLogFiles = 151, and I increased it to 181 and restarted the ndb_mgmd process:
-- NDB Cluster -- Management Client --
ndb_mgm> 1 restart
Connected to Management Server at:
Node 1 is being restarted
Once restarted, I started to restart the data nodes starting with ID#2:
ndb_mgm> 2 restart -i
Node 2: Node shutdown initiated
Node 2: Node shutdown completed, restarting, no start, initial.
During this restart I got the following message:
Restart failed.
* 0: No error
* Executing: ndb_mgm_disconnect
I tried to do a show but got back:
ndb_mgm> show
Could not get status
* 1010: Management server not connected
*
So I disconnected and reconneted to the the ndb_mgm and was able to do a show. ID#2 was no longer trying to restart.
ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)] 2 node(s)
id=2 (not connected, accepting connect from n16)
Looking in the data directory, I found a different type of error log, no trace file and a core.
$ ls
config.ini ndb_1_cluster.log ndb_1.pid ndb_2_out.log ndb_pid10600_error.log
core.10600 ndb_1_out.log ndb_2_fs ndb_2.pid
The error log showed:
Time: Friday 1 September 2006 - 14:27:38
Status: Permanent error, external action needed
Message: Invalid configuration received from Management Server (Configuration error)
Error: 2350
Error data: Unable to alloc node id
Error object: Error : Could not alloc node id at n16 port 14000: Cluster refused allocation of id 2. Error: 1703 (Node failure handling not completed: Permanent error: Application error).
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 10600
Trace: <no tracefile>
Vers
I tried several times to restart ndbd with --initial but got the same error each time.
Back Trace showed:
#0 0x00000033b835bd3d in fflush () from /lib64/libc.so.6
(gdb) bt
#0 0x00000033b835bd3d in fflush () from /lib64/libc.so.6
#1 0x00000000004a8560 in writeChildInfo ()
#2 0x00000000004a8594 in childReportError ()
#3 0x00000000006a7542 in ErrorReporter::handleError ()
#4 0x000000000069ea53 in Configuration::fetch_configuration ()
#5 0x00000000004a93af in main ()
I finial took a backup from the other data node that was still up, and shutdown the slave cluster.
I then edited config.ini and set NoOfFragmentLogFiles = 200 in the configuration file.
I then had to manually rm -rf the 2 ndbd file systems due to using disk data and having to do a restore.
I brought the cluster back up and did a restore. During the restore I recieved about 50 of the following:
_____________________________________________________
Processing data in table: dbt2/def/NDB$BLOB_25_20(26) fragment 0
Temporary error: 1221: REDO buffers overloaded, consult online manual (increase RedoBuffer)
Temporary error: 1221: REDO buffers overloaded, consult online manual (increase RedoBuffer)
Temporary error: 1221: REDO buffers overloaded, consult online manual (increase RedoBuffer)
The restore completed and I have since restart the slave which is about Seconds_Behind_Master: 25577.
How to repeat:
Not sure. See above.