Bug #18043 Reboot cause node failure on other server in cluster
Submitted: 7 Mar 2006 17:22 Modified: 19 Jun 2006 9:52
Reporter: Andrew Harrison Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.0.18 OS:Linux (SLES 9 SP1)
Assigned to: CPU Architecture:Any

[7 Mar 2006 17:22] Andrew Harrison
Description:
I have seen this three times recently.

We have an 8 node cluster across two servers (4 nodes per server).  One of the servers (A) required a reboot, which should be no problem as its a cluster after all.  The reboot of this server (A) caused the 4 nodes on the other server (B) to die resulting on loss of service.  Thankfully, the third time, I was watching for this and caught it quickly, restarting the nodes on (B) and resuming service.

The message output for what appears to be the node failure that started this is:

Time: Tuesday 7 Mars 2006 - 16:12:32
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DbtcMain.cpp
Error object: DBTC (Line: 1345) 0x0000000e
Program: /usr/local/mysql/bin/ndbd
Pid: 11992
Trace: /usr/local/mysql/cluster/ndb_8_trace.log.23
Version: Version 5.0.18
***EOM***

We have not seen this before and do not believe that we have made any configuration changes to MySQL except to redirect the MySQL temporary file storage area out of root and into another filespace.

How to repeat:
AS far as I can see, this occurs every time we reboot one of the servers in the cluster.
[7 Mar 2006 17:26] Andrew Harrison
Trace log for node that caused the rest of the nodes to fail

Attachment: ndb_8_trace.log.zip (application/zip, text), 78.49 KiB.

[7 Mar 2006 17:27] Andrew Harrison
Changed category to cluster
[9 Mar 2006 12:06] Hartmut Holzgraefe
can you please add the cluster log (ndb_?_cluster.log) from the management
node, too?
[9 Mar 2006 13:34] Andrew Harrison
Cluster log from the Server that was not rebooted.

Attachment: ndb_2_cluster.zip (application/x-zip-compressed, text), 59.15 KiB.

[9 Mar 2006 13:35] Andrew Harrison
Cluster log

Attachment: ndb_1_cluster.zip (application/x-zip-compressed, text), 33.01 KiB.

[9 Mar 2006 13:46] Andrew Harrison
The set-up that we have is:
Two WebSphere application servers running the node management daemon
Two Servers running the MySQL daemon.

The Application Servers and MySQL servers are a paired failover (i.e. AppServer1 primarily uses MySQLServer1, but fails over to MySQLServer2 and vice-versa.)

The root cause:
MyQSLServer1 suffered a message storm (originating in oictl32).  We are going to upgrade the Kernel version soon.
The message storm filled up the root filespace.  The only action to get around this is to delete the syslog file and reboot the server.

It would appear that rebooting the server causes the nodes on the other MySQL server to fail.

In this instance, node 1 is the management daemon on AppServer1 and node 2 is the management daemon on AppServer2.
Node 3,5,7 & 9 are the ndbd instances on MySQLServer1 and 4,6,8 & 10 are the ndbd instances on MySQLServer2.

MySQLServer1 was rebooted (we could see when this happened as I was watching the Node Management Console when the server was being rebooted.) resulting in node 3,5,7 & 9 disappearing as expected.  Very soon afterwards, node 4, 6, 8 & 10 also die unexpectedly.

Hope this helps
[19 May 2006 9:52] Jonas Oreland
Hi,

Can you upload you config.ini
and error/trace files of the other ndbd nodes that crashed.

Also, a number of bug fixes in this area has been fixed since 5.0.18, can you try a newer
  version?

/Jonas
[19 Jun 2006 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".