Description:
We run a very large database on MySQL cluster that just went into production recently. We call it our "production cluster". The cluster is a 3-machine cluster consisting of a management server, and two dual-processor machines each running an sql node and an ndb node. (We also have a "development cluster" with identical hardware within the office.)
Node ID 1 = ndb_mgmd
Node ID 2 = ndbd node 1
Node ID 3 = ndbd node 2
Node ID 4 = mysqld node 1
Node ID 5 = mysqld node 2
Yesterday evening, all 5 nodes (2xndb, 2xsql, 1xmgm) were running, when an administrator accidentally restarted a data node using mdb_mgm. While the accidental restart should not have happened, there should not have been any data loss as the node was restarted "normally".
There is currently a 1.5 hour "hole" in our data between 8:40 PM EDT and 10:02 PM EDT. 8:40 PM is exactly the time when the node restart occurred.
- 8:40 PM EDT is the minute the node was restarted.
- 10:02 PM EDT does no seem to correspond to any event in the cluster log.
Since then, our system has been malfuntioning.
- Node 3, when restarted, seemed to take over as master.
- Node 2 stopped recording any data (insert/update), but continued to respond to select statements, so some select statements were returning out-of-date results.
For example, this morning, I got this:
Connected to node 4:
mysql> select max(trans_id) from transaction;
+---------------+
| max(trans_id) |
+---------------+
| 37692 |
+---------------+
1 row in set (0.00 sec)
Connected to node 5:
mysql> select max(trans_id) from transaction;
+---------------+
| max(trans_id) |
+---------------+
| 65771 |
+---------------+
1 row in set (0.00 sec)
Upon discovering this, I restarted the out of date ndb node, and everything appears to be working normally again.
I'm not sure what the root cause of the data loss is, but I suspect it may be related to a problem we've been seeing on our development cluster. On the development cluster, often both ndb nodes seem to be the master at the same time. (ndb_mgm reports both as "Master")
In trying to stop one node or the other this morning, I was unable:
~# /usr/local/mysql/bin/ndb_mgm
-- NDB Cluster -- Management Client --
ndb_mgm> show
Connected to Management Server at: 192.168.2.41:1186
Cluster Configuration
---------------------
[ndbd(NDB)] 2 node(s)
id=2 @192.168.2.17 (Version: 5.0.19, Nodegroup: 0)
id=3 @192.168.2.16 (Version: 5.0.19, Nodegroup: 0, Master)
[ndb_mgmd(MGM)] 1 node(s)
id=1 @192.168.2.41 (Version: 5.0.19)
[mysqld(API)] 2 node(s)
id=4 @192.168.2.16 (Version: 5.0.19)
id=5 @192.168.2.17 (Version: 5.0.19)
ndb_mgm> 2 stop
Node 2: Node shutdown aborted
Shutdown failed.
* 2002: Stop failed
* Node shutdown would cause system crash
ndb_mgm> 3 stop
Node 3: Node shutdown aborted
Shutdown failed.
* 2002: Stop failed
* Node shutdown would cause system crash
This seems odd to me...
I've attached all the logs from the ndb nodes and the mgm node. Maybe you can make a little more sense of it. If there's any other piece of information I can give you, please don't hesitate to ask. You have my full co-operation.
The logs make this post too long, so they will follow in a follow-up post.
How to repeat:
Unknown - In this case, it was simply a node restart - "3 restart" from ndb_mgm. This has been done countless times before without error.
Suggested fix:
Unknown - For now, we're going to tread extremely lightly on the database servers and the management node.
For now, an NDB node must be restarted, we will simply shut it down, and restart it with --initial when during off-peak hours.