Bug #36718 Hard node crash causes error 701 'System busy with other schema operation'
Submitted: 14 May 2008 17:31 Modified: 27 Jan 2009 14:27
Reporter: David Shrewsbury Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:mysql-5.1.24 ndb-6.3.13 OS:Linux
Assigned to: Martin Skold CPU Architecture:Any

[14 May 2008 17:31] David Shrewsbury
Description:
This is fairly easy to repeat. So far have only repeated on ndb-6.3 branch.

Place a considerable load on a 2 data node cluster and kill one of the data nodes (kill -9). System becomes unrecoverable as all schema operations now fail with the 701 error. Can still select from existing NDB tables, though.

How to repeat:
Create a simple two data node Cluster. Will upload my sample config.ini and my.cnf used.

Use mysqlslap to create the load:

  shell> mysqlslap -a --create-schema=junk -e ndb --commit=10 -i 30 -c 10 -T

Kill one of the ndbd nodes with kill -9. mysqlslap should fail with an error similar to:

mysqlslap: Cannot run query CREATE TABLE `t1` (intcol1 INT(32) ,charcol1 VARCHAR(128)) ERROR : Can't create table 'junk.t1' (errno: 701)

Might need to repeat the process above a couple of times to duplicate the error.

Once the error happens, trying to create a table through mysql CLI will fail, too:

mysql> create table abcd (id int) engine=ndb;
ERROR 1005 (HY000): Can't create table 'test.abcd' (errno: 701)
mysql> show warnings;
+-------+------+----------------------------------------------------------------------------+
| Level | Code | Message                                                                    |
+-------+------+----------------------------------------------------------------------------+
| Error | 1297 | Got temporary error 701 'System busy with other schema operation' from NDB | 
| Error | 1005 | Can't create table 'test.abcd' (errno: 701)                                | 
+-------+------+----------------------------------------------------------------------------+
[14 May 2008 17:41] David Shrewsbury
Config file used

Attachment: config.ini (application/octet-stream, text), 795 bytes.

[14 May 2008 17:42] David Shrewsbury
mysqld config file

Attachment: my.cnf (application/octet-stream, text), 176 bytes.

[15 May 2008 6:51] Tomas Ulin
If you start up the killed node again, and shut down the other, does it recover?

Also do you know which node you are killing?  The master?  It should make a difference which one you kill.  If you kill the master, this may occur (because there is no code written to handle that failure case), but if you kill the non master this should not occur.

BR,

T
[15 May 2008 8:29] Tomas Ulin
see also:

http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-limitations-multiple-nodes.html "DDL operations"

We are currently adressing this in ndb version 6.4
[15 May 2008 16:58] David Shrewsbury
Tomas,

Yes, you are correct. This happens consistently when I kill the master.

However, when I kill the non-master, the Cluster goes into a "frozen" state. No queries against an NDB table will return, the mysqld process refuses to shutdown unless I first shutdown the cluster, and I see this from the ndb_show_tables command:

id    type                 state    logging database     schema   name
6     UserTable            Online   Yes     test         def      x
1     SystemTable          Online   Yes     sys          def      NDB$EVENTS_0
4     UserTable            Online   Yes     mysql        def      ndb_apply_status
5     UserTable            Dropping Yes     junk         def      t1
3     UserTable            Online   Yes     mysql        def      NDB$BLOB_2_3
0     SystemTable          Online   Yes     sys          def      SYSTAB_0
2     UserTable            Online   Yes     mysql        def      ndb_schema
1     TableEvent           Online   -                             REPL$mysql/ndb_schema
2     TableEvent           Online   -                             NDB$BLOBEVENT_REPL$mysql/ndb_schema_3
5     TableEvent           Online   -                             REPL$test/x
3     TableEvent           Online   -                             REPL$mysql/ndb_apply_status
11    TableEvent           Online   -                             REPL$junk/t1

Notice that id=5 remains in the "Dropping" state.
[30 Oct 2008 7:51] Martin Skold
WL#4331 Ensuring resilience against master node failures (Ndb)
now pushed to mysql-5.1-telco-6.4
[27 Jan 2009 14:27] Jon Stephens
Documented bugfix in the NDB-6.4.0 changelog as follows:

        The failure of a master node during a DDL operation caused the
        cluster to be unavailable for further DDL operations until it
        was restarted; failures of non-master nodes during DLL
        operations caused the cluster to become completely inaccessible.

Also updated MySQL Cluster Limitations and MySQL Cluster NDB 6.4 Roadmap sections of 5.1 Manual.