MySQL Bugs: #36718: Hard node crash causes error 701 'System busy with other schema operation'

Bug #36718	Hard node crash causes error 701 'System busy with other schema operation'
Submitted:	14 May 2008 17:31	Modified:	27 Jan 2009 14:27
Reporter:	David Shrewsbury	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-5.1.24 ndb-6.3.13	OS:	Linux
Assigned to:	Martin Skold	CPU Architecture:	Any

Description:
This is fairly easy to repeat. So far have only repeated on ndb-6.3 branch.

Place a considerable load on a 2 data node cluster and kill one of the data nodes (kill -9). System becomes unrecoverable as all schema operations now fail with the 701 error. Can still select from existing NDB tables, though.

How to repeat:
Create a simple two data node Cluster. Will upload my sample config.ini and my.cnf used.

Use mysqlslap to create the load:

  shell> mysqlslap -a --create-schema=junk -e ndb --commit=10 -i 30 -c 10 -T

Kill one of the ndbd nodes with kill -9. mysqlslap should fail with an error similar to:

mysqlslap: Cannot run query CREATE TABLE `t1` (intcol1 INT(32) ,charcol1 VARCHAR(128)) ERROR : Can't create table 'junk.t1' (errno: 701)

Might need to repeat the process above a couple of times to duplicate the error.

Once the error happens, trying to create a table through mysql CLI will fail, too:

mysql> create table abcd (id int) engine=ndb;
ERROR 1005 (HY000): Can't create table 'test.abcd' (errno: 701)
mysql> show warnings;
+-------+------+----------------------------------------------------------------------------+
| Level | Code | Message                                                                    |
+-------+------+----------------------------------------------------------------------------+
| Error | 1297 | Got temporary error 701 'System busy with other schema operation' from NDB | 
| Error | 1005 | Can't create table 'test.abcd' (errno: 701)                                | 
+-------+------+----------------------------------------------------------------------------+

Config file used

Attachment: config.ini (application/octet-stream, text), 795 bytes.

mysqld config file

Attachment: my.cnf (application/octet-stream, text), 176 bytes.

If you start up the killed node again, and shut down the other, does it recover?

Also do you know which node you are killing?  The master?  It should make a difference which one you kill.  If you kill the master, this may occur (because there is no code written to handle that failure case), but if you kill the non master this should not occur.

BR,

T

see also:

http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-limitations-multiple-nodes.html "DDL operations"

We are currently adressing this in ndb version 6.4

Tomas,

Yes, you are correct. This happens consistently when I kill the master.

However, when I kill the non-master, the Cluster goes into a "frozen" state. No queries against an NDB table will return, the mysqld process refuses to shutdown unless I first shutdown the cluster, and I see this from the ndb_show_tables command:

id    type                 state    logging database     schema   name
6     UserTable            Online   Yes     test         def      x
1     SystemTable          Online   Yes     sys          def      NDB$EVENTS_0
4     UserTable            Online   Yes     mysql        def      ndb_apply_status
5     UserTable            Dropping Yes     junk         def      t1
3     UserTable            Online   Yes     mysql        def      NDB$BLOB_2_3
0     SystemTable          Online   Yes     sys          def      SYSTAB_0
2     UserTable            Online   Yes     mysql        def      ndb_schema
1     TableEvent           Online   -                             REPL$mysql/ndb_schema
2     TableEvent           Online   -                             NDB$BLOBEVENT_REPL$mysql/ndb_schema_3
5     TableEvent           Online   -                             REPL$test/x
3     TableEvent           Online   -                             REPL$mysql/ndb_apply_status
11    TableEvent           Online   -                             REPL$junk/t1

Notice that id=5 remains in the "Dropping" state.

WL#4331 Ensuring resilience against master node failures (Ndb)
now pushed to mysql-5.1-telco-6.4

Documented bugfix in the NDB-6.4.0 changelog as follows:

        The failure of a master node during a DDL operation caused the
        cluster to be unavailable for further DDL operations until it
        was restarted; failures of non-master nodes during DLL
        operations caused the cluster to become completely inaccessible.

Also updated MySQL Cluster Limitations and MySQL Cluster NDB 6.4 Roadmap sections of 5.1 Manual.