Bug #36718 Hard node crash causes error 701 'System busy with other schema operation'
Submitted: 14 May 2008 19:31 Modified: 27 Jan 15:27
Reporter: David Shrewsbury
Status: Closed
Category:Server: Cluster Severity:S1 (Critical)
Version:mysql-5.1.24 ndb-6.3.13 OS:Linux
Assigned to: Martin Skold Target Version:ndb-6.4
Triage: Needs Triage: D1 (Critical)

[14 May 2008 19:31] David Shrewsbury
Description:
This is fairly easy to repeat. So far have only repeated on ndb-6.3 branch.

Place a considerable load on a 2 data node cluster and kill one of the data nodes (kill
-9). System becomes unrecoverable as all schema operations now fail with the 701 error.
Can still select from existing NDB tables, though.

How to repeat:
Create a simple two data node Cluster. Will upload my sample config.ini and my.cnf used.

Use mysqlslap to create the load:

  shell> mysqlslap -a --create-schema=junk -e ndb --commit=10 -i 30 -c 10 -T

Kill one of the ndbd nodes with kill -9. mysqlslap should fail with an error similar to:

mysqlslap: Cannot run query CREATE TABLE `t1` (intcol1 INT(32) ,charcol1 VARCHAR(128))
ERROR : Can't create table 'junk.t1' (errno: 701)

Might need to repeat the process above a couple of times to duplicate the error.

Once the error happens, trying to create a table through mysql CLI will fail, too:

mysql> create table abcd (id int) engine=ndb;
ERROR 1005 (HY000): Can't create table 'test.abcd' (errno: 701)
mysql> show warnings;
+-------+------+----------------------------------------------------------------------------+
| Level | Code | Message                                                                 
  |
+-------+------+----------------------------------------------------------------------------+
| Error | 1297 | Got temporary error 701 'System busy with other schema operation' from
NDB | 
| Error | 1005 | Can't create table 'test.abcd' (errno: 701)                             
  | 
+-------+------+----------------------------------------------------------------------------+
[14 May 2008 19:41] David Shrewsbury
Config file used

Attachment: config.ini (application/octet-stream, text), 795 bytes.

[14 May 2008 19:42] David Shrewsbury
mysqld config file

Attachment: my.cnf (application/octet-stream, text), 176 bytes.

[15 May 2008 8:51] Tomas Ulin
If you start up the killed node again, and shut down the other, does it recover?

Also do you know which node you are killing?  The master?  It should make a difference
which one you kill.  If you kill the master, this may occur (because there is no code
written to handle that failure case), but if you kill the non master this should not
occur.

BR,

T
[15 May 2008 10:29] Tomas Ulin
see also:

http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-limitations-multiple-nodes.html "DDL
operations"

We are currently adressing this in ndb version 6.4
[15 May 2008 18:58] David Shrewsbury
Tomas,

Yes, you are correct. This happens consistently when I kill the master.

However, when I kill the non-master, the Cluster goes into a "frozen" state. No queries
against an NDB table will return, the mysqld process refuses to shutdown unless I first
shutdown the cluster, and I see this from the ndb_show_tables command:

id    type                 state    logging database     schema   name
6     UserTable            Online   Yes     test         def      x
1     SystemTable          Online   Yes     sys          def      NDB$EVENTS_0
4     UserTable            Online   Yes     mysql        def      ndb_apply_status
5     UserTable            Dropping Yes     junk         def      t1
3     UserTable            Online   Yes     mysql        def      NDB$BLOB_2_3
0     SystemTable          Online   Yes     sys          def      SYSTAB_0
2     UserTable            Online   Yes     mysql        def      ndb_schema
1     TableEvent           Online   -                             REPL$mysql/ndb_schema
2     TableEvent           Online   -                            
NDB$BLOBEVENT_REPL$mysql/ndb_schema_3
5     TableEvent           Online   -                             REPL$test/x
3     TableEvent           Online   -                            
REPL$mysql/ndb_apply_status
11    TableEvent           Online   -                             REPL$junk/t1

Notice that id=5 remains in the "Dropping" state.
[30 Oct 2008 8:51] Martin Skold
WL#4331 Ensuring resilience against master node failures (Ndb)
now pushed to mysql-5.1-telco-6.4
[27 Jan 15:27] Jon Stephens
Documented bugfix in the NDB-6.4.0 changelog as follows:

        The failure of a master node during a DDL operation caused the
        cluster to be unavailable for further DDL operations until it
        was restarted; failures of non-master nodes during DLL
        operations caused the cluster to become completely inaccessible.

Also updated MySQL Cluster Limitations and MySQL Cluster NDB 6.4 Roadmap sections of 5.1
Manual.