Bug #21050 Drop db during DN recovery causes other DN to fail & causes recovery issues
Submitted: 13 Jul 2006 23:03 Modified: 2 Sep 2006 6:09
Reporter: Jonathan Miller Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.1.12 OS:Linux (Linux 32 Bit OS)
Assigned to: Pekka Nousiainen CPU Architecture:Any
Tags: MySQLD Crash back trace included

[13 Jul 2006 23:03] Jonathan Miller
Description:
I found this yesterday using the DBT2 database. with that test I was actaully getting a crash within the MySQLD. I have been trying to break it down to a smaller test case. I have not been able to get the mysqld to crash with this smaller test case, but I have been able to get a consitent DN failure. When trying to recover from that data node failure it hangs in phase 2.

Test: I have a 2 DN w/ 2 replicas setup. I create a database and small table. I insert about 300,000 rows of data into that table. I then issue a "killall -11 ndbd" on one of the DN hosts. I restart that DN then issue a drop database. The drop database will hang and once the restart data node start completly the other DN will fail. When trying to recover that DN, it will hang in phase 2.

mysql> use foo
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> CALL dorepeat(100000);
Query OK, 0 rows affected (3 min 44.46 sec)

**** NOTE DATA NODE KILLED AT THIS POINT ******
**** See Warning Below *********

mysql> CALL dorepeat(100000);
Query OK, 0 rows affected, 1 warning (2 min 37.17 sec)

mysql> show warnings;
+---------+------+-------------------------------------------------+
| Level   | Code | Message                                         |
+---------+------+-------------------------------------------------+
| Warning | 1264 | Out of range value for column 'c1' at row 51179 |
+---------+------+-------------------------------------------------+
1 row in set (0.00 sec)

mysql> select count(*) from foo1; +----------+
| count(*) |
+----------+
|   279683 |
+----------+
1 row in set (0.00 sec)

mysql> CALL dorepeat(10);
Query OK, 0 rows affected (0.01 sec)

mysql> CALL dorepeat(10);
Query OK, 0 rows affected (0.01 sec)

mysql> select count(*) from foo1;
+----------+
| count(*) |
+----------+
|   279705 |
+----------+
1 row in set (0.00 sec)

DN restarted at this point

ndb_mgm> Node 3: Started (version 5.1.12)
Node 2: Forced node shutdown completed. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

mysql> drop database foo; 

Query aborted by Ctrl+C
Query OK, 1 row affected, 1 warning (45.96 sec)

mysql> show warnings;
+-------+------+---------------------------------------------------------------------------------------+
| Level | Code | Message       |
+-------+------+---------------------------------------------------------------------------------------+
| Error | 1296 | Got error 723 'No such table existed' from Could not log query '%s' on other mysqld's |
| Error | 1317 | Query execution was interrupted       |
+-------+------+---------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

mysql> drop database foo;
ERROR 1008 (HY000): Can't drop database 'foo'; database doesn't exist

RESTART DN ID#3

ndb_mgm> all status
Node 2: starting (Phase 2) (Version 5.1.12)
Node 3: started (Version 5.1.12)

2006-07-14 00:46:13 [MgmSrvr] INFO     -- Node 3: DICT: lock bs: 3 ops: 2 poll: 1 cnt: 0 queue: 22006-07-14 00:46:13 [MgmSrvr] INFO     -- Node 3: DICT: lock request by node 2 for NodeRestart
2006-07-14 00:46:13 [MgmSrvr] INFO     -- Node 3: DICT: lock bs: 3 ops: 2 poll: 1 cnt: 1 queue: 22006-07-14 00:46:13 [MgmSrvr] INFO     -- Node 3: DICT: lock bs: 3 ops: 2 poll: 1 cnt: 2 queue: 2
.
.
.

How to repeat:
setup 2 DN 2 replica cluster w/ mysqld
mysql> create database foo; 
mysql> use foo;
Database changed
mysql> create table foo1 (c1 int auto_increment key)engine=ndb;
mysql> delimiter | mysql> CREATE PROCEDURE dorepeat(p1 INT) BEGIN set @x=0; REPEAT insert into foo1 values(null); SET @x = @x + 1; UNTIL @x > p1 END REPEAT; END|
mysql> CALL dorepeat(300000);

killall -11 ndbd on one of the DN host.
restart that DN
mysql> drop database foo;

Suggested fix:
I will add log information in another posting.
[13 Jul 2006 23:05] Jonathan Miller
DN Error log:
Time: Friday 14 July 2006 - 00:45:01
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbdict/Dbdict.cpp
Error object: DBDICT (Line: 13547) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 18504
Trace: /space/run/ndb_2_trace.log.2
Version: Version 5.1.12 (beta)

Trace:
BLOCK   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR
       ?010826 011344
DBDICT  010851 010869 010936
DBLQH   002481
DBTC    004050
DBTUP   002032
DBDICT  010851 010869 010941
DBDICT  010851 010869 010884 008403 008416
DBDICT  006393 006411

--------------- Signal ----------------
r.bn: 250 "DBDICT", r.proc: 2, r.sigId: 11279186 gsn: 420 "DICT_UNLOCK_ORD" prio: 1
s.bn: 246 "DBDIH", s.proc: 3, s.sigId: 1109139 length: 2 trace: 2 #sec: 0 fragInf: 0
 H'00000000 H'00000001
--------------- Signal ----------------
r.bn: 250 "DBDICT", r.proc: 2, r.sigId: 11279185 gsn: 605 "ALTER_INDX_CONF" prio: 1
s.bn: 250 "DBDICT", s.proc: 2, s.sigId: 11279184 length: 3 trace: 0 #sec: 0 fragInf: 0
 H'00fa0002 H'00003298 H'00000010
--------------- Signal ----------------
r.bn: 250 "DBDICT", r.proc: 2, r.sigId: 11279184 gsn: 603 "ALTER_INDX_REQ" prio: 1
s.bn: 250 "DBDICT", s.proc: 2, s.sigId: 11279163 length: 8 trace: 0 #sec: 0 fragInf: 0
 H'00fa0002 H'00003297 H'00100003 H'00000002 H'00000003 H'00000003 H'00000000
 H'00003298
--------------- Signal ----------------
r.bn: 262 "RESTORE", r.proc: 2, r.sigId: 11279183 gsn: 502 "NODE_START_REP" prio: 1
s.bn: 251 "NDBCNTR", s.proc: 2, s.sigId: 11279162 length: 1 trace: 2 #sec: 0 fragInf: 0
 H'00000003
--------------- Signal ----------------
r.bn: 261 "PGMAN", r.proc: 2, r.sigId: 11279182 gsn: 502 "NODE_START_REP" prio: 1
s.bn: 251 "NDBCNTR", s.proc: 2, s.sigId: 11279162 length: 1 trace: 2 #sec: 0 fragInf: 0
 H'00000003
--------------- Signal ----------------
r.bn: 260 "LGMAN", r.proc: 2, r.sigId: 11279181 gsn: 502 "NODE_START_REP" prio: 1
s.bn: 251 "NDBCNTR", s.proc: 2, s.sigId: 11279162 length: 1 trace: 2 #sec: 0 fragInf: 0
 H'00000003
--------------- Signal ----------------
[27 Jul 2006 21:49] Jonathan Miller
Did a little more investigation on this one with the cid_dd_ndb.pl script.

If you kill a DN at the right position in the test (i.e. not during the delete data from table) the test will continue. The if you restart the DN the test will hang at the drop table part of the test. Once the DN recovers the other DN dies.

You have to shutdown the cluster and restart to get both DN operational again and you have to ctl-c out of the test. If you restart the test, if first tries to drop the database if it exists. When you do this, the MySQLD will crash.

#0  0x004af402 in __kernel_vsyscall ()
(gdb) bt
#0  0x004af402 in __kernel_vsyscall ()
#1  0x0064364f in pthread_kill () from /lib/libpthread.so.0
#2  0x0834a457 in write_core ()
#3  0x081f6336 in handle_segfault ()
#4  <signal handler called>
#5  0x00537ce8 in strcmp () from /lib/libc.so.6
#6  0x08240d8c in remove_db_from_cache ()
#7  0x0830588b in mysql_rm_db ()
#8  0x0821375b in mysql_execute_command ()
#9  0x08217d2b in mysql_parse ()
#10 0x08218609 in dispatch_command ()
#11 0x0821973a in do_command ()
#12 0x08219af1 in handle_one_connection ()
#13 0x00640bd4 in start_thread () from /lib/libpthread.so.0
#14 0x005984fe in clone () from /lib/libc.so.6

The mysqld_safe restarts the mysqld and then if you run the test again, it procceds without issue.

So we have a couple of issues here.

1) The drop database hangs when a data node dies
2) The 2nd data node dies when the first recovers
3) An additional attempt to drop the database from the same mysqld process will cause a crash one the cluster has been recoverd unless that mysqld process as been restarted first.

Since we have hangs and crashes associated with this bug, I am rasing it to P1.

Thanks
/jeb
[18 Aug 2006 7:57] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/10612

ChangeSet@1.2245, 2006-08-18 09:56:52+02:00, pekka@orca.ndb.mysql.com +8 -0
  ndb - bug#18781 bug#21017 bug#21050 : block index ops during NR + fix asserts
[23 Aug 2006 8:47] Pekka Nousiainen
patch for 2), open new bug# if other bugs remains
[1 Sep 2006 7:58] Jonas Oreland
pushed to 5.1.12
[1 Sep 2006 13:21] Jon Stephens
Documented bugfix in 5.1.12 changelog.
[1 Sep 2006 19:27] Jonas Oreland
pushed to 5.0.25
[2 Sep 2006 6:09] Jon Stephens
Documented bugfix in 5.0.25 changelog.