MySQL Bugs: #33619: Cluster Disk Data: sysbench causes ndbd crash using disk data

Bug #33619	Cluster Disk Data: sysbench causes ndbd crash using disk data
Submitted:	2 Jan 2008 9:19	Modified:	20 Feb 2008 21:58
Reporter:	raid fifa	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Disk Data	Severity:	S1 (Critical)
Version:	5.1.22-rc community edition	OS:	Linux (2.6.9-8.11smp i386 GNU/Linux)
Assigned to:	Jonas Oreland	CPU Architecture:	Any

Description:
I run sysbench-0.4.8 against cluster to test disk data feature in oltp mode, but this caused ndbd crash.
Builded one cluster on three linux boxes:
one mgmd: 172.16.74.66
one mysqld: 172.16.74.62
two ndbd: 172.16.74.62 and 172.16.74.116

After preparing sbtest table, run sysbench :
[root@qa3-62 mysql]# /usr/local/sysbench048/bin/sysbench --test=oltp --mysql-table-engine=ndbcluster --oltp-table-size=1000000 --max-requests=50 --num-threads=1 --mysql-user=root run
......
Using "BEGIN" for starting transactions
Using auto_inc on the id column
Maximum number of requests for OLTP test is limited to 50
Threads started!
ALERT: failed to execute mysql_stmt_execute(): Err1297 Got temporary error 4031 'Node failure caused abort of transaction' from NDBCLUSTER
FATAL: database error, exiting...
Done.

The ndb_mgm console printed:
ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=2    @172.16.74.62  (Version: 5.1.22, Nodegroup: 0, Master)
id=3    @172.16.74.116  (Version: 5.1.22, Nodegroup: 0)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @172.16.74.66  (Version: 5.1.22)

[mysqld(API)]   3 node(s)
id=5    @172.16.74.62  (Version: 5.1.22)
id=6 (not connected, accepting connect from 172.16.74.116)
id=7 (not connected, accepting connect from any host)

ndb_mgm> Node 2: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

When I tried to start node2 with ndbd command, the ndb_mgm console printed:
ndb_mgm> show
... ...
ndb_mgm> Node 3: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
Node 2: Forced node shutdown completed. Occured during startphase 5. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.

Node 2 error log:
=================
Time: Wednesday 2 January 2008 - 15:32:26
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbtup/DbtupDiskAlloc.cpp
Error object: DBTUP (Line: 946) 0x0000000e
Program: bin/ndbd
Pid: 6031
Trace: /usr/local/mysql/clu_data/ndb_2_trace.log.1
Version: Version 5.1.22 (rc)
***EOM***

Time: Wednesday 2 January 2008 - 15:42:11
Status: Temporary error, restart node
Message: Another node failed during system restart, please investigate error(s) on other node(s) (Restart error)
Error: 2308
Error data: Node 3 disconnected
Error object: QMGR (Line: 2648) 0x0000000a
Program: bin/ndbd
Pid: 6174
Trace: /usr/local/mysql/clu_data/ndb_2_trace.log.2
Version: Version 5.1.22 (rc)
***EOM***

Node 3 error log:
================
Time: Wednesday 2 January 2008 - 15:45:40
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbtup/DbtupDiskAlloc.cpp
Error object: DBTUP (Line: 946) 0x0000000e
Program: bin/ndbd
Pid: 22198
Trace: /usr/local/mysql/clu_data/ndb_3_trace.log.1
Version: Version 5.1.22 (rc)
***EOM***

cluster log:
============
2008-01-02 15:32:32 [MgmSrvr] INFO     -- Node 3: LCP Take over completed (state = 4)
2008-01-02 15:32:32 [MgmSrvr] INFO     -- Node 3: ParticipatingDIH = 0000000000000000
2008-01-02 15:32:32 [MgmSrvr] INFO     -- Node 3: ParticipatingLQH = 0000000000000000
2008-01-02 15:32:32 [MgmSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2008-01-02 15:32:32 [MgmSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000]
2008-01-02 15:32:32 [MgmSrvr] INFO     -- Node 3: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2008-01-02 15:32:32 [MgmSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_From_Master_Received = 1
2008-01-02 15:32:32 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2008-01-02 15:32:32 [MgmSrvr] INFO     -- Node 3: Started arbitrator node 1 [ticket=56b60002396d926c]
2008-01-02 15:32:35 [MgmSrvr] INFO     -- Node 3: Communication to Node 2 opened
2008-01-02 15:41:48 [MgmSrvr] INFO     -- Mgmt server state: nodeid 2 reserved for ip 172.16.74.62, m_reserved_nodes 0000000000000026.
2008-01-02 15:41:48 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2008-01-02 15:41:49 [MgmSrvr] INFO     -- Mgmt server state: nodeid 2 freed, m_reserved_nodes 0000000000000022.
2008-01-02 15:41:49 [MgmSrvr] INFO     -- Node 2: Node 3 Connected
2008-01-02 15:41:49 [MgmSrvr] INFO     -- Node 2: CM_REGCONF president = 3, own Node = 2, our dynamic id = 3
2008-01-02 15:41:49 [MgmSrvr] INFO     -- Node 3: Node 2 Connected
2008-01-02 15:41:49 [MgmSrvr] INFO     -- Node 3: Node 2: API version 5.1.22
2008-01-02 15:41:49 [MgmSrvr] INFO     -- Node 2: Node 3: API version 5.1.22
2008-01-02 15:41:49 [MgmSrvr] INFO     -- Node 2: Start phase 1 completed
2008-01-02 15:41:49 [MgmSrvr] INFO     -- Node 2: Start phase 2 completed (node restart)
2008-01-02 15:41:49 [MgmSrvr] INFO     -- Node 2: Receive arbitrator node 1 [ticket=56b60002396d926c]
2008-01-02 15:41:50 [MgmSrvr] INFO     -- Node 2: Start phase 3 completed (node restart)
2008-01-02 15:41:50 [MgmSrvr] INFO     -- Node 2: Start phase 4 completed (node restart)
2008-01-02 15:41:50 [MgmSrvr] INFO     -- Node 3: DICT: locked by node 2 for NodeRestart
2008-01-02 15:41:50 [MgmSrvr] INFO     -- Node 3: DICT: lock bs: 4 ops: 0 poll: 0 cnt: 0 queue: 2L
2008-01-02 15:41:51 [MgmSrvr] INFO     -- Node 2: DICT: index 13 activated
2008-01-02 15:41:51 [MgmSrvr] INFO     -- Node 2: DICT: index 14 activated
2008-01-02 15:41:51 [MgmSrvr] INFO     -- Node 3: Node: 2 StartLog: [GCI Keep: 395 LastCompleted: 645 NewestRestorable: 918]
2008-01-02 15:42:05 [MgmSrvr] INFO     -- Node 2: Applying undo to LCP: 29
2008-01-02 15:42:05 [MgmSrvr] INFO     -- Node 2: Undo head - undo_1.dat page: 2626 lsn: 1000459
2008-01-02 15:42:07 [MgmSrvr] INFO     -- Node 2: Logfile group: 8
2008-01-02 15:42:07 [MgmSrvr] INFO     -- Node 2:   head: undo_1.dat page: 2626
2008-01-02 15:42:07 [MgmSrvr] INFO     -- Node 2:   tail: undo_1.dat page: 2590
2008-01-02 15:42:07 [MgmSrvr] INFO     -- Node 2: Flushing page cache after undo completion
2008-01-02 15:42:07 [MgmSrvr] INFO     -- Node 2: Flushing complete
2008-01-02 15:42:08 [MgmSrvr] INFO     -- Node 3: Local checkpoint 30 started. Keep GCI = 409 oldest restorable GCI = 410
2008-01-02 15:42:15 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2008-01-02 15:42:16 [MgmSrvr] ALERT    -- Node 3: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2008-01-02 15:42:16 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-01-02 15:42:16 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Occured during startphase 5. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.

How to repeat:
1.prepare 'sbtest' table for sysbench test:
/usr/local/sysbench048/bin/sysbench --test=oltp --mysql-table-engine=ndbcluster --oltp-table-size=1000000 --mysql-user=root prepare
2.change sbtest table to disk data:
mysql> create logfile group lg_1 add undofile 'undo_1.dat' initial_size 200M undo_buffer_size 10M engine ndb;
Query OK, 0 rows affected (5.43 sec)
mysql> create tablespace ts_1 add datafile 'data_1.dat' use logfile group lg_1  initial_size 700M engine ndb;
Query OK, 0 rows affected (18.67 sec)
mysql> alter table sbtest tablespace ts_1 storage disk engine ndb;
Query OK, 1000000 rows affected (9 min 31.35 sec)
Records: 1000000  Duplicates: 0  Warnings: 0
3.run sysbench test scripts:
[root@qa3-62 mysql]# /usr/local/sysbench048/bin/sysbench --test=oltp --mysql-table-engine=ndbcluster --oltp-table-size=1000000 --max-requests=50 --num-threads=1 --mysql-user=root run

Can you provide the /usr/local/mysql/clu_data/ndb_2_trace.log.1 and /usr/local/mysql/clu_data/ndb_3_trace.log.1 trace files, too?

I had upload a compressed file named "bug-data-33619.rar" that included all log files of node2&node3 to ftp://ftp.mysql.com/pub/mysql/upload/

this is 
update (only MM columns)
delete
insert
commit;

somewhere the "update MM columns"-optimizations, makes clears load_disk_page_on_commit flag

Proposed patch

Attachment: pp (application/octet-stream, text), 1.23 KiB.

Hi,

I attached a patch (that I tested locally) which seems to work.
Can you also please test it

/Jonas

Testcase which reproduces problem:
1) create table (a int primary key, b int, c int, key(b)) storage disk engine ndb;
2) insert into t1 values (1,1,1);
3) begin;
   update t1 set b = b + 1 where a = 1; // Update *only* MM-columns
   delete from t1 where a = 1;          // Delete row (here is bug)
   insert into t1 values (1,1,1);       // Re-insert row
   commit;

Note:
1) Same primary key every operation
2) First at least 1 operations touching *only* MM-part
3) Then delete+insert

correction: the insert is not needed.
it sufficient with the delete...after MM-only updates

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/41613

ChangeSet@1.2675, 2008-02-03 21:24:59+01:00, jonas@perch.ndb.mysql.com +3 -0
  ndb - bug#33619
    make sure to alloc logspace and set bits
    if doing delete after previous update wo/ touching DD part

pushed to 51-ndb, telco-6.x (>= 2)

Documented bugfix in the 5.1.23-ndb-6.3.9 changelog as follows:

        Updating in-memory columns of one or more rows of Disk Data
        table, followed by deletion of these rows and re-insertion of
        them, caused data node failures.

Left in PQ status pending further merges.

Also documented for 5.1.23-ndb-6.2.12.

Pushed into 5.1.24-rc

Pushed into 6.0.5-alpha

Also documented for 5.1.24 and 6.0.5.