Bug #30149 NDB: on-line add column during inserts causes complete and catastrophic failure
Submitted: 31 Jul 2007 14:56 Modified: 5 Oct 2007 11:39
Reporter: Jonathan Miller Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:mysql-5.1-telco-6.2 OS:Linux (32 bit)
Assigned to: Jonathan Miller CPU Architecture:Any

[31 Jul 2007 14:56] Jonathan Miller
Description:
I pulled tomas patch for the ForceVarPart: in the MySQLD and retested using the scripts I built. During this testing the complete cluster system including both MySQLD's failed/crashed.

Looking through the logs and core the following happens:

Cluster:
Node 3 crashes during memcpy(dst, srcPtr, srcBytes);
Node 2 crashes on assert(page_idx < high_index);

Local = sever scripts are running on
Remote = MySQLD running on different server that some scripts use.

The local MySQLD Crashes during:

107       case AbortBackupOrd::LogBufferFull:
108         fprintf(out, " LogBufferFull: backupPtr: %d backupId: %d\n",
109                 sig->backupPtr, sig->backupId);
110         return true;

And the remote MySQLD crashes on 
2637      ptrCheckGuard(failedNodePtr, MAX_NDB_NODES, nodeRec);

A complete crash log will be attached to this report after open

How to repeat:
Create 2 DN cluster with 2 MySQLD's. MySQLD's should be on different hosts.
Edit shell scripts to contain remote MySQLD information
Run main script on local system using: perl ./ndb_alter_dd.pl --sock
[31 Jul 2007 15:10] Jonathan Miller
AlterTableInsertCrash.log

Attachment: AlterTableInsertCrash.log (text/x-log), 33.46 KiB.

[31 Jul 2007 19:42] Jonathan Miller
With the bin-log turned off, the cluster crash still happens, but both MySQLD's stay up and running.

Local MySQLD error log:
ERR: receiveResponse - theImpl->theWaiter.m_state = 1
Node failed when TCRELEASE sent
Node failed when TCRELEASE sent
Node failed when TCRELEASE sent
NDB: Found 3 NdbTransaction's that have not been released
NDB: Found 2 NdbReceiver's that have not been released

Remote MySQLD error log:
070731 21:32:59 [Note] NDB_SHARE: trailing share ./TABLE_ALTER/t1(connect_count: 0) released after NSS_DROPPED check at connect_count: 0
NDB: Found 2 NdbTransaction's that have not been released
NDB: Found 2 NdbReceiver's that have not been released
070731 21:33:16 [Note] NDB_SHARE: trailing share ./TABLE_ALTER/t1(connect_count: 0) released after NSS_DROPPED check at connect_count: 0
070731 21:33:18 [Note] NDB_SHARE: trailing share ./TABLE_ALTER/t1(connect_count: 0) released after NSS_DROPPED check at connect_count: 0
070731 21:33:30 [Note] NDB Binlog: Node: 2, down, Subscriber bitmask 00
070731 21:33:30 [Note] NDB Binlog: cluster failure for ./mysql/ndb_schema at epoch 63.
070731 21:33:33 [ERROR] /home/ndbdev/jmiller/builds/libexec/mysqld: Incorrect information in file: './TABLE_ALTER/t1.frm'
Node failed when TCRELEASE sent
NDB: Found 1 NdbTransaction that has not been released
NDB: Found 2 NdbReceiver's that have not been released
[1 Aug 2007 21:14] Jonathan Miller
NOTE: This was using Fixed tables and not Dynamic
[5 Oct 2007 11:39] Jonathan Miller
Retest shows that the issues has been corrected.
/Jeb