MySQL Bugs: #50534: First datanode crashes when creating a new nodegroup

Bug #50534	First datanode crashes when creating a new nodegroup
Submitted:	22 Jan 2010 7:48	Modified:	27 Jan 2010 7:45
Reporter:	Oli Sennhauser	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	mysql-5.1-telco-7.0	OS:	Any (Linux)
Assigned to:	Jonas Oreland	CPU Architecture:	Any
Tags:	7.0.9, cluster, crash, datanode, MySQL, nodegroup

Description:
When adding data nodes on-line according to our docu:

  http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-online-add-node-example.html

In step 6 CREATE NODEGROUP let the 1st node crash.

How to repeat:
Follow the example in the docu:

CREATE TABLE ips (
  id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  country_code CHAR(2) NOT NULL,
  type CHAR(4) NOT NULL,
  ip_address varchar(15) NOT NULL,
  addresses BIGINT UNSIGNED DEFAULT NULL,
  date BIGINT UNSIGNED DEFAULT NULL
) ENGINE NDBCLUSTER;

INSERT INTO ips VALUES (NULL, 'CH', 'test', '192.168.1.33', 12345678901234567890, 12345678901234567890);
INSERT INTO ips SELECT NULL, 'CH', 'test', '192.168.1.33', 12345678901234567890, 12345678901234567890 FROM ips LIMIT 32000;

Suggested fix:
No idea except to NOT crash.

Error log of crash

Attachment: ndb_error_report_20100122085004.tar.bz2 (application/x-redhat-package-manager, text), 41.35 KiB.

ndb_mgm> CREATE NODEGROUP 3,4
*   322: Error
*        322-Invalid node(s) specified for new nodegroup, node already in nodegroup: Permanent error: Application error
ndb_mgm> create nodegroup 5,6

Node 3: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
*    -1: Error
*        -1-Unknown error code: Unknown result: Unknown error code

----

Time: Friday 22 January 2010 - 08:34:40
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug)
Error: 2303
Error data: Node 3 killed this node because GCP stop was detected
Error object: NDBCNTR (Line: 270) 0x0000000a
Program: ndbd
Pid: 7173
Version: mysql-5.1.39 ndb-7.0.9b
Trace: /home/mysql/cluster/7.0.9/ndb_3_trace.log.2
***EOM***

----

DBDIH   000338 000532 013341 013357 013624
NDBCNTR 000214 016621
NDBCNTR 000224
LGMAN   000351
NDBCNTR 000231 000270

--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 3, r.sigId: 1856691 gsn: 164 "CONTINUEB" prio: 0
s.bn: 246 "DBDIH", s.proc: 3, s.sigId: 1856687 length: 1 trace: 8 #sec: 0 fragInf: 0
 Check GCP Stop
--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 3, r.sigId: 1856690 gsn: 164 "CONTINUEB" prio: 0
s.bn: 246 "DBDIH", s.proc: 3, s.sigId: 1856686 length: 1 trace: 2 #sec: 0 fragInf: 0
 Start GCP
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 3, r.sigId: 1856689 gsn: 164 "CONTINUEB" prio: 0
s.bn: 253 "NDBFS", s.proc: 3, s.sigId: 1856685 length: 1 trace: 0 #sec: 0 fragInf: 0
 Scanning the memory channel every 10ms
--------------- Signal ----------------
r.bn: 252 "QMGR", r.proc: 3, r.sigId: 1856688 gsn: 164 "CONTINUEB" prio: 0
s.bn: 252 "QMGR", s.proc: 3, s.sigId: 1856684 length: 3 trace: 0 #sec: 0 fragInf: 0
 H'00000004 H'00000000 H'0061e1cc
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 1856683 gsn: 409 "TIME_SIGNAL" prio: 1
s.bn: 252 "QMGR", s.proc: 3, s.sigId: 1856679 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00000004

NDBCNTR:

204 /*******************************/
205 /*  SYSTEM_ERROR               */
206 /*******************************/
207 void Ndbcntr::execSYSTEM_ERROR(Signal* signal)
208 {
    ...
215   switch (sysErr->errorCode){
216   case SystemError::GCPStopDetected:
217   {
218     BaseString::snprintf(buf, sizeof(buf),
219              "Node %d killed this node because "
220              "GCP stop was detected",
221              killingNode);
222     signal->theData[0] = 7025;
223     EXECUTE_DIRECT(DBDIH, GSN_DUMP_STATE_ORD, signal, 1);
224     jamEntry();
225
226     {
227       signal->theData[0] = 12002;
228       EXECUTE_DIRECT(LGMAN, GSN_DUMP_STATE_ORD, signal, 1, 0);
229     }
230
231     jamEntry();
232     break;
233   }

----

LGMAN:

349 void
350 Lgman::execDUMP_STATE_ORD(Signal* signal){
351   jamEntry();
352   if (signal->theData[0] == 12001 || signal->theData[0] == 12002)
353   {
    ...

With 7.0.7 not even step 1 works:

shell> ndb_mgmd -f config.ini --configdir=/home/mysql/cluster/7.0.7
2010-01-26 14:12:17 [MgmSrvr] INFO     -- NDB Cluster Management Server. mysql-5.1.35 ndb-7.0.7                           
2010-01-26 14:12:17 [MgmSrvr] INFO     -- Reading cluster configuration from 'config.ini'                                 
shell> ndb_mgm                                                     
-- NDB Cluster -- Management Client --                                                                                    
ndb_mgm> show
                             
Connected to Management Server at: localhost:1186                                                                         
Cluster Configuration                                                                                                     
---------------------                                                                                                     
[ndbd(NDB)]     4 node(s)                                                                                                 
id=10 (not connected, accepting connect from localhost)
id=20 (not connected, accepting connect from localhost)
id=30 (not connected, accepting connect from localhost)
id=40 (not connected, accepting connect from localhost)

[ndb_mgmd(MGM)] 1 node(s)
id=2    @localhost  (mysql-5.1.35 ndb-7.0.7)

trying 7.0.8 now...

this is a regression, not entirely sure when it was introduced.
and out testcases for it was disabled :-(

patch will fix problem, and re enable the testing.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/98195

3369 Jonas Oreland	2010-01-26
      ndb - bug#50534 - fix regression in create/drop nodegroup, and make sure that it's properly tested

Pushed into 5.1.41-ndb-7.0.11 (revid:jonas@mysql.com-20100126140723-ec25q36v55cw5awp) (version source revid:jonas@mysql.com-20100126140352-0ld0q4gk0yc8wh7v) (merge vers: 5.1.41-ndb-7.0.11) (pib:16)

pushed into 7.0.11

Documented in the NDB-7.0.11 changelog as follows:

      CREATE NODEGROUP could sometimes cause a data node forced 
      shutdown.

Closed.