Bug #32239 failed ndbrequire on slave server after intensive db activity
Submitted: 9 Nov 2007 16:23 Modified: 24 Feb 2008 21:07
Reporter: Bogdan Kecman Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Replication Severity:S3 (Non-critical)
Version:5.1.22-ndb-6.3.2-telco-log OS:Any
Assigned to: CPU Architecture:Any
Tags: ndbrequire

[9 Nov 2007 16:23] Bogdan Kecman
Description:
nodes on slave server crash after intense db activity

one node crash with
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbtc/DbtcMain.cpp
Error object: DBTC (Line: 8545) 0x0000000e
Program: /usr/mysql/libexec/ndbd

other node crash with
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dblqh/DblqhMain.cpp
Error object: DBLQH (Line: 7010) 0x0000000e
Program: /usr/mysql/libexec/ndbd

storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:7010

/**
* Only primary replica can get ZTUPLE_ALREADY_EXIST || ZNO_TUPLE_FOUND
*
* Unless it's a simple or dirty read
*
* NOT TRUE!
* 1) op1 - primary insert ok
* 2) op1 - backup insert fail (log full or what ever)
* 3) op1 - delete ok @ primary
* 4) op1 - delete fail @ backup
*
* -> ZNO_TUPLE_FOUND is possible
*/
ndbrequire
(tcPtr->seqNoReplica == 0 ||
errCode != ZTUPLE_ALREADY_EXIST ||
(tcPtr->operation == ZREAD && (tcPtr->dirtyOp || tcPtr->opSimple))); //7010

tcPtr->abortState = TcConnectionrec::ABORT_FROM_LQH;
abortCommonLab(signal);

storage/ndb/src/kernel/blocks/dblqh/dbtc/DbtcMain.cpp:8545

const Uint32 noOfLqhs = tmp.p->noOfLqhs;
ndbrequire(noOfLqhs < MAX_REPLICAS); //8545
tmp.p->lqhNodeId[noOfLqhs] = tnodeid;
tmp.p->noOfLqhs = (noOfLqhs + 1);

How to repeat:
Configuration:
1 PC 64bits with vmware running with 4 virtual machine
2 are running with ndb_mgm mysqld and our application
2 are running with ndb node
This machine is running "master"

A separate PC is running "slave" with the same configuration.
There is 2 replication flow

I want to test the replication with this table and procedure:

create table if not exists loadreptable ( nid INTEGER NOT NULL, nom CHAR(255), prenom CHAR(255), abc CHAR(255), wkz CHAR(255),xyz CHAR(255),
PRIMARY KEY USING HASH (nid) )
engine=ndb PARTITION BY KEY (nid);

delimiter //
CREATE PROCEDURE loadreplication (in p1 INT)
BEGIN
label1: LOOP
SET p1 = p1 - 1;
IF p1 < 0 THEN LEAVE label1;
END IF;
DELETE FROM loadreptable WHERE nid > 2;
UPDATE loadreptable SET nid=nid+1 ORDER BY nid DESC;
UPDATE loadreptable SET nom=\"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx";
INSERT INTO loadreptable VALUES(1,"wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww",
"tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt",
"yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy",
"kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk",
"bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb");
END LOOP label1;
END;
//
delimiter ;

When I call loadreplication with 20 everything is right
When I call loadreplication with 200 the 2 nodes on slave side crash

Suggested fix:
n/a
[9 Nov 2007 18:11] Tomas Ulin
6.3.2 is a quite old version

can you retry with latest 6.3.6?

BR,

Tomas
[12 Nov 2007 14:51] Bogdan Kecman
I was not able to duplicate the behavior using ndb-6.3.6 so I assume this bug is fixed somewhere between 6.3.2 and 6.3.6.