MySQL Bugs: #65037: ndbd can't start while old transactions are kept open

Bug #65037	ndbd can't start while old transactions are kept open
Submitted:	19 Apr 2012 15:19	Modified:	30 Sep 2015 12:50
Reporter:	Hartmut Holzgraefe	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	7.1.x, 7.2.12	OS:	Linux
Assigned to:		CPU Architecture:	Any

Description:
A starting data node seems to have to lock all rows for a short period of time during start phase 5. 

If a long running transaction holds row locks that the starting node needs it will remain in phase 5 until that transaction ends, without giving any hint what it is waiting for.

How to repeat:
* create a test table like:

   CREATE TABLE t1(id int primary key, val int);
   INSERT INTO t1 VALUES(1,1);

* stop one data node

* update the test table row in a loop:

   BEGIN
   while(true)
      UPDATE t1 SET val = val + 1 WHERE id = 1
      sleep 1

* restart the stopped node, see how it hangs in phase 5

* kill the test transaction

* see how the node restart finally continues

Suggested fix:
minimum: have the starting node report that it is waiting for locks from another nodes transactions in regular intervals, e.g. once per minute

preferred, if possible: after a certain grace period (configurable?) force a timeout of the offending transaction so that the node restart can proceed and availability is not at risk as long as the long running transaction is still humming along ...

making this situation more clear in the documentation would also help a bit for now already ;)

Hello Hartmut,

Thank you for the report.
I can not repeat described behavior with the provided test case on mysql-5.1.63 ndb-7.1.24.

Please could you tell me the version in which you experienced this issue? 

Regards,
Umesh

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

test code

Attachment: test.php (application/x-php, text), 271 bytes.

test schema / data

Attachment: schema.sql (text/x-sql), 76 bytes.

I can still reproduce this easily on 7.1.24 and 7.2.12 using the attached schema/data and php code file:

* create a 2 node cluster (in this case with everything on localhost, but that shouldn't really matter)
* mysql ... test < schema.sql
* ndb_mgm -e "3 stop" # assuming data nodes have ids 2 and 3
* wait for the data node to stop
* run "php test.php" (assuming everything is on localhost and mysql "root" has no password set, else modify the php code accordingly)
* start the 2nd data node again
* watch how it does never get beyond start phase 4 ...
* stop the php script
* see how the 2nd data node completes the remaining start phases

Hello Hartmut,

Thank you for the report.
Verified as described.

Just a correction(not to confuse 2nd data node with nodeid 2) in the test case:

* create a 2 node cluster (in this case with everything on localhost, but that shouldn't really matter)
* mysql ... test < schema.sql
* ndb_mgm -e "3 stop" # assuming data nodes have ids 2 and 3
* wait for the data node to stop
* run "php test.php" (assuming everything is on localhost and mysql "root" has no password set, else modify the php code accordingly)
* start the node data again (nodeid: 3, which was stopped in step 2) 
* watch how it does never get beyond start phase 4 ...
* stop the php script
* see how the data node completes the remaining start phases((nodeid: 3)

###

ndb_mgm> 3 status
Node 3: starting (Last completed phase 4) (mysql-5.5.30 ndb-7.2.12)

Thanks,
Umesh

Any news here?