Bug #48040 Ndb : TC trigger infinite loop in abort scenario
Submitted: 14 Oct 2009 13:49 Modified: 29 Oct 2009 9:22
Reporter: Frazer Clement Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.1-telco-6.2 OS:Any
Assigned to: Frazer Clement CPU Architecture:Any
Tags: mysql-5.1-telco-6.2+

[14 Oct 2009 13:49] Frazer Clement
Description:
TC uses TUP triggers to maintain unique indexes and perform table reorganisation.

When executing a batch of operations TC received from an API node, TUP triggers fire and send FIRE_TRIG_ORD signals back to TC.

TC buffers these FIRE_TRIG_ORD requests until it has processed all of the requests from the API (connection state enters CS_STARTED or CS_START_COMMITTING)

This is done by sending a CONTINUEB signal to itself that tests whether the state is CS_STARTED or CS_START_COMMITTING.  If it is, it executes the triggers, if not, it sends another CONTINUEB.

When the transaction batch fails, the state does not transition to CS_STARTED or CS_START_COMMITTING, and the CONTINUEB loop is 'infinite', or at least until the connection record is reused.  This results in an observable CPU usage increase.

How to repeat:
Create table with unique index on some column.
Create batch of AbortOnError DML operations, of which some later operation(s) will fail (e.g due to no data found, or data already exists).
Execute
Observe CPU usage on Ndbd after execute failure.

Suggested fix:
1) Modify ContinueB code to break out of 'loop' if state is not as expected.
2) Modify ContinueB code to use TrandId to verify context before execution.
3) Potentially modify mechanism not to use 'polling' ContinueB in this way - have TC execute Triggers on transition into CS_START/CS_START_COMMITTING.
[22 Oct 2009 16:06] Frazer Clement
Proposed fix adding transid checks to TcContinueB

Attachment: bug48040.patch (text/x-patch), 2.42 KiB.

[22 Oct 2009 16:14] Frazer Clement
Proposed patch is low-risk fix.
ContinueB is checked for TransId alignment in pending triggers scenario.

Connection State is checked before attempting to execute pending triggers.

No attempt is made to delay abort, or to avoid the need for ContinueB in this scenario.
[22 Oct 2009 16:15] Frazer Clement
Bug can be exposed by testBlobs -skip p -bug 45768.

Results in excessive ndbd Cpu consumption due to infinite CPU loop.
[27 Oct 2009 10:22] Frazer Clement
http://lists.mysql.com/commits/88221

Pushed to 
6.2.19
6.3.28
7.0.9
7.1.0
[28 Oct 2009 19:47] Jon Stephens
Hi Frazer,

Is this something that could be triggered only by an NDBAPI application? If not, could you give me example of a situation where this issue might come up?

Thanks!
[28 Oct 2009 20:19] Frazer Clement
The original problem was noticed while testing the fix to bug#41674.

Any case where there's a kernel triggered abort of bulk DML with Blobs (e.g ALTER TABLE runs out of space, bulk insert runs out of space/hits a duplicate key error).
[29 Oct 2009 9:22] Jon Stephens
Documented bugfix in the NDB-6.2.19, 6.3.28, and 7.0.9 changelogs (together with BUG#34583 and duplicates).

See BUG#45768 for changelog entry.