Bug #74510 DICT master crash during takeover during \'rollforward\' of schemaTrans
Submitted: 22 Oct 2014 14:09 Modified: 15 May 2015 9:26
Reporter: Ole John Aske Email Updates:
Status: Closed Impact on me:
Category:MySQL Cluster: Disk Data Severity:S1 (Critical)
Version:7.1.33 OS:Any
Assigned to: CPU Architecture:Any

[22 Oct 2014 14:09] Ole John Aske
When a DICT master fails late in the commit processing of a schema transaction, the new dict master should decide to roll this transaction forward (complete the commit) during takeover processing. We observe that this sometimes creates a crash in the new DICT master:

2014-10-22 15:55:00 [ndbd] ALERT    -- Arbitration check won - node group majority
2014-10-22 15:55:00 [ndbd] INFO     -- President restarts arbitration thread [state=6]
2014-10-22 15:55:00 [ndbd] INFO     -- DBTC instance 0: Starting take over of node 3
Dbdict::execDICT_TAKEOVER_REF: error 1, from 4
execDICT_TAKEOVER_CONF: Node 5, trans 10(13), count 2, rollf 50331659/13, rb 0/0
Dbdict::execDICT_TAKEOVER_REF: error 1, from 2
New master seized transaction 10
New master locked transaction 10
Adding node 5 to transaction 10
Analyzing transaction progress, trans 10/0, lowest/highest 13/13
Setting transaction state to 13 for rollforward
Setting start state for transaction 10 to 13
Node 5 had 2 operations, master has 4227595259
Node 5 did not have all operations for transaction 10, skip < 50331659
Comparing node 5 rollforward(13(50331659)<13(50331659))/rollback(0(0)<0(0))
2014-10-22 15:55:00 [ndbd] INFO     -- /net/fimafeng09/export/home/tmp/oleja/mysql/mysql-5.6-cluster-7.4-new/storage/ndb/src/kernel/blocks/dbdict/Dbdict.cpp
2014-10-22 15:55:00 [ndbd] INFO     -- DBDICT (Line: 21299) 0x00000002
2014-10-22 15:55:00 [ndbd] INFO     -- Error handler shutting down system
2014-10-22 15:55:00 [ndbd] INFO     -- Error handler shutdown completed - aborting
2014-10-22 15:55:20 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. Initiated by signal 6. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Crash in :

void Dbdict::check_takeover_replies(Signal* signal)
      Set current op to the lowest/highest reported by slaves
      depending on if decision is to rollforward/rollback.
    if (trans_ptr.p->m_master_recovery_state == SchemaTrans::TRS_ROLLFORWARD)
      SchemaOpPtr rollforward_op_ptr;
      ndbrequire(findSchemaOp(rollforward_op_ptr, trans_ptr.p->m_rollforward_op));  << Line 21299
      trans_ptr.p->m_curr_op_ptr_i = rollforward_op_ptr.i;

We find from the log above, that DICT_TAKEOVER_REF was returned from node 4 (this node).
This is due to the commit had already completed on this node before it became master.
Thus, the schema transaction object and its schema operations has already been removed
from node 4.

The expected behaviour in this case is that ::check_takeover_replies()
recreates 'dummy' transaction and schema op objects in order to let
the takeover master complete the takeover.

The failed require indicate that the schemaOp for some reason was 
not recreated as expected.

How to repeat:
Reproduced with with the 'y9' testcase in AutoTest 'testDict -n schemaTrans'
[15 May 2015 9:26] Jon Stephens
Documented fix in the NDB 7.1.34, 7.2.19, and 7.3.8 changelogs, as follows:

    In some cases, during DICT master takeover, the new master could
    crash while attempting to roll forward an ongoing schema