Bug #80554 Binlog schema distribution timeout and fails when another mysql node start
Submitted: 29 Feb 2016 9:43 Modified: 30 Mar 2016 10:37
Reporter: Ole John Aske Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:7.4.11 OS:Any
Assigned to: CPU Architecture:Any

[29 Feb 2016 9:43] Ole John Aske
Description:
2016-02-25 14:17:51 21644 [Note] NDB: distributed .lg_1(10/1) type: LOGFILE_GROUP(9) query: 'CREATE LOGFILE GROUP lg_1
    ADD UNDOFILE 'undo_1.dat'
    INITIAL_SIZE 4M
    UNDO_BUFFER_SIZE 2M
    ENGINE NDB' to ffffffffffffffff
2016-02-25 14:17:51 21644 [Note] NDB: got schema event on .lg_1(10/1) query: 'CREATE LOGFILE GROUP lg_1
    ADD UNDOFILE 'undo_1.dat'
    INITIAL_SIZE 4M
    UNDO_BUFFER_SIZE 2M
    ENGINE NDB' type: LOGFILE_GROUP(9) node: 6 slock: ffffffffffffffff
2016-02-25 14:17:51 21644 [Note] NDB: reply to .lg_1(10/1) from ffffffffffffffff to ffffffbfffffffff
2016-02-25 14:17:51 21644 [Note] NDB Schema dist: Data node: 1 reports subscribe from node 9, subscriber bitmask 0200
2016-02-25 14:17:51 21644 [Note] NDB Schema dist: Data node: 2 reports subscribe from node 9, subscriber bitmask 0200
2016-02-25 14:17:51 21644 [Note] NDB: got schema event on .lg_1(10/1) query: '' type: CLEAR_SLOCK(7) node: 6 slock: ffffffbfffffffff
2016-02-25 14:17:51 21644 [Note] CLEAR_SLOCK - .lg_1
2016-02-25 14:17:51 21644 [Note] NDB: CLEAR_SLOCK key: .//lg_1(10/1) from 2000 to ffffffbfffffffff
2016-02-25 14:17:52 21644 [Note] NDB logfile group: waiting max 119 sec for distributing .//lg_1.  epochs: (0/0,0/0,14/31)  injector proc_info: Waiting for event from ndbcluster map: 2000
2016-02-25 14:17:53 21644 [Note] NDB logfile group: waiting max 118 sec for distributing .//lg_1.  epochs: (0/0,0/0,15/8)  injector proc_info: Waiting for event from ndbcluster map: 2000
2016-02-25 14:17:54 21644 [Note] NDB logfile group: waiting max 117 sec for distributing .//lg_1.  epochs: (0/0,0/0,15/18)  injector proc_info: Waiting for event from ndbcluster map: 2000
2016-02-25 14:17:55 21644 [Note] NDB logfile group: waiting max 116 sec for distributing .//lg_1.  epochs: (0/0,0/0,15/28)  injector proc_info: Waiting for event from ndbcluster map: 2000
2016-02-25 14:17:56 21644 [Note] NDB logfile group: waiting max 115 sec for distributing .//lg_1.  epochs: (0/0,0/0,16/8)  injector proc_info: Waiting for event from ndbcluster map: 2000
......
2016-02-25 14:19:50 21644 [Note] NDB logfile group: waiting max 3 sec for distributing .//lg_1.  epochs: (0/0,0/0,53/9)  injector proc_info: Waiting for event from ndbcluster map: 2000
2016-02-25 14:19:51 21644 [Note] NDB logfile group: waiting max 2 sec for distributing .//lg_1.  epochs: (0/0,0/0,53/19)  injector proc_info: Waiting for event from ndbcluster map: 2000
2016-02-25 14:19:52 21644 [Note] NDB logfile group: waiting max 1 sec for distributing .//lg_1.  epochs: (0/0,0/0,53/29)  injector proc_info: Waiting for event from ndbcluster map: 2000
2016-02-25 14:19:53 21644 [ERROR] NDB logfile group: distributing .//lg_1 timed out. Ignoring...
mysqld: /export/home/pb2/build/sb_0-18057359-1456397880.82/mysql-cluster-gpl-7.4.11/sql/ha_ndbcluster_binlog.cc:2046: int ndbcluster_log_schema_op(THD*, const char*, int, const char*, const char*, uint32, uint32, SCHEMA_OP_TYPE, const char*, const char*): Assertion `false' failed.
11:19:53 UTC - mysqld got signal 6 ;

The above is from debug compiled binaries, where we hit an assert when timing out.
Else, only an error is returned which fails the schema operation.

We find from the log that the mysqld node=9, joined *after* the schema operation started.
We then wait for that node to 'ack' the participation in the schema change.
As that node was not active when the operation was initiated, we should
actually not wait for that node at all.

How to repeat:

Start a mysqld node concurrently with schema operations
being issues to another mysqld node.

A test program doing that will be provided as part of the patch
[30 Mar 2016 10:37] Jon Stephens
Fixed in NDB 7.4.11 and 7.5.2. Documented as follows:

    When an SQL node was started, and joined the schema distribution
    protocol, another SQL node, already waiting for a schema change
    to be distributed, timed out during that wait. This was because
    the code incorrectly assumed that the new SQL node would also
    acknowledge the schema distribution even though the new node
    joined too late to be a participant in it.

    As part of this fix, printouts of schema distribution progress
    now always print the more significant part of a bitmask before
    the less significant; formatting of bitmasks in such prinouts
    has also been improved.

Closed.