MySQL Bugs: #105098: Node 1 constantly reports error 1204

Bug #105098	Node 1 constantly reports error 1204
Submitted:	1 Oct 2021 10:12	Modified:	19 Nov 2021 15:10
Reporter:	Mikael Ronström	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	8.0.26	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
See below

How to repeat:
The following test case should find it in a safe manner after a few minutes:
./mtr --suite=ndb --start-and-exit ndb_basic_3rpl
testNodeRestart -n Bug16895311 T1

This should report error 1204 and bail out of the test case when restarting
node 3.

Suggested fix:
Change either i = 1 to i = 0 or change i+1 to i
in search in bitmap in execCOPY_FRAGREQ.

Hi Mikael, 

Thanks for the report.

all best
Bogdan

Posted by developer:
 
Conditions for bug

  * using 8.0.23 or newer
  * using 3 or 4 replicas
  * one data node should have node id 1

Rolling restart under load should be enough to hit it.

SQL queries will show Warning: Got temporary error 1204 'Temporary failure, distribution changed' from NDB.
Note that not every occurrence of error 1204 indicate that this bug is hit, the warning typically show up temporarily during node restarts.
But then this bug is hit it is not a temporary condition in data node 1 can and as a workaround data node 1 can be restarted.
Also note that if no node in same nodegroup as node id 1 is down when queries fail, this bug is not causing the failed queries.

This bug also can impact on any operation that in its implementation uses ndb tables such as ddl, autoincrement, backup, binlogging and replication.
Check if error 1204 show up in show warnings after the failed command.

A more proactive workaround is to always restart data node 1 if some of other nodes in same nodegroup has restarted.

Posted by developer:
 
How bug works:

  * While node 1 is alive, one other node in same ng should restart to change distribution key on fragments

  * Still while 1 is alive, yet other node in same ng should stop, such that node 1 become primary for some fragment.

  * There should be some request by key against node 1 on such fragment.
    SQL queries will typically fail with error 1297 and show warnings will reveal error 1204.

By restarting data node 1 it will get the correct distribution keys from the other nodes during start up.

Documented fix as follows in the NDB 8.0.28 changelog:

    Following improvements in LDM handling made in NDB 8.0.23, an
    UPDATE_FRAG_DIST_KEY_ORD signal was never sent when needed to a
    data node using 1 as its node ID. When running the cluster with
    3 or 4 replicas and another node in the same node group
    restarted, this could result in SQL statements being rejected
    with error 1297 and, subsequently, SHOW WARNINGS reporting error
    1204.

    NOTE Prior to upgrading to this release, you can work around
    the issue by restarting data node 1 whenever any other node
    in the same node group has been restarted.

Closed.