Bug #105098 Node 1 constantly reports error 1204
Submitted: 1 Oct 2021 10:12 Modified: 19 Nov 2021 15:10
Reporter: Mikael Ronström Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:8.0.26 OS:Any
Assigned to: CPU Architecture:Any

[1 Oct 2021 10:12] Mikael Ronström
Description:
See below

How to repeat:
The following test case should find it in a safe manner after a few minutes:
./mtr --suite=ndb --start-and-exit ndb_basic_3rpl
testNodeRestart -n Bug16895311 T1

This should report error 1204 and bail out of the test case when restarting
node 3.

Suggested fix:
Change either i = 1 to i = 0 or change i+1 to i
in search in bitmap in execCOPY_FRAGREQ.
[1 Oct 2021 12:23] MySQL Verification Team
Hi Mikael, 

Thanks for the report.

all best
Bogdan
[19 Nov 2021 14:23] Mauritz Sundell
Posted by developer:
 
Conditions for bug

  * using 8.0.23 or newer
  * using 3 or 4 replicas
  * one data node should have node id 1

Rolling restart under load should be enough to hit it.

SQL queries will show Warning: Got temporary error 1204 'Temporary failure, distribution changed' from NDB.
Note that not every occurrence of error 1204 indicate that this bug is hit, the warning typically show up temporarily during node restarts.
But then this bug is hit it is not a temporary condition in data node 1 can and as a workaround data node 1 can be restarted.
Also note that if no node in same nodegroup as node id 1 is down when queries fail, this bug is not causing the failed queries.

This bug also can impact on any operation that in its implementation uses ndb tables such as ddl, autoincrement, backup, binlogging and replication.
Check if error 1204 show up in show warnings after the failed command.

A more proactive workaround is to always restart data node 1 if some of other nodes in same nodegroup has restarted.
[19 Nov 2021 14:25] Mauritz Sundell
Posted by developer:
 
How bug works:

  * While node 1 is alive, one other node in same ng should restart to change distribution key on fragments

  * Still while 1 is alive, yet other node in same ng should stop, such that node 1 become primary for some fragment.

  * There should be some request by key against node 1 on such fragment.
    SQL queries will typically fail with error 1297 and show warnings will reveal error 1204.

By restarting data node 1 it will get the correct distribution keys from the other nodes during start up.
[19 Nov 2021 15:10] Jon Stephens
Documented fix as follows in the NDB 8.0.28 changelog:

    Following improvements in LDM handling made in NDB 8.0.23, an
    UPDATE_FRAG_DIST_KEY_ORD signal was never sent when needed to a
    data node using 1 as its node ID. When running the cluster with
    3 or 4 replicas and another node in the same node group
    restarted, this could result in SQL statements being rejected
    with error 1297 and, subsequently, SHOW WARNINGS reporting error
    1204.

    NOTE Prior to upgrading to this release, you can work around
    the issue by restarting data node 1 whenever any other node
    in the same node group has been restarted.

Closed.