Bug #103294 TRUNCATE table causes data node to pause starting
Submitted: 12 Apr 2021 20:42 Modified: 16 Apr 2021 10:33
Reporter: Shawn Hogan Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.7.33 ndb-7.6.17 OS:SUSE
Assigned to: CPU Architecture:x86

[12 Apr 2021 20:42] Shawn Hogan
Description:
I was starting a data node and I noticed that it was still starting close to 2 hours longer than it should have taken in total.  Checked the data node log and it had not written anything for more than 90 minutes.

Here's a copy of the log at the point where the pause happened (10 lines before and 10 lines after):

2021-04-12 11:01:20 [ndbd] INFO     -- LDM(4): Completed copy of fragment T7254F12. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2021-04-12 11:01:20 [ndbd] INFO     -- LDM(4): Completed copy of fragment T7254F28. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2021-04-12 11:01:20 [ndbd] INFO     -- LDM(4): Completed copy of fragment T7255F12. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2021-04-12 11:01:20 [ndbd] INFO     -- LDM(4): Completed copy of fragment T7255F28. Changed +0/-1019 rows, 57064 bytes. 101900 pct churn to 1 rows.
2021-04-12 11:01:20 [ndbd] INFO     -- LDM(4): Completed copy of fragment T7256F12. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2021-04-12 11:01:20 [ndbd] INFO     -- LDM(4): Completed copy of fragment T7256F28. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2021-04-12 11:01:20 [ndbd] INFO     -- LDM(4): Completed copy of fragment T7257F12. Changed +0/-815 rows, 45640 bytes. 81500 pct churn to 1 rows.
2021-04-12 11:01:20 [ndbd] INFO     -- LDM(4): Completed copy of fragment T7257F28. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2021-04-12 11:01:20 [ndbd] INFO     -- LDM(4): Completed copy of fragment T7258F12. Changed +2/-904 rows, 50856 bytes. 45300 pct churn to 2 rows.
2021-04-12 11:01:20 [ndbd] INFO     -- LDM(4): Completed copy of fragment T7258F28. Changed +2/-904 rows, 50856 bytes. 45300 pct churn to 2 rows.
2021-04-12 12:39:39 [ndbd] INFO     -- LDM(2): Completed copy of fragment T2F20. Changed +8/-1063 rows, 60168 bytes. 1050 pct churn to 102 rows.
2021-04-12 12:39:39 [ndbd] INFO     -- LDM(2): Completed copy of fragment T3F4. Changed +0/-13 rows, 728 bytes. 14 pct churn to 92 rows.
2021-04-12 12:39:39 [ndbd] INFO     -- LDM(2): Completed copy of fragment T3F20. Changed +0/-20 rows, 1120 bytes. 23 pct churn to 85 rows.
2021-04-12 12:39:39 [ndbd] INFO     -- LDM(2): Completed copy of fragment T4F4. Changed +0/-503 rows, 28168 bytes. 408 pct churn to 123 rows.
2021-04-12 12:39:39 [ndbd] INFO     -- LDM(2): Completed copy of fragment T4F20. Changed +0/-512 rows, 28672 bytes. 449 pct churn to 114 rows.
2021-04-12 12:39:39 [ndbd] INFO     -- LDM(2): Completed copy of fragment T5F4. Changed +0/-3624 rows, 202944 bytes. 2 pct churn to 148643 rows.
2021-04-12 12:39:39 [ndbd] INFO     -- LDM(2): Completed copy of fragment T5F20. Changed +0/-1812 rows, 101472 bytes. 1 pct churn to 148160 rows.
2021-04-12 12:39:39 [ndbd] INFO     -- LDM(2): Completed copy of fragment T6F4. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2021-04-12 12:39:39 [ndbd] INFO     -- LDM(2): Completed copy of fragment T6F20. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2021-04-12 12:39:39 [ndbd] INFO     -- LDM(2): Completed copy of fragment T7F4. Changed +0/-349 rows, 19544 bytes. 591 pct churn to 59 rows.

It turns out that a SQL node had given a truncate table command during the start up and just sat there for 90 minutes.  From the SQL node:

| 10345979 | root        | localhost | database | Execute | 7719 | checking permissions              | TRUNCATE TABLE xf_thread_view |

I know TRUNCATE will not work while doing a start up, but I would think that it shouldn't halt the start of the data node.

Interestingly, the TRUNCATE had been "running" for over 2 hours, but the logs show the startup paused for a little over 90 minutes.

The solution was to kill the process on the SQL node that was trying to truncate a table and instantly the data node continued starting up and finished without issue.

It seems to me that there shouldn't be a SQL statement (TRUNCATE or otherwise) that would pause the startup of the data node indefinitely.

How to repeat:
Truncate a table while a data node is starting up?

Suggested fix:
Kill the process on the SQL node that is trying to truncate the table.
[16 Apr 2021 10:33] MySQL Verification Team
Hi,

I reproduced this twice on 7.6.17 but could not reproduce on 8.0.23

Thanks for the report
Bogdan