Description:
I was starting a data node and I noticed that it was still starting close to 2 hours longer than it should have taken in total. Checked the data node log and it had not written anything for more than 90 minutes.
Here's a copy of the log at the point where the pause happened (10 lines before and 10 lines after):
2021-04-12 11:01:20 [ndbd] INFO -- LDM(4): Completed copy of fragment T7254F12. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2021-04-12 11:01:20 [ndbd] INFO -- LDM(4): Completed copy of fragment T7254F28. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2021-04-12 11:01:20 [ndbd] INFO -- LDM(4): Completed copy of fragment T7255F12. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2021-04-12 11:01:20 [ndbd] INFO -- LDM(4): Completed copy of fragment T7255F28. Changed +0/-1019 rows, 57064 bytes. 101900 pct churn to 1 rows.
2021-04-12 11:01:20 [ndbd] INFO -- LDM(4): Completed copy of fragment T7256F12. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2021-04-12 11:01:20 [ndbd] INFO -- LDM(4): Completed copy of fragment T7256F28. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2021-04-12 11:01:20 [ndbd] INFO -- LDM(4): Completed copy of fragment T7257F12. Changed +0/-815 rows, 45640 bytes. 81500 pct churn to 1 rows.
2021-04-12 11:01:20 [ndbd] INFO -- LDM(4): Completed copy of fragment T7257F28. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2021-04-12 11:01:20 [ndbd] INFO -- LDM(4): Completed copy of fragment T7258F12. Changed +2/-904 rows, 50856 bytes. 45300 pct churn to 2 rows.
2021-04-12 11:01:20 [ndbd] INFO -- LDM(4): Completed copy of fragment T7258F28. Changed +2/-904 rows, 50856 bytes. 45300 pct churn to 2 rows.
2021-04-12 12:39:39 [ndbd] INFO -- LDM(2): Completed copy of fragment T2F20. Changed +8/-1063 rows, 60168 bytes. 1050 pct churn to 102 rows.
2021-04-12 12:39:39 [ndbd] INFO -- LDM(2): Completed copy of fragment T3F4. Changed +0/-13 rows, 728 bytes. 14 pct churn to 92 rows.
2021-04-12 12:39:39 [ndbd] INFO -- LDM(2): Completed copy of fragment T3F20. Changed +0/-20 rows, 1120 bytes. 23 pct churn to 85 rows.
2021-04-12 12:39:39 [ndbd] INFO -- LDM(2): Completed copy of fragment T4F4. Changed +0/-503 rows, 28168 bytes. 408 pct churn to 123 rows.
2021-04-12 12:39:39 [ndbd] INFO -- LDM(2): Completed copy of fragment T4F20. Changed +0/-512 rows, 28672 bytes. 449 pct churn to 114 rows.
2021-04-12 12:39:39 [ndbd] INFO -- LDM(2): Completed copy of fragment T5F4. Changed +0/-3624 rows, 202944 bytes. 2 pct churn to 148643 rows.
2021-04-12 12:39:39 [ndbd] INFO -- LDM(2): Completed copy of fragment T5F20. Changed +0/-1812 rows, 101472 bytes. 1 pct churn to 148160 rows.
2021-04-12 12:39:39 [ndbd] INFO -- LDM(2): Completed copy of fragment T6F4. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2021-04-12 12:39:39 [ndbd] INFO -- LDM(2): Completed copy of fragment T6F20. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2021-04-12 12:39:39 [ndbd] INFO -- LDM(2): Completed copy of fragment T7F4. Changed +0/-349 rows, 19544 bytes. 591 pct churn to 59 rows.
It turns out that a SQL node had given a truncate table command during the start up and just sat there for 90 minutes. From the SQL node:
| 10345979 | root | localhost | database | Execute | 7719 | checking permissions | TRUNCATE TABLE xf_thread_view |
I know TRUNCATE will not work while doing a start up, but I would think that it shouldn't halt the start of the data node.
Interestingly, the TRUNCATE had been "running" for over 2 hours, but the logs show the startup paused for a little over 90 minutes.
The solution was to kill the process on the SQL node that was trying to truncate a table and instantly the data node continued starting up and finished without issue.
It seems to me that there shouldn't be a SQL statement (TRUNCATE or otherwise) that would pause the startup of the data node indefinitely.
How to repeat:
Truncate a table while a data node is starting up?
Suggested fix:
Kill the process on the SQL node that is trying to truncate the table.