MySQL Bugs: #90940: MySQL Cluster: 7.5.10: copyfrag failed, error: 1501 : NDBCNTR (Line: 295) 0x0000

Bug #90940	MySQL Cluster: 7.5.10: copyfrag failed, error: 1501 : NDBCNTR (Line: 295) 0x0000
Submitted:	19 May 2018 7:21	Modified:	11 Jul 2018 11:01
Reporter:	Thomas Waibel-BGo	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	7.5.10 + commit e3f905ccb844db3cf9da7bc6	OS:	Debian (Jessie)
Assigned to:	MySQL Verification Team	CPU Architecture:	x86 (Xeon(R) CPU E5-2643 v2 @ 3.50GHz)

Description:
During a crash incident we started one node initial. Now we are trying to bring that node back online.

Problem: We clean some tables every workday using a TRUNCATE. This is blocked by the starting node. Since the start takes such a long time this is only possible from Friday 0:30 after the truncate till monday 0:30.
So we started Node 3 on Friday 0:30, but at around 20:32 Node 3 got killed.
----
2018-05-18 20:28:20 [ndbd] INFO     -- LDM(10): Completed copy of fragment T60F10. Changed +3601389/-435 rows, 547321528 bytes. 100 pct churn to 3601389 rows.
2018-05-18 20:32:54 [ndbd] INFO     -- Killed by node 3 as copyfrag failed, error: 1501
2018-05-18 20:32:54 [ndbd] INFO     -- NDBCNTR (Line: 295) 0x00000000
2018-05-18 20:32:54 [ndbd] INFO     -- Error handler restarting system
2018-05-18 20:32:54 [ndbd] INFO     -- Error handler shutdown completed - exiting
2018-05-18 20:33:58 [ndbd] INFO     -- Angel detected startup failure, count: 1
2018-05-18 20:33:58 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Occured during startphase 5. Caused by error 2303: 'System error, node killed during node restart by other 
node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2018-05-18 20:33:58 [ndbd] INFO     -- Ndb has terminated (pid 35496) restarting
2018-05-18 20:33:58 [ndbd] INFO     -- Angel reconnected to '10.20.56.5:1186'
2018-05-18 20:33:58 [ndbd] INFO     -- Angel reallocated nodeid: 3
2018-05-18 20:33:58 [ndbd] INFO     -- Angel pid: 35492 started child: 1066
----

How to repeat:
Unclear - this was the first time

Suggested fix:
One issue is that we have a lot of diskbased data. With every crash disk based data is copied again from scratch. It would be an improvement, if already synched tables would not have to be copied again...

It just happened again...

----
2018-05-19 21:41:14 [ndbd] INFO     -- LDM(11): Completed copy of fragment T58F44. Changed +1163734/-3500522 rows, 358950584 bytes. 400 pct churn to 1163734 rows.
2018-05-19 21:41:14 [ndbd] INFO     -- LDM(11): Completed copy of fragment T59F20. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2018-05-19 21:41:14 [ndbd] INFO     -- LDM(11): Completed copy of fragment T59F44. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2018-05-19 21:41:27 [ndbd] INFO     -- LDM(6): Completed copy of fragment T54F10. Changed +4929575/-697 rows, 492996532 bytes. 100 pct churn to 4929575 rows.
2018-05-19 21:41:35 [ndbd] INFO     -- LDM(9): Completed copy of fragment T58F16. Changed +1176820/-3413180 rows, 355891568 bytes. 390 pct churn to 1176820 rows.
2018-05-19 21:41:36 [ndbd] INFO     -- LDM(1): Completed copy of fragment T64F24. Changed +28493645/-259 rows, 2849379004 bytes. 100 pct churn to 28493645 rows.
2018-05-19 21:42:13 [ndbd] INFO     -- LDM(6): Completed copy of fragment T54F34. Changed +5045224/-104 rows, 504528224 bytes. 100 pct churn to 5045224 rows.
2018-05-19 21:42:13 [ndbd] INFO     -- LDM(6): Completed copy of fragment T55F10. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2018-05-19 21:42:13 [ndbd] INFO     -- LDM(6): Completed copy of fragment T55F34. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2018-05-19 21:42:13 [ndbd] INFO     -- LDM(6): Completed copy of fragment T56F10. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2018-05-19 21:42:13 [ndbd] INFO     -- LDM(6): Completed copy of fragment T56F34. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2018-05-19 21:42:13 [ndbd] INFO     -- LDM(6): Completed copy of fragment T57F10. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2018-05-19 21:42:13 [ndbd] INFO     -- LDM(6): Completed copy of fragment T57F34. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2018-05-19 21:42:37 [ndbd] INFO     -- Killed by node 3 as copyfrag failed, error: 1501
2018-05-19 21:42:37 [ndbd] INFO     -- NDBCNTR (Line: 295) 0x00000000
2018-05-19 21:42:37 [ndbd] INFO     -- Error handler restarting system
2018-05-19 21:42:37 [ndbd] INFO     -- Error handler shutdown completed - exiting
2018-05-19 21:43:41 [ndbd] INFO     -- Angel detected startup failure, count: 2
2018-05-19 21:43:41 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Occured during startphase 5. Caused by error 2303: 'System error, node killed during node restart by other 
node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2018-05-19 21:43:41 [ndbd] INFO     -- Ndb has terminated (pid 1066) restarting
2018-05-19 21:43:41 [ndbd] INFO     -- Angel reconnected to '10.20.56.5:1186'
2018-05-19 21:43:41 [ndbd] INFO     -- Angel reallocated nodeid: 3
2018-05-19 21:43:41 [ndbd] INFO     -- Angel pid: 35492 started child: 44717
2018-05-19 21:43:41 [ndbd] INFO     -- Normal start of data node using checkpoint and log info if existing
----

So somehow we are able to reproduce it - it is just unclear how...

Uploaded //support/incoming/mysql-bug-data-90940-db_error_report_20180519234840.tar.bz2

Even with a start --initial we are not able to bring Node 3 back online again... therefore I would like to change from S2 to S1...

Uploaded //support/incoming/mysql-bug-data-90940-ndb_error_report_20180521090930.tar.bz2

Hi,

This looks like a missconfigured/misssized cluster rather then a bug, contacting Support team is in my opinion best course of action.

Can you confirm that 
 - you start the crashed node
 - while node is starting you execute that truncate
 - node start fails

This is how I understood the problem, is that it or you just start the node with --initial and do not exec truncate or anything similar and have "normal load" on your cluster and the node fails to start?

Thanks
Bogdan

Hi Bogdan,

the TRUNCATE is happening every Monday untill Friday at 0:30 AM. So we basically have a window starting Friday 0:35 till Monday 0:15 to start that missing node. This windows also includes the weekend where we experience low traffic and basically no write activity.

We tried starting the missing Node 3 during that window using --initial which did not succeed. And we tried starting the missing node without using --initial.

We plan to upgrade to 7.6 soon. Since this requires a rolling --initial restart due to the new file format this copyfrag problem poses quite a risk here...

Best regards
Rainer Stumbaum

Just for your reference:
- A NDB cluster with a starting node does not allow to execute a DD transaction. If you try to execute a TRUNCATE, DROP TABLE, CREATE TABLE while having a data node start this command will just fail/hang...

Hi,

I'm having issues making a reproducible test case. Trying to figure out from the log what's going on. Please upload a new set of logs with latest crash just to have more data if that's not a problem.

Now a question, when you start the node with --initial at Friday 00:35 when does it fail?

all best
Bogdan

p.s. we are going trough logs we already have but please send a second batch so we have more data

Hi Bogdan,

since the crash we have not tried to restart that node. I will restart it this coming Friday again at 0:35 to let you know.

Looking at the ndb_3_out.log I extracted to following times:
...
2018-05-18 01:13:36 [ndbd] INFO     -- Angel pid: 35492 started child: 35496
...
2018-05-18 20:33:58 [ndbd] ALERT    -- Node 3: Forced node shutdown completed.
...
2018-05-18 20:33:58 [ndbd] INFO     -- Angel pid: 35492 started child: 1066
...
2018-05-19 21:43:41 [ndbd] ALERT    -- Node 3: Forced node shutdown completed.
...
2018-05-20 00:15:50 [ndbd] INFO     -- Angel pid: 15452 started child: 15453
...
2018-05-21 04:45:49 [ndbd] ALERT    -- Node 3: Forced node shutdown completed.

So the first two are not starting --initial  - only the last one is.

Regarding ndb_error_reporter - do you all the files? I did set directory permissions, so that the file does not get too big...

Best regards
Rainer stumbaum

Hi,

Having all files is useful, but what's required is ndb_3_error.log and ndb_3_trace.log.*

> Looking at the ndb_3_out.log I extracted to following times:

Seen that, pretty inconsistent, 19 hours, 1 hour, 4 hours .. 

I doubt you'll get anything new this Friday but..

Thanks
Bogdan

Hi Bogdan,

is there any option I can give to have more information on these trace files?

Should I create a ndb_error_reporter output including filesystems?

Best regards
Rainer Stumbaum

And I left the date in the logfile extract there on purpose...

19h20m - 25h10m - 28h30m ... is it somehow possible to pin down the copyfrag error to a specific table in the trace?

Hi Thomas, Rainer,

I see another bug with this copyfrag issue similiar to yours (older version of cluster) and they don't have any DDL running but looking at logs you already provided this crash happens long before that truncate is executed so it's questionable if that truncate has anything to do with the problem... looks right now I was drawn to it as possible cause ignoring everything else.

> is there any option I can give to have more information on these trace files?

Well there's no additional files except the logs files that are already included so the only possible additional info would be a full log of all sql statements executed on server around crash time but if I'm right that won't help neither (the other bug I'm looking at, they stopped all clients, disconnected mysqld's and still got copyfrag crash)

> Should I create a ndb_error_reporter output including filesystems?

that actually could help, having your whole ndb filesystem but I'm not sure you are comfortable sharing all your data with us or not. This would allow us to reproduce the issue locally which should lead to quick bug fix.

The major issue attm is that I do see what happened from the log files, I just can't reproduce it in any way to see why it happened and how to prevent it.

Let me know if this is possible and how big the whole ndb filesystem is

> is it somehow possible to pin down the copyfrag error to a specific table in the trace?

When you look at the error file it lists the crash and in the description of crash it lists what trace files are linked to that crash.

all best
Bogdan

Hi Bogdan,

there is no personal data in that database so it should be no problem.

I think the issue is the size of the DB

Node 3:
rstumbaum@ndb13.dc1:~$ du -hd1 /var/lib/mysql-cluster/ndb_3_fs
2.4G    /var/lib/mysql-cluster/ndb_3_fs/D19
2.4G    /var/lib/mysql-cluster/ndb_3_fs/D17
2.4G    /var/lib/mysql-cluster/ndb_3_fs/D13
2.4G    /var/lib/mysql-cluster/ndb_3_fs/D14
2.4G    /var/lib/mysql-cluster/ndb_3_fs/D15
2.4G    /var/lib/mysql-cluster/ndb_3_fs/D16
2.4G    /var/lib/mysql-cluster/ndb_3_fs/D18
2.4G    /var/lib/mysql-cluster/ndb_3_fs/D9
2.4G    /var/lib/mysql-cluster/ndb_3_fs/D11
2.4G    /var/lib/mysql-cluster/ndb_3_fs/D8
2.4G    /var/lib/mysql-cluster/ndb_3_fs/D10
2.4G    /var/lib/mysql-cluster/ndb_3_fs/D12
71M     /var/lib/mysql-cluster/ndb_3_fs/D1
71M     /var/lib/mysql-cluster/ndb_3_fs/D2
29G     /var/lib/mysql-cluster/ndb_3_fs
rstumbaum@ndb13.dc1:~$

Node 4 (Partner of Node 3):
rstumbaum@ndb14.dc1:~$ sudo du -hd1 /var/lib/mysql-cluster/ndb_4_fs
76G     /var/lib/mysql-cluster/ndb_4_fs/D8
76G     /var/lib/mysql-cluster/ndb_4_fs/D16
76G     /var/lib/mysql-cluster/ndb_4_fs/D13
76G     /var/lib/mysql-cluster/ndb_4_fs/D15
76G     /var/lib/mysql-cluster/ndb_4_fs/D14
76G     /var/lib/mysql-cluster/ndb_4_fs/D17
76G     /var/lib/mysql-cluster/ndb_4_fs/D18
76G     /var/lib/mysql-cluster/ndb_4_fs/D19
117G    /var/lib/mysql-cluster/ndb_4_fs/LCP
76G     /var/lib/mysql-cluster/ndb_4_fs/D10
76G     /var/lib/mysql-cluster/ndb_4_fs/D11
76G     /var/lib/mysql-cluster/ndb_4_fs/D12
76G     /var/lib/mysql-cluster/ndb_4_fs/D9
74M     /var/lib/mysql-cluster/ndb_4_fs/D1
74M     /var/lib/mysql-cluster/ndb_4_fs/D2
1018G   /var/lib/mysql-cluster/ndb_4_fs
rstumbaum@ndb14.dc1:~$

Which is still missing undo and data (diskbased tablespace) of 420GByte...

Would you need the filesystems of all 4 nodes?

Hi,

That's kinda huge :(

I'm having issues with the /support/incoming/mysql-bug-data-90940-db_error_report_20180519234840.tar.bz2 and rest of the files they are gone from the ftp?! Can you upload latest ndb error report.

allb est
Bogdan

I just uploaded a newly created report:
https://data.boerse-go.de/it-operations/ndb_error_report_20180605145847.tar.bz2

After looking through previous bugs related to copyfrag and error 1501 I added further undo files using
ALTER LOGFILE GROUP `DEFAULT-LG` ADD UNDOFILE 'undo24.log' INITIAL_SIZE 1G ENGINE NDBCLUSTER;

We have 25 1G undo files now.

Just started node 3 with --initial now...

...
2018-06-08 01:10:25 [ndbd] INFO     -- Angel pid: 31439 started child: 31441
2018-06-08 01:10:25 [ndbd] INFO     -- Initial start of data node, ignoring any info on disk
...

Let's see how it goes...

Grepping through the out log should have shown that pattern...

rstumbaum@ndb13.dc1:~$ grep "LGMAN: changing file from undo" /var/lib/mysql-cluster/ndb_3_out.log
2018-05-17 02:20:36 [ndbd] INFO     -- LGMAN: changing file from undo0.log to undo13.log
2018-05-17 03:37:10 [ndbd] INFO     -- LGMAN: changing file from undo13.log to undo12.log
2018-05-17 05:27:14 [ndbd] INFO     -- LGMAN: changing file from undo12.log to undo11.log
2018-05-17 07:38:40 [ndbd] INFO     -- LGMAN: changing file from undo11.log to undo10.log
2018-05-17 09:10:46 [ndbd] INFO     -- LGMAN: changing file from undo10.log to undo9.log
2018-05-17 10:22:34 [ndbd] INFO     -- LGMAN: changing file from undo9.log to undo8.log
2018-05-17 11:34:59 [ndbd] INFO     -- LGMAN: changing file from undo8.log to undo7.log
2018-05-17 12:45:09 [ndbd] INFO     -- LGMAN: changing file from undo7.log to undo6.log
2018-05-17 13:54:40 [ndbd] INFO     -- LGMAN: changing file from undo6.log to undo5.log
2018-05-17 15:02:55 [ndbd] INFO     -- LGMAN: changing file from undo5.log to undo4.log
2018-05-17 16:11:05 [ndbd] INFO     -- LGMAN: changing file from undo4.log to undo3.log
2018-05-17 17:19:31 [ndbd] INFO     -- LGMAN: changing file from undo3.log to undo2.log
2018-05-17 18:30:35 [ndbd] INFO     -- LGMAN: changing file from undo2.log to undo1.log
2018-05-18 02:43:44 [ndbd] INFO     -- LGMAN: changing file from undo0.log to undo13.log
2018-05-18 03:59:46 [ndbd] INFO     -- LGMAN: changing file from undo13.log to undo12.log
2018-05-18 06:02:13 [ndbd] INFO     -- LGMAN: changing file from undo12.log to undo11.log
2018-05-18 08:07:36 [ndbd] INFO     -- LGMAN: changing file from undo11.log to undo10.log
2018-05-18 09:35:42 [ndbd] INFO     -- LGMAN: changing file from undo10.log to undo9.log
2018-05-18 10:50:39 [ndbd] INFO     -- LGMAN: changing file from undo9.log to undo8.log
2018-05-18 12:03:49 [ndbd] INFO     -- LGMAN: changing file from undo8.log to undo7.log
2018-05-18 13:15:17 [ndbd] INFO     -- LGMAN: changing file from undo7.log to undo6.log
2018-05-18 14:25:51 [ndbd] INFO     -- LGMAN: changing file from undo6.log to undo5.log
2018-05-18 15:36:01 [ndbd] INFO     -- LGMAN: changing file from undo5.log to undo4.log
2018-05-18 16:49:23 [ndbd] INFO     -- LGMAN: changing file from undo4.log to undo3.log
2018-05-18 18:05:42 [ndbd] INFO     -- LGMAN: changing file from undo3.log to undo2.log
2018-05-18 19:23:32 [ndbd] INFO     -- LGMAN: changing file from undo2.log to undo1.log
2018-05-18 23:21:09 [ndbd] INFO     -- LGMAN: changing file from undo0.log to undo13.log
2018-05-19 01:12:58 [ndbd] INFO     -- LGMAN: changing file from undo13.log to undo12.log
2018-05-19 02:08:39 [ndbd] INFO     -- LGMAN: changing file from undo12.log to undo11.log
2018-05-19 03:09:24 [ndbd] INFO     -- LGMAN: changing file from undo11.log to undo10.log
2018-05-19 04:21:18 [ndbd] INFO     -- LGMAN: changing file from undo10.log to undo9.log
2018-05-19 05:35:11 [ndbd] INFO     -- LGMAN: changing file from undo9.log to undo8.log
2018-05-19 07:38:15 [ndbd] INFO     -- LGMAN: changing file from undo8.log to undo7.log
2018-05-19 09:46:12 [ndbd] INFO     -- LGMAN: changing file from undo7.log to undo6.log
2018-05-19 11:53:14 [ndbd] INFO     -- LGMAN: changing file from undo6.log to undo5.log
2018-05-19 13:59:51 [ndbd] INFO     -- LGMAN: changing file from undo5.log to undo4.log
2018-05-19 16:06:47 [ndbd] INFO     -- LGMAN: changing file from undo4.log to undo3.log
2018-05-19 18:09:35 [ndbd] INFO     -- LGMAN: changing file from undo3.log to undo2.log
2018-05-19 20:06:35 [ndbd] INFO     -- LGMAN: changing file from undo2.log to undo1.log
2018-05-20 03:13:16 [ndbd] INFO     -- LGMAN: changing file from undo0.log to undo13.log
2018-05-20 05:22:40 [ndbd] INFO     -- LGMAN: changing file from undo13.log to undo12.log
2018-05-20 07:27:42 [ndbd] INFO     -- LGMAN: changing file from undo12.log to undo11.log
2018-05-20 09:37:16 [ndbd] INFO     -- LGMAN: changing file from undo11.log to undo10.log
2018-05-20 11:46:53 [ndbd] INFO     -- LGMAN: changing file from undo10.log to undo9.log
2018-05-20 13:55:37 [ndbd] INFO     -- LGMAN: changing file from undo9.log to undo8.log
2018-05-20 15:50:33 [ndbd] INFO     -- LGMAN: changing file from undo8.log to undo7.log
2018-05-20 17:33:07 [ndbd] INFO     -- LGMAN: changing file from undo7.log to undo6.log
2018-05-20 19:21:47 [ndbd] INFO     -- LGMAN: changing file from undo6.log to undo5.log
2018-05-20 21:26:48 [ndbd] INFO     -- LGMAN: changing file from undo5.log to undo4.log
2018-05-20 23:31:51 [ndbd] INFO     -- LGMAN: changing file from undo4.log to undo3.log
2018-05-21 01:29:05 [ndbd] INFO     -- LGMAN: changing file from undo3.log to undo2.log
2018-05-21 03:13:16 [ndbd] INFO     -- LGMAN: changing file from undo2.log to undo1.log
2018-05-21 06:21:17 [ndbd] INFO     -- LGMAN: changing file from undo0.log to undo13.log
2018-05-21 07:29:21 [ndbd] INFO     -- LGMAN: changing file from undo13.log to undo12.log
2018-05-21 08:39:41 [ndbd] INFO     -- LGMAN: changing file from undo12.log to undo11.log
2018-05-21 09:48:19 [ndbd] INFO     -- LGMAN: changing file from undo11.log to undo10.log
2018-05-21 11:11:59 [ndbd] INFO     -- LGMAN: changing file from undo10.log to undo9.log
2018-05-21 13:02:59 [ndbd] INFO     -- LGMAN: changing file from undo9.log to undo8.log
2018-05-21 14:54:06 [ndbd] INFO     -- LGMAN: changing file from undo8.log to undo7.log
2018-05-21 16:39:23 [ndbd] INFO     -- LGMAN: changing file from undo7.log to undo6.log
2018-05-21 18:27:49 [ndbd] INFO     -- LGMAN: changing file from undo6.log to undo5.log
2018-05-21 20:25:25 [ndbd] INFO     -- LGMAN: changing file from undo5.log to undo4.log
2018-06-08 03:31:34 [ndbd] INFO     -- LGMAN: changing file from undo0.log to undo24.log
2018-06-08 05:41:25 [ndbd] INFO     -- LGMAN: changing file from undo24.log to undo23.log
rstumbaum@ndb13.dc1:~$

I hope I created enough undo log files...

Hi Bogdan,

so the bug is indeed coming from the insufficient size of the undo space.

I was able to get to over 54% of 91% starting from Friday at 23:05 till just now (Monday 0:15).

rstumbaum@ndb13.dc1:~$ /opt/mysql/server-5.7.22/bin/ndb_mgm -e "all report mem"
Connected to Management Server at: 10.20.56.5:1186
Node 3: Data usage is 54%(9514781 32K pages of total 17530880)
Node 3: Index usage is 50%(9923706 8K pages of total 19661184)
Node 4: Data usage is 91%(16116770 32K pages of total 17530880)
Node 4: Index usage is 83%(16480782 8K pages of total 19661184)
Node 5: Data usage is 92%(16168326 32K pages of total 17530880)
Node 5: Index usage is 83%(16499023 8K pages of total 19661184)
Node 6: Data usage is 92%(16168523 32K pages of total 17530880)
Node 6: Index usage is 83%(16499006 8K pages of total 19661184)

Problem was that the data coming in on friday was more than usual, so already processed data had to be truncated at around 20:00. Therefore the starting node 3 had to be stopped again...

The copying of the fragments seems slow - there is almost no CPU activity and only about 5MBytes/sec disk activity during that procedure...

Latest error report...
https://data.boerse-go.de/it-operations/ndb_error_report_20180611001857.tar.bz2

Hi,

This actually now makes more sense (why I was not able to reproduce it too).. 

the slow disk usage is a problem here.. I see you have
ODirect=1
MinDiskWriteSpeed=30M
MaxDiskWriteSpeed=200M
MaxDiskWriteSpeedOtherNodeRestart=300M
MaxDiskWriteSpeedOwnRestart=500M

so this looks good but hm ..

need to consult further with my colleagues, stay tuned! This attm looks more like a missconfiguration then a bug, so more work for MySQL Support team then a bug report handling but let us be sure what's happening first.

all best
Bogdan

Hi,

After further investigation and discussion with dev team, this is, as I initially assumed, definitely not a bug but miss-configured system. 

Our Support & Consulting team is always at your disposal to help you properly size and configure everything.

all best
Bogdan