Bug #90940 | MySQL Cluster: 7.5.10: copyfrag failed, error: 1501 : NDBCNTR (Line: 295) 0x0000 | ||
---|---|---|---|
Submitted: | 19 May 2018 7:21 | Modified: | 11 Jul 2018 11:01 |
Reporter: | Thomas Waibel-BGo | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S1 (Critical) |
Version: | 7.5.10 + commit e3f905ccb844db3cf9da7bc6 | OS: | Debian (Jessie) |
Assigned to: | MySQL Verification Team | CPU Architecture: | x86 (Xeon(R) CPU E5-2643 v2 @ 3.50GHz) |
[19 May 2018 7:21]
Thomas Waibel-BGo
[19 May 2018 21:47]
Thomas Waibel-BGo
It just happened again... ---- 2018-05-19 21:41:14 [ndbd] INFO -- LDM(11): Completed copy of fragment T58F44. Changed +1163734/-3500522 rows, 358950584 bytes. 400 pct churn to 1163734 rows. 2018-05-19 21:41:14 [ndbd] INFO -- LDM(11): Completed copy of fragment T59F20. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows. 2018-05-19 21:41:14 [ndbd] INFO -- LDM(11): Completed copy of fragment T59F44. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows. 2018-05-19 21:41:27 [ndbd] INFO -- LDM(6): Completed copy of fragment T54F10. Changed +4929575/-697 rows, 492996532 bytes. 100 pct churn to 4929575 rows. 2018-05-19 21:41:35 [ndbd] INFO -- LDM(9): Completed copy of fragment T58F16. Changed +1176820/-3413180 rows, 355891568 bytes. 390 pct churn to 1176820 rows. 2018-05-19 21:41:36 [ndbd] INFO -- LDM(1): Completed copy of fragment T64F24. Changed +28493645/-259 rows, 2849379004 bytes. 100 pct churn to 28493645 rows. 2018-05-19 21:42:13 [ndbd] INFO -- LDM(6): Completed copy of fragment T54F34. Changed +5045224/-104 rows, 504528224 bytes. 100 pct churn to 5045224 rows. 2018-05-19 21:42:13 [ndbd] INFO -- LDM(6): Completed copy of fragment T55F10. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows. 2018-05-19 21:42:13 [ndbd] INFO -- LDM(6): Completed copy of fragment T55F34. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows. 2018-05-19 21:42:13 [ndbd] INFO -- LDM(6): Completed copy of fragment T56F10. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows. 2018-05-19 21:42:13 [ndbd] INFO -- LDM(6): Completed copy of fragment T56F34. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows. 2018-05-19 21:42:13 [ndbd] INFO -- LDM(6): Completed copy of fragment T57F10. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows. 2018-05-19 21:42:13 [ndbd] INFO -- LDM(6): Completed copy of fragment T57F34. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows. 2018-05-19 21:42:37 [ndbd] INFO -- Killed by node 3 as copyfrag failed, error: 1501 2018-05-19 21:42:37 [ndbd] INFO -- NDBCNTR (Line: 295) 0x00000000 2018-05-19 21:42:37 [ndbd] INFO -- Error handler restarting system 2018-05-19 21:42:37 [ndbd] INFO -- Error handler shutdown completed - exiting 2018-05-19 21:43:41 [ndbd] INFO -- Angel detected startup failure, count: 2 2018-05-19 21:43:41 [ndbd] ALERT -- Node 3: Forced node shutdown completed. Occured during startphase 5. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. 2018-05-19 21:43:41 [ndbd] INFO -- Ndb has terminated (pid 1066) restarting 2018-05-19 21:43:41 [ndbd] INFO -- Angel reconnected to '10.20.56.5:1186' 2018-05-19 21:43:41 [ndbd] INFO -- Angel reallocated nodeid: 3 2018-05-19 21:43:41 [ndbd] INFO -- Angel pid: 35492 started child: 44717 2018-05-19 21:43:41 [ndbd] INFO -- Normal start of data node using checkpoint and log info if existing ---- So somehow we are able to reproduce it - it is just unclear how...
[19 May 2018 21:52]
Thomas Waibel-BGo
Uploaded //support/incoming/mysql-bug-data-90940-db_error_report_20180519234840.tar.bz2
[21 May 2018 7:08]
Thomas Waibel-BGo
Even with a start --initial we are not able to bring Node 3 back online again... therefore I would like to change from S2 to S1...
[21 May 2018 7:13]
Thomas Waibel-BGo
Uploaded //support/incoming/mysql-bug-data-90940-ndb_error_report_20180521090930.tar.bz2
[28 May 2018 15:54]
MySQL Verification Team
Hi, This looks like a missconfigured/misssized cluster rather then a bug, contacting Support team is in my opinion best course of action. Can you confirm that - you start the crashed node - while node is starting you execute that truncate - node start fails This is how I understood the problem, is that it or you just start the node with --initial and do not exec truncate or anything similar and have "normal load" on your cluster and the node fails to start? Thanks Bogdan
[1 Jun 2018 10:32]
Thomas Waibel-BGo
Hi Bogdan, the TRUNCATE is happening every Monday untill Friday at 0:30 AM. So we basically have a window starting Friday 0:35 till Monday 0:15 to start that missing node. This windows also includes the weekend where we experience low traffic and basically no write activity. We tried starting the missing Node 3 during that window using --initial which did not succeed. And we tried starting the missing node without using --initial. We plan to upgrade to 7.6 soon. Since this requires a rolling --initial restart due to the new file format this copyfrag problem poses quite a risk here... Best regards Rainer Stumbaum
[1 Jun 2018 10:35]
Thomas Waibel-BGo
Just for your reference: - A NDB cluster with a starting node does not allow to execute a DD transaction. If you try to execute a TRUNCATE, DROP TABLE, CREATE TABLE while having a data node start this command will just fail/hang...
[4 Jun 2018 12:41]
MySQL Verification Team
Hi, I'm having issues making a reproducible test case. Trying to figure out from the log what's going on. Please upload a new set of logs with latest crash just to have more data if that's not a problem. Now a question, when you start the node with --initial at Friday 00:35 when does it fail? all best Bogdan p.s. we are going trough logs we already have but please send a second batch so we have more data
[4 Jun 2018 13:40]
Thomas Waibel-BGo
Hi Bogdan, since the crash we have not tried to restart that node. I will restart it this coming Friday again at 0:35 to let you know. Looking at the ndb_3_out.log I extracted to following times: ... 2018-05-18 01:13:36 [ndbd] INFO -- Angel pid: 35492 started child: 35496 ... 2018-05-18 20:33:58 [ndbd] ALERT -- Node 3: Forced node shutdown completed. ... 2018-05-18 20:33:58 [ndbd] INFO -- Angel pid: 35492 started child: 1066 ... 2018-05-19 21:43:41 [ndbd] ALERT -- Node 3: Forced node shutdown completed. ... 2018-05-20 00:15:50 [ndbd] INFO -- Angel pid: 15452 started child: 15453 ... 2018-05-21 04:45:49 [ndbd] ALERT -- Node 3: Forced node shutdown completed. So the first two are not starting --initial - only the last one is. Regarding ndb_error_reporter - do you all the files? I did set directory permissions, so that the file does not get too big... Best regards Rainer stumbaum
[4 Jun 2018 14:38]
MySQL Verification Team
Hi, Having all files is useful, but what's required is ndb_3_error.log and ndb_3_trace.log.* > Looking at the ndb_3_out.log I extracted to following times: Seen that, pretty inconsistent, 19 hours, 1 hour, 4 hours .. I doubt you'll get anything new this Friday but.. Thanks Bogdan
[5 Jun 2018 9:06]
Thomas Waibel-BGo
Hi Bogdan, is there any option I can give to have more information on these trace files? Should I create a ndb_error_reporter output including filesystems? Best regards Rainer Stumbaum
[5 Jun 2018 9:11]
Thomas Waibel-BGo
And I left the date in the logfile extract there on purpose... 19h20m - 25h10m - 28h30m ... is it somehow possible to pin down the copyfrag error to a specific table in the trace?
[5 Jun 2018 10:20]
MySQL Verification Team
Hi Thomas, Rainer, I see another bug with this copyfrag issue similiar to yours (older version of cluster) and they don't have any DDL running but looking at logs you already provided this crash happens long before that truncate is executed so it's questionable if that truncate has anything to do with the problem... looks right now I was drawn to it as possible cause ignoring everything else. > is there any option I can give to have more information on these trace files? Well there's no additional files except the logs files that are already included so the only possible additional info would be a full log of all sql statements executed on server around crash time but if I'm right that won't help neither (the other bug I'm looking at, they stopped all clients, disconnected mysqld's and still got copyfrag crash) > Should I create a ndb_error_reporter output including filesystems? that actually could help, having your whole ndb filesystem but I'm not sure you are comfortable sharing all your data with us or not. This would allow us to reproduce the issue locally which should lead to quick bug fix. The major issue attm is that I do see what happened from the log files, I just can't reproduce it in any way to see why it happened and how to prevent it. Let me know if this is possible and how big the whole ndb filesystem is > is it somehow possible to pin down the copyfrag error to a specific table in the trace? When you look at the error file it lists the crash and in the description of crash it lists what trace files are linked to that crash. all best Bogdan
[5 Jun 2018 10:50]
Thomas Waibel-BGo
Hi Bogdan, there is no personal data in that database so it should be no problem. I think the issue is the size of the DB Node 3: rstumbaum@ndb13.dc1:~$ du -hd1 /var/lib/mysql-cluster/ndb_3_fs 2.4G /var/lib/mysql-cluster/ndb_3_fs/D19 2.4G /var/lib/mysql-cluster/ndb_3_fs/D17 2.4G /var/lib/mysql-cluster/ndb_3_fs/D13 2.4G /var/lib/mysql-cluster/ndb_3_fs/D14 2.4G /var/lib/mysql-cluster/ndb_3_fs/D15 2.4G /var/lib/mysql-cluster/ndb_3_fs/D16 2.4G /var/lib/mysql-cluster/ndb_3_fs/D18 2.4G /var/lib/mysql-cluster/ndb_3_fs/D9 2.4G /var/lib/mysql-cluster/ndb_3_fs/D11 2.4G /var/lib/mysql-cluster/ndb_3_fs/D8 2.4G /var/lib/mysql-cluster/ndb_3_fs/D10 2.4G /var/lib/mysql-cluster/ndb_3_fs/D12 71M /var/lib/mysql-cluster/ndb_3_fs/D1 71M /var/lib/mysql-cluster/ndb_3_fs/D2 29G /var/lib/mysql-cluster/ndb_3_fs rstumbaum@ndb13.dc1:~$ Node 4 (Partner of Node 3): rstumbaum@ndb14.dc1:~$ sudo du -hd1 /var/lib/mysql-cluster/ndb_4_fs 76G /var/lib/mysql-cluster/ndb_4_fs/D8 76G /var/lib/mysql-cluster/ndb_4_fs/D16 76G /var/lib/mysql-cluster/ndb_4_fs/D13 76G /var/lib/mysql-cluster/ndb_4_fs/D15 76G /var/lib/mysql-cluster/ndb_4_fs/D14 76G /var/lib/mysql-cluster/ndb_4_fs/D17 76G /var/lib/mysql-cluster/ndb_4_fs/D18 76G /var/lib/mysql-cluster/ndb_4_fs/D19 117G /var/lib/mysql-cluster/ndb_4_fs/LCP 76G /var/lib/mysql-cluster/ndb_4_fs/D10 76G /var/lib/mysql-cluster/ndb_4_fs/D11 76G /var/lib/mysql-cluster/ndb_4_fs/D12 76G /var/lib/mysql-cluster/ndb_4_fs/D9 74M /var/lib/mysql-cluster/ndb_4_fs/D1 74M /var/lib/mysql-cluster/ndb_4_fs/D2 1018G /var/lib/mysql-cluster/ndb_4_fs rstumbaum@ndb14.dc1:~$ Which is still missing undo and data (diskbased tablespace) of 420GByte... Would you need the filesystems of all 4 nodes?
[5 Jun 2018 11:10]
MySQL Verification Team
Hi, That's kinda huge :( I'm having issues with the /support/incoming/mysql-bug-data-90940-db_error_report_20180519234840.tar.bz2 and rest of the files they are gone from the ftp?! Can you upload latest ndb error report. allb est Bogdan
[5 Jun 2018 13:04]
Thomas Waibel-BGo
I just uploaded a newly created report: https://data.boerse-go.de/it-operations/ndb_error_report_20180605145847.tar.bz2
[7 Jun 2018 23:16]
Thomas Waibel-BGo
After looking through previous bugs related to copyfrag and error 1501 I added further undo files using ALTER LOGFILE GROUP `DEFAULT-LG` ADD UNDOFILE 'undo24.log' INITIAL_SIZE 1G ENGINE NDBCLUSTER; We have 25 1G undo files now. Just started node 3 with --initial now... ... 2018-06-08 01:10:25 [ndbd] INFO -- Angel pid: 31439 started child: 31441 2018-06-08 01:10:25 [ndbd] INFO -- Initial start of data node, ignoring any info on disk ... Let's see how it goes...
[8 Jun 2018 4:16]
Thomas Waibel-BGo
Grepping through the out log should have shown that pattern... rstumbaum@ndb13.dc1:~$ grep "LGMAN: changing file from undo" /var/lib/mysql-cluster/ndb_3_out.log 2018-05-17 02:20:36 [ndbd] INFO -- LGMAN: changing file from undo0.log to undo13.log 2018-05-17 03:37:10 [ndbd] INFO -- LGMAN: changing file from undo13.log to undo12.log 2018-05-17 05:27:14 [ndbd] INFO -- LGMAN: changing file from undo12.log to undo11.log 2018-05-17 07:38:40 [ndbd] INFO -- LGMAN: changing file from undo11.log to undo10.log 2018-05-17 09:10:46 [ndbd] INFO -- LGMAN: changing file from undo10.log to undo9.log 2018-05-17 10:22:34 [ndbd] INFO -- LGMAN: changing file from undo9.log to undo8.log 2018-05-17 11:34:59 [ndbd] INFO -- LGMAN: changing file from undo8.log to undo7.log 2018-05-17 12:45:09 [ndbd] INFO -- LGMAN: changing file from undo7.log to undo6.log 2018-05-17 13:54:40 [ndbd] INFO -- LGMAN: changing file from undo6.log to undo5.log 2018-05-17 15:02:55 [ndbd] INFO -- LGMAN: changing file from undo5.log to undo4.log 2018-05-17 16:11:05 [ndbd] INFO -- LGMAN: changing file from undo4.log to undo3.log 2018-05-17 17:19:31 [ndbd] INFO -- LGMAN: changing file from undo3.log to undo2.log 2018-05-17 18:30:35 [ndbd] INFO -- LGMAN: changing file from undo2.log to undo1.log 2018-05-18 02:43:44 [ndbd] INFO -- LGMAN: changing file from undo0.log to undo13.log 2018-05-18 03:59:46 [ndbd] INFO -- LGMAN: changing file from undo13.log to undo12.log 2018-05-18 06:02:13 [ndbd] INFO -- LGMAN: changing file from undo12.log to undo11.log 2018-05-18 08:07:36 [ndbd] INFO -- LGMAN: changing file from undo11.log to undo10.log 2018-05-18 09:35:42 [ndbd] INFO -- LGMAN: changing file from undo10.log to undo9.log 2018-05-18 10:50:39 [ndbd] INFO -- LGMAN: changing file from undo9.log to undo8.log 2018-05-18 12:03:49 [ndbd] INFO -- LGMAN: changing file from undo8.log to undo7.log 2018-05-18 13:15:17 [ndbd] INFO -- LGMAN: changing file from undo7.log to undo6.log 2018-05-18 14:25:51 [ndbd] INFO -- LGMAN: changing file from undo6.log to undo5.log 2018-05-18 15:36:01 [ndbd] INFO -- LGMAN: changing file from undo5.log to undo4.log 2018-05-18 16:49:23 [ndbd] INFO -- LGMAN: changing file from undo4.log to undo3.log 2018-05-18 18:05:42 [ndbd] INFO -- LGMAN: changing file from undo3.log to undo2.log 2018-05-18 19:23:32 [ndbd] INFO -- LGMAN: changing file from undo2.log to undo1.log 2018-05-18 23:21:09 [ndbd] INFO -- LGMAN: changing file from undo0.log to undo13.log 2018-05-19 01:12:58 [ndbd] INFO -- LGMAN: changing file from undo13.log to undo12.log 2018-05-19 02:08:39 [ndbd] INFO -- LGMAN: changing file from undo12.log to undo11.log 2018-05-19 03:09:24 [ndbd] INFO -- LGMAN: changing file from undo11.log to undo10.log 2018-05-19 04:21:18 [ndbd] INFO -- LGMAN: changing file from undo10.log to undo9.log 2018-05-19 05:35:11 [ndbd] INFO -- LGMAN: changing file from undo9.log to undo8.log 2018-05-19 07:38:15 [ndbd] INFO -- LGMAN: changing file from undo8.log to undo7.log 2018-05-19 09:46:12 [ndbd] INFO -- LGMAN: changing file from undo7.log to undo6.log 2018-05-19 11:53:14 [ndbd] INFO -- LGMAN: changing file from undo6.log to undo5.log 2018-05-19 13:59:51 [ndbd] INFO -- LGMAN: changing file from undo5.log to undo4.log 2018-05-19 16:06:47 [ndbd] INFO -- LGMAN: changing file from undo4.log to undo3.log 2018-05-19 18:09:35 [ndbd] INFO -- LGMAN: changing file from undo3.log to undo2.log 2018-05-19 20:06:35 [ndbd] INFO -- LGMAN: changing file from undo2.log to undo1.log 2018-05-20 03:13:16 [ndbd] INFO -- LGMAN: changing file from undo0.log to undo13.log 2018-05-20 05:22:40 [ndbd] INFO -- LGMAN: changing file from undo13.log to undo12.log 2018-05-20 07:27:42 [ndbd] INFO -- LGMAN: changing file from undo12.log to undo11.log 2018-05-20 09:37:16 [ndbd] INFO -- LGMAN: changing file from undo11.log to undo10.log 2018-05-20 11:46:53 [ndbd] INFO -- LGMAN: changing file from undo10.log to undo9.log 2018-05-20 13:55:37 [ndbd] INFO -- LGMAN: changing file from undo9.log to undo8.log 2018-05-20 15:50:33 [ndbd] INFO -- LGMAN: changing file from undo8.log to undo7.log 2018-05-20 17:33:07 [ndbd] INFO -- LGMAN: changing file from undo7.log to undo6.log 2018-05-20 19:21:47 [ndbd] INFO -- LGMAN: changing file from undo6.log to undo5.log 2018-05-20 21:26:48 [ndbd] INFO -- LGMAN: changing file from undo5.log to undo4.log 2018-05-20 23:31:51 [ndbd] INFO -- LGMAN: changing file from undo4.log to undo3.log 2018-05-21 01:29:05 [ndbd] INFO -- LGMAN: changing file from undo3.log to undo2.log 2018-05-21 03:13:16 [ndbd] INFO -- LGMAN: changing file from undo2.log to undo1.log 2018-05-21 06:21:17 [ndbd] INFO -- LGMAN: changing file from undo0.log to undo13.log 2018-05-21 07:29:21 [ndbd] INFO -- LGMAN: changing file from undo13.log to undo12.log 2018-05-21 08:39:41 [ndbd] INFO -- LGMAN: changing file from undo12.log to undo11.log 2018-05-21 09:48:19 [ndbd] INFO -- LGMAN: changing file from undo11.log to undo10.log 2018-05-21 11:11:59 [ndbd] INFO -- LGMAN: changing file from undo10.log to undo9.log 2018-05-21 13:02:59 [ndbd] INFO -- LGMAN: changing file from undo9.log to undo8.log 2018-05-21 14:54:06 [ndbd] INFO -- LGMAN: changing file from undo8.log to undo7.log 2018-05-21 16:39:23 [ndbd] INFO -- LGMAN: changing file from undo7.log to undo6.log 2018-05-21 18:27:49 [ndbd] INFO -- LGMAN: changing file from undo6.log to undo5.log 2018-05-21 20:25:25 [ndbd] INFO -- LGMAN: changing file from undo5.log to undo4.log 2018-06-08 03:31:34 [ndbd] INFO -- LGMAN: changing file from undo0.log to undo24.log 2018-06-08 05:41:25 [ndbd] INFO -- LGMAN: changing file from undo24.log to undo23.log rstumbaum@ndb13.dc1:~$ I hope I created enough undo log files...
[10 Jun 2018 22:17]
Thomas Waibel-BGo
Hi Bogdan, so the bug is indeed coming from the insufficient size of the undo space. I was able to get to over 54% of 91% starting from Friday at 23:05 till just now (Monday 0:15). rstumbaum@ndb13.dc1:~$ /opt/mysql/server-5.7.22/bin/ndb_mgm -e "all report mem" Connected to Management Server at: 10.20.56.5:1186 Node 3: Data usage is 54%(9514781 32K pages of total 17530880) Node 3: Index usage is 50%(9923706 8K pages of total 19661184) Node 4: Data usage is 91%(16116770 32K pages of total 17530880) Node 4: Index usage is 83%(16480782 8K pages of total 19661184) Node 5: Data usage is 92%(16168326 32K pages of total 17530880) Node 5: Index usage is 83%(16499023 8K pages of total 19661184) Node 6: Data usage is 92%(16168523 32K pages of total 17530880) Node 6: Index usage is 83%(16499006 8K pages of total 19661184) Problem was that the data coming in on friday was more than usual, so already processed data had to be truncated at around 20:00. Therefore the starting node 3 had to be stopped again... The copying of the fragments seems slow - there is almost no CPU activity and only about 5MBytes/sec disk activity during that procedure...
[10 Jun 2018 22:32]
Thomas Waibel-BGo
Latest error report... https://data.boerse-go.de/it-operations/ndb_error_report_20180611001857.tar.bz2
[11 Jun 2018 11:03]
MySQL Verification Team
Hi, This actually now makes more sense (why I was not able to reproduce it too).. the slow disk usage is a problem here.. I see you have ODirect=1 MinDiskWriteSpeed=30M MaxDiskWriteSpeed=200M MaxDiskWriteSpeedOtherNodeRestart=300M MaxDiskWriteSpeedOwnRestart=500M so this looks good but hm .. need to consult further with my colleagues, stay tuned! This attm looks more like a missconfiguration then a bug, so more work for MySQL Support team then a bug report handling but let us be sure what's happening first. all best Bogdan
[11 Jul 2018 11:01]
MySQL Verification Team
Hi, After further investigation and discussion with dev team, this is, as I initially assumed, definitely not a bug but miss-configured system. Our Support & Consulting team is always at your disposal to help you properly size and configure everything. all best Bogdan