Bug #72537 | Replication stalls using Parallel Workers | ||
---|---|---|---|
Submitted: | 5 May 2014 14:22 | Modified: | 5 Jun 2018 0:17 |
Reporter: | Van Stokes | Email Updates: | |
Status: | No Feedback | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S1 (Critical) |
Version: | 5.6.17 | OS: | Any (Windows,Linux) |
Assigned to: | MySQL Verification Team | CPU Architecture: | Any |
Tags: | myisam, parallel, replication, slave, workers |
[5 May 2014 14:22]
Van Stokes
[7 May 2014 13:42]
Van Stokes
I modified the Synopsis because we have now encountered this "stall" when no relative errors are logged in the error.log.
[7 May 2014 13:48]
Van Stokes
OS: Ubuntu 14.04 LTS (x64) MySQL Server: 5.6.16 (x64) Ubuntu Distro Here is some diagnostics where the SLAVE has stalled again. 1) Show the process list (file p1.txt) 2) Show slave status (file s.txt) 3) STOP SLAVE command issued (client is hung performing STOP SLAVE) 4) show process list (file p2.txt)
[7 May 2014 13:49]
Van Stokes
SHOW FULL PROCESSLIST
Attachment: p1.txt (text/plain), 3.42 KiB.
[7 May 2014 13:49]
Van Stokes
SHOW SLAVE STATUS
Attachment: s.txt (text/plain), 2.27 KiB.
[7 May 2014 13:49]
Van Stokes
SHOW FULL PROCESSLIST after STOP SLAVE
Attachment: p2.txt (text/plain), 3.46 KiB.
[7 May 2014 14:12]
Van Stokes
Had to restart the MySQL server because of hung slave. # service mysql stop (hung attempting to shutdown) from the mysql error.log file: 2014-05-07 10:00:04 8313 [Note] /usr/sbin/mysqld: Normal shutdown 2014-05-07 10:00:04 8313 [Note] Giving 19 client threads a chance to die gracefully 2014-05-07 10:00:04 8313 [Note] Event Scheduler: Purging the queue. 2 events 2014-05-07 10:00:04 8313 [Note] Shutting down slave threads 2014-05-07 10:00:04 8313 [Note] Slave I/O thread killed while reading event 2014-05-07 10:00:04 8313 [Note] Slave I/O thread exiting, read up to log 'master-bin.000093', position 95623245 (no further entries) Typically, the server takes around one minute to shutdown. The server took over five minutes to shutdown. Error log entries on startup: 2014-05-07 10:06:08 2103 [Note] InnoDB: The log sequence numbers 80885068473 and 80885068473 in ibdata files do not match the log sequence number 107226483453 in the ib_logfiles! 2014-05-07 10:06:08 2103 [Note] InnoDB: Database was not shutdown normally! 2014-05-07 10:06:08 2103 [Note] InnoDB: Starting crash recovery. 2014-05-07 10:06:08 2103 [Note] InnoDB: Reading tablespace information from the .ibd files... 2014-05-07 10:06:29 2103 [Note] InnoDB: Restoring possible half-written data pages 2014-05-07 10:06:29 2103 [Note] InnoDB: from the doublewrite buffer... InnoDB: 1 transaction(s) which must be rolled back or cleaned up InnoDB: in total 3199 row operations to undo InnoDB: Trx id counter is 224093184 InnoDB: Last MySQL binlog file position 0 617585925, file name master-bin.000091 2014-05-07 10:06:33 2103 [Note] InnoDB: 128 rollback segment(s) are active. InnoDB: Starting in background the rollback of uncommitted transactions 2014-05-07 10:06:33 7f3e7e45b700 InnoDB: Rolling back trx with id 217297891, 3199 rows to undo InnoDB: Progress in percents: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 152014-05-07 10:06:34 2103 [Note] InnoDB: Waiting for purge to start 162014-05-07 10:06:34 2103 [Note] InnoDB: 5.6.17 started; log sequence number 107226483453 2014-05-07 10:06:34 7f3e837fe700 InnoDB: Loading buffer pool(s) from .//ib_buffer_pool 2014-05-07 10:06:34 2103 [Note] Recovering after a crash using master-bin 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 1002014-05-07 10:06:34 2103 [Note] InnoDB: Rollback of trx with id 217297891 completed 2014-05-07 10:06:34 7f3e7e45b700 InnoDB: Rollback of non-prepared transactions completed 2014-05-07 10:06:35 7f3e837fe700 InnoDB: Buffer pool(s) load completed at 140507 10:06:35 2014-05-07 10:06:37 2103 [Note] Starting crash recovery... 2014-05-07 10:06:37 2103 [Note] Crash recovery finished. And now my SLAVE hang is back on the same transaction! 6 system user (NULL) Connect 4717 Waiting for an event from Coordinator INSERT INTO lcl_exp_ratesheet SET comcde = pCom...
[7 May 2014 14:22]
Van Stokes
I had to shutdown the MySQL server - again - because the same transaction was hung the SLAVE. Took over five minutes to shutdown and it still did not shutdown cleanly (per the error log). I suspected the problem was in the relay log so I added this parameter to my my.cnf file: relay-log-recovery = 1 I restarted MySQL and that entry seemed to work. The slave started running again.
[7 May 2014 14:24]
Van Stokes
Current slave settings (from my.cnf): ############################### # Replication: SLAVE log-slave-updates = 1 relay-log = slave-relay-bin relay-log-index = slave-relay-bin.index relay-log-info-repository = TABLE relay-log-recovery = 1 # Added 2014-05-07 replicate-same-server-id = 0 skip-slave-start = 0 slave-net-timeout = 15 slave-parallel-workers = 4 slave-pending-jobs-size-max = 32M # default 16M slave-sql-verify-checksum = 1
[7 May 2014 16:54]
Van Stokes
Slave keeps stalling. This time, there doesn't seem to be any SQL activity in the process list being executed. Issued another STOP SLAVE from the client and the client is hung. Process List shows me attempting to kill the slave. I had to forcibly stop the server again. Escalated this to critical. I cannot believe I am the only one with this issue. I have changed my configuration: slave-parallel-workers = 4 to slave-parallel-workers = 0 to see if this helps.
[9 May 2014 14:15]
Van Stokes
Setting slave-parallel-workers = 0 is the only solution. Today I tried to set it back to four only to have it stall again. I issued a STOP SLAVE which hung. No slave threads were performing any work. This time, I killed each slave thread and one of them freed up my hung STOP SLAVE. The slave system was stopped. I did not have to stop the server. This particular server handles mostly MyISAM databases and replicates them downstream to other masters and slaves. It imports data dumped by a mainframe system into a staging database (MyISAM, not replicated, not binlogged) and then 'syncs' the staging database to the replicated database (MyISAM, binlogged) using Webyogs SJA. This MyISAM database is treated as READ-ONLY on all downstream masters and slaves. Only this server modifies this MyISAM database. My only assumption here is that MyISAM is not properly handled by the parallel slave system.
[9 May 2014 15:00]
Van Stokes
Well, killing the slave thread was not good. I cannot restart slave. I get errors in the error log: 2014-05-09 10:03:49 4085 [Warning] Slave SQL: Coordinator thread of multi-threaded slave is being stopped in the middle of assigning a group of events; deferring to exit until the group completion ... , Error_code: 0 2014-05-09 10:03:49 4085 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000107' at position 120 2014-05-09 10:03:53 4085 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013) 2014-05-09 10:03:53 4085 [Note] Slave I/O thread killed while reading event 2014-05-09 10:03:53 4085 [Note] Slave I/O thread exiting, read up to log 'master-bin.000109', position 923623464 2014-05-09 10:15:34 4085 [Note] Slave: MTS group recovery relay log info based on Worker-Id 0, group_relay_log_name ./slave-relay-bin.000256, group_relay_log_pos 100595791 group_master_log_name master-bin.000107, group_master_log_pos 979175316 2014-05-09 10:15:35 4085 [Note] Slave: MTS group recovery relay log info group_master_log_name master-bin.000107, event_master_log_pos 1001471773. 2 .... many lines of "Slave: MTS group recovery relay log info" .... 2014-05-09 10:15:58 4085 [Note] Slave: MTS group recovery relay log info group_master_log_name master-bin.000107, event_master_log_pos 923607991. 2014-05-09 10:15:58 4085 [Note] Slave: MTS group recovery relay log info group_master_log_name master-bin.000107, event_master_log_pos 923623464. 2014-05-09 10:15:58 4085 [ERROR] Error looking for file after ./slave-relay-bin.000261. I had to stop the server. Now I have this error: Error No: 1872 Slave failed to initialize relay log info structure from the repository 2014-05-09 10:47:40 4845 [Note] Slave: MTS group recovery relay log info group_master_log_name master-bin.000107, event_master_log_pos 923623464. 2014-05-09 10:47:40 4845 [Note] Slave: MTS group recovery relay log info group_master_log_name master-bin.000107, event_master_log_pos 908532420. 2014-05-09 10:47:40 4845 [ERROR] Error looking for file after ./slave-relay-bin.000262. 2014-05-09 10:47:40 4845 [ERROR] Failed to initialize the master info structure 2014-05-09 10:47:40 4845 [Note] Check error log for additional messages. You will not be able to start replication until the issue is resolved and the server restarted. 2014-05-09 10:47:40 4845 [Note] Event Scheduler: Loaded 2 events 2014-05-09 10:47:40 4845 [Note] /usr/sbin/mysqld: ready for connections. Version: '5.6.17-0ubuntu0.14.04.1-log' socket: '/var/run/mysqld/mysqld.sock' port: 3306 (Ubuntu) 2014-05-09 10:54:32 7f7408d2b700 InnoDB: Buffer pool(s) load completed at 140509 10:54:32 2014-05-09 10:58:17 4845 [ERROR] Slave SQL: Slave failed to initialize relay log info structure from the repository, Error_code: 1872
[1 Aug 2016 18:12]
MySQL Verification Team
seeing this on 5.7.14 when i stop slave and start slave again. mysql> stop slave; Query OK, 0 rows affected (0.00 sec) mysql> start slave; ERROR 1201 (HY000): Could not initialize master info structure; more error messages can be found in the MySQL error log Error log contains: ----- [Note] Slave I/O thread exiting for channel '', read up to log 'i7-bin.000001', position 51698390 [Note] Slave: MTS group recovery relay log info based on Worker-Id 0, group_relay_log_name .\thei7-relay.000007, group_relay_log_pos 258564868 group_master_log_name i7-bin.000001, group_master_log_pos 273638888 [ERROR] Error looking for file after .\thei7-relay.000009.
[5 May 2018 0:17]
MySQL Verification Team
Hi, I can reproduce this "partially", using parallel workers and crashed myisam tables but - I can see errors - the tables actually are crashed I'm not sure why would that be a bug. Initially you state it "silently" stopped replication but I can see errors (5.7) and myisam is not very "safe" tbh in the first place. With innodb I can't reproduce and I can't reproduce scenario when nothing is in the logs. all best Bogdan
[5 Jun 2018 1:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".