Bug #82980 | Multi-threaded slave leaks worker threads in case of thread create failure | ||
---|---|---|---|
Submitted: | 14 Sep 2016 7:02 | Modified: | 14 Jul 2017 14:28 |
Reporter: | Laurynas Biveinis (OCA) | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S3 (Non-critical) |
Version: | 5.6 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | error handling, MTS, replication |
[14 Sep 2016 7:02]
Laurynas Biveinis
[16 Sep 2016 9:30]
MySQL Verification Team
Hello Laurynas, Thank you for the report. Thanks, Umesh
[16 Sep 2016 9:30]
MySQL Verification Team
-- 5.6.33 SHOW PROCESSLIST; Id User Host db Command Time State Info 2 root localhost test Query 0 init SHOW PROCESSLIST 6 root localhost:41632 test Sleep 1 NULL 7 root localhost:41633 test Sleep 1 NULL 8 root localhost:41636 NULL Binlog Dump 1 Master has sent all binlog to slave; waiting for binlog to be updated NULL main.rpl_bug82980 [ pass ] 1008 -- 5.7.15 SHOW PROCESSLIST; Id User Host db Command Time State Info 3 root localhost test Query 0 starting SHOW PROCESSLIST 7 root localhost:42379 test Sleep 1 NULL 8 root localhost:42380 test Sleep 1 NULL 9 root localhost:42384 NULL Binlog Dump 1 Master has sent all binlog to slave; waiting for more updates NULL main.rpl_bug82980 [ fail ] Found warnings/errors in server log file! Test ended at 2016-09-16 10:42:02 line 2016-09-16T08:42:01.394622Z 9 [ERROR] Failed during slave worker thread creation for channel '' (errno= 0) ^ Found warnings in /export/umesh/server/binaries/GABuilds/mysql-advanced-5.7.15/mysql-test/var/log/mysqld.2.err ok - the logfile can be found in '/export/umesh/server/binaries/GABuilds/mysql-advanced-5.7.15/mysql-test/var/log/main.rpl_bug82980/rpl_bug82980.log'
[19 Sep 2016 10:55]
Laurynas Biveinis
Bug 82980 fix for 5.6 (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.
Contribution: bug82980-5.6.patch (application/octet-stream, text), 822 bytes.
[19 Sep 2016 10:56]
Laurynas Biveinis
Bug 82980 fix for 5.7 (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.
Contribution: bug82980-5.7.patch (application/octet-stream, text), 828 bytes.
[19 Sep 2016 10:56]
Laurynas Biveinis
Bug 82980 fix for 8.0 (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.
Contribution: bug82980-8.0.patch (application/octet-stream, text), 842 bytes.
[21 Mar 2017 10:27]
NEHA KUMARI
Hi Laurynas, I tried reproducing this bug on MySQL-5.6 but couldn't, it passes all the time. With MySQL-5.7 I could reproduce the error, with the test given by you. This was the error in mysqld.2.err: 2017-03-21T09:41:43.359887Z 10 [ERROR] Failed during slave worker thread creation for channel '' (errno= 0) 2017-03-21T09:41:43.359974Z 10 [ERROR] Slave SQL for channel '': Fatal error: Failed during slave workers initialization, Error_code: 1593 But this error persists even with the patch you have submitted. I will try to look more closely at the errors.
[22 Mar 2017 4:26]
Laurynas Biveinis
Hello Neha - I will review too and should get back to you soon
[24 Mar 2017 9:35]
Laurynas Biveinis
My "SHOW PROCESSLIST" command added in the original report is wrong, as it executes on the master instead of the slave. With current public 5.6 (5.6.35), the reduced testcase shows the following PROCESSLIST on slave at the end: SHOW PROCESSLIST; Id User Host db Command Time State Info 6 root localhost:55768 test Query 0 init SHOW PROCESSLIST 7 root localhost:55769 test Sleep 0 NULL 9 system user NULL Connect 0 Waiting for an event from Coordinator NULL If the testcase is modified to comment out "SET @@global.debug= "d,mts_worker_thread_fails";", wait for slave SQL error, and a include/stop_slave.inc is added before final include/start_slave.inc, then the slave processlist at the end becomes: SHOW PROCESSLIST; Id User Host db Command Time State Info 6 root localhost:55742 test Query 0 init SHOW PROCESSLIST 7 root localhost:55743 test Sleep 1 NULL i.e. one less thread
[24 Mar 2017 9:37]
Laurynas Biveinis
And with the patch applied, the reduced testcase processlist has two threads: SHOW PROCESSLIST; Id User Host db Command Time State Info 6 root localhost:55881 test Query 0 init SHOW PROCESSLIST 7 root localhost:55882 test Sleep 1 NULL But now I'm suspecting that the patch traded thread leak for a memory leak, checking.
[24 Mar 2017 9:52]
Laurynas Biveinis
Indeed, the patch causes rpl.rpl_mts_debug 'mix' [ pass ] 54107 rpl.rpl_mts_debug 'row' [ skipped ] Doesn't support --binlog-format='row' rpl.rpl_mts_debug 'stmt' [ skipped ] Doesn't support --binlog-format='statement' worker[1] Valgrind report from /home/laurynas/obj-5.6-valgrind/mysql-test/var/log/mysqld.2.err after tests: rpl.rpl_mts_debug -------------------------------------------------------------------------- HEAP SUMMARY: in use at exit: 68,054,661 bytes in 1,230 blocks total heap usage: 68,147 allocs, 66,917 frees, 368,180,952 bytes allocated 6 bytes in 1 blocks are possibly lost in loss record 1 of 468 at 0x4C2CB3F: malloc (vg_replace_malloc.c:299) by 0xA3184B: my_malloc (my_malloc.c:38) by 0xA07841: Rpl_info_table::Rpl_info_table(unsigned int, char const*, char const*) (rpl_info_table.cc:28) by 0x9FD4FD: Rpl_info_factory::init_repositories(Rpl_info_factory::struct_table_data, Rpl_info_factory::struct_file_data, unsigned int, unsigned int, Rpl_info_handler**, Rpl_info_handler**, char const**) (rpl_info_factory.cc:849) by 0x9FC7D1: Rpl_info_factory::create_worker(unsigned int, unsigned int, Relay_log_info*, bool) (rpl_info_factory.cc:404) by 0x9E5E29: slave_start_single_worker(Relay_log_info*, unsigned long) (rpl_slave.cc:5806) by 0x9E6351: slave_start_workers(Relay_log_info*, unsigned long, bool*) (rpl_slave.cc:5932) by 0x9E6D61: handle_slave_sql (rpl_slave.cc:6139) by 0xE00DB0: pfs_spawn_thread (pfs.cc:1860) by 0x547E6C9: start_thread (pthread_create.c:333) by 0x60450AE: clone (clone.S:105) ... 131,072 bytes in 1 blocks are possibly lost in loss record 437 of 468 at 0x4C2CB3F: malloc (vg_replace_malloc.c:299) by 0xA3184B: my_malloc (my_malloc.c:38) by 0xA0B167: init_dynamic_array2 (array.c:70) by 0x9FF5EF: Slave_worker::init_worker(Relay_log_info*, unsigned long) (rpl_rli_pdb.cc:282) by 0x9E5E6F: slave_start_single_worker(Relay_log_info*, unsigned long) (rpl_slave.cc:5813) by 0x9E6351: slave_start_workers(Relay_log_info*, unsigned long, bool*) (rpl_slave.cc:5932) by 0x9E6D61: handle_slave_sql (rpl_slave.cc:6139) by 0xE00DB0: pfs_spawn_thread (pfs.cc:1860) by 0x547E6C9: start_thread (pthread_create.c:333) by 0x60450AE: clone (clone.S:105) LEAK SUMMARY: definitely lost: 0 bytes in 0 blocks indirectly lost: 0 bytes in 0 blocks possibly lost: 146,792 bytes in 16 blocks still reachable: 0 bytes in 0 blocks of which reachable via heuristic: newarray : 392 bytes in 1 blocks suppressed: 67,907,869 bytes in 1,214 blocks For counts of detected and suppressed errors, rerun with: -v ERROR SUMMARY: 16 errors from 16 contexts (suppressed: 228 from 228) valgrind_report [ fail ] I will submit a fixed version once it passes review/testing here.
[30 Mar 2017 13:29]
Laurynas Biveinis
Here's an improved 5.6 patch. Pasting inline as unable to upload new contributions here due to the status of existing ones. diff --git a/sql/rpl_slave.cc b/sql/rpl_slave.cc index 4a2d1d4..0d9861c 100644 --- a/sql/rpl_slave.cc +++ b/sql/rpl_slave.cc @@ -5962,10 +5962,6 @@ void slave_stop_workers(Relay_log_info *rli, bool *mts_inited) { int i; THD *thd= rli->info_thd; - if (!*mts_inited) - return; - else if (rli->slave_parallel_workers == 0) - goto end; /* If request for stop slave is received notify worker @@ -6025,7 +6021,10 @@ void slave_stop_workers(Relay_log_info *rli, bool *mts_inited) mysql_mutex_unlock(&w->jobs_lock); } - if (thd->killed == THD::NOT_KILLED) + if (!*mts_inited) + return; + + if (rli->slave_parallel_workers != 0 && thd->killed == THD::NOT_KILLED) (void) mts_checkpoint_routine(rli, 0, false, true/*need_data_lock=true*/); // TODO:consider to propagate an error out of the function for (i= rli->workers.elements - 1; i >= 0; i--) @@ -6035,7 +6034,7 @@ void slave_stop_workers(Relay_log_info *rli, bool *mts_inited) delete_dynamic_element(&rli->workers, i); delete w; } - if (log_warnings > 1) + if (rli->slave_parallel_workers != 0 && log_warnings > 1) sql_print_information("Total MTS session statistics: " "events processed = %llu; " "worker queues filled over overrun level = %lu; " @@ -6049,7 +6048,6 @@ void slave_stop_workers(Relay_log_info *rli, bool *mts_inited) DBUG_ASSERT(rli->pending_jobs == 0); DBUG_ASSERT(rli->mts_pending_jobs_size == 0); -end: rli->mts_group_status= Relay_log_info::MTS_NOT_IN_GROUP; destroy_hash_workers(rli); delete rli->gaq;
[30 Mar 2017 13:29]
Laurynas Biveinis
5.7 version diff --git a/sql/rpl_slave.cc b/sql/rpl_slave.cc index fd00f1a..3e9dcad 100644 --- a/sql/rpl_slave.cc +++ b/sql/rpl_slave.cc @@ -6770,6 +6770,8 @@ int slave_start_single_worker(Relay_log_info *rli, ulong i) err: if (error && w) { + if (w->current_mts_submode) + delete w->current_mts_submode; delete w; /* Any failure after array inserted must follow with deletion @@ -6896,11 +6898,6 @@ void slave_stop_workers(Relay_log_info *rli, bool *mts_inited) { THD *thd= rli->info_thd; - if (!*mts_inited) - return; - else if (rli->slave_parallel_workers == 0) - goto end; - /* If request for stop slave is received notify worker to stop. @@ -6985,7 +6982,10 @@ void slave_stop_workers(Relay_log_info *rli, bool *mts_inited) mysql_mutex_unlock(&w->jobs_lock); } - if (thd->killed == THD::NOT_KILLED) + if (!*mts_inited) + return; + + if (rli->slave_parallel_workers != 0 && thd->killed == THD::NOT_KILLED) (void) mts_checkpoint_routine(rli, 0, false, true/*need_data_lock=true*/); // TODO:consider to propagate an error out of the function while (!rli->workers.empty()) @@ -7020,7 +7020,6 @@ void slave_stop_workers(Relay_log_info *rli, bool *mts_inited) DBUG_ASSERT(rli->pending_jobs == 0); DBUG_ASSERT(rli->mts_pending_jobs_size == 0); -end: rli->mts_group_status= Relay_log_info::MTS_NOT_IN_GROUP; destroy_hash_workers(rli); delete rli->gaq;
[30 Mar 2017 13:30]
Laurynas Biveinis
A 8.0.0 delta patch, to be applied on the top of the previous 8.0.0 patch. It should be identical to the 5.7 version, give or take line offsets diff --git a/sql/rpl_slave.cc b/sql/rpl_slave.cc index a440efd..375a043 100644 --- a/sql/rpl_slave.cc +++ b/sql/rpl_slave.cc @@ -6767,6 +6767,8 @@ static int slave_start_single_worker(Relay_log_info *rli, ulong i) err: if (error && w) { + if (w->current_mts_submode) + delete w->current_mts_submode; delete w; /* Any failure after array inserted must follow with deletion @@ -6981,10 +6983,8 @@ static void slave_stop_workers(Relay_log_info *rli, bool *mts_inited) if (!*mts_inited) return; - else if (rli->slave_parallel_workers == 0) - goto end; - if (thd->killed == THD::NOT_KILLED) + if (rli->slave_parallel_workers != 0 && thd->killed == THD::NOT_KILLED) (void) mts_checkpoint_routine(rli, 0, false, true/*need_data_lock=true*/); // TODO:consider to propagate an error out of the function while (!rli->workers.empty()) @@ -7019,7 +7019,6 @@ static void slave_stop_workers(Relay_log_info *rli, bool *mts_inited) DBUG_ASSERT(rli->pending_jobs == 0); DBUG_ASSERT(rli->mts_pending_jobs_size == 0); -end: rli->mts_group_status= Relay_log_info::MTS_NOT_IN_GROUP; destroy_hash_workers(rli); delete rli->gaq;
[14 Apr 2017 9:04]
Laurynas Biveinis
Bug 82980 fix for 8.0.1
Attachment: bug82980-8.0.1.patch (application/octet-stream, text), 1.82 KiB.
[14 Jul 2017 14:28]
Margaret Fisher
Posted by developer: Changelog entry added for MySQL 8.0.3, 5.7.20, and 5.6.38: In case of a failure while creating multiple slave applier worker threads, some threads would be left orphaned and their resources would not be collected. Thanks to Laurynas Biveinis for his contribution to fixing this bug.