| Bug #82980 | Multi-threaded slave leaks worker threads in case of thread create failure | ||
|---|---|---|---|
| Submitted: | 14 Sep 2016 7:02 | Modified: | 14 Jul 2017 14:28 |
| Reporter: | Laurynas Biveinis (OCA) | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Server: Replication | Severity: | S3 (Non-critical) |
| Version: | 5.6 | OS: | Any |
| Assigned to: | CPU Architecture: | Any | |
| Tags: | error handling, MTS, replication | ||
[16 Sep 2016 9:30]
MySQL Verification Team
Hello Laurynas, Thank you for the report. Thanks, Umesh
[16 Sep 2016 9:30]
MySQL Verification Team
-- 5.6.33
SHOW PROCESSLIST;
Id User Host db Command Time State Info
2 root localhost test Query 0 init SHOW PROCESSLIST
6 root localhost:41632 test Sleep 1 NULL
7 root localhost:41633 test Sleep 1 NULL
8 root localhost:41636 NULL Binlog Dump 1 Master has sent all binlog to slave; waiting for binlog to be updated NULL
main.rpl_bug82980 [ pass ] 1008
-- 5.7.15
SHOW PROCESSLIST;
Id User Host db Command Time State Info
3 root localhost test Query 0 starting SHOW PROCESSLIST
7 root localhost:42379 test Sleep 1 NULL
8 root localhost:42380 test Sleep 1 NULL
9 root localhost:42384 NULL Binlog Dump 1 Master has sent all binlog to slave; waiting for more updates NULL
main.rpl_bug82980 [ fail ] Found warnings/errors in server log file!
Test ended at 2016-09-16 10:42:02
line
2016-09-16T08:42:01.394622Z 9 [ERROR] Failed during slave worker thread creation for channel '' (errno= 0)
^ Found warnings in /export/umesh/server/binaries/GABuilds/mysql-advanced-5.7.15/mysql-test/var/log/mysqld.2.err
ok
- the logfile can be found in '/export/umesh/server/binaries/GABuilds/mysql-advanced-5.7.15/mysql-test/var/log/main.rpl_bug82980/rpl_bug82980.log'
[19 Sep 2016 10:55]
Laurynas Biveinis
Bug 82980 fix for 5.6 (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.
Contribution: bug82980-5.6.patch (application/octet-stream, text), 822 bytes.
[19 Sep 2016 10:56]
Laurynas Biveinis
Bug 82980 fix for 5.7 (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.
Contribution: bug82980-5.7.patch (application/octet-stream, text), 828 bytes.
[19 Sep 2016 10:56]
Laurynas Biveinis
Bug 82980 fix for 8.0 (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.
Contribution: bug82980-8.0.patch (application/octet-stream, text), 842 bytes.
[21 Mar 2017 10:27]
NEHA KUMARI
Hi Laurynas, I tried reproducing this bug on MySQL-5.6 but couldn't, it passes all the time. With MySQL-5.7 I could reproduce the error, with the test given by you. This was the error in mysqld.2.err: 2017-03-21T09:41:43.359887Z 10 [ERROR] Failed during slave worker thread creation for channel '' (errno= 0) 2017-03-21T09:41:43.359974Z 10 [ERROR] Slave SQL for channel '': Fatal error: Failed during slave workers initialization, Error_code: 1593 But this error persists even with the patch you have submitted. I will try to look more closely at the errors.
[22 Mar 2017 4:26]
Laurynas Biveinis
Hello Neha - I will review too and should get back to you soon
[24 Mar 2017 9:35]
Laurynas Biveinis
My "SHOW PROCESSLIST" command added in the original report is wrong, as it executes on the master instead of the slave. With current public 5.6 (5.6.35), the reduced testcase shows the following PROCESSLIST on slave at the end: SHOW PROCESSLIST; Id User Host db Command Time State Info 6 root localhost:55768 test Query 0 init SHOW PROCESSLIST 7 root localhost:55769 test Sleep 0 NULL 9 system user NULL Connect 0 Waiting for an event from Coordinator NULL If the testcase is modified to comment out "SET @@global.debug= "d,mts_worker_thread_fails";", wait for slave SQL error, and a include/stop_slave.inc is added before final include/start_slave.inc, then the slave processlist at the end becomes: SHOW PROCESSLIST; Id User Host db Command Time State Info 6 root localhost:55742 test Query 0 init SHOW PROCESSLIST 7 root localhost:55743 test Sleep 1 NULL i.e. one less thread
[24 Mar 2017 9:37]
Laurynas Biveinis
And with the patch applied, the reduced testcase processlist has two threads: SHOW PROCESSLIST; Id User Host db Command Time State Info 6 root localhost:55881 test Query 0 init SHOW PROCESSLIST 7 root localhost:55882 test Sleep 1 NULL But now I'm suspecting that the patch traded thread leak for a memory leak, checking.
[24 Mar 2017 9:52]
Laurynas Biveinis
Indeed, the patch causes
rpl.rpl_mts_debug 'mix' [ pass ] 54107
rpl.rpl_mts_debug 'row' [ skipped ] Doesn't support --binlog-format='row'
rpl.rpl_mts_debug 'stmt' [ skipped ] Doesn't support --binlog-format='statement'
worker[1] Valgrind report from /home/laurynas/obj-5.6-valgrind/mysql-test/var/log/mysqld.2.err after tests:
rpl.rpl_mts_debug
--------------------------------------------------------------------------
HEAP SUMMARY:
in use at exit: 68,054,661 bytes in 1,230 blocks
total heap usage: 68,147 allocs, 66,917 frees, 368,180,952 bytes allocated
6 bytes in 1 blocks are possibly lost in loss record 1 of 468
at 0x4C2CB3F: malloc (vg_replace_malloc.c:299)
by 0xA3184B: my_malloc (my_malloc.c:38)
by 0xA07841: Rpl_info_table::Rpl_info_table(unsigned int, char const*, char const*) (rpl_info_table.cc:28)
by 0x9FD4FD: Rpl_info_factory::init_repositories(Rpl_info_factory::struct_table_data, Rpl_info_factory::struct_file_data, unsigned int, unsigned int, Rpl_info_handler**, Rpl_info_handler**, char const**) (rpl_info_factory.cc:849)
by 0x9FC7D1: Rpl_info_factory::create_worker(unsigned int, unsigned int, Relay_log_info*, bool) (rpl_info_factory.cc:404)
by 0x9E5E29: slave_start_single_worker(Relay_log_info*, unsigned long) (rpl_slave.cc:5806)
by 0x9E6351: slave_start_workers(Relay_log_info*, unsigned long, bool*) (rpl_slave.cc:5932)
by 0x9E6D61: handle_slave_sql (rpl_slave.cc:6139)
by 0xE00DB0: pfs_spawn_thread (pfs.cc:1860)
by 0x547E6C9: start_thread (pthread_create.c:333)
by 0x60450AE: clone (clone.S:105)
...
131,072 bytes in 1 blocks are possibly lost in loss record 437 of 468
at 0x4C2CB3F: malloc (vg_replace_malloc.c:299)
by 0xA3184B: my_malloc (my_malloc.c:38)
by 0xA0B167: init_dynamic_array2 (array.c:70)
by 0x9FF5EF: Slave_worker::init_worker(Relay_log_info*, unsigned long) (rpl_rli_pdb.cc:282)
by 0x9E5E6F: slave_start_single_worker(Relay_log_info*, unsigned long) (rpl_slave.cc:5813)
by 0x9E6351: slave_start_workers(Relay_log_info*, unsigned long, bool*) (rpl_slave.cc:5932)
by 0x9E6D61: handle_slave_sql (rpl_slave.cc:6139)
by 0xE00DB0: pfs_spawn_thread (pfs.cc:1860)
by 0x547E6C9: start_thread (pthread_create.c:333)
by 0x60450AE: clone (clone.S:105)
LEAK SUMMARY:
definitely lost: 0 bytes in 0 blocks
indirectly lost: 0 bytes in 0 blocks
possibly lost: 146,792 bytes in 16 blocks
still reachable: 0 bytes in 0 blocks
of which reachable via heuristic:
newarray : 392 bytes in 1 blocks
suppressed: 67,907,869 bytes in 1,214 blocks
For counts of detected and suppressed errors, rerun with: -v
ERROR SUMMARY: 16 errors from 16 contexts (suppressed: 228 from 228)
valgrind_report [ fail ]
I will submit a fixed version once it passes review/testing here.
[30 Mar 2017 13:29]
Laurynas Biveinis
Here's an improved 5.6 patch. Pasting inline as unable to upload new contributions here due to the status of existing ones.
diff --git a/sql/rpl_slave.cc b/sql/rpl_slave.cc
index 4a2d1d4..0d9861c 100644
--- a/sql/rpl_slave.cc
+++ b/sql/rpl_slave.cc
@@ -5962,10 +5962,6 @@ void slave_stop_workers(Relay_log_info *rli, bool *mts_inited)
{
int i;
THD *thd= rli->info_thd;
- if (!*mts_inited)
- return;
- else if (rli->slave_parallel_workers == 0)
- goto end;
/*
If request for stop slave is received notify worker
@@ -6025,7 +6021,10 @@ void slave_stop_workers(Relay_log_info *rli, bool *mts_inited)
mysql_mutex_unlock(&w->jobs_lock);
}
- if (thd->killed == THD::NOT_KILLED)
+ if (!*mts_inited)
+ return;
+
+ if (rli->slave_parallel_workers != 0 && thd->killed == THD::NOT_KILLED)
(void) mts_checkpoint_routine(rli, 0, false, true/*need_data_lock=true*/); // TODO:consider to propagate an error out of the function
for (i= rli->workers.elements - 1; i >= 0; i--)
@@ -6035,7 +6034,7 @@ void slave_stop_workers(Relay_log_info *rli, bool *mts_inited)
delete_dynamic_element(&rli->workers, i);
delete w;
}
- if (log_warnings > 1)
+ if (rli->slave_parallel_workers != 0 && log_warnings > 1)
sql_print_information("Total MTS session statistics: "
"events processed = %llu; "
"worker queues filled over overrun level = %lu; "
@@ -6049,7 +6048,6 @@ void slave_stop_workers(Relay_log_info *rli, bool *mts_inited)
DBUG_ASSERT(rli->pending_jobs == 0);
DBUG_ASSERT(rli->mts_pending_jobs_size == 0);
-end:
rli->mts_group_status= Relay_log_info::MTS_NOT_IN_GROUP;
destroy_hash_workers(rli);
delete rli->gaq;
[30 Mar 2017 13:29]
Laurynas Biveinis
5.7 version
diff --git a/sql/rpl_slave.cc b/sql/rpl_slave.cc
index fd00f1a..3e9dcad 100644
--- a/sql/rpl_slave.cc
+++ b/sql/rpl_slave.cc
@@ -6770,6 +6770,8 @@ int slave_start_single_worker(Relay_log_info *rli, ulong i)
err:
if (error && w)
{
+ if (w->current_mts_submode)
+ delete w->current_mts_submode;
delete w;
/*
Any failure after array inserted must follow with deletion
@@ -6896,11 +6898,6 @@ void slave_stop_workers(Relay_log_info *rli, bool *mts_inited)
{
THD *thd= rli->info_thd;
- if (!*mts_inited)
- return;
- else if (rli->slave_parallel_workers == 0)
- goto end;
-
/*
If request for stop slave is received notify worker
to stop.
@@ -6985,7 +6982,10 @@ void slave_stop_workers(Relay_log_info *rli, bool *mts_inited)
mysql_mutex_unlock(&w->jobs_lock);
}
- if (thd->killed == THD::NOT_KILLED)
+ if (!*mts_inited)
+ return;
+
+ if (rli->slave_parallel_workers != 0 && thd->killed == THD::NOT_KILLED)
(void) mts_checkpoint_routine(rli, 0, false, true/*need_data_lock=true*/); // TODO:consider to propagate an error out of the function
while (!rli->workers.empty())
@@ -7020,7 +7020,6 @@ void slave_stop_workers(Relay_log_info *rli, bool *mts_inited)
DBUG_ASSERT(rli->pending_jobs == 0);
DBUG_ASSERT(rli->mts_pending_jobs_size == 0);
-end:
rli->mts_group_status= Relay_log_info::MTS_NOT_IN_GROUP;
destroy_hash_workers(rli);
delete rli->gaq;
[30 Mar 2017 13:30]
Laurynas Biveinis
A 8.0.0 delta patch, to be applied on the top of the previous 8.0.0 patch. It should be identical to the 5.7 version, give or take line offsets
diff --git a/sql/rpl_slave.cc b/sql/rpl_slave.cc
index a440efd..375a043 100644
--- a/sql/rpl_slave.cc
+++ b/sql/rpl_slave.cc
@@ -6767,6 +6767,8 @@ static int slave_start_single_worker(Relay_log_info *rli, ulong i)
err:
if (error && w)
{
+ if (w->current_mts_submode)
+ delete w->current_mts_submode;
delete w;
/*
Any failure after array inserted must follow with deletion
@@ -6981,10 +6983,8 @@ static void slave_stop_workers(Relay_log_info *rli, bool *mts_inited)
if (!*mts_inited)
return;
- else if (rli->slave_parallel_workers == 0)
- goto end;
- if (thd->killed == THD::NOT_KILLED)
+ if (rli->slave_parallel_workers != 0 && thd->killed == THD::NOT_KILLED)
(void) mts_checkpoint_routine(rli, 0, false, true/*need_data_lock=true*/); // TODO:consider to propagate an error out of the function
while (!rli->workers.empty())
@@ -7019,7 +7019,6 @@ static void slave_stop_workers(Relay_log_info *rli, bool *mts_inited)
DBUG_ASSERT(rli->pending_jobs == 0);
DBUG_ASSERT(rli->mts_pending_jobs_size == 0);
-end:
rli->mts_group_status= Relay_log_info::MTS_NOT_IN_GROUP;
destroy_hash_workers(rli);
delete rli->gaq;
[14 Apr 2017 9:04]
Laurynas Biveinis
Bug 82980 fix for 8.0.1
Attachment: bug82980-8.0.1.patch (application/octet-stream, text), 1.82 KiB.
[14 Jul 2017 14:28]
Margaret Fisher
Posted by developer:
Changelog entry added for MySQL 8.0.3, 5.7.20, and 5.6.38:
In case of a failure while creating multiple slave applier
worker threads, some threads would be left orphaned and their
resources would not be collected. Thanks to Laurynas Biveinis
for his contribution to fixing this bug.

Description: If one adds SHOW PROCESSLIST to the end of rpl.mts_debug_slave, it will show an orphaned SQL worker thread that is there regardless that the slave is stopped. It is a thread created by mts_worker_thread_fails failure injection point testcase, which tries to set up MTS with two worker threads and simulates a pthread_create failure for one of them. How to repeat: The testcase reduces to --source include/not_gtid_enabled.inc --source include/master-slave.inc --source include/have_debug.inc --source include/have_innodb.inc # # Bug#12964649 MTS memory not released # # Originally reported when opt_slave_checkpoint_group is set to be greater # than sizeof(BLOB) * 8 - 1 Workers initialization code fails. # VALID_RANGE() for opt_slave_checkpoint_group is fixed in bug12979420. # # The current bug test simulates a failure to start by a Worker # thread, checks the error, and verifies that all memory is released # (valgrind proves, the user can watch top report). # connection slave; call mtr.add_suppression('Error reading slave worker configuration'); call mtr.add_suppression('Failed during slave workers initialization'); call mtr.add_suppression('Failed during slave worker thread create'); source include/stop_slave.inc; set @save.slave_checkpoint_group= @@global.slave_checkpoint_group; set @save.slave_parallel_workers= @@global.slave_parallel_workers; # The following assignment can work only with debug builds. # It sets opt_slave_checkpoint_group to one more to its VALID_RANGE() maximum. # set @@global.slave_checkpoint_group= 524280 + 1; set @save.debug= @@global.debug; SET @@global.debug= "d,mts_worker_thread_fails"; set @@global.slave_parallel_workers= 2; start slave sql_thread; --let $slave_sql_errno= 1593 source include/wait_for_slave_sql_error.inc; set @@global.slave_checkpoint_group= @save.slave_checkpoint_group; set @@global.slave_parallel_workers= @save.slave_parallel_workers; set @@global.debug= @save.debug; source include/start_slave.inc; --source include/rpl_end.inc SHOW PROCESSLIST; Suggested fix: Remove early exits from slave_stop_workers?