Bug #33815 | Improve reporting with duplicate server_id / server_uuid values | ||
---|---|---|---|
Submitted: | 11 Jan 2008 9:27 | Modified: | 24 Jul 2014 15:10 |
Reporter: | Simon Mudd (OCA) | Email Updates: | |
Status: | Duplicate | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S4 (Feature request) |
Version: | 5.0.52, 5.1.17, 5.6.14 | OS: | Linux (RHEL4) |
Assigned to: | Assigned Account | CPU Architecture: | Any |
Tags: | Contribution, qc, replication, server-id |
[11 Jan 2008 9:27]
Simon Mudd
[4 Feb 2008 18:16]
Susanne Ebrecht
Many thanks for writing a feature request.
[10 Jul 2008 0:13]
James Day
Setting bug #28795 as a duplicate of this one. That has the following feature request that's already part of this bug report: "retry in larger intervals. Currently it is retrying as soon as possible in the default-configuration."
[25 Jul 2009 20:31]
Mike Calandra
FYI, I was just burned by this problem. As the original author states, the error message is very unhelpful. Thanks for considering this enhancement.
[26 Jul 2009 8:14]
Simon Mudd
This ticket has been open for over a year. Is there any plan for this suggested fix to be implemented? I reported it in 5.0.52. We're currently running a newer version of 5.0 and starting to look at 5.1, but I see the same problem still exists. The code change of at least reporting the error in the master logs should be pretty trivial as it deliberately disconnects from the previously running slave. Some of the other suggestions possibly require more work. I wonder if any attention is likely to be given to this problem?
[26 Jul 2009 8:46]
Simon Mudd
As a suggestion (not tested) the following patch would seem to work against 5.1.31. This at least gives more information to the user that something is wrong. diff --git a/sql/sql_repl.cc b/sql/sql_repl.cc index 298d810..b715750 100644 --- a/sql/sql_repl.cc +++ b/sql/sql_repl.cc @@ -1110,6 +1110,17 @@ void kill_zombie_dump_threads(uint32 slave_server_id) it will be slow because it will iterate through the list again. We just to do kill the thread ourselves. */ + + /* + #33815 - Suggested logging: Duplicate slave connection with server-id 12345 from rep_user@1.2.3.4, dropping exiting connection from other_user@5.6.7.8 + Below is not complete but better than nothing. + */ + sql_print_information("Duplicate slave connection with server-id %d, dropping existing connection from '%s'@'%s'", + slave_server_id, + tmp->main_security_ctx.user, + tmp->main_security_ctx.host_or_ip + ); + tmp->awake(THD::KILL_QUERY); pthread_mutex_unlock(&tmp->LOCK_delete); } Perhaps someone could see if this is workable at least as a way of enabling a warning message to appear on the server?
[28 Jul 2009 22:22]
Simon Mudd
I should actually try and compile this but guess this is close to being what is needed. commit 4ec8446496d8461ef2e7ab4cb15793b5446ca6ee Author: Simon J Mudd <sjmudd@pobox.com> Date: Tue Jul 28 17:43:44 2009 +0200 fix logging diff --git a/sql/sql_parse.cc b/sql/sql_parse.cc index 5247f19..af2b0d9 100644 --- a/sql/sql_parse.cc +++ b/sql/sql_parse.cc @@ -1400,7 +1400,7 @@ bool dispatch_command(enum enum_server_command command, THD *thd, flags = uint2korr(packet + 4); thd->server_id=0; /* avoid suicide */ if ((slave_server_id= uint4korr(packet+6))) // mysqlbinlog.server_id==0 - kill_zombie_dump_threads(slave_server_id); + kill_zombie_dump_threads(thd, slave_server_id); thd->server_id = slave_server_id; general_log_print(thd, command, "Log: '%s' Pos: %ld", packet+10, diff --git a/sql/sql_repl.cc b/sql/sql_repl.cc index 298d810..5c1e37b 100644 --- a/sql/sql_repl.cc +++ b/sql/sql_repl.cc @@ -1082,12 +1082,13 @@ err: SYNOPSIS kill_zombie_dump_threads() + thd the new thread that is triggering killing other threads. slave_server_id the slave's server id */ -void kill_zombie_dump_threads(uint32 slave_server_id) +void kill_zombie_dump_threads(THD *thd, uint32 slave_server_id) { pthread_mutex_lock(&LOCK_thread_count); I_List_iterator<THD> it(threads); @@ -1110,6 +1111,18 @@ void kill_zombie_dump_threads(uint32 slave_server_id) it will be slow because it will iterate through the list again. We just to do kill the thread ourselves. */ + + /* + #33815 - Suggested logging: Duplicate slave connection with server-id 12345 from rep_user@1.2.3.4, dropping exiting connection from other_user@5.6.7.8 + */ + sql_print_information("Duplicate slave connection with server-id %d from '%s'@'%s', dropping existing connection from '%s'@'%s'", + slave_server_id, + thd->main_security_ctx.user, + thd->main_security_ctx.host_or_ip + tmp->main_security_ctx.user, + tmp->main_security_ctx.host_or_ip + ); + tmp->awake(THD::KILL_QUERY); pthread_mutex_unlock(&tmp->LOCK_delete); } diff --git a/sql/sql_repl.h b/sql/sql_repl.h index b47c1a4..248fd42 100644 --- a/sql/sql_repl.h +++ b/sql/sql_repl.h @@ -51,7 +51,7 @@ bool log_in_use(const char* log_name); void adjust_linfo_offsets(my_off_t purge_offset); bool show_binlogs(THD* thd); extern int init_master_info(Master_info* mi); -void kill_zombie_dump_threads(uint32 slave_server_id); +void kill_zombie_dump_threads(THD *thd, uint32 slave_server_id); int check_binlog_magic(IO_CACHE* log, const char** errmsg); typedef struct st_load_file_info
[29 Jul 2009 9:50]
Simon Mudd
The patch had a minor typo: commit 4ec8446496d8461ef2e7ab4cb15793b5446ca6ee Author: Simon J Mudd <sjmudd@pobox.com> Date: Tue Jul 28 17:43:44 2009 +0200 fix logging diff --git a/sql/sql_parse.cc b/sql/sql_parse.cc index 5247f19..af2b0d9 100644 --- a/sql/sql_parse.cc +++ b/sql/sql_parse.cc @@ -1400,7 +1400,7 @@ bool dispatch_command(enum enum_server_command command, THD *thd, flags = uint2korr(packet + 4); thd->server_id=0; /* avoid suicide */ if ((slave_server_id= uint4korr(packet+6))) // mysqlbinlog.server_id==0 - kill_zombie_dump_threads(slave_server_id); + kill_zombie_dump_threads(thd, slave_server_id); thd->server_id = slave_server_id; general_log_print(thd, command, "Log: '%s' Pos: %ld", packet+10, diff --git a/sql/sql_repl.cc b/sql/sql_repl.cc index 298d810..5c1e37b 100644 --- a/sql/sql_repl.cc +++ b/sql/sql_repl.cc @@ -1082,12 +1082,13 @@ err: SYNOPSIS kill_zombie_dump_threads() + thd the new thread that is triggering killing other threads. slave_server_id the slave's server id */ -void kill_zombie_dump_threads(uint32 slave_server_id) +void kill_zombie_dump_threads(THD *thd, uint32 slave_server_id) { pthread_mutex_lock(&LOCK_thread_count); I_List_iterator<THD> it(threads); @@ -1110,6 +1111,18 @@ void kill_zombie_dump_threads(uint32 slave_server_id) it will be slow because it will iterate through the list again. We just to do kill the thread ourselves. */ + + /* + #33815 - Suggested logging: Duplicate slave connection with server-id 12345 from rep_user@1.2.3.4, dropping exiting connection from other_user@5.6.7.8 + */ + sql_print_information("Duplicate slave connection with server-id %d from '%s'@'%s', dropping existing connection from '%s'@'%s'", + slave_server_id, + thd->main_security_ctx.user, + thd->main_security_ctx.host_or_ip, + tmp->main_security_ctx.user, + tmp->main_security_ctx.host_or_ip + ); + tmp->awake(THD::KILL_QUERY); pthread_mutex_unlock(&tmp->LOCK_delete); } diff --git a/sql/sql_repl.h b/sql/sql_repl.h index b47c1a4..248fd42 100644 --- a/sql/sql_repl.h +++ b/sql/sql_repl.h @@ -51,7 +51,7 @@ bool log_in_use(const char* log_name); void adjust_linfo_offsets(my_off_t purge_offset); bool show_binlogs(THD* thd); extern int init_master_info(Master_info* mi); -void kill_zombie_dump_threads(uint32 slave_server_id); +void kill_zombie_dump_threads(THD *thd, uint32 slave_server_id); int check_binlog_magic(IO_CACHE* log, const char** errmsg); typedef struct st_load_file_info I have applied against 5.1.36 and this patches cleanly. Then built a binary distribution and tested using the mysql sandbox: [sjmudd@mad07 mysql-5.1.36]$ make_replication_sandbox mysql-5.1.36-osx10.5-i386.tar.gz installing and starting master installing slave 1 installing slave 2 starting slave 1 . sandbox server started starting slave 2 . sandbox server started initializing slave 1 initializing slave 2 replication directory installed in $HOME/sandboxes/rsandbox_5_1_36 [sjmudd@mad07 mysql-5.1.36]$ cd /Users/sjmudd/sandboxes/ [sjmudd@mad07 sandboxes]$ ls -la total 56 drwxr-xr-x 11 sjmudd staff 374 29 Jul 11:33 . drwxr-xr-x+ 82 sjmudd staff 2788 29 Jul 11:29 .. -rwxr-xr-x 1 sjmudd staff 62 29 Jul 11:33 clear_all -rwxr-xr-x 1 sjmudd staff 64 29 Jul 11:33 restart_all drwxr-xr-x 2 sjmudd staff 68 29 Jul 09:42 rsandbox_5_0_67 drwxr-xr-x 16 sjmudd staff 544 29 Jul 11:33 rsandbox_5_1_36 -rwxr-xr-x 1 sjmudd staff 2112 29 Jul 11:33 sandbox_action -rwxr-xr-x 1 sjmudd staff 66 29 Jul 11:33 send_kill_all -rwxr-xr-x 1 sjmudd staff 62 29 Jul 11:33 start_all -rwxr-xr-x 1 sjmudd staff 61 29 Jul 11:33 stop_all -rwxr-xr-x 1 sjmudd staff 60 29 Jul 11:33 use_all clear_all restart_all rsandbox_5_0_67 rsandbox_5_1_36 sandbox_action send_kill_all start_all stop_all use_all [sjmudd@mad07 sandboxes]$ cd rsandbox_5_1_36/ [sjmudd@mad07 rsandbox_5_1_36]$ ls -la total 88 drwxr-xr-x 16 sjmudd staff 544 29 Jul 11:33 . drwxr-xr-x 11 sjmudd staff 374 29 Jul 11:33 .. -rwxr-xr-x 1 sjmudd staff 158 29 Jul 11:33 check_slaves -rwxr-xr-x 1 sjmudd staff 357 29 Jul 11:33 clear_all -rwxr-xr-x 1 sjmudd staff 593 29 Jul 11:33 initialize_slaves -rwxr-xr-x 1 sjmudd staff 66 29 Jul 11:33 m drwxr-xr-x 17 sjmudd staff 578 29 Jul 11:33 master drwxr-xr-x 17 sjmudd staff 578 29 Jul 11:33 node1 drwxr-xr-x 17 sjmudd staff 578 29 Jul 11:33 node2 -rwxr-xr-x 1 sjmudd staff 114 29 Jul 11:33 restart_all -rwxr-xr-x 1 sjmudd staff 55 29 Jul 11:33 s1 -rwxr-xr-x 1 sjmudd staff 55 29 Jul 11:33 s2 -rwxr-xr-x 1 sjmudd staff 313 29 Jul 11:33 send_kill_all -rwxr-xr-x 1 sjmudd staff 502 29 Jul 11:33 start_all -rwxr-xr-x 1 sjmudd staff 283 29 Jul 11:33 stop_all -rwxr-xr-x 1 sjmudd staff 299 29 Jul 11:33 use_all [sjmudd@mad07 rsandbox_5_1_36]$ vi node1/my.sandbox.cnf # make the server-id the same [sjmudd@mad07 rsandbox_5_1_36]$ vi node2/my.sandbox.cnf # make the server-id the same [sjmudd@mad07 rsandbox_5_1_36]$ ./restart_all executing "stop" on slave 1 executing "stop" on slave 2 executing "stop" on master executing "start" on master . sandbox server started executing "start" on slave 1 . sandbox server started executing "start" on slave 2 . sandbox server started [sjmudd@mad07 rsandbox_5_1_36]$ tail -f master/data/msandbox.err shows: 090729 11:43:44 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost' 090729 11:43:44 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost' 090729 11:43:44 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost' 090729 11:43:44 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost' 090729 11:43:45 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost' 090729 11:43:45 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost' 090729 11:43:45 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost' 090729 11:43:45 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost' 090729 11:43:46 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost' 090729 11:43:46 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost' 090729 11:43:46 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost' 090729 11:43:46 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost' 090729 11:43:47 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost' 090729 11:43:47 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost' Unfortunately as both slaves are from the same server you can't see the switch. However it's clear that the problem is now being reported properly. Please could you consider applying this patch to 5.0/5.1 and later... Thanks
[26 Aug 2009 8:42]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/81575 3078 Andrei Elkin 2009-08-26 Bug #33815 Improve reporting with duplicate server-ids The report designate actually a serious issue: if two slaves happen to have the same server id then: 1. The first one's dump thread session will be killed 2. Neither of them in the following could establish connection ever until manual action. Fixed with deploying an unique signature to the slave-to-master session that complements server_id. The signature is caclulated on the slave at its replication module start-up. It's passed to the master at `start slave'. In the following master verifies the signature and kills only those zombi dump threads that have an existing signature more additionally to equality of the server_id. If server_id:s are the same but not the signature then a new connection attempt is rejected and the existing dump thread remains to operate undistirbingly. modified: sql/rpl_mi.cc sql/rpl_mi.h sql/share/errmsg.txt sql/slave.cc sql/sql_parse.cc sql/sql_repl.cc sql/sql_repl.h
[26 Aug 2009 8:43]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/81576 3079 Andrei Elkin 2009-08-26 Bug #33815 Improve reporting with duplicate server-ids test case is added added: mysql-test/suite/rpl/r/rpl_server_id_duplicate.result mysql-test/suite/rpl/t/rpl_server_id_duplicate-master.opt mysql-test/suite/rpl/t/rpl_server_id_duplicate-slave.opt mysql-test/suite/rpl/t/rpl_server_id_duplicate.test modified: sql/sql_class.cc support-files/build-tags
[28 Aug 2009 13:23]
Andrei Elkin
The suggested patch has a flaw in that the new connection from the restarted slave *server* (not IO thread) comes in with a new uuid signature, and that does not help identifying a possible remaining zombi. There are two (partly overlapping) alternatives: 1. to make the uuid persistent, be generated at the slave server first time deploying and span all the slave server life time. 2. to change the current zombi killing by the new reconnecting, to leave zombi dump thread to exit eventually on its own. If master supports HeartBeat (it does in 5.4) exiting time is proportional
[28 Aug 2009 13:25]
Andrei Elkin
The suggested patch has a flaw in that the new connection from the restarted slave *server* (not IO thread) comes in with a new uuid signature, and that does not help identifying a possible remaining zombi. There are two (partly overlapping) alternatives: 1. to make the uuid persistent, be generated at the slave server first time deploying and span all the slave server life time. 2. to change the current zombi killing by the new reconnecting, to leave zombi dump thread to exit eventually on its own. If master supports HeartBeat (it does in 5.4) exiting time is of order of the HB period. Effectively, a zombi would leave in a few HB periods in a case of binlog is not updated (the motivating use case for the current zombi killing alg).
[5 Oct 2010 13:07]
Andrei Elkin
WL#4677 implemented a suggested in this framework solution.
[18 Dec 2013 16:46]
Simon Mudd
This should I think be revisited. I see a reference to WL#4677 but no actual reference to a working GA version of MySQL where the problem has been resolved. I have been bitten recently again by this, this time with the equivalent 5.6 server_uuid value being used on 2 servers. Again: no logging on the master and the slave shows no clear indication of the problem, so it's not apparent what's going on. Given we now have server_id duplicate value issues and MySQL 5.6 server_uuid duplicate value issues, it would seem really helpful if when this happens something USEFUL is logged somewhere. That does not seem to be happening now and while the symptoms, once you have seen them may be recognisable, the fact no server actually tells you what the problem is can be really frustrating. So hoping that despite the various bug reports which mention this issue bug#16927, bug#28795 having comments about WL#4677 and "the fix" I don't see it, at least on the 5.6.14 servers I'm looking at. note: I've updated the Synopsis to take into account the MySQL 5.6 specific variable but I don't think there's a difference in the underlying issue of lack of transparency when this happens.
[24 Jul 2014 15:10]
Simon Mudd
Bitten again during GTID testing 6 years later. Oh well. (this time due to duplicate server_uuid settings) Related to bug#72578
[13 Aug 2014 7:51]
Andrei Elkin
bug#72578 must be a parent for this one and its patch is already in 5.6.