| Bug #33815 | Improve reporting with duplicate server-ids | ||
|---|---|---|---|
| Submitted: | 11 Jan 2008 10:27 | Modified: | 28 Aug 2009 15:08 |
| Reporter: | Simon Mudd | ||
| Status: | In progress | ||
| Category: | Server: Replication | Severity: | S4 (Feature request) |
| Version: | 5.0.52 enterprise 5.1.17 | OS: | Linux (RHEL4) |
| Assigned to: | Andrei Elkin | Target Version: | |
| Tags: | server-id, qc, replication, Contribution | ||
| Triage: | Needs Triage: D5 (Feature request) | ||
[11 Jan 2008 10:27]
Simon Mudd
[4 Feb 2008 19:16]
Susanne Ebrecht
Many thanks for writing a feature request.
[10 Jul 2008 2:13]
James Day
Setting bug #28795 as a duplicate of this one. That has the following feature request that's already part of this bug report: "retry in larger intervals. Currently it is retrying as soon as possible in the default-configuration."
[25 Jul 2009 22:31]
Mike Calandra
FYI, I was just burned by this problem. As the original author states, the error message is very unhelpful. Thanks for considering this enhancement.
[26 Jul 2009 10:14]
Simon Mudd
This ticket has been open for over a year. Is there any plan for this suggested fix to be implemented? I reported it in 5.0.52. We're currently running a newer version of 5.0 and starting to look at 5.1, but I see the same problem still exists. The code change of at least reporting the error in the master logs should be pretty trivial as it deliberately disconnects from the previously running slave. Some of the other suggestions possibly require more work. I wonder if any attention is likely to be given to this problem?
[26 Jul 2009 10:46]
Simon Mudd
As a suggestion (not tested) the following patch would seem to work against 5.1.31. This
at least gives more information to the user that something is wrong.
diff --git a/sql/sql_repl.cc b/sql/sql_repl.cc
index 298d810..b715750 100644
--- a/sql/sql_repl.cc
+++ b/sql/sql_repl.cc
@@ -1110,6 +1110,17 @@ void kill_zombie_dump_threads(uint32 slave_server_id)
it will be slow because it will iterate through the list
again. We just to do kill the thread ourselves.
*/
+
+ /*
+ #33815 - Suggested logging: Duplicate slave connection with server-id 12345 from
rep_user@1.2.3.4, dropping exiting connection from other_user@5.6.7.8
+ Below is not complete but better than nothing.
+ */
+ sql_print_information("Duplicate slave connection with server-id %d, dropping
existing connection from '%s'@'%s'",
+ slave_server_id,
+ tmp->main_security_ctx.user,
+ tmp->main_security_ctx.host_or_ip
+ );
+
tmp->awake(THD::KILL_QUERY);
pthread_mutex_unlock(&tmp->LOCK_delete);
}
Perhaps someone could see if this is workable at least as a way of enabling a warning
message to appear on the server?
[29 Jul 2009 0:22]
Simon Mudd
I should actually try and compile this but guess this is close to being what is needed.
commit 4ec8446496d8461ef2e7ab4cb15793b5446ca6ee
Author: Simon J Mudd <sjmudd@pobox.com>
Date: Tue Jul 28 17:43:44 2009 +0200
fix logging
diff --git a/sql/sql_parse.cc b/sql/sql_parse.cc
index 5247f19..af2b0d9 100644
--- a/sql/sql_parse.cc
+++ b/sql/sql_parse.cc
@@ -1400,7 +1400,7 @@ bool dispatch_command(enum enum_server_command command, THD *thd,
flags = uint2korr(packet + 4);
thd->server_id=0; /* avoid suicide */
if ((slave_server_id= uint4korr(packet+6))) // mysqlbinlog.server_id==0
- kill_zombie_dump_threads(slave_server_id);
+ kill_zombie_dump_threads(thd, slave_server_id);
thd->server_id = slave_server_id;
general_log_print(thd, command, "Log: '%s' Pos: %ld", packet+10,
diff --git a/sql/sql_repl.cc b/sql/sql_repl.cc
index 298d810..5c1e37b 100644
--- a/sql/sql_repl.cc
+++ b/sql/sql_repl.cc
@@ -1082,12 +1082,13 @@ err:
SYNOPSIS
kill_zombie_dump_threads()
+ thd the new thread that is triggering killing other threads.
slave_server_id the slave's server id
*/
-void kill_zombie_dump_threads(uint32 slave_server_id)
+void kill_zombie_dump_threads(THD *thd, uint32 slave_server_id)
{
pthread_mutex_lock(&LOCK_thread_count);
I_List_iterator<THD> it(threads);
@@ -1110,6 +1111,18 @@ void kill_zombie_dump_threads(uint32 slave_server_id)
it will be slow because it will iterate through the list
again. We just to do kill the thread ourselves.
*/
+
+ /*
+ #33815 - Suggested logging: Duplicate slave connection with server-id 12345 from
rep_user@1.2.3.4, dropping exiting connection from other_user@5.6.7.8
+ */
+ sql_print_information("Duplicate slave connection with server-id %d from '%s'@'%s',
dropping existing connection from '%s'@'%s'",
+ slave_server_id,
+ thd->main_security_ctx.user,
+ thd->main_security_ctx.host_or_ip
+ tmp->main_security_ctx.user,
+ tmp->main_security_ctx.host_or_ip
+ );
+
tmp->awake(THD::KILL_QUERY);
pthread_mutex_unlock(&tmp->LOCK_delete);
}
diff --git a/sql/sql_repl.h b/sql/sql_repl.h
index b47c1a4..248fd42 100644
--- a/sql/sql_repl.h
+++ b/sql/sql_repl.h
@@ -51,7 +51,7 @@ bool log_in_use(const char* log_name);
void adjust_linfo_offsets(my_off_t purge_offset);
bool show_binlogs(THD* thd);
extern int init_master_info(Master_info* mi);
-void kill_zombie_dump_threads(uint32 slave_server_id);
+void kill_zombie_dump_threads(THD *thd, uint32 slave_server_id);
int check_binlog_magic(IO_CACHE* log, const char** errmsg);
typedef struct st_load_file_info
[29 Jul 2009 11:50]
Simon Mudd
The patch had a minor typo:
commit 4ec8446496d8461ef2e7ab4cb15793b5446ca6ee
Author: Simon J Mudd <sjmudd@pobox.com>
Date: Tue Jul 28 17:43:44 2009 +0200
fix logging
diff --git a/sql/sql_parse.cc b/sql/sql_parse.cc
index 5247f19..af2b0d9 100644
--- a/sql/sql_parse.cc
+++ b/sql/sql_parse.cc
@@ -1400,7 +1400,7 @@ bool dispatch_command(enum enum_server_command command, THD *thd,
flags = uint2korr(packet + 4);
thd->server_id=0; /* avoid suicide */
if ((slave_server_id= uint4korr(packet+6))) // mysqlbinlog.server_id==0
- kill_zombie_dump_threads(slave_server_id);
+ kill_zombie_dump_threads(thd, slave_server_id);
thd->server_id = slave_server_id;
general_log_print(thd, command, "Log: '%s' Pos: %ld", packet+10,
diff --git a/sql/sql_repl.cc b/sql/sql_repl.cc
index 298d810..5c1e37b 100644
--- a/sql/sql_repl.cc
+++ b/sql/sql_repl.cc
@@ -1082,12 +1082,13 @@ err:
SYNOPSIS
kill_zombie_dump_threads()
+ thd the new thread that is triggering killing other threads.
slave_server_id the slave's server id
*/
-void kill_zombie_dump_threads(uint32 slave_server_id)
+void kill_zombie_dump_threads(THD *thd, uint32 slave_server_id)
{
pthread_mutex_lock(&LOCK_thread_count);
I_List_iterator<THD> it(threads);
@@ -1110,6 +1111,18 @@ void kill_zombie_dump_threads(uint32 slave_server_id)
it will be slow because it will iterate through the list
again. We just to do kill the thread ourselves.
*/
+
+ /*
+ #33815 - Suggested logging: Duplicate slave connection with server-id 12345 from
rep_user@1.2.3.4, dropping exiting connection from other_user@5.6.7.8
+ */
+ sql_print_information("Duplicate slave connection with server-id %d from '%s'@'%s',
dropping existing connection from '%s'@'%s'",
+ slave_server_id,
+ thd->main_security_ctx.user,
+ thd->main_security_ctx.host_or_ip,
+ tmp->main_security_ctx.user,
+ tmp->main_security_ctx.host_or_ip
+ );
+
tmp->awake(THD::KILL_QUERY);
pthread_mutex_unlock(&tmp->LOCK_delete);
}
diff --git a/sql/sql_repl.h b/sql/sql_repl.h
index b47c1a4..248fd42 100644
--- a/sql/sql_repl.h
+++ b/sql/sql_repl.h
@@ -51,7 +51,7 @@ bool log_in_use(const char* log_name);
void adjust_linfo_offsets(my_off_t purge_offset);
bool show_binlogs(THD* thd);
extern int init_master_info(Master_info* mi);
-void kill_zombie_dump_threads(uint32 slave_server_id);
+void kill_zombie_dump_threads(THD *thd, uint32 slave_server_id);
int check_binlog_magic(IO_CACHE* log, const char** errmsg);
typedef struct st_load_file_info
I have applied against 5.1.36 and this patches cleanly. Then built a binary distribution
and tested using the mysql sandbox:
[sjmudd@mad07 mysql-5.1.36]$ make_replication_sandbox mysql-5.1.36-osx10.5-i386.tar.gz
installing and starting master
installing slave 1
installing slave 2
starting slave 1
. sandbox server started
starting slave 2
. sandbox server started
initializing slave 1
initializing slave 2
replication directory installed in $HOME/sandboxes/rsandbox_5_1_36
[sjmudd@mad07 mysql-5.1.36]$ cd /Users/sjmudd/sandboxes/
[sjmudd@mad07 sandboxes]$ ls -la
total 56
drwxr-xr-x 11 sjmudd staff 374 29 Jul 11:33 .
drwxr-xr-x+ 82 sjmudd staff 2788 29 Jul 11:29 ..
-rwxr-xr-x 1 sjmudd staff 62 29 Jul 11:33 clear_all
-rwxr-xr-x 1 sjmudd staff 64 29 Jul 11:33 restart_all
drwxr-xr-x 2 sjmudd staff 68 29 Jul 09:42 rsandbox_5_0_67
drwxr-xr-x 16 sjmudd staff 544 29 Jul 11:33 rsandbox_5_1_36
-rwxr-xr-x 1 sjmudd staff 2112 29 Jul 11:33 sandbox_action
-rwxr-xr-x 1 sjmudd staff 66 29 Jul 11:33 send_kill_all
-rwxr-xr-x 1 sjmudd staff 62 29 Jul 11:33 start_all
-rwxr-xr-x 1 sjmudd staff 61 29 Jul 11:33 stop_all
-rwxr-xr-x 1 sjmudd staff 60 29 Jul 11:33 use_all
clear_all restart_all rsandbox_5_0_67 rsandbox_5_1_36 sandbox_action send_kill_all start_all stop_all use_all
[sjmudd@mad07 sandboxes]$ cd rsandbox_5_1_36/
[sjmudd@mad07 rsandbox_5_1_36]$ ls -la
total 88
drwxr-xr-x 16 sjmudd staff 544 29 Jul 11:33 .
drwxr-xr-x 11 sjmudd staff 374 29 Jul 11:33 ..
-rwxr-xr-x 1 sjmudd staff 158 29 Jul 11:33 check_slaves
-rwxr-xr-x 1 sjmudd staff 357 29 Jul 11:33 clear_all
-rwxr-xr-x 1 sjmudd staff 593 29 Jul 11:33 initialize_slaves
-rwxr-xr-x 1 sjmudd staff 66 29 Jul 11:33 m
drwxr-xr-x 17 sjmudd staff 578 29 Jul 11:33 master
drwxr-xr-x 17 sjmudd staff 578 29 Jul 11:33 node1
drwxr-xr-x 17 sjmudd staff 578 29 Jul 11:33 node2
-rwxr-xr-x 1 sjmudd staff 114 29 Jul 11:33 restart_all
-rwxr-xr-x 1 sjmudd staff 55 29 Jul 11:33 s1
-rwxr-xr-x 1 sjmudd staff 55 29 Jul 11:33 s2
-rwxr-xr-x 1 sjmudd staff 313 29 Jul 11:33 send_kill_all
-rwxr-xr-x 1 sjmudd staff 502 29 Jul 11:33 start_all
-rwxr-xr-x 1 sjmudd staff 283 29 Jul 11:33 stop_all
-rwxr-xr-x 1 sjmudd staff 299 29 Jul 11:33 use_all
[sjmudd@mad07 rsandbox_5_1_36]$ vi node1/my.sandbox.cnf # make the server-id the same
[sjmudd@mad07 rsandbox_5_1_36]$ vi node2/my.sandbox.cnf # make the server-id the same
[sjmudd@mad07 rsandbox_5_1_36]$ ./restart_all
executing "stop" on slave 1
executing "stop" on slave 2
executing "stop" on master
executing "start" on master
. sandbox server started
executing "start" on slave 1
. sandbox server started
executing "start" on slave 2
. sandbox server started
[sjmudd@mad07 rsandbox_5_1_36]$ tail -f master/data/msandbox.err shows:
090729 11:43:44 [Note] Duplicate slave connection with server-id 10199 from
'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:44 [Note] Duplicate slave connection with server-id 10199 from
'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:44 [Note] Duplicate slave connection with server-id 10199 from
'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:44 [Note] Duplicate slave connection with server-id 10199 from
'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:45 [Note] Duplicate slave connection with server-id 10199 from
'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:45 [Note] Duplicate slave connection with server-id 10199 from
'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:45 [Note] Duplicate slave connection with server-id 10199 from
'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:45 [Note] Duplicate slave connection with server-id 10199 from
'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:46 [Note] Duplicate slave connection with server-id 10199 from
'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:46 [Note] Duplicate slave connection with server-id 10199 from
'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:46 [Note] Duplicate slave connection with server-id 10199 from
'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:46 [Note] Duplicate slave connection with server-id 10199 from
'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:47 [Note] Duplicate slave connection with server-id 10199 from
'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:47 [Note] Duplicate slave connection with server-id 10199 from
'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
Unfortunately as both slaves are from the same server you can't see the switch.
However it's clear that the problem is now being reported properly.
Please could you consider applying this patch to 5.0/5.1 and later...
Thanks
[26 Aug 2009 10:42]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/81575 3078 Andrei Elkin 2009-08-26 Bug #33815 Improve reporting with duplicate server-ids The report designate actually a serious issue: if two slaves happen to have the same server id then: 1. The first one's dump thread session will be killed 2. Neither of them in the following could establish connection ever until manual action. Fixed with deploying an unique signature to the slave-to-master session that complements server_id. The signature is caclulated on the slave at its replication module start-up. It's passed to the master at `start slave'. In the following master verifies the signature and kills only those zombi dump threads that have an existing signature more additionally to equality of the server_id. If server_id:s are the same but not the signature then a new connection attempt is rejected and the existing dump thread remains to operate undistirbingly. modified: sql/rpl_mi.cc sql/rpl_mi.h sql/share/errmsg.txt sql/slave.cc sql/sql_parse.cc sql/sql_repl.cc sql/sql_repl.h
[26 Aug 2009 10:43]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/81576 3079 Andrei Elkin 2009-08-26 Bug #33815 Improve reporting with duplicate server-ids test case is added added: mysql-test/suite/rpl/r/rpl_server_id_duplicate.result mysql-test/suite/rpl/t/rpl_server_id_duplicate-master.opt mysql-test/suite/rpl/t/rpl_server_id_duplicate-slave.opt mysql-test/suite/rpl/t/rpl_server_id_duplicate.test modified: sql/sql_class.cc support-files/build-tags
[28 Aug 2009 15:23]
Andrei Elkin
The suggested patch has a flaw in that the new connection from the restarted slave *server* (not IO thread) comes in with a new uuid signature, and that does not help identifying a possible remaining zombi. There are two (partly overlapping) alternatives: 1. to make the uuid persistent, be generated at the slave server first time deploying and span all the slave server life time. 2. to change the current zombi killing by the new reconnecting, to leave zombi dump thread to exit eventually on its own. If master supports HeartBeat (it does in 5.4) exiting time is proportional
[28 Aug 2009 15:25]
Andrei Elkin
The suggested patch has a flaw in that the new connection from the restarted slave *server* (not IO thread) comes in with a new uuid signature, and that does not help identifying a possible remaining zombi. There are two (partly overlapping) alternatives: 1. to make the uuid persistent, be generated at the slave server first time deploying and span all the slave server life time. 2. to change the current zombi killing by the new reconnecting, to leave zombi dump thread to exit eventually on its own. If master supports HeartBeat (it does in 5.4) exiting time is of order of the HB period. Effectively, a zombi would leave in a few HB periods in a case of binlog is not updated (the motivating use case for the current zombi killing alg).
