MySQL Bugs: #33815: Improve reporting with duplicate server_id / server

Bug #33815	Improve reporting with duplicate server_id / server_uuid values
Submitted:	11 Jan 2008 9:27	Modified:	24 Jul 2014 15:10
Reporter:	Simon Mudd (OCA)	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S4 (Feature request)
Version:	5.0.52, 5.1.17, 5.6.14	OS:	Linux (RHEL4)
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	Contribution, qc, replication, server-id

Description:
Slaves should NOT use the same server-id when connecting to a master. However if you do this (by mistake) in a big network of MySQL servers the error reporting is difficult to immediately diagnose:

1. Nothing reported on the master.
2. The slave complains about "apparent master shutdown" (which is wrong and misleading)

080110 15:29:22 [Note] Slave: received end packet from server, apparent master shutdown:
080110 15:29:22 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'binlog.000016' position 887506728
080110 15:29:22 [Note] Slave: connected to master 'rep_user@mysql-master:3306',replication resumed in log 'binlog.000016' at position 887506728
080110 15:29:22 [Note] Slave: received end packet from server, apparent master shutdown:
080110 15:29:22 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'binlog.000016' position 887506728
080110 15:29:22 [Note] Slave: connected to master 'rep_user@mysql-master:3306',replication resumed in log 'binlog.000016' at position 887506728
080110 15:29:22 [Note] Slave: received end packet from server, apparent master shutdown:
080110 15:29:22 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'binlog.000016' position 887506728
080110 15:29:22 [Note] Slave: connected to master 'rep_user@mysql-master:3306',replication resumed in log 'binlog.000016' at position 887506728
080110 15:29:22 [Note] Slave: received end packet from server, apparent master shutdown:
080110 15:29:22 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'binlog.000016' position 887506728
080110 15:29:22 [Note] Slave: connected to master 'rep_user@mysql-master:3306',replication resumed in log 'binlog.000016' at position 887506728

How to repeat:
1. have a busy replicating master/slave database.
2. start up a new slave with the same server-id as an existing slave.

Suggested fix:
My suggestion would be (if possible) to:
1. make the master tell the slave what the problem is.
2. make the master log the problem:

080110 15:29:22 Duplicate slave connection with server-id 12345 from rep_user@1.2.3.4, dropping exiting connection from 5.6.7.8

3. LIMIT the number of messages this error can generate in the master log (similar to syslog)

080115 15:29:27 Duplicate slave connection with server-id 12345 (repeated 100 times), dropping active/existing connection

4. Perhaps make the slave "backoff" a little before retrying again. (pause 30 seconds?). While this will affect "performance" this type of behaviour should NOT take place and you don't want to overload the master.

Many thanks for writing a feature request.

Setting bug #28795 as a duplicate of this one. That has the following feature request that's already part of this bug report:

"retry in larger intervals. Currently it is retrying as soon as possible in the
default-configuration."

FYI, I was just burned by this problem.  As the original author states, the error message is very unhelpful.  Thanks for considering this enhancement.

This ticket has been open for over a year. Is there any plan for this suggested fix to be implemented? I reported it in 5.0.52. We're currently running a newer version of 5.0 and starting to look at 5.1, but I see the same problem still exists. The code change of at least reporting the error in the master logs should be pretty trivial as it deliberately disconnects from the previously running slave. Some of the other suggestions possibly require more work.

I wonder if any attention is likely to be given to this problem?

As a suggestion (not tested) the following patch would seem to work against 5.1.31. This at least gives more information to the user that something is wrong.

diff --git a/sql/sql_repl.cc b/sql/sql_repl.cc
index 298d810..b715750 100644
--- a/sql/sql_repl.cc
+++ b/sql/sql_repl.cc
@@ -1110,6 +1110,17 @@ void kill_zombie_dump_threads(uint32 slave_server_id)
       it will be slow because it will iterate through the list
       again. We just to do kill the thread ourselves.
     */
+
+    /*
+      #33815 - Suggested logging: Duplicate slave connection with server-id 12345 from rep_user@1.2.3.4, dropping exiting connection from other_user@5.6.7.8
+      Below is not complete but better than nothing.
+    */
+    sql_print_information("Duplicate slave connection with server-id %d, dropping existing connection from '%s'@'%s'",
+        slave_server_id,
+        tmp->main_security_ctx.user,
+        tmp->main_security_ctx.host_or_ip
+    );
+
     tmp->awake(THD::KILL_QUERY);
     pthread_mutex_unlock(&tmp->LOCK_delete);
   }

Perhaps someone could see if this is workable at least as a way of enabling a warning message to appear on the server?

I should actually try and compile this but guess this is close to being what is needed.

commit 4ec8446496d8461ef2e7ab4cb15793b5446ca6ee
Author: Simon J Mudd <sjmudd@pobox.com>
Date:   Tue Jul 28 17:43:44 2009 +0200

    fix logging

diff --git a/sql/sql_parse.cc b/sql/sql_parse.cc
index 5247f19..af2b0d9 100644
--- a/sql/sql_parse.cc
+++ b/sql/sql_parse.cc
@@ -1400,7 +1400,7 @@ bool dispatch_command(enum enum_server_command command, THD *thd,
       flags = uint2korr(packet + 4);
       thd->server_id=0; /* avoid suicide */
       if ((slave_server_id= uint4korr(packet+6))) // mysqlbinlog.server_id==0
-	kill_zombie_dump_threads(slave_server_id);
+	kill_zombie_dump_threads(thd, slave_server_id);
       thd->server_id = slave_server_id;
 
       general_log_print(thd, command, "Log: '%s'  Pos: %ld", packet+10,
diff --git a/sql/sql_repl.cc b/sql/sql_repl.cc
index 298d810..5c1e37b 100644
--- a/sql/sql_repl.cc
+++ b/sql/sql_repl.cc
@@ -1082,12 +1082,13 @@ err:
 
   SYNOPSIS
     kill_zombie_dump_threads()
+    thd                 the new thread that is triggering killing other threads.
     slave_server_id     the slave's server id
 
 */
 
 
-void kill_zombie_dump_threads(uint32 slave_server_id)
+void kill_zombie_dump_threads(THD *thd, uint32 slave_server_id)
 {
   pthread_mutex_lock(&LOCK_thread_count);
   I_List_iterator<THD> it(threads);
@@ -1110,6 +1111,18 @@ void kill_zombie_dump_threads(uint32 slave_server_id)
       it will be slow because it will iterate through the list
       again. We just to do kill the thread ourselves.
     */
+
+    /*
+      #33815 - Suggested logging: Duplicate slave connection with server-id 12345 from rep_user@1.2.3.4, dropping exiting connection from other_user@5.6.7.8
+    */
+    sql_print_information("Duplicate slave connection with server-id %d from '%s'@'%s', dropping existing connection from '%s'@'%s'",
+        slave_server_id,
+        thd->main_security_ctx.user,
+        thd->main_security_ctx.host_or_ip
+        tmp->main_security_ctx.user,
+        tmp->main_security_ctx.host_or_ip
+    );
+
     tmp->awake(THD::KILL_QUERY);
     pthread_mutex_unlock(&tmp->LOCK_delete);
   }
diff --git a/sql/sql_repl.h b/sql/sql_repl.h
index b47c1a4..248fd42 100644
--- a/sql/sql_repl.h
+++ b/sql/sql_repl.h
@@ -51,7 +51,7 @@ bool log_in_use(const char* log_name);
 void adjust_linfo_offsets(my_off_t purge_offset);
 bool show_binlogs(THD* thd);
 extern int init_master_info(Master_info* mi);
-void kill_zombie_dump_threads(uint32 slave_server_id);
+void kill_zombie_dump_threads(THD *thd, uint32 slave_server_id);
 int check_binlog_magic(IO_CACHE* log, const char** errmsg);
 
 typedef struct st_load_file_info

The patch had a minor typo:

commit 4ec8446496d8461ef2e7ab4cb15793b5446ca6ee
Author: Simon J Mudd <sjmudd@pobox.com>
Date:   Tue Jul 28 17:43:44 2009 +0200

    fix logging

diff --git a/sql/sql_parse.cc b/sql/sql_parse.cc
index 5247f19..af2b0d9 100644
--- a/sql/sql_parse.cc
+++ b/sql/sql_parse.cc
@@ -1400,7 +1400,7 @@ bool dispatch_command(enum enum_server_command command, THD *thd,
       flags = uint2korr(packet + 4);
       thd->server_id=0; /* avoid suicide */
       if ((slave_server_id= uint4korr(packet+6))) // mysqlbinlog.server_id==0
-	kill_zombie_dump_threads(slave_server_id);
+	kill_zombie_dump_threads(thd, slave_server_id);
       thd->server_id = slave_server_id;
 
       general_log_print(thd, command, "Log: '%s'  Pos: %ld", packet+10,
diff --git a/sql/sql_repl.cc b/sql/sql_repl.cc
index 298d810..5c1e37b 100644
--- a/sql/sql_repl.cc
+++ b/sql/sql_repl.cc
@@ -1082,12 +1082,13 @@ err:
 
   SYNOPSIS
     kill_zombie_dump_threads()
+    thd                 the new thread that is triggering killing other threads.
     slave_server_id     the slave's server id
 
 */
 
 
-void kill_zombie_dump_threads(uint32 slave_server_id)
+void kill_zombie_dump_threads(THD *thd, uint32 slave_server_id)
 {
   pthread_mutex_lock(&LOCK_thread_count);
   I_List_iterator<THD> it(threads);
@@ -1110,6 +1111,18 @@ void kill_zombie_dump_threads(uint32 slave_server_id)
       it will be slow because it will iterate through the list
       again. We just to do kill the thread ourselves.
     */
+
+    /*
+      #33815 - Suggested logging: Duplicate slave connection with server-id 12345 from rep_user@1.2.3.4, dropping exiting connection from other_user@5.6.7.8
+    */
+    sql_print_information("Duplicate slave connection with server-id %d from '%s'@'%s', dropping existing connection from '%s'@'%s'",
+        slave_server_id,
+        thd->main_security_ctx.user,
+        thd->main_security_ctx.host_or_ip,
+        tmp->main_security_ctx.user,
+        tmp->main_security_ctx.host_or_ip
+    );
+
     tmp->awake(THD::KILL_QUERY);
     pthread_mutex_unlock(&tmp->LOCK_delete);
   }
diff --git a/sql/sql_repl.h b/sql/sql_repl.h
index b47c1a4..248fd42 100644
--- a/sql/sql_repl.h
+++ b/sql/sql_repl.h
@@ -51,7 +51,7 @@ bool log_in_use(const char* log_name);
 void adjust_linfo_offsets(my_off_t purge_offset);
 bool show_binlogs(THD* thd);
 extern int init_master_info(Master_info* mi);
-void kill_zombie_dump_threads(uint32 slave_server_id);
+void kill_zombie_dump_threads(THD *thd, uint32 slave_server_id);
 int check_binlog_magic(IO_CACHE* log, const char** errmsg);
 
 typedef struct st_load_file_info

I have applied against 5.1.36 and this patches cleanly. Then built a binary distribution and tested using the mysql sandbox:

[sjmudd@mad07 mysql-5.1.36]$ make_replication_sandbox mysql-5.1.36-osx10.5-i386.tar.gz 
installing and starting master
installing slave 1
installing slave 2
starting slave 1
. sandbox server started
starting slave 2
. sandbox server started
initializing slave 1
initializing slave 2
replication directory installed in $HOME/sandboxes/rsandbox_5_1_36
[sjmudd@mad07 mysql-5.1.36]$ cd /Users/sjmudd/sandboxes/
[sjmudd@mad07 sandboxes]$ ls -la
total 56
drwxr-xr-x  11 sjmudd  staff   374 29 Jul 11:33 .
drwxr-xr-x+ 82 sjmudd  staff  2788 29 Jul 11:29 ..
-rwxr-xr-x   1 sjmudd  staff    62 29 Jul 11:33 clear_all
-rwxr-xr-x   1 sjmudd  staff    64 29 Jul 11:33 restart_all
drwxr-xr-x   2 sjmudd  staff    68 29 Jul 09:42 rsandbox_5_0_67
drwxr-xr-x  16 sjmudd  staff   544 29 Jul 11:33 rsandbox_5_1_36
-rwxr-xr-x   1 sjmudd  staff  2112 29 Jul 11:33 sandbox_action
-rwxr-xr-x   1 sjmudd  staff    66 29 Jul 11:33 send_kill_all
-rwxr-xr-x   1 sjmudd  staff    62 29 Jul 11:33 start_all
-rwxr-xr-x   1 sjmudd  staff    61 29 Jul 11:33 stop_all
-rwxr-xr-x   1 sjmudd  staff    60 29 Jul 11:33 use_all
clear_all	restart_all	rsandbox_5_0_67	rsandbox_5_1_36	sandbox_action	send_kill_all	start_all	stop_all	use_all
[sjmudd@mad07 sandboxes]$ cd rsandbox_5_1_36/
[sjmudd@mad07 rsandbox_5_1_36]$ ls -la
total 88
drwxr-xr-x  16 sjmudd  staff  544 29 Jul 11:33 .
drwxr-xr-x  11 sjmudd  staff  374 29 Jul 11:33 ..
-rwxr-xr-x   1 sjmudd  staff  158 29 Jul 11:33 check_slaves
-rwxr-xr-x   1 sjmudd  staff  357 29 Jul 11:33 clear_all
-rwxr-xr-x   1 sjmudd  staff  593 29 Jul 11:33 initialize_slaves
-rwxr-xr-x   1 sjmudd  staff   66 29 Jul 11:33 m
drwxr-xr-x  17 sjmudd  staff  578 29 Jul 11:33 master
drwxr-xr-x  17 sjmudd  staff  578 29 Jul 11:33 node1
drwxr-xr-x  17 sjmudd  staff  578 29 Jul 11:33 node2
-rwxr-xr-x   1 sjmudd  staff  114 29 Jul 11:33 restart_all
-rwxr-xr-x   1 sjmudd  staff   55 29 Jul 11:33 s1
-rwxr-xr-x   1 sjmudd  staff   55 29 Jul 11:33 s2
-rwxr-xr-x   1 sjmudd  staff  313 29 Jul 11:33 send_kill_all
-rwxr-xr-x   1 sjmudd  staff  502 29 Jul 11:33 start_all
-rwxr-xr-x   1 sjmudd  staff  283 29 Jul 11:33 stop_all
-rwxr-xr-x   1 sjmudd  staff  299 29 Jul 11:33 use_all
[sjmudd@mad07 rsandbox_5_1_36]$ vi node1/my.sandbox.cnf # make the server-id the same
[sjmudd@mad07 rsandbox_5_1_36]$ vi node2/my.sandbox.cnf # make the server-id the same
[sjmudd@mad07 rsandbox_5_1_36]$ ./restart_all 
executing "stop" on slave 1
executing "stop" on slave 2
executing "stop" on master
executing "start" on master
. sandbox server started
executing "start" on slave 1
. sandbox server started
executing "start" on slave 2
. sandbox server started
[sjmudd@mad07 rsandbox_5_1_36]$ tail -f master/data/msandbox.err shows:

090729 11:43:44 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:44 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:44 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:44 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:45 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:45 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:45 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:45 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:46 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:46 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:46 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:46 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:47 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'
090729 11:43:47 [Note] Duplicate slave connection with server-id 10199 from 'msandbox'@'localhost', dropping existing connection from 'msandbox'@'localhost'

Unfortunately as both slaves are from the same server you can't see the switch.
However it's clear that the problem is now being reported properly.

Please could you consider applying this patch to 5.0/5.1 and later...

Thanks

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/81575

3078 Andrei Elkin	2009-08-26
      Bug #33815 Improve reporting with duplicate server-ids
      
      The report designate actually a serious issue:
      if two slaves happen to have the same server id then:
      
      1. The first one's dump thread session will be killed
      2. Neither of them in the following could establish connection ever until
      manual action.
      
      Fixed with
      deploying an unique signature to the slave-to-master session that complements
      server_id.
      The signature is caclulated on the slave at its replication module start-up.
      It's passed to the master at `start slave'.
      In the following master verifies the signature and kills only those zombi dump threads
      that have an existing signature more additionally to equality of the server_id.
      If server_id:s are the same but not the signature then a new connection attempt is rejected
      and the existing dump thread remains to operate undistirbingly.
      modified:
        sql/rpl_mi.cc
        sql/rpl_mi.h
        sql/share/errmsg.txt
        sql/slave.cc
        sql/sql_parse.cc
        sql/sql_repl.cc
        sql/sql_repl.h

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/81576

3079 Andrei Elkin	2009-08-26
      Bug #33815  Improve reporting with duplicate server-ids
      
      test case is added
      added:
        mysql-test/suite/rpl/r/rpl_server_id_duplicate.result
        mysql-test/suite/rpl/t/rpl_server_id_duplicate-master.opt
        mysql-test/suite/rpl/t/rpl_server_id_duplicate-slave.opt
        mysql-test/suite/rpl/t/rpl_server_id_duplicate.test
      modified:
        sql/sql_class.cc
        support-files/build-tags

The suggested patch has a flaw in that the new connection from the restarted
slave *server* (not IO thread) comes in with a new uuid signature, and that
does not help identifying a possible remaining zombi.
There are two (partly overlapping) alternatives:
1. to make the uuid persistent, be generated at the slave
   server first time deploying and span all the slave server life time.
2. to change the current zombi killing by the new reconnecting, to leave
   zombi dump thread to exit eventually on its own.
   If master supports HeartBeat (it does in 5.4) exiting time is proportional

The suggested patch has a flaw in that the new connection from the restarted
slave *server* (not IO thread) comes in with a new uuid signature, and that
does not help identifying a possible remaining zombi.
There are two (partly overlapping) alternatives:
1. to make the uuid persistent, be generated at the slave
   server first time deploying and span all the slave server life time.
2. to change the current zombi killing by the new reconnecting, to leave
   zombi dump thread to exit eventually on its own.
   If master supports HeartBeat (it does in 5.4) exiting time is of order
   of the HB period. Effectively, a zombi would leave in a few HB periods
   in a case of binlog is not updated (the motivating use case for the current
   zombi killing alg).

WL#4677 implemented a suggested in this framework solution.

This should I think be revisited.

I see a reference to WL#4677 but no actual reference to a working GA version of MySQL where the problem has been resolved.

I have been bitten recently again by this, this time with the equivalent 5.6 server_uuid value being used on 2 servers. Again: no logging on the master and the slave shows no clear indication of the problem, so it's not apparent what's going on.

Given we now have server_id duplicate value issues and MySQL 5.6 server_uuid duplicate value issues, it would seem really helpful if when this happens something USEFUL is logged somewhere. That does not seem to be happening now and while the symptoms, once you have seen them may be recognisable, the fact no server actually tells you what the problem is can be really frustrating.

So hoping that despite the various bug reports which mention this issue  bug#16927, bug#28795 having comments about WL#4677 and "the fix" I don't see it, at least on the 5.6.14 servers I'm looking at.

note: I've updated the Synopsis to take into account the MySQL 5.6 specific variable but I don't think there's a difference in the underlying issue of lack of transparency when this happens.

Bitten again during GTID testing 6 years later. Oh well.
(this time due to duplicate server_uuid settings)

Related to bug#72578

bug#72578 must be a parent for this one and its patch is already in 5.6.