MySQL Bugs: #72578: Duplicate slave server_uuid in replication forum gives no specific error message

Bug #72578	Duplicate slave server_uuid in replication forum gives no specific error message
Submitted:	8 May 2014 14:57	Modified:	21 Aug 2014 7:57
Reporter:	Hartmut Holzgraefe	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.6.17, 5.6.19	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
http://bugs.mysql.com/bug.php?id=16927 "Duplicate slave server-id gives no specific error message, poor relay log use" was closed as with the introduction of server_uuid it should not happen anymore.

Having two slaves with same server_id indeed doesn't matter anymore, but if you happen to end up with the same server_uuid, even with different server_id, e.g. as you created a new slave by cloning an existing slaves datadir and didn't know about the uuid in the auto.cnf file, so you only took care of changing the server_id parameter, you'll end up with the same 

  Slave: received end packet from server, apparent master shutdown: 
  Slave I/O thread: Failed reading log event, reconnecting to retry,...

cycle

How to repeat:
Create two slaves with same server_uuid (server_id may be different), see that only one can connect to the master while the other is receiving missleading

    received end packet from server, apparent master shutdown

log messages instead of pointing out the actual problem

Also on the master side you will see repeated messsages like this:

  Stop semi-sync binlog_dump to slave (server_id: 1003)
  Start semi-sync binlog_dump to slave (server_id: 1001), pos(...
  Start binlog_dump to master_thread_id(10755) slave_server(1003), pos(...

So no mention of the actual reason for terminating the connection either,
and bonus points for only showing the slave_id, not the slave_uuid (so in this case obfuscating the reason even further as the server_id is obviously different on the two slaves (1001, 1003))

Suggested fix:
Master may be able to provide some extra information in the end packet as to why it ended the connection? Or the error message could at least mention both "master shutdown" and "duplicate server_id/server_uuid" as possible causes? 

Maybe the slave could even be as clever as only reporting duplicate id/uuid as possible cause if it finds that the master connection is terminated right away, and if it receives an end packet later on it could still just mention "master shutdown" as only apparent reason?

And make the salve log server_uuid along with server_id, and print an extra line when closing a connection due to duplicate server_uuid?

It is actually not the connecting slave that gets rejected, but the existing connection is closed in favor of the new one ... which causes a steady flow of reconnects from the competing slaves with same server_uuid ... so my "only suggest duplicate IDs if connection fails right away" suggestion doesn't make sense ...

Separate but report about the wrong connection being closed will follow (this was also already mentioned in one of the bugs that were marked duplicate of 16927) ...

Hello Hartmut,

Thank you for the bug report.
Verified as described.

Thanks,
Umesh

// How to repeat

1. Setup simple replication (Master-Slave1)
2. Clone Slave2 from Slave1(keeping auto.cnf as is)
3. Bring up Slave2 and instruct Slave2 to connect Master

On master> HOW SLAVE HOSTS shows just one slave as connected(whichever connected first in Slave1/Slave2)

Suppose Slave1 already connected, check Slave2 error log which is flooded with
"[Note] Slave: received end packet from server, apparent master shutdown:"

Also, check master's error log - it is flooded with

2014-05-11 05:39:35 9407 [Note] Start binlog_dump to master_thread_id(56) slave_server(2), pos(master-bin.000003, 813)
2014-05-11 05:39:35 9407 [Note] Start binlog_dump to master_thread_id(57) slave_server(3), pos(master-bin.000003, 813)
2014-05-11 05:39:35 9407 [Note] Start binlog_dump to master_thread_id(58) slave_server(2), pos(master-bin.000003, 813)
2014-05-11 05:39:35 9407 [Note] Start binlog_dump to master_thread_id(59) slave_server(3), pos(master-bin.000003, 813)
2014-05-11 05:39:35 9407 [Note] Start binlog_dump to master_thread_id(60) slave_server(2), pos(master-bin.000003, 813)
2014-05-11 05:39:35 9407 [Note] Start binlog_dump to master_thread_id(61) slave_server(3), pos(master-bin.000003, 813)

## Work around I see is if we delete auto.cnf before bringing up Slave2

The following was added to the 5.6.21 and 5.7.5 change logs with commit 3981:

The error messages generated when a duplicate server UUID causes issues during replication have been improved. The slave error now identifies the duplicate server UUID and the master error identifies the zombie thread that has been killed.

Posted by developer:
 
The following was added to the 5.6.21 and 5.7.5 change logs with commit 3981:

The error messages generated when a duplicate server UUID causes issues during replication have been improved. The slave error now identifies the duplicate server UUID and the master error identifies the zombie thread that has been killed.

revno: 6082
committer: Venkatesh Duggirala<venkatesh.duggirala@oracle.com>
branch nick: mysql-5.6
timestamp: Thu 2014-07-17 16:56:55 +0530
message:
  Bug#18731211  Duplicate slave server_uuid in replication 
  forum gives no specific error message
  
  Problem & Analysis: 
  ===================
  When a new slave joins to a replication forum with uuid which
  is exactly same as one of the existing slave's uuid, then master is 
  confused between old slave and the newly joined slave. When com_binlog_send
  request comes from newly joined slave, it thinks that old slave's dump
  thread is zombie and kills it. After some time, old slave tries to reconnect
  to the master and sends com_binlog_send request to Master. Then Master
  thinks that newly joined slave's dump thread is zombie and kills it.
  This ping-pong reconnections between Master->old_slave and Master->new_slave
  happens continously. This is known expected behaviour but the
  error message in old_slave and new_slave's error log file is not clear
  enough to identify this situation.
  
  Information in error log file was:
  "received end packet from server, apparent master shutdown" which
  does not say anything about server_uuid misconfiguration in the
  replication forum.
  
  Fix: Changing the error message which clearly mentions all the possible
  causes which resulted this disconnection.
  
  New information in error log file is: 
  "Slave: received end packet from server due to dump thread being killed on master.
  Dump threads are killed for example during master shutdown,
  explicitly by a user, or when the master receives a binlog send
  request from a duplicate server UUID <server_uuid>. "
  
  And also a NOTE is added on master's error log file with the following
  information when log_warnings is greater than 1.
  "While initializing dump thread for slave with UUID <slave_uuid>,
  found a zombie dump thread with the same UUID.
  Master is killing the zombie dump thread."

Posted by developer:
 
Reopened by merge, no new changelog entries needed. Set back to Closed.

Related to bug#33815.

Note: calling the "other thread" a zombie is using a terminology which may be confusing or simply wrong, especially if the action is triggered by 2 slaves with the same uuid and they are both "active".  "zombie" implies "dead or not working", when in this case it may clearly not be true.  So perhaps it's best to simply say something like:

"While initializing dump thread for slave with UUID <slave_uuid>,
  found an existing dump thread with the same UUID.
  Master is killing the existing dump thread."

Maybe it's worth adding if the time between "reconnections" under these circumstances is very short: "Seeing this message frequently for the same UUID is probably a sign of more than one slave using the same uuid, a configuration mistake", but if you do this rate limit the frequency of this message?

Related, Bug #84358