Bug #16927 Duplicate slave server-id gives no specific error message, poor relay log use
Submitted: 31 Jan 2006 10:01 Modified: 3 Sep 2010 16:50
Reporter: Bruce Dembecki Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S4 (Feature request)
Version:5.0, 5.1 OS:Any
Assigned to: Libing Song CPU Architecture:Any

[31 Jan 2006 10:01] Bruce Dembecki
Description:
Running these versions of MySQL under OS X 10.4.4 as the master causes it to send some sort of end packet on a frequent basis which is handled by a MySQL 5.0.15, 5,0,17 or 5.0.18 slave as the misbelief that the Master has shutdown.

One side affect of this is that many many relay-log files are created. Another side affect of this if Replication runs from time to time with SQL_THREAD turned off is that these relay log files grow rapidly... ending in our currewnt situation where there are 120K+ relay log files in our data directory and any operation on them or on the directory as a whole is very slow.

If I roll the Master back to 5.0.15 then the logs are sent in a manner that can be read by the slaves... this includes old logs created by MySQL 5.0.18... this it is not a file format issue, but rather something in how the newer versions of MySQL are sending the data.

Here's an example from the error log on the slave (there is nothing in the error log omn the master).

060131  0:40:07 [Note] Slave: received end packet from server, apparent master shutdown: 
060131  0:40:07 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'binary-log.001012' position 5344449
060131  0:40:07 [Note] Slave: connected to master 'repl@data6:3306',replication resumed in log 'binary-log.001012' at position 5344449
060131  0:40:10 [Note] Slave: received end packet from server, apparent master shutdown: 
060131  0:40:10 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'binary-log.001012' position 5354255
060131  0:40:10 [Note] Slave: connected to master 'repl@data6:3306',replication resumed in log 'binary-log.001012' at position 5354255

I can give you the log files if you need them... or some examples (you can't possibly want all 120000).

Of course replication can't catch up because everything involving changing to another log file needs a directory scan and the OS isn't very good at dealing with all those files in the directory.

How to repeat:
Connect to an OS X server running 5.0.18  as a slave and watch the relay log count increase rapidly if there's no real backlog... if you run without the replication SQL_THREAD running you will start to see the number of relay-log files grow rapidly... if you get enough of them you can't catch up with replication.
[1 Feb 2006 9:55] Valeriy Kravchuk
Thank you for a problem report. Please, upload that 'binary-log.001012' file. Can you provide a small set of SQL statements to execute on master to repeat the behaviour you described each and every time?
[1 Feb 2006 19:30] Bruce Dembecki
I can give you any combination of b inary and relay logs you may want... The relay logs tend to be small, a handfull of queries... here is a portion of the ls -l of where I stashed the relay logs to give you an idea. I can upload a binary log from before I rolled back to 5.0.15... how about one from the middle of the night, wee roll binaries every two hours, so the middle of the night ones are much smaller at about 10M each. This exceeds the file upload ability on the web interface, where would you like me to put it? FTP would be better, I'm offsite this week so would like to send it directly from the server, rather than routing it to my laptop first.

Here is a small piece of that directory listing so you can see the typical file size of those relay logs:

-rw-rw----   1 mysql  mysql      1580 Jan 30 23:31 data-admin-relay-bin.586501
-rw-rw----   1 mysql  mysql      2551 Jan 30 23:31 data-admin-relay-bin.586502
-rw-rw----   1 mysql  mysql      5154 Jan 30 23:31 data-admin-relay-bin.586503
-rw-rw----   1 mysql  mysql      1548 Jan 30 23:31 data-admin-relay-bin.586504
-rw-rw----   1 mysql  mysql      8699 Jan 30 23:31 data-admin-relay-bin.586505
-rw-rw----   1 mysql  mysql       710 Jan 30 23:31 data-admin-relay-bin.586506
-rw-rw----   1 mysql  mysql      3665 Jan 30 23:31 data-admin-relay-bin.586507
-rw-rw----   1 mysql  mysql      1846 Jan 30 23:31 data-admin-relay-bin.586508
-rw-rw----   1 mysql  mysql      7486 Jan 30 23:31 data-admin-relay-bin.586509
-rw-rw----   1 mysql  mysql      2624 Jan 30 23:31 data-admin-relay-bin.586510
-rw-rw----   1 mysql  mysql      1195 Jan 30 23:31 data-admin-relay-bin.586511
-rw-rw----   1 mysql  mysql     12683 Jan 30 23:31 data-admin-relay-bin.586512
-rw-rw----   1 mysql  mysql      6765 Jan 30 23:31 data-admin-relay-bin.586513
-rw-rw----   1 mysql  mysql      5477 Jan 30 23:32 data-admin-relay-bin.586514
-rw-rw----   1 mysql  mysql      1613 Jan 30 23:32 data-admin-relay-bin.586515
-rw-rw----   1 mysql  mysql      1373 Jan 30 23:32 data-admin-relay-bin.586516
-rw-rw----   1 mysql  mysql     10729 Jan 30 23:32 data-admin-relay-bin.586517
-rw-rw----   1 mysql  mysql       473 Jan 30 23:32 data-admin-relay-bin.586518
-rw-rw----   1 mysql  mysql      1159 Jan 30 23:32 data-admin-relay-bin.586519
-rw-rw----   1 mysql  mysql      5469 Jan 30 23:32 data-admin-relay-bin.586520
-rw-rw----   1 mysql  mysql     12258 Jan 30 23:32 data-admin-relay-bin.586521
[8 Feb 2006 8:22] Bruce Dembecki
It appears when setting up an offsite slave I used a my.cnf file from another slave and there was multiple slaves with the same serverID... Apparently this behaviour is a known problem when there s a serverID conflict... I'll close this issue as a stupid user issue :-)
[9 Feb 2006 19:19] James Day
Bruce,

I've reopened this with the original 'Slave receives "end packet" from Master' title updated to reflect the underlying problem. This is a fairly common problem, usually for new setups.

Three feature requests:

1. A more clear error message in the slave. Might require the master telling the slave being disconnected the server IP and/or name of the slave it is accepting the connection from just before closing the connection. Then the slave can see that those differ from its own and can report both that information from the master and that it differs from what the slave thinks it is called.

2. The master error log should be logging the disconnect/reconnect with the same server ID. But no more than once a minute per server-id/slave name pair, so the error log doesn't grow to a huge size.

3. Perhaps only create a new relay log file if it has been more than five seconds since the last disconnect from the master (ideally also only if it looks like a duplicate server-id issue)?

I'm also adding a check for this to the desired feature list for a possible server monitoring tool.
[9 Feb 2006 23:59] Lachlan Mulcahy
In addition/essentially:

- We shouldn't allow the new connections since there's already an active server connection with that server id.
- The error message and error handling should be better
[10 Feb 2006 0:35] Bruce Dembecki
I do believe not allowing a new connection if there is already a slave with that serverID prevents both the problem and the Stupid User Error, assuming the error message covers the reason. I like that solution a lot (as the Stupid User in question I'm allowed to call it a Stupid User Error).  :-)
[24 Feb 2010 2:18] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/101279

3352 Luis Soares	2010-02-24
      BUG#16927: Duplicate slave server-id gives no specific error message, poor relay log use
      
      Sketch of a test case.
[24 Feb 2010 2:19] Luis Soares
Setting back to verified. Previous patch was just a sample of a test case.
[22 Mar 2010 10:07] Libing Song
This bug happens when killing a zombie thread. When a slave connects to its master, the master always kills the other slave thread which has the same server-id. 

Let us demonstrate it.
Slave1 and Slave2 have the same server-id.
1. Slave1 connects to master.
2. Slave2 connects to master. 
   Slave1 has the same server-id, so it is killed.
3. Slave1 tries to connect to master automatically.
   Slave1 connects successful.
   Slave2 is killed.
4. Slave2 tries to connect to master again.
   Slave2 connects successful.
   Slave1 is killed.
5. ......
   ......

This bug is fixed by WL#4677. WL#4677 gives each server one UUID. it is generated automatically and can not be same between different severs.
In the worklog, the rule of killing zombie slave thread is changed.
When a slave connects to its master, the master only kills the slave thread
which has the same UUID instead of the thread which has the same server-id.
[23 Mar 2010 15:59] Paul DuBois
Resetting report to Need Merge. It cannot be set to Documenting until it specifies the version(s) into which the fix was pushed.
[29 Mar 2010 8:22] Libing Song
WL#4677 is not pushed yet, so go back to verified.
[23 Jun 2010 2:42] Libing Song
Fixed by the patch for WL#4677.
The patch's revid: li-bing.song@sun.com-20100622014153-9jsvfdejgw78r4qe
It has arrived mysql-next-mr-rpl-merge now and will arrive mysql-next-mr-bugfixing soon.
[3 Sep 2010 16:50] Jon Stephens
This is being documented as part of docs work for WL#4766, so closing.