MySQL Bugs: #16927: Duplicate slave server-id gives no specific error message, poor relay log use

Bug #16927	Duplicate slave server-id gives no specific error message, poor relay log use
Submitted:	31 Jan 2006 10:01	Modified:	3 Sep 2010 16:50
Reporter:	Bruce Dembecki	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S4 (Feature request)
Version:	5.0, 5.1	OS:	Any
Assigned to:	Libing Song	CPU Architecture:	Any

Description:
Running these versions of MySQL under OS X 10.4.4 as the master causes it to send some sort of end packet on a frequent basis which is handled by a MySQL 5.0.15, 5,0,17 or 5.0.18 slave as the misbelief that the Master has shutdown.

One side affect of this is that many many relay-log files are created. Another side affect of this if Replication runs from time to time with SQL_THREAD turned off is that these relay log files grow rapidly... ending in our currewnt situation where there are 120K+ relay log files in our data directory and any operation on them or on the directory as a whole is very slow.

If I roll the Master back to 5.0.15 then the logs are sent in a manner that can be read by the slaves... this includes old logs created by MySQL 5.0.18... this it is not a file format issue, but rather something in how the newer versions of MySQL are sending the data.

Here's an example from the error log on the slave (there is nothing in the error log omn the master).

060131  0:40:07 [Note] Slave: received end packet from server, apparent master shutdown: 
060131  0:40:07 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'binary-log.001012' position 5344449
060131  0:40:07 [Note] Slave: connected to master 'repl@data6:3306',replication resumed in log 'binary-log.001012' at position 5344449
060131  0:40:10 [Note] Slave: received end packet from server, apparent master shutdown: 
060131  0:40:10 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'binary-log.001012' position 5354255
060131  0:40:10 [Note] Slave: connected to master 'repl@data6:3306',replication resumed in log 'binary-log.001012' at position 5354255

I can give you the log files if you need them... or some examples (you can't possibly want all 120000).

Of course replication can't catch up because everything involving changing to another log file needs a directory scan and the OS isn't very good at dealing with all those files in the directory.

How to repeat:
Connect to an OS X server running 5.0.18  as a slave and watch the relay log count increase rapidly if there's no real backlog... if you run without the replication SQL_THREAD running you will start to see the number of relay-log files grow rapidly... if you get enough of them you can't catch up with replication.

Thank you for a problem report. Please, upload that 'binary-log.001012' file. Can you provide a small set of SQL statements to execute on master to repeat the behaviour you described each and every time?

I can give you any combination of b inary and relay logs you may want... The relay logs tend to be small, a handfull of queries... here is a portion of the ls -l of where I stashed the relay logs to give you an idea. I can upload a binary log from before I rolled back to 5.0.15... how about one from the middle of the night, wee roll binaries every two hours, so the middle of the night ones are much smaller at about 10M each. This exceeds the file upload ability on the web interface, where would you like me to put it? FTP would be better, I'm offsite this week so would like to send it directly from the server, rather than routing it to my laptop first.

Here is a small piece of that directory listing so you can see the typical file size of those relay logs:

-rw-rw----   1 mysql  mysql      1580 Jan 30 23:31 data-admin-relay-bin.586501
-rw-rw----   1 mysql  mysql      2551 Jan 30 23:31 data-admin-relay-bin.586502
-rw-rw----   1 mysql  mysql      5154 Jan 30 23:31 data-admin-relay-bin.586503
-rw-rw----   1 mysql  mysql      1548 Jan 30 23:31 data-admin-relay-bin.586504
-rw-rw----   1 mysql  mysql      8699 Jan 30 23:31 data-admin-relay-bin.586505
-rw-rw----   1 mysql  mysql       710 Jan 30 23:31 data-admin-relay-bin.586506
-rw-rw----   1 mysql  mysql      3665 Jan 30 23:31 data-admin-relay-bin.586507
-rw-rw----   1 mysql  mysql      1846 Jan 30 23:31 data-admin-relay-bin.586508
-rw-rw----   1 mysql  mysql      7486 Jan 30 23:31 data-admin-relay-bin.586509
-rw-rw----   1 mysql  mysql      2624 Jan 30 23:31 data-admin-relay-bin.586510
-rw-rw----   1 mysql  mysql      1195 Jan 30 23:31 data-admin-relay-bin.586511
-rw-rw----   1 mysql  mysql     12683 Jan 30 23:31 data-admin-relay-bin.586512
-rw-rw----   1 mysql  mysql      6765 Jan 30 23:31 data-admin-relay-bin.586513
-rw-rw----   1 mysql  mysql      5477 Jan 30 23:32 data-admin-relay-bin.586514
-rw-rw----   1 mysql  mysql      1613 Jan 30 23:32 data-admin-relay-bin.586515
-rw-rw----   1 mysql  mysql      1373 Jan 30 23:32 data-admin-relay-bin.586516
-rw-rw----   1 mysql  mysql     10729 Jan 30 23:32 data-admin-relay-bin.586517
-rw-rw----   1 mysql  mysql       473 Jan 30 23:32 data-admin-relay-bin.586518
-rw-rw----   1 mysql  mysql      1159 Jan 30 23:32 data-admin-relay-bin.586519
-rw-rw----   1 mysql  mysql      5469 Jan 30 23:32 data-admin-relay-bin.586520
-rw-rw----   1 mysql  mysql     12258 Jan 30 23:32 data-admin-relay-bin.586521

It appears when setting up an offsite slave I used a my.cnf file from another slave and there was multiple slaves with the same serverID... Apparently this behaviour is a known problem when there s a serverID conflict... I'll close this issue as a stupid user issue :-)

Bruce,

I've reopened this with the original 'Slave receives "end packet" from Master' title updated to reflect the underlying problem. This is a fairly common problem, usually for new setups.

Three feature requests:

1. A more clear error message in the slave. Might require the master telling the slave being disconnected the server IP and/or name of the slave it is accepting the connection from just before closing the connection. Then the slave can see that those differ from its own and can report both that information from the master and that it differs from what the slave thinks it is called.

2. The master error log should be logging the disconnect/reconnect with the same server ID. But no more than once a minute per server-id/slave name pair, so the error log doesn't grow to a huge size.

3. Perhaps only create a new relay log file if it has been more than five seconds since the last disconnect from the master (ideally also only if it looks like a duplicate server-id issue)?

I'm also adding a check for this to the desired feature list for a possible server monitoring tool.

In addition/essentially:

- We shouldn't allow the new connections since there's already an active server connection with that server id.
- The error message and error handling should be better

I do believe not allowing a new connection if there is already a slave with that serverID prevents both the problem and the Stupid User Error, assuming the error message covers the reason. I like that solution a lot (as the Stupid User in question I'm allowed to call it a Stupid User Error).  :-)

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/101279

3352 Luis Soares	2010-02-24
      BUG#16927: Duplicate slave server-id gives no specific error message, poor relay log use
      
      Sketch of a test case.

Setting back to verified. Previous patch was just a sample of a test case.

This bug happens when killing a zombie thread. When a slave connects to its master, the master always kills the other slave thread which has the same server-id. 

Let us demonstrate it.
Slave1 and Slave2 have the same server-id.
1. Slave1 connects to master.
2. Slave2 connects to master. 
   Slave1 has the same server-id, so it is killed.
3. Slave1 tries to connect to master automatically.
   Slave1 connects successful.
   Slave2 is killed.
4. Slave2 tries to connect to master again.
   Slave2 connects successful.
   Slave1 is killed.
5. ......
   ......

This bug is fixed by WL#4677. WL#4677 gives each server one UUID. it is generated automatically and can not be same between different severs.
In the worklog, the rule of killing zombie slave thread is changed.
When a slave connects to its master, the master only kills the slave thread
which has the same UUID instead of the thread which has the same server-id.

Resetting report to Need Merge. It cannot be set to Documenting until it specifies the version(s) into which the fix was pushed.

WL#4677 is not pushed yet, so go back to verified.

Fixed by the patch for WL#4677.
The patch's revid: li-bing.song@sun.com-20100622014153-9jsvfdejgw78r4qe
It has arrived mysql-next-mr-rpl-merge now and will arrive mysql-next-mr-bugfixing soon.

This is being documented as part of docs work for WL#4766, so closing.