Bug #15318 Replication Slave IO thread lags periodically
Submitted: 29 Nov 2005 16:24 Modified: 4 Jan 2006 13:24
Reporter: Jack Chadowitz Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:4.1.15nt-log OS:Windows (Windows 2000 professional)
Assigned to: CPU Architecture:Any

[29 Nov 2005 16:24] Jack Chadowitz
Description:
I set up replication between two windows 2000 machines running MySQL 4.1.15. Replication works extremely well initially. I noticed after about 1 hour slave Read_Master_Log_Pos lags master position by about 653308 bytes.
Slave_IO_State is "Waiting for master to send event". Master thread is sleep. Seconds_Behind_Master = 0.
Stopping and starting the IO Thread via SLAVE STOP IO_THREAD and SLAVE STOP IO_THREAD causes almost instant synchronization.
There is no loading on the communication between master and slave.  
Connect_Retry =5.

SHOW PROCESSLIST on slave shows slave IO thread "Waiting for master to send event" 

Then without any intervention the slave IO thread will start updating and continue updating until the next hangup.

I suspect that the lag may be caused by a short communications failure although the error log shows no failures.

How to repeat:
Set up 2 servers linked over the internet using Windows 2000 professional and MySQL 4.1.15 installed from binaries. Set up replication. Create a task that runs on the master that periodically does inserts into tables. Monitor the master position using " show master status;" and the slave poistion using " show slave status \G".
Compare master position with  Read_Master_Log_Pos on the slave. It may take up to an hour or more to see the lagging and the resumption of slave IO operation may occur 10 minutes later. Once the slave resumes it works well for a while until the lag again occurs

Suggested fix:
The operation of the slave IO thread is not clear. Does it poll the master periodically or does the master inform the slave when there is more data. If, as I suspect the slave is polling the master, what controls the poll rate and what causes the periodic lags? Could this be related to windows port resource limitations or to delays in recovering from a communications failure?
[29 Nov 2005 16:44] Valeriy Kravchuk
Thank you for a problem report. I want to check that this lag was not caused by TCP/IP connection being closed by one of the machines, and then reopended upon next update sent. So, please, send the values from the registry folder

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters]

from both master and slave.
[29 Nov 2005 17:21] Jack Chadowitz
Master Registry Settings

Attachment: master.png (image/png, text), 31.56 KiB.

[29 Nov 2005 17:22] Jack Chadowitz
Slave registry settings

Attachment: slave.png (image/png, text), 36.03 KiB.

[29 Nov 2005 17:23] Jack Chadowitz
As per your request screen captures of the registry values for master and slave have been added. Please let me know if you need anything else
[30 Nov 2005 8:48] Valeriy Kravchuk
Please, try to set KeepAliveTime and KeepAlivaInterval in the registry to something like:

"KeepAliveTime"=dword:000927c0
"KeepAliveInterval"=dword:000003e8

See http://www.microsoft.com/resources/documentation/Windows/2000/server/reskit/en-us/Default.... and http://www.microsoft.com/resources/documentation/Windows/2000/server/reskit/en-us/Default....
for the details. Try to work with these parameters explicitely set and inform about the results (will you see that lags or not).
[30 Nov 2005 12:53] Jack Chadowitz
Should I change these settings on the master, slave or both

Thanks

Jack
[30 Nov 2005 13:23] Valeriy Kravchuk
Change them on both, please, if you can. At least - on slave.
[2 Dec 2005 12:37] Jack Chadowitz
The suggested registry additions were made on the slave. The slave caught up immediately without restarting the slave.
After about a day I noticed that the slave lag had reappeared.

The registry additions have now been added to the master and the slave has been restarted. The restart caused the slave to catch up. I will check to see if this solves the problem.

Do the computers require a reboot for the registry additions to take effect?

Thanks

Jack
[2 Dec 2005 13:09] Valeriy Kravchuk
Thank you for the additional tests.

TCP/IP base services should be restarted, I believe (check the Microsoft documentation to clarify, if you want). The simplest way to do it is to restart the machine.

Please, restart both master and slave with these parameters, and inform about the results after a reasonable period of work.
[3 Dec 2005 18:59] Jack Chadowitz
Both master and slave registry entries were changed and both machines rebooted. It has now been more than 24 hours and the problem has not returned.

Your solution appears to have solved the problem. Many thanks for your efforts. 
Could you explain what caused the problem and how adding the registry entries solved the problem.

Jack
[4 Dec 2005 13:24] Valeriy Kravchuk
Thank you for the additional test. The problem is that "by default" MS Windows may close TCP/IP sockets after some period of inactivity to "free unused resources". I learned "the trick" using other RDBMS on this platform (IBM Informix). That settings (not present by default, but documented by Microsoft) makes it send some packages over the TCP/IP connection periodically to "keep it alive" in case of both connected sides inactivity. 

The description you presented lead me to the idea that this may be a real reason for the lag you got. Slave's connection was simply closed by OS, and slave does not note it - it simply "thinks" there is nothing to replicate.

Please, keep looking at your servers and reopen this bug report if you'll see the same lag with these registry settings. The report will be closed automatically after a month.
[5 Jan 2006 0:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".