MySQL Bugs: #15318: Replication Slave IO thread lags periodically

Bug #15318	Replication Slave IO thread lags periodically
Submitted:	29 Nov 2005 16:24	Modified:	4 Jan 2006 13:24
Reporter:	Jack Chadowitz	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server	Severity:	S2 (Serious)
Version:	4.1.15nt-log	OS:	Windows (Windows 2000 professional)
Assigned to:		CPU Architecture:	Any

Description:
I set up replication between two windows 2000 machines running MySQL 4.1.15. Replication works extremely well initially. I noticed after about 1 hour slave Read_Master_Log_Pos lags master position by about 653308 bytes.
Slave_IO_State is "Waiting for master to send event". Master thread is sleep. Seconds_Behind_Master = 0.
Stopping and starting the IO Thread via SLAVE STOP IO_THREAD and SLAVE STOP IO_THREAD causes almost instant synchronization.
There is no loading on the communication between master and slave.
Connect_Retry =5.

SHOW PROCESSLIST on slave shows slave IO thread "Waiting for master to send event"

Then without any intervention the slave IO thread will start updating and continue updating until the next hangup.

I suspect that the lag may be caused by a short communications failure although the error log shows no failures.

How to repeat:
Set up 2 servers linked over the internet using Windows 2000 professional and MySQL 4.1.15 installed from binaries. Set up replication. Create a task that runs on the master that periodically does inserts into tables. Monitor the master position using " show master status;" and the slave poistion using " show slave status \G".
Compare master position with Read_Master_Log_Pos on the slave. It may take up to an hour or more to see the lagging and the resumption of slave IO operation may occur 10 minutes later. Once the slave resumes it works well for a while until the lag again occurs

Suggested fix:
The operation of the slave IO thread is not clear. Does it poll the master periodically or does the master inform the slave when there is more data. If, as I suspect the slave is polling the master, what controls the poll rate and what causes the periodic lags? Could this be related to windows port resource limitations or to delays in recovering from a communications failure?

Thank you for a problem report. I want to check that this lag was not caused by TCP/IP connection being closed by one of the machines, and then reopended upon next update sent. So, please, send the values from the registry folder

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters]

from both master and slave.

Master Registry Settings

Attachment: master.png (image/png, text), 31.56 KiB.

Slave registry settings

Attachment: slave.png (image/png, text), 36.03 KiB.

As per your request screen captures of the registry values for master and slave have been added. Please let me know if you need anything else

Please, try to set KeepAliveTime and KeepAlivaInterval in the registry to something like:

"KeepAliveTime"=dword:000927c0
"KeepAliveInterval"=dword:000003e8

See http://www.microsoft.com/resources/documentation/Windows/2000/server/reskit/en-us/Default.... and http://www.microsoft.com/resources/documentation/Windows/2000/server/reskit/en-us/Default....
for the details. Try to work with these parameters explicitely set and inform about the results (will you see that lags or not).

Should I change these settings on the master, slave or both

Thanks

Jack

Change them on both, please, if you can. At least - on slave.

The suggested registry additions were made on the slave. The slave caught up immediately without restarting the slave.
After about a day I noticed that the slave lag had reappeared.

The registry additions have now been added to the master and the slave has been restarted. The restart caused the slave to catch up. I will check to see if this solves the problem.

Do the computers require a reboot for the registry additions to take effect?

Thanks

Jack

Thank you for the additional tests.

TCP/IP base services should be restarted, I believe (check the Microsoft documentation to clarify, if you want). The simplest way to do it is to restart the machine.

Please, restart both master and slave with these parameters, and inform about the results after a reasonable period of work.

Both master and slave registry entries were changed and both machines rebooted. It has now been more than 24 hours and the problem has not returned.

Your solution appears to have solved the problem. Many thanks for your efforts. 
Could you explain what caused the problem and how adding the registry entries solved the problem.

Jack

Thank you for the additional test. The problem is that "by default" MS Windows may close TCP/IP sockets after some period of inactivity to "free unused resources". I learned "the trick" using other RDBMS on this platform (IBM Informix). That settings (not present by default, but documented by Microsoft) makes it send some packages over the TCP/IP connection periodically to "keep it alive" in case of both connected sides inactivity. 

The description you presented lead me to the idea that this may be a real reason for the lag you got. Slave's connection was simply closed by OS, and slave does not note it - it simply "thinks" there is nothing to replicate.

Please, keep looking at your servers and reopen this bug report if you'll see the same lag with these registry settings. The report will be closed automatically after a month.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".