Bug #15318 | Replication Slave IO thread lags periodically | ||
---|---|---|---|
Submitted: | 29 Nov 2005 16:24 | Modified: | 4 Jan 2006 13:24 |
Reporter: | Jack Chadowitz | Email Updates: | |
Status: | No Feedback | Impact on me: | |
Category: | MySQL Server | Severity: | S2 (Serious) |
Version: | 4.1.15nt-log | OS: | Windows (Windows 2000 professional) |
Assigned to: | CPU Architecture: | Any |
[29 Nov 2005 16:24]
Jack Chadowitz
[29 Nov 2005 16:44]
Valeriy Kravchuk
Thank you for a problem report. I want to check that this lag was not caused by TCP/IP connection being closed by one of the machines, and then reopended upon next update sent. So, please, send the values from the registry folder [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters] from both master and slave.
[29 Nov 2005 17:21]
Jack Chadowitz
Master Registry Settings
Attachment: master.png (image/png, text), 31.56 KiB.
[29 Nov 2005 17:22]
Jack Chadowitz
Slave registry settings
Attachment: slave.png (image/png, text), 36.03 KiB.
[29 Nov 2005 17:23]
Jack Chadowitz
As per your request screen captures of the registry values for master and slave have been added. Please let me know if you need anything else
[30 Nov 2005 8:48]
Valeriy Kravchuk
Please, try to set KeepAliveTime and KeepAlivaInterval in the registry to something like: "KeepAliveTime"=dword:000927c0 "KeepAliveInterval"=dword:000003e8 See http://www.microsoft.com/resources/documentation/Windows/2000/server/reskit/en-us/Default.... and http://www.microsoft.com/resources/documentation/Windows/2000/server/reskit/en-us/Default.... for the details. Try to work with these parameters explicitely set and inform about the results (will you see that lags or not).
[30 Nov 2005 12:53]
Jack Chadowitz
Should I change these settings on the master, slave or both Thanks Jack
[30 Nov 2005 13:23]
Valeriy Kravchuk
Change them on both, please, if you can. At least - on slave.
[2 Dec 2005 12:37]
Jack Chadowitz
The suggested registry additions were made on the slave. The slave caught up immediately without restarting the slave. After about a day I noticed that the slave lag had reappeared. The registry additions have now been added to the master and the slave has been restarted. The restart caused the slave to catch up. I will check to see if this solves the problem. Do the computers require a reboot for the registry additions to take effect? Thanks Jack
[2 Dec 2005 13:09]
Valeriy Kravchuk
Thank you for the additional tests. TCP/IP base services should be restarted, I believe (check the Microsoft documentation to clarify, if you want). The simplest way to do it is to restart the machine. Please, restart both master and slave with these parameters, and inform about the results after a reasonable period of work.
[3 Dec 2005 18:59]
Jack Chadowitz
Both master and slave registry entries were changed and both machines rebooted. It has now been more than 24 hours and the problem has not returned. Your solution appears to have solved the problem. Many thanks for your efforts. Could you explain what caused the problem and how adding the registry entries solved the problem. Jack
[4 Dec 2005 13:24]
Valeriy Kravchuk
Thank you for the additional test. The problem is that "by default" MS Windows may close TCP/IP sockets after some period of inactivity to "free unused resources". I learned "the trick" using other RDBMS on this platform (IBM Informix). That settings (not present by default, but documented by Microsoft) makes it send some packages over the TCP/IP connection periodically to "keep it alive" in case of both connected sides inactivity. The description you presented lead me to the idea that this may be a real reason for the lag you got. Slave's connection was simply closed by OS, and slave does not note it - it simply "thinks" there is nothing to replicate. Please, keep looking at your servers and reopen this bug report if you'll see the same lag with these registry settings. The report will be closed automatically after a month.
[5 Jan 2006 0:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".