Bug #29976 | Excessive Slave I/O errors in replication tests | ||
---|---|---|---|
Submitted: | 23 Jul 2007 9:53 | Modified: | 14 Jan 2008 17:26 |
Reporter: | Ingo Strüwing | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | Tests: Replication | Severity: | S7 (Test Cases) |
Version: | 5.1.21, 5.2.5 | OS: | Linux (Debian x86) |
Assigned to: | Alexey Kopytov | CPU Architecture: | Any |
Tags: | pbwarn |
[23 Jul 2007 9:53]
Ingo Strüwing
[23 Oct 2007 12:34]
Lars Thalmann
This is a problem of the test framework. If "the system" is too heavily loaded then the slave can't connect to the master. Magnus, can you have a look at this and try to figure out a solution for this? Actually, I wonder why this is happening. I can imaging the following causes: 1. Slave IO thread does not get CPU power 2. Network is too filled up with other TCP/IP packages 3. Master dump thread does not get CPU power If it is 1 or 3, then it is the multi-tasking that starves a thread. If this is the cause, then the test framework could monitor the CPU and if it is above, lets say, 80% then the PB tests should be marked with black (hardware failure) rather then with red (software failure). If it is 2, then we could monitor the network and if over, lets say, 80% capacity (needs to be defined), then mark as black instead of red.
[23 Oct 2007 14:05]
Ingo Strüwing
Hehe. Great advice. Like not implementing wait for lock, and telling the user to wait until noone else uses a table. :-/ Why don't we implement waiting of the slave I/O thread until the other threads reach the required state? If there is some retry mechanism implemented, why is it an [ERROR], not a [NOTE]? If you can tell for sure that this is nothing to worry about, you can add the message to the filter in mysql-test/lib/mtr_report.pl.
[24 Oct 2007 9:06]
Mats Kindahl
Ingo, your metaphor is grossly inaccurate and does not apply to this situation. Even table locks have time-outs and will give an error message if a lock cannot be acquired within a certain time limit (based on time-outs and retries). In this case, if the slave cannot contact the master within a certain time limit, the timeout expires and hence the slave reports an error since the master does not reply. This is a correct error message and we cannot filter it out, since there can be other reasons for the failure. In other words: 1. We are waiting for the master to reach the required state, but it fails to do that in time 2. We cannot filter out the error, or report it as a note, since the error correctly reflects a potential failure 2) The error occurs because the timeout is set too low when running the tests under pushbuild, since there the machines are heavily loaded
[24 Oct 2007 9:41]
Ingo Strüwing
I see. In this case the timeout should be increased so that it does not appear on our test machines. Users could also have heavily loaded machines. And for me it happens on an idle machine, just running one test suite. But it is a slow one: uniprocessor 1.8GHz, 1.5GB RAM. Still such machines can be around at our users sites. A failing test suite is very annoying for me because every failure (including warnings) breaks any automatic script. Every second or so day I have to restart a merge script due to this Slave I/O warning. This is what makes me slightly upset as I feel reluctance to fix it by people who know how to do it. As a team mate I suggest not to slow down each other unnecessarily.
[1 Nov 2007 15:52]
Alexey Kopytov
The problem is easily reproducible on a single core machine by running any replication test with a low scheduling priority and a concurrent cpu-intensive load with a high priority (I used cpuburn), e.g.: $ sudo nice --20 burnP6 In other session: $ nice -19 ./mysql-test-run.pl rpl_ssl
[8 Nov 2007 11:16]
Lars Thalmann
Refined Problem Description --------------------------- Slave registration on master sometimes take long time (since the master CPU is too loaded). There is then a too small timeout. Related code ------------ The slave executes: slave.cc:register_slave_on_master() { ... simple_command(mysql, COM_REGISTER_SLAVE, buf, (size_t) (pos- buf), 0) ... } The master responds by executing: repl_failsafe.cc:register_slave() (No real problems here, except that it takes too long time to execute this, since master is too loaded.) Suggested fix ------------- When executing the simple_command in the slave, make sure to wait for this for a longer time. These are variables to control other timeouts for replication connection: * `--master-connect-retry=SECONDS' The number of seconds that the slave thread sleeps before trying to reconnect to the master in case the master goes down or the connection is lost. The value in the `master.info' file takes precedence if it can be read. If not set, the default is 60. Connection retries are not invoked until the slave times out reading data from the master according to the value of `--slave-net-timeout'. The number of reconnection attempts is limited by the `--master-retry-count' option. * `--slave-net-timeout=SECONDS' The number of seconds to wait for more data from the master before the slave considers the connection broken, aborts the read, and tries to reconnect. The first retry occurs immediately after the timeout. The interval between retries is controlled by the `--master-connect-retry' option and the number of reconnection attempts is limited by the `--master-retry-count' option. The default is 3600 seconds (one hour). Note that in later version, we have heartbeating so the --slave-net-timeout value might be very low in many future production systems. I suggest to use, as value to the timeout, either MAX(master-connect-retry, slave-net-timeout) or introduce a new variable --slave-register-timeout. The latter is probably best, since it can then be more easily tested.
[12 Nov 2007 16:14]
Alexey Kopytov
Here is the problem: net_clear() calls net_data_is_ready() which calls poll() or select() to see if the socket is ready for I/O. However, poll()/select() may fail with EINTR when a signal occurs. We catch EINTR when doing the I/O itself by checking vio_should_retry() and restarting vio_read()/vio_write() up to net_retry_count times. The same must also be done in net_data_is_ready().
[13 Nov 2007 14:15]
Alexey Kopytov
Fixing the above does not actually solve the problem, the error still occurs. The real problem is much more simple. START SLAVE is asynchronous with respect to the replication threads, it only starts I/O and SQL threads without waiting for the necessary initialization steps to complete (for example, for the I/O thread to register on master). When a subsequent STOP SLAVE terminates the I/O thread, there are chances that the I/O thread is still registering on master. This aborts all network operations, causes the COM_REGISTER_SLAVE command to fail and results in the "couldn't register on master" error. The chances of terminating the slave I/O thread in COM_REGISTER_SLAVE increase on a heavily loaded system, and on test cases with frequent START SLAVE/STOP SLAVE commands. This is the case for rpl_ssl: # Do the same thing a number of times disable_query_log; disable_result_log; let $i= 100; while ($i) { start slave; connection master; insert into t1 values (NULL); select * from t1; # Some variance connection slave; select * from t1; # Some variance stop slave; dec $i; }
[14 Nov 2007 11:43]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/37738 ChangeSet@1.2589, 2007-11-14 14:42:40+03:00, kaa@polly.(none) +1 -0 Fix for bug #29976: Excessive Slave I/O errors in replication tests Problem: The "Slave I/O thread couldn't register on master" error sporadically occurred in replication tests because the slave I/O thread got killed by STOP SLAVE before or while registering on master. Fixed by checking the state of the I/O thread, and issueing the error only if it was not explicitely killed by a user.
[21 Nov 2007 11:15]
Mats Kindahl
The problem with the above errors is a failure in the tests, not in the server. When starting a slave, the test should wait for the slave to start before proceeding, and when stopping a slave, the test should wait for the slave thread to stop before proceeding. There is a missing wait in "include/master-slave.inc" after the slave is started, which is what causes rpl000017, rpl_server_id1, and rpl_slave_skip to fail on a loaded machine since the test case begins with stopping the slave. I really don't think it is a good idea to eliminate this error message even for this particular situation. Imagine the following scenario: - The DBA does a CHANGE MASTER, but accidentally points it to the wrong master. - The DBA does a START SLAVE to fetch some changes on the master, but the I/O thread fails to register because there is no reply. - The DBA does a STOP SLAVE and expects to see the changes, but there is none. There is neither an error message that indicates a potential problem, so he is no wiser. In this situation, an error message should be generated indicating that the slave failed to register on the master before it was told to stop. We cannot know if this is the scenario at hand, so we should print some form of message if the (asynchronous) start failed, whatever the reason. Specifically, since there is an easy patch that solves Ingo's immediate problems (i.e., adding a wait for slave start to master-slave-reset.inc), I see no reason to change the slave's behavior. Warnings are easily missed, especially for the slave threads, and any error generated as a result of a stop is something that the DBA or script can handle.
[21 Nov 2007 11:57]
Ingo Strüwing
Aha. I understand this explanation. (I did not understand the former explanations, which pointed at the tests too). Now I beg that this is please fixed in the tests. Thanks Ingo
[22 Nov 2007 16:23]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/38292 ChangeSet@1.2589, 2007-11-22 19:22:54+03:00, kaa@polly.(none) +1 -0 Fix for bug #29976: Excessive Slave I/O errors in replication tests Problem: The "Slave I/O thread couldn't register on master" error sporadically occurred in replication tests because the slave I/O thread got killed by STOP SLAVE before or while registering on master. Fixed by checking the state of the I/O thread, and issueing the error only if it was not explicitely killed by a user.
[14 Dec 2007 8:18]
Bugs System
Pushed into 5.1.23-rc
[14 Dec 2007 8:22]
Bugs System
Pushed into 6.0.5-alpha
[14 Jan 2008 17:26]
Jon Stephens
This appears to be a test framework issue only, with nothing to document for end users. Closed w/o further action.