MySQL Bugs: #29976: Excessive Slave I/O errors in replication tests

Bug #29976	Excessive Slave I/O errors in replication tests
Submitted:	23 Jul 2007 9:53	Modified:	14 Jan 2008 17:26
Reporter:	Ingo Strüwing	Email Updates:
Status:	Closed	Impact on me:	None
Category:	Tests: Replication	Severity:	S7 (Test Cases)
Version:	5.1.21, 5.2.5	OS:	Linux (Debian x86)
Assigned to:	Alexey Kopytov	CPU Architecture:	Any
Tags:	pbwarn

Description:
These are random errors. The affected test cases vary. The errors do mainly show up on slower systems.

On my local system (Pentium 4, 1 GB, 1.8 GHz) the warnings file often looks similar to:

slave.err: rpl.rpl000017: 070723 10:28:46 [ERROR] Slave I/O thread couldn't register on master
slave.err: rpl.rpl_server_id1: 070723 10:42:42 [ERROR] Slave I/O thread couldn't register on master
slave.err: rpl.rpl_slave_skip: 070723 10:42:56 [ERROR] Slave I/O thread couldn't register on master

This stops every automatic script and requires much manual intervention.

These problems do also sometimes show up on pushbuild machines.

How to repeat:
See platform 'debx86-b' 'double whopper' on https://intranet.mysql.com/secure/pushbuild/showdir.pl?dir=mysql-5.2-engines or https://intranet.mysql.com/secure/pushbuild/showdir.pl?dir=mysql-5.1-engines

Or see platform 'powermacg5' -max on https://intranet.mysql.com/secure/pushbuild/showdir.pl?dir=mysql-5.1

This is a problem of the test framework.  If "the system" is too
heavily loaded then the slave can't connect to the master.

Magnus, can you have a look at this and try to figure out a solution
for this?

Actually, I wonder why this is happening.  I can imaging the following
causes:

1. Slave IO thread does not get CPU power
2. Network is too filled up with other TCP/IP packages
3. Master dump thread does not get CPU power

If it is 1 or 3, then it is the multi-tasking that starves a thread.
If this is the cause, then the test framework could monitor the CPU
and if it is above, lets say, 80% then the PB tests should be marked
with black (hardware failure) rather then with red (software failure).

If it is 2, then we could monitor the network and if over, lets say,
80% capacity (needs to be defined), then mark as black instead of red.

Hehe. Great advice. Like not implementing wait for lock, and telling the user to wait until noone else uses a table. :-/

Why don't we implement waiting of the slave I/O thread until the other threads reach the required state?

If there is some retry mechanism implemented, why is it an [ERROR], not a [NOTE]?

If you can tell for sure that this is nothing to worry about, you can add the message to the filter in mysql-test/lib/mtr_report.pl.

Ingo, your metaphor is grossly inaccurate and does not apply to this situation. Even table locks have time-outs and will give an error message if a lock cannot be acquired within a certain time limit (based on time-outs and retries).

In this case, if the slave cannot contact the master within a certain time limit, the timeout expires and hence the slave reports an error since the master does not reply. This is a correct error message and we cannot filter it out, since there can be other reasons for the failure.

In other words:

1. We are waiting for the master to reach the required state, but it fails
   to do that in time

2. We cannot filter out the error, or report it as a note, since the error
   correctly reflects a potential failure

2) The error occurs because the timeout is set too low when running the
   tests under pushbuild, since there the machines are heavily loaded

I see. In this case the timeout should be increased so that it does not appear on our test machines. Users could also have heavily loaded machines. And for me it happens on an idle machine, just running one test suite. But it is a slow one: uniprocessor 1.8GHz, 1.5GB RAM. Still such machines can be around at our users sites.

A failing test suite is very annoying for me because every failure (including warnings) breaks any automatic script. Every second or so day I have to restart a merge script due to this Slave I/O warning. This is what makes me slightly upset as I feel reluctance to fix it by people who know how to do it. As a team mate I suggest not to slow down each other unnecessarily.

The problem is easily reproducible on a single core machine by running any replication test with a low scheduling priority and a concurrent cpu-intensive load with a high priority (I used cpuburn), e.g.:

$ sudo nice --20 burnP6

In other session:

$ nice -19 ./mysql-test-run.pl rpl_ssl

Refined Problem Description
---------------------------
Slave registration on master sometimes take long time (since the
master CPU is too loaded).  There is then a too small timeout.

Related code
------------
The slave executes:
  slave.cc:register_slave_on_master() {
    ...
    simple_command(mysql, COM_REGISTER_SLAVE, buf, (size_t) (pos- buf), 0)
    ...
  }

The master responds by executing:
  repl_failsafe.cc:register_slave()
  (No real problems here, except that it takes too long 
  time to execute this, since master is too loaded.)

Suggested fix
-------------
When executing the simple_command in the slave, make sure to wait for
this for a longer time.

These are variables to control other timeouts for replication
connection:

   *  `--master-connect-retry=SECONDS'

     The number of seconds that the slave thread sleeps before trying
     to reconnect to the master in case the master goes down or the
     connection is lost. The value in the `master.info' file takes
     precedence if it can be read. If not set, the default is 60.
     Connection retries are not invoked until the slave times out
     reading data from the master according to the value of
     `--slave-net-timeout'. The number of reconnection attempts is
     limited by the `--master-retry-count' option.

   *  `--slave-net-timeout=SECONDS'

     The number of seconds to wait for more data from the master before
     the slave considers the connection broken, aborts the read, and
     tries to reconnect. The first retry occurs immediately after the
     timeout. The interval between retries is controlled by the
     `--master-connect-retry' option and the number of reconnection
     attempts is limited by the `--master-retry-count' option. The
     default is 3600 seconds (one hour).

Note that in later version, we have heartbeating so the
--slave-net-timeout value might be very low in many future production
systems.

I suggest to use, as value to the timeout, either
MAX(master-connect-retry, slave-net-timeout) or introduce a new
variable --slave-register-timeout.  The latter is probably best, since
it can then be more easily tested.

Here is the problem: net_clear() calls net_data_is_ready() which calls poll() or select() to see if the socket is ready for I/O. However, poll()/select() may fail with EINTR when a signal occurs.

We catch EINTR when doing the I/O itself by checking vio_should_retry() and restarting vio_read()/vio_write() up to net_retry_count times. The same must also be done in net_data_is_ready().

Fixing the above does not actually solve the problem, the error still occurs.

The real problem is much more simple. START SLAVE is asynchronous with respect to the replication threads, it only starts I/O and SQL threads without waiting for the necessary initialization steps to complete (for example, for the I/O thread to register on master). When a subsequent STOP SLAVE terminates the I/O thread, there are chances that the I/O thread is still registering on master. This aborts all network operations, causes the COM_REGISTER_SLAVE command to fail and results in the "couldn't register on master" error.

The chances of terminating the slave I/O thread in COM_REGISTER_SLAVE increase on a heavily loaded system, and on test cases with frequent START SLAVE/STOP SLAVE commands. This is the case for rpl_ssl:

# Do the same thing a number of times
disable_query_log;
disable_result_log;
let $i= 100;
while ($i)
{
  start slave;
  connection master;
  insert into t1 values (NULL);
  select * from t1; # Some variance
  connection slave;
  select * from t1; # Some variance
  stop slave;
  dec $i;
}

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/37738

ChangeSet@1.2589, 2007-11-14 14:42:40+03:00, kaa@polly.(none) +1 -0
  Fix for bug #29976: Excessive Slave I/O errors in replication tests
  
  Problem:
  
  The "Slave I/O thread couldn't register on master" error sporadically
  occurred in replication tests because the slave I/O thread got
  killed by STOP SLAVE before or while registering on master.
  
  Fixed by checking the state of the I/O thread, and issueing
  the error only if it was not explicitely killed by a user.

The problem with the above errors is a failure in the tests, not in the server. When starting a slave, the test should wait for the slave to start before proceeding, and when stopping a slave, the test should wait for the slave thread to stop before proceeding. There is a missing wait in "include/master-slave.inc" after the slave is started, which is what causes rpl000017, rpl_server_id1, and rpl_slave_skip to fail on a loaded machine since the test case begins with stopping the slave.

I really don't think it is a good idea to eliminate this error message even for this particular situation. Imagine the following scenario:

- The DBA does a CHANGE MASTER, but accidentally points it to the wrong master.
- The DBA does a START SLAVE to fetch some changes on the master, but
the I/O thread fails to register because there is no reply.
- The DBA does a STOP SLAVE and expects to see the changes, but there is none.
There is neither an error message that indicates a potential problem, so he
is no wiser.

In this situation, an error message should be generated indicating that the slave failed to register on the master before it was told to stop. We cannot know if this is the scenario at hand, so we should print some form of message if the (asynchronous) start failed, whatever the reason.

Specifically, since there is an easy patch that solves Ingo's immediate problems (i.e., adding a wait for slave start to master-slave-reset.inc), I see no reason to change the slave's behavior. Warnings are easily missed, especially for the slave threads, and any error generated as a result of a stop is something that the DBA or script can handle.

Aha. I understand this explanation. (I did not understand the former explanations, which pointed at the tests too). Now I beg that this is please fixed in the tests.

Thanks
Ingo

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/38292

ChangeSet@1.2589, 2007-11-22 19:22:54+03:00, kaa@polly.(none) +1 -0
  Fix for bug #29976: Excessive Slave I/O errors in replication tests
  
  Problem:
  
  The "Slave I/O thread couldn't register on master" error sporadically
  occurred in replication tests because the slave I/O thread got
  killed by STOP SLAVE before or while registering on master.
  
  Fixed by checking the state of the I/O thread, and issueing
  the error only if it was not explicitely killed by a user.

Pushed into 5.1.23-rc

Pushed into 6.0.5-alpha

This appears to be a test framework issue only, with nothing to document for end users.

Closed w/o further action.