Bug #34050 I/O thread disconnects/killed when replicating under load with partition tables
Submitted: 25 Jan 2008 4:40 Modified: 25 Jul 2008 19:31
Reporter: Omer Barnir (OCA) Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.1.23 OS:Any
Assigned to: Mats Kindahl CPU Architecture:Any

[25 Jan 2008 4:40] Omer Barnir
Description:
When running a load with 100 concurrent users when some of the users execute calls to the following SP:

CREATE PROCEDURE test.pinsdel2()
BEGIN
   DECLARE ins_count INT DEFAULT 100; #<--This should be changed to 1,000,000
   DECLARE del_count INT;
   DECLARE cur_user VARCHAR(255);
   DECLARE local_uuid VARCHAR(255);
   DECLARE local_time TIMESTAMP;

   SET local_time= NOW();
   SET cur_user = CURRENT_USER();
   SET local_uuid=UUID();

   WHILE ins_count > 0 DO
       INSERT INTO test.pinsdel2_tbl VALUES (NULL, local_time, cur_user, local_uuid,
                                    ins_count,'Going to test SBR for MySQL');
     SET ins_count = ins_count - 1;
   END WHILE;

   SELECT MAX(id) FROM test.pinsdel2_tbl INTO del_count;
   WHILE del_count > 0 DO
     DELETE FROM test.pinsdel2_tbl WHERE id = del_count;
   SET del_count = del_count - 2;
   END WHILE;
END|

delimiter ;|

The I/O replication thread disconnects every few seconds and then dies completely, when the test.pinsdel2 table is defined as:

CREATE TABLE test.pinsdel2_tbl(id MEDIUMINT NOT NULL AUTO_INCREMENT,
                           dt TIMESTAMP, user CHAR(255), uuidf LONGBLOB,
                           fkid MEDIUMINT, filler VARCHAR(255),
                           PRIMARY KEY(id))ENGINE=innodb
                                PARTITION BY RANGE(id)
                                SUBPARTITION BY hash(id) subpartitions 2
                                (PARTITION pa1 values less than (10),
                                 PARTITION pa2 values less than (20),
                                 PARTITION pa3 values less than (30),
                                 PARTITION pa4 values less than (40),
                                 PARTITION pa5 values less than (50),
                                 PARTITION pa6 values less than (60),
                                 PARTITION pa7 values less than (70),
                                 PARTITION pa8 values less than (80),
                                 PARTITION pa9 values less than (90),
                                 PARTITION pa10 values less than (100),
                                 PARTITION pa11 values less than MAXVALUE);

The problem is not observed when the table is defined without partitions, i.e:
eval CREATE TABLE test.pinsdel2_tbl(id MEDIUMINT NOT NULL AUTO_INCREMENT,
                           dt TIMESTAMP, user CHAR(255), uuidf LONGBLOB,
                           fkid MEDIUMINT, filler VARCHAR(255),
                           PRIMARY KEY(id))ENGINE=innodb;

The mode of replication does not seem to play a role in the above

How to repeat:
Isolation in progress, steps to be posted

Suggested fix:
I/O thread should not observe disconnects
[30 Jan 2008 0:56] Omer Barnir
How to repeat
=============
1) download the attached tar.gz file and extract it in the mysql-test directory
2) Start the server with:
   perl ./mysql-test-run.pl --suite=rpl --do-test=rpl_alter 
                            --mysqld=--innodb --start-and-exit 
   note: 'rpl_alter' is used so both master and slave will be stareted. It
         is not related to the test itself)
3) Using the client log into the slave and initiate a 'start slave' command
   >> Verify using 'show slave status' that replication is running
4) Start the 'stress test' using the following command
    perl ./mysql-test-run.pl --extern --stress --stress-init-file=rpl_init.txt 
         --stress-test-file=rpl_sys_test.txt --stress-threads=100 
         --stress-test-duration=600 --user=root 
         --socket=<path_to_mysql-test_dir>/var/tmp/master.sock 
    This will start a 10 minute stress test with 100 concurrent connections
    (on the screen you will see messages like 
        test_loop[0:0 0:4708]: TID 10 test: 'rpl_row_sys_pinsdel2'  
        Errors: No Errors. Test Passed OK
5) Once the test is completed, check the slave.err file in the var/log 
   directory. You will notice the I/O thread disconnecting and reconnecting a 
   few times and then killed permanently.

If the same test is run when the tables are not partitioned (see 
t/rpl_setup.test for more details) the disconnects are not observed.

The replication 'mode' does not seem to affect this test case
[30 Jan 2008 0:57] Omer Barnir
test files for load test case

Attachment: files.tar.gz (application/x-gzip, text), 3.57 KiB.

[1 Feb 2008 0:32] Omer Barnir
To clarify: The problem was reported with a test run against 5.1.23 but is also observed when running this test against 5.1.22
[25 Jul 2008 14:25] MySQL Verification Team
failed to repeat bug by calling the sp in 100 threads.t got no disconnections. see attached for all infos.

Attachment: bug34050_not_repeated_infos.txt (text/plain), 49.52 KiB.

[25 Jul 2008 14:51] MySQL Verification Team
when i rerun a test with --slave_net_timeout=2 the i/o thread does fail a few times due to the high load on the box:

[Note] Slave I/O thread: connected to master 'root@127.0.0.1:3306',replication started in log 'xp64-bin.000002' at position 6123850
[ERROR] Slave I/O: error reconnecting to master 'root@127.0.0.1:3306' - retry-time: 60  retries: 86400, Error_code: 1159
[Note] Slave: connected to master 'root@127.0.0.1:3306',replication resumed in log 'xp64-bin.000002' at position 6295775
[ERROR] Slave I/O: error reconnecting to master 'root@127.0.0.1:3306' - retry-time: 60  retries: 86400, Error_code: 1159
[25 Jul 2008 19:31] Omer Barnir
I do believe this  issue is related more on the load on the machine then to directly related to the partition tables. It did show in a configuration where the master and slave were running on the same box under load. Increasing the connect values (opposite to what Shane did) decreased the overall number of disconnect/connect and this problem did not show. 

Also this was not observed in tests run lately. 

Based on the above setting the bug to 'can't repeat' (at east until I run into it again)