Bug #52528 Protocol desynchronisation between client and server on a DDL workload
Submitted: 1 Apr 2010 15:17 Modified: 12 Jul 2010 8:04
Reporter: Philip Stoev Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:mysql-trunk,mysql-trunk-runtime-exp, 5.1, next-mr OS:Any
Assigned to: Assigned Account CPU Architecture:Any

[1 Apr 2010 15:17] Philip Stoev
Description:
When executing a RQG DDL/MDL workload, sometimes the server will forever remain waiting for a packet from the client, but the client does not appear to think that it is supposed to send such a packet, and vice versa.

The server backtrace looks like this:

Thread 3 (Thread 0x7fffc32ac950 (LWP 15171)):
#0  0x000000315b00df5b in read () from /lib64/libpthread.so.0
#1  0x0000000000a1755f in vio_read (vio=0x7fffbc1491c8, buf=0x7fffbc0ae598 "\001", size=4) at /1tb/bzr/mysql-trunk/vio/viosocket.c:44
#2  0x0000000000522e62 in my_real_read (net=0x7fffbc1462b8, complen=0x7fffc32abfc8) at /1tb/bzr/mysql-trunk/sql/net_serv.cc:836
#3  0x00000000005234e2 in my_net_read (net=0x7fffbc1462b8) at /1tb/bzr/mysql-trunk/sql/net_serv.cc:1026
#4  0x00000000005b3806 in do_command (thd=0x7fffbc1460b8) at /1tb/bzr/mysql-trunk/sql/sql_parse.cc:656
#5  0x000000000067e584 in do_handle_one_connection (thd_arg=0x7fffbc1460b8) at /1tb/bzr/mysql-trunk/sql/sql_connect.cc:1174
#6  0x000000000067e653 in handle_one_connection (arg=0x7fffbc1460b8) at /1tb/bzr/mysql-trunk/sql/sql_connect.cc:1113
#7  0x000000315b0073da in start_thread () from /lib64/libpthread.so.0
#8  0x000000315a4e627d in clone () from /lib64/libc.so.6

SHOW PROCESSLIST looks like this:

*************************** 2. row ***************************
     Id: 32
   User: root
   Host: localhost:51720
     db: test
Command: Killed
   Time: 345
  State:
   Info: NULL

As it is visible from SHOW PROCESSLIST, applying KILL did not achieve anything, just flagged the connection as "Killed".

How to repeat:
This is easily reproducible with the RQG with the following command line:

perl runall.pl \
--grammar=conf/runtime/WL5004_sql.yy \
--gendata=conf/runtime/WL5004_data.zz \
--queries=1M \
--duration=120 \
--reporters=Deadlock,ErrorLog,Backtrace,Shutdown \
--mysqld=--innodb-lock-wait-timeout=1 \
--mysqld=--loose-lock-wait-timeout=1 \
--mysqld=--secure-file-priv=/tmp/ \
--engine=MyISAM \
--rows=1 \
--threads=16 \
--mysqld=--transaction-isolation=READ-COMMITTED \
--mysqld=--log-output=table,file \
--mysqld=--innodb_flush_log_at_trx_commit=1 \
--mask-level=1 \
--mysqld=--big-tables \
--mysqld=--open_files_limit=10 \
--mysqld=--table_cache=10 \
--mysqld=--tmp_table_size=1K \
--mask=55166 \
--queries=100000000 \
--basedir=/1tb/bzr/mysql-trunk-runtime-exp \
--seed=time 

Ignore any output from the test, but notice that the final lines will be 

# 2010-04-01T18:10:37 Child process completed successfully.
# 2010-04-01T18:10:37 Child process completed successfully.

but the test will not terminate. The reason is that some RQG testing children still expect packets from mysql, which never arrive.
[1 Apr 2010 15:46] Philip Stoev
Last queries from the thread that handged:

                  152 Query     SET SESSION SQL_MODE=''
                  152 Query     SAVEPOINT A
                  152 Query     SAVEPOINT B
                  152 Query     FLUSH TABLE testdb_S . t1_temp1_S
                  152 Query     COMMIT WORK
                  152 Query     UNLOCK TABLES
                  152 Query     FLUSH TABLE testdb_S . t1_view1_N
                  152 Query     ROLLBACK WORK
                  152 Query     COMMIT WORK  NO RELEASE
                  152 Query     CREATE TRIGGER testdb_N . tr1_1_N  BEFORE DELETE ON t1_base2_N FOR EACH ROW BEGIN REPLACE   testdb_N . t1_temp2_N  ( `col_int` ) VALUES ( 6 ) ; END
                  152 Query     SELECT * FROM test.table1_int_autoinc INTO OUTFILE '/tmp/gentest18505.tmp'
                  152 Query     LOAD DATA  INFILE '/tmp/gentest18505.tmp'  INTO TABLE testdb_S . t1_temp1_S
                  152 Query     START TRANSACTION
                  152 Query     SELECT * FROM test.table1_int_autoinc INTO OUTFILE '/tmp/gentest18505.tmp'
                  152 Query     LOAD DATA  INFILE '/tmp/gentest18505.tmp'  INTO TABLE testdb_N . t1_base1_N
                  152 Query     /* Sequence start */ CREATE FUNCTION testdb_S . f1_2_S  () RETURNS INTEGER RETURN ( SELECT MOD( COUNT( DISTINCT `pk` ) , 10 ) FROM testdb_S . t1_view2_S  ) ; COMMIT ; SELECT SLEEP( 0.5 * 0.94 * 1 ) ; DROP FUNCTION testdb_S . f1_2_S /* Sequence end */

core,binary and logs:

http://mysql-systemqa.s3.amazonaws.com/var-bug52528.zip

source:

revision-id: kostja@sun.com-20100327120632-xvhcrtay402f8x52
date: 2010-03-27 15:06:32 +0300
build-date: 2010-04-01 18:41:13 +0300
revno: 3138
branch-nick: mysql-trunk-runtime-exp
[13 Apr 2010 10:02] Philip Stoev
Unfortunately this issue does not lend itself to automatic simplification, even if the RQG command line above reproduces it easily.

It is possible that KILL and/or COMMIT|ROLLBACK RELEASE, which cause the RQG to establish a fresh connection, are required to reproduce.
[13 Apr 2010 15:40] Sveta Smirnova
Bug #48515 can be duplicate of this one.
[21 Apr 2010 8:56] Sveta Smirnova
See also bug #50399

Bug #19610 was marked as duplicate of this one.
[11 Jul 2010 20:56] Elena Stepanova
After running some additional tests, I am quite certain that Philip´s initial RQG test and my MTR test used to hit the same problem: when an idle connection got killed, it did not disappear but stayed in 'Killed' forever. 

The RQG grammar contains KILL QUERY and KILL statements. In a number of failures that I checked, a hanging connection <ID> always appeared after some other connection(s) ran KILL QUERY <ID> followed by KILL <ID>. The race condition apparently is that at the moment of 'KILL' the previous query has to be already killed, but a new one should not start yet.

The MTR test was initially designed to kill an idle thread, so it failed deterministically. I have modified it a little to avoid confusion about pass/fail, and to remove explicit sleeps. It can be found below as test case 'Kill Idle'.

Both tests fail on 5.5 cmake builds before the change http://lists.mysql.com/commits/104251, and do not fail on builds which contain this change (mid April, depending on the tree).

Bug#19610, which is marked as a duplicate of this one, initially reports two part of the problem: first, idle connection do not disappear after wait/interactive_timeout, and second, these connections, while killed manually get stuck in 'Killed'. The second part is exactly the same as we observed.

However, while running tests on a SUSE 10.2 box, I could not get the failure on *normal* 5.1 builds, neither on source code from March-April, nor on the current trees. 

Both the life-after-timeout and the stay-in-killed problems are easily reproducible if I compile 5.1 explicitly with NO_ALARM (and disappear again if I add SIGNAL_WITH_VIO_CLOSE); I'm not sure however it is a valid experiment. 
Just in case, the test which shows both issues on a 5.1 "NO_ALARM" build can be found below as 'Kill Timed Out' test case. 

#---------------------
# Kill Idle test case 
# Should fail on a 5.5 cmake build before http://lists.mysql.com/commits/104251
# and pass after the change 

connect(con1,localhost,root,,test,,);
let $con1_id = `SELECT CONNECTION_ID()`;
connection default;
let $wait_condition = SELECT command = 'Sleep'
        FROM information_schema.processlist
        WHERE id = $con1_id;
--source include/wait_condition.inc
eval KILL $con1_id;
let $wait_timeout = 5;
let $wait_condition = SELECT command = 'Killed' AND time > 3
        FROM information_schema.processlist
        WHERE id = $con1_id;
--source include/wait_condition.inc

if ($success)
{
        SHOW PROCESSLIST;
        --die # Hanging connection detected
}
--echo # No hanging connections detected
--exit

# EOF (Kill Idle test case)

#---------------------
# Kill Timed Out test case

connect(con1,localhost,root,,test,,);
SET @@wait_timeout=3;
SET @@interactive_timeout=3;
let $con1_id = `SELECT CONNECTION_ID()`;

connection default;
let $wait_timeout=6;
let $wait_condition = SELECT time > 4 FROM information_schema.processlist WHERE id = $con1_id;
--source include/wait_condition.inc
if (!$success)
{
        --echo # No timed out connections detected, nothing to kill
        --exit
}
--echo # Timed out connection detected:
SHOW PROCESSLIST;
eval KILL $con1_id;
let $wait_timeout = 3;
let $wait_condition = SELECT time > 6 AND command = 'Killed' FROM information_schema.processlist WHERE id = $con1_id;
--source include/wait_condition.inc
if (!$success)
{
        --echo # No hanging connections detected
        --die # Timed out connection, but no hang up
}
SHOW PROCESSLIST;
--die # Timed out and hanging connection detected

# EOF (Kill Timed Out test case)
[12 Jul 2010 8:04] Dmitry Shulga
This is a duplicate of Bug#37780