MySQL Bugs: #52528: Protocol desynchronisation between client and server on a DDL workload

Bug #52528	Protocol desynchronisation between client and server on a DDL workload
Submitted:	1 Apr 2010 15:17	Modified:	12 Jul 2010 8:04
Reporter:	Philip Stoev	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server	Severity:	S3 (Non-critical)
Version:	mysql-trunk,mysql-trunk-runtime-exp, 5.1, next-mr	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
When executing a RQG DDL/MDL workload, sometimes the server will forever remain waiting for a packet from the client, but the client does not appear to think that it is supposed to send such a packet, and vice versa.

The server backtrace looks like this:

Thread 3 (Thread 0x7fffc32ac950 (LWP 15171)):
#0  0x000000315b00df5b in read () from /lib64/libpthread.so.0
#1  0x0000000000a1755f in vio_read (vio=0x7fffbc1491c8, buf=0x7fffbc0ae598 "\001", size=4) at /1tb/bzr/mysql-trunk/vio/viosocket.c:44
#2  0x0000000000522e62 in my_real_read (net=0x7fffbc1462b8, complen=0x7fffc32abfc8) at /1tb/bzr/mysql-trunk/sql/net_serv.cc:836
#3  0x00000000005234e2 in my_net_read (net=0x7fffbc1462b8) at /1tb/bzr/mysql-trunk/sql/net_serv.cc:1026
#4  0x00000000005b3806 in do_command (thd=0x7fffbc1460b8) at /1tb/bzr/mysql-trunk/sql/sql_parse.cc:656
#5  0x000000000067e584 in do_handle_one_connection (thd_arg=0x7fffbc1460b8) at /1tb/bzr/mysql-trunk/sql/sql_connect.cc:1174
#6  0x000000000067e653 in handle_one_connection (arg=0x7fffbc1460b8) at /1tb/bzr/mysql-trunk/sql/sql_connect.cc:1113
#7  0x000000315b0073da in start_thread () from /lib64/libpthread.so.0
#8  0x000000315a4e627d in clone () from /lib64/libc.so.6

SHOW PROCESSLIST looks like this:

*************************** 2. row ***************************
     Id: 32
   User: root
   Host: localhost:51720
     db: test
Command: Killed
   Time: 345
  State:
   Info: NULL

As it is visible from SHOW PROCESSLIST, applying KILL did not achieve anything, just flagged the connection as "Killed".

How to repeat:
This is easily reproducible with the RQG with the following command line:

perl runall.pl \
--grammar=conf/runtime/WL5004_sql.yy \
--gendata=conf/runtime/WL5004_data.zz \
--queries=1M \
--duration=120 \
--reporters=Deadlock,ErrorLog,Backtrace,Shutdown \
--mysqld=--innodb-lock-wait-timeout=1 \
--mysqld=--loose-lock-wait-timeout=1 \
--mysqld=--secure-file-priv=/tmp/ \
--engine=MyISAM \
--rows=1 \
--threads=16 \
--mysqld=--transaction-isolation=READ-COMMITTED \
--mysqld=--log-output=table,file \
--mysqld=--innodb_flush_log_at_trx_commit=1 \
--mask-level=1 \
--mysqld=--big-tables \
--mysqld=--open_files_limit=10 \
--mysqld=--table_cache=10 \
--mysqld=--tmp_table_size=1K \
--mask=55166 \
--queries=100000000 \
--basedir=/1tb/bzr/mysql-trunk-runtime-exp \
--seed=time 

Ignore any output from the test, but notice that the final lines will be 

# 2010-04-01T18:10:37 Child process completed successfully.
# 2010-04-01T18:10:37 Child process completed successfully.

but the test will not terminate. The reason is that some RQG testing children still expect packets from mysql, which never arrive.

Last queries from the thread that handged:

                  152 Query     SET SESSION SQL_MODE=''
                  152 Query     SAVEPOINT A
                  152 Query     SAVEPOINT B
                  152 Query     FLUSH TABLE testdb_S . t1_temp1_S
                  152 Query     COMMIT WORK
                  152 Query     UNLOCK TABLES
                  152 Query     FLUSH TABLE testdb_S . t1_view1_N
                  152 Query     ROLLBACK WORK
                  152 Query     COMMIT WORK  NO RELEASE
                  152 Query     CREATE TRIGGER testdb_N . tr1_1_N  BEFORE DELETE ON t1_base2_N FOR EACH ROW BEGIN REPLACE   testdb_N . t1_temp2_N  ( `col_int` ) VALUES ( 6 ) ; END
                  152 Query     SELECT * FROM test.table1_int_autoinc INTO OUTFILE '/tmp/gentest18505.tmp'
                  152 Query     LOAD DATA  INFILE '/tmp/gentest18505.tmp'  INTO TABLE testdb_S . t1_temp1_S
                  152 Query     START TRANSACTION
                  152 Query     SELECT * FROM test.table1_int_autoinc INTO OUTFILE '/tmp/gentest18505.tmp'
                  152 Query     LOAD DATA  INFILE '/tmp/gentest18505.tmp'  INTO TABLE testdb_N . t1_base1_N
                  152 Query     /* Sequence start */ CREATE FUNCTION testdb_S . f1_2_S  () RETURNS INTEGER RETURN ( SELECT MOD( COUNT( DISTINCT `pk` ) , 10 ) FROM testdb_S . t1_view2_S  ) ; COMMIT ; SELECT SLEEP( 0.5 * 0.94 * 1 ) ; DROP FUNCTION testdb_S . f1_2_S /* Sequence end */

core,binary and logs:

http://mysql-systemqa.s3.amazonaws.com/var-bug52528.zip

source:

revision-id: kostja@sun.com-20100327120632-xvhcrtay402f8x52
date: 2010-03-27 15:06:32 +0300
build-date: 2010-04-01 18:41:13 +0300
revno: 3138
branch-nick: mysql-trunk-runtime-exp

Unfortunately this issue does not lend itself to automatic simplification, even if the RQG command line above reproduces it easily.

It is possible that KILL and/or COMMIT|ROLLBACK RELEASE, which cause the RQG to establish a fresh connection, are required to reproduce.

Bug #48515 can be duplicate of this one.

See also bug #50399

Bug #19610 was marked as duplicate of this one.

After running some additional tests, I am quite certain that Philip´s initial RQG test and my MTR test used to hit the same problem: when an idle connection got killed, it did not disappear but stayed in 'Killed' forever. 

The RQG grammar contains KILL QUERY and KILL statements. In a number of failures that I checked, a hanging connection <ID> always appeared after some other connection(s) ran KILL QUERY <ID> followed by KILL <ID>. The race condition apparently is that at the moment of 'KILL' the previous query has to be already killed, but a new one should not start yet.

The MTR test was initially designed to kill an idle thread, so it failed deterministically. I have modified it a little to avoid confusion about pass/fail, and to remove explicit sleeps. It can be found below as test case 'Kill Idle'.

Both tests fail on 5.5 cmake builds before the change http://lists.mysql.com/commits/104251, and do not fail on builds which contain this change (mid April, depending on the tree).

Bug#19610, which is marked as a duplicate of this one, initially reports two part of the problem: first, idle connection do not disappear after wait/interactive_timeout, and second, these connections, while killed manually get stuck in 'Killed'. The second part is exactly the same as we observed.

However, while running tests on a SUSE 10.2 box, I could not get the failure on *normal* 5.1 builds, neither on source code from March-April, nor on the current trees. 

Both the life-after-timeout and the stay-in-killed problems are easily reproducible if I compile 5.1 explicitly with NO_ALARM (and disappear again if I add SIGNAL_WITH_VIO_CLOSE); I'm not sure however it is a valid experiment. 
Just in case, the test which shows both issues on a 5.1 "NO_ALARM" build can be found below as 'Kill Timed Out' test case. 

#---------------------
# Kill Idle test case 
# Should fail on a 5.5 cmake build before http://lists.mysql.com/commits/104251
# and pass after the change 

connect(con1,localhost,root,,test,,);
let $con1_id = `SELECT CONNECTION_ID()`;
connection default;
let $wait_condition = SELECT command = 'Sleep'
        FROM information_schema.processlist
        WHERE id = $con1_id;
--source include/wait_condition.inc
eval KILL $con1_id;
let $wait_timeout = 5;
let $wait_condition = SELECT command = 'Killed' AND time > 3
        FROM information_schema.processlist
        WHERE id = $con1_id;
--source include/wait_condition.inc

if ($success)
{
        SHOW PROCESSLIST;
        --die # Hanging connection detected
}
--echo # No hanging connections detected
--exit

# EOF (Kill Idle test case)

#---------------------
# Kill Timed Out test case

connect(con1,localhost,root,,test,,);
SET @@wait_timeout=3;
SET @@interactive_timeout=3;
let $con1_id = `SELECT CONNECTION_ID()`;

connection default;
let $wait_timeout=6;
let $wait_condition = SELECT time > 4 FROM information_schema.processlist WHERE id = $con1_id;
--source include/wait_condition.inc
if (!$success)
{
        --echo # No timed out connections detected, nothing to kill
        --exit
}
--echo # Timed out connection detected:
SHOW PROCESSLIST;
eval KILL $con1_id;
let $wait_timeout = 3;
let $wait_condition = SELECT time > 6 AND command = 'Killed' FROM information_schema.processlist WHERE id = $con1_id;
--source include/wait_condition.inc
if (!$success)
{
        --echo # No hanging connections detected
        --die # Timed out connection, but no hang up
}
SHOW PROCESSLIST;
--die # Timed out and hanging connection detected

# EOF (Kill Timed Out test case)

This is a duplicate of Bug#37780