Bug #42503 "Lost connection" errors when using compression protocol
Submitted: 31 Jan 22:09 Modified: 18 Aug 9:53
Reporter: Tech Magos
Status: Verified
Category:Server Severity:S2 (Serious)
Version:5.1.34, 5.1 bzr OS:Linux
Assigned to: Lars-Erik Bjørk Target Version:5.1+
Triage: Triaged: D1 (Critical) / R2 (Low) / E3 (Medium)

[31 Jan 22:09] Tech Magos
Description:
Doing a full retrieve:

select NumID,StringMap from SimpleTable

from a simple innodb table of numbers (mediumint) to strings (varchar(80)) leads to lost
connection when the table is 700K rows.

When the table has less rows, it works fine.

The problem is resolved when switching OFF compression for the connection.

max_allowed_packet was set up to 400M but the problem still occurred. And it worked with
much smaller max_allowed_packet, when compression was off.

Not running compression from the client is not a sustainable option, as we need it for
remote clients to be more efficient.

How to repeat:
Create an innodb table of 700K rows of medium int to varchar(80)

Run above query with compression enabled in the connection.
[1 Feb 11:47] Valeriy Kravchuk
Thank you for a problem report. Please, check your in server's error log if there was a
server crash at that time. What version of client do you use? Check with mysql --version,
please.
[1 Feb 15:38] Tech Magos
Server reports nothing in the logs. It seems to have no idea of the problem. No server
crash either.

mysql client Ver 14.14, 5.1.30 (linux); from same version as the server.

Note exactly the same behaviour is seen by Windows client code, connecting by direct use
of the client API. I.e. if compression is on, library returns lost connection, after call
to mysq_real_query call. If compression is off, then query works for win client code too.
[1 Feb 20:58] Miguel Solorzano
I couldn't repeat with current source server. Could you please provide your my.cnf file
and create table statement. Thanks in advance.
[2 Feb 19:29] Tech Magos
[mysqld]
slow_query_log
slow_query_log_file=/path/slow.log
skip-safemalloc
skip-name-resolve
skip-locking
default-character-set=latin1
default-storage-engine=INNODB
max_allowed_packet=250M
query_cache_type=1
query_cache_size=800M
query_cache_limit=100M
max_connections=150
max_connect_errors=100
wait_timeout=3600
connect_timeout=16
net_read_timeout=50
net_write_timeout=100
interactive_timeout=300

innodb_locks_unsafe_for_binlog=1
innodb_autoinc_lock_mode=2
innodb_buffer_pool_size = 16000MB
innodb_log_files_in_group = 2
innodb_log_file_size = 320M
innodb_log_buffer_size=16M
innodb_data_file_path = ibdata1:100M:autoextend
innodb_additional_mem_pool_size=25M
innodb_file_per_table=1
innodb_open_files = 2048
innodb_max_dirty_pages_pct=95
innodb_checksums=0
innodb_support_xa=0
innodb_doublewrite=0
innodb_flush_method = O_DIRECT
innodb_flush_log_at_trx_commit = 2
innodb_lock_wait_timeout=50
innodb_thread_concurrency=16
innodb_commit_concurrency=20
innodb_read_io_threads=16
innodb_write_io_threads=4

key_buffer_size=256M
sort_buffer_size=8M
myisam_sort_buffer_size=8M
read_rnd_buffer_size=32M
read_buffer_size=3M
join_buffer_size=16M
tmp_table_size=256M
max_heap_table_size=256M
max_tmp_tables=100

thread_cache_size=128
thread_concurrency=16

[safe_mysqld]
timezone=GMT

CREATE TABLE `ID2NAME`
(
 `Id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT,
 `Name` varchar(255) NOT NULL,
  PRIMARY KEY (`Id`),
  UNIQUE KEY `iName` (`Name`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
[12 Feb 11:41] Sveta Smirnova
Thank you for the feedback.

I can not repeat described behavior. Please try current version 5.1.31 and inform us if
problem still exists.
[13 Mar 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[22 May 13:40] Shane Bester
the might be verified in bug #45031 . crash is very hard to repeat, and depends on certain
properties of the compressed stream.  hence, using valgrind is needed is alot of times to
catch the problem.
[29 May 0:39] Tech Magos
This is a real issue and the last cross-reference seems to suggest where it is (valgrind
report)

It only seems to happen when requested data is pretty large; e.g. 300K rows or more of
data of 6 20 byte columns or so.

It does not seem to be related to some memory exhaustion problem, as when the compression
protocol is off, then this works fine.

The valgrind report on the cross-referenced issue suggests that this is due to a
memory-related bug, i.e. trying to read off bounds.

Could you please check if things work ok in the case
- The client is 32-bit and server is 64-bit; i recall seeeing complen in my_real_read
being declared size_t; this is probably different size in 32 and 64 bit.
- Could it be that the size of the compressed block is not always well defined or aligned
so the reader gets confused? And this happens only when we get very large data returned?
- It is not clear to me the problem originates on the server or the client.
The cross-referenced bug report suggests this happens on the client side.
Server reports nothing on the logs when the error hits.
- When it happens, it occurs from both linux 64-client, 32-bit client and windows 32-bit
client.
[2 Jun 0:41] Tech Magos
This bug occurs also in 5.1.34 and seems to be also connected to the query  cache. Here
are some important findings:

- The problem happens AFTER the first time the query is run, i.e. after it stores itself
in the query cache. First time, the query works fine.
- If query cache is disabled, the bug never happens.
- If SQL_NO_CACHE is placed after the SELECT, the bug never happens.
- After the problem happens, if RESET QUERY CACHE is run, the problem does not happen for
run of the query after the reset. It does happens again for runs after that.
- Seems that if qcache is 1 and the query result does not fit inside the qcache (e.g. if
query_cache_limit is small), then the query never fails

So the problem seems to appear when the compressed protocol is used AND the query
produces a large result that is already cached in the query cache.

It is worth mentioning that when the problem appears the server does pipe data through
the channel of the same size as when the query succeeds(as SHOW GLOBAL STATUS reveals). 

For 5.1.34, when valgrind is run on client (mysql) or server (mysqld), no issues are
reported. The valgrind problem as mentioned in the cross-referenced bug related to 6.x is
NOT reported.

So it looks like there is probably malformed data produced when the query result is
compressed from the query cache.
[30 Jun 8:35] Sveta Smirnova
Thank you for the feedback.

Verified as described.

To repeat:

1. filled table with insert into ID2NAME (Name) values(md5(rand()));, then different
combinations of  insert into ID2NAME (Name) select md5(rand()) from ID2NAME;, insert into
ID2NAME (Name) select concat(Id, md5(rand()), Id) from ID2NAME; etc.

2. mysql> select count(*) from ID2NAME;
+----------+
| count(*) |
+----------+
|  1048576 | 
+----------+
1 row in set (5.89 sec)

3. Run following command:

while ./bin/mysql -hsol10-sparc-c -P33051 -uroot -C test -e "select Id, Name from
ID2NAME" >/dev/null; do echo 1; done

Server started as:

./libexec/mysqld --defaults-file=support-files/my-small.cnf --basedir=. --datadir=./data
--port=33051 --socket=/tmp/mysql_ssmirnova.sock -O thread_stack=1280000 --log-error
--query_cache_type=1 --query_cache_size=800M --query_cache_limit=100M &
[30 Jun 8:35] Sveta Smirnova
Backtrace in my env:

(dbx) where
current thread: t@13
  [1] __lwp_kill(0x0, 0x6, 0xff06cbc0, 0xfef21c00, 0xff06c388, 0x6), at 0xff04155c 
=>[2] my_write_core(sig = 6), line 310 in "stacktrace.c"
  [3] handle_segfault(sig = 6), line 2536 in "mysqld.cc"
  [4] __sighndlr(0x6, 0x0, 0xfe638ce0, 0x270964, 0x0, 0x0), at 0xff040494 
  ---- called from signal handler with signal 6 (SIGABRT) ------
  [5] __lwp_kill(0x0, 0x6, 0x85274, 0xff3b3904, 0xff06fb20, 0x0), at 0xff04155c 
  [6] raise(0x6, 0x6, 0x5, 0x6, 0x32aa0, 0x0), at 0xfefdfef8 
  [7] abort(0xfe639058, 0xff0700b8, 0xff06928c, 0xa8274, 0xff068288, 0xff070904), at
0xfefc00d8 
  [8] __assert(0x9c1c50, 0x9c1bd0, 0x212, 0xfe63930c, 0xa7f90, 0xb02750), at 0xfefc0354 
  [9] Diagnostics_area::disable_status(this = 0x33b92568), line 530 in "sql_class.cc"
  [10] Query_cache::send_result_to_client(this = 0xc2c480, thd = 0x33b91808, sql =
0x33bc0840 "select Id, Name from ID2NAME", query_length = 28U), line 1527 in
"sql_cache.cc"
  [11] mysql_parse(thd = 0x33b91808, inBuf = 0x33bc0840 "select Id, Name from ID2NAME",
length = 28U, found_semicolon = 0xfe639cf0), line 5886 in "sql_parse.cc"
  [12] dispatch_command(command = COM_QUERY, thd = 0x33b91808, packet = 0x33ba412d
"select Id, Name from ID2NAME", packet_length = 28U), line 1213 in "sql_parse.cc"
  [13] do_command(thd = 0x33b91808), line 854 in "sql_parse.cc"
  [14] handle_one_connection(arg = 0x33b91808), line 1115 in "sql_connect.cc"
(dbx)
[30 Jun 8:49] Sveta Smirnova
If don't use compression protocol, run same client test, then kill thread connected
several times server crashes with same trace as well.
[18 Aug 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[18 Aug 18:49] Konstantin Osipov
This can easily lead to disconnect on production.
[19 Aug 11:49] Andrei Elkin
A failure
http://pb2.norway.sun.com/web.py?template=mysql_show_test_failure&test_failure_id=2281048

can relate.

Some details on that crash:
test-max-win_ws2008-x86_64  loki05 ps_stm

innodb                         [ fail ]

mysqltest: At line 1001: query 'update t1 set c=a' failed: 2013: Lost connection to MySQL
server during query
[24 Aug 18:18] Sinisa Milivojevic
Sveta,

Please check whether this bug is fixed with a patch made for a bug #45031.

Thanks ...
[25 Aug 9:54] Sveta Smirnova
Bug is still repeatable after fix of bug #45031.

Please note

 [10] Query_cache::send_result_to_client(this = 0xc30888, thd = 0x33b95d88, sql =
0x33bc3f80 "select Id, Name from ID2NAME", query_length = 28U), line 1624 in
"sql_cache.cc"

in the trace. Also bug is not repeatable with query cache disabled. So this is
compression + query cache problem.