MySQL Bugs: #42503: "Lost connection" errors when using compression protocol

Bug #42503	"Lost connection" errors when using compression protocol
Submitted:	31 Jan 2009 21:09	Modified:	12 Jan 2011 17:31
Reporter:	Tech Magos	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server	Severity:	S2 (Serious)
Version:	5.1.34, 5.1 bzr	OS:	Linux
Assigned to:	Dmitry Shulga	CPU Architecture:	Any

Description:
Doing a full retrieve:

select NumID,StringMap from SimpleTable

from a simple innodb table of numbers (mediumint) to strings (varchar(80)) leads to lost connection when the table is 700K rows.

When the table has less rows, it works fine.

The problem is resolved when switching OFF compression for the connection.

max_allowed_packet was set up to 400M but the problem still occurred. And it worked with much smaller max_allowed_packet, when compression was off.

Not running compression from the client is not a sustainable option, as we need it for remote clients to be more efficient.

How to repeat:
Create an innodb table of 700K rows of medium int to varchar(80)

Run above query with compression enabled in the connection.

Thank you for a problem report. Please, check your in server's error log if there was a server crash at that time. What version of client do you use? Check with mysql --version, please.

Server reports nothing in the logs. It seems to have no idea of the problem. No server crash either.

mysql client Ver 14.14, 5.1.30 (linux); from same version as the server.

Note exactly the same behaviour is seen by Windows client code, connecting by direct use of the client API. I.e. if compression is on, library returns lost connection, after call to mysq_real_query call. If compression is off, then query works for win client code too.

I couldn't repeat with current source server. Could you please provide your my.cnf file and create table statement. Thanks in advance.

[mysqld]
slow_query_log
slow_query_log_file=/path/slow.log
skip-safemalloc
skip-name-resolve
skip-locking
default-character-set=latin1
default-storage-engine=INNODB
max_allowed_packet=250M
query_cache_type=1
query_cache_size=800M
query_cache_limit=100M
max_connections=150
max_connect_errors=100
wait_timeout=3600
connect_timeout=16
net_read_timeout=50
net_write_timeout=100
interactive_timeout=300

innodb_locks_unsafe_for_binlog=1
innodb_autoinc_lock_mode=2
innodb_buffer_pool_size = 16000MB
innodb_log_files_in_group = 2
innodb_log_file_size = 320M
innodb_log_buffer_size=16M
innodb_data_file_path = ibdata1:100M:autoextend
innodb_additional_mem_pool_size=25M
innodb_file_per_table=1
innodb_open_files = 2048
innodb_max_dirty_pages_pct=95
innodb_checksums=0
innodb_support_xa=0
innodb_doublewrite=0
innodb_flush_method = O_DIRECT
innodb_flush_log_at_trx_commit = 2
innodb_lock_wait_timeout=50
innodb_thread_concurrency=16
innodb_commit_concurrency=20
innodb_read_io_threads=16
innodb_write_io_threads=4

key_buffer_size=256M
sort_buffer_size=8M
myisam_sort_buffer_size=8M
read_rnd_buffer_size=32M
read_buffer_size=3M
join_buffer_size=16M
tmp_table_size=256M
max_heap_table_size=256M
max_tmp_tables=100

thread_cache_size=128
thread_concurrency=16

[safe_mysqld]
timezone=GMT

CREATE TABLE `ID2NAME`
(
 `Id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT,
 `Name` varchar(255) NOT NULL,
  PRIMARY KEY (`Id`),
  UNIQUE KEY `iName` (`Name`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

Thank you for the feedback.

I can not repeat described behavior. Please try current version 5.1.31 and inform us if problem still exists.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

the might be verified in bug #45031 . crash is very hard to repeat, and depends on certain properties of the compressed stream.  hence, using valgrind is needed is alot of times to catch the problem.

This is a real issue and the last cross-reference seems to suggest where it is (valgrind report)

It only seems to happen when requested data is pretty large; e.g. 300K rows or more of data of 6 20 byte columns or so.

It does not seem to be related to some memory exhaustion problem, as when the compression protocol is off, then this works fine.

The valgrind report on the cross-referenced issue suggests that this is due to a memory-related bug, i.e. trying to read off bounds.

Could you please check if things work ok in the case
- The client is 32-bit and server is 64-bit; i recall seeeing complen in my_real_read being declared size_t; this is probably different size in 32 and 64 bit.
- Could it be that the size of the compressed block is not always well defined or aligned so the reader gets confused? And this happens only when we get very large data returned?
- It is not clear to me the problem originates on the server or the client.
The cross-referenced bug report suggests this happens on the client side.
Server reports nothing on the logs when the error hits.
- When it happens, it occurs from both linux 64-client, 32-bit client and windows 32-bit client.

This bug occurs also in 5.1.34 and seems to be also connected to the query  cache. Here are some important findings:

- The problem happens AFTER the first time the query is run, i.e. after it stores itself in the query cache. First time, the query works fine.
- If query cache is disabled, the bug never happens.
- If SQL_NO_CACHE is placed after the SELECT, the bug never happens.
- After the problem happens, if RESET QUERY CACHE is run, the problem does not happen for run of the query after the reset. It does happens again for runs after that.
- Seems that if qcache is 1 and the query result does not fit inside the qcache (e.g. if query_cache_limit is small), then the query never fails

So the problem seems to appear when the compressed protocol is used AND the query produces a large result that is already cached in the query cache.

It is worth mentioning that when the problem appears the server does pipe data through the channel of the same size as when the query succeeds(as SHOW GLOBAL STATUS reveals). 

For 5.1.34, when valgrind is run on client (mysql) or server (mysqld), no issues are reported. The valgrind problem as mentioned in the cross-referenced bug related to 6.x is NOT reported.

So it looks like there is probably malformed data produced when the query result is compressed from the query cache.

Thank you for the feedback.

Verified as described.

To repeat:

1. filled table with insert into ID2NAME (Name) values(md5(rand()));, then different combinations of  insert into ID2NAME (Name) select md5(rand()) from ID2NAME;, insert into ID2NAME (Name) select concat(Id, md5(rand()), Id) from ID2NAME; etc.

2. mysql> select count(*) from ID2NAME;
+----------+
| count(*) |
+----------+
|  1048576 | 
+----------+
1 row in set (5.89 sec)

3. Run following command:

while ./bin/mysql -hsol10-sparc-c -P33051 -uroot -C test -e "select Id, Name from ID2NAME" >/dev/null; do echo 1; done

Server started as:

./libexec/mysqld --defaults-file=support-files/my-small.cnf --basedir=. --datadir=./data --port=33051 --socket=/tmp/mysql_ssmirnova.sock -O thread_stack=1280000 --log-error --query_cache_type=1 --query_cache_size=800M --query_cache_limit=100M &

Backtrace in my env:

(dbx) where
current thread: t@13
  [1] __lwp_kill(0x0, 0x6, 0xff06cbc0, 0xfef21c00, 0xff06c388, 0x6), at 0xff04155c 
=>[2] my_write_core(sig = 6), line 310 in "stacktrace.c"
  [3] handle_segfault(sig = 6), line 2536 in "mysqld.cc"
  [4] __sighndlr(0x6, 0x0, 0xfe638ce0, 0x270964, 0x0, 0x0), at 0xff040494 
  ---- called from signal handler with signal 6 (SIGABRT) ------
  [5] __lwp_kill(0x0, 0x6, 0x85274, 0xff3b3904, 0xff06fb20, 0x0), at 0xff04155c 
  [6] raise(0x6, 0x6, 0x5, 0x6, 0x32aa0, 0x0), at 0xfefdfef8 
  [7] abort(0xfe639058, 0xff0700b8, 0xff06928c, 0xa8274, 0xff068288, 0xff070904), at 0xfefc00d8 
  [8] __assert(0x9c1c50, 0x9c1bd0, 0x212, 0xfe63930c, 0xa7f90, 0xb02750), at 0xfefc0354 
  [9] Diagnostics_area::disable_status(this = 0x33b92568), line 530 in "sql_class.cc"
  [10] Query_cache::send_result_to_client(this = 0xc2c480, thd = 0x33b91808, sql = 0x33bc0840 "select Id, Name from ID2NAME", query_length = 28U), line 1527 in "sql_cache.cc"
  [11] mysql_parse(thd = 0x33b91808, inBuf = 0x33bc0840 "select Id, Name from ID2NAME", length = 28U, found_semicolon = 0xfe639cf0), line 5886 in "sql_parse.cc"
  [12] dispatch_command(command = COM_QUERY, thd = 0x33b91808, packet = 0x33ba412d "select Id, Name from ID2NAME", packet_length = 28U), line 1213 in "sql_parse.cc"
  [13] do_command(thd = 0x33b91808), line 854 in "sql_parse.cc"
  [14] handle_one_connection(arg = 0x33b91808), line 1115 in "sql_connect.cc"
(dbx)

If don't use compression protocol, run same client test, then kill thread connected several times server crashes with same trace as well.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

This can easily lead to disconnect on production.

A failure
http://pb2.norway.sun.com/web.py?template=mysql_show_test_failure&test_failure_id=2281048

can relate.

Some details on that crash:
test-max-win_ws2008-x86_64  loki05 ps_stm

innodb                         [ fail ]

mysqltest: At line 1001: query 'update t1 set c=a' failed: 2013: Lost connection to MySQL server during query

Sveta,

Please check whether this bug is fixed with a patch made for a bug #45031.

Thanks ...

Bug is still repeatable after fix of bug #45031.

Please note

 [10] Query_cache::send_result_to_client(this = 0xc30888, thd = 0x33b95d88, sql = 0x33bc3f80 "select Id, Name from ID2NAME", query_length = 28U), line 1624 in "sql_cache.cc"

in the trace. Also bug is not repeatable with query cache disabled. So this is compression + query cache problem.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/115748

3480 Dmitry Shulga	2010-08-16
      Fixed bug #42503 - "Lost connection" errors when using
      compression protocol.
     @ sql/sql_cache.cc
        modified Query_cache::append_result_data: added checking for max
        packet size limit (2^24) when forming result data block.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/117561

3480 Dmitry Shulga	2010-09-06
      Fixed bug #42503 - "Lost connection" errors when using
      compression protocol.
     @ sql/net_serv.cc
        net_realloc() modified: consider already used memory
        when compare packet buffer length
     @ sql/sql_cache.cc
        modified Query_cache::send_result_to_client: send result to client
        in chunks limited by 2^24-1 bytes.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/118285

3480 Dmitry Shulga	2010-09-15
      Fixed bug#42503 - "Lost connection" errors when using
      compression protocol.
     @ sql/net_serv.cc
        net_realloc() modified: consider already used memory
        when compare packet buffer length
     @ sql/sql_cache.cc
        modified Query_cache::send_result_to_client: send result to client
        in chunks limited by 1 megabyte.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/118345

3509 Dmitry Shulga	2010-09-16
      Fixed bug#42503 - "Lost connection" errors when using
      compression protocol.
      
      The loss of connection was caused by a malformed packet
      sent by the server in case when query cache was in use.
      When storing data in the query cache, the query  cache
      memory allocation algorithm had a tendency to reduce
      the amount of memory block necessary to store a result
      set, up to finally storing the entire result set in a single
      block. With a significant result set, this memory block
      could turn out to be quite large - 30, 40 MB and on.
      When such a result set was sent to the client, the entire
      memory block was compressed and written to network as a
      single network packet. However, the length of the
      network packet is limited by 0xFFFFFF (16MB), since
      the packet format only allows 3 bytes for packet length.
      As a result, a malformed, overly large packet
      with truncated length would be sent to the client
      and break the client/server protocol.
      
      
      The solution is, when sending result sets from the query
      cache, to ensure that the data is chopped into
      network packets of size <= 16MB, so that there
      is no corruption of packet length. This solution,
      however, has a shortcoming: since the result set
      is still stored in the query cache as a single block,
      at the time of sending, we've lost boundaries of individual
      logical packets (one logical packet = one row of the result
      set) and thus can end up sending a truncated logical
      packet in a compressed network packet.
      
      As a result, on the client we may require more memory than 
      max_allowed_packet to keep, both, the truncated
      last logical packet, and the compressed next packet.
      This never (or in practice never) happens without compression,
      since without compression it's very unlikely that
      a) a truncated logical packet would remain on the client
      when it's time to read the next packet
      b) a subsequent logical packet that is being read would be
      so large that size-of-new-packet + size-of-old-packet-tail >
      max_allowed_packet.
      To remedy this issue, we send data in 1MB sized packets,
      that's below the current client default of 16MB for
      max_allowed_packet, but large enough to ensure there is no
      unnecessary overhead from too many syscalls per result set.
     @ sql/net_serv.cc
        net_realloc() modified: consider already used memory
        when compare packet buffer length
     @ sql/sql_cache.cc
        modified Query_cache::send_result_to_client: send result to client
        in chunks limited by 1 megabyte.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/118363

3510 Dmitry Shulga	2010-09-16
      Fixed bug#42503 - "Lost connection" errors when using
      compression protocol.
      
      The loss of connection was caused by a malformed packet
      sent by the server in case when query cache was in use.
      When storing data in the query cache, the query  cache
      memory allocation algorithm had a tendency to reduce
      the amount of memory block necessary to store a result
      set, up to finally storing the entire result set in a single
      block. With a significant result set, this memory block
      could turn out to be quite large - 30, 40 MB and on.
      When such a result set was sent to the client, the entire
      memory block was compressed and written to network as a
      single network packet. However, the length of the
      network packet is limited by 0xFFFFFF (16MB), since
      the packet format only allows 3 bytes for packet length.
      As a result, a malformed, overly large packet
      with truncated length would be sent to the client
      and break the client/server protocol.
      
      The solution is, when sending result sets from the query
      cache, to ensure that the data is chopped into
      network packets of size <= 16MB, so that there
      is no corruption of packet length. This solution,
      however, has a shortcoming: since the result set
      is still stored in the query cache as a single block,
      at the time of sending, we've lost boundaries of individual
      logical packets (one logical packet = one row of the result
      set) and thus can end up sending a truncated logical
      packet in a compressed network packet.
      
      As a result, on the client we may require more memory than 
      max_allowed_packet to keep, both, the truncated
      last logical packet, and the compressed next packet.
      This never (or in practice never) happens without compression,
      since without compression it's very unlikely that
      a) a truncated logical packet would remain on the client
      when it's time to read the next packet
      b) a subsequent logical packet that is being read would be
      so large that size-of-new-packet + size-of-old-packet-tail >
      max_allowed_packet.
      To remedy this issue, we send data in 1MB sized packets,
      that's below the current client default of 16MB for
      max_allowed_packet, but large enough to ensure there is no
      unnecessary overhead from too many syscalls per result set.
     @ sql/net_serv.cc
        net_realloc() modified: consider already used memory
        when compare packet buffer length
     @ sql/sql_cache.cc
        modified Query_cache::send_result_to_client: send result to client
        in chunks limited by 1 megabyte.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/118370

3206 Dmitry Shulga	2010-09-16 [merge]
      Auto-merge from mysql-5.1-bugteam for bug#42503.

Pushed to mysql-5.1-bugteam. Merged and pushed to mysql-5.5-merge.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/119026

3516 Dmitry Shulga	2010-09-24
      Follow-up for Bug#42503: fix a compilation warning.
     @ sql/sql_cache.cc
        Added include of send_data_in_chunks() definiton when macros EMBEDDED_LIBRARY is on.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/119028

3212 Dmitry Shulga	2010-09-24 [merge]
      Merged changes from 5.1-bugteam for bug#42503.

Pushed into mysql-trunk 5.6.1-m4 (revid:alik@sun.com-20100928153607-tdsxkdm5cmuym5sq) (version source revid:alik@sun.com-20100928153508-0saa6v93dinqx1u7) (merge vers: 5.6.1-m4) (pib:21)

Pushed into mysql-next-mr (revid:alik@sun.com-20100928153646-pqp8o1a92mxtuj3h) (version source revid:alik@sun.com-20100928153532-lr3gtvnyp2en4y75) (pib:21)

Pushed into mysql-5.5 5.5.7-rc (revid:alik@sun.com-20100928153459-4nudf4zgzlou4s7q) (version source revid:alik@sun.com-20100928153459-4nudf4zgzlou4s7q) (merge vers: 5.5.7-rc) (pib:21)

Noted in 5.5.7, 5.6.1 changelogs.

A malformed packet sent by the server when the query cache was in use
resulted in lost-connection errors. 

Setting to Need Merge pending push to 5.1.x.

Setting to Need Merge pending push to 5.1.x.

Bug#57074 mysql_store_result() fails if compression is enabled was marked a duplicate of this bug.

Noted in 5.1.52 changelog.

Pushed into mysql-5.1 5.1.53 (revid:build@mysql.com-20101101184443-o2olipi8vkaxzsqk) (version source revid:build@mysql.com-20101101184443-o2olipi8vkaxzsqk) (merge vers: 5.1.53) (pib:21)

Pushed into mysql-5.1 5.1.55 (revid:dmitry.shulga@oracle.com-20110111151825-d376jh0jefmsdslw) (version source revid:dmitry.shulga@oracle.com-20110111151825-d376jh0jefmsdslw) (merge vers: 5.1.55) (pib:24)

Pushed into mysql-trunk 5.6.2 (revid:dmitry.shulga@oracle.com-20110111153355-uvff16i445mk2spu) (version source revid:dmitry.shulga@oracle.com-20110111153355-uvff16i445mk2spu) (merge vers: 5.6.2) (pib:24)

Pushed into mysql-5.5 5.5.9 (revid:dmitry.shulga@oracle.com-20110111152631-jezu2dbglg05g6ff) (version source revid:dmitry.shulga@oracle.com-20110111152631-jezu2dbglg05g6ff) (merge vers: 5.5.9) (pib:24)