Bug #61790 "Got an error writing communication packets" on not FEDERATED DBs
Submitted: 7 Jul 2011 19:04 Modified: 28 Jan 2012 14:03
Reporter: Dario Fiumicello Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: Federated storage engine Severity:S2 (Serious)
Version:5.1.41, 5.5.14 OS:Linux (Ubuntu 10.04)
Assigned to: CPU Architecture:Any
Tags: federated network error communication packets
Triage: Needs Triage: D2 (Serious)

[7 Jul 2011 19:04] Dario Fiumicello
Description:
Scenario:

We have two servers, one in DMZ and one in LAN. The server in DMZ runs a mysql instance with one database (say db_dmz) that needs to be reached by LAN, the server in LAN runs a mysql instance with several database, (say db_a, db_b ...)

LAN Server:
Ubuntu 10.04
kernel 2.6.32-32-server
mysql 5.1.41
apache Apache/2.2.14 (Ubuntu)
php 5.3.5-1ubuntu7.1
php5-mysql 5.3.5-1ubuntu7.1

DMZ Server:
Ubuntu 10.04
2.6.32-32-server
mysql 5.1.41
apache Apache/2.2.14 (Ubuntu)
php 5.3.2-1ubuntu4.9 
php5-mysql 5.3.2-1ubuntu4.9 

The server in DMZ (say ip 10.0.0.1) is reached by the server in LAN (say ip 192.168.1.1) through a router which NAT the DMZ server to a private lan IP (say 192.168.1.254)

Both servers has TCP Keepalive enabled and set to a very small interval (90s) in order to avoid dropping mysql connection through the NAT. I checked the Keepalive packets using a sniffer

One of the servers in LAN has a database (say db_lan) that uses some federated tables pointing to db_dmz.

This is the odd thing:

At a certain moment (after some hour) if I use a web application that uses a database in lan (say db_a) which DOESN'T INVOLVE ANY FEDERATED TABLE I obtain an error like this:

Database errorInvalid SQL: (SELECT egw_cal_repeats.*,egw_cal.*,cal_start,cal_end,cal_recur_date FROM egw_cal JOIN egw_cal_dates ON egw_cal.cal_id=egw_cal_dates.cal_id JOIN egw_cal_user ON egw_cal.cal_id=egw_cal_user.cal_id LEFT JOIN egw_cal_repeats ON egw_cal.cal_id=egw_cal_repeats.cal_id WHERE 1309903200 < cal_end AND cal_start < 1309989599 AND cal_user_type='u' AND cal_user_id IN ('30','-1') AND recur_type IS NULL AND cal_recur_date=0) UNION (SELECT egw_cal_repeats.*,egw_cal.*,cal_start,cal_end,cal_recur_date FROM egw_cal JOIN egw_cal_dates ON egw_cal.cal_id=egw_cal_dates.cal_id JOIN egw_cal_user ON egw_cal.cal_id=egw_cal_user.cal_id LEFT JOIN egw_cal_repeats ON egw_cal.cal_id=egw_cal_repeats.cal_id WHERE 1309903200 < cal_end AND cal_start < 1309989599 AND cal_user_type='u' AND cal_user_id IN ('30','-1') AND cal_recur_date=cal_start) ORDER BY cal_start: Got an error writing communication packets

And there is no way either then restarting LAN mysql for getting things work again.

I obtain this error not with a single web application but also with other web applications installed in LAN. Moreover, it appears in differents parts of the application and sometimes it appears or disapper changing the query content, like:

select ... where  id='3' ; two spaces between "where" and "id" <<<<< ERROR
select ... where id='3'  ; one space between "where" and "id"  <<<<< NO ERROR

While those applications are not working the application which uses federated tables on db_dmz continues to work smoothly

Note that before creating the db_lan database (which USES federated tables) all the other databases runs smoothly

I'm bumping on this error for weeks and I can't find a solution so I think this is a bug which involve FEDERATED engine some way.

How to repeat:
Unable to repeat it. It appears randomly after some hour using it
[8 Jul 2011 20:04] Sveta Smirnova
Thank you for the report.

But version 5.1.41 is very old and many bugs were fixed since. Please upgrade to current version 5.1.58, try with it and inform us if problem still exists.
[20 Jul 2011 15:29] Alexey Kopytov
We've encountered a similar problem on a customer machine. After adding some diagnostics we've traced this down to FEDERATED + table cache:

- when putting a table (not necessarily a FEDERATED one) to the table cache, server may evict and close another if the table cache is full
- if the victim happens to be a FEDERATED table, its client connection to remote server is closed
- if the remote server has gone for whatever reasons (stopped, connection timed out, network issues), mysql_close() in ha_federated::close() fails with ER_NET_ERROR_ON_WRITE
- the above error is propagated back to the client that issued the original query, even if it was against a local table

Here's the diagnostic stacktrace (in this case it was a SHOW TABLE STATUS against an InnoDB table):

stack_bottom = 0x4a6520e8 thread_stack 0x31000
/usr/sbin/mysqld(my_print_stacktrace+0x39)[0x89dca9]
/usr/sbin/mysqld(net_real_write+0x1e7)[0x5b1987]
/usr/sbin/mysqld(net_flush+0x23)[0x5b1e63]
/usr/sbin/mysqld(net_write_command+0x18c)[0x5b201c]
/usr/sbin/mysqld(cli_advanced_command+0xed)[0x72697d]
/usr/sbin/mysqld(mysql_close+0x64)[0x726524]
/usr/sbin/mysqld(_ZN12ha_federated5closeEv+0x29)[0x730179]
/usr/sbin/mysqld(_Z8closefrmP8st_tableb+0x10a)[0x613a6a]
/usr/sbin/mysqld[0x609611]
/usr/sbin/mysqld(my_hash_delete+0x2c0)[0x892b90]
/usr/sbin/mysqld(_Z19close_thread_tablesP3THD+0x18a)[0x60935a]
/usr/sbin/mysqld(_Z14get_all_tablesP3THDP10TABLE_LISTP4Item+0x8ac)[0x6c966c]
/usr/sbin/mysqld(_Z24get_schema_tables_resultP4JOIN23enum_schema_table_state+0x1d1)[0x6c1911]
/usr/sbin/mysqld(_ZN4JOIN4execEv+0x3fe)[0x63853e]
/usr/sbin/mysqld(_Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_jP8st_orderSB_S2_SB_yP13select_resultP18st_select_lex_unitP13st_select_lex+0x199)[0x63a609]
/usr/sbin/mysqld(_Z13handle_selectP3THDP6st_lexP13select_resultm+0x1c5)[0x63af35]
/usr/sbin/mysqld[0x5c66a1]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x3c6)[0x5c9566]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjPPKc+0x51b)[0x5cf79b]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x9f1)[0x5d0191]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x128)[0x5d0dd8]
/usr/sbin/mysqld(handle_one_connection+0x7b1)[0x5c2d91]
/lib64/libpthread.so.0[0x3f15206617]
/lib64/libc.so.6(clone+0x6d)[0x3f14ad3c2d]

Replication is obviously affected by the same problem, so bug #51196 is likely a duplicate of this one.

The solution seems to be to clear errors coming from mysql_close() in ha_federated::close().
[29 Jul 2011 20:05] Sveta Smirnova
Alexey,

I can not repeat described error since 5.1.50. I used generic test case and got no error when ha_federated::close was called. Which version do you use?
[19 Dec 2011 7:02] Shane Bester
Percona's fix: https://code.launchpad.net/~akopytov/percona-server/bug843587-5.5
[28 Jan 2012 14:03] Jon Stephens
Duplicate of BUG#51196.