MySQL Bugs: #51196: Slave SQL: Got an error writing communication packets, Error

Bug #51196	Slave SQL: Got an error writing communication packets, Error_code: 1160
Submitted:	15 Feb 2010 17:03	Modified:	28 Jan 2012 15:35
Reporter:	Alvaro Palacios	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.1.42, 5.1.47, 5.0, 5.1.50	OS:	Linux (Centos 2.6.15.1)
Assigned to:	Daogang Qu	CPU Architecture:	Any
Tags:	federated, replication

Description:
We get this error when a execute long process in a server that disapears if FEDERATED ENGINE is off. 
The server act as replication slave and sql replication thread die.
The federated tables don´t access by process.

How to repeat:
Mysql 5.1
Server master
Server slave
Federated engine on
replace statements long process

Thank you for the report.

Which exact minor version do you use?

The minor versión is 5.1.36 and also we have test in 5.1.42.

Thank you for the feedback.

I the initial description you wrote federation engine should be off while in "how to repeat" section you wrote it should be on. So should it be on or off?

Please also send us output of SHOW SLAVE STATUS\G after error occur. Query which caused the problem would be helpful too.

You said you had test with version 5.1.42. Can you send this test to us, so we can repeat the problem in our environment?

Sorry, I want to say that ENGINE FEDERATED may be ON in slave server.

We get the error when execute 120k rows COMMAND REPLACE aprox. but i think that is not relevant because we get it in two diferent slave servers, one with various consecutive 5.1 upgrades since three month. 
Due to work in a production enviroment, we had to make the FEDERATED ENGINE = OFF in slave, so i can not test.

Configuration basic:
Two servers with Mysql 5.1.x Master - Slave
Federated ON  
High load COMMAND REPLACE in master.

Result:
Slave SQL thread die

We experience the same problem.
Example output:

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: X.X.X.X
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: XXXX-bin.000091
          Read_Master_Log_Pos: 978446181
               Relay_Log_File: XXXX-relay.000228
                Relay_Log_Pos: 591993295
        Relay_Master_Log_File: XXXX-bin.000091
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1160
                   Last_Error: Error 'Got an error writing communication packets' on query. Default database: 'snapshot_work'. Query: 'DROP TABLE IF EXISTS federate_tables.player_remote'
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 591993151
              Relay_Log_Space: 978446517
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1160
               Last_SQL_Error: Error 'Got an error writing communication packets' on query. Default database: 'snapshot_work'. Query: 'DROP TABLE IF EXISTS federate_tables.player_remote'
1 row in set (0.00 sec)

mysql> select version();
+------------------------------------+
| version()                          |
+------------------------------------+
| 5.1.47-enterprise-gpl-advanced-log | 
+------------------------------------+
1 row in set (0.00 sec)

Thank you for the feedback.

Problem is replication affected by communication between federated and server federated table connects to.

I can not repeat exactly same problem, but communication between master and slave failure can be shown by following test. I used 5.0 here, results for 5.1 will be posted later.

1. Create test for MTR:
--source include/master-slave.inc

CREATE TABLE `t1` (
  `f1` int(11) DEFAULT NULL
) ENGINE=FEDERATED DEFAULT CHARSET=utf8 CONNECTION='mysql://root@127.0.0.1:4040/test/t1';
--sync_slave_with_master
select * from t1;

connection master;
#set binlog_format='row';
create table t2(f1 int);
insert into t2 values(1),(2),(3),(4),(5);
--sync_slave_with_master
--vertical_results
show slave status;
stop slave;
start slave;
sleep 2;
show slave status;
2. Create table t1 with same definition as federated one on one of local servers. Insert 6 rows into it.
3. Start MySQL Proxy with backend this third server.
4. Run ./mtr --manual-gdb rpl+_bug51196
5. On slave server add breakpoints:
b log_event.cc:1922 and mysql_parse
6. When you hit "Breakpoint 2, Query_log_event::exec_event (this=0x36050e8, rli=0x4846098, query_arg=0x36051a1 "insert into t2 values(1),(2),(3),(4),(5)", q_len_arg=40) at log_event.cc:1929
1929      thd->catalog= catalog_len ? (char *) catalog : (char *)"";" kill MySQL Proxy process, then continue execution
7. In the first SHOW SLAVE STATUS OUTPUT you see "Slave_IO_Running    No" although no connection to master was interrupted.

All reporters,

I can not repeat it with current 5.1 BZR sources anymore. Please try if this still repeatable with 5.1.49 for you and inform us about results.

A customer reports that stopping and starting the slave did not resume replication while a start/stop slave did.

Issue may happen even if FEDERATED tables are not actually replicated (FEDERATED turned ON, but database which contains FEDERATED tables excluded using any of the repliate* or binlog* do/ignore options). All reporters, please add comments if you find the same.

Could you please generate a test case to fix the problem?

many folks report federated table data is not even being replicated, but the error is still seen.  this is strange.  perhaps it is some table cache problem?  i'd be curious to know if regular "flush tables" helps to reduce the frequency of this breakage?

I've seen this error when run REPLACE LOW_PRIORITY in master into the database with federated tables, but not applied on then.
¿It's possible that the federated source server connection has expired when wait_timeout variable it's over and failed to reconnect when the write operation arrive to the database?

This is likely the same problem with FEDERATED + table cache as in bug #61790, see my analysis there.

Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Fixed in 5.1+. Documented bugfix in the 5.1.62, 5.5.21, and 5.6.5 changelogs, as follows:

      It was possible on replication slaves where FEDERATED tables were in use
      to get timeouts on long-running operations, such as Error 1160 -Got an
      error writing communication packets-. The FEDERATED tables did not need to
      be replicated for the issue to occur.

Closed.

We're seeing intermittent 1160 errors in a related environment: FEDERATED engine is on, but the error is triggered (intermittently, not reliably reproducable) during mysqldump of databases that have, and do not have actual FEDERATED tables.  That is, we have FEDERATED enabled, but databases that don't make use of FEDERATED tables also intermittently fail with 1160 type errors (even when setting mysqldump to SOCKETS vice tcp).

Sorry, using:
MySQL-client-5.5.25a-1.rhel5.x86_64
MySQL-server-5.5.25a-1.rhel5.x86_64
MySQL-shared-compat-5.5.25a-1.rhel5.x86_64

Running 5.5.24, we see these errors on a setup with FEDERATED enabled too. Happens mostly during the scheduled MySQL dump runs.

I experienced the same error (i.e.  Got an error writing communication packets, Error_code: 1160) with MySQL 5.6.16 x64 (which is a later version than what you said has the bug fix) when executing stored procedures and functions from the command line or from running a .bat files. The SP selects data from a federated table and I suspect that my temporary nightly export/dump of the database is failing for the same reason. I plan to attach to this bug all the related files right after I post this comment. 

Can you please reopen this bug and fix it?

Thanks!

Attachment with more info on our bug has been attached for the developers to see. Thanks!

For my bug, MysqlServer is installed on Windows 2008 R2

The workaround it try running it multiple times first from double clicking the bat and second from the command line (as shown in the screenshot above by “cd” to the directory in cmd and then running the .bat file there). Then it started working every time for that minute. But then fails again intermittently if you try a week later.

We have the same issue in our multi-master setup on MySQL Ver 5.6.17 for Linux on x86_64.

When the federated engine is queried for the first time, it fetches the rows without a problem.  But if we kill the open connection on the remote server, and query the the federated table again, mysql says:

ERROR 1158 (08S01): Got an error reading communication packets

Shouldn't it re-establish the connection and query the table, instead of just throwing the error?  The open-connection is automatically closed due to wait_timeout.

Related to Federated Table :

Whenever a query is issued against a federated table, a single/multiple [ depends on Number = table_open_cache_instances] connection will be created from the DB instance hosting the federated table to the DB instance hosting the main table. If those single/mutiple connection on the DB instance hosting main table expires because of [wait_timeout,connect_timeout] or if those gets terminated manually , any new queries on federated table will result in "ERROR 1158 (08S01) at line 1: Got an error reading communication packets" on the DB instance hosting the federated table.

Reason : Flushing of the federated table does not happen properly from the table_open_cache_instances ,

Solution : To avoid this, the temporary solution is to manually/forcefully flush only those federated tables once before every [wait_timeout,connect_timeout] seconds. In our case we put the flush command in event_scheduler since the number of federated tables were very less and we knew exactly which federated tables were rarely getting queried with gap of more than [wait_timeout,connect_timeout] seconds. This used to affect even our MEB backup when it used execute "flush tables with READ lock".

flush table <federated_table_name>; -- executed every 4 hours

Key : The frequency with which it errors depends on how many table_open_cache_instances is configured on the Instance.

Since then, after putting the manual flush of federated tables with a frequence less than [wait_timeout,connect_timeout] , this issue has not showed up. Hope this helps.

Thanks
Prasad

That is a very helpful comment. Thank you!
Austin

http://bugs.mysql.com/bug.php?id=74618 marked as duplicate of this one.