MySQL Bugs: #19157: Replication stops with Solaris machine

Bug #19157	Replication stops with Solaris machine
Submitted:	18 Apr 2006 2:51	Modified:	2 Jul 2006 16:03
Reporter:	Haroon Anwar	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server	Severity:	S2 (Serious)
Version:	4.1.18	OS:	Linux (RHEL(Mast) Solaris(Slave))
Assigned to:		CPU Architecture:	Any

Description:
Hi,

I am running 4.0.18 on RHEL 4.0 and Solaris 10. RHEL 4.0 acts as mater and Solaris machine as slave. I have another RHEL 4.0 machine running 4.0.18 as slave. So, there are two slaves getting replicated from master.

Solaris machine is off-site and all the updates are being replicated to it over the 512 WAN link. Suddenly, the replication stops between the Solaris Machine and the master. First time it happened, I tried to fix the problem but was clueless. Therefore, I again took the snapshot and put on the solaris machine and started replication from scratch again. Everything worked till 4 days. Here is the output below.
MySQL Replication Status as at 2006-04-18 12:42
===============================================
MASTER (master.intranet):
        Position: 53517890

SLAVE (slave1.intranet):
             Slave_IO_State: Waiting for master to send event
            Master_Log_File: online-bin.000015
        Read_Master_Log_Pos: 53517890

SLAVE (slave2.ipwan):
             Slave_IO_State: Waiting for master to send event
            Master_Log_File: online-bin.000015
        Read_Master_Log_Pos: 47295713

Here, you can see two machines are still in sync but Solaris machine is not. It just stops at thsi position and does not go any further. I have got a lot of disk space available on both master and slave. There is no error in the hostname.err file. It shows connected to the master.
I have seen this position with the mysqlbinlog command and one of the programmers uploaded a 2M of the file on mysql. This file got replicated across to the onsite slave but on Solaris machine it did not go across.

How to repeat:
I have repeated this problem by uploading the 2M file on the master. It got replicated to onsite RHEL 4.0 slave but refused to replicate across to the Solaris slave which is connected to the master through a WAN Link.

Thank you for a problem report. Please, send your my.cnf files from master and both slaves.

Master my.cnf (RHEL 4.0 Master)
-----------------------------------------------
[client]
port            = 3306
socket          = /var/lib/mysql/mysql.sock
[mysqld]
port            = 3306
socket          = /var/lib/mysql/mysql.sock
back_log = 50
max_connections = 100
max_connect_errors = 10
table_cache = 2048
max_allowed_packet = 16M
binlog_cache_size = 1M
max_heap_table_size = 64M
sort_buffer_size = 8M
join_buffer_size = 8M
thread_cache = 8
thread_concurrency = 40
query_cache_size = 64M
query_cache_limit = 2M
ft_min_word_len = 4
default_table_type = InnoDB
thread_stack = 192K
transaction_isolation = REPEATABLE-READ
tmp_table_size = 64M
log_bin
log_slow_queries
long_query_time = 2
log_long_format
tmpdir = /var/tmp/
server-id = 1
key_buffer_size = 32M
read_buffer_size = 2M
read_rnd_buffer_size = 16M
bulk_insert_buffer_size = 64M
myisam_sort_buffer_size = 128M
myisam_max_sort_file_size = 10G
myisam_max_extra_sort_file_size = 10G
myisam_repair_threads = 1
myisam_recover
skip-bdb
innodb_additional_mem_pool_size = 16M
innodb_buffer_pool_size = 2000M
innodb_data_file_path = ibdata1:2000M;ibdata2:10M:autoextend
innodb_data_home_dir = /var/lib/mysql/ibdata/
innodb_file_io_threads = 4
innodb_thread_concurrency = 16
innodb_flush_log_at_trx_commit = 1
innodb_log_buffer_size = 8M
innodb_log_file_size = 1500M
innodb_log_files_in_group = 2
innodb_log_group_home_dir = /var/lib/mysql/iblogs/
innodb_max_dirty_pages_pct = 90
innodb_flush_method=O_DIRECT
innodb_lock_wait_timeout = 120
datadir = /var/lib/mysql
lower_case_table_names = 1
innodb_file_per_table = 1

[mysqldump]
quick
max_allowed_packet = 16M

[mysql]
no-auto-rehash
[isamchk]
key_buffer = 512M
sort_buffer_size = 512M
read_buffer = 8M
write_buffer = 8M

[myisamchk]
key_buffer = 512M
sort_buffer_size = 512M
read_buffer = 8M
write_buffer = 8M

[mysqlhotcopy]
interactive-timeout

[mysqld_safe]
open-files-limit = 8192

Slave1 my.cnf (RHEL 4.0 Slave)
----------------------------------------------
[client]
port            = 3306
socket          = /var/lib/mysql/mysql.sock

[mysqld]
port            = 3306
socket          = /var/lib/mysql/mysql.sock
skip-locking
key_buffer = 16M
max_allowed_packet = 16M
table_cache = 64
sort_buffer_size = 512K
net_buffer_length = 8K
read_buffer_size = 256K
read_rnd_buffer_size = 512K
myisam_sort_buffer_size = 8M
server-id       = 2
innodb_data_home_dir = /var/lib/mysql/ibdata
innodb_data_file_path = ibdata1:2000M;ibdata2:10M:autoextend
innodb_log_group_home_dir = /var/lib/mysql/iblogs
innodb_buffer_pool_size = 16M
innodb_additional_mem_pool_size = 2M
innodb_log_file_size = 1500M
innodb_log_buffer_size = 8M
innodb_flush_log_at_trx_commit = 1
innodb_lock_wait_timeout = 50
innodb_log_files_in_group = 2
datadir = /var/lib/mysql
lower_case_table_names = 1
innodb_file_per_table = 1
tmpdir = /var/spool/mysql

[mysqldump]
quick
max_allowed_packet = 16M

[mysql]
no-auto-rehash

[isamchk]
key_buffer = 20M
sort_buffer_size = 20M
read_buffer = 2M
write_buffer = 2M

[myisamchk]
key_buffer = 20M
sort_buffer_size = 20M
read_buffer = 2M
write_buffer = 2M

[mysqlhotcopy]
interactive-timeout

Slave2 my.cnf (Running Solaris 10)
--------------------------------------------------
[client]
port            = 3306
socket          = /var/lib/mysql/mysql.sock

[mysqld]
port            = 3306
socket          = /var/lib/mysql/mysql.sock
skip-locking
key_buffer = 16M
max_allowed_packet = 16M
table_cache = 64
sort_buffer_size = 512K
net_buffer_length = 8K
read_buffer_size = 256K
read_rnd_buffer_size = 512K
myisam_sort_buffer_size = 8M
server-id       = 3
relay-log=bentleigh-replica-relay-bin
log-warnings
innodb_data_home_dir = /var/lib/mysql/ibdata
innodb_data_file_path = ibdata1:2000M;ibdata2:10M:autoextend
innodb_log_group_home_dir = /var/lib/mysql/iblogs
innodb_buffer_pool_size = 16M
innodb_additional_mem_pool_size = 2M
innodb_log_file_size = 1500M
innodb_log_buffer_size = 8M
innodb_flush_log_at_trx_commit = 1
innodb_lock_wait_timeout = 50
innodb_log_files_in_group = 2
datadir = /var/lib/mysql
lower_case_table_names = 1
innodb_file_per_table = 1
tmpdir = /var/spool/mysql

[mysqldump]
quick
max_allowed_packet = 16M

[mysql]
no-auto-rehash

[isamchk]
key_buffer = 20M
sort_buffer_size = 20M
read_buffer = 2M
write_buffer = 2M

[myisamchk]
key_buffer = 20M
sort_buffer_size = 20M
read_buffer = 2M
write_buffer = 2M

[mysqlhotcopy]
interactive-timeout

Thank you for the additional information.

I had found a difference in my.cnf on master and (both) slaves. On master you have:

read_buffer_size = 2M
read_rnd_buffer_size = 16M
bulk_insert_buffer_size = 64M

On both slaves you have:

read_buffer_size = 256K
read_rnd_buffer_size = 512K

and no bulk_insert_buffer_size. Not sure it can lead to the problem you described, but, anyway, please send the

SHOW VARIABLES LIKE 'bulk%';

from both your slaves. Just to check for (maybe, different) defaults.

Is there anything in Solaris slave's error log for the appropriate period?

Your version is noted as 4.1.18 in "header" and 4.0.18 in your description. Please, clerify what version is really used.

Hi,

The version that I am using is 4.1.18. My mistake. Sorry about that. Further to help you, I upgraded my solaris version to 5.0.20 to rule any possibility of bug in older releases but the situation is still the same.

To rule out the possiblity that this problem might be caused by a WAN connection, I installed version 4.1.18 in a solaris machine on a LAN and then tested replication with this machine. It was able to go thorugh and was in sync with master later on. So, I believe this problem is related to a WAN connection. May be the problem is that when we send a file as attachment to the slaves over a slow connection, if the slave cannot execute a event for say x amount of time, then it restarts this event and it is now a endless loop. I am just guessing. Further, to transfer a file of 2M on my connection, it takes around 1.5 minutes. But on LAN it takes around .5 seconds. May be this further info. help you sort out the problem.

Please find below the further info. requested.

Slave 1 (RHEL 4.0)
-----------------------------
mysql> SHOW VARIABLES LIKE 'bulk%';
+-------------------------+---------+
| Variable_name           | Value   |
+-------------------------+---------+
| bulk_insert_buffer_size | 8388608 |
+-------------------------+---------+
1 row in set (0.00 sec)

Slave 2 (Solaris 10)
-----------------------------
mysql> SHOW VARIABLES LIKE 'bulk%';
+-------------------------+---------+
| Variable_name           | Value   |
+-------------------------+---------+
| bulk_insert_buffer_size | 8388608 |
+-------------------------+---------+
1 row in set (0.01 sec)

Thank you for the additional information and tests. I've got no answer to the following question yet:

Is there anything in Solaris slave's error log for the appropriate period?

Send the 

show variables like 'slave%';

results from the problematic slave also.

Hi Valeriy,

Many thanks for the quick reply. I have checked error log manytimes but to my surprise there is nothing in there. No error message at all. The last few lines say connected to master on xxx position. With no error message and all the things behaving normally, I don't know where to start troubleshooting.

Anyways, please find below the additional information requested.

mysql> show variables like 'slave%';
+---------------------------+-------------------+
| Variable_name             | Value             |
+---------------------------+-------------------+
| slave_compressed_protocol | OFF               |
| slave_load_tmpdir         | /var/spool/mysql/ |
| slave_net_timeout         | 3600              |
| slave_skip_errors         | OFF               |
| slave_transaction_retries | 10                |
+---------------------------+-------------------+
5 rows in set (0.00 sec)

Cheers

Hi,
Was just checking about the status of this query. Is it a bug? Can you please let me know.

Thanks

I am not sure about the status. I have no ideas on how to repeat and what is the reason, really. 

Please, send show variables like 'slave%'; results from local RH slave. Send the uname -a results from your master and both slaves.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".