MySQL Bugs: #65518: Test rpl.rpl_innodb_bug28430 fails on certain hardware

Bug #65518	Test rpl.rpl_innodb_bug28430 fails on certain hardware
Submitted:	5 Jun 2012 8:55	Modified:	30 Jan 2013 18:38
Reporter:	Andreas Mock	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Tests	Severity:	S3 (Non-critical)
Version:	5.5.25	OS:	Linux (Scientific Linux release 6.2)
Assigned to:		CPU Architecture:	Any
Tags:	test failure rpl_innodb_bug28430 RHEL

Description:
While running the testsuite after compiling from sources the test rpl.rpl_innodb_bug28430 fails with a timeout. But this happens on a certain type of machine. The test doesn't fail on another machine with the same OS.
In both cases it is the current package status of Scientific Linux 6.2 (RHEL 6.2). A compare of the rpm packages shows that the package versions are identical.
In both cases it's 64bit. When I switch to another machine of the same first type I can replicate the problem, so that the special hardware in combination with the OS seems to produce the error.
The same error happens with 5.5.24.

How to repeat:
I'm not sure if it's repeatable as I think it depends on hardware/OS combination. In the bugtracking system are hints that this kind of error happened to others too.

Untar sources. 5.5.25.
cmake -DBUILD_CONFIG=mysql_release -DCMAKE_INSTALL_PREFIX=/opt/mysql -DWITH_DEBUG=0 -DENABLED_PROFILING=ON .
make
cd mysql-test
./mysql-test-run.pl --force rpl.rpl_innodb_bug28430 2>&1 | tee mysql-test.out

==============================================================================

TEST                                      RESULT   TIME (ms) or COMMENT
--------------------------------------------------------------------------

worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 13000..13009
worker[1] Trying to dump core for [mysqltest - pid: 19893, winpid: 19893]
worker[1] Trying to dump core for [mysqld.1 - pid: 19854, winpid: 19854]
worker[1] Trying to dump core for [mysqld.2 - pid: 19857, winpid: 19857]
rpl.rpl_innodb_bug28430 'mix'            [ fail ]  timeout after 900 seconds
        Test ended at 2012-06-05 10:47:39

machine information

Attachment: machine-info.txt (text/plain), 44.04 KiB.

Output of test run

Attachment: mysql-test.out (application/octet-stream, text), 36.85 KiB.

Thank you for the report.

Backtrace shows issue is in fsync operation. Please check disk on problematic machine.

Hi Sveta,

I have two identical machines. I can reproduce test failure on both.
On a third machine which has different hardware the test doesn't fail.

For what kind of things concerning disks shall I watch out?
/var/log/messages doesn't show up an error message.

Hints welcome.

Best regards
Andreas

Thank you for the feedback.

I want you to ensure:

1. Disks have not damages. Any utility which your OS has can be used to check this. On Linux you can use fsck or any other utility your OS has.
2. If disks are OK please inform us about model of disks.

* Booted in rescue mode
* Made forced filesystem check on all filesystems
* => no errors reported

* hard drives are two Seagate ST3000DM001-9YN166
* disk setup:
- partitions are linux raid
- raid mirror of the installed two disks
- LVM volume groups on top of raid mirror
- logical volumes sourced by volume groups
- file system ext4

How do you know from stack trace that fsync was the cause of timeout?

Best regards
Andreas

Thank you for the feedback.

Will try to find similar disks to test.

Regarding your question.

After last successful query there is a call of stored procedure which updates InnoDB table and information about thread which does writing job in the trace:

...
Thread 14 (Thread 0x7f1d18c77700 (LWP 20456)):
#0  0x0000003a3ee0ebad in fsync () from /lib64/libpthread.so.0
#1  0x000000000097023f in os_file_fsync (file=9) at /opt/data/install/mysql/mysql-5.5.25/storage/innobase/os/os0file.c:2058
#2  os_file_flush_func (file=9) at /opt/data/install/mysql/mysql-5.5.25/storage/innobase/os/os0file.c:2156
#3  0x0000000000922654 in pfs_os_file_flush_func (space_id=<value optimized out>) at /opt/data/install/mysql/mysql-5.5.25/storage/innobase/include/os0file.ic:381
#4  fil_flush (space_id=<value optimized out>) at /opt/data/install/mysql/mysql-5.5.25/storage/innobase/fil/fil0fil.c:4685
#5  0x000000000095c5d6 in log_write_up_to (lsn=<value optimized out>, wait=<value optimized out>, flush_to_disk=1) at /opt/data/install/mysql/mysql-5.5.25/storage/innobase/log/log0log.c:1514
#6  0x00000000008b5d02 in trx_prepare_off_kernel (trx=<value optimized out>) at /opt/data/install/mysql/mysql-5.5.25/storage/innobase/trx/trx0trx.c:2005
#7  0x00000000008b7993 in trx_prepare_for_mysql (trx=0x7f1d0400e598) at /opt/data/install/mysql/mysql-5.5.25/storage/innobase/trx/trx0trx.c:2041
....

How do you know that the cited thread dump is the thread causing the problem? Sorry, but I'm not used to read dumps with several threads.
Which instance is causing the timeout? Locking code? Mutex code?
From which instance is the sigabrt being sent?

Can reproduce with 5.5.22 on same hardware. Third instance of same machine type.

Thank you for the feedback.

Actually we have such machines and this was reported as bug #46029. Setting as duplicate.