MySQL Bugs: #6987: "rpl_max_relay_size" fails on SGI IRIX

Bug #6987	"rpl_max_relay_size" fails on SGI IRIX
Submitted:	3 Dec 2004 13:04	Modified:	17 Jul 2005 0:43
Reporter:	Lenz Grimmer	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	4.1.8-pre + 4.1.10 + 4.1.10a	OS:	SGI IRIX 6.5 + Tru64 + ...
Assigned to:	Lars Thalmann	CPU Architecture:	Any

Description:
When running the test suite, the "rpl_max_relay_size" fails with the following diff:

Below are the diffs between actual and expected results:
-------------------------------------------------------
*** r/rpl_max_relay_size.result Fri Dec  3 00:16:09 2004
--- r/rpl_max_relay_size.reject Fri Dec  3 05:42:38 2004
***************
*** 16,22 ****
  start slave;
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry   Master_Log_File     Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running    Slave_SQL_Running       Replicate_Do_DB Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table      Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error  Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File  Unti
l_Log_Pos       Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert     Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     1       master-bin.000001       50477   slave-relay-bin.000014      1221    master-bin.000001       Yes     Yes
                0               0       50477   1221    None            0       No
                                #
  stop slave;
  reset slave;
  set global max_relay_log_size=(5*4096);
--- 16,22 ----
  start slave;
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry   Master_Log_File     Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running    Slave_SQL_Running       Replicate_Do_DB Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table      Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last
_Error  Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File  Until_Log_Pos       Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert     Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     1       master-bin.000001       50477   slave-relay-bin.000014      1221    master-bin.000001       Yes     Yes
                0               0       50477   1159    None            0       No
                                #
  stop slave;
  reset slave;
  set global max_relay_log_size=(5*4096);
-------------------------------------------------------

How to repeat:
Run the test and notice the failure.

Please disregard - I was not able to reproduce it manually afterwards :(

Very similar symptom in 4.1.10 on 'barney':

-------------------------------------------------------
*** r/rpl_max_relay_size.result Mon Feb 14 19:58:34 2005
--- r/rpl_max_relay_size.reject Fri Feb 25 10:46:29 2005
***************
*** 49,60 ****
  create table t1 (a int);
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry   Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table  Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File  Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     1       master-bin.000001       50535   slave-relay-bin.000009  62      master-bin.000001       Yes     Yes                                                     0               0       50535   62      None            0       No                                              #
  flush logs;
  drop table t1;
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry   Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table  Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File  Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     1       master-bin.000001       50583   slave-relay-bin.000010  52      master-bin.000001       Yes     Yes                                                     0               0       50583   52      None            0       No                                              #
  flush logs;
  show master status;
  File  Position        Binlog_Do_DB    Binlog_Ignore_DB
--- 49,60 ----
  create table t1 (a int);
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry   Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table  Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File  Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     1       master-bin.000001       50535   slave-relay-bin.000009  62      master-bin.000001       Yes     Yes                                                     0               0       50535   8256    None            0       No                                              #
  flush logs;
  drop table t1;
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry   Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table  Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File  Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     1       master-bin.000001       50583   slave-relay-bin.000010  52      master-bin.000001       Yes     Yes                                                     0               0       50583   8246    None            0       No                                              #
  flush logs;
  show master status;
  File  Position        Binlog_Do_DB    Binlog_Ignore_DB
-------------------------------------------------------

Lars looked into this, this is his comment on IRC:

<lars> joerg: I've looked into the test failure and could repeat it a couple of times and then I could *not* anymore.  The failure is that the system reports the wrong total size of all the relay logs (I've checked all logs and everything else is working fine). I think that there could be the following reasons for the failure:
1) The disk is corrupt, 
2) mysys.c:my_stat() has a bug when finding file sizes, 
3) Summarizing log sizes in rpl code is wrong (but seems right from reading it). 
Anyhow, to check this I need to set the test program to not purge the relay logs to see what those files really are.  In the only saved failure I have, these were purged (as they should), so I can't see any error there.  But since it is not failing anymore, I can't do much...  
<joerg> lars: So you propose to ignore that for the moment?
<joerg> lars: Ok. Just for completeness, I will file a bug report and include "not reproducible for certain", so that the info does not get lost.
<lars> joerg: Ok, sounds good

As agreed with Brian, assigning to Lars as he already worked on this bug.

This seems to be an intermittent problem, happening and hiding without any pattern.
A similar symptom occurred in the tests for the "certified" (!) 4.1.10a on 'nocona' (Linux x86_64):

-------------------------------------------------------
*** r/rpl_rotate_logs.result    Sun Mar  6 03:49:23 2005
--- r/rpl_rotate_logs.reject    Sun Mar  6 05:50:07 2005
***************
*** 84,93 ****
  testing temporary tables part 2
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry   Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table  Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File  Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert Master_SSL_Cipher     Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     60      master-bin.000004       2886    slave-relay-bin.000001  7891    mast er-bin.000004       Yes     Yes                                                     0               0       2886    7891    None            0       No                                              #
  lock tables t3 read;
  select count(*) from t3 where n >= 4;
  count(*)
! 100
   unlock tables;
   drop table if exists t1,t2,t3,t4;
--- 84,93 ----
  testing temporary tables part 2
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry   Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table  Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File  Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert Master_SSL_Cipher     Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     60      master-bin.000004       2886    slave-relay-bin.000001  7655    mast er-bin.000004       Yes     Yes                                                     0               0       2886    7655    None            0       No                                              #
  lock tables t3 read;
  select count(*) from t3 where n >= 4;
  count(*)
! 97
  unlock tables;
  drop table if exists t1,t2,t3,t4;
-------------------------------------------------------

Found only later: Similar 'rpl_rotate_logs' difference in 4.1.10a on 'hammer'
(also Linux on x86_64, but 'hammer' uses gcc whereas nocona uses icc).

Found still later:
The "original" failure, in "rpl_max_relay_size", also happens on "intelxeon3" in 4.1.10a.

It might be significant that all these failures occur in "debug" builds only (at least for the 4.1.10a tests). "hammer" and "intelxeon3" happen in "default" test run, whereas "nocona" passes there and fails only in "--ps-protocol" test run.

Lenz saw this again in the 4.0.25 build for 64bit solaris, here is the diff from the test failure:

-------------------------------------------------------
09503: *** r/rpl_max_relay_size.result	Wed Jun 29 20:32:01 2005
09504: --- r/rpl_max_relay_size.reject	Thu Jun 30 07:54:18 2005
09505: ***************
09506: *** 16,22 ****
09507:   start slave;
09508:   show slave status;
09509:   Master_Host	Master_User	Master_Port	Connect_retry	Master_Log_File	Read_Master_Log_Pos	Relay_Log_File	Relay_Log_Pos	Relay_Master_Log_File	Slave_IO_Running	Slave_SQL_Running	Replicate_do_db	Replicate_ignore_db	Last_errno	Last_error	Skip_counter	Exec_master_log_pos	Relay_log_space
09510: ! 127.0.0.1	root	MASTER_PORT	1	master-bin.001	50477	slave-relay-bin.014	1221	master-bin.001	Yes	Yes			0		0	50477	1221
09511:   stop slave;
09512:   reset slave;
09513:   set global max_relay_log_size=(5*4096);
09514: --- 16,22 ----
09515:   start slave;
09516:   show slave status;
09517:   Master_Host	Master_User	Master_Port	Connect_retry	Master_Log_File	Read_Master_Log_Pos	Relay_Log_File	Relay_Log_Pos	Relay_Master_Log_File	Slave_IO_Running	Slave_SQL_Running	Replicate_do_db	Replicate_ignore_db	Last_errno	Last_error	Skip_counter	Exec_master_log_pos	Relay_log_space
09518: ! 127.0.0.1	root	MASTER_PORT	1	master-bin.001	50477	slave-relay-bin.014	1221	master-bin.001	Yes	Yes			0		0	50477	5368
09519:   stop slave;
09520:   reset slave;
09521:   set global max_relay_log_size=(5*4096);
09522: -------------------------------------------------------

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/26923

The patch has a fix that prints a note when there is an erronous reading of the file size of a relay log.  This could help in detecting possible failures in the future.
There is also a fix for a mutex when rotating the relay log.

I have pushed a patch for 4.1 (but not for 5.0, since the code is different in this version) 
where an extra mutex is used to make sure the relay log is not rotated at the same 
time as other operations occur.

The patch is included in the coming 4.1.13 release.

Noted in 4.1.13 changelog.