Bug #6987 "rpl_max_relay_size" fails on SGI IRIX
Submitted: 3 Dec 2004 13:04 Modified: 17 Jul 2005 0:43
Reporter: Lenz Grimmer Email Updates:
Status: Closed
Category:Server: Replication Severity:S2 (Serious)
Version:4.1.8-pre + 4.1.10 + 4.1.10a OS:SGI IRIX 6.5 + Tru64 + ...
Assigned to: Lars Thalmann Target Version:

[3 Dec 2004 13:04] Lenz Grimmer
Description:
When running the test suite, the "rpl_max_relay_size" fails with the following diff:

Below are the diffs between actual and expected results:
-------------------------------------------------------
*** r/rpl_max_relay_size.result Fri Dec  3 00:16:09 2004
--- r/rpl_max_relay_size.reject Fri Dec  3 05:42:38 2004
***************
*** 16,22 ****
  start slave;
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry  
Master_Log_File     Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos  
Relay_Master_Log_File   Slave_IO_Running    Slave_SQL_Running       Replicate_Do_DB
Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table     
Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error 
Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File 
Unti
l_Log_Pos       Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path     
Master_SSL_Cert     Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     1       master-bin.000001       50477  
slave-relay-bin.000014      1221    master-bin.000001       Yes     Yes
                0               0       50477   1221    None            0       No
                                #
  stop slave;
  reset slave;
  set global max_relay_log_size=(5*4096);
--- 16,22 ----
  start slave;
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry  
Master_Log_File     Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos  
Relay_Master_Log_File   Slave_IO_Running    Slave_SQL_Running       Replicate_Do_DB
Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table     
Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last
_Error  Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition
Until_Log_File  Until_Log_Pos       Master_SSL_Allowed      Master_SSL_CA_File     
Master_SSL_CA_Path      Master_SSL_Cert     Master_SSL_Cipher       Master_SSL_Key 
Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     1       master-bin.000001       50477  
slave-relay-bin.000014      1221    master-bin.000001       Yes     Yes
                0               0       50477   1159    None            0       No
                                #
  stop slave;
  reset slave;
  set global max_relay_log_size=(5*4096);
-------------------------------------------------------

How to repeat:
Run the test and notice the failure.
[3 Dec 2004 14:38] Lenz Grimmer
Please disregard - I was not able to reproduce it manually afterwards :(
[25 Feb 2005 13:14] Joerg Bruehe
Very similar symptom in 4.1.10 on 'barney':

-------------------------------------------------------
*** r/rpl_max_relay_size.result Mon Feb 14 19:58:34 2005
--- r/rpl_max_relay_size.reject Fri Feb 25 10:46:29 2005
***************
*** 49,60 ****
  create table t1 (a int);
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry  
Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos  
Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB
Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table 
Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error     
Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File 
Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path     
Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     1       master-bin.000001       50535  
slave-relay-bin.000009  62      master-bin.000001       Yes     Yes                      
                              0               0       50535   62      None            0   
   No                                              #
  flush logs;
  drop table t1;
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry  
Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos  
Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB
Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table 
Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error     
Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File 
Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path     
Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     1       master-bin.000001       50583  
slave-relay-bin.000010  52      master-bin.000001       Yes     Yes                      
                              0               0       50583   52      None            0   
   No                                              #
  flush logs;
  show master status;
  File  Position        Binlog_Do_DB    Binlog_Ignore_DB
--- 49,60 ----
  create table t1 (a int);
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry  
Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos  
Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB
Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table 
Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error     
Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File 
Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path     
Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     1       master-bin.000001       50535  
slave-relay-bin.000009  62      master-bin.000001       Yes     Yes                      
                              0               0       50535   8256    None            0   
   No                                              #
  flush logs;
  drop table t1;
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry  
Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos  
Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB
Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table 
Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error     
Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File 
Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path     
Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     1       master-bin.000001       50583  
slave-relay-bin.000010  52      master-bin.000001       Yes     Yes                      
                              0               0       50583   8246    None            0   
   No                                              #
  flush logs;
  show master status;
  File  Position        Binlog_Do_DB    Binlog_Ignore_DB
-------------------------------------------------------

Lars looked into this, this is his comment on IRC:

<lars> joerg: I've looked into the test failure and could repeat it a couple of times and
then I could *not* anymore.  The failure is that the system reports the wrong total size
of all the relay logs (I've checked all logs and everything else is working fine). I
think that there could be the following reasons for the failure:
1) The disk is corrupt, 
2) mysys.c:my_stat() has a bug when finding file sizes, 
3) Summarizing log sizes in rpl code is wrong (but seems right from reading it). 
Anyhow, to check this I need to set the test program to not purge the relay logs to see
what those files really are.  In the only saved failure I have, these were purged (as
they should), so I can't see any error there.  But since it is not failing anymore, I
can't do much...  
<joerg> lars: So you propose to ignore that for the moment?
<joerg> lars: Ok. Just for completeness, I will file a bug report and include "not
reproducible for certain", so that the info does not get lost.
<lars> joerg: Ok, sounds good
[28 Feb 2005 12:28] Guilhem Bichot
As agreed with Brian, assigning to Lars as he already worked on this bug.
[8 Mar 2005 11:57] Joerg Bruehe
This seems to be an intermittent problem, happening and hiding without any pattern.
A similar symptom occurred in the tests for the "certified" (!) 4.1.10a on 'nocona'
(Linux x86_64):

-------------------------------------------------------
*** r/rpl_rotate_logs.result    Sun Mar  6 03:49:23 2005
--- r/rpl_rotate_logs.reject    Sun Mar  6 05:50:07 2005
***************
*** 84,93 ****
  testing temporary tables part 2
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry  
Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos  
Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB
Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table 
Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error     
Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File 
Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path     
Master_SSL_Cert Master_SSL_Cipher     Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     60      master-bin.000004       2886   
slave-relay-bin.000001  7891    mast er-bin.000004       Yes     Yes                     
                               0               0       2886    7891    None            0  
    No                                              #
  lock tables t3 read;
  select count(*) from t3 where n >= 4;
  count(*)
! 100
   unlock tables;
   drop table if exists t1,t2,t3,t4;
--- 84,93 ----
  testing temporary tables part 2
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry  
Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos  
Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB
Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table 
Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error     
Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File 
Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path     
Master_SSL_Cert Master_SSL_Cipher     Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     60      master-bin.000004       2886   
slave-relay-bin.000001  7655    mast er-bin.000004       Yes     Yes                     
                               0               0       2886    7655    None            0  
    No                                              #
  lock tables t3 read;
  select count(*) from t3 where n >= 4;
  count(*)
! 97
  unlock tables;
  drop table if exists t1,t2,t3,t4;
-------------------------------------------------------
[8 Mar 2005 12:06] Joerg Bruehe
Found only later: Similar 'rpl_rotate_logs' difference in 4.1.10a on 'hammer'
(also Linux on x86_64, but 'hammer' uses gcc whereas nocona uses icc).
[8 Mar 2005 14:56] Joerg Bruehe
Found still later:
The "original" failure, in "rpl_max_relay_size", also happens on "intelxeon3" in
4.1.10a.

It might be significant that all these failures occur in "debug" builds only (at least
for the 4.1.10a tests). "hammer" and "intelxeon3" happen in "default" test run, whereas
"nocona" passes there and fails only in "--ps-protocol" test run.
[1 Jul 2005 12:00] Elliot Murphy
Lenz saw this again in the 4.0.25 build for 64bit solaris, here is the diff from the test
failure:

-------------------------------------------------------
09503: *** r/rpl_max_relay_size.result	Wed Jun 29 20:32:01 2005
09504: --- r/rpl_max_relay_size.reject	Thu Jun 30 07:54:18 2005
09505: ***************
09506: *** 16,22 ****
09507:   start slave;
09508:   show slave status;
09509:  
Master_Host	Master_User	Master_Port	Connect_retry	Master_Log_File	Read_Master_Log_Pos	Relay_Log_File	Relay_Log_Pos	Relay_Master_Log_File	Slave_IO_Running	Slave_SQL_Running	Replicate_do_db	Replicate_ignore_db	Last_errno	Last_error	Skip_counter	Exec_master_log_pos	Relay_log_space
09510: !
127.0.0.1	root	MASTER_PORT	1	master-bin.001	50477	slave-relay-bin.014	1221	master-bin.001	Yes	Yes			0		0	50477	1221
09511:   stop slave;
09512:   reset slave;
09513:   set global max_relay_log_size=(5*4096);
09514: --- 16,22 ----
09515:   start slave;
09516:   show slave status;
09517:  
Master_Host	Master_User	Master_Port	Connect_retry	Master_Log_File	Read_Master_Log_Pos	Relay_Log_File	Relay_Log_Pos	Relay_Master_Log_File	Slave_IO_Running	Slave_SQL_Running	Replicate_do_db	Replicate_ignore_db	Last_errno	Last_error	Skip_counter	Exec_master_log_pos	Relay_log_space
09518: !
127.0.0.1	root	MASTER_PORT	1	master-bin.001	50477	slave-relay-bin.014	1221	master-bin.001	Yes	Yes			0		0	50477	5368
09519:   stop slave;
09520:   reset slave;
09521:   set global max_relay_log_size=(5*4096);
09522: -------------------------------------------------------
[12 Jul 2005 4:01] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/26923
[12 Jul 2005 4:23] Lars Thalmann
The patch has a fix that prints a note when there is an erronous reading of the file size
of a relay log.  This could help in detecting possible failures in the future.
There is also a fix for a mutex when rotating the relay log.
[14 Jul 2005 23:42] Lars Thalmann
I have pushed a patch for 4.1 (but not for 5.0, since the code is different in this
version) 
where an extra mutex is used to make sure the relay log is not rotated at the same 
time as other operations occur.

The patch is included in the coming 4.1.13 release.
[17 Jul 2005 0:43] Paul DuBois
Noted in 4.1.13 changelog.