Bug #6987 | "rpl_max_relay_size" fails on SGI IRIX | ||
---|---|---|---|
Submitted: | 3 Dec 2004 13:04 | Modified: | 17 Jul 2005 0:43 |
Reporter: | Lenz Grimmer | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
Version: | 4.1.8-pre + 4.1.10 + 4.1.10a | OS: | SGI IRIX 6.5 + Tru64 + ... |
Assigned to: | Lars Thalmann | CPU Architecture: | Any |
[3 Dec 2004 13:04]
Lenz Grimmer
[3 Dec 2004 14:38]
Lenz Grimmer
Please disregard - I was not able to reproduce it manually afterwards :(
[25 Feb 2005 13:14]
Joerg Bruehe
Very similar symptom in 4.1.10 on 'barney': ------------------------------------------------------- *** r/rpl_max_relay_size.result Mon Feb 14 19:58:34 2005 --- r/rpl_max_relay_size.reject Fri Feb 25 10:46:29 2005 *************** *** 49,60 **** create table t1 (a int); show slave status; Slave_IO_State Master_Host Master_User Master_Port Connect_Retry Master_Log_File Read_Master_Log_Pos Relay_Log_File Relay_Log_Pos Relay_Master_Log_File Slave_IO_Running Slave_SQL_Running Replicate_Do_DB Replicate_Ignore_DB Replicate_Do_Table Replicate_Ignore_Table Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table Last_Errno Last_Error Skip_Counter Exec_Master_Log_Pos Relay_Log_Space Until_Condition Until_Log_File Until_Log_Pos Master_SSL_Allowed Master_SSL_CA_File Master_SSL_CA_Path Master_SSL_Cert Master_SSL_Cipher Master_SSL_Key Seconds_Behind_Master ! # 127.0.0.1 root MASTER_PORT 1 master-bin.000001 50535 slave-relay-bin.000009 62 master-bin.000001 Yes Yes 0 0 50535 62 None 0 No # flush logs; drop table t1; show slave status; Slave_IO_State Master_Host Master_User Master_Port Connect_Retry Master_Log_File Read_Master_Log_Pos Relay_Log_File Relay_Log_Pos Relay_Master_Log_File Slave_IO_Running Slave_SQL_Running Replicate_Do_DB Replicate_Ignore_DB Replicate_Do_Table Replicate_Ignore_Table Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table Last_Errno Last_Error Skip_Counter Exec_Master_Log_Pos Relay_Log_Space Until_Condition Until_Log_File Until_Log_Pos Master_SSL_Allowed Master_SSL_CA_File Master_SSL_CA_Path Master_SSL_Cert Master_SSL_Cipher Master_SSL_Key Seconds_Behind_Master ! # 127.0.0.1 root MASTER_PORT 1 master-bin.000001 50583 slave-relay-bin.000010 52 master-bin.000001 Yes Yes 0 0 50583 52 None 0 No # flush logs; show master status; File Position Binlog_Do_DB Binlog_Ignore_DB --- 49,60 ---- create table t1 (a int); show slave status; Slave_IO_State Master_Host Master_User Master_Port Connect_Retry Master_Log_File Read_Master_Log_Pos Relay_Log_File Relay_Log_Pos Relay_Master_Log_File Slave_IO_Running Slave_SQL_Running Replicate_Do_DB Replicate_Ignore_DB Replicate_Do_Table Replicate_Ignore_Table Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table Last_Errno Last_Error Skip_Counter Exec_Master_Log_Pos Relay_Log_Space Until_Condition Until_Log_File Until_Log_Pos Master_SSL_Allowed Master_SSL_CA_File Master_SSL_CA_Path Master_SSL_Cert Master_SSL_Cipher Master_SSL_Key Seconds_Behind_Master ! # 127.0.0.1 root MASTER_PORT 1 master-bin.000001 50535 slave-relay-bin.000009 62 master-bin.000001 Yes Yes 0 0 50535 8256 None 0 No # flush logs; drop table t1; show slave status; Slave_IO_State Master_Host Master_User Master_Port Connect_Retry Master_Log_File Read_Master_Log_Pos Relay_Log_File Relay_Log_Pos Relay_Master_Log_File Slave_IO_Running Slave_SQL_Running Replicate_Do_DB Replicate_Ignore_DB Replicate_Do_Table Replicate_Ignore_Table Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table Last_Errno Last_Error Skip_Counter Exec_Master_Log_Pos Relay_Log_Space Until_Condition Until_Log_File Until_Log_Pos Master_SSL_Allowed Master_SSL_CA_File Master_SSL_CA_Path Master_SSL_Cert Master_SSL_Cipher Master_SSL_Key Seconds_Behind_Master ! # 127.0.0.1 root MASTER_PORT 1 master-bin.000001 50583 slave-relay-bin.000010 52 master-bin.000001 Yes Yes 0 0 50583 8246 None 0 No # flush logs; show master status; File Position Binlog_Do_DB Binlog_Ignore_DB ------------------------------------------------------- Lars looked into this, this is his comment on IRC: <lars> joerg: I've looked into the test failure and could repeat it a couple of times and then I could *not* anymore. The failure is that the system reports the wrong total size of all the relay logs (I've checked all logs and everything else is working fine). I think that there could be the following reasons for the failure: 1) The disk is corrupt, 2) mysys.c:my_stat() has a bug when finding file sizes, 3) Summarizing log sizes in rpl code is wrong (but seems right from reading it). Anyhow, to check this I need to set the test program to not purge the relay logs to see what those files really are. In the only saved failure I have, these were purged (as they should), so I can't see any error there. But since it is not failing anymore, I can't do much... <joerg> lars: So you propose to ignore that for the moment? <joerg> lars: Ok. Just for completeness, I will file a bug report and include "not reproducible for certain", so that the info does not get lost. <lars> joerg: Ok, sounds good
[28 Feb 2005 12:28]
Guilhem Bichot
As agreed with Brian, assigning to Lars as he already worked on this bug.
[8 Mar 2005 11:57]
Joerg Bruehe
This seems to be an intermittent problem, happening and hiding without any pattern. A similar symptom occurred in the tests for the "certified" (!) 4.1.10a on 'nocona' (Linux x86_64): ------------------------------------------------------- *** r/rpl_rotate_logs.result Sun Mar 6 03:49:23 2005 --- r/rpl_rotate_logs.reject Sun Mar 6 05:50:07 2005 *************** *** 84,93 **** testing temporary tables part 2 show slave status; Slave_IO_State Master_Host Master_User Master_Port Connect_Retry Master_Log_File Read_Master_Log_Pos Relay_Log_File Relay_Log_Pos Relay_Master_Log_File Slave_IO_Running Slave_SQL_Running Replicate_Do_DB Replicate_Ignore_DB Replicate_Do_Table Replicate_Ignore_Table Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table Last_Errno Last_Error Skip_Counter Exec_Master_Log_Pos Relay_Log_Space Until_Condition Until_Log_File Until_Log_Pos Master_SSL_Allowed Master_SSL_CA_File Master_SSL_CA_Path Master_SSL_Cert Master_SSL_Cipher Master_SSL_Key Seconds_Behind_Master ! # 127.0.0.1 root MASTER_PORT 60 master-bin.000004 2886 slave-relay-bin.000001 7891 mast er-bin.000004 Yes Yes 0 0 2886 7891 None 0 No # lock tables t3 read; select count(*) from t3 where n >= 4; count(*) ! 100 unlock tables; drop table if exists t1,t2,t3,t4; --- 84,93 ---- testing temporary tables part 2 show slave status; Slave_IO_State Master_Host Master_User Master_Port Connect_Retry Master_Log_File Read_Master_Log_Pos Relay_Log_File Relay_Log_Pos Relay_Master_Log_File Slave_IO_Running Slave_SQL_Running Replicate_Do_DB Replicate_Ignore_DB Replicate_Do_Table Replicate_Ignore_Table Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table Last_Errno Last_Error Skip_Counter Exec_Master_Log_Pos Relay_Log_Space Until_Condition Until_Log_File Until_Log_Pos Master_SSL_Allowed Master_SSL_CA_File Master_SSL_CA_Path Master_SSL_Cert Master_SSL_Cipher Master_SSL_Key Seconds_Behind_Master ! # 127.0.0.1 root MASTER_PORT 60 master-bin.000004 2886 slave-relay-bin.000001 7655 mast er-bin.000004 Yes Yes 0 0 2886 7655 None 0 No # lock tables t3 read; select count(*) from t3 where n >= 4; count(*) ! 97 unlock tables; drop table if exists t1,t2,t3,t4; -------------------------------------------------------
[8 Mar 2005 12:06]
Joerg Bruehe
Found only later: Similar 'rpl_rotate_logs' difference in 4.1.10a on 'hammer' (also Linux on x86_64, but 'hammer' uses gcc whereas nocona uses icc).
[8 Mar 2005 14:56]
Joerg Bruehe
Found still later: The "original" failure, in "rpl_max_relay_size", also happens on "intelxeon3" in 4.1.10a. It might be significant that all these failures occur in "debug" builds only (at least for the 4.1.10a tests). "hammer" and "intelxeon3" happen in "default" test run, whereas "nocona" passes there and fails only in "--ps-protocol" test run.
[1 Jul 2005 12:00]
Elliot Murphy
Lenz saw this again in the 4.0.25 build for 64bit solaris, here is the diff from the test failure: ------------------------------------------------------- 09503: *** r/rpl_max_relay_size.result Wed Jun 29 20:32:01 2005 09504: --- r/rpl_max_relay_size.reject Thu Jun 30 07:54:18 2005 09505: *************** 09506: *** 16,22 **** 09507: start slave; 09508: show slave status; 09509: Master_Host Master_User Master_Port Connect_retry Master_Log_File Read_Master_Log_Pos Relay_Log_File Relay_Log_Pos Relay_Master_Log_File Slave_IO_Running Slave_SQL_Running Replicate_do_db Replicate_ignore_db Last_errno Last_error Skip_counter Exec_master_log_pos Relay_log_space 09510: ! 127.0.0.1 root MASTER_PORT 1 master-bin.001 50477 slave-relay-bin.014 1221 master-bin.001 Yes Yes 0 0 50477 1221 09511: stop slave; 09512: reset slave; 09513: set global max_relay_log_size=(5*4096); 09514: --- 16,22 ---- 09515: start slave; 09516: show slave status; 09517: Master_Host Master_User Master_Port Connect_retry Master_Log_File Read_Master_Log_Pos Relay_Log_File Relay_Log_Pos Relay_Master_Log_File Slave_IO_Running Slave_SQL_Running Replicate_do_db Replicate_ignore_db Last_errno Last_error Skip_counter Exec_master_log_pos Relay_log_space 09518: ! 127.0.0.1 root MASTER_PORT 1 master-bin.001 50477 slave-relay-bin.014 1221 master-bin.001 Yes Yes 0 0 50477 5368 09519: stop slave; 09520: reset slave; 09521: set global max_relay_log_size=(5*4096); 09522: -------------------------------------------------------
[12 Jul 2005 4:01]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/internals/26923
[12 Jul 2005 4:23]
Lars Thalmann
The patch has a fix that prints a note when there is an erronous reading of the file size of a relay log. This could help in detecting possible failures in the future. There is also a fix for a mutex when rotating the relay log.
[14 Jul 2005 23:42]
Lars Thalmann
I have pushed a patch for 4.1 (but not for 5.0, since the code is different in this version) where an extra mutex is used to make sure the relay log is not rotated at the same time as other operations occur. The patch is included in the coming 4.1.13 release.
[17 Jul 2005 0:43]
Paul DuBois
Noted in 4.1.13 changelog.