Bug #8693 Test 'rpl_log_pos' fails sometimes
Submitted: 22 Feb 2005 17:25 Modified: 31 Aug 14:23
Reporter: Joerg Bruehe
Status: Verified
Category:Tests: Replication Severity:S1 (Critical)
Version:5.0.3 OS:IBM AIX (AIX 5.2, QNX, Mac OS X)
Assigned to: Serge Kozlov Target Version:
Triage: D3 (Medium)

[22 Feb 2005 17:25] Joerg Bruehe
Description:
Build based on changeset
  1.1880 05/02/21 20:41:48 monty@mysql.com +8 -0
  Generate warning in ha_delete_table() if files is missing in handler
fails in 'classic' (others not yet done, 'debug' makes compiler crash). Result output:

-------------------------------------------------------
*** r/rpl_log_pos.result        Tue Feb 22 01:56:23 2005
--- r/rpl_log_pos.reject        Tue Feb 22 13:34:17 2005
***************
*** 21,33 ****
  start slave;
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry  
Master_Log_File Read_Master_Log_Pos
    Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running       
Slave_SQL_Running       Replicate_Do_DB
Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table 
Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table
    Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos    
Relay_Log_Space Until_Condition Until_Log_File
Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path     
Master_SSL_Cert Master_SSL_Cipher
    Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     1       master-bin.000001       73      #
      #       master-bin.000001
    No      Yes                                                     0               0    
  73      #       None
0       No                                              #
  stop slave;
  change master to master_log_pos=173;
  start slave;
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry  
Master_Log_File Read_Master_Log_Pos
    Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running       
Slave_SQL_Running       Replicate_Do_DB
Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table 
Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table
    Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos    
Relay_Log_Space Until_Condition Until_Log_File
Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path     
Master_SSL_Cert Master_SSL_Cipher
    Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     1       master-bin.000001       173     #
      #       master-bin.000001
    No      Yes                                                     0               0    
  173     #       None
0       No                                              #
  show master status;
  File  Position        Binlog_Do_DB    Binlog_Ignore_DB
  master-bin.000001     96
--- 21,33 ----
  start slave;
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry  
Master_Log_File Read_Master_Log_Pos
    Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running       
Slave_SQL_Running       Replicate_Do_DB
Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table 
Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table
    Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos    
Relay_Log_Space Until_Condition Until_Log_File
Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path     
Master_SSL_Cert Master_SSL_Cipher
    Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     1       master-bin.000001       73      #
      #       master-bin.000001
    No      Yes                                                     0       Rolling back
unfinished transaction (no COMMIT o
r ROLLBACK) from relay log. A probable cause is that the master died while writing the
transaction to its binary log.   0
    73      #       None            0       No                                           
  #
  stop slave;
  change master to master_log_pos=173;
  start slave;
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry  
Master_Log_File Read_Master_Log_Pos
    Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running       
Slave_SQL_Running       Replicate_Do_DB
Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table 
Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table
    Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos    
Relay_Log_Space Until_Condition Until_Log_File
Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path     
Master_SSL_Cert Master_SSL_Cipher
    Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     1       master-bin.000001       173     #
      #       master-bin.000001
    No      Yes                                                     0       Rolling back
unfinished transaction (no COMMIT o
r ROLLBACK) from relay log. A probable cause is that the master died while writing the
transaction to its binary log.   0
    173     #       None            0       No                                           
  #
  show master status;
  File  Position        Binlog_Do_DB    Binlog_Ignore_DB
  master-bin.000001     96
-------------------------------------------------------

How to repeat:
Build + test there.

I saved the binary tree on that machine 'aix52' in a tree
'~mysqldev/KEEP_503_20050222_classic'. Please delete it when it is not needed any more -
disk space is scarce everywhere.
[23 Feb 2005 16:58] Lars Thalmann
After analyzing this bug, the followings has been found out:

The bug only shows up if one runs 
  ./mysql-test-run rpl_log rpl_log_pos

Running the following, shows no failure:
  ./mysql-test-run rpl_log_pos

The problems shows up in the following code:

int Format_description_log_event::exec_event(struct st_relay_log_info* rli)
{
  DBUG_ENTER("Format_description_log_event::exec_event");

  /* save the information describing this binlog */
  delete rli->relay_log.description_event_for_exec;
  rli->relay_log.description_event_for_exec= this;

  /*
    As a transaction NEVER spans on 2 or more binlogs:
    if we have an active transaction at this point, the master died
    while writing the transaction to the binary log, i.e. while
    flushing the binlog cache to the binlog. As the write was started,
    the transaction had been committed on the master, so we lack of
    information to replay this transaction on the slave; all we can do
    is stop with error.
    Note: this event could be sent by the master to inform us of the
    format of its binlog; in other words maybe it is not at its
    original place when it comes to us; we'll know this by checking
    log_pos ("artificial" events have log_pos == 0).
  */
  if (!artificial_event && created && thd->transaction.all.nht)
  {
    slave_print_error(rli, 0, "Rolling back unfinished transaction (no "
                      "COMMIT or ROLLBACK) from relay log. A probable cause "
                      "is that the master died while writing the transaction "
                      "to its binary log.");
    end_trans(thd, ROLLBACK);
  }

When this is executed (and there is a failure), we have that
  thd->options & OPTION_BEGIN = 0
  thd->options & OPTION_NOT_AUTOCOMMIT = 0

The first of these tests (OPTION_BEGIN) was removed from the code 
earlier and adding it back removes the failure.  According to Serg 
this is not enough however, and one needs to study this in more detail.
[28 Feb 2005 12:18] Lars Thalmann
Many changes has been done to the 5.0 tree.
Have discussed with Serg and Brian.  Since this 
failure did not repeat itself in any of the later tests, 
it is being closed.
[24 Mar 2005 18:44] Joerg Bruehe
This symptom re-occurs in the final release build of 5.0.3 on some machines:
- AIX 4.3 "classic" in both default + PS-protocol tests,
- AIX 5.2 (64 bit) "classic" just in default test (first difference only, lines 21 -
27),
- QNX only in PS-protocol test, and also earlier in test case: Lines 9 - 15 (diff added
below),
- Max OS X 10.2 (not 10.3 !) "classic" just in default test, and only for lines 9 - 15.

Difference from QNX:
-------------------------------------------------------
*** r/rpl_log_pos.result        Wed Mar 23 02:46:44 2005
--- r/rpl_log_pos.reject        Thu Mar 24 08:19:01 2005
***************
*** 9,15 ****
  master-bin.000001     98
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry  
Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos  
Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB
Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table 
Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error     
Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File 
Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path     
Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     1       master-bin.000001       98      #
      #       master-bin.000001       Yes     Yes                                         
           0               0       98      #       None            0       No             
                                #
  stop slave;
  change master to master_log_pos=73;
  start slave;
--- 9,15 ----
  master-bin.000001     98
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry  
Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos  
Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB
Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table 
Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error     
Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File 
Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path     
Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     1       master-bin.000001       98      #
      #       master-bin.000001       Yes     Yes                                         
           0       Rolling back unfinished transaction (no COMMIT or ROLLBACK) from relay
log. A probable cause is that the master died while writing the transaction to its binary
log.   0       98      #       None            0       No                                 
            #
  stop slave;
  change master to master_log_pos=73;
  start slave;
(( continues with lines 21 - 33 as in original report ))

It might be significant that this was noticed only in "classic" builds (without InnoDB).
[29 Jan 2007 13:11] Sveta Smirnova
Bug #25713 can be dependent from this one
[27 Nov 2007 17:26] Ingo Strüwing
Raised priority. It presented itself as a crashing bug on Debian x86_64. But it is not
always repeatable.

strlen () from /lib/libc.so.6
Protocol::store (this=0xffcf70, from=0x8f8f8f8f8f8f8f8f <Address 0x8f8f8f8f8f8f8f8f out
of bounds>, cs=0xf6e2c0) at protocol.cc:723
show_master_info (thd=0xffcbc8, mi=0xff1ca0) at slave.cc:2457
mysql_execute_command (thd=0xffcbc8) at sql_parse.cc:2858
mysql_parse (thd=0xffcbc8, inBuf=0x1006df8 "SHOW SLAVE STATUS", length=18,
found_semicolon=0x40881e70) at sql_parse.cc:6113
dispatch_command (command=COM_QUERY, thd=0xffcbc8, packet=0xffed69 "", packet_length=20)
at sql_parse.cc:1821
do_command (thd=0xffcbc8) at sql_parse.cc:1595
handle_one_connection (arg=0xffcbc8) at sql_parse.cc:1201
start_thread () from /lib/libpthread.so.0
clone () from /lib/libc.so.6

How to repeat:
OS: Debian GNU/Linux/x86_64
OS: Debian Sid kernel 2.6.22  SMP PREEMPT
gcc (GCC) 4.2.3 20071014 (prerelease) (Debian 4.2.2-3)

bk clone bk-internal.mysql.com:/home/bk/mysql-5.0-engines mysql-5.0-axmrg
cd mysql-5.0-axmrg
BUILD/compile-pentium-debug-max --with-debug=full
make test-force

Disabling the test case. Please re-enable after fix.
[28 Nov 2007 11:24] Bugs System
Pushed into 6.0.4-alpha
[28 Nov 2007 11:26] Bugs System
Pushed into 5.1.23-rc
[28 Nov 2007 11:27] Bugs System
Pushed into 5.0.54
[23 Apr 2008 11:57] Serge Kozlov
Couldn't repeat so probably with others bugs
[24 Apr 2008 23:01] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/45975

ChangeSet@1.2613, 2008-04-25 00:56:29+04:00, skozlov@mysql.com +1 -0
  Bug#8693 is enabled.
[31 Aug 14:23] Andrei Elkin
Test case rpl_log_pos is still disabled in 5.0-bugteam tree.