Bug #24615 Server crash during replication
Submitted: 27 Nov 2006 10:15 Modified: 19 Mar 2007 15:48
Reporter: Scott Wilson Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.0.27 OS:MacOS (Mac OSX 10.4.6)
Assigned to: CPU Architecture:Any
Tags: binlog, replication

[27 Nov 2006 10:15] Scott Wilson
Description:
This server does nothing but run replication of a master running freebsd 6.1, mysql 5.0.24.  Another replication slave running freebsd 6.1 and mysql 5.0.27 continued replicating correctly.

I've included various details below that will hopefully help. The "show slave status" is after the crashed server was re-spawned by mysqld_safe which happened to have been run with --skip-slave-start hence replication had not restarted automatically.

This server had also been exhibiting behaviour like Issue: 23838, however I realize now the key file corruption was correlated to the server crashes which were being automatically restarted.

061124 11:02:26 [Warning] Got signal 1 from thread 1
mysqld got signal 10;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=209715200
read_buffer_size=2093056
max_used_connections=4
max_connections=50
threads_connected=0
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 409399 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

061127 03:11:57  mysqld restarted

**********

Host Name:      XServe2
Date/Time:      2006-11-27 03:11:56.967 -0500
OS Version:     10.4.6 (Build 8I127)
Report Version: 4

Command: mysqld
Path:    /usr/local/mysql/bin/mysqld
Parent:  sh [19889]

Version: ??? (???)

PID:    19913
Thread: Unknown

Exception:  EXC_BAD_ACCESS (0x0001)
Codes:      KERN_PROTECTION_FAILURE (0x0002) at 0x000001b8

Unable to generate backtrace for 64 bit task.

Unknown thread crashed with PPC Thread State 64:
  srr0: 0x0000000000394620 srr1: 0x900000000200f030                        vrsave: 0x0000000000000000
    cr: 0x24024244          xer: 0x0000000000000000   lr: 0x0000000000104990  ctr: 0x0000000000394604
    r0: 0x0000000000000001   r1: 0x00000000f0182010   r2: 0x0000000000000000   r3: 0x0000000002247430
    r4: 0x0000000002210108   r5: 0x0000000000000001   r6: 0x0000000000000001   r7: 0x0000000000000000
    r8: 0x0000000000000004   r9: 0x0000000000520478  r10: 0x00000000003789a4  r11: 0x00000000030d2c00
   r12: 0x0000000000394604  r13: 0x0000000003097400  r14: 0x000000000319cac8  r15: 0x00000000f01828d8
   r16: 0x000000000033c930  r17: 0x0000000000709010  r18: 0x0000000000000000  r19: 0x0000000003034050
   r20: 0x0000000000000001  r21: 0x0000000002210758  r22: 0x00000000f01828a8  r23: 0x0000000003034078
   r24: 0x00000000033312a0  r25: 0x00000000f01828d8  r26: 0x0000000000000000  r27: 0x0000000000000000
   r28: 0x0000000003034030  r29: 0x0000000000000001  r30: 0x0000000002247430  r31: 0x00000000001043b0

Binary Images Description:

******

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 26 to server version: 5.0.27-standard-log

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> show slave status \G
*************************** 1. row ***************************
             Slave_IO_State: 
                Master_Host: pe1
                Master_User: repl
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: pe1-binlog.000198
        Read_Master_Log_Pos: 462730627
             Relay_Log_File: xs2-relay-bin.000006
              Relay_Log_Pos: 462728624
      Relay_Master_Log_File: pe1-binlog.000198
           Slave_IO_Running: No
          Slave_SQL_Running: No
            Replicate_Do_DB: 
        Replicate_Ignore_DB: test,ccnbench
         Replicate_Do_Table: 
     Replicate_Ignore_Table: ccn.SITE_INDEX_ENTRY,ccn.SITE_INDEX_WORD_POSITION
    Replicate_Wild_Do_Table: 
Replicate_Wild_Ignore_Table: 
                 Last_Errno: 0
                 Last_Error: 
               Skip_Counter: 0
        Exec_Master_Log_Pos: 462728486
            Relay_Log_Space: 462730863
            Until_Condition: None
             Until_Log_File: 
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File: 
         Master_SSL_CA_Path: 
            Master_SSL_Cert: 
          Master_SSL_Cipher: 
             Master_SSL_Key: 
      Seconds_Behind_Master: NULL
1 row in set (0.00 sec)

******

...
#061127  3:11:54 server id 40  end_log_pos 462728304    Query   thread_id=1619362       exec_time=0     error_code=0
SET TIMESTAMP=1164615114;
UPDATE SESSION SET MODIFICATION_DATE=1164615115, `CONTEXT_NUMBER`='32', `LAST_ACTIVE_DATE`='1164615115' WHERE ID=79581317;
# at 462728304
#061127  3:11:54 server id 40  end_log_pos 462728486    Query   thread_id=1618853       exec_time=0     error_code=0
SET TIMESTAMP=1164615114;
UPDATE SESSION SET MODIFICATION_DATE=1164615115, `CONTEXT_NUMBER`='5', `LAST_ACTIVE_DATE`='1164615115' WHERE ID=79581892;
# at 462728486
#061127  3:11:41 server id 40  end_log_pos 462728514    Intvar
SET LAST_INSERT_ID=15064434;
# at 462728514
#061127  3:11:41 server id 40  end_log_pos 462728542    Intvar
SET INSERT_ID=15064434;
# at 462728542
#061127  3:11:41 server id 40  end_log_pos 462730627    Query   thread_id=1619590       exec_time=15    error_code=0
SET TIMESTAMP=1164615101;
... big long query follows ...

How to repeat:
I haven't found a test case yet, but the problem happens every few days on our system.  I'm installing the debug build of the server now to see if I can get more info out of it.  The system is only in use as a source of binary snapshots for backup so it's not a problem to turn on debugging and/or tinker with it.
[28 Nov 2006 9:45] Valeriy Kravchuk
Thank you for a problem report. Please, send my.cnf content from both master, normal slave and problematic slave. Are that FreeBSD instances running on 64-bit platforms? Send SHOW CREATE TABLE for that SESSION table. Is it possible for you to upload binary/relay log that corresponds to this crash?
[28 Nov 2006 10:56] Scott Wilson
master my.cnf (freebsd 6.1 amd64, 5.0.24)

Attachment: my.cnf.master (application/octet-stream, text), 1.75 KiB.

[28 Nov 2006 10:56] Scott Wilson
normal slave my.cnf (freebsd 6.1 amd64, 5.0.27)

Attachment: my.cnf.slave_normal (application/octet-stream, text), 1.62 KiB.

[28 Nov 2006 10:57] Scott Wilson
problem slave my.cnf (OS X Server 10.4.6 64-bit,  5.0.27 64-bit)

Attachment: my.cnf.slave_problem (application/octet-stream, text), 1.63 KiB.

[28 Nov 2006 11:19] Scott Wilson
Hi,

Yes all three machines are running 64-bit platforms.  I've attached the SHOW CREATE TABLE output below for the SESSION table and the MY_IDEALIST_MATCH table which is where I suspect the problem lies.  I think the key corruption manifests itself in the SESSION table because it is very busy with writes.

The crash seems to correspond to the first complex insert query into MY_IDEALIST_MATCH (see attached binlog output).  Once a day a process is run that executes a whole series of these, and once the server crashes and restarts it runs them all without incident.  

I've uploaded two binlog excerpts, in the second one the crash is still very near the first of these complex queries, but the restart point is after rather than before the query in the second. 

CREATE TABLE `SESSION` (
  `ID` int(11) NOT NULL auto_increment,
  `CREATION_DATE` int(11) NOT NULL default '0',
  `MODIFICATION_DATE` int(11) NOT NULL default '0',
  `LAST_ACTIVE_DATE` int(11) NOT NULL default '0',
  `CLIENT_IP` char(16) NOT NULL default '0',
  `CONTEXT_NUMBER` int(11) NOT NULL default '0',
  `USER_ID` int(11) default NULL,
  `USER_IS_AUTHENTICATED` smallint(6) NOT NULL default '0',
  PRIMARY KEY  (`ID`),
  KEY `LAST_ACTIVE_DATE` (`LAST_ACTIVE_DATE`)
) ENGINE=MyISAM AUTO_INCREMENT=79697223 DEFAULT CHARSET=latin1 DELAY_KEY_WRITE=1

CREATE TABLE `MY_IDEALIST_MATCH` (
  `ID` int(11) NOT NULL auto_increment,
  `CREATION_DATE` int(11) NOT NULL default '0',
  `MODIFICATION_DATE` int(11) NOT NULL default '0',
  `USER_ID` int(11) NOT NULL default '0',
  `ASSET_TYPE_ID` int(11) NOT NULL default '0',
  `ASSET_ID` int(11) NOT NULL default '0',
  `HAS_BEEN_VISITED` tinyint(4) NOT NULL default '0',
  `MY_IDEALIST_EMAIL_ME_PREFERENCE_ID` int(11) NOT NULL default '0',
  `MY_IDEALIST_MATCH_RUN_ID` int(11) NOT NULL default '0',
  `ASSET_TAG_ID` int(11) NOT NULL default '0',
  PRIMARY KEY  (`ID`),
  KEY `MY_IDEALIST_QUALIFIER_ID` (`MY_IDEALIST_EMAIL_ME_PREFERENCE_ID`),
  KEY `MY_IDEALIST_EMAIL_ME_PREFERENCE_ID` (`MY_IDEALIST_EMAIL_ME_PREFERENCE_ID`),
  KEY `ASSET_ID` (`ASSET_ID`,`ASSET_TYPE_ID`,`MY_IDEALIST_MATCH_RUN_ID`),
  KEY `USER_ID` (`USER_ID`,`ASSET_ID`),
  KEY `MY_IDEALIST_MATCH_RUN_ID` (`MY_IDEALIST_MATCH_RUN_ID`,`USER_ID`)
) ENGINE=MyISAM AUTO_INCREMENT=17435900 DEFAULT CHARSET=latin1

*************************** 1. row ***************************
           Name: MY_IDEALIST_MATCH
         Engine: MyISAM
        Version: 10
     Row_format: Fixed
           Rows: 17289589
 Avg_row_length: 38
    Data_length: 657004952
Max_data_length: 10696049115004927
   Index_length: 1609876480
      Data_free: 570
 Auto_increment: 17449410
    Create_time: 2006-11-10 04:31:32
    Update_time: 2006-11-28 05:46:40
     Check_time: 2006-11-28 05:06:43
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options: 
        Comment:
[19 Feb 2007 15:48] Valeriy Kravchuk
Sorry for a delay with this report. Please, send the results of:

free

and

ulimit -a

commands for the user that runs MySQL server on that problematic slave. You may need up to 1600M of memory (and a lot of file descriptors) in some cases with that my.cnf you have. 

Have you tried to upgrade OS to 10.4.7 or 10.4.8?
[20 Mar 2007 0:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".