MySQL Bugs: #2164: MYI files being corrupted on slaves.

Bug #2164	MYI files being corrupted on slaves.
Submitted:	18 Dec 2003 13:08	Modified:	29 Jan 2004 6:36
Reporter:	Chad Clark	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	4.0.15 -> 4.0.17	OS:	Linux (Linux (Debian 3.0))
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
Sorry I don't know how to reproduce this but I've seen it a few times on different hardware so it looks like a software issue.

We have a master running 4.0.15 and two slaves, both running 4.0.17.  Occasionally we find queries to one table failing.  Upon inspection it appears the contents of relay-log.info are appearing at the top of a .MYI file.  This happend to the same file twice (that I recorded) over a couple weeks.  We changed the table to INNODB and have not seen the problem since.

About a week later we noticed other queries failing (error 130 is reported again if I recall)  This time we found simillar text at the top of a different .MYI table file.

Originally we only saw this on the one table on the one slave machine so we thought it could be filesystem / hardware related.  I asked for a badblocks check but we didn't want the downtime (even though it was on a slave).

The issue experienced this morning (on the second table so far) occured on both slaves.  That is both slave tables were corrupted.  I didn't get a copy of both so I'm not sure if they contained exactly the same data.

As far as I know the master tables never get corrupted.

The first table we noticed this on has 12 rows and has not been updated in several months.  Here is what the top of the file looked like first uncorrupt and second corrupt.

offset    0  1  2  3   4  5  6  7   8  9  a  b   c  d  e  f  0123456789abcdef
00000000 <fe>fe 07 01  00 01 01 93  00 b0 00 64  00 d0 00 02  þþ.......°.d.Ð..
00000010  00 00 02 00  08 01 00 00  00 00 30 ff  00 00 00 00  ..........0ÿ....
00000020  00 00 00 09  00 00 00 00  00 00 00 00  00 00 00 00  ................
00000030  00 00 00 09  ff ff ff ff  ff ff ff ff  00 00 00 00  ....ÿÿÿÿÿÿÿÿ....
00000040  00 00 0c 00  00 00 00 00  00 00 01 5c  00 00 00 00  ...........\....
00000050  00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00  ................
00000060  00 00 00 09  00 00 00 00  00 00 00 00  00 00 7a 5d  ..............z]
AppCategories.MYI_line_1_of_2_(50%)___________________________1,63__Command___

offset    0  1  2  3   4  5  6  7   8  9  a  b   c  d  e  f  0123456789abcdef
00000000 <2e>2f 69 73  74 6f 63 6b  64 62 2d 72  65 6c 61 79  ./istockdb-relay
00000010  2d 62 69 6e  2e 30 30 31  0a 34 0a 69  73 74 6f 63  -bin.001.4.istoc
00000020  6b 2d 62 69  6e 2e 30 30  31 0a 33 39  35 38 0a 00  k-bin.001.3958..
00000030  00 00 00 09  ff ff ff ff  ff ff ff ff  00 00 00 00  ....ÿÿÿÿÿÿÿÿ....
00000040  00 00 0c 00  00 00 00 00  00 00 01 5c  00 00 00 00  ...........\....
00000050  00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00  ................
00000060  00 00 00 09  00 00 00 00  00 00 00 00  00 00 1c 44  ...............D
AppCategories.MYI-corrupt line 1 of 6 (16%)                   1,63  Command

Someone has pointed out that the corrupt table is always the second MyISAM table (lexographically).  This could be a coincidence.  We have however added dummy MyIASM tables at the top of our database to protect our data in case this is true.  (There were new (real data) tables added between the time the first table was corrupted and the second table was corrupted.)

I have not enabled --log-iasm at this time because the logs would be huge but if it might help I'll push for it.

We are "pretty sure" we did the binary installs on all machines but I'm checking on it.

Originally I was under the impression that nothing was being done to the database by developers / admins when these issues were introduced.   However it looks like machines were being synced last night and there are entries with the same starting and ending offsets as the numbers appearing in the corrupt data in the hostname.err log.

031217 22:01:06  Slave I/O thread: connected to master 'replusr@ip:3306',  replication started in log 'FIRST' at position 4
ERROR: 1146  Table 'dbname.user' doesn't exist
031217 22:01:06  Slave: Error 'Table 'dbname.user' doesn't exist' on query 'ALTER TABLE `user` TYPE = MyISAM'. Default database: 'dbname', Error_code: 1146
031217 22:01:06  Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'istock-bin.001' position 79
031217 22:06:26  Slave I/O thread exiting, read up to log 'istock-bin.001', position 2124

In this case the data we saw in the corrupt file are:
  ./istockdb-relay-bin.001
  4
  istock-bin.001
  79

Note that there were stack traces with some of these log entries.

./bin/resolve_stack_dump -s mysqld.sym -n stacktrace-2003-12-18
0x8070610 handle_segfault + 420
0x8289628 pthread_sighandler + 184
0x8287adb pthread_mutex_lock + 11
0x82b211a free + 122
0x826ee16 my_no_flags_free + 22
0x826f31c free_root + 120
0x807abc0 dispatch_command__F19enum_server_commandP3THDPcUi + 4216
0x8079b3d do_command__FP3THD + 165
0x8079329 handle_one_connection + 641
0x8286ddc pthread_start_thread + 220
0x82bc5ba thread_start + 4

And

0x8070610 handle_segfault + 420
0x8289628 pthread_sighandler + 184
0x82b1b69 chunk_alloc + 649
0x82b17910 _end + 2053977172
0x826f19a init_alloc_root + 90
0x82750c3 my_dir + 167
0x80a9c15 find_uniq_filename__FPc + 101
0x80a9e69 generate_new_name__9MYSQL_LOGPcPCc + 81
0x80a9fd7 open__9MYSQL_LOGPCc13enum_log_typeT1T110cache_typebUl + 195
0x8070d0f open_log__FP9MYSQL_LOGPCcN3113enum_log_typebT6Ul + 175
0x80e636a init_relay_log_info__FP17st_relay_log_infoPCc + 358
0x80e6dc0 init_master_info__FP14st_master_infoPCcT1b + 784
0x80eb8ca start_slave__FP3THDP14st_master_infob + 166
0x807ba23 mysql_execute_command__Fv + 3663
0x807eb8a mysql_parse__FP3THDPcUi + 146
0x807a0e3 dispatch_command__F19enum_server_commandP3THDPcUi + 1435
0x8079b3d do_command__FP3THD + 165
0x8079329 handle_one_connection + 641
0x8286ddc pthread_start_thread + 220
0x82bc5ba thread_start + 4

Unfortunatly I can't say for sure how all of these log entries are related, etc.

If anyone has any advice or suggestions I would really like to hear them.

Thanks /Chad  <  chad AT evolvs DOT com  >

PS: All machines are running Debian 3.0 and a vanilla kernel 2.4.23

PPS: similar bug reports:
http://bugs.mysql.com/bug.php?id=563
http://bugs.mysql.com/bug.php?id=197

How to repeat:
Sorry I don't know how to reproduce this but I've seen it a few times on different hardware so it looks like a software issue.  Therion on irc.freenode.net in #mysql recommended I post this as a bug.

Some questions I didn't ask (or answers didn't make it to the report) when we were talking earlier:

Are you using the official MySQL AB binaries or the Debian binaries (or self-compiled, or...)?

What filing system are the slaves using?

Are the slaves using identical hardware?

Is it possible to add a slave which uses different hardware/kernel to see if the issue appears there as well?

Is this always accompanied by errors such as your failed ALTER TABLE statement?

Is this always accompanied by stack traces?

Are both slaves showing the same stack traces at the same times?

Are you showing other stack traces/crashes which are not accompanied by this corruption?

Do the relay logs on the slaves appear "sane" (check with mysqlbinlog)?

Can you run some hardware tests on the slaves (memory tests, drive tests, etc)?

As we discussed, this sounded very much like a hardware or kernel problem until it appeared on more than one slave, but that is still very possible so we need as much information as we can get (particularly as there is not enough information for us to reproduce this if it is in fact our bug).  It does appear to be something specific to your particular environment so far.

Thank you

Our master server froze up within hours of our noticing a problem on the website. It does look
like this may have caused a table on both slaves to become corrupt. I do however stress that we
have seen corruption which looked the same as what we saw last week without any noted hardware
trouble.

Following is a timeline of what happend and a description of the hardware. We are in the middle
of adding new hardware and so some of the machines have been swapped around a bit making
this troubleshooting somewhat more difficult.

Right now I don't have any more information (other than some log files but I'm not sure when the
corruption occured). I think we may have to suspend researching this issue until we observe it
again when we can hopefully get more information as it happens.

Thanks,
/Chad

Trouble on Dec 18th : Timeline

Total slaves: db1, db2, iweb1, iweb2 (see hardware description below).

Slaves running : db1, iweb2

1) At 17:26 (17 Dec 2003) the master server stopped logging apache requests.
Also syslogd contains nothing past 17:25.

2) At 18:27 db1 logged: "Slave I/O thread: error reconnecting to master 'repluser@MASTERS_IP:33
06': Error: 'Lost connection to MySQL server during query' errno: 2013 retry-time: 60 retries: 8
6400"

3) At 18:28 the master machine was hard rebooted. Keyboard was not responding.

4) 18:28 db1 logged: "Got fatal error 1236: 'Client requested master to start replication from
impossible position' from master when reading data from binary log" followed by "Slave I/O
thread exiting".

5 Around aprox. 21:00 - 22:00 attempts were made to resync db1, db2, iweb1
Only db1 and db2 suceeded. iweb1 would not start replication for an unknown reason.

6) Around 03:00 (18 Dec 2003) Reports of strange values appearing on the site led to
investigations
. It was discovered that slaves (not noted which ones) were reporting different values.

7) Around 04:00-04:30 the site was switched to use only the master for all queries.

8) 09:00-12:00 (18 Dec 2003) The master's tables were converted to MyIASM. The slaves /data
directories were cleared and replication was started via "LOAD DATA FROM MASTER".

Slaves running : db1, db2

Hardware descriptions:

Master:
------
note: also runs apache
OS: Debian 3.0
kernel: custom built, vanilla 2.4.21 (since then updated to 2.4.23)
filesystem: ext2 (mounted with noatime on the data/ directory)
mysql: mysql-standard-pc-linux-i686 4.0.15 - mysql binary install
(since updated to mysql-standard-pc-linux-i686 4.0.17 - mysql binary install)
Dual Xeon 2.4 GHz w/ hyperthreading
RAM: 5 Gig ECC
ICP Vortex SCSI Raid 5

Slave 1: (db1)
-------
OS: Debian 3.0
kernel: custom built, vanilla 2.4.20 (since then updated to 2.4.23)
filesystem: ext2 (atime is enabled.)
mysql: mysql-standard-pc-linux-i686 4.0.17 - mysql binary install
Dual Athlon MP 2000 Plus
RAM: 1.5 Gig Not ECC
Single IDE drive.

Slave 2: (db2)
-------
OS: Debian 3.0
kernel: custom built, vanilla 2.4.23
filesystem: ext2 (atime is enabled.)
mysql: mysql-standard-pc-linux-i686 4.0.17 - mysql binary install
Dual Xeon 2.4 GHz with hyperthreading
RAM: 2 Gig ECC
Single IDE drive.

Slave 3: (iweb1)
-------
OS: Slackware 9.1
kernel: custom built 2.4.22 (with I2C patches applied)
filesystem: reiserfs (mounted with noatime on the data directory.)
mysql: mysql-standard-pc-linux-i686 4.0.16 - mysql binary install
Single Xeon 2.4 GHz on a Dual Motherboard
RAM: 3 Gig ECC
Single IDE drive.

Slave 4: (iweb2)
-------
OS: Slackware 9.1
kernel: custom built 2.4.22 (with I2C patches applied)
filesystem: reiserfs (mounted with noatime on the data directory.)
mysql: mysql-standard-pc-linux-i686 4.0.16 - mysql binary install
Single Xeon 2.4 GHz on a Dual Motherboard
RAM: 3 Gig ECC
Single IDE drive.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".