MySQL Bugs: #45489: Endless tables corruptions

Bug #45489	Endless tables corruptions
Submitted:	14 Jun 2009 23:16	Modified:	28 Jul 2009 5:36
Reporter:	Jérôme Blion	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S1 (Critical)
Version:	5.1.34-0.dotdeb.1-log and 5.1.35-log	OS:	Linux (Debian Lenny 64b)
Assigned to:		CPU Architecture:	Any
Tags:	corruption, crash, ext4, ext4dev, innodb

Description:
Hello, 

I'm using one 40G database on 2 mysql servers (master/slave scheme).I'm backing up this database with mylvmbackup (both on master and slave). mysqladmin status reported 6600 queries per second.

Several times per day, Mysql crashes complaining about a data corruption. Tables involved can change (today, tloc_link_user, smsgateway and invites). Several times per day, I'm doing mysqldump and then importing the dump back...

Some days, I have no crashes, some others days, server keeps crashing for several hours. Disks have been checked on both servers and reported no errors.

What could I do to fix these crashes ?

Here is the main part of my my.cnf

[mysqld]
user                    = mysql
port                    = 3306
old_passwords           = false
max_connect_errors      = 10000
socket                  = /var/run/mysqld/mysqld.sock
pid-file                = /var/run/mysqld/mysqld.pid
datadir                 = /home/mysql
basedir                 = /usr
tmpdir                  = /tmp
skip-locking
key_buffer              = 32M
max_allowed_packet      = 4M
table_cache             = 20000
sort_buffer_size        = 2M
read_buffer_size        = 3M
read_rnd_buffer_size    = 3M
myisam_sort_buffer_size = 1M
thread_cache_size       = 128
query_cache_type        = 1
query_cache_limit       = 4M
query_cache_size        = 8M
join_buffer_size        = 2M
max_connections         = 400   # JBN : Don't increase this value. Or you will burry the server !
low_priority_updates    = 1
concurrent_insert       = 2
long_query_time         = 5
thread_concurrency      = 128
tmp_table_size          = 128M
max_heap_table_size     = 128M
max_tmp_tables          = 16
log-error               = /var/log/mysql/mysql.error.log.txt                    
log-slow-queries        = /var/log/mysql/mysql.slowqueries.log.txt
skip-federated
log-bin                 = mysql-bin
relay-log               = /home/mysql/mysql-relay
max_binlog_size         = 100M
expire_logs_days        = 3
binlog-format           = ROW # NEVER EVER MODIFY THIS: innodb_autoinc_lock_mode = 2
ft_min_word_len         = 3
server-id               = 3
character-set-server    = utf8
collation-server        = utf8_general_ci
language                = /usr/share/mysql/english
innodb_data_file_path           = ibdata1:1G:autoextend
innodb_table_locks              = OFF
innodb_support_xa               = 0
innodb_autoinc_lock_mode        = 2 # WARNING : binlog-format = ROW IS MANDATORY
innodb_flush_log_at_trx_commit  = 2
innodb_buffer_pool_size         = 7G
innodb_additional_mem_pool_size = 8M
innodb_log_file_size            = 128M
innodb_log_buffer_size          = 8M
innodb_file_per_table
innodb_thread_concurrency       = 30

I will attach the log file I have...

How to repeat:
When table is corrupted, any request on the corrupted segment will crash Mysql.
Otherwise, I don't know how to make and crash my servers.

Mysql error log

Attachment: 45489-mysql.error.log.zip (application/zip, text), 368.71 KiB.

Thank you for the report.

Have you run mysql_upgrade when upgraded? have you run CHECK TABLE on problem tables as in error log suggested?

Hello, 

Thanks for your answer.
The database has always been hosted on 5.1 Mysql Servers.

# mysql_upgrade
Looking for 'mysql' as: mysql
Looking for 'mysqlcheck' as: mysqlcheck
This installation of MySQL is already upgraded to 5.1.34, use --force if you still need to run mysql_upgrade

Will CHECK TABLE have a better result than mysqldump + import of the dump ?
(the mysqldump drops the table)

Check table done on phpbb_users (the table which crashed today)

New corruption on tprofile_info and smsgateway !
On check table is running accross the first one (several GB, it will take some times, the server is loaded)

090615 12:21:22  InnoDB: error clustered record for sec rec not found
InnoDB: index `fdate` of table `mobiluck`.`tprofile_info`
InnoDB: sec index record PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 8; hex 8000124120a8c566; asc    A   f;; 1: len 4; hex 80032a74; asc   *t;; 2: len 5; hex 4130333031; asc A0301;;

InnoDB: clust index record PHYSICAL RECORD: n_fields 18; compact format; info bits 0
 0: len 4; hex 80040c73; asc    s;; 1: len 5; hex 4131313031; asc A1101;; 2: len 6; hex 0000717814b9; asc   qx  ;; 3: len 7; hex 80000080060fd8; asc        ;; 4: len 6; hex 202020202020; asc       ;; 5: len 4; hex 800d00b1; asc     ;; 6: len 4; hex 80000000; asc     ;; 7: len 10; hex 616e6169733030373833; asc anais00783;; 8: len 2; hex 2020; asc   ;; 9: len 1; hex 03; asc  ;; 10: len 4; hex 80000000; asc     ;; 11: len 8; hex 8000000000000000; asc         ;; 12: len 4; hex 4718a500; asc G   ;; 13: len 8; hex 8000000000000000; asc         ;; 14: SQL NULL; 15: SQL NULL; 16: SQL NULL; 17: len 4; hex 00000000; asc     ;;

TRANSACTION 0 2012146312, ACTIVE 35 sec, process no 14167, OS thread id 1154926928 fetching rows, thread declared inside InnoDB 298
mysql tables in use 1, locked 0
MySQL thread id 76760, query id 19021108 ns60332.ovh.net 91.121.188.151 mobiluck Sending data
SELECT userid, stringValue, fdate, moderation_status, moderation_date, moderation_userid, comment FROM tprofile_info
                                                                        WHERE infoTemplateid = 'photo' AND (moderation_status='todo')
                                                                        ORDER BY fdate DESC
                                                                        LIMIT 42, 42

InnoDB: Submit a detailed bug report to http://bugs.mysql.com

20090615 Mysql error log.

Attachment: 45489-20090615-mysql.error.log.txt.zip (application/octet-stream, text), 240.18 KiB.

tloc_link_user which I recreated this night crashed once more !
When trying to do a CHECK TABLE on it, Mysql server went away !

I have to mysqldump it and reimport it.
This drives me totally crazy...

Read severity definitions... Increased the severity...
Service it totally unusable.

smsgateway crashed one more time, whereas the check table reported nothing wrong...

Packages installed on the Mysql master

Attachment: mysql-dpkg.txt (text/plain), 30.25 KiB.

Packages installed on the Apache server

Attachment: apache-dpkg.txt (text/plain), 47.93 KiB.

Thank you for the feedback.

CHECK TABLE is not required after loading dump, I just wanted to be sure table is OK before load.

Can you repeat crash in any case when you run specific query on same table? Anyway please provide dump of one of tables you have repeatable crashes with.

Also, please, check with our MySQL binaries accessible from http://dev.mysql.com/downloads and current version 5.1.35 to be sure this is not problem of Debian binaries.

smsgateway crashed one more time today... I didn't repaired it yet.

Do you want me to upload the .ibd file too ?
I will do that this night, when the server will have less load.

Thank you for the feedback.

> Do you want me to upload the .ibd file too ?

Yes, it can be helpful. Please provide *frm file and dump as well.

Mysql error log for 2009-06-16...

Attachment: 45489-20090616-mysql.error.log.zip (application/zip, text), 16.69 KiB.

One more crash on tloc_msg.
CHECK TABLE ==> Crash
Mysqldump ==> Crash
innodb_force_recovery=[1|3|4|5] + mysqldump ==> crash

Tried to upgrade to 5.1.35. Same behaviour.

ns6702:/backup# mysqldump xxxxx tloc_msg > /backup/tloc_msg.sql
mysqldump: Error 2013: Lost connection to MySQL server during query when dumping table `tloc_msg` at row: 2170611

I will post the ibd+frm files...
I cannot provide the SQL dump !

# ncftpput ftp.mysql.com /pub/mysql/upload/ 45489-tloc_msg.tar.gz
45489-tloc_msg.tar.gz:                                   1.45 GB    2.38 MB/s

# ncftpput ftp.mysql.com /pub/mysql/upload/ 45489-20090619-mysql.error.log.txt.gz
45489-20090619-mysql.error.log.txt.gz:                   1.81 MB    1.40 MB/s

I upgraded the server to 5.1.35 downloaded from Mysql website.
I installed the binary version compiled with gcc.
Mysql is still randomly crashing.

I dumped all databases which uses InnoDB.
I dropped them all. I moved ib* files in mysql datadir root outside of it.

Now, I'm importing all my data pieces (22GB of sql file...)

Did you find some interesting things in all files I provided you?
This is becoming really critical for my activity.

Thank you for the feedback.

We are still analyzing files uploaded. No information from these files.

Although I'd recommend to check hard disk also.

After more than a day with no crash, I cloned my master to create a slave.
The backup is performed with a LVM snapshot.

2 hours later: 
 - on the master : 
090623 19:12:12 [Note] /usr/local/mysql/bin/mysqld: ready for connections.
Version: '5.1.35-log'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  MySQL Community Server (GPL)
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 131198.
InnoDB: You may have to recover from a backup.
090623 21:09:24  InnoDB: Page dump in ascii and hex (16384 bytes):

 - on the slave: 
090623 18:01:58 [Note] /usr/local/mysql/bin/mysqld: ready for connections.
Version: '5.1.35-log'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  MySQL Community Server (GPL)
090623 19:08:42 [Note] Slave: received end packet from server, apparent master shutdown:
090623 19:08:42 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000037' at postion 45936090
090623 19:08:42 [ERROR] Slave I/O: error reconnecting to master 'slave_user@ns6511.ovh.net:3306' - retry-time: 60  retries: 86400, Error_code: 2013
090623 19:12:42 [Note] Slave: connected to master 'slave_user@ns6511.ovh.net:3306',replication resumed in log 'mysql-bin.000037' at position 45936090
090623 21:01:25 - mysqld got signal 11 ;

At 7PM, I increased innodb_log_file_size to 128MB...

Erratum : The backup has been made à 3AM with no errors.
I didn't realize my snapshot just before starting the slave, but many hours ago.

"InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 131198.
"

This indicates possible hardware problems.  Had you checked disks and memory are okay?  Does this corruption happen on more than one machine, or just this machine ?

Hello, 

This problem happens on 2 machines.
On both servers, no errors are reported by the RAID controller neither the syslog. I ran several fsck -f on disks which reported no error some days ago.
RAM has been checked on the slave. It reported no error.

Corruption came back 2 hours after enabling the slave server !
I shut down the slave. I didn't do any corrective action since yesterday and it didn't crash one more time.

Changing master/slave order does not solve the issue. There was no backup between the slave activation and the first crash.

It seems this problem is occuring since I upgraded Mysql from 5.1.32 to 5.1.34 and was still present when I upgraded to 5.1.35...

ns6511:/backup/bug# ncftpput ftp.mysql.com /pub/mysql/upload/ 45489-20090627-mobiluck-20090627.sql.gz 45489-20090627-mys                                     ql.error.log.txt.gz
45489-20090627-mobiluck-20090627.sql.gz:                 3.05 GB    2.25 MB/s
45489-20090627-mysql.error.log.txt.gz:                 285.43 kB  686.89 kB/s

45489-20090627-my.cnf:                                   5.66 kB   56.21 kB/s
45489-20090627-whole-database.tar.gz:                   11.31 GB    2.59 MB/s

Now you have the entire database...
Hope you will find some interesting things...

I have no slave anymore (performance issue, but that's not the worst)
On the master, a "start slave" has been issued yesterday. There was no slave data filled, so slave replication didn't start. But the master crashed one hour later and corrupted the smsgateway table once again ! Had to restore it.

Did you find some interesting things?
Could it be linked to a LVM snapshot bug?

(sorry, outputs are in french)

# fsck /dev/vg1/lv_home -f
fsck 1.41.3 (12-Oct-2008)
e2fsck 1.41.3 (12-Oct-2008)
Passe 1 : vérification des i-noeuds, des blocs et des tailles
Passe 2 : vérification de la structure des répertoires
Passe 3 : vérification de la connectivité des répertoires
Passe 4 : vérification des compteurs de référence
Passe 5 : vérification de l'information du sommaire de groupe
/dev/vg1/lv_home : 747/4718592 fichiers (46.3% non contigus), 14389071/18874368 blocs

==> no error !

# dd if=/dev/vg1/lv_home of=/dev/null
150994944+0 enregistrements lus
150994944+0 enregistrements écrits
77309411328 bytes (77 GB) copied, 740,691 s, 104 MB/s

==> no error !

It was a ext4dev filesystem. Mounted with options noatime, nodiratime.
Filesytem has been recreated as ext3 with defaults options.

Since the last change I did on filesystem, it did not crash yet.

Thank you for the feedback.

No news from our side so far - your database work fine. Looks like related to ext4dev. Will see what can be checked here.

Thank you for the feedback.

I can not repeat described behavior on ext4 volume. So closed as "Can't repeat". Feel free to reopen the report if problem reoccurs.