MySQL Bugs: #39090: Corruption in the InnoDB tablespace.

Bug #39090	Corruption in the InnoDB tablespace.
Submitted:	28 Aug 2008 12:56	Modified:	7 Nov 2015 7:45
Reporter:	Dimitrij HIlt	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S1 (Critical)
Version:	5.0.51a-3-log	OS:	Linux (Debian 4.0(etch)/2.6.25.6)
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	drbd, InnoDB Corruption, Signal11.

Description:
Hello,

we got twice an error (on different hardware), that mysql server dies with signal11 and after a new start InnoDB files becomes corrupt and must be recovered with "innodb_force_recovery = 5" via dump&load. On 836GB data warehouse database the recovery time is unacceptble long. With lot of optimisation (copy of MyISAM directly, parallel dump & load etc.) it may be up to 8 hours and more.

This server rund on HA cluster with DRBD and heartbeat. Switch to Fail-Over server after master failure get same issue, so hardware may not be corrupt.

I attach a complete logfile from server as privat link, so you can see what happens.

During it happens server was backuped with mylvmbackup, but it runs every night. This issue we got on 27.07 and on 27.08. After first failure we have checked hardware and switch to another server. It does not help us.

We have backup from corrupt data, so you can analyse this files directly too.

How to repeat:
I don't know how to repeat this issue, sorry.

Suggested fix:
None.

Logfile from corrupt server.

Attachment: mysql_innodb_bug.log.gz (application/x-gzip, text), 31.74 KiB.

Hi,

we'v figured out that on both crashes same table was involved. This table is a table for monthly aggregation on data. Both crashes was at and of mounth (28. Juli and 28. August) at the night during other jobs are running on another databases and backup running on the server with mylvmbackup.

At end of mounth these tables are big. Please see a privat attachment wich 'show create table' for such tables.

This table crash the server on dump with 'innodb_force_recovery = 5' too, so we could not dump any data from this table.

Dimitrij

I hope we ca

Thank you for a problem report. Can you, please, check if the problem is repeatable with a newer version, 5.0.67?

Valeriy,

i will try it on new test hardware, but it may be a race condition, so it is very difficult to reproduce this bug.

Dimitrij

I'v installed 5.0.67 on different hardware and i have tried to start database with brocken files. It is the same:
080902 14:00:57 InnoDB: Error: page 7 log sequence number 302 1417099970
InnoDB: is in the future! Current system log sequence number 302 1357368825.
InnoDB: Your database may be corrupt or you may have copied the InnoDB
InnoDB: tablespace but not the InnoDB log files. See
InnoDB: http://dev.mysql.com/doc/refman/5.0/en/forcing-recovery.html
InnoDB: for more information.
080902 14:00:57 InnoDB: Error: page 2 log sequence number 302 1628332110
InnoDB: is in the future! Current system log sequence number 302 1357368825.
InnoDB: Your database may be corrupt or you may have copied the InnoDB
InnoDB: tablespace but not the InnoDB log files. See
InnoDB: http://dev.mysql.com/doc/refman/5.0/en/forcing-recovery.html
InnoDB: for more information.
080902 14:00:57 InnoDB: Error: page 6 log sequence number 302 1628332110
InnoDB: is in the future! Current system log sequence number 302 1357368825.
InnoDB: Your database may be corrupt or you may have copied the InnoDB
InnoDB: tablespace but not the InnoDB log files. See
InnoDB: http://dev.mysql.com/doc/refman/5.0/en/forcing-recovery.html
InnoDB: for more information.
InnoDB: Error: trying to access page number 211961600 in space 0,
InnoDB: space name /db/mysql/ibdata1,
InnoDB: which is outside the tablespace bounds.
InnoDB: Byte offset 0, len 16384, i/o type 10.
InnoDB: If you get this error at mysqld startup, please check that
InnoDB: your my.cnf matches the ibdata files that you have in the
InnoDB: MySQL server.
080902 14:00:57InnoDB: Assertion failure in thread 140270393509600 in file fil0f
il.c line 3959
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.0/en/forcing-recovery.html
InnoDB: about forcing recovery.
080902 14:00:57 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=0
read_buffer_size=131072
max_used_connections=0
max_connections=2600
threads_connected=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 565757
9 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd=(nil)
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
frame pointer is NULL, did you compile with
-fomit-frame-pointer? Aborting backtrace!
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Writing a core file

Dimitrij

Do you know what exact sequence of steps leads to this corrupted state?

Hi,

we have logfile uploaded. It a step by step crash. Do you needs any information that is not in this log?

Dimitrij

Assigning this to Inaam.

There is a huge lag in the checkpoint.

The LVM backup process may be associated with this.

Michael I. noted that it crashes on the 28th day of a month. Why?

Due data growing at end of mounth.

Dimitrij

There seem to be several issues at play.

First, are you deleting a large amount of data from the table at the end of every month? Do you get this crash every month? Instead of doing large deletes, it would be better to partition the data to multiple tables (e.g., by month, if the primary key is a timestamp). Then you could delete simply by dropping a table, which is fast. Failing that, please try to make the transactions smaller, by delete fewer records at a time, and committing in between. Maybe the DELETE statement supports a LIMIT clause?

Second, in the error log you can see warnings that the InnoDB redo log (for crash recovery) is too small. The first warning is:

080826 3:12:15 InnoDB: ERROR: the age of the last checkpoint is 1932731949,
InnoDB: which exceeds the log group capacity 1932731597.
InnoDB: If you are using big BLOB or TEXT rows, you must set the
InnoDB: combined size of log files at least 10 times bigger than the
InnoDB: largest such row.

The last one is:

080828 4:00:04 InnoDB: ERROR: the age of the last checkpoint is 2472663546,
InnoDB: which exceeds the log group capacity 1932731597.
InnoDB: If you are using big BLOB or TEXT rows, you must set the
InnoDB: combined size of log files at least 10 times bigger than the
InnoDB: largest such row.

Can you make the redo log bigger?

Was a single DELETE statement running all this time?

Was the LVM backup running at this time frame by any chance? It could be that it is reducing the I/O performance of the system too much.

Then we have this long semaphore wait that finally caused the 600-second InnoDB watchdog to kick in and crash the server:

InnoDB: Warning: a long semaphore wait:
--Thread 1168726368 has waited at dict0dict.c line 781 for 241.00 seconds the semaphore:
Mutex at 0x7fdace1838b8 created file dict0dict.c line 709, lock var 1
waiters flag 1

This is the dict_sys->mutex, which dict_table_get_and_increment_handle_count() is waiting for.

Finally, because the redo log was configured to be too small, the InnoDB crash recovery failed, and you ended up with corrupt tables.

If the problem persists, please try to get a core dump, so that we can investigate the root cause (what is holding the dict_sys->mutex for over 600 seconds).

Marko,

we have 2 x 1GB logfile size. It is not enough? We can only get 4GB logfiels totaly, so it unpossible to make it 10 times bigger than the largest such row.

Yes. LVM Backup was running at this time. But it is proposed from mysql and may not corrupt any tables. Sure I/O performance was reduced, but it can also be a issue if a harddrive fails and raid makes rebuild.

We'w splitted database and move the database, that issued a crash. I hope we does not get any crashes. And our customer get enterprise subscription for this server with hotfix build 5.0.72-br27715-enterprise-gpl-log. We will install this build on server at next time.

Next week i will orovide information about DELETE and amount of data.

Dimitrij,

Any update on delete and size of data being deleted?

regards,
inaam

setting 'need feedback'. The questions are: aside of data growing, what else is happening at the end of the month? Do you delete or move the data from the big table? Can you reproduce it still, if so we'd like to see the core dump.
Did it fail in the end of September and October. If not, why? Did you implement a workaround or did it just stop happening?

Hello,

i wait for report from our customer about a traffic at end from august and juli. It does not fail at end of september and oktober because the database, that got corrupted was moved to another host.
And we got fix build from Enterprise Team, but it is not activated at the moment.

Dimitrij

Side note: the mentioned hotfix build 5.0.72-br27715-enterprise-gpl-log apparently contains the fix for Bug #39483 (a hang in the InnoDB adaptive hash index because of an out-of-order ha_innobase::open() call by MySQL). That bug might have caused the 600-second watchdog that killed the server in the first place. But we would like to see a core dump to be sure.

Hello,

We does not have (and hopefully will never got) a core dump, sorry.
The crash recovery for this database was 34 hours with all hacks and tricks.

But we still have a corupted data.

Dimitrij

Dimitrij,

I believe that the corruption can be explained by InnoDB running out of redo log. Nevertheless, it is good to keep the corrupted files for future analysis, if it is not too much trouble.

I'm not suggesting or wishing that the production server should crash again. I was thinking that you might have a core dump of the earlier crash. It can be a little tricky to enable the generation of core dumps in Linux, but the core dump would help us a lot. The root cause could be Bug #39483, but without a core dump (or complete stack traces of all mysqld threads) we will never know.

Best regards,
Marko

If this happens on modern versions of 5.6.27 or newer,  please open a new bug.
5.0 is no longer actively supported.