MySQL Bugs: #25255: MyISAM table corruption when using thread_cache

Bug #25255	MyISAM table corruption when using thread_cache_size
Submitted:	23 Dec 2006 13:18	Modified:	3 Mar 2007 19:35
Reporter:	[ name withheld ]	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: MyISAM storage engine	Severity:	S1 (Critical)
Version:	5.0.22 and 5.0.27	OS:	Linux (Linux Suse ES9; kernel 2.6.5)
Assigned to:		CPU Architecture:	Any

Description:
When using official build on Linux Novell/Suse and thread_cache_size=512 is set in my.cnf and also 100s of simultaneouis connections are on which may be doing multi and large INSERTs and also numerous UPDATES and SELECTs per second as well as frequent KILL <thread> are issued from a machine to various remote threads, we get table corruption as follows:

We occasionally see this on error log:
[ERROR] Got error 127 when reading table './DB/I'
Error 130 appears too sometimes, i think it was 130.

And after running a repair on table I, the error log gets lines like below:
Found 1723 of 1724 rows when repairing './DB/I'

I.e. data rows are lost!

Note, the mysql linux load in our environment, despite the 500 - 700 simultaneous clients hardly exceeds 1 and memory takeup can max out to 600Mb on a machine with 4GB ram. So there seems to be no issue with huge load or memory stress.

The mysql build uses NPTL, under Novel/Suse Enterprise with all latest Novel system patches applied.

How to repeat:
It is extremely difficult to post you a case that recreates the problem.

This is because the problem only occurs in our environment where there are many 100s of simultaneous connections open to the db, doing writing and reading and reading. It seems to also be related to the issuing of "KILL <threadid>" commands or when large multi-insert or other statments are stopped in the middle of their execution.

Some times the corruption problem may take hours or days to reproduce.
However there is definitely a problem and is related to Bug #20814 report, for which there was no fix (months back).

Only recently we found this problem seems to haev something to do with the thread_cache_size setting.

I would sugegst you had a regression/stress test where large queries from multiple client processes accessing the db can be killed at any time (e.g. in the middle of a large insert or update, or in the middle of a select pulling large data). Also 100s of small queries per second occurring and in addtion to that some machine periodically issuing a "KILL thread_id" on many of those connections randomly. All this wihile thread_cache_size is set = 512.

Suggested fix:
Removing the thread_cache_size directive from my.cnf seems to fix the problem (at least we have not seen it for weeks). This is using the official 5.0.22 build shipped by you for Novel/Suse Enterprise v9.

A custom build from sources we made on the same system of mysql 5.0.27 (using the Novel supported gcc 3.3.3 compiler) still presented the same corruption behaviour, which got cleared when again removing thread_cache_size from the my.cnf file. However, on this build, after a week of operation of our database, we got this error in the error log twice:
[ERROR] Got error 127 when reading table './DB/I'

However, in this case, running "repair table I" did not report any data lost.
It could be that official 5.0.22 may still have same issue, but we have not seen that one yet.

All this suggests some race condition that has been in mysql/myisam for some time. We first noticed this over 6 months ago.

Hi! Please send us output from:
SHOW CREATE TABLE I\G
SHOW TABLE STATUS LIKE I\G

We can try recreate a random dataset, using many concurrent connections and thread_cache_size setting as you described, also killing random connections.

a sample file for queries

Attachment: bug25255.query (application/octet-stream, text), 769 bytes.

An update: while running the above queries in 45 threads on 5.0.34BK, I managed to get CHECK TABLE to report a corrupted table at least once.  But, the next time CHECK TABLE is run, it returns OK. So, there is something worth investigating here.

On another run of 45 threads, I issued a REPAIR TABLE instead, and it crashed the server, with a debug assertion.

061226 18:22:53 [Note] Retrying repair of: './test/t1' with keycache
mysqld: my_seek.c:57: my_seek: Assertion `fd != -1' failed.

This is probably another bug, which I'll deal with seperately.

we've seen many customers who had corruption issues which vanished after setting thread_cache_size to 4 or even 0.  So, this bug report has merit, but I haven't been able to reproduce it exactly yet.   Perhaps a code review would be easier.

an update.  a huge problem was found with thread_cache_size.  I filed it in bug #25966 .  So, maybe that was an underlying cause of this? (especially if slaves exist, or if KILL sql command is used).