Bug #42648 | Maria hang in read_block() on recovery | ||
---|---|---|---|
Submitted: | 6 Feb 2009 14:09 | Modified: | 3 Mar 2009 8:35 |
Reporter: | Philip Stoev | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Server: Maria storage engine | Severity: | S1 (Critical) |
Version: | 6.0 | OS: | Any |
Assigned to: | Guilhem Bichot | CPU Architecture: | Any |
[6 Feb 2009 14:09]
Philip Stoev
[6 Feb 2009 14:12]
Philip Stoev
YY file
Attachment: bug42648.yy (application/octet-stream, text), 884 bytes.
[6 Feb 2009 14:15]
Philip Stoev
ZZ file for this bug 42648
Attachment: bug42648.zz (text/plain), 382 bytes.
[6 Feb 2009 14:27]
Philip Stoev
To reproduce with the RQG. $ perl runall.pl \ --basedir=/build/bzr/6.0-maria/ \ --engine=Maria \ --grammar=conf/bug42648.yy \ --gendata=conf/bug42648.zz \ --reporter=Recovery \ --mysqld=--skip-falcon This will quickly cause a crash due to an UTF32 bug (filed separately). Then, recovery will hang forever.
[19 Feb 2009 10:57]
Philip Stoev
This hang afflicts a wide range of workloads, including situations where Maria tables are not used at all.
[2 Mar 2009 21:38]
Guilhem Bichot
tested on Linux 32 bit, latest mysql-test-extra-6.0, latest 6.0-maria, debug build, no hang (tried several times, with and without --mem). But we fixed nothing related to such hang I believe. So, I'm puzzled. Is there any machine where the problem happens, that I can log into?
[3 Mar 2009 8:35]
Philip Stoev
I am afraid I have not seen it in a while. It was 100% reproducible at the time the bug was filed. If you are still interested in debugging it, please use an older Maria tree. If not, I will re-open the bug if I see this again.
[3 Mar 2009 15:23]
Guilhem Bichot
I was able to repeat the hang by using the old 6.0-maria of: revision-id:guilhem@mysql.com-20090129200110-yxugmqmjqcwdiey3 This bug was fixed by the change to pagecache_unlock_by_link() made in revision: monty@mysql.com-20081227020516-bmta8shmtz0hqfhc and which reached 6.0-maria on 2009-02-13. This is sure, because when I take the datadir at the time of crash, and let the old 6.0-maria do the recovery, it always hangs in read_block() but if I apply only the change made by the above revision it succeeds. The per-file description for the change, in the revision, is "Mark page as read when we do a write of a full page. This fixes a bug when we got an error during read and then used direct write to page to update it". I assume the mentioned "error during read" could happen if Recovery finds in the log: - first a record which says to create page 2 - then a record which says to create page 1 Applying the first record will create page 2 in memory (in page cache, not on disk), and increase share->state.state.data_file_length. Thus applying the second record (_ma_apply_redo_insert_row_head_or_tail()) will try to read the page from the page cache (because data_file_length is bigger than page 1, it is possible that page already exists); it does not find it, neither in cache nor on disk (file is still empty on disk) and so page cache says HA_ERR_FILE_TOO_SHORT and automatically creates the page in memory (that is, provides caller with a buffer to fill); then recovery fills this page, then unlocks it (pagecache_unlock_by_link()); at that moment it needs to say that the page is read (PCBLOCK_READ) otherwise a future reader (find_block()) will consider the page is PAGE_TO_BE_READ and thus read_block() will wait forever. In other words: when page cache automatically creates a missing page, it must mark it as read (because, page is ready in the block's buffer now, does not need to be read from the disk).