Bug #62395 Off-by-one bug in MLOG_COMP_REC_DELETE record during Crash Recovery
Submitted: 9 Sep 2011 0:35 Modified: 31 Jan 2012 21:28
Reporter: Nizameddin Ordulu Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB Plugin storage engine Severity:S2 (Serious)
Version:5.1.52 OS:Linux
Assigned to: Marko Mäkelä CPU Architecture:Any
Tags: compression, innodb, page_zip_validate

[9 Sep 2011 0:35] Nizameddin Ordulu
Description:
This bug happened during crash recovery after I applied the patch provided in bug 61191(http://bugs.mysql.com/file.php?id=17054), however I don't think the bug is related to crash recovery. It's also possible, albeit less likely, for the bug to occur even if the patch was not applied. 

The crash recovery worked fine after we reverted the patch from 61191 so I am positive that the problem is related to the changes in that patch. That patch fixes other bugs so simply reverting that patch is not a solution. Furthermore even if we reverted that patch the same bug will continue to exist for the reasons discussed below.

I was able to investigate the bug, because it caused the crash recovery to crash deterministically. With UNIV_DEBUG enabled and UNIV_ZIP_DEBUG disabled, the crash recovery crashes at the following assertion in rem0rec.ic around line 283:

 283    ut_ad((field_value >= 32768
 284           ? field_value - 65536
 285           : field_value)
 286          + ut_align_offset(rec, UNIV_PAGE_SIZE)
 287          < UNIV_PAGE_SIZE);

I think the above is a side-effect of the actual bug. The above assertion fails after the 82nd redo log record for the problematic page. With UNIV_ZIP_DEBUG enabled, page_zip_validate() in the end of page_zip_write_rec() fails during applying the 22nd log record:

3634#ifdef UNIV_ZIP_DEBUG
3635  ut_a(page_zip_validate(page_zip, page_align(rec)));
3636#endif /* UNIV_ZIP_DEBUG */

The log transaction log records that are applied to the problematic page are as follows:
1. applying log record type 42 to page_no:455 space_no:24789 num_records=5 [DELETE]
2. applying log record type 38 to page_no:455 space_no:24789 num_records=4 [INSERT]
3. applying log record type 49 to page_no:455 space_no:24789 num_records=5 [WRITE_BLOB_PTR]
4. applying log record type 52 to page_no:455 space_no:24789 num_records=5 [COMPRESS_NO_DATA]
5. applying log record type 41 to page_no:455 space_no:24789 num_records=5 [UPDATE_IN_PLACE]
6. applying log record type 41 to page_no:455 space_no:24789 num_records=5 [UPDATE_IN_PLACE]
7. applying log record type 41 to page_no:455 space_no:24789 num_records=5 [UPDATE_IN_PLACE]
8. applying log record type 42 to page_no:455 space_no:24789 num_records=5 [DELETE]
9. applying log record type 38 to page_no:455 space_no:24789 num_records=4 [INSERT]
10. applying log record type 49 to page_no:455 space_no:24789 num_records=5 [WRITE_BLOB_PTR]
11. applying log record type 41 to page_no:455 space_no:24789 num_records=5 [UPDATE_IN_PLACE]
12. applying log record type 42 to page_no:455 space_no:24789 num_records=5 [DELETE]
13. applying log record type 38 to page_no:455 space_no:24789 num_records=4 [INSERT]
14. applying log record type 49 to page_no:455 space_no:24789 num_records=5 [WRITE_BLOB_PTR]
15. applying log record type 42 to page_no:455 space_no:24789 num_records=5 [DELETE]
16. applying log record type 38 to page_no:455 space_no:24789 num_records=4 [INSERT]
17. applying log record type 49 to page_no:455 space_no:24789 num_records=5 [WRITE_BLOB_PTR]
18. applying log record type 42 to page_no:455 space_no:24789 num_records=5 [DELETE]
19. applying log record type 38 to page_no:455 space_no:24789 num_records=4 [INSERT]
20. applying log record type 49 to page_no:455 space_no:24789 num_records=5 [WRITE_BLOB_PTR]
21. applying log record type 42 to page_no:455 space_no:24789 num_records=5 [DELETE]
22. applying log record type 38 to page_no:455 space_no:24789 num_records=4 [INSERT]
110908 17:06:15  InnoDB: page_zip_validate: content

Here are my findings regarding the page that causes the crash:
-The page contains a reference to an externally stored blob field in its compressed part.
-The record that owns the above external field is marked deleted after applying the 18th transaction log record above. The compressed part of the page (not modification log or uncompressed fields) still has reference to now-obsolete blob reference but when the page is decompressed the reference is overridden because the record is marked deleted. Therefore page_zip_validate() succeeds after applying 18.
-19th log record inserts another record that uses the same offset as the deleted record above. page_zip_validate() still succeeds at this point because the new record has its own external field pointer which overwrites the blob pointer on the compressed part of the page inside page_zip_decompress(). page_zip_validate() still succeeds.
-20th record doesn't have any affect on the status of the obsolete blob pointer.
-21st log record deletes the record that was inserted by 19.
-22nd log record inserts a new record to the table at the same offset. This record has a small blob field (<20) therefore it has no external field reference. This time page_zip_validate() fails after the record gets written to both compressed and uncompressed page (in the end of page_zip_write_rec()). This is because page_zip_decompress() doesn't have any information to clear out the obsolete external field pointer.

How to repeat:
Build with UNIV_ZIP_DEBUG, create a table with blob fields. Insert and delete rows to the same page and wait until a row with a shorter blob field is inserted in the slot of a previously deleted row with a blob field reference. I wasn't able to reproduce, but I'm not sure why.

Suggested fix:
The bug disappears when I add the following to page_zip_clear_rec()

		/* Clear only the data bytes, because the allocator and
		the decompressor depend on the extra bytes. */
		memset(rec, 0, rec_offs_data_size(offsets));

before
		if (!page_is_leaf(page)) {
			/* Clear node_ptr on the compressed page. */
			byte*	storage	= page_zip->data
				+ page_zip_get_size(page_zip)
				- (page_dir_get_n_heap(page)
				   - PAGE_HEAP_NO_USER_LOW)
				* PAGE_ZIP_DIR_SLOT_SIZE;

and 
		/* Log that the data was zeroed out. */
		data = page_zip->data + page_zip->m_end;
		ut_ad(!*data);
		if (UNIV_UNLIKELY(heap_no - 1 >= 64)) {
			*data++ = (byte) (0x80 | (heap_no - 1) >> 7);
			ut_ad(!*data);
		}
		*data++ = (byte) ((heap_no - 1) << 1 | 1);
		ut_ad(!*data);
		ut_ad((ulint) (data - page_zip->data)
		      < page_zip_get_size(page_zip));
		page_zip->m_end = data - page_zip->data;
		page_zip->m_nonempty = TRUE;
before 
} else if (page_is_leaf(page) && dict_index_is_clust(index)) {
		/* Do not clear the record, because there is not enough space
		to log the operation. */

		if (rec_offs_any_extern(offsets)) {
			ulint	i;

This is essentially reverting the patch provided in 61191, but that has its own problems.
[9 Sep 2011 0:42] Nizameddin Ordulu
The main reason for the bug is that we don't log the deletion of rows in the page zip modification log when the patch is applied. Even if the patch is not applied the deleted rows are logged in the modification log if there is enough space. Therefore it's possible to not log a deletion if there is not enough space in the log.
[12 Sep 2011 6:55] Marko Mäkelä
Thanks, Nizameddin, for the detailed analysis. I am setting the bug to Verified, even though I have not repeated the bug yet.

Our QA engineer Michael has reported a rise of page_zip_validate() assertion failures. Also it could be due to the faulty fix of Bug #61191. I have not looked into that yet, but it is high on my list.
[12 Sep 2011 18:38] Nizameddin Ordulu
Thanks Marko. You are right that this is related to bug 61191 patch, but the bug exists even without that patch. That patch just makes the bug more likely.
[28 Sep 2011 10:22] Marko Mäkelä
I analyzed the core dumps that we got from an internal page_zip_validate() failure, Oracle bug #12701488. Those failures are on a secondary index, index(a) of a table whose primary key is (b(255),a,d) and KEY_BLOCK_SIZE=16. One failure is on the leaf page (in 5.1 plugin) and one is on a node-pointer page (in 5.5). Curiously, when I copied the compressed page to a fresh mysql process and tricked it into decompressing it, the resulting page matched the uncompressed page byte-for-byte, that is, the page_zip_validate() should not have failed at all.

Your case seems to be different, because it has something to do with BLOB pointers. BLOBs only exist in leaf pages of clustered indexes, and our internally observed bug occurs in a secondary index.

Do you still have the data and log files for reproducing this bug? Could you dump the compressed page before each of the following steps?

19. applying log record type 38 to page_no:455 space_no:24789 num_records=4 [INSERT]
20. applying log record type 49 to page_no:455 space_no:24789 num_records=5
[WRITE_BLOB_PTR]
21. applying log record type 42 to page_no:455 space_no:24789 num_records=5 [DELETE]
22. applying log record type 38 to page_no:455 space_no:24789 num_records=4 [INSERT]
110908 17:06:15  InnoDB: page_zip_validate: content

Also, can you dump the uncompressed and compressed page after the page_zip_validate fails? Commands like this should do the trick in gdb:

dump binary memory page_zip.19.bin page_zip.data page_zip.data+16384
...
dump binary memory page_zip.22.bin page_zip.data page_zip.data+16384
dump binary memory page_zip.22fail.bin page_zip.data page_zip.data+16384
dump binary memory page.22fail.bin page page+0x4000

Replace 16384 with your compressed page size. The "page" should be aligned to a multiple of 0x4000. You can get it from the rec pointer by &~0x3fff, or from page_cur->block->frame.
[28 Sep 2011 11:22] Marko Mäkelä
One more thing: can you please post the page dumps from the error log?

I would really like to see the last steps before your failure. That it fails in crash recovery deterministically is the ideal situation for analysis. The failures in our tests occurred during normal operation, and thus it is harder to see the history of the page.

I examined the error logs from our internal crash, and found two mismatches, not counting differences in the page checksum (bytes 0..3) and LSN (16..23).

Both occurred in secondary index (a) of a table, using key_block_size=16. The table schema could differ a little.

5.1 plugin (leaf page): The temp_page (decompressed from the compressed page) at 0x5b8 contains "gclgrtfuylfluyfy", while the uncompressed page contains 16 zero bytes. The "gclgrtfuylfluyfy" probably used to belong to a prefix of column b in the primary key, which is something like (b(255),a,d). b is a BLOB or TEXT column, and a and d are 32-bit integers.

5.5 (level=1 node pointer page): temp_page at 0x27c5 contains "o5wl" and "page" contains 4 zero bytes.

These failures might be related to this bug, after all. Next, I shall analyze if the bytes belong to a record (or if they just belong to a deleted record in the garbage area), and how the page dumps differ from the dumps that I extracted from the core dumps.

Some notes about deletion. First, a user DELETE actually is an UPDATE that sets the delete-mark flag. This operation is done entirely in-place. We just flip the PAGE_ZIP_DIR_SLOT_DEL and REC_INFO_DELETED_FLAG bits.

Second, the actual deletion (purge) should be done without writing the modification log (well, if it were not for Bug#61191). We simply permute the dense page directory so that the deleted records (the PAGE_FREE list) are at the end of the array.

On delete, if the record is a node pointer or a clustered index record, we will also delete the node pointer bytes or DB_TRX_ID,DB_ROLL_PTR and optionally n*BTR_EXTERN_FIELD_REF_SIZE bytes by memmove()ing data around in the compressed page and zero-filling.
[28 Sep 2011 13:41] Marko Mäkelä
I did a little deeper analysis of our internal crash. In the 5.5 crash, we have apparently deleted (put into PAGE_FREE) and reused a record three times. The mismatch is that when decompressing the page, we fail to zero out the child_page pointer of "v2" of the record. Instead, we would write bytes from the "v1" of the record in place of the node pointer.

Note that only the "v3" of the record really exists; the "v1" was deleted, purged and overwritten by the "v2", which in turn was deleted, purged and overwritten by "v3".

This particular mismatch is harmless in the sense that it is in the unused portion of the page. That space can never be reused until the entire record is deleted or the page is rebuilt.

Next, I will check the page dump from our 5.1 crash.

Your case is somewhat different, because it really affects "live" data on the page.

A quick workaround should be to disable the use of page_mem_alloc_free() on compressed pages. That is, add

	goto use_heap;
before this in page0cur.c:
	free_rec = page_header_get_ptr(page, PAGE_FREE);
	if (UNIV_LIKELY_NULL(free_rec)) {
		/* Try to allocate from the head of the free list. */
[28 Sep 2011 18:39] Nizameddin Ordulu
Marko: I will need to make sure that the data in the page dumps do not reveal any private information before sharing them. This may take a while, I will try your fix in the mean time.
[29 Sep 2011 7:24] Marko Mäkelä
Nizameddin, I understand. Compression bugs are nasty in this respect; you cannot XXX out the sensitive bits easily.

I checked our 5.1 page_zip_validate() mismatch. It turned out to be a mismatch in a clustered index page. The record had apparently been freed and replaced with a shorter record, twice. The most recently stored record (v3) had the columns b='khd0',c='khd0'. I guess that "v2" of the record must have had a 20-byte BLOB pointer where c='khd0' was stored. So, we got a 16-byte mismatch immediately following the record. The uncompressed page would have 16 zeros, while the compressed page would decompress 16 bytes that apparently used to belong to "v1".

In short, our internally observed bug just looks like "overasserting" in UNIV_ZIP_DEBUG. The bug is merely making UNIV_ZIP_DEBUG less useful; it does not seem to affect correctness. Your bug must be something else.
[13 Oct 2011 1:43] Nizameddin Ordulu
Marko: I want to apply the small change above (goto use_heap) to see if current crash happens elsewhere. But there are two locations that match your description in page0cur.c. One is in page_cur_insert_rec_zip() and the other is in page_cur_insert_rec_low(). Which one are you referring to?
[4 Nov 2011 23:54] Nizameddin Ordulu
Hi Marko,

I had some more time to work on this bug and I disabled UNIV_ZIP_DEBUG to ignore unnecessary assertions. I agree that the above page_zip_validate() failure is not the main cause of the crash. It turns out that the crash happens because in one of the MLOG_COMP_REC_DELETE log records, the offset specified by the log record does not correspond to a record on the page. More specifically, it points to the offset of the last inserted record minus 1. I get the following in gdb:
<code>
(gdb) p page_header_get_field(page, 10)
$30 = 15743
(gdb) p page_offset(cursor->rec)
$31 = 15742
(gdb) f
#8  0x00000000008deb19 in page_cur_delete_rec (cursor=0x44c99a10, index=0x2ab9b482e778, offsets=0x44c996e0, 
    mtr=0x44c99af0) at page/page0cur.c:1900
1900    in page/page0cur.c
</code>

The stacktrace looks like:
<code>
#4  0x0000003939430155 in raise () from /lib64/libc.so.6
#5  0x0000003939431bf0 in abort () from /lib64/libc.so.6
#6  0x00000000008f7006 in rec_get_next_ptr_const (rec=0x2aac9a2cfd7e "�\200", comp=1)
    at /data/users/nizamordulu/mysql/5138-build-xdbg/storage/innodb_plugin/include/rem0rec.ic:283
#7  0x00000000008e084f in page_dir_find_owner_slot (rec=0x2aac9a2cfd7e "�\200") at page/page0page.c:109
#8  0x00000000008deb19 in page_cur_delete_rec (cursor=0x44c99a10, index=0x2ab9b482e778, offsets=0x44c996e0, 
    mtr=0x44c99af0) at page/page0cur.c:1900
#9  0x00000000008de99b in page_cur_parse_delete_rec (ptr=0x2aab9fe3d3d2 "�fEu", end_ptr=0x2aab9fe3d3d2 "�fEu", 
    block=0x2aaab8c4b9c0, index=0x2ab9b482e778, mtr=0x44c99af0) at page/page0cur.c:1844
#10 0x00000000008c7e94 in recv_parse_or_apply_log_rec_body (type=42 '*', ptr=0x2aab9fe3d3d0 "=~�fEu", 
    end_ptr=0x2aab9fe3d3d2 "�fEu", block=0x2aaab8c4b9c0, mtr=0x44c99af0) at log/log0recv.c:1215
#11 0x00000000008c8ac0 in recv_recover_page_func (just_read_in=1, block=0x2aaab8c4b9c0) at log/log0recv.c:1613
#12 0x000000000096845e in buf_page_io_complete (bpage=0x2aaab8c4b9c0) at buf/buf0buf.c:3371
#13 0x000000000099d198 in fil_aio_wait (segment=5) at fil/fil0fil.c:5580
#14 0x00000000009250d7 in io_handler_thread (arg=0x10b4b68) at srv/srv0start.c:479
#15 0x0000003939c062f7 in start_thread () from /lib64/libpthread.so.0
#16 0x00000039394d1e3d in clone () from /lib64/libc.so.6

</code>

The assertion that fails belongs to the following lines in rem0rec.ic:
<code>
283    ut_ad((field_value >= 32768
 284           ? field_value - 65536
 285           : field_value)
 286          + ut_align_offset(rec, UNIV_PAGE_SIZE)
 287          < UNIV_PAGE_SIZE);
</code>

I believe that the crash is because the log record has the wrong offset for the deleted record. What do you think? Are there similar bug reports?
[5 Nov 2011 0:12] Nizameddin Ordulu
changing the subject.
[15 Nov 2011 14:01] Marko Mäkelä
I do not remember any off-by-one bugs in the recent past. I fear that I would need to see the page dumps for this. Would UNIV_ZIP_DEBUG fail on the same page, a little earlier?
[31 Jan 2012 21:28] Nizameddin Ordulu
closing this bug until I see another instance  of it.