MySQL Bugs: #57611: ibdata file and continuous growing undo logs

Bug #57611	ibdata file and continuous growing undo logs
Submitted:	20 Oct 2010 22:25	Modified:	10 Dec 2010 0:54
Reporter:	Lig Isler-Turmelle	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB Plugin storage engine	Severity:	S1 (Critical)
Version:	5.0,5.1,5.5.6-rc	OS:	Any
Assigned to:	Inaam Rana	CPU Architecture:	Any

Description:
A continually growing undo log in InnoDB, which causes the main tablesapce to grow forever.

baselevel:
blade12:~$ ll ./mysql_installs/5.5.6-rc/mysql-5.5.6-rc-linux2.6-x86_64/data/test/
total 1968128
-rw-rw---- 1 lislerturmelle lislerturmelle      8632 Oct 20 18:53 sbtest1.frm
-rw-rw---- 1 lislerturmelle lislerturmelle 251658240 Oct 20 18:54 sbtest1.ibd
-rw-rw---- 1 lislerturmelle lislerturmelle      8632 Oct 20 18:53 sbtest2.frm
-rw-rw---- 1 lislerturmelle lislerturmelle 251658240 Oct 20 18:54 sbtest2.ibd
-rw-rw---- 1 lislerturmelle lislerturmelle      8632 Oct 20 18:54 sbtest3.frm
-rw-rw---- 1 lislerturmelle lislerturmelle 251658240 Oct 20 18:55 sbtest3.ibd
-rw-rw---- 1 lislerturmelle lislerturmelle      8632 Oct 20 18:54 sbtest4.frm
-rw-rw---- 1 lislerturmelle lislerturmelle 251658240 Oct 20 18:55 sbtest4.ibd
-rw-rw---- 1 lislerturmelle lislerturmelle      8632 Oct 20 18:54 sbtest5.frm
-rw-rw---- 1 lislerturmelle lislerturmelle 251658240 Oct 20 18:56 sbtest5.ibd
-rw-rw---- 1 lislerturmelle lislerturmelle      8632 Oct 20 18:55 sbtest6.frm
-rw-rw---- 1 lislerturmelle lislerturmelle 251658240 Oct 20 18:56 sbtest6.ibd
-rw-rw---- 1 lislerturmelle lislerturmelle      8632 Oct 20 18:55 sbtest7.frm
-rw-rw---- 1 lislerturmelle lislerturmelle 251658240 Oct 20 18:56 sbtest7.ibd
-rw-rw---- 1 lislerturmelle lislerturmelle      8632 Oct 20 18:55 sbtest8.frm
-rw-rw---- 1 lislerturmelle lislerturmelle 251658240 Oct 20 18:57 sbtest8.ibd

blade12:~$ ll ./mysql_installs/5.5.6-rc/mysql-5.5.6-rc-linux2.6-x86_64/data/
total 1043492
-rw-rw---- 1 lislerturmelle lislerturmelle         6 Oct 20 17:15 blade12.pid
-rw-rw---- 1 lislerturmelle lislerturmelle         0 Oct 20 17:16 error-log.err
-rw-rw---- 1 lislerturmelle lislerturmelle      8151 Oct 20 17:16 error-log.err-old
-rw-rw---- 1 lislerturmelle lislerturmelle  18874368 Oct 20 18:57 ibdata1
-rw-rw---- 1 lislerturmelle lislerturmelle 524288000 Oct 20 18:57 ib_logfile0
-rw-rw---- 1 lislerturmelle lislerturmelle 524288000 Oct 20 18:56 ib_logfile1
drwxr-xr-x 2 lislerturmelle lislerturmelle      4096 Oct 19 23:34 mysql
-rw-r--r-- 1 lislerturmelle lislerturmelle         8 Oct 19 23:34 mysql_upgrade_info
drwx------ 2 lislerturmelle lislerturmelle      4096 Oct 19 23:34 performance_schema
drwxr-xr-x 2 lislerturmelle lislerturmelle      4096 Oct 20 18:55 test

After running test:
blade12:~$ ll ./mysql_installs/5.5.6-rc/mysql-5.5.6-rc-linux2.6-x86_64/data/test/total 1968128
-rw-rw---- 1 lislerturmelle lislerturmelle      8632 Oct 20 18:53 sbtest1.frm
-rw-rw---- 1 lislerturmelle lislerturmelle 251658240 Oct 20 19:27 sbtest1.ibd
-rw-rw---- 1 lislerturmelle lislerturmelle      8632 Oct 20 18:53 sbtest2.frm
-rw-rw---- 1 lislerturmelle lislerturmelle 251658240 Oct 20 19:27 sbtest2.ibd
-rw-rw---- 1 lislerturmelle lislerturmelle      8632 Oct 20 18:54 sbtest3.frm
-rw-rw---- 1 lislerturmelle lislerturmelle 251658240 Oct 20 19:27 sbtest3.ibd
-rw-rw---- 1 lislerturmelle lislerturmelle      8632 Oct 20 18:54 sbtest4.frm
-rw-rw---- 1 lislerturmelle lislerturmelle 251658240 Oct 20 19:27 sbtest4.ibd
-rw-rw---- 1 lislerturmelle lislerturmelle      8632 Oct 20 18:54 sbtest5.frm
-rw-rw---- 1 lislerturmelle lislerturmelle 251658240 Oct 20 19:27 sbtest5.ibd
-rw-rw---- 1 lislerturmelle lislerturmelle      8632 Oct 20 18:55 sbtest6.frm
-rw-rw---- 1 lislerturmelle lislerturmelle 251658240 Oct 20 19:27 sbtest6.ibd
-rw-rw---- 1 lislerturmelle lislerturmelle      8632 Oct 20 18:55 sbtest7.frm
-rw-rw---- 1 lislerturmelle lislerturmelle 251658240 Oct 20 19:27 sbtest7.ibd
-rw-rw---- 1 lislerturmelle lislerturmelle      8632 Oct 20 18:55 sbtest8.frm
-rw-rw---- 1 lislerturmelle lislerturmelle 251658240 Oct 20 19:27 sbtest8.ibd

blade12:~$ ll ./mysql_installs/5.5.6-rc/mysql-5.5.6-rc-linux2.6-x86_64/data/total 4011892
-rw-rw---- 1 lislerturmelle lislerturmelle          6 Oct 20 17:15 blade12.pid
-rw-rw---- 1 lislerturmelle lislerturmelle          0 Oct 20 17:16 error-log.err
-rw-rw---- 1 lislerturmelle lislerturmelle       8151 Oct 20 17:16 error-log.err-old
-rw-rw---- 1 lislerturmelle lislerturmelle 3055550464 Oct 20 19:30 ibdata1
-rw-rw---- 1 lislerturmelle lislerturmelle  524288000 Oct 20 19:30 ib_logfile0
-rw-rw---- 1 lislerturmelle lislerturmelle  524288000 Oct 20 19:27 ib_logfile1
drwxr-xr-x 2 lislerturmelle lislerturmelle       4096 Oct 19 23:34 mysql
-rw-r--r-- 1 lislerturmelle lislerturmelle          8 Oct 19 23:34 mysql_upgrade_info
drwx------ 2 lislerturmelle lislerturmelle       4096 Oct 19 23:34 performance_schema
drwxr-xr-x 2 lislerturmelle lislerturmelle       4096 Oct 20 18:55 test

Output from customized innochecksum:
blade12:~/innochecksum$ ./innodbcheck ~/mysql_installs/5.5.6-rc/mysql-5.5.6-rc-linux2.6-x86_64/data/ibdata1
posix_fadvise failed: Success
0 bad checksum
15 FIL_PAGE_INDEX
182680 FIL_PAGE_UNDO_LOG    <-- Note size
2154 FIL_PAGE_INODE
0 FIL_PAGE_IBUF_FREE_LIST
1487 FIL_PAGE_TYPE_ALLOCATED
13 FIL_PAGE_IBUF_BITMAP
132 FIL_PAGE_TYPE_SYS
1 FIL_PAGE_TYPE_TRX_SYS
2 FIL_PAGE_TYPE_FSP_HDR
12 FIL_PAGE_TYPE_XDES
0 FIL_PAGE_TYPE_BLOB
0 FIL_PAGE_TYPE_ZBLOB
0 other
1 max index_id
#pages index_id
<snip>

Attaching InnoDB Status output for review.

How to repeat:
my.cnf 

 [mysqld]
 innodb_buffer_pool_size=5G
 innodb_log_file_size=500M
 innodb_flush_log_at_trx_commit=2
 innodb_doublewrite=1
 innodb_flush_method=O_DIRECT
 innodb_thread_concurrency=0
 innodb_max_dirty_pages_pct=80
 innodb_file_format=barracuda
 innodb_file_per_table
 innodb_io_capacity=1000
 innodb_thread_concurrency=64
 innodb_buffer_pool_instances=1
 max_connections=2000
 table_cache=2000
 key_buffer_size=200M

sysbench - prep to generate tables:

blade12:~$ for i in $( seq 1 8 ); do> sysbench --batch --batch-delay=60 --test=oltp  --oltp-table-size=1000000 \
>  --max-time=1200 --max-requests=0 --mysql-socket=/tmp/mysql.5.5.6-rc.sock --mysql-table-engine=innodb --mysql-db=test \
> --db-ps-mode=disable --mysql-engine-trx=yes --oltp-table-name=sbtest${i} \
> --oltp-skip-trx --oltp-test-mode=simple --oltp-point-select-all-cols \
> --oltp-simple-update --oltp-dist-type=uniform --oltp-range-size=1 \
> --num-threads=1 --seed-rng=1 prepare
> done

Take your base levels

second run:

blade12:~$ for nt in 1 2 4 8 16 32; do> > for x in $( seq 1 8 ); do
> 
> (sysbench --batch --batch-delay=60 --test=oltp --mysql-db=test --oltp-table-size=100000 \
> --max-time=1200 --max-requests=0 --mysql-socket=/tmp/mysql.5.5.6-rc.sock --mysql-table-engine=innodb \
> --db-ps-mode=disable --mysql-engine-trx=yes --oltp-table-name=sbtest${x} \
> --oltp-skip-trx --oltp-test-mode=simple --oltp-point-select-all-cols \
> --oltp-simple-update --oltp-dist-type=uniform --oltp-range-size=1 \
> --num-threads=${nt} --seed-rng=${x} run & )
> 
> done
> 
> done

Look at
1) InnoDB Status - make sure you have a small history list
2) the size of the ibdata file
3) when the history list is short, run innodbchecksum on the ibdata file

Suggested fix:
unknown

Why is it only growing in the 5.5.6 branch? Have previous versions 5.5.x been
confirmed to not suffer from this issue?

Regards,
-sunny

I have one instance of this in production with 5.1.47 + facebook patch. The database files probably were created from 5.0.84 and then an in-place upgrade was done -- no dump&reload.

Ligaya,

What you have to do to actually verify this is:

1) Run the workload once. Keep an eye on the history length. My guess is that it will keep growing.
2) Once the test is finished keep the server up and you'll see the history length going down
3) When the history length is back to around 100, without stopping the server rerun the workload
4) If you see ibdata file started growing again then we have a problem. If not then it is working as expected

I tried the above on one of our boxes (with vanilla sysbench i.e.: single table) and during the first run the history length grew to 1.4 million with around 65,000 UNDO pages. When I reran the workload after cooling the server down the UNDO pages stayed steady at 65,000.

The config parameters and the sysbench options clearly indicate that we'll have a very cpu intensive test and it is not a surprise if purge is unable to keep up with that. The surprise will be if after purge has caught up we are still unable to use the 65,000 presumably freed UNDO pages.

secondary testing using steps provided by Inaam earlier:

Ran workload test twice.  

As expected ibdata file grew on first run with large FIL_PAGE_UNDO_LOG value at end.  History list grew and then shrunk again as purge caught up.  ibdata file size at end - 2283798528

Ran second workload test.  History list again grew to a large value. FIL_PAGE_UNDO_LOG value had *very* minor growth.  ibdata file did not grow at all.

Will upload raw files for review.

I don't have much to describe the reproduction. This occurred on a production server and a test server (sysbench). The prod server did an in-place (no dump & reload) upgrade from 5.0.84 to 5.1.47 and with 5.1 uses barracuda and file-per-table -- but no files were created after upgrade. The prod server also gets frequent errors for attempts to exceed 1023 concurrent transactions (cannot find undo slot).

The ibdata file on the prod server appears to have grown at a regular rate over 3+ weeks until it reached ~900GB with ~700GB for undo. History list length on this host goes up to ~10M each night when mysqldump is run to dump a few tables. It returns to a small value afterwards.

Does InnoDB not return blocks allocated to undo back to the free block list?
Why would it grow slowly for 3 weeks? There were no transactions open for that long and I did not see any "lost" XA prepared transactions either.

I modified innochecksum to collect more stats on undo blocks. From the server that runs 5.1.47 and has ~700GB of undo space.

file /data/mysql/ibdata1 = 1058650718208 bytes (64614912 pages)...
checking pages in range 0 to 64614911
0       bad checksum
17248722        FIL_PAGE_INDEX
45325414        FIL_PAGE_UNDO_LOG
533186  FIL_PAGE_INODE
10      FIL_PAGE_IBUF_FREE_LIST
1488532 FIL_PAGE_TYPE_ALLOCATED
2769    FIL_PAGE_IBUF_BITMAP
0       FIL_PAGE_TYPE_SYS
0       FIL_PAGE_TYPE_TRX_SYS
0       FIL_PAGE_TYPE_FSP_HDR
2773    FIL_PAGE_TYPE_XDES
13506   FIL_PAGE_TYPE_BLOB
0       FIL_PAGE_TYPE_ZBLOB
0       other
258     max index_id
undo type: 29 insert, 45325383 update, 2 other
undo state: 0 active, 3 cached, 44862541 to_free, 462643 to_purge, 0 prepared, 227 other

Note:
* most of the undo blocks have type TRX_UNDO_UPDATE
* most of the undo blocks have state TRX_UNDO_TO_FREE

Why are they not freed?

From other servers that handle a similar workload, they have similar allocation:
* lots of undo blocks
* most undo blocks have type TRX_UNDO_UPDATE
* most undo blocks have state TRX_UNDO_TO_FREE

The server listed previously was idle. The servers listed below are in production.

-rw-rw---- 1 mysql mysql 12G Oct 21 22:41 /data/mysql/ibdata1
0       bad checksum
141     FIL_PAGE_INDEX
769642  FIL_PAGE_UNDO_LOG
8678    FIL_PAGE_INODE
0       FIL_PAGE_IBUF_FREE_LIST
191     FIL_PAGE_TYPE_ALLOCATED
48      FIL_PAGE_IBUF_BITMAP
3       FIL_PAGE_TYPE_SYS
1       FIL_PAGE_TYPE_TRX_SYS
1       FIL_PAGE_TYPE_FSP_HDR
47      FIL_PAGE_TYPE_XDES
0       FIL_PAGE_TYPE_BLOB
0       FIL_PAGE_TYPE_ZBLOB
0       other
1       max index_id
undo type: 363 insert, 769279 update, 0 other
undo state: 0 active, 493 cached, 736773 to_free, 347 to_purge, 10 prepared, 32019 other

-rw-rw---- 1 mysql mysql 124G Oct 21 22:40 /data/mysql/ibdata1
0       bad checksum
138     FIL_PAGE_INDEX
7954539 FIL_PAGE_UNDO_LOG
93583   FIL_PAGE_INODE
0       FIL_PAGE_IBUF_FREE_LIST
41376   FIL_PAGE_TYPE_ALLOCATED
492     FIL_PAGE_IBUF_BITMAP
3       FIL_PAGE_TYPE_SYS
1       FIL_PAGE_TYPE_TRX_SYS
1       FIL_PAGE_TYPE_FSP_HDR
491     FIL_PAGE_TYPE_XDES
0       FIL_PAGE_TYPE_BLOB
0       FIL_PAGE_TYPE_ZBLOB
0       other
1       max index_id
undo type: 23 insert, 7954516 update, 0 other
undo state: 0 active, 23 cached, 7953777 to_free, 739 to_purge, 0 prepared, 0 other

-rw-r--r-- 1 mysql mysql 592G Oct 21 22:40 /data/mysql/ibdata1
0       bad checksum
34882723        FIL_PAGE_INDEX
3312938 FIL_PAGE_UNDO_LOG
36921   FIL_PAGE_INODE
473     FIL_PAGE_IBUF_FREE_LIST
377080  FIL_PAGE_TYPE_ALLOCATED
1209    FIL_PAGE_IBUF_BITMAP
0       FIL_PAGE_TYPE_SYS
0       FIL_PAGE_TYPE_TRX_SYS
0       FIL_PAGE_TYPE_FSP_HDR
1209    FIL_PAGE_TYPE_XDES
166327  FIL_PAGE_TYPE_BLOB
0       FIL_PAGE_TYPE_ZBLOB
0       other
72369   max index_id
undo type: 5829 insert, 3302610 update, 4499 other
undo state: 0 active, 5 cached, 2759890 to_free, 380192 to_purge, 0 prepared, 172851 other

-rw-rw---- 1 mysql mysql 597G Oct 21 22:40 /data/mysql/ibdata1
0       bad checksum
16534528        FIL_PAGE_INDEX
20467834        FIL_PAGE_UNDO_LOG
240578  FIL_PAGE_INODE
28      FIL_PAGE_IBUF_FREE_LIST
402731  FIL_PAGE_TYPE_ALLOCATED
583     FIL_PAGE_IBUF_BITMAP
0       FIL_PAGE_TYPE_SYS
0       FIL_PAGE_TYPE_TRX_SYS
0       FIL_PAGE_TYPE_FSP_HDR
583     FIL_PAGE_TYPE_XDES
1455023 FIL_PAGE_TYPE_BLOB
0       FIL_PAGE_TYPE_ZBLOB
0       other
608806  max index_id
undo type: 291 insert, 20429861 update, 37682 other
undo state: 0 active, 111 cached, 20400059 to_free, 26266 to_purge, 0 prepared, 41398 other

-rw-rw---- 1 mysql mysql 805G Oct 21 22:41 /data/mysql/ibdata1
0       bad checksum
31277512        FIL_PAGE_INDEX
20334639        FIL_PAGE_UNDO_LOG
238664  FIL_PAGE_INODE
172     FIL_PAGE_IBUF_FREE_LIST
530657  FIL_PAGE_TYPE_ALLOCATED
0       FIL_PAGE_IBUF_BITMAP
0       FIL_PAGE_TYPE_SYS
0       FIL_PAGE_TYPE_TRX_SYS
0       FIL_PAGE_TYPE_FSP_HDR
0       FIL_PAGE_TYPE_XDES
326196  FIL_PAGE_TYPE_BLOB
0       FIL_PAGE_TYPE_ZBLOB
0       other
231118  max index_id
undo type: 675 insert, 20321566 update, 12398 other
undo state: 0 active, 30 cached, 19956753 to_free, 332515 to_purge, 0 prepared, 45341 other

-rw-rw---- 1 mysql mysql 798G Oct 21 22:41 /data/mysql/ibdata1
0       bad checksum
29481088        FIL_PAGE_INDEX
21746699        FIL_PAGE_UNDO_LOG
255501  FIL_PAGE_INODE
59      FIL_PAGE_IBUF_FREE_LIST
525471  FIL_PAGE_TYPE_ALLOCATED
0       FIL_PAGE_IBUF_BITMAP
0       FIL_PAGE_TYPE_SYS
0       FIL_PAGE_TYPE_TRX_SYS
0       FIL_PAGE_TYPE_FSP_HDR
0       FIL_PAGE_TYPE_XDES
277134  FIL_PAGE_TYPE_BLOB
0       FIL_PAGE_TYPE_ZBLOB
0       other
3800    max index_id
undo type: 785 insert, 21745694 update, 220 other
undo state: 0 active, 26 cached, 21170534 to_free, 557793 to_purge, 0 prepared, 18346 other

plugin code: TRX_UNDO_TO_FREE comes from INSERT undo.
It looks convincing that it is freed on commit.
It isn't immediately clear that it is freed on rollback,
unless rollback ends up calling trx_commit_off_kernel().
A simple test case could be tried then with a txn that does insert then rollback.

What about the first branch in trx_undo_set_state_at_finish?

        if (undo->size == 1
            && mach_read_from_2(page_hdr + TRX_UNDO_PAGE_FREE)
               < TRX_UNDO_PAGE_REUSE_LIMIT) {

                /* This is a heuristic to avoid the problem of all UNDO
                slots ending up in one of the UNDO lists. Previously if
                the server crashed with all the slots in one of the lists,
                transactions that required the slots of a different type
                would fail for lack of slots. */

                if (UT_LIST_GET_LEN(rseg->update_undo_list) < 500
                    && UT_LIST_GET_LEN(rseg->insert_undo_list) < 500) {

                        state = TRX_UNDO_CACHED;
                } else {
                        state = TRX_UNDO_TO_FREE;
                }

        } else if (undo->type == TRX_UNDO_INSERT) {

                state = TRX_UNDO_TO_FREE;
        } else {
                state = TRX_UNDO_TO_PURGE;
        }

trx_finish_rollback_off_kernel() does call trx_commit_off_kernel(trx);
however there are a couple of returns before that that are worth looking at.

let's call it Verified for now since someone needs to look at the code
given the information that Mark provided. 
- it is not entirely clear that we always clean up insert undo on rollback
- in the branch that Mark spotted it is possible that we set state
to TRX_UNDO_TO_FREE even though type is not TRX_UNDO_INSERT.
If so, it's not clear that  this will ever be freed.

yet another possibility is that TRX_UNDO_TO_FREE is a good a valid state
to which we transition correctly yet when we need a new page for a rollback
segment we fail to use those TRX_UNDO_TO_FREE pages instead
extending the file.

Michael, Mark,

This does look suspicious:

* During commit we call trx_undo_set_state_at_finish() which, as both of you mentioned, can set the state of an TRX_UNDO_UPDATE type undo log to TRX_UNDO_TO_FREE based on some heuristic.

* trx_undo_update_cleanup() does not free up the pages as they are still needed by purge.

* When purge actually gets around to freeing the pages it does the following:

 556         if ((mach_read_from_2(seg_hdr + TRX_UNDO_STATE) == TRX_UNDO_TO_PURGE)
 557             && (mach_read_from_2(log_hdr + TRX_UNDO_NEXT_LOG) == 0)) {
 558 
 559                 /* We can free the whole log segment */
 560 
 561                 mutex_exit(&(rseg->mutex));
 562                 mtr_commit(&mtr);
 563 
 564                 trx_purge_free_segment(rseg, hdr_addr, n_removed_logs);

It does check for the state TRX_UNDO_TO_PURGE and only then calls trx_purge_free_segment() which calls fseg_free_step().

This can explain:
1) Why pages of type TRX_UNDO_UPDATE are in state TRX_UNDO_TO_FREE
2) Why TRX_UNDO_TO_FREE state pages are not getting freed

I don't have a fair idea what this heuristic is about and why we need to check page type when cleaning during purge. Sunny, is our purge and undo expert. I'll let him comment on this and either deny the above or come up with some solution.

it would be interesting to dump the pages that actually say what
is free and what is not.
insert undo freeing routine is different from the purge freeing routine
but the logic appear to be essentially the same.

Mark, I think Inaam has a good point. You could try either of the following
two fixes to see if the problem goes away:

1. replace, where we decide to free or not to free in trx0purge.c

(mach_read_from_2(seg_hdr + TRX_UNDO_STATE) == TRX_UNDO_TO_PURGE)

with 

(mach_read_from_2(seg_hdr + TRX_UNDO_STATE) == TRX_UNDO_TO_PURGE
 || mach_read_from_2(seg_hdr + TRX_UNDO_STATE) == TRX_UNDO_TO_FREE)

or an optimized equivalent

2. in the heuristics in trx_undo_set_state_at_finish()

replace

state = TRX_UNDO_TO_FREE;

with

state = undo->type == TRX_UNDO_INSERT ? TRX_UNDO_TO_FREE : TRX_UNDO_TO_PURGE;

Inaam,

The heuristic was added for pre 5.5 code to avoid caching all update UNDO
slots in memory. If there was a crash and all UNDO slots were in the say
update UNDO list then during recovery there would be no insert UNDO slots
to use and recovery would fail.

For updates we can't free the segment like we do for inserts at the end of a
commit like trx_undo_insert_cleanup()  because they are needed by MVCC
and purge (as you noted). trx_undo_update_cleanup() should set the state
of the page to TRX_UNDO_TO_PURGE before freeing the memory but not
delete the segment as we do in trx_undo_insert_cleanup().  Therefore this
is a bug that was introduced when the heuristic was added.

In 5.5 code we can delete the heuristics, I think I already did in my branch
and we could back port that to 5.5.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/121976

3643 Inaam Rana	2010-10-26
      Bug #57611	ibdata file and continuous growing undo logs
      rb://498
      
      Fix handling of update_undo_logs at trx commit. Previously, when
      rseg->update_undo_list grows beyond 500 the update_undo_logs were
      marked with state TRX_UNDO_TO_FREE which should have been
      TRX_UNDO_TO_PURGE.
      
      Approved by: Sunny Bains

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/121979

3203 Inaam Rana	2010-10-26
        Bug #57611    ibdata file and continuous growing undo logs
        rb://498
        
        Fix handling of update_undo_logs at trx commit. Previously, when
        rseg->update_undo_list grows beyond 500 the update_undo_logs were
        marked with state TRX_UNDO_TO_FREE which should have been
        TRX_UNDO_TO_PURGE.
        In 5.5 we don't need the heuristic as we support multiple rollback
        segments.
        
        Approved by: Sunny Bains

Thanks to Mark C. for not only finding this bug but also helping us with debugging.

Can a patch for the custom innochecksum be made available, so that we can perform similar tests?

Sub question: With the bug fixed, is there or will there be a way to clear the pages that are incorrectly marked?

changes to count allocation by type

Attachment: innochecksum.c.orig (, text), 18.78 KiB.

I uploaded extra/innochecksum.c with changes to count allocation by type. This is the version from 5.1.50

Thanks Mark.

Daniel, the issue has been around since 5.0.68 was released in August 2008 so for the moment we'll watch to see whether many people report being significantly affected by it. It would be helpful for anyone who is to let us know.

It's not as simple as just removing the TRX_UNDO_FREE flag because some pages may still be needed for purge. Even after that, the shared tablespace size wouldn't be reduced, just the pages made available for reuse.

The workaround for the moment is to get the fix then do the usual mysqladump and reload, or conversion to MyISAM and erasing the InnoDB files, that is required for other ibdata size reduction requirements.

I've been using MySQL quite heavily and manage several boxes which despite having innodb_file_per_table set also have quite large ibdata1 files.

There's no way to clear this especially on master servers which are hard to take down for any period of time and so I'd like to suggest that any way to fix this would be most appreciated.

Yes we can live with this but it's like warts: they are ugly and you really don't want to have them. If you get rid of them you feel much more comfortable ....

@James

Surely it would be fine to reset TRX_UNDO_FREE if the server was cleanly shut down? 

I do appreciate that it wouldn't reduce the size (I'll come to that in a second), but at least freeing up the pages for re-use is better than leaving them in limbo.

Regarding the size of the file, surely if a server was shutdown cleanly then the ibdata file could be rebuilt? 

I pretty much exclusively use file_per_table so on an adhoc basis it is possible to rebuild tables and reclaim space. However the shared data (esp with this bug) can be large and unrecoverable.

Simon, it's normal for the ibdata1 file to be quite large even when using innodb_file_per_table because the undo records can take up a lot of space even when all is working as it should. First thing to do is find out how much this bug is affecting you, if at all, by using the report. Size alone isn't sufficient to know that it is, however big ibdata1 gets, it takes verifying that there a re a significant number of pages set to TRX_UNDO_FREE as well.

Daniel, resetting after a slow shut down (innodb_fast_shutdown = 0, not the 1 default) should be safe. Rebuilding the ibdata1 file would break all of the tablespace id links so it's not completely straightforward.

Best thing to do for now is find out how much it affects you, if at all, and then let us know using the report from Mark's modified innochecksum plus the total size of the whole database, not just ibdata1. Once we have a fair range of reports we can see how much and how widespread the noticeable effects of this are.

An initial thought is that anyone who hasn't also been affected by the 1k open transaction limit in InnoDB isn't likely to be affected by this, since it looks to require at least 500 undo records of one type before the heuristic which did this can be triggered.

Using BLOB's? See bug #57980

Pushed into mysql-trunk 5.6.99-m5 (revid:alexander.nozdrin@oracle.com-20101113155825-czmva9kg4n31anmu) (version source revid:alexander.nozdrin@oracle.com-20101113152450-2zzcm50e7i4j35v7) (merge vers: 5.6.1-m4) (pib:21)

Pushed into mysql-next-mr (revid:alexander.nozdrin@oracle.com-20101113160336-atmtmfb3mzm4pz4i) (version source revid:alexander.nozdrin@oracle.com-20101113152540-gxro4g0v29l27f5x) (pib:21)

Pushed into mysql-5.1 5.1.54 (revid:build@mysql.com-20101118153531-693taxtxyxpt037i) (version source revid:build@mysql.com-20101118153531-693taxtxyxpt037i) (merge vers: 5.1.54) (pib:21)

I am not sure the bug has been fixed. We had a server under severe load yesterday with frequent attempts to start more than 1023 concurrent transactions. The ibdata1 file grew 17GB in 1 hour. From other monitoring it has grown by 140GB over the past month.

Data from the enhanced innochecksum shows that:
* most pages are FIL_PAGE_UNDO_LOG
* most pages are in state to_free

I confirmed that the mysqld server running there has the patch.

4	FIL_PAGE_INDEX
8906492	FIL_PAGE_UNDO_LOG
104744	FIL_PAGE_INODE
0	FIL_PAGE_IBUF_FREE_LIST
54368	FIL_PAGE_TYPE_ALLOCATED
550	FIL_PAGE_IBUF_BITMAP
0	FIL_PAGE_TYPE_SYS
0	FIL_PAGE_TYPE_TRX_SYS
0	FIL_PAGE_TYPE_FSP_HDR
548	FIL_PAGE_TYPE_XDES
0	FIL_PAGE_TYPE_BLOB
0	FIL_PAGE_TYPE_ZBLOB
0	other
0	max index_id
undo type: 0 insert, 8906492 update, 0 other
undo state: 0 active, 0 cached, 8906492 to_free, 0 to_purge, 0 prepared, 0 other

Mark, this could be an independent issue, could you file a new bug?
It could be easier to track commits to two different problems,
to two different bugs even though the symptoms were the same.

What could help with the body of ignorance about the problem at hand,
from this bug:

[22 Oct 19:13] Michael Izioumtchenko

trx_finish_rollback_off_kernel() does call trx_commit_off_kernel(trx);
however there are a couple of returns before that that are worth looking at.

[22 Oct 19:49] Michael Izioumtchenko

let's call it Verified for now since someone needs to look at the code
given the information that Mark provided. 
- it is not entirely clear that we always clean up insert undo on rollback

this line of reasoning was never followed up. A note for myself
when I'm back from vacation: try to reproduce a rollback 
rich environment.

Mark, is the purge lagging behind by any chance?

This can't be because of purge lag. If we have that then the undo_page::state should not be 'to free'. The fact that all pages are of type 'update' and in state 'to free' is hard to explain after the fix. May be:
1) The server running this doesn't have the fix or
2) The dump restore was not performed on the server after applying the fix to reclaim the pages already lost or
3) There is some error in the enhanced innochecksum or
4) We have screwed the ::type and ::state coupling which seems very strange. The only place where state is set to TRX_UNDO_TO_FREE is in trx_undo_set_state_at_finish():

inaam@bling:~/w/mysql-5.1-innodb/storage/innodb_plugin$ grep -riIn TRX_UNDO_TO_FREE .
./include/trx0undo.h:354:#define        TRX_UNDO_TO_FREE        3       /* insert undo segment can be freed */
./trx/trx0undo.c:1325:  if (state == TRX_UNDO_TO_FREE) {
./trx/trx0undo.c:1834:          state = TRX_UNDO_TO_FREE;
./trx/trx0undo.c:1927:                || undo->state == TRX_UNDO_TO_FREE);
./trx/trx0undo.c:1960:          ut_ad(undo->state == TRX_UNDO_TO_FREE);

1832         } else if (undo->type == TRX_UNDO_INSERT) {
1833 
1834                 state = TRX_UNDO_TO_FREE;
1835         } else {
1836                 state = TRX_UNDO_TO_PURGE;
1837         }
1838 
1839         undo->state = state;
1840 
1841         mlog_write_ulint(seg_hdr + TRX_UNDO_STATE, state, MLOG_2BYTES, mtr);

I'd like to be 100% sure that this is not a false alarm.

I am reasonably certain the server has the fix. Dump & reload was not done. Does that need to be done to stop the leak? This server continued to leak with the fix. I assumed the fix would not reclaim previously leaked blocks but would stop the leak from growing.

Mark,

The patch should fix the leak. I have to look into code to see what happens when a 'to be purged' undo log page is purged.

Added to change log:

The InnoDB system tablespace could grow continually for a server
under heavy load.

Pushed into mysql-5.5 5.5.9 (revid:jonathan.perkin@oracle.com-20101216101358-fyzr1epq95a3yett) (version source revid:jonathan.perkin@oracle.com-20101216101358-fyzr1epq95a3yett) (merge vers: 5.5.9) (pib:24)

I am running sysbench tests for 5.1.52+facebook-patch and 5.5.8 unmodified. Binaries don't share datafiles, so the files used by 5.5.8 are only used by it. After running the tests a few times and using file-per-table, the ibdata1 file is 247G for 5.5.8. I am running enhanced innochecksum on it now to understand why it is so big.

Does this bug exist in 5.5.8?

On the bright side, my patched 5.1.52 build does not have the leak so I think the patch fixed the problem while my test of 5.1.47 unmodified also reproduces the leak.

Mark,

The bug reported has been fixed in 5.5.8 as well.

Having said that, there are many changes in 5.5 relating to purge and rbs. If you see unexplained growth in the ibdata file in your tests, it'd make sense if you open a new bug report to nail that down.

It was purge lag. Lag is much greater on 5.5.8 at benchmark end than on 5.1.52. I will modify my scripts to wait for it to go to zero after each test and then revisit why lag gets worse on 5.5.8.

Mark, you could also try a separate purge thread available in 5.5,
--innodb-purge-threads=1.
But the fact that the purge lag became bigger in 5.5 in a real world
application is also very interesting to investigate.
If it is bigger, it could have something to do with multiple RBS in 5.5.
Or it could be because of a bigger throughput in 5.5 so the master/purge thread
can't handle it.

Discussion related to this continues in bug #59291.

luo

Attachment: luo.txt (text/plain), 4.41 KiB.

luo

Attachment: luo.txt (text/plain), 4.41 KiB.