MySQL Bugs: #42339: Backup of a small database takes too long time with high load

Bug #42339	Backup of a small database takes too long time with high load
Submitted:	26 Jan 2009 9:09	Modified:	11 Aug 2009 12:05
Reporter:	Bernt Marius Johnsen	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Backup	Severity:	S2 (Serious)
Version:	mysql-6.0-backup	OS:	Any
Assigned to:	Rafal Somla	CPU Architecture:	Any

Description:
Backup takes very long time if there is high load on the database. More than 2 hours in this case on a backup which takes 2 seconds with no load. This might be a duplicate of bug report #42338, but not necessarily.

How to repeat:
Run the jar file jl.jar attached to bug report #42338 with the property file attahced to bug report #42338 like this:

java -classpath jl.jar:mysql-connector-java-5.1.7-bin.jar com.sun.hadb.misc.jl.JL
test-mysql.prop

wait 10-15 minutes after the line

2009-01-26 09:47:10.867** Starting test

has appeared. Then take backup of the database named longtest.

Thank you for the report.

Verified as described. Problem only repeatable if updates happen on InnoDB tables.
Probably duplicate of bug #34414 as well.

Further info from original reporter:

The transactions run by the test program runs a mix of selects, inserts, deletes and updates. There's also extensive use of aggregates. The client runs with autocommit, so there are no long transactions.

More from original reporter: The problems started to occur between 8 and 16 concurrent treads running back-to-back transactions. The response time for backup then increases in an exponential-like way.

------------

This problem *may* be due to the simple commit blocker implementation used by backup. WL#4610 is for refining that piece of code. Thus, further investigation is required - this may or may not be a duplicate of bug#34414.

An interesting next step could be to have longer transactions, e.g. similar operations as done in the test client, but with 5-10 operations bundled together in transactions. If the CB really is the root of this problem, the effects reported should be observable with much fewer threads running back-to-back transactions.

The load from jl.jar implemented in mysql test framework using RQG

Attachment: jl.test (application/octet-stream, text), 1.05 KiB.

RQG grammar for jl.test

Attachment: jl.rqg (application/octet-stream, text), 1.89 KiB.

I have transcribed the jl.jar code into a test case which uses the random query generator. To run jl.test, I do the following:

- place jl.rqg in mysql-test/ directory,
- place jl.test in mysql-test/t directory,
- set RQG_HOME environment variable to point at the directory where RQG is installed,
- run test with "./mtr jl".

One needs to have RQG installed. See http://forge.mysql.com/wiki/RandomQueryGenerator.

An improved version of jl.test

Attachment: jl.test (application/octet-stream, text), 1.58 KiB.

The improved version of jl.test prepares test data mych faster and is able to handle 500000 rows per table, as the original jl.jar. It also uses debug synchronization points so that it can be synchronized with a client performing parallel BACKUP operation.

After copying files to mysql-test/ directory and setting the RQG_HOME variable, I use this test as follows:

- I start the test with "./mtr jl --testcase-timeout=30000". After server startup test stops waiting for jl_start signal.
- I start another connction to the server instance created by MTR: "mysql -uroot -S var/tmp/mysqld.1.sock".
- From this connection I send the jl_start signal: "set debug_sync= 'now signal jl_start';".
- I wait for jl test to create and fill the tables until it starts waiting for jl_load signal.
- Now I can see how much time it takes to backup db1 database without any load.
- I send jl_load signal: "set debug_sync= 'now signal jl_load';".
- Now I can see how much time it takes to backup db1 under the load.

MEASUREMENT RESULTS
-------------------
Here are results of measurements I did using the provided jl.test script. I was manually backing up a database containing 4 tables with around 500k rows each both with idle server and with 64 parallel threads modifying data in the same datbase. Tables were using InnoDB storage engine.

I compared times for three operations: backup of the database, select count(*) from one of the tables, and a simple scan of one of the tables which was performed at handlerton level using modified server code. The latter operation mimics what the CS backup driver does, but without any I/O or other overhead - just all rows of the table are fetched and ignored.

Times on an idle server:

- backup of the database	  7 sec
- select count(*) from one table  1 sec
- scan of one table               1 sec

Times on server serving 64 parallel threads:

- backup of the database	  21 min
- select count(*) from one table   5 min  (x4 = 20 min)
- scan of one table                6 min  (x4 = 24 min)

Since the database being backed-up contains 4 tables of approximately the same size, the times shown for single table scan should be multiplied by 4 for comparison with backup operation. Given that, one can see that BACKUP does not add a significant overhead over a simple scan of a table. 

Under the load, the time required to perform both operations increases dramatically, but this is most probably an inneficiency of InnoDB storage engine under concurrent load. This is confirmed by the fact that under the same load both selects and BACKUP take only few seconds if MyISAM storage engine is used.

Triage Team,

I propose to move this bug to "won't fix" state. 

Given the measurements and research I did, it seems there is no problem with the backup code. Perhaps a new bug against InnoDB engine should be opened after developing good test showing the problem (a test which would not involve BACKUP operation but e.g. SELECT COUNT(*) as I did in my testing). Note that backup team has WL#4406 for systematic testing of backup/restore performance.

Using the attached Java test program, I'm able to reproduce the reported problem. Attaching gdb to mysql shows that the BACKUP process is blocked when trying to get the Global Read Lock (see stack below)

Since Rafal was unable to reproduce the problem with RQG, it could be the case that RQG is not capable of generating a high enough load to starve backup. mysql-stress-test.pl is also capable of reproducing the problem reported in this bug (see BUG#44568).

Stack of BACKUP thread:
-----------------------
#0  0xb8010430 in __kernel_vsyscall ()
#1  0xb7fe30e5 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/tls/i686/cmov/libpthread.so.0
#2  0x08652e4e in safe_cond_wait (cond=0x8971fa0, mp=0x8971900, 
    file=0x875b097 "lock.cc", line=1161) at thr_mutex.c:423
#3  0x082955a9 in lock_global_read_lock (thd=0x9709670) at lock.cc:1161
#4  0x08704149 in backup::block_commits (thd=0x9709670, tables=0x0)
    at data_backup.cc:395
#5  0x0870480d in backup::write_table_data (thd=0x9709670, info=@0xa2c32f0, 
    s=@0x9d1b700) at data_backup.cc:700
#6  0x086f9f1b in Backup_restore_ctx::do_backup (this=0xb73e9b64)
    at kernel.cc:1315
#7  0x086fcca8 in execute_backup_command (thd=0x9709670, lex=0x970a5bc, 
    backupdir=0xb73ea140, overwrite=false, skip_gap_event=false)
    at kernel.cc:218
#8  0x082b335e in mysql_execute_command (thd=0x9709670) at sql_parse.cc:2432

Duplicate of BUG#42338 and BUG#44568.