MySQL Bugs: #34701: Restore Hangs for TPCB database

Bug #34701	Restore Hangs for TPCB database
Submitted:	20 Feb 2008 19:24	Modified:	29 Aug 2008 9:11
Reporter:	Hema Sridharan	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: Backup	Severity:	S1 (Critical)
Version:	mysql-6.0-backup	OS:	Linux
Assigned to:	Øystein Grøvlen	CPU Architecture:	Any

Description:
1) I create a huge database TPCB
2) I backup the database to local file
3) I restore the database from the local file and the restore hangs.

How to repeat:
1) Create a large database of around (450MB)

you can check this,

ndbdev@ndb16:/data2/hema/backup_dmp> ls -lhrt

-rw-rw----  1 ndbdev ndbdev  363 Feb 20 04:50 sp
-rw-rw----  1 ndbdev ndbdev  363 Feb 20 04:50 sp1
-rw-rw----  1 ndbdev ndbdev 4.3M Feb 20 06:54 tpcb1.dmp
-rw-rw----  1 ndbdev ndbdev 424M Feb 20 13:57 tpcb2

2) I backup this database to 
backup database TPCB to '/data2/hema/backup_dmp/tpcb2'

mysql> backup database TPCB to '/data2/hema/backup_dmp/tpcb2';
+-----------+
| backup_id |
+-----------+
| 2         |
+-----------+
1 row in set (1 min 11.97 sec)

3)I drop the database TPCB

mysql> drop database TPCB;
Query OK, 6 rows affected (0.85 sec)

4)mysql> restore from '/data2/hema/backup_dmp/tpcb2';

5) The above restore step hangs

6) I check the mysql.online_backup table, this is what I get,

mysql> select * from mysql.online_backup;
+-----------+------------+------------+-------------+--------------+-----------+-----------+-------------+-------------+---------------------+---------------------+---------------------+---------------------+----------+-------------------------------------+--------------+------------------------------------------------------------+-------------------+
| backup_id | process_id | binlog_pos | binlog_file | backup_state | operation | error_num | num_objects | total_bytes | validity_point_time | start_time          | stop_time           | host_or_server_name | username | backup_file                         | user_comment | command                                                    | engines           |
+-----------+------------+------------+-------------+--------------+-----------+-----------+-------------+-------------+---------------------+---------------------+---------------------+---------------------+----------+-------------------------------------+--------------+------------------------------------------------------------+-------------------+
       2 |          0 |          0 | NULL        | complete     | backup    |         0 |           6 |   366000000 | 2008-02-20 12:57:13 | 2008-02-20 12:56:01 | 2008-02-20 12:57:13 | localhost           | root     | /data2/hema/backup_dmp/tpcb2        |              | backup database TPCB to '/data2/hema/backup_dmp/tpcb2'     | Default           |
|        6 |          0 |          0 | NULL        | running      | restore   |         0 |           0 |        NULL | NULL                | 2008-02-20 13:34:46 | NULL                | localhost           | root     | /data2/hema/backup_dmp/tpcb2        |              | restore from '/data2/hema/backup_dmp/tpcb2'                | Default           |
+-----------+------------+------------+-------------+--------------+-----------+-----------+-------------+-------------+---------------------+---------------------+---------------------+---------------------+----------+-------------------------------------+--------------+------------------------------------------------------------+-------------------+

7) The restore never completes.

Suggested fix:
The restore operation should not hang regardless of size of database.

Thank you for the bug report.

I managed to repeat the problem and found the possible reason.

DETAILED PROBLEM DESCRIPTION

RESTORE hangs because of a bug in the backup stream library code causing it to fail to detect end of the backup stream. 

Backup stream is divided into fixed size blocks. The size of the block is stored at the beginning of the stream. It must be correctly read before the stream can be processed. It happens in this place inside load_buffer() function in stream_v1_transport.c:

>     for (i= 0; i<4; ++i)
>     {
>       block_size >>= 8;
>       block_size |= (*(s->buf.begin++)) << 3*8;
>     }

Here block_size is of type "long unsigned int" and s->buf.begin is a pointer to "unsigned char". The problem is caused by implicit cast to "int" type which happens inside the expression "(*(s->buf.begin++)) << 3*8". If the highest bit becomes 1, compiler thinks that this is a negative integer. When assigning negative integer to block_size which is of unsigned type, if the size of block_size is bigger than the size of int type, the overflowing bits are set to 1, not to 0! This leads to wrong value of block_size. A fix is to use an explicit cast to "long unsigned int" like this

>     for (i= 0; i<4; ++i)
>     {
>       block_size >>= 8;
>       block_size |= ((unsigned long int)*(s->buf.begin++)) << 3*8;
>     }

Interestingly, the problem manifests itself only on 64bit platform, because only there "long int" is bigger thant "int". Also, due to the logic of reading backup stream, the problem shows up only when more than one input block is read and that required restoring a big database.

There are two problems to be fixed:

- verify that no such incorrect implicit casts are done in the code,
- make backup stream library more robust, so that it terminates gracefully in 
  case stream ends, even if it is confused about block size or other parameters 
  of the stream.

Note that completely satisfactory solution of this problem is to introduce checksums in the stream which is planned to be done in the future.

is it possible that 5.0.54a-enterprise-gpl-log has the same bug?
a restore of my database hangs too!

There are two related bugs: BUG#36624 and BUG#36586. After closing these bugs it should be checked if the issue was resolved.

This is probably duplicate of BUG#36571.

I am not able to repeat this.  As Rafal commented, it is probably fixed by BUG#36624 and BUG#36586.