Bug #34701 Restore Hangs for TPCB database
Submitted: 20 Feb 2008 20:24 Modified: 29 Aug 2008 11:11
Reporter: Hema Sridharan
Status: Can't repeat
Category:Server: Backup Severity:S1 (Critical)
Version:mysql-6.0-backup OS:Linux
Assigned to: Oystein Grovlen Target Version:6.0-rc
Triage: D2 (Serious)

[20 Feb 2008 20:24] Hema Sridharan
Description:
1) I create a huge database TPCB
2) I backup the database to local file
3) I restore the database from the local file and the restore hangs.

How to repeat:
1) Create a large database of around (450MB)

you can check this,

ndbdev@ndb16:/data2/hema/backup_dmp> ls -lhrt

-rw-rw----  1 ndbdev ndbdev  363 Feb 20 04:50 sp
-rw-rw----  1 ndbdev ndbdev  363 Feb 20 04:50 sp1
-rw-rw----  1 ndbdev ndbdev 4.3M Feb 20 06:54 tpcb1.dmp
-rw-rw----  1 ndbdev ndbdev 424M Feb 20 13:57 tpcb2

2) I backup this database to 
backup database TPCB to '/data2/hema/backup_dmp/tpcb2'

mysql> backup database TPCB to '/data2/hema/backup_dmp/tpcb2';
+-----------+
| backup_id |
+-----------+
| 2         |
+-----------+
1 row in set (1 min 11.97 sec)

3)I drop the database TPCB

mysql> drop database TPCB;
Query OK, 6 rows affected (0.85 sec)

4)mysql> restore from '/data2/hema/backup_dmp/tpcb2';

5) The above restore step hangs

6) I check the mysql.online_backup table, this is what I get,

mysql> select * from mysql.online_backup;
+-----------+------------+------------+-------------+--------------+-----------+-----------+-------------+-------------+---------------------+---------------------+---------------------+---------------------+----------+-------------------------------------+--------------+------------------------------------------------------------+-------------------+
| backup_id | process_id | binlog_pos | binlog_file | backup_state | operation |
error_num | num_objects | total_bytes | validity_point_time | start_time          |
stop_time           | host_or_server_name | username | backup_file                       
 | user_comment | command                                                    | engines    
      |
+-----------+------------+------------+-------------+--------------+-----------+-----------+-------------+-------------+---------------------+---------------------+---------------------+---------------------+----------+-------------------------------------+--------------+------------------------------------------------------------+-------------------+
       2 |          0 |          0 | NULL        | complete     | backup    |         0 |
          6 |   366000000 | 2008-02-20 12:57:13 | 2008-02-20 12:56:01 | 2008-02-20
12:57:13 | localhost           | root     | /data2/hema/backup_dmp/tpcb2        |        
     | backup database TPCB to '/data2/hema/backup_dmp/tpcb2'     | Default           |
|        6 |          0 |          0 | NULL        | running      | restore   |         0
|           0 |        NULL | NULL                | 2008-02-20 13:34:46 | NULL            
   | localhost           | root     | /data2/hema/backup_dmp/tpcb2        |              |
restore from '/data2/hema/backup_dmp/tpcb2'                | Default           |
+-----------+------------+------------+-------------+--------------+-----------+-----------+-------------+-------------+---------------------+---------------------+---------------------+---------------------+----------+-------------------------------------+--------------+------------------------------------------------------------+-------------------+

7) The restore never completes.

Suggested fix:
The restore operation should not hang regardless of size of database.
[21 Feb 2008 12:03] Miguel Solorzano
Thank you for the bug report.
[4 Mar 2008 13:39] Rafal Somla
I managed to repeat the problem and found the possible reason.

DETAILED PROBLEM DESCRIPTION

RESTORE hangs because of a bug in the backup stream library code causing it to fail to
detect end of the backup stream. 

Backup stream is divided into fixed size blocks. The size of the block is stored at the
beginning of the stream. It must be correctly read before the stream can be processed. It
happens in this place inside load_buffer() function in stream_v1_transport.c:

>     for (i= 0; i<4; ++i)
>     {
>       block_size >>= 8;
>       block_size |= (*(s->buf.begin++)) << 3*8;
>     }

Here block_size is of type "long unsigned int" and s->buf.begin is a pointer to "unsigned
char". The problem is caused by implicit cast to "int" type which happens inside the
expression "(*(s->buf.begin++)) << 3*8". If the highest bit becomes 1, compiler thinks
that this is a negative integer. When assigning negative integer to block_size which is
of unsigned type, if the size of block_size is bigger than the size of int type, the
overflowing bits are set to 1, not to 0! This leads to wrong value of block_size. A fix
is to use an explicit cast to "long unsigned int" like this

>     for (i= 0; i<4; ++i)
>     {
>       block_size >>= 8;
>       block_size |= ((unsigned long int)*(s->buf.begin++)) << 3*8;
>     }

Interestingly, the problem manifests itself only on 64bit platform, because only there
"long int" is bigger thant "int". Also, due to the logic of reading backup stream, the
problem shows up only when more than one input block is read and that required restoring
a big database.

There are two problems to be fixed:

- verify that no such incorrect implicit casts are done in the code,
- make backup stream library more robust, so that it terminates gracefully in 
  case stream ends, even if it is confused about block size or other parameters 
  of the stream.

Note that completely satisfactory solution of this problem is to introduce checksums in
the stream which is planned to be done in the future.
[13 Mar 2008 12:50] Sander de Boer
is it possible that 5.0.54a-enterprise-gpl-log has the same bug?
a restore of my database hangs too!
[9 May 2008 14:44] Rafal Somla
There are two related bugs: BUG#36624 and BUG#36586. After closing these bugs it should be
checked if the issue was resolved.
[21 May 2008 14:13] Rafal Somla
This is probably duplicate of BUG#36571.
[29 Aug 2008 11:11] Oystein Grovlen
I am not able to repeat this.  As Rafal commented, it is probably fixed by BUG#36624 and
BUG#36586.