Bug #48861 Restoring backups aborts the ndbd's leaving processes 'hanging'
Submitted: 18 Nov 2009 10:48 Modified: 6 Dec 2009 10:08
Reporter: Geert Vanderkelen Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.1.39-ndb-6.3.28b OS:Linux
Assigned to: Jonas Oreland CPU Architecture:Any
Tags: Backup, crash, ndbd, restore

[18 Nov 2009 10:48] Geert Vanderkelen
Description:
Doing something nobody should: restoring backups in parallel (caused by bad copy/paste).

With 2 data nodes, if you restore data and meta on the first, and launch the restore of only data on the other node, the ndbd's will abort with Signal 6.

The problem, at least in Debug builds, is that the ndbd processes never exit. If you start the ndbd's new, they will also not give errors.

Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000
Error data: Signal 6 received; Aborted
Error object: main.cpp

How to repeat:
1) Start empty data nodes.
2) Restore backup (meta+data) on the first
3) While 1) is going on, restore data on the second

Watch it all go wrong. Expected, but a bit hard IMHO.
[18 Nov 2009 11:04] Geert Vanderkelen
Actually happens just with a restore (obfuscating table name):

_____________________________________________________
Processing data in table: ******/def/NDB$BLOB_5_4(6) fragment 0
Temporary error: 266: Time-out in NDB, probably caused by deadlock
Temporary error: 266: Time-out in NDB, probably caused by deadlock
Temporary error: 266: Time-out in NDB, probably caused by deadlock
Temporary error: 266: Time-out in NDB, probably caused by deadlock
Temporary error: 266: Time-out in NDB, probably caused by deadlock
Temporary error: 266: Time-out in NDB, probably caused by deadlock
Temporary error: 266: Time-out in NDB, probably caused by deadlock
Temporary error: 266: Time-out in NDB, probably caused by deadlock
Temporary error: 266: Time-out in NDB, probably caused by deadlock
Temporary error: 266: Time-out in NDB, probably caused by deadlock
Temporary error: 266: Time-out in NDB, probably caused by deadlock
Temporary error: 266: Time-out in NDB, probably caused by deadlock
Temporary error: 266: Time-out in NDB, probably caused by deadlock
Temporary error: 266: Time-out in NDB, probably caused by deadlock
Temporary error: 266: Time-out in NDB, probably caused by deadlock
Temporary error: 266: Time-out in NDB, probably caused by deadlock
Temporary error: 266: Time-out in NDB, probably caused by deadlock
Temporary error: 266: Time-out in NDB, probably caused by deadlock
Temporary error: 266: Time-out in NDB, probably caused by deadlock
Temporary error: 4010: Node failure caused abort of transaction
Unknown: 4009: Cluster Failure
Cannot start transaction

Verified using MySQL Cluster 6.3.28 (on Linux, debug build)
[18 Nov 2009 15:01] Geert Vanderkelen
Same backup restores fine (except for some temporary redo buffer errors) in MySQL Cluster 7.0.9b.
[26 Nov 2009 12:10] Geert Vanderkelen
ndb_restore parallelism problem. If you use -p 1 it restores just fine.
Trying with -p 64, same restore fails again.

Workaround: use "-p 1" when restoring
[30 Nov 2009 9:22] Jonas Oreland
proposed patch

Attachment: bug48861.patch (application/octet-stream, text), 4.42 KiB.

[30 Nov 2009 10:52] Geert Vanderkelen
Using Patch against 6.3.28 indeed fixes the problem.
[30 Nov 2009 11:07] Jonas Oreland
Docs: When performing tasks generating lots of IO (such as using ndb_restore) an internal memory buffer could overflow, causing signal 6.

The patch removes the internal buffer totally, since it's useless.
[30 Nov 2009 11:11] Jonas Oreland
to be pushed to 6.2.19, 6.3.29 and 7.0.10
[30 Nov 2009 11:15] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/92054

3046 Jonas Oreland	2009-11-30
      ndb - bug#48861 - remove memory channels internal storage (which can overflow) and link the request in a linked list instead
[30 Nov 2009 12:21] Bugs System
Pushed into 5.1.39-ndb-6.3.29 (revid:jonas@mysql.com-20091130113057-vvfogdxcst814cjn) (version source revid:jonas@mysql.com-20091130113057-vvfogdxcst814cjn) (merge vers: 5.1.39-ndb-6.3.29) (pib:13)
[30 Nov 2009 12:21] Bugs System
Pushed into 5.1.39-ndb-7.0.10 (revid:jonas@mysql.com-20091130115355-vmqycis77g5pd0yt) (version source revid:jonas@mysql.com-20091130115355-vmqycis77g5pd0yt) (merge vers: 5.1.39-ndb-7.0.10) (pib:13)
[30 Nov 2009 12:22] Bugs System
Pushed into 5.1.39-ndb-7.1.0 (revid:jonas@mysql.com-20091130121550-lltariazkvytcjox) (version source revid:jonas@mysql.com-20091130121550-lltariazkvytcjox) (merge vers: 5.1.39-ndb-7.1.0) (pib:13)
[1 Dec 2009 13:02] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/92273

3167 Martin Skold	2009-12-01 [merge]
      Merge
      modified:
        storage/ndb/src/common/debugger/EventLogger.cpp
        storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp
        storage/ndb/src/kernel/blocks/ndbfs/AsyncIoThread.hpp
        storage/ndb/src/kernel/blocks/ndbfs/MemoryChannel.hpp
        storage/ndb/src/kernel/blocks/pgman.cpp
        storage/ndb/src/kernel/blocks/pgman.hpp
        storage/ndb/src/mgmsrv/MgmtSrvr.cpp
        storage/ndb/src/ndbapi/NdbOperationDefine.cpp
        storage/ndb/src/ndbapi/NdbOperationSearch.cpp
        storage/ndb/test/ndbapi/testBlobs.cpp
        storage/ndb/test/run-test/daily-basic-tests.txt
        storage/ndb/test/run-test/daily-devel-tests.txt
[1 Dec 2009 13:33] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/92279

3244 Martin Skold	2009-12-01 [merge]
      Merge
      modified:
        storage/ndb/src/common/debugger/EventLogger.cpp
        storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp
        storage/ndb/src/kernel/blocks/ndbfs/AsyncIoThread.hpp
        storage/ndb/src/kernel/blocks/ndbfs/MemoryChannel.hpp
        storage/ndb/src/kernel/blocks/pgman.cpp
        storage/ndb/src/kernel/blocks/pgman.hpp
        storage/ndb/src/mgmsrv/MgmtSrvr.cpp
        storage/ndb/src/ndbapi/NdbOperationDefine.cpp
        storage/ndb/src/ndbapi/NdbOperationSearch.cpp
        storage/ndb/test/ndbapi/testBlobs.cpp
        storage/ndb/test/run-test/daily-basic-tests.txt
        storage/ndb/test/run-test/daily-devel-tests.txt
[1 Dec 2009 14:02] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/92287

3170 Martin Skold	2009-12-01 [merge]
      Merge
      modified:
        storage/ndb/src/common/debugger/EventLogger.cpp
        storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp
        storage/ndb/src/kernel/blocks/ndbfs/AsyncFile.hpp
        storage/ndb/src/kernel/blocks/ndbfs/MemoryChannel.hpp
        storage/ndb/src/kernel/blocks/pgman.cpp
        storage/ndb/src/kernel/blocks/pgman.hpp
        storage/ndb/src/mgmsrv/MgmtSrvr.cpp
        storage/ndb/src/ndbapi/NdbOperationDefine.cpp
        storage/ndb/src/ndbapi/NdbOperationSearch.cpp
        storage/ndb/test/ndbapi/testBlobs.cpp
        storage/ndb/test/run-test/daily-basic-tests.txt
        storage/ndb/test/run-test/daily-devel-tests.txt
[1 Dec 2009 14:22] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/92291

3040 Martin Skold	2009-12-01 [merge]
      Merge
      modified:
        storage/ndb/src/kernel/blocks/ndbfs/AsyncFile.hpp
        storage/ndb/src/kernel/blocks/ndbfs/MemoryChannel.hpp
        storage/ndb/src/kernel/blocks/pgman.cpp
        storage/ndb/src/kernel/blocks/pgman.hpp
        storage/ndb/src/ndbapi/NdbOperationDefine.cpp
        storage/ndb/src/ndbapi/NdbOperationSearch.cpp
        storage/ndb/test/ndbapi/testBlobs.cpp
        storage/ndb/test/run-test/daily-basic-tests.txt
[6 Dec 2009 10:08] Jon Stephens
Documented bugfix in the NDB-6.2.19, 6.3.29, and 7.0.10 changelogs, as follows:

      When performing tasks that generated large amounts of I/O (such as 
      using ndb_restore), an internal memory buffer could overflow, causing 
      data nodes to fail with signal 6.
      
      Subsequent analysis showed that this buffer was not actually required, 
      so this fix removes it.

Closed.