Bug #36044 START BACKUP causes ndb crash with >4096 tables
Submitted: 14 Apr 2008 2:31 Modified: 30 May 2008 13:30
Reporter: Jeff Sturm Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:mysql-5.1.24-ndb-6.3.13-telco OS:Any
Assigned to: Martin Skold CPU Architecture:Any
Tags: Backup, Contribution, ndb

[14 Apr 2008 2:31] Jeff Sturm
Description:
START BACKUP results in ndbd crash and loss of cluster services with the following  diagnostic:

Time: Sunday 13 April 2008 - 13:30:45
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: backup/Backup.cpp
Error object: BACKUP (Line: 3348) 0x0000000a
Program: /opt/mysql/libexec/ndbd
Pid: 8911
Trace: /var/lib/mysql/data/ndb_2_trace.log.8
Version: mysql-5.1.24 ndb-6.3.13-RC
***EOM***

This occurs whenever the total number of tables+indexes exceeds 4096.

Related to this, the listObjects API call displays incorrect tableId values for any tableId > 4096, and ndb_show_tables similarly displays incorrect output, because all utilize a common bitmask named ListTablesData.

We can work around the ndb_show_tables bug in our production systems.  The backup issue is critical to us however because it will result in a complete cluster crash if our schema grows beyond 4K tables.  Further, this limit is far short of the documented maximum of 20,320 tables in NDB 5.1.

How to repeat:
Create ~1,500 instances of a table with two indexes similar to:

create table junk (
  id integer auto_increment not null primary key,
  time datetime,
  index junk_time (time)
);

Then:

a) Run ndb_show_tables and verify the tableId column wraps at 4095, and
b) Attempt START BACKUP from the management console and observe ndbd failure.

Suggested fix:
The ListTablesData bitfield allocates 12 bits for tableId and 8 bits for tableType.  The former was sufficient for 5.0, which had a hard limit of 1600 tables, but is insufficient for 5.1 which can grow to 20320 tables.

However, the maximum tableType is currently "23" so it seems you could shave 3 bits off the tableType and provide it to the tableId.

The following patch effectively solves our problem.

diff -ur mysql-5.1.24-ndb-6.3.13-telco/storage/ndb/include/kernel/signaldata/ListTables.hpp mysql-5.1.24-ndb-6.3.13-telco-eprize/storage/ndb/include/kernel/signaldata/ListTables.hpp
--- mysql-5.1.24-ndb-6.3.13-telco/storage/ndb/include/kernel/signaldata/ListTables.hpp  2008-04-10 13:27:21.000000000 -0400
+++ mysql-5.1.24-ndb-6.3.13-telco-eprize/storage/ndb/include/kernel/signaldata/ListTables.hpp   2008-04-13 21:01:07.000000000 -0400
@@ -26,16 +26,16 @@
 class ListTablesData {
 public:
   static Uint32 getTableId(Uint32 data) {
-    return BitmaskImpl::getField(1, &data, 0, 12);
+    return BitmaskImpl::getField(1, &data, 0, 15);
   }
   static void setTableId(Uint32& data, Uint32 val) {
-    BitmaskImpl::setField(1, &data, 0, 12, val);
+    BitmaskImpl::setField(1, &data, 0, 15, val);
   }
   static Uint32 getTableType(Uint32 data) {
-    return BitmaskImpl::getField(1, &data, 12, 8);
+    return BitmaskImpl::getField(1, &data, 15, 5);
   }
   static void setTableType(Uint32& data, Uint32 val) {
-    BitmaskImpl::setField(1, &data, 12, 8, val);
+    BitmaskImpl::setField(1, &data, 15, 5, val);
   }
   static Uint32 getTableStore(Uint32 data) {
     return BitmaskImpl::getField(1, &data, 20, 3);
[9 May 2008 22:18] Martin Skold
Patch sent for review
[16 May 2008 13:13] Martin Skold
A patch has been committed:
http://lists.mysql.com/commits/46766
[22 May 2008 6:55] Martin Skold
Fix pushed to 5.1-telco-6.2
http://lists.mysql.com/commits/46766
merged to 5.1-telco-6.3.
[30 May 2008 13:05] Jon Stephens
Documented in the 5.1.24-ndb-6.2.16 and 5.1.24-ndb-6.1.15 changelogs as follows:

        If the combined total of tables and indexes in the cluster was greater
        than 4096, issuing START BACKUP caused data nodes to fail.

Left PQ status pending any further merges.
[30 May 2008 13:30] Jon Stephens
Closed. (Not going anywhere other than NDB 6.2/6.3.)