Bug #46585 Inconsistent cluster or crash during SR following a table-reorg
Submitted: 6 Aug 2009 14:06 Modified: 16 Oct 2009 11:33
Reporter: Jonas Oreland Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.1-telco-7.0 OS:Any
Assigned to: Jonas Oreland CPU Architecture:Any

[6 Aug 2009 14:06] Jonas Oreland
Description:
After having performed a table-reorg which add partitions
If you perform a SR

1) If a LCP has not completed, new partitions will not be restored
   i.e all data that moved will be lost
   This as DIH haven't saved it's table-definition and will have
   old version which has fewer partitions

2) If a LCP has completed, ordered indexes/no-logging-tables will still 
   not be saved as they don't participate in LCP
  
   Which will for ordered index mean that cluster will crash during SR
     as it will find the inconsistency between the main-table and the ordered
     index

How to repeat:
add node group
do table reorg
perform system restart

Suggested fix:
dict has correct information
force DIH to use this instead of relying on own
[15 Oct 2009 13:00] Jonas Oreland
patch ready...now "only" need to write test-prg
[16 Oct 2009 6:33] Jonas Oreland
pushed to 7.0.9
[16 Oct 2009 11:33] Jon Stephens
Documented bugfix in the NDB-7.0.9 changelog as follows:

        Performing a system restart of the cluster after having
        performed a table reorganization which added partitions caused
        the cluster to become inconsistent, possibly leading to a forced
        shutdown, in either of the following cases:

            1. When a local checkpoint was in progress but had not yet
            completed, new partitions were not restored; that is, data
            that was supposed to be moved could be lost instead, leading
            to an inconcistent cluster. This was due to an issue whereby
            the DBDIH kernel block did not save the new table definition
            and instead used the old one (the version having fewer
            partitions).

            2. When the most recent LCP had completed, ordered indexes and
            unlogged tables were still not saved (since these did not
            participate in the LCP). In this case, the cluster crashed
            during a subsequent system restart, due to the inconsistency
            between the main table and the ordered index.

        Now, DBDIH is forced to use the version of the table definition
        held by the DBDICT kernel block, which was (already) correct and
        up to date.

Closed.