Bug #43268 Ndb : Change Blob part tables partitioning for UserDefined partitioning
Submitted: 27 Feb 2009 17:09 Modified: 10 Aug 2009 11:25
Reporter: Frazer Clement Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.1-telco-6.2 OS:Any
Assigned to: Frazer Clement CPU Architecture:Any

[27 Feb 2009 17:09] Frazer Clement
Description:
Background

Blob part tables created to store Blob data generally take the same partitioning information as the table containing the Blob.  

For tables with UserDefined partitioning (as used by MySQLD for PARTITION BY HASH[LINEAR] / RANGE / LIST), the Blob parts table is set to use DistrKeyHash partitioning, however the partition used to store Blob parts is always explcitly set to be the same partition as the main row takes.  

This is implemented by 
 1) When the main operation has a partition id set, the Blob parts operations take this partition id.
 2) For main table scan operations with no partition id set, an internal read of the NDB$FRAGMENT value for the main table row is inserted.  This value is then used to set the fragment id for the part table rows.

This mechanism is also used for Blobs in non user defined partitioned tables, when the Blob stripe size is 0 (e.g. the partition id is manually set to be the same as the main table's partitioning).

For Blob v1, the Blob table's primary key includes an unsigned integer array containing the concatenated max length primary key columns from the main table.  The Blob table's distribution key is this 'array' primary key and a dist integer column.  The distribution hash of the array form of the Blob part table's distribution key is not guaranteed to be the same as the distribution hash of the main table's distribution keys.  Stripe size zero is not supported for Blob v1, so Blob v1 parts are always distributed according to their distribution keys or the explicit partition id (UserDefined).

For Blob v2, the Blob table's primary key includes all of the main table's primary keys and an optional NDB$DIST column.  The distribution keys are set to be the same as the main table.  When the Blob's stripe size is zero there is no NDB$DIST column and all part rows for a blob will be in the same partition, and will be distributed aligned with the main table rows.  This would happen naturally, but is currently enforced by a setPartitionId() call.
Blob v2 defaults to stripe size 0, and cannot currently be set otherwise from MySQLD.

Problems :
 1) Setting partitionId of part table operations to be same as main table row's fragmentid
    This assumes that :
    a) Main table and blob part table *always* have the same number of partitions 
       (Not necessarily true with online add node)
    b) PartitionId can be set for natively partitioned operations
       (Not the case for HashMap partitioned tables)

 2) Having a table which should be distributed according to key hash but whose rows are actually distributed using a user defined algorithm
    This seems generally inconsistent and requires that we allow the use of a setPartitionId() call on a non UserDefined partitioned table.

Constraints :
 1) For Blob v2 stripe size 0, blob parts should be located on the same node as their corresponding main table rows.
   - For performance reasons and 'logical partitioning model' reasons.
 2) For striped Blobs (i.e. Blob v1, Blob v2 with stripe size != 0, blob parts should be spread around all Blob part table partitions) except for UserDefined partitioning
   - To support semantics of 'stripe' option
 3) For Blob v1 and v2 existing tables (created on a running cluster before software upgrade / cluster restart / backup restore) should be accessable after any changes (Could potentially relax for Blob v1 / UserDefined partitioned tables) 
 4) Index and Table scans with explicit partitioning information should work as expected.
 5) Index and Table scans with no explicit partitioning information should work as expected.

Desirable improvements :
 - Enable online repartitioning of HashMap partitioned tables including Blobs by removing the tight coupling of partition ids between main tables and their Blob part tables.

How to repeat:
Look at Blob code for UserDefined partitioning, form headache.

Suggested fix:
Proposed Solution :
  1) Change Blob part tables for UserDefined main tables to use UserDefined partitioning
     - This is effectively what they use already.
     - No changes to ndb_restore, node restart online upgrade etc.
       So code has to still work with Native Partitioned part tables.
  2) Change Blob handling code to only setPartitionId() on Blob part operations for UserDefined main tables.
     - This is required for UserDefined part tables, but not for any other tables.  This has the side-effect that stripe != 0 has no effect for UserDefined Partitioned tables.  We can optionally add a warning / note to documentation describing this limitation.
     - If the Blob table comes from an old release then the parts table may still use native partitioning, so the implementation may still set an explicit partitionId on a natively partitioned table.
  3) Optionally change Blob handling code to not request main table's NDB$FRAGMENT for non UserDefined partitioned tables.
     - No need for this information for non UserDefined partitioned tables.

Implications
 - Constraint 1 is met as Blob v2 stripe size 0 is partitioned by its distribution keys which are the same as the main table's
 - Constraint 2 is met except for UserDefined tables, as today.
 - Constraint 3 is met as : 
   For native tables :
   - Blob v2 stripe size 0 is correctly partitioned without explicitly setting partitionId
   - Blob v2 stripe size != 0 is correctly partitioned by distribution key as now.
   - Blob v1 stripe size 0 was not and is not supported.
   - Blob v1 stripe size != 0 is correctly partitioned by distribution key as now.
   For user defined tables :
   - Blob v2 stripe size 0 is partitioned as today (with manual partition id)
   - Blob v2 stripe size !=0 is partitioned as today (with manual partition id)
   - Blob v1 stripe size 0 is not supported
   - Blob v1 stripe size !=0 is partitioned as today (with manual partition id)
 - Constraint 4 is met as :
   - For UserDefined main tables, the explicitly set partitionId is used for the Blob part operations, which will all be in the same partition number in the blob parts table
   - For Natively partitioned main tables, the explicitly set partitionId is ignored for the Blob part table operations which use normal distribution keys to find their partition
 - Constraint 5 is met as scans without explicit partitioning information 
   On natively partitioned tables
     - Will use the normal distribution key mechanism
   On UserDefined partitioned tables
     - Will request and then use the main table's NDB$FRAGMENT value for setting the blob part table operation's partitionId. 
 - For main tables partitioned by HashMap :
   - Blob tables are partitioned by HashMap
   - Distribution keys are used as normal for accessing Blob parts
   - Blob table or Main part table can be independently re-distributed as there is no hard linkage between # of partitions or partition id. 
 - Manual setting of partitionId for natively partitioned tables is still required for legacy user defined partitioned Blob tables
   - Perhaps this can be dropped eventually.
[2 Mar 2009 21:19] Frazer Clement
Proposed patch against 6.3 : 

http://lists.mysql.com/commits/68005

Proposed merge to 6.4 including test of HashMapPartition distribution :

http://lists.mysql.com/commits/68017
[2 Mar 2009 21:45] Frazer Clement
Caveat to this proposed patch : 

Actual Table distribution of Blob part rows with (stripe != 0) for UserDefined tables is changed.  Previously parts were distributed using DistrKeyHash (md5 dist key) and now they are distributed as specified by the main table's partition key.  This means that :
  1) Stripe has no effect for UserDefined partitioned Blobs - they all have Stripe == 0
  2) Any existing table with UserDefined partitioning, Blobs and Stripe Size != 0 will not work with NDBAPI code using this patch.  This affects tables existing over an online upgrade, or restored from a backup.

1) Seems reasonable as the application controls the distribution of the rows to specific partitions with UserDefined partitioning.  Stripe size != 0 conflicts with this.

2) Seems reasonable as a) MySQLD currently always creates Blob v2 tables with stripe size 0.  The only tables which could be affected would be Blob v1 tables (with default stripe size 4) and b) UserDefined partitioning with Ndb requires the --new option.
[9 Mar 2009 14:24] Frazer Clement
Pushed to 6.3.24, 6.4.4
[10 Aug 2009 11:25] Jon Stephens
This appears to have been taken care of as part of previous documentation work, with no new changelog entry necessary.

Closed.