Description:
The current algorithm used for partitioning data into the adaptive hash index (which has innodb_adaptive_hash_index_parts parts, by default 8) produces an almost worst case uneven distribution when the server contains many tables with an identical schema. The algorithm is as follows:
/** Get the hash-table based on index attributes.
A table is selected from an array of tables using pair of index-id, space-id.
@param[in] index index handler
@return hash table */
UNIV_INLINE
hash_table_t*
btr_get_search_table(const dict_index_t* index)
{
ut_ad(index != NULL);
ulint ifold = ut_fold_ulint_pair(static_cast<ulint>(index->id),
static_cast<ulint>(index->space));
return(btr_search_sys->hash_tables[ifold % btr_ahi_parts]);
}
The ut_fold_ulint_pair function is as follows:
#define UT_HASH_RANDOM_MASK 1463735687
#define UT_HASH_RANDOM_MASK2 1653893711
/*************************************************************//**
Folds a pair of ib_ulint_ts.
@return folded value */
static
ib_ulint_t
ut_fold_ib_ulint_t_pair(
/*====================*/
ib_ulint_t n1, /*!< in: ib_ulint_t */
ib_ulint_t n2) /*!< in: ib_ulint_t */
{
return(((((n1 ^ n2 ^ UT_HASH_RANDOM_MASK2) << 8) + n1)
^ UT_HASH_RANDOM_MASK) + n2);
}
So in summary the AHI is partitioned by ut_fold_ulint_pair(index_id, space) % 8. This is, in practice, essentially a proxy for the distance between the index_id and space. Since both index_id and space are essentially auto-incremented values, and with innodb_file_per_table, a new space is allocated for each table, the two maintain a fixed distance when many tables with the same schema are created on a server, causing the AHI to be partitioned and used unevenly (and in the worst case for a server with thousands of identical tables, such as in a partitioned system, these could all fall into the same single AHI partition), such as:
Hash table size 5843083, node heap has 0 buffer(s)
Hash table size 5843083, node heap has 0 buffer(s)
Hash table size 5843083, node heap has 0 buffer(s)
Hash table size 5843083, node heap has 0 buffer(s)
Hash table size 5843083, node heap has 20889 buffer(s)
Hash table size 5843083, node heap has 1 buffer(s)
Hash table size 5843083, node heap has 0 buffer(s)
Hash table size 5843083, node heap has 1 buffer(s)
This is due to a bad interaction with both the ut_fold_ulint_pair (which is a poor hash anyway) and the default number of partitions being 8. The result of ut_fold_ulint_pair(n1, n2) and ut_fold_ulint_pair(n1+1, n2+1) are often the exact same value, and in all cases have the same result modulus 8.
I believe this bug affects all versions of MySQL, but I have not exhaustively tested.
How to repeat:
This problem can be trivially reproduced:
1. Create several tables with only a primary key.
2. Insert data.
3. Create secondary keys (if desired).
4. Run a workload against the tables.
All primary keys (and thus all tables) will fall into the same single AHI partition, rather than being evenly distributed across partitions.
A small Ruby program which demonstrates the problem follows:
#!/usr/bin/env ruby
UT_HASH_RANDOM_MASK = 1463735687
UT_HASH_RANDOM_MASK2 = 1653893711
def ut_fold_ulint_pair(n1, n2)
(((((n1 ^ n2 ^ UT_HASH_RANDOM_MASK2) << 8) + n1) ^ UT_HASH_RANDOM_MASK) + n2)
end
innodb_adaptive_hash_index_parts = 8
index_id = (rand * 1000000000).floor
space_id = (rand * 1000000000).floor
puts "innodb_adaptive_hash_index_parts=%i" % [ innodb_adaptive_hash_index_parts ]
(0...10).each do |i|
index_id += 1
space_id += 1
fold = ut_fold_ulint_pair(index_id, space_id)
puts "index_id=%i, space_id=%i, fold=%i, ahi_slot=%i" % [
index_id, space_id, fold,
fold % innodb_adaptive_hash_index_parts,
]
end
And its output is, e.g.:
innodb_adaptive_hash_index_parts=8
index_id=936096520, space_id=759598558, fold=516258342765, ahi_slot=5
index_id=936096521, space_id=759598559, fold=516258342765, ahi_slot=5
index_id=936096522, space_id=759598560, fold=516258339693, ahi_slot=5
index_id=936096523, space_id=759598561, fold=516258339693, ahi_slot=5
index_id=936096524, space_id=759598562, fold=516258336621, ahi_slot=5
index_id=936096525, space_id=759598563, fold=516258336621, ahi_slot=5
index_id=936096526, space_id=759598564, fold=516258339693, ahi_slot=5
index_id=936096527, space_id=759598565, fold=516258339693, ahi_slot=5
index_id=936096528, space_id=759598566, fold=516258350973, ahi_slot=5
index_id=936096529, space_id=759598567, fold=516258350973, ahi_slot=5
Suggested fix:
Use a better hashing algorithm to partition the adaptive hash index. As a workaround, setting innodb_adaptive_hash_index_parts to a different value (not a power of 2, ideally prime?) also partitions more usefully, although it still doesn't seem great. I have tried innodb_adaptive_hash_index_parts=11 for instance and it partitions a bit better, but not great, as can be demonstrated by the Ruby program:
innodb_adaptive_hash_index_parts=11
index_id=49870460, space_id=375996162, fold=508786651645, ahi_slot=8
index_id=49870461, space_id=375996163, fold=508786651645, ahi_slot=8
index_id=49870462, space_id=375996164, fold=508786654717, ahi_slot=0
index_id=49870463, space_id=375996165, fold=508786654717, ahi_slot=0
index_id=49870464, space_id=375996166, fold=508786616589, ahi_slot=9
index_id=49870465, space_id=375996167, fold=508786616589, ahi_slot=9
index_id=49870466, space_id=375996168, fold=508786617613, ahi_slot=10
index_id=49870467, space_id=375996169, fold=508786617613, ahi_slot=10
index_id=49870468, space_id=375996170, fold=508786614541, ahi_slot=7
index_id=49870469, space_id=375996171, fold=508786614541, ahi_slot=7