MySQL Bugs: #90987: partition table index next same scan ineffective

Bug #90987	partition table index next same scan ineffective
Submitted:	23 May 2018 10:03	Modified:	1 Jun 2018 2:18
Reporter:	Wei Zhao (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Storage Engines	Severity:	S5 (Performance)
Version:	5.7.17	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
This bug happens in a partitioned table backed by innodb. As the below test shows, when both db transactions T1 and T2 are in read committed isolation level, T1 updates the row R1(b=14) first, and when T2 tries to update R2(b=12), but T2 is blocked. This should not happen because under RC isolevel, both T1 and T2 should only hold record lock on R1 and R2 respectively, and T1 should not block T2. Although the query result is correct, performance is not optimal --- more deadlocks can be caused by this issue and parallelism is degraded, so it is a performance bug.

And this issue doesn't happen if the table 't4' is not a partitioned table but an ordinary innodb table.

After some debugging and investigation I found the cause, as the patch shows.

How to repeat:
mysql> create table t4 (a int unsigned auto_increment primary key, b int, c int, key(b)) PARTITION BY LIST (murmurHashCodeAndMod(`a`,4))(PARTITION p0 VALUES IN (0), PARTITION p1 VALUES IN (1),PARTITION p2 VALUES IN (2) ,PARTITION p3 VALUES IN (3)  );
Query OK, 0 rows affected (0.21 sec)

mysql> insert into t4 (b,c) values(11,111),(12,112),(13,113),(14,114);
Query OK, 4 rows affected (0.02 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> insert into t4 (b,c) values(15,115),(16,116),(17,117),(18,118);
Query OK, 4 rows affected (0.01 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> select*from t4 partition(p0);
+---+------+------+
| a | b    | c    |
+---+------+------+
| 2 |   12 |  112 |
| 4 |   14 |  114 |
| 7 |   17 |  117 |
+---+------+------+
3 rows in set (0.00 sec)

mysql> select @@tx_isolation;
+----------------+
| @@tx_isolation |
+----------------+
| READ-COMMITTED |
+----------------+
1 row in set (0.00 sec)

mysql> begin; update t4 partition(p0) set c=c+1 where b=14;
Query OK, 0 rows affected (0.00 sec)

Query OK, 1 row affected (0.02 sec)
Rows matched: 1  Changed: 1  Warnings: 0

Then, open another session and execute these stmts:

mysql> select @@tx_isolation;
+----------------+
| @@tx_isolation |
+----------------+
| READ-COMMITTED |
+----------------+
1 row in set (0.00 sec)

mysql> begin;update t4 partition(p0) set c=c+1 where b=12;
Query OK, 0 rows affected (0.00 sec)

ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
mysql> 

Suggested fix:
see the patch

this patch fixes the bug (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: partition_index_same.diff (application/octet-stream, text), 971 bytes.

the hash function and partitioning is not mysql official feature, but the issue still exists if you use official partitioning.

C:\tmp\mysql-5.7.17-winx64>bin\mysql -uroot -p
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.7.17 MySQL Community Server (GPL)

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> CREATE DATABASE a;
Query OK, 1 row affected (0.00 sec)

mysql> USE a
Database changed
mysql> create table t4 (a int unsigned auto_increment primary key, b int, c int, key(b)) PARTITION BY LIST (murmurHashCodeAndMod(`a`,4))(PARTITION p0 VALUES IN (0), PARTITION p1 VALUES IN (1),PARTITION p2 VALUES IN (2) ,PARTITION p3 VALUES IN (3)  );
ERROR 1064 (42000): Constant, random or timezone-dependent expressions in (sub)partitioning function are not allowed near ')(PARTITION p0 VALUES IN (0), PARTITION p1 VALUES IN (1),PARTITION p2 VALUES IN ' at line 1
mysql>

Hi Mr. Zhao,

In order to process this report we have to be able to reproduce the behaviour.

Our Miguel Godofredo Solorzano has found a difficulty that I encounter as well in creating the table. Hence, could you, first of all, clear out that small matter???

One more small detail.

Since we can not run your test case, please send us innodb status while the second transaction waits for the locks. Simply, this could be some gap lock or similar.

Hi Sinisa,

This is a test case that you should be able to run on original mysql-5.7:

1. in session 1, do these ops:
mysql> create table t4 (a int unsigned auto_increment , b int, c int, key(b), primary key(a,b)) PARTITION BY LIST (b)(partition p0 values in (0,1, 2 ,3, 4), partition p1 values in(5,6,7,8,9,10));
Query OK, 0 rows affected (0.02 sec)

mysql> insert into t4(b,c) values(1,11),(2,22),(3,33),(4,44),(5,55),(6,66),(7,77),(8,88),(9,99);
Query OK, 9 rows affected (0.01 sec)
Records: 9  Duplicates: 0  Warnings: 0

mysql> begin;update t4 set c=c+1 where b=3;                                                     
Query OK, 0 rows affected (0.00 sec)

Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> 

2. in session 2, do these ops:
mysql> begin; update t4 set c=c+1 where b=2;
Query OK, 0 rows affected (0.00 sec)

ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
mysql> 

session 2's stmt is blocked and later the lock wait timeouts.

Hi,

Would you be so kind to provide us with innodb status after second transaction has started and began waiting.

Also, please do not forget that deadlock is probably occurring here since number of rows per partition is way too small. That leads to scanning of all rows and all rows get locked. Hence a lock wait .......

InnoDB status would show up whether it is a consequence of this , well documented behaviour. Scanning small number of rows is MUCH faster then using index.

We are waiting on your feedback.

the 'show engine innodb status;' output after session 2 is blocked

Attachment: idb-stat.txt.gz (application/gzip, text), 2.72 KiB.

Hi Sinisa,

Thank you for your patient explanation, I've uploaded the 'show engine innodb status;' output, taken after session 2 is blocked waiting for the row lock.

There is no dead lock when session 2 blocks, as  the uploaded output also shows. And session 2's stmt returns with ER_LOCK_WAIT_TIMEOUT rather than ER_LOCK_DEADLOCK, showing the same fact.

Hi Mr. Zhao,

Actually, yes, lock wait timeout is expected in these situations and it is what you get.

As the status clearly shows, all rows in the partition are scanned, during which scanning all rows are locked, which is why one transaction has to wait for another. Number of row is too small for the indexed search.

This is all explained in our Reference Manual, sections on InnoDB and Optimiser.

Thank you for your report.

But in the test case example, the transaction(T2) in the 2nd session doesn't have to be blocked by the 1st session's transaction (T1), because in RC isolation level, according to how mysql and innodb works, T1 should lock (4,14,114) and T2 should lock only (2,12,112).

The reason T2 is blocked is because when T2 executes the update stmt, handler::index_next() is called instead of handler::index_next_same(), so innodb tries to return the row (4, 14,114), thus T2 tries to acquire row lock on (4, 14,114), and got blocked by T1. See my patch and do some debugging, and you will figure out why this happens.

Hi Zhao,

I have done some debugging and it seems that you are right !!!

Verified as a performance problem.

Hi Sinisa,

Good to know the bug is verified.but it's a bug of the mysql server code, not a bug of innodb --- mysql server makes use of innodb in a wrong way. 

The correct way to scan the partitioned table when executing the test case I provided, is to do an 'index next same' scan, because the scan is an equality scan on a non-unique index. 

The cause of the bug is that in the two functions 
int Partition_helper::handle_unordered_next(uchar *buf, bool is_next_same)
and
int Partition_helper::handle_ordered_next(uchar *buf, bool is_next_same)

the 'is_next_same' argument is actually ignored or neglected, scans of 'next same' type is actually performed as ordinary range scan(read_range_next_in_part() is called rather than index_next_same_in_part()), causing transaction T2 to try to read&lock the row R1(R1.b=3), and got blocked by T1 because T1 already locked row R1. The correct way for T2 to execute its update stmt is to stop the index scan after innodb returning the row R2(R2.b=2) because innodb should be doing an index next same scan and should find that there is no more index entries of same value(b=2), and thus T2's update stmt should return rather than be blocked by T1.

And my patch fixes the issue.

My explanation in previous message refers to rows not consistent with the test case I provided at the top of the bug report, which might have made some confusion. The explanation in this message is consistent with the test case, hopefully it makes things clear.

I am putting this bug in the category that belongs to the executioner, more precisely to the SE API.