Bug #117783 optimizer selects the wrong index due to the "low_limit" mechanism
Submitted: 25 Mar 8:45 Modified: 26 Mar 7:40
Reporter: ksql- team Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S2 (Serious)
Version:8.0.32, 8.0.41 OS:Any
Assigned to: CPU Architecture:Any
Tags: INDEX, low_limit, Optimizer bug, sort

[25 Mar 8:45] ksql- team
Description:
When a query contains an ORDER BY clause with a small LIMIT value, MySQL optimizer triggers a "low_limit" optimization mode and recheck index selection.If the current index doesn't support the required sorting or is an index merge, the optimizer selects an index that supports the sorting keys. However, this process fails to adequately consider the filtering efficiency of the WHERE conditions. As a result, the ultimately selected index, while supporting the sorting requirement, often leads to scanning a large number of unnecessary rows. This is clearly a defect in the optimizer that needs to be addressed.

table design:
CREATE TABLE `t1` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `a` tinyint(4) DEFAULT NULL,
  `b` date DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_b` (`b`)
) ENGINE=InnoDB AUTO_INCREMENT=500001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

mysql> explain select * from `t1` where b = '2025-03-03'and id > 100 and a=1 order by id limit 1;
+----+-------------+-------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
| id | select_type | table | partitions | type  | possible_keys | key     | key_len | ref  | rows   | filtered | Extra       |
+----+-------------+-------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
|  1 | SIMPLE      | t1    | NULL       | range | PRIMARY,idx_b | PRIMARY | 8       | NULL | 249789 |     1.00 | Using where |
+----+-------------+-------+------------+-------+---------------+---------+---------+------+--------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

mysql> select * from`t1`  where b = '2025-03-03'and id > 100 and a=1 order by id limit 1;
+--------+------+------------+
| id     | a    | b          |
+--------+------+------------+
| 479751 |    1 | 2025-03-03 |
+--------+------+------------+
1 row in set (20.78 sec)

mysql> select * from`t1` force index(idx_b) where b = '2025-03-03'and id > 100 and a=1 order by id limit 1;
+--------+------+------------+
| id     | a    | b          |
+--------+------+------------+
| 479751 |    1 | 2025-03-03 |
+--------+------+------------+
1 row in set (0.00 sec)

Obviously, the optimizer chose the wrong primary key index, resulting in scanning a large number of unnecessary rows.Below is a partial summary of the Optimizer_trace. We can see that it's because the low_limit was triggered, and then the optimizer re-selected an index that supports sorting. In this example, the only index that supports sorting is the primary key.

        {
            "attaching_conditions_to_tables": {
              "original_condition": "((`t1`.`a` = 1) and (`t1`.`b` = '2025-03-03') and (`t1`.`id` > 100))",
              "attached_conditions_computation": [
                {
                  "table": "`t1`",
                  "rechecking_index_usage": {
                    "recheck_reason": "low_limit",
                    "limit": 1,
                    "row_estimate": 126.2,
                    "range_analysis": {
                      "table_scan": {
                        "rows": 499579,
                        "cost": 599497
                      },
                      "potential_range_indexes": [
                        {
                          "index": "PRIMARY",
                          "usable": true,
                          "key_parts": [
                            "id"
                          ]
                        },
                        {
                          "index": "idx_b",
                          "usable": false,
                          "cause": "not_applicable"
                        }
                      ],
                      "setup_range_conditions": [
                      ],
                      "group_index_range": {
                        "chosen": false,
                        "cause": "not_group_by_or_distinct"
                      },
                      "analyzing_range_alternatives": {
                        "range_scan_alternatives": [
                          {
                            "index": "PRIMARY",
                            "ranges": [
                              "100 < id"
                            ],
                            "index_dives_for_eq_ranges": true,
                            "rowid_ordered": true,
                            "using_mrr": false,
                            "index_only": false,
                            "rows": 249789,
                            "cost": 50203,
                            "chosen": true
                          }
                        ],
                        "analyzing_roworder_intersect": {
                          "usable": false,
                          "cause": "too_few_roworder_scans"
                        }
                      },
                      "chosen_range_access_summary": {
                        "range_access_plan": {
                          "type": "range_scan",
                          "index": "PRIMARY",
                          "rows": 249789,
                          "ranges": [
                            "100 < id"
                          ]
                        },
                        "rows_for_plan": 249789,
                        "cost_for_plan": 50203,
                        "chosen": true
                      }
                    }
                  }
                }
              ],
              "attached_conditions_summary": [
                {
                  "table": "`t1`",
                  "attached": "((`t1`.`a` = 1) and (`t1`.`b` = '2025-03-03') and (`t1`.`id` > 100))"
                }
              ]
            }
          },

How to repeat:
We provide a Python script for creating tables and importing data. You only need to specify parameters in the terminal. The format can be referenced in the following statement.

python3 /path/to/bug_report_data,py --host 127.0.0.1 --port 3306 --user root --password 123456 --db test

import mysql.connector
from datetime import datetime, timedelta
import random
import argparse

def main():
    # Parse command line arguments
    parser = argparse.ArgumentParser(description='Generate and insert test data into MySQL database')
    parser.add_argument('--port', type=str, required=True, help='MySQL port')
    parser.add_argument('--host', type=str, required=True, help='MySQL host')
    parser.add_argument('--db', type=str, required=True, help='MySQL database name')
    parser.add_argument('--user', type=str, required=True, help='MySQL username')
    parser.add_argument('--password', type=str, required=True, help='MySQL password')
    parser.add_argument('--records', type=int, default=500000, help='Total records to generate (default: 500000)')
    args = parser.parse_args()

    # Database connection configuration
    config = {
        'user': args.user,
        'password': args.password,
        'port': args.port,
        'host': args.host,
        'database': args.db,
        'raise_on_warnings': True
    }

    # Data generation configuration
    total_records = args.records
    batch_size = 1000

    try:
        # Connect to database
        print(f"Connecting to MySQL database at {args.host}:{args.port}...")
        conn = mysql.connector.connect(**config)
        cursor = conn.cursor()

        # Create table if not exists
        print("Creating table if it doesn't exist...")
        create_table_sql = """
        CREATE TABLE IF NOT EXISTS t1 (
            id BIGINT AUTO_INCREMENT PRIMARY KEY,
            a TINYINT,
            b DATE,
            INDEX idx_b (b)
        )
        """
        cursor.execute(create_table_sql)
        conn.commit()

        # Generate date range
        print("Generating dates...")
        start_date = datetime(2024, 8, 25)
        end_date = datetime(2025, 3, 10)
        delta = end_date - start_date
        all_dates = [start_date + timedelta(days=i) for i in range(delta.days + 1)]
        date_strs = [d.strftime("%Y-%m-%d") for d in all_dates]

        # Calculate records per day to distribute evenly
        days_count = len(all_dates)
        records_per_day = total_records // days_count
        remaining_records = total_records % days_count

        # Prepare insert statement
        insert_sql = """
        INSERT INTO t1 (
            a, b
        ) VALUES (%s, %s)
        """

        # Generate and insert data in batches
        print("Starting data insertion...")
        for i in range(0, total_records, batch_size):
            # Generate batch data with sequential dates
            data = []
            for j in range(min(batch_size, total_records - i)):
                idx = i + j
                # Calculate which day this record belongs to
                day_idx = idx // records_per_day
                if day_idx >= days_count:  # Handle remaining records
                    day_idx = days_count - 1
                
                # Date field (sequential distribution) - now named b
                b = date_strs[day_idx]
                
                # Valid field - now named a
                a = random.choices([0, 1], weights=[0.1, 0.9])[0]
                
                data.append((a, b))
            
            # Insert batch
            cursor.executemany(insert_sql, data)
            conn.commit()
            print(f"Inserted {i + len(data)}/{total_records} records ({(i + len(data))/total_records*100:.1f}%)...")

    except mysql.connector.Error as err:
        print("Error:", err)
    finally:
        if 'conn' in locals() and conn.is_connected():
            cursor.close()
            conn.close()
            print("Database connection closed.")

    print("Data insertion completed.")

if __name__ == "__main__":
    main()

Suggested fix:
After the "low_limit" is triggered, the optimizer reconsiders index selection by simply choosing the lowest-cost sorting index from those that support sorting, without considering filtering costs. This approach seems somewhat unreasonable. Although MySQL 5.7.33 and above provides the optimizer switch "prefer_ordering_index," and disabling it can solve the index selection issue in this case, doing so might cause incorrect index selections for other SQL queries. Therefore, we believe the best solution would be to consider both sorting capability and filtering costs when reconsideration is triggered by "low_limit."
[26 Mar 7:40] MySQL Verification Team
Hello!

Thank you for the report and detailed steps to reproduce.
Verified as described.

Sincerely,
Umesh