Bug #104108 cumulative frequency may grow over 1.0 due to float errors
Submitted: 24 Jun 2021 17:06 Modified: 11 Oct 2021 23:52
Reporter: Kaiwang CHen (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:8.0.25 OS:Any
Assigned to: CPU Architecture:Any

[24 Jun 2021 17:06] Kaiwang CHen
Description:
When histogram is constructed, cumulative frequency is calculated by summing frequencies of previous buckets and the current bucket. This may lead to accumulated float errors and the last cumulative frequency may be over 1.0 (cf > 1.0). The out-of-range cumulative frequency may be returned and cause abnormal behaviors.

For example, Singleton performs calculation:

  double cumulative_frequency = 0.0;

  try {
    for (const auto &node : value_map) {
      const double frequency = node.second / static_cast<double>(total_count);
      cumulative_frequency += frequency;

      m_buckets.emplace(node.first, cumulative_frequency);
    }
  } catch (const std::bad_alloc &) {
    // Out of memory.
    return true;
  }

How to repeat:
In histograms.test, the following ANALYZE statement generates the last bucket with cumulative frequency being 1.0000000000000002.

ANALYZE TABLE t2 UPDATE HISTOGRAM ON col_datetime WITH 100 BUCKETS;

{"buckets": [["1900-01-01 00:00:00.000000", 0.3], ["2001-01-20 12:47:23.000000", 0.35], ["2001-07-25 08:40:24.000000", 0.39999999999999997], ["2001-10-22 11:13:24.000000", 0.44999999999999996], ["2002-08-25 20:35:06.000000", 0.49999999999999994], ["2002-12-08 11:34:58.000000", 0.5499999999999999], ["2003-03-12 02:00:34.000000", 0.6], ["2003-12-04 11:14:26.000000", 0.65], ["2005-08-15 00:00:00.000000", 0.7000000000000001], ["2006-09-11 18:25:21.000000", 0.7500000000000001], ["2007-04-10 12:16:04.000000", 0.8000000000000002], ["2008-05-16 08:09:06.000000", 0.8500000000000002], ["2008-07-02 00:00:00.000000", 0.9000000000000002], ["2009-06-07 13:48:58.000000", 0.9500000000000003], ["2009-11-07 00:00:00.000000", 1.0000000000000002]], "data-type": "datetime", "null-values": 0.0, "collation-id": 8, "last-updated": "2021-06-24 16:14:56.077219", "sampling-rate": 1.0, "histogram-type": "singleton", "number-of-buckets-specified": 100}

Suggested fix:
Add std::min(cf, 1.0) to ensure it is in range [0, 1.0].

In addition, the precision may be limited to discard those meaningless tail figures.
[25 Jun 2021 6:46] MySQL Verification Team
Hello Kaiwang,

Thank you for the report and feedback.

regards,
Umesh
[25 Jun 2021 6:47] MySQL Verification Team
- 8.0.25 - either use ./mtr main.histograms or run below which is extracted from the same source i.e histograms.test

CREATE TABLE t2 (
  pk int(11) NOT NULL AUTO_INCREMENT,
  col_int int(11) NOT NULL,
  col_int_key int(11) GENERATED ALWAYS AS ((col_int + col_int)) VIRTUAL,
  col_date date NOT NULL,
  col_date_key date GENERATED ALWAYS AS ((col_date + interval 30 day))
VIRTUAL,
  col_datetime datetime NOT NULL,
  col_time time NOT NULL,
  col_datetime_key datetime GENERATED ALWAYS AS
(addtime(col_datetime,col_time)) VIRTUAL,
  col_time_key time GENERATED ALWAYS AS (addtime(col_datetime,col_time))
VIRTUAL,
  col_varchar varchar(1) NOT NULL,
  col_varchar_key varchar(2) GENERATED ALWAYS AS
(concat(col_varchar,col_varchar)) VIRTUAL,
  PRIMARY KEY (pk),
  KEY col_date_key (col_date_key)
) ENGINE=InnoDB AUTO_INCREMENT=21 DEFAULT CHARSET=utf8mb4;
INSERT INTO t2 (pk, col_int, col_date, col_datetime, col_time, col_varchar)
VALUES (1,2,'2002-10-13','1900-01-01
00:00:00','00:00:00','s'),(2,4,'1900-01-01','2005-08-15
00:00:00','15:57:25','r'),(3,8,'0000-00-00','1900-01-01
00:00:00','07:05:51','m'),(4,4,'2006-03-09','2008-05-16
08:09:06','19:22:21','b'),(5,4,'2001-06-05','2001-01-20
12:47:23','03:53:16','x'),(6,7,'2006-05-28','2008-07-02
00:00:00','09:16:38','g'),(7,4,'2001-04-19','1900-01-01
00:00:00','15:37:26','p'),(8,1,'1900-01-01','2002-12-08
11:34:58','00:00:00','q'),(9,9,'2004-08-20','1900-01-01
00:00:00','05:03:03','w'),(10,4,'2004-10-10','1900-01-01
00:00:00','02:59:24','d'),(11,8,'2000-04-02','2002-08-25
20:35:06','00:01:58','e'),(12,4,'2006-11-02','2001-10-22
11:13:24','00:00:00','b'),(13,8,'2009-01-28','2003-03-12
02:00:34','02:20:16','y'),(14,0,'2005-04-19','2007-04-10
12:16:04','04:59:50','p'),(15,0,'2006-08-12','2009-11-07
00:00:00','21:14:04','f'),(16,0,'2005-03-12','2003-12-04
11:14:26','00:00:00','p'),(17,7,'1900-01-01','2006-09-11
18:25:21','12:59:27','d'),(18,7,'1900-01-01','1900-01-01
00:00:00','16:39:36','f'),(19,5,'0000-00-00','2001-07-25
08:40:24','00:00:00','j'),(20,3,'2007-09-09','2009-06-07
13:48:58','00:00:00','e');

analyze table t2;
ANALYZE TABLE t2 UPDATE HISTOGRAM ON col_datetime WITH 100 BUCKETS;
select * from information_schema.column_statistics;

-

# SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, HISTOGRAM
'u', 't', 'c1', '{\"buckets\": [[1.0, 3.0, 0.5, 3], [4.0, 5.0, 1.0, 2]], \"data-type\": \"double\", \"null-values\": 0.0, \"collation-id\": 8, \"last-updated\": \"2021-06-25 06:13:27.195500\", \"sampling-rate\": 1.0, \"histogram-type\": \"equi-height\", \"number-of-buckets-specified\": 2}'
'u', 't2', 'col_datetime', '{\"buckets\": [[\"1900-01-01 00:00:00.000000\", 0.3], [\"2001-01-20 12:47:23.000000\", 0.35], [\"2001-07-25 08:40:24.000000\", 0.39999999999999997], [\"2001-10-22 11:13:24.000000\", 0.44999999999999996], [\"2002-08-25 20:35:06.000000\", 0.49999999999999994], [\"2002-12-08 11:34:58.000000\", 0.5499999999999999], [\"2003-03-12 02:00:34.000000\", 0.6], [\"2003-12-04 11:14:26.000000\", 0.65], [\"2005-08-15 00:00:00.000000\", 0.7000000000000001], [\"2006-09-11 18:25:21.000000\", 0.7500000000000001], [\"2007-04-10 12:16:04.000000\", 0.8000000000000002], [\"2008-05-16 08:09:06.000000\", 0.8500000000000002], [\"2008-07-02 00:00:00.000000\", 0.9000000000000002], [\"2009-06-07 13:48:58.000000\", 0.9500000000000003], [\"2009-11-07 00:00:00.000000\", 1.0000000000000002]], \"data-type\": \"datetime\", \"null-values\": 0.0, \"collation-id\": 8, \"last-updated\": \"2021-06-25 06:45:02.674980\", \"sampling-rate\": 1.0, \"histogram-type\": \"singleton\", \"number-of-buckets-specified\": 100}'
[17 Sep 2021 6:10] casa zhang
Hi, Looks like only Singleton has such problem because it accumulates floats, Equi-height hasn’t because it accumulates integers.

I have a precise fix for this case. See enclosed.
[17 Sep 2021 6:11] casa zhang
CUMULATIVE FREQUENCY MAY GROW OVER 1.0 DUE TO FLOAT ERRORS

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: 0001-Bug-104108-CUMULATIVE-FREQUENCY-MAY-GROW-OVER-1.0-DU.patch (application/octet-stream, text), 9.18 KiB.

[11 Oct 2021 23:52] Jon Stephens
Documented fix as follows in the MySQL 8.0.28 changelog:

        When a singleton histogram is constructed, its cumulative
        frequency is calculated by adding frequencies of previous
        buckets with the current bucket; because a floating-point value
        was used for the accumulator, this sometimes led to accumulated
        float errors, with the final cumulative frequency fractionally
        greater than 1.0.

        This fix accumulates the frequency with an integer type instead,
        to avoid intermediate floating-point errors.

        Our thanks to Casa Zhang and the Tencent team for the contribution.

Closed.