Worklog Bug #117826 Add histogram selectivity for key ranges 1 Description Goal of this WL is to provide key range estimation by statistics. A key range, represented by KEY_MULTI_RANGE, consists of one or two endpoint values in KeyTupleFormat and a range flag describing the properties of the range. It is also known as scan filter because of filtering rows during the scan, while a post-scan filter works on the produced rows of the scan. Before this WL:  1. A key range is primarily estimated by index sampling, that is, handler::records_in_range(). 2. A key range could also be estimated by index statistics or as a magic number in limited cases. (WL#5957 WL#6526) 3. A post-scan filter could be estimated by histogram, that is, Item::get_filtering_effect(),  if any is available. This WL extends the scope of estimation by statistics to key ranges. In other words, any local filter could be estimated by statistics now. Although a join filter could also be estimated by histogram, it is not discussed in this WL. Statistics need to be kept up to date, and how to achieve that is also not discussed in this WL. Here is an example of estimation by histograms rather than index statistics: CREATE TABLE tbl_int (col1 INT, key(col1)); INSERT INTO tbl_int VALUES (1), (2), (3), (4), (4), (4), (4), (8), (NULL), (NULL); ANALYZE TABLE tbl_int; ANALYZE TABLE tbl_int UPDATE HISTOGRAM ON col1 WITH 10 BUCKETS; -- Enable estimation by index statistics (WL#5957). SET eq_range_index_dive_limit = 2; -- Use estimation by histogram instead of index statistics if histogram is available. SET range_estimation = USE_STATISTICS_EQ_LIMIT; -- Estimation: 6 rows (histogram, work with skew) v.s. 3 rows (index statistics, assume uniformity) EXPLAIN SELECT * FROM tbl_int WHERE col1 IN (1, 3, 4); The new sysvar range_estimation could be used to specify when the new estimator kicks in. Replacing eq limit is safe because histogram is good at describing skew and the estimator could also explore correlation information embedded in index statistics as well. Replacing index sampling is not always safe, so another sysvar range_estimation_adjust_threshold is introduced as a multiplier against table rows to select big ranges. The design reflects the fact that any specific estimator has assumptions, pros and cons: 1. Index sampling should give the best estimation if all its assumptions are met: the sampling overhead is negligible, the sample is unbiased, and there is a real index to sample. However, sampling overhead could be significant in some cases and that’s why WL#5957 and WL#6526 were introduced. The sample could be biased when the range is very big or changing (Bug #117037), implying a big error in these cases. Although currently MySQL does not support hypo index, it might be added in the future. 2. Histogram is the traditional way to estimate any filter with negligible overhead. The assumptions are: unbiased sample from which the histogram is built, uniform distribution within a bucket, that the histogram is up to date, and that, for one-dimension histogram, correlation is not significant or could be aided by some correlation factor. The assumption of unbiased sample could be met to some extent because gathering histogram is asynchronous thus has more budget compared to synchronous index sampling. However, uniform distribution within a bucket and lacking correlation is native to histogram, which implies big estimation error for small ranges and correlated columns. 3. Index statistics, or density vector in database literatures, could be a way to estimate equality filters with negligible overhead and correlation is embedded in the statistics. The assumptions are: unbiased sample from which the index statistics is built, uniform distribution in the whole value domain, and that the statistics is up to date. Similarly the assumption of unbiased sample could be met to some extent. However, the assumption of uniform distribution implies a big error for skewed values including the notable NULL value. 4. Magic numbers are static and thus imply unbounded errors. They are used only when other estimators are unavailable or when the user explicitly requested it (WL#6526). A special case is a key range without null values backed by a full unique key, which estimates to at most one value. Although other specific statistics and sampling methods could be added in the future, a general conclusion could be reached that sampling has more overhead and does not scale to big ranges, while estimating by statistics has negligible overhead but does not work well for small ranges and sometimes correlated columns. Regardless of these factors as well as the freshness of statistics, they should be interchangeable. Note that previous efforts focused on sampling overhead, rather than estimation error: - WL#5957: Range optimizer: use statistics instead of index dives if there are many ranges. Which enabled estimating equality ranges without null part by index statistics instead of index dive, when their count is bigger than eq_range_index_dive_limit. - WL#6526: FORCE INDEX to avoid index dives when possible. Which enabled certain simple queries with FORCE INDEX to avoid index dives because cost calculation is useless for them. As a result, a scan filter estimates to 1 row unless it is supported by index statistics. 2 Requirements 2.1 Functional requirements f1) Two new session/global scoped sysvars, range_estimation and range_estimation_adjust_threshold, are added to specify when the range estimator kicks in. f2) When range_estimation=COMPATIBLE, range estimation should be as without this feature. f3) When range_estimation=ESTIMATE_USE_STATISTICS_EQ_LIMIT and the number of equality ranges without null parts exceeds sysvar eq_range_index_dive_limit, the range estimator tries to use histogram estimator rather than index statistics. This is the default setting. f4) When range_estimation=ESTIMATE_USE_STATISTICS, the range estimator tries to use histogram for any kind of ranges, namely equality ranges without null parts, equality ranges with null parts and strict non-equality ranges. f5) When range_estimation=ESTIMATE_USE_STATISTICS_ONLY, the range estimator always takes over even when there is no supportive statistics. Which resembles FORCE INDEX skipping index dives for all queries. (XXX-1: warnings?) f6) When range_estimation is not COMPATIBLE, range_estimation_adjust_threshold is greater than zero, and records_in_range() returns more than "table_rows * range_estimation_adjust_threshold", and statistics are available, the range estimator also takes over as an adjustment. It is for potentially big ranges, that is, equality ranges with null parts and strict non-equality ranges. f7) When range_estimation is COMPATIBLE, range_estimation_adjust_threshold is not effective. f8) For a multi-column key, the range selectivity is calculated based on selectivities of single parts. f9) The histogram of the first key part should be present to use the histogram estimator. Histograms of other key parts are not required but might be present to provide better estimation. f10) The histogram estimator is said to be available when the histogram of the first key part is available. An available histogram estimator never fails to give a result. f11) Geometry types have no histograms, so there is no available histogram estimator for geometry ranges. f12) Prefix index can be built on blob types and long string types, however, currently there is no way to build a histogram on a column prefix. So there is no available histogram estimator for this case. Such histograms might be added by other WLs in the future. f13) There are two ways to produce the final selectivity. One is by the typical formula (formula-1) "P(ab) = f * P(a) + (1 - f) * P(a) * P(b)" where P stands for probability and f stands for correlation factor, and the other formula (formula-2) explores correlation information in the index statistics "P(ab) = P(a) * records_per_key[i+1] / records_per_key[i]". Both formulas could be applied recursively. f14) If f = 0 and "records_per_key[i+1] / records_per_key[i] > P(b)", correlation should play a significant role, consequently formula-2 should be chosen. (XXX-2: strategy?) 2.2 Non-functional requirements nf1) Although the range estimator takes over real unique ranges, which are equality ranges without null parts and use the full key of a unique index, it should behave the same, that is, estimating to 1 row. nf2) WL#5957 is a special case of this WL, except that it has an additional sysvar eq_range_index_dive_limit. Bug #117776 is considered as a defect and should be fixed separately. nf3) WL#6526 is a special case of this WL. Bug #117750 and Bug #117791 are considered as defects and should be fixed separately. nf4) For queries satisfying both WL#6526 and histogram estimator is available, the range estimator gives meaningful estimated rows rather than 1 row. This resembles the case of estimating by index statistics when WL#6526 was introduced. nf5) When this feature is OFF, the overhead should be almost the same as when the feature is compiled away. 3 High Level Architecture Here are some design principles. 3.1 Unified range estimator Although previous efforts, WL#5957 and WL#6526, focused on sampling overhead, they provide special kinds of estimators that do not involve index sampling. Let’s name them EQ_RANGE_ESTIMATOR and ONE_ROW_ESTIMATOR, respectively. A third estimator, HISTOGRAM_ESTIMATOR, is introduced by this WL to estimate by histogram and correlation statistics. They are all classified as statistics-based range estimators, and share the same interface. The same interface might be also extended to perform table sampling in the future. However, a table sample is really expensive to get, so it should only be used in restricted cases. 3.2 Extensibility to new estimators It should be relatively easy to add new estimators to the unified range estimator interface. An extensible way is to introduce a registry of concrete estimators and elaborate each until one success, however, this requires a class hierarchy, virtual function calls and a dedicated loop, and most importantly, a single kind of statistics might be used by multiple estimators. As a result, the unified range estimator essentially provides a conceptual interface to statistics for the current table, and some estimator functions identified by enum values. If the estimators should be exposed to the component architecture, a refactor could be introduced. The refactor should be local to the unified range estimator. 3.3 Minimal changes Estimating with endpoint values in key range buffers and estimating with comparison against literals differ only in the forms of values, they should share the same selectivity algorithms. 3.4 Clean interface The code related to estimation should be isolated in dedicated files. In any usage sites, the invocation of the range estimator should consist of three steps: (1) Decide estimation method, given user settings and a key range. It should ensure that an estimator is available when it is selected. (2) Check if an estimator is selected. (3) If one is selected, use it to estimate the range, otherwise, use index sampling. And an optional fourth step for index sampling, (4) if index sampling suggests big ranges, try an adjustment estimator. 3.5 Safe improvement, back-compatible operation mode The WL#5957 estimator could be expressed as: the base factor is provided by the statistic of the first part, a scale factor is provided by statistics of this part divided by that of the previous part, then the final result is obtained as the product of the base factor and scale factors. Similarly, the histogram estimator could be expressed as: the base factor is provided by the histogram of the first part, a scale factor is provided by statistics of this part divided by that of the previous part, then the final result is obtained as the product of the base factor and scale factors. Since a practical histogram expresses skewed distribution by buckets, which should have smaller error than global uniformity of index statistics, the histogram estimator should have a better base factor and the same scale factor, thus producing better estimation. Freshness of histogram and index statistics should be the same since the introduction of WL#15786 Automatically updated histograms or assuming explicit maintenance by the user. As a result, ESTIMATE_USE_STATISTICS_EQ_LIMIT is chosen as the default operation mode. If there is any problem, the user can still fall back by setting it to COMPATIBLE. 3.6 Progressively reduced error As more statistics are added, the range estimator should produce better estimation gradually. Additionally statistics include histograms of key parts other than the first one. However, a histogram could be supporting both the first part of one key and other parts of another key. Adding a histogram without knowledge of correlation is not always better, this problem can only be solved by explicit correlation statistics. For now, assume adding a new histogram provides better estimation for all involved queries. (XXX-2) 3.7 Minimal overhead Although a few hooked invocations are needed to (re)initialize the estimator properly and decide the estimation method, they should incur little overhead. Any heavy operation should be lazy and only carried out when really needed. The only optional operation is preparing the histogram look up table, which is an optimization to avoid looking up histograms for each range when the range list is really big. It adds a number of lookups that is the same as the number of key parts. It should be negligible. 3.8 Be promptive Statistics need to be gathered to be ready for estimation. Forcing skipping index sampling leads to estimating by magic numbers which is usually a bad sign unless the filter never contributes to a plan change. Prompt messages might be added for these cases (XXX-1), which could be introduced in some successive WL. 4 Low Level Design 4.1 How to read a value in KeyTupleFormat? A key tuple is an endpoint of a key range, consisting of multiple key parts. The KeyTupleFormat is a sequence of memory segments per each key part, each of stored_length in bytes. Each memory segment consists of a null byte if the field is nullable, then the area for the value. The value can be read with a few fixings in certain cases. The value is the same as in TableRecordFormat, except for varstring, blob and bit types. For these special types additional fixings are needed and a utility class Use_field is introduced for this purpose. - A varstring has 2-byte length bytes in KeyTupleFormat while 1-byte or 2-byte in TableRecordFormat. - A blob has 2-byte length bytes with inlined prefix in KeyTupleFormat while 1-4 length bytes with an overflow pointer in TableRecordFormat. - A bit has inlined uneven bits in KeyTupleFormat while separated uneven bits in null bits storage in TableRecordFormat. 4.2 How to estimate a single field range? A field range, or a SEL_ARG in concept, is a construct of one or two endpoints. An endpoint can be present, inclusive and of the same value as the other. A field range can have additional properties such as equality and containing null values. Estimating a field range is achieved by estimating against the endpoints separately and combining the selectivities in regular math. One thing to note is that histograms in MySQL separate the null value from regular values, so the complete domain of regular values is actually Histogram::get_non_null_values_fraction() rather than 1 in math. The combining algorithm and forms of a single field range are described in the following comment.   /*     Note that histogram separates null ratio from value buckets, while a field     range might include NULL, a value range, or both. So the selectivity formula     of a field range is:       Prange = Pnull + Px + Py - Pnon_null     where,       1) Px is for 'fld > x' or 'fld >= x', and x can be any value or -Inf.       2) Py is for 'fld < y' or 'fld <= y', and y can be any value or +Inf.       3) Pnon_null is for (3.1) 'fld > -Inf', (3.2) 'fld < +Inf' or both. 4) Pnull is the null fraction if 'fld IS NULL' is present, or 0 otherwise.       5) Obviously, Pnull + Pnon_null = 1 when 'fld IS NULL' is present.     Note that a field range, represented by endpoint values in keypart_data     format as well as flags, could be any of these forms:       id  | field range  | expression               | flag       ----|--------------|--------------------------|--------       F1  | [X, Y]       | fld >= X AND fld <= Y    | -       F2  | (X, Y]       | fld > X AND fld <=Y      | NEAR_MIN       F3  | [X, Y)       | fld >= X AND fld < Y     | NEAR_MAX       F4  | (X, Y)       | fld > X AND fld < Y      | NEAR_MIN NEAR_MAX       F5  | [X, +Inf)    | fld >= X                 | NO_MAX_RANGE       F6  | (X, +Inf)    | fld > X                  | NO_MAX_RANGE NEAR_MIN       F7  | [NULL, +Inf) | fld IS NULL OR fld IS NOT NULL | NO_MAX_RANGE       F8  | (NULL, +Inf) | fld IS NOT NULL  (**)    | NO_MAX_RANGE NEAR_MIN       F9  | (-Inf, Y) (*)| fld < Y                  | NO_MIN_RANGE NEAR_MAX       F10 | (-Inf, Y]    | fld <= Y                 | NO_MIN_RANGE       F11 | (NULL, Y) (*)| fld < Y                  | NEAR_MIN NEAR_MAX       F12 | (NULL, Y]    | fld <= Y                 | NEAR_MIN       F13 | [NULL, Y)    | fid IS NULL OR fld < Y   | NEAR_MAX       F14 | [NULL, Y]    | fld IS NULL OR fld <= Y  | -       F15 | [NULL, NULL] | fld IS NULL              | -     (*)  Although the forms for 'fld NOT NULL' and 'fld NULL' are different,          they are the same in essence.     (**) 'fld IS NOT NULL' with 'fld NOT NULL' evaluates to true, so there is no          such field range.     Here are some observations:       (A) An inclusive NULL at the lower end implies null selectivity.       (B) NULL at the lower end implies 'fld > -Inf'.       (C) NULL at the upper end is meaningless with respect to value range (F15)     If we define Py = 0 for the third observation, then 'Px + Py - Pnon_null'     evaluates to 0. Then the selectivity formula applies to all forms.    */ 4.3 How to deduce field range properties from range properties? With a key range, field range properties, that is, presence and inclusiveness, are deduced from keypart map and range flag. Usually a key range consists of equality field ranges followed by at most one strict non-equality field range at the end. Note that the range property bits (enum key_range_flags) are designed as zero for present and inclusive. So the range properties essentially reflect the properties of the last field. However, there is also an "next keypart" optimization, for example, "a >= 2 AND b >= 3" stores two key parts instead of one, because the second key part reduces the span of the key range to some extent. Then the inclusiveness of 3 cannot be deduced reliably. It is a best-effort. 4.4 How to estimate a key range? A key range consists of a sequence of field ranges. The joint selectivity is combined over each individual field range selectivity, and the algorithm is described in the following comment.   /**     Get selectivity of a key range, based on pure statistics.     Taking correlation into account gets better joint selectivity, which     is bigger than one assuming independency. Correlation could be     explicitly provided or implicitly represented in record_per_key.     Typically the joint selectivity is obtained by formula-1:       P(ab) = f * P(a) + (1 - f) * P(a) * P(b)     With record_per_key, given that       records_per_key[i] = rows * P(a)       records_per_key[i+1] = rows * P(ab)     We have formula-2:       P(ab) = P(a) * records_per_key[i+1] / records_per_key[i]             = P(a) * rpk_selectivity     Note that although a histogram with explicit correlation factor (f) is     usually the better choice, f is not always available, while     rpk_selectivity is almost ready.     Besides, we have an observation:       (*) If f = 0 and rpk_selectivity > P(b), correlation should play a           significant role, consequently formula-2 should be chosen.     @param range            The range     @param[out] selectivity the calculated selectivity     @retval  false         Success     @retval  true          Error   */ 4.5 The unified range estimator The unified range estimator is associated with each TABLE instance. struct TABLE {   // The estimator is associated with the table in TABLE::init().   Range_estimator range_estimator; }; The module source files are range_estimator.h and range_estimator.cc. The base code is introduced in range_estimator_4_of_8_extract.patch, and then extended in patch 7 and 8. /// Range estimator types. enum enum_range_estimator {   /**     An estimator that always estimates to 1 row. It can be used when there is     such knowledge, or when cost calculation is useless.   */   ONE_ROW_ESTIMATOR,   /**     An estimator for equality non-null range.   */   EQ_RANGE_ESTIMATOR,   /**     An estimator for any range.   */   HISTOGRAM_ESTIMATOR,   NUM_ESTIMATORS }; struct Range_settings {   /// True: always uses an estimator. ONE_ROW_ESTIMATOR will be the catch.   bool skip;   /// True: use index statistics for equality ranges.   bool eq_range;   /// True: use index statistics for non-equality ranges.   bool strict_range;   /// True: use index statistics for range estimation.   bool use_index_statistics;   /// True: use histogram for range estimation.   bool use_histogram; }; // Decompose user settings to the uniform setting struct. void init_range_settings(Range_settings &settings, const RANGE_OPT_PARAM ¶m,                          bool force_skip); class Range_estimator {  public:   void init(TABLE *_table);   /// Set the index to evaluate key ranges.   void use_index(uint keyno);   void reset_index();   /**     Try to select a statistics-only estimator for the range.     @param settings Settings that control the seletion of estimators.     @param range    The range.    */   void decide_estimation_method(const Range_settings &settings,                                 KEY_MULTI_RANGE &range);   /**     Check if any statistics-only estimator is selected.     @param range The range.     @retval True  An estimator is selected.     @retval False No available estimator.    */   bool use_statistics_only() const {     return selected_estimator != NUM_ESTIMATORS && !adjust_only;   }   bool use_statistics_adjust() const {     return selected_estimator != NUM_ESTIMATORS && adjust_only;   }   /**     Estimate records in a single key range.     Note that the range is const in essence. It is declared non-const just because of     the lower-level handler interface.     @param range The range.    */   ha_rows records_in_range(KEY_MULTI_RANGE &range);  private:   bool check_index_statistics(KEY_MULTI_RANGE &range);   ha_rows estimate_by_index_statistics(KEY_MULTI_RANGE &range);   bool check_histogram_statistics(KEY_MULTI_RANGE &range);   bool get_range_selectivity(const KEY_MULTI_RANGE &range,                              double &selectivity);   // The TABLE as statistics provider.   const TABLE *table;   // Backing index of the current range.   uint keyno;   // Key parts used for the current range.   int keyparts_used;   // A lookup table for histograms of each key part.   Histogram_array histogram_array;   // Selected estimator.   enum enum_range_estimator selected_estimator;   // The selected estimator is only used as a way to adjust.   bool adjust_only; }; 4.6 Notes to code changes 4.6.1 The patches list range_estimator_1_of_8_bug_117750.patch range_estimator_2_of_8_bug_117776.patch range_estimator_3_of_8_bug_117791.patch range_estimator_4_of_8_extract.patch  Extracts range estimator to dedicated files. range_estimator_5_of_8_templatize.patch  Make the fundamental histogram estimation a template function which is not limited to Item literals. range_estimator_6_of_8_histogram_field.patch  Add histogram estimation for field ranges. range_estimator_7_of_8_histogram_range.patch  Add histogram estimation for key ranges, and sysvar range_estimation. range_estimator_8_of_8_histogram_adjust.patch  Add histogram adjustment for big key ranges, and sysvar range_estimation_adjust_threshold. range_estimator_postfix_1_keyno_flag.patch  Some bug fixes of the feature. range_estimator_postfix_2_comment.patch  Refine comment on the field range selectivity formula. 4.6.2 Changes to previous efforts Bug #117750, Bug #117776 and Bug #117791 are considered as bugs and fixed as preliminary patches (patches 1-3 of 8). They are necessary because fixing these problems gets cleaner code, which makes extracting range estimators much easier. 4.6.3 Changes to usage sites The range estimator is extracted to dedicated files, and the usage sites in handler::multi_range_read_info_const() and ROR_intersect_plan::get_scan_selectivity() are refactored to enforce the same invocation steps. (patch 4) 4.6.4 Changes to range iterator The range estimator interacts with the range iterator to maintain estimator state and range description flag. When initializing key range iterations in sel_arg_range_seq_init(), the range estimator is updated by Range_estimator::use_index() to reflect the backing index. (patch 4) When materializing a key range, in addition to changes in preliminary patches, the DESC flag is added (patch 7) to mark key ranges over descending indexes (WL#1074: Add Descending indexes support) because the endpoint values are stored in reversed order: the max end stores the lower bound value and the min end stores the upper bound value. The DESC flag is recognized in Histogram::get_raw_range_selectivity(). 4.6.5 Changes to histogram Histogram estimation is templatized to ensure the same algorithm for values from either Item literal or key buffer. (patch 5) Then Histogram::get_raw_range_selectivity() is added to estimate essentially a SEL_ARG range. (patch 6) 4.6.6 Changes to SEL_ARG flag op SEL_ARG::invert_min_max_flag() is added to recover flags for ranges over descending indexes. (patch 6) 4.6.7 Changes to field evaluation Values in key range endpoints are stored in KeyTupleFormat and are accessed through the same table field which is temporarily set to use the key buffer. A utility class Use_field is introduced for this purpose. (patch 6) 4.6.8 Changes related to sysvars enum_range_estimation and THD { range_estimation } are added to map user settings. They are passed over to RANGE_OPT_PARAM { range_estimation }. (patch 7) THD { range_estimation_adjust_threshold } is added to recognize big ranges. (patch 8)