| Bug #9622 | Optimizer makes join index mistake | ||
|---|---|---|---|
| Submitted: | 4 Apr 2005 21:31 | Modified: | 24 Apr 2006 8:01 |
| Reporter: | Matthew Lord | ||
| Status: | In progress | ||
| Category: | Server: Optimizer | Severity: | S2 (Serious) |
| Version: | 5.0,4,1,5.0 | OS: | Any (*) |
| Assigned to: | Sergey Petrunia | Target Version: | 6.0 |
| Tags: | bfsm_2007_11_15 | ||
| Triage: | D2 (Serious) | ||
[4 Apr 2005 21:31]
Matthew Lord
[24 Apr 2005 23:02]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/internals/24265
[13 May 2005 22:22]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/internals/24882
[16 May 2005 21:17]
Sergey Petrunia
Note for the docs team: The bug was that ALTER TABLE ... ENABLE INDEXES statistics collection code treated all NULL values as equal (that is NULL == NULL), while ANALYZE TABLE statistics collection code treated NULL values as inequal (NULL != NULL). We've fixed it - now all statistics collection code treats NULLs as inequal.
[16 May 2005 21:39]
Paul DuBois
Noted in 4.1.13 changelog.
[16 May 2005 21:39]
Sergey Petrunia
A slightly corrected note for the docs team: The bug was that for MyISAM tables, ALTER TABLE ... ENABLE INDEXES and bulk insert statistics collection code treated all NULL values as equal (NULL == NULL), while ANALYZE TABLE statistics collection code treated NULL values as inequal (NULL != NULL). We've fixed it - now all statistics collection code treats NULLs as inequal.
[19 Oct 2005 0:07]
Matthew Lord
If you use this mysqld option: myisam_stats_method=nulls_equal The test fails again as it did prior to this patch.
[20 Oct 2005 17:47]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/internals/31277
[20 Oct 2005 17:49]
Sergey Petrunia
The patch introduces a new statistics collection method "nulls_ignored". With this method, all queries run fast: TESTCASE nulls_equal nulls_unequal nulls_ignored BUG9622 >60 sec 0.08 sec 0.08 sec BUG12232-case1 0.00 sec 1.42 sec 0.01 sec BUG12232-case2 0.03 sec 0.98 sec 0.01 sec BUG12232-case3 1.70 sec 47.32 sec 1.67 sec
[21 Oct 2005 5:03]
Sergey Petrunia
A patch with initial review feedback: http://lists.mysql.com/internals/31284
[23 Oct 2005 0:45]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/internals/31355
[23 Oct 2005 8:24]
Sergey Petrunia
Fix pushed into 4.1.16 tree. Documentation will be soon provided
[27 Oct 2005 22:54]
Sergey Petrunia
PROBLEM WITH NULLS_IGNORED METHOD
=================================
Consider an index defined as KEY(kp1, kp2), where all values of kp1 are not
nulls, but a good number of values of kp2 are NULLs. Graphically this can be
depicted as follows:
kp1 kp2
+-----+-------+
| | |
| A | C |
| | |
+-----+-------+
| | |
| B | NULL |
| | |
+-----+-------+
Areas A, B, C contain non-NULL values here.
In current nulls_ignored method (both spec and implementation):
rec_per_key[0] is an estimate on how many records one will get if he chooses
some value of kp1 from region A or region B.
rec_per_key[1] is an estimate on how many records one will get if he chooses
some value of kp1 from A region, and value of kp2 from region C.
So,
for rec_per_key[0] we assume that kp1 is in A
for rec_per_key[1] - that kp1 is in (A + B). (***)
This can cause undesirable effects. Suppose the table is populated as
follows:
area A: AVG(value_group_size) = $big_value
area B: AVG(value_group_size) = $small_value
area C: Let each tuple in combined {A|C} area have {kp1 == kp2}, so
AVG(value_group_size) = $big_value
Then:
rec_per_key[1] = AVG(value_group_size in AC area ) = $big_value
rec_per_key[0] = (somewhere between $small_value and $big_value);
i.e.
rec_per_key[0] < rec_per_key[1]
i.e.
E(#records(ref(kp1,kp12))) > E(#records(ref(kp1)))
which is obviously not true. I state that the problem lies in (***).
Proposed fix
============
Make different assumptions about the domain that the "probe tuple" belongs
to.
Let's start from 2-part key. Here an index values diagram (Vxy are
!NULLS):
kp1 kp2
+------+-------+
| | |
| NULL | V2B |
| | |
+------+-------+
| | |
| V1A | V2A |
| | |
+------+-------+
| | |
| V1B | NULL |
| | |
+------+-------+
| | |
| NULL | NULL |
| | |
+------+-------+
We'll assume that the "probe tuple" is {x,y} where
x - random value from combined area V1A+V1B
y - random value from combined area V2A+V2B
Note that we include values like {x from V1B, y from V2B}.
Then new rec_per_key values (denote them as rpc2) are:
rpc2[0] = AVG(#records equal to a probe value from V1A+V2A) =
= rec_per_key[0], no changes here.
rpc2[1] = P(x in V1A AND y in V2A) *
AVG(#records equal to a probe tuple in {a in V1A, b in V2A})
+
P(x not in V1A AND y not in V2A) * AVG(...) // AVG=0 here.
The first AVG(...) is the same as rec_per_key[0] in nulls_ignored method,
i.e.
rpc2[1] = P(x in V1A AND y in V2A) * #rec_per_key[1] =
= {assume independence} = P(x in V1A)*P(y in V2A) * #rec_per_key[1].
P(x in V1A) = (#tuples that have kp1 NOT NULL and kp2 NOT NULL) /
(#tuples that have kp1 NOT NULL and kp2 IS NULL)
P(y in V2A) is calculated analogously.
[2 Oct 2007 23:41]
Sergey Petrunia
See also BUG#30423
