Bug #100485 There is no way to provide ECP row estimates for scan cost modeling
Submitted: 10 Aug 2020 14:37 Modified: 11 Aug 2020 21:20
Reporter: Justin Swanhart Email Updates:
Status: Verified Impact on me:
Category:MySQL Server: Optimizer Severity:S4 (Feature request)
Version:8.0.21 OS:Any
Assigned to: CPU Architecture:Any

[10 Aug 2020 14:37] Justin Swanhart
Consider I have two tables:
t1(id pk, c1 key, c2, c3) - 1B rows
t2(id pk, c4, c5, c6) - 1M rows

when executing query:
select * from t1 join t2 on t1.c1 = t2.id where t1.c2 = 5 and t1.c2 = 6

let's say ECP can be used to filter t1 to 10 rows

MySQL will still join in order t2->t1 because it gets the table scan cost estimates from handler::info before ECP filters any rows. If NL join is used, then t1 will be probed 1M times.  If hash join is used, t2 will be considered the smaller side and may have to spill to disk when only 10 rows are visible in t1.

How to repeat:
Use an engine with ECP that reduces rows in such a way as to affect plan order.

Suggested fix:
Not sure what specific approach that MySQL would take.  Possibly adding a new handler function like ::ecp_estimate which returns a row estimate for a table after ECP is evaluated.  Of course this makes things more complicate, since joins can be handled in ::engine_push and the join order could change after ECP estimates are taken into account.  So maybe you need ::engine_push_before_join() which adds filters for non-join columns, which can set row estimates, and then after join order is decided call ::engine_push_join which can evaluate join pushdown?
[10 Aug 2020 15:18] Justin Swanhart
note that t1 and t2 might not have any keys, and the evaluation of the size of the relations is completely dependent on ECP...
[10 Aug 2020 15:31] Justin Swanhart
maybe call cond_push() without any join expressions, then ha_info could set stat->records appropriately given the cond_push() pushdown info, and then after the query plan is created call engine_push which would not have any of the expressions pushed down in cond_push() which would just be join conditions?
[10 Aug 2020 15:32] Justin Swanhart
(usually) just be join conditions...
[10 Aug 2020 15:45] Justin Swanhart
But that causes some weirdness too, since constant and equality propagation are determined by join order.  Let me know your thoughts, as this is something I would maybe have to handle by parsing the query in a preparse rewrite plugin, using EXPLAIN ANALYZE to gather cost information that I pass in with explain_extra, then passing that info into the storage engine with a variable, etc, and finally executing the original query.  That is a real mess..
[10 Aug 2020 20:24] Justin Swanhart
Oops..  Example query should be:
where t1.c2 = 5 AND t1.c3 = 6
but it could be something complex like:
(a>1 and b=2) or (c=3 and d< 12) or e=13 or f is null

My storage engine can evaluate those using bitmap indexes and filter the table efficiently.  Regardless, even in NDB which would have to scan the table, it could provide an estimate for the scan by using histograms for example.  

Maybe ANALYZE TABLE should collect histogram information for all columns so that hash join can choose a proper order using statistics and so that ICP can properly estimate the cost of pushed down conditions (I filed another bug about ICP too).
[11 Aug 2020 12:18] MySQL Verification Team

I presume that your SE can't do histograms.

Also, do you handle ECP well in your SE ????
[11 Aug 2020 12:52] MySQL Verification Team
Hi Justin,

After careful deliberation, it was concluded that this is a very good feature request.

Verified as a feature request.
[11 Aug 2020 21:20] Justin Swanhart
changed synopsis