Bug #92759 Add TABLESAMPLE support
Submitted: 12 Oct 2018 6:14 Modified: 19 Nov 2018 14:22
Reporter: Daniël van Eeden (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S4 (Feature request)
Version:8.0 OS:Any
Assigned to: CPU Architecture:Any
Tags: compatibility, tablesample

[12 Oct 2018 6:14] Daniël van Eeden
Description:
For various reasons it can be needed to sample the data in a table.

If a table has say 100_000 rows I want to get 100 random rows.

Getting the first or last 100 rows is easy with ORDER BY and LIMIT or via HANDLER tatements. But getting 100 random rows isn't.

If the table as a auto_increment without gaps and without deleted rows you can do this by generating 100 numbers in the 0 to AUTO_INCREMENT range. But if there is no auto_increment or if the PK isn't numeric or if the PK is combined this doesn't work.

And SELECT * FROM t1 ORDER BY RAND() LIMIT 100 works... but fails in practice for big tables as it is very inefficent.

How to repeat:
Try to get 100 random rows from a big table.

Suggested fix:
PostgreSQL has this:
SELECT * FROM foo TABLESAMPLE SYSTEM (10);

See also:
https://www.postgresql.org/docs/current/static/sql-select.html

Another option would be to extend the HANDLER command to allow this:
HANDLER t1 OPEN;
HANDLER t1 READ `PRIMARY` RANDOM;
HANDLER t1 CLOSE;

Basically what I would like the database to do is:
Pick a nr of random pages and return a random row from each.

This probably has overlap with index stat collection.
[12 Oct 2018 6:15] Daniël van Eeden
Corrected the severity
[12 Oct 2018 8:00] Daniël van Eeden
Microsoft SQL Server, Hive and IBM DB2 also seem to implement this.
[12 Oct 2018 11:08] MySQL Verification Team
Hello Daniël,

Thank you for the report and feature request!

regards,
Umesh
[13 Oct 2018 18:50] Ruud H.G. van Tol
If you can accept the extra index-disk-space-usage, 
and there is <some_column> with plenty of variance, 
then this is a workaround:

ALTER TABLE t1
  ADD column randy smallint unsigned as (CONV(LEFT(MD5(some_column),4),16,10))
, ADD INDEX (randy)
;

The next steps would obviously be:

SELECT * FROM t1 
WHERE randy >= FLOOR(65536 * RAND())
LIMIT 100
;

That can easily return less than 100 rows, 
so in that case you could retry with:

SELECT * FROM t1 
WHERE randy < FLOOR(65536 * RAND())
ORDER BY randy DESC
LIMIT 100
;

or alternatively:

SELECT * FROM t1 
WHERE randy >= 0
LIMIT 42
;

(where 42 is the number of missing rows)
[19 Nov 2018 14:22] Daniël van Eeden
Using mysqldump with a tablesample could be useful to setup a development/acceptance copy with a percentage of the data.
This will probably not really work with foreign keys.
[11 Mar 2024 9:16] bro Simon
May I ask how mysqldump uses tablesample? I couldn't find any relevant options in the man pages of mysqldump