Bug #2049 PROCEDURE ANALYSE() trigger-happy for ENUM
Submitted: 8 Dec 2003 14:17 Modified: 5 Feb 2020 16:04
Reporter: Arjen Lentz Email Updates:
Status: Won't fix Impact on me:
None 
Category:MySQL Server: DML Severity:S4 (Feature request)
Version:any OS:Any (any)
Assigned to: CPU Architecture:Any

[8 Dec 2003 14:17] Arjen Lentz
Description:
The docs say:
PROCEDURE ANALYSE([max elements,[max memory]])
- max elements (default 256) is the maximum number of distinct values analyse will notice per column. This is used by analyse to check if the optimal column type should be of type ENUM.

PROCEDURE ANALYSE() is trigger happy for ENUMs. It gets very silly.
a) The default limit should be way lower than 256 elements.
b) ENUM('1','2','3','4','5') for an integer column is very silly. We teach that people shouldn't use ENUM for numerical values, only for strings. As things get too confusing.
c) The max_elements parameter is apparently only applied to integer columns, not string columns. Given b) it should be exactly the other way around: string columns are the only relevant column that should be handled this way.

How to repeat:
b)
CREATE TABLE seq (i INT);
INSERT INTO seq VALUES (1),(2),(3),(4),(5);
SELECT * FROM seq PROCEDURE ANALYSE();

c)
CREATE TABLE strings (s CHAR(20));
INSERT INTO strings VALUES ('foo'),('bar'),('bla');
SELECT * FROM strings PROCEDURE ANALYSE(1);

Suggested fix:
1) Numerical columns should never produce an ENUM suggestion.
2) The max_elements parameter should be applied for string columns.
3) Make max_elements default to 0. Then ENUMs are not suggested unless the user explicitly allows this. Space is often less an issue than relational practicalities. ENUM columns are more difficult to handle in queries, causing the resulting table design to be *slower*. Not what we want.
[8 Dec 2003 17:42] Arjen Lentz
Specifying the second parameter with a small value does have the desired effect:
  SELECT * FROM strings PROCEDURE ANALYSE(0,0);

The optional second param is documented as:
 - max memory (default 8192) is the maximum memory analyse should allocate
   per column while trying to find all distinct values.

The original fix suggestion remains.
Perhaps defaults of 16 and 256 are good, like PROCEDURE ANALYSE(16,256)
That will at least make the default results more sensible (together with the change of not suggesting ENUM for numerical columns).
[8 Dec 2003 22:47] Sergei Golubchik
this is documentation issue. Procedure analyse() is an example procedure that shows how to change the table structure to make the table *more compact* - it need not and does not care about "relational practicalities" or alter table difficulties.

Above said, it's behaviour is perfectly logical -
1. Numeric columns are suggested to be converted to ENUM, as long as this ENUM takes less space - that is has less than 256 elements.

2. max_elements does not apply to strings, ans for strings ENUM is almost always better

3. max_elements is default to 256 - because with this value suggested table are more compact than with max_elements=0.

Thank you.
[9 Dec 2003 0:11] Arjen Lentz
Even a sample function can be useful. In its current form, ANALYSE() is clearly not. The "make more compact" objective has no practical benefit.

Let's call this issue a "feature request" instead. I am suggesting to change the default parameter values (and perhaps the string behaviour, slightly - if it's not too much work) so that the function becomes genuinely useful to a lot of people, and doesn't just remain a demo gimmick.
[10 Dec 2005 17:24] Valeriy Kravchuk
Thank you for a feature request.
[4 Jul 2013 8:39] Kamil Dziedzic
While I can understand that this "feature request" isn't extremely important then I can't understand why documentation wasn't updated to not confuse people.

There are still some misleading or missing informations:
1. max_elements doesn't apply to all column types, e.g. it doesn't apply to varchar columns
2. there is no information how to force this procedure to not suggest ENUMs for varchars (passing (0, 0) as parameters)

Of course there is still problem that you can't allow using ENUMs for integers and forbid this for varchars but still it would be just nice to make clarification about all those problems in documentation so people won't fall in same problems again, again and again. It's now 10 years from original bug report and I was probably another person which fall into this problem and who needed to figure out all those problems by myself because the documentation is lacking information. So it will be really nice If could be last, such person :)

And BTW, about not suggesting enums for varchars - I think too that it shouldn't suggest enums for varchars (or at least there should be simple option to turn off that). Yes, I agree that enums are always better than constant strings but if you design application and database model then at this early point you know you are using constants, and if so you will use integers and map them on application side or you will use enums. If you define field as a varchar in database then you probably meant that such field doesn't have constant values - user can input whatever he wants. Thats why I think you shouldn't suggest enums instead of varchar because as a developer you already decided that this field might have any value (varchar) or it has list of constant values (ENUM). You can also notice that statistically if you follow such ENUM suggestions for varchar you will brake your application because users won't be able to add new values to columns with defined constant lists of values.

Regards, Kamil Dziedzic
[5 Feb 2020 16:04] Erlend Dahl
PROCEDURE ANALYSE was removed in 8.0.1 (WL#4745).