Bug #44060 First option of PROCEDURE ANALYSE() does not work, second needs some work
Submitted: 3 Apr 2009 3:07 Modified: 6 Apr 2009 8:47
Reporter: Roel Van de Paar Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: DML Severity:S3 (Non-critical)
Version:4.1, 5.0, 5.1, 6.0 bzr OS:Any
Assigned to: CPU Architecture:Any
Tags: PROCEDURE ANALYSE

[3 Apr 2009 3:07] Roel Van de Paar
Description:
The manual says this:

SELECT ... FROM ... WHERE ... PROCEDURE ANALYSE([max_elements,[max_memory]])
http://dev.mysql.com/doc/refman/5.0/en/procedure-analyse.html

While currently the PROCEDURE ANALYSE function is working like this:
SELECT ... FROM ... WHERE ... PROCEDURE ANALYSE([max_memory,[max_elements]])

How to repeat:
mysql> CREATE TABLE test_pa (testfield char(200));
Query OK, 0 rows affected (0.04 sec)

mysql> INSERT INTO test_pa VALUES ("1"),("2"),("3"),("4"),("5"),("6");
Query OK, 6 rows affected (0.08 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM test_pa PROCEDURE ANALYSE(3,256)\G
*************************** 1. row ***************************
             Field_name: roelt.test_pa.testfield
              Min_value: 1
              Max_value: 6
             Min_length: 1
             Max_length: 1
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 1.0000
                    Std: NULL
      Optimal_fieldtype: ENUM('1','2','3','4','5','6') NOT NULL <<< INCORRECT
1 row in set (0.00 sec)

mysql> SELECT * FROM test_pa PROCEDURE ANALYSE(256,3)\G
*************************** 1. row ***************************
             Field_name: roelt.test_pa.testfield
              Min_value: 1
              Max_value: 6
             Min_length: 1
             Max_length: 1
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 1.0000
                    Std: NULL
      Optimal_fieldtype: CHAR(1) NOT NULL <<< INCORRECT
1 row in set (0.00 sec)

mysql> SELECT * FROM test_pa PROCEDURE ANALYSE(3)\G
*************************** 1. row ***************************
             Field_name: roelt.test_pa.testfield
              Min_value: 1
              Max_value: 6
             Min_length: 1
             Max_length: 1
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 1.0000
                    Std: NULL
      Optimal_fieldtype: ENUM('1','2','3','4','5','6') NOT NULL <<< INCORRECT
1 row in set (0.00 sec)

Suggested fix:
The documentation cannot really be updated since the second option (max_memory) for the (correct) command should be optional, so the function should be corrected.
[3 Apr 2009 6:23] Sveta Smirnova
Thank you for the report.

Results are repeatable. But how did you do such conclusion?

SELECT * FROM test_pa PROCEDURE ANALYSE(3,256) and SELECT * FROM test_pa PROCEDURE ANALYSE(3) give same results, while SELECT * FROM test_pa PROCEDURE ANALYSE(256,3) gives other.

Additionally proc_analyse_init sets pc->max_tree_elements and pc->max_treemem in order described in the manual:

  86   else if (param->next)
  87   {
  88     // first parameter
  89     if (!(*param->item)->fixed && (*param->item)->fix_fields(thd, param->item))
  90     {
  91       DBUG_PRINT("info", ("fix_fields() for the first parameter failed"));
  92       goto err;
  93     }
  94     if ((*param->item)->type() != Item::INT_ITEM ||
  95     (*param->item)->val_real() < 0)
  96     {
  97       my_error(ER_WRONG_PARAMETERS_TO_PROCEDURE, MYF(0), proc_name);
  98       goto err;
  99     }
 100     pc->max_tree_elements = (uint) (*param->item)->val_int();
 101     param = param->next;
 102     if (param->next)  // no third parameter possible
 103     {
 104       my_error(ER_WRONG_PARAMCOUNT_TO_PROCEDURE, MYF(0), proc_name);
 105       goto err;
 106     }
 107     // second parameter
 108     if (!(*param->item)->fixed && (*param->item)->fix_fields(thd, param->item))
 109     {
 110       DBUG_PRINT("info", ("fix_fields() for the second parameter failed"));
 111       goto err;
 112     }
 113     if ((*param->item)->type() != Item::INT_ITEM ||
 114     (*param->item)->val_real() < 0)
 115     {
 116       my_error(ER_WRONG_PARAMETERS_TO_PROCEDURE, MYF(0), proc_name);
 117       goto err;
 118     }
 119     pc->max_treemem = (uint) (*param->item)->val_int();
 120   }
[3 Apr 2009 6:31] Sveta Smirnova
Probably this code in  analyse::end_of_records() leads to the behavior described, but question is why to consider it a bug:

 732     if (((*f)->treemem || (*f)->tree_elements) &&
 733     (*f)->tree.elements_in_tree &&
!!!!
 734     (((*f)->treemem ? max_treemem : max_tree_elements) >
!!!!
 735      (((*f)->treemem ? (*f)->treemem : (*f)->tree_elements) +
 736        ((*f)->tree.elements_in_tree * 3 - 1 + 6))))
 737     {
 738       char tmp[331]; //331, because one double prec. num. can be this long
 739       String tmp_str(tmp, sizeof(tmp),&my_charset_bin);
[4 Apr 2009 1:08] Roel Van de Paar
> but question is why to consider it a bug

1. Since the behavior is different from the manual. The options are reversed.
2. Since a simple fix to the manual (reverse options) will not work.
[4 Apr 2009 17:53] Sveta Smirnova
Thank you for the feedback.

>  Since the behavior is different from the manual. The options are reversed.

Not. Please look into source code:

 100     pc->max_tree_elements = (uint) (*param->item)->val_int();
...
 119     pc->max_treemem = (uint) (*param->item)->val_int();

Only place where max_treemem becomes max_tree_elements is:

 734     (((*f)->treemem ? max_treemem : max_tree_elements) >

(Quotes from file ./sql/sql_analyse.cc)

And max_tree_elements never becomes max_treemem. So question still the same: "Why do you think options are reversed?"
[6 Apr 2009 1:26] Roel Van de Paar
On doing some more in-depth research, I discovered that the reason why it looked like the options were reversed was that it seemed from the output of this test:

-----------
mysql> SELECT * FROM test_pa PROCEDURE ANALYSE(256,3)\G

      Optimal_fieldtype: CHAR(1) NOT NULL <<< INCORRECT
-----------

That the options were reversed. 

However, this particular output is correct since 3 < the length of the generated ENUM output. 

Additionally, though the code is (partly) correct for this second option, there is an error from the manual in this regard. More on this below.

Let's start with the description for max_memory in the manual:

max_memory (default 8192) is the maximum amount of memory that analyse should allocate per column while trying to find all distinct values. 

This is incorrect. we are not talking here about 'memory per column' but rather the 'max. number of characters in the Optimal_fieldtype output when considering ENUM fields' as is clear from the source code below.

What it actually should say is something like this:

max_memory (default 8192): The maximum number of characters that an ENUM field definition can take up.

This can be seen from this source code:

  (((*f)->treemem ? (*f)->treemem : (*f)->tree_elements) + ((*f)->tree.elements_in_tree * 3 - 1 + 6))))

  i.e. the actual lenght of the data elements + the needed ENUM declaration formatting given as 'tree.elements_in_tree * 3 - 1 + 6':

Example: ENUM('1','2','3','4','5','6') NOT NULL

o  3 since we need a single quote on both sides and a ',' in the definition for each element.
o -1 since there is no need for a final ','
o +6 since we have 'ENUM()'

[Side note: another bug is that this way of calculating is not fully correct since for instance ' NOT NULL' would be another +9]

This should clarify the statement 'though the code is (partly) correct for this second option, there is an error from the manual in this regard'.

However, all this said, the other 2 examples are still incorrect. The maximum number of ENUM values requested is 3, and in both cases more values are returned:

-----------
mysql> SELECT * FROM test_pa PROCEDURE ANALYSE(3,256)\G

      Optimal_fieldtype: ENUM('1','2','3','4','5','6') NOT NULL <<< INCORRECT
-----------

-----------
mysql> SELECT * FROM test_pa PROCEDURE ANALYSE(3)\G

      Optimal_fieldtype: ENUM('1','2','3','4','5','6') NOT NULL <<< INCORRECT
-----------

In fact, these examples are (in this case) exactly the same. Since the generated output length = 38 (though the current coding only counts 29 of this as mentioned above) is < 256, the ENUM option is shown. This can be seen by doing this test:

-----------
mysql> SELECT * FROM test_pa PROCEDURE ANALYSE(3,29)\G

      Optimal_fieldtype: CHAR(1) NOT NULL

mysql> SELECT * FROM test_pa PROCEDURE ANALYSE(3,30)\G

      Optimal_fieldtype: ENUM('1','2','3','4','5','6') NOT NULL
-----------

To find the cause, let's try to simplify the next piece of code a bit:

if (((*f)->treemem || (*f)->tree_elements) &&
  (*f)->tree.elements_in_tree &&
  (((*f)->treemem ? max_treemem : max_tree_elements) >
   (((*f)->treemem ? (*f)->treemem : (*f)->tree_elements) +
     ((*f)->tree.elements_in_tree * 3 - 1 + 6))))

To:

if ( (treemem || tree_elements) && tree.elements_in_tree &&
 
   ( ( treemem ? max_treemem : max_tree_elements) > ((treemem ? treemem : tree_elements) + (tree.elements_in_tree * 3 - 1 + 6)) ) )

The variables used here are:

o max_tree_elements = 1st parameter to the analyse function (optional), or is the default of 256
o max_treemem = 2nd parameter to the analyse function (optional), or is the default of 8192
o tree.elements_in_tree = the actual # of unique elements in a given field (\mysys\tree.c)
o treemem = used to measure the size of tree for strings (as per the comments above this calculation), more on this below
o tree_elements = used to count the elements (as per the comments above this calculation), more on this below

The 'treemem' variable gets increased (+lenght) in the following function as long as max_treemem is not exceeded:

void field_str::add(){
...
  if (room_in_tree){
  ...
    if ((treemem += length) > pc->max_treemem){
      room_in_tree = 0;	 // Remove tree, too big tree
      delete_tree(&tree);
    }
  ...
  }
...
} // field_str::add

The 'tree_elements' variable gets increased (+1) in the following function as long as max_tree_elements is not exceeded:

void field_ulonglong::add(){
...
  if (room_in_tree){
    if (!(element = tree_insert(&tree, (void*) &num, 0, tree.custom_arg))){
      room_in_tree = 0;    // Remove tree, out of RAM ?
      delete_tree(&tree);
    }
    /*
      if element->count == 1, this element can be found only once from tree
      if element->count == 2, or more, this element is already in tree
    */
    else if (element->count == 1 && (tree_elements++) >= pc->max_tree_elements){
      room_in_tree = 0;  // Remove tree, too many elements
      delete_tree(&tree);
    }
    ...
  }
...
} // field_ulonglong::add

Notice that the calculation for tree_elements differs from that for treemem in regards of resulting tree deletion:
treemem:                                   if ((treemem+=length) >  pc->max_treemem){
tree_elements: else if (element->count == 1 && (tree_elements++) >= pc->max_tree_elements){
[6 Apr 2009 1:26] Roel Van de Paar
Now let's take the simple incorrect example:

-----------
mysql> SELECT * FROM test_pa PROCEDURE ANALYSE(3)\G

      Optimal_fieldtype: ENUM('1','2','3','4','5','6') NOT NULL <<< INCORRECT
-----------

ENUM should not be suggested as a type, since (as per the manual):

'if there are more than max_elements [3] distinct values [6], then ENUM is not a suggested type.'
http://dev.mysql.com/doc/refman/5.0/en/procedure-analyse.html

Let's look at variables and the code again:

o max_tree_elements = 3
o max_treemem = 8192 (default as it is not given)
o tree.elements_in_tree = 6: ("1"),("2"),("3"),("4"),("5"),("6")
o treemem:
  if ((treemem += length) > pc->max_treemem){
  the total length of the string is 29 characters so
  if (29 > 8192){
  Will not happen. So the tree continues to exist.
o tree_elements = used to count the elements (as per the comments above this calculation)
  /*
    if element->count == 1, this element can be found only once from tree
    if element->count == 2, or more, this element is already in tree
  */
  else if (element->count == 1 && (tree_elements++) >= pc->max_tree_elements){

  *This seems incorrect*

  Since the need to check (tree_elements++) >= pc->max_tree_elements) should be separate from any 'element->count == 1'

  Normally, the tree would need to be deleted in this case since:
  (tree_elements++) >= pc->max_tree_elements) 
  6>=3

  However, I think that for some reason the tree does not get deleted here (more on this below)

Also, let's look at the delete_tree function in \mysys\tree.c:

void delete_tree(TREE* tree){
  free_tree(tree, MYF(0)); /* my_free() mem_root if applicable */
}

static void free_tree(TREE *tree, myf free_flags){
...
  tree->elements_in_tree=0;
...
}

So it is clear that:

o tree.elements_in_tree should be 0 *if* the tree was deleted, or *6* if the tree was not deleted

So the simplified code:

if ( (treemem || tree_elements) && tree.elements_in_tree &&
 
   ( ( treemem ? max_treemem : max_tree_elements) > ((treemem ? treemem : tree_elements) + (tree.elements_in_tree * 3 - 1 + 6)) ) )

Will be either:

-----------
#1# (the tree was not deleted)

if ( (TRUE<29> || TRUE<6>) && TRUE<6> &&
 
   ( (8192) > ((29) + (0 * 3 - 1 + 6)) ) )

End result = TRUE, which would be *incorrect* (i.e. provide a ENUM suggestion)
-----------

or:

-----------
#2# (the tree was deleted, and tree.elements_in_tree = 0 = FALSE)

if ( (TRUE<29> || TRUE<6>) && FALSE<0> &&
 
   ( (8192) > ((29) + (0 * 3 - 1 + 6)) ) )

End result = FALSE, which would be *correct* (i.e. do not provide a ENUM suggestion since 6 values in table >= 3 max ENUM values)
-----------

I will leave it to the programmers to find the exact cause, though it seems to have something to do with the non-deletion of the tree, possibly as a result of '...element->count == 1 && ...'.
[6 Apr 2009 8:47] Sveta Smirnova
Thank you for the feedback.

Inconsistency verified as described.

Variables from last test ("[6 Apr 3:26] Roel Van de Paar"):

732         if (((*f)->treemem || (*f)->tree_elements) &&
(gdb) p max_tree_elements
$1 = 3
(gdb) p max_treemem
$2 = 8192
(gdb) p (*f)->tree.elements_in_tree
$3 = 6
(gdb) p (*f)->treemem
$4 = 6
(gdb) p (*f)->tree_elements
$5 = 0
[15 Apr 2009 3:13] Roel Van de Paar
< PARTIAL WORKAROUND >

In regards the issue with 'ENUM column recommendation output' for PROCEDURE ANALYSE, you can still 'parly' use this function based on the second argument only. 

For instance, if you would like to have a maximum of 50 characters (excluding 'NOT NULL') for any ENUM column declaration, use the function as follows:

PROCEDURE ANALYSE(1,50);

The '1' will not do anything (as per the bug), and the '50' will define the maximum numbers of characters for any ENUM (excluding the text 'NOT NULL', as per the bug).

If you do not want to use any ENUM columns at all (and for instance use a linked lookup table with IDs instead), you can use:

PROCEDURE ANALYSE(1,1);

Having a linked lookup table, allows you the advantage of being able to add new values to the lookup table later on, and then start inserting the new IDs into the main table immediately (i.e. no ALTER of the ENUM column is required).
[23 Oct 2009 12:04] Peter Laursen
I did not know this report when I posted:
http://bugs.mysql.com/bug.php?id=46972
[23 Oct 2009 13:22] Valeriy Kravchuk
Bug #46972 was marked as a duplicate of this one.
[16 Nov 2010 9:09] Roel Van de Paar
Any updates?
[27 May 2012 16:41] Kevin Benton
Please bump the priority of this issue. It is tremendously frustrating to have this happening and having to write a wrapper to fix this problem as a part of code I'm writing to do full analysis on all tables in a database. In my case, I'm using ... procedure analyse( 10, 1000 ) and of course, it suggests enum far too often. I'm working around it in my Perl code by detecting the number of ENUM matches and when it's obviously wrong, I re-run the procedure analyse with (1,1) and that gives me what I expected. It's expensive on the database, however, because I have to run the analyse twice.
[27 May 2012 16:42] Kevin Benton
Please bump the priority of this issue. It is tremendously frustrating to have this happening and having to write a wrapper to fix this problem as a part of code I'm writing to do full analysis on all tables in a database. In my case, I'm using ... procedure analyse( 10, 1000 ) and of course, it suggests enum far too often. I'm working around it in my Perl code by detecting the number of ENUM matches and when it's obviously wrong, I re-run the procedure analyse with (1,1) and that gives me what I expected. It's expensive on the database, however, because I have to run the analyse twice.
[4 Apr 2013 8:38] Hartmut Holzgraefe
Still reproducible in 5.6.10