Bug #44060 | First option of PROCEDURE ANALYSE() does not work, second needs some work | ||
---|---|---|---|
Submitted: | 3 Apr 2009 3:07 | Modified: | 6 Apr 2009 8:47 |
Reporter: | Roel Van de Paar | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: DML | Severity: | S3 (Non-critical) |
Version: | 4.1, 5.0, 5.1, 6.0 bzr | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | PROCEDURE ANALYSE |
[3 Apr 2009 3:07]
Roel Van de Paar
[3 Apr 2009 6:23]
Sveta Smirnova
Thank you for the report. Results are repeatable. But how did you do such conclusion? SELECT * FROM test_pa PROCEDURE ANALYSE(3,256) and SELECT * FROM test_pa PROCEDURE ANALYSE(3) give same results, while SELECT * FROM test_pa PROCEDURE ANALYSE(256,3) gives other. Additionally proc_analyse_init sets pc->max_tree_elements and pc->max_treemem in order described in the manual: 86 else if (param->next) 87 { 88 // first parameter 89 if (!(*param->item)->fixed && (*param->item)->fix_fields(thd, param->item)) 90 { 91 DBUG_PRINT("info", ("fix_fields() for the first parameter failed")); 92 goto err; 93 } 94 if ((*param->item)->type() != Item::INT_ITEM || 95 (*param->item)->val_real() < 0) 96 { 97 my_error(ER_WRONG_PARAMETERS_TO_PROCEDURE, MYF(0), proc_name); 98 goto err; 99 } 100 pc->max_tree_elements = (uint) (*param->item)->val_int(); 101 param = param->next; 102 if (param->next) // no third parameter possible 103 { 104 my_error(ER_WRONG_PARAMCOUNT_TO_PROCEDURE, MYF(0), proc_name); 105 goto err; 106 } 107 // second parameter 108 if (!(*param->item)->fixed && (*param->item)->fix_fields(thd, param->item)) 109 { 110 DBUG_PRINT("info", ("fix_fields() for the second parameter failed")); 111 goto err; 112 } 113 if ((*param->item)->type() != Item::INT_ITEM || 114 (*param->item)->val_real() < 0) 115 { 116 my_error(ER_WRONG_PARAMETERS_TO_PROCEDURE, MYF(0), proc_name); 117 goto err; 118 } 119 pc->max_treemem = (uint) (*param->item)->val_int(); 120 }
[3 Apr 2009 6:31]
Sveta Smirnova
Probably this code in analyse::end_of_records() leads to the behavior described, but question is why to consider it a bug: 732 if (((*f)->treemem || (*f)->tree_elements) && 733 (*f)->tree.elements_in_tree && !!!! 734 (((*f)->treemem ? max_treemem : max_tree_elements) > !!!! 735 (((*f)->treemem ? (*f)->treemem : (*f)->tree_elements) + 736 ((*f)->tree.elements_in_tree * 3 - 1 + 6)))) 737 { 738 char tmp[331]; //331, because one double prec. num. can be this long 739 String tmp_str(tmp, sizeof(tmp),&my_charset_bin);
[4 Apr 2009 1:08]
Roel Van de Paar
> but question is why to consider it a bug 1. Since the behavior is different from the manual. The options are reversed. 2. Since a simple fix to the manual (reverse options) will not work.
[4 Apr 2009 17:53]
Sveta Smirnova
Thank you for the feedback. > Since the behavior is different from the manual. The options are reversed. Not. Please look into source code: 100 pc->max_tree_elements = (uint) (*param->item)->val_int(); ... 119 pc->max_treemem = (uint) (*param->item)->val_int(); Only place where max_treemem becomes max_tree_elements is: 734 (((*f)->treemem ? max_treemem : max_tree_elements) > (Quotes from file ./sql/sql_analyse.cc) And max_tree_elements never becomes max_treemem. So question still the same: "Why do you think options are reversed?"
[6 Apr 2009 1:26]
Roel Van de Paar
On doing some more in-depth research, I discovered that the reason why it looked like the options were reversed was that it seemed from the output of this test: ----------- mysql> SELECT * FROM test_pa PROCEDURE ANALYSE(256,3)\G Optimal_fieldtype: CHAR(1) NOT NULL <<< INCORRECT ----------- That the options were reversed. However, this particular output is correct since 3 < the length of the generated ENUM output. Additionally, though the code is (partly) correct for this second option, there is an error from the manual in this regard. More on this below. Let's start with the description for max_memory in the manual: max_memory (default 8192) is the maximum amount of memory that analyse should allocate per column while trying to find all distinct values. This is incorrect. we are not talking here about 'memory per column' but rather the 'max. number of characters in the Optimal_fieldtype output when considering ENUM fields' as is clear from the source code below. What it actually should say is something like this: max_memory (default 8192): The maximum number of characters that an ENUM field definition can take up. This can be seen from this source code: (((*f)->treemem ? (*f)->treemem : (*f)->tree_elements) + ((*f)->tree.elements_in_tree * 3 - 1 + 6)))) i.e. the actual lenght of the data elements + the needed ENUM declaration formatting given as 'tree.elements_in_tree * 3 - 1 + 6': Example: ENUM('1','2','3','4','5','6') NOT NULL o 3 since we need a single quote on both sides and a ',' in the definition for each element. o -1 since there is no need for a final ',' o +6 since we have 'ENUM()' [Side note: another bug is that this way of calculating is not fully correct since for instance ' NOT NULL' would be another +9] This should clarify the statement 'though the code is (partly) correct for this second option, there is an error from the manual in this regard'. However, all this said, the other 2 examples are still incorrect. The maximum number of ENUM values requested is 3, and in both cases more values are returned: ----------- mysql> SELECT * FROM test_pa PROCEDURE ANALYSE(3,256)\G Optimal_fieldtype: ENUM('1','2','3','4','5','6') NOT NULL <<< INCORRECT ----------- ----------- mysql> SELECT * FROM test_pa PROCEDURE ANALYSE(3)\G Optimal_fieldtype: ENUM('1','2','3','4','5','6') NOT NULL <<< INCORRECT ----------- In fact, these examples are (in this case) exactly the same. Since the generated output length = 38 (though the current coding only counts 29 of this as mentioned above) is < 256, the ENUM option is shown. This can be seen by doing this test: ----------- mysql> SELECT * FROM test_pa PROCEDURE ANALYSE(3,29)\G Optimal_fieldtype: CHAR(1) NOT NULL mysql> SELECT * FROM test_pa PROCEDURE ANALYSE(3,30)\G Optimal_fieldtype: ENUM('1','2','3','4','5','6') NOT NULL ----------- To find the cause, let's try to simplify the next piece of code a bit: if (((*f)->treemem || (*f)->tree_elements) && (*f)->tree.elements_in_tree && (((*f)->treemem ? max_treemem : max_tree_elements) > (((*f)->treemem ? (*f)->treemem : (*f)->tree_elements) + ((*f)->tree.elements_in_tree * 3 - 1 + 6)))) To: if ( (treemem || tree_elements) && tree.elements_in_tree && ( ( treemem ? max_treemem : max_tree_elements) > ((treemem ? treemem : tree_elements) + (tree.elements_in_tree * 3 - 1 + 6)) ) ) The variables used here are: o max_tree_elements = 1st parameter to the analyse function (optional), or is the default of 256 o max_treemem = 2nd parameter to the analyse function (optional), or is the default of 8192 o tree.elements_in_tree = the actual # of unique elements in a given field (\mysys\tree.c) o treemem = used to measure the size of tree for strings (as per the comments above this calculation), more on this below o tree_elements = used to count the elements (as per the comments above this calculation), more on this below The 'treemem' variable gets increased (+lenght) in the following function as long as max_treemem is not exceeded: void field_str::add(){ ... if (room_in_tree){ ... if ((treemem += length) > pc->max_treemem){ room_in_tree = 0; // Remove tree, too big tree delete_tree(&tree); } ... } ... } // field_str::add The 'tree_elements' variable gets increased (+1) in the following function as long as max_tree_elements is not exceeded: void field_ulonglong::add(){ ... if (room_in_tree){ if (!(element = tree_insert(&tree, (void*) &num, 0, tree.custom_arg))){ room_in_tree = 0; // Remove tree, out of RAM ? delete_tree(&tree); } /* if element->count == 1, this element can be found only once from tree if element->count == 2, or more, this element is already in tree */ else if (element->count == 1 && (tree_elements++) >= pc->max_tree_elements){ room_in_tree = 0; // Remove tree, too many elements delete_tree(&tree); } ... } ... } // field_ulonglong::add Notice that the calculation for tree_elements differs from that for treemem in regards of resulting tree deletion: treemem: if ((treemem+=length) > pc->max_treemem){ tree_elements: else if (element->count == 1 && (tree_elements++) >= pc->max_tree_elements){
[6 Apr 2009 1:26]
Roel Van de Paar
Now let's take the simple incorrect example: ----------- mysql> SELECT * FROM test_pa PROCEDURE ANALYSE(3)\G Optimal_fieldtype: ENUM('1','2','3','4','5','6') NOT NULL <<< INCORRECT ----------- ENUM should not be suggested as a type, since (as per the manual): 'if there are more than max_elements [3] distinct values [6], then ENUM is not a suggested type.' http://dev.mysql.com/doc/refman/5.0/en/procedure-analyse.html Let's look at variables and the code again: o max_tree_elements = 3 o max_treemem = 8192 (default as it is not given) o tree.elements_in_tree = 6: ("1"),("2"),("3"),("4"),("5"),("6") o treemem: if ((treemem += length) > pc->max_treemem){ the total length of the string is 29 characters so if (29 > 8192){ Will not happen. So the tree continues to exist. o tree_elements = used to count the elements (as per the comments above this calculation) /* if element->count == 1, this element can be found only once from tree if element->count == 2, or more, this element is already in tree */ else if (element->count == 1 && (tree_elements++) >= pc->max_tree_elements){ *This seems incorrect* Since the need to check (tree_elements++) >= pc->max_tree_elements) should be separate from any 'element->count == 1' Normally, the tree would need to be deleted in this case since: (tree_elements++) >= pc->max_tree_elements) 6>=3 However, I think that for some reason the tree does not get deleted here (more on this below) Also, let's look at the delete_tree function in \mysys\tree.c: void delete_tree(TREE* tree){ free_tree(tree, MYF(0)); /* my_free() mem_root if applicable */ } static void free_tree(TREE *tree, myf free_flags){ ... tree->elements_in_tree=0; ... } So it is clear that: o tree.elements_in_tree should be 0 *if* the tree was deleted, or *6* if the tree was not deleted So the simplified code: if ( (treemem || tree_elements) && tree.elements_in_tree && ( ( treemem ? max_treemem : max_tree_elements) > ((treemem ? treemem : tree_elements) + (tree.elements_in_tree * 3 - 1 + 6)) ) ) Will be either: ----------- #1# (the tree was not deleted) if ( (TRUE<29> || TRUE<6>) && TRUE<6> && ( (8192) > ((29) + (0 * 3 - 1 + 6)) ) ) End result = TRUE, which would be *incorrect* (i.e. provide a ENUM suggestion) ----------- or: ----------- #2# (the tree was deleted, and tree.elements_in_tree = 0 = FALSE) if ( (TRUE<29> || TRUE<6>) && FALSE<0> && ( (8192) > ((29) + (0 * 3 - 1 + 6)) ) ) End result = FALSE, which would be *correct* (i.e. do not provide a ENUM suggestion since 6 values in table >= 3 max ENUM values) ----------- I will leave it to the programmers to find the exact cause, though it seems to have something to do with the non-deletion of the tree, possibly as a result of '...element->count == 1 && ...'.
[6 Apr 2009 8:47]
Sveta Smirnova
Thank you for the feedback. Inconsistency verified as described. Variables from last test ("[6 Apr 3:26] Roel Van de Paar"): 732 if (((*f)->treemem || (*f)->tree_elements) && (gdb) p max_tree_elements $1 = 3 (gdb) p max_treemem $2 = 8192 (gdb) p (*f)->tree.elements_in_tree $3 = 6 (gdb) p (*f)->treemem $4 = 6 (gdb) p (*f)->tree_elements $5 = 0
[15 Apr 2009 3:13]
Roel Van de Paar
< PARTIAL WORKAROUND > In regards the issue with 'ENUM column recommendation output' for PROCEDURE ANALYSE, you can still 'parly' use this function based on the second argument only. For instance, if you would like to have a maximum of 50 characters (excluding 'NOT NULL') for any ENUM column declaration, use the function as follows: PROCEDURE ANALYSE(1,50); The '1' will not do anything (as per the bug), and the '50' will define the maximum numbers of characters for any ENUM (excluding the text 'NOT NULL', as per the bug). If you do not want to use any ENUM columns at all (and for instance use a linked lookup table with IDs instead), you can use: PROCEDURE ANALYSE(1,1); Having a linked lookup table, allows you the advantage of being able to add new values to the lookup table later on, and then start inserting the new IDs into the main table immediately (i.e. no ALTER of the ENUM column is required).
[23 Oct 2009 12:04]
Peter Laursen
I did not know this report when I posted: http://bugs.mysql.com/bug.php?id=46972
[23 Oct 2009 13:22]
Valeriy Kravchuk
Bug #46972 was marked as a duplicate of this one.
[16 Nov 2010 9:09]
Roel Van de Paar
Any updates?
[27 May 2012 16:41]
Kevin Benton
Please bump the priority of this issue. It is tremendously frustrating to have this happening and having to write a wrapper to fix this problem as a part of code I'm writing to do full analysis on all tables in a database. In my case, I'm using ... procedure analyse( 10, 1000 ) and of course, it suggests enum far too often. I'm working around it in my Perl code by detecting the number of ENUM matches and when it's obviously wrong, I re-run the procedure analyse with (1,1) and that gives me what I expected. It's expensive on the database, however, because I have to run the analyse twice.
[27 May 2012 16:42]
Kevin Benton
Please bump the priority of this issue. It is tremendously frustrating to have this happening and having to write a wrapper to fix this problem as a part of code I'm writing to do full analysis on all tables in a database. In my case, I'm using ... procedure analyse( 10, 1000 ) and of course, it suggests enum far too often. I'm working around it in my Perl code by detecting the number of ENUM matches and when it's obviously wrong, I re-run the procedure analyse with (1,1) and that gives me what I expected. It's expensive on the database, however, because I have to run the analyse twice.
[4 Apr 2013 8:38]
Hartmut Holzgraefe
Still reproducible in 5.6.10