MySQL Bugs: #19194: Right recursion in parser for CASE causes excessive stack usage, limitation

Bug #19194	Right recursion in parser for CASE causes excessive stack usage, limitation
Submitted:	19 Apr 2006 12:07	Modified:	16 Jan 2007 6:40
Reporter:	Kristian Nielsen	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Parser	Severity:	S3 (Non-critical)
Version:	5.0.21-BK	OS:	Any (All)
Assigned to:	Marc ALFF	CPU Architecture:	Any

Description:
The parser (in sql_yacc.yy) for CASE expressions is written in a right-recursive way (rather than left-recursive as is usually done in LALR parsers).

    sp_case: expr THEN_SYM sp_proc_stmts1 sp_whens;
    sp_whens: /* Empty */ | WHEN_SYM sp_case;

This means that the parser will use stack proportional to the size of the CASE statement, so the number of cases is limited to around 500.

How to repeat:
$ perl -e 'print "delimiter |\nDROP PROCEDURE IF EXISTS p0;\nCREATE PROCEDURE p0() BEGIN DECLARE dummy TINYINT DEFAULT 0; CASE 0\n"; print "WHEN $_ THEN SET dummy = 0;\n" for (1..500); print "END CASE;\nEND|\n"' | mysql test
ERROR 1064 (42000) at line 2: parser stack overflow near 'dummy = 0;
WHEN 454 THEN SET dummy = 0;
WHEN 455 THEN SET dummy = 0;
WHEN 456 TH' at line 453

Suggested fix:
Rewrite the parser to be right-recursive, something like

    sp_case: expr THEN_SYM sp_proc_stmts1;
    sp_whens: /* Empty */ | sp_whens WHEN_SYM sp_case;

Alternatively, if this is difficult to do because of the needs of semantic actions in the parser, it would be good to have an explanation describing why in a comment in sql_yacc.yy.

Thank you for a bug report. Verified just as described (with 5000 instead of 500) on 5.0.21-BK (ChangeSet@1.2160.1.3, 2006-04-17 16:46:56+03:00):

openxs@suse:~/dbs/5.0> perl -e 'print "delimiter |\nDROP PROCEDURE IF EXISTS p0
;\nCREATE PROCEDURE p0() BEGIN DECLARE dummy TINYINT DEFAULT 0; CASE 0\n"; print "WHEN $_ THEN SET dummy = 0;\n" for (1..5000); print "END CASE;\nEND|\n"' | bin/mysql -uroot test
ERROR 1064 (42000) at line 2: parser stack overflow near '0;
WHEN 4568 THEN SET dummy = 0;
WHEN 4569 THEN SET dummy = 0;
WHEN 4570 THEN SE' at line 4567

Calvin, Antony, I'm stealing this.

Manually changing to Patch Pending :

The commit email was bounced by by the commit list because of it's size,
but was delivered to development list:
  bk commit - 5.0 tree (malff:1.2285) BUG#19194 WL#2999

A copy of the Change Set is provided as an attachment.

Change Set comments

Attachment: cs_19194.txt (text/plain), 7.09 KiB.

Complete patch

Attachment: patch_19194.txt.gz (application/x-gzip, text), 106.81 KiB.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/14309

ChangeSet@1.2285, 2006-10-24 15:33:09-07:00, malff@weblab.(none) +10 -0
  Bug#19194 (Right recursion in parser for CASE causes excessive stack usage,
    limitation)
  
  Note to the reviewer
  ====================
  
  Warning: reviewing this patch is somewhat involved.
  Due to the nature of several issues all affecting the same area,
  fixing separately each issue is not practical, since each fix can not be
  implemented and tested independently.
  In particular, the issues with
  - rule recursion
  - nested case statements
  - forward jump resolution (backpatch list)
  are tightly coupled (see below).
  
  Definitions
  ===========
  
  The expression
    CASE expr
    WHEN expr THEN expr
    WHEN expr THEN expr
    ...
    END
  is a "Simple Case Expression".
  
  The expression
    CASE
    WHEN expr THEN expr
    WHEN expr THEN expr
    ...
    END
  is a "Searched Case Expression".
  
  The statement
    CASE expr
    WHEN expr THEN stmts
    WHEN expr THEN stmts
    ...
    END CASE
  is a "Simple Case Statement".
  
  The statement
    CASE
    WHEN expr THEN stmts
    WHEN expr THEN stmts
    ...
    END CASE
  is a "Searched Case Statement".
  
  A "Left Recursive" rule is like
    list:
        element
      | list element
      ;
  
  A "Right Recursive" rule is like
    list:
        element
      | element list
      ;
  
  Left and right recursion produces the same language, the difference only
  affects the *order* in which the text is parsed.
  
  In a descendant parser (usually written manually), right recursion works
  very well, and is typically implemented with a while loop.
  In an ascendant parser (yacc/bison) left recursion works very well,
  and is implemented naturally by the parser stack.
  In both cases, using the wrong type or recursion is very bad and should be
  avoided, as it causes technical issues with the parser implementation.
  
  Before this change
  ==================
  
  The "Simple Case Expression" and "Searched Case Expression" were both
  implemented by the "when_list" and "when_list2" rules, which are left
  recursive (ok).
  
  These rules, however, used lex->when_list instead of using the parser stack,
  which is more complex that necessary, and potentially dangerous because
  of other rules using THD::reset_lex.
  
  The "Simple Case Statement" and "Searched Case Statements" were implemented
  by the "sp_case", "sp_whens" and in part by "sp_proc_stmt" rules.
  Both cases were right recursive (bad).
  
  The grammar involved was convoluted, and is assumed to be the results of
  tweaks to get the code generation to work, but is not what someone would
  naturally write.
  
  In addition, using a common rule for both "Simple" and "Searched" case
  statements was implemented with sp_head::m_flags |= IN_SIMPLE_CASE,
  which is a flag and not a stack, and therefore does not take into account
  *nested* case statements. This leads to incorrect generated code, and either
  a server crash or an incorrect result.
  
  With regards to the backpatch mechanism, a *different* backpatch list was
  created for each jump from "WHEN expr THEN stmt" to "END CASE", which
  relied on the grammar to be right recursive.
  This is a mis-use of the backpatch list, since this list can resolve
  multiple references to the same target at once.
  
  The optimizer algorithm used to detect dead code in the "assembly" SQL
  instructions, implemented by sp_head::opt_mark(uint ip), was recursive
  in some cases (a conditional jump pointing forward to another conditional
  jump).
  In case of specially crafted code, like
  - a long list of "IF expr THEN stmt END IF"
  - a long CASE statement
  this would actually cause a server crash with a stack overflow.
  In general, having a stack that grows proportionally with user data (the
  SQL code given by the client in a CREATE PROCEDURE) is to be avoided.
  
  In debug builds only, creating a SP / SF / Trigger which had a significant
  amount of code would spend --literally-- several minutes in sp_head::create,
  because of the debug code involved with DBUG_PRINT("info", ("Code %s ...
  There are several issues with this code:
  - in a CASE with 5 000 WHEN, there are 15 000 instructions generated,
    which create a sting representation of the code which is 500 000 bytes
    long,
  - using a String instead of an io stream causes performances to degrade
    to a total server freeze, as time is spent doing realloc of a buffer
    always too short,
  - Printing a 500 000 long string in the debug log is too verbose,
  - Generating this string even when DBUG_PRINT is off is useless,
  - Having code that potentially can affect the server behavior, used with
    #ifdef / #endif is useful in some cases, but is also a bad practice.
  
  After this change
  =================
  
  "Case Expressions" (both simple and searched) have been simplified to
  not use LEX::when_list, which has been removed.
  
  Considering all the issues affecting case statements, the grammar for these
  has been totally re written.
  
  The existing actions, used to generate "assembly" sp_inst* code, have been
  preserved but moved in the new grammar, with the following changes:
  
  a) Code is no longer shared between "Simple" and "Searched" case statements,
  because there is a slight difference in the actions.
  Nested statements are handled naturally by the parser stack, which by
  definition uses the correct rule in the correct context.
  Nested statements of the opposite type (simple vs searched) works correctly.
  The flag sp_head::IN_SIMPLE_CASE is no longer used.
  This is a step towards resolution of WL#2999, which correctly identified
  that temporary parsing flags do not belong to sp_head.
  
  b) The backpatch mechanism, used to resolve forward jumps in the generated
  code, has been changed to:
  - create a label for the instruction following 'END CASE',
  - register each jump at the end of a "WHEN expr THEN stmt" in a *unique*
    backpatch list associated with the 'END CASE' label
  - resolve all the forward jumps for this label at once.
  
  In addition, the code involving backpatch has been commented, so that a
  reader can now understand by reading matching "Registering" and "Resolving"
  comments how the forward jumps are resolved and what target they resolve to,
  as this is far from evident when reading the code alone.
  
  The implementation of sp_head::opt_mark() has been revised to avoid
  recursive calls from jump instructions, and instead add the jump location
  to the list of paths to explore during the flow analysis of the instruction
  graph, with a call to sp_head::add_mark_lead().
  In addition, the flow analysis will stop if an instruction has already
  been marked as reachable, which the previous code failed to do in the
  recursive case.
  sp_head::opt_mark() is now private, to prevent new calls to this method from
  being introduced.
  
  The debug code present in sp_head::create() has been removed.
  Considering that SHOW PROCEDURE CODE is also available in debug builds,
  and can be used anytime regardless of the trace level, as opposed to
  "CREATE PROCEDURE" time and only if the trace was on,
  removing the code actually makes debugging easier (usable trace).
  
  Tests have been written to cover the parser overflow (big CASE),
  and to cover nested CASE statements.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/15521

ChangeSet@1.2285, 2006-11-17 12:14:29-07:00, malff@weblab.(none) +10 -0
  Bug#19194 (Right recursion in parser for CASE causes excessive stack usage,
    limitation)
  
  Note to the reviewer
  ====================
  
  Warning: reviewing this patch is somewhat involved.
  Due to the nature of several issues all affecting the same area,
  fixing separately each issue is not practical, since each fix can not be
  implemented and tested independently.
  In particular, the issues with
  - rule recursion
  - nested case statements
  - forward jump resolution (backpatch list)
  are tightly coupled (see below).
  
  Definitions
  ===========
  
  The expression
    CASE expr
    WHEN expr THEN expr
    WHEN expr THEN expr
    ...
    END
  is a "Simple Case Expression".
  
  The expression
    CASE
    WHEN expr THEN expr
    WHEN expr THEN expr
    ...
    END
  is a "Searched Case Expression".
  
  The statement
    CASE expr
    WHEN expr THEN stmts
    WHEN expr THEN stmts
    ...
    END CASE
  is a "Simple Case Statement".
  
  The statement
    CASE
    WHEN expr THEN stmts
    WHEN expr THEN stmts
    ...
    END CASE
  is a "Searched Case Statement".
  
  A "Left Recursive" rule is like
    list:
        element
      | list element
      ;
  
  A "Right Recursive" rule is like
    list:
        element
      | element list
      ;
  
  Left and right recursion produces the same language, the difference only
  affects the *order* in which the text is parsed.
  
  In a descendant parser (usually written manually), right recursion works
  very well, and is typically implemented with a while loop.
  In an ascendant parser (yacc/bison) left recursion works very well,
  and is implemented naturally by the parser stack.
  In both cases, using the wrong type or recursion is very bad and should be
  avoided, as it causes technical issues with the parser implementation.
  
  Before this change
  ==================
  
  The "Simple Case Expression" and "Searched Case Expression" were both
  implemented by the "when_list" and "when_list2" rules, which are left
  recursive (ok).
  
  These rules, however, used lex->when_list instead of using the parser stack,
  which is more complex that necessary, and potentially dangerous because
  of other rules using THD::reset_lex.
  
  The "Simple Case Statement" and "Searched Case Statements" were implemented
  by the "sp_case", "sp_whens" and in part by "sp_proc_stmt" rules.
  Both cases were right recursive (bad).
  
  The grammar involved was convoluted, and is assumed to be the results of
  tweaks to get the code generation to work, but is not what someone would
  naturally write.
  
  In addition, using a common rule for both "Simple" and "Searched" case
  statements was implemented with sp_head::m_flags |= IN_SIMPLE_CASE,
  which is a flag and not a stack, and therefore does not take into account
  *nested* case statements. This leads to incorrect generated code, and either
  a server crash or an incorrect result.
  
  With regards to the backpatch mechanism, a *different* backpatch list was
  created for each jump from "WHEN expr THEN stmt" to "END CASE", which
  relied on the grammar to be right recursive.
  This is a mis-use of the backpatch list, since this list can resolve
  multiple references to the same target at once.
  
  The optimizer algorithm used to detect dead code in the "assembly" SQL
  instructions, implemented by sp_head::opt_mark(uint ip), was recursive
  in some cases (a conditional jump pointing forward to another conditional
  jump).
  In case of specially crafted code, like
  - a long list of "IF expr THEN stmt END IF"
  - a long CASE statement
  this would actually cause a server crash with a stack overflow.
  In general, having a stack that grows proportionally with user data (the
  SQL code given by the client in a CREATE PROCEDURE) is to be avoided.
  
  In debug builds only, creating a SP / SF / Trigger which had a significant
  amount of code would spend --literally-- several minutes in sp_head::create,
  because of the debug code involved with DBUG_PRINT("info", ("Code %s ...
  There are several issues with this code:
  - in a CASE with 5 000 WHEN, there are 15 000 instructions generated,
    which create a sting representation of the code which is 500 000 bytes
    long,
  - using a String instead of an io stream causes performances to degrade
    to a total server freeze, as time is spent doing realloc of a buffer
    always too short,
  - Printing a 500 000 long string in the debug log is too verbose,
  - Generating this string even when DBUG_PRINT is off is useless,
  - Having code that potentially can affect the server behavior, used with
    #ifdef / #endif is useful in some cases, but is also a bad practice.
  
  After this change
  =================
  
  "Case Expressions" (both simple and searched) have been simplified to
  not use LEX::when_list, which has been removed.
  
  Considering all the issues affecting case statements, the grammar for these
  has been totally re written.
  
  The existing actions, used to generate "assembly" sp_inst* code, have been
  preserved but moved in the new grammar, with the following changes:
  
  a) Bison rules are no longer shared between "Simple" and "Searched" case
  statements, because a stack instead of a flag is required to handle them.
  Nested statements are handled naturally by the parser stack, which by
  definition uses the correct rule in the correct context.
  Nested statements of the opposite type (simple vs searched) works correctly.
  The flag sp_head::IN_SIMPLE_CASE is no longer used.
  This is a step towards resolution of WL#2999, which correctly identified
  that temporary parsing flags do not belong to sp_head.
  The code in the action is shared by mean of the case_stmt_action_xxx()
  helpers.
  
  b) The backpatch mechanism, used to resolve forward jumps in the generated
  code, has been changed to:
  - create a label for the instruction following 'END CASE',
  - register each jump at the end of a "WHEN expr THEN stmt" in a *unique*
    backpatch list associated with the 'END CASE' label
  - resolve all the forward jumps for this label at once.
  
  In addition, the code involving backpatch has been commented, so that a
  reader can now understand by reading matching "Registering" and "Resolving"
  comments how the forward jumps are resolved and what target they resolve to,
  as this is far from evident when reading the code alone.
  
  The implementation of sp_head::opt_mark() has been revised to avoid
  recursive calls from jump instructions, and instead add the jump location
  to the list of paths to explore during the flow analysis of the instruction
  graph, with a call to sp_head::add_mark_lead().
  In addition, the flow analysis will stop if an instruction has already
  been marked as reachable, which the previous code failed to do in the
  recursive case.
  sp_head::opt_mark() is now private, to prevent new calls to this method from
  being introduced.
  
  The debug code present in sp_head::create() has been removed.
  Considering that SHOW PROCEDURE CODE is also available in debug builds,
  and can be used anytime regardless of the trace level, as opposed to
  "CREATE PROCEDURE" time and only if the trace was on,
  removing the code actually makes debugging easier (usable trace).
  
  Tests have been written to cover the parser overflow (big CASE),
  and to cover nested CASE statements.

Sent a code rewiew by email.
Asked for a couple of minor changes.

See related Bug#24854.

Merged into 5.0.34 and 5.1.15

Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented bugfix in 5.0.34 and 5.1.15 changelogs.