Bug #84566 Crash in TABLE_LIST::query_block_id_for_explain during EXPLAIN FOR CONNECTION
Submitted: 19 Jan 2017 14:39 Modified: 8 Mar 2017 18:09
Reporter: John Embretsen Email Updates:
Status: Closed Impact on me:
Category:MySQL Server: Optimizer Severity:S2 (Serious)
Version:8.0.1 OS:Any
Assigned to: CPU Architecture:Any

[19 Jan 2017 14:39] John Embretsen
The server crashes (segfault signal 11) when running a concurrency RQG test involving regular DML/DDL along with EXPLAIN FOR CONNECTION statements.


Program terminated with signal SIGSEGV, Segmentation fault.
#0  in  __pthread_kill (threadid=<optimized out>, signo=11)
    at ../nptl/sysdeps/unix/sysv/linux/pthread_kill.c:61
#0  in  __pthread_kill (threadid=<optimized out>, signo=11)
    at ../nptl/sysdeps/unix/sysv/linux/pthread_kill.c:61
#1  in  my_write_core (sig=11) at mysys/stacktrace.cc:291
#2  in  handle_fatal_signal (sig=11) at sql/signal_handler.cc:233
#3  <signal handler called>
#4  in  TABLE_LIST::query_block_id_for_explain (this=0x0)
    at sql/table.cc:6628
#5  in  Explain::explain_subqueries (this=0x7fa3ac7f5350)
    at sql/opt_explain.cc:613
#6  in  Explain::send (this=0x7fa3ac7f5350) at sql/opt_explain.cc:724
#7  in  explain_no_table (thd=0x7fa340000be0, select_lex=0x7fa348007bd0,
    message=0x31e0c20 <plan_not_ready> "Plan isn't ready yet", ctx=CTX_JOIN)
    at sql/opt_explain.cc:1970
#8  in  explain_query_specification (ethd=0x7fa340000be0,
    select_lex=0x7fa348007bd0, ctx=CTX_JOIN) at sql/opt_explain.cc:2117
#9  in  mysql_explain_unit (ethd=0x7fa340000be0, unit=0x7fa348007ea0)
    at sql/opt_explain.cc:2315
#10 in  explain_query (ethd=0x7fa340000be0, unit=0x7fa348007ea0)
    at sql/opt_explain.cc:2262
#11 in  mysql_explain_other (thd=0x7fa340000be0) at sql/opt_explain.cc:2459
#12 in  mysql_execute_command (thd=0x7fa340000be0, first_level=true)
    at sql/sql_parse.cc:4418

Stacktrace for all threads and full RQG log will be attached to this bug report.
The issue has been observed in weekly RQG testing since around December 4 2016. The same tests passed on November 27.
"git bisect" indicates the following culprit push:

7580e00a63bfc68abd46fbc5d81436ce969eef2d is the first bad commit
commit 7580e00a63bfc68abd46fbc5d81436ce969eef2d
Author: Knut Anders Hatlen <knut.hatlen@oracle.com>
Date:   Tue Nov 29 10:04:50 2016 +0100



    1) Make erase() use std::move() instead of std::copy() to allow
    erasing elements that can be moved, but not copied, from a

    2) Change the signatures of the erase() functions to match the
    signatures of std::vector's erase() functions in C++11. That is, the
    arguments are now const_iterator instead of iterator.

    3) Add cbegin() and cend() functions modelled after std::vector.

    Change-Id: I58f010055f05a3aaa495b862812ca3e96d0060b7

:040000 040000 f7a3efeaeadb33947eab9b19a288d91d14037545 bb4f1d1b5f0ec3efa5fad3ee479dc86e82a41b69 M      include
:040000 040000 74abf075b9fe51d87a3fec320b952becea553931 6bb7223f17d389031e54fc0b591626f1eed19618 M      unittest
bisect run success

How to repeat:
So far unable to repeat manually. It may be possible to trigger the issue with some clever DEBUG_SYNC usage.

The issue repeats reliably using RQG tests. An example (used for git bisect) follows below.

perl $RQG_HOME/runtest.pl \
  --queries=100000000 \
  --debug-server \
  --queries=10M \
  --seed=30080 \
  --threads=5 \
  --duration=1800 \
  --mysqld=--log_error_verbosity=3 \
  --mysqld=--sql_mode='' \
  --reporter=Backtrace,QueryTimeout:querytimeout=180,ErrorLog,ErrorLogAlarm \
  --termination_timeout=2700 \
  --mysqld=--lock-wait-timeout=5 \
  --mysqld=--innodb-lock-wait-timeout=5 \
  --mysqld=--wait-timeout=300 \
  --Validator=Transformer,ExplainOther,ExplainOtherKill \
  --mysqld=--debug-sync-timeout=15 \
  --grammar=$TRUNK_SRC/internal/rqg_grammars/optimizer/outer_join.yy \
  --gendata=$TRUNK_SRC/internal/rqg_grammars/optimizer/outer_join.zz \
  --basedir1=$TRUNK_SRC/jbisect/jbuild-bisect/install \
[23 Jan 2017 9:00] John Embretsen
Posted by developer:
Adding --valgrind to the RQG command line seems to make it easier to reproduce the issue.

After doing this I can reproduce also against the parent commit, 7e2101fcc9. 

A new git bisect run now indicates that the following commit is related:

ec3c0d079b2b566366d22238bdc0efa82968c3a7 is the first bad commit
commit ec3c0d079b2b566366d22238bdc0efa82968c3a7
Author: Guilhem Bichot <guilhem.bichot@oracle.com>
Date:   Tue Nov 29 15:46:27 2016 +0100

    WL#883 non-recursive CTE
    WL#3634 recursive CTE

It seems that this was committed after Knut's patch, so it may be another false positive.
[8 Mar 2017 18:09] Paul DuBois
Posted by developer:
Fixed in 8.0.1.

Bug affects no released version. No changelog entry needed.