Bug #79629 NDB_SHARE prematurely destructed by binlog restart,due to incorrect ref counting
Submitted: 14 Dec 2015 9:29 Modified: 11 Jan 2016 17:09
Reporter: Ole John Aske Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:7.2.22 OS:Any
Assigned to: CPU Architecture:Any

[14 Dec 2015 9:29] Ole John Aske
Description:
The NDB_SHARE object may be referred by multiple threads and multiple data structures
shared between these threads. A reference counting schema is used to keep track of
number of references to the NDB_SHARE, such that it is not destructed while still
being referred from somewhere. To avoid keeping locks while updating a 'share',
a temporary ref count may also be taken on the share to avoid that it is destructed
in the middle of a set of operations (typical while the binlog thread handle distribution
of schema changes)

One of the data structures referring a 'share', is the hash-list: 'ndbcluster_open_tables'.
A share is registered here as soon as it is created or 'discovered' by the first operation
opening a table referring this share. The 'state' of the share has to be '!= NSS_DROPPED'
while it is registered in the 'open_tables' hash.

When the binlog thread restarts, it will empty the 'open_tables' hash to force 
a later reopen of these tables. Thus, the shares in this tables are unrefed, and 
possibly destructed.

However, a possible race condition exists between a client thread creating or
opening a share, and the binlog thread emptying the 'open_tables' list. As the
::open and ::create() methods did not have their own (temporary) reference 
to the share they operated on, the binlog thread could destruct the share 
in the middle of their operation. The stack dump below is one such case:

current thread: t@27
  [1] _lwp_kill(0x1b, 0xb, 0x0, 0xfe522168), at 0xfe529265 
  [2] thr_kill(0x1b, 0xb, 0xfe528365, 0x14, 0x1708e9c, 0x9011efc), at 0xfe52218a 
=>[3] my_write_core(sig = 0), line 422 in "stacktrace.c"
  [4] handle_fatal_signal(sig = 0), line 230 in "signal_handler.cc"
  [5] __sighndlr(0xb, 0x0, 0xfcb04d20, 0x84f9380), at 0xfe5247d5 
  [6] call_user_handler(0xb), at 0xfe517dae 
  [7] sigacthandler(0xb, 0x0, 0xfcb04d20, 0x1f, 0x0, 0x807007), at 0xfe518007 
  ---- called from signal handler with signal 11 (SIGSEGV) ------

-> While ::open() sets up the event subscription on the share, the share 
   itself is destructed under our feets ->  SIGSEGV 

  [8] strcmp(0x0, 0x8f86b50, 0xfcb052c8, 0x8c4e13e, 0x10291660, 0x8000000f), at 0xfe49c9d3 
  [9] Ndb_dist_priv_util::is_distributed_priv_table(db = (nil), table = (nil)), line 54 in "ndb_dist_priv_util.h"
  [10] ndbcluster_create_event_ops(thd = (nil), share = (nil), ndbtab = (nil), event_name = (nil)), line 4898 in "ha_ndbcluster_binlog.cc"
  [11] ndbcluster_create_binlog_setup(thd = (nil), ndb = (nil), key = (nil), key_len = 0, db = (nil), table_name = (nil), table = (nil)), line 4679 in "ha_ndbcluster_binlog.cc"
  [12] ha_ndbcluster::open(this = (nil), name = (nil), mode = 0, test_if_locked = 0), line 11819 in "ha_ndbcluster.cc"
  [13] handler::ha_open(this = (nil), table_arg = (nil), name = (nil), mode = 0, test_if_locked = 0), line 2505 in "handler.cc"
  [14] open_table_from_share(thd = (nil), share = (nil), alias = (nil), db_stat = 0, prgflag = 0, ha_open_flags = 0, outparam = (nil), is_create_table = false), line 2355 in "table.cc"
  [15] open_table(thd = (nil), table_list = (nil), ot_ctx = (nil)), line 3167 in "sql_base.cc"
  [16] _ZL22open_and_process_tableP3THDP3LEXP10TABLE_LISTPjjP19Prelocking_strategybP18Open_table_context(thd = (nil), lex = (nil), tables = (nil), counter = (nil), flags = 0, prelocking_strategy = (nil), has_prelocking_list = false, ot_ctx = (nil)), line 4726 in "sql_base.cc"
  [17] open_tables(thd = (nil), start = (nil), counter = (nil), flags = 0, prelocking_strategy = (nil)), line 5159 in "sql_base.cc"
  [18] open_normal_and_derived_tables(thd = (nil), tables = (nil), flags = 0), line 5856 in "sql_base.cc"
  [19] mysql_insert(thd = (nil), table_list = (nil), fields = CLASS, values_list = CLASS, update_fields = CLASS, update_values = CLASS, duplic = DUP_ERROR, ignore = false), line 737 in "sql_insert.cc"
  [20] mysql_execute_command(thd = (nil)), line 3457 in "sql_parse.cc"
  [21] mysql_parse(thd = (nil), rawbuf = (nil), length = 0, parser_state = (nil)), line 6393 in "sql_parse.cc"
  [22] dispatch_command(command = COM_SLEEP, thd = (nil), packet = (nil), packet_length = 0), line 1340 in "sql_parse.cc"
  [23] do_command(thd = (nil)), line 1037 in "sql_parse.cc"
  [24] do_handle_one_connection(thd_arg = (nil)), line 982 in "sql_connect.cc"
  [25] handle_one_connection(arg = (nil)), line 898 in "sql_connect.cc"
  [26] pfs_spawn_thread(arg = (nil)), line 1860 in "pfs.cc"
  [27] _thrp_setup(0xfe2ec240), at 0xfe52444c 
  [28] _lwp_start(0x1b, 0xb, 0x0, 0xfe522168, 0xfcb04bd8, 0x9011efc), at 0xfe5246f0 

Below, we find the binlog thread in its initial steps, waiting for the first
events to arrive. Immediately before arriving at this place, it has unref'ed
all shares in the 'open_tables'

current thread: t@23
=>[1] __lwp_park(0x0, 0xfcbce838, 0x0, 0x0), at 0xfe524749 
  [2] cond_sleep_queue(0x9531494, 0x953147c), at 0xfe51de00 
  [3] cond_wait_queue(0x9531494, 0x953147c, 0xfcbce838, 0xfe51e32b), at 0xfe51e0e3 
  [4] cond_wait_common(0x9531494, 0x953147c, 0xfcbce838, 0xfe51e749), at 0xfe51e501 
  [5] __cond_timedwait(0x9531494, 0x953147c, 0xfcbcee6c, 0xfe51e834), at 0xfe51e7a1 
  [6] cond_timedwait(0x9531494, 0x953147c, 0xfcbcee6c, 0xfe51e87b), at 0xfe51e845 
  [7] pthread_cond_timedwait(0x9531494, 0x953147c, 0xfcbcee6c, 0x0, 0x12, 0x0), at 0xfe51e893 
  [8] Ndb_binlog_thread::do_run(this = (nil)), line 6452 in "ha_ndbcluster_binlog.cc"
  [9] Ndb_component::run_impl(this = (nil)), line 93 in "ndb_component.cc"
  [10] Ndb_component_run_C(arg = (nil)), line 52 in "ndb_component.cc"
  [11] _thrp_setup(0xfe2ea240), at 0xfe52444c 
  [12] _lwp_start(0x0, 0xfcbce838, 0x0, 0x0, 0xfe51dd04, 0x1), at 0xfe5246f0 

How to repeat:
Run the new testcase 'ndb_binlog_restart.test' which will be pushed as part of
fix for bug#22204186.
[11 Jan 2016 17:11] Jon Stephens
Updated changelog entry to note that this bug is also fixed in NDB 7.2.23/7.3.12/7.4.9. Closed.