Bug #79311 NDB binlog injector use incorrect mutex in condition signaling
Submitted: 17 Nov 2015 11:25 Modified: 24 Nov 2015 11:22
Reporter: Ole John Aske Email Updates:
Status: Closed Impact on me:
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:7.2.22 OS:Any
Assigned to: CPU Architecture:Any

[17 Nov 2015 11:25] Ole John Aske
The patch for Bug#20957068


... Required locking the injector_mutex in the binlig-thread loop
before calling pollEvents(). This had the effect of slowing down
distribution of schema operations.

Investigation this we find that the the 'injector_mutex'
is also used as an argument to pthread_cond_timedwait(), both while
dropping (ndbcluster_handle_drop_table()) and creating
a table (ndbcluster_log_schema_op()). The intention of this
conditional wait is to wait for the schema change to be distributed
to all mysqlds.

However, as the binlog injector thread now pretty much monopolizes
the injector_mutex while polling for events, the condition signalling
had to wait a long time for getting the injector_mutex lock after it
had been signaled - Thus the delay.

Inspecting the two pthread_cond_timedwait() it turns out that the
condition being waited for does not need the protected of the
injector_mutex at all:

1) In ndbcluster_log_schema_op() we wait for the condition:
'bitmap_is_clear_all(&ndb_schema_object->slock_bitmap)' which is
set by ::handle_clear_slock(), and protected by ndb_schema_object->mutex.

2) In ndbcluster_handle_drop_table() we wait for the condition
'share->op==NULL' which is protected by 'share->mutex'.

... So using injector_mutex in the  pthread_cond_timedwait() seems to simply
be wrong, in addition to creating an unnecessary contention on the injector_mutex.

How to repeat:
ndb_binlog_variant_ddl.test shows a significant (~50%) improvement in latency
when this mutex contention is removed.

Suggested fix:
Use the mutex really protecting the wait condition as argument to pthread_cond_*()
[24 Nov 2015 11:22] Jon Stephens
Fixed in NDB 7.2.23, 7.3.12, and 7.4.9. Closed.

See BUG#20957068 for documentation notes.