MySQL Bugs: #68634: DROP/CREATE TABLE running into timeouts on certain api nodes, but not all

Bug #68634	DROP/CREATE TABLE running into timeouts on certain api nodes, but not all
Submitted:	11 Mar 2013 10:59	Modified:	14 Mar 2013 13:37
Reporter:	Hartmut Holzgraefe	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	5.5.22-ndb-7.2.6-log	OS:	Linux
Assigned to:		CPU Architecture:	Any

Description:
CREATE / DROP TABLE commands run into timeouts on some API nodes, getting this error message:

  NDB: Releasing global schema lock (266)Time-out in NDB, probably caused by deadlock

This situation persists on the affected nodes while the same operation still executes without problems on other API nodes in that cluster. Situation even seems to persist across API node restarts.

See below some (anonymized) log snippets related to this situation:

* first real timeout incident started with

130305 3:00:32 [Note] NDB drop table: waiting max 119 sec for distributing ./some_db/some_table. epochs: (0/0,0/0,19124125/0) injector proc_info: Waiting for event from ndbcluster map: 90814804040000
[...]
130305 3:02:30 [Note] NDB drop table: waiting max 1 sec for distributing ./some_db/some_table. epochs: (0/0,0/0,19124125/0) injector proc_info: Waiting for event from ndbcluster map: 90814804040000
130305 3:02:31 [ERROR] NDB drop table: distributing ./some_db/some_table timed out. Ignoring...
130305 3:02:31 [Warning] NDB: Releasing global schema lock (266)Time-out in NDB, probably caused by deadlock
130305 3:02:32 [Note] NDB Binlog: CREATE TABLE Event: REPL$some_db/some_table

followed by

130305 3:02:33 [Note] NDB create table: waiting max 119 sec for distributing ./some_db/some_table. epochs: (0/0,0/0,19124125/0) injector proc_info: Waiting for event from ndbcluster map: 90814804040000

and so on

There was an earlier single incident on 121002 but there the "waiting max ... sec for distributing"
messages only lasted for a few seconds, after that the operation succeeded without timeout.

* in the cluster log i can see Event buffer status messages

2013-03-05 03:00:31 [MgmtSrvr] INFO -- Node 50: Event buffer status: used=2196B(39%) alloc=5524B(0%) max=0B apply_epoch=19136430/9 latest_epoch=19136430/12
2013-03-05 03:00:31 [MgmtSrvr] INFO -- Node 10: Event buffer status: used=2216B(55%) alloc=4008B(0%) max=0B apply_epoch=19136430/9 latest_epoch=19136430/12
2013-03-05 03:00:31 [MgmtSrvr] INFO -- Node 50: Event buffer status: used=2440B(43%) alloc=5640B(0%) max=0B apply_epoch=19136430/9 latest_epoch=19136430/13
2013-03-05 03:00:32 [MgmtSrvr] INFO -- Node 50: Event buffer status: used=2440B(43%) alloc=5640B(0%) max=0B apply_epoch=19136430/9 latest_epoch=19136430/14
2013-03-05 03:00:32 [MgmtSrvr] INFO -- Node 50: Event buffer status: used=2440B(43%) alloc=5640B(0%) max=0B apply_epoch=19136430/9 latest_epoch=19136431/0
2013-03-05 03:00:32 [MgmtSrvr] INFO -- Node 50: Event buffer status: used=2440B(43%) alloc=5640B(0%) max=0B apply_epoch=19136430/9 latest_epoch=19136431/1

Similar event buffer messages are logged at 03:02:32, the time where the
second timeout sequence started on api_node_6

* in the ndb_3_out.log file i can see several messages like

2013-03-05 03:03:38 [ndbd] ERROR -- c_gcp_list.seize() failed: gci: 82191105516371971 nodes: 00000000000000000000000000000000000000000000000024040000090a1480

and

2013-03-05 03:03:38 [ndbd] WARNING -- ACK wo/ gcp record (gci: 19136608/3) ref: 0fa20032 from: 0fa20032

but as these only start at 3:03:38, over a minute after the 2nd incident started,
and as messages like this also have happened on previous occasions in ndb_3_out.log
without any related CREATE/DROP timeouts these are probably unrelated to the
problem at hand ...

* in the error log from api_node_10 i can see two incidents that eventually ran into a timeout:

130214 15:25:05 [Note] NDB drop table: waiting max 119 sec for distributing ./some_db/other_table. epochs: (0/0,0/0,17245691/7) injector proc_info: Waiting for event from ndbcluster map: 900018011000100

130214 15:31:25 [Note] NDB create table: waiting max 119 sec for distributing ./some_db/other_table. epochs: (0/0,0/0,17245691/7) injector proc_info: Waiting for event from ndbcluster map: 900018011000100

unfortunately the cluster log doesn't reach back far enough to check whether
these were also acompanied by "Event buffer status" messages ...

How to repeat:
No idea ...

API nodes are using multiple NDB connections here:
 
  ndb-cluster-connection-pool=8

Maybe the problem is related to connection pool handling?

Hello Hartmut,

Thank you for the report.
Verified as described.

Thanks,
Umesh

error log contents - 7.2.6/10

Attachment: 68634.txt (text/plain), 131.61 KiB.

Hi Umesh,

how did you verify it? Are you able to reproduce it at will?

Hello. Now I have problems with same bug. It reproduced even after restart with "--initial" and recovering from NDB backup. Recently I tried new master-master replication with NDB_EPOCH2_TRANS function.
I had problems in both clusters, but after removing all data in /var/lib/mysql and starting data nodes with "--initial" one cluster works fine. 
So, I can fix it now with such strange method, but it was just experimenting. I don't want to face with such problem in production. 
Is there some progress in solving this issue?

Hi,
We are running MySQL cluster (Server version: 5.7.16-ndb-7.5.4-cluster-gpl-log MySQL Cluster Community Server (GPL)) and intermittently get following message in MySQL Error logs::
[Warning] NDB: Failed to release global schema lock, error: (266)Time-out in NDB, probably caused by deadlock.
As mentioned this does not happen always, but does happen with 2 conseceutive requests:
2018-02-09T16:26:20.070589Z	DROP TABLE IF EXISTS temp_account_pool
2018-02-09T16:26:20.071043Z	CREATE TABLE temp_account_pool (...)

Is there a proper solution to this?
Thanks in advance for any suggestions.