MySQL Bugs: #54595: Node startup fails if other nodes are in a hung shutdown state

Bug #54595	Node startup fails if other nodes are in a hung shutdown state
Submitted:	17 Jun 2010 19:25	Modified:	2 Sep 2016 16:13
Reporter:	Andrew Hutchings	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	mysql-5.1-telco-6.3	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Related to bug#54594 if nodes are subsequently started whilst a node is in a hung shutdown state these nodes will fail to start, citing the hung node has failed.

How to repeat:
.

Some time spent trying to create 'similar' scenarios to the hang which was observed (bug #54594).

These are created by recompiling with sleep(10000) in various places in the Ndbd error handling / shutdown code.

Placing a sleep(10000) in the ErrorReporter::handleError() between WriteMessage and the g_eventLogger->info() calls is somewhat similar to the potential lock wait in the g_eventLogger->info() calls.  This results in :
  - Main thread blocks in handleError()
  - The Watchdog reports problems and then also blocks in handleError()
    (It would block attempting to report the problems if the g_eventLogger was the
problem)
  - Transporter infrastructure still running as that is not shutdown until later in
NdbShutdown()
    - SocketServer thread (accepting connections)
    - Start clients thread (originating connections)
    - Ndbd and Mgmd connections are in state CONNECTED according to the
TransporterRegistry
  - Heartbeat listener eventually declares the node dead and informs others
    - They close communications and disconnect (according to Cluster log)
    - Remote disconnects not handled/actioned locally as that occurs as part of
performReceive() in main thread
    - Sockets enter CLOSE_WAIT state, waiting for socket close calls from main thread.
  - NdbApi clients can connect to the cluster

Hard killing (-9) and then starting a node in another node group succeeds without any problems.

I have not so far managed to reproduce a situation where a starting node fails to start due to a node being hung like this.  The scenario appeared to be something like :
 - Starting node attempts to connect to hung node
 - Hung node disconnects (or starting node gives up on connect attempt)
 - Starting node treats as 'another node failed during startup' and fails to start

For further investigation :
-------------------------
- Perhaps the hung node's role as client/server in the connection setup process affects the behaviour?
- Perhaps a more realistic reproduction has different behaviour (e.g. block one thread while holding the g_eventLogger mutex)?

Investigation continues.

While this can be fairly easily reproduced with 6.3 I can't for the life of me reproduce this with 7.4 so setting to "can't reprocude" ... seems we fixed this sometimes in past few years