Bug #54595 Node startup fails if other nodes are in a hung shutdown state
Submitted: 17 Jun 2010 19:25 Modified: 2 Sep 2016 16:13
Reporter: Andrew Hutchings Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.1-telco-6.3 OS:Any
Assigned to: CPU Architecture:Any
Triage: Triaged: D3 (Medium) / R6 (Needs Assessment) / E6 (Needs Assessment)

[17 Jun 2010 19:25] Andrew Hutchings
Description:
Related to bug#54594 if nodes are subsequently started whilst a node is in a hung shutdown state these nodes will fail to start, citing the hung node has failed.

How to repeat:
.
[19 Jul 2010 11:09] Frazer Clement
Some time spent trying to create 'similar' scenarios to the hang which was observed (bug #54594).

These are created by recompiling with sleep(10000) in various places in the Ndbd error handling / shutdown code.

Placing a sleep(10000) in the ErrorReporter::handleError() between WriteMessage and the g_eventLogger->info() calls is somewhat similar to the potential lock wait in the g_eventLogger->info() calls.  This results in :
  - Main thread blocks in handleError()
  - The Watchdog reports problems and then also blocks in handleError()
    (It would block attempting to report the problems if the g_eventLogger was the
problem)
  - Transporter infrastructure still running as that is not shutdown until later in
NdbShutdown()
    - SocketServer thread (accepting connections)
    - Start clients thread (originating connections)
    - Ndbd and Mgmd connections are in state CONNECTED according to the
TransporterRegistry
  - Heartbeat listener eventually declares the node dead and informs others
    - They close communications and disconnect (according to Cluster log)
    - Remote disconnects not handled/actioned locally as that occurs as part of
performReceive() in main thread
    - Sockets enter CLOSE_WAIT state, waiting for socket close calls from main thread.
  - NdbApi clients can connect to the cluster

Hard killing (-9) and then starting a node in another node group succeeds without any problems.

I have not so far managed to reproduce a situation where a starting node fails to start due to a node being hung like this.  The scenario appeared to be something like :
 - Starting node attempts to connect to hung node
 - Hung node disconnects (or starting node gives up on connect attempt)
 - Starting node treats as 'another node failed during startup' and fails to start

For further investigation :
-------------------------
- Perhaps the hung node's role as client/server in the connection setup process affects the behaviour?
- Perhaps a more realistic reproduction has different behaviour (e.g. block one thread while holding the g_eventLogger mutex)?

Investigation continues.
[2 Sep 2016 16:13] Bogdan Kecman
While this can be fairly easily reproduced with 6.3 I can't for the life of me reproduce this with 7.4 so setting to "can't reprocude" ... seems we fixed this sometimes in past few years