Bug #54595 | Node startup fails if other nodes are in a hung shutdown state | ||
---|---|---|---|
Submitted: | 17 Jun 2010 19:25 | Modified: | 2 Sep 2016 16:13 |
Reporter: | Andrew Hutchings | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S3 (Non-critical) |
Version: | mysql-5.1-telco-6.3 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[17 Jun 2010 19:25]
Andrew Hutchings
[19 Jul 2010 11:09]
Frazer Clement
Some time spent trying to create 'similar' scenarios to the hang which was observed (bug #54594). These are created by recompiling with sleep(10000) in various places in the Ndbd error handling / shutdown code. Placing a sleep(10000) in the ErrorReporter::handleError() between WriteMessage and the g_eventLogger->info() calls is somewhat similar to the potential lock wait in the g_eventLogger->info() calls. This results in : - Main thread blocks in handleError() - The Watchdog reports problems and then also blocks in handleError() (It would block attempting to report the problems if the g_eventLogger was the problem) - Transporter infrastructure still running as that is not shutdown until later in NdbShutdown() - SocketServer thread (accepting connections) - Start clients thread (originating connections) - Ndbd and Mgmd connections are in state CONNECTED according to the TransporterRegistry - Heartbeat listener eventually declares the node dead and informs others - They close communications and disconnect (according to Cluster log) - Remote disconnects not handled/actioned locally as that occurs as part of performReceive() in main thread - Sockets enter CLOSE_WAIT state, waiting for socket close calls from main thread. - NdbApi clients can connect to the cluster Hard killing (-9) and then starting a node in another node group succeeds without any problems. I have not so far managed to reproduce a situation where a starting node fails to start due to a node being hung like this. The scenario appeared to be something like : - Starting node attempts to connect to hung node - Hung node disconnects (or starting node gives up on connect attempt) - Starting node treats as 'another node failed during startup' and fails to start For further investigation : ------------------------- - Perhaps the hung node's role as client/server in the connection setup process affects the behaviour? - Perhaps a more realistic reproduction has different behaviour (e.g. block one thread while holding the g_eventLogger mutex)? Investigation continues.
[2 Sep 2016 16:13]
MySQL Verification Team
While this can be fairly easily reproduced with 6.3 I can't for the life of me reproduce this with 7.4 so setting to "can't reprocude" ... seems we fixed this sometimes in past few years