MySQL Bugs: #88627: MySQl cluster is wiping a databasse data during initialization of one ndb node

Bug #88627	MySQl cluster is wiping a databasse data during initialization of one ndb node
Submitted:	23 Nov 2017 15:23	Modified:	18 Dec 2017 16:20
Reporter:	Zeljko Zuvic	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	ndb-7.4.7	OS:	CentOS (6.6)
Assigned to:	MySQL Verification Team	CPU Architecture:	Any
Tags:	wiping data

Description:
Hi,

We have Mysql cluster real time replication with one managed node and 2 data nodes. 
We have a situation that one RAM module on the first data node has failed and hence restarted by itself. That happened during the full cluster backup process overnight and since we have passphrase protection on data disk it is necessary to enter passphrase manually after server reboot.
Hence we had after while second data node operational and first out of sync. Then when we started with cluster recovery we put first node in "initial" state in order to fetch data from operational second node (production environment), we got all database data were wiped on both nodes at the end.
We have twice such a situation on production ( second time there was some UPS failure issue caused ndb cluster crush ).
Maybe is worth to mention that we have binary logging enabled on first node. The strange thing is that I cannot find a message in ndb_x_out or mysqd logs that is saying the node is wiping data or something.
I am now stuck in troubleshooting the problem and skeptic that the logs does not correspond to the behavior that I am experiencing.

Could you please take a look the logs that I have attached and confirm if it is a bug or I am wrong in this mystery.

  
Thanks,

Zeljko

How to repeat:
every time we had suddenly  hard restart of one data node, during initialization of that node instead of fetching data from other node (from cluster) it wiping database data on both

Hi,
This is very weird behavior. Can you tell me exactly how do you start nodes after the crash. You are not using MCM right?

thanks
Bogdan

Hi Bogdan,

After crush of one node of mysql cluster we first doing initialization of ndb file system on that node:
../mysql/bin/ndbd --initial
then by monitoring ndb_log when the phase 101 has been reached and wait few more minutes we are doing start of mysql service on that node:
#service mysql start
after getting the service started I am checking is DB content successfully replicated from halthy node but except DB name all tables are wiped on both (in our case ) data nodes.
During this process healthy node and management ( arbitrator) node are up and running.

Please let me know if you need any further information from our side.

Thanks,

Zeljko

Hi Zeljko, 

So, you have one node running, properly, with data on it, you connect to mysqld check data you see it; then you start second ndbd with --initial and after it moves to "running" state all your data is gone ?!

And you have the same behavior 2 times already? on 7.4.xx ?

thanks
Bogdan

Bogdan,

That is right and two times that happened on our production MySQL cluster during unplanned crush of one of data node. 
But I have to say that I have tried to simulate such a behavior ( by doing power of of WM - one node) on our test environment with exact configuration and same version of mysql cluster but reduced resources and significant smaller DB size but everything is working perfect without any loss. 

Thanks,

Zeljko

Hi,
weird, I just tried medium sized setup (32G ram per node) and was not able to reproduce the issue. Lemme discuss this with colleagues before I get back to you, this really makes not much sense and it is scaring me.

all best
Bogdan

Hi Bogdan,

Maybe is worth to mention that we have binary logs enabled on first data node and also after the crush, second node had a lot of DB write activities before we started with recovery of first node. 

Thank you for your time spent on this case.

Best regards,

Zeljko

Hi,
Few questions,
- any chance the missing data was innodb/myisam and not ndbcluster data?
- any chance you started the "broken" node first?
 
The potential sequence that can lead to this type of situation is
1. stop cluster (crash, shutdown, whatever..)
2. destroy data on node1 and start it(rm -rf datadir or ndbd --initial)

at this point you have only node1 running and it's empty

3. start node2 that used to have good data

in this case cluster will wipe the data from node2 (expected behavior) as the "latest valid data" is empty cluster (node1). 

any chance your scenario can fit into this?

all best
bogdan

Hi,

additional question - are you maybe using NOLOGGING tables?
https://dev.mysql.com/doc/refman/5.7/en/create-table-ndb-table-comment-options.html

because nologging tables will survive a node crash but will not survive cluster crash (What you had) 'cause nologging tabls are not included in redo logs so they cannot be consistently restored with other data hence they are not included in the LCP

all best
Bogdan

Timeline from the cluster log (ndb_1_cluster.log.5)

1.  Cluster ok
e.g.
  2017-07-04 09:07:03 [MgmtSrvr] INFO     -- Node 3: LDM(0): Completed 
LCP, #frags = 14 #records = 2117, #bytes = 87076
  2017-07-04 09:07:03 [MgmtSrvr] INFO     -- Node 2: LDM(0): Completed 
LCP, #frags = 14 #records = 2117, #bytes = 87076
  2017-07-04 09:07:03 [MgmtSrvr] INFO     -- Node 2: Local checkpoint 
222 completed

2.  MGMD is killed and restarted
  2017-07-04 17:11:29 [MgmtSrvr] INFO     -- Loaded config from 
'/var/lib/mysql-cluster//ndb_1_config.bin.1'
  2017-07-04 17:11:29 [MgmtSrvr] INFO     -- Id: 1, Command port: *:1186
  2017-07-04 17:11:29 [MgmtSrvr] INFO     -- MySQL Cluster Management 
Server mysql-5.6.25 ndb-7.4.7 started
  2017-07-04 17:11:29 [MgmtSrvr] INFO     -- Node 1: Node 1 Connected
  2017-07-04 17:11:29 [MgmtSrvr] INFO     -- Node 1 connected

3.  Nodes connect for System Restart (SR)
  2017-07-04 17:15:52 [MgmtSrvr] INFO     -- Node 2: System Restart: 
master node: 2, num starting: 2, gci: 6852095
  ...
  2017-07-04 17:16:03 [MgmtSrvr] INFO     -- Node 2: LDM instance 0: 
Restored LCP : 14 fragments, 2117 rows, 60 millis, 35283 rows/s
  2017-07-04 17:16:03 [MgmtSrvr] INFO     -- Node 3: LDM instance 0: 
Restored LCP : 14 fragments, 2117 rows, 91 millis, 23263 rows/s

4.  Node 3 has a problem with REDO 'corruption' :
  2017-07-04 17:16:03 [MgmtSrvr] ALERT    -- Node 3: Forced node 
shutdown completed. Occured during startphase 4. Caused by error 2310: 
'Error while reading the REDO log(Ndbd file system inconsistency error, 
please report a bug). Ndbd file system error, restart node initial'.

5.  Another attempt at SR, fails in the same way :
  2017-07-04 17:32:07 [MgmtSrvr] INFO     -- Node 2: System Restart: 
master node: 2, num starting: 2, gci: 6852095
  ...
  2017-07-04 17:32:13 [MgmtSrvr] ALERT    -- Node 3: Forced node 
shutdown completed. Occured during startphase 4. Caused by error 2310: 
'Error while reading the REDO log(Ndbd file system inconsistency error, 
please report a bug). Ndbd file system error, restart node initial'.

6.  Another attempt at SR, but node 3 has initial specified from somewhere
  2017-07-04 17:33:16 [MgmtSrvr] INFO     -- Node 2: Start phase 0 
completed
  2017-07-04 17:33:16 [MgmtSrvr] INFO     -- Node 2: Communication to 
Node 3 opened
  ...
  2017-07-04 17:34:18 [MgmtSrvr] INFO     -- Node 3: Communication to 
Node 2 opened
  2017-07-04 17:34:18 [MgmtSrvr] INFO     -- Node 3: Initial start, 
waiting for 2 to connect,  nodes [ all: 2 and 3 connected: 3 no-wait:  ]
  2017-07-04 17:34:18 [MgmtSrvr] INFO     -- Node 2: Node 3 Connected
  2017-07-04 17:34:18 [MgmtSrvr] INFO     -- Node 3: Node 2 Connected
  2017-07-04 17:34:18 [MgmtSrvr] INFO     -- Node 2: Start with all 
nodes 2 and 3
  ...
  2017-07-04 17:34:21 [MgmtSrvr] INFO     -- Node 2: System Restart: 
master node: 1, num starting: 2, gci: 6852095

  Note that the System Restart log says 'master node: 1', but it is 
node 2, node 1 is MGMD.

7.  SR proceeds just recovering node 2 :
   2017-07-04 17:34:27 [MgmtSrvr] INFO     -- Node 2: LDM instance 0: 
Restored LCP : 14 fragments, 2117 rows, 19 millis, 111421 rows/s

8.  Start of node 2 completes :
   2017-07-04 17:34:28 [MgmtSrvr] INFO     -- Node 2: Start phase 101 
completed (system restart)
   2017-07-04 17:34:28 [MgmtSrvr] INFO     -- Node 2: Started 
(mysql-5.6.25 ndb-7.4.7)

9.  Node 3 begins INR as part of SR to join node 2
   2017-07-04 17:34:28 [MgmtSrvr] INFO     -- Node 2: Start node: 3 
using initial node restart as part of system restart

10.  Node 3 copies schema, activates indexes etc ok

11.  Node 3 restores from LCP (but there's none as it's INR)
   2017-07-04 17:39:08 [MgmtSrvr] INFO     -- Node 3: LDM instance 0: 
Restored LCP : 0 fragments, 0 rows, 0 millis, 0 rows/s

12.  Node 3 copies fragments from Node 2 :
2017-07-04 17:39:08 [MgmtSrvr] INFO     -- Node 3: LDM(0): CopyFrag 
complete. 124 frags, +2117/-3279 rows, 376936 bytes/76 ms 4959684 bytes/s.

  Note that 124 fragments are mentioned here, whereas 14 fragments are 
mentioned in the LCP restore.
  However, the number of rows added (2117) aligns.

13.  Wait for an LCP to complete :
   2017-07-04 17:39:08 [MgmtSrvr] INFO     -- Node 2: Local checkpoint 
224 started. Keep GCI = 6852095 oldest restorable GCI = 6852095
   2017-07-04 17:39:08 [MgmtSrvr] INFO     -- Node 3: LDM(0): Completed 
LCP, #frags = 14 #records = 2117, #bytes = 87076
   2017-07-04 17:39:09 [MgmtSrvr] INFO     -- Node 2: LDM(0): Completed 
LCP, #frags = 14 #records = 2117, #bytes = 87076
   2017-07-04 17:39:09 [MgmtSrvr] INFO     -- Node 2: Local checkpoint 
224 completed

14.  Restart completes
   2017-07-04 17:39:12 [MgmtSrvr] INFO     -- Node 3: Start phase 101 
completed (initial node restart)
   2017-07-04 17:39:12 [MgmtSrvr] INFO     -- Node 3: Started 
(mysql-5.6.25 ndb-7.4.7)

15.  APIs connect
   2017-07-04 17:53:57 [MgmtSrvr] INFO     -- Node 3: Node 5: API 
mysql-5.6.25 ndb-7.4.7
   2017-07-04 17:54:23 [MgmtSrvr] INFO     -- Node 2: Node 4: API 
mysql-5.6.25 ndb-7.4.7

16.  Restore of metadata from a backup occurs
   2017-07-04 18:46:00 [MgmtSrvr] INFO     -- Alloc node id 4 succeeded
   2017-07-04 18:46:00 [MgmtSrvr] INFO     -- Nodeid 4 allocated for 
API at 192.168.0.4
   2017-07-04 18:46:00 [MgmtSrvr] INFO     -- Node 4: ndb_restore -b 15 
-n 2 -m -e -p 128
   2017-07-04 18:46:17 [MgmtSrvr] INFO     -- Node 4: Restore meta 
data: backup 15 from node 2 #Tables: 13
 #Tablespaces: 0 #Logfilegroups: 0 #datafiles: 0 #undofiles: 0

17.  Restore of data from a backup occurs
   2017-07-04 18:47:08 [MgmtSrvr] INFO     -- Node 4: ndb_restore -b 15 
-n 2 -r -e -p 128
   2017-07-04 18:47:09 [MgmtSrvr] INFO     -- Node 4: Restore started: 
backup 15 from node 2
   2017-07-04 18:47:09 [MgmtSrvr] INFO     -- Node 4: Restore meta 
data: backup 15 from node 2 #Tables: 13
 #Tablespaces: 0 #Logfilegroups: 0 #datafiles: 0 #undofiles: 0
   2017-07-04 18:51:02 [MgmtSrvr] INFO     -- Node 4: Restore data: 
backup 15 from node 2 #Records: 2564234 Data: 9125230207 bytes
   2017-07-04 18:51:02 [MgmtSrvr] INFO     -- Node 4: Restore log: 
backup 15 from node 2 #Records: 0 Data: 0 bytes
   2017-07-04 18:51:02 [MgmtSrvr] INFO     -- Node 4: Restore 
completed: backup 15 from node 2
   ...
   2017-07-04 18:51:49 [MgmtSrvr] INFO     -- Node 5: ndb_restore -b 15 
-n 3 -r -e -p 128
   2017-07-04 18:51:50 [MgmtSrvr] INFO     -- Node 5: Restore started: 
backup 15 from node 3
   2017-07-04 18:51:50 [MgmtSrvr] INFO     -- Node 5: Restore meta 
data: backup 15 from node 3 #Tables: 13
 #Tablespaces: 0 #Logfilegroups: 0 #datafiles: 0 #undofiles: 0
   2017-07-04 19:00:45 [MgmtSrvr] INFO     -- Node 5: Restore data: 
backup 15 from node 3 #Records: 2545933 Data: 9053962610 bytes
   2017-07-04 19:00:45 [MgmtSrvr] INFO     -- Node 5: Restore log: 
backup 15 from node 3 #Records: 0 Data: 0 bytes
   2017-07-04 19:00:45 [MgmtSrvr] INFO     -- Node 5: Restore 
completed: backup 15 from node 3

1) The LCP size in the 'before' case is small relative to the Backup size :

  LCP :
  2017-07-04 09:07:03 [MgmtSrvr] INFO     -- Node 3: LDM(0): Completed 
LCP, #frags = 14 #records = 2117, #bytes = 87076
  2017-07-04 09:07:03 [MgmtSrvr] INFO     -- Node 2: LDM(0): Completed 
LCP, #frags = 14 #records = 2117, #bytes = 87076

  Backup :
   2017-07-04 18:51:02 [MgmtSrvr] INFO     -- Node 4: Restore data: 
backup 15 from node 2 #Records: 2564234 Data: 9125230207 bytes
   2017-07-04 19:00:45 [MgmtSrvr] INFO     -- Node 5: Restore data: 
backup 15 from node 3 #Records: 2545933 Data: 9053962610 bytes

  2) Seems like the cluster restores everything that was included in the LCP (14 frags, 2117 rows)
      Note that the cluster copies the content of 124 fragments as part of CopyFrag

  3) Therefore it seems like the problem is that the tables are not 
included in the LCP

This is why I'm asking if you are using NOLOGGING.

> Any chance that significant difference between last state 
> in bin logs (enabled only on node1 that crushed last time) 
> and the current state (epoch) on the running node (node2) 
> can cause the issue?

Hi Željko,

I don't think so. Especially not looking at the logs.

You had the same problem in October:

2017-10-29 23:26:59 [MgmtSrvr] INFO     -- Node 2: Local checkpoint 533 started. Keep GCI = 16484622 oldest restorable GCI = 16483797
2017-10-29 23:26:59 [MgmtSrvr] INFO     -- Node 3: LDM(0): Completed LCP, #frags = 14 #records = 2129, #bytes = 87524
2017-10-29 23:27:00 [MgmtSrvr] INFO     -- Node 2: LDM(0): Completed LCP, #frags = 14 #records = 2129, #bytes = 87524
2017-10-29 23:27:00 [MgmtSrvr] INFO     -- Node 2: Local checkpoint 533 completed
2017-10-30 00:10:07 [MgmtSrvr] INFO     -- Node 2: Backup 32 started from node 1
2017-10-30 00:19:04 [MgmtSrvr] INFO     -- Node 2: Backup 32 started from node 1 completed. StartGCP: 16518386 StopGCP: 16518889 #Records: 6209360 #LogRecords: 365 Data: 720191156 bytes Log: 14284 bytes

LCP#533 starts and finishes in 1 second, recording 14 fragments and ~87kB on each node
Backup 32 takes ~9 minutes (to record what should be ~half the data) and writes ~720MB overall (~360MB per node)

Looking at all of the LCP logs since July, the LCP has *never* contained > 14 fragments, and always exactly the same amount of data...

So for some reason your data is not in LCP and this *will* happen again!

Can you get me the output of 

*  ndb_show_tables
*  ndb_desc -pn output for all tables

If the tables are NOLOGGING then that property would be restored from the metadata in the backup!!

Possible fix would be to recreate the tables from SQL without nologging, and then restore the data from the backup. 

Anyhow, please get me ndb_show_tables and ndb_desc -pn for each table so we can look (or you can look yourself), but looking at the logs you don't have persistent data there!

all best
Bogdan

Hi Bogdan,

Sorry for my delayed response. I went to private trip last week and hence didn't provide you requested information. 
As you have requested I captured and attached output of the command 

ndb_show_tables

on the other hand for command ndb_desc -pn i have got the same short output for all tables, for example:

ndb_desc -pn -t user_record

NDBT_ProgramExit: 0 - OK

Is that expected output or I have missed something?

Many thanks for your support on this case.

Br.

Zeljko

Hi,

as for the ndb_desc call it with

ndb_desct tablename -c connstring -d database -t tablename

(so yes, 2 times tablename)

for e.g.

[root@localhost bin]# ./ndb_desc nekitest -c localhost -u -n -p -i -d test -t nekitest
-- nekitest --
Version: 5
Fragment type: HashMapPartition
K Value: 6
Min load factor: 78
Max load factor: 80
Temporary table: no
Number of attributes: 4
Number of primary keys: 1
Length of frm data: 307
Max Rows: 0
Row Checksum: 1
Row GCI: 1
SingleUserMode: 0
ForceVarPart: 1
PartitionCount: 2
FragmentCount: 2
PartitionBalance: FOR_RP_BY_LDM
ExtraRowGciBits: 0
ExtraRowAuthorBits: 0
TableStatus: Retrieved
Table options:
HashMap: DEFAULT-HASHMAP-3840-2
-- Attributes --
id Bigint PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY AUTO_INCR
x Int NULL AT=FIXED ST=MEMORY
y Int NULL AT=FIXED ST=MEMORY
z Varchar(255;latin1_swedish_ci) NULL AT=SHORT_VAR ST=MEMORY
-- Indexes --
PRIMARY KEY(id) - UniqueHashIndex
PRIMARY(id) - OrderedIndex
-- Per partition info --
Partition       Row count       Commit count    Frag fixed memory       Frag varsized memory    Extent_space    Free extent_spaceN            odes
0               10132           10132           458752                  557056                  0               0                             2,3
1               9868            9868            458752                  557056                  0               0                             3,2

-- nekitest/PRIMARY --
Version: 5
Base table: nekitest
Number of attributes: 1
Logging: 0
Index type: OrderedIndex
Index status: Retrieved
-- Attributes --
id Bigint NOT NULL AT=FIXED ST=MEMORY
-- IndexTable 20/PRIMARY --
Version: 5
Fragment type: FragUndefined
K Value: 6
Min load factor: 78
Max load factor: 80
Temporary table: yes
Number of attributes: 2
Number of primary keys: 1
Length of frm data: 0
Max Rows: 0
Row Checksum: 1
Row GCI: 1
SingleUserMode: 2
ForceVarPart: 0
PartitionCount: 2
FragmentCount: 2
PartitionBalance: FOR_RP_BY_LDM
ExtraRowGciBits: 0
ExtraRowAuthorBits: 0
TableStatus: Retrieved
Table options:
-- Attributes --
id Bigint NOT NULL DISTRIBUTION KEY AT=FIXED ST=MEMORY
NDB$TNODE Unsigned [64] PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY
-- Indexes --
PRIMARY KEY(NDB$TNODE) - UniqueHashIndex

NDBT_ProgramExit: 0 - OK

The more important part.. ndb_show_tables .. look at "logging" column, most of your tables are no-logging.

Everything I wrote about nologging tables apply here, they are not persistent, they do not write to disk (their content is not part of LCP), they are part of backup but that's it. They will survive single node restart but not the cluster restart - cluster restart will lose all the data.

How did you end up with nologging tables, I can't say, but that property would be restored from the metadata in the backup so every time you restore from backup they are again set to nologging.

One fix would be to recreate the tables from SQL without nologging, and then restore the data from the backup. So you do mysqldump to dump the structure, modify if needed, make cluster backup, then initialize cluster, create tables from sql script created with mysqldump and restore data (without metadata) from cluster backup.

Sometime in the past you set those tables to be nologging. This reduces significantly disk usage of the cluster (as LCP is not performed for those table)

hope this helps
Bogdan

btw no need for ndb_desc output now, we see that all tables are nologging from ndb_show tables

all best
b.

Hi Bogdan,

Thank you for confirmation and everything makes sense now. I cannot figure out what happened and we got nologging tables, but I will take an investigation.
Also on our TEST environment logging parameter is set to YES and hence I could not recreate the issue on it.
The fix you have proposed also sounds good so I am going to prepare condition for it.
Many thanks for all your support so far, not just for this case. The every time    you have guided us to right direction ,finding the exact cause of and also suggesting a fix for it.

Really GREAT support, I am impressed! :)

All the best!

Zeljko

Hi Zeljko, 

You are welcome.

> Thank you for confirmation and everything makes sense now.
> I cannot figure out what happened and we got nologging tables,
> but I will take an investigation.

It is possible that sometimes in history someone set the tables to be nologging, as you maybe had issues with disk io, not understanding all the implications of that action. Moving tables to be nologging you reduce disk io a lot (since nothing is written to disk) but then, as you experienced, tables don't survive cluster crash/shutdown.

all best
Bogdan

p.s. I changed the bug status to "not a bug"