Bug #84986 ndbmtd worker thread blocked / overlept
Submitted: 14 Feb 2017 9:36 Modified: 20 Feb 2017 16:03
Reporter: d g Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:7.5.5 OS:Red Hat (Centos 7.3)
Assigned to: MySQL Verification Team CPU Architecture:Any

[14 Feb 2017 9:36] d g
Description:
Restoring a modyfied dump (changed engine via regex from Innodb|myisam to ndbcluster and added tablespace line) into mysql-cluster causes data loss and logs full of: 

WARNING  -- thr: 7: Overslept 1507 ms, expected ~10ms
WARNING  -- thr: 4: Overslept 1512 ms, expected ~10ms
WARNING  -- thr: 6: Overslept 2559 ms, expected ~10ms

and 

thr_no:15 - sleeploop 10!! (Worker thread blocked (>= 10ms) by slow consumer threads)
thr_no:15 - sleeploop 10!! (Worker thread blocked (>= 10ms) by slow consumer threads)
thr_no:15 - sleeploop 10!! (Worker thread blocked (>= 10ms) by slow consumer threads)
thr_no:15 - sleeploop 10!! (Worker thread blocked (>= 10ms) by slow consumer threads)
thr_no:15 - sleeploop 10!! (Worker thread blocked (>= 10ms) by slow consumer threads)
thr_no:15 - sleeploop 10!! (Worker thread blocked (>= 10ms) by slow consumer threads)
thr_no:15 - sleeploop 10!! (Worker thread blocked (>= 10ms) by slow consumer threads)
thr_no:15 - sleeploop 10!! (Worker thread blocked (>= 10ms) by slow consumer threads)
thr_no:15 - sleeploop 10!! (Worker thread blocked (>= 10ms) by slow consumer threads)
thr_no:15 - sleeploop 10!! (Worker thread blocked (>= 10ms) by slow consumer threads)
thr_no:15 - sleeploop 10!! (Worker thread blocked (>= 10ms) by slow consumer threads)

after that data memory usage goes down from  like 4% to 2% and then 0% and starting again to fill up but very slowly the dump does not abort. 

datanodes hardware: 

24cpu(HT)
128GB RAM
2.5TB raid5
2x 1GB LACP Ethernet

Management / sql nodes(2x KVM): 

4 vCPUS
4 GB RAM

my config : 

NoOfReplicas=2
DataMemory=90000M
IndexMemory=5000M
DiskPageBufferMemory=4000M
CompressedBackup=true
datadir=/var/lib/mysql-cluster
NoOfFragmentLogParts=10
MaxNoOfConcurrentOperations=10000000
MaxNoOfAttributes=100000
NoOfFragmentLogFiles=32
#Alle 64MB Lokalen Snapshot erstellen 20=4MB 21=8MB 22=16MB
TimeBetweenLocalCheckpoints=26
#Alle 8 Sekunden(8000) Globalen Snapshot erstellen
TimeBetweenGlobalCheckpoints=10000
MaxDiskWriteSpeed=600M
MinDiskWriteSpeed=200M
MaxDiskWriteSpeedOwnRestart=300M
MaxNoOfExecutionThreads=20
#RealtimeScheduler=1
TransactionDeadlockDetectionTimeout=3000
StopOnError=0
ODirect=1
#ThreadConfig=ldm={count=10,cpubind=0-4,12-16},tc={count=4,cpubind=6-7,18-19},send={count=1,cpubind=8},recv={count=1,cpubind=20},main={count=1,cpubind=9,21},rep={count=1,cpubind=9,21},io={count=1,cpubind=9,21},watchdog={count=1,cpubind=9,21}

#ThreadConfig=ldm={count=10,cpubind=0-4,12-16,thread_prio=9,spintime=200}, \
#tc={count=4,cpuset=6-7,18-19,thread_prio=8},send={count=1,cpuset=8}, \
#recv={count=1,cpuset=20},main={count=1,cpuset=9,21},rep={count=1,cpuset=9,21}, \
#io={count=1,cpuset=9,21,thread_prio=8},watchdog={count=1,cpuset=9,21,thread_prio=9}

[tcp default]
SendBufferMemory=64M
ReceiveBufferMemory=64M

[ndb_mgmd]
NodeId=1
hostname=172.16.17.11
datadir=/var/lib/mysql-cluster

[ndb_mgmd]
NodeId=2
hostname=172.16.17.12
datadir=/var/lib/mysql-cluster

[ndbd]
hostname=172.16.17.1

[ndbd]
hostname=172.16.17.2

[mysqld]
[mysqld]
[mysqld]
[mysqld]
[mysqld]
[mysqld]
[mysqld]
[mysqld]
[mysqld]
[mysqld]
[mysqld]
[mysqld]
[mysqld]
[mysqld]
[mysqld]
[mysqld]
[mysqld]
[mysqld]
[mysqld]
[mysqld]

the dump itself is 33GB.

ndb_mgm output: 

ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]	2 node(s)
id=3	@172.16.17.1  (mysql-5.7.17 ndb-7.5.5, Nodegroup: 0, *)
id=4	@172.16.17.2  (mysql-5.7.17 ndb-7.5.5, Nodegroup: 0)

[ndb_mgmd(MGM)]	2 node(s)
id=1	@172.16.17.11  (mysql-5.7.17 ndb-7.5.5)
id=2	@172.16.17.12  (mysql-5.7.17 ndb-7.5.5)

[mysqld(API)]	20 node(s)
id=5	@172.16.17.11  (mysql-5.7.17 ndb-7.5.5)
id=6	@172.16.17.11  (mysql-5.7.17 ndb-7.5.5)
id=7	@172.16.17.11  (mysql-5.7.17 ndb-7.5.5)
id=8	@172.16.17.11  (mysql-5.7.17 ndb-7.5.5)
id=9	@172.16.17.12  (mysql-5.7.17 ndb-7.5.5)
id=10	@172.16.17.12  (mysql-5.7.17 ndb-7.5.5)
id=11	@172.16.17.12  (mysql-5.7.17 ndb-7.5.5)
id=12	@172.16.17.12  (mysql-5.7.17 ndb-7.5.5)
id=13 (not connected, accepting connect from any host)
id=14 (not connected, accepting connect from any host)
id=15 (not connected, accepting connect from any host)
id=16 (not connected, accepting connect from any host)
id=17 (not connected, accepting connect from any host)
id=18 (not connected, accepting connect from any host)
id=19 (not connected, accepting connect from any host)
id=20 (not connected, accepting connect from any host)
id=21 (not connected, accepting connect from any host)
id=22 (not connected, accepting connect from any host)
id=23 (not connected, accepting connect from any host)
id=24 (not connected, accepting connect from any host)

How to repeat:
1. clean start of datanode with empty directory and --initial
2. on mysqld do : 
   CREATE LOGFILE GROUP lg1 ADD UNDOFILE 'undo.dat' INITIAL_SIZE = 10000M engine=NDB;
   create tablespace ts_jt add datafile 'ts_01.dat' use logfile group lg1 initial_size 100G engine ndb;
   create database testdb
3. mysql -u root -ppassword testdb < testdb.sql
[14 Feb 2017 11:37] d g
error_log

Attachment: ndb_3_error.log (text/plain), 24.43 KiB.

[14 Feb 2017 11:44] d g
datanode01 log

Attachment: datanode01_log (application/octet-stream, text), 1.63 MiB.

[14 Feb 2017 11:46] d g
datanode02 log

Attachment: datanode02_log (application/octet-stream, text), 2.22 MiB.

[14 Feb 2017 12:14] d g
Related Post in Forum : 

https://forums.mysql.com/read.php?25,654936,654936#msg-654936
[20 Feb 2017 11:18] d g
Hi, 

after more testing i found the solution. The Problem was the System where the dump came from. On the Soucre System max-allowed-packet was configured to 128M. It seems that this value is too high and cause the ndb datanodes to overload. i dumped the data again with 16M Packet size with the following command and my restore runs without error but a bit slower: 

ysqldump --hex-blob --routines --triggers --add-drop-database --single-transaction --max_allowed_packet=16M

Maybe 32M or 64M is also ok but i didnt tested it yet. 

Regards 
Denny
[20 Feb 2017 16:03] MySQL Verification Team
Hi,

As you already discovered yourself, this is not a bug but a missconfigured server. For help about properly configuring mysql/mysql cluster you can get great support from Oracle.

best regards
Bogdan