MySQL Bugs: #102486: util.dumpSchema fails if any worker thread times out (error 2006)(error 2013)

Bug #102486	util.dumpSchema fails if any worker thread times out (error 2006)(error 2013)
Submitted:	5 Feb 2021 3:45	Modified:	14 Mar 2021 20:39
Reporter:	Robert Roland	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: Document Store: MySQL Shell	Severity:	S1 (Critical)
Version:	8.0.23	OS:	Windows (10 x64)
Assigned to:	MySQL Verification Team	CPU Architecture:	x86 (i9 9900K)
Tags:	ERROR 2006, ERROR 2013, net_read_timeout, util.dumpInstance, util.dumpSchema, util.loadDump

Description:
util.dumpSchema fails if any single worker thread times out when dumping a large table. The error is the common: MySQL server has gone away (error 2006).

The following was applied to my.ini and the Server was restarted:

[client]
max_allowed_packet=1073741824
net_read_timeout=360000
net_write_timeout=360000
wait_timeout=20000
net_buffer_length =1048576

[mysqldump]
max_allowed_packet=1073741824
net_read_timeout=360000
net_write_timeout=360000
net_buffer_length =1048576

[mysql]
max_allowed_packet=1073741824
net_read_timeout=360000
net_write_timeout=360000

The Shell connection was also modified to include the connect-timeout option as follows:

\connect user@example.com?connect-timeout=36000

The error continued to occur.

Through trial and error the following to settings for the Shell were modified:

dba.restartWaitTimeout=3600
dba.gtidWaitTimeout=3600

...even though they did not seem to directly apply to the issue.

After these settings were changed using the following commands:

shell.options.setPersist("dba.restartWaitTimeout",3600)
shell.options.setPersist("dba.gtidWaitTimeout",3600)

The error no longer occurred.

How to repeat:
Start Server 8.0.23 with or without the following settings in my.ini:

[client]
max_allowed_packet=1073741824
net_read_timeout=360000
net_write_timeout=360000
wait_timeout=20000
net_buffer_length =1048576

[mysqldump]
max_allowed_packet=1073741824
net_read_timeout=360000
net_write_timeout=360000
net_buffer_length =1048576

[mysql]
max_allowed_packet=1073741824
net_read_timeout=360000
net_write_timeout=360000

Attempt to dump a large table (data length > 10G) from shell 8.0.23 using command of the form:

util.dumpTables(["tablename"], "C:/dumpfolder",{ocimds:true}))

Suggested fix:
The documentation (B.3.2.7 MySQL server has gone away) does not seem to address the issue described here and may need to be supplemented.

Add an Auto-Reconnect option for the utility and/or shell.

Under "How to Repeat" the command should be:

util.dumpSchemas(["tablename"], "C:/dumpfolder",{ocimds:true}))

or

util.dumpTables("schemaname",["tablename"], "C:/dumpfolder",{ocimds:true}))

I also encountered MySQL server has gone away (error 2006) when using util.loadDump when trying to populate the new instance.

This happened even though all of the steps I outlined above were taken.

At this point I am an unsure how to proceed and have no ability work without the ability to restore these databases. I have already lost several days and it looks like there are going to more to come.

I find it hard to believe that the shell's connection parameters or the util.loadDump options cannot be modified to account for a very straightforward restore of a large table created by the util.dump utility

I am pretty frustrated to say the least

wait_timeout and interactive_timout are set to the standard 8hrs

the error arrives in under a minute after launching the data load so it is more likely packet size but I do have global max_allowed_packet=1073741824 everywhere. I do not know if the client utility needs that setting also and I cannot find mention of it anywhere.

In mysql-shell/modules/util/dump/dumper.cc

 1097   // set size of max packet (~size of 1 row) we can get from server
 1098   if (!co.has(mysqlshdk::db::kMaxAllowedPacket)) {
 1099     const auto k_one_gb = "1073741824";
 1100     co.set(mysqlshdk::db::kMaxAllowedPacket, k_one_gb);
 1101   }

the dumper is clearly creating dump files with large packets in mind if the server has max_packet=1G

In mysql-shell/modules/util/load/load_dump_options.cc

there is no reference to packet_size (I assume sender just sends what it has and it is up to server setting to determine whether it can be received)

but there are clear references to long timeouts being set:

 61   m_target = get_classic_connection_options(m_base_session);
   62 
   63   if (m_target.has(mysqlshdk::db::kLocalInfile)) {
   64     m_target.remove(mysqlshdk::db::kLocalInfile);
   65   }
   66   m_target.set(mysqlshdk::db::kLocalInfile, "true");
   67 
   68   // Set long timeouts by default
   69   std::string timeout = "86400000";  // 1 day in milliseconds
   70   if (!m_target.has(mysqlshdk::db::kNetReadTimeout)) {
   71     m_target.set(mysqlshdk::db::kNetReadTimeout, timeout);
   72   }
   73   if (!m_target.has(mysqlshdk::db::kNetWriteTimeout)) {
   74     m_target.set(mysqlshdk::db::kNetWriteTimeout, timeout);
   75   }

  231   m_wait_dump_timeout_ms = wait_dump_timeout * 1000;

This adds to my belief that it isn't a timeout issue.

This Q&A:

https://stackoverflow.com/questions/51946078/change-and-show-max-allowed-packet-of-client-...

seems to clarify better that any max_packet setting for the client only matters for the dumper not the loader.

So if it is true that timeouts are ruled out and I have confirmed server has global max packet of 1G:

show variables like 'max_allowed_packet';
Variable_name	Value
max_allowed_packet	1073741824

I am at a loss as to how to restore my data created by the util.dumpSchema

This is mentioned in the documentation as one possible cause for server closing connection (server has gone away):

"You can also encounter this error with applications that fork child processes, all of which try to use the same connection to the MySQL server. This can be avoided by using a separate connection for each child process."

I was not sure if the threads created in the shell util are considered forked child processes which could cause a problem.

After executing set global wait_timeout=60000;

I tried loading the dump again and it worked.

A second dump load of 1.2TB is also almost complete without error.

I don't know if this made the difference but it was the only change I made. Wait_timeout was 28800 prior to the change.

screenshot of error showing thread losing connection at ~25 sec elapsed

Attachment: screenshot.jpg (image/jpeg, text), 1.19 MiB.

I added a screenshot of the lost connection error (I did not have a text log available).

After having success with changing wait_timeout to 60000 I have continued having more success throughout the last 12 hours and have now been able to restore 1.6 TB across 480 tables and 5 schema dumps. I am still unsure why changing wait_timeout from default 8hrs to 60000s made the difference but I am relieved.

These are all of the server timeout settings that are currently working

-------------- settings -------------
#set global wait_timeout=60000;
show variables like '%timeout';

Variable_name	             Value
-------------                -----
connect_timeout	             10000
delayed_insert_timeout	     10000
have_statement_timeout	      YES
innodb_flush_log_at_timeout	1
innodb_lock_wait_timeout	50
innodb_rollback_on_timeout	OFF
interactive_timeout	     60000
lock_wait_timeout	    31536000
mysqlx_connect_timeout	       30
mysqlx_idle_worker_thread_timeout  60
mysqlx_interactive_timeout   28800
mysqlx_port_open_timeout	0
mysqlx_read_timeout	       30
mysqlx_wait_timeout	     28800
mysqlx_write_timeout	       60
net_read_timeout	     360000
net_write_timeout	     360000
rpl_stop_slave_timeout	   31536000
slave_net_timeout	       60
wait_timeout	             60000

screenshot2 - Error occurred again even after wait_timout=60000

Attachment: screenshot2.jpg (image/jpeg, text), 1.40 MiB.

Unfortunately the error did come up again after having some good success with wait_timeout=60000.

I attached a screenshot that show the error occurred on chunk 80 of the 34th table after 21G of data successfully loaded in 20 min.

I continued to experiment with timeout settings. I continued to see errors of this type:

MySQL Error 2013 (HY000): Lost connection to MySQL server during query: LOAD DATA LOCAL INFILE

shell log excerpt:
2021-02-06 05:44:32: Info: Using credential store helper: c:\program files\mysql\mysql shell 8.0\bin\mysql-secret-store-windows-credential.exe
2021-02-06 05:44:32: Info: Setting Python home to 'c:\program files\mysql\mysql shell 8.0\lib\Python3.7'
2021-02-06 05:44:32: Info: Loading startup files...
2021-02-06 05:44:32: Info: Loading plugins...
2021-02-06 05:44:32: Info: Dump \\?\e:\dumps\soa is complete
2021-02-06 05:44:32: Info: Logging load progress to \\?\e:\dumps\soa\load-progress.0273c489-6824-11eb-a362-6e33b4eb80bf.json

---------------------------------------------------------------------
2021-02-06 05:44:58: Error: [Worker002] soa@ilec0215@@0.tsv.zst: MySQL Error 2013 (HY000): Lost connection to MySQL server during query: LOAD DATA LOCAL INFILE '\\\\?\\e:\\dumps\\soa\\soa@ilec0215@@0.tsv.zst' REPLACE INTO TABLE `soa`.`ilec0215` CHARACTER SET 'utf8mb4' FIELDS TERMINATED BY '	' ESCAPED BY '\\' LINES STARTING BY '' TERMINATED BY '\n' (`ID`, `Observation_Year`, `Common_Company_Indicator`, `Preferred_Indicator`, `Gender`, `Smoker_Status`, `Insurance_Plan`, `Issue_Age`, `Duration`, `Attained_Age`, `Age_Basis`, `Face_Amount_Band`, `Issue_Year`, `Number_Of_Preferred_Classes`, `Preferred_Class`, `SOA_Anticipated_Level_Term_Period`, `SOA_Guaranteed_Level_Term_Period`, `SOA_Post_Level_Term_Indicator`, `Select_Ultimate_Indicator`, `Number_of_Deaths`, `Death_Claim_Amount`, `Policies_Exposed`, `Amount_Exposed`, `Expected_Death_QX7580E_by_Amount`, `Expected_Death_QX2001VBT_by_Amount`, `Expected_Death_QX2008VBT_by_Amount`, `Expected_Death_QX2008VBTLU_by_Amount`, `Expected_Death_QX2015VBT_by_Amount`, `Expected_Death_QX7580E_by_Policy`, `Expected_Death_QX2001VBT_by_Policy`, `Expected_Death_QX2008VBT_by_Policy`, `Expected_Death_QX2008VBTLU_by_Policy`, `Expected_Death_QX2015VBT_by_Policy`, `Contestable`, `Age_Group`)
2021-02-06 05:44:58: Error: Aborting load..
--------------------------------------------------------------------

I finally had some consistent success with a large schema load (schema dump characteristics: 339 chunks, 332M rows, 56.8G) which loaded without any warnings or errors 1hr and 39min.

The settings I used were:
SHOW SESSION VARIABLES LIKE "%timeout%"

Variable_name	Value
-------------   -----
connect_timeout	99999
delayed_insert_timeout	10000
have_statement_timeout	YES
innodb_flush_log_at_timeout	1
innodb_lock_wait_timeout	50
innodb_rollback_on_timeout	OFF
interactive_timeout	99999
lock_wait_timeout	31536000
mysqlx_connect_timeout	99999
mysqlx_idle_worker_thread_timeout	3600
mysqlx_interactive_timeout	99999
mysqlx_port_open_timeout	0
mysqlx_read_timeout	99999
mysqlx_wait_timeout	99999
mysqlx_write_timeout	99999
net_read_timeout	360000
net_write_timeout	360000
rpl_stop_slave_timeout	31536000
slave_net_timeout	60
wait_timeout	99999

I started the shell as follows:

mysqlsh.exe --mc --interactive --file restore.js

And "restore.js" was as follows:

\connect root@localhost:3306?connect-timeout=99999
shell.options.setPersist("dba.restartWaitTimeout",99999)
shell.options.setPersist("dba.gtidWaitTimeout",99999)
util.loadDump("e:/dumps/schemaname",{threads:5,skipBinlog:true,analyzeTables:"on"})

I cannot pinpoint the exact reason why this worked.

A few things I considered were:

1) connecting using ssl vs mysql classic protocol (perhaps a default timeout that I could not directly control was different for each type)

2) setting timeouts to arbitrarily large values might be ignored if they were above a certain number such as smallint or a length of 5 digits (i.e. 99999 was the max timeout the server would recognize)

3) since I start the shell using --interactive I was sure to change the setting for interactive_timeout rather than wait_timeout

My current thought is that the connect_timeout and interactive_timeout (wait_timeout) are the most important. This already assumes that these other settings have been set (to high values) in my.ini already:

[client]
local_infile=1
max_allowed_packet=1073741824
net_read_timeout=360000
net_write_timeout=360000
wait_timeout=99999
net_buffer_length=1048576

[mysqldump]
max_allowed_packet=1073741824
net_read_timeout=360000
net_write_timeout=360000
net_buffer_length=1048576

[mysql]
local-infile=1
max_allowed_packet=1073741824
net_read_timeout=360000
net_write_timeout=360000

one more thing to note is that I found I could make it further through the loading of chunks if the server was just restarted (and then I immediately set global timeout settings as previously noted)

I am happy to report that after 7 days of effort I have over 95% of my instance recovered.

I continue to have an intermittent util.loadDump thread that loses connection but it is less frequent using the settings I described earlier. Since the util.loadDump has an ability to resume where it left off after failing, you are able to eventually get the load to complete if you continually monitor the progress and resume upon failure.

All of the instance data has been restored except for 1 large table which still has the intermittent MySQL Error 2013 (HY000): Lost connection to MySQL server during query: LOAD DATA LOCAL INFILE

Sometimes as many as 10 or 12 chunks get processed before the error occurs and sometimes it almost immediate as soon as LOAD DATA is invoked.

All timeouts and packet size settings are as previously described.

Here is verbose shell log:

Copyright (c) 2018, 2021, Oracle and/or its affiliates. All rights reserved.
2021-02-09 19:16:58: Debug2:   Exit code: 0
2021-02-09 19:16:58: Info: Using credential store helper: c:\program files\mysql\mysql shell 8.0\bin\mysql-secret-store-windows-credential.exe
2021-02-09 19:16:58: Info: Setting Python home to 'c:\program files\mysql\mysql shell 8.0\lib\Python3.7'
2021-02-09 19:16:58: Info: Loading startup files...
2021-02-09 19:16:58: Debug: Searching for plugins at 'c:\program files\mysql\mysql shell 8.0\lib\mysqlsh\plugins'
2021-02-09 19:16:58: Info: Loading plugins...
2021-02-09 19:16:58: Debug: - c:\program files\mysql\mysql shell 8.0\lib\mysqlsh\plugins\plugins_plugin\init.py
2021-02-09 19:16:59: Debug: The 'plugins' extension object has been registered as a global object.
2021-02-09 19:16:59: Debug: The 'info' function has been registered into the 'plugins' extension object.
2021-02-09 19:16:59: Debug: The 'version' function has been registered into the 'plugins' extension object.
2021-02-09 19:16:59: Debug: The 'about' function has been registered into the 'plugins' extension object.
2021-02-09 19:16:59: Debug: The 'list' function has been registered into the 'plugins' extension object.
2021-02-09 19:16:59: Debug: The 'install' function has been registered into the 'plugins' extension object.
2021-02-09 19:16:59: Debug: The 'uninstall' function has been registered into the 'plugins' extension object.
2021-02-09 19:16:59: Debug: The 'update' function has been registered into the 'plugins' extension object.
2021-02-09 19:16:59: Debug: The 'details' function has been registered into the 'plugins' extension object.
2021-02-09 19:16:59: Debug: Using color mode 3
2021-02-09 19:16:59: Debug: Using prompt theme file c:\program files\mysql\mysql shell 8.0\share\mysqlsh\prompt\prompt_256.json
2021-02-09 19:16:59: Debug2: Invoking helper
2021-02-09 19:16:59: Debug2:   Command line: c:\program files\mysql\mysql shell 8.0\bin\mysql-secret-store-windows-credential.exe get
2021-02-09 19:16:59: Debug2:   Input: {"SecretType":"password","ServerURL":"root@localhost:3306"}
2021-02-09 19:16:59: Debug2:   Output: {"SecretType":"password","ServerURL":"root@localhost:3306","Secret":"****"}
2021-02-09 19:16:59: Debug2:   Exit code: 0
2021-02-09 19:16:59: Debug: Loading DDL and Data from 'e:/dumps/MyDBase' using 4 threads.
2021-02-09 19:16:59: Debug: Opening dump...
2021-02-09 19:16:59: Info: Dump \\?\e:\dumps\MyDBase is complete
2021-02-09 19:16:59: Debug: Target is MySQL 8.0.23. Dump was produced from MySQL 8.0.23
2021-02-09 19:16:59: Info: Load progress file detected. Load will be resumed from where it was left, assuming no external updates were made.
2021-02-09 19:16:59: Debug: You may enable the 'resetProgress' option to discard progress for this MySQL instance and force it to be completely reloaded.
2021-02-09 19:16:59: Info: Resuming load, last loaded 0 bytes
2021-02-09 19:16:59: Debug: Scanning contents of schema 'MyDBase'
2021-02-09 19:16:59: Debug: MyDBase has 1 tables
2021-02-09 19:16:59: Debug: MyDBase has 0 views
2021-02-09 19:16:59: Debug: All metadata for schema `MyDBase` was scanned
2021-02-09 19:16:59: Debug: All metadata for dump was scanned
2021-02-09 19:16:59: Debug: Executing common preamble SQL
2021-02-09 19:16:59: Debug: Begin loading table DDL
2021-02-09 19:16:59: Debug2: Got event READY from worker 2
2021-02-09 19:16:59: Debug: Schema DDL for 'MyDBase' (DONE)
2021-02-09 19:16:59: Debug: Processing table DDL for `MyDBase`.`mytable` (placeholder=0)
2021-02-09 19:16:59: Debug2: Got event READY from worker 3
2021-02-09 19:16:59: Debug2: Got event READY from worker 0
2021-02-09 19:16:59: Debug2: Got event READY from worker 1
2021-02-09 19:16:59: Debug: worker2 will load DDL file \\?\e:\dumps\MyDBase\MyDBase@mytable.sql for table `MyDBase`.`mytable`
2021-02-09 19:16:59: Debug2: Got event TABLE_DDL_START from worker 2
2021-02-09 19:16:59: Debug: worker2 done
2021-02-09 19:16:59: Debug2: Got event TABLE_DDL_END from worker 2
2021-02-09 19:16:59: Debug: End loading table DDL
2021-02-09 19:16:59: Debug: Begin loading view DDL
2021-02-09 19:16:59: Debug2: Got event READY from worker 3
2021-02-09 19:16:59: Debug2: Got event READY from worker 0
2021-02-09 19:16:59: Debug2: Got event READY from worker 1
2021-02-09 19:16:59: Debug2: Got event READY from worker 2
2021-02-09 19:16:59: Debug: End loading view DDL
2021-02-09 19:16:59: Debug2: Got event READY from worker 3
2021-02-09 19:16:59: Debug: Table data for 'MyDBase'.'mytable'/-1 (INTERRUPTED)
2021-02-09 19:16:59: Debug: Scheduling chunk for table MyDBase.mytable (\\?\e:\dumps\MyDBase\MyDBase@mytable.tsv.zst) - worker3
2021-02-09 19:16:59: Debug: Loading data for `MyDBase`.`mytable` (chunk -1)
2021-02-09 19:16:59: Debug2: Got event READY from worker 0
2021-02-09 19:16:59: Debug: Analyzing table `MyDBase`.`mytable`
2021-02-09 19:16:59: Debug: worker3 will load chunk \\?\e:\dumps\MyDBase\MyDBase@mytable.tsv.zst for table `MyDBase`.`mytable`
2021-02-09 19:16:59: Debug2: Got event READY from worker 1
2021-02-09 19:16:59: Debug: worker0 will analyze table `MyDBase`.`mytable`
2021-02-09 19:16:59: Debug2: Got event READY from worker 2
2021-02-09 19:16:59: Debug2: Got event LOAD_START from worker 3
2021-02-09 19:16:59: Debug: Analyzing table `MyDBase`.`mytable`
2021-02-09 19:16:59: Debug2: Got event ANALYZE_START from worker 0
2021-02-09 19:16:59: Debug: worker0 done
2021-02-09 19:16:59: Debug2: Got event ANALYZE_END from worker 0

2021-02-09 19:17:02: Error: [Worker003] MyDBase@mytable.tsv.zst: MySQL Error 2013 (HY000): Lost connection to MySQL server during query: LOAD DATA LOCAL INFILE '\\\\?\\e:\\dumps\\MyDBase\\MyDBase@mytable.tsv.zst' REPLACE INTO TABLE `MyDBase`.`mytable` CHARACTER SET 'utf8mb4' FIELDS TERMINATED BY '	' ESCAPED BY '\\' LINES STARTING BY '' TERMINATED BY '\n' (`ID`, `FULLNAME`, `wcount`, `num`, `num2`, `num3`, `num4`, `num5`, `num6`)

2021-02-09 19:17:02: Debug2: Got event FATAL_ERROR from worker 3
2021-02-09 19:17:02: Error: Aborting load...
2021-02-09 19:17:02: Debug: Import done
2021-02-09 19:17:02: Debug: Waiting on worker threads...
2021-02-09 19:17:02: Debug: All worker threads stopped
2021-02-09 19:17:02: Debug: There was no remaining data left to be loaded.
2021-02-09 19:17:02: Debug: 1 errors and 0 warnings messages were reported during the load.
2021-02-09 19:17:02: Debug: Util.loadDump: Error loading dump (RuntimeError)

As previously mentioned 1 last table has not been able to complete restoring all chunks using util.loadDump and has been exiting fairly quickly after processing 0 to 3 chunks (rarely up to 10).

I tried changing the shell connect option connect-timeout to 0:

\connect root@localhost:3306?connect-timeout=0

I completed 108 chunks (65M rows) before again receiving ERROR 2013 Lost connection during query LOAD DATA.

It was an unexpectedly large amount of progress that I thought important to report.

adjusted tags to include error 2013

I am still unable to fully restore 1 table due to the shell aborting the connection (Error 2013) after just a few seconds regardless of any settings previously mentioned. 
I have spent a good deal of time looking through much of the python code the shell uses without any obvious place where a hidden setting overrides the server or command line settings. 
It looks like the shell may be aborting rather than the server closing the connection. If anyone has any insight as to how to proceed (even 1 chunk at a time) I would be grateful for a recommendation that works.

I am still unable to fully restore 1 table due to the shell aborting the connection (Error 2013) after just a few seconds regardless of any settings previously mentioned. 
I have spent a good deal of time looking through much of the python code the shell uses without any obvious place where a hidden setting overrides the server or command line settings. 
It looks like the shell may be aborting rather than the server closing the connection. If anyone has any insight as to how to proceed (even 1 chunk at a time) I would be grateful for a recommendation that works.

when I put the server under a load by executing other update queries on a different connection I was able to get over 320 more chunks added without the error

I then tried loading more chunks after the server became idle and the loadDump failed right away

I then put the server under load and was able to get over 150 more chunks before I got the error

I tried this back and forth several times yesterday and the loadDump made it much, much farther if the server was under load

Maybe this points to a buffer/flow rate issue

The hardware for this server is win 10 x64 i9 9900k 16 logical cores at 5Ghz 64G Memory (Buffer Pool 54G / 64 pool instances) 2 TB nvme ssd raid0, the iblog, undo logs and ibdata are all located on 3 different separate local ssds and the connection for the shell is via localhost using classic mysql protocol also binlog is disabled - this should be adequate to digest a single threaded load data infile

Hi,

Just to inform on the progress. I'm failing to reproduce this on a normal linux server. Testing on windows now and I'm not seeing the problem here neither but I might be doing something wrong. 

I assume you are loading the dump you created from shell util.dumpSchema you created in bug 102485 ? so a single table 10M rows? Can you share the table structure? 

all best
Bogdan

Hi,

I cannot reproduce this