MySQL Bugs: #4382: Replicated LOAD DATA INFILE blocks queries in other dbs on slave

Bug #4382	Replicated LOAD DATA INFILE blocks queries in other dbs on slave
Submitted:	2 Jul 2004 15:20	Modified:	30 Aug 2004 21:20
Reporter:	[ name withheld ]	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	4.0.20	OS:	Linux (RedHat Linux (MySQL rpms))
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
When a replicated LOAD DATA INFILE is running on the slave (either copying the file or doing the load) it puts queries on databases not involved in the LDI into a system lock state where they block for the duration of the LDI. This seems unnecessary. Shouldn't only queries on the table the LDI is running against be blocked? 

How to repeat:
The system in question replicates several databases from another system and has one which it is master for (nagios). A show processlist during a LDI to one of the tables in the mailer database:

Id      User    Host    db      Command Time    State   Info
6       nagios  park:55000      nagios  Query   230     end     UPDATE programstatus SET last_update=FROM_UNIXTIME(1088756891),program_start=FROM_UNIXTIME(108764749
404     system user             NULL    Connect 670098  Waiting for master to send event        NULL
8206    system user             mailer  Connect 4025    NULL    NULL
18662   nagios  park:38253      nagios  Query   151     System lock     SELECT UNIX_TIMESTAMP(last_update) AS last_update, UNIX_TIMESTAMP(program_start) AS program_start, n
18667   root    localhost       NULL    Query   0       NULL    show processlist

ID 8206 is the LDI
ID 18662 is one of the queries which shows the system lock.
ID 6 is another which the web cgi making the query has timed out and dropped the connection.

Suggested fix:
Replicated LDI should not block the entire system during the file copy and load. It seems unnecessary.

sorry ID 6 was not another query from the cgi. It was an update from the backend process to tables in the nagios db.

FWIW this problem also exists in version 4.0.16.

Hi,
To better study this bug (which I could not reproduce yet), I would like to see the blocked query not truncated. You'll achieve this with SHOW FULL PROCESSLIST.
Could you please copy here the output of SHOW FULL PROCESSLIST at the moment the slave experiences the "system lock"? If this is confidential data, you can simply email it to guilhem@mysql.com .
Thanks.
Guilhem

Ok. Since the LDIs in question are large and run overnight, I'll set something up to get the full processlist output from each of the affected servers when the LDIs run and get back to you.

One other point that may be relevant is that the table the LDI is happening on is a very large MyISAM table with ~83.7 million rows. The .MYD file is 2.4Gb and the .MYI is 8Gb.

Below is a set of show full processlist (once a minute) from one of the servers that experiences this. 8206 is the LDI. I suppose this shows that 50028 and 50036 are blocking on the stalled update of 6. But then the question becomes why is 6 stuck in an end state for so long while the LDI is occuring.

Thu Jul  8 05:42:00 EDT 2004
Id      User    Host    db      Command Time    State   Info
6       nagios  park:55000      nagios  Sleep   4               NULL
404     system user             NULL    Connect 1192697 Waiting for master to send event        NULL
8206    system user             mailer  Connect 6220    NULL    NULL
50021   root    localhost       NULL    Query   0       NULL    show full processlist

Thu Jul  8 05:43:00 EDT 2004
Id      User    Host    db      Command Time    State   Info
6       nagios  park:55000      nagios  Query   4       end     UPDATE programstatus SET last_update=FROM_UNIXTIME(1089279776),program_start=FROM_UNIXTIME(1087647491),nagios_pid='2352',daemon_mode='1',last_command_check=FROM_UNIXTIME(1089279776),last_log_rotation=FROM_UNIXTIME(1089259200),enable_notifications='1',execute_service_checks='1',accept_passive_service_checks='1',enable_event_handlers='1',obsess_over_services='0',enable_flap_detection='0',enable_failure_prediction='1',process_performance_data='0'
404     system user             NULL    Connect 1192757 Waiting for master to send event        NULL
8206    system user             mailer  Connect 6280    NULL    NULL
50027   root    localhost       NULL    Query   0       NULL    show full processlist

Thu Jul  8 05:44:00 EDT 2004
Id      User    Host    db      Command Time    State   Info
6       nagios  park:55000      nagios  Query   64      end     UPDATE programstatus SET last_update=FROM_UNIXTIME(1089279776),program_start=FROM_UNIXTIME(1087647491),nagios_pid='2352',daemon_mode='1',last_command_check=FROM_UNIXTIME(1089279776),last_log_rotation=FROM_UNIXTIME(1089259200),enable_notifications='1',execute_service_checks='1',accept_passive_service_checks='1',enable_event_handlers='1',obsess_over_services='0',enable_flap_detection='0',enable_failure_prediction='1',process_performance_data='0'
404     system user             NULL    Connect 1192817 Waiting for master to send event        NULL
8206    system user             mailer  Connect 6340    NULL    NULL
50028   nagios  park:47964      nagios  Query   28      System lock     SELECT UNIX_TIMESTAMP(last_update) AS last_update, UNIX_TIMESTAMP(program_start) AS program_start, nagios_pid, daemon_mode, UNIX_TIMESTAMP(last_command_check) AS last_command_check, UNIX_TIMESTAMP(last_log_rotation) AS last_log_rotation, enable_notifications, execute_service_checks, accept_passive_service_checks, enable_event_handlers, obsess_over_services, enable_flap_detection, enable_failure_prediction, process_performance_data FROM programstatus
50030   root    localhost       NULL    Query   0       NULL    show full processlist

[snip]

Thu Jul  8 05:48:00 EDT 2004
Id      User    Host    db      Command Time    State   Info
6       nagios  park:55000      nagios  Query   304     end     UPDATE programstatus SET last_update=FROM_UNIXTIME(1089279776),program_start=FROM_UNIXTIME(1087647491),nagios_pid='2352',daemon_mode='1',last_command_check=FROM_UNIXTIME(1089279776),last_log_rotation=FROM_UNIXTIME(1089259200),enable_notifications='1',execute_service_checks='1',accept_passive_service_checks='1',enable_event_handlers='1',obsess_over_services='0',enable_flap_detection='0',enable_failure_prediction='1',process_performance_data='0'
404     system user             NULL    Connect 1193057 Waiting for master to send event        NULL
8206    system user             mailer  Connect 6580    NULL    NULL
50028   nagios  park:47964      nagios  Query   268     System lock     SELECT UNIX_TIMESTAMP(last_update) AS last_update, UNIX_TIMESTAMP(program_start) AS program_start, nagios_pid, daemon_mode, UNIX_TIMESTAMP(last_command_check) AS last_command_check, UNIX_TIMESTAMP(last_log_rotation) AS last_log_rotation, enable_notifications, execute_service_checks, accept_passive_service_checks, enable_event_handlers, obsess_over_services, enable_flap_detection, enable_failure_prediction, process_performance_data FROM programstatus
50036   nagios  park:47965      nagios  Query   16      System lock     SELECT UNIX_TIMESTAMP(last_update) AS last_update, UNIX_TIMESTAMP(program_start) AS program_start, nagios_pid, daemon_mode, UNIX_TIMESTAMP(last_command_check) AS last_command_check, UNIX_TIMESTAMP(last_log_rotation) AS last_log_rotation, enable_notifications, execute_service_checks, accept_passive_service_checks, enable_event_handlers, obsess_over_services, enable_flap_detection, enable_failure_prediction, process_performance_data FROM programstatus
50037   root    localhost       NULL    Query   0       NULL    show full processlist

[this continues on for five more minutes with more selects backing up until]

Thu Jul  8 05:53:00 EDT 2004
Id      User    Host    db      Command Time    State   Info
6       nagios  park:55000      nagios  Sleep   4               NULL
404     system user             NULL    Connect 1193357 Waiting for master to send event        NULL
8206    system user             mailer  Connect 6258    NULL    NULL
50053   root    localhost       NULL    Query   0       NULL    show full processlist

which suggests that the LDI has changed from copying the temp file to loading or some such. I'll know for sure when I installed 4.0.21 with the patch you made for me in another bug report. :) At this point the blocked update completes along with the selects that had queued up behind it.

This is a fairly decent machine (2x1.8Ghz Xeon with HyperThreading on, 2Gb RAM, RedHat 7.3) so I wouldn't expect the LDI to bog the system down so much that updates on another database would block on it.

Please let me know if I can provide any additional info.

Thanks,

Thanks for all this interesting info :) I'm looking at what could cause 'end' to last so long.

Question: does the lock happen only on slave or do you see the same problem on master? This question is only relevant if the master also has to process LDI and UPDATE on another table at the same time.
Does the slave run with --log-bin and --log-slave-updates?
Thank you.

More questions. As it's not clear what the slave SQL thread is doing when it is blocking others and when it's not (due to the deficiency in SHOW PROCESSLIST which you already reported separately), I would need the same listing (SHOW FULL PROCESSLIST) as you just provided, but please together with a SHOW SLAVE STATUS at the moments of the SHOW FULL PROCESSLIST. Then I may ask you to please tell me what is the event in the master's binlog which corresponds to the Exec_master_log_pos positions we'll see in SHOW SLAVE STATUS (you'll be able to tell by using "mysqlbinlog -j<the_position> <the_master's_binlog>" | head -n20). This way I will be able to know that when LDI was blocking others it was in Append_block, or in Exec_load, which will give a hint.
Thank you.

I'm not sure on the master. I'll setup the same processlist logging to check the overnight LDI. It's possible that some updates are occuring on other dbs at the time but it's not as regular as elsewhere.

It's probably a good idea to go into more detail on my environment at this point. I was hoping I wouldn't have to deal with trying to explain it in this ticket. :)

In short:

mattapan: 4.0.16 running on 2x2.4Ghz Xeon with HyperThreading, 4Gb RAM, RedHat 9
central: 4.0.16 running on 2x2.4Ghz Xeon with HyperThreading, 4Gb RAM, RedHat 9
park: 4.0.20 running on 2x1.8Ghz Xeon with HyperThreading, 2Gb RAM, RedHat 7.3

mattapan is master for 2 dbs (and is the master of the db where the LDI in question happens), and slaves 4 dbs from central. It has log-bin on but log-slave-updates off.

central is master for 23 databases, slaves the 2 dbs from mattapan. It has both log-bin and log-slave-updates on.

park is master for 1 database (the nagios db) and slaves 23 dbs from central. It has log-bin on and log-slave-updates off.

I've updated the logging on central and park to include show slave status with each corresponding show full processlist. Based on the next overnight occurence of this problem I'll include the mysqlbinlog output you requested in another update.

I should note that I think I'm also seeing this behavior on central as well, but it's been harder to pin down. We have occasional periods where reads timeout in the same way that we believe correspond to it's replication of the LDI into the large db in question from mattapan. However it's not clear that an update is occuring at the same time that might be stuck in an end state. Hopefully I can find one of those as well.

As soon as I have the additional info I'll add another comment.

Thanks

Waiting for more info (SHOW FULL PROCESSLIST + SHOW SLAVE STATUS + binlogs at the moment of hang).

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".