Bug #93708 Page Cleaner will sleep for long time if clock changes
Submitted: 20 Dec 2018 20:10 Modified: 21 Dec 2018 18:07
Reporter: Marcelo Altmann (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S3 (Non-critical)
Version:5.7.24, 8.0.13 OS:Linux
Assigned to: CPU Architecture:Any

[20 Dec 2018 20:10] Marcelo Altmann
Description:
Page cleaner will sleep for a long time in case you adjust your clock backward, for example, if ntp adjusts the clock or when you exit summer time.

Error log:
2018-12-20T15:26:28.981196Z 0 [Note] InnoDB: Dumping buffer pool(s) to /var/lib/mysql/ib_buffer_pool
2018-12-20T15:26:28.981951Z 0 [Note] InnoDB: Buffer pool(s) dump completed at 181220 15:26:28
2018-12-20T15:26:29.181297Z 0 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
2018-12-20T15:27:29.412452Z 0 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
2018-12-20T15:28:29.636237Z 0 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
2018-12-20T15:29:31.497178Z 0 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
2018-12-20T15:30:34.856757Z 0 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
2018-12-20T15:31:38.337117Z 0 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
2018-12-20T15:32:41.654854Z 0 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
2018-12-20T15:33:45.085653Z 0 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
2018-12-20T15:34:48.594154Z 0 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
2018-12-20T15:35:52.294568Z 0 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
2018-12-20T15:36:55.932855Z 0 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
2018-12-20T15:37:59.590748Z 0 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
2018-12-20T15:39:03.387422Z 0 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
2018-12-20T15:40:07.109075Z 0 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
2018-12-20T15:41:10.742600Z 0 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
2018-12-20T15:42:14.363348Z 0 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
2018-12-20T15:43:18.191026Z 0 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
2018-12-20T15:44:21.817929Z 0 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
2018-12-20T15:45:25.487459Z 0 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool

Thread stack:
[Switching to thread 3 (Thread 0x7f1dd2897700 (LWP 3570))]
#0  0x00007f1decbb8d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f1decbb8d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000fce162 in os_event::timed_wait (this=this@entry=0x393e528, abstime=abstime@entry=0x7f1dd2896b70) at /export/home/pb2/build/sb_0-30854123-1538633024.42/rpm/BUILD/mysql-5.7.24/mysql-5.7.24/storage/innobase/os/os0event.cc:285
#2  0x0000000000fce391 in os_event::wait_time_low (this=0x393e528, time_in_usec=<optimized out>, reset_sig_count=<optimized out>, reset_sig_count@entry=2)
    at /export/home/pb2/build/sb_0-30854123-1538633024.42/rpm/BUILD/mysql-5.7.24/mysql-5.7.24/storage/innobase/os/os0event.cc:412
#3  0x0000000000fce5fa in os_event_wait_time_low (event=<optimized out>, time_in_usec=<optimized out>, reset_sig_count=reset_sig_count@entry=2) at /export/home/pb2/build/sb_0-30854123-1538633024.42/rpm/BUILD/mysql-5.7.24/mysql-5.7.24/storage/innobase/os/os0event.cc:517
#4  0x000000000113b19f in pc_sleep_if_needed (sig_count=2, next_loop_time=1545320288739) at /export/home/pb2/build/sb_0-30854123-1538633024.42/rpm/BUILD/mysql-5.7.24/mysql-5.7.24/storage/innobase/buf/buf0flu.cc:2690
#5  buf_flush_page_cleaner_coordinator (arg=<optimized out>) at /export/home/pb2/build/sb_0-30854123-1538633024.42/rpm/BUILD/mysql-5.7.24/mysql-5.7.24/storage/innobase/buf/buf0flu.cc:3191
#6  0x00007f1decbb4e25 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f1deb66ebad in clone () from /lib64/libc.so.6

If clock changed after buf0flu.cc where we set next_loop_time = curr_time + 1000 the time passed to pc_sleep_if_needed will be too far in the future, which will cause a long delay on shutdown.

How to repeat:

1. Start MySQL
2. Change time: date $(date +%m%d%H%M%Y.%S -d '50 min ago')
3. Stop MySQL

Suggested fix:
As we always set next_loop_time to be 1 second, on buf0flu.cc at function pc_sleep_if_needed we should also check if the diff between (next_loop_time - cur_time) is not greater than 1 second.
[21 Dec 2018 1:25] Marcelo Altmann
On further analysis, looks like os_event_wait_time_low->os_event::wait_time_low->ut_gettimeofday is returning previous saved time (in the future).

This affects page cleaner in general, as it won't flush any pages while coordinator is in os_event::timed_wait->pthread_cond_timedwait()

When talking about os_event_wait_time_low, looks like there are other 10 places where it can be potentially affected by the same issue:

1 buf0flu.cc   pc_sleep_if_needed             2689 return(os_event_wait_time_low(buf_flush_event,
2 os0event.h   os_event_wait_time              133 #define os_event_wait_time(e, t) os_event_wait_time_low((e), (t), 0)
3 lock0wait.cc DECLARE_THREAD                  501 os_event_wait_time_low(event, 1000000, sig_count);
4 os0event.cc  os_event_wait_time_low          507 os_event_wait_time_low(
5 row0merge.cc row_merge_read_clustered_index 2574 os_event_wait_time_low(fts_parallel_sort_event,
6 row0merge.cc row_merge_build_indexes        4473 os_event_wait_time_low(
7 srv0srv.cc   DECLARE_THREAD                 1585 os_event_wait_time_low(srv_monitor_event, 5000000, sig_count);
8 srv0srv.cc   DECLARE_THREAD                 1751 os_event_wait_time_low(srv_error_event, 1000000, sig_count);
9 srv0srv.cc   srv_purge_coordinator_suspend  2679 ret = os_event_wait_time_low(
a ut0wqueue.cc ib_wqueue_timedwait             158 error = os_event_wait_time_low(wq->event,
[21 Dec 2018 13:14] Sinisa Milivojevic
Hi,

I have managed to repeat this behaviour. The only missing step in the How-To-Repeat paragraph is that massive DML have to be run during changing of the system clock. Actually, there are 6 (six) steps needed to definitely repeat the behaviour:

1. Start MySQL
2. Run massive number of DMLs on some large table
3. During those DMLs  change time: date $(date +%m%d%H%M%Y.%S -d '50 min ago')
4. After those DMLs, also change time: date $(date +%m%d%H%M%Y.%S -d '50 min ago')
5. Stop MySQL
6. Read the error log

However, this is not a bug, because changing the system time should not be done during running of the server.

However, this needs to be documented, hence I change this to the "Documentation" bug.
[21 Dec 2018 13:19] Laurynas Biveinis
This is a bug - cleaner coordinator code should use monotonic clock
[21 Dec 2018 13:23] Sinisa Milivojevic
Laurinas,

I agree with you.

Cleaner coordinator should use a monotonic clock.

But, this makes it into a feature request, so I set it to the feature request.
[21 Dec 2018 13:58] Marcelo Altmann
This can be repeated with the 3 steps I mentioned.

Time passed to os_event::timed_wait is based on before the time change, doesn't matter if you have pages to flush or not.

> However, this is not a bug, because changing the system time should not be done during running of the server.

With all due respect, I don't agree with the above sentence. What about daylight saving and NTP?
[21 Dec 2018 14:08] Jeff Leuschner
We have upgraded to 5.7.24 and have numerous customers and are hitting this at multiple sites. This is critical to us and our customers, so looking for a fix or a work-around in short order if possible.
[21 Dec 2018 17:49] Sinisa Milivojevic
Hello Mr. Leschner,

If you wish to escalate this problem, you or anyone of your customers should create the appropriate ticket in our support system.
[21 Dec 2018 18:07] Marcelo Altmann
This issue doesn't seem like a feature request. It's rather a BUG.

clock can drift for various reasons and a simple correction made by NTP can freeze InnoDB page_cleaner for as many seconds/minutes the clock was adjusted.
[25 Dec 2018 18:34] Miguel Solorzano
Back to original severity reported.