MySQL Bugs: #77740: silent failure to start if mysql.gtids_executed gets HA_ERR_LOCK_WAIT

Bug #77740	silent failure to start if mysql.gtids_executed gets HA_ERR_LOCK_WAIT_TIMEOUT
Submitted:	15 Jul 2015 20:41	Modified:	8 Dec 2015 11:14
Reporter:	Shane Bester (Platinum Quality Contributor)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S1 (Critical)
Version:	5.7.9	OS:	Windows
Assigned to:		CPU Architecture:	Any

Description:
Take a look at this error log and try determine what the problem is:

mysqld.exe --no-defaults --console --innodb-lock-wait-timeout=1 
--log-bin --server-id=1 --binlog-format=mixed 
--binlog_direct_non_transactional_updates=1 --binlog_row_image=minimal 

[Note] mysqld (mysqld 5.7.9-log) starting as process 10948 ...
[Note] InnoDB: Mutexes and rw_locks use Windows interlocked functions
[Note] InnoDB: Uses event mutexes
[Note] InnoDB: _mm_lfence() and _mm_sfence() are used for memory barrier
[Note] InnoDB: Compressed tables use zlib 1.2.3
[Note] InnoDB: Number of pools: 1
[Note] InnoDB: Not using CPU crc32 instructions
[Note] InnoDB: Initializing buffer pool, total size = 128M, instances = 1, chunk size = 128M
[Note] InnoDB: Completed initialization of buffer pool
[Note] InnoDB: Highest supported file format is Barracuda.
[Note] InnoDB: Transaction 21964352 was in the XA prepared state.
[Note] InnoDB: Transaction 21964352 was in the XA prepared state.
[Note] InnoDB: 1 transaction(s) which must be rolled back or cleaned up in total 0 row operations to undo
[Note] InnoDB: Trx id counter is 21972480
[Note] InnoDB: Creating shared tablespace for temporary tables
[Note] InnoDB: Starting in background the rollback of uncommitted transactions
[Note] InnoDB: Setting file '.\ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
[Note] InnoDB: Rollback of non-prepared transactions completed
[Note] InnoDB: File '.\ibtmp1' size is now 12 MB.
[Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active.
[Note] InnoDB: 32 non-redo rollback segment(s) are active.
[Note] InnoDB: Waiting for purge to start
[Note] InnoDB: 5.7.9 started; log sequence number 48567253845
[Note] InnoDB: Loading buffer pool(s) from .\\ib_buffer_pool
[Note] Plugin 'FEDERATED' is disabled.
[Note] InnoDB: Starting recovery for XA transactions...
[Note] InnoDB: Transaction 21964352 in prepared state after recovery
[Note] InnoDB: Transaction contains changes to 2 rows
[Note] InnoDB: 1 transactions in prepared state after recovery
[Note] Found 1 prepared transaction(s) in InnoDB
[Note] InnoDB: Buffer pool(s) load completed at 150715 21:24:26
[Warning] Found 1 prepared XA transactions
[ERROR] Aborting

[Note] Binlog end
[Note] Shutting down plugin 'ngram'
[Note] Shutting down plugin 'PERFORMANCE_SCHEMA'
[Note] Shutting down plugin 'partition'
[Note] Shutting down plugin 'BLACKHOLE'
[Note] Shutting down plugin 'ARCHIVE'
[Note] Shutting down plugin 'MRG_MYISAM'
[Note] Shutting down plugin 'MyISAM'

It's failing to start after one second because my innodb-lock-wait-timeout is 1.   
In normal servers it would hang for 50 seconds before failing.
The lock wait timeout is here:

00 ntdll!NtWaitForKeyedEvent+0xa
01 ntdll!RtlSleepConditionVariableCS+0xd4
02 kernel32!SleepConditionVariableCS+0x28
03 mysqld_debug!os_event::wait+0x3f
04 mysqld_debug!os_event::wait_low+0x6d
05 mysqld_debug!os_event_wait_low+0x32
06 mysqld_debug!lock_wait_suspend_thread+0x579
07 mysqld_debug!row_mysql_handle_errors+0x1ce
08 mysqld_debug!row_search_mvcc+0x4090
09 mysqld_debug!ha_innobase::general_fetch+0x1b8
0a mysqld_debug!ha_innobase::rnd_next+0xcc
0b mysqld_debug!handler::ha_rnd_next+0x306
0c mysqld_debug!Gtid_table_persistor::fetch_gtids+0x113
0d mysqld_debug!Gtid_state::read_gtid_executed_from_table+0x38
0e mysqld_debug!win_main+0x468
0f mysqld_debug!mysql_service+0x62
10 mysqld_debug!mysqld_main+0x52c
11 mysqld_debug!main+0x2f
12 mysqld_debug!__tmainCRTStartup+0xec
13 mysqld_debug!mainCRTStartup+0xe
14 kernel32!BaseThreadInitThunk+0xd
15 ntdll!RtlUserThreadStart+0x1d

How to repeat:
How the server got into this state is a mystery.  It is my testing server.
But why does error log contain NO useful information?
I will upload my datadir to the bug report.

Suggested fix:
Don't fail silently.  
Find out why lock wait timeouts are happening during startup and prevent them...

setting as verified so I can upload datadir to internal bug report.
the bug is real,  but I do NOT know how server got into this state, but I think that doesn't matter.

Posted by developer:
 
Resurrecting prepared XA at the server restart is a rather regular feature.
I can't imagine how it could lead to a mentioned frm corruption.

As to if this xa trx could be locking the gtid table, that's an open question.
I recommended to DaoGang to set up a scenario where the prepared XA would modify
the gtid table, e.g explicitly, prior to the server crash, and see how things would unfold
at restart.

Overall, it's sad there's no how-reproduce it.

Thanks for your feedback. This has been fixed in upcoming versions and the following was noted in the 5.7.11 change log:
If the server stopped unexpectedly immediately before committing an XA transaction which had been prepared, and the transaction modified the mysql.gtid_executed table, then the subsequent recovery was aborting due to an innodb_lock_wait_timeout error when it was reading the mysql.gtid_executed table. To fix the problem, XA transactions are no longer permitted to modify the mysql.gtid_executed table.