Bug #76262 5.7.6 crash when running mysql_upgrade after binary upgrade
Submitted: 11 Mar 2015 13:21 Modified: 17 Mar 2015 22:46
Reporter: Simon Mudd (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S2 (Serious)
Version:5.7.6 OS:Any
Assigned to: Marko Mäkelä CPU Architecture:Any
Tags: 5.7.6, crash, mysql_upgrade

[11 Mar 2015 13:21] Simon Mudd
Description:
I had a 5.7.5 slave running for 100 days, running fine, so seeing that 5.7.6 is out I thought I'd upgrade it to 5.7.6. This was a binary rpm upgrade using the CentOS yum rpms.

I did a clean shutdown , updated the rpms and then started mysql with replication stopped
using /etc/init.d/mysqld start --skip-slave-start

I notice logging which indicates there are table differences and understand that's expected etc with minor version changes in a DEV version of MySQL.

So I ran mysql_upgrade and mysqld crashed.

How to repeat:
See: above.

Logging shows:

2015-03-11T11:38:44.443996Z 3326 [Warning] System table 'servers' is expected to be transactional.
2015-03-11T11:38:44.472964Z 3326 [Warning] InnoDB: Resetting unknown page [page id: space=0, page number=0] type 0 to 13 before writing to data file.
2015-03-11T11:38:44.475942Z 3326 [ERROR] InnoDB: Wrong owned count 15, 3, rec 191
2015-03-11T11:38:44.476038Z 3326 [Note] InnoDB: Page dump in ascii and hex (16384 bytes):
…
InnoDB: End of page dump
2015-03-11T11:38:44.697455Z 3326 [Note] InnoDB: Uncompressed page, stored checksum in field1 2637354485, calculated checksums for field1: crc32 584182572, innodb 2637354485, none 3735928559, stored checksum in field2 2812270395, calculated checksums for field2: crc32 584182572, innodb 2812270395, none 3735928559,  page LSN 5287 3296975656, low 4 bytes of LSN at page end 3296975656, page number (if stored to page already) 32768, space id (if created with >= MySQL-4.1.1 and stored already) 0
InnoDB: Page may be an index page where index id is 748
2015-03-11T11:38:44.697620Z 3326 [ERROR] [FATAL] InnoDB: Apparent corruption of an index page [page id: space=0, page number=32768] to be written to data file. We intentionally crash the server to prevent corrupt data from ending up in data files.
2015-03-11 12:38:44 0x7fbdb410e700  InnoDB: Assertion failure in thread 140452746553088 in file ut0ut.cc line 906
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
11:38:44 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

key_buffer_size=805306368
read_buffer_size=131072
max_used_connections=4
max_threads=3000
thread_count=3
connection_count=3
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 4280049 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7fa904015cc0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fbdb410de28 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0xd076b5]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x6fa7f4]
/lib64/libpthread.so.0(+0xf710)[0x7fbdbf1cc710]
/lib64/libc.so.6(gsignal+0x35)[0x7fbdbde77635]
/lib64/libc.so.6(abort+0x175)[0x7fbdbde78e15]
/usr/sbin/mysqld(_Z18ut_print_timestampP8_IO_FILE+0x0)[0x6e963c]
/usr/sbin/mysqld(_ZN2ib5fatalD1Ev+0xb3)[0xf3d323]
/usr/sbin/mysqld[0xf85efe]
/usr/sbin/mysqld(_Z31buf_dblwr_flush_buffered_writesv+0x3f8)[0xf86d88]
/usr/sbin/mysqld(_Z18buf_flush_do_batchP10buf_pool_t11buf_flush_tmmPmP16ut_stage_alter_t+0x2c5)[0xf8e445]
/usr/sbin/mysqld(_Z15buf_flush_listsmmPmP16ut_stage_alter_t+0x322)[0xf8f5b2]
/usr/sbin/mysqld(_Z22log_make_checkpoint_atmbP16ut_stage_alter_t+0x204)[0xe3e8a4]
/usr/sbin/mysqld(_Z23row_merge_build_indexesP5trx_tP12dict_table_tS2_bPP12dict_index_tPKmmP5TABLEPK8dtuple_tS7_mR13ib_sequence_tbP16ut_stage_alter_t+0x1cbf)[0xea6c4f]
/usr/sbin/mysqld(_ZN11ha_innobase19inplace_alter_tableEP5TABLEP18Alter_inplace_info+0x1b1)[0xe093d1]
/usr/sbin/mysqld(_Z17mysql_alter_tableP3THDPKcS2_P24st_ha_create_informationP10TABLE_LISTP10Alter_info+0x36b5)[0xb63aa5]
/usr/sbin/mysqld(_ZN19Sql_cmd_alter_table7executeEP3THD+0x4a7)[0xc4e5a7]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0xea3)[0xaf75a3]
/usr/sbin/mysqld(_Z11mysql_parseP3THDP12Parser_state+0x438)[0xafb5d8]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcm+0x15c4)[0xafcc54]
/usr/sbin/mysqld(handle_connection+0x284)[0xbc3144]
/usr/sbin/mysqld(pfs_spawn_thread+0x12a)[0xd6bc0a]
/lib64/libpthread.so.0(+0x79d1)[0x7fbdbf1c49d1]
/lib64/libc.so.6(clone+0x6d)[0x7fbdbdf2d86d]

I did check and may have missed the fact that upgrading from 5.7.5 to 5.7.6 is not supported because of the table changes but don't see any clear warning not to do this.
Did I miss such a comment?  If that's the case then that's fine. If such a recommendation should be added (and is missing) please make it prominent as I had done a 5.7.4 to 5.7.5 upgrade previously with not ill effect.

If this behaviour is not expected then please let me know what other info you might need.

Suggested fix:
Point me where it says: don't do a binary upgrade from 5.7.5 (if it exists), or
Add such a comment (if needed), or 
fix the bug if you consider this behaviour a bug.
[11 Mar 2015 13:32] Simon Mudd
I missed some logging:

...

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7fa904004860): is an invalid pointer
Connection ID (thread ID): 3326
Status: NOT_KILLED

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
150311 12:38:45 mysqld_safe Number of processes running now: 0
150311 12:38:45 mysqld_safe mysqld restarted
2015-03-11T11:38:46.650963Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2015-03-11T11:38:46.656725Z 0 [Note] /usr/sbin/mysqld (mysqld 5.7.6-m16-log) starting as process 43243 ...
2015-03-11T11:38:46.891487Z 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2015-03-11T11:38:46.891630Z 0 [Note] InnoDB: Uses event mutexes
2015-03-11T11:38:46.891673Z 0 [Note] InnoDB: GCC builtin __sync_synchronize() is used for memory barrier
2015-03-11T11:38:46.891705Z 0 [Note] InnoDB: Compressed tables use zlib 1.2.3
2015-03-11T11:38:46.891735Z 0 [Note] InnoDB: Using Linux native AIO
2015-03-11T11:38:46.895971Z 0 [Note] InnoDB: Number of pools: 1
2015-03-11T11:38:46.896273Z 0 [Note] InnoDB: Using CPU crc32 instructions
2015-03-11T11:38:46.907189Z 0 [Note] InnoDB: Initializing buffer pool, total size = 76G, instances = 38, chunk size = 128M
2015-03-11T11:38:57.717219Z 0 [Note] InnoDB: Completed initialization of buffer pool
2015-03-11T11:38:58.783350Z 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2015-03-11T11:38:58.802946Z 0 [ERROR] InnoDB: Checksum mismatch in datafile: ./ibdata1, Space ID:0, Flags: 0. Please refer to http://dev.mysql.com/doc/refman/5.7/en/innodb-troubleshooting-datadict.html for how to resolve the issue.
2015-03-11T11:38:58.803091Z 0 [ERROR] InnoDB: Corrupted page [page id: space=0, page number=0] of datafile './ibdata1' could not be found in the doublewrite buffer.
2015-03-11T11:38:59.403702Z 0 [ERROR] Plugin 'InnoDB' init function returned error.
2015-03-11T11:38:59.403804Z 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
2015-03-11T11:38:59.403857Z 0 [ERROR] Failed to initialize plugins.
2015-03-11T11:38:59.403896Z 0 [ERROR] Aborting

2015-03-11T11:38:59.403945Z 0 [Note] Binlog end
2015-03-11T11:38:59.404059Z 0 [Note] Shutting down plugin 'MyISAM'
2015-03-11T11:38:59.417995Z 0 [Note] /usr/sbin/mysqld: Shutdown complete

150311 12:38:59 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
[11 Mar 2015 22:31] Sunny Bains
Simon, thanks for the bug report, we will investigate and hopefully have something sensible to say about this soon.
[12 Mar 2015 7:01] MySQL Verification Team
The resolved stack trace.
---------------------------
my_print_stacktrace at ./mysql-5.7.6-m16/mysys/stacktrace.c:225
handle_fatal_signal at ./mysql-5.7.6-m16/sql/signal_handler.cc:150
ut_print_timestamp(_IO_FILE*) at ./mysql-5.7.6-m16/storage/innobase/ut/ut0ut.cc:218
~fatal at ./mysql-5.7.6-m16/storage/innobase/ut/ut0ut.cc:906
buf_dblwr_assert_on_corrupt_block at ./mysql-5.7.6-m16/storage/innobase/buf/buf0dblwr.cc:717
(inlined by) buf_dblwr_check_block at ./mysql-5.7.6-m16/storage/innobase/buf/buf0dblwr.cc:808
buf_dblwr_flush_buffered_writes() at ./mysql-5.7.6-m16/storage/innobase/buf/buf0dblwr.cc:940
buf_flush_do_batch at ./mysql-5.7.6-m16/storage/innobase/buf/buf0flu.cc:1943
buf_flush_lists at ./mysql-5.7.6-m16/storage/innobase/buf/buf0flu.cc:2075
log_preflush_pool_modified_pages at ./mysql-5.7.6-m16/storage/innobase/log/log0log.cc:1439
(inlined by) log_make_checkpoint_at at ./mysql-5.7.6-m16/storage/innobase/log/log0log.cc:1877
row_merge_build_indexes at ./mysql-5.7.6-m16/storage/innobase/row/row0merge.cc:4502
ha_innobase::inplace_alter_table at ./mysql-5.7.6-m16/storage/innobase/handler/handler0alter.cc:4628
mysql_inplace_alter_table at ./mysql-5.7.6-m16/sql/sql_table.cc:6955
(inlined by) mysql_alter_table at ./mysql-5.7.6-m16/sql/sql_table.cc:8882
Sql_cmd_alter_table::execute at ./mysql-5.7.6-m16/sql/sql_alter.cc:316
mysql_execute_command at ./mysql-5.7.6-m16/sql/sql_parse.cc:4452
mysql_parse at ./mysql-5.7.6-m16/sql/sql_parse.cc:5161
Parser_state::reset at ./mysql-5.7.6-m16/sql/sql_lex.h:3489
(inlined by) dispatch_command at ./mysql-5.7.6-m16/sql/sql_parse.cc:1326
handle_connection at ./mysql-5.7.6-m16/sql/conn_handler/connection_handler_per_thread.cc:298
pfs_spawn_thread at ./mysql-5.7.6-m16/storage/perfschema/pfs.cc:2150
[12 Mar 2015 9:10] Marko Mäkelä
This issue caused by InnoDB having been really sloppy before MySQL 5.1, and being more strict in MySQL 5.7.

Old versions of InnoDB did not initialize all fields when creating buffer pages. We would only initialize page[FIL_PAGE_TYPE]=FIL_PAGE_INDEX on B-tree index pages. On other types of pages, the field FIL_PAGE_TYPE could be anything, including FIL_PAGE_INDEX.

In the log, we see that page 0 in the system tablespace happened to have 0 in the FIL_PAGE_TYPE field. MySQL 5.7 is initializing all unknown page types to 13 to avoid trouble in the future (in case some future version of MySQL is going to introduce more page types).

The crash occurs, because page 32768 in the system tablespace, which must be an allocation bitmap page, is carrying the page type FIL_PAGE_INDEX. What probably happened was that when this page was originally created before MySQL 5.1, a buffer pool page frame that once belonged to a freed index page (after DROP TABLE?) got reused when the tablespace was extended beyond 32767 pages (512MiB).

When InnoDB is writing back an index page from the buffer pool to the datafile, it is performing a consistency check. This consistency check detects that the bitmap page does not look like a valid index page.

The reason why earlier versions of MySQL do not have trouble with this data file is that we had an extra flag to control the flush-time check.

8821 Marko Makela	2014-09-11
      Bug#17332603 REMOVE BUF_BLOCK_T::CHECK_INDEX_PAGE_AT_FLUSH
      Bug#17345513 CHECKING FIL_PAGE_TYPE BREAKS COMPATIBILITY WITH OLD
      INNODB DATA FILES

      The field block->check_index_page_at_flush was needed in the bad old
      days when InnoDB did not initialize FIL_PAGE_TYPE when creating a
      page. It would only initialize this field to FIL_PAGE_INDEX when it
      was about to flush a B-tree page. For other page types, InnoDB would
      leave garbage in the field. That garbage could also be FIL_PAGE_INDEX
      by chance. There was no problem with this in the past, because we
      would only invoke the B-tree validation functions when
      block->check_index_page_at_flush had been set by some B-tree operation
      while the page was in the buffer pool.

      Since MySQL 5.1, InnoDB initializes the FIL_PAGE_TYPE for every
      page. But, there could be some old data files that contain garbage
      FIL_PAGE_TYPE values in some pages. This garbage must be reset when
      flushing the page.

      buf_block_t: Remove check_index_page_at_flush.

      buf_dblwr_check_block(): Reset unknown FIL_PAGE_TYPE values to
      FIL_PAGE_TYPE_UNKNOWN if the tablespace flags are 0.
      For newer tablespaces, the flags never are 0, and we will complain
      about a corrupted page.

      FIL_PAGE_TYPE_UNKNOWN: New constant (13) for FIL_PAGE_TYPE.

      i_s_page_type[]: Map I_S_PAGE_TYPE_UNKNOWN to FIL_PAGE_TYPE_UNKNOWN.
      Both will be reported as "UNKNOWN".

      I_S_PAGE_TYPE_BITS: A new macro, to make compile-time checks more
localized.

      With this fix, it is possible that some non-index page that happens
      to be tagged as FIL_PAGE_INDEX or FIL_PAGE_RTREE will start to trigger
      failures in page_simple_validate_new() or page_simple_validate_old().
      This situation can be handled by manually patching the data file to
replace
      the FIL_PAGE_TYPE on that page with FIL_PAGE_UNKNOWN. The error
      message will identify the tablespace ID and page number.

      rb#6604 approved by Vasil Dimov

Unfortunately, it is not only the InnoDB system tablespace and not only the bitmap pages that are affected by the garbage FIL_PAGE_TYPE. Also *.ibd files can be affected. Support for innodb_file_per_table was introduced already in MySQL 4.1, and the FIL_PAGE_TYPE on BLOB pages will be garbage (may be FIL_PAGE_INDEX by accident). Only the undo tablespaces created by MySQL 5.6 should have reliable FIL_PAGE_TYPE. Likewise, tablespaces using innodb_page_size other than the default 16k will be reliable, because this feature was introduced in MySQL 5.6.

I do not remember that this consistency check has failed recently. Close to 10 years ago, it once seemed to catch faulty RAM. I think that serious users should use ECC RAM, and we can make this consistency check debug-only in MySQL 5.7. That should be enough to catch any bugs during the development of new features.
[12 Mar 2015 11:15] Simon Mudd
Thanks for the long and descriptive answer.  However, I don't understand the last paragraph.

I am not a serious user?
My servers don't run ECC ram?  Certainly that is not the case.

What seems clear is that people like myself who do an inline binary upgrade obviously have brought forward "issues from 5.1" into the current 5.7 release.  That's not expected as I've always run mysql_upgrade, at least when running major version changes, so any special magic that's required to keep thing clean and correct "should have been done".

Shouldn't mysql_upgrade should deal with this if needed?
Can this not be resolved in the background on a 5.6 box to clean up these "ill-defined pages"?
Can this not be resolved by a special job/program to check the ibdataX or .ibd files with MySQL down to correct this?

Looking forward to seeing how this will be resolved.
[13 Mar 2015 3:39] Sunny Bains
Simon, we will basically revert this change, make it a debug change.
[13 Mar 2015 13:54] Marko Mäkelä
Simon, I was giving reasons why we could make the flush-time change a debug-only check. I am glad to hear that my assumption about you using ECC RAM was correct. Basically, the flush-time check would only make sense for those users who do not use ECC RAM. And because the check is rather lightweight, it might be blind to many cases of memory corruption.

That said, based on reviewer feedback, I tried to see if we can keep the flush-time check. It turns out that all MySQL/InnoDB versions initialized the FIL_PAGE_TYPE on b-tree and undo log pages. I started writing an alternative patch that would reset the page type to valid values on other types of pages. Old BLOB pages and other pages that are never modified would continue to contain garbage in FIL_PAGE_TYPE.

Whichever fix we choose, I plan to write a test case that modifies existing data files to contain FIL_PAGE_INDEX in every page except pages tagged FIL_PAGE_UNDO, and then executes heavy R/W workload on the dataset.
[13 Mar 2015 14:05] Marko Mäkelä
A note on the mysql_upgrade: Unfortunately, InnoDB never had proper file format version tagging from the start. For some class of files, we do cannot know if they contain garbage or not. The mysql_upgrade is supposed to be a fast metadata-only operation. Because we do not have the metadata to say whether the data contains garbage that should be made valid, I think that we should do the cleanup in a lazy fashion.

The case of having uninitialized garbage in data pages is a very nasty one. If we choose my bigger fix, it would essentially become the "background task" that you suggest, running in the 5.7 server. But it would only touch the pages that are modified. Garbage could survive in old read-mostly data files.

The metadata situation can change with the Global Data Dictionary, which we are working on (not for the MySQL 5.7 series). With that, we could properly tag old datafiles as "may contain garbage" and deal with them. But, I guess that if we implement anything along those lines, it should be made optional, so that users who originally created their data files with a newer server could avoid the clean-up operation.
[17 Mar 2015 12:10] Marko Mäkelä
Posted by developer:
 
My fix implements the more comprehensive fix while keeping the consistency check in the doublewrite buffer code.
The new test innodb.file_format_upgrade_16k is setting misleading FIL_PAGE_TYPE on the first few pages of an *.ibd file and checking the outcome.
[17 Mar 2015 22:46] Daniel Price
Posted by developer:
 
Fixed as of the upcoming 5.7.8, 5.8.0 releases, and here's the changelog entry:

Running "mysql_upgrade" after a binary upgrade to MySQL 5.7.6 caused a
server exit on a system with data files that were originally created with
MySQL 5.0 or earlier. 

Thank you for the bug report.
[23 Mar 2015 6:48] Marko Mäkelä
Posted by developer:
 
There is another problem in the code that shipped with MySQL 5.7.6.
All implementations of InnoDB page checksums include the FIL_PAGE_TYPE field in the checksum. When we reset the FIL_PAGE_TYPE field to 13 in buf0dblwr.cc, we would fail to recompute the page checksum, making the page look corrupted for subsequent server startup attempts.

This situation could be rectified by using the innochecksum tool.

With the bug fix, we never write the updated FIL_PAGE_TYPE to files before recomputing the page checksum.
[26 Mar 2015 11:20] Daniel Price
Posted by developer:
 
The previously noted changelog entry has been revised as follows:

Running "mysql_upgrade" after a binary upgrade to MySQL 5.7.6 caused a
server exit on a system with tablespace data files that were created in
MySQL 5.0 or earlier. The fix for Bug #17345513 in MySQL 5.7.6 failed to
address all instances of garbage "FIL_PAGE_TYPE" values in tablespace data
files created in MySQL 5.0 or earlier. 

With this patch, the manual process described in the MySQL 5.7.6 release 
notes entry for Bug #17345513 for repairing non-index pages that contain invalid
"FIL_PAGE_TYPE" values is no longer necessary. 

The patch for Bug #17345513 also failed to recompute page checksums after resetting
invalid "FIL_PAGE_TYPE" values. Upon restarting the server, a failure
would occur due to an apparent page corruption. The page checksum is now
recomputed before the new "FIL_PAGE_TYPE" value is written to the data
file. 

Thank you for the bug report.
[26 Mar 2015 12:22] Daniel Price
Posted by developer:
 
The fix for this bug is in MySQL 5.7.7. The Release Note will now appear in the 5.7.7 Release Notes document.