Bug #79634 transaction written to the binary log were not actually applied to the DB
Submitted: 14 Dec 2015 13:04 Modified: 18 Jan 2016 10:02
Reporter: Eyal Sorek Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:5.6.25-log OS:Debian (7.8)
Assigned to: MySQL Verification Team CPU Architecture:Any

[14 Dec 2015 13:04] Eyal Sorek
Description:
Application has written 5 records to a table, which were recorded in the binary log, but in the database only 3 records were actually inserted into the table.
From the binary log, after doing - mysqlbinlog -vv mysql-bin.004071 --base64-output=DECODE-ROWS - and grepping  all relevant transaction, by the key meta_site_id='57b1e706-a621-415b-a5ed-39533fecaf9a':

DELETE FROM application_instances WHERE meta_site_id = '57b1e706-a621-415b-a5ed-39533fecaf9a'
INSERT INTO application_instances (instance_id, app_def_id, meta_site_id, last_update_date) VALUES ('f6614884-3ec5-4060-b913-12ba1f0096a0', 'HtmlWeb', '57b1e706-a621-415b-a5ed-39533fecaf9a', 1449963427604423964) ON DUPLICATE KEY UPDATE meta_site_id = '57b1e706-a621-415b-a5ed-39533fecaf9a', last_update_date = 1449963427604423964
INSERT INTO application_instances (instance_id, app_def_id, meta_site_id, last_update_date) VALUES ('141f4df1-e18d-08e1-2c65-af5810ebae03', 'AppBuilder', '57b1e706-a621-415b-a5ed-39533fecaf9a', 1449963427604425047) ON DUPLICATE KEY UPDATE meta_site_id = '57b1e706-a621-415b-a5ed-39533fecaf9a', last_update_date = 1449963427604425047
INSERT INTO application_instances (instance_id, app_def_id, meta_site_id, last_update_date) VALUES ('141f4df1-e33f-f2f4-b4ee-a05fa19209ee', 'SiteMembers', '57b1e706-a621-415b-a5ed-39533fecaf9a', 1449963427604425347) ON DUPLICATE KEY UPDATE meta_site_id = '57b1e706-a621-415b-a5ed-39533fecaf9a', last_update_date = 1449963427604425347
INSERT INTO application_instances (instance_id, app_def_id, meta_site_id, last_update_date) VALUES ('0ad48b1f-c74a-419b-ba64-708a343cd2e9', '135c3d92-0fea-1f9d-2ba5-2a1dfb04297e', '57b1e706-a621-415b-a5ed-39533fecaf9a', 1449963427604425543) ON DUPLICATE KEY UPDATE meta_site_id = '57b1e706-a621-415b-a5ed-39533fecaf9a', last_update_date = 1449963427604425543
INSERT INTO application_instances (instance_id, app_def_id, meta_site_id, last_update_date) VALUES ('597c4135-2cd0-44d4-96a3-a2066416bdb6', '139ef4fa-c108-8f9a-c7be-d5f492a2c939', '57b1e706-a621-415b-a5ed-39533fecaf9a', 1449963427604425739) ON DUPLICATE KEY UPDATE meta_site_id = '57b1e706-a621-415b-a5ed-39533fecaf9a', last_update_date = 1449963427604425739

SELECT * FROM application_instances WHERE meta_site_id = '57b1e706-a621-415b-a5ed-39533fecaf9a';
+--------------------------------------+-------------+--------------------------------------+---------------------+
| instance_id                          | app_def_id  | meta_site_id                         | last_update_date    |
+--------------------------------------+-------------+--------------------------------------+---------------------+
| 141f4df1-e18d-08e1-2c65-af5810ebae03 | AppBuilder  | 57b1e706-a621-415b-a5ed-39533fecaf9a | 1449963427604425047 |
| 141f4df1-e33f-f2f4-b4ee-a05fa19209ee | SiteMembers | 57b1e706-a621-415b-a5ed-39533fecaf9a | 1449963427604425347 |
| f6614884-3ec5-4060-b913-12ba1f0096a0 | HtmlWeb     | 57b1e706-a621-415b-a5ed-39533fecaf9a | 1449963427604423964 |
+--------------------------------------+-------------+--------------------------------------+---------------------+
3 rows in set (0.00 sec)

Our config file settings:
cat /etc/mysql/my_meta_site.cnf

[client]
port   = 3306
socket = /var/run/mysqld/mysqld_meta_site.sock

[mysqld_safe]
socket            = /var/run/mysqld/mysqld_meta_site.sock
nice              = 0
open-files-limit  = 16384

# *** Application-specific options follow here ***
#
# The MySQL server
#
[mysqld]

#
# * Basic Settings
#
user                      = mysql
pid-file                  = /var/run/mysqld/mysqld_meta_site.pid
socket                    = /var/run/mysqld/mysqld_meta_site.sock
port                      = 3306
basedir                   = /usr
datadir                   = /var/lib/mysql/meta_site/data
tmpdir                    = /var/lib/mysql/meta_site/tmp
innodb_log_group_home_dir = /var/lib/mysql/meta_site/journal

language = /usr/share/mysql/english
skip-name-resolve
skip-external-locking

net_read_timeout  = 120
net_write_timeout = 120
net_retry_count   = 10
net_buffer_length = 16384

bind-address  = 172.16.103.8
report-host   = db3-8.aus.wixpress.com

character_set_server = utf8
collation_server     = utf8_general_ci

max_allowed_packet      = 512M
thread_stack            = 192K
query_alloc_block_size  = 16K
transaction_isolation   = REPEATABLE-READ
tmp_table_size          = 64M
default_storage_engine  = InnoDB
max_heap_table_size     = 64M
sort_buffer_size        = 8M
join_buffer_size        = 8M
thread_cache_size       = 256
back_log                = 50
max_connections         = 1024
max_connect_errors      = 9999999
table_open_cache        = 8192
thread_concurrency      = 12
query_cache_size        = 0
query_cache_type        = 0
query_cache_limit       = 2M

# * Logging and Replication
#
general_log      = 0
general_log_file = /var/lib/mysql/meta_site/log/db3-8.aus.wixpress.com-general.log

# sync_binlog ensures that all writes to the binary log are immediately
# flushed to disk.  This is important, especially for replication, because
# if the server crashes and has not written all of the binary log to disk (and flushed it)
# then some rows will not make it to the slave
sync_binlog               = 1
slave_compressed_protocol = 1

slave-skip-errors = 1032,1062

log_error             = /var/lib/mysql/meta_site/log/db3-8.aus.wixpress.com.err
slow_query_log        = 1
slow_query_log_file   = /var/lib/mysql/meta_site/log/db3-8.aus.wixpress.com-slow.log
long_query_time       = 2

server-id = 2016103008

log_bin          = /var/lib/mysql/meta_site/binary/mysql-bin
log_bin_index    = /var/lib/mysql/meta_site/binary/mysql-bin.index
expire_logs_days = 10
max_binlog_size  = 1G

binlog_cache_size = 1M

relay_log       = /var/lib/mysql/meta_site/relay/relay-bin
relay_log_index = /var/lib/mysql/meta_site/relay/relay-bin.index
relay_log_info_file = /var/lib/mysql/meta_site/relay/relay-log.info

auto_increment_increment = 1
auto_increment_offset = 1

log_slave_updates=ON
binlog_format=MIXED
binlog_row_image=MINIMAL

# Enable the full query log. Every query (even ones with incorrect
# syntax) that the server receives will be logged. This is useful for
# debugging, it is usually disabled in production use.
#log

# Log warnings to the error log
log_warnings = 1

# Log more information in the slow query log. Normally it is good to
# have this turned on. This will enable logging of queries that are not
# using indexes in addition to long running queries.
# this parameter is removed as of mysql5.5

innodb_doublewrite = 1

innodb_file_format = Barracuda
innodb_additional_mem_pool_size = 16M
innodb_buffer_pool_size = 48G
innodb_file_per_table
innodb_thread_concurrency = 0
innodb_flush_log_at_trx_commit = 2
innodb_fast_shutdown
innodb_log_buffer_size = 64M
innodb_log_file_size = 1G
innodb_log_files_in_group = 3
innodb_max_dirty_pages_pct = 80
innodb_flush_method = O_DIRECT
innodb_lock_wait_timeout = 120

innodb_commit_concurrency  = 0
innodb_open_files = 2000

[mysqldump]
quick
quote-names
max_allowed_packet = 512M
net_buffer_length  = 16384

[mysql]
prompt = \u@\h(db3-8.aus):[\d]>\_

How to repeat:
It happens to use sporadically. 
Unfortunately we still can not reproduce on demand.

Suggested fix:
I have no idea
[14 Dec 2015 16:44] MySQL Verification Team
Please provide the create table statement private if you wish and the whole binlog file too. Thanks.
[14 Dec 2015 18:38] Eyal Sorek
Uploaded decoded as text file the all binary log file to:  //support/incoming/mysql_bin.txt_bug-data_79634.gz
[18 Dec 2015 11:20] MySQL Verification Team
Hi,

Thanks for the submission. 
I can't reproduce this but from what I read you can't reproduce on demand either.

Did you perform check table on this table? It is possible table is corrupted? Anything in the error log?

Are you 100% sure you don't have any process in parallel modifying data while you are running this? 

thanks
Bogdan Kecman
[30 Dec 2015 15:12] Eyal Sorek
Hi,
1. There was nothing in the error log and yes we are  100% sure we don't have any process in parallel modifying data
while running this, otherwise it would be presented in the binary log. Everything done in the DB, is logged in the binary log.
2. I verified also in subsequent binary log, and there was no activity on this meta_site_id.
3. We did not check the table, can check be done online with not downtime for a service writing to this table ? but is it relevant ? it happens sporadically, most of the time it is ok, but we see very low percent of transactions getting this mutation.
In addition, I believe it is not relevant, since the binary log has all 5 records and also in ALL secondaries the DB contains only 3 records.
This eliminates the option that physical issue on the master DB table, is the issues, since the secondaries suppose to apply, what is contained in the binary logs...

Thanks,
Eyal
[30 Dec 2015 19:42] MySQL Verification Team
Hi,

> 1. There was nothing in the error log and yes we are
>    100% sure we don't have any process in parallel modifying data
>    while running this, otherwise it would be presented in the binary log.
>    Everything done in the DB, is logged in the binary log.

If this config file is what mysqld is running with then yes but often you can
have mysqld running with replicate-do-db or similar parameter and way these 
work with mysql are not always properly understood and allow for non logged
changes to happen (for e.g. you can replicate-do-db db1; and then exec
use db2; delete from db1.table1; and you will end up with blank table1 and
nothing logged!). 

This is why I only trust general query log.

> 3. We did not check the table, can check be done online with not downtime
> for a service writing to this table ?

On a non ndbcluster storage engine - unfortunately no.

> but is it relevant ? it happens sporadically, most of the time it is ok, 
> but we see very low percent of transactions getting this mutation.

This is a *very* dangerous behavior and if we have a bug like this 
then I guess we'd have a huge number of users complaining. Also, if this is
a bug in MySQL then there would be more then one table where you would 
experience this behavior; don't you agree?

Anyhow I do agree with rest of your comment that this might not be about that
table. I'd still like to rebuild it :D but I do very much understand the
downtime issues you want to avoid.

> In addition, I believe it is not relevant, 
> since the binary log has all 5 records and also in 
> ALL secondaries the DB contains only 3 records.

> This eliminates the option that physical issue on the master DB table,
> is the issues, since the secondaries suppose to apply, what is contained in the binary logs...

I agree with you that it might not be linked to a table on master.

Now lemme just try to summarize so that you can confirm that I understand the problem
you are reporting, as at this moment I'm not 100% sure I understand it.

- you have master + few slaves
- master is binloging all (no filters)
- slaves are executing everything (no filters)
- you are sure you have, from your app, executed delete where metasiteid=X and then 5 inserts with metasiteid=X
- you see your delete + 5 inserts in the binlog on the master
- you see delete + 5 inserts on relay log on slaves (did you check this actually?)
- when you select where metasiteid=X you get 3 rows on master as well on slaves?

I tried this on number of releases - no chance of reproducing. Without ability to reproduce it's seriously difficult to see what's going on. I can suggest some things to try to debug this but they require a lot of work and might not be possible at all in production environment (for e.g. promote one of your slaves to master and see if this will happen again if another machine is a slave) 

I'm still confused how come this is happening on one table only and from time to time only. I'd normally suggest trying out ROW instead of MIXED binlog format, but you have all 5 queries in your binlog so I doubt it's the binlog format that's problem here.

Knowing what's in the relay log might help, if you can check your relay log and see if all 5 queries are there that might bring us closer to the solution. Can you maybe add one "testing" slave to the mix that we can use to debug what's going on?

kind regards
Bogdan Kecman
[5 Jan 2016 5:25] Jesper wisborg Krogh
Hi,

On both the master and slave, what is the output of:

   SHOW TRIGGERS LIKE 'application_instances';

I also note that the rows that are applied in the database have app_def_id set to the values AppBuilder, SiteMembers, and HtmlWeb whereas the rows that did not apply have UUIDs as values for app_def_id.

Is that always the case?

Best regards,
Jesper Krogh
[10 Jan 2016 16:56] Eyal Sorek
Hi,
- The summary above is correct.
- In addition, I just checked recent cases and the DELETE & INSERT's are fully in the binary log & in the relay log on the secondaries. So the correct number of  DELETE & INSERT's are being recorded in both MASTER binary logs  & Slaves relay logs.
- And we saw that there were cases that the app_def_if is also with UUID that is being saved, while others are not saved.
- We are not using triggers at all. SHOW TRIGGERS on the table comes out empty on both Master & Slaves:
MASTER:
root@localhost(db3-8.aus):[wix_meta_site]> SHOW TRIGGERS LIKE 'application_instances';
Empty set (0.01 sec)
root@localhost(db3-8.aus):[wix_meta_site]> select count(*) from information_schema.triggers;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.01 sec)
SLAVE:
root@localhost(db6-8.aus):[wix_meta_site]> SHOW TRIGGERS LIKE 'application_instances';
Empty set (0.00 sec)

root@localhost(db6-8.aus):[wix_meta_site]> select count(*) from information_schema.triggers;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.01 sec)

root@localhost(db6-8.aus):[wix_meta_site]>

- See last sample data of additional missing records (out of 30K missing records in the last few days) from the binary log in the master, with same exact pictures at the slaves:

In the BINARY LOG (DELETE + INSERT of 10 Records !!!):
==========================================
DELETE FROM application_instances WHERE meta_site_id = '5ca81f53-5bda-4ec4-bfe3-51c83118a189'

INSERT INTO application_instances (instance_id, app_def_id, meta_site_id, last_update_date) VALUES ('4cbd4401-6ecd-4157-ae98-d8aa98d55588', 'HtmlWeb', '5ca81f53-5bda-4ec4-bfe3-51c83118a189', 1452374546899399232) 
ON DUPLICATE KEY UPDATE meta_site_id = '5ca81f53-5bda-4ec4-bfe3-51c83118a189', last_update_date = 1452374546899399232

INSERT INTO application_instances (instance_id, app_def_id, meta_site_id, last_update_date) VALUES ('aba048fa-cbab-43b3-bc91-26745914e723', 'SiteMembers', '5ca81f53-5bda-4ec4-bfe3-51c83118a189', 1452374546899399232) ON DUPLICATE KEY UPDATE meta_site_id = '5ca81f53-5bda-4ec4-bfe3-51c83118a189', last_update_date = 1452374546899399232

INSERT INTO application_instances (instance_id, app_def_id, meta_site_id, last_update_date) VALUES ('142185f6-0727-ada2-128d-50759ebb9594', '61f33d50-3002-4882-ae86-d319c1a249ab', '5ca81f53-5bda-4ec4-bfe3-51c83118a189', 1452374546899399232) ON DUPLICATE KEY UPDATE meta_site_id = '5ca81f53-5bda-4ec4-bfe3-51c83118a189', last_update_date = 1452374546899399232

INSERT INTO application_instances (instance_id, app_def_id, meta_site_id, last_update_date) VALUES ('142185f6-093a-635b-ae1a-0521655b5806', 'AppBuilder', '5ca81f53-5bda-4ec4-bfe3-51c83118a189', 1452374546899399232) ON DUPLICATE KEY UPDATE meta_site_id = '5ca81f53-5bda-4ec4-bfe3-51c83118a189', last_update_date = 1452374546899399232
INSERT INTO application_instances (instance_id, app_def_id, meta_site_id, last_update_date) VALUES ('08f06b68-d2db-4dca-89df-1dae6e9a3a68', '13322a7c-6039-ac58-86e8-48b76f901d91', '5ca81f53-5bda-4ec4-bfe3-51c83118a189', 1452374546899399232) ON DUPLICATE KEY UPDATE meta_site_id = '5ca81f53-5bda-4ec4-bfe3-51c83118a189', last_update_date = 1452374546899399232

INSERT INTO application_instances (instance_id, app_def_id, meta_site_id, last_update_date) VALUES ('40239c08-e8bc-48a8-8229-674837a82b17', '132f257b-2102-5c50-86e8-48b76f901d91', '5ca81f53-5bda-4ec4-bfe3-51c83118a189', 1452374546899399232) ON DUPLICATE KEY UPDATE meta_site_id = '5ca81f53-5bda-4ec4-bfe3-51c83118a189', last_update_date = 1452374546899399232

INSERT INTO application_instances (instance_id, app_def_id, meta_site_id, last_update_date) VALUES ('2e16da50-5b58-47b3-92e3-26f484b7fccd', '1346fcbb-e41e-6d12-ad1b-95ddfe1577d4', '5ca81f53-5bda-4ec4-bfe3-51c83118a189', 1452374546899399232) ON DUPLICATE KEY UPDATE meta_site_id = '5ca81f53-5bda-4ec4-bfe3-51c83118a189', last_update_date = 1452374546899399232

INSERT INTO application_instances (instance_id, app_def_id, meta_site_id, last_update_date) VALUES ('2cfb0ccb-6246-49f7-9d98-f419eaed8967', '135f6b38-764d-e005-2be3-1c77bf8d8f1d', '5ca81f53-5bda-4ec4-bfe3-51c83118a189', 1452374546899399232) ON DUPLICATE KEY UPDATE meta_site_id = '5ca81f53-5bda-4ec4-bfe3-51c83118a189', last_update_date = 1452374546899399232

INSERT INTO application_instances (instance_id, app_def_id, meta_site_id, last_update_date) VALUES ('b7b51de2-0fda-4c19-b4db-a22498451c91', '139ef4fa-c108-8f9a-c7be-d5f492a2c939', '5ca81f53-5bda-4ec4-bfe3-51c83118a189', 1452374546899399232) ON DUPLICATE KEY UPDATE meta_site_id = '5ca81f53-5bda-4ec4-bfe3-51c83118a189', last_update_date = 1452374546899399232

INSERT INTO application_instances (instance_id, app_def_id, meta_site_id, last_update_date) VALUES ('59d527a9-a69b-4bca-ab7c-dfaff1324bde', '135c3d92-0fea-1f9d-2ba5-2a1dfb04297e', '5ca81f53-5bda-4ec4-bfe3-51c83118a189', 1452374546899399232) ON DUPLICATE KEY UPDATE meta_site_id = '5ca81f53-5bda-4ec4-bfe3-51c83118a189', last_update_date = 1452374546899399232

In the DB (Only 4 records exists !!! for the specific meta_site_id='5ca81f53-5bda-4ec4-bfe3-51c83118a189' with the  same last_update_date timestamp):
=============================================================

root@localhost(db3-8.aus):[wix_meta_site]> select * from application_instances where meta_site_id='5ca81f53-5bda-4ec4-bfe3-51c83118a189';
+--------------------------------------+--------------------------------------+--------------------------------------+---------------------+
| instance_id                          | app_def_id                           | meta_site_id                         | last_update_date    |
+--------------------------------------+--------------------------------------+--------------------------------------+---------------------+
| 142185f6-0727-ada2-128d-50759ebb9594 | 61f33d50-3002-4882-ae86-d319c1a249ab | 5ca81f53-5bda-4ec4-bfe3-51c83118a189 | 1452374546899399232 |
| 142185f6-093a-635b-ae1a-0521655b5806 | AppBuilder                           | 5ca81f53-5bda-4ec4-bfe3-51c83118a189 | 1452374546899399232 |
| 4cbd4401-6ecd-4157-ae98-d8aa98d55588 | HtmlWeb                              | 5ca81f53-5bda-4ec4-bfe3-51c83118a189 | 1452374546899399232 |
| aba048fa-cbab-43b3-bc91-26745914e723 | SiteMembers                          | 5ca81f53-5bda-4ec4-bfe3-51c83118a189 | 1452374546899399232 |
+--------------------------------------+--------------------------------------+--------------------------------------+---------------------+
4 rows in set (0.00 sec)

I know it is VERY weird and I can not start to explain what the hell is going on here.
This is the first time I bump into such weird case/bug in MySQL.
We investigated any direction in the code & db that can cause it and we don't have any clue
how to proceed with this issue.
The fact the the binary log has all the data, both the master binary logs & slaves relay logs and all the DB's (master & slaves) are missing the data, is too weird for us to handle.
Since I would begin to understand some logic, if it was only on the master binary log, but the fact the missing inserts exists in the slaves relay logs, eliminated any so called "master manipulation stuff"...
Will appreciate an advise how to process..
Eyal.
[10 Jan 2016 18:10] Eyal Sorek
OK, After deep investigation, of couple of us,  we found a rare edge case scenario of applicative bug that might explain this issue, of updating the app_def_id, instance_id with another metasite_id (moved the app_def_id, instance_id combination to another metasite_id), which explains why we did not find the same application instances that were entered to the specific metasite_id, that it was initially inserted to.
Sorry for the bother, but it seems, our applicative bug.
As i mentioned above, it was to me, as a DBA, too weird to be true, in the DB level.
And, at last, I got relief... :-).
You can close this bug.
[18 Jan 2016 10:02] MySQL Verification Team
Hi,

Thanks for the update.

Kind regards
Bogdan Kecman