MySQL Bugs: #26081: AMD platform: Failing assertion: block->buf_fix

Bug #26081	AMD platform: Failing assertion: block->buf_fix_count > 0
Submitted:	5 Feb 2007 12:34	Modified:	30 Dec 2008 8:27
Reporter:	Shane Bester (Platinum Quality Contributor)	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S2 (Serious)
Version:	5.0.38, 5.0.36, 5.0.34, 5.0.32, 5.0.30, 5.0.40, 5.0.45, 5.0.46, 5.0.50a,	OS:	Linux (AMD: linux 64 bit, solaris, linux 32 bit on 64-bit hardware)
Assigned to:	Sunny Bains	CPU Architecture:	Any
Tags:	crash, innodb, read committed

Description:
070201 07:53:16 mysqld started
070201 7:53:31 InnoDB: Started; log sequence number 651 2222318171
070201 7:53:31 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=xx' to avoid this problem.
070201 7:53:31 [Note] /usr/local/mysql/bin/mysqld: ready for connections.
Version: '5.0.32-enterprise-gpl-log' socket: '/tmp/mysql.sock' port: 3306 MySQL Enterprise Server (GPL)
070201 7:53:31 [Note] Slave SQL thread initialized, starting replication in log 'xx.000091' at position 941419361, relay log './xx.000269' position: 941419496
070201 7:53:31 [Note] Slave I/O thread: connected to master 'xx:3306', replication started in log 'xx.000091' at position 941419361
070203 19:32:04InnoDB: Assertion failure in thread 1188542816 in file ../include/buf0buf.ic line 626
InnoDB: Failing assertion: block->buf_fix_count > 0
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.0/en/forcing-recovery.html
InnoDB: about forcing recovery.
mysqld got signal 11;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

InnoDB: Thread 1216764256 stopped in file os0sync.c line 238
InnoDB: Thread 1176029536 stopped in file trx0trx.c line 1601
InnoDB: Thread 1165113696 stopped in file row0sel.c line 2088
key_buffer_size=402653184
read_buffer_size=1044480
InnoDB: Thread 1161386336 stopped in file trx0trx.c line 1601
InnoDB: Thread 1169373536 stopped in file row0sel.c line 2088
InnoDB: Thread 1166178656 stopped in file row0sel.c line 2088
InnoDB: Thread 1206380896 stopped in file os0sync.c line 546
max_used_connections=232
max_connections=360
threads_connected=134
InnoDB: Thread 1202121056 stopped in file srv0srv.c line 1127
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 1129053 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd=0x31325d2640
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Cannot determine thread, fp=0x46d7b118, backtrace may not be correct.
Stack range sanity check OK, backtrace follows:
(nil)
Stack trace seems successful - bottom reached
Please read http://dev.mysql.com/doc/mysql/en/using-stack-trace.html and follow instructions on how to resolve the stack trace. Resolved
stack trace is much more helpful in diagnosing the problem, so please do
resolve it
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x3133804580 is invalid pointer
thd->thread_id=909345
The manual page at http://www.mysql.com/doc/en/Crashing.html contains
information that should help you find out what is causing the crash.
InnoDB: Thread 1187211616 stopped in file row0sel.c line 2088

Number of processes running now: 0
070203 19:32:11 mysqld restarted

How to repeat:
no idea yet. opening a bug report in case other users have seen this crash.

Suggested fix:
.

buf0buf.ic in 5.0:

/************************************************************************
Decrements the bufferfix count of a buffer control block and releases
a latch, if specified. */
UNIV_INLINE
void
buf_page_release(
/*=============*/
        buf_block_t*    block,          /* in: buffer block */
        ulint           rw_latch,       /* in: RW_S_LATCH, RW_X_LATCH,
                                        RW_NO_LATCH */
        mtr_t*          mtr)            /* in: mtr */
{
        ulint   buf_fix_count;

        ut_ad(block);

        ut_a(block->state == BUF_BLOCK_FILE_PAGE);
        ut_a(block->buf_fix_count > 0);

        if (rw_latch == RW_X_LATCH && mtr->modifications) {
                mutex_enter(&buf_pool->mutex);
                buf_flush_note_modification(block, mtr);
                mutex_exit(&buf_pool->mutex);
        }

        mutex_enter(&block->mutex);

#ifdef UNIV_SYNC_DEBUG
        rw_lock_s_unlock(&(block->debug_latch));
#endif
        buf_fix_count = block->buf_fix_count;
        block->buf_fix_count = buf_fix_count - 1;

        mutex_exit(&block->mutex);

        if (rw_latch == RW_S_LATCH) {
                rw_lock_s_unlock(&(block->lock));
        } else if (rw_latch == RW_X_LATCH) {
                rw_lock_x_unlock(&(block->lock));
        }
}

I do not recall a similar bug report prior to this. This might be memory corruption, even bad hardware.

Setting this bug report to 'Can't repeat' until we can repeat this.

Heikki, the common factor between the two independent occurrences of this crash I've seen is this:

64-bit
READ-COMMITTED isolation level.

Got any clues?

Shane,

maybe the #15815 and #22868 bug fixes have broken the buffer fix count keeping for a page.

Are both users using >= 5.0.30?

Regards,

Heikki

Sunny,

please add diagnostic code to buf0buf.ic in 5.0, so that when this the next time fails, we know what is the thread doing. Printing the trx, if any, may help.

Regards,

Heikki

1 user had 5.0.32-enterprise-gpl-log and the other had 5.0.30-ent...

Hi -- we have also seen this same crash:

070322  6:34:37InnoDB: Assertion failure in thread 1258297696 in file ../include/buf0buf.ic line 626
InnoDB: Failing assertion: block->buf_fix_count > 0
InnoDB: We intentionally generate a memory trap.
InnoDB: Thread 1201588576 stopped in file ../include/sync0sync.ic line 111
InnoDB: Thread 1230608736 stopped in file ../include/sync0sync.ic line 111

We are using enterprise-gpl-5.0.32-linux-x86_64-glibc23, and like your other two cases, we are also using read-committed transaction isolation.

error log from offermatica's mysql crash (starts 070322)

Attachment: mysqlcrash.err (application/octet-stream, text), 10.71 KiB.

I've uploaded a chunk of our .err file; let me know if I can give you any more information.

Heikki, we have another reported 64 bit + read-committed crash for this bug. I'll try come up with a stress test soon.

Sunny,

please study the READ COMMITTED case.

Can you add some assertions to the code to catch this bug?

Regards,

Heikki

I don't have much new info, but I hit this recently and thought I'd add my error log to the list. Linux 64-bit, mysql 5.0.40

Error log, let me know if I can do anything else...

Attachment: mysqld.err (application/octet-stream, text), 2.61 KiB.

Me too, 64-bit read-committed, error log attached. 5.0.45, built from source. Debian etch.

Attachment: db0.err (application/octet-stream, text), 1.71 KiB.

Config file in case that helps.

Attachment: db0.conf (application/octet-stream, text), 1.00 KiB.

all reporters: please try get a corefile on next crash. has anybody repeated this crash in a controlled environment?

A lot of debugging information has been added to help us diagnose this crash. See this patch:
http://lists.mysql.com/commits/30607?f=plain

I just experienced what may be the same thing under 64bit Solaris

System is an 8 core Opteron Sun Fire V40z with 32GB memory running Solaris 10 x64 (AMD64). Databases are exclusively InnoDB.

SunOS sql-gamma.freedomvoice.com 5.10 Generic_118855-36 i86pc i386 i86pc

5.0.38-enterprise-gpl-log MySQL Enterprise Server (GPL)

070918 22:20:01InnoDB: Assertion failure in thread 26 in file ./../include/buf0buf.ic line 626
InnoDB: Failing assertion: block->buf_fix_count > 0
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.0/en/forcing-recovery.html
InnoDB: about forcing recovery.
InnoDB: Thread 89 stopped in file ./../include/sync0sync.ic line 111
InnoDB: Thread 204 stopped in file ./../include/sync0sync.ic line 111
InnoDB: Thread 484 stopped in file btr0pcur.c line 236
070918 22:20:01 - mysqld got signal 11;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
InnoDB: Thread 367 stopped in file ha_innodb.cc line 2025
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=134217728
read_buffer_size=1044480
max_used_connections=555
max_connections=1200
threads_connected=139
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 2583862 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

InnoDB: Thread 125 stopped in file ./../include/sync0sync.ic line 111

Mysql-5.0.45 builded from source (linux 64 bit):

071029 19:39:44InnoDB: Assertion failure in thread 1189607744 in file ./../include/buf0buf.ic line 626
InnoDB: Failing assertion: block->buf_fix_count > 0
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.0/en/forcing-recovery.html
InnoDB: about forcing recovery.
InnoDB: Thread 1260161344 stopped in file row0sel.c line 2930
InnoDB: Thread 1166444864 stopped in file btr0sea.c line 336
InnoDB: Thread 1163516224 stopped in file ./../include/buf0buf.ic line 558
InnoDB: Thread 1184016704 stopped in file ./../include/sync0sync.ic line 111
InnoDB: Thread 1282791744 stopped in file row0row.c line 480
071029 19:39:44 - mysqld got signal 11;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=268435456
read_buffer_size=1044480
max_used_connections=598
max_connections=1000
threads_connected=236
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 5378136 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd=0x2ab048859c70
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Cannot determine thread, fp=0x46e7f150, backtrace may not be correct.
Stack range sanity check OK, backtrace follows:
(nil)
Stack trace seems successful - bottom reached
Please read http://dev.mysql.com/doc/mysql/en/using-stack-trace.html and follow instructions on how to resolve the stack trace. Resolved
stack trace is much more helpful in diagnosing the problem, so please do
resolve it
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x2ab05ca6b020 is invalid pointer
thd->thread_id=41424
The manual page at http://www.mysql.com/doc/en/Crashing.html contains
information that should help you find out what is causing the crash.
InnoDB: Thread 1212238144 stopped in file ha_innodb.cc line 977

Number of processes running now: 0
071029 19:39:52 mysqld restarted

Nikolay, what is your hardware?

Regards,

Heikki

4xDual-Core Opteron 885/32Gb RAM
Linux kernel 2.6.17-1.2157_FC5 x86_64

thread_concurrency = 16
transaction-isolation = READ-COMMITTED
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT
innodb_buffer_pool_size = 20G 
innodb_additional_mem_pool_size = 64M

update: I've been running 160 threads against 5.0.40 all inserting/updating/deleting/replacing/selecting on our 64-bit box with 7G buffer pool, read-committed for past 72 hours.  No crashes yet.  Will update again, after another few days.

I get this on my servers (4 and 8 core AMD). It occurs about 1 time every 1000 machine days. I don't have core files. I use 5.0.37. I am not sure what cursor isolation level apps choose, but the default is repeatable read.

My current hypothesis is that this is a problem in AMD's cache coherence protocol.

Has anyone hit this on a non-AMD platform?

Regards,

Heikki

Update: I ran some stress tests in 160 threads for over 7 days and couldn't repeat a crash on 64-bit Intel CPUs.  Will have to try AMD next.

I don't think we've ever seen >=5.0.48 crashing due to this bug.  Is there a chance it's been fixed unknowingly with some other fixes in the mean time?

Shane,
I do not think it could have been fixed in 5.0.48. The bug was a total mystery from the start.
--Heikki

There is a bug for some AMD systems. This bug is _very_ unlikely, which matches the frequency with which we have this problem and we are still determining whether it causes our crashes.

Look at src/base/atomicops-internals-x86.{h,cc} from
http://code.google.com/p/google-perftools/

Mark,

thank you. Maybe time to contact AMD?

--Heikki

I think they know about it. I was told it was in the errata.

Thank you Mark!

Let us keep this bug report still open for a year or two, so that we get feedback from users. I guess AMD has fixed the bug in their latest processors, and these bug reports will slowly taper out.

Regards,

Heikki

Well, we have hit this bug a couple of times on our 5.0.32 servers, but it seems that no server never than 5.0.48 is affected. Is there anyone out there who has hit this bug on a newer version of 5.0?

This is the log readout:

Version: '5.0.32-enterprise-gpl-log' socket: '/tmp/mysql.sock' port: 3306 MySQL Enterprise Server (GPL)
080401 8:40:20InnoDB: Assertion failure in thread 1189874016 in file ../include/buf0buf.ic line 626
InnoDB: Failing assertion: block->buf_fix_count > 0
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.0/en/forcing-recovery.html
InnoDB: about forcing recovery.
mysqld got signal 11;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
InnoDB: Thread 1225017696 stopped in file row0sel.c line 2088
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=52428800
read_buffer_size=2093056
max_used_connections=382
max_connections=1200
threads_connected=97
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 4961590 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd=0x2b4814df50
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Cannot determine thread, fp=0x46ec0118, backtrace may not be correct.
Bogus stack limit or frame pointer, fp=0x46ec0118, stack_bottom=0x46ec0000, thread_stack=262144, aborting backtrace.
InnoDB: Thread 1165113696 stopped in file row0sel.c line 2088
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x2b4560e660 is invalid pointer
thd->thread_id=1423637
The manual page at http://www.mysql.com/doc/en/Crashing.html contains
information that should help you find out what is causing the crash.

This instance is running on a 8 core AMD server.

Anyone who knows more about how involved AMD is in this issue? Are they aware of this issue? If they already have done something about it, is there a date from where processors have been fixed? Maybe a processor version in where it has been fixed?

I have pulled this comment from the files that Mark referenced earlier:

// Opteron Rev E has a bug in which on very rare occasions a locked
// instruction doesn't act as a read-acquire barrier if followed by a
// non-locked read-modify-write instruction.  Rev F has this bug in 
// pre-release versions, but not in versions released to customers,
// so we test only for Rev E, which is family 15, model 32..63 inclusive.

It would be interesting to know if anybody has encountered this crash on a processor that falls outside this family/model range.

Saw this on:

mysql 5.1.26

2 x dual core system

vendor_id       : AuthenticAMD
cpu family      : 15
model           : 33
model name      : Dual Core AMD Opteron(tm) Processor 270
stepping        : 2
cpu MHz         : 2009.261

We use a few AMD models, but the error only occurs on this one:
model name      : Dual Core AMD Opteron(tm) Processor 270

/proc/cpuinfo data is important (cpu family, model, model name and stepping). 

Knowing that this happens on OTHER models than limited by
family == 15 && 32 <= model && model <= 63 rule would be important.

db crash error log

Attachment: db-cst.log (, text), 12.13 KiB.

Encountered similar issue with AMD-64 bit 4 core processor but transaction set to innodb default.

This seems to be open for a long time. Is there any way to shield this other than changing the processors? Hope the logs help 

http://bugs.mysql.com/file.php?id=10257&text=1

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 33
model name      : AMD Opteron(tm) Processor 280
stepping        : 2
cpu MHz         : 2004.590
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov           pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni
bogomips        : 3607.67
TLB size        : 1088 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

The bug is known to linux kernel developers but for now no developer was interested in fixing the problem. 

The generic workaround are also described there. OpenSolaris has in kernel workarounds for the problem.

http://bugzilla.kernel.org/show_bug.cgi?id=11305

This should probably be set to 'not a bug' since the problem looks to be
well understood:

- can happen with any MySQL version, observed in 5.0 and 5.1
- the cause is a bug in AMD processor family 15, models 32-63 inclusive
- the workaround can be done only at OS kernel level, not in MySQL or InnoDB

so outside of scope of MySQL bugdb

Patch for Linux kernel (waiting for testers) at
http://bugzilla.kernel.org/show_bug.cgi?id=11305

It is still too early to say for sure, but Bug #35077 may be related to this.

http://support.amd.com/us/Processor_TechDocs/25759.pdf
See Errata #147

So far, no mainline kernel fixes the problem : 2.6.39 does not.

see : 

http://timetobleed.com/mysql-doesnt-always-suck-this-time-its-amd/

There is a unofficial patch there :

http://marc.info/?l=linux-kernel&m=124043356704100