Bug #5910 | InnoDB: Next record offset nonsensical | ||
---|---|---|---|
Submitted: | 5 Oct 2004 22:20 | Modified: | 8 Oct 2004 4:46 |
Reporter: | Eric Korpela | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Server: InnoDB storage engine | Severity: | S1 (Critical) |
Version: | 4.0.21-linux-x86_64 (Official binary) | OS: | Linux (Linux 2.4.21-20.ELsmp) |
Assigned to: | Heikki Tuuri | CPU Architecture: | Any |
[5 Oct 2004 22:20]
Eric Korpela
[5 Oct 2004 22:25]
Eric Korpela
Oops, I truncated part of the log. Following the page dump is... 041005 11:14:33 InnoDB: Page checksum 2827990002, prior-to-4.0.14-form chec ksum 2846856100 InnoDB: stored checksum 2827990002, prior-to-4.0.14-form stored checksum 284 6856100 InnoDB: Page lsn 2 4185473273, low 4 bytes of lsn at page end 4185473273 InnoDB: Page may be an index page where index id is 0 848 InnoDB: (index `ra` of table `new_spear_fims/photon`) 041005 11:14:33 InnoDB: Apparent corruption of an index page n:o 681454 in space 0 InnoDB: to be written to data file. We intentionally crash server InnoDB: to prevent corrupt data from ending up in data InnoDB: files. 041005 11:14:33InnoDB: Assertion failure in thread 1147140464 in file buf0fl u.c line 268 InnoDB: We intentionally generate a memory trap. InnoDB: Submit a detailed bug report to http://bugs.mysql.com. InnoDB: If you get repeated assertion failures or crashes, even InnoDB: immediately after the mysqld startup, there may be InnoDB: corruption in the InnoDB tablespace. See section 6.1 of InnoDB: http://www.innodb.com/ibman.php about forcing recovery. mysqld got signal 11; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. We will try our best to scrape up some info that will hopefully help diagnos e the problem, but since we have already crashed, something is definitely wron g and this may fail. key_buffer_size=268435456 read_buffer_size=8384512 max_used_connections=2 max_connections=100 threads_connected=2 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 19 00143 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. Number of processes running now: 0 041005 11:14:37 mysqld restarted It appears to be random whether the restart succeeds or crashes.
[5 Oct 2004 22:31]
Eric Korpela
mysqld log
Attachment: zork.err (application/octet-stream, text), 74.33 KiB.
[5 Oct 2004 22:32]
Eric Korpela
mysql configuration file
Attachment: my.cnf (application/octet-stream, text), 5.01 KiB.
[6 Oct 2004 8:36]
Heikki Tuuri
Eric, please send the entire .err log to heikki.tuuri@innodb.com Do not cut anything off. Also print what SHOW CREATE TABLE ... says about the suspect table(s) new_spear_fims.photon and others that crash. If this is a repeatable way to create corruption in InnoDB, this is the first report in a long time. "Next record offset nonsensical 60952 for rec 12575": this means that the index record at offset 12575 bytes has the next record pointer pointing to offset 60952 bytes. This is impossible, because the InnoDB page size is 16 kB. Regards, Heikki
[6 Oct 2004 8:41]
Heikki Tuuri
Eric, oops I noticed that you already uploaded the .err file for this crash. Please send also the .err files and SHOW CREATE TABLE about the other crashes that you have seen. Regards, Heikki
[6 Oct 2004 8:43]
Heikki Tuuri
Eric, I notice that you are using a 64-bit Opteron or Nocona. Please describe your hardware. Regards, Heikki
[6 Oct 2004 9:20]
Heikki Tuuri
Hi! This is interesting. Looks like 64 bytes have been copied from offset 0x1100 on the page to offset 0x3100! 110600000000000000000000000000000000800000000191b5ed18100804041007 112700000000000000000000000000000000800000000191b5ee18100800041807 114800000000000000000000000000000000800000000191b5ef18100800042007 116900000000000000000000000000000000800000000191b5f018100800042807 118a00000000000000000000000000000000800000000191b5f118100804043007 ... 30bc00000000000000000000000000000000800000000191b6e3181008000bc007 30dd00000000000000000000000000000000800000000191b6e4181008000bc807 30fe00000000000000000000000000000000800000000191b6e5181008040bd007 311f000004041007112700000000000000000000000000000000800000000191b5 ee18100800041807114800000000000000000000000000000000800000000191b5 ef1800000000000000000000000000000000800000000191b6e8181008000be807 318200000000000000000000000000000000800000000191b6e9181008040bf007 31a300000000000000000000000000000000800000000191b6ea181008000bf807 31c400000000000000000000000000000000800000000191b6eb181008000c0007 The copied block: 04041007112700000000 00000000000000000000 0000800000000191b5ee 18100800041807114800 00000000000000000000 00000000008000000001 91b5ef18 This might be bad RAM. But let us first look at the other crashes that you have seen. Regards, Heikki
[6 Oct 2004 18:20]
Eric Korpela
Thanks for the response! The hardware is a dual Operton 246 with 6GB EEC RAM. EEC is on as is EEC scrub and EEC correction writeback. 4 GB RAM is used for innodb buffers. Innodb logs are on an internal 250 GB SATA drive. The myisam and innodb data files are on a 360 GB SCSI RAID-5 array. We tried loading another large database last night, which crashed again this morning. I'll upload the error log and the SHOW CREATE TABLE output once the rollbacks are complete.
[6 Oct 2004 23:21]
Eric Korpela
output of show create table new_spear_fims.photon
Attachment: photon (application/octet-stream, text), 982 bytes.
[6 Oct 2004 23:37]
Eric Korpela
Error log for crash of 2004/10/06
Attachment: latest_crash.err (application/octet-stream, text), 197.96 KiB.
[6 Oct 2004 23:41]
Eric Korpela
I've uploaded the "show create table" output and the error log from today's crash. I can't be sure that today's crash isn't due to corruption from yesterdays, so I will rebuild the innodb files and logs in order to generate a newer crash dump. I will also use the opportunity to run more detailed memory/burn-in tests than I had run earlier to see if there might be an intermittent hardware problem.
[7 Oct 2004 7:08]
Heikki Tuuri
Eric, in the new printout, looks like something has zeroed bytes at offsets 0x20fc to 0x20ff (inclusive) or 0x20fc to 0x20fe (inclusive) on the page. The previous corruption occurred at 0x3100. Some similarity in offsets. Again, this looks like a fault in hardware, like in RAM, processor page tables, processor caches, memory controller. Or a bug in the OS memory management. Experience from at least two other users is that there are unexplained crashes in Opteron computers. I think we have to have this bug repeated in some other 64-bit (best to try with little-endian processors) computer to show that this is an InnoDB bug. In principle, this might be a bug in some InnoDB code that handles with addresses divisible by 256. For example, the additional mem pool. Bad pointers there might cause symptoms like this. Regards, Heikki 20dd800001118000000002f7e0700c04000d0005 20f1800001118000000002f7e0710c04000d0805 2105800001118000000002f7e0000000000d1005 2119800001118000000002f7e0730c04000d1805 212d800001118000000002f7e0740c04000d2005 2141800001118000000002f7e0750c04000d2805 2155800001118000000002f7e0760c04040d3005
[7 Oct 2004 15:57]
Eric Korpela
The memory test has been running all night (memtest86) with no errors reported. memtest86 is a pretty stringent test suite, so I think hardware memory issues are probably out. Once we've rebooted, I'll run a disk test suite to ensure we don't have a subtle disk problem. Beyond that we have either innodb or the Linux x86_64 kernel. I suppose it could also be a glibc bug?
[7 Oct 2004 16:00]
Eric Korpela
Anyone have an IA64 machine with a couple hundred gigs of disk space I could try to recreate this bug on?
[7 Oct 2004 23:05]
Eric Korpela
It turns out that innodb is off the hook. I wrote a simple program to memory map a 20GB file as an array of off_t. Then I wrote into each element of the array its own offset. During readback, some items were misplaced. Error at offset 641248800 != 641248288 Error at offset 641248801 != 641248289 Error at offset 641248802 != 641248290 Error at offset 641248803 != 641248291 Error at offset 641248804 != 641248292 Error at offset 641248805 != 641248293 Error at offset 641248806 != 641248294 Error at offset 641248807 != 641248295 Error at offset 718280736 != 722050080 Error at offset 718280737 != 722050081 Error at offset 718280738 != 722050082 Error at offset 718280739 != 722050083 Error at offset 718280740 != 722050084 Error at offset 718280741 != 722050085 Error at offset 718280742 != 722050086 Error at offset 718280743 != 722050087 Look familiar? I'm now doing some tests to see whether this is a problem with the disk hardware, the linux kernel, or the driver. Sorry to trouble you with this.
[8 Oct 2004 4:40]
Heikki Tuuri
Eric, it would be great if you can send your test program to the vendor of your computer, to AMD, and the producer of your Linux kernel and drivers. Problems in Opterons with Linux seem to be so common that there probably is some bug or hardware design flaw that needs to be corrected. In Itanium computers there also were problems, but they seem to work well now. The standard advice in cases like this is to try upgrading to the latest Linux kernel and drivers. By the way, your family name Korpela is a Finnish name. Thank you, Heikki
[8 Oct 2004 4:46]
Heikki Tuuri
I am changing the status of the bug to 'Can't repeat', because bug is not in MySQL. But please report to the bugs database if and how you were able to resolve the problem. Regards, Heikki
[19 Oct 2004 18:32]
Eric Korpela
A little additional info in case someone else stumbles into this. The bug is apparently related in some way to the IOMMU mechanism in the Linux kernel. One workaround is to specify "mem=4032M" (4GB-64MB) as a kernel parameter. Of course that throws away any RAM above 4G, so it's an unsatisfactory solution. I've submitted a bug report to RedHat, since it's a RedHat distribution I'm using. The bugzilla URL for this item is https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=135017 Eric