MySQL Bugs: #4053: too many of "error 1236: 'binlog truncated in the middle of event' from master"

Bug #4053	too many of "error 1236: 'binlog truncated in the middle of event' from master"
Submitted:	8 Jun 2004 10:00	Modified:	28 Nov 2006 19:38
Reporter:	Olivier Müller	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	4.0.20	OS:	Linux (linux)
Assigned to:	Timothy Smith	CPU Architecture:	Any

Description:
Standard replication setup between a master and 2 slaves, everything running fine, replicating a 1.5GB database with a very big table. 

But sometimes (every few days) without any visible reason the replication "crashes", with this message comming on the slaves:

040608  7:32:21  Error reading packet from server: binlog truncated in the middle of event (server_errno=1236)
040608  7:32:21  Got fatal error 1236: 'binlog truncated in the middle of event' from master when reading data from binary log
040608  7:32:21  Slave I/O thread exiting, read up to log 'vega3-bin.003', position 403828671

Nothing appears on the master's error log, and using "slave stop; slave start" on the slave, or "SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; SLAVE START;" doesn't help -> the only way to make it work again is to stop everything, copy the tables manually (2 GB data each time, takes 15 minutes), and restart the servers. This happens nearly every 2-3 days, at random time: no specific cronjob running, etc.

How can I help and find out what is wrong, debug the issue? Such an unstable replication is not ok :(   The load is quite high on the master (it's a news spider), but there is enough ram and disk space.

I would be glad if I could look at that problem with somebody of the mysql devel team working on the replication routines, thanks.

How to repeat:
"wait 2-3 days"...

Hi,

Such problems are usually hard to troubleshoot. It can be a hardware problem, OS problem or MySQL bug. Here's how you can help; if for example the slave says:
040608  7:32:21  Error reading packet from server: binlog truncated in the
middle of event (server_errno=1236)
040608  7:32:21  Got fatal error 1236: 'binlog truncated in the middle of event'
from master when reading data from binary log
040608  7:32:21  Slave I/O thread exiting, read up to log 'vega3-bin.003',
position 403828671
Then save the master's binlog named vega3-bin.003, and run
mysqlbinlog vega3-bin.003 2>/tmp/errors
If the "/tmp/errors" file contains anything, please copy-paste it here.
Thank you.

Guilhem

ok thanks, I'll try that on the next crash.

Here the stderr of the mysqlbinlog command ran after last crash:

ERROR: Error in Log_event::read_log_event(): 'read error', data_len: 369229824, event_type: 0
Could not read entry at offset 210767880:Error in log format or read error

Hi Olivier,
Your master's binary log is corrupted starting from position 210767880. If this happens every 2 or 3 days, it may be a hardware/OS problem, not necessarily a MySQL bug (over the last 2 years we spotted no bug which could lead to binary log unreadability like you experience).
I recommend you to try with another machine/OS (other Linux distro, other kernel, I don't know; some Linux kernels are suspicious; for example 2.4.18 of Mandrake and Redhat has a bad reputation (not only at MySQL AB) as it was built with gcc 2.96 which was a buggy gcc.
Memtest86 could be used to test the memory of the machine if it's an x86.

Guilhem

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

We also have this problem quite repeatably. It occurs when very large blobs are written to the binlog.

What I suspect happens is:
- a database thread starts writing complex data to the binlog
- slave's reader starts sending the log to the slave
- the reader is faster :-( and returns an error when it can't read all the data

A simple "slave start" always allows the slave to continue.

Hello Matthias,
Please could you attach (in the "Files" section) a compressed copy of such a query using large BLOBs which causes replication to stop?
Which version do you have on master and slave?
When you say "quite repeatably" do you mean that you can repeat it whenever you decide, or it happens frequently but without you being able to decide when it happens ?

Hi again,
more questions (there are also some questions in my previous post) : InnoDB or MyISAM? is the query in a transaction or in autocommit mode?
The race condition is improbable because binlog reads and writes happen under mutex. But I agree there is something fishy.

Just to say I tried to repeat with a 40 MB query and failed to produce the bug. Looks like I would need your information to have more luck.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

The problem occurred again:

051012 12:29:44 [ERROR] Error reading packet from server: binlog truncated in the middle of event (server_errno=1236)
051012 12:29:44 [ERROR] Schwerer Fehler 1236: 'binlog truncated in the middle of event vom Master beim Lesen des bin\uffffren Logs aufgetreten
051012 12:29:44 [ERROR] Slave I/O thread exiting, read up to log 'binlog.000100', position 435323250
051012 12:50:30 [Note] Slave I/O thread: connected to master 'sklave@213.95.11.247:3306',  replication started in log 'binlog.000100' at position 435323250

As you see, after restarting the slave everything works.

The binlog looks like this:

## mysqlbinlog --start-position=435323250 binlog.000100 |less -S

/*!40019 SET @@session.max_insert_delayed_threads=0*/;
# at 435323250
#051012 12:29:39 server id 2  log_pos 415459555         Query   thread_id=4710
use bmw2;
SET TIMESTAMP=1129112979;
insert into archivepreview (id, sliceid, pdfpreviewsize, docid, mimetype, sortor[...]
values (46447, 0, '%PDF-1.4\n%\uffff\uffff\uffff\uffff\n3 0 obj\n<</Length 4 0 R\n/Filter/FlateDeco[...]
# at 439950880
#051012 12:29:39 server id 2  log_pos 415459555         Query   thread_id=4710
SET TIMESTAMP=1129112979;
insert into archivepreview (id, sliceid, pdfpreviewsize, docid, mimetype, sortor[...]
values (46448, 0, '%PDF-1.4\n%\uffff\uffff\uffff\uffff\n3 0 obj\n<</Length 4 0 R\n/Filter/FlateDeco[...]
# at 440673917
#

That still looksl like the master->slave update is triggered too early. Maybe flushing the binlog to disk before signalling the master thread would help?

Hi Matthias,
By default only a fflush() of the binlog is done at every write.
There is an option to force a fsync() of the binlog at every write: --sync-binlog.
It is going to slow things down, though.

Thanks; I'll try that option and report back.

Following are the entries from the error log file:
_______________________________________________________________________

051214 16:22:28 [ERROR] /usr/local/mysql/bin/mysqld: Incorrect key file for table '/tmp/#sql_69c6_0.MYI'; try to repair it
051214 16:29:16 [ERROR] /usr/local/mysql/bin/mysqld: Incorrect key file for table '/tmp/#sql_69c6_0.MYI'; try to repair it
051214 16:39:27 [ERROR] Error reading packet from server: error reading log entry ( server_errno=1236)
051214 16:39:27 [ERROR] Got fatal error 1236: 'error reading log entry' from master when reading data from binary log
051214 16:39:27 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.000045', position 51711500
051214 16:39:27 [ERROR] Slave: Error 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ''��C' at line 1' on query. Default database: 'cms'. Query: Error_code: 1064
051214 16:39:27 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000045' position 51654054
_______________________________________________________________________

Following are the entries in the binary log file where the error took place:
 mysqlbinlog --start-position='51711500' mysql-bin.000045
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
ERROR: Error in Log_event::read_log_event(): 'read error', data_len: 50724864, event_type: 0
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
__________________________________________________________________________

We have recently begun to experience this problem sporadically as well with a 4.0.26 master, several weeks after migrating our master to a new platform.

Our current theory is that we've managed to exhaust or fragment MySQL's heap to the point where allocations are beginning to randomly fail. (We also see sporadic "out of memory" errors.)  We have a 2GB key_buffer_cache (on a 32-bit machine with 8GB of RAM) and hundreds of clients running on a 2.6.9 kernel (Centos 4.1), and believe we're bouncing off the 3GB hard limit (e.g., akin to <http://www.spack.org/wiki/LinuxRamLimits>) in the kernel.  We're going to try reducing our key_buffer_size to 1GB (which is what we had on our older server, with only 4GB of RAM), and cross our fingers...

We have begun to encounter this on 4.0.26. It appears to happen for large (> 6MB) statements. Eventually, the slaves (after many restarts) are able to fetch and execute that statement -- so the binlog does not appear to be corrupt. Our workaround is to limit statement sizes when possible, although this is difficult to do for blob columns.

We have spent some time looking at the code. This error is represented as LOG_TRUNC_READ in the server. It is caused by memory allocation errors or by a failure to read the event from the binlog. Memory allocation errors are logged, and our masters have no such error messages in their log, so it does not appear to be caused by that (for us).

The other parts of the code path where errors occur do not log an error message at the initial cause of the error, so all of this is based on guesswork. But it _might_ be possible for the error to occur if the read() call for the binlog file (done in my_read) returns less than the number of bytes requested (request 6MB, get back 5MB). The read(2) man page and the POSIX spec state that read() may have this behavior. The Stevens Advanced Unix Programming book (1st and 2nd editions) states that reads from the local file system (Ext2 in our case) do not have this behavior (they are not partial, unless at EOF).

I am not sure why MY_FULL_IO is not set for the binlog file read in this case. If partial reads occur, then my_read might do the right thing when that flag is set. Of couse, I do not know if partial reads occur. It is just a guess.

This problem would be much easier to diagnose were logging done at the first failure in the code path.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/11130

ChangeSet@1.2183, 2006-08-30 17:28:34-04:00, cmiller@zippy.cornsilk.net +1 -0
  Bug#4053: too many of "error 1236: 'binlog truncated in the middle of \
  	event' from master"
  
  Since there is no repeatable test case, and this is obviously wrong, this is
  the most conservative change that might possibly work.  
  
  The syscall  read()  wasn't checked for a negative return value for an
  interrupted read.  The kernel  sys_read()  returns -EINTR, and the "library" 
  layer maps that to return value of -1 and sets  errno  to EINTR.  It's 
  impossible (on Linux) for  read()  to set errno EINTR without the return 
  value being -1 .
  
  So, if we're checking for EINTR behavior, we should not require that the
  return value be zero.

Chad, I agree with pushing this without a test case, since many people have tried without success to make one, and the change is very small and conservative.

Thanks,

Tim

I've applied the patch to the Maintenance team trees, 4.0 through 5.1.

Documenters, Support:  I propose we keep this bug open ("Need Feedback") after this code goes into the wild, and close it ("Closed") or reopen it ("Verified" (or "Open")) if the problem one month after it.  I'll leave it up to Support to manage this as they track their support-issue item.

Pushed to 5.0 (will be in 5.0.25)

TODO: Push to 4.0, 4.1, 5.1

An entry for this bug has been added to the 5.0.25 changelog.

This is pushed to 5.1.12.

Docs team: please return it to "Need Feedback", to remind us that we need confirmation from users who've been hitting this if it's fixed or not with 5.0.25.

Also, TODO: push into 4.0 & 4.1 (is now queued in -maint trees)

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/11984

ChangeSet@1.2186, 2006-09-14 20:51:50-06:00, tsmith@siva.hindu.god +6 -0
  Bug #4053: too many of "error 1236: 'binlog truncated in the middle of event' from master"
  
  - Fix my_read(), my_write() and my_read_charset_file() to handle return values from read/write correctly
  - Add debugging 'deprecated function' warning to my_lread/my_lwrite
  - Add debugging 'error, read/write interupt not handled' warning to my_quick_read/my_quick_write

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/11985

ChangeSet@1.2540, 2006-09-14 21:06:14-06:00, tsmith@siva.hindu.god +1 -0
  Bug #4053: too many of "error 1236: 'binlog truncated in the middle of event' from master"
  - Detect read failure in my_read_charset_file

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/11990

ChangeSet@1.2186, 2006-09-14 23:19:24-06:00, tsmith@siva.hindu.god +6 -0
  Bug #4053: too many of "error 1236: 'binlog truncated in the middle of event' from master"
  
  - Fix my_read/my_write to handle return values from read/write correctly
  - Add debugging 'deprecated function' warning to my_lread/my_lwrite
  - Add debugging 'error, read/write interrupt not handled' warning to my_quick_read/my_quick_write
  
  There is no test case associated with these changes.  However, this is a conservative change,
  and no repeatable test case is available.

The following changes need review:

In 4.0+: http://lists.mysql.com/commits/11990
- Update to Chad's original fix for this bug; also address other
functions mentioned by Jonas

In 4.1+: http://lists.mysql.com/commits/11985
- Fix for my_read_charset_file(), which isn't present in 4.0

In 5.1:  http://lists.mysql.com/commits/11994
- Remove my_lread and my_lwrite, which are unused
- It was recommended that they not be removed from earlier versions

Moved 5.0.25 changelog entry to 5.0.30 (not 5.0.29).
Moved 5.1.12 changelog entry to 5.1.13.
Added entry to 4.1.23 changelog.

Resetting reporrt to NDI - need three-part version number
for the 4.0.x changelog. (Looks to me like 4.0.28, but would
like confirmation of that.)

Added changelog entry to 4.0.28 changelog.

The problem may be because of upgraded version. As in 5.1.x the start binlog pos is 106 and in Mysql 5.5.x it is 107 so you need to give binlog pos as 107 at a time of starting raplication from master.

??? Do you have the wrong bug?

(a) 5.5 wasn't even thought of when this bug was created.
(b) you're wrong: zero works with every version.