| Bug #28359 | Intermitted lost connection at 'reading authorization packet' errors | ||
|---|---|---|---|
| Submitted: | 10 May 2007 18:25 | Modified: | 25 Oct 2007 21:24 |
| Reporter: | Jon Ribbens | ||
| Status: | Closed | ||
| Category: | Server: Errors | Severity: | S2 (Serious) |
| Version: | 5.0.41 | OS: | Linux (RedHat Enterprise Linux ES release 4) |
| Assigned to: | Magnus Svensson | Target Version: | |
| Tags: | 64-bit, linux, authorization, Lost connection | ||
[10 May 2007 18:25]
Jon Ribbens
[10 May 2007 18:30]
Jon Ribbens
Oh, I forgot to mention, if you google for "reading authorization packet" you'll see thousands of sites where this intermittent error has leaked into the site content (i.e. usually a PHP site), which provides further evidence it's a general problem/bug and not just something weird about our setup.
[10 May 2007 21:28]
Jon Ribbens
I did some debugging with --debug, below is a partial log. It shows a single thread from connection to where it dies. As you can see it is dying because of EINTR on read(). What I don't get is (a) why this doesn't happen at any time other than during connection setup (connections once established are rock-solid), and (b) why MySQLd isn't coping with this, since it's completely standard and expected to get EINTR from read(). T@10874248: ?func: info: New connection received on TCP/IP (70) T@10874248: vio_peer_addr: enter: sd: 70 T@10874248: vio_peer_addr: exit: addr: 192.168.1.74 T@10874248: my_malloc: my: size: 13 my_flags: 0 T@10874248: my_malloc: exit: ptr: 0xf5d5a0 T@10874248: ?func: info: Host: unknown host ip: 192.168.1.74 T@10874248: vio_keepalive: enter: sd: 70 set_keep_alive: 1 T@10874248: net_write_command: enter: length: 67 T@10874248: vio_is_blocking: exit: 0 T@10874248: vio_write: enter: sd: 70 buf: 0x10016e0 size: 72 T@10874248: vio_write: exit: 72 T@10874248: vio_is_blocking: exit: 0 T@10874248: vio_read: enter: sd: 70 buf: 0x10016e0 size: 4 T@10874248: vio_read: vio_error: Got error 11 during read T@10874248: vio_read: exit: -1 T@10874248: ?func: info: vio_read returned -1, errno: 11 T@10874248: thr_alarm: enter: thread: T@10874248 sec: 5 T@10874248: vio_blocking: enter: set_blocking_mode: 1 old_mode: 0 T@10874248: vio_blocking: exit: 0 T@10874248: vio_read: enter: sd: 70 buf: 0x10016e0 size: 4 T@10874248: vio_read: vio_error: Got error 4 during read T@10874248: vio_read: exit: -1 T@10874248: ?func: info: vio_read returned -1, errno: 4 T@10874248: ?func: error: Couldn't read packet: remain: 4 errno: 4 length: -1 T@10874248: vio_blocking: enter: set_blocking_mode: 0 old_mode: 1 T@10874248: vio_blocking: exit: 0 T@10874248: net_printf_error: enter: message: 1043 T@10874248: vio_is_blocking: exit: 0 T@10874248: push_warning: enter: code: 1043, msg: Bad handshake T@10874248: close_connection: enter: fd: TCP/IP (70) error: ''
[13 May 2007 16:00]
Jeff C
I get this often as well. I'm on 32bit. I noticed it happening when using 5.1 clients going to 5.0. I haven't seen 5.0->5.0 clients producing this error as of yet... Jon Ribbens, are your clients 5.1 going to 5.0 as well? I hope they fix this bug as fast as possible as this bug hurts mysql's image. All the errors being generated thru php (search google and stuff)... most people will get a bad view of Mysql when it is great!
[13 May 2007 16:37]
Jon Ribbens
Our clients were 5.0.27 talking to 5.0.27 servers. I upgraded to see if this would fix the problem (which it didn't) - they are now 5.0.41 clients talking to 5.0.41 servers.
[13 May 2007 19:39]
Jon Ribbens
I have discovered what's going on, sort of. The login is sometimes taking over 5 seconds, and therefore it is timing out. Although the MySQL documentation says that this should produce a "Bad handshake" error on the client, it doesn't - the server just drops the connection. I have worked-around the problem by increasing the connect_timeout global variable to 30 seconds. What I do not yet know is why the login is taking over 5 seconds. Neither our client machine or server machine are overloaded, the logins should be completing practically instantaneously.
[13 May 2007 20:59]
Jeff C
Well at least you found a semi-solution that works. I hope they figure it out soon. I've added 30 seconds into connect_timeout... Thanks for the tip Jon.
[6 Jun 2007 20:31]
Gabriel Barazer
I got this error many times on a moderately loaded server as well. The error string and code (0) isn't clear at all : Reading the source related to these errors shows some tests during the authentication process to the mysql server. See sql-common/client.c:2313 to 2321 : Error "reading authorization packet" is returned when mysql->net.last_errno == CR_SERVER_LOST, which by the constant CR_SERVER_LOST, obviously means that the server has dropped the connection. Reasons for this can be various, but I think it has something to do with timeouts during authentication on the server side. As wrote Jon, the problem might be a bug in the network error handling (which could explain the randomness of this bug). Temporary workaround is indeed setting a long timeout, to avoid the server dropping the connection. I will check if this is enough to make this bug disappear. But there is clearly something to do on the server code to avoid that. (somewhere next to sql/sql_parse.cc:1143 , version 5.0.41)
[9 Aug 2007 23:52]
Sveta Smirnova
Thankk you for the report. Error 2013, "Lost connection to MySQL server" is client error, not server. Even if server sent Bad Handshake this means client didn't receive anything. Also there is "T@10874248: push_warning: enter: code: 1043, msg: Bad handshake" in the server stack trace. Connection outage seems to be somewhere in your environment. So I don't see bug here. Please explain what is wrong?
[28 Aug 2007 11:56]
Jon Ribbens
Sveta, firstly the documentation says you will get the 'Bad handshake' error, and this is false. You will never see this error. So that's either a bug in the code or a mistake in the documentation. The only way I could find to diagnose this problem was to either 'gdb' the mysql server binary, or to run the 'debug' mysql server and try and use the completely-undocumented debug macros. (The "T@10874248" stuff you mention is not in the 'server stack trace', it is in the debug macro output, and only if you know how to use that, which I suspect essentially no non-mysql-employees do.) Secondly, I explained in my original report why the problem is unlikely to be caused by the network. Please explain why you say "Connection outage seems to be somewhere in your environment" when all the evidence seems to point to the contrary.
[28 Aug 2007 15:41]
Sveta Smirnova
Thank you for the feedback. The manual says "The number of seconds that the mysqld server waits for a connect packet before responding with Bad handshake" at http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html#option_mysqld_connect_ timeout which means server will send this error and nothing about client will receive this error. If you mind other quote from manual please provide exact quote and URL. Also how will client receive any error from server if it lost connection? > Secondly, I explained in my original report why the problem is unlikely > to be caused by the network. Please explain why you say "Connection > outage seems to be somewhere in your environment" when all the evidence > seems to point to the contrary. You said "Temporary workaround is indeed setting a long timeout, to avoid the server dropping the connection." Additionally you get "lost connection" error. This shows something in your environment make network connection between MySQL client and server slow and buggy sometimes. So I don't see MySQL bug here. If you will be able to provide test case which shows problem is MySQL code and not environment which makes network connection slow and/or buggy feel free to provide the test case and reopen the report.
[28 Aug 2007 18:21]
Jon Ribbens
> The manual says "The number of seconds that the mysqld server waits > for a connect packet before responding with Bad handshake" Indeed. What do you think the word 'respond' means? > Also how will client receive any error from server if it lost connection? It didn't lose the connection, the server deliberately dropped it. The server could quite easily send an error before dropping the connection. > Additionally you get "lost connection" error. This shows something in your > environment make network connection between MySQL client and server slow > and buggy sometimes. "Lost connection" shows nothing except that the server has deliberately dropped the connection. I have already explained why the problem does not appear to be the network.
[28 Aug 2007 18:23]
Jon Ribbens
I am setting the bug back to 'open' because, at the very least, the "Bad handshake" thing is a documentation error.
[31 Aug 2007 11:21]
Magnus Svensson
Hi, will try to explain how a connection to the MySQL Server works. You have provided a very nice tracefile that show what happens in the server when the client get the error "2013 Lost connection to MySQL server at 'reading authorization packet', system error: 0" and I'll comment on that one directly. It actually starts before the beginning of the trace file where the server accepts the connection and starts a new thread to handle this incoming connection. It will set the network read and write timeouts to the value of "connect_timeout", which is 5 seconds by default in order to quickly be able to disconnect clients that don't respond(try telnet to the mysqld port...). Then call 'check_connection' in sql_parse.cc > T@10874248: ?func: info: New connection received on TCP/IP (70) > T@10874248: vio_peer_addr: enter: sd: 70 > T@10874248: vio_peer_addr: exit: addr: 192.168.1.74 The connection is from 192.168.1.74, we use 'vio_peer_addr' to ask the network about that. > T@10874248: my_malloc: my: size: 13 my_flags: 0 > T@10874248: my_malloc: exit: ptr: 0xf5d5a0 > T@10874248: ?func: info: Host: unknown host ip: 192.168.1.74 > T@10874248: vio_keepalive: enter: sd: 70 set_keep_alive: 1 Call to 'vio_keep_alive', this shows that the ip was allowed to connect du to ACL lists. At sql_parse.cc:880 > T@10874248: net_write_command: enter: length: 67 > T@10874248: vio_is_blocking: exit: 0 > T@10874248: vio_write: enter: sd: 70 buf: 0x10016e0 size: 72 > T@10874248: vio_write: exit: 72 The server builds the "initial greeting message" and sends it to the client. > T@10874248: vio_is_blocking: exit: 0 > T@10874248: vio_read: enter: sd: 70 buf: 0x10016e0 size: 4 > T@10874248: vio_read: vio_error: Got error 11 during read > T@10874248: vio_read: exit: -1 > T@10874248: ?func: info: vio_read returned -1, errno: 11 The server starts to read from the client in non blocking mode just to see if there is anything there. errno 11 shows no data was available at this time. > T@10874248: thr_alarm: enter: thread: T@10874248 sec: 5 Setup alarm that will break the below read in blocking mode after 5 seconds. > T@10874248: vio_blocking: enter: set_blocking_mode: 1 old_mode: 0 > T@10874248: vio_blocking: exit: 0 Turn on blocking mode. > T@10874248: vio_read: enter: sd: 70 buf: 0x10016e0 size: 4 > T@10874248: vio_read: vio_error: Got error 4 during read > T@10874248: vio_read: exit: -1 > T@10874248: ?func: info: vio_read returned -1, errno: 4 > T@10874248: ?func: error: Couldn't read packet: remain: 4 errno: 4 length: -1 Read from the client. Fails after 5 seconds because the client hasn't said anything(or at least it hasn't yet reached the server). This is "such a serious error" that the server decides not to write anything more to the socket, to enforce this it set the flag "net->error= 2"(this can be seen in function 'my_real_read' in net_serv.cc). This flag has the effect that even if the server calls a function that writes to the net it will be thrown away. Even if the server would write to the client(which might not exist anymore), the client would most likely be out of sync. Remember that packets can be "in the air" as if throwing a ball back and forth. > T@10874248: vio_blocking: enter: set_blocking_mode: 0 old_mode: 1 > T@10874248: vio_blocking: exit: 0 > T@10874248: net_printf_error: enter: message: 1043 Write the "Bad handshake" error message to the client. But since "net->error" is 2 it will be thrown away and never sent. Sorry! > T@10874248: vio_is_blocking: exit: 0 > T@10874248: push_warning: enter: code: 1043, msg: Bad handshake > T@10874248: close_connection: enter: fd: TCP/IP (70) error: Finally close the connection to the client. If the client is still there as in this case, it will be in the state of reading from the network/server expecting an answer to it's "Login request" that it has sent(although the server didn't get it in time). So instead of that reply it'll get an error since the server closed the connection and thus will display 2013 indicating "The client didn't get an error when writing to the server, but it didn't get a full answer (or any answer) to the question." How can we then determine this is not a server bug, after all the "thr_alarm" function could be buggy in some way and thus breaking the read from the client before the 5 seconds has passed. Or something... 1. Use wireshark/etheral to log the connection between the client and the server. Since it will include the time of each packet it receive that will effectively show how long the server have waited before seding out the socket reset(a RST packet). Wireshark is very nice since it also knows how to interpret the MySQL protocol. If you are running on the default port 3306 it will interpret automatically, otherwise use "Decode as.." and set the port being used to MySQL. Example trace(reduced): Time From/To What 0.000251 127.0.0.1:3306->127.0.0.1:10508 MySQL Server Greeting proto=10 version=5.0.50-debug-log 0.004448 127.0.0.1:10508<-127.0.0.1:3306 MySQL Login Request user=root db=test 0.004586 127.0.0.1:3306->127.0.0.1:10508 MySQL Response OK 2. Insert sleep(10) in the appropriate place in server or client code. - When doing this in the client code, I get exactly the error described here. - When doing it in the server code, it's not possible to get any error at all since the client is reading with almost infinite timeout. 3. Check the value of "aborted_connections" by using "SHOW STATUS like 'aborted_connections'" - it should have increased with one after the client got this error. Can the server send ER_BAD_HANDSHAKE as it says in the manual? Yes, it will do that in case of other errors during this handshake phase(for example if the length of the clients Login request is wrong). But most likely _never_ in the case where "connect_timeout" has expired - since it's "such a serious error" ;) Why don't we have a higher "connect_timeout" value by default? 5 seconds is fairly low in order to be "secure by default" - the low value allows spurious connects to that port to be discarded quickly and free up that thread to do some useful work. But maybe 5 seconds is too low?
[31 Aug 2007 11:22]
Magnus Svensson
I'll recommend the manual be updated. We really don't want our users have to produce tracefiles or use wireshark in order to find out they need to increase the "connect_timeout" value. An error message on the client saying "2013 Lost connection to MySQL Server at '<something>', system error: <errno>" most likely indicates the the "connect_timeout" value is to low on the server. There is also "2013 Lost connection to MySQL Server during query" but that is a _different_ problem.
[21 Sep 2007 16:59]
Magnus Svensson
Talked with Monty about this for an explanation why we have such a low connect_timeout by default. It's primarily a security feature i.e if there are "clients" connecting to the server without doing anything, they will only block the server(i.e one thread in the server) for the 5 seconds. Since quite a lot of these error messages show up in a web search, we will increase the default value to 10 seconds. Increasing the connect_timeout value in the servers my.cnf is recommended if this problem occurs frequently. It does not really make the server insecure. Check the value of "aborted_connections" by using "SHOW STATUS like 'aborted_connections'" to easily determine this is the problem you experienc, it will increase with one for each connection the server aborts.
[21 Sep 2007 17:11]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/34456 ChangeSet@1.2526, 2007-09-21 17:10:45+02:00, msvensson@shellback.(none) +1 -0 Bug#28359 Intermitted lost connection at 'reading authorization packet' errors - Increase default 'connect_timeout' value to 10 seconds
[18 Oct 2007 23:35]
Bugs System
Pushed into 5.1.23-beta
[18 Oct 2007 23:37]
Bugs System
Pushed into 5.0.52
[25 Oct 2007 21:24]
Paul DuBois
Noted in 5.0.52, 5.1.23 changelogs. The default value of the connect_timeout system variable was increased from 5 to 10 seconds. This might help in cases where clients frequently encounter errors of the form Lost connection to MySQL server at 'XXX', system error: errno. Also will add a new section about the "Lost connection" error at: http://dev.mysql.com/doc/refman/5.0/en/common-errors.html
[9 Dec 2007 8:00]
Ivan Pupkin
Hi! I'd like to comment this issue. U think the problem is more complicated. I have several 5.0.38 mysql servers. On 32 bit platform and on 64 bit platform. On 64-bit platform i've noticed those problems You describe - a random connection lost to mysql server. The encreased timeout helped, but there is at least one issue when it doesn't help. When i'm copying a large file from the mysql server with rsync protocol - it's impossible to connect to mysql server - the connection lost appears. I was dumping traffic with tcpdump - and noticed a lot of packets with checksum error. I must add that this doesn't happen when You are loading the network with another protocols (at least that i tried didnt). And this issue does not happen on 32 bit pltform. If You are able to do this please try to reproduce it, the keys are: 1) 64 bit linux 2) copying a large file (>10Gb) with rsync from mysql server 3) maybe master-slave replication (my servers are in master-slave replication)
