Bug #48929 | Error in Accept() if using many file descriptors | ||
---|---|---|---|
Submitted: | 20 Nov 2009 7:26 | Modified: | 7 Mar 2010 20:07 |
Reporter: | Peter Zaitsev (Basic Quality Contributor) | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: General | Severity: | S3 (Non-critical) |
Version: | all | OS: | Linux |
Assigned to: | Magne Mæhre | CPU Architecture: | Any |
Tags: | qc |
[20 Nov 2009 7:26]
Peter Zaitsev
[20 Nov 2009 8:32]
Kristian Nielsen
The problem here is that one of the server sockets ends up with file descriptor 16393, which is bigger than FD_SETSIZE. This causes the code to pass random memory into the select() call. The fix would seem to be to use poll() instead of select() in handle_connections_sockets() in sql/mysqld.cc.
[23 Nov 2009 10:47]
Valeriy Kravchuk
I failed to repeat. I've started server, 5.1.42-debug (after setting ulimit -n 40000) as follows: suse:/home2/openxs/dbs/5.1 # bin/mysqld_safe --user=openxs --innodb_file_per_table --innodb_open_files=20000 & then I executed the following shell script to create 20000 InnoDB tables: openxs@suse:/home2/openxs/dbs/5.1> i=0; while [ $i -le 20000 ]; do echo $i; bin/mysql -uroot test -e "create table tii$i(c1 int primary key, c2 int) engine=InnoDB"; let i=i+1; done and after it finished killed -9 both mysqld and mysqld_safe. Upon restart I've got only these messages in the error log: 100228 15:36:48 mysqld_safe Starting mysqld daemon with databases from /home2/openxs/dbs/5.1/var 100228 15:36:48 [Note] Plugin 'FEDERATED' is disabled. 100228 15:36:48 [Note] Plugin 'ndbcluster' is disabled. InnoDB: The log sequence number in ibdata files does not match InnoDB: the log sequence number in the ib_logfiles! 100228 15:36:49 InnoDB: Database was not shut down normally! InnoDB: Starting crash recovery. InnoDB: Reading tablespace information from the .ibd files... InnoDB: Restoring possible half-written data pages from the doublewrite InnoDB: buffer... 100228 15:39:02 InnoDB: Started; log sequence number 0 66300224 100228 15:39:03 [Note] Event Scheduler: Loaded 0 events 100228 15:39:03 [Note] /home2/openxs/dbs/5.1/libexec/mysqld: ready for connections. Version: '5.1.42-debug' socket: '/tmp/mysql.sock' port: 3306 Source distribution
[23 Nov 2009 14:42]
Sveta Smirnova
Peter, thank you for the report. Could you please upload full error log file?
[23 Nov 2009 15:13]
Kristian Nielsen
It is pretty clear from the code that there is a problem: pthread_handler_t handle_connections_sockets(void *arg __attribute__((unused))) ... uint max_used_connection= (uint) (max(ip_sock,unix_sock)+1); ... FD_SET(ip_sock,&clientFDs); FD_SET(unix_sock,&clientFDs); ... if (select((int) max_used_connection,&readFDs,0,0,0) < 0) ... There is no check for if ip_sock or unix_sock are >= FD_SETSIZE. So if they are, the FD_SET() will overflow the buffer, and select() will be passed undefined stack memory. The easiest way to reproduce is probably to add something like this at the start of network_init() (in sql/mysqld.cc): for (int i= 0; i < 2000; i++) (void)open("/dev/null", O_RDONLY); This will force the sockets to get file descriptors outside of the range of FD_SETSIZE (which is 1024 on Linux).
[23 Nov 2009 15:24]
Lenz Grimmer
Blog post related to this bug report: http://www.mysqlperformanceblog.com/2009/11/20/rare-evil-mysql-bug/
[24 Nov 2009 7:25]
Sveta Smirnova
Kristian, thank you for the hint! I can not repeat problem on my side though. Peter, please also upload your configuration file.
[25 Nov 2009 6:52]
Sveta Smirnova
Kristian, have you tried your patch? I only can see error: "091125 9:39:19 [ERROR] Error in accept: Too many open files" and not original problem.
[25 Nov 2009 7:52]
Kristian Nielsen
Sounds like you ran mysqld with a limit on the number of open file handles allowed, which will prevent the problem from showing. On my system the default is 1024, so you need something like ulimit -n 20000 before starting the mysqld. (But I did not try the patch).
[25 Nov 2009 11:31]
Sveta Smirnova
Kristian, thank you for the update. Problem still not repeatable for me.
[26 Nov 2009 12:32]
Andrii Nikitin
From 'man fd_set': " An fd_set is a fixed size buffer. Executing FD_CLR() or FD_SET() with a value of fd that is negative or is equal to or larger than FD_SETSIZE will result in undefined behavior. " <Undefined behavior> doesn't guarantee errors will be shown. Sveta, could you in the testcase just put following assert into handle_connections_sockets before SD_SET: (ip_sock < FD_SETSIZE && unix_sock < FD_SETSIZE) and confirm that no invalid values which cause undefined behavior go to FD_SET call?
[27 Nov 2009 9:18]
Sveta Smirnova
Andrii, thank you for the hint. I modify sql/mysqld.cc as follows: $bzr diff === modified file 'sql/mysqld.cc' --- sql/mysqld.cc 2009-11-03 00:52:57 +0000 +++ sql/mysqld.cc 2009-11-27 09:13:35 +0000 @@ -5074,6 +5074,7 @@ for (uint retry=0; retry < MAX_ACCEPT_RETRY; retry++) { size_socket length=sizeof(struct sockaddr_in); + sock=1025; new_sock = accept(sock, my_reinterpret_cast(struct sockaddr *) (&cAddr), &length); #ifdef __NETWARE__ MySQL error log spammed with ... 4893 091127 12:08:39 [ERROR] Error in accept: Bad file descriptor 4894 091127 12:08:39 [ERROR] Error in accept: Bad file descriptor 4895 091127 12:08:39 [ERROR] Error in accept: Bad file descriptor 4896 091127 12:08:39 [ERROR] Error in accept: Bad file descriptor 4897 091127 12:08:39 [ERROR] Error in accept: Bad file descriptor 4898 091127 12:08:39 [ERROR] Error in accept: Bad file descriptor 4899 091127 12:08:39 [ERROR] Error in accept: Bad file descriptor 4900 091127 12:08:39 [ERROR] Error in accept: Bad file descriptor 4901 091127 12:08:39 [ERROR] Error in accept: Bad file descriptor 4902 091127 12:08:39 [ERROR] Error in accept: Bad file descriptor 4903 091127 12:08:39 [ERROR] Error in accept: Bad file descriptor 4904 091127 12:08:39 [ERROR] Error in accept: Bad file descriptor 4905 091127 12:08:39 [ERROR] Error in accept: Bad file descriptor 4906 091127 12:08:39 [ERROR] Error in accept: Bad file descriptor 4907 091127 12:08:39 [ERROR] Error in accept: Bad file descriptor 4908 091127 12:08:39 [ERROR] Error in accept: Bad file descriptor 4909 091127 12:08:39 [ERROR] Error in accept: Bad file descriptor 4910 091127 12:08:39 [ERROR] Error in accept: Bad file descriptor .... I believe if resource is temporarily not available due to some reason it would repeat trying to use not available resource again and again as it does with this case. So bug can be considered verified although original problem has not been repeated.
[10 Jan 2010 16:00]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/96472 2957 Magne Mahre 2010-01-10 Bug#48929 Error in Accept() if using many file descriptors In POSIX systems, the file descriptor set used in the select(2) system call is represented by a bit vector, of size FD_SETSIZE. When select(2) is used on file/socket descriptors with a value that is beyond the this size, unpredictable errors may occur. In this case, the error happens when there are a large number of tables that need repair. These tables are opened before the sockets for incoming connections are acquired, resulting in these sockets getting descriptor id which is higher than FD_SETSIZE. Replacing the call to select(2) with poll(2) fixes the problem, as poll takes an array of the wanted descriptors, instead of a bit vector. MS Windows has a different implementation of 'select', and is not affected by this bug. @ configure.in Added a test for the <poll.h> file @ sql/mysqld.cc Restructured some of the code to reduce the number of #ifdef's. Removed some HP/UX 10-specific code.
[3 Feb 2010 22:17]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/99173 2981 Magne Mahre 2010-02-03 Bug#48929 Error in Accept() if using many file descriptors In POSIX systems, the file descriptor set used in the select(2) system call is represented by a bit vector of size FD_SETSIZE. When select(2) is used on file/socket descriptors with a value that is beyond this size, unpredictable errors may occur. In this case, the error happens when there are a large number of tables that need repair. These tables are opened before the sockets for incoming connections are acquired, resulting in these sockets getting descriptor id which is higher than FD_SETSIZE. Replacing the call to select(2) with poll(2) fixes the problem, as poll takes an array of the wanted descriptors, instead of a bit vector. MS Windows has a different implementation of 'select', and is not affected by this bug. @ configure.in Added a test for the <poll.h> file @ sql/mysqld.cc Restructured some of the code to reduce the number of #ifdef's. Removed some HP/UX 10-specific code.
[14 Feb 2010 18:13]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/100308 2962 Magne Mahre 2010-02-14 Bug#48929 Error in Accept() if using many file descriptors In POSIX systems, the file descriptor set used in the select(2) system call is represented by a bit vector of size FD_SETSIZE. When select(2) is used on file/socket descriptors with a value that is beyond this size, unpredictable errors may occur. In this case, the error happens when there are a large number of tables that need repair. These tables are opened before the sockets for incoming connections are acquired, resulting in these sockets getting descriptor id which is higher than FD_SETSIZE. Replacing the call to select(2) with poll(2) fixes the problem, as poll takes an array of the wanted descriptors, instead of a bit vector. MS Windows has a different implementation of 'select', and is not affected by this bug. @ configure.in Added a test for the <poll.h> file @ sql/mysqld.cc Restructured some of the code to reduce the number of #ifdef's. Removed some HP/UX 10-specific code.
[25 Feb 2010 19:47]
Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20100225194420-p60r4u90tszj8q2x) (version source revid:magne.mahre@sun.com-20100214183042-8kst82v1rm3887e6) (merge vers: 6.0.14-alpha) (pib:16)
[25 Feb 2010 19:49]
Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100225194305-h49uyjrlfl3mwo60) (version source revid:magne.mahre@sun.com-20100214181258-n0wu2x683akayt3h) (pib:16)
[6 Mar 2010 10:58]
Bugs System
Pushed into 5.5.3-m3 (revid:alik@sun.com-20100306103849-hha31z2enhh7jwt3) (version source revid:alik@sun.com-20100225195857-farb6yvy8x06bylj) (merge vers: 5.5.99-m3) (pib:16)
[7 Mar 2010 20:07]
Paul DuBois
Noted in 5.5.3, 6.0.14 changelogs. On POSIX systems, calls to select() with a file descriptor set larger than FD_SETSIZE resulted in unpredictable I/O errors; for example, when a large number of tables required repair.