MySQL Bugs: #58662: File descriptor leak in mysql_real

Bug #58662	File descriptor leak in mysql_real_connect? (replication related?)
Submitted:	2 Dec 2010 15:18	Modified:	30 Dec 2012 10:12
Reporter:	Hartmut Holzgraefe	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: C API (client library)	Severity:	S3 (Non-critical)
Version:	mysql-5.1	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
After returning the following connection errors quite a few times in a loop:

  Jul 9 14:22:49 PL_2_3 mysqld: 100709 14:22:49 [ERROR] Slave I/O: error connecting to master 'mysql@10.22.35.11:3822' - retry-time: 1 retries: 86400, Error_code: 2004
  Jul 9 14:22:49 PL_2_3 mysqld: 100709 14:22:49 [Note] Slave SQL thread initialized, starting replication in log 'log-bin.000061' at position 6289, relay log './pid-relay-bin.000001' position: 4
  Jul 9 14:22:54 PL_2_3 mysqld: 100709 14:22:54 [Note] Slave I/O thread killed while connecting to master

a mysqld slave finally got stuck with 

  Jul 9 16:38:26 PL_2_3 mysqld: 100709 16:38:26 [ERROR] Error in accept: Too many open files

The server was kept running for a few days in that state for analysus (it was a test machine only, not a production system), but never recovered as none of the used file descriptors were freed.

lsof shows lots of entries like

  mysqld  19222 mysql  289u  sock                0,5          7762570 can't identify protocol

"can't identify protocol" says that the file descriptor is for a socket, but either no connect() has happened on that socket yet or connect() has failed. The actual protocol for the socket is only known after connect() has succeeded.

Looks as if there's a code path in which a socket is created but is not properly closed again after connect() failures ...

How to repeat:
no idea, only seen once in the wild so far ...

I have tried to find a code path in mysql_real_connect() that could explain this, but everything looks ok to me in there.

The fact that "error connection ... Error_code: 2004" is logged also indicates that everything is ok on the mysqld side. As far as i can tell Error code 2004 (CR_IPSOCK_ERROR) is only raised if the socket() system call returns -1 it looks as if it's actually the system call or libc that is leaking the file descriptor already ...?

Setting to "can't repeat" for now - haven't ever seen this happen myself, nor on other servers. Please comment if the problem is seen again and include exact OS details too.

I think I have hit an issue similar to the one explained here;
I have two NDB clusters (server1 and server2) and a master-master replication between them.  A few days ago I started to get "too many open files" error from server2.  At the same time server1 wasn't able to connect to server2 for replication; it was displaying error code Error_code: 2004 as the reason when I run ‘SHOW SLAVE STATUS\G’. I thought replication was down because mysql server on server2 was not accepting any connection and restarted server2. After restart, it was accepting connections but replication was still down. When I tried to connect server2 from server1 using the replication user, I was able to connect such as;
# mysql –h<IP_OF_SERVER2> -u<REP_USER> -p<REP_PWD> -P<3306>
But even though I stopped and started the slave on server1 several times, it didn’t connect. It was always giving 2004 error. I checked number of files opened by mysql on server2 using pfiles command and pid of mysqld; it was increasing gradually each time I stopped and started the slave on server2. 
Now there are two things I don't understard here; first of all why I get error code 2004 while I am able to connect from server1 to server2 using the same user/password/IP/port? Secondly, why getting error 2004 causes mysql to hit "max open file" limit. Is it possible that it doesn't close some files when it gets error 2004?

My mysql version is 5.1.30-ndb-6.3.20-cluster-gpl-log MySQL Cluster Server (GPL)