MySQL Bugs: #48832: Disconnection is slow when using non-persistent mysql connections

Bug #48832	Disconnection is slow when using non-persistent mysql connections
Submitted:	17 Nov 2009 11:00	Modified:	23 Apr 2010 7:35
Reporter:	Yoshinori Matsunobu	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Locking	Severity:	S5 (Performance)
Version:	5.0+	OS:	Any
Assigned to:	Magnus Blåudd	CPU Architecture:	Any

Description:
I encountered performance slowdown
when running PHP(non-persistent connections) + MySQL + NDB (7.0.8).

SHOW PROCESSLIST told that many MySQL connections were kept
waiting at "cleaning up" phase.

I investigated in detail then found the following.

1. When terminating MySQL connection, THD object is destroyed.
This is done by "delete thd;" at unlink_thd() in mysqld.cc .

2. "delete thd" is called under critical section.

mysqld.cc bool one_thread_per_connection_end(THD *thd, bool put_in_cache)
{
  DBUG_ENTER("one_thread_per_connection_end");
  unlink_thd(thd);
  if (put_in_cache)
    put_in_cache= cache_thread();
  pthread_mutex_unlock(&LOCK_thread_count);

mysqld.cc unlink_thd()
  (void) pthread_mutex_lock(&LOCK_thread_count);
  thread_count--;
  delete thd;
  DBUG_VOID_RETURN;

So delete thd; is protected by LOCK_thread_count.

3. ha_close_connection() is called inside ~THD()(THD destructor).

4. ha_close_connection() ends up doing network access.
Here is a stack trace I got during tests.

Thread 9 (Thread 0x42511940 (LWP 3528)):
#0  0x0000003ddd0d4018 in epoll_wait () from /lib64/libc.so.6
#1  0x000000000089f74b in TransporterRegistry::pollReceive (this=0x10e92ff0,
    timeOutMillis=10) at TransporterRegistry.cpp:873
#2  0x0000000000847b5d in TransporterFacade::external_poll (this=0x10e49130,
    wait_time=10) at TransporterFacade.cpp:611
#3  0x000000000084a49e in PollGuard::wait_for_input_in_loop
(this=0x42510810,
    wait_time=120000, forceSend=<value optimized out>) at
TransporterFacade.cpp:1895
#4  0x000000000084a546 in PollGuard::wait_n_unlock (this=0xb,
wait_time=283734752,
    nodeId=<value optimized out>, state=4294967295, forceSend=4)
    at TransporterFacade.cpp:1870
#5  0x000000000082cf11 in Ndb::sendRecSignal (this=<value optimized
out>, node_id=2,
    aWaitState=4, aSignal=0x42510870, conn_seq=0, ret_conn_seq=<value
optimized out>)
    at Ndbif.cpp:1425
#6  0x000000000085232f in Ndb::releaseConnectToNdb (this=0x112a9cf0,
    a_con=0x2aaab812e560) at Ndblist.cpp:484
#7  0x000000000082e421 in Ndb::doDisconnect (this=0x112a9cf0) at Ndb.cpp:255
#8  0x000000000084cf78 in ~Ndb (this=0x112a9cf0) at Ndbinit.cpp:151
#9  0x00000000007a0085 in ~Thd_ndb (this=0x2aaab003f460) at
ha_ndbcluster.cc:701
#10 0x00000000007af2fe in ha_ndbcluster::release_thd_ndb
(thd_ndb=0x2aaab003f460)
    at ha_ndbcluster.cc:8497
#11 0x00000000007afbf1 in ndbcluster_close_connection (hton=<value
optimized out>,
    thd=0x2aaab01ae130) at ha_ndbcluster.cc:8601
#12 0x000000000070a4ad in closecon_handlerton (thd=0x2aaab01ae130,
    plugin=<value optimized out>, unused=<value optimized out>) at
handler.cc:606
#13 0x000000000078fcd7 in plugin_foreach_with_mask (thd=0x2aaab01ae130,
    func=0x70a460 <closecon_handlerton>, type=<value optimized out>,
    state_mask=4294967287, arg=0x0) at sql_plugin.cc:1859
#14 0x000000000060ceb5 in ~THD (this=0x2aaab01ae130) at sql_class.cc:952
#15 0x000000000061b510 in one_thread_per_connection_end (thd=0x2aaab01ae130,
    put_in_cache=true) at mysqld.cc:1872
#16 0x0000000000625381 in handle_one_connection (arg=0x2aaab01ae130)
    at sql_connect.cc:1140
#17 0x0000003dddc064a7 in start_thread () from /lib64/libpthread.so.0
#18 0x0000003ddd0d3c2d in clone () from /lib64/libc.so.6

The fundamental cause of this performance slowdown seems that
"ha_close_connection(this);" is called under critical section
(inside ~THD()). Each disconnection process requires network access
so when 50 clients disconnect at the same time 50 serialized 
network access will happen.

How to repeat:
/* Establish 50 concurrent clients, then repeat the following */
for (int i=0; i< 2000; i++){
  connect to mysqld
  select id from table where id=1;
  disconnect from mysqld
}

On my test environment,
 MyISAM: 5.4 sec
 NDB: 16 min 

I wrote the attached C program to repeat.

Suggested fix:
Using persistent connection helps, but this is not always possible. 

We can think of a couple of design..

1. Moving "delete thd" out of critical section
Currently "delete thd;" line is protected by LOCK_thread_count mutex,
so if it can be moved outside, it will be no longer serialized.

--- sql/mysqld.cc    2009-10-23 19:20:27 +0000
+++ sql/mysqld.cc    2009-11-16 14:07:24 +0000
@@ -1881,13 +1881,14 @@ void unlink_thd(THD *thd)
   DBUG_PRINT("enter", ("thd: 0x%lx", (long) thd));
   thd->cleanup();

+  delete thd;
+
   pthread_mutex_lock(&LOCK_connection_count);
   --connection_count;
   pthread_mutex_unlock(&LOCK_connection_count);

   (void) pthread_mutex_lock(&LOCK_thread_count);
   thread_count--;
-  delete thd;
   DBUG_VOID_RETURN;
 }

I tried the above, but unfortunately mysqld crashed by SIGSEGV.
I'm not 100% sure but ~THD() seems not thread safe so currently
it has to be protected by mutex anyway. 

2. Moving ha_close_connection() out of critical section

I tried moving ha_close_connection() from ~THD() to THD::cleanup().

--- orig/sql/sql_class.cc  2009-08-29 02:27:31.000000000 +0900
+++ new/sql/sql_class.cc    2009-11-17 23:39:55.000000000 +0900
@@ -910,6 +910,7 @@ void THD::cleanup(void)
     pthread_mutex_unlock(&LOCK_user_locks);
     ull= NULL;
   }
+  ha_close_connection(this);

   cleanup_done=1;
   DBUG_VOID_RETURN;

@@ -938,7 +939,6 @@ THD::~THD()
   if (!cleanup_done)
     cleanup();

-  ha_close_connection(this);
   plugin_thdvar_cleanup(this);

   DBUG_PRINT("info", ("freeing security context"));

THD::cleanup() is called at the timing of disconnection,
and not called under critical section.

I tried the same benchmarking and it performs great (and
not crashed). 

Current:
 MyISAM: 5.4 sec
 NDB: 16 min

Patched:
 MyISAM: 5.4 sec
 NDB: 17 sec 

3. Modifying ha_ndbcluster.cc and NDB layer so that 
disconnection operations are not executed by ha_close_connection()

C program to test

Attachment: bench.c (text/plain), 1.51 KiB.

Tentative implementation of alt 3

Attachment: bug48832_alt3.patch (text/x-patch), 5.25 KiB.

I've tried the suggested fix to move the call to 'ha_close_connection' and also implemented a third way with a small pool for Ndb objects in ha_ndbcluster so that we don't have to do "delete ndb" all the time. Since that would be possible to turn on and off, it would feel safer in some way.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/105161

Will rewrite patch and add a THD::teardown function which can be used to perform things which does not need the global lock. Thus moving it out of THD::~THD.

However, a flag will still be used so that if teardown has not been run already, it will be run by ~THD.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/105212

Pushed to MySQL Cluster 7.0.15 and 7.1.4

Documented fix in the NDB-7.0.15 and 7.1.4 changelogs as follows:

      The performance of MySQL applications using impersistent 
      client connections was adversely affected by due to many of 
      these connections being kept waiting for an excessive length of 
      time in cleanup phase while being closed.

Closed.