Description:
I encountered performance slowdown
when running PHP(non-persistent connections) + MySQL + NDB (7.0.8).
SHOW PROCESSLIST told that many MySQL connections were kept
waiting at "cleaning up" phase.
I investigated in detail then found the following.
1. When terminating MySQL connection, THD object is destroyed.
This is done by "delete thd;" at unlink_thd() in mysqld.cc .
2. "delete thd" is called under critical section.
mysqld.cc bool one_thread_per_connection_end(THD *thd, bool put_in_cache)
{
DBUG_ENTER("one_thread_per_connection_end");
unlink_thd(thd);
if (put_in_cache)
put_in_cache= cache_thread();
pthread_mutex_unlock(&LOCK_thread_count);
mysqld.cc unlink_thd()
(void) pthread_mutex_lock(&LOCK_thread_count);
thread_count--;
delete thd;
DBUG_VOID_RETURN;
So delete thd; is protected by LOCK_thread_count.
3. ha_close_connection() is called inside ~THD()(THD destructor).
4. ha_close_connection() ends up doing network access.
Here is a stack trace I got during tests.
Thread 9 (Thread 0x42511940 (LWP 3528)):
#0 0x0000003ddd0d4018 in epoll_wait () from /lib64/libc.so.6
#1 0x000000000089f74b in TransporterRegistry::pollReceive (this=0x10e92ff0,
timeOutMillis=10) at TransporterRegistry.cpp:873
#2 0x0000000000847b5d in TransporterFacade::external_poll (this=0x10e49130,
wait_time=10) at TransporterFacade.cpp:611
#3 0x000000000084a49e in PollGuard::wait_for_input_in_loop
(this=0x42510810,
wait_time=120000, forceSend=<value optimized out>) at
TransporterFacade.cpp:1895
#4 0x000000000084a546 in PollGuard::wait_n_unlock (this=0xb,
wait_time=283734752,
nodeId=<value optimized out>, state=4294967295, forceSend=4)
at TransporterFacade.cpp:1870
#5 0x000000000082cf11 in Ndb::sendRecSignal (this=<value optimized
out>, node_id=2,
aWaitState=4, aSignal=0x42510870, conn_seq=0, ret_conn_seq=<value
optimized out>)
at Ndbif.cpp:1425
#6 0x000000000085232f in Ndb::releaseConnectToNdb (this=0x112a9cf0,
a_con=0x2aaab812e560) at Ndblist.cpp:484
#7 0x000000000082e421 in Ndb::doDisconnect (this=0x112a9cf0) at Ndb.cpp:255
#8 0x000000000084cf78 in ~Ndb (this=0x112a9cf0) at Ndbinit.cpp:151
#9 0x00000000007a0085 in ~Thd_ndb (this=0x2aaab003f460) at
ha_ndbcluster.cc:701
#10 0x00000000007af2fe in ha_ndbcluster::release_thd_ndb
(thd_ndb=0x2aaab003f460)
at ha_ndbcluster.cc:8497
#11 0x00000000007afbf1 in ndbcluster_close_connection (hton=<value
optimized out>,
thd=0x2aaab01ae130) at ha_ndbcluster.cc:8601
#12 0x000000000070a4ad in closecon_handlerton (thd=0x2aaab01ae130,
plugin=<value optimized out>, unused=<value optimized out>) at
handler.cc:606
#13 0x000000000078fcd7 in plugin_foreach_with_mask (thd=0x2aaab01ae130,
func=0x70a460 <closecon_handlerton>, type=<value optimized out>,
state_mask=4294967287, arg=0x0) at sql_plugin.cc:1859
#14 0x000000000060ceb5 in ~THD (this=0x2aaab01ae130) at sql_class.cc:952
#15 0x000000000061b510 in one_thread_per_connection_end (thd=0x2aaab01ae130,
put_in_cache=true) at mysqld.cc:1872
#16 0x0000000000625381 in handle_one_connection (arg=0x2aaab01ae130)
at sql_connect.cc:1140
#17 0x0000003dddc064a7 in start_thread () from /lib64/libpthread.so.0
#18 0x0000003ddd0d3c2d in clone () from /lib64/libc.so.6
The fundamental cause of this performance slowdown seems that
"ha_close_connection(this);" is called under critical section
(inside ~THD()). Each disconnection process requires network access
so when 50 clients disconnect at the same time 50 serialized
network access will happen.
How to repeat:
/* Establish 50 concurrent clients, then repeat the following */
for (int i=0; i< 2000; i++){
connect to mysqld
select id from table where id=1;
disconnect from mysqld
}
On my test environment,
MyISAM: 5.4 sec
NDB: 16 min
I wrote the attached C program to repeat.
Suggested fix:
Using persistent connection helps, but this is not always possible.
We can think of a couple of design..
1. Moving "delete thd" out of critical section
Currently "delete thd;" line is protected by LOCK_thread_count mutex,
so if it can be moved outside, it will be no longer serialized.
--- sql/mysqld.cc 2009-10-23 19:20:27 +0000
+++ sql/mysqld.cc 2009-11-16 14:07:24 +0000
@@ -1881,13 +1881,14 @@ void unlink_thd(THD *thd)
DBUG_PRINT("enter", ("thd: 0x%lx", (long) thd));
thd->cleanup();
+ delete thd;
+
pthread_mutex_lock(&LOCK_connection_count);
--connection_count;
pthread_mutex_unlock(&LOCK_connection_count);
(void) pthread_mutex_lock(&LOCK_thread_count);
thread_count--;
- delete thd;
DBUG_VOID_RETURN;
}
I tried the above, but unfortunately mysqld crashed by SIGSEGV.
I'm not 100% sure but ~THD() seems not thread safe so currently
it has to be protected by mutex anyway.
2. Moving ha_close_connection() out of critical section
I tried moving ha_close_connection() from ~THD() to THD::cleanup().
--- orig/sql/sql_class.cc 2009-08-29 02:27:31.000000000 +0900
+++ new/sql/sql_class.cc 2009-11-17 23:39:55.000000000 +0900
@@ -910,6 +910,7 @@ void THD::cleanup(void)
pthread_mutex_unlock(&LOCK_user_locks);
ull= NULL;
}
+ ha_close_connection(this);
cleanup_done=1;
DBUG_VOID_RETURN;
@@ -938,7 +939,6 @@ THD::~THD()
if (!cleanup_done)
cleanup();
- ha_close_connection(this);
plugin_thdvar_cleanup(this);
DBUG_PRINT("info", ("freeing security context"));
THD::cleanup() is called at the timing of disconnection,
and not called under critical section.
I tried the same benchmarking and it performs great (and
not crashed).
Current:
MyISAM: 5.4 sec
NDB: 16 min
Patched:
MyISAM: 5.4 sec
NDB: 17 sec
3. Modifying ha_ndbcluster.cc and NDB layer so that
disconnection operations are not executed by ha_close_connection()
Description: I encountered performance slowdown when running PHP(non-persistent connections) + MySQL + NDB (7.0.8). SHOW PROCESSLIST told that many MySQL connections were kept waiting at "cleaning up" phase. I investigated in detail then found the following. 1. When terminating MySQL connection, THD object is destroyed. This is done by "delete thd;" at unlink_thd() in mysqld.cc . 2. "delete thd" is called under critical section. mysqld.cc bool one_thread_per_connection_end(THD *thd, bool put_in_cache) { DBUG_ENTER("one_thread_per_connection_end"); unlink_thd(thd); if (put_in_cache) put_in_cache= cache_thread(); pthread_mutex_unlock(&LOCK_thread_count); mysqld.cc unlink_thd() (void) pthread_mutex_lock(&LOCK_thread_count); thread_count--; delete thd; DBUG_VOID_RETURN; So delete thd; is protected by LOCK_thread_count. 3. ha_close_connection() is called inside ~THD()(THD destructor). 4. ha_close_connection() ends up doing network access. Here is a stack trace I got during tests. Thread 9 (Thread 0x42511940 (LWP 3528)): #0 0x0000003ddd0d4018 in epoll_wait () from /lib64/libc.so.6 #1 0x000000000089f74b in TransporterRegistry::pollReceive (this=0x10e92ff0, timeOutMillis=10) at TransporterRegistry.cpp:873 #2 0x0000000000847b5d in TransporterFacade::external_poll (this=0x10e49130, wait_time=10) at TransporterFacade.cpp:611 #3 0x000000000084a49e in PollGuard::wait_for_input_in_loop (this=0x42510810, wait_time=120000, forceSend=<value optimized out>) at TransporterFacade.cpp:1895 #4 0x000000000084a546 in PollGuard::wait_n_unlock (this=0xb, wait_time=283734752, nodeId=<value optimized out>, state=4294967295, forceSend=4) at TransporterFacade.cpp:1870 #5 0x000000000082cf11 in Ndb::sendRecSignal (this=<value optimized out>, node_id=2, aWaitState=4, aSignal=0x42510870, conn_seq=0, ret_conn_seq=<value optimized out>) at Ndbif.cpp:1425 #6 0x000000000085232f in Ndb::releaseConnectToNdb (this=0x112a9cf0, a_con=0x2aaab812e560) at Ndblist.cpp:484 #7 0x000000000082e421 in Ndb::doDisconnect (this=0x112a9cf0) at Ndb.cpp:255 #8 0x000000000084cf78 in ~Ndb (this=0x112a9cf0) at Ndbinit.cpp:151 #9 0x00000000007a0085 in ~Thd_ndb (this=0x2aaab003f460) at ha_ndbcluster.cc:701 #10 0x00000000007af2fe in ha_ndbcluster::release_thd_ndb (thd_ndb=0x2aaab003f460) at ha_ndbcluster.cc:8497 #11 0x00000000007afbf1 in ndbcluster_close_connection (hton=<value optimized out>, thd=0x2aaab01ae130) at ha_ndbcluster.cc:8601 #12 0x000000000070a4ad in closecon_handlerton (thd=0x2aaab01ae130, plugin=<value optimized out>, unused=<value optimized out>) at handler.cc:606 #13 0x000000000078fcd7 in plugin_foreach_with_mask (thd=0x2aaab01ae130, func=0x70a460 <closecon_handlerton>, type=<value optimized out>, state_mask=4294967287, arg=0x0) at sql_plugin.cc:1859 #14 0x000000000060ceb5 in ~THD (this=0x2aaab01ae130) at sql_class.cc:952 #15 0x000000000061b510 in one_thread_per_connection_end (thd=0x2aaab01ae130, put_in_cache=true) at mysqld.cc:1872 #16 0x0000000000625381 in handle_one_connection (arg=0x2aaab01ae130) at sql_connect.cc:1140 #17 0x0000003dddc064a7 in start_thread () from /lib64/libpthread.so.0 #18 0x0000003ddd0d3c2d in clone () from /lib64/libc.so.6 The fundamental cause of this performance slowdown seems that "ha_close_connection(this);" is called under critical section (inside ~THD()). Each disconnection process requires network access so when 50 clients disconnect at the same time 50 serialized network access will happen. How to repeat: /* Establish 50 concurrent clients, then repeat the following */ for (int i=0; i< 2000; i++){ connect to mysqld select id from table where id=1; disconnect from mysqld } On my test environment, MyISAM: 5.4 sec NDB: 16 min I wrote the attached C program to repeat. Suggested fix: Using persistent connection helps, but this is not always possible. We can think of a couple of design.. 1. Moving "delete thd" out of critical section Currently "delete thd;" line is protected by LOCK_thread_count mutex, so if it can be moved outside, it will be no longer serialized. --- sql/mysqld.cc 2009-10-23 19:20:27 +0000 +++ sql/mysqld.cc 2009-11-16 14:07:24 +0000 @@ -1881,13 +1881,14 @@ void unlink_thd(THD *thd) DBUG_PRINT("enter", ("thd: 0x%lx", (long) thd)); thd->cleanup(); + delete thd; + pthread_mutex_lock(&LOCK_connection_count); --connection_count; pthread_mutex_unlock(&LOCK_connection_count); (void) pthread_mutex_lock(&LOCK_thread_count); thread_count--; - delete thd; DBUG_VOID_RETURN; } I tried the above, but unfortunately mysqld crashed by SIGSEGV. I'm not 100% sure but ~THD() seems not thread safe so currently it has to be protected by mutex anyway. 2. Moving ha_close_connection() out of critical section I tried moving ha_close_connection() from ~THD() to THD::cleanup(). --- orig/sql/sql_class.cc 2009-08-29 02:27:31.000000000 +0900 +++ new/sql/sql_class.cc 2009-11-17 23:39:55.000000000 +0900 @@ -910,6 +910,7 @@ void THD::cleanup(void) pthread_mutex_unlock(&LOCK_user_locks); ull= NULL; } + ha_close_connection(this); cleanup_done=1; DBUG_VOID_RETURN; @@ -938,7 +939,6 @@ THD::~THD() if (!cleanup_done) cleanup(); - ha_close_connection(this); plugin_thdvar_cleanup(this); DBUG_PRINT("info", ("freeing security context")); THD::cleanup() is called at the timing of disconnection, and not called under critical section. I tried the same benchmarking and it performs great (and not crashed). Current: MyISAM: 5.4 sec NDB: 16 min Patched: MyISAM: 5.4 sec NDB: 17 sec 3. Modifying ha_ndbcluster.cc and NDB layer so that disconnection operations are not executed by ha_close_connection()