Description:
I encountered performance slowdown
when running PHP(non-persistent connections) + MySQL + NDB (7.0.8).
SHOW PROCESSLIST told that many MySQL connections were kept
waiting at "cleaning up" phase.
I investigated in detail then found the following.
1. When terminating MySQL connection, THD object is destroyed.
This is done by "delete thd;" at unlink_thd() in mysqld.cc .
2. "delete thd" is called under critical section.
mysqld.cc bool one_thread_per_connection_end(THD *thd, bool put_in_cache)
{
DBUG_ENTER("one_thread_per_connection_end");
unlink_thd(thd);
if (put_in_cache)
put_in_cache= cache_thread();
pthread_mutex_unlock(&LOCK_thread_count);
mysqld.cc unlink_thd()
(void) pthread_mutex_lock(&LOCK_thread_count);
thread_count--;
delete thd;
DBUG_VOID_RETURN;
So delete thd; is protected by LOCK_thread_count.
3. ha_close_connection() is called inside ~THD()(THD destructor).
4. ha_close_connection() ends up doing network access.
Here is a stack trace I got during tests.
Thread 9 (Thread 0x42511940 (LWP 3528)):
#0 0x0000003ddd0d4018 in epoll_wait () from /lib64/libc.so.6
#1 0x000000000089f74b in TransporterRegistry::pollReceive (this=0x10e92ff0,
timeOutMillis=10) at TransporterRegistry.cpp:873
#2 0x0000000000847b5d in TransporterFacade::external_poll (this=0x10e49130,
wait_time=10) at TransporterFacade.cpp:611
#3 0x000000000084a49e in PollGuard::wait_for_input_in_loop
(this=0x42510810,
wait_time=120000, forceSend=<value optimized out>) at
TransporterFacade.cpp:1895
#4 0x000000000084a546 in PollGuard::wait_n_unlock (this=0xb,
wait_time=283734752,
nodeId=<value optimized out>, state=4294967295, forceSend=4)
at TransporterFacade.cpp:1870
#5 0x000000000082cf11 in Ndb::sendRecSignal (this=<value optimized
out>, node_id=2,
aWaitState=4, aSignal=0x42510870, conn_seq=0, ret_conn_seq=<value
optimized out>)
at Ndbif.cpp:1425
#6 0x000000000085232f in Ndb::releaseConnectToNdb (this=0x112a9cf0,
a_con=0x2aaab812e560) at Ndblist.cpp:484
#7 0x000000000082e421 in Ndb::doDisconnect (this=0x112a9cf0) at Ndb.cpp:255
#8 0x000000000084cf78 in ~Ndb (this=0x112a9cf0) at Ndbinit.cpp:151
#9 0x00000000007a0085 in ~Thd_ndb (this=0x2aaab003f460) at
ha_ndbcluster.cc:701
#10 0x00000000007af2fe in ha_ndbcluster::release_thd_ndb
(thd_ndb=0x2aaab003f460)
at ha_ndbcluster.cc:8497
#11 0x00000000007afbf1 in ndbcluster_close_connection (hton=<value
optimized out>,
thd=0x2aaab01ae130) at ha_ndbcluster.cc:8601
#12 0x000000000070a4ad in closecon_handlerton (thd=0x2aaab01ae130,
plugin=<value optimized out>, unused=<value optimized out>) at
handler.cc:606
#13 0x000000000078fcd7 in plugin_foreach_with_mask (thd=0x2aaab01ae130,
func=0x70a460 <closecon_handlerton>, type=<value optimized out>,
state_mask=4294967287, arg=0x0) at sql_plugin.cc:1859
#14 0x000000000060ceb5 in ~THD (this=0x2aaab01ae130) at sql_class.cc:952
#15 0x000000000061b510 in one_thread_per_connection_end (thd=0x2aaab01ae130,
put_in_cache=true) at mysqld.cc:1872
#16 0x0000000000625381 in handle_one_connection (arg=0x2aaab01ae130)
at sql_connect.cc:1140
#17 0x0000003dddc064a7 in start_thread () from /lib64/libpthread.so.0
#18 0x0000003ddd0d3c2d in clone () from /lib64/libc.so.6
The fundamental cause of this performance slowdown seems that
"ha_close_connection(this);" is called under critical section
(inside ~THD()). Each disconnection process requires network access
so when 50 clients disconnect at the same time 50 serialized
network access will happen.
How to repeat:
/* Establish 50 concurrent clients, then repeat the following */
for (int i=0; i< 2000; i++){
connect to mysqld
select id from table where id=1;
disconnect from mysqld
}
On my test environment,
MyISAM: 5.4 sec
NDB: 16 min
I wrote the attached C program to repeat.
Suggested fix:
Using persistent connection helps, but this is not always possible.
We can think of a couple of design..
1. Moving "delete thd" out of critical section
Currently "delete thd;" line is protected by LOCK_thread_count mutex,
so if it can be moved outside, it will be no longer serialized.
--- sql/mysqld.cc 2009-10-23 19:20:27 +0000
+++ sql/mysqld.cc 2009-11-16 14:07:24 +0000
@@ -1881,13 +1881,14 @@ void unlink_thd(THD *thd)
DBUG_PRINT("enter", ("thd: 0x%lx", (long) thd));
thd->cleanup();
+ delete thd;
+
pthread_mutex_lock(&LOCK_connection_count);
--connection_count;
pthread_mutex_unlock(&LOCK_connection_count);
(void) pthread_mutex_lock(&LOCK_thread_count);
thread_count--;
- delete thd;
DBUG_VOID_RETURN;
}
I tried the above, but unfortunately mysqld crashed by SIGSEGV.
I'm not 100% sure but ~THD() seems not thread safe so currently
it has to be protected by mutex anyway.
2. Moving ha_close_connection() out of critical section
I tried moving ha_close_connection() from ~THD() to THD::cleanup().
--- orig/sql/sql_class.cc 2009-08-29 02:27:31.000000000 +0900
+++ new/sql/sql_class.cc 2009-11-17 23:39:55.000000000 +0900
@@ -910,6 +910,7 @@ void THD::cleanup(void)
pthread_mutex_unlock(&LOCK_user_locks);
ull= NULL;
}
+ ha_close_connection(this);
cleanup_done=1;
DBUG_VOID_RETURN;
@@ -938,7 +939,6 @@ THD::~THD()
if (!cleanup_done)
cleanup();
- ha_close_connection(this);
plugin_thdvar_cleanup(this);
DBUG_PRINT("info", ("freeing security context"));
THD::cleanup() is called at the timing of disconnection,
and not called under critical section.
I tried the same benchmarking and it performs great (and
not crashed).
Current:
MyISAM: 5.4 sec
NDB: 16 min
Patched:
MyISAM: 5.4 sec
NDB: 17 sec
3. Modifying ha_ndbcluster.cc and NDB layer so that
disconnection operations are not executed by ha_close_connection()