Bug #41092 Maria: timeout in waiting_threads-t
Submitted: 28 Nov 2008 12:56 Modified: 2 Apr 2009 7:03
Reporter: Guilhem Bichot Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: Maria storage engine Severity:S3 (Non-critical)
Version:6.0-maria OS:Linux (itanium)
Assigned to: Assigned Account CPU Architecture:Any
Tags: pushbuild, sporadic, test failure, timeout

[28 Nov 2008 12:56] Guilhem Bichot
Description:
In pushbuild/6.0-maria (see private post for URL), we see this timeout:
# kill strategy: YOUNGEST
# Testing waiting_threads with 30 threads, 3000 iterations... 

TIMEOUT (7200 seconds), ABORTING.

It's random and rare (no related code changes done in the previous weeks, first failure in ten last pushes).

How to repeat:
It's random and rare (first failure in ten last pushes).
[28 Nov 2008 12:57] Guilhem Bichot
Suggest I4 as this is a unit test, random and rare failure on one Itanium platform.
[8 Dec 2008 10:02] Guilhem Bichot
Serg said that we don't use in the server the code which is tested by the timing-out portion ("YOUNGEST") so this test failure is low priority, we will disable this portion if requested.
[10 Dec 2008 18:09] Ingo Strüwing
Would be great, if you can do something about it.

I have similar on Debian GNU/Linux unstable x86_64.

After today's pull from 6.0-bugteam (revision-id:satya.bn@sun.com-20081210052907-zy3mj8sj4sqejerz) I had an infinite loop for 10 hours (!). I wonder why there is no timeout.

The loop happened inside libpthread:

#0  0x00002b2d39b7cd29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x000000000041474a in safe_cond_wait (cond=0x762f40, mp=0x761660, file=0x44950e "thr_template.c", line=47) at thr_mutex.c:249
#2  0x000000000040c385 in test_concurrently (test=0x44959d "waiting_threads", handler=0x40c538 <test_wt>, n=0, m=3000) at thr_template.c:47
#3  0x000000000040cc88 in do_one_test () at waiting_threads-t.c:144
#4  0x000000000040d45b in do_tests () at waiting_threads-t.c:254
#5  0x000000000040c4f5 in main (argc=1, argv=0x7fff71789cf8) at thr_template.c:79

do_tests () is at: test_kill_strategy(YOUNGEST);
do_one_test () is at: test_concurrently("waiting_threads", test_wt, THREADS, CYCLES);
test_concurrently () is at: pthread_cond_wait(&cond, &mutex);
safe_cond_wait () is at: error=pthread_cond_wait(cond,&mp->mutex);

When I hit ddd's 'next' button at this line, the CPU goes to 100% again and the function does not return.

I have seen hour-long consumption of CPU before, but it used to time out. Casual checks with 'top' used to show waiting_threads-t at the top. So it doesn't seem to be a seldom problem. At least not on my machine. [Why do all problems manifest on my system? Sigh.]
[12 Feb 2009 10:12] Sergei Golubchik
http://lists.mysql.com/maria/442
[17 Feb 2009 11:46] Bugs System
Pushed into 6.0.10-alpha (revid:serg@mysql.com-20090217113558-vpsqsyjule7nz0gk) (version source revid:guilhem@mysql.com-20090213163054-rsg204z5qzcekbfe) (merge vers: 6.0.10-alpha) (pib:6)
[4 Mar 2009 13:01] Jonathan Perkin
This appears to still happen in 6.0.10.  I noticed it in 6.0.9 but it seemed limited to solaris 64bit and linux-powerpc platforms, however now it has spread to all Solaris platforms.

If I leave the test running for a few days it finally completes (at least on reasonably fast machines), but this is unworkable for release builds.

truss gives no output, I have pstack output upon request.  I will leave the tests running on a couple of machines for now in case you want to live debug.
[4 Mar 2009 16:14] Jonathan Perkin
I left truss running for a while and managed to get some output:

Base time stamp:  1236172669.3688  [ Wed Mar  4 13:17:49 Europe/Oslo 2009 ]
13885:  2945.2669       lwp_unpark(81, 1)                               = 0
13885:  2945.2676       lwp_park(0x00000000, 0)                         = 0
13885:  2945.2683       lwp_unpark(81, 1)                               = 0
13885:  2945.2686       lwp_park(0x00000000, 0)                         = 0
13885:  2945.2690       lwp_park(0x00000000, 0)                         = 0
13885:  2945.2689       lwp_unpark(73, 1)                               = 0
13885:  2945.2788       lwp_park(0x00000000, 0)                         = 0
13885:  2945.2791       lwp_park(0x00000000, 0)                         = 0
13885:  2945.2795       lwp_unpark(91, 1)                               = 0
13885:  2945.2795       lwp_park(0x00000000, 0)                         = 0
13885:  2945.2799       lwp_park(0x7EA7BC70, 91)                        Err#62 ETIME
13885:  2945.2803       lwp_park(0x7F67BC70, 91)                        Err#62 ETIME
13885:  2945.2806       lwp_unpark(81, 1)                               = 0
13885:  2945.2809       lwp_park(0x00000000, 0)                         = 0

13885:  4701.5802       lwp_park(0x00000000, 0)                         = 0
13885:  4701.5801       lwp_unpark(70, 1)                               = 0
13885:  4701.5811       lwp_unpark(70, 1)                               = 0
13885:  4701.5815       lwp_park(0x00000000, 0)                         = 0
13885:  4701.5820       lwp_park(0x00000000, 0)                         = 0
13885:  4701.5824       lwp_park(0x00000000, 0)                         = 0
13885:  4701.5828       lwp_park(0x00000000, 0)                         = 0
13885:  2945.2799       lwp_park(0x7EA7BC70, 91)                        Err#62 ETIME
13885:  2945.2803       lwp_park(0x7F67BC70, 91)                        Err#62 ETIME
13885:  2945.2806       lwp_unpark(81, 1)                               = 0
13885:  2945.2809       lwp_park(0x00000000, 0)                         = 0

13885:  4701.5802       lwp_park(0x00000000, 0)                         = 0
13885:  4701.5801       lwp_unpark(70, 1)                               = 0
13885:  4701.5811       lwp_unpark(70, 1)                               = 0
13885:  4701.5815       lwp_park(0x00000000, 0)                         = 0
13885:  4701.5820       lwp_park(0x00000000, 0)                         = 0
13885:  4701.5824       lwp_park(0x00000000, 0)                         = 0
13885:  4701.5828       lwp_park(0x00000000, 0)                         = 0
13885:  4701.5831       lwp_park(0x7F47BC70, 91)                        Err#62 ETIME
13885:  4701.5834       lwp_park(0x7D27BC70, 85)                        Err#62 ETIME
13885:  4701.5837       lwp_park(0x7D87BC70, 72)                        Err#62 ETIME

13885:  8428.9742       lwp_unpark(81, 1)                               = 0
13885:  8428.9748       lwp_park(0x00000000, 0)                         = 0
13885:  8428.9753       lwp_unpark(81, 1)                               = 0
13885:  8428.9757       lwp_park(0x00000000, 0)                         = 0
13885:  8428.9760       lwp_unpark(72, 1)                               = 0
13885:  8428.9763       lwp_park(0x00000000, 0)                         = 0
13885:  8428.9766       lwp_unpark(72, 1)                               = 0
13885:  8428.9769       lwp_park(0x00000000, 0)                         = 0

13885:  9697.6838       lwp_unpark(70, 1)                               = 0
13885:  9697.6848       lwp_park(0x00000000, 0)                         = 0
13885:  9697.8719       lwp_unpark(70, 1)                               = 0
13885:  9697.8723       lwp_park(0x00000000, 0)                         = 0
13885:  9697.8727       lwp_unpark(81, 1)                               = 0
13885:  9697.8730       lwp_park(0x00000000, 0)                         = 0
13885:  9697.8733       lwp_unpark(81, 1)                               = 0
13885:  9697.8736       lwp_park(0x00000000, 0)                         = 0
13885:  9697.8739       lwp_park(0x00000000, 0)                         = 0
13885:  9697.9116       lwp_park(0x7EA7BC70, 85)                        Err#62 ETIME
[...]
[10 Mar 2009 8:04] Alexander Nozdrin
See also Bug#43359.
[2 Apr 2009 7:03] Guilhem Bichot
duplicate of BUG#43359