Description:
Even with MySQL compiled --with-debug=full, Slurm (cluster resource management program) is causing a SIGABRT to be generated from libc.so's free() routine.
Here is the backtrace:
#0 0x00000031ab030215 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00000031ab030215 in raise () from /lib64/libc.so.6
#1 0x00000031ab031cc0 in abort () from /lib64/libc.so.6
#2 0x00000031ab06a7fb in __libc_message () from /lib64/libc.so.6
#3 0x00000031ab071ce2 in _int_free () from /lib64/libc.so.6
#4 0x00000031ab07590c in free () from /lib64/libc.so.6
#5 0x00002b535eb07b0f in _myfree (ptr=0x1c019af8, filename=<value optimized out>, lineno=<value optimized out>,
myflags=<value optimized out>) at safemalloc.c:326
#6 0x00002b535eb2c60a in vio_delete (vio=0x1b8e20b8) at vio.c:238
#7 0x00002b535eb276e0 in end_server (mysql=0x1b8283b8) at client.c:949
#8 0x00002b535eb279a8 in cli_safe_read (mysql=0x359a) at client.c:702
#9 0x00002b535eb27f0a in cli_read_rows (mysql=0x1b8283b8, mysql_fields=0x2aaaac30cc98, fields=2) at client.c:1389
#10 0x00002b535eb2602b in mysql_store_result (mysql=<value optimized out>) at client.c:2954
#11 0x00002b535e8d5e0c in _get_first_result (mysql_db=0x1b8283b8) at mysql_common.c:59
#12 0x00002b535e8d76fe in mysql_db_query_ret (mysql_db=0x1b8283b8,
query=0x2aaaac001978 "select cpu_count, cluster_nodes from cluster_event_table where cluster=\"andytest\" and period_end=0 and node_name='' limit 1", last=false) at mysql_common.c:617
#13 0x00002b535e8ca901 in clusteracct_storage_p_cluster_procs (mysql_conn=0x1b828118, cluster=0x1b824018 "andytest",
cluster_nodes=0x2aaaac0016b8 "node[11,13-17]", procs=40, event_time=1285026316) at accounting_storage_mysql.c:10505
#14 0x0000000000526286 in clusteracct_storage_g_cluster_procs (db_conn=0x1b828118, cluster=0x1b824018 "andytest",
cluster_nodes=0x2aaaac0016b8 "node[11,13-17]", procs=40, event_time=1285026316) at slurm_accounting_storage.c:8402
#15 0x0000000000425b86 in _accounting_cluster_ready () at controller.c:1057
#16 0x000000000042653c in _slurmctld_background (no_data=0x0) at controller.c:1353
#17 0x0000000000424d7f in main (argc=1, argv=0x7fff4c424098) at controller.c:525
(gdb) up 5
#5 0x00002b535eb07b0f in _myfree (ptr=0x1c019af8, filename=<value optimized out>, lineno=<value optimized out>,
myflags=<value optimized out>) at safemalloc.c:326
326 free((char*) irem);
(gdb) print irem
$1 = (struct st_irem *) 0x1c019ad0
(gdb) print *irem
$2 = {next = 0x1b8e2090, prev = 0x1c01db10, filename = 0x2b535eb3b9e3 "vio.c", datasize = 16384, linenum = 44,
SpecialValue = 3957108073}
(gdb)
How to repeat:
This is a sporadic problem that typically takes 1-3 hours and 60,000 Slurm jobs to reproduce using a test program that puts a heavy load on Slurm.
If I can find an easy way to reproduce, I'll post it here.
Suggested fix:
1. It looks like vio_delete() should do a bit more argument checking, even if it is called with a non-null argument.
2. It looks like safemalloc.c is also missing a check. (I assume that since the SIGABRT emanates from libc, that safemalloc missed the problem.)