MySQL Bugs: #116714: With 8000 to 15,999 Tablespaces, MySQL Startup Spawns one Useless Sub-Thread.

Bug #116714	With 8000 to 15,999 Tablespaces, MySQL Startup Spawns one Useless Sub-Thread.
Submitted:	19 Nov 2024 16:04	Modified:	20 Nov 2024 10:58
Reporter:	Jean-François Gagné	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S3 (Non-critical)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Hi,

the number of threads used for the MySQL startup phase "InnoDB Tablespace Duplicate Check" (and for "InnoDB Tablespace Validation" - the phase names are my own, more about them in [1]) is computed calling the function fil_get_scan_threads (definition: [2]; call for Duplicate Check: [3]; call for Tablespace Validation: [4]).  This function returns 0 for 0 to 7999 tablespaces, 1 for 8000 to 15,999, 2 for 16,000 to 23,999, and so on...

[1]: https://jfg-mysql.blogspot.com/2024/11/long-and-silent-stressful-mysql-startup.html#phases

[2]: https://github.com/mysql/mysql-server/blob/mysql-9.1.0/storage/innobase/fil/fil0fil.cc#L13...

[3]: https://github.com/mysql/mysql-server/blob/mysql-9.1.0/storage/innobase/fil/fil0fil.cc#L11...

[4]: https://github.com/mysql/mysql-server/blob/mysql-9.1.0/storage/innobase/handler/ha_innodb....

This thread number is then passed to the macro / function par_for (macro: [5]; function: [6]; call for Duplicate Check: [7]; call for Tablespace Validation: [8]).  When called with a number of threads equal to 0, par_for does all the work in the current thread.  With 1, it does all the work in a sub-thread, which is wasteful / spawning one useless sub-thread.  See How to repeat for details.

[5]: https://github.com/mysql/mysql-server/blob/mysql-9.1.0/storage/innobase/include/os0thread-...

[6]: https://github.com/mysql/mysql-server/blob/mysql-9.1.0/storage/innobase/include/os0thread-...

[7]: https://github.com/mysql/mysql-server/blob/mysql-9.1.0/storage/innobase/handler/ha_innodb....

[8]: https://github.com/mysql/mysql-server/blob/mysql-9.1.0/storage/innobase/handler/ha_innodb....

I might eventually submit a patch for this.

Many thanks for looking into this,

Jean-François Gagné

How to repeat:
v=9.1.0
dbdeployer deploy single mysql_$v -c log_error_verbosity=3
cd ~/sandboxes/msb_mysql_${v//./_}

# No MY-012207 in logs --> nb thread is 0.
./use <<< "create database test_jfg"; \
  seq -f "CREATE TABLE t%06.0f(i int);" 1 7000 | ./use test_jfg; \
  { ./stop; rm data/msandbox.err; ./start; } > /dev/null; \
  grep MY-012207 data/msandbox.err | wc -l
0

# Consider doing gdb test described below before creating more tables,
#   because DROP DATABASE is slow with many tables.

# "Using 2 threads" --> nb thread is 1 (more about this in section 2 of Bug#116620).
./use <<< "create database test_jfg2"; \
  seq -f "CREATE TABLE u%06.0f(i int);" 1 7000 | ./use test_jfg2; \
  { ./stop; rm data/msandbox.err; ./start; } > /dev/null; \
  grep MY-012207 data/msandbox.err
2024-11-11T18:59:43.437011Z 1 [Note] [MY-012207] [InnoDB] Using 2 threads to scan 14002 tablespace files

# To actually show we are creating a sub-thread, we need to use gdb,
#   but for that, we need a few things first.
pgrep="$(pgrep -a mysqld$)"
cmd="$(cut -d " " -f 2 <<< "$pgrep")"
args="$(cut -d " " -f 3- <<< "$pgrep")"

# Use gdp with MySQL debug binaries.
./stop
gdb ${cmd}-debug -ex "set args $args" -ex "break par_for" -ex start

# Make sure we are on the right breakpoint, if not, continue.
# We want to be in par_for, with Fil_system::scan and Tablespace_dirs::scan on the call-stack.
bt

# When in Fil_system::scan, add a breakpoint on thread creation.
break create_detached_thread

# Below, we expect to hit the breakpoint, and this means we created a thread.
finish

# Delete all breakpoints and continue.
delete breakpoints
continue

# Out of gdb, delete half the tables, stop MySQL, and restart all gdb above,
#   we expect not to hit the create_detached_thread for less than 8000 tables.
# Note that the drop is long, because of what Kris blogged about in below
#   (unclear if there is a bug for this, if not, there should be).
# https://blog.koehntopp.info/2024/10/28/mysql-information-schema-table-performance-regressi...
./use <<< "drop database test_jfg2"
./stop
gdb ...

Suggested fix:
I do not think par_for should handle 1 thread the same way as 0.  There might be a good reason to do the work in a sub-thread (like testing with sub-threading, which might have allowed finding Bug#115517 which is private, but which has been discussed publicly [a]).

[a]: https://www.percona.com/blog/do-not-upgrade-to-any-version-of-mysql-after-8-0-37/

I do not think this should be fixed at the call of par_for, because this would make it more complicated to have a parameter for the thread number (see bug that will be open after this one referenced in the comments).

I think the function fil_get_scan_threads should be changed in such a way that, instead of returning 1, it would return 0.  Another fix, which I am not in favor of because it is an incompatible change and might introduce other behavior modifications which could be un-welcomed (would be fine for MySQL 9, not for 8.4 and 8.0), is to change 1 to 2, 2 to 3, and so on.  Another possible fix: instead of returning 1, return 2 (without changing 2 to 3, 3 to 4, ...).

Salut Monsieur Gagne,

We are waiting for the team in charge regarding resolving these issues.

We are grateful in advance for your patch.

Thanks a lot.

Hello Jean-François Gagné,
yes `par_for` is not optimal in how it splits the work among the threads.
In case `n>0` it uses the main thread to process the remainder of c.size()%n items of c, which is often tiny compared to c.size()/n assigned to helper threads, perhaps even 0.

Perhaps a fix could be to do just:
```
- size_t slice = (n > 0) ? c.size() / n : 0;
+ const size_t slice = c.size() / (n+1);
```

Then:
- for n=0, the main thread would do everything.
- for n=1, the helper thread would do floor(c.size()/2) and the main thread would do ceil(c.size()/2)
- in general for n>0, each helper thread would do floor(c.size()/(n+1)), while the main thread would do floor(c.size()/(n+1)) + c.size()%(n+1) which is a little unfair maybe, in particular for large values of n. 

In general the right way of splitting it would be to use two slice sizes:
```
-  size_t slice = (n > 0) ? c.size() / n : 0;
+  const size_t small_slice = c.size() / (n + 1);
+  const size_t big_slice = small_slice + 1;
+  const size_t big_slices = c.size() % (n + 1);
+  ut_ad_eq(big_slices * big_slice + (n+1-big_slices) * small_slice, c.size());

   using Workers = std::vector<IB_thread>;

   Workers workers;

   workers.reserve(n);
-
+  auto b = c.being();
   for (size_t i = 0; i < n; ++i) {
-    auto b = c.begin() + (i * slice);
-    auto e = b + slice;
+    const auto e = b + (i < big_slices ? big_slice : small_slice);

     auto worker = os_thread_create(pfs_key, i, f, b, e, i, args...);
     worker.start();

     workers.push_back(std::move(worker));
+    b = e;
   }

-  f(c.begin() + (n * slice), c.end(), n, args...);
+  f(b, c.end(), n, args...);
```

> I do not think this should be fixed at the call of par_for, because this would make it more complicated to have a parameter for the thread number (see bug that will be open after this one referenced in the comments).

The feature request for having the number of threads as a parameter is Bug#116716.

Hello Jakub, thanks for joining the discussion.

> yes `par_for` is not optimal in how it splits the work among the threads.

Indeed, but this was not the specific point I wanted to raise in this report.

> Perhaps a fix could be to do just [...]

It could indeed be a fix, but then it is addressing a broader problem than what I wanted to report here.

My suggestion for this report was to avoid touching par_for and only focussing on Duplicate Check Threads.

If we change par_for, I would suggest that either the main thread does everything (like for 0 threads) or it "just" wait for sub-threads.  IMHO, either all or no processing should happen in the main thread.  Also IMHO, the current implementation where most of the work is done in sub-threads and the remainder is done in the main thread, or your suggestion with different splitting, makes things complicated to understand.

> In general the right way of splitting it would be to use two slice sizes:

Maybe, but you then "break" my "expectation" from above (either all or no processing should happen in the main thread).  I kind of see where you are going there, with "nb processing units" being n+1, and this actually starts to make sense with section "2." of Bug#116620, but I do not like that the main thread and sub-threads are sharing work.  The main thread "context" is different than sub-thread context (Bug#115517), and mixing work looks wrong to me.

All that goes back to the "definition" of par_for:

1. do we want to split work between main and sub threads (IMHO no, but I understand your suggestion);

2. do we want to "spawn" n threads (referring here to the argument of par_for named n) OR do we want the work to be done by n threads which would potentially keep up to n CPUs busy (IMHO, n should be the number of CPU we keep busy, and I understand it is weird with 0).

Jakub, I thought of another reason you suggested change is problematic: it changes then interface of par_for.

Currently, with a value of n threads and large collection (c many orders of magnitudes larger than n), work is mostly done in n sub threads (floor(c/n)) with little work being done in the main thread (c%n).  Your suggested change make work happening in n+1 threads, which significantly changes the "interface" of par_for.

Currently, par_for is called with n as the "total number of threads to use", not the number of threads to "spawn".

For your change to "work", it would need to be in a new par_for (parfor2, parfor_ng, ...), and we would have to change where par_for is called.

Cheers, Jean-François

Salut, Monsieur Gagne,

Thank you for your bug report.

This is now a verified bug for the versions 8.0 and all higher supported versions.

Thank you.

With all due respect, but `par_for` is an internal function of InnoDB code-base, not exposed to end users, and thus the "expectations" for it come from the contract stated in the doxygen:
```
/** Parallel for loop over a container.
@param[in]      pfs_key  Performance schema thread key
@param[in]      c        Container to iterate over in parallel
@param[in]      n        Number of threads to create
@param[in]      f        Callable instance
@param[in]      args     Zero or more args */
```
Here we see `n` is the number of threads to create.
Clearly the value `n==0` is allowed and frequently used.

There is no promise as to how exactly the work will be split between the 
threads, should it be fair, or should the main thread participate or not.

Because we support, and most often use, the case of n=0, it must mean that at least in this case the main thread participates (does all the work).
So it is only natural (for me, as developer) to assume that n+1 threads will participate, and indeed that's what is happening right now, it's just that the main thread often gets very little to do, which I judge to be a bug, because it clearly makes the overall duration of par_for needlessly longer than needed.
That there's an expectation that n+1 threads should be doing something is not clearly stated in the doxygen, but can be deduced from:
1. supporting n=0 case, and natural, least surprising, extension of it to n>0 cases
2. the surrounding code which actually uses it. In particular:
```
  /* Get the number of additional threads needed to scan the files. */
  size_t n_threads = fil_get_scan_threads(ibd_files.size());

  if (n_threads > 0) {
    ib::info(ER_IB_MSG_382)
        << "Using " << (n_threads + 1) << " threads to"
        << " scan " << ibd_files.size() << " tablespace files";
  }
```
where we clearly talk about *additional* threads and report `n_threads+1` as the number of threads working on the problem. 
There's definitely some confusion in our code base, too, in other places.
For example `fil_get_scan_threads()` itself starts with a reasonable comment:
```
/* Number of additional threads required to scan all the files.
n_threads == 0 means that the main thread itself will do all the
work instead of spawning any additional threads. */
size_t n_threads = num_files / FIL_SCAN_MAX_TABLESPACES_PER_THREAD;
```
but then later on, somehow forgets to take main thread into account when it computes the max_threads:
```
/* Number of concurrent threads supported by the host machine. */
size_t max_threads =
    FIL_SCAN_THREADS_PER_CORE * std::thread::hardware_concurrency();

/* If the number of concurrent threads supported by the host
machine could not be calculated, assume the supported threads
to be FIL_SCAN_MAX_THREADS. */
max_threads = max_threads == 0 ? FIL_SCAN_MAX_THREADS : max_threads;

/* Restrict the number of threads to the lower of number of threads
supported by the host machine or FIL_SCAN_MAX_THREADS. */
if (n_threads > max_threads) {
  n_threads = max_threads;
}

if (n_threads > FIL_SCAN_MAX_THREADS) {
  n_threads = FIL_SCAN_MAX_THREADS;
}
```
where arguably the logic should subtract 1 from max_threads and FIL_SCAN_MAX_THREADS, if we really want the number of threads per core to be 2, or the total number of threads operating on the task to be 16.
The only other usage of `par_for` is in clone0api.cc, where it also seems a bit confused as it passes get_num_tasks() instead of get_num_tasks()-1 to the function. However it can be argued, that despite its name `get_num_tasks()` is actually the number of *new* *threads* to spawn (people sometimes use words "task" and "thread" interchangebly), because if you look at `set_num_tasks()`'s doxygen it has:
```
  /** Calculate and set number of new tasks to spawn.
  @param[in]    num_entries     number of entries to handle */
  void set_num_tasks(size_t num_entries) {
```

So, perhaps we would have to do some cleanup, too, but basically I can see only three self-consistent contracts:
A) n is the number of new threads to spawn, all (n+1) threads participate in work evenly
B) n is the total number of threads including the main thread, thus has to be > 0, and all the n threads participate in work evenly
C) n is the number of new threads to spawn, has to be > 0, and only they participate in the work

The C) option, which I believe you advocate for, doesn't seem reasonable to me, as it means that even for small data sets we will be forced to spawn at least one new thread, and do at least two context-switches, or complicate the code in the caller to handle that case differently.

Option A) seems to me to be most in line with the original intent of the author, as evidenced by the oldest usage and comments and messages in log.
Option B) seems to me perhaps cleaner/natural in that it seems conceptually easier to think in (high-level) terms of total number of workers, than in (low-level) terms of how many new threads we need.