Bug #116714 | With 8000 to 15,999 Tablespaces, MySQL Startup Spawns one Useless Sub-Thread. | ||
---|---|---|---|
Submitted: | 19 Nov 2024 16:04 | Modified: | 20 Nov 2024 10:58 |
Reporter: | Jean-François Gagné | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: InnoDB storage engine | Severity: | S3 (Non-critical) |
Version: | 8.0 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[19 Nov 2024 16:04]
Jean-François Gagné
[19 Nov 2024 16:33]
MySQL Verification Team
Salut Monsieur Gagne, We are waiting for the team in charge regarding resolving these issues. We are grateful in advance for your patch. Thanks a lot.
[19 Nov 2024 16:57]
Jakub Lopuszanski
Hello Jean-François Gagné, yes `par_for` is not optimal in how it splits the work among the threads. In case `n>0` it uses the main thread to process the remainder of c.size()%n items of c, which is often tiny compared to c.size()/n assigned to helper threads, perhaps even 0. Perhaps a fix could be to do just: ``` - size_t slice = (n > 0) ? c.size() / n : 0; + const size_t slice = c.size() / (n+1); ``` Then: - for n=0, the main thread would do everything. - for n=1, the helper thread would do floor(c.size()/2) and the main thread would do ceil(c.size()/2) - in general for n>0, each helper thread would do floor(c.size()/(n+1)), while the main thread would do floor(c.size()/(n+1)) + c.size()%(n+1) which is a little unfair maybe, in particular for large values of n. In general the right way of splitting it would be to use two slice sizes: ``` - size_t slice = (n > 0) ? c.size() / n : 0; + const size_t small_slice = c.size() / (n + 1); + const size_t big_slice = small_slice + 1; + const size_t big_slices = c.size() % (n + 1); + ut_ad_eq(big_slices * big_slice + (n+1-big_slices) * small_slice, c.size()); using Workers = std::vector<IB_thread>; Workers workers; workers.reserve(n); - + auto b = c.being(); for (size_t i = 0; i < n; ++i) { - auto b = c.begin() + (i * slice); - auto e = b + slice; + const auto e = b + (i < big_slices ? big_slice : small_slice); auto worker = os_thread_create(pfs_key, i, f, b, e, i, args...); worker.start(); workers.push_back(std::move(worker)); + b = e; } - f(c.begin() + (n * slice), c.end(), n, args...); + f(b, c.end(), n, args...); ```
[19 Nov 2024 17:08]
Jean-François Gagné
> I do not think this should be fixed at the call of par_for, because this would make it more complicated to have a parameter for the thread number (see bug that will be open after this one referenced in the comments). The feature request for having the number of threads as a parameter is Bug#116716.
[19 Nov 2024 19:19]
Jean-François Gagné
Hello Jakub, thanks for joining the discussion. > yes `par_for` is not optimal in how it splits the work among the threads. Indeed, but this was not the specific point I wanted to raise in this report. > Perhaps a fix could be to do just [...] It could indeed be a fix, but then it is addressing a broader problem than what I wanted to report here. My suggestion for this report was to avoid touching par_for and only focussing on Duplicate Check Threads. If we change par_for, I would suggest that either the main thread does everything (like for 0 threads) or it "just" wait for sub-threads. IMHO, either all or no processing should happen in the main thread. Also IMHO, the current implementation where most of the work is done in sub-threads and the remainder is done in the main thread, or your suggestion with different splitting, makes things complicated to understand. > In general the right way of splitting it would be to use two slice sizes: Maybe, but you then "break" my "expectation" from above (either all or no processing should happen in the main thread). I kind of see where you are going there, with "nb processing units" being n+1, and this actually starts to make sense with section "2." of Bug#116620, but I do not like that the main thread and sub-threads are sharing work. The main thread "context" is different than sub-thread context (Bug#115517), and mixing work looks wrong to me. All that goes back to the "definition" of par_for: 1. do we want to split work between main and sub threads (IMHO no, but I understand your suggestion); 2. do we want to "spawn" n threads (referring here to the argument of par_for named n) OR do we want the work to be done by n threads which would potentially keep up to n CPUs busy (IMHO, n should be the number of CPU we keep busy, and I understand it is weird with 0).
[19 Nov 2024 21:40]
Jean-François Gagné
Jakub, I thought of another reason you suggested change is problematic: it changes then interface of par_for. Currently, with a value of n threads and large collection (c many orders of magnitudes larger than n), work is mostly done in n sub threads (floor(c/n)) with little work being done in the main thread (c%n). Your suggested change make work happening in n+1 threads, which significantly changes the "interface" of par_for. Currently, par_for is called with n as the "total number of threads to use", not the number of threads to "spawn". For your change to "work", it would need to be in a new par_for (parfor2, parfor_ng, ...), and we would have to change where par_for is called. Cheers, Jean-François
[20 Nov 2024 10:58]
MySQL Verification Team
Salut, Monsieur Gagne, Thank you for your bug report. This is now a verified bug for the versions 8.0 and all higher supported versions. Thank you.
[20 Nov 2024 11:24]
Jakub Lopuszanski
With all due respect, but `par_for` is an internal function of InnoDB code-base, not exposed to end users, and thus the "expectations" for it come from the contract stated in the doxygen: ``` /** Parallel for loop over a container. @param[in] pfs_key Performance schema thread key @param[in] c Container to iterate over in parallel @param[in] n Number of threads to create @param[in] f Callable instance @param[in] args Zero or more args */ ``` Here we see `n` is the number of threads to create. Clearly the value `n==0` is allowed and frequently used. There is no promise as to how exactly the work will be split between the threads, should it be fair, or should the main thread participate or not. Because we support, and most often use, the case of n=0, it must mean that at least in this case the main thread participates (does all the work). So it is only natural (for me, as developer) to assume that n+1 threads will participate, and indeed that's what is happening right now, it's just that the main thread often gets very little to do, which I judge to be a bug, because it clearly makes the overall duration of par_for needlessly longer than needed. That there's an expectation that n+1 threads should be doing something is not clearly stated in the doxygen, but can be deduced from: 1. supporting n=0 case, and natural, least surprising, extension of it to n>0 cases 2. the surrounding code which actually uses it. In particular: ``` /* Get the number of additional threads needed to scan the files. */ size_t n_threads = fil_get_scan_threads(ibd_files.size()); if (n_threads > 0) { ib::info(ER_IB_MSG_382) << "Using " << (n_threads + 1) << " threads to" << " scan " << ibd_files.size() << " tablespace files"; } ``` where we clearly talk about *additional* threads and report `n_threads+1` as the number of threads working on the problem. There's definitely some confusion in our code base, too, in other places. For example `fil_get_scan_threads()` itself starts with a reasonable comment: ``` /* Number of additional threads required to scan all the files. n_threads == 0 means that the main thread itself will do all the work instead of spawning any additional threads. */ size_t n_threads = num_files / FIL_SCAN_MAX_TABLESPACES_PER_THREAD; ``` but then later on, somehow forgets to take main thread into account when it computes the max_threads: ``` /* Number of concurrent threads supported by the host machine. */ size_t max_threads = FIL_SCAN_THREADS_PER_CORE * std::thread::hardware_concurrency(); /* If the number of concurrent threads supported by the host machine could not be calculated, assume the supported threads to be FIL_SCAN_MAX_THREADS. */ max_threads = max_threads == 0 ? FIL_SCAN_MAX_THREADS : max_threads; /* Restrict the number of threads to the lower of number of threads supported by the host machine or FIL_SCAN_MAX_THREADS. */ if (n_threads > max_threads) { n_threads = max_threads; } if (n_threads > FIL_SCAN_MAX_THREADS) { n_threads = FIL_SCAN_MAX_THREADS; } ``` where arguably the logic should subtract 1 from max_threads and FIL_SCAN_MAX_THREADS, if we really want the number of threads per core to be 2, or the total number of threads operating on the task to be 16. The only other usage of `par_for` is in clone0api.cc, where it also seems a bit confused as it passes get_num_tasks() instead of get_num_tasks()-1 to the function. However it can be argued, that despite its name `get_num_tasks()` is actually the number of *new* *threads* to spawn (people sometimes use words "task" and "thread" interchangebly), because if you look at `set_num_tasks()`'s doxygen it has: ``` /** Calculate and set number of new tasks to spawn. @param[in] num_entries number of entries to handle */ void set_num_tasks(size_t num_entries) { ``` So, perhaps we would have to do some cleanup, too, but basically I can see only three self-consistent contracts: A) n is the number of new threads to spawn, all (n+1) threads participate in work evenly B) n is the total number of threads including the main thread, thus has to be > 0, and all the n threads participate in work evenly C) n is the number of new threads to spawn, has to be > 0, and only they participate in the work The C) option, which I believe you advocate for, doesn't seem reasonable to me, as it means that even for small data sets we will be forced to spawn at least one new thread, and do at least two context-switches, or complicate the code in the caller to handle that case differently. Option A) seems to me to be most in line with the original intent of the author, as evidenced by the oldest usage and comments and messages in log. Option B) seems to me perhaps cleaner/natural in that it seems conceptually easier to think in (high-level) terms of total number of workers, than in (low-level) terms of how many new threads we need.