MySQL Bugs: #116796: get_oldest_view cost too much and block other oprator

Bug #116796	get_oldest_view cost too much and block other oprator
Submitted:	27 Nov 2024 6:23	Modified:	28 Nov 2024 17:13
Reporter:	ksql- team	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server	Severity:	S5 (Performance)
Version:	8.0.32	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	purge, read view

Description:

   The clone_oldest_view holds the trx_sys->mutex lock to perform the get_oldest_view operation, which requires traversing the view linked list to find the oldest view that has not been closed. 
   
   Therefore, when the view linked list is too long and there are too many closed read_views, this traversal operation will become time-consuming. 

   The trx_sys->mutex is the lock for the transaction system, which needs to be acquired for writing operations in the database and can therefore block other operations on the database.

   in our scenary, tps in replica with 3.6w RO connections decrease from 2w/s  
to 0.6w/s, cpu cost of clone_oldest_view raise to 30% or even more.

   related functions: MVCC::clone_oldest_view, MVCC::view_close

How to repeat:

1. create 3.6w connections to db, each connections do a simple select; here is an example use python

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import MySQLdb as mdb
import time

config = {
    'host': 'db-host',
    'port': db-port,
    'user': 'db-user',
    'passwd': 'db-password',
    'db': 'db-name',
}

connections = []
for i in range(36000):
    try:
        conn = mdb.connect(**config)
        conn.autocommit(True)
        connections.append(conn)
        cursor=conn.cursor()
        cursor.execute("select k from sbtest1 where id=1")
    except Exception as e:
        print(e)

time.sleep(600000)

for connection in connections:
    connection.close()

2. use sysbench load pressure, here is an example:

   sysbench oltp_update_index.lua --mysql-host=host --mysql-port=port --mysql-user=user --mysql-password=password --mysql-db=db --tables=100 --db-driver=mysql --table_size=1000 --report-interval=5 --threads=256 --time=120 run
   

3. observing the performance of databases, as well as cpu cost.

   in our scenary, tps in replica with 3.6w RO connections decrease from 2w/s  
to 0.6w/s, cpu cost of clone_oldest_view raise to 30% or even more.

Suggested fix:

    Provide an optional switch to disable the linked list removal operation in read view to ensure performance in related scenarios.

Hi Ksql-team,

Thank you for your bug report.

We would like to repeat the problem that you are reporting, but we have problems in that respect.

First of all, you wrote:

"

   in our scenary, tps in replica with 3.6w RO connections decrease from 2w/s  
to 0.6w/s, cpu cost of clone_oldest_view raise to 30% or even more.

"

What replica are you writing about above ??? Is that a replica from Group Replication or something similar ??

Next, what does a number of 3.6w stands for ???

Next, how do you measure the CPU cost. Please, explain the procedure in detail ......

Next, you are writing about a linked list, but can you be specific what linked list it is ??? Can you provide us with a source code file name and the function name ???

Last, but not least, you are asking for a new option in MySQL. Can you describe to us more precisely what kind of option would you want and what would it do ???

If your report is asking for the new option, then, according to our rules,  this is not a bug, but a feature request. 

We are waiting for your response.

Can't repeat.

Hi MySQL Verification Team,

Thank you for your reply, 

Above are the answers to the questions mentioned before:

We used a MySQL semi-synchronous cluster with one master and three replicas for testing. Therefore, the mentioned replicas are actually nodes of a semi-synchronous cluster. The deployed mode does not have a big impact on the actual issue. If a single-master database or other deployment modes are used, there could still be significant performance loss in certain cases. The key point is that if a node has a large number of read-only connections, for example, 36,000 read-only connections, its overall read-wirte performance will have a significant decrease.

The number "3.6w" represents the read-only connections on the performance evaluation instance, which means there are 36,000 connections that only execute select operations. The python code provided in the "How to repeat" section demonstrates a simple simulation method.

We used the "perf top" tool to observe CPU utilization. The specific command used is "perf top -p pid", which shows the CPU consumption of relevant functions. Other similar tools can also be used.

The mentioned linked list refers to ReadView::m_view_list, which is used to store all readviews in the current system. The implementation can be found in the file storage/innobase/include/read0types.h. The issue arises during the purge process when the operation clone_oldest_view is executed. It requires traversing the linked list ReadView::m_view_list to find the oldest view, which holds the trx_sys->mutex lock. When the linked list is long, finding the oldest view can take a long time, causing a block on other operations that require trx_sys->mutex lock.

We encountered a significant performance issue during usage, as described above. The requirement is to solve this problem. If the issue can be thoroughly resolved without the use of parameters, that would be the better solution.

related func: MVCC::clone_oldest_view MVCC::get_oldest_view

related code:

ReadView *MVCC::get_oldest_view() const {
  ReadView *view;

  ut_ad(trx_sys_mutex_own());

  for (view = UT_LIST_GET_LAST(m_views); view != nullptr;
       view = UT_LIST_GET_PREV(m_view_list, view)) {
    if (!view->is_closed()) {
      break;
    }
  }

  return (view);
}

regards

Hi ksql-team,

Please, let us. know whether this behaviour can only be observed in synchronous or semi-synchronous group replication, or is it observable also on the standalone server ???

Next,  you stated that you have 36000 concurrent connections. You did not specify whether it was on a single node or a total. 

More important is the fact that 36000 concurrent queries will require a VERY powerful hardware, with many multi-core  CPU's, very large installed RAM,  RAID disks and fully optimised settings for the MySQL server.  Otherwise having some connections blocked by the other is a totally expected behaviour. That can not be a bug ...... It is simply an overload.

Linked lists have to be used for this operation, because there is no way to know how many concurrent reads will occur. Hence, you can not optimise this by using , for example, arrays, since constantly increasing the array size, copying, traversing and then enlarging it further would be even slower.

Hence, we do not see any bug here. You can ask for the performance improvement, by changing Severity, but in that case, please explain which method would you use in place of the linked lists ????

Thanks in advance.

Hi MySQL Verification Team,

Thank you for your reply, 

Here are some replies to your questions:

In the case where there are many read-only connections, the behavior of purge holding locks to traverse the linked list and affecting performance can be observed in both synchronous or semi-synchronous group replication and standalone servers. The key factor is whether the observed node has many read-only connections.

The 3.6w connection is on a single node.

If the 3.6w connection only performs a simple select query and has no other actions, as provided by the logic of creating a connection in the aforementioned Python script, its resource demand is relatively small and theoretically should not affect the read and write performance of the node.

Traversing the linked list is necessary, but because of the special handling of read-only connections, in the absence of high concurrency for read and write operations, traversing the linked list takes longer. We analyzed the reason as follows: when executing the view close function (MVCC::view_close), as seen in the note section, the view will not be removed from m_views for AC-NL-RO transactions, leading to an excessively long linked list in the above scenario and high time consumption for the purge operation to traverse the list.

void MVCC::view_close(ReadView *&view, bool own_mutex) {
  uintptr_t p = reinterpret_cast<uintptr_t>(view);

  /* Note: The assumption here is that AC-NL-RO transactions will
  call this function with own_mutex == false. */
  if (!own_mutex) {
    /* Sanitise the pointer first. */
    ReadView *ptr = reinterpret_cast<ReadView *>(p & ~1);

    /* Note this can be called for a read view that
    was already closed. */
    ptr->m_closed = true;

    /* Set the view as closed. */
    view = reinterpret_cast<ReadView *>(p | 0x1);
  } else {
    view = reinterpret_cast<ReadView *>(p & ~1);

    view->close();

    UT_LIST_REMOVE(m_views, view);
    UT_LIST_ADD_LAST(m_free, view);

    ut_ad(validate());

    view = nullptr;
  }
}

Our suggestion is to provide an optional parameter to determine whether the view will be removed from m_views when view close is executed, so as to avoid significant performance degradation in the above scenario. However, this is only a suggestion, and if there is a better solution to this problem, it would be preferable.

The following is the information of the testing machine：

cpu(use lscpu to see): 
   CPU(s):56
   NUMA node(s):2
   Model name:Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz

Linux version 4.18.0

by the way:
If your testing machine that executes the Python script cannot create 36,000 connections at once, you can use multiple machines to create them. For example, you can use four machines to create 9,000 connections each. All connections need to be connected to the same MySQL node, which is a single node, not a cluster.

regards

Also note that for reproducing the issue, it is essential to first execute the Python script and confirm the existence of 36,000 connections before running the sysbench test, instead of running the connection and sysbench scripts simultaneously.

Additionally, the tps mentioned in the problem description dropped from 2w to 0.6w. This specific value is only applicable to our test environment, and the data performance in different test environments may vary, just as the performance values in different test environments will also vary. In different test environments, it is possible to focus on the percentage of performance degradation. For example, the performance of a node with 36,000 read-only connections versus the performance without 36,000 read-only connections.

suggested fix, add srv_enable_readview_remove switch, if is on, remove view from list instead of set closed but still in list m_views.

void MVCC::view_close(ReadView *&view, bool own_mutex) {
  uintptr_t p = reinterpret_cast<uintptr_t>(view);
  bool need_lock = srv_enable_readview_remove && !own_mutex;

  if (need_lock) {
    trx_sys_mutex_enter();
    own_mutex= true;
  }

  /* Note: The assumption here is that AC-NL-RO transactions will
  call this function with own_mutex == false. */
  if (!own_mutex) {
    /* Sanitise the pointer first. */
    ReadView *ptr = reinterpret_cast<ReadView *>(p & ~1);
    /* Note this can be called for a read view that
    was already closed. */
    ptr->m_closed = true;

    /* Set the view as closed. */
    view = reinterpret_cast<ReadView *>(p | 0x1);
  } else {
    view = reinterpret_cast<ReadView *>(p & ~1);

    view->close();

    UT_LIST_REMOVE(m_views, view);
    UT_LIST_ADD_LAST(m_free, view);

    ut_ad(validate());

    view = nullptr;
  }
  if (need_lock) {
    trx_sys_mutex_exit();
  }
}

change Severity to S5(Performance)

Hi ksql-team,

Thank you for your suggestions ...

However, the idea that you have expressed here:

"

Traversing the linked list is necessary, but because of the special handling of read-only connections, in the absence of high concurrency for read and write operations, traversing the linked list takes longer. We analyzed the reason as follows: when executing the view close function (MVCC::view_close), as seen in the note section, the view will not be removed from m_views for AC-NL-RO transactions, leading to an excessively long linked list in the above scenario and high time consumption for the purge operation to traverse the list.

"

can not be implemented. 

Simply, the standard for transactional engines explicitly states that  even in auto-commit mode, a single SELECT (without any locks), is executed in the following manner .....

BEGIN;
SELECT .......;
COMMIT:

If you are not, by default,  in auto-commit mode, then standard stipulates that your SELECT will only start a new transaction. Hence, you did not specify how have you notified the server that all your connections are in read-only mode ONLY.

That is a reason why your above idea can not work.

If you have answers to these questions, we shall definitely consider your ideas.

Hi MySQL Verification Team,

Thank you for your reply, 

there are two point,

Firstly, as you said before

'''
Simply, the standard for transactional engines explicitly states that  even in auto-commit mode, a single SELECT (without any locks), is executed in the following manner .....
BEGIN;
SELECT .......;
COMMIT:
'''

But the invocation of non-autocommit read-only transactions also causes the issue, which is unrelated to autocommit. 

The relevant code can be found in storage/innobase/trx/trx0trx.cc, in the function trx_commit_in_memory. The critical logic is as follows:

if (trx_is_autocommit_non_locking(trx)) {
...
	if (trx->read_view != nullptr) {
      trx_sys->mvcc->view_close(trx->read_view, false);
    }
...
} else {
...
    if (trx->read_only || trx->rsegs.m_redo.rseg == nullptr) {
      MONITOR_INC(MONITOR_TRX_RO_COMMIT);
      if (trx->read_view != nullptr) {
        trx_sys->mvcc->view_close(trx->read_view, false);
      }
    }
...
}

It can be analyzed from the code that for non-autocommit read-only transactions, when execute fucntion view_close(), the second parameter passed is false. This means that when the trx doing commit, the read view is retained in the read view list instead of being removed.

And Secondly,

As the code mentioned in trx_commit_in_memory, it can be seen that in function view_close() for autocommit or non-locking or read-only transactions, the view is not removed from the view list. 

But, in function view_open(), there is a restriction that only autocommit and non-locking transactions can reuse read views. The logic of the before and after restrictions is inconsistent, and we are perplexed as to why read-only transactions cannot reuse read views.

And as the note said in function view_open, 
'''
This can be optimised further...There is an inherent race here between purge and this thread...
'''
It is exactly the situation we met.

related code:
void MVCC::view_open(ReadView *&view, trx_t *trx) {
	...
  /** If no new RW transaction has been started since the last view
  was created then reuse the the existing view. */
  if (view != nullptr) {
    uintptr_t p = reinterpret_cast<uintptr_t>(view);

    view = reinterpret_cast<ReadView *>(p & ~1);

    ut_ad(view->m_closed);

    /* NOTE: This can be optimised further, for now we only
    reuse the view if there are no active RW transactions.

    There is an inherent race here between purge and this
    thread. Purge will skip views that are marked as closed.
    Therefore we must set the low limit id after we reset the
    closed status after the check. */

    if (trx_is_autocommit_non_locking(trx) && view->empty()) {
      view->m_closed = false;

      if (view->m_low_limit_id == trx_sys_get_next_trx_id_or_no()) {
        return;
      } else {
        view->m_closed = true;
      }
    }
  }

  ...

  trx_sys_mutex_exit();
}

As for the proposed solution, we currently suggest that the server does not need to adjust automatically based on the number of read-only connections. However, a configuration parameter that can be set using "set global xxx" can be provided. When encountering such scenarios, we can manually adjust it to avoid serious performance degradation.

Hi ksql-team,

First and first of all, you have not replied to answer our questions on what are read-only transactions ???

How do you define them and how do you run them as read-only ????

We asked you that before, but you did not answer.

Regarding committed transactions retained in the read view, they are very much needed for all concurrent transactions running in the READ-COMMITTED mode. Simply, if transaction T1 has committed in the middle of READ-COMMITTED transaction T2, T2 must be able to see the changes. That is why that read is retained .......

Hi MySQL Verification Team,

Firstly, a read-only transaction is a transaction that only involves read operations. Additionally, in the above Python script, there is an implementation method for quickly generating read-only transactions, and you can refer to the replication steps for better understanding. 

For example, if there are 10 connections, each executing a begin, select, and commit, there will be 10 views in the read view list, and their status will be "closed". Then, the purge thread needs to find the oldest read view before purging, which requires iterating through these 10 connections. Similarly, if there are many connections, iterating through them can be time-consuming. 

Closing the read view without removing it from the linked list is an optimization for the trx_sys->mutex and is not necessary. For more details, please refer to the above-mentioned function.

The above discussion is about read-only transactions that do not generate any writes, meaning that a transaction only involves read operations. This is also the optimization point of view_close and view_open. Please refer again to the code and related comments, which clearly express that there will be contention with the purge operation. We encountered this issue.

Hi ksql-team,

Standard for transactional processing does not have such thing as read-only transactions. There is no definition like that in the standard.

Also, this code is written as is for all the reasons that we wrote about already.

Changing anything there would cause a flood of regression bug, which would result in many , many problems. For our product and those who maintain it .....

Not a bug.

This is the relevant issues we have found, which can be used as reference, and the problem still exists in the implementation of version 8.0.32.

https://bugs.mysql.com/bug.php?id=88422

Hi ksql-team,

Yes, it is ...... 

However, our team in charge did not accept that idea.

However, we shall set this bug as a duplicate of that one and will write an e-mail to the team in charge, so they will re-consider the idea.

We shall also update the original bug report.

This report is a duplicate of the bug :

https://bugs.mysql.com/bug.php?id=88422

Duplicate.

Thank you for your help,

But our issue is not exactly the same as the bug mentioned above. We have additional points of concern:

First, the performance issue when view_list is too long is consistent;

Second, there is a logical problem with the implementation of view_close and view_open. view_open only reuses views for AC, RO, and NL transactions, but view_close sets views to close for all AC, RO, or NL transactions without removing them from the view_list. Related function: trx_commit_in_memory and MVCC::view_open. We believe that reuse of views can also be done for RO or NL transactions during view_open.

Please kindly explain these two issues to the relevant team. 

Thank you again.

Hi Ksql-team,

We definitely shall.