Bug #107635 event scheduler cause error on group replication
Submitted: 22 Jun 2022 14:47 Modified: 15 Jul 2022 19:47
Reporter: lou shuai (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S1 (Critical)
Version:8.0.* OS:Any
Assigned to: CPU Architecture:Any
Tags: Contribution

[22 Jun 2022 14:47] lou shuai
Description:
When run below test case will cause group replication error.

Assert In debug version:

```
#0  __GI_raise (sig=...) at raise.c:50
#1  0x00007fc873279535 in __GI_abort () at abort.c:79
#2  0x00007fc87327940f in __assert_fail_base (fmt=..., assertion=..., file=..., line=..., function=...) at assert.c:92
#3  0x00007fc873287102 in __GI___assert_fail (assertion=..., file=..., line=..., function=...) at assert.c:101
#4  0x000055e3f8f44bab in Disable_autocommit_guard::~Disable_autocommit_guard (this=..., __in_chrg=...) at thd_raii.h:79
#5  0x000055e3f985561f in Event_queue::recalculate_activation_times (this=..., thd=...) at event_queue.cc:390
#6  0x000055e3f985b4ab in Event_scheduler::run (this=..., thd=...) at event_scheduler.cc:568
#7  0x000055e3f985a372 in event_scheduler_thread (arg=...) at event_scheduler.cc:279
#8  0x000055e3faee89df in pfs_spawn_thread (arg=...) at pfs.cc:2899
#9  0x00007fc87341efa3 in start_thread (arg=...) at pthread_create.c:486
#10 0x00007fc87334feff in clone () at clone.S:95
```

In release mode:  The server will leave MGR group:

```
2022-06-22T14:32:18.412045Z 15 [ERROR] [MY-011452] [Repl] Plugin group_replication reported: 'Fatal error during execution     on the Applier process of Group Replication. The server will now leave the group.'

```

How to repeat:
SET GLOBAL EVENT_SCHEDULER = OFF; 

CREATE EVENT e1 ON SCHEDULE EVERY 1 SECOND ENDS NOW() + INTERVAL 1 SECOND DO SELECT 1;  

SET GLOBAL EVENT_SCHEDULER = ON;  

Suggested fix:
allow daemon event scheduler thread can be found in THD manager.
Or set event scheduler thread to !COM_DAEMON in function
[22 Jun 2022 14:53] lou shuai
patch to fix this bug

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: 0001-Bug-107635-MGR-Assertion-failure-in-event_scheduler_.patch (application/octet-stream, text), 3.86 KiB.

[23 Jun 2022 6:33] lou shuai
analyze:

trans_commit_stmt return error in recalculate_activation_times function.
group_replication_trans_before_commit cause trans_commit_stmt's error.

The error code is passed from the applier thread in MGR.
```
#0  0x000056386186e909 in set_transaction_ctx (transaction_termination_ctx=...) at rpl_transaction_ctx.cc:107
#1  0x00007f8e5220afad in Certification_handler::handle_transaction_id (this=..., pevent=..., cont=...) at certification_handler.cc:308
#2  0x00007f8e5220a349 in Certification_handler::handle_event (this=..., pevent=..., cont=...) at certification_handler.cc:127
#3  0x00007f8e5220982a in Event_handler::next (this=..., event=..., continuation=...) at pipeline_interfaces.h:716
#4  0x00007f8e5220f20e in Event_cataloger::handle_event (this=..., pevent=..., cont=...) at event_cataloger.cc:53
#5  0x00007f8e521b7621 in Applier_module::inject_event_into_pipeline (this=..., pevent=..., cont=...) at applier.cc:258
#6  0x00007f8e521b801b in Applier_module::apply_data_packet (this=..., data_packet=..., fde_evt=..., cont=...) at applier.cc:388
#7  0x00007f8e521b8fba in Applier_module::applier_thread_handle (this=...) at applier.cc:613
#8  0x00007f8e521b692e in launch_handler_thread (arg=...) at applier.cc:50
#9  0x00005638634d39df in pfs_spawn_thread (arg=...) at pfs.cc:2899
#10 0x00007f8e8d37efa3 in start_thread (arg=...) at pthread_create.c:486
```

When try to find the thd in set_transaction_ctx, the daemon event scheduler thd is ignored, so can not find thd, and return ER_NO_SUCH_THREAD.

```
 int set_transaction_ctx(
    Transaction_termination_ctx transaction_termination_ctx) {
  DBUG_TRACE;
  DBUG_PRINT("enter", ("thread_id=%lu, rollback_transaction=%d, "
                       "generated_gtid=%d, sidno=%d, gno=%" PRId64,
                       transaction_termination_ctx.m_thread_id,
                       transaction_termination_ctx.m_rollback_transaction,
                       transaction_termination_ctx.m_generated_gtid,
                       transaction_termination_ctx.m_sidno,
                       transaction_termination_ctx.m_gno));

  uint error = ER_NO_SUCH_THREAD;
  Find_thd_with_id find_thd_with_id(transaction_termination_ctx.m_thread_id);

  THD_ptr thd_ptr =
      Global_THD_manager::get_instance()->find_thd(&find_thd_with_id);
  if (thd_ptr) {
    error = thd_ptr->get_transaction()
                ->get_rpl_transaction_ctx()
                ->set_rpl_transaction_ctx(transaction_termination_ctx);
  }
  return error;
}

bool Find_thd_with_id::operator()(THD *thd) {
  if (thd->get_command() == COM_DAEMON) return false;
  return (thd->thread_id() == m_thread_id);
}

```
[23 Jun 2022 10:12] MySQL Verification Team
Hello lou shuai,

Thank you for the report and contribution.

regards,
Umesh
[23 Jun 2022 10:37] lou shuai
Hi Umesh,

I saw you change severity to S6, this problem not only happend in DEBUG mode.
In release mode, the node will leave the MGR group, and set to read_only.
So i think you should change it to a high severity
[23 Jun 2022 11:37] MySQL Verification Team
Hello lou shuai,

Ack, changed the sev. Thank you.

Regards,
Umesh
[15 Jul 2022 19:47] Margaret Fisher
Posted by developer:
 
Changelog entry added for MySQL 8.0.31:

  After checking a transaction commit has no conflicts and is in the correct order, Group Replication reports back to the committing session. When the event scheduler thread was started, Group Replication was not able to find the committing session, resulting in the member entering ERROR state and leaving the group. The procedure to locate the committing session was extended to find daemon threads, as used to start the event scheduler thread. Thanks to Lou Shuai for the contribution.