Bug #112619 STOP GROUP_REPLICATION on primary during ALTER leads to a confusing message.
Submitted: 4 Oct 2023 15:44 Modified: 6 Oct 2023 9:45
Reporter: Jean-François Gagné Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S3 (Non-critical)
Version:8.0.34 OS:Any
Assigned to: CPU Architecture:Any

[4 Oct 2023 15:44] Jean-François Gagné
Description:
Hi,

When running STOP GROUP_REPLICATION on the primary node of a Group while an ALTER TABLE is running, I get the below error message, see How to repeat for details.

ERROR 3100 (HY000) at line 1: Error on observer while running replication hook 'before_commit'.

This message is confusing at best, and unexpected at worst.

It is confusing because it is not telling me what fails or anything allowing me to take action.

It is unexpected as I would expect STOP GROUP_REPLICATION to block until after the ALTER completes.  This blocking behavior is what happens if I change the primary during an ALTER, see How to repeat for this behavior.

Or maybe it should not be possible to run STOP GROUP_REPLICATION on a primary: a primary change needing to be done beforehand.  The final solution might need some refinement.

Many thanks for looking into this,

Jean-François Gagné

How to repeat:
# Create a 8.0.34 sandbox.
dbdeployer deploy --topology=group replication mysql_8.0.34 --single-primary

# Create a schema and table.
./n1 <<< "
  create database test_jfg;
  create table test_jfg.t(id bigint not null auto_increment primary key)
"

# Fill the table.
# n chosen for having a table that will take a few seconds to ALTER.
# "paste | sed" to insert many rows per trx for speed.
n=131072
time yes "(null)" | head -n $n |
  paste -s -d "$(printf ',%.0s' {1..1000})\n" |
  sed -e 's/.*/INSERT INTO t (id) values &;/' |
  ./n1 test_jfg

# Grow the rows for having ALTER FORCE takes some time.
time { echo "ALTER TABLE t ADD COLUMN c0 CHAR(240) DEFAULT ''"
       seq -f " ADD COLUMN c%.0f CHAR(240) DEFAULT ''" 1 3
} | paste -s -d "," | ./n1 test_jfg

# Actually grow the table (because instant ADD COLUMN).
time ./n1 test_jfg <<< "ALTER TABLE t FORCE"

##################################################
# Output of above, comments added for readability.

# Inserts.
real    0m5.784s
user    0m0.039s
sys     0m0.066s

# ALTER ADD.
real    0m0.048s
user    0m0.010s
sys     0m0.009s

# ALTER FORCE.
real    0m15.905s
user    0m0.013s
sys     0m0.000s

################################################
# STOP GROUP_REPLICATION while ALTER is running.
# With the SELECT now(), we can see that the ALTER blocks STOP GR.
# We can also infer, from the date after the ALTER,
#   that the error is reported at the end of the ALTER.
./n1 -N <<< "
    do sleep(1);
    STOP GROUP_REPLICATION;
    select 'Completed stop GR:', now()" & \
  date; ./n1 test_jfg <<< "ALTER TABLE t FORCE"; date; \
  wait

[1] 4169
Tue Oct  3 16:49:22 UTC 2023
ERROR 3100 (HY000) at line 1: Error on observer while running replication hook 'before_commit'.
Completed stop GR:      2023-10-03 16:49:31
Tue Oct  3 16:49:31 UTC 2023
[1]+  Done                    ./n1 -N <<< "
    do sleep(1);
    STOP GROUP_REPLICATION;
    select 'Completed stop GR:', now()"

##########################################################
# group_replication_set_as_primary while ALTER is running.
# This needs to be run with n1 a the primary,
#   the best way is to re-create the sandbox,
#   and run below after "Actually grow the table".
# We see that when changing the primary, the ALTER completes first,
#   and then we have a delay before completing the primary change.
#   (The delay is probably for replicating the ALTER before promoting n2)
./n2 -N <<< "
    do sleep(1);
    select group_replication_set_as_primary(@@server_uuid);
    select 'Completed set primary at:', now()" & \
  date; ./n1 test_jfg <<< "ALTER TABLE t FORCE"; date; \
  wait

[1] 10276
Tue Oct  3 16:56:48 UTC 2023
Tue Oct  3 16:56:58 UTC 2023
Primary server switched to: 00024436-2222-2222-2222-222222222222
Completed set primary at:       2023-10-03 16:57:10
[1]+  Done                    ./n2 -N <<< "
    do sleep(1);
    select group_replication_set_as_primary(@@server_uuid);
    select 'Completed set primary at:', now()"
[6 Oct 2023 9:45] MySQL Verification Team
Hello Jean-François,

Thank you for the report and feedback.
Verified as described.

regards,
Umesh