Bug #108905 offline-mode is persisted when doing dba.rebootClusterFromCompleteOutage
Submitted: 28 Oct 2022 0:02 Modified: 8 Mar 2023 11:07
Reporter: Marcos Albe (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:Shell AdminAPI InnoDB Cluster / ReplicaSet Severity:S2 (Serious)
Version: OS:Any
Assigned to: CPU Architecture:Any

[28 Oct 2022 0:02] Marcos Albe
Description:
Hello! When restoring the cluster via dba.rebootClusterFromCompleteOutage(), the variable offline-mode gets persisted, overriding the value we had manually set on the my.cnf.  This action is invisible to the human operator, and can lead to the mistake of starting a node assuming offline_mode will be ON, when in fact the persisted variable will disabled and allow traffic to the node.

I set some options and make the cluster multi-primary:
MySQL  127.0.0.1:33060+ ssl  JS > c = dba.getCluster();
<Cluster:cluster1>
 
MySQL  127.0.0.1:33060+ ssl  JS > c.setOption('exitStateAction', 'OFFLINE_MODE');
Setting the value of 'exitStateAction' to 'OFFLINE_MODE' in all cluster members ...

Successfully set the value of 'exitStateAction' to 'OFFLINE_MODE' in the 'cluster1' cluster.
 
MySQL  127.0.0.1:33060+ ssl  JS > c.setOption('consistency', 'BEFORE_ON_PRIMARY_FAILOVER');
Setting the value of 'consistency' to 'BEFORE_ON_PRIMARY_FAILOVER' in all cluster members ...

Successfully set the value of 'consistency' to 'BEFORE_ON_PRIMARY_FAILOVER' in the 'cluster1' cluster.
  
MySQL  127.0.0.1:33060+ ssl  JS > c.switchToMultiPrimaryMode()
Switching cluster 'cluster1' to Multi-Primary mode...

Instance '10.124.33.46:3306' was switched from SECONDARY to PRIMARY.
Instance '10.124.33.119:3306' remains PRIMARY.
Instance '10.124.33.52:3306' was switched from SECONDARY to PRIMARY.

The cluster successfully switched to Multi-Primary mode.

Then I make sure I have offline-mode=ON in my.cnf (it's the same on every node in the cluster)
[root@marcos-albe-node1 ~]# grep offline /etc/my.cnf
offline-mode=ON

I then stop every instance, and start them again, they all appear as individual OFFLINE nodes:

MySQL  127.0.0.1:33060+ ssl  SQL > select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+----------------------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST   | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | MEMBER_COMMUNICATION_STACK |
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+----------------------------+
| group_replication_applier | 4759ab67-5540-11ed-a7a3-00163effa087 | 10.124.33.119 |        3306 | OFFLINE      |             |                | XCom                       |

1 row in set (0.0014 sec)

Then we move to bootstrap:

We verify that before bootstrapping, offline_mode is correctly ON:
MySQL  127.0.0.1:33060+ ssl  JS > \sql show global variables where Variable_name in('read_only', 'super_read_only', 'offline_mode', 'group_replication_exit_state_action');
+-------------------------------------+--------------+
| Variable_name                       | Value        |
+-------------------------------------+--------------+
| group_replication_exit_state_action | OFFLINE_MODE |
| offline_mode                        | ON           |
| read_only                           | ON           |
| super_read_only                     | ON           |
4 rows in set (0.0023 sec)

And we verify the variable is NOT among the persisted variables:
MySQL  127.0.0.1:33060+ ssl  SQL > \sql select * from performance_schema.persisted_variables;
+----------------------------------------------------+---------------------------------------+
| VARIABLE_NAME                                      | VARIABLE_VALUE                        |
+----------------------------------------------------+---------------------------------------+
| group_replication_consistency                      | BEFORE_ON_PRIMARY_FAILOVER            |
| auto_increment_increment                           | 7                                     |
| auto_increment_offset                              | 6                                     |
| super_read_only                                    | ON                                    |
| group_replication_enforce_update_everywhere_checks | ON                                    |
| group_replication_exit_state_action                | OFFLINE_MODE                          |
| group_replication_ssl_mode                         | REQUIRED                              |
| group_replication_group_seeds                      | 10.124.33.46:33061,10.124.33.52:33061 |
| group_replication_recovery_use_ssl                 | ON                                    |
| group_replication_member_expel_timeout             | 5                                     |
| group_replication_ip_allowlist                     | AUTOMATIC                             |
| group_replication_single_primary_mode              | OFF                                   |
| group_replication_local_address                    | 10.124.33.119:33061                   |
| group_replication_member_weight                    | 50                                    |
| group_replication_start_on_boot                    | ON                                    |
| group_replication_autorejoin_tries                 | 3                                     |
| group_replication_group_name                       | c9a67da3-5540-11ed-8be7-00163effa087  |
| group_replication_view_change_uuid                 | c9a68431-5540-11ed-8be7-00163effa087  |
18 rows in set (0.0012 sec)

We reboot the cluster
MySQL  127.0.0.1:33060+ ssl  JS > dba.rebootClusterFromCompleteOutage()
Restoring the cluster 'cluster1' from complete outage...
...snip...
* Waiting for seed instance to become ONLINE...
10.124.33.119:3306 was restored.
Rejoining '10.124.33.52:3306' to the cluster.
Rejoining instance '10.124.33.52:3306' to cluster 'cluster1'...

The instance '10.124.33.52:3306' was successfully rejoined to the cluster.

Rejoining '10.124.33.46:3306' to the cluster.
Rejoining instance '10.124.33.46:3306' to cluster 'cluster1'...

The instance '10.124.33.46:3306' was successfully rejoined to the cluster.

The cluster was successfully rebooted.

<Cluster:cluster1>

And check persisted variables again:
MySQL  127.0.0.1:33060+ ssl  JS > \sql select * from performance_schema.persisted_variables;
+----------------------------------------------------+---------------------------------------+
| VARIABLE_NAME                                      | VARIABLE_VALUE                        |
+----------------------------------------------------+---------------------------------------+
| auto_increment_offset                              | 6                                     |
| auto_increment_increment                           | 7                                     |
| group_replication_consistency                      | BEFORE_ON_PRIMARY_FAILOVER            |
| offline_mode                                       | OFF                                   |  <---- ouch!
| super_read_only                                    | ON                                    |
...
19 rows in set (0.0005 sec)

When we restart the next time we will have offline_mode=OFF even when we would have expected it to be ON, because we have it set that way in my.cnf, and we were not aware that the variable was persisted....which doesn't make much sense to persist, or at least not without some seriously visible warning.

Hope it makes sense

How to repeat:
See description for steps

Suggested fix:
Don't persist the offline_mode variable, as it's important to be able to control it in emergency situations, and is incredibly easy to miss the fact that it was persisted behind scene.
[28 Oct 2022 0:05] Marcos Albe
Changing category from Group Replication to Shell AdminAPI which should be more appropriate, I believe
[28 Oct 2022 15:14] MySQL Verification Team
Hi Marcos,
I missed something here as I tried this out myself just now and it behaved like I expected?! Did you run this with 8.0.31 or something much older?

I will be redoing the test as I did it how I "assumed" you did not following your steps one by one (doing that now).

Thanks
[28 Oct 2022 21:27] Marcos Albe
Hello Verification team,

Current environment is 8.0.28 so not MUCH older but sure, some versions behind...  
I didn't tried with 8.0.31; Will do so and let you know.
[28 Oct 2022 21:31] MySQL Verification Team
Thanks for the update, please try to reproduce with 8.0.31 but I do not see any changes between 28 and 31 that could influence this behavior. I'll finish my tests on Monday I hope I'll be able to reproduce it with 8.0.31
[8 Nov 2022 15:24] MySQL Verification Team
I did reproduce this with .28 following exact steps. Thanks for the report.
[15 Nov 2022 12:17] Joao Ramos
Posted by developer:
 
The "offline-mode" variable is persisted with the value OFF, when a cluster is rebooted or an instance is added / rejoined into a cluster. This change must be made so that, in case of a reboot (and the instances are rejoined), to prevent replication errors and connection failures, and in case of add / rejoin and instance, to prevent replication errors in case that instance later becomes the primary. There's also the interaction with the "exitStateAction" option, that also controls how the server behaves if it leaves the group unintentionally, and can also be set to "offline_mode".

In a scenario where all the members of a cluster were expelled and "exitStateAction" was "offline_mode", if the user executes dba.rebootClusterFromCompleteOutage(), it's expected that the cluster is put back online and the members rejoined. With this in mind, changing the current behavior of dba.rebootClusterFromCompleteOutage() can have a very significant impact and break the general user expectations. Given that the scenario described isn't very common, maybe creating a new option, "keepOfflineModeSettings", in the API could be a possible approach.

This option, if explicitly enabled, could simply skip persisting the "offline-mode" variable, so that any config stored in the user .cnf file wouldn't be ignored.
[9 Feb 2023 16:51] Aaditya Dubey
Hi Team,

Any clue when this can be fixed?
[8 Mar 2023 11:07] Edward Gilmore
Posted by developer:
 
Added the following note to the MySQL Shell 8.0.33 release notes:
	
MySQL Shell disabled and persisted offline_mode when an instance was added or rejoined to a Cluster, or when
rebooted. If this variable was enabled explicitly by the user, it was overwritten by MySQL Shell.
As of this release, offline_mode is disabled globally, not persisted, and a new warning is added to inform the user of the risks of enabling this variable.