MySQL Bugs: #113974: improve GR error reporting

Bug #113974	improve GR error reporting
Submitted:	13 Feb 2024 9:05	Modified:	13 Feb 2024 9:12
Reporter:	Simon Mudd (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S4 (Feature request)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
I've been asked to provide information on how I'd like to see GR logging improved. Here is an example.

I managed to break a GR cluster while doing some work on it.

Error reporting by GR could be improved. What I saw was this:

2024-02-09T15:23:37.037339Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] TCP_NODELAY already set'
2024-02-09T15:23:37.037370Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Sucessfully connected to peer <some host>. Sending a request to be added to the group'
2024-02-09T15:23:37.037386Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Sending add_node request to a peer XCom node'
2024-02-09T15:23:37.087993Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Sending a request to a remote XCom failed. Please check the remote node log for more details.'
2024-02-09T15:23:37.088071Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Failed to send add_node request to a peer XCom node.'

This does not seem helpful.
(1) Sending add_node request to a peer XCom node'

- Suggestion: Consider adding the hostname / ip address and destination port to this log line for clarity.

(2) 2024-02-09T15:23:37.087993Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Sending a request to a remote XCom failed. Please check the remote node log for more details.'

- Something failed?
- What failed?
- I would like to get some sort of idea/error message of what the problem is reported to me. Having to look at log lines on a *different server* is not really acceptable.
  The GCS communication is not that well defined. It might be good to publish a spec? Either way if there are errors and expectations of communication behaviour I would expect the protocol to provide sufficient error messages so the receiver of the error can understand what error or type of error has occurred.
- Please consider providing the error message providing the information that is indicating a failure as this would simplify debugging and diagnosis of issues considerably.

Note: I did look at the other node and it's log messages said:

2024-02-09T14:38:05.501885Z 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Old incarnation found while trying to add node <original_node> 17074894805576519. Please stop the old node or wait for it to leave the group.'
2024-02-09T14:38:06.074954Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2024-02-09T14:38:06.195180Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] xcom_client_remove_node: Try to push xcom_client_remove_node to XCom'
2024-02-09T14:38:06.198174Z 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Old incarnation found while trying to add node <original_node> 17074894805576519. Please stop the old node or wait for it to leave the group.'
2024-02-09T14:38:06.212631Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Updating physical connections to other servers'

This logging is also far from clear for a non-developer. On the original node I had later tried to do STOP GROUP REPLICATION; START GROUP REPLICATION; but it "seems" (guessing) that the node I was trying to restart / reconnect was in a different / inconsistent state from what the receiving node expected.

I was unable to "stop the old node" as it was the same as the "new node". So the node synchronisation seemed to get somewhat confused.

I'm not asking for my specific issue to be resolved, simply that logging of information should be clearer.

Do NOT expect the user to have to check logs on another server. Sufficient information on the server having issues should be possible to understand what is happening.
I hope that the current GCS protocol has enough information to provide sufficient information (even if it's at a lower level than the higher level SQL/GR handling) for the user to understand what is going wrong. Ideally GCS should fix itself. It's designed for HA and fault-tolerance and ideally that should be achieved with minimal user-interaction.
If the user has to fix something then the reporting should be clear indicating what the problem is and ideally based on that "the manual/documentation" will provide instructions on how to proceed.

How to repeat:
See above.

(Note to oneself: do not kill all threads on a GR server at the same time. GR does not like this and seems unable to recover.)

Suggested fix:
- Improve GR / GCS logging to be clear and more precise about abnormal conditions seen.
- Avoid "refer to the manual" as a generic response to errors without providing more specific links / references to specific actions to take.
- Avoid referring to "refer to remote node logs". This should not be necessary. Ensure errors reported to the node have enough information to be reported directly to the user or handled directly by the server as appropriate.

Hello Simon,

Thank you for the feature request!

regards,
Umesh