MySQL Bugs: #74203: 5.7.5: Noisyness and no log throttling (SQL coordinator messages)

Bug #74203	5.7.5: Noisyness and no log throttling (SQL coordinator messages)
Submitted:	3 Oct 2014 6:33	Modified:	5 May 2015 12:13
Reporter:	Simon Mudd (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.7.4,5.7.5	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	coordinator, Logs, noisy, replication, SQL

Description:
Upgrading from 5.7.4 shows:

2014-10-03T06:17:39.063283Z 96 [Note] Coordinator and workers enter synchronization procedure when scheduling event relay-log: ../log/relaylog.012229 pos: 530098
2014-10-03T06:17:39.063316Z 96 [Note] Coordinator and workers enter synchronization procedure when scheduling event relay-log: ../log/relaylog.012229 pos: 530284
2014-10-03T06:17:39.065478Z 96 [Note] Coordinator synchronized with Workers, waited entries: 0, cant_sync: 0
2014-10-03T06:17:39.070484Z 96 [Note] Coordinator and workers enter synchronization procedure when scheduling event relay-log: ../log/relaylog.012229 pos: 530436
2014-10-03T06:17:39.070534Z 96 [Note] Coordinator and workers enter synchronization procedure when scheduling event relay-log: ../log/relaylog.012229 pos: 530662
...

Triggered by slave_parallel_workers = 10

5.7.4 was not noisy.

How to repeat:
See above.

Suggested fix:
I thought there was log throttling to prevent this type of thing in MySQL 5.7 as that's been sorely needed, but this particular case does not seem to be caught. (maybe only the last message is recorded so if you get several together then this does not work?)

I had to set log_warnings = 0 to avoid this noise but then I'm unaware of _any_ issues which may need my attention and that's not good. So setting log_warnings = 1 is good but if the logging can be throttled and tell me about problems but not flood me with so much noise I fill up disks or can't see the occasional message because of the noise from another one.

So please:
1. ensure there is throttling (I think that's in place?)
2. the throttling can handle more than 1 noisy message at a time
3. if possible the thresholds for triggering this throttling and the time the messages are throttled for is configurable
4. there is some indication in the logging of how long the throttling applies for (again may be done).

Correction:

I simply had not noticed the noise in 5.7.4. It was there before too.
So there's no change from 5.7.4 to 5.7.5 in the noise but it generates 40MB of logging in 7 hours (on this specific not very busy box).

So the issue here really is about throttling in a way which doesn't fill up the logs with too much noise.

related: http://bugs.mysql.com/bug.php?id=72464

I'm Verifying this.
Workaround is to set:    --log_error_verbosity=2

On recent mysql-trunk source, we have such output, which I'm not sure is even useful to print with default log-error-verbosity setting.
Maybe we need a separate "rpl_debug" global variable to control replication log levels? I'm not sure.

----
Version: '5.7.6-m16-log'  socket: ''  port: 3308  MySQL Community Server (GPL)
2014-10-03T08:15:00.078886Z 2 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2014-10-03T08:15:00.103889Z 2 [Note] Slave I/O thread for channel '': connected to master 'root@127.0.0.1:3307',replication started in log 'i7-bin.000001' at position 120
2014-10-03T08:15:00.150395Z 3 [Note] Slave SQL thread for channel '' initialized, starting replication in log 'i7-bin.000001' at position 120, relay log '.\i7-relay-bin.000005' position: 164
2014-10-03T08:15:04.452441Z 3 [Note] Coordinator and workers enter synchronization procedure when scheduling event relay-log: .\i7-relay-bin.000006 pos: 120
2014-10-03T08:15:04.452941Z 3 [Note] Coordinator synchronized with Workers, waited entries: 0, cant_sync: 0
2014-10-03T08:15:16.614985Z 3 [Note] Coordinator and workers enter synchronization procedure when scheduling event relay-log: .\i7-relay-bin.000007 pos: 120
2014-10-03T08:15:16.615985Z 3 [Note] Coordinator synchronized with Workers, waited entries: 0, cant_sync: 0
2014-10-03T08:15:16.615985Z 3 [Note] Slave SQL thread for channel '' exiting, replication stopped in log 'i7-bin.000001' at position 120
2014-10-03T08:15:19.672374Z 3 [Note] Coordinator and workers enter synchronization procedure when scheduling event relay-log: .\i7-relay-bin.000007 pos: 164
2014-10-03T08:15:19.672874Z 3 [Note] Coordinator synchronized with Workers, waited entries: 0, cant_sync: 0
2014-10-03T08:15:19.673874Z 3 [Note] Notifying worker 9 for channel '' to exit, thd 0x20e496a0
2014-10-03T08:15:19.673874Z 13 [Note] Worker 9 is exiting: killed 0, error 0, running_status 3
2014-10-03T08:15:19.674374Z 3 [Note] Notifying worker 8 for channel '' to exit, thd 0x20dd78c0
2014-10-03T08:15:19.674374Z 12 [Note] Worker 8 is exiting: killed 0, error 0, running_status 3
2014-10-03T08:15:19.674374Z 13 [Note] Worker 9 statistics: events processed = 0 online time = 19555983300 events exec time = 0 events read time = 0 hungry waits = 1 priv queue overfills = 0
2014-10-03T08:15:19.674374Z 3 [Note] Notifying worker 7 for channel '' to exit, thd 0x20cfc750
2014-10-03T08:15:19.674374Z 11 [Note] Worker 7 is exiting: killed 0, error 0, running_status 3
2014-10-03T08:15:19.674374Z 12 [Note] Worker 8 statistics: events processed = 0 online time = 19557483500 events exec time = 0 events read time = 0 hungry waits = 1 priv queue overfills = 0
2014-10-03T08:15:19.674874Z 3 [Note] Notifying worker 6 for channel '' to exit, thd 0x20cefd10
2014-10-03T08:15:19.674874Z 10 [Note] Worker 6 is exiting: killed 0, error 0, running_status 3
2014-10-03T08:15:19.675874Z 11 [Note] Worker 7 statistics: events processed = 0 online time = 19560983900 events exec time = 0 events read time = 0 hungry waits = 1 priv queue overfills = 0
2014-10-03T08:15:19.676874Z 3 [Note] Notifying worker 5 for channel '' to exit, thd 0x20c14ba0
2014-10-03T08:15:19.676874Z 9 [Note] Worker 5 is exiting: killed 0, error 0, running_status 3
2014-10-03T08:15:19.676874Z 10 [Note] Worker 6 statistics: events processed = 0 online time = 19563484200 events exec time = 0 events read time = 0 hungry waits = 1 priv queue overfills = 0
2014-10-03T08:15:19.677374Z 8 [Note] Worker 4 is exiting: killed 0, error 0, running_status 3
2014-10-03T08:15:19.677374Z 3 [Note] Notifying worker 4 for channel '' to exit, thd 0x20ba3700
2014-10-03T08:15:19.677874Z 9 [Note] Worker 5 statistics: events processed = 0 online time = 19565984500 events exec time = 0 events read time = 0 hungry waits = 1 priv queue overfills = 0
2014-10-03T08:15:19.678874Z 8 [Note] Worker 4 statistics: events processed = 0 online time = 19568985000 events exec time = 0 events read time = 0 hungry waits = 1 priv queue overfills = 0
2014-10-03T08:15:19.678874Z 3 [Note] Notifying worker 3 for channel '' to exit, thd 0x20b900d0
2014-10-03T08:15:19.678874Z 7 [Note] Worker 3 is exiting: killed 0, error 0, running_status 3
2014-10-03T08:15:19.680875Z 3 [Note] Notifying worker 2 for channel '' to exit, thd 0x20b1c370
2014-10-03T08:15:19.680875Z 6 [Note] Worker 2 is exiting: killed 0, error 0, running_status 3
2014-10-03T08:15:19.680875Z 3 [Note] Notifying worker 1 for channel '' to exit, thd 0x20aa84f0
2014-10-03T08:15:19.680875Z 5 [Note] Worker 1 is exiting: killed 0, error 0, running_status 3
2014-10-03T08:15:19.680875Z 7 [Note] Worker 3 statistics: events processed = 0 online time = 19572485400 events exec time = 0 events read time = 0 hungry waits = 1 priv queue overfills = 0
2014-10-03T08:15:19.681375Z 6 [Note] Worker 2 statistics: events processed = 0 online time = 19574985700 events exec time = 0 events read time = 0 hungry waits = 1 priv queue overfills = 0
2014-10-03T08:15:19.681375Z 3 [Note] Notifying worker 0 for channel '' to exit, thd 0x20a30740
2014-10-03T08:15:19.681375Z 5 [Note] Worker 1 statistics: events processed = 0 online time = 19577486000 events exec time = 0 events read time = 0 hungry waits = 1 priv queue overfills = 0
2014-10-03T08:15:19.681375Z 4 [Note] Worker 0 is exiting: killed 0, error 0, running_status 3
2014-10-03T08:15:19.686875Z 4 [Note] Worker 0 statistics: events processed = 0 online time = 19586987200 events exec time = 0 events read time = 0 hungry waits = 1 priv queue overfills = 0
2014-10-03T08:15:19.690876Z 3 [Note] Total MTS session statistics: events processed = 0; online time = 19592988000 worker queues filled over overrun level = 0 waited due a Worker queue full = 0 waited due the total size = 0 total wait at clock conflicts = 0 found (count) Workers occupied = 0 waited when Workers occupied = 0
2014-10-03T08:15:19.697877Z 2 [Note] Slave I/O thread killed while reading event for channel ''
2014-10-03T08:15:19.699377Z 2 [Note] Slave I/O thread exiting for channel '', read up to log 'i7-bin.000001', position 120
-------

I'm not sure but for this specific logging I'm tempted to think the "loglevel" is wrong and really there's no need to bother the DBA about this, it's information for the developers. I understand they need to see stuff sometimes but the normal logging should be useful and meaningful to DBAs.  So maybe that's the easy fix for this specific problem.

In the more general case there are various replication type messages which can happen and the warnings may be understood by the DBA (but it's good to see them up to a point) so this is where the throttling is helpful.

Having said that having control over the throttling is something we'd like to see. It has been a source of frustration, so getting this right and flexible enough so we can see messages that are important and not ones that are not is vital.  You really need to allow the DBAs to configure this and not try to pre-judge what you think the answers are as different sites are likely to have different opinions or needs.

By default being noisy is fine: it makes people read up on the potential problem and then decide whether it's an issue or not, and if needed they can then filter/throttle this noise so they will notice other issues that may come up.

The work around is good and does stop the noise. So thanks for the suggestion.

Thanks for your feedback. This has been fixed in an upcoming version and the following was noted in the 5.7.8 changelog:
Some messages that were meant to be printed when log_warnings was greater than 1 started appearing in the error log. The information regarding the multi-threaded applier in particular was too verbose and much of this information was very technical and development oriented. The fix ensures that these messages are not logged.