Bug #42581 Agent won't reconnect to monitored DB if started when monitored DB is down
Submitted: 4 Feb 2009 7:24 Modified: 14 Jan 2010 14:58
Reporter: Andrii Nikitin Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Enterprise Monitor: Agent Severity:S2 (Serious)
Version:2.0.3.7134, 2.0.2.7131,2.1.* OS:Any
Assigned to: Kay Roepke CPU Architecture:Any
Tags: a2memj, mem_20_maint, mem_discuss_me, regression, up_for_grabs, windmill

[4 Feb 2009 7:24] Andrii Nikitin
Description:
Agent won't reconnect to monitored DB if started when monitored DB is down.
It logs error:
Can't connect to MySQL server on '127.0.0.1' (0) (mysql-errno = 2003)

and sends only OS data to dashboard. When DB is started later, no reconnect attempts logged.

restart agent when DB is up fixes the problem.

(if DB is temporary down after first success connect, agent will reconnect later properly).

2009-02-04 07:48:37: (debug) job (task 9223372036854775807 (list_known_data_items)) executed only once
2009-02-04 07:48:38: (critical) C:\cygwin\home\mysqldev\bs\merlin\agent-2.0\src\mysql-proxy-0.7.0r1190\plugins\agent\agent_mysqld.c:606: agent connecting to mysql-server failed: mysql_real_connect(host = '127.0.0.1', port = 3304, socket = ''): Can't connect to MySQL server on '127.0.0.1' (0) (mysql-errno = 2003)
2009-02-04 07:48:40: (message) C:\cygwin\home\mysqldev\bs\merlin\agent-2.0\src\mysql-proxy-0.7.0r1190\plugins\agent\network-io.c:964: found list_known_data_items ... uncorking

How to repeat:
1. Shutdown monitored DB
2. Shutdown agent
3. Start agent 
4. Wait 2 min and check host appears crossed out in Dashboard and agent log contains fresh error like "mysql_real_connect ... (0) (mysql-errno = 2003)"
5. Start monitored DB
6. wait 10-30 min
7. Refresh Dashboard, Host still appears crossed out in Dashboard <= BUG
8. Restart agent
9. wait 1 min -> Host appears properly in Dashboard

Suggested fix:
Try reconnecting to DB even with initial connect errors
[18 Feb 2009 18:52] Diego Medina
"We think the solution is: the agent never shuts down;"

There is another use case, where this should really be fixed:

1- Complete server (the box) restarts
2- The agent starts up
2- then the mysqld starts up
 but as the agent started first, and tried to connect  to the mysqld and failed, the agent will not try to connect to the db any more, and it will report the mysqld as down, while it is up.

See http://bugs.mysql.com/bug.php?id=41634 for more info
[27 Feb 2009 13:09] Jan Kneschke
These are two bugs in one:

* agent-side: no auto-report of new items/attributes/value when the mysql-server comes up
* server-side: os-data displayed without a mysql-server reported.

It should be moved to the next release to implement properly.
[27 Feb 2009 13:11] Jan Kneschke
* server-side: os-data is _not_ displayed without the initial connect to the mysql-server
[27 Feb 2009 13:14] Jan Kneschke
We should split this bug into 2 bugs (agent/server) and close this one.
[22 Jun 2009 15:18] Jan Kneschke
The problem is that the mem-server only asks for the LKDI at startup and never again afterwards. Most DI's are known at startup and don't change.

As a mysql-server might be down at startup and the LKDI isn't executed again later for the "unknown" items we have to run LKDI as long as the mysql-server is still down and send back the result for the LKDI as soon as it is reachable.

The LKDI should return the KDI's of:

mysql::server
mysql::status
mysql::variables
mysql::innodbstatus
... and the other mysql::* classes of the mysql-collector.

That should trigger the LI on mem-server side automaticly and lead to a working late-discovery of the mysql-server.
[22 Jun 2009 15:20] Jan Kneschke
=== modified file 'plugins/agent/network-io.c'
--- plugins/agent/network-io.c  2009-06-17 21:13:17 +0000
+++ plugins/agent/network-io.c  2009-06-22 15:20:31 +0000
@@ -871,6 +871,25 @@
                                                }
                                        } else if (0 == strcmp(job_resp->command, "resynchronize")) {
                                                g_hash_table_remove_all(tracked_uuids); /* flush the table of tracked UUIDs as we got resynced */
+                                       } else if (0 == strcmp(job_resp->command, "list_known_data_items")) {
+                                               /* check if we send mysql::server items up to the server
+                                                *
+                                                * see #42581
+                                                *
+                                                * if not, start a internal task that tries to get list_known_data_items for mysql::server 
+                                                * every 30sec. If that succeeds, send the data up and start list_known_data_items for
+                                                * 
+                                                * - mysql::status
+                                                * - mysql::variables
+                                                * - mysql::innodbstatus
+                                                * - mysql::...
+                                                *
+                                                * and send its result back too
+                                                *
+                                                * the mem-server should start the list-instances for those data-items right away
+                                                */
+
+
                                        }
                                }
[23 Jun 2009 10:31] Jan Kneschke
After investigation: 

* the agent sends a response to LKDI(mysql::server), they are static (server.reachable, ...)
* but returns not to LI(mysql::server) as expected

We need some infrastructure work to return the instances automaticly when they appear or change.
[2 Jul 2009 13:36] Jan Kneschke
revno: 1402
committer: jan@mysql.com
branch nick: trunk
timestamp: Thu 2009-07-02 15:10:56 +0200
message:
  added a internal task that checks unknown mysql::server instances 
  
    * moved the network-io internal structures into network_io_state_t
    * added a _before_send() function to intercep the result of internal tasks
    * start internal list-instances(mysql::server) if no mysql::server instances
      are reported on startup 
------------------------------------------------------------
revno: 1401
committer: jan@mysql.com
branch nick: trunk
timestamp: Thu 2009-07-02 13:18:10 +0200
message:
  moved the job_task_t structure into the job_response_t to see which task the response is for
  
    * at the time the task arrives in network-io, the corresponding agent_task
      might be gone
    * the job_task is the current instance of that agent_task
    * we need it to see if list-instances() call could have included a mysql::server
      instance or not to start a internal task for it
[2 Jul 2009 13:37] Jan Kneschke
oops, I should have set it to patch queued.
[6 Jul 2009 19:44] Enterprise Tools JIRA Robot
Darren Oldag writes: 
the fix appears sufficient for the single-server monitored case.
[6 Jul 2009 19:47] Enterprise Tools JIRA Robot
Darren Oldag writes: 
patch was pushed prior to review, which is "no big deal" to me.
[6 Jul 2009 22:27] Enterprise Tools JIRA Robot
Keith Russell writes: 
Patch applied in versions => 2.1.0.1074.
[7 Jul 2009 18:11] Enterprise Tools JIRA Robot
Diego Medina writes: 
Verified fixed on 2.1.0.1074
[20 Jul 2009 15:43] Tony Bedford
An entry was added to the 2.1.0 changelog:

The Agent would not reconnect to a monitored database if it was started when the monitored server was down. The agent log contained the following error:

Can't connect to MySQL server on '127.0.0.1' (0) (mysql-errno = 2003)

The agent only sent OS data to the Dashboard. Further, when the monitored server was later started, no attempts to reconnect were logged.

The problem could be worked around by restarting the agent when the monitored server was running again.
[16 Dec 2009 19:31] Enterprise Tools JIRA Robot
Keith Russell writes: 
Patch installed in versions => 2.2.0.1560.
[18 Dec 2009 18:02] Enterprise Tools JIRA Robot
Diego Medina writes: 
Agent  	2.2.0.1588  still has the problem.
[12 Jan 2010 19:53] Enterprise Tools JIRA Robot
Keith Russell writes: 
Patch installer in version => 2.2.0.1605.
[13 Jan 2010 17:04] Enterprise Tools JIRA Robot
Carsten Segieth writes: 
checked fixed in 2.2.0.1605
[14 Jan 2010 14:58] MC Brown
Entry has been added to the 2.2.0 changelog