MySQL Bugs: #42581: Agent won't reconnect to monitored DB if started when monitored DB is down

Bug #42581	Agent won't reconnect to monitored DB if started when monitored DB is down
Submitted:	4 Feb 2009 7:24	Modified:	14 Jan 2010 14:58
Reporter:	Andrii Nikitin	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Enterprise Monitor: Agent	Severity:	S2 (Serious)
Version:	2.0.3.7134, 2.0.2.7131,2.1.*	OS:	Any
Assigned to:	Kay Roepke	CPU Architecture:	Any
Tags:	a2memj, mem_20_maint, mem_discuss_me, regression, up_for_grabs, windmill

Description:
Agent won't reconnect to monitored DB if started when monitored DB is down.
It logs error:
Can't connect to MySQL server on '127.0.0.1' (0) (mysql-errno = 2003)

and sends only OS data to dashboard. When DB is started later, no reconnect attempts logged.

restart agent when DB is up fixes the problem.

(if DB is temporary down after first success connect, agent will reconnect later properly).

2009-02-04 07:48:37: (debug) job (task 9223372036854775807 (list_known_data_items)) executed only once
2009-02-04 07:48:38: (critical) C:\cygwin\home\mysqldev\bs\merlin\agent-2.0\src\mysql-proxy-0.7.0r1190\plugins\agent\agent_mysqld.c:606: agent connecting to mysql-server failed: mysql_real_connect(host = '127.0.0.1', port = 3304, socket = ''): Can't connect to MySQL server on '127.0.0.1' (0) (mysql-errno = 2003)
2009-02-04 07:48:40: (message) C:\cygwin\home\mysqldev\bs\merlin\agent-2.0\src\mysql-proxy-0.7.0r1190\plugins\agent\network-io.c:964: found list_known_data_items ... uncorking

How to repeat:
1. Shutdown monitored DB
2. Shutdown agent
3. Start agent 
4. Wait 2 min and check host appears crossed out in Dashboard and agent log contains fresh error like "mysql_real_connect ... (0) (mysql-errno = 2003)"
5. Start monitored DB
6. wait 10-30 min
7. Refresh Dashboard, Host still appears crossed out in Dashboard <= BUG
8. Restart agent
9. wait 1 min -> Host appears properly in Dashboard

Suggested fix:
Try reconnecting to DB even with initial connect errors

"We think the solution is: the agent never shuts down;"

There is another use case, where this should really be fixed:

1- Complete server (the box) restarts
2- The agent starts up
2- then the mysqld starts up
 but as the agent started first, and tried to connect  to the mysqld and failed, the agent will not try to connect to the db any more, and it will report the mysqld as down, while it is up.

See http://bugs.mysql.com/bug.php?id=41634 for more info

These are two bugs in one:

* agent-side: no auto-report of new items/attributes/value when the mysql-server comes up
* server-side: os-data displayed without a mysql-server reported.

It should be moved to the next release to implement properly.

* server-side: os-data is _not_ displayed without the initial connect to the mysql-server

We should split this bug into 2 bugs (agent/server) and close this one.

The problem is that the mem-server only asks for the LKDI at startup and never again afterwards. Most DI's are known at startup and don't change.

As a mysql-server might be down at startup and the LKDI isn't executed again later for the "unknown" items we have to run LKDI as long as the mysql-server is still down and send back the result for the LKDI as soon as it is reachable.

The LKDI should return the KDI's of:

mysql::server
mysql::status
mysql::variables
mysql::innodbstatus
... and the other mysql::* classes of the mysql-collector.

That should trigger the LI on mem-server side automaticly and lead to a working late-discovery of the mysql-server.

=== modified file 'plugins/agent/network-io.c'
--- plugins/agent/network-io.c  2009-06-17 21:13:17 +0000
+++ plugins/agent/network-io.c  2009-06-22 15:20:31 +0000
@@ -871,6 +871,25 @@
                                                }
                                        } else if (0 == strcmp(job_resp->command, "resynchronize")) {
                                                g_hash_table_remove_all(tracked_uuids); /* flush the table of tracked UUIDs as we got resynced */
+                                       } else if (0 == strcmp(job_resp->command, "list_known_data_items")) {
+                                               /* check if we send mysql::server items up to the server
+                                                *
+                                                * see #42581
+                                                *
+                                                * if not, start a internal task that tries to get list_known_data_items for mysql::server 
+                                                * every 30sec. If that succeeds, send the data up and start list_known_data_items for
+                                                * 
+                                                * - mysql::status
+                                                * - mysql::variables
+                                                * - mysql::innodbstatus
+                                                * - mysql::...
+                                                *
+                                                * and send its result back too
+                                                *
+                                                * the mem-server should start the list-instances for those data-items right away
+                                                */
+
+
                                        }
                                }

After investigation: 

* the agent sends a response to LKDI(mysql::server), they are static (server.reachable, ...)
* but returns not to LI(mysql::server) as expected

We need some infrastructure work to return the instances automaticly when they appear or change.

revno: 1402
committer: jan@mysql.com
branch nick: trunk
timestamp: Thu 2009-07-02 15:10:56 +0200
message:
  added a internal task that checks unknown mysql::server instances 
  
    * moved the network-io internal structures into network_io_state_t
    * added a _before_send() function to intercep the result of internal tasks
    * start internal list-instances(mysql::server) if no mysql::server instances
      are reported on startup 
------------------------------------------------------------
revno: 1401
committer: jan@mysql.com
branch nick: trunk
timestamp: Thu 2009-07-02 13:18:10 +0200
message:
  moved the job_task_t structure into the job_response_t to see which task the response is for
  
    * at the time the task arrives in network-io, the corresponding agent_task
      might be gone
    * the job_task is the current instance of that agent_task
    * we need it to see if list-instances() call could have included a mysql::server
      instance or not to start a internal task for it

oops, I should have set it to patch queued.

Darren Oldag writes: 
the fix appears sufficient for the single-server monitored case.

Darren Oldag writes: 
patch was pushed prior to review, which is "no big deal" to me.

Keith Russell writes: 
Patch applied in versions => 2.1.0.1074.

Diego Medina writes: 
Verified fixed on 2.1.0.1074

An entry was added to the 2.1.0 changelog:

The Agent would not reconnect to a monitored database if it was started when the monitored server was down. The agent log contained the following error:

Can't connect to MySQL server on '127.0.0.1' (0) (mysql-errno = 2003)

The agent only sent OS data to the Dashboard. Further, when the monitored server was later started, no attempts to reconnect were logged.

The problem could be worked around by restarting the agent when the monitored server was running again.

Keith Russell writes: 
Patch installed in versions => 2.2.0.1560.

Diego Medina writes: 
Agent  	2.2.0.1588  still has the problem.

Keith Russell writes: 
Patch installer in version => 2.2.0.1605.

Carsten Segieth writes: 
checked fixed in 2.2.0.1605

Entry has been added to the 2.2.0 changelog