Bug #4641 | MySQL Cluster API loses data during node restarts - NOT fully connected | ||
---|---|---|---|
Submitted: | 20 Jul 2004 3:42 | Modified: | 23 Jul 2004 10:04 |
Reporter: | Jim Hoadley | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
Version: | 4.1.3-beta-nightly-20040628 | OS: | Linux (Red Hat Linux 3.2.3-26) |
Assigned to: | Magnus Blåudd | CPU Architecture: | Any |
[20 Jul 2004 3:42]
Jim Hoadley
[22 Jul 2004 11:16]
Magnus Blåudd
The problem occurs because the DB nodes are NOT fully connected to all API nodes. Each DB node need to set up one TCP connection to each API node(this is done automatically but needs some basic information to set it up correctly). Since you have written "localhost" as HostName on COMPUTER 4-7 the DB nodes on BOX2 and BOX3 will only wait for connections from the localhost and thus the DB node on BOX2 will not connect to any other API node than a localhost one. [COMPUTER] Id: 4 ByteOrder: Little HostName: localhost [COMPUTER] Id: 5 ByteOrder: Little HostName: localhost [COMPUTER] Id: 6 ByteOrder: Little HostName: localhost [COMPUTER] Id: 7 ByteOrder: Little HostName: localhost Either you can change to hostname on half of the COMPUTERS from localhost to BOX2 and BOX3 respectively. Or I actually think you can remove the COMPUTERS section completely since you should only define 3 COMPUTERS if you hacve three boxes, but please try the other solution first. [COMPUTER] Id: 4 ByteOrder: Little HostName: BOX2 [COMPUTER] Id: 5 ByteOrder: Little HostName: BOX3 [COMPUTER] Id: 6 ByteOrder: Little HostName: BOX3 [COMPUTER] Id: 7 ByteOrder: Little HostName: BOX3 To make sure each DB node has connections properly setup you can add the configuratiopn parameter LogLevelConnections to the DB sections of config.ini. The db node will then print information about it's connections. [DB DEFAULT] LogLevelConnection: 15 Or, if you have compiled in debug mode you can also use the ndb_mgm command ALL DUMP 2600 That will make all NDB nodes print information about active connections, like this: 2004-07-22 11:13:55 [NDB] INFO -- Connection to 1 (MGM) is connected 2004-07-22 11:13:55 [NDB] INFO -- Connection to 2 (DB) is connected 2004-07-22 11:13:55 [NDB] INFO -- Connection to 3 (DB) does nothing 2004-07-22 11:13:55 [NDB] INFO -- Connection to 4 (API) is trying to connect 2004-07-22 11:13:55 [NDB] INFO -- Connection to 5 (API) is trying to connect 2004-07-22 11:13:55 [NDB] INFO -- Connection to 6 (API) is trying to connect 2004-07-22 11:13:55 [NDB] INFO -- Connection to 7 (API) is trying to connect
[22 Jul 2004 21:39]
Jim Hoadley
Magnus -- Fixed! :) Thank you for your help. I am quite pleased. Your first suggestion worked. Details below. Maybe removing the definitions for API13 and API14 would've fixed it as well. I've got 4 APIs defined but only API11 and API12 are used. Your second suggestion was to remove the definitions for COMPUTER6 and COMPUTER7, but that probably wouldn't work unless you also remove the definitions for API13 and API14. [API] Id: 13 ExecuteOnComputer: 6 [API] Id: 14 ExecuteOnComputer: 7 I added "LogLevelConnection: 15" as suggested, then restarted all nodes without making any other changes to config.ini. (note: clock is off between BOX2 and BOX3, will fix) Here is the output of "BOX1": [root@BOX3 3.ndb_db]# 2004-07-22 09:29:42 [NDB] INFO -- Angel pid: 1992 ndb pid: 1993 2004-07-22 09:29:42 [NDB] INFO -- NDB Cluster -- DB node 3 2004-07-22 09:29:42 [NDB] INFO -- Version 3.5.0 (beta) -- 2004-07-22 09:29:42 [NDB] INFO -- Start initiated (version 3.5.0) 2004-07-22 09:29:44 [NDB] INFO -- Communication to Node 2 opened 2004-07-22 09:29:45 [NDB] INFO -- Node 1 Connected 2004-07-22 09:30:11 [NDB] INFO -- Node 2 Connected 2004-07-22 09:30:13 [NDB] INFO -- Node 2: API version 3.5.0 2004-07-22 09:30:19 [NDB] INFO -- Communication to Node 11 opened 2004-07-22 09:30:19 [NDB] INFO -- Communication to Node 12 opened 2004-07-22 09:30:19 [NDB] INFO -- Communication to Node 13 opened 2004-07-22 09:30:19 [NDB] INFO -- Communication to Node 14 opened 2004-07-22 09:30:19 [NDB] INFO -- Communication to Node 0 opened 2004-07-22 09:30:19 [NDB] INFO -- Started (version 3.5.0) 2004-07-22 09:30:19 [NDB] INFO -- Node 1: API version 3.5.0 Here is the output of "BOX2": [root@BOX2 2.ndb_db]# 2004-07-22 09:25:41 [NDB] INFO -- Angel pid: 2406 ndb pid: 2407 2004-07-22 09:25:41 [NDB] INFO -- NDB Cluster -- DB node 2 2004-07-22 09:25:41 [NDB] INFO -- Version 3.5.0 (beta) -- 2004-07-22 09:25:41 [NDB] INFO -- Start initiated (version 3.5.0) 2004-07-22 09:25:43 [NDB] INFO -- Communication to Node 3 opened 2004-07-22 09:25:44 [NDB] INFO -- Node 1 Connected 2004-07-22 09:25:44 [NDB] INFO -- Node 3 Connected 2004-07-22 09:25:45 [NDB] INFO -- Node 3: API version 3.5.0 2004-07-22 09:25:51 [NDB] INFO -- Communication to Node 11 opened 2004-07-22 09:25:51 [NDB] INFO -- Communication to Node 12 opened 2004-07-22 09:25:51 [NDB] INFO -- Communication to Node 13 opened 2004-07-22 09:25:51 [NDB] INFO -- Communication to Node 14 opened 2004-07-22 09:25:51 [NDB] INFO -- Communication to Node 0 opened 2004-07-22 09:25:51 [NDB] INFO -- Started (version 3.5.0) 2004-07-22 09:25:52 [NDB] INFO -- Node 1: API version 3.5.0 Here is NBD "show" output: NDB> show Cluster Configuration --------------------- 2 NDB Node(s) DB node: 2 (Version: 3.5.0) DB node: 3 (Version: 3.5.0) 4 API Node(s) API node: 11 (not connected) API node: 12 (not connected) API node: 13 (not connected) API node: 14 (not connected) 1 MGM Node(s) MGM node: 1 (Version: 3.5.0) Then I edited config.ini to incorporate your first suggestion: [COMPUTER] Id: 4 ByteOrder: Little HostName: BOX3 [COMPUTER] Id: 5 ByteOrder: Little HostName: BOX2 [COMPUTER] Id: 6 ByteOrder: Little HostName: BOX2 [COMPUTER] Id: 7 ByteOrder: Little HostName: BOX3 Then I restarted ndb_mgmd and the ndbd processes on BOX2 and BOX3. Output on "BOX3": [root@BOX3 3.ndb_db]# ndbd & [1] 2092 [root@BOX3 3.ndb_db]# 2004-07-22 09:50:12 [NDB] INFO -- Angel pid: 2092 ndb pid: 2093 2004-07-22 09:50:12 [NDB] INFO -- NDB Cluster -- DB node 3 2004-07-22 09:50:12 [NDB] INFO -- Version 3.5.0 (beta) -- 2004-07-22 09:50:12 [NDB] INFO -- Start initiated (version 3.5.0) 2004-07-22 09:50:13 [NDB] INFO -- Communication to Node 2 opened 2004-07-22 09:50:13 [NDB] INFO -- Node 1 Connected 2004-07-22 09:50:14 [NDB] INFO -- Node 2 Connected 2004-07-22 09:50:14 [NDB] INFO -- Node 2: API version 3.5.0 NR: setLcpActiveStatusEnd - m_participatingLQH 2004-07-22 09:50:15 [NDB] INFO -- Communication to Node 11 opened 2004-07-22 09:50:15 [NDB] INFO -- Communication to Node 12 opened 2004-07-22 09:50:15 [NDB] INFO -- Communication to Node 13 opened 2004-07-22 09:50:15 [NDB] INFO -- Communication to Node 14 opened 2004-07-22 09:50:15 [NDB] INFO -- Communication to Node 0 opened 2004-07-22 09:50:15 [NDB] INFO -- Started (version 3.5.0) 2004-07-22 09:50:16 [NDB] INFO -- Node 1: API version 3.5.0 Output on "BOX2": [root@BOX2 2.ndb_db]# 2004-07-22 09:44:57 [NDB] INFO -- Angel pid: 2467 ndb pid: 2468 2004-07-22 09:44:57 [NDB] INFO -- NDB Cluster -- DB node 2 2004-07-22 09:44:57 [NDB] INFO -- Version 3.5.0 (beta) -- 2004-07-22 09:44:57 [NDB] INFO -- Start initiated (version 3.5.0) 2004-07-22 09:44:59 [NDB] INFO -- Communication to Node 3 opened 2004-07-22 09:44:59 [NDB] INFO -- Node 1 Connected 2004-07-22 09:44:59 [NDB] INFO -- Node 3 Connected 2004-07-22 09:44:59 [NDB] INFO -- Node 3: API version 3.5.0 NR: setLcpActiveStatusEnd - m_participatingLQH 2004-07-22 09:45:01 [NDB] INFO -- Communication to Node 11 opened 2004-07-22 09:45:01 [NDB] INFO -- Communication to Node 12 opened 2004-07-22 09:45:01 [NDB] INFO -- Communication to Node 13 opened 2004-07-22 09:45:01 [NDB] INFO -- Communication to Node 14 opened 2004-07-22 09:45:01 [NDB] INFO -- Communication to Node 0 opened 2004-07-22 09:45:01 [NDB] INFO -- Started (version 3.5.0) 2004-07-22 09:45:02 [NDB] INFO -- Node 1: API version 3.5.0 Then started API on "BOX3": [root@BOX3 3.ndb_db]# export NDB_CONNECTSTRING="host=BOX1:2200;nodeid=11" [root@BOX3 3.ndb_db]# mysqld_safe --ndbcluster --default-storage-engine=ndbcluster & [2] 2123 [root@BOX3 3.ndb_db]# Starting mysqld daemon with databases from /usr/local/mysql/var 2004-07-22 10:10:28 [NDB] INFO -- Node 11 Connected 2004-07-22 10:10:28 [NDB] INFO -- Node 11: API version 3.5.0 Then started API on "BOX2": [root@BOX2 2.ndb_db]# mysqld_safe --ndbcluster --default-storage-engine=ndbcluster & [2] 2499 [root@BOX2 2.ndb_db]# Starting mysqld daemon with databases from /usr/local/mysql/var 2004-07-22 10:08:36 [NDB] INFO -- Node 12 Connected 2004-07-22 10:08:36 [NDB] INFO -- Node 12: API version 3.5.0 Then I ran a script to SELECT records from BOX2 every second. Then I shut down node2 on BOX2 and MySQL didn't stop serving records from the cluster. THANK YOU VERY MUCH! -- Jim ---------- Forwarded message ---------- Date: Thu, 22 Jul 2004 08:46:08 -0700 (PDT) From: Jim Hoadley <j_hoadley@yahoo.com> To: jhoadley@dealerfusion.com Subject: Bug #4641 [Opn]: MySQL Cluster API loses data during node restarts --- Bug Database <dev-bugs@mysql.com> wrote: > Date: 22 Jul 2004 09:16:02 -0000 > To: j_hoadley@yahoo.com > Subject: Bug #4641 [Opn]: MySQL Cluster API loses data during node restarts > From: Bug Database <dev-bugs@mysql.com> > > ATTENTION! Do NOT reply to this email! > To reply, use the web interface found at > http://bugs.mysql.com/?id=4641&edit=2 > > > ID: 4641 > Updated by: Magnus Svensson > Reported by: Jim Hoadley > User Type: User > Status: Open > Priority: Medium > Severity: Serious > Category: MySQL Cluster > Operating System: Red Hat Linux 3.2.3-26 > Version: 4.1.3-beta-nightly-20040628 > -Assigned To: > +Assigned To: msvensson@mysql.com > New Comment: > > The problem occurs because the DB nodes are NOT fully connected to all > API nodes. Each DB node need to set up one TCP connection to each API > node(this is done automatically but needs some basic information to set > it up correctly). Since you have written "localhost" as HostName on > COMPUTER 4-7 the DB nodes on BOX2 and BOX3 will only wait for > connections from the localhost and thus the DB node on BOX2 will not > connect to any other API node than a localhost one. > > > [COMPUTER] > Id: 4 > ByteOrder: Little > HostName: localhost > > [COMPUTER] > Id: 5 > ByteOrder: Little > HostName: localhost > > [COMPUTER] > Id: 6 > ByteOrder: Little > HostName: localhost > > [COMPUTER] > Id: 7 > ByteOrder: Little > HostName: localhost > > Either you can change to hostname on half of the COMPUTERS from > localhost to BOX2 and BOX3 respectively. Or I actually think you can > remove the COMPUTERS section completely since you should only define 3 > COMPUTERS if you hacve three boxes, but please try the other solution > first. > > > [COMPUTER] > Id: 4 > ByteOrder: Little > HostName: BOX2 > > [COMPUTER] > Id: 5 > ByteOrder: Little > HostName: BOX3 > > [COMPUTER] > Id: 6 > ByteOrder: Little > HostName: BOX3 > > [COMPUTER] > Id: 7 > ByteOrder: Little > HostName: BOX3 > > > > > To make sure each DB node has connections properly setup you can add > the configuratiopn parameter LogLevelConnections to the DB sections of > config.ini. The db node will then print information about it's > connections. > > > [DB DEFAULT] > LogLevelConnection: 15 > > > Or, if you have compiled in debug mode you can also use the ndb_mgm > command > ALL DUMP 2600 > That will make all NDB nodes print information about active > connections, like this: > > 2004-07-22 11:13:55 [NDB] INFO -- Connection to 1 (MGM) is > connected > 2004-07-22 11:13:55 [NDB] INFO -- Connection to 2 (DB) is > connected > 2004-07-22 11:13:55 [NDB] INFO -- Connection to 3 (DB) does > nothing > 2004-07-22 11:13:55 [NDB] INFO -- Connection to 4 (API) is trying > to connect > 2004-07-22 11:13:55 [NDB] INFO -- Connection to 5 (API) is trying > to connect > 2004-07-22 11:13:55 [NDB] INFO -- Connection to 6 (API) is trying > to connect > 2004-07-22 11:13:55 [NDB] INFO -- Connection to 7 (API) is trying > to connect > > > Previous Comments: > ------------------------------------------------------------------------ > > [2004-07-19 18:42:24] Jim Hoadley > > Description: > What I did: > > See "How to repeat" and the "API loses data during node restarts" > thread in the cluster section of lists.mysql.com. > > What I want to happen: > > I want to run a 2-node MySQL Cluster (on 2 computers with 2 APIs) and > have either API serve up queries uninterrupted when either DB node is > taken offline. > > What actually happened: > > When I stop and start a database node on the same computer that the API > is running on, mysqld reports "ERROR 1015: Can't lock file (errno: > 4009)". When the node is started again the API begins answering queries > again. > > I consistently see this behavior on one node but never on the other, > and the MySQL server API always resumes serving cluster data once the > database node starts back up. > > > How to repeat: > 1. Set up a 2-node MySQL Cluster, on 2 Linux boxes, with an API on each > node. Put the MGM on either node or, optionally, on a third node. Use > this MGM config.ini file: > --- > [DB DEFAULT] > NoOfReplicas: 2 > MaxNoOfConcurrentOperations: 10000 > DataMemory: 40M > IndexMemory: 12M > Discless: 0 > > [COMPUTER] > Id: 1 > ByteOrder: Little > HostName: BOX3 > > [COMPUTER] > Id: 2 > ByteOrder: Little > HostName: BOX2 > > [COMPUTER] > Id: 3 > ByteOrder: Little > HostName: BOX3 > > [COMPUTER] > Id: 4 > ByteOrder: Little > HostName: localhost > > [COMPUTER] > Id: 5 > ByteOrder: Little > HostName: localhost > > [COMPUTER] > Id: 6 > ByteOrder: Little > HostName: localhost > > [COMPUTER] > Id: 7 > ByteOrder: Little > HostName: localhost > > [MGM] > Id: 1 > ExecuteOnComputer: 1 > PortNumber: 2200 > > [DB] > Id: 2 > ExecuteOnComputer: 2 > FileSystemPath: /var/ndbcluster/mysql-test/ndbcluster/node-2-fs-2200 > > [DB] > Id: 3 > ExecuteOnComputer: 3 > FileSystemPath: /var/ndbcluster/mysql-test/ndbcluster/node-3-fs-2200 > > [API] > Id: 11 > ExecuteOnComputer: 4 > > [API] > Id: 12 > ExecuteOnComputer: 5 > > [API] > Id: 13 > ExecuteOnComputer: 6 > > [API] > Id: 14 > ExecuteOnComputer: 7 > > [TCP DEFAULT] > PortNumber: 2202 > --- > > 2. Log into the database cluster and create a test table in the test > database (e.g. "create table simpsons(id integer not null primary key, > first_name char(20)) Engine=NDB;") > > 3. Insert some values in the test table (e.g. "INSERT INTO simpsons > VALUES(1, 'Bart'); INSERT INTO simpsons VALUES(2, 'Lisa');") > > 4. Connect to the API on BOX1 (from BOX3 or ...) and query the NDB > table (e.g. "SELECT * FROM simpsons;" > > 5. Connect to the MGM node (with "ndb_mgm") and stop DB node 2 ("MGM> 2 > stop") > > 6. Again, query the NDB table (e.g. "SELECT * FROM simpsons;") > > 7. Bring DB node 2 back up (can't do this from MGM, but you know how) > > 8. Stop DB node 3 ("MGM> 3 stop") > > 9. Again, query the NDB table (e.g. "SELECT * FROM simpsons;") > > 10. Bring DB node 3 back up > > [ Note that everything is working up to this point ] > > 11. Now connect to the API on BOX2 (from BOX3 or ...) and query the > NDB table (e.g. "SELECT * FROM simpsons;" > > 12. Stop DB node 3 ("MGM> 3 stop") > > 13. Again, query the NDB table (e.g. "SELECT * FROM simpsons;") > > [ Note that everything is working up to this point ] > > 14. Bring DB node 3 back up > > 15. Stop DB node 2 ("MGM> 2 stop"). Note: this node is on the same box > as the API > > 16. Again, query the NDB table (e.g. "SELECT * FROM simpsons;") > > [ Note that this is where things break ] > > 17. Bring DB node 2 back up > > 18. Again, query the NDB table (e.g. "SELECT * FROM simpsons;") > > [ Note that everything is working again ] > > [ Another diagnostic: you can add records to the test table from > another API while node 2 is offline. When node 2 is brought back online > it "sees" those records ] > > Suggested fix: > None. > > ------------------------------------------------------------------------ > > >
[23 Jul 2004 10:04]
Magnus Blåudd
>Your second suggestion was to remove the definitions for COMPUTER6 and >COMPUTER7, but that probably wouldn't work unless you also remove the >definitions for API13 and API14. You need only one COMPUTER for each physical computer you have, more than one API, MGM or DB node can run on the same COMPUTER. In my test config, I'm running all nodes on the same box and hence I have only one COMPUTER defined. Simply change the ExecuteOnComputer parameter: [API] Id: 13 ExecuteOnComputer: 2 [API] Id: 14 ExecuteOnComputer: 3 >[root@BOX3 3.ndb_db]# export >NDB_CONNECTSTRING="host=BOX1:2200;nodeid=11" This is why your mysqld is using nodeid 11, your are telling it to! If you have fixed setup like this, it's good to hardcode the nodeid's. But it is also possible to skip specifying nodeid and hence the mysqld would select the first free API nodeid. [root@BOX3 3.ndb_db]# export NDB_CONNECTSTRING="host=BOX1:2200"