Thanks. 

It has been some time since I reported the problem. Unfortunately, I still cannot provide a test case. But I had some more experiences in this direction. The whole thing is highly mysterious. Best is I tell what I know. Maybe somebody can relate to this and knows more.

Originally, I suspected that "ALTER TABLE" statements were not replicated at all. This is not true. It was easy to set up a test case and prove that this hypothesis is definitely wrong although it had some appeal to it. 

For example, in case the slave user would not have the appropriate administrative rights, this would have been the logical effect and would not be a bug but a setup fault. But administrative commands are executed as a rule except now and then, and the user has in fact all the rights he needs as can be expected from the fact.

For example, for logging purposes, every day a new log table is created by the application. No problem whatsoever. But then on occasion, creating a table from scratch, we find that one machine doesn't execute this command. We find out immediately because the consecutive commands with respect to this (not existing) table cannot be executed, hence generates an error which in turn stops the slave and sends an e-mail. Creating that table on that machine manually works fine. Funny indeed. The same applies to "ALTER TABLE" statements.

Example for trouble on the Debian machine:

ALTER TABLE `editions` ADD `createdEd` TIMESTAMP NOT NULL, ADD `publishedEd` TIMESTAMP NOT NULL;  

ALTER TABLE `editions` DROP `createdEd`, DROP `publishedEd`;  

The first has not, the last has been executed and produced an error in consequence:

MySQL Debian1>SHOW SLAVE STATUS\G
*************************** 1. row ***************************
          Master_Host: pz-server1.xxx.com
          Master_User: repl
          Master_Port: 3306
        Connect_retry: 60
      Master_Log_File: pz-server1-bin.497
  Read_Master_Log_Pos: 71699778
       Relay_Log_File: server8324611625-relay-bin.091
        Relay_Log_Pos: 51390041
Relay_Master_Log_File: pz-server1-bin.497
     Slave_IO_Running: Yes
    Slave_SQL_Running: No
      Replicate_do_db: xxx
  Replicate_ignore_db: 
           Last_errno: 1091
           Last_error: Error 'Can't DROP 'createdEd'. Check that column/key exists' on query 'ALTER TABLE `editions` DROP `createdEd`, DROP `publishedEd`'. Default database: 'xxx'
         Skip_counter: 0
  Exec_master_log_pos: 71659732
      Relay_log_space: 51430087
1 row in set (0.00 sec)

Ok, now the following test is issued on the master:

DROP TABLE IF EXISTS `editionsTest`;
CREATE TABLE editionsTest(
  idEd smallint(5) unsigned NOT NULL auto_increment,
  titleEd varchar(50) NOT NULL default '',
  subTitle varchar(100) NOT NULL default '',
  datumEd date NOT NULL default '0000-00-00',
  nrEd varchar(5) NOT NULL default '',
  typEd enum('p','r') NOT NULL default 'p',
  keysEd text NOT NULL,
  idAutor smallint(5) unsigned NOT NULL default '1',
  prevEds varchar(40) NOT NULL default '',
  PRIMARY KEY  (idEd),
  UNIQUE KEY datTitNr (datumEd,titleEd,nrEd,idEd),
  KEY datumEd (datumEd),
  KEY typEd (typEd),
  KEY nrEd (nrEd),
  KEY idAutor (idAutor,nrEd),
  FULLTEXT KEY titleEd (titleEd,subTitle,keysEd)
) TYPE=MyISAM;

ALTER TABLE `editionsTest` ADD `createdEd` TIMESTAMP NOT NULL, ADD `publishedEd` TIMESTAMP NOT NULL;  

No problem with that. Hm.

Those defects have appeared both on the Debian and Windows machine, but seldom on both at one occasion. There are other problems which I didn't have time to delve into yet, but which might prove to be similarly hard to track down. Those are bugs in their own right, so I should open up a new bug report for each of them once I know more about it. Right now I just observe these things and hope that I will learn more about it to be able to give some clues or at best a test case. Whenever something like that happened, I really thought hard about setting up a test case, to no avail so far. 

A couple of weeks ago, I had some really bad days. Tons of database problems without any clue as to what has happened. I had this kind of trouble once back in 2001, and I remember that I was hunting that kind of problem for months. It disappeared all of a sudden, and it should have been related to some kind of hardware problem, because the provider found out that a memory module was defect and had to be replaced. In this case, after a couple of days, the problem disappeared. I can't even remember exactly what kind of problems I had (I do have records, though, so I could look it up).

But one problem related to replication can be documented fine: We should never see a duplicate key error on a (slave) machine triggered by a "REPLACE INTO" statement, as this statement uses the duplicate key error condition to find out what should be done, and then does it. It just doesn't make sense to get the error condition back which should be used.

MySQL on MAX ver 4.0.20 instance 1>show slave status\G
*************************** 1. row ***************************
          Master_Host: pz-server1.xxx.com
          Master_User: repl
          Master_Port: 3306
        Connect_retry: 60
      Master_Log_File: pz-server1-bin.516
  Read_Master_Log_Pos: 45745139
       Relay_Log_File: max-relay-bin.045
        Relay_Log_Pos: 55102161
Relay_Master_Log_File: pz-server1-bin.516
     Slave_IO_Running: Yes
    Slave_SQL_Running: No
      Replicate_do_db:
  Replicate_ignore_db: servcontrol,snapshot
           Last_errno: 1062
           Last_error: Error 'Duplicate entry '339' for key 1' on query. Default
 database: 'xxx'. Query: 'replace INTO `editions` (`idEd`, `titleEd`,
`subTitle`, `datumEd`, `nrEd`, `typEd`, `keysEd`, `idAutor`, `prevEds`, `created
Ed`, `publishedEd`) VALUES ('', '', '', '050821', '334', 'p', '', '1', '315,316,
317,332', NOW(NULL), '00000000000000')'
         Skip_counter: 0
  Exec_master_log_pos: 44205827
      Relay_log_space: 56641469
1 row in set (0.02 sec)

Then I had the idea that this weird behavior might be due to some internal problem of a table. I had these kinds of problems occasionally with a single table years ago, so a "REPAIR TABLE" did the fix. But in this case, all the tables in all databases were healthy. No problem could be found whatsoever with tables.

One problem only happens on the Windows machine. As this is my development machine, I can virtually hear the disk keeping track. And sometimes, it may be several times a day, replication stops. It stops without an error. I can hear it stopped because the disk is silent all of a sudden. I found out how to get things going again. It's funny as well.

I have recorded the proceedings so I could prove it, but it is easier to first tell the story. The first thing to notice is that the appropriate numbers for Master_Log_File and Exec_master_log_pos do not change anymore. Normally, they change all the time. I suspect that the machine just hangs, so I issue "STOP SLAVE;" and the slave returns immediately. Next I issue "START SLAVE;" and again the slave returns immediately, but nothing happens. Then I do the whole process again, and now the slave needs some 15 seconds or so to stop the slave; so something is different. Now I start the slave again, and I may succeed, or I may have to repeat the whole procedure once more, but then the slave picks up work again.

MySQL on MAX ver 4.0.20 instance 1>show slave status\G
*************************** 1. row ***************************
          Master_Host: pz-server1.xxx.com
          Master_User: repl
          Master_Port: 3306
        Connect_retry: 60
      Master_Log_File: pz-server1-bin.543
  Read_Master_Log_Pos: 31597425
       Relay_Log_File: max-relay-bin.050
        Relay_Log_Pos: 31266636
Relay_Master_Log_File: pz-server1-bin.543
     Slave_IO_Running: Yes
    Slave_SQL_Running: Yes
      Replicate_do_db:
  Replicate_ignore_db: servcontrol,snapshot
           Last_errno: 0
           Last_error:
         Skip_counter: 0
  Exec_master_log_pos: 31597425
      Relay_log_space: 31266632
1 row in set (0.00 sec)

MySQL on MAX ver 4.0.20 instance 1>stop slave;
Query OK, 0 rows affected (0.03 sec)

MySQL on MAX ver 4.0.20 instance 1>stop slave;
ERROR 1199: This operation requires a running slave, configure slave and do SLAVE START
MySQL on MAX ver 4.0.20 instance 1>start slave;
Query OK, 0 rows affected (0.00 sec)

MySQL on MAX ver 4.0.20 instance 1>start slave;
ERROR 1198: This operation cannot be performed with a running slave, run SLAVE STOP first
MySQL on MAX ver 4.0.20 instance 1>stop slave;
Query OK, 0 rows affected (15.70 sec)

To prove that the procedure has to be done twice (notice that the Master_Log_File does not change the first time):

MySQL on MAX ver 4.0.20 instance 1>show slave status\G
*************************** 1. row ***************************
          Master_Host: pz-server1.xxx.com
          Master_User: repl
          Master_Port: 3306
        Connect_retry: 60
      Master_Log_File: pz-server1-bin.549
  Read_Master_Log_Pos: 134791141
       Relay_Log_File: max-relay-bin.058
        Relay_Log_Pos: 140584020
Relay_Master_Log_File: pz-server1-bin.549
     Slave_IO_Running: Yes
    Slave_SQL_Running: Yes
      Replicate_do_db:
  Replicate_ignore_db: servcontrol,snapshot
           Last_errno: 0
           Last_error:
         Skip_counter: 0
  Exec_master_log_pos: 134791141
      Relay_log_space: 140584016
1 row in set (0.00 sec)

MySQL on MAX ver 4.0.20 instance 1>stop slave ;
Query OK, 0 rows affected (0.05 sec)

MySQL on MAX ver 4.0.20 instance 1>start slave ;
Query OK, 0 rows affected (0.00 sec)

MySQL on MAX ver 4.0.20 instance 1>stop slave ;
Query OK, 0 rows affected (17.72 sec)


MySQL on MAX ver 4.0.20 instance 1>start slave ;
Query OK, 0 rows affected (0.00 sec)

MySQL on MAX ver 4.0.20 instance 1>show slave status\G
*************************** 1. row ***************************
          Master_Host: pz-server1.xxx.com
          Master_User: repl
          Master_Port: 3306
        Connect_retry: 60
      Master_Log_File: pz-server1-bin.549
  Read_Master_Log_Pos: 134791141
       Relay_Log_File: max-relay-bin.058
        Relay_Log_Pos: 140584020
Relay_Master_Log_File: pz-server1-bin.549
     Slave_IO_Running: Yes
    Slave_SQL_Running: Yes
      Replicate_do_db:
  Replicate_ignore_db: servcontrol,snapshot
           Last_errno: 0
           Last_error:
         Skip_counter: 0
  Exec_master_log_pos: 134791141
      Relay_log_space: 140584016
1 row in set (0.00 sec)

MySQL on MAX ver 4.0.20 instance 1>stop slave ;
Query OK, 0 rows affected (9.97 sec)

MySQL on MAX ver 4.0.20 instance 1>start slave ;
Query OK, 0 rows affected (0.00 sec)


MySQL on MAX ver 4.0.20 instance 1>show slave status\G
*************************** 1. row ***************************
          Master_Host: pz-server1.xxx.com
          Master_User: repl
          Master_Port: 3306
        Connect_retry: 60
      Master_Log_File: pz-server1-bin.549
  Read_Master_Log_Pos: 135058689
       Relay_Log_File: max-relay-bin.058
        Relay_Log_Pos: 140749543
Relay_Master_Log_File: pz-server1-bin.549
     Slave_IO_Running: Yes
    Slave_SQL_Running: Yes
      Replicate_do_db:
  Replicate_ignore_db: servcontrol,snapshot
           Last_errno: 0
           Last_error:
         Skip_counter: 0
  Exec_master_log_pos: 134956619
      Relay_log_space: 140851609
1 row in set (0.00 sec)

Sometimes I don't realize that the slave has stopped, so it has to work for quite some time to catch up. The master log will be read quite fast, so the execution thread is working like mad which can be heard pretty good by the sound the disk produces. It has been a couple of times that I noticed this problem from unexpected behavior of the application on my local machine. In fact this was the reason I found out the first time. For example, there should have been a record which has been inserted on the master, but it isn't. Why? Well, the slave is taking a break, sleeping.

I guess that it happens that the slave awakens all by itself, but I'm not sure because I deduct this from the disk making that kind of noise all of a sudden indicating that it catches up desperately. In this case there is no way to find out if it had indeed slept. One reason might be that the Windows machine is busy otherwise, but this hypothesis is weak, because I should know in these cases, and it wasn't more busy than at other times.

I started using my local development machine as a slave when I got DSL. I was glad I could do this because otherwise it was hard to synchronize the databases on the master and on the development machine; in fact, almost all of the time they were out of sync which wasn't of problem relay but annoying enough. I wasn't sure if this would work out, but to begin with, it didn't look like there was any problem.

As I use to shut down my development machine at night, I wasn't sure how this would work out. But the first day it worked out fine. I just shut down my Windows machine and the next day fired it up again and replication was running. Great! The other day it wasn't. The slave was not running. I was stunned because the slave didn't show an error. That was the first time that I found the slave sleeping. And that was the occasion that I invented the workaround described above. It happened again at startup very rarely.

Since then I have shut down and fired up this machine countless times. Of course, in the morning, the slave has a hard time to catch up all the work the master has done at night. It's great to hear it working. It was a long time that I thought this must be possible, and indeed it is. Fantastic.

One of these days I got the following error on the Debian slave:

ERROR 1201: Could not initialize master info structure, more error messages can be found in the MySQL error log

I never saw this before and couldn't find anything about it on MySQL.com or with Google except questions of other people, and if I remember correctly, this question even appeared in your mailing list, but without an answer, so I was on my own. I remember that I had to spend quite some time to get rid of this error. Simply issuing "CHANGE MASTER TO" would not do the trick. I'm not sure about it but I think that erasing the master.info file would not help either. Don't remember how I got it running again. Oh yes, now I remember faintly, I think I got rid of the slave set up altogether and started the slave from that master log position it left off.

As the Debian was set up differently, it took me some time to find the error log, but I don't remember if I found something of interest there; I guess not. As you guys didn't seem to care, I didn't keep record that meticulously.

Another problem I have is on my second SuSE machine. It never worked correctly as a slave. I have no idea why. I have set up quite a number of slaves and never experienced this kind of phenomenon. The slave does appear to work correctly, the master log is read and executed and everything looks just perfect. But it isn't. The slave does not execute a single statement at all. The database doesn't change. As I didn't have time to work out a solution for this, I constructed a quick and dirty workaround: the machine was set up as Apache machine only, it reads the data from the master instead of from its own slave server (which "runs" all the time, though).

One of these days, I had some time so I had a closer look and found out something that might have been the cause of the problem, at least so I thought. The relay log on that machine was corrupt. No idea why. Looks like 

𤨤in蟅澫^B^B^@^@^@罠@^@^@)宭^B^@^@`氛@^@^@^@^@^@^M^@^@db_name^@UPDATE item_counter ...
^B^@^@j氛@^@^@^@^@^@^M^@^@db_name^@UPDATE page_counter ...
洯@^@^@U狿^B^@^@l氛@^@^@^@^@^@^M^@^@db_name^@UPDATE page_counter...
甌@^@^@'莩^B^@^@h氛@^@^@^@^@^@^M^@^@db_name^@UPDATE anzeigen ...

and so on all the way ... But anyway, this problem should be easy enough to fix. I took the last backup from the master and set up this machine from scratch, more or less routine. Same result. It seems to be working, but it isn't. The latest records are exactly those from the setup procedure. Nothing is changed after that.

The next thing should be to look at the relay log again to find out if it's corrupt as well, but alas, I didn't have the time, so I left it at that. Funny thing. But it looks like the problem can be worked produced, which is a good thing.

I'm glad you picked up the dialog again and I will do everything I can to record phenomena and think about them and try to find out if there is any kind of hint so that I might be able to set up a test case. So far, the first condition for bug hunting has been met: the problems appeared again and again. So eventually we will be able to track them down. I would be most gratefully if you could provide me with ideas as to what I could do (if you can).