MySQL Bugs: #51203: Make workbench mwb file's document.mwb.xml file more diff friendly

Bug #51203	Make workbench mwb file's document.mwb.xml file more diff friendly
Submitted:	16 Feb 2010 4:08	Modified:	16 Mar 2010 14:47
Reporter:	Edward Rudd	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Workbench	Severity:	S3 (Non-critical)
Version:	5.2.15	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Currently the document.mwb.xml file that is embedded in the main mwb zip is saved with _ptr_ attributes that I'm assuming are references to internal memory pointers from when the file was saved. However I have this "magic" setup in Git to provide diffs of the MWB files for me

(in .gitconfig)
[diff "mwbdiff"]
        textconv = mwb-diff

(in .gitattributes)
*.mwb diff=mwbdiff

(~/bin/mwb-diff)
#!/bin/sh
unzip -p $1 document.mwb.xml

this generates a friendly diff when I run git diff mymodel.mwb

However everytime the model is saved (even if there are no changes) the file itself changes due to new memory pointers (in the _ptr_).

How to repeat:
1) create a new Model 
2) add a new table
3) save
4) make a backup (cp test.mwb test2.mwb)
5) add a column
6) save
7) run mwb-diff test2.mwb > file1.txt
8) run mwb-diff test.mwb > file2.txt
9) diff -u file1.txt file2.txt

notice all the EXTRA changes other than JUST the new column.

Suggested fix:
it would be really nice if the _ptr_'s were a more non-changing unique ID so that diffs could work better and so that the file generated would be more consistently the same if nothing has actually changed.  As right now it causes much pain with SCM which is why I have the diff "magic" above setup so I can really see if there are any changes.

Please can I second this.  It would be VERY helpful to have a diffable MWB format.  It doesn't need to be a plain text file, but unnecessary changes to the embedded XML are a real pain.  Database design should be something that can be version controlled and this is a real impediment to that.

Incidentally one very ugly hack which is nevertheless very helpful for my workflow is to obscure all _ptrs before running diff.  This would mean changing the example script given by Edward Rudd to:

#!/bin/sh
unzip -p $1 document.mwb.xml |sed -E 's/_ptr_="0x[0-9a-f]{8}"/_ptr_="0xXXXXXXXX"/'

Still not a real solution, but helpful.

I second this proposal, and would go even further to simply save the file in a clear XML format without zipping.

There were some threads mentioning that now additional files are stored inside, but why cant they be serialized to xml too?

Mysql workbench is a great tool and many would use it if it would be SCM friendly. 

We have big issues when storing MWB files in the SVN and when two people change the file there is a unsolvable conflict. As a workaround we forward export to SQL to see what the changes were, but cant go further from that to synchronise the MWB file.

It would be so awesome to have a solution for this.

kind regards
Tibor

Please! The binary file requires us to implement a branching strategy that revolves around the binary mwb file to avoid source control conflicts.

btw, MWB can also open directly the document.mwb.xml file.
So there may be a solution involving storing only this file, unzipped (and with _ptr_ attributes reset or even removed).
This way the SCM could store it with delta-fication.

I'm trying to implement such a solution for git.

I also second this proposal.

It would be great if we could get some insight from a team member on _ptr_ and it's purpose/reason for having to be in the xml. Removing it would be a great first step toward SCM.

I have been meaning to look at this issue for some time and have finally got around to it. I have undertaken the following analysis with version 6.3.4, but much of it is likely to be true for many previous versions. The following are preliminary findings, I haven't yet found any show stoppers, but that doesn't mean there aren't any.

First, there are three inconveniences when trying to revision control the .mwb file

1) the @db directory, which is an sqlite3 database (and hence binary
2) the _ptr_ attributes in the document.mwb.xml
3) the struct-checksum attributes in the document.mwb.xml

Second, whilst it is true that you can open the xml file itself in workbench, when it is saved, it reverts to a zipped version. Plus there are complications if you have notes or other inclusions in the original model.

However;
1) @db directory can be omitted. It seems that it will be generated automatically if it does not exist
2) the _ptr_ values are just numbers, not actually pointers. In fact in the source code they are read into std::string variables and so they are in fact just strings. What is critical though is that they must all be unique
3) the struct-checksum can be 0. I'm sure they are probably important for something but they can be set to "0x0" and a .mwb will load fine.

So it seems to me that we can
unzip the .mwb to a directory, say "model.mwb.dir"
delete the @db directory
set the __ptr__ to "0x0" and the struct-checksums to "0x0" in the document.mwb.xml file
put "model.mwb.dir" into text based revision control (including @notes and @scripts if you use them)

Track, merge and diff at your leisure

when time comes to use the model, from inside "model.mwb.dir"

fix the _ptr_ records to be a set of distinct values
zip -r ../model.mwb *

to rebuild the model file for use in the application. At this early stage, this is working for me, but it's still early days.

BTW here is some python to (hopefully) toggle the _ptr_ values between 0x0 and an increasing sequence and set the struct-checksum to 0x0.

#!/bin/env python
import sys, re

with open(sys.argv[1], 'rb') as file:
   counter = 1
   for line in [l.strip('\n') for l in file.readlines()]:
      ptr_line = re.sub(r'_ptr_="[^"]*"', r'_ptr_="0x0"', line)
      use_line = re.sub(r'struct-checksum="[^"]*"', r'struct-checksum="0x0"', ptr_line)
      if ptr_line == line:
         ptr_value = '"0x%06x"' % counter
         use_line = re.sub(r'_ptr_="[^"]*"', r'_ptr_="0x%06x"' % counter, use_line)
         counter += 1
      print use_line

Thanks for your long comment, Philip Haynes.
It's nice to have such a procedure, it will make easier the writing of a git handler.

Sorry, my english is bad, but i will try to explain what i found about '_ptr_'.
If you make in project copy of a table some information in document.mwb.xml will be linked by _ptr_.

like this

<value type="object" struct-name="db.mysql.Table" id="c35e459a-e352-11e4-ae83-50465db4fe4e" struct-checksum="0x0">
  ...
  <value _ptr_="0x0" type="dict" key="customData">
    <value _ptr_="0x5d4d0e0" type="list" content-type="int" key="InsertsColumnWidths">
      <value type="int">100</value>
      <value type="int">100</value>
      <value type="int">100</value>
      <value type="int">100</value>
      <value type="int">100</value>
      <value type="int">100</value>
      <value type="int">100</value>
      <value type="int">100</value>
    </value>
  </value>
  ...
</value>
<value type="object" struct-name="db.mysql.Table" id="41ea714e-e354-11e4-ae83-50465db4fe4e" struct-checksum="0x0">
  ...
  <value _ptr_="0x0" type="dict" key="customData">
    <link type="list" key="InsertsColumnWidths">0x5d4d0e0</link>
  </value>
  ...
</value>

in this case you should keep _ptr_ linked or just replace link with linked value.