Minimal Site Synchronisation using Changesets

Today, I implemented a feature in my home-brew site publishing system that has been on my list for some while.

I had been relying on Transmit to synchronise my local copy of dx13 with the server, but this was becoming untenable. The Transmit approach would compare the entire site with that on the ftp server. There are several hundred files and folders in the site to examine, meaning that the synchronisation process took a long time. Secondly, it would often fail with the server refusing more requests¹.

When all that was needed was the upload of a single article and the re-generated homepage, archives and feed this was overkill.

The new feature, as may be obvious by now, is that the ruby scripts take care of uploading changed files. The script modifies the files, and therefore should be able to upload files more efficiently than the brute-force Transmit approach as it knows the changes which need to be uploaded.

I’d put off writing this feature because I thought it would be fiddly; that I would have to do lots of things like checking directories existed on the ftp server before sending files. Once I realised, however, that this needn’t be the case, I became more motivated.

The key insight I’d missed was that I maintain the sole way to modify the site (beyond logging into the FTP server), in the set of scripts. Therefore, all my scripts had to do was generate a changeset of their actions as they went about generating the site. This changeset could then be replayed verbatim on the FTP server to bring it up to date with my local copy. In part, I was inspired by some reading on subversion from a book I’ve been reading, Beautiful Code, discussing svn’s method of processing diffs between source trees.

The implementation is fairly straightforward. I describe it here for completeness and as I may spot some errors in the logic if I try to explain it. Design review by weblog, you might say.

The changeset is composed of a sequence of ordered changes. The changes must be ordered as there are some things which need to occur in sequence. For example, a directory must be created before a file inside the directory can be uploaded.

Each part of the site is generated by a specialised publisher class. Each of these publishers was augmented with code to record modifications it makes to the filesystem during the publish process. A simple Change class is used for this purpose, which stores the type of change (add file, add directory, change file etc.) and the path of the changed item.

Then an FTP changeset processing class takes the list of changes and replays it over the wire on the FTP server. The beauty of this method is that the FTP processor doesn’t have to worry about missing folders or other hiccups as it is merely replaying what the script has done, which should be the only modifications needed.

Of course there are problems with my current implementation. The most severe is that if the changeset processor fails during the replay process, the next time the script runs it will have forgotten the list of changes that should have been made but which failed, meaning the remote site is not properly up-to-date. The script also doesn’t know about changes to non-generated files, such as stylesheets or images.

For now, I’ll fall back on Transmit if either scenario occurs, revisiting the publishing process if this becomes tiresome. For the first problem, a better approach would be to store the changeset in a file, so it is persistent across script invocations. As for the second, more thought would be required.

If you can read this post, the implementation is holding together so far.

¹ My guess is that this was my exceeding some number of requests to the server because of the number of checks required.