Long ago, before storing things in the cloud was commonplace or economic, I used an online mirroring tool to replicate all our family photos between my desktop PC and my wife’s netbook. This was both effective and painful - here’s why.
We wanted all our family photos to be kept on both machines so that we wouldn’t lose any (well, very many) if one of the hard drives failed. This also allowed us to work independently - I could add photos from my camera to the pictures folder on my desktop, and my wife could do the same with pictures from her camera on her netbook, and all the photos would be available on both machines.
I found a very smart piece of software that did this job well. First, it would scan the folder of pictures on my machine, then it would scan the matching folder on my wife’s netbook. Once both scans were complete, it would work out which files needed to be copied, which needed to be deleted, and which were simple moves.
Correctly identifying the file moves was a critical feature because it saved a lot of time and bandwidth. When we reorganized our photos (say, by combining together pictures from different cameras of the same event), the system wouldn’t recopy all of the photos across a second time, it would just move them around.
The software needed to complete both file system scans before any file transfers took place - and, due to the number of files involved, the scans took around 45 minutes to complete if the machines were both idle, longer if either machine was in active use.
While my desktop machine was on much of the evening - on a typical weekday I’d turn it on when I got home from work and leave it on until bedtime, my wife’s netbook was only turned on when she was using it.
This meant that photos copied onto one machine were often not replicated onto the other machine until weeks later. This was hardly satisfactory - either for convenience, or for backup.
Once I became aware of this situation and started researching the problem, I discovered that the machines would commonly sit churning away for over an hour before any photos actually got transferred between the machines.
Lots of activity going on, but very little actual progress.
Since encountering this situation, I’ve seen many other systems that have a similar problem - they can sit there churning away, doing a terribly large amount of work but not actually getting any closer to the end goal - and if they’re interrupted, the whole process starts over again.
Fortunately, there’s a better way - by designing systems to guarantee progress, the effect of interruptions can be minimized.
Here are some suggestions for how that file replication tool could have been more effective.
For some reason, the two machines were not scanned in parallel, even though the required software was installed on both machines. Instead, one machine would do a full local scan and then ask the other to do the same.
Only once both scans were complete would any file replication occur.
A simple modification for both scans to be doing their scan at the same time would have greatly decreased the length of time it took for progress to begin.
The tool only scanned the folders being replicated when both machines were online at the same time, presumably to synchronize their current state. This wasted time that could have been spent copying, moving and deleting files.
Instead, the tool should have scanned the local copy of the folder regularly, maintaining an index of all the files. As the content of the folders changed, the index could track the changes that have been made:
Photo 892.jpgwith hash 2e3f8b created in folder
2001-01-01 New Year's Day
Photo 645.jpgwith hash 0604ca moved to folder
2000-12-31 New Year's Evefrom folder
2001-01-01 New Year's Day
Photo 344.jpgwith has 0c2c4f deleted from folder
2001-01-01 New Year's Day
With both machines maintaining such an index, they could have very quickly synchronized once both machines were online and started copying/moving/deleting files as required.
Note that a hash of the file contents would be a necessary part of the index so that different files with the same name (perhaps as the result of an edit) could be differentiated. In 2000, this would have been done using MD5 but today we’d use something like SHA-256.
Partial File Updates
When a file was changed, it was always copied across in full, even if the change was only at the end of the file.
By dividing a file into a number of chunks and calculating a hash for each chunk, only the changed parts of a file need to be shared between the machines, decreasing the bandwidth requirements. While the original program might have used 1MB chunking, we’d probably expect 16MB or larger chunks if implemented today.
Combining this chunking mechanism with offline indexing would have offered even greater opportunity for improvements.
Tracking deferred actions
The folder replication process started over from scratch every time, throwing away everything that was known.
Instead, the application could have kept a record of every proposed action that hadn’t yet been completed. When the machines connected again, this record could have been a starting point for making a quick start to the new session.
This would have allowed missing files to be copied across the link between machines even while the new file scans were taking place.
The key lessons here are less to do with the actual changes that might have been made to the file synchronization program and more to do with the idea of ensuring that progress is made, even under adverse conditions.
It’s easy to design and implement software that works properly when running on a machine with infinite CPU, memory, bandwidth, and time. Unfortunately, such machines are exceptionally rare - so we need to design software that works well when running with limited CPU, memory, bandwidth and time.
Designing for guaranteed progression is one technique that can help.