I should really put links in this post to all of the resources I found when looking into this, however it's stupid-early in the morning and I thought I'd just try and get some thoughts down about this.
The first iteration of my prototype deduplicating backup system used fixed length blocks (henceforth known as chunks to match the terminology found in various papers online) which is a very easy thing to do -essentially just read the file into a buffer and chop it up based on the size you've chosen. There is a problem with this however. Consider a large text file that you are chopping into 4k blocks for deduplication purposes. Now consider adding a couple of characters to the start of the text file. What this essentially does is cause all of the fixed blocks to be different. So the next time the deduplication process runs each block will be considered new.
There is a way round this however, using variable length blocks and content defined chunking. Essentially looking for a pattern within the data at which a chunk boundary is declared and using that as the end of the chunk. In this model only the first chunk will now be different, so greater deduplication ratios will be achieved. Plenty of information can be found on this looking for content defined chunking on Google (look for LBFS specifically).
A lot of the papers suggest using Rabin fingerprinting over a sliding window, a collision resistant method of fingerprinting a string based on random polynomials. After quite a bit of reading to try and understand this method (I suck at maths) I had the thought that it might be overkill. Collision resistance is not actually a property that is required for this task -really the only thing required is that the chunk boundary be repeatedly found. What I did was to look at simpler ways of fingerprinting, in the end settling on three, a simple summation of the bytes, a crc32 and an adler32 (the latter two because they are in the ruby zlib library). I take a sliding window over the bytes from a buffer, fingerprint the window contents and then use a bitwise AND with a mask to detect chunk boundaries. I need to run a bunch of comparisons now with different parameters to see which gives the best throughput, average block-length, etc.
Subscribe to:
Post Comments (Atom)
0 comments:
Post a Comment