Some people would say I am a sad geek, and they may be correct. For fun I've started (yet another) project at home, writing a very simple proof-of-concept disk backup system. I'm really doing it to explore the issues surrounding the technologies that may be present in such a system, because the backup system is part of my responsibilities at work and it is clear that the days of tape based backup here are numbered. But what to replace it with? Keep networker and use disk devices as targets? What about smart disk targets that de-duplicate? We'd certainly want to minimise the amount of disk used by any backup system to keep it cost effective. The weekly total for just full level backups is roughly around 31TB being shoved off to tape (and growing).
So I'm having a bit of a play. No doubt that this will never see the light of day in any really useful sense, I can't see me using it at work for instance. I may end up implementing properly at home for backups from, say, Gemma's laptop.
As per usual with my kick-about projects at home I'm basing it on the ruby language. I hesitate to use the words, 'design decisions', as I'm not sure the scrawl on my pad could properly count as design. But this 'design' has two levels of deduplication. The first being single instance store and the second being block level duplication. Since I implemented block-level deduplication first I will talk about that first. Essentially all I am doing is carving up a file into blocks, storing them on disk if they are unique and keeping an index of the blocks that make up the file. The uniqueness is determined by a SHA256 hash of the block contents. Block size is an interesting question, too small a block and performance drops off dramatically, too large a block and the number of repeated blocks drops off. I am double checking uniqueness by a comparison of block contents if the block already exists at ingest time (although this could possibly be removed as the chance of a collision is so small).
The ingester writes it's blocks to multiple destinations per-block -ey up, aren't you just duplicating that deduplicated data. Well, yes, but that is in the name of redundancy and fault tolerance. Although in development the system is writing blocks to a couple of subdirectories these could easily be NFS mounts from systems in separate datacenters for example. The index of blocks keeps a list of where the blocks have been written.
As an aside, I decided to use the key-value NoSQL database Redis for the storage in this case. Basically to play with it and because it's list-centric operation lends itself quite nicely to this sort of thing.
So I can chop a file up into blocks and deduplicate based on those blocks. I can even do the reverse, which is to re-constitute a file from it's deduplicated version. This is a good thing. Being able to get your file back is quite important.
The latest thing I've been looking at is a very simple client and web service for creation of savesets and uploading of files. I mentioned that I was using two levels of deduplication, the block based and the single instance store. For each file that the client processes it takes an MD5Sum and checks whether that exists at that file path. If it exists the uploader merely updates pointers to the file in the saveset and carries on, if it doesn't exist it uploads the file first. After that will be the reverse, a simple recovery client and job done.
Subscribe to:
Post Comments (Atom)
0 comments:
Post a Comment