Summer Sundae 2010

At the weekend we went to the Summer Sundae festival in Leicester. It was interesting going back there after some time away from it, having previously gone in 2003 and 2004. The festival has sprawled back into Victoria Park and gained a couple of stages -the comedy tent and the rising stage. It's also gotten the ability for stuff to go on a bit later than normal in the guise of a silent disco. Brilliant for things that have to put noise curfews in place I would've thought. In previous years we'd actually gone out to clubs in Leicester after the bands finished. In adding all the space they haven't upped the capacity massively so it is still quite small and not to squishy to get around.

Big Rich was volunteering down there so we met up with him fairly shortly after we' arrived, got the tent up and nipped back out to Tesco for a couple of bits. We had a couple of beers inside the festival, then popped to a pub on London road. A relatively early night didn't buy us much in the way of sleep because of being in the noisy camping field, but oddly the Thursday night was the only night it was actually that noisy. Certainly by the mornings on every morning other than Friday it was deathly quiet.

Unfortunately one of Richard's 8 hour shifts was Friday starting at 2, so he had to leave us relatively early. In the morning we'd popped out to Morrisons and brought back boxes of wine and a bottle of Morgan's spiced which we duly decanted into various bottles. We didn't take a massive amount of the wine out with us and managed to stay in the nice tipsy zone most of the evening only switching onto the rum really around the time Roots Manuva came on. Absolutely wicked set and the highlight of the Friday as far as I was concerned. After seeing an ace set by Teenage Fanclub I was ready to step it up a notch and managed to persuade Gemma that she'd be better off seeing Roots Manuva rather than Seasick Steve again. Other good things were the always brilliant Slow Club and the new to me, Spotlight Kid. The Orchard on the Musician stage I found utterly baffling - like three different bands playing on stage at once, and not in a good way. I could be charitable and blame it on on technical difficulties I guess. I wished I'd seen By The Rivers, who I caught the last song of and they sounded amazing.

It rained a lot on Saturday. We started off the day back in Leicester, buying more rum, breakfast and some snacks for later. Richard was not at work so we spent some time running between stages and drinking booze. This time the full boxes had come out and led to the latter part of the evening being a little more hazy. However, we did see some things throughout the day. My favourite being Caribou who were absolutely rattlingly good. Gemma ran away from Richard and I during The Go! Team as we were 'being annoyed'. We once again ended up into the silent disco, which ended rather abruptly, but that was probably a good thing.

I was nursing a pretty bad hangover on Sunday morning and it was the only morning we didn't get up early and get a decent breakfast down us. By the time a couple of Gemma's old nursing student mates met up with us I was almost out of my suffering (Pint number 3 normally does it). Because we were out chatting we didn't move as much from the field in front of the main stage, making a couple of quick diversions to see Pokey LeFarge and the South City Three and Los Campesinos! Pokey LeFarge was absolutely brilliant. The Local Natives were also very good on the main stage, and all the ladies went a bit giddy for Mumford and Sons. After we said goodbye to Gemma's student nurse mates we realised we'd run out of rum which resulting in bumping into a pair of guys who were partying off the hook. We kept bumping into them for the rest of the evening. Gemma was giving life advice to 15 year olds. I think the advice was -life sucks at 15, just ride it out, there's nothing you can do about it.



I have taken the plunge at home and switched to using ubuntu as my main OS rather than Windows. My laptop was creaking slightly from the 4 years of installs and uninstalls and I sort of couldn't face the job of a reinstall, so I bought a new hard disk and installed ubuntu. I settled on a HITACHI HTS725050A9A364 Travelstar 500GB Hard Disk Drive 7200rpm SATA with 16MB cache. I'm pretty pleased with it as it's quick and quiet, certainly at the moment the fan makes more noise. That's probably stuffed with dust though and I was having an issue where the laptop would suddenly shut down on me. I reckon that was due to overheating, so will try and give the thing a good clean once my can of air arrives.


MySQL on Linux on NFS on Netapp

I recently moved some storage from a local volume on a Linux virtual server over to an NFS volume on one of our Filers. This was all well and good until the users ran their backup script which stops the software and MySQL, copies some files about* and then restarts everything. MySQL was failing to restart with error log messages complaining about locks on the InnoDB files (ibdata1, ib_logfile0, ib_logfile1). It turns out that the Linux NFS client doesn't do locking in the same way as other clients (e.g. Solaris) and was never clearing the locks out. The simplest solution I found, and which looks to work, is to add the 'nolock' mount parameter to the Linux boxes /etc/fstab.

* Of course they don't really need to be doing traditional backups now, because they can tell the filer to create a snapshot instead.


Progress Bars

The last couple of days at work have consisted of watching progress bars of various types (sometimes not bars at all) scroll along. Restores, scanning tapes for restores, software/OS installs, Svmotions, VMDK aligns, and more. When you're tired to begin with from not sleeping too well this level of dullness makes it hard to stay awake.


Variable-length blocks for deduplication

I should really put links in this post to all of the resources I found when looking into this, however it's stupid-early in the morning and I thought I'd just try and get some thoughts down about this.

The first iteration of my prototype deduplicating backup system used fixed length blocks (henceforth known as chunks to match the terminology found in various papers online) which is a very easy thing to do -essentially just read the file into a buffer and chop it up based on the size you've chosen. There is a problem with this however. Consider a large text file that you are chopping into 4k blocks for deduplication purposes. Now consider adding a couple of characters to the start of the text file. What this essentially does is cause all of the fixed blocks to be different. So the next time the deduplication process runs each block will be considered new.

There is a way round this however, using variable length blocks and content defined chunking. Essentially looking for a pattern within the data at which a chunk boundary is declared and using that as the end of the chunk. In this model only the first chunk will now be different, so greater deduplication ratios will be achieved. Plenty of information can be found on this looking for content defined chunking on Google (look for LBFS specifically).

A lot of the papers suggest using Rabin fingerprinting over a sliding window, a collision resistant method of fingerprinting a string based on random polynomials. After quite a bit of reading to try and understand this method (I suck at maths) I had the thought that it might be overkill. Collision resistance is not actually a property that is required for this task -really the only thing required is that the chunk boundary be repeatedly found. What I did was to look at simpler ways of fingerprinting, in the end settling on three, a simple summation of the bytes, a crc32 and an adler32 (the latter two because they are in the ruby zlib library). I take a sliding window over the bytes from a buffer, fingerprint the window contents and then use a bitwise AND with a mask to detect chunk boundaries. I need to run a bunch of comparisons now with different parameters to see which gives the best throughput, average block-length, etc.


Backup system

Some people would say I am a sad geek, and they may be correct. For fun I've started (yet another) project at home, writing a very simple proof-of-concept disk backup system. I'm really doing it to explore the issues surrounding the technologies that may be present in such a system, because the backup system is part of my responsibilities at work and it is clear that the days of tape based backup here are numbered. But what to replace it with? Keep networker and use disk devices as targets? What about smart disk targets that de-duplicate? We'd certainly want to minimise the amount of disk used by any backup system to keep it cost effective. The weekly total for just full level backups is roughly around 31TB being shoved off to tape (and growing).

So I'm having a bit of a play. No doubt that this will never see the light of day in any really useful sense, I can't see me using it at work for instance. I may end up implementing properly at home for backups from, say, Gemma's laptop.

As per usual with my kick-about projects at home I'm basing it on the ruby language. I hesitate to use the words, 'design decisions', as I'm not sure the scrawl on my pad could properly count as design. But this 'design' has two levels of deduplication. The first being single instance store and the second being block level duplication. Since I implemented block-level deduplication first I will talk about that first. Essentially all I am doing is carving up a file into blocks, storing them on disk if they are unique and keeping an index of the blocks that make up the file. The uniqueness is determined by a SHA256 hash of the block contents. Block size is an interesting question, too small a block and performance drops off dramatically, too large a block and the number of repeated blocks drops off. I am double checking uniqueness by a comparison of block contents if the block already exists at ingest time (although this could possibly be removed as the chance of a collision is so small).

The ingester writes it's blocks to multiple destinations per-block -ey up, aren't you just duplicating that deduplicated data. Well, yes, but that is in the name of redundancy and fault tolerance. Although in development the system is writing blocks to a couple of subdirectories these could easily be NFS mounts from systems in separate datacenters for example. The index of blocks keeps a list of where the blocks have been written.

As an aside, I decided to use the key-value NoSQL database Redis for the storage in this case. Basically to play with it and because it's list-centric operation lends itself quite nicely to this sort of thing.

So I can chop a file up into blocks and deduplicate based on those blocks. I can even do the reverse, which is to re-constitute a file from it's deduplicated version. This is a good thing. Being able to get your file back is quite important.

The latest thing I've been looking at is a very simple client and web service for creation of savesets and uploading of files. I mentioned that I was using two levels of deduplication, the block based and the single instance store. For each file that the client processes it takes an MD5Sum and checks whether that exists at that file path. If it exists the uploader merely updates pointers to the file in the saveset and carries on, if it doesn't exist it uploads the file first. After that will be the reverse, a simple recovery client and job done.


Enabling cut and paste in the VI client.

For when I forget (again), you enable cut and paste from a VM accessed by the VI client by sticking in the advanced parameters, set to true: