Friday, July 26, 2013

How to create a 322107-part Virtual Disk

I generated this meme to remind me of the consequences...

Several months back I switched from Arch Linux, to SuSE, then to Ubuntu (which I had used for many years in the past).  Ultimately I ended up running back to Arch.

When I re-installed Arch on my laptop  I decided that I was going to try me some Btrfs.  I hear all the time how it's "ready for production" and now SuSE is using it by default, and Netgear is using it in their newer ReadyNAS (and maybe ReadyDATA) products blah blah blah.  The install and everything was great, no real different than any other file system.

I installed Snapper and started doing hourly snapshots of my data and that worked well.  After about a month or two of use the hard disk would sit there and spin; my computer would grind to a halt and I had to wait until it was finished.  The only clues I had were the btrfs-endio processes that were apparently the culprits.

When this reached the 5 minute mark I decided that something was really wrong.  My first thought was Snapper is taking a snapshot - so I disabled Snapper.

When the problem reached the 10 minute mark I finally decided that this has to be something that I can fix. 

After I RTFM, err Wiki - I discovered that Btrfs doesn't play well with large files that have lots of small writes to them.  It totally makes sense once I thought about it.

Btrfs Wiki: Gotchas

Files with a lot of random writes can become heavily fragmented (10000+ extents) causing trashing on HDDs and excessive multi-second spikes of CPU load on systems with an SSD or large amount a RAM.
On servers and workstations this affects databases and virtual machine images. 

Btrfs is a copy-on-write file system; this means that instead of overwriting data it will write to an empty portion of the hard disk.  ZFS also behaves this way and it's a great feature that lets you take quick snapshots and protect your data by doing so.  If power dies while the file system is still writing, your original data is still intact because it's writing to an empty part of the disk leaving your original files alone.  After the write the filesystem then updates its big dataset of data (meta...) pointing to the new file/portion of data.  So a snapshot would leave all your files in place, then the filesystem would start writing to other areas of the disk, so all your files are there and just a few commands away from recovery.

With my virtual disk, however, this is terrible.  It's Windows 7 Pro virtual machine that I use for some Windows-only work applications.  Windows likes to write things to the hard disk constantly and each time btrfs writes to another area of the disk.

So a long story long I ran filefrag on my vdi file and blam - 322107 fragments in this one file.  That means each time I start my VM the computer would have to try to read from all of these different fragments that are in physically disparate parts of the hard disk..

I tried to defragment the file with the btrfs defragment tool ($btrfs fi defrag /path/to/file), but no luck; my system would just crash after a while and eventually reboot.  I ultimately ended up copying the vdi to a new file, and turning COW off for this file ($chattr +C /path/to/file).

Moral of the story - RTFM.