« Feel the 1U love... | Main | Ruby debugger hacking »
April 30, 2005
File under 'R' for redundant
I love how over the past few years, there's been a lot activity in the distributed filesystem arena. Corporations are realizing how expensive it is to recreate data versus just keeping it indefinitely in the first place. And storage requirements keep growing higher and higher.
There are a few distributed filesystem implementations that I like, namely AFS, GFS (Google's implementation, not the original GFS), and Xsan. Each is designed for very different usage patterns, but all are worth a look for different reasons.
First, a quick recap of this trio. All three let you build SANs (storage area networks), but with varying degrees of coupling between your storage hosts. Additionally, AFS will let you create SWANs, which are like SANs, but connected over a WAN. From reading the GFS paper, it can theoretically support SWANs, but I don't know if Google is employing it in this fashion.
AFS
AFS is the Andrew File System, which was developed at Carnegie Mellon University, and was subsequently worked on by Transarc and IBM. Somewhere along the way, Andrew was dropped from the name and AFS now just means AFS. OpenAFS is the branch
most commonly used. AFS uses a cell-based architecture, where each cell corresponds to a geographic cluster of storage hosts. For example, if your company has three offices in New York, Paris, and Tokyo, each office would constitute a single cell.
AFS Volumes
Each cell subscribes to a set of volumes, where each volume holds a category of files and enforces its space allocation, replication counts, user ACLs, etc. The physical data for the volume is only stored at the cell where it is located, but it is possible to do read-only replication of the volume data to other cells for redundancy purposes.
Let's pretend I have three cells, /afs/home, /afs/work, and /afs/hawaii, corresponding to three locations I have AFS installed at, and that I have these volumes:
- /afs/home/config
- /afs/home/src
- /afs/work/work-mail
- /afs/hawaii/photos
By default, each volume is only stored on the cell where it is defined. The physical data for /afs/hawaii/photos only resides in my hut in Oahu. Other cells (home, work) can grab images from my hut by using the full cell path (/afs/hawaii/photos/hut.jpg), but they don't have a local copy of the data in my photos volume.
When a cell wants to edit a file located in another cell's volume, it grabs a copy of the file from the remote cell and stores a cached copy locally. All file operations happen on the local copy, until the file is closed. Once the file is closed, the updates are sent to the remote cell, and the new changes show up globally, since remote cells are grabbing a newly cached copy of the file whenever they read/write to it.
AFS Replication
Now, it doesn't always make sense to always grab files from AFS volumes remotely all the time. What if your files are heavy? What if you want to access more data than what can be streamed over the network between your cells? AFS addresses that with read-only replication.
The basic idea is that volumes will always have a master cell where they are stored (/afs/hawaii/photos), but that other cells can be configured to keep read-only local copies of those volumes. This places a burden on the master cell to determine when the replicas should be updated.
For example, let's say I have /afs/home/dvds, which holds a couple terabytes of ripped dvds (which I legally own, thank you very much). If I want to watch a film from Hawaii and don't have replication enabled, I have to depend on the network link between Seattle and Oahu to be reliable enough to stream my VOB file. That dog won't hunt, monsignor.
Now, if I setup my hawaii cell to have a read-only replica of /afs/home/dvds, then I can watch my dvds in sunny Hawaii and not worry about network latency. Whenever I rip a new dvd at home, I can issue a volume update and make the remote read-only replicas update their copies of my dvd volume. As you can imagine, AFS replication works best when you have data that isn't constantly being updated.
Commercial AFS Installations
AFS should be more famous than it is, it's the leader when it comes to distributed storage across WANs. A former colleague of mine used to work as a system engineer for one of the large brokerage firms on Wall Street, and he explained to me how AFS was great for their worldwide storage needs, since it even worked with their branch offices in SE Asia that didn't have the same bandwidth as their North American and European offices. The slower offices just subscribed to a smaller number of volumes that were vital, and picked up updates as they came along. The company had dozens and dozens of datacenters around the world with terabytes and terabytes of storage, and it all just worked (TM).
It's hard to find a public list of large companies that use AFS, probably because they see it as a secret weapon of sorts. It's not perfect; you have to construct your volumes carefully in order to manage update frequencies, and you'll probably need a dedicated AFS admin for large installs, but it's used by all sorts of industry leaders. The key is to go to AFS conferences and look at everone's nametags. You'll see employees of major financial institutions, auto manufacturers, e-commerce companies, retailers, government contractors, etc, etc.
AFS Downsides
AFS isn't perfect. It depends on Kerberos for user authentication, which is a headache all of its own. You have to carefully manage your volumes, especially when replication comes into play. While you can resize the quota of different volumes easily, there is still a bit of a tightrope act required in order to balance data across cells and volumes properly. (There's even a mailing list dedicated to balancing AFS.)
AFS is best suited for storing large filesets that don't have high update frequencies. It's great for distributing binaries across organizations, storing and replicating videos, mp3s, etc. If you have unflexible replication requirements (i.e. changes have to show up in read-only volume copies immediately after being commited to the master read/write volume), then you can't support high volume change velocities. That, or you're stuck with creating tons of very small volumes in order to gate your updates per-volume per-timeslice, which increases the admin overhead of your AFS deployment.
All that said, I rather like AFS, and have been considering deploying it for the home storage network I've been working on. I've been ripping all of my media to disk, and AFS complements rips nicely. They aren't changed very often, but they are heavy and I want to make sure that my media is replicated offsite. This is where read-only replication works wonders.
GFS (Google File System)
GFS is Google's distributed file system (read the whitepaper here). Its design is Google-centric (naturally), so it assumes that it will run on commodity hardware with non-RAID drives. It optimizes for bulk reads/writes over random reads/writes, and it relaxes some of concurrency requirements of normal distributed file systems in favor of letting application logic detect anomalies.
GFS isn't a traditional file system, though. GFS clients use a custom library to read/write files, and GFS doesn't use the normal kernel-level vnode hooks that other file systems use. The server side also runs completely in userland.
Basic GFS Design
GFS breaks files into 64MB chunks. Chunks are stored on hosts called chunk servers. Each chunk is replicated onto three chunk servers, which optimally are in different physical racks. Chunks are placed in servers which have the most free space available, and over time, their placement is fairly randomized.
Here is a diagram of a GFS cluster, taken from Google's whitepaper on GFS that I linked to above:

A master server stores a directory of which chunks are available and which chunks reside on each chunk server. The client API keeps track of a cluster's chunk size, and translates the read/write requests of the user into chunk offsets for a given file.
GFS Read Example
For example, let's say I grabbed a 150MB apache log and stored it in my GFS cluster as httpd.log two weeks ago, and now I want to read it back and build a unique list of IPs that hit my web server...
I would ask the GFS client API to read the file httpd.log, starting at byte offset 0. Since that is within the first 64MB of the file, it would ask the master server for a list of chunk locations for httpd.log's first 64MB chunk. The master server then returns a global chunk id for that file/chunk offset combo, plus a list of chunk servers that hold the chunk mapped by that chunk id. The client library looks at the list of chunk servers and picks the closest one, grabbing the chunk data and sending the bytes back to my read call.
These steps are repeated for chunks 2 and 3 of httpd.log (64-128MB, 128-150MB). The third chunk isn't 64MB, but only 22MB. Chunks are allocated as their size fills up, so it's easy for the client library to know it's reached the end of the file.
The chunks are streamed to the client library in proper chunk order, and the user doesn't have to know what's happening under the covers. The large chunk size means that read bandwidth stays pretty high.
GFS also has some optimizations; multiple chunk offsets are requested from the master server at one time to cut down on back-and-forth traffic between the client and the master, and chunk/chunk server metadata (but not actual the actual payload) is cached on the client side.
The key to the system is that while the master server holds the directory service that maps abstract files to chunk id lists and the servers that hold the chunks, the actual streaming of the data for a chunk is a direct communication between the client and one of the three chunk servers that holds the chunk.
GFS Writes
Similar to how the client talks to chunk servers directly for reads, it also talks to them directly for writes. If you're writing to a file, the master server is queried for the chunks that will need to be mutated. For each chunk, the master server picks one of the three chunk servers that hold a replica of that chunk, and makes it the master replica server.
All the mutations for that chunk are applied on the master replica server, and then it turns around and tells the other two chunk servers to update their replicas of the chunk to match its new copy. Once all three chunk servers have confirmed that they have written the new chunk to stable storage, the master replica returns a success code to the client, which moves to write the next chunk.
GFS also has optimizations that make it possible to write several chunks at once, and to have multiple simultaneous writers on a chunk (as long as they are appending and not writing at random offsets in the chunk, which isn't a common case for writes anyways). The interesting thing is that while GFS guarantees that multiple writes can occur atomically, it doesn't guarantee that a given write happens only once. This means that the records appended to chunks have to contain some sort of header or sequence id inside them, otherwise the application that's reading the data might accidentally process a record multiple times.
The nice thing is that consumers of the client library aren't issuing low-level chunk read/write calls, they're just using the api to write to a file.
Interesting GFS Architecture Items
GFS has other features, but you should really just read the whitepaper to get a better idea for what it can do. Because Google wrote it specifically to address their internal needs for a filesystem, it is pretty specialized to Google's business domain.
For example, it uses a non-standard client library for access to files, and is much more focused on streaming high-bandwidth reads and writes versus supporting random file access. Concurrency on writes is supported, but you don't get the guarantee of the write happening only one time. Latency isn't emphasized much, but bandwidth sure is.
One interesting thing is that Google was seeing i/o corruption with early deployments of GFS, so they built-in a checksum system that sits at the chunk server level (not the master server level). Chunks are checksummed into 64KB blocks, so each chunk has 1024 checksum entries. The checksums are checked on both reads and writes, and stored in memory on the chunk servers. A common theme for GFS is aggressive storage of metadata in memory, with checkpoint flushing to disk for critical things.
I also like that chunk servers are the authority for which chunks they hold, the master server doesn't have a persistent authoritative store of which chunks are in which servers. Instead, when a chunk server starts up, it contacts the master server and tells it which chunk ids it can serve up. Once a chunk server is running, it also periodically sends a heartbeat back to the master server, making sure it has an up-to-date list of all the chunks it holds. Google realized that it would have been more complex to make the central server be authoritative, and who better to know what a chunk server holds than the chunk server itself? I like that.
One small thing I'm not a fan of is calling the central directory server the master server, it think it can be confusing at times. I don't know what I would call it, maybe something like chunk directory or chunk router. Heh, or maybe... Chunk matchmaker. :-)
If you read the GFS paper, I highly recommend reading these sections closely:
- 2.6.3 -- Operation Log
- 3.1 -- Leases and Mutation Order
- 3.2 -- Data Flow
- 3.3 -- Atomic Record Appends
- 3.4 -- Snapshot
- 4.3 -- Creation, Re-replication, Rebalancing
- 5.1.3 -- Master Replication
- 5.2 -- Data Integrity
And sections 6 & 7 (benchmarks and real-world GFS experiences) are a fun read too. GFS fills its niche pretty well!
Xsan
Xsan is Apple's entry into the SAN market. It runs on top of Xserve RAID boxes and supports storage clusters of up to 16TB. Xsan uses fibre channel for its fabric, so it's tuned for short-range, high-throughput storage. Xsan's technology overview paper goes into good detail on how it works.
Xsan is an unapologetic, super-high bandwidth storage solution. Here are just a few features it has that are uncommon for a NAS/SAN system:
- Reservation of FC bandwidth so hosts don't have to worry about contention when streaming high-bandwidth payloads like HD video.
- Supports different RAID levels for different volumes. For example, use RAID-5 for scratch storage when working on a commercial for a client, but store the final version on mirrored RAID-1 for peace of mind.
- Metadata controller uses gigabit ethernet to manage reads/writes between clients and the SAN, freeing up the FC fabric to just shove raw payloads around.
- Metadata controller can automatically fallover to a sequence of backup hosts, or do an impromptu election and pick a new host.
- 16TB is the limit for a virtual Xsan volume, but a single metadata controller can support multiple volumes.
- Clients can have two FC connections to the SAN, and access to a given volume is automatically routed through whatever connection is loaded less. This means you can have two FC fabrics attached to the same SAN, or make each host attach to two SANs at once. 32TB of virtual storage from a single client, anyone?
And you get a host of gui tools to help you configure and manage your differrent volumes, setup access groups, measure SAN throughput, etc, etc. Oh, if only I had a budget for HPC clusters, I would get a rack of Xserves and a few Xserve RAIDs, and put Xsan to use. When you combine the throughput of Xsan on Xserve RAID's already beefy hardware, and then attach machines with wide and deep buses like the Xserves, well, it seems like you're getting as close as you can to supercomputer territory without paying a few million bucks. I keep hearing good things about Apple's server market, and I hope they keep fighting this fight, their server solutions are excellent and relatively affordable when you compare them to the competition. Casual PowerMac owners might freak out when hearing what these solutions cost, but people who work with storage and servers for a living realize that for what you get, it's a bargain.
Why talk about these filesystems?
You might be wondering why I spent all this time discussing this trio of filesystems. Well, I talked about AFS because I've been thinking about using it for helping me store and manage all of my media and email archives. Offsite replication is pretty straightforward, and I get POSIX-y access to my files without having to change much application logic. It's mature, stable, and is a much better alternative than NFS for what I want to do with it.
As far as GFS goes, well, I guess I'm just a fan of GFS's principles, I like how Google was unapologetic about its design and tweaked it as much as possible for their problem domain. It's not a general purpose distributed filesystem for tech companies, but it doesn't have to be. That said, it has a lot of interesting ideas, and I like how its design isn't overly complex with lots of distributed locks and transaction managers.
And last but not least, I like Xsan because Xserve RAID is damn sexy, and Xsan exploits its performance well. GFS/AFS use ethernet, but Xsan uses fibre channel links (for smokin' speed) and has good concurrency AND throughput. I don't have the money to afford Xserve RAID, but if I did, I would use Xsan for my home storage needs. Xserve RAID + Xsan is relatively cheap compared to the SAN solutions offered by other companies in the storage market. You wouldn't believe how much NetApp and friends charge for their storage cabinets...
I should have my storage cluster setup six months from now, I'm going to have a heavy node at home that stores all my media, and have an offsite node that stores vital things like email, config files, source code, etc. It will be interesting to see where my storage is at a few years down the road once HD dvds are more common. I'm thinking that 5TB won't seem like a lot by then...
Posted by djb at April 30, 2005 01:25 AM