<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
   <channel>
      <title>Bitwise Philosophy</title>
      <link>http://blog.beaver.net/</link>
      <description></description>
      <language>en</language>
      <copyright>Copyright 2006</copyright>
      <lastBuildDate>Wed, 20 Jul 2005 12:10:48 -0800</lastBuildDate>
      <generator>http://www.sixapart.com/movabletype/?v=3.2</generator>
      <docs>http://blogs.law.harvard.edu/tech/rss</docs> 

            <item>
         <title>Movement</title>
         <description><![CDATA[<p>I'm super busy wrapping up the sale of my place and packing up everything, so I haven't had much time for non-commercial hacking lately.  I've got a VPN and have been working remotely for my new job, so that has been working out well.  My place cleaned up real nice, I'm taking pictures of all the staging that got done so I have good ideas for things to do with my stuff once I move in down south.  I found a cool apartment in Sunnyvale, I'm right near a shopping complex with grocery stores, restaurants, etc, etc.  Should be pretty swanky.</p>

<p>I'll be flying back and forth for training and then leaving town the first week of August, driving down to California in my car.  The only stuff I won't let the movers pack up are my computers, so they will be packed away in my trunk.  I'm debating whether or not to do a one day drive straight down (which is doable if I leave Seattle at 4am), or to do a two night drive.  I kind of don't want to stop, but I won't be driving if I get exhausted.  I'm setting up a little iPod mount for my dash, so I'm looking forward to having lots of tunes ready, I'm creating a special 24 hour long playlist for the drive.</p>]]></description>
         <link>http://blog.beaver.net/2005/07/movement.html</link>
         <guid>http://blog.beaver.net/2005/07/movement.html</guid>
         <category></category>
         <pubDate>Wed, 20 Jul 2005 12:10:48 -0800</pubDate>
      </item>
            <item>
         <title>No more GC loops</title>
         <description><![CDATA[<p>I fixed the cycles in the sweep phase.  I had a few bugs in my changes where the mark table (what I'm calling the hashtable that holds the list of marked node addresses) wasn't getting updated properly, so certain nodes were being continually marked in a runaway loop.</p>

<p>It was interesting to fix, since I couldn't tickle the problem with simple test scripts.  I had to use the miniruby build calls used when compiling ruby from source.  Debugging was difficult as well, since the cycle didn't occur immediately and it was hard to tell how much work had been done when cycles occurred.  So, I added extensive logging statements to watch all the state changes and see what addresses were being marked in what order.  I then wrote a small script to parse the GC debug logs and find the cycle nodes.  It ended up working pretty well!</p>

<p>I still have some bugs to fix on the sweep phase, but the mark phase is looking relatively complete.</p>

<p>I've also got to put together a few small slides for the seattle.rb meeting on Tuesday night, since I'm giving a progress report on the GC fixes.  I'm not sure if I will have the fixes working by Tuesday, but at least I'll be able to explain the design and implementation of the changes.</p>

<p>As an aside, it's been a while since I wrote bare C, so this project has been fun to work on.  I don't miss C's tedious memory management when I'm working in Java, but there are times where I miss the terseness and flexibility of pointers, and being able to tightly control and keep track of allocated memory.  Don't worry though, it doesn't take a lot of C to make me want to start doing Java and Ruby code again.  It's just that having lots of constraints and steps can be oddly liberating sometimes.</p>]]></description>
         <link>http://blog.beaver.net/2005/06/no_more_gc_loops.html</link>
         <guid>http://blog.beaver.net/2005/06/no_more_gc_loops.html</guid>
         <category></category>
         <pubDate>Sat, 25 Jun 2005 23:54:06 -0800</pubDate>
      </item>
            <item>
         <title>GC update</title>
         <description><![CDATA[<p>I've been doing lots of move-related stuff, but I've been able to make progress on the ruby GC work in the evenings.  I'd say the work is 70% done, there are a few unintended cycles that I need to close, but I've got lots of printfs in there and I've spent a lot of time in the debugger.  I bunded all my changes with CPP logic, so I can flip a switch and run the old or new code as desired.</p>

<p>I'm hoping to have it in beta form for the seattle.rb meeting at the end of the month, it will be my second-to-last meeting.</p>

<p>While doing some packing in my office this week, I found an old handout from a GC tutorial I took at OOPSLA 2000.  I had forgotten all about it, but reading through the slides, I realized it was full of useful hints and improvements to make with the mark/sweep family of garbage collectors.  The handout is 130+ pages long and full of diagrams, plus handwritten notes I made while taking the tutorial, so it was a bit of a find.  And very timely, considering the GC stuff I'm doing this month.</p>

<p>This upcoming week is mostly going to be spent going through all my unpacked boxes in the garage, and all the papers and books in my office, and deciding what I'm going to take with me.  I aim to only take 30% of what I have now, just to keep the clutter down.  It should be fun!</p>]]></description>
         <link>http://blog.beaver.net/2005/06/gc_update.html</link>
         <guid>http://blog.beaver.net/2005/06/gc_update.html</guid>
         <category></category>
         <pubDate>Sun, 19 Jun 2005 14:16:11 -0800</pubDate>
      </item>
            <item>
         <title>OS X Intel builds</title>
         <description><![CDATA[<p>Well, Apple announced today that they're moving to Intel.  Wild!</p>

<p>I'm pretty pleased with this development, especially because now all the player haters get to eat crow.  There were so many people saying that it couldn't be done, that the porting would be too tough, that Apple would lose market share...  Well, the market share part is yet to be determined, but I think a move to Intel can only spell good news for Apple.</p>

<p>And one can only hope that as time goes on, OS X gains more market share.  It's my favorite development platform, both for the ease of development, and wide range of features developers can use to build their apps.  I don't know if Apple would ever just sell the OS and let it run on any type of x86 hardware, but pretty soon, developers will have the luxury of a single architecture that can more or less run Windows or OS X.  How many people will stick with Windows, given the glut of spyware and viruses?  I'm hoping there are more cheap Mac Mini-like machines coming down the road as well.</p>

<p>The other great upside is that video cards should speed up again.  I don't think the drivers will be swappable, but at least we won't need separate video cards for our Macs anymore.  You never know on the driver front though, FreeBSD 6 has support for using binary windows network drivers, so maybe there's some magic glue that will let OS X take advantage of the tuned windows video card drivers.  Either way, I'm  hoping that Leopard will be spending time on OpenGL performance, since Tiger is fast on the UI side, but not as fast as it could be when crunching OpenGL for things like video or image servers.</p>]]></description>
         <link>http://blog.beaver.net/2005/06/os_x_intel_builds.html</link>
         <guid>http://blog.beaver.net/2005/06/os_x_intel_builds.html</guid>
         <category></category>
         <pubDate>Mon, 06 Jun 2005 12:37:41 -0800</pubDate>
      </item>
            <item>
         <title>I&apos;ve got a new job...</title>
         <description><![CDATA[<p>Well, this is exciting news...  I'm moving to the Silicon Valley in a couple months.  I'll be living somewhere around San Jose, but I'm not quite sure exactly where yet.  Well, that, and I'm also looking at living in San Francisco and taking mass transit south for my daily commute.</p>

<p>So the next couple months are going to be very busy for me.  I'll be going through all my stuff, and trying to only take 25% of it down with me, with the rest going to Goodwill and the local dump.  I'll be finding a new home for my Sony WEGA tv, because there's no way in hell I'm getting movers to carry that 350lb behemoth into my new place, the last two times I moved it was two times too many.  I've had people move that tv into four residences in the past five years.  I'll also be preparing my condo for sale, which should be fun, I'll end up making all the little improvements I wanted to do over the past couple years but didn't have time to do yet.  Optimally, I wouldn't be selling it this summer but instead in early fall, but we'll see what kind of interest there is.  I have a great view of Lake Union from my deck, so I'm thinking it will go pretty quickly.</p>

<p>I've worked out a budget for air travel, so barring any global crash in the oil market, I'll be able to fly back to Seattle every couple months and for Thanksgiving and Christmas.  I'm really looking forward to exploring San Francisco's food options, and hoping that I'll be able to find a relatively urban place that has character.  I wouldn't be happy in the suburban developments.</p>

<p>It's sad to leave Seattle, but I've never lived anywhere but here, so moving somewhere more cosmopolitan is very appealing to me.  The next few months are going to be pretty wild.</p>]]></description>
         <link>http://blog.beaver.net/2005/06/ive_got_a_new_job.html</link>
         <guid>http://blog.beaver.net/2005/06/ive_got_a_new_job.html</guid>
         <category></category>
         <pubDate>Sat, 04 Jun 2005 17:50:10 -0800</pubDate>
      </item>
            <item>
         <title>seattle.rb hackfest and monthly meeting</title>
         <description><![CDATA[<p>The seattle.rb hackfest last weekend <a href="http://blog.zenspider.com/archives/2005/05/rubygems_hackfe.html">turned out pretty well</a>.  I was only able to make it on Saturday afternoon, since I had homework, but it was a lot of fun.</p>

<p>We also had a seattle.rb meeting on Tuesday night, using the new meeting location at Amazon's building down in the international district.  The ID is a great place for user group meetings, there are several great restaurants for pre-meeting meals.  A group of us met up at Shanghai Garden, where we introduced each other to our favorite dishes.  Afterwards, we walked through the Uwajimaya grocery store and got ice cream at the milk tea stand.</p>

<p>The meeting was interesting, it was a little less structured than normal.  We had nine people show up, with a mix of experience represented.  We summarized the hackfest changes to rubygems and that led to an impromptu demo of 43 things by Eric.  This led to a discussion of how different people were using Ruby at work, and what were the shortcomings that people had run into.  The main thing mentioned was how the green threads made things difficult at times, and the lack of a unified bundling format for third-party ruby code out there, and the large amount of abandonware there was.  rubygems is addressing the packaging and install aspects of third-party code, but there's a lot of good stuff out there that is either abandoned or only has docs in Japanese.</p>

<p>I got the room to do a little brainstorming on new debugger features, but it turned out that most of the users at the meeting hadn't used it much before.  So we took some time to cover some of the debugger's more interesting features.</p>

<p>Eric also demoed his new database query grapher.  It's pretty sweet.  Rails keeps a detailed log of all queries done, so Eric wrote a script that parses the sql trace for each query and finds all queries that joined tables together.  For each join pair, he created an edge in a graph, with the label being how many times those two tables were joined in the logs.  He plugged the graph object into a Graphviz dot file, and then used the <a href="http://www.pixelglow.com/graphviz/">OS X Graphviz client</a> (I use it too, it rocks!) to manipulate the graph to show us which tables were most heavily linked together.  This was an interesting way to determine how to partition the existing 43 Things database into multiple databases.</p>

<p>I also discussed a change I'd like to make to the Ruby garbage collector.  I posted back in March how Ruby's garbage collector <a href="http://beaver.net/blog/archives/2005/03/ruby_gc_and_cop.html">destroys copy-on-write semantics</a> because when it does its mark pass through the list of ruby objects, it does the marking on the objects themselves instead of a separate mark tree.</p>

<p>I used a small program that required several libraries, and then forked off 10 children.  In the normal case, the kids didn't do anything, so they used 250KB of memory a piece.  When I flipped a flag, each child forced a GC run as soon as it was spawned, the result being that each child was consuming 2500KB, or 90% of what the parent was using.</p>

<p>This doesn't seem like a big deal, since most small Ruby programs just use threads for concurrency, but this ignores the juggernaut that is Rails.  Production Rails installs use small httpds for the frontend that connect to Rails FastCGI processes running on the backend.  43 Things is consuming approximately 50MB of RSS (i.e. non-shared and actually allocated) memory per FCGI child.  They don't have to, though.  Each child should only be consuming 2-5MB a piece, and if Ruby's GC didn't stomp on copy-on-write semantics, they would.</p>

<p>The work of moving the marks to a separate tree structure isn't as bad as you'd think, since all the code for traversing the nodes and marking them up only exists in gc.c.  Ruby already provides a nice C tree implementation, so it's basically a task of refactoring 500 lines of a 2000 line file.  Once the work is done, forked children will consume much less memory than they do now, which means that Rails installs will be able to scale more linearly.  Right now, you can only run so many Rails FCGI processes on a box before you start to thrash virtual memory.</p>

<p>I've got at least one hacker who's interested in doing the work with me, so we're going to tackle that over the next couple months.  I'm thinking it would only take a week to make the bulk of the changes, and then a few more weeks to flush out the bugs and write some good unit tests to ensure correctness.</p>]]></description>
         <link>http://blog.beaver.net/2005/06/seattlerb_hackfest_and_monthly.html</link>
         <guid>http://blog.beaver.net/2005/06/seattlerb_hackfest_and_monthly.html</guid>
         <category></category>
         <pubDate>Thu, 02 Jun 2005 10:56:23 -0800</pubDate>
      </item>
            <item>
         <title>E3 and recent announcements</title>
         <description><![CDATA[<p>A good friend of mine went to E3 to cover it for the local newspaper recently, so I got to hear about all the exciting video games that are going to be coming out soon.  That was my big excitement for the past couple weeks, that, and hearing the final specs for the Cell processor in the PS3.</p>

<p>ArsTechnica did a good article covering the <a href="http://arstechnica.com/articles/paedia/cpu/xbox360-1.ars">Xenon, the new IBM-made processor inside Xbox2</a>.  I'm still much more of a Cell fan than a Xenon fan, but my interest lies more in the video and audio stream capabilities of the Cell than its video game capabiltiies.  And I'm getting moderately excited for the Xbox2, although the odds are very low I'd buy one, simply because one, I've bought into the PlayStation franchise since the PS1, and two, I don't really feel like rewarding Microsoft for their bad behavior, even if Xbox is a separate unit in the company.</p>

<p>I almost bought an Xbox because of Splinter Cell, though.  Luckily, the second and third installments in that series came out for PS2.  SC3 has reworked the controls and features, it's much more fun than Metal Gear 3 was (I know, sacrilege!).</p>

<p>Recently, Slashdot linked to <a href="http://www.eetimes.com/news/latest/showArticle.jhtml?articleID=163106213">an EE Times article on IBM opening up the Cell architecture</a>, presumably to entice open source involvement.  I was pretty excited to read that.  I just hope that the hardware options don't suck the way the PS2 linux kits did.  Yes, they booted linux, but you had to have a special boot cd to run software you wrote (i.e. you couldn't make dvds that held linux images and give them to friends who didn't have linux kits), and most of the interesting bits of the hardware weren't documented.</p>

<p>Now if you can afford the development box (and a Sony license), you can develop games for the PS2.  The outlay there is something like $10k for the hardware, and substantially more for the license to distribute, but I forget the exact numbers.  The PS3's hardware specs are great, so I'm hoping that Sony relaxes their stance a bit and gives the open source tinkerers more access to the hardware.  Imagine MythTV running on a PS3, that would be great.</p>

<p>Hmmm, what else...  The local Ruby group is having a hack fest this weekend, so I'm going to take some time and show up for the hacking and socializing.  I haven't done a lot of Ruby the past six months, just the RubyCocoa work I did.  I really miss it, but with the Java and HTTP stuff I've been doing, I haven't had a lot of extra time.</p>]]></description>
         <link>http://blog.beaver.net/2005/05/e3_and_recent_announcements.html</link>
         <guid>http://blog.beaver.net/2005/05/e3_and_recent_announcements.html</guid>
         <category></category>
         <pubDate>Wed, 25 May 2005 23:42:19 -0800</pubDate>
      </item>
            <item>
         <title>Disks in the ether</title>
         <description><![CDATA[<p>I was browsing the new Linux Journal and saw the article on ATA-over-Ethernet, where raw disk blocks are transmitted over ethernet and used to build cheap SANs, instead of the traditional fibre channel route.</p>

<p>FC gives you raw access to drives, so this is taking the same approach but with ethernet.  I don't quite know what I think about it yet.  But it sounds promising.</p>

<p>Upon further research, I also found iSCSI, which implements SCSI over ethernet.  Originally, iSCSI just ran on gigabit networks, but with 10GbE becoming common in datacenters, it's becoming very competitive with FC, which normally goes 2Gbps and can do up to 4Gbps.  And there are new iSCSI enhancements that let you do error correction and route fixing, similar to how Xsan detects breaks in the FC fabric and can pass blocks through one of N controllers.</p>

<p>I need to read more about iSCSI, but it sounds very interesting.  I had never heard of ethernet-based solutions for reading/writing raw disk blocks before now, but it has several upsides over traditional FC SANs, the main ones being higher speed and unlimited range.  Well, unlimited range in the case where you're streaming versus doing random access.  If you're reading a 50MB range of contiguous blocks from a remote disk, then the N ms latency isn't a big deal since you're going to take an intiial pause of N ms and then get all your data in one nice stream.  But if you're doing random seeks on a disk that's 50ms away, then you're gonna cry.</p>

<p>I'm thinking the WAN disk solution probably isn't used as much as the LAN/SAN one, so latency probably isn't a big deal for most installs.  But since you're making the disks available as a raw device, I wonder how disk fragmentation comes into play and whether or not that starts to limit your throughput.  I'm guessing FC installs have similar issues, but since their range is so limited, other sources of error/slowness show up first.</p>

<p>Using ethernet instead of FC makes sense from the perspective that so much more brainpower is going into making ethernet-based switches, routers and computers use the network as efficiently as possible.  There might be 1000 engineers worldwide working on FC enhancements, but you could imagine there's 100,000 working on ethernet hardware and protocols.  FC does do things that TCP/IP can't, so it's not like FC is going to die in the storage market, I just think that for many sites, its cost won't make up for its benefits, and iSCSI or an alternative implementation will be used.</p>

<p>And that doesn't mean that userland SAN (like Google FS) is going away anytime soon either.  It's certainly the cheapest option out there for people who want to build storage clusters, and putting abstractions in front of the raw disk blocks and serving up data via HTTP or similar protocols gives you capabilities you can't do with iSCSI.</p>

<p>For example, while some companies use SAN systems for storing databases or high update velocity filesets, a lot of folks use SANs to store filesets that are fairly static.  Think Netflix storing raw VOB images of all the DVDs they rent, or Amazon's archive of product images.  Once you store a copy of something, you rarely update it, if at all.</p>

<p>And if your objects aren't being updated very often, then that screams out for caching.  Using HTTP or similar protocols to transport data lets you plug in caching and load balancing pretty easily, but raw blocks over the network with solutions like iSCSI aren't exactly easy to cache.  I'd imagine they're pretty hard to load balance as well.</p>

<p>But, I admit that my brain is fairly addled with the userland SAN point-of-view, so I'm going to do some research on blocks-over-the-network technologies and think about them some more.  The major storage players are deploying/selling iSCSI systems now, and they're not dummies, so I want to read up more on the technologies.</p>

<p>As an aside, it seems that the more interesting technologies I run into, the more my home datacenter (and its corresponding budget) expands.  Some people buy boats, some buy cabins at ski resorts, but I spend my extra money on servers, disk, and switches.  As it is now, I can't run my hairdryer in the bathroom, since my office is on the same circuit, and the hairdryer combined with my machines trips the circuit breaker.  If I end up staying in my place, I'm going to get an electrician to come over and install two isolated circuits in my office so I can run all the servers and blinky-light boxes my heart desires.</p>]]></description>
         <link>http://blog.beaver.net/2005/05/disks_in_the_ether.html</link>
         <guid>http://blog.beaver.net/2005/05/disks_in_the_ether.html</guid>
         <category></category>
         <pubDate>Fri, 13 May 2005 13:21:57 -0800</pubDate>
      </item>
            <item>
         <title>Co-dependent co-processors</title>
         <description><![CDATA[<p>I remember in the early '90s, debating with a friend of mine whether or not a 40MHz 386SX would be faster than a 25MHz 386DX.  Co-processors were still a big deal then, with both the 386 and 486 processor lines having models that didn't have a dedicated FPU.</p>

<p>I think that co-processors are poised to make a huge comeback in the coming years.  There are lots of new computing technologies coming down the road where it doesn't make sense to run an entire operating system on top of them.</p>

<p>I'm talking about things like <a href="http://en.wikipedia.org/wiki/Quantum_computing">quantum computing</a>, <a href="http://en.wikipedia.org/wiki/Dna_computing">DNA/Molecular computing</a>, GPU/<a href="http://www.gdhardware.com/interviews/agiea/001.htm">PPU</a>/FPGA chips, and IBM/Sony/Toshiba's <a href="http://arstechnica.com/articles/paedia/cpu/cell-1.ars">new Cell processor</a>.</p>

<p>It's not like the idea of the co-processor has really died, anyways.  PCs have had add-on sound and graphics cards for 10-15 years now, it's just that instead of having a socket on the motherboard, we plug cards into the expansion bus.  The buses used to be a lot slower than using a socket to get a direct line to the cpu, but modern buses are very fast.  PCIe is making it possible for people to create 4-8 port gigabit ethernet cards without worrying about oversaturation, since PCIe currently scales between 200MB/s to 3.2GB/s (in both directions!).  The connections between cpu and memory will probably always be faster than those between the cpu and its peripheral devices, but modern buses are a much higher fraction of that performance than the olden days with 33MHz PCI.</p>

<p>Ultimately, I envision physicists plugging in four PPU cards into their cluster hosts and running specialized software to model n-body simulations, allowing them to purchase 1/10th the number of normal cluster hosts, which means less power, cooling, and network requirements, which end up being the hidden costs of clusters.  Or a OS X-based HDTV video streaming host that has 4 Cell boards plugged in for offloading complex encoding tasks to Cell's optimized architecture, freeing the machine's CPU and GPU for UI, job control, brokering jobs, streaming the encoded streams over the network, etc.  The list goes on and on.</p>

<p>With these hypothetical systems, the OS still runs on the CPU architectures we've grown to love and hate, it's just that certain types of computations are pushed off onto specialized hardware.  It's taking the GPU model and applying it not just to graphics, but any type of computation.  PPUs are starting to get a bit of mindshare, so I think it's only a matter of time before the onboard CPU becomes less of a measure of performance compared to the add-on processing boards that are added to a machine.  </p>

<p>Servers might still just have vanilla CPUs, but multimedia and scientific applications could take huge advantage of specialized engines.  Think of the performance increase you get with Altivec on bioinformatics algorithms, and multiply it by one or two orders of magnitude.  Altivec is more specialized than a cpu, but there are other solutions out there that are more specialized, if your problem space is narrow enough.  For example, FPGAs can <b>SCREAM</b> on certain types of algorithms, and blow away even dedicated vector units like Altivec.  There's a reason why Echelon runs on FPGA-based hosts with solid-state memory instead of Xserves.  :-)</p>

<p>And the more exotic technologies (quantum, molecular, and optical computing) aren't too far away.  Maybe the first coarse models in 5-10 years?  I read about quantum crypto getting small scale field tests today, so I wouldn't be surprised to see expansion boards for it in five years.  Of course, all this talk of expansion boards assumes that the technology is small enough to fit on a board inside a normal case, which is a big assumption.  But I like to believe that eventually, they will be that small.</p>

<p>And the physics processing units (PPUs) wouldn't necessarily be for scientists.  Imagine how much more intelligent artificial opponents in games could be if all the graphics and physics calculations are taking place in your GPU and PPU chips, leaving your CPU free to focus on AI and networking.  I love to boot up old games in my emulators and remember how basic the graphics and sound were 15 years ago; I think we don't stop often enough to reflect and marvel at the progress computing technology has made in our lifetimes.  It can be sobering if you think about it.</p>

<p>In fact, a couple years ago, a good friend of mine and I were talking about when we'd start to see the first cases of PTSD from playing over-realistic FPS games.  I'd argue that there are probably cases out there already, but I think we could see widespread mental health issues 5-10 years down the road when the games are that much more real.</p>]]></description>
         <link>http://blog.beaver.net/2005/05/codependent_coprocessors.html</link>
         <guid>http://blog.beaver.net/2005/05/codependent_coprocessors.html</guid>
         <category></category>
         <pubDate>Thu, 05 May 2005 12:06:31 -0800</pubDate>
      </item>
            <item>
         <title>One is the loneliest number...</title>
         <description><![CDATA[<p>I saw tonight's Slashdot link on Orion's <a href="http://www.orionmulti.com/products/specs_ds96">96 cpu desktop cluster</a> and it got me thinking.</p>

<p>It has 96 1.2GHz x86 cpus, which is an interesting direction to take.  I was thinking tonight about how power, speed and price are intertwined with each other, and I wonder if Orion's system is signaling a future trend.  More cpus per box, but with relaxed speeds for each core.  Their system costs $100k, but the main concept could be used to build cheaper systems than that.</p>

<p>Imagine 32 1.2 GHz cpus (relative to today's Opteron or G5 processors) on a single host.  You wouldn't be doing crazy 2-3GHz speeds, which means much less power consumption, and higher tolerances for manufacturing error.  Perhaps slower speeds and power consumption would let you fit more cores per die compared to the models that maximize performance-per-core.  For example, maybe I can fit four 1.6GHz cores per socket instead of the normal pair of 2.5 GHz cores.  More cores with slower speeds means you'll have to utilize each core more fully, though.</p>

<p>Along those lines, I liked Herb Sutter's recent article in DDJ about <a href="http://www.gotw.ca/publications/concurrency-ddj.htm">the coming concurrency revolution</a>, he described how performance-per-core isn't increasing as much as it used to, and how concurrency will soon become a required skill for most developers, similar to how OO is such a common idiom today.  I agree!</p>

<p>Concurrency is hard, though.  I've been working with threaded software for a couple of years, and the more I learn, the more I realize how damn hard it is to get right.  Herb mentioned that perhaps future languages will have greater support for concurrency built-in, and hinted at a future article covering that topic.  I'm very much looking forward to reading it.</p>

<p>I don't know when low-level primitives will be a requirement for programming languages, but probably not for a while.  I like the new java.util.concurrent package in 5.0, it really expanded my thread toolbox.  I bet that future Java releases include more support for concurrency, 5.0 showed that Sun wasn't afraid to make relatively large changes to the language features and syntax.</p>

<p>I think that deeply understanding concurrency will be a great long-term skill to have.  I recently started a long-term project so I could spend more of my research/study time learning new things about threaded programming and its philosophy.  I'm going to spend 3-6 months refreshing the basics and just doing lots of concurrent programming, and then I'll start to study more deeply on interesting topics that pop up during the first period.  I'm guessing there'll be a lot of time spent on locks, transactions, and ordering...</p>]]></description>
         <link>http://blog.beaver.net/2005/05/one_is_the_loneliest_number.html</link>
         <guid>http://blog.beaver.net/2005/05/one_is_the_loneliest_number.html</guid>
         <category></category>
         <pubDate>Tue, 03 May 2005 23:34:54 -0800</pubDate>
      </item>
            <item>
         <title>Ruby debugger hacking</title>
         <description><![CDATA[<p>ZenSpider and I got together this morning since we hadn't hung out in quite a while.  He showed me his new zenprofiler, which uses his RubyInline module to create a hybrid Ruby/C profiler that is much faster than the default pure-Ruby profiler, but with much more readability and less verbosity than a pure-C implementation hooked into Ruby itself.</p>

<h4><font color="#008800">Meta&lt;Foo&gt;</font></h4>

<p>Ruby's profiler/debugger is a lot like Perl's.  The stock introspection tools aren't written in C, they're written in the high level language.  The core C engine of the language makes calls to callback hooks whenever a line of code is executed, a function is called, etc, but all the smarts and UI live in high level code.  This is mostly a good thing, except it means that your debuggers and profilers can have a high overhead.  Still, those are problems that can be fixed.</p>

<p>The philosophy of implementing core functionality in a higher-level language is one of the reasons Ryan is working on RubyToC and metaruby, because he believes that if the Ruby engine itself was written in Ruby, it would be easier to find developers willing to add features to the language, and maintenance overall would be much easier.</p>

<p>Now, self-bootstrapping is one of those through-the-rabbit-hole things that freaks you out at first.  But Ryan is one of those guys who just groks compilers and low level stuff, and he's a patient instructor.  I'm not at a point where I can jump in and pair with him and Eric on tasks yet, but I can read the code and grok what's going on, and not make funny noises while I read the code.</p>

<p>If you want to read more about their RubyToC and metaruby projects, read <a href="http://blog.zenspider.com/archives/metaruby/index.html">their blog category for metaruby</a>, and check out their <a href="http://www.zenspider.com/~ryand/Ruby2C.pdf">overview slides for the project</a>.  It's really great stuff, and just a small slice of what the Ruby community as a whole is working on.</p>

<h4><font color="#008800">Ruby debugger guts</font></h4>

<p>But, back to the zenprofiler and the debugger work we did today.  zenprofiler uses RubyInline, a module Ryan wrote which was inspired by Brian Ingerson's set of Inline modules for Perl, they allow you to define C code inside your Ruby code, and at runtime, the C code is automatically built and linked into your program.  The resulting shared libs are cached for future runs.</p>

<p>Now, the default Ruby profiler and debugger are very slow beasts.  That's because Ruby's debugger and profilers tell the core engine to activate a trace function which is called every time a new line of code is executed, a wrapped C call is made, a function returns, etc, etc.  The trace function receives information about the state of the interpreter, and does appropriate things with it, depending on if it's a profiler or a debugger.</p>

<p>The trace func called by the C engine doesn't differentiate between different code states, it just passes a string to the trace func which says "call", "c_call", "return", etc, etc.  So your trace function is called for every line of code, even if it doesn't need to be.</p>

<p>Our work today centered around pushing as much of the switching logic inside the debugger's trace function down into an inlined C class.  There are still some segvs to be worked out, but we saw fairly encouraging initial results.  In the next iteration, we're probably going to do something I've been wanting to do for quite some time.  If you're in profiler mode, you have to make annotations for each line of code and each function call/return.  Debuggers, not so much!  With a debugger, you're setting breakpoints and telling the code to run, or you're single-stepping through the code.  If you're single-stepping, then overhead isn't a big issue for you.</p>

<p>But if you're setting breakpoints, the code should run as close to non-debugger speed as possible!  That's the next set of changes, I think.  A small set of hooks that makes the trace func only get called on lines or functions that we want to break on, and in all other cases, skips the trace func altogether.</p>

<p>With the stock debugger, we were seeing slowdowns of 100-250x when running some regression tests with REXML, a pure-Ruby XML parsing library.  I think we can get the in-debugger performance of breakpoint code to be 1.05-1.1x the normal performance, with maybe 10-20 hours of work.  RubyInline makes this kind of stuff a lot easier, although in this case, we'll have to make some core interpreter changes just because of what we want to do.  But it will be pretty sweet.</p>

<h4><font color="#008800">How random hacking can payoff ten-fold down the road...</font></h4>

<p>I've always been very interested in scripting language debuggers.  Back when Amazon was using perl 5.003 in 1997, I back-ported the 5.004 debugger and added a bunch of features/macros to make it easier on me when fixing code.  I spent a few hours a night for about 10 days, pouring over the debugger's code, learning all of its intracacies and gotchas, and ended up with a pretty deep knowledge of perl internals because of it.  The debugger used all sorts of tricks with symbolic references and symbol globs in order to peek at the symbol table and properly handle Perl's control flow, which we all know can be a bit... chaotic, to put it nicely.</p>

<p>And the funny thing about learning about that sort of stuff is that you never know when it will come in handy.  When we were grafting Mason onto our new website display engine in 2001-2002, all that knowledge I learned from hacking the Perl debugger paid off in spades, since I was able to work at different layers of our embedded perl engine and fix problems quickly, without having to mentally context switch too much.  But when I originally worked on my debugger fixes, I was simply focused on peeling apart a tool I used often to see how it worked, I had no goal or plan.  I just wanted a few annoying bugs to get fixed, and once I did that, I realized I had enough knowledge to start adding features.  I think it's similar to people who move to France, and bemoan the fact that they can't learn French, until all of a sudden one day, they're talking to themself about groceries they need to buy, and they realize they're talking to themselves in French!  I love aha! moments like that.</p>

<p>I started thinking about some favorite aha! moments over the years, but then I realized that's a separate blog entry altogether.  But I do have fond memories of reverse-engineering saved game formats on my Apple IIgs (Bard's Tale, anyone?) and playing games methodically to figure out their logic, and then exploting that logic in order to get past tough spots.  I guess I always had a bit of a hacker spirit, even when I was a little kid.</p>

<p>I blame the legos.</p>]]></description>
         <link>http://blog.beaver.net/2005/05/ruby_debugger_hacking.html</link>
         <guid>http://blog.beaver.net/2005/05/ruby_debugger_hacking.html</guid>
         <category></category>
         <pubDate>Tue, 03 May 2005 18:05:47 -0800</pubDate>
      </item>
            <item>
         <title>File under &apos;R&apos; for redundant</title>
         <description><![CDATA[<p>I love how over the past few years, there's been a lot activity in the distributed filesystem arena.  Corporations are realizing how expensive it is to recreate data versus just keeping it indefinitely in the first place.  And storage requirements keep growing higher and higher.</p>

<p>There are a few distributed filesystem implementations that I like, namely AFS, GFS (Google's implementation, not the original GFS), and Xsan.  Each is designed for very different usage patterns, but all are worth a look for different reasons.</p>

<p>First, a quick recap of this trio.  All three let you build SANs (storage area networks), but with varying degrees of coupling between your storage hosts.  Additionally, AFS will let you create SWANs, which are like SANs, but connected over a WAN.  From reading the GFS paper, it can theoretically support SWANs, but I don't know if Google is employing it in this fashion.</p>

<h4><font color="#008800">AFS</font></h4>

<p>AFS is the Andrew File System, which was developed at Carnegie Mellon University, and was subsequently worked on by Transarc and IBM.  Somewhere along the way, Andrew was dropped from the name and AFS now just means AFS.  <a href="http://openafs.org/">OpenAFS</a> is the branch<br />
most commonly used.  AFS uses a cell-based architecture, where each cell corresponds to a geographic cluster of storage hosts.  For example, if your company has three offices in New York, Paris, and Tokyo, each office would constitute a single cell.</p>

<h4><font color="#008800">AFS Volumes</font></h4>

<p>Each cell subscribes to a set of volumes, where each volume holds a category of files and enforces its space allocation, replication counts, user ACLs, etc.  The physical data for the volume is only stored at the cell where it is located, but it is possible to do read-only replication of the volume data to other cells for redundancy purposes.</p>

<p>Let's pretend I have three cells, /afs/home, /afs/work, and /afs/hawaii, corresponding to three locations I have AFS installed at, and that I have these volumes:</p>

<ul>
<li>/afs/home/config
<li>/afs/home/src
<li>/afs/work/work-mail
<li>/afs/hawaii/photos
</ul>

<p>By default, each volume is only stored on the cell where it is defined.  The physical data for /afs/hawaii/photos only resides in my hut in Oahu.  Other cells (home, work) can grab images from my hut by using the full cell path (/afs/hawaii/photos/hut.jpg), but they don't have a local copy of the data in my photos volume.</p>

<p>When a cell wants to edit a file located in another cell's volume, it grabs a copy of the file from the remote cell and stores a cached copy locally.  All file operations happen on the local copy, until the file is closed.  Once the file is closed, the updates are sent to the remote cell, and the new changes show up globally, since remote cells are grabbing a newly cached copy of the file whenever they read/write to it.</p>

<h4><font color="#008800">AFS Replication</font></h4>

<p>Now, it doesn't always make sense to always grab files from AFS volumes remotely all the time.  What if your files are heavy?  What if you want to access more data than what can be streamed over the network between your cells?  AFS addresses that with read-only replication.</p>

<p>The basic idea is that volumes will always have a master cell where they are stored (/afs/hawaii/photos), but that other cells can be configured to keep read-only local copies of those volumes.  This places a burden on the master cell to determine when the replicas should be updated.</p>

<p>For example, let's say I have /afs/home/dvds, which holds a couple terabytes of ripped dvds (which I legally own, thank you very much).  If I want to watch a film from Hawaii and don't have replication enabled, I have to depend on the network link between Seattle and Oahu to be reliable enough to stream my VOB file.  That dog won't hunt, monsignor.</p>

<p>Now, if I setup my hawaii cell to have a read-only replica of /afs/home/dvds, then I can watch my dvds in sunny Hawaii and not worry about network latency.  Whenever I rip a new dvd at home, I can issue a volume update and make the remote read-only replicas update their copies of my dvd volume.  As you can imagine, AFS replication works best when you have data that isn't constantly being updated.</p>

<h4><font color="#008800">Commercial AFS Installations</font></h4>

<p>AFS should be more famous than it is, it's the leader when it comes to distributed storage across WANs.  A former colleague of mine used to work as a system engineer for one of the large brokerage firms on Wall Street, and he explained to me how AFS was great for their worldwide storage needs, since it even worked with their branch offices in SE Asia that didn't have the same bandwidth as their North American and European offices.  The slower offices just subscribed to a smaller number of volumes that were vital, and picked up updates as they came along.  The company had dozens and dozens of datacenters around the world with terabytes and terabytes of storage, and it all just worked (TM).</p>

<p>It's hard to find a public list of large companies that use AFS, probably because they see it as a secret weapon of sorts.  It's not perfect; you have to construct your volumes carefully in order to manage update frequencies, and you'll probably need a dedicated AFS admin for large installs, but it's used by all sorts of industry leaders.  <a href="http://www-conf.slac.stanford.edu/AFSBestPractices/register/default.asp">The key is to go to AFS conferences and look at everone's nametags</a>. You'll see employees of major financial institutions, auto  manufacturers, e-commerce companies, retailers, government contractors, etc, etc.</p>

<h4><font color="#008800">AFS Downsides</font></h4>

<p>AFS isn't perfect.  It depends on Kerberos for user authentication, which is a headache all of its own.  You have to carefully manage your volumes, especially when replication comes into play.  While you can resize the quota of different volumes easily, there is still a bit of a tightrope act required in order to balance data across cells and volumes properly.  (There's even a mailing list dedicated to balancing AFS.)</p>

<p>AFS is best suited for storing large filesets that don't have high update frequencies.  It's great for distributing binaries across organizations, storing and replicating videos, mp3s, etc.  If you have unflexible replication requirements (i.e. changes have to show up in read-only volume copies immediately after being commited to the master read/write volume), then you can't support high volume change velocities.  That, or you're stuck with creating tons of very small volumes in order to gate your updates per-volume per-timeslice, which increases the admin overhead of your AFS deployment.</p>

<p>All that said, I rather like AFS, and have been considering deploying it for the home storage network I've been working on.  I've been ripping all of my media to disk, and AFS complements rips nicely.  They aren't changed very often, but they are heavy and I want to make sure that my media is replicated offsite.  This is where read-only replication works wonders.</p>

<h4><font color="#008800">GFS (Google File System)</font></h4>

<p>GFS is Google's distributed file system (<a href="http://labs.google.com/papers/gfs-sosp2003.pdf">read the whitepaper here</a>).  Its design is Google-centric (naturally), so it assumes that it will run on commodity hardware with non-RAID drives. It optimizes for bulk reads/writes over random reads/writes, and it relaxes some of concurrency requirements of normal distributed file systems in favor of letting application logic detect anomalies.</p>

<p>GFS isn't a traditional file system, though.  GFS clients use a custom library to read/write files, and GFS doesn't use the normal kernel-level vnode hooks that other file systems use.  The server side also runs completely in userland.</p>

<h4><font color="#008800">Basic GFS Design</font></h4>

<p>GFS breaks files into 64MB chunks.  Chunks are stored on hosts called chunk servers.  Each chunk is replicated onto three chunk servers, which optimally are in different physical racks.  Chunks are placed in servers which have the most free space available, and over time, their placement is fairly randomized.</p>

<p>Here is a diagram of a GFS cluster, taken from Google's whitepaper on GFS that I linked to above:</p>

<center><img src="http://blog.beaver.net/images/GFS-overview.jpg"></center>

<p>A master server stores a directory of which chunks are available and which chunks reside on each chunk server.  The client API keeps track of a cluster's chunk size, and translates the read/write requests of the user into chunk offsets for a given file.</p>

<h4><font color="#008800">GFS Read Example</font></h4>

<p>For example, let's say I grabbed a 150MB apache log and stored it in my GFS cluster as httpd.log two weeks ago, and now I want to read it back and build a unique list of IPs that hit my web server...</p>

<p>I would ask the GFS client API to read the file httpd.log, starting at byte offset 0.  Since that is within the first 64MB of the file, it would ask the master server for a list of chunk locations for httpd.log's first 64MB chunk.  The master server then returns a global chunk id for that file/chunk offset combo, plus a list of chunk servers that hold the chunk mapped by that chunk id.  The client library looks at the list of chunk servers and picks the closest one, grabbing the chunk data and sending the bytes back to my read call.</p>

<p>These steps are repeated for chunks 2 and 3 of httpd.log (64-128MB, 128-150MB).  The third chunk isn't 64MB, but only 22MB.  Chunks are allocated as their size fills up, so it's easy for the client library to know it's reached the end of the file.<br />
 <br />
The chunks are streamed to the client library in proper chunk order, and the user doesn't have to know what's happening under the covers.  The large chunk size means that read bandwidth stays pretty high.</p>

<p>GFS also has some optimizations; multiple chunk offsets are requested from the master server at one time to cut down on back-and-forth traffic between the client and the master, and chunk/chunk server metadata (but not actual the actual payload) is cached on the client side.</p>

<p>The key to the system is that while the master server holds the directory service that maps abstract files to chunk id lists and the servers that hold the chunks, the actual streaming of the data for a chunk is a direct communication between the client and one of the three chunk servers that holds the chunk.</p>

<h4><font color="#008800">GFS Writes</font></h4>

<p>Similar to how the client talks to chunk servers directly for reads, it also talks to them directly for writes.  If you're writing to a file, the master server is queried for the chunks that will need to be mutated.  For each chunk, the master server picks one of the three chunk servers that hold a replica of that chunk, and makes it the master replica server.</p>

<p>All the mutations for that chunk are applied on the master replica server, and then it turns around and tells the other two chunk servers to update their replicas of the chunk to match its new copy.  Once all three chunk servers have confirmed that they have written the new chunk to stable storage, the master replica returns a success code to the client, which moves to write the next chunk.</p>

<p>GFS also has optimizations that make it possible to write several chunks at once, and to have multiple simultaneous writers on a chunk (as long as they are appending and not writing at random offsets in the chunk, which isn't a common case for writes anyways).  The interesting thing is that while GFS guarantees that multiple writes can occur atomically, it doesn't guarantee that a given write happens only once. This means that the records appended to chunks have to contain some sort of header or sequence id inside them, otherwise the application that's reading the data might accidentally process a record multiple times.</p>

<p>The nice thing is that consumers of the client library aren't issuing low-level chunk read/write calls, they're just using the api to write to a file.</p>

<h4><font color="#008800">Interesting GFS Architecture Items</font></h4>

<p>GFS has other features, but you should really just read the whitepaper to get a better idea for what it can do.  Because Google wrote it specifically to address their internal needs for a filesystem, it is pretty specialized to Google's business domain.</p>

<p>For example, it uses a non-standard client library for access to files, and is much more focused on streaming high-bandwidth reads and writes versus supporting random file access.  Concurrency on writes is supported, but you don't get the guarantee of the write happening only one time.  Latency isn't emphasized much, but bandwidth sure is.</p>

<p>One interesting thing is that Google was seeing i/o corruption with early deployments of GFS, so they built-in a checksum system that sits at the chunk server level (not the master server level).  Chunks are checksummed into 64KB blocks, so each chunk has 1024 checksum entries. The checksums are checked on both reads and writes, and stored in memory on the chunk servers.  A common theme for GFS is aggressive storage of metadata in memory, with checkpoint flushing to disk for critical things.</p>

<p>I also like that chunk servers are the authority for which chunks they hold, the master server doesn't have a persistent authoritative store of which chunks are in which servers.  Instead, when a chunk server starts up, it contacts the master server and tells it which chunk ids it can serve up.  Once a chunk server is running, it also periodically sends a heartbeat back to the master server, making sure it has an up-to-date list of all the chunks it holds.  Google realized that it would have been more complex to make the central server be authoritative, and who better to know what a chunk server holds than the chunk server itself? I like that.</p>

<p>One small thing I'm not a fan of is calling the central directory server the master server, it think it can be confusing at times.  I don't know what I would call it, maybe something like chunk directory or chunk router.  Heh, or maybe...  Chunk matchmaker.  :-)</p>

<p>If you read the GFS paper, I highly recommend reading these sections closely:</p>

<ul>
<li>2.6.3 -- Operation Log
<li>3.1 -- Leases and Mutation Order
<li>3.2 -- Data Flow
<li>3.3 -- Atomic Record Appends
<li>3.4 -- Snapshot
<li>4.3 -- Creation, Re-replication, Rebalancing
<li>5.1.3 -- Master Replication
<li>5.2 -- Data Integrity
</ul>

<p>And sections 6 & 7 (benchmarks and real-world GFS experiences) are a fun read too.  GFS fills its niche pretty well!</p>

<h4><font color="#008800">Xsan</font></h4>

<p>Xsan is Apple's entry into the SAN market.  It runs on top of Xserve RAID boxes and supports storage clusters of up to 16TB.  Xsan uses fibre channel for its fabric, so it's tuned for short-range, high-throughput storage.  Xsan's <a href="http://images.apple.com/xsan/pdf/20050128Xsan_Technology_Overview.pdf">technology overview paper</a> goes into good detail on how it works.</p>

<p>Xsan is an unapologetic, super-high bandwidth storage solution.  Here are just a few features it has that are uncommon for a NAS/SAN system:</p>

<ul>
<li>Reservation of FC bandwidth so hosts don't have to worry about 
contention when streaming high-bandwidth payloads like HD video.
<li>Supports different RAID levels for different volumes.  For example, use RAID-5 for scratch
storage when working on a commercial for a client, but store the final version on mirrored RAID-1 for
peace of mind.
<li>Metadata controller uses gigabit ethernet to manage reads/writes between clients and
the SAN, freeing up the FC fabric to just shove raw payloads around.
<li>Metadata controller can automatically fallover to a sequence of backup hosts, or do an impromptu election and pick a new host.
<li>16TB is the limit for a virtual Xsan volume, but a single metadata controller can support multiple volumes.
<li>Clients can have two FC connections to the SAN, and access to a given volume is automatically routed through whatever connection is loaded less.  This means you can have two FC fabrics attached to the same SAN, or make each host attach to two SANs at once.  32TB of virtual storage from a single client, anyone?
</ul>

<p>And you get a host of gui tools to help you configure and manage your differrent volumes, setup access groups, measure SAN throughput, etc, etc.  Oh, if only I had a budget for HPC clusters, I would get a rack of Xserves and a few Xserve RAIDs, and put Xsan to use.  When you combine the throughput of Xsan on Xserve RAID's already beefy hardware, and then attach machines with wide and deep buses like the Xserves, well, it seems like you're getting as close as you can to supercomputer territory without paying a few million bucks.  I keep hearing good things about Apple's server market, and I hope they keep fighting this fight, their server solutions are excellent and relatively affordable when you compare them to the competition.  Casual PowerMac owners might freak out when hearing what these solutions cost, but people who work with storage and servers for a living realize that for what you get, it's a bargain.</p>

<h4><font color="#008800">Why talk about these filesystems?</font></h4>

<p>You might be wondering why I spent all this time discussing this trio of filesystems.  Well, I talked about AFS because I've been thinking about using it for helping me store and manage all of my media and email archives. Offsite replication is pretty straightforward, and I get POSIX-y access to my files without having to change much application logic.  It's  mature, stable, and is a much better alternative than NFS for what I want to do with it.</p>

<p>As far as GFS goes, well, I guess I'm just a fan of GFS's principles, I like how Google was unapologetic about its design and tweaked it as much as possible for their problem domain.  It's not a general purpose distributed filesystem for tech companies, but it doesn't have to be.  That said, it has a lot of interesting ideas, and I like how its design isn't overly complex with lots of distributed locks and transaction managers.<br />
 <br />
And last but not least, I like Xsan because Xserve RAID is damn sexy, and Xsan exploits its performance well.  GFS/AFS use ethernet, but Xsan uses fibre channel links (for smokin' speed) and has good concurrency AND throughput.  I don't have the money to afford Xserve RAID, but if I did, I would use Xsan for my home storage needs.  Xserve RAID + Xsan is relatively cheap compared to the SAN solutions offered by other companies in the storage market.  You wouldn't believe how much NetApp and friends charge for their storage cabinets...</p>

<p>I should have my storage cluster setup six months from now, I'm going to have a heavy node at home that stores all my media, and have an offsite node that stores vital things like email, config files, source code, etc.  It will be interesting to see where my storage is at a few years down the road once HD dvds are more common.  I'm thinking that 5TB won't seem like a lot by then...</p>]]></description>
         <link>http://blog.beaver.net/2005/04/file_under_r_for_redundant.html</link>
         <guid>http://blog.beaver.net/2005/04/file_under_r_for_redundant.html</guid>
         <category></category>
         <pubDate>Sat, 30 Apr 2005 01:25:49 -0800</pubDate>
      </item>
            <item>
         <title>Feel the 1U love...</title>
         <description><![CDATA[<p>So, I'm in the market for a 1U server.  My only real choice is a Xserve G5 or an Opteron-based solution.  Intel doesn't factor in for me, they might hold the bulk of the CPU market, but as Microsoft goes to show you, being the market leader doesn't necessarily mean you have the best product in the market.  I'm not a huge fan of the Intel server offerings out there.</p>

<p>Tyan makes a cool <a href="http://www.tyan.com/products/html/gx28b2881.html">dual-socket 1U Opteron case</a> which I can find online for ~$900.  It's certainly not as sexy as a Xserve (and doesn't have all the neat monitoring/admin tools bundled), but it has four sata trays, which would let me hold 1.6TB in 1U of height.  The cooling on it looks adequate enough to run four drives.</p>

<p>I'm pretty excited about the new Opteron dual-core cpus that are starting to trickle out.  The 270 seems to be perfect for what I'm looking for, a dual-socket box would give 4 x 2.0GHz.  The prices are pretty high on the new Opterons though, it looks like I'd be paying $1k per cpu for the 270, which is the dual-core 2.0GHz model.  Hopefully the price would go down over the next few months, but who knows.  I'd certainly take 4 x 1.8GHz versus 2 x 2.6GHz (2x265 vs 2x252, they're roughly $850 per socket now).</p>

<p>That's $2600 for a 4 x 1.8 Opteron 1U case, and then after factoring in 4x400GB sata drives ($1000) and N GB of 1GB DDR400 ECC ram ($150 per gig), you come out at $4200 for a 4GB system, and $4800 for a 8GB system.  With 7.2 GHz of cpu and 1.6TB of raw storage.  Oh, and tack on another $125 for an internal 250GB operating system drive.  So basically, $5k.</p>

<p>And I can dial that back by putting in less drives up front, and getting less ram.  $2600 for the base 1U case and the four cpus is pretty good, though.</p>

<p>On the XServe front, things are a little bit more expensive.  $4k for the base 2x2.3 model, and then I have to put in ram and drives at the same cost for the Opteron solution ($250 per 400GB of sata drive with a three drive limit, $125 per 1GB of ECC ram).  The Xserve has better monitoring and system information (there are multiple temperature and fan sensors you can monitor), and many more options for multimedia programming.  It's easier to get support for, and it's pretty much plug-and-play with Xserve raid.  I'm also guaranteed that things like the onboard gigabit ethernet will just work without any tweaking or driver machinations, and that the system in general will be pretty rock solid stable.  From what I've learned, the amd64 variants of Linux and FreeBSD are still shaking the bugs out and aren't quite as stable as their i386 cousins.</p>

<p>So in the short term, I'll probably end up purchasing a Xserve and explore using it for video and image transcoding, and then will eventually purchase a dual-core Opteron for doing low-level http cache work, or perhaps put it into service as a database host running PostgreSQL.</p>

<p>The nice thing about 1U is that it's not just a standard for racking computer hardware, it's a standard for audio hardware as well.  And there are tons of relatively cheap 8U and 12U gig cases with hardened shells that would make a great portable rack.  They have latches on the front and back, so you can lock up the whole shebang and put it in your car and go somewhere, and then take the front and back panels off the case when you've plugged it in.  Throw in a 1U power conditioner and a 1U gigabit switch, and you have a human-portable micro datacenter.</p>

<p>That's the appeal of 1U servers to me, I can fit 25-50GHz in a large suitcase and take it anywhere (well, it would weigh 35 lbs x the number of servers, so maybe I'd need something with wheels and a handle, but you get the idea).</p>

<p>I'll leave the business case for having that much power in a small portable case versus in a rack in a HVAC-enabled datacenter as an exercise for the reader...</p>]]></description>
         <link>http://blog.beaver.net/2005/04/feel_the_1u_love.html</link>
         <guid>http://blog.beaver.net/2005/04/feel_the_1u_love.html</guid>
         <category></category>
         <pubDate>Wed, 27 Apr 2005 14:38:33 -0800</pubDate>
      </item>
            <item>
         <title>Java games and field trips</title>
         <description><![CDATA[<p>I've been taking CSC 143, which is the second Java course in the CS foundation courses required for transfer to UW.  It is pretty fun, our first two assignments covered writing Tetris, mostly from scratch.  We were given some scaffolding and swing stubs for drawing the squares and grid, and then we had to implement the shapes, the row removal and the rotation.</p>

<p>It was pretty enlightening, since the rules of the assignment weren't exactly how Tetris itself played back in the day.  I must have eliminated a thousand rows while play testing the solution, and after a while, my brain adjusted and it was as if Tetris had always behaved that way.  Two of the big differences were the initial starting orientations of the pieces when they dropped through the top of the screen, and the logic requirements for determining when it was safe to rotate a piece.</p>

<p>I'm glad there was some differences, though, since it made the project more challenging.  The newest one is a historical stock graph that shows daily prices (with high/low/closing prices marked) and quarterly earnings and their trends over time.  This is pretty similar to the Cocoa code I wrote for graphing audio samples, except there are never any negative values.  I've written the model and controller classes and have implemented most of the text view, so now I have a week to write the graphical view that does the pretty graphs.  Here is <a href="http://beaver.net/javachart.jpg">an example</a> of what we're shooting for, although our graphs won't be quite as complex.</p>

<p>This is the third of eight projects for the quarter, so I'm excited to see what's coming down the road.  Last quarter, we did an adjustable LED clock, implemented a PhotoShop-like app with convolution filters for different image manipulations, and created a pinball game, among other things.</p>

<p>The image app was my favorite, which probably isn't a surprise to people who know me, since I used to work on image software quite a bit.</p>

<p>I've been pretty busy with school and interviews, I've got an interesting possibility in the pipeline right now, but I'm not going to talk about it until I know it's a sure thing.  It's been a long time since I was on the receiving side of an interview, so that was a real eye opener.  I went to the Bay Area last week, and on the flight back, I saw several people I knew from Amazon.  A couple of DBAs who had since left the company and were down there for a MySQL conference, and two people from the tech leadership of A9.com, Amazon's search spinoff that's located near Palo Alto.  And a couple other people on the flight looked familiar, but I wasn't sure if I knew them or not.  Turns out the Thursday night flight to Seattle is always packed, since a lot of people are flying home for the weekend.  There wasn't an empty seat on the plane.</p>

<p>It actually ended up being cooler in the Bay Area than it was up in Seattle, it's been roasting up here lately.  Summer in Seattle can be hit or miss.  If it gets really hot, people aren't really prepared mentally for it, and they start doing crazy things.  It's the only time of year that Seattle drivers forget their normal, sedate, friendly driving routine and start driving like the rest of the world.</p>

<p>It has yet to be determined whether this is good or bad.</p>]]></description>
         <link>http://blog.beaver.net/2005/04/java_games_and_field_trips.html</link>
         <guid>http://blog.beaver.net/2005/04/java_games_and_field_trips.html</guid>
         <category></category>
         <pubDate>Tue, 26 Apr 2005 12:47:44 -0800</pubDate>
      </item>
            <item>
         <title>kqueue and libevent</title>
         <description><![CDATA[<p>I've been spending a lot of time this past week looking at kqueue and libevent examples out there, specifically benchmarking my G5 to see what kind of http performance I can expect from it.</p>

<p>A simple non-keepalive select based daemon is serving 9k requests per second (RPS) per cpu.  I've been hacking on PLB (pure load balancer, a free software load balancer) and have made some changes so I can keep the core libevent code for listening on sockets and watching for read/write conditions, but I tore out the proxy code so I could make it behave like a plain httpd.  I'm still tuning and profiling it, but I'm getting obscene request rates with that codebase.  I think that once I'm finished, I'll be able to do 30k-50k RPS per cpu.  I've found some excellent papers on analyzing system call bottlenecks, so it's possible I could go even higher than my projections, but I think that 30k is a safe bet.</p>

<p>PLB uses <a href="http://www.monkey.org/~provos/libevent">libevent</a>, which is a cross-platform library that finds the best event notification library on your OS (be it epoll, kqueue, or /dev/poll), and wraps it into a nice api.  If you use libevent, you can create server code that supports tens of thousands of concurrent connections, without taking a huge hit.  <a href="http://www.monkey.org/~provos/libevent/libevent-benchmark2.jpg">This logarithmic graph</a> shows how well it scales.</p>

<p>BSD-based OS's use kqueue and kevent, which originally came from FreeBSD.  I like the kqueue framework because it allows you to setup observers on all sorts of system state changes.  For example, I can open a fd to a critical config file on my system, and if the file is copied, unlinked, renamed, written to, extended, or hard linked to, I am notified as soon as these changes occur and can take appropriate action.  It doesn't block the actions from happening, but it gives me a cheap and guaranteed way to know they did.  If I was writing a deployment nanny, I could register all the config and binary files for my deployment (which won't be more than my per-process fd limit) and alarm if any of the files are changed while the process is running.  How do I know it's still running?  Because there's a kqueue watcher for processes as well, and it follows them across fork or exec calls.  :-)</p>

<p>Basically, kqueue lets you take a random fd (a network socket, a file on disk, a pipe, or a fifo) and register observers so that you are notified when it's ready for reading or writing.  It also has observers for vnodes (the vm objects that model files inside filesystems.  they're at a higher level than inodes), and signals.  Darwin doesn't have support for two FreeBSD kqueue features, which are async I/O notifications (it doesn't have async I/O at all), and generic timer notifications.  </p>

<p>Anyways, kqueue has beefier features than linux's epoll, and lets you do some pretty freaky stuff.  All the BSD cousins have sysctl, so you can bump your per-process fd limits pretty high, darwin goes all the way up to 64K.</p>

<p>I've also been reading up on Tux, which is Linux's replacement for the first generation khttpd, a http daemon that ran entirely in the kernel.  Tux is pretty darn fast, and will end up being faster than userland httpds, even if they use low-cost event frameworks like epoll or kqueue, simply because it doesn't have to context switch.  Tux is an upper bound on performance with Linux httpds, so it's a good goal to shoot for.</p>

<p>I think a kqueue-enabled BSD httpd could get pretty close to Tux's speeds, and I think a kernel-level FreeBSD httpd could even beat Tux's speeds.  I'm still learning about the newer FreeBSD features like netgraph, zero-copy sockets and zero-copy apis, but it sounds like I could get some pretty high rates if I employed those with this code, but running on a FreeBSD box.</p>

<p>Hmmm, and I just noticed that RedHat has their RH Content Accelerator, which is based on the Tux code base, but has all sorts of zero-copy and performance enhancements made to it.  So that seems to be the one to aim for performance-wise.</p>

<p>Now, I know that FreeBSD can hit some pretty high speeds, many http cache and load balancer companies use it as their OS of choice for their products.  They make a lot of changes (and I bet a lot of their stuff runs in the kernel instead of userland), so it's not vanilla FreeBSD, but I know it will scale up pretty well.</p>

<p>I've been thinking a lot about FreeBSD and Darwin lately.  I love OS X and Mac hardware.  I think the proper server niche for both is to use FreeBSD for load balancers and http caches, and to use OS X for dynamic content serving (like audio, images or video).  FreeBSD doesn't have rich multimedia apis, but it makes up for that with super-fast network performance and lots of heavy-lifting apis for shuffling bytes around.  OS X isn't as fast as FreeBSD (and still runs at a sizeable fraction of FreeBSD's performance), but you can employ technologies like Altivec, Core Audio, Core Image, and Core Video.</p>

<p>The two OS's combined make a really nice platform target, since OS X is already using large parts of FreeBSD for its userland base, and has ported a lot of FreeBSD kernel features too.  I'm waiting for netgraph to make its appearance in OS X, but maybe that will never happen.  Still, you have to look at your workload.  If you're scaling images all day long, then even if your FreeBSD-tuned httpd is 2x as fast as the one built for OS X, the OS X one is probably going to spit out the scaled image with a lower overall wall clock time because it has access to much more dsp power in the form of Altivec and Core Image.  So as long as you don't blindly stick to one or the other, you can have your cake, and eat it too.</p>

<p>Now I need to go and read up on FreeBSD in-kernel httpds, there have been a couple of those posted over the years to the lists...</p>

<p>BTW, I picked up the latest edition of Stevens's Unix Network Programming book today.  I loaned out my second edition to someone at Amazon, but silly me, forgot to write down who it was.  I've been flying without a copy of UNP when doing my recent server work, which makes me feel a little naked.  It's nice to have it back, and an upgraded edition to boot!</p>]]></description>
         <link>http://blog.beaver.net/2005/04/kqueue_and_libevent.html</link>
         <guid>http://blog.beaver.net/2005/04/kqueue_and_libevent.html</guid>
         <category></category>
         <pubDate>Fri, 15 Apr 2005 15:21:25 -0800</pubDate>
      </item>
      
   </channel>
</rss>
