« April 2005 | Main | June 2005 »

May 25, 2005

E3 and recent announcements

A good friend of mine went to E3 to cover it for the local newspaper recently, so I got to hear about all the exciting video games that are going to be coming out soon. That was my big excitement for the past couple weeks, that, and hearing the final specs for the Cell processor in the PS3.

ArsTechnica did a good article covering the Xenon, the new IBM-made processor inside Xbox2. I'm still much more of a Cell fan than a Xenon fan, but my interest lies more in the video and audio stream capabilities of the Cell than its video game capabiltiies. And I'm getting moderately excited for the Xbox2, although the odds are very low I'd buy one, simply because one, I've bought into the PlayStation franchise since the PS1, and two, I don't really feel like rewarding Microsoft for their bad behavior, even if Xbox is a separate unit in the company.

I almost bought an Xbox because of Splinter Cell, though. Luckily, the second and third installments in that series came out for PS2. SC3 has reworked the controls and features, it's much more fun than Metal Gear 3 was (I know, sacrilege!).

Recently, Slashdot linked to an EE Times article on IBM opening up the Cell architecture, presumably to entice open source involvement. I was pretty excited to read that. I just hope that the hardware options don't suck the way the PS2 linux kits did. Yes, they booted linux, but you had to have a special boot cd to run software you wrote (i.e. you couldn't make dvds that held linux images and give them to friends who didn't have linux kits), and most of the interesting bits of the hardware weren't documented.

Now if you can afford the development box (and a Sony license), you can develop games for the PS2. The outlay there is something like $10k for the hardware, and substantially more for the license to distribute, but I forget the exact numbers. The PS3's hardware specs are great, so I'm hoping that Sony relaxes their stance a bit and gives the open source tinkerers more access to the hardware. Imagine MythTV running on a PS3, that would be great.

Hmmm, what else... The local Ruby group is having a hack fest this weekend, so I'm going to take some time and show up for the hacking and socializing. I haven't done a lot of Ruby the past six months, just the RubyCocoa work I did. I really miss it, but with the Java and HTTP stuff I've been doing, I haven't had a lot of extra time.

Posted by djb at 11:42 PM | Comments (0)

May 13, 2005

Disks in the ether

I was browsing the new Linux Journal and saw the article on ATA-over-Ethernet, where raw disk blocks are transmitted over ethernet and used to build cheap SANs, instead of the traditional fibre channel route.

FC gives you raw access to drives, so this is taking the same approach but with ethernet. I don't quite know what I think about it yet. But it sounds promising.

Upon further research, I also found iSCSI, which implements SCSI over ethernet. Originally, iSCSI just ran on gigabit networks, but with 10GbE becoming common in datacenters, it's becoming very competitive with FC, which normally goes 2Gbps and can do up to 4Gbps. And there are new iSCSI enhancements that let you do error correction and route fixing, similar to how Xsan detects breaks in the FC fabric and can pass blocks through one of N controllers.

I need to read more about iSCSI, but it sounds very interesting. I had never heard of ethernet-based solutions for reading/writing raw disk blocks before now, but it has several upsides over traditional FC SANs, the main ones being higher speed and unlimited range. Well, unlimited range in the case where you're streaming versus doing random access. If you're reading a 50MB range of contiguous blocks from a remote disk, then the N ms latency isn't a big deal since you're going to take an intiial pause of N ms and then get all your data in one nice stream. But if you're doing random seeks on a disk that's 50ms away, then you're gonna cry.

I'm thinking the WAN disk solution probably isn't used as much as the LAN/SAN one, so latency probably isn't a big deal for most installs. But since you're making the disks available as a raw device, I wonder how disk fragmentation comes into play and whether or not that starts to limit your throughput. I'm guessing FC installs have similar issues, but since their range is so limited, other sources of error/slowness show up first.

Using ethernet instead of FC makes sense from the perspective that so much more brainpower is going into making ethernet-based switches, routers and computers use the network as efficiently as possible. There might be 1000 engineers worldwide working on FC enhancements, but you could imagine there's 100,000 working on ethernet hardware and protocols. FC does do things that TCP/IP can't, so it's not like FC is going to die in the storage market, I just think that for many sites, its cost won't make up for its benefits, and iSCSI or an alternative implementation will be used.

And that doesn't mean that userland SAN (like Google FS) is going away anytime soon either. It's certainly the cheapest option out there for people who want to build storage clusters, and putting abstractions in front of the raw disk blocks and serving up data via HTTP or similar protocols gives you capabilities you can't do with iSCSI.

For example, while some companies use SAN systems for storing databases or high update velocity filesets, a lot of folks use SANs to store filesets that are fairly static. Think Netflix storing raw VOB images of all the DVDs they rent, or Amazon's archive of product images. Once you store a copy of something, you rarely update it, if at all.

And if your objects aren't being updated very often, then that screams out for caching. Using HTTP or similar protocols to transport data lets you plug in caching and load balancing pretty easily, but raw blocks over the network with solutions like iSCSI aren't exactly easy to cache. I'd imagine they're pretty hard to load balance as well.

But, I admit that my brain is fairly addled with the userland SAN point-of-view, so I'm going to do some research on blocks-over-the-network technologies and think about them some more. The major storage players are deploying/selling iSCSI systems now, and they're not dummies, so I want to read up more on the technologies.

As an aside, it seems that the more interesting technologies I run into, the more my home datacenter (and its corresponding budget) expands. Some people buy boats, some buy cabins at ski resorts, but I spend my extra money on servers, disk, and switches. As it is now, I can't run my hairdryer in the bathroom, since my office is on the same circuit, and the hairdryer combined with my machines trips the circuit breaker. If I end up staying in my place, I'm going to get an electrician to come over and install two isolated circuits in my office so I can run all the servers and blinky-light boxes my heart desires.

Posted by djb at 01:21 PM | Comments (0)

May 05, 2005

Co-dependent co-processors

I remember in the early '90s, debating with a friend of mine whether or not a 40MHz 386SX would be faster than a 25MHz 386DX. Co-processors were still a big deal then, with both the 386 and 486 processor lines having models that didn't have a dedicated FPU.

I think that co-processors are poised to make a huge comeback in the coming years. There are lots of new computing technologies coming down the road where it doesn't make sense to run an entire operating system on top of them.

I'm talking about things like quantum computing, DNA/Molecular computing, GPU/PPU/FPGA chips, and IBM/Sony/Toshiba's new Cell processor.

It's not like the idea of the co-processor has really died, anyways. PCs have had add-on sound and graphics cards for 10-15 years now, it's just that instead of having a socket on the motherboard, we plug cards into the expansion bus. The buses used to be a lot slower than using a socket to get a direct line to the cpu, but modern buses are very fast. PCIe is making it possible for people to create 4-8 port gigabit ethernet cards without worrying about oversaturation, since PCIe currently scales between 200MB/s to 3.2GB/s (in both directions!). The connections between cpu and memory will probably always be faster than those between the cpu and its peripheral devices, but modern buses are a much higher fraction of that performance than the olden days with 33MHz PCI.

Ultimately, I envision physicists plugging in four PPU cards into their cluster hosts and running specialized software to model n-body simulations, allowing them to purchase 1/10th the number of normal cluster hosts, which means less power, cooling, and network requirements, which end up being the hidden costs of clusters. Or a OS X-based HDTV video streaming host that has 4 Cell boards plugged in for offloading complex encoding tasks to Cell's optimized architecture, freeing the machine's CPU and GPU for UI, job control, brokering jobs, streaming the encoded streams over the network, etc. The list goes on and on.

With these hypothetical systems, the OS still runs on the CPU architectures we've grown to love and hate, it's just that certain types of computations are pushed off onto specialized hardware. It's taking the GPU model and applying it not just to graphics, but any type of computation. PPUs are starting to get a bit of mindshare, so I think it's only a matter of time before the onboard CPU becomes less of a measure of performance compared to the add-on processing boards that are added to a machine.

Servers might still just have vanilla CPUs, but multimedia and scientific applications could take huge advantage of specialized engines. Think of the performance increase you get with Altivec on bioinformatics algorithms, and multiply it by one or two orders of magnitude. Altivec is more specialized than a cpu, but there are other solutions out there that are more specialized, if your problem space is narrow enough. For example, FPGAs can SCREAM on certain types of algorithms, and blow away even dedicated vector units like Altivec. There's a reason why Echelon runs on FPGA-based hosts with solid-state memory instead of Xserves. :-)

And the more exotic technologies (quantum, molecular, and optical computing) aren't too far away. Maybe the first coarse models in 5-10 years? I read about quantum crypto getting small scale field tests today, so I wouldn't be surprised to see expansion boards for it in five years. Of course, all this talk of expansion boards assumes that the technology is small enough to fit on a board inside a normal case, which is a big assumption. But I like to believe that eventually, they will be that small.

And the physics processing units (PPUs) wouldn't necessarily be for scientists. Imagine how much more intelligent artificial opponents in games could be if all the graphics and physics calculations are taking place in your GPU and PPU chips, leaving your CPU free to focus on AI and networking. I love to boot up old games in my emulators and remember how basic the graphics and sound were 15 years ago; I think we don't stop often enough to reflect and marvel at the progress computing technology has made in our lifetimes. It can be sobering if you think about it.

In fact, a couple years ago, a good friend of mine and I were talking about when we'd start to see the first cases of PTSD from playing over-realistic FPS games. I'd argue that there are probably cases out there already, but I think we could see widespread mental health issues 5-10 years down the road when the games are that much more real.

Posted by djb at 12:06 PM | Comments (0)

May 03, 2005

One is the loneliest number...

I saw tonight's Slashdot link on Orion's 96 cpu desktop cluster and it got me thinking.

It has 96 1.2GHz x86 cpus, which is an interesting direction to take. I was thinking tonight about how power, speed and price are intertwined with each other, and I wonder if Orion's system is signaling a future trend. More cpus per box, but with relaxed speeds for each core. Their system costs $100k, but the main concept could be used to build cheaper systems than that.

Imagine 32 1.2 GHz cpus (relative to today's Opteron or G5 processors) on a single host. You wouldn't be doing crazy 2-3GHz speeds, which means much less power consumption, and higher tolerances for manufacturing error. Perhaps slower speeds and power consumption would let you fit more cores per die compared to the models that maximize performance-per-core. For example, maybe I can fit four 1.6GHz cores per socket instead of the normal pair of 2.5 GHz cores. More cores with slower speeds means you'll have to utilize each core more fully, though.

Along those lines, I liked Herb Sutter's recent article in DDJ about the coming concurrency revolution, he described how performance-per-core isn't increasing as much as it used to, and how concurrency will soon become a required skill for most developers, similar to how OO is such a common idiom today. I agree!

Concurrency is hard, though. I've been working with threaded software for a couple of years, and the more I learn, the more I realize how damn hard it is to get right. Herb mentioned that perhaps future languages will have greater support for concurrency built-in, and hinted at a future article covering that topic. I'm very much looking forward to reading it.

I don't know when low-level primitives will be a requirement for programming languages, but probably not for a while. I like the new java.util.concurrent package in 5.0, it really expanded my thread toolbox. I bet that future Java releases include more support for concurrency, 5.0 showed that Sun wasn't afraid to make relatively large changes to the language features and syntax.

I think that deeply understanding concurrency will be a great long-term skill to have. I recently started a long-term project so I could spend more of my research/study time learning new things about threaded programming and its philosophy. I'm going to spend 3-6 months refreshing the basics and just doing lots of concurrent programming, and then I'll start to study more deeply on interesting topics that pop up during the first period. I'm guessing there'll be a lot of time spent on locks, transactions, and ordering...

Posted by djb at 11:34 PM | Comments (0)

Ruby debugger hacking

ZenSpider and I got together this morning since we hadn't hung out in quite a while. He showed me his new zenprofiler, which uses his RubyInline module to create a hybrid Ruby/C profiler that is much faster than the default pure-Ruby profiler, but with much more readability and less verbosity than a pure-C implementation hooked into Ruby itself.

Meta<Foo>

Ruby's profiler/debugger is a lot like Perl's. The stock introspection tools aren't written in C, they're written in the high level language. The core C engine of the language makes calls to callback hooks whenever a line of code is executed, a function is called, etc, but all the smarts and UI live in high level code. This is mostly a good thing, except it means that your debuggers and profilers can have a high overhead. Still, those are problems that can be fixed.

The philosophy of implementing core functionality in a higher-level language is one of the reasons Ryan is working on RubyToC and metaruby, because he believes that if the Ruby engine itself was written in Ruby, it would be easier to find developers willing to add features to the language, and maintenance overall would be much easier.

Now, self-bootstrapping is one of those through-the-rabbit-hole things that freaks you out at first. But Ryan is one of those guys who just groks compilers and low level stuff, and he's a patient instructor. I'm not at a point where I can jump in and pair with him and Eric on tasks yet, but I can read the code and grok what's going on, and not make funny noises while I read the code.

If you want to read more about their RubyToC and metaruby projects, read their blog category for metaruby, and check out their overview slides for the project. It's really great stuff, and just a small slice of what the Ruby community as a whole is working on.

Ruby debugger guts

But, back to the zenprofiler and the debugger work we did today. zenprofiler uses RubyInline, a module Ryan wrote which was inspired by Brian Ingerson's set of Inline modules for Perl, they allow you to define C code inside your Ruby code, and at runtime, the C code is automatically built and linked into your program. The resulting shared libs are cached for future runs.

Now, the default Ruby profiler and debugger are very slow beasts. That's because Ruby's debugger and profilers tell the core engine to activate a trace function which is called every time a new line of code is executed, a wrapped C call is made, a function returns, etc, etc. The trace function receives information about the state of the interpreter, and does appropriate things with it, depending on if it's a profiler or a debugger.

The trace func called by the C engine doesn't differentiate between different code states, it just passes a string to the trace func which says "call", "c_call", "return", etc, etc. So your trace function is called for every line of code, even if it doesn't need to be.

Our work today centered around pushing as much of the switching logic inside the debugger's trace function down into an inlined C class. There are still some segvs to be worked out, but we saw fairly encouraging initial results. In the next iteration, we're probably going to do something I've been wanting to do for quite some time. If you're in profiler mode, you have to make annotations for each line of code and each function call/return. Debuggers, not so much! With a debugger, you're setting breakpoints and telling the code to run, or you're single-stepping through the code. If you're single-stepping, then overhead isn't a big issue for you.

But if you're setting breakpoints, the code should run as close to non-debugger speed as possible! That's the next set of changes, I think. A small set of hooks that makes the trace func only get called on lines or functions that we want to break on, and in all other cases, skips the trace func altogether.

With the stock debugger, we were seeing slowdowns of 100-250x when running some regression tests with REXML, a pure-Ruby XML parsing library. I think we can get the in-debugger performance of breakpoint code to be 1.05-1.1x the normal performance, with maybe 10-20 hours of work. RubyInline makes this kind of stuff a lot easier, although in this case, we'll have to make some core interpreter changes just because of what we want to do. But it will be pretty sweet.

How random hacking can payoff ten-fold down the road...

I've always been very interested in scripting language debuggers. Back when Amazon was using perl 5.003 in 1997, I back-ported the 5.004 debugger and added a bunch of features/macros to make it easier on me when fixing code. I spent a few hours a night for about 10 days, pouring over the debugger's code, learning all of its intracacies and gotchas, and ended up with a pretty deep knowledge of perl internals because of it. The debugger used all sorts of tricks with symbolic references and symbol globs in order to peek at the symbol table and properly handle Perl's control flow, which we all know can be a bit... chaotic, to put it nicely.

And the funny thing about learning about that sort of stuff is that you never know when it will come in handy. When we were grafting Mason onto our new website display engine in 2001-2002, all that knowledge I learned from hacking the Perl debugger paid off in spades, since I was able to work at different layers of our embedded perl engine and fix problems quickly, without having to mentally context switch too much. But when I originally worked on my debugger fixes, I was simply focused on peeling apart a tool I used often to see how it worked, I had no goal or plan. I just wanted a few annoying bugs to get fixed, and once I did that, I realized I had enough knowledge to start adding features. I think it's similar to people who move to France, and bemoan the fact that they can't learn French, until all of a sudden one day, they're talking to themself about groceries they need to buy, and they realize they're talking to themselves in French! I love aha! moments like that.

I started thinking about some favorite aha! moments over the years, but then I realized that's a separate blog entry altogether. But I do have fond memories of reverse-engineering saved game formats on my Apple IIgs (Bard's Tale, anyone?) and playing games methodically to figure out their logic, and then exploting that logic in order to get past tough spots. I guess I always had a bit of a hacker spirit, even when I was a little kid.

I blame the legos.

Posted by djb at 06:05 PM | Comments (0)