Architecture


(Note: This page describes the design of the system. Some of the things mentioned below are not yet implemented. It is left as an exercise to the reader to figure out which are the features that already exist. :)

    Overview

    The prototype consists of 20 PCs which host 370 8GB disks. The PCs are P6-200MHz (donated by Intel) with 98MB of RAM each. They run FreeBSD 3.0-current (developer's snapshot) with Justin Gibbs' CAM interface. The hosts are connected via switched 100Mbps Ethernet.

    To study the trade-offs between different hardware configurations, our design is based on two designs for individual nodes. A single Tertiary Disk node is composed of two PCs which share disks. This double ending of disks to two PCs is for higher reliability. If a PC has a hardware or software failure, all disks connected to it are accessible through its dual PC at the other end of the string. Both nodes use the Fast-Wide SCSI disk interface. The SCSI strings are shared between PCs, with two SCSI controllers per string. In normal mode (i.e. with no failures), each PC accesses half the disks. Both node designs have four SCSI controllers per PC.

    In Node design 1, each SCSI string has 8 disks in one disk enclosure. In the second node design, each SCSI string has 14 disks in two disk enclosures. The complete prototype consists of eight nodes of design 1 and two nodes of design 2. (The picture to the left shows eight nodes of design 1.)

    Power and cooling for the disks is provided by the disk enclosures. For easier maintenance and monitoring, the enclosures are hot pluggable and programmable from a remote host through a serial port. These features are important as in large storage systems, management can be even more expensive than the storage itself. All of the components of two nodes of design 1, or one node of design 2, fit in one 19 inch wide by 7 foot tall rack. Each rack contains PCs, disk enclosures and network switch hardware.

    Here is a PostScript picture and an xfig source that summarizes the overall design.

    The system is designed to be an efficient web server. All disks are independent FFS filesystems. (I.e., we don't do any kind of parity-based RAID, although the whole system is a very good example of a Redundant Array of Inexpensive/Indepent Disks. :) Within a single machine, we manage a tree of symlinks to give the application an illusion that all the data is in one filesystem. Between machines, files are distributed according to some rules and a front-end machine (which is itself duplicated) forwards user requests to the appropriate machine. The front-ends also monitor the status of the servers and hide machine failures from the user.

    Network SwitchUPS

Network

The hosts are connected via a fast (100Mbits/sec) switched Ethernet. After testing 3Com 3c595TX (which actually came with the machines), SMC 9332DST (a fairly old card), and Intel EtherExpress Pro/100B, we decided on the Intel card, which had the highest throughput and lowest overhead. SMC was a close second; the 3Com card was a distant third, both in FreeBSD and Windows NT.

In case you are interested, this is the kind of numbers we were seeing. These are all sequential bandwidth, disk to disk, for ftp'ing a large file from one machine to another over a 100BaseTX crossover link. The "disk" is actually a large striped filesystem of 14 disks on both ends.

Legends:

  • 3 - 3Com 3c595TX
  • I - Intel EtherExpress Pro/100B
  • S - SMC 9332DST

Results:
Server Client Throughput
FreeBSD/I FreeBSD/I >10MB/s
FreeBSD/S FreeBSD/S >10MB/s
FreeBSD/3 FreeBSD/3 4.5MB/s
FreeBSD/I WinNT/I 4.4MB/s
FreeBSD/3 WinNT/3 3.3MB/s

SCSI Subsystem

The prototype has 370 8GB IBM disks, half of which were donated by IBM. The SCSI adapters are Adaptec 3940UW (twin-channel, Ultra-wide). It was necessary to use twin-channel adapters because we needed 4 to 5 SCSI strings per host, with only 4 PCI slots on the motherboard (one of which is used by the network card).

Our SCSI strings are double-ended, i.e., there are two SCSI adapters on both ends of SCSI strings. We use external feed-through terminators to provide termination on the bus even when a SCSI adapter completely loses power (for instance, because of a PC power supply failure). We could not run the SCSI strings in 20MHz (Ultra) mode with feed-through terminators. This is probably due to the fact that there is a "stub" of about 30cm inside the PCs that connect the second channel of the 3940UW to the feed-through terminators (which are mounted outside the PC case).

This is how each SCSI string looks like with feed-through terminators:

Click on this thumbnail to get a larger photo of one double-ended pair of node design 1.

The disks will be either shared between both machines (mounted read-only from both) or owned by one machine (mounted read-write on one machine, after being unmounted from the other). We are able to use such coarse-grain sharing because nature of the application (most accesses are reads, writes come in spurts and can be delayed if necessary).

Disk and SCSI adapter errors are detected by monitoring the kernel message buffer. All requests for files on a failed disk are forwarded to the backup copy.

Redundancy

The Tertiary Disk prototype is designed with the principle that, if you have enough hardware to begin with, you can add just a little extra to make the system redundant, thus eliminating many single points of failures and other weaknesses of the system.


NOW Home Page | Tertiary Disk Home Page