NOW-Sort: Fastest Disk-to-Disk Sort in the World

April 1, 1997: NOW wins fastest sort trophies

The NOW team was awarded two trophies for having the fastest disk-to-disk sorts in 1997! The first of these was for the Datamation benchmark, and old industry standard that requires the entrants to sort 1 million 100-byte records as fast as possible. A 32-node NOW cluster was able to sort this amount of data in 2.41 seconds, beating the old record by more than a full second!

The more interesting challenge was MinuteSort; how much data can you sort in one minute of elapsed time? For the 1997 competition, the NOW team sorted 6.0 GB in just under 1 minute. The last posted record was 1.6 GB.

Update April 1, 1998: NOW-Sort retains records for another year

By running on 95 Ultra1 machines, NOW-Sort set a new MinuteSort record: 8.41 GB! The Datamation record stays the same at 2.41 seconds.

The biggest challenge came from Ordinal Technologies, whose commercially available nSort program sorted 5 GB in under one minute on the 1997 SGI Origin machine. Their work earned them the commercial MinuteSort trophy, known as the "Daytona" award. For more details about these benchmarks, visit the page.

What is NOW-Sort?

NOW-Sort is currently the fastest disk-to-disk sorting program, as measured by two well-known database benchmarks: Datamation and the Indy-version of MinuteSort. For more details about these benchmarks, visit the Sort Benchmark page.

NOW-Sort competes for the Indy trophy; that is, sorting programs that are not commercially supported products. The Daytona award goes to the fastest commercial product.

NOW-Sort is a collection of sorting implementations on a cluster of Sun workstations connected with a high-speed Myrinet switch, and is adaptable to the number of available processors, the amount of main memory, and the number and the relative speed of the local disks. The parallel implementations of NOW-Sort are built on top of two key pieces of NOW technology, Active Messages and Glunix. To acheive high I/O performance and minimize communication, the layout of records to local disks is carefully managed.

The basic algorithm when all records fit in the collective memory of the workstations is as follows. Each processor reads its records from its local disk into main memory. A bucket sort determines the workstation that will contain this record after all have been sorted. Active messages are used to send groups of records from the processor that read the key from disk to the destination workstation. After all records have been sent and received, each processor sorts its local keys and writes them to local disk.

When there are more records to be sorted than fit in the memory available in the cluster, two passes must be made over the records. In the first pass, the records are sorted into multiple runs according to the algorithm described above, where each run contains the amount of records that fit in main memory. In the second pass, each processor merges the runs that reside on its local disk, writing out the final sorted run to another local disk.

What is our hardware configuration?

In our work, we evaluated two different cluster environments. The first consists of 64 commodity UltraSPARC I workstations, each with 64 MB of memory. Each workstation houses two internal 5400 RPM Seagate Hawk disks on a single fast-narrow SCSI bus. The second cluster connects 8 more fully-equipped UltraSPARCs. Each contains 128 MB of main memory and an extra fast-wide SCSI card, with two 7200 Seagate Barracuda external disks. The workstations are connected through a Myrinet switch.

What is our performance?

The Datamation sorting benchmark was introduced in 1985 by a group of database experts as a test of a processor's I/O subsystem and operating system. The performance metric is the time to sort 1 million 100-byte records, where the first 10-bytes are the key.

NOW-Sort performs the Datamation benchmark in 2.41 seconds, more than 1 second faster than the previous record (3.52 seconds, held by an SGI Challenge). This record was attained on a 32-node cluster of UltraSPARCs. Roughly 1.1 seconds of the 2.41 was spent in start-up, clearly a problem for the GLUnix prototype.

Recognizing that the Datamation benchmark has become more of a test of startup and shutdown time, MinuteSort was introduced in 1994 in the landmark AlphaSort paper. The performance metric is now the number of 100-byte records that can be sorted in one minute. For the 1997 benchmark competition, NOW-Sort sorted 6.0 GB of data in 59.3 seconds, eclipsing the old mark of 1.6 GB, also held by the SGI Challenge Server. This was done on the full 64-node cluster. The next year, the record was increased to 8.41 GB on the full 95-node NOW cluster.

What are our conclusions?

  • NOWs are well-suited to I/O-intensive applications. The presence of multiple local disks per node enables a higher level of I/O performance than is traditionally possible from the I/O systems of MPPs.
  • Optimizing the single-node implementation is important for achieving scalable, high-performance parallel sorting algorithms. Examining the performance of individual workstations allows the programmer to isolate bottlenecks.
  • Disk I/O and fast communication can be overlapped, but the combination quickly stresses the limited bandwidth of the S-Bus in the UltraSparc I. This bottleneck implies that an UltraSparc with just two disks and a connection to a high-speed network may be the best building block.

If you want more information, please grab a draft of the paper on NOW-Sort

High-Performance Sorting on Networks of Workstations.
SIGMOD '97, Tucson, Arizona, May, 1997.
Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, David A. Patterson.
Available as: Abstract and PostScript

Here is another paper, more about the process of getting the sort to break the record:

Searching for the Sorting Record: Experiences in Tuning NOW-Sort.
The 1998 Symposium on Parallel and Distributed Tools (SPDT '98) , Welches, Oregon, August 3-4, 1998.
Andrea Arpaci-Dusseau,Remzi Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, David A. Patterson.

A related paper compares sorting and other I/O primitives on workstations, clusters, and SMPs:

The Architectural Costs of Streaming I/O: A Comparison of Workstations, Clusters, and SMPs.
HPCA 4, Las Vegas, February, 1998.
Remzi Arpaci-Dusseau, Andrea Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, David A. Patterson.
Available as: PostScript

Check out our sorting bibliography! You will find references to sorting papers and links to other sorting work.