System support for Distributed Supercomputing on a Network of Workstations (NOW).
David A. Patterson, Thomas E. Anderson, David E. Culler, University of California at Berkeley.
Funded by Rome Laboratories, Griffiss Air Force Base.
The Berkeley NOW project is to provide software and hardware support for using a network of workstations (NOW) as a distributed supercomputer on a building-wide scale. Recent technology advances in switch-based networks, such as ATM and Myrinet, have made it possible to closely integrate processors, memory, and disks across the network. Example applications that will benefit from this work include databases, file servers, computer-aided engineering tasks, large simulations, large scale information servers such as Prodigy, as well as parallel scientific programs.
We are conducting research and development into network interface hardware, communications protocol software, network-wide resource management, distributed scheduling, and parallel file systems. Our approach is to leverage as much as possible from commercially available systems for fast prototyping and to track the dramatic growth in the technology. The demonstration is to be a 100 processor system that is (i) high performance (delivering a large portion of the NOW to demanding sequential and parallel applications while guaranteeing good performance to interactive users) (ii) incrementally scalable (system capacity can increase simply by adding more workstations), (iii) fault tolerant (the system continues to be usable even when a machine fails), and (iv) easy to administer. If sucessful, this project has the potential to redefine the high-end of the computing industry.
The NOW project aims to use desktop hardware and software and switch-based LANs as the building block for scalable computing systems. Not only do these desktop technologies dominate other technologies in cost/performance, but the fast time-to-market of desktop products means that systems built out of desktop building blocks will have an advantage in absolute performance as well.
We are developing key technology required for low-latency, high-bandwidth communications on networks with orders of magnitude better hardware latency and bandwidth than todays LANs. Active Messages will provide fast, reliable communication, avoiding operating system intervention through the technique of network virtualization. The GLUNIX global layer resource management will coordinate access to the system's resources for both sequential and parallel jobs. The file system project will take advantage of the systems disks for storage and memory for cooperative caching. The network RAM project will make the sys tem's DRAM globally available using a fast network interface increasing the peak memory size available to a machine by a factor of 100X.
As one example application, we have constructed the Inktomi Berkeley Search Engine. This scalable WWW search engine uses NOW technology to answer search queries 10 times faster than the contemporary commercial services of similar capacity using half as much hardware.
The NOW research will change the way large-scale systems are designed, enabling the construction of highly integrated systems geographically distributed over, for instance, the dimensions of a building or ship. These systems will allow communication at the speed of today's MPPs to harness the resources of hundreds of computers towards a single goal but with a seamless interface that makes using those hundreds of machines as simple as using a single workstation today.
During the last period we have completed milestones related to the construction of our NOW cluster. First, we have installed and tested new Myricom Lanai4.1 Sbus Myrinet cards in our 32 node NOW-1 cluster. This is now in "production" use for application development. We have tested and deployed a scalable 2nd-tier network connecting NOW-1 to the external world, consisting of switched ethernet with a 155Mb/sec ATM backbone.
We have installed 70 of the new UltraSparc 1 machines from Sun, forming an most of the NOW-2 UltraSparc cluster. These machines are in the process of being outfitted with switched ethernet and Myrinet interfaces.
We have developed a structured wiring concept and cable management plan to allow an efficient and robust construction of a large NOW system.
We have completed construction and testing of the initial GAM (Generic Active Messages) on Myrinet and constructed GAM on the Paragon and Meiko. We have conducted several studies implementing NOW system components on GAM and introducing experimental extensions as candidates for our final Active Message definition.
We have constucted a GAM communication microbenchmark suite which serves as a calibration tool for the entire range of Generic Active Message implementations, including those on the Myrinet 2.3, Intel Paragon, Meiko CS-2, and HP-NOW0 cluster. This tool is sensitive enough to reveal the impact of even small differences in either the hardware or software in the GAM imple mentation, and has provided valuable guidance in our on-going development. We have used this tools to study a variety of design trade-offs in network interface architectures on the NOW, Meiko, and Paragon. It allowed us to isolate bottlenecks in each case. A full study appears in the Feb. 1996 special issue of IEEE Micro on Hot Interconnects.
We have extended our test suite for the Myrinet hardware. This was initially developed to stress the first generation of Myrinet hardware and it revealed significant problems that had not otherwise been demonstrated.
We have now tested the second generation Myrinet 4.0 and 4.1 network interface card and switch extensively, and demonstrated it to be much more reliable than the previous versions. We have ported the Generic Active Message (GAM) layer to the new 32-bit processor in the Lanai 4.1. This version of Myrinet hardware is accompanied by a completely new compiler, and in the process of porting GAM several problems were uncovered as fixed. We have been able to evaluate the performance of the new myrinet hardware on Sparcs and Ultrasparcs using out GAM microbenchmark tools.
We have completed a study of an alternate communication model, the Remote Queue model (RQ) as a communicative model equivalent in power (in its simplest form) to active messages with polling is made attractive by message processing. In essence, we separate the queuing of messages from the invocation of handlers. This enables new optimizations, provides a clean atomicity model, and simplifies implementation and multiprogramming. We uses RQ to implement active messages, bulk transfers, and fine-grain applications on the Intel Paragon and Cray T3D using extremely different implementations of RQ.
We have completed the specification of the final Active Message specification. It is general purpose, supporting client-server, distributed server, kernel-to-kernel communication, as well as conventional parallel programs. It is integrated with POSIX threads and provides a clear error model.
We have completed a reference implementation of the new AM specification on top of UDP, which can run on any homogeneous set of workstation. The process of constructing this reference implementation shed light on several subtle aspects of the specification. The reference implementation allowed us to port other major components of the project to AM in advance of the final Myrinet implementation, include the XFS file system, the network RAM remote virtual memory system, and the Split-C parallel programmign language.
We have completed the detailed design of the native Myrinet AM layer and have coded and tested the major portion of it. The basic message functions are operational. We have demonstrated hot-swap of cables and switches while under heavy load. The Solaris driver that implements network virtualization is operational, paging virtual network endpoints onto and off of the network interface card.
We have developed a detailed microbenchmark suite for MPI and used it to analyze a num ber of commercial MPI implementations. Essentially of which demonstrate performance prob lems, correctness problems, or fail on certain operations. A detailed study is to appear in the International Conference on Parallel Processing.
We have completed a version of the network mapper and router, which is fully integrated with GLUNIX.
We have implemented a version of Unix sockets on AM and demonstrated subtantial improvements in latency and bandwidth over traditional TCP and UDP implementations.
Having previously implemented cosheduling for parallel programs in GLUNIX and experienced its true implementation costs, we have designed an alternative distributed algorithm for time-sharing parallel workloads, called implicit scheduling, which utlizes events inherent in the program to coordinate the scheduling across nodes. We have performed an extensive simulation study that indicates that implicit scheduling is competitive with ideal coscheduling, even with zero cost for global context switching. Implicit scheduling allows each local scheduler in the system to make independent decisions that dynamically coordinate the scheduling of cooperating processes across processors. Of particular importance is the blocking algorithm which decides the action of a process waiting for a communication or synchronization event to complete. Through simulation of bulk-synchronous parallel applications, we find that a simple two-phase fixed-spin blocking algorithm per forms well; a two-phase adaptive algorithm that gathers run-time data on barrier wait-times per forms slightly better. Our results hold for a range of machine parameters and parallel program characteristics. These findings are in direct contrast to the literature that states explicit cosheduling is necessary for fine-grained programs. We have shown that the choice of the local scheduler is crucial, with a priority-based scheduler performing two to three times better than a round-robin scheduler. Overall, we find that the performance of implicit scheduling is near that of coschedul ing (+/- 35%), without the requirement of explicit, global coordination.
We have implemented these ideas in the Split-C global memory and synchronization library and are testing the techniques on the AM reference implementation.
We have developed a method for extending operating system functionality in a way that is secure, efficient, simple, requires no kernel source changes, and is compatible with existing application binaries. Our approach is to enable extensions of the system call interface by loading a device driver into the kernel that redirects system calls to extension code running either in the kernel or in a user-level process. As a proof of concept, we have implemented a prototype of our system, Secure Loadable Interposition Code (SLIC), on Solaris 2.4. Measurements of our prototype implementation demonstrate that extensions can be added with little runtime overhead. We have also implemented several extensions using SLIC, which illustrates the functionality provided and demonstrate good performance.
We have conducted an extensive study of the load balancing and scheduling of sequential and parallel jobs in the context of an interactive load on a network of workstations. We employ both trace-driven and direct simulation to evaluate the impact sequential and parallel jobs may have on one another. Starting from a dedicated NOW just for parallel programs, we incrementally relax the restrictions until we have a multiprogrammed, multiuser NOW for both interactive sequential users and parallel programs. At each step along this path we focus on measurements that help answer policy question for several aspects of the NOW system.
A network RAM implementation running over active messages was completed, and a CAD application of BDD's was run, and shown to be roughly three times faster than paging to disk. The network RAM implementation is still untuned.
We have developed a network-based front-end, called the magicrouter, which provides a single IP access point to a parallel server. It provides a mechanism for load balancing and fail-over that is an important alternative to the traditional front-end machine of a cluster. To build the magicrouter, we implemented fast packet imterposing, which allows a user-level process nearly kernel-level performance when accessing device drivers.
We have prototyped a "smart client" interface to the NOW that provides transparent access to a collection of machines providing a service over the web. This generic interface is used as a building block for providing scalable internet services. It presents a single system view of the NOW to the interface user, a web browser. This research is unique in that it puts intelligence into the client software (such as load balancing, fault tolerance). Code that was once in a centralized server is now pushed out to the clients, elminating single points of failure. The interface is written as a Java class which can be subclassed or specialized to implement a wide variety of scalable services. We have demonstrated a scalable ftp server, telnet server, and a scalable chat room. The interface specification is currently being designed and tested.
Installation and configuration of the Solaris operating system on a NOW can be a difficult and time-consuming task. As the size of the cluster becomes larger, the ability of the system adminis trator to perform such a feat and maintain a consistent environment across machines becomes increasingly difficult. Using the Solaris custom JumpStart methodology, we have developed a set of installation tools which allow new machines to be added to the cluster and be up and running with no administrator assistance. In addition, the tools provide a centrally managed paradigm for recovering lost machines and for global reconfiguration of existing machines with simple modifi cations to a central file and a reboot.
We have designed a system diagnostic console will make it possible to scale network and host performance monitoring to hundreds of nodes through aggregation of information. It will support network and node testing so that problems can be diagnosed. The system is being implemented in Perl/Tk, and will be freely available.
We have designed a tool called, Authenticated action, to provide fine grained remote function execution in a distributed system that crossed organizational boundaries and includes mobile hosts. Two principals can mutually authenticate even if they are disconnected from the rest of the system. Authenticated actions pro vides an excellent infrastructure for developing secure system administration tools.
During this past year, we have been investigating "Serverless Network File Systems" . Serverless network file systems eliminate the central server bottleneck by allowing all data and control to be distributed over all machines in the system. Dynamic location of data and control improve performance, increase availability and reliability, and permit transparent data migration to and from tertiary storage.
We have implemented an initial prototype serverless network file system, called xFS. The pro totype demonstrates distributed network Redundant Array of Inexpensive Disks (RAID) striping to provide high-performance and reliable disk storage; it demonstrates cooperative caching to har ness the distributed caches of many machines as a single, large cache; and it demonstrates distributed metadata control that scales with its high-performance data paths.
Initial measurements of the xFS prototype showed that we achieved a peak read or write band width of 14 MB/s for a 32-node cluster of SPARCStation 10's and 20's. While this is significantly more bandwidth than a single such machine could achieve, it represents only about 20% of the aggregate bandwidth we expect to realize in a well-tuned system with this hardware. We identified RPC/TCP network overheads as a primary performance bottleneck, and we have nearly completed a port to Active Messages that we expect to significantly improve the prototype's performance.
In addition to efforts to improve performance, we are currently preparing to gather a detailed trace of file system activities across a large number of active machines. We plan to use this trace to study the performance impact of different algorithms for the placement of data on disks.
We have developed an initial prototype of a means of allowing users to execute computation on a remote site, such as now, while operating within the logical context of their local system. The explosive growth of the World Wide Web, along with the evolution of the HTTP, HTML, and CGI standards, has enabled many applications which would not have been feasible just a few years ago. We believe that this evolution will eventually lead to support for general-purpose distributed computing over the Web. Such computation minimally requires transparent access to: i) local and remote files, ii) private and public data, and iii) local and remote computation. We have demonstrated that these requirements can be met by integrating simple protocols from existing distributed systems into the Web. In turn, we show that adding these capabilities can qualitatively improve the utility of the Web by enabling a whole new class of distributed applications.
The Inktomi Search Engine has been running non-stop since early October. On December 29th we loaded a new database with 2.8 million Web pages, making it one of the largest online data bases. More importantly, the new database was brought online without any down time: we auto matically switched to the new database while there were users on the system. This test confirms that the fault-tolerant architecture enables hot swapping server contents, one of the keys to the 24 by 7 operation required by global information servers. The size of the database also stresses the scalability of the system, which currently uses 4 nodes and 8 GB of disk.
The Inktomi search engine continues to run with nearly 100% up time and a load of about 100,000 queries per day. Some of the key technology has been licensed from the University by Inktomi Corporation, which has built a 10-node commercial search engine using the NOW design (http: //www.hotbot.com). We expect to complete scalability and fault tolerance studies of the Berkeley version in the next three months.
We have continued the development of Split-C, a parallel extension of C intended for high-performance programming on distributed memory platforms. Split-C provides a mixture of the shared memory, message passing, and data parallel programming paradigms and gives the programmer a clear cost model, making optimization straightforward. Implementation of Split-C now exist on the Cray T3D, the IBM SP-1 and SP-2, the Intel Paragon, the Meiko CS-2, the Thinking Machines CM-5, and networks of Sun and HP workstations. Work is also underway to promote the language as a de facto standard.
We have developed a family of external sorts on NOW and characterized the performance characteristics and scaling with respect to machine size, number of disks per node, and problem size. This application stresses several new aspects of the AM layer, including the extensive use of threads and overlapping of I/O with internal communication. It illustrates the advantage of NOWs over traditional SMPs related to the presence of disks on every node. The NOW is competitive with the record, set on SMPs, for this benchmark.
Complete the AM myrinet implementation. Demonstrate its robustness on a large scale and its effective use in a wide spectrum of applications. Demonstrate sockets and MPI built upon this substrate.
Demonstrate AM ability to operate correctly in the presence of hardware faults, errant hardware, dynamic hot-swap reconfiguration.
Characterize the performance of virtual networks under real system load with several applications and system modules using AM over myrinet concurrently. Optimize the AM implementation toward the 10us target in the common case.
Characterize GLUnix performance and optimize critical aspects, such as job start up time.
Implement implicit scheduling on AM and evaluate is effectiveness on timeshare workloads of parallel and sequential jobs, including a comparison with explicit co-scheduling.
Demonstrate SLIC protability. Enhance the SLIC interface to support multiple OS extensions. Evaluate adding interrupts/traps to SLIC interface. Implement transparent remote execution in GLUnix using SLIC. Implement migration in GLUnix using SLIC
Make the current xFS prototype robust enough to be used on a regular basis. In particular, we will reengineer the file system protocol and verify its correctness with formal methods.
Conduct a large scale trace of the file system usage in the CS department. The principal benefit of the trace is that it will last two orders of magnitude longer than previous efforts; it will also contain more information about each access and cover more machines.
Develop a file data reorganizer that optimizes both read AND write performance, to make xFS have near optimal performance across a very wide spectrum of workloads.
Extend the file system recovery work to construct a toolkit that enables building reliable distributed applications.
Build and investigate a stable WebOS environment providing a global file system, authentication, and secure execution of jobs as a candidate mode of access to a NOW. It should also provide a framework for sysadmins to reason about access rights for outside/local users.
Investigate issues associated with scheduling multiple services on a NOW. With respect to the file system, a number of application caching policies are possible. One example of this is the caching policy we developed to support replicated webfs servers for the scalable chat application.
Build or modify a web browser to give transparent access to a replicated web server, and route around failures.
Implement and investigate a alternative "user views" of a NOW. The simplest view is that the entire cluster looks like a single multiprocessor server, i.e., a single system image. We have seen that this view is insufficient to support our own activities involve simultaneous use of the NOW for system development, production use, and performance studies. An extreme alternative is to provide a virtual NOW to every NOW user or portions of the NOW to identfiable groups of users. This group membership information will either live in a GLUnix database, or potentially be stored in a Smart Client database.
Construct a NOW Monitor. This Java tool will display NOW user information (which machines are reserved) and also the load on NOW machines. The goal is to expose all internal Smart Client information in graphical form, so that NOW information is easily accessible from any browser.
Our design of a Myrinet card for the Hewlett-Packard GSC+ bus was given to the companies and was improved and productized by Myricom under contract with HP. The card should enable a significantly better network interface for TAC-4. Our fast communications layer will run on this hardware platform.
Our technical contacts at SUN labs have developed and published a GLUnix-like layer for providing a single-system view of Solaris over a network of workstations. This is likely to result in a new set of products.
Intel has demonstrated strong support for NOW, contributing 35 PentiumPro machines to serve as the host machines for the Roboline tertiary disk array and to integrate this massive I/O facility with NOW. We are planning to develop an AM layer for these machines using the Myricom PCI interface card so NOW and Tertiary disk will connect at high bandwidth through the same Myrinet. This will also allow an investigation of NOW concepts and techniques in the PC realm.
July 24, 1996
For technical information, contact David A. Patterson
patterson@cs.berkeley.edu