Recent improvements in network and workstation performance have made clusters an attractive architecture for diverse workloads, including sequential and parallel interactive applications. However, although viable hardware solutions are available today, the largest challenge in making such a cluster usable lies in the system software. This paper describes the design and implementation of GLUnix, an operating system layer for a cluster of workstations. GLUnix is designed to provide transparent remote execution, support for interactive parallel and sequential jobs, load balancing, and backward compatibility for existing application binaries. GLUnix is a multi-user, user-level system which was constructed to be easily portable to a number of platforms.
GLUnix has been in daily use for over two years and is currently running on a 100-node cluster of Sun UltraSparcs. Performance measurements indicate a 100-node parallel program can be run in 1.3 seconds and that the centralized GLUnix master is not the performance bottleneck of the system.
This paper relates our experiences with designing, building, and running GLUnix. We evaluate the original goals of the project in contrast with the final features of the system. The GLUnix architecture and implementation are presented, along with performance and scalability measurements. The discussion focuses on the lessons we have learned from the system, including a characterization of the limitations of a user-level implementation and the social considerations encountered when supporting a large user community.