Quarterly Status Report for 2/6/97 through 4/30/97
Programming Support for Clusters of Multiprocessors (Clumps)

Submitted to Lawrence Livermore National Laboratory

Professor Katherine Yelick, PI
(yelick@cs.berkeley.edu, 510-642-8900)

Professor David Culler, co-PI
(culler@cs.berkeley.edu, 510-643-7572)

Professor James Demmel, co-PI
(demmel@cs.berkeley.edu, 510-643-5386)

University of California at Berkeley
Computer Science Division


Since the beginning of the contract, we have made significant progress towards the development of programming support for CLUMPs. The results are divided into the following areas: communication support, performance analysis, distributed data structures and load balancing, solvers, and compilation techniques.

1 Communication Support

1.1 Virtual Networks and Multiple NICs

We have begun to explore more substantial support for communications within CLUMPs than was previously available for networks of workstations or shared memory clusters. The first effort in this direction is the implementation and evaluation of virtual networks, which allow a large number of processes to have direct protected access to communication resources. We have completed the design, implementation, and preliminary performance analysis of the first true virtual network system. Virtual networks provide an arbitrary number of applications with the illusion of direct, memory-mapped access to multiple independent and protected networks in an on-demand, general-purpose, and protected manner. The key abstraction is a communication endpoint - an object that abstracts a process's handle on a virtual network. A large number of processes on both single- and multi-processor machine can have one or more, independent communication endpoints. A set of communication endpoints spread across NOWs and CLUMPS form virtual networks. This abstraction is implemented through a sophisticated segment driver and firmware on an intelligent network interface. In the next quarter we will complete a thorough evaluation of virtual networks, and will continue other research on the CLUMP communication system, including the use of multiple network interface cards per SMP node.

1.2 Multi-protocol Communication on CLUMPs

One model for programming CLUMPs, which is attractive due to its simplicity, is to treat the machine as a flat distributed memory multiprocessor. The programmer writes an SPMD program that communicates both within and between SMPs using a single communication interface, such as using MPI or Active Messages. We have designed a multi-protocol version of Active Messages to support such models; it is implemented using shared memory within an SMP and techniques developed for NOWs between SMP nodes. An initial implementation is underway, with preliminary results already available. We will continue to refine and evaluate this implementation, in addition to the novel "compiler-based" approach, during the remainder of this contract.

2 Performance Analysis and Modeling

A key to achieving high performance on large scale machines is understanding the observed performance in detail, from the architecture level through the applications. We have started in this effort by looking one application kernel, the 3D FFT, on two machines, the UCB NOW (uniprocessor Ultrasparcs connected by Myrinet) and the UCB CLUMP (quad processor Ultrasparc servers connected by Myrinet). The FFT performance is primarily limited by the performance of serial 2D FFTs and an all-to-all communication phase for remapping the 3D array. Using software pipelining and hand-modified register allocation, we are able to achieve 150 MFlops for a 1D FFT on the Ultrasparc nodes. The 2D FFT is somewhat slower, due to a remap between the two phases of 1D FFT. The parallel version speeds up will up to 16 processors, but the bisection bandwidth of the network saturates for larger numbers of processors. The CLUMP version shows a need for better architectural support for the network -- the network performance is limited by the bus bandwidth in the SMP, rather than any aspect of the network. The parallel versions of the 3D FFT were developed in conjunction with performance modeling. A very rough model, based only on bandwidth and floating point requirements, indicates that a full turbulence model using the 3D FFT would run for approximately 80 hours on a ASCI-sized machine. This assume that the code could reach a TFlop/s, which will require the kind of careful tuning by either compiler or programmer that our current NOW version used.

3 Distributed Data Structures and Load Balancing

We are pursuing the development of distributed data structures and the related locality and load balancing optimizations on two fronts. We have ported the Kelp grid library to the UCB NOW platform using two strategies: 1) to run on the standard MPI layer described above and 2) using Active Messages within the Kelp communication layer. We will continue performance tuning in the next quarter and evaluate the library for scalability. We conjecture that new load balancing algorithms may be needed to achieve load balance without destroying locality. The second main effort is the port of Multipol to the UCB NOW platform. It currently runs on TCP, but as with Kelp, we believe some performance advantage may result from using lighter weight Active Messages directly.

4 Solvers

4.1 Multigrid

We have been working to consolidate the multigrid (MG) code: running large problems and refining the algorithm and the implementation to reach a stable checkpoint. We would like a partially optimized implementation to get a solid idea of the performance characteristics and potential of our current algorithm. To date we have run on problems of up to 1:025 ?10 degree of freedom (dof) systems on the new SP3 at Argonne. The problem that we run has large jumps in material coefficients (10e4) with Poisson ratios of 0.49. We run a problem, which is scalable, and keep about 25,000 dof per processor; thus we run from one to 41 processors to get efficiency type data (see http://www.mcs.anl.gov/Projects/sp/report/adams97.html for an earlier version of this problem and pictures of the results). We now get about 40% efficiency on the 41 processor case. Note: this is real efficiency, i.e. the wall clock time to reduce the residual by a constant (10e6) factor, but this does not include the algebraic construction of the coarse grid matrices as our current implementation in not well optimized. We are currently investigating new MG methods to be able to handle highly incompressible problems more effectively. Incompressibility is important in its own right, as well as being similar in nature to plasticity, which is an important class of finite element applications.

4.2 Sparse Matrix Factorization

We are pursuing our development of SuperLU, a high performance implementation of sparse Gaussian elimination. The serial version, optimized for cache-based machines, was previously released, and the parallel (SMP) version is to be released soon. The SMP version is running and has been performance tuned and tested. We are currently revising the interface (calling sequence) to allow it to be called from either a serial or parallel client. The interface for serial clients is fairly straightforward, but the interface for parallel clients is complicated by the differences between thread libraries on different platforms and the need for shared data structures within the code. For example, the input array and a task queue used within the algorithm for scheduling need to be passed into the function.

5 Compilation Techniques

Our language and compiler effort is based in part on the Titanium project, which uses a C++ dialect extended with primitives from a grid library, Tin. It also build on the Split-C effort, which is a systems-programming language based on C. Titanium and Split-C differ in both the base language and the amount of compiler support; the Split-C compiler is a fairly simple translator while the Titanium compiler does more sophisticated communication optimizations. We are currently implementing two versions of the global communication, which is common to both Split-C and Titanium, in support of CLUMPs. One version runs on the multi-protocol Active Message layer, and was completed in the past quarter. The second uses memory operations for global address space accesses within an SMP. Over the next quarter we will complete this compiler-based implemented of a global address space, and in future quarters we will evaluate this implementation based on Titanium and Split-C benchmarks.

6 Reports and Publications

  1. S. S. Lumetta, D. E. Culler, "Multi-Protocol Message Passing in Clusters of SMP's," submitted for publication, March 1997.

  2. D. Gay, A. Mainwairing, R. Thomas, "3D FFTs on Networks of SMPs," CS267 Final Project Report, May 1997. Available from http://now.CS.Berkeley.EDU/cs267/.

  3. N. Bowman, N. Treuhaft, "Kelp on NOW," CS267 Final Project Report, May 1997.


Return to Clumps Quarterly Reports Page

Return to Clumps Home Page