Quarterly Status Report for 5/1/97 through 7/31/97
Programming Support for Clusters of Multiprocessors (Clumps)

Submitted to Lawrence Livermore National Laboratory

Professor Katherine Yelick, PI
(yelick@cs.berkeley.edu, 510-642-8900)

Professor David Culler, co-PI
(culler@cs.berkeley.edu, 510-643-7572)

Professor James Demmel, co-PI
(demmel@cs.berkeley.edu, 510-643-5386)

University of California at Berkeley
Computer Science Division


1 Communication Support

1.1 Multi-protocol Communication on CLUMPs

We have written a technical paper describing the multiprotocol active message layer and explored the space of implementations using various types of synchronization. Clusters of multiprocessors, or Clumps, promise to be the supercomputers of the future, but obtaining high performance on these architectures requires an understanding of interactions between the multiple levels of interconnection.

This paper presents the first multi-protocol implementation of a lightweight message layer---a version of Active Messages-II running on a cluster of Sun Enterprise 5000 servers connected with Myrinet. This research brings together several pieces of high-performance interconnection technology: bus backplanes for symmetric multiprocessors, low-latency networks for connections between machines, and simple, user-level primitives for communication. The paper describes the shared memory message- passing protocol and analyzes the multi-protocol implementation with both microbenchmarks and Split-C applications. Three aspects of the communication layer are critical to performance: the overhead of cache-coherence mechanisms, the method of managing concurrent access, and the cost of accessing state with the slower protocol. Through the use of an adaptive polling strategy, the multi-protocol implementation limits performance interactions between the protocols, delivering up to 160 MB/s of bandwidth with 3.6 microsecond end-to-end latency. Applications within an SMP benefit from this fast communication, running up to 75% faster than on a network of uniprocessor workstations. Applications running on the entire Clump are limited by the balance of NIC's to processors, and are typically slower than on the NOW.

2 Performance Analysis and Modeling

2.1 FFT

The Fast Fourier Transform is a kernel that is used in turbulence modeling, signal and image processing, and many other scientific and engineering domains. The 1-dimensional FFT is a basic block for multi-dimensional FFT and because of its repetitive access to data, optimizing 1-dimensional FFT for memory hierarchy helps to improve the overall performance of this algorithm. For the register level optimization, a radix-4 butterfly algorithm is adopted to improved reuse of data registers; the data is also remapped mid-computation to improve the cache reuse as the strides vary. This optimization showed 71% performance improvement at best on UltraSparc. This optimization is used for 2-dimensional FFT on one SMP on clumps.

2.2 Microbenchmarking

A set of benchmarking programs is being developed to evaluate different multiprocessor systems. Currently, it is being developed on 8-way Enterprise 5000 SMP. The purpose of this benchmark set is to determine machine parameters such as memory latency/bandwidth and cost of synchronization primitives, and to classify kernel routines in terms of such parameters. This result will be used to predict the adequacy of systems to particular application.

3 Solvers

3.1 Multigrid

We have recently consolidated our multigrid code to the point where it is robust and all features have been parallelized to our current satisfaction. We are now writing papers and continuing to investigate a new MG method for highly incompressible problems (or constrained problems in general). We have also begun to experiment with geometric and material nonlinear problems, we have seen somewhat encouraging results, but much work remains on this front and this will be the focus of our near term work.

To date we have run on problems of up to 1,721,000 degree of freedom (dof) systems on the SP3 at Argonne. The problem that we run has large jumps in material coefficients (1.e-4) with Poisson ratios of 0.49. We run a problem, which is scalable, and keep about 25,000 dof per processor, thus we run from one to 69 processors to get efficiency type data (see the World Wide Web address: http://www.mcs.anl.gov/Projects/sp/report/adams97.html for an earlier version of this problem and pictures of the results). We now get about 50% efficiency, of the solve after the setup phase is completed, for the 69 processor case. Note: this is real efficiency, i.e. the wall clock time to reduce the residual by a constant (1.e-6) factor, with a uniprocessor base case. Our total efficiency to solve a linear system once is about 35%, again for the 69 processor case, this is due to the less than fully optimized setup phase.

We have had one meeting recently with the ALE3D team, and have begun to discuss the issues that are of concern to them, and how to arrange to run their problems in our system so as to get a better idea of how well our method works on the problems of interest to the ALE3D project.

3.2 Sparse Matrix Factorization

We are investigating a new algorithm for balancing sparse matrices. Balancing is a preconditioning technique for nonsymmetric eigenvalue problems which permutes and/or diagonally scales a matrix to improve the accuracy or speed of its subsequent eigenvalue calculation. The traditional algorithm for balancing dense matrices requires the elements of a matrix to be given explicitly --- in practice, the elements of the large sparse matrices we are interested in are not always available. Our algorithm avoids this problem by using only Krylov information: matrix-vector and matrix-transpose-vector multiplies to access the matrix. Preliminary tests show that for most matrices our algorithm both reduces the norm and improves the accuracy to which eigenvalues are computed, in some cases by orders of magnitude. Over the next year, we plan to continue assessing the efficacy of our approach on sparse eigenproblems arising in practice, and to implement a parallel version of this algorithm.

4 Compilation Techniques

Over the past quarter we completed an initial implementation of the Titanium compiler, so that is generates C code for a sequential subset of the language. We are currently working on parallel code generation and support for packages, parameterized types, and other features to support large-scale programming projects. In addition, we have developed a small set of scientific benchmarks, including a multigrid solver, matrix multiply, and a grid-based simulation. A more sophisticated AMR poisson solver is under development in collaboration with Luigi Semenzato at NERSC/LBNL.

5 Reports and Publications

  1. S. S. Lumetta, A. M. Mainwaring, D. E. Culler, "Multi- Protocol Active Messages on a Cluster of SMP's," in the Proceedings of Supercomputing '97, San Jose, California, November 1997.

  2. A. Krishnamurthy, D. Culler, and K. Yelick, ``Empirical Evaluation of Global Memory Support on the Cray-T3D and Cray-T3E,'' submitted for publication, May 1997.


Return to Clumps Quarterly Reports Page

Return to Clumps Home Page