Quarterly Status Report for 8/1/97 through 10/31/97
Programming Support for Clusters of Multiprocessors (Clumps)

Submitted to Lawrence Livermore National Laboratory

Professor Katherine Yelick, PI
(yelick@cs.berkeley.edu, 510-642-8900)

Professor David Culler, co-PI
(culler@cs.berkeley.edu, 510-643-7572)

Professor James Demmel, co-PI
(demmel@cs.berkeley.edu, 510-643-5386)

University of California at Berkeley
Computer Science Division


EM3D

EM3D, a model 3D electromagnetic simulation, has been optimized for SMPs. This application operates on an irregular, linked-list bipartite graph structure. Most of the performance improvement is gained from modifying the data structure of the E and H nodes (the two parts of the bipartite graph), which were originally represented in a linked list, to be multiple arrays of each field in the node structure, which results in an increased cache hit rate. Also changing the implementation of the synchronization barrier from mutex and condition variables in the POSIX thread library to a faster assembly routine which uses Compare-and-Swap instruction improved the performance.

Multigrid Solver for Irregular Grids

We have spent much of this quarter scaling our multigrid solver implementation, and can now solve 2.5 x 10^6 degrees-of-freedom (dof) finite element problems with about 60 processors on the SP3 at Argonne National Lab (ANL). We have also continued to refine our method to improve the robustness and performance of the code. Our method scales well, i.e. the number of iterations required is fairly constant for finite element problems even with large jumps in material coefficients and on geometries with thin shell types of structures (we currently only work with 8 node "brick" elements).

We have also continued to investigate methods for using our solver in the solution of indefinite matrices that arise from constrained problems implemented with Lagrange multipliers. This problem is a primary concern of the ALE3D group as well as being a fundamental problem in computational mechanics.

SuperLU

We have continued to work on designing a version of SuperLU, our parallel sparse Gaussian elimiation algorithm, for distributed memory machines. The big bottleneck in making the current sequential and shared-memory codes work on distributed memory is their need for fine grain communication, synchronication, and load balancing, as well as their use of a 1D data layout. To overcome these problems, we have decided not to pivot dynamically. This will let us statically layout the data in a 2D fashion, load balance it, and schedule the communication, i.e. enjoy all the benefits of parallel sparse Cholesky. But not pivoting dynamically requires several techniques to retain stability. These include prepivoting to put large entries on the diagonal, setting tiny pivots to larger values and correcting later either with the Sherman-Morrison-Woodbury formula or iterative refinement, and use of mixed precision (higher precision when there is pivot growth). Initial experiments indicate that these techniques can retain stability in nearly all cases of interest.

High Performance Sparse Matrix-Vector Multiplication.

We are working towards developing high performance sparse matrix-vector multiply codes for CLUMPs. Sparse matrix-vector multiplication is a core part of many applications --- including large scale iterative and eigenvalue solvers. Although much work has been done on optimizing sparse matrix vector multiplication for uniprocessors and distributed memory machines, there has been less work on optimizations for SMPs and CLUMPs. Preliminary tests on a single SMP show that careful partitioning is necessary not only on distributed memory machines, but also between the nodes of one SMP. In addition, on a single SMP, the single processor performance appears to be a bottleneck. We plan to continue tests on SMPs and CLUMPs, and also to work on improving single processor performance.

Multiprotocal MPI for CLUMPs

Passing messages through shared memory plays an important role on symmetric multiprocessors. Messages can manage asynchronous data transfers, can serve to serialize operations on complex data structures, and can provide an attractive uniform interface for programming clusters of SMP's. The management of concurrent access to message queues is an important aspect of design for shared memory message-passing systems. Using both microbenchmarks and applications, this paper compares the performance of concurrent access algorithms for passing active messages on a Sun Enterprise 5000 server. The paper also presents a new lock-free algorithm that provides many of the advantages of non-blocking algorithms while avoiding the overhead necessary for true non-blocking behavior. The lock-free algorithm couples synchronization tightly to the data structure and demonstrates application performance superior to all others studied. The success of this algorithm implies that other practical problems might also benefit from a reexamination of the non-blocking literature.

A paper listed in the last quarterly report, "Multi-Protocol Active Messages on a Cluster of SMP's", by S. Lumetta, A. Mainwaring and D. Culler, has been accepted for publication in SuperComputing 97.

Another paper by S. Lumetta, and D. Culler, "Managing Concurrent Access for Shared Memory Active Messages," has been submitted to IPPS/SPDP 1998.


Return to Clumps Quarterly Reports Page

Return to Clumps Home Page