Quarterly Status Report for 8/1/97 through 10/31/97
Programming Support for Clusters of Multiprocessors (Clumps)
Submitted to Lawrence Livermore National Laboratory
Professor Katherine Yelick, PI
(yelick@cs.berkeley.edu,
510-642-8900)
Professor David Culler, co-PI
(culler@cs.berkeley.edu, 510-643-7572)
Professor James Demmel, co-PI
(demmel@cs.berkeley.edu, 510-643-5386)
University of California at Berkeley
Computer Science Division
EM3D
EM3D, a model 3D electromagnetic simulation, has been optimized for
SMPs. This application operates on an irregular, linked-list
bipartite graph structure. Most of the performance improvement is
gained from modifying the data structure of the E and H nodes (the
two parts of the bipartite graph), which were originally represented
in a linked list, to be multiple arrays of each field in the node
structure, which results in an increased cache hit rate. Also
changing the implementation of the synchronization barrier from mutex
and condition variables in the POSIX thread library to a faster
assembly routine which uses Compare-and-Swap instruction improved the
performance.
Multigrid Solver for Irregular Grids
We have spent much of this quarter scaling our multigrid solver
implementation, and can now solve 2.5 x 10^6 degrees-of-freedom
(dof) finite element problems with about 60 processors on the SP3 at
Argonne National Lab (ANL). We have also continued to refine our
method to improve the robustness and performance of the code. Our
method scales well, i.e. the number of iterations required is fairly
constant for finite element problems even with large jumps in
material coefficients and on geometries with thin shell types of
structures (we currently only work with 8 node "brick" elements).
We have also continued to investigate methods for using our solver in
the solution of indefinite matrices that arise from constrained
problems implemented with Lagrange multipliers. This problem is a
primary concern of the ALE3D group as well as being a fundamental
problem in computational mechanics.
SuperLU
We have continued to work on designing a version of SuperLU, our
parallel sparse Gaussian elimiation algorithm, for distributed memory
machines. The big bottleneck in making the current sequential and
shared-memory codes work on distributed memory is their need for fine
grain communication, synchronication, and load balancing, as well as
their use of a 1D data layout. To overcome these problems, we have
decided not to pivot dynamically. This will let us statically layout
the data in a 2D fashion, load balance it, and schedule the
communication, i.e. enjoy all the benefits of parallel sparse
Cholesky. But not pivoting dynamically requires several techniques
to retain stability. These include prepivoting to put large entries
on the diagonal, setting tiny pivots to larger values and correcting
later either with the Sherman-Morrison-Woodbury formula or iterative
refinement, and use of mixed precision (higher precision when there
is pivot growth). Initial experiments indicate that these techniques
can retain stability in nearly all cases of interest.
High Performance Sparse Matrix-Vector Multiplication.
We are working towards developing high performance sparse
matrix-vector multiply codes for CLUMPs. Sparse matrix-vector
multiplication is a core part of many applications --- including
large scale iterative and eigenvalue solvers. Although much work has
been done on optimizing sparse matrix vector multiplication for
uniprocessors and distributed memory machines, there has been less
work on optimizations for SMPs and CLUMPs. Preliminary tests on a
single SMP show that careful partitioning is necessary not only on
distributed memory machines, but also between the nodes of one SMP.
In addition, on a single SMP, the single processor performance
appears to be a bottleneck. We plan to continue tests on SMPs and
CLUMPs, and also to work on improving single processor performance.
Multiprotocal MPI for CLUMPs
Passing messages through shared memory plays an important role on
symmetric multiprocessors. Messages can manage asynchronous data
transfers, can serve to serialize operations on complex data
structures, and can provide an attractive uniform interface for
programming clusters of SMP's. The management of concurrent access
to message queues is an important aspect of design for shared memory
message-passing systems. Using both microbenchmarks and
applications, this paper compares the performance of concurrent
access algorithms for passing active messages on a Sun Enterprise
5000 server. The paper also presents a new lock-free algorithm that
provides many of the advantages of non-blocking algorithms while
avoiding the overhead necessary for true non-blocking behavior. The
lock-free algorithm couples synchronization tightly to the data
structure and demonstrates application performance superior to all
others studied. The success of this algorithm implies that other
practical problems might also benefit from a reexamination of the
non-blocking literature.
A paper listed in the last quarterly report, "Multi-Protocol Active
Messages on a Cluster of SMP's", by S. Lumetta, A. Mainwaring and D.
Culler, has been accepted for publication in SuperComputing 97.
Another paper by S. Lumetta, and D. Culler, "Managing Concurrent
Access for Shared Memory Active Messages," has been submitted to
IPPS/SPDP 1998.
Return to Clumps Quarterly Reports Page
Return to Clumps Home Page