Submitted to Lawrence Livermore National Laboratory
Professor Katherine Yelick, PI
(yelick@cs.berkeley.edu,
510-642-8900)
Professor David Culler, co-PI
(culler@cs.berkeley.edu, 510-643-7572)
Professor James Demmel, co-PI
(demmel@cs.berkeley.edu, 510-643-5386)
University of California at Berkeley
Computer Science Division
Since the beginning of the contract, we have made significant progress towards the development of programming support for CLUMPs. The results are divided into the following areas: communication support, performance analysis, distributed data structures and load balancing, solvers, and compilation techniques.
We have begun to explore more substantial support for communications within CLUMPs than was previously available for networks of workstations or shared memory clusters. The first effort in this direction is the implementation and evaluation of virtual networks, which allow a large number of processes to have direct protected access to communication resources. We have completed the design, implementation, and preliminary performance analysis of the first true virtual network system. Virtual networks provide an arbitrary number of applications with the illusion of direct, memory-mapped access to multiple independent and protected networks in an on-demand, general-purpose, and protected manner. The key abstraction is a communication endpoint - an object that abstracts a process's handle on a virtual network. A large number of processes on both single- and multi-processor machine can have one or more, independent communication endpoints. A set of communication endpoints spread across NOWs and CLUMPS form virtual networks. This abstraction is implemented through a sophisticated segment driver and firmware on an intelligent network interface. In the next quarter we will complete a thorough evaluation of virtual networks, and will continue other research on the CLUMP communication system, including the use of multiple network interface cards per SMP node.
1.2 Multi-protocol Communication on CLUMPs
One model for programming CLUMPs, which is attractive due to its simplicity, is to treat the machine as a flat distributed memory multiprocessor. The programmer writes an SPMD program that communicates both within and between SMPs using a single communication interface, such as using MPI or Active Messages. We have designed a multi-protocol version of Active Messages to support such models; it is implemented using shared memory within an SMP and techniques developed for NOWs between SMP nodes. An initial implementation is underway, with preliminary results already available. We will continue to refine and evaluate this implementation, in addition to the novel "compiler-based" approach, during the remainder of this contract.
We have been working to consolidate the multigrid (MG) code: running large problems and refining the algorithm and the implementation to reach a stable checkpoint. We would like a partially optimized implementation to get a solid idea of the performance characteristics and potential of our current algorithm. To date we have run on problems of up to 1:025 ?10 degree of freedom (dof) systems on the new SP3 at Argonne. The problem that we run has large jumps in material coefficients (10e4) with Poisson ratios of 0.49. We run a problem, which is scalable, and keep about 25,000 dof per processor; thus we run from one to 41 processors to get efficiency type data (see http://www.mcs.anl.gov/Projects/sp/report/adams97.html for an earlier version of this problem and pictures of the results). We now get about 40% efficiency on the 41 processor case. Note: this is real efficiency, i.e. the wall clock time to reduce the residual by a constant (10e6) factor, but this does not include the algebraic construction of the coarse grid matrices as our current implementation in not well optimized. We are currently investigating new MG methods to be able to handle highly incompressible problems more effectively. Incompressibility is important in its own right, as well as being similar in nature to plasticity, which is an important class of finite element applications.
4.2 Sparse Matrix Factorization
We are pursuing our development of SuperLU, a high performance implementation of sparse Gaussian elimination. The serial version, optimized for cache-based machines, was previously released, and the parallel (SMP) version is to be released soon. The SMP version is running and has been performance tuned and tested. We are currently revising the interface (calling sequence) to allow it to be called from either a serial or parallel client. The interface for serial clients is fairly straightforward, but the interface for parallel clients is complicated by the differences between thread libraries on different platforms and the need for shared data structures within the code. For example, the input array and a task queue used within the algorithm for scheduling need to be passed into the function.
Return to Clumps Quarterly Reports Page
Return to Clumps Home Page