Steven S. Lumetta
stevel@CS.Berkeley.EDU
http://http.CS.Berkeley.EDU/~stevel
Alan M. Mainwaring
alanm@CS.Berkeley.EDU
http://http.CS.Berkeley.EDU/~alanm
David E. Culler
culler@CS.Berkeley.EDU
http://http.CS.Berkeley.EDU/~culler
Several groups have studied the problem of programming Clumps [3][6][11][13][15][16]. Some of these efforts focus on issues related to shared virtual memory [17][32], but most relate to high-level message libraries such as MPI. The software overheads associated with memory allocation and tag matching in traditional message-passing libraries often obscure the machine-level performance interactions and design issues associated with the actual communication. Very little work has addressed the complicated set of tradeoffs involved in implementing a fast communication layer on the combination of cache-coherent memory and a low-latency network.
Three aspects of the communication layer are critical to performance on Clumps:
To quantify these dimensions, we have built a multi-protocol implementation of Active Messages-II [21] that transparently directs message traffic through the appropriate medium, either shared memory or a high-speed network. The implementation operates on a cluster of four Sun Enterprise 5000 servers running the Solaris 2.5 operating system and interconnected by a Myrinet with multiple NIC's per SMP.
This paper describes the shared memory message-passing protocol and analyzes the effects of the hardware and software architectures on communication performance. Using both microbenchmarks and a range of applications, we illuminate the important design tradeoffs for a multi-protocol communication layer. While Clumps have a clear engineering advantage over networks of uniprocessor workstations as the base architecture for large-scale systems, several obstacles limit their performance advantages at the application level. The paper brings several of these problems to light and discusses their implications.
The remainder of the paper is organized as follows:
Figure 1: Target architecture for the multi-protocol AM-II implementation--a cluster of symmetric multi-processors, or Clump. Processors inside an SMP are connected via Sun's Gigaplane Interconnect, which provides cache-coherency and delivers up to 2.7 GB/s of bandwidth [27]. Communication between SMP's utilizes multiple, independent SBUS connections to a Myrinet high-speed network with internal link bandwidths of 160 MB/s [4].
The critical components of hardware performance are the memory hierarchy and the network, as these characteristics have direct impact on the speed at which data moves from one processor to another. The cost of synchronization primitives is also pertinent when managing simultaneous access by multiple processes. Using microbenchmarks based upon those of Saavedra-Barrera [25], we are able to measure these values for our system, as shown in Table 1. For comparison, the table also gives the parameters for an UltraSPARC Model 170 workstation, which uses the same processor as the Enterprise 5000. For message-passing via shared memory, the latency of accesses to data within another processor's L2 cache is of particular interest, as it represents the minimum cost to transfer data between two processors. On the Enterprise 5000, such a transfer requires 80 cycles, significantly more than the base memory latency of 50 cycles.
| Parameter | Enterprise 5000 server | UltraSPARC 170 workstation |
|---|---|---|
| L2 size | 512 kB | 256 kB |
| L2 miss (memory) | 50 cycles | 42 cycles |
| L2 miss (other L2) | 80 cycles | N/A |
| memcpy bandwidth | 200 MB/s | 160 MB/s |
| NIC 32-bit read | 152 cycles | 114 cycles |
| NIC 32-bit write | 54 cycles | 34 cycles |
| SBUS bandwidth | 3x38 MB/s | 38 MB/s |
| compare-and-swap | 15 cycles | 15 cycles |
Most Active Message implementations assume the use of a SPMD model of programming and the availability of a reliable network that is space-shared between users or is time-shared with long, globally-scheduled time slices. The AM-II specification [21] defines a uniform communication interface that provides the functionality required for general-purpose distributed programming yet permits implementations yielding performance close to that available at the hardware level. AM-II abstracts communication into point-to-point messages between communication endpoints. A group of communicating endpoints form a virtual network with a unique protection domain. Traffic in one virtual network is never visible to a second virtual network, yet each virtual network retains the direct, user-level network access necessary for high performance. When distinct virtual networks share the same physical network resources, each continues to perceive private resources, albeit with potentially reduced performance. This communication multiplexing is critical to high-performance message-passing with Clumps, since many processes are expected to be communicating at once. Also, the ratio of NIC's to processors in an SMP might, in general, differ from one.
The Active Message layer assigns each endpoint a unique, global name. To create a virtual network, endpoints map a set of such names into a table of message destinations. Destinations in the virtual network are then indexed using a small integer. Access rights in a virtual network take the form of a 64-bit tag specified by each destination endpoint. A sender must know the value of an endpoint's tag before sending any messages to that endpoint. Tags provide a reasonable level of protection against both inadvertently misdirected and malicious messages.
Communication in AM-II uses a request-reply paradigm. Messages originating outside the Active Messages layer are called requests, and all request handler routines must issue a reply using an opaque message token which holds information about the requesting endpoint's name and tag. A message references a handler routine with a small integer, which is used by the recipient as an index into an endpoint-specific table of such routines. The Active Message layer reserves index 0 of the table for the user-defined handler to which messages are returned in the case of network failure or other faults, e.g., denial of access.
Two types of message are relevant to this work: short messages carry up to eight 32-bit arguments; a bulk data transfer extends a short message with a block of up to 8 kB of data.
Figure 2: Data layout for an AM-II endpoint. The control block resides in main memory, the network queue block resides on the network interface card (NIC), and the shared memory queue block resides in a shared memory segment. Only the shared memory queue block is accessible to other processes.
A diagram of the shared memory queue block appears in Figure 3. A copy of the endpoint tag is used for access control, while two queue structures hold request and reply messages received by the endpoint. Each queue structure further divides into three sections: queue tail information, accessed only by senders; queue head information, accessed only by recipients; and two FIFO data queues, accessed by both senders and recipients. The queues are the packet queue, which contains the handler index and arguments, and the bulk data queue, which holds data for bulk data transfers. Short messages use only the packet queue, while bulk data transfers use both queues.
Figure 3: Block diagram of a shared memory queue block. Short messages use only the packet queue. Bulk data transfers are written into the bulk data queue as well.
The shared memory queue block has been carefully tuned for performance. Data are laid out so as to eliminate false sharing and thereby to reduce bus transactions. Each packet, for example, occupies a distinct L2 cache line, and bulk data blocks begin on cache boundaries to increase copying speed.
In addition to the handler index and arguments, entries in the packet queue contain three other fields: a packet type, an inverse queue block mapping, and a bulk data index. The first of these, the packet type, differentiates between short messages and bulk data transfers. It also serves as the handshake state in transferring data from a sender to a recipient. A valid flag serves the latter purpose for the bulk data queue. The inverse queue block mapping points to the shared memory queue block of the sending endpoint in the address space of the process that owns the receiving endpoint, enabling reply messages to avoid a potentially expensive lookup operation. The last field, the bulk index, records the association between a bulk data transfer packet and the data itself. The bulk data queue is significantly shorter than the packet queue, allowing the shared memory queue block to fit into a reasonable amount of memory (293 kB).
The shared memory queue block differs significantly from the network queue block in its lack of send queues. The absence arises from a fundamental difference between the methods used to transmit data over the network and within an SMP. In the network case, a sender cannot directly deposit data into memory located across the network, and must instead rely on a third party, such as the intelligent Myrinet NIC's, to move the data. Within an SMP, the situation is just the opposite: direct access is possible through shared memory, and no third party exists to perform the transfer.
In this work, we assume the use of multiple processes within an SMP rather than a number of threads. Communication between multiple processes matches the message-passing model more closely than does communication between threads sharing an address space. Interprocess communication in Unix typically utilizes the System V IPC layer, which provides a number of standard mechanisms for communication. By choosing to use System V shared memory segments as storage for the shared memory queue block, we implicitly tie access control decisions to the model supported by System V interprocess communication. The IPC model is quite similar to that used by traditional Unix filesystems. Each segment has distinct read and write access bits for the owner of the segment, for a Unix group associated with the segment, and for all other users. Although perhaps not impossible to build, a system that addresses security issues through the IPC access model requires multiple segments and significantly more complex operations than our performance requirements can tolerate. These considerations compel us to assume a high level of trust between endpoints communicating through shared memory.
Processes other than the one that owns the endpoint can access only the shared memory queue block. To obtain such access, a process must map the block into its address space as follows. The segment identifier for the block is used to extend the endpoint name--other processes obtain the identifier when they learn the name. The actual mapping into another process' address space occurs when an endpoint owned by that process adds the endpoint associated with the block to its table of message destinations. The inverse mapping is performed at the same time to guarantee that reply messages also travel through shared memory. In Figure 2, processes A and B have mapped the shared memory queue blocks for endpoints 1 and 2 into their address spaces. A hash table guarantees that no shared memory segment is mapped into multiple locations in a single address space.
When sending a message, the Active Message layer first decides whether to use a shared memory protocol or a network protocol. For a local message, the layer next checks the tag in the destination queue block and returns any message that lacks access rights. After this check, the sender attempts to enqueue the message into the appropriate queue. To enqueue a short message, the sender first obtains a packet assignment by atomically incrementing the packet queue tail using the compare-and-swap instruction (CAS), then claims the assigned packet by changing its type from free to claimed, again using CAS. If the claim fails, the queue is full, and the sender backs off exponentially and polls for messages to prevent possible deadlock. Once the claim succeeds, the sender writes the data into the packet and completes the enqueue operation by changing the packet type to ready. For bulk data transfers, a sender claims a bulk data block before obtaining a packet assignment. After filling both packet and block, the sender marks the full packet with ready-bulk.
Given the effort made to achieve high performance, the use of two synchronization primitives, and in particular the CAS instruction, may seem peculiar. We have studied the performance of a range of mechanisms for managing concurrent access to the shared memory queue blocks, including the mutual exclusion techniques described in [23]. The communication regime is one of low resource contention. The time spent in the critical section of the send operation is small when compared with the total overhead of sending a message, and only intense all-to-one communication results in non-trivial contention for the shared queues. CAS is reasonably inexpensive on the Enterprise 5000, and the degree to which our send operation reduces the impact of senders being swapped out on the progress of other senders results in a level of robustness that proves quite advantageous in multiprogrammed systems. Furthermore, the method outlined above results in superior application performance even for a dedicated system. The interested reader is referred to [20]. for further detail.
Although the AM-II library provides support for protected access to an endpoint using multiple receiver threads, we have assumed the use of a single thread per process in this work. The issues and costs for concurrent access by receivers are similar to those for senders. In the absence of concurrency, the local poll operation need only check the type of the packet at the head of each packet queue. When a message is available, the recipient advances the packet queue head and passes the arguments and, for bulk data transfers, the associated data block, to the appropriate handler routine. After this call returns, the packet is marked as free and the data block is marked as invalid.
We explored both fractional and adaptive strategies for polling. A fractional strategy performs the more heavyweight poll for only a fraction of all polling operations. To balance the protocols, a successful poll accepts a correspondingly larger number of messages when using the more expensive protocol. For example, a strategy that polls the network only once in every four calls to poll then accepts up to four times as many network messages in a single network poll as it does shared memory messages in a shared memory poll. An adaptive strategy adjusts polling rates dynamically in response to traffic patterns. The adaptive strategies that we investigated varied a fractional polling rate for the network between minimum and maximum values based on a history of recent network polling efforts. Within the boundaries, the strategy polls whenever the history predicts the presence of a message.
After investigating a wide selection of strategies, we settled on an adaptive strategy with a maximum network polling frequency of one in eight, slightly above the cost ratio of one to ten between shared memory and network polling. Allowing more frequent network polling generally decreased application performance. Our strategy ranges between frequencies of one in eight and one in thirty-two based on the number of network messages received by the last thirty-two network polls. Applications were not very sensitive to small differences in these parameters, but neither did all applications respond in the same way to changes. The minimum network polling frequency of one in thirty-two is small enough that applications running inside of an SMP run within 12% of their execution times using a single-protocol shared memory layer with no network accesses.
The queue block holds a tag for access control and two queue structures for receiving local messages. No send queues are used for local messages--a sender deposits data directly into a receive queue at the destination endpoint. The local send operation uses CAS to minimize interference between senders. The operation takes five steps: checking the destination tag, obtaining a packet assignment, claiming the packet, writing the data, and marking the packet as ready for receipt. Bulk data transfers obtain a data block assignment and claim the block before obtaining a packet assignment. When a local message arrives, a recipient notices the change in the type of the packet at the head of the queue. After passing the data to the appropriate handler routine, the recipient frees the packet for reuse. The structure of the shared memory queue block keeps the number of bus transactions produced by these operations to a minimum.
Message polling operations are ubiquitous in Active Message layers, but polling the network typically costs an order of magnitude more than polling shared memory. The source of this difference lies in the storage used for the two queue blocks. Shared memory segments benefit from the Gigaplane's cache-coherence support. NIC memory does not, and is uncacheable. To retain the base performance of the shared memory protocol, we adopt an adaptive polling strategy. The strategy polls the network whenever a history of the last thirty-two network poll operations predicts the presence of a message, bounded by a minimum polling rate of one in thirty-two and a maximum rate of one in eight.
LogP parameters were measured using a microbenchmark from the suite described in [10]. To measure G, we constructed a second benchmark to fragment bulk data transfers of arbitrary length into 8 kB chunks and to pipeline those chunks through the Active Message layer. The sending process copies the data from a send buffer into the receiving endpoint, and the receiving process copies the data from the endpoint into a receive buffer.
| Parameter | Shared Memory | Multi-Protocol Shared Memory | Multi-Protocol Myrinet | Myrinet |
|---|---|---|---|---|
| Latency (L) | -0.3 | 0.1 | 15.5 | 13.8 |
| Send Overhead (o_s) | 1.8 | 2.1 | 3.3 | 5.6 |
| Receive Overhead (o_r) | 1.3 | 1.4 | 8.6 | 8.1 |
| Gap (g) | 3.1 | 3.6 | 17.6 | 17.6 |
| Gap per Byte (G) | 0.00602 | 0.00625 | 0.0305 | 0.0315 |
| Bandwidth (1/G) | 166 MB/s | 160 MB/s | 32.8 MB/s | 31.7 MB/s |
| Half-power Point | 5.1 kB | 5.3 kB | 8.6 kB | 8.7 kB |
| Round Trip Time (RTT) | 5.6 | 7.2 | 55 | 55 |
A breakdown of the send overhead for short messages appears in Figure 4. The left bar illustrates the base cost of a short message in the absence of error checking and concurrency management for the destination queue. The total of 108 cycles (0.65 microseconds) also assumes that the message packet is not resident in the receiver's cache. To reach the base cost, the sender prepares eight arguments and calls the Active Message layer in a total of 11 cycles. Locating the destination endpoint and advancing the tail of the queue require another 9 cycles. Finally, the layer obtains and fills a packet, incurring an L2 cache miss in the process. Filling the packet takes another 49 cycles, but allows the processor to hide 11 cycles of the miss latency.
Figure 4: Breakdown of send overhead in cycles for the shared memory protocol. The left bar shows the costs of each component for the base case, which performs no error checking or concurrency management for the destination queue. The cost of the latter appears in the right bar. The send overhead totals 299 cycles (1.8 microseconds) for the shared memory protocol and 345 cycles (2.1 microseconds) for the multi-protocol layer.
The right bar in the figure extends the base cost with measurements of the remaining components of send overhead. When queue packets are resident in the receiver's L2 cache, each message incurs an additional 22 cycle penalty. A check for destination queue overflow is responsible for the next 50 cycles, primarily due to an extra bus transaction. The check reads the packet type and makes immediate use of the result, incurring the full overhead of the first transaction. Filling the packet then results in an invalidation, the second transaction.2 The local poll operation performed before each send adds another 42 cycles. Function argument and endpoint tag checking by the Active Message layer introduce another 25 cycles of overhead. Concurrency management using CAS adds 45 cycles, bringing the total to 299 cycles (1.8 microseconds) when using only the shared memory protocol. Finally, inclusion of the network protocol more than doubles the time spent in the poll operation, bringing the total for the multi-protocol implementation to 345 cycles (2.1 microseconds).
The most important of these issues is the balance of processors to NIC's inside each SMP. A Clumps application tuned to take advantage of the shared memory protocol makes subsequently less use of the network protocol, allowing processors that share network resources to operate at full potential. But for some applications, such as those with phases of all-to-all communication, tuning for Clumps may not be possible. In our system, each SMP uses three network interfaces to handle network traffic for eight processors, leading to three-way sharing for two of the NIC's.
A second complication arises for applications written in a bulk synchronous style, which implicitly assume a reasonably balanced load. The use of multi-protocol communication can violate that assumption, as performance benefits depend on the fraction of traffic routed through the shared memory protocol. In such a case, the improvement in application execution time reflects only the minimum of the per-processor improvements--other processors idle until the slowest processor has finished the phase. As a first step towards tuning a bulk synchronous application for a Clump, a programmer can arrange the virtual processor layout to reduce the amount of communication traffic that must travel through the network.
Finally, the additional complexity of cache-coherence support within an SMP results in longer memory latencies. On the Enterprise 5000, the difference is roughly 20% for memory not shared between processors. This memory latency penalty does not usually play a large role in performance, however, as it is mitigated by a number of factors, including a larger L2 cache, higher per-processor memory bandwidth, and locality of access in the application codes. A more restrictive constraint occurs in the form of aggregate memory bandwidth. Although the Gigaplane provides more bandwidth than the eight processors can use, the memory banks in each of our SMP's are not fully populated, limiting aggregate memory bandwidth to 400 MB/s. This artifact compounds the effect of the memory latency penalty and has significant impact on performance.
| Run | Input Parameters | Memory |
|---|---|---|
| 3D FFT | 256x256x256 values | 14 MB |
| CON/comm |
3D underlying lattice 512,000 nodes/processor 25% edges present |
32 MB |
| CON/comp |
2D underlying lattice 640,000 nodes/processor 40% edges present |
40 MB |
| EM3D/naive |
2,500 nodes/processor degree 20, 40% remote naive layout |
8.3 MB |
| EM3D/good |
2,500 nodes/processor degree 20, 40% remote good layout |
8.3 MB |
| Run | Network Communication | Local Communication | % Local | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Short Messages | Bulk Data | Short Messages | Bulk Data | ||||||
| Mean | Minimum | Maximum | Mean | Minimum | Maximum | ||||
| 3D FFT | 3,270 | 3,072 | 3,468 | 3,072 x 4 kB | 954 | 896 | 1,292 | 896 x 4 kB | 22.6 |
| CON/comm | 67,236 | 51,541 | 240,919 | 292 x 16 B | 50,138 | 43,614 | 90,346 | 130 x 16 B | 42.7 |
| CON/comp | 5,993 | 5,601 | 6,498 | 10 x 16 B | 5,890 | 5,351 | 6,411 | 6 x 16 B | 49.5 |
| EM3D/naive | 608,817 | 199,600 | 1,015,500 | none | 1,013,094 | 804,065 | 1,222,500 | none | 62.5 |
| EM3D/good | 407,369 | 0 | 818,500 | none | 1,216,549 | 1,198,407 | 1,223,500 | none | 74.9 |
The first application, 3D FFT, performs a fast Fourier transform in three dimensions and typifies regular applications that rely primarily on bulk communication. The all-to-all communication pattern used in 3D FFT exposes the effect of the SMP's processor to NIC balance.
The second application, CON, finds the connected components of a distributed graph. CON performs a large amount of fine-grained communication in a statistically well-defined pattern. The balance between computation and communication in CON depends strongly on the input parameters. We selected a communication-bound run to highlight the benefits of the fast communication and a second, computation-bound run to demonstrate the effect of the SMP memory latency penalty. The input parameters for the first run result in a period of high contention and load imbalance near the end of the execution.
EM3D, the last application, propagates electromagnetic radiation in three dimensions on an irregular mesh and represents the class of applications that perform irregular, fine-grained communication. EM3D alternates between updates to the electric and magnetic fields in a bulk synchronous manner. We use two runs of EM3D to show the effect of the bulk synchronous style and the advantage of intelligent virtual processor layout. Both runs partition the underlying coordinate space on 32 processors into 4x4x2 blocks. The first run uses a naive layout for virtual processors, placing processors within an SMP into 4x2x1 blocks. In the second run, denoted EM3D/good in the tables, an SMP's processors instead occupy 2x2x2 blocks, reducing both the aggregate network traffic and the upper bound on per-processor network traffic.
| Run | 8-way SMP | NOW (8 processors) | ||
|---|---|---|---|---|
| Shared Memory | Multi-Protocol | Multi-Protocol | Myrinet | |
| 3D FFT | 8.7 | 9.0 | 7.0 | 6.76 |
| CON/comm | 1.96 | 2.1 | 4.01 | 4.01 |
| CON/comp | 1.40 | 1.44 | 1.18 | 1.19 |
| EM3D | 7.7 | 8.6 | 33.9 | 33.8 |
Application execution times in seconds on the Clump and a 32-processor NOW appear in Table 6. The NOW uses the single-protocol Myrinet implementation. The bottom section of the table provides information on the communication load balance between processors for each EM3D run; each entry represents the range across processors of time spent in the communication phase. The effect of the processor to NIC balance in our system is apparent in the degradation of 3D FFT performance, which takes 209% (a factor of three) longer on the Clump than on the NOW. The communication-bound CON run is also affected by the sharing of network resources, requiring 95% more time on the Clump. For the computation-bound CON run, the 108% slowdown on the Clump results from a combination of the memory latency penalty and the processor to NIC balance. The EM3D runs demonstrate the effect of the bulk synchronous style on execution time. In each case, execution time is limited by the slowest of the processors. The benefit of the improved layout in generating less network traffic is enhanced by the processor to NIC balance, leading to a 30% improvement in execution time. For the NOW, the naive layout proves superior due to bandwidth thinning in the upper regions of the tree network connecting the NOW. Machines in the same sub-branch of the NOW network enjoy greater aggregate bandwidth than do machines in different branches, and the naive layout takes better advantage of this arrangement. Comparing the best EM3D result on each platform, the Clump finishes in 21% less time.
| Run | Clump (4 8-way SMP's) | NOW (32 processors) |
|---|---|---|
| 3D FFT | 6.8 | 2.2 |
| CON/comm | 15 | 7.7 |
| CON/comp | 2.7 | 1.3 |
| EM3D/naive | 54 | 48 |
| EM3D/good | 38 | 52 |
| EM3D/naive bal. | 28 to 52 | 25 to 47 |
| EM3D/good bal. | 5.5 to 37 | 28 to 51 |
Also critical to performance is the aggregate memory bandwidth of each SMP. Underpopulation of the memory banks in our Clump results in an artificially low limit and severely restricts the performance of some applications. While the problem is not inherent in our system, it is one that future systems must remember to avoid: for large computational problems, caches do not always adequately buffer memory accesses.
The bulk synchronous style of programming has achieved some degree of popularity for parallel programming. When using a Clump, a programmer must put in a greater effort to balance the load between processors. Optimization of the virtual processor layout onto the Clump is a good first step.
One approach to Clumps that has received much attention over the years is the extension of shared memory between SMP's. Recent efforts on this front include SVM [17] and MGS [32]. These studies investigate a problem complementary to our own in that both seek to optimize common techniques in one medium to allow use of those techniques in both. Each view proves more natural and effective than the other for interesting classes of applications.
In the smaller body of message-passing work, Nexus comes closest to our own. Nexus is a portable programming system [13] that focuses primarily on portability and on support for heterogeneity. It supports arbitrary sets of machines, processes (or contexts, in Nexus terminology) and threads. Nexus generally builds on top of existing communication layers, resulting in somewhat higher overheads than those obtained with Active Messages. The communication abstractions are similar to those of AM-II, but the style of communication is different. Like AM-II, Nexus has endpoints that define tables of handler routines, but Nexus does not require that communication obey a request-reply paradigm. This flexibility allows Nexus to use endpoint names, or startpoints, to initiate messages. A startpoint can be bound to multiple endpoints, allowing for multicast communication.
Since Nexus platforms can support multiple communication protocols between a startpoint and an endpoint, Nexus has explored multi-protocol communication from a more general perspective than have we [14]. Although shared memory is mentioned in the work, numbers are provided only for more expensive underlying protocols, making a direct comparison impossible. The Nexus multi-protocol paper also notes the wide variance between polling costs for different protocols and presents data for fractional polling strategies. We explored more adaptive strategies to reduce the impact of network polling to a satisfactory level.
An interesting study by Lim et. al. [18] investigates the use of one processor in each SMP as a message proxy for the remaining processors. The work focuses on providing multiple users with protected access to a single network resource and evaluates the proxy approach in detail. AM-II sidesteps the question of protected access by taking advantage of an SMP's virtual memory system to grant direct access to a subset of network resources. An intelligent NIC plays an essential role in the AM-II approach.
The remaining message-passing work on Clumps pertains primarily to the problem of programming them for performance. This paper does not speak directly to that problem, although some insight can be gained from the section on applications. Such efforts often assume that a programmer is willing to rewrite most or all of an application to obtain performance.
The P4 programming system [6] was probably one of the first systems to recognize Clumps as a platform. P4 provides mechanisms start multiple threads on one or more machines and to communicate between such threads using either message-passing or shared memory constructs. The programmer must explicitly select the appropriate library call. The library also provides a number of useful reduction operations.
SIMPLE [3] provides functionality similar to P4, but extends the library with broadcast operations and a variety of tuned, many-processor communication methods. SIMPLE also attempts to lighten the programmer's burden by offering functions that involve all processors, all processors in an SMP, one processor in each SMP, and so forth.
A paper by Fink and Baden [11] attacks the problem of balance in bulk synchronous algorithms by rebalancing computation and communication for a regular problem within an SMP. Given a 2D domain partitioned in one dimension between SMP's in a Clump, the paper calculates a non-uniform partitioning of the domain within each SMP such that the time spent in a phase is roughly equal for each processor. Essentially, the analysis gives processors on boundaries less computation to balance the cost of communication.
KeLP, by the same authors, seeks to simplify the process of application development. Recent extensions to KeLP [12] add new functionality to support applications on Clumps. With KeLP, a programmer expresses data decomposition and motion in a block-structured style. The runtime system then employs inspector-executor analysis to overlap communication with computation. No global barriers are used; interprocessor synchronization occurs only through communication dependencies.
In work related less directly to Clumps, Mukherjee and Hill [24] have investigated the advantages of making NIC memory cacheable. For multi-protocol communication, the importance of cacheable NIC memory is the resulting reduction in the cost of polling the network. Such systems might not require a sophisticated polling strategy.
In the design of a multi-protocol layer for Clumps, three critical aspects must be considered: data layout, concurrent access, and polling strategy. For our implementation, the data for an endpoint are split between main memory, NIC memory, and shared memory. Local communication passes through the shared memory portion of the endpoint, which is structured to minimize bus transactions. Concurrent access by multiple senders is handled using the compare-and-swap instruction to reduce interference between senders. Accessing network message data in uncacheable NIC memory is expensive--typically an order of magnitude more costly than accessing shared memory data. To retain the base performance of the shared memory protocol, we developed an adaptive polling strategy that varies the rate of network polling between upper and lower bounds based on a history of recent network activity.
When measured with microbenchmarks, our implementation illustrates the performance tradeoffs for fast communication on Clumps. Use of a multi-protocol communication layer has little impact on network message performance, but both careful engineering and an adaptive polling strategy are necessary to retain high performance through shared memory. The shared memory protocol provides five times the bandwidth available from the network, peaking at 160 MB/sec. End-to-end latency for short local messages is 3.6 microseconds, a factor of eight less than the corresponding number for the network protocol. The abstractions necessary to support message-passing consume a significant fraction of this time. For example, the overhead involved in a local send operation is 2.1 microseconds. Bus transactions account for 32% of this number; basic mechanisms such as call overhead and packet filling make up another 29%; managing concurrent access leads to another 13%; and the remaining 26% is split nearly evenly between polls for each protocol.
We studied application-level performance issues using three Split-C applications with five sets of input parameters. The applications use a bulk synchronous style and perform significant amounts of communication. To establish a base case, we presented the same results for a NOW. Within an SMP, applications take advantage of the shared memory protocol to achieve improved performance, but some are constrained by aggregate memory bandwidth limitations. Inclusion of the network protocol slows these applications by no more than 12%. The use of the full Clump brings the balance between processors and NIC's to light. For 3D FFT, which uses all-to-all communication, we observed a factor of three slowdown, as we might expect given the three-way sharing of NIC's by processors in our system. The Clump numbers also illustrate a drawback of the bulk synchronous programming style: although many processors might be able to take advantage of fast communication, a performance increase requires that all processors do so. By rearranging the virtual processors to increase the fraction of traffic sent through shared memory, we demonstrated improved performance. These obstacles to performance are not insurmountable, but they do illustrate several potential pitfalls for Clumps.
The Clumps architecture presents a wealth of interesting new tradeoffs and possibilities. With this paper, we have begun to explore these issues and have illustrated our findings with measurements at two levels. In the future, we plan to continue our investigation of the low-level aspects of these systems in order to build a solid foundation of understanding for exploring more abstract interactions.
2 One transaction can be eliminated if the sender prewrites an unused part of the packet before the overflow check and uses a memory barrier to prevent reordering. This approach reduces the send overhead by 40 cycles when measured in isolation, but has a negative impact on the full LogP parameters and application results, presumably due to cache line thrashing when a receiver polls during the send operation.