Assignment 2

Be sure to read the newsgroup ucb.class.cs267for updated information

There are three main objectives of this problem set. First, you should learn how to run code on the NOW. Second, we hope you have a better idea of things to consider when trying to write efficient code. And finally, we hope you have a better appreciation for how difficult writing efficient code can be (although you already knew that, right?). Although this is a class on parallel computing, this problem set concerns optimizing serial code. However, insights gained here will be of use throughout the course.

This problem set is split into two parts. The first part shouldn't take too long. You only need to run some programs which we'll make available, to change a few parameters, and to observe and comment on the results. In the second part, you'll get to use some of the insight gained in the first part to code an efficient matrix multiply. To make things more interesting, the second part includes an optional contest. (All groups should try to write an efficient version of matrix multiply. Only the competition part is optional.)

This problem set will be done in groups of two or three. The groups should be interdisciplinary, and we'll try to help set up the groups.

Part I - Memory Benchmark

Logging into NOW

There are two ways to login to the Berkeley NOW. The first is a command line method, and the second is accessible over the web.
  1. From the command line
  2. From the web.

Setting up your path

Make sure that your path includes:
  1. /usr/sww/bin
  2. /usr/sww/share/bin
  3. /usr/now/bin
You can check this by typing
echo $path

If your path does not include the three directories, then type
set path=($path /usr/sww/bin /usr/sww/share/bin /usr/now/bin)

Make sure that your MANPATH includes:
  1. /usr/now/man
Check this by typing:
echo $MANPATH

If this is not set, you can type
setenv MANPATH ${MANPATH}:/usr/now/man

Running the memory benchmark code

Grab three files: The memory benchmark code steps through an array in memory, while changing the size of the array and the stride at which it steps. The effects of the memory hierarch (different latencies for different levels of the hierarchy) will be visible in the output.

Put the three files in a directory, and type 'glumake'. This will build an executable called 'cache'.
Type 'glurun ./cache > cache.out'
This will build an output file in your directory called 'cache.out'. Examination of the file will reveal the latency for different parts of the hierarchy.

In order to plot your file, type the following on the command line:

chmod 755 format.pl
format.pl cache.out
gnuplot cache.out.plot
This will build a postscript file called cache.out.ps

On the UltraSparc the access time to the first level cache is roughly 6ns, while the access time to the second level cache is roughly 40ns. Since the memory benchmark program is performing a read and a write operation, these times will be doubled (e.g. the accesses to the first level cache will be ~12 ns and the accesses to the second level cache will be ~80ns).

Your job is to plot the output file, and identify the first level cache, the second level cache, and main memory by circling the horizontal lines on the graph that correspond to the levels in the hierarchy. (Draw only three circles to identify the levels - the circles do not have to be exact.) Also compute the latency to each level of the hierarchy on your graph (answers will vary from group to group). For computing the latency to main memory, just read the next level up from the L2 cache - don't take into account other effects such as TLB misses.

How much slower is memory than the first level cache? Than the second level cache?

Part II - Matrix Multiplication

Now we'll use some of the insights gained by doing part I by coding a fast version of matrix multiply. As you probably all know, the simplest algorithm for computing C = C+A*B, where A, B, and C are all n-by-n matrices is the following:

      for i=1 to n
          for j=1 to n
             for k=1 to n
                 C(i,j) = C(i,j) + A(i,k)*B(k,j)
             end for
          end for
       end for
However, note that this algorithm does not take into consideration the memory hierarchy. Hopefully after doing part 1 of the problem set you've learned that good performance depends on writing code which takes the memory hierarchy into consideration -- especially if the matrices are ``large''. So, the assignment (as you've no doubt guessed) is to code a more efficient version of matrix multiply for the Ultrasparcs in the NOW cluster. Because there is an optional competition, we require everyone to write code with the same routine interface:
/*
**  mul_mfmf_mf:
**    Optimized matrix multiply routine.
**
**    Parameters:
**      int matdim: the matrix dimension
**      double * A: source1 matrix of size (matdim)x(matdim)
**      double * B: source2 matrix of size (matdim)x(matdim)
**      double * C: destination matrix of size (matdim)x(matdim)
**
**    Operation:
**      C = C + A*B;
**
*/
void mul_mfmf_mf(int matdim, 
                 const double *A,   
                 const double *B,
                 double *C);
Assume that the arrays are stored in row-major order, and that the compiler is ANSI-C compliant. A driver routine (which will call your routine and both check that it's answers are correct and get accurate MFLOPS rates on matrices of various sizes) and a Makefile are provided (see the following section). You may also want to look at last year's assignment (from which this writeup is partially taken). Among other things, it has a few pointers to more information.

Building the matrix multiplication program

Again, you'll need to login to one of the UltraSparc machines. Grab the three files below:
Type 'glumake'. This will build a program called 'matrix_multiply'

Type 'glurun ./matrix_multiply'

The output shows the matrix size and the megaflops rating. Here's a sample:

matrix_multiply^M^M
Checking for correctness on sizes: 192 232 10 240 124 186 14 149 55 222^M
Checking quad-word aligned sizes^M
16 37.470695^M
32 37.314091^M
64 23.385536^M
128 22.966635^M
256 7.076914^M
Checking arbitrary sizes^M
23 39.129846^M
43 37.438217^M
61 28.306794^M
79 28.533186^M
99 28.470845^M
119 28.415806^M
151 27.897837^M
This will run the generic matrix multiply routine.

You'll need to replace the function in the file 'my_mm.c' with your own matrix multiply routine.

Hand in:


Extra Credit

For groups which aren't interested in spending time trying to get every last MFLOPS out of the Ultrasparcs, or any other groups that may be interested, we also have the following problem. We know that data sets which are too large for the cache can slow down the running time of an algorithm such as matrix multiply significantly. However, some matrices have specific structure that we can try to exploit --- for example, if a matrix is symmetric, you need store only (a little more than) half its elements. Can you code a version of matrix multiply satisfying the following specification?
/*
 *  mul_sym:
 *      multiplying a symmetric matrix and a nonsymmetric matrix.
 *
 *      Parameters:
 *        int matdim: the matrix dimension
 *        double *A : source matrix 1.  symmetric with only half the
 *                    elements stored, so of size (matdim)x((matdim)+1)/2
 *        double *B : source matrix 2 of size (matdim)x(matdim) 
 *        double *C : output matrix of size (matdim)x(matdim)
 *
 *      Operation:
 *        C = C + A*B
 */
void mul_sym(int matdim,
             const double *A,
             const double *B,
             double *C);
Again, all matrices are stored in row-major order. (Yes, figuring out the indexing can be tricky.)