There are three main objectives of this problem set. First, you should learn how to run code on the NOW. Second, we hope you have a better idea of things to consider when trying to write efficient code. And finally, we hope you have a better appreciation for how difficult writing efficient code can be (although you already knew that, right?). Although this is a class on parallel computing, this problem set concerns optimizing serial code. However, insights gained here will be of use throughout the course.
This problem set is split into two parts. The first part shouldn't take too long. You only need to run some programs which we'll make available, to change a few parameters, and to observe and comment on the results. In the second part, you'll get to use some of the insight gained in the first part to code an efficient matrix multiply. To make things more interesting, the second part includes an optional contest. (All groups should try to write an efficient version of matrix multiply. Only the competition part is optional.)
This problem set will be done in groups of two or three. The groups should be interdisciplinary, and we'll try to help set up the groups.
echo $path
set path=($path /usr/sww/bin /usr/sww/share/bin /usr/now/bin)
echo $MANPATH
setenv MANPATH ${MANPATH}:/usr/now/man
Put the three files in a directory, and type 'glumake'. This will build an executable called 'cache'.
Type 'glurun ./cache > cache.out'
This will build an output file in your directory called 'cache.out'. Examination of the file will reveal the latency for different parts of the hierarchy.
In order to plot your file, type the following on the command line:
chmod 755 format.pl format.pl cache.out gnuplot cache.out.plotThis will build a postscript file called cache.out.ps
On the UltraSparc the access time to the first level cache is roughly 6ns, while the access time to the second level cache is roughly 40ns. Since the memory benchmark program is performing a read and a write operation, these times will be doubled (e.g. the accesses to the first level cache will be ~12 ns and the accesses to the second level cache will be ~80ns).
Your job is to plot the output file, and identify the first level cache, the second level cache, and main memory by circling the horizontal lines on the graph that correspond to the levels in the hierarchy. (Draw only three circles to identify the levels - the circles do not have to be exact.) Also compute the latency to each level of the hierarchy on your graph (answers will vary from group to group). For computing the latency to main memory, just read the next level up from the L2 cache - don't take into account other effects such as TLB misses.
How much slower is memory than the first level cache? Than the second level cache?
for i=1 to n
for j=1 to n
for k=1 to n
C(i,j) = C(i,j) + A(i,k)*B(k,j)
end for
end for
end for
However, note that this algorithm does not take into consideration
the memory hierarchy. Hopefully after doing part 1 of the problem set
you've learned that good performance depends on writing code which takes
the memory hierarchy into consideration -- especially if the matrices
are ``large''.
So, the assignment (as you've no doubt guessed) is to code a more
efficient version of matrix multiply for the Ultrasparcs in the NOW
cluster. Because there is an optional competition, we require everyone
to write code with the same routine interface:
/*
** mul_mfmf_mf:
** Optimized matrix multiply routine.
**
** Parameters:
** int matdim: the matrix dimension
** double * A: source1 matrix of size (matdim)x(matdim)
** double * B: source2 matrix of size (matdim)x(matdim)
** double * C: destination matrix of size (matdim)x(matdim)
**
** Operation:
** C = C + A*B;
**
*/
void mul_mfmf_mf(int matdim,
const double *A,
const double *B,
double *C);
Assume that the arrays are stored in row-major order, and that the
compiler is ANSI-C compliant. A driver routine (which will call
your routine and both check that it's answers are correct and get
accurate MFLOPS rates on matrices of various sizes) and a Makefile
are provided (see the following section).
You may also want to look at
last year's assignment (from which this writeup is partially taken).
Among other things, it has a few pointers to more information.
Type 'glurun ./matrix_multiply'
The output shows the matrix size and the megaflops rating. Here's a sample:
matrix_multiply^M^M Checking for correctness on sizes: 192 232 10 240 124 186 14 149 55 222^M Checking quad-word aligned sizes^M 16 37.470695^M 32 37.314091^M 64 23.385536^M 128 22.966635^M 256 7.076914^M Checking arbitrary sizes^M 23 39.129846^M 43 37.438217^M 61 28.306794^M 79 28.533186^M 99 28.470845^M 119 28.415806^M 151 27.897837^MThis will run the generic matrix multiply routine.
You'll need to replace the function in the file 'my_mm.c' with your own matrix multiply routine.
Hand in:
/*
* mul_sym:
* multiplying a symmetric matrix and a nonsymmetric matrix.
*
* Parameters:
* int matdim: the matrix dimension
* double *A : source matrix 1. symmetric with only half the
* elements stored, so of size (matdim)x((matdim)+1)/2
* double *B : source matrix 2 of size (matdim)x(matdim)
* double *C : output matrix of size (matdim)x(matdim)
*
* Operation:
* C = C + A*B
*/
void mul_sym(int matdim,
const double *A,
const double *B,
double *C);
Again, all matrices are stored in row-major order. (Yes, figuring
out the indexing can be tricky.)