Loop nest optimization Guide, Meaning , Facts, Information and Description
Since the 1990s, commodity CPUs attached to inexpensive memory systems have been able to deliver much of the performance of supercomputers with expensive high-bandwidth memory systems on certain problems. This development is due, in part, to ever-smaller processes making it possible to put very fast fully pipelined floating-point units onto commodity CPUs. But delivering that performance is also crucially dependent on compiler transformations that reduce the need for the high-bandwidth memory system.Loop Nest Optimization (LNO) was a crucial development in compiler technology that made possible large reductions in the cache bandwidth necessary for some pervasive codes.
| Table of contents |
|
2 Example: matrix multiply 3 See also 4 External links |
The essential transformation done by Loop Nest Optimization is to first
change a simple loop nest like this:
The useful property of this interchange is that it changes the access
patterns of array references inside the inner loop. The optimizer can
exploit this change in pattern to improve register data reuse or the
cache hit rate in the inner loop.
Many large mathematical operations on computers end up spending much
of their time doing matrix multiplication. Examining this loop nest
can be quite instructive. The operation is:
The basic loop is:
This code would run quite acceptably on a Cray Y-MP, which can sustain
0.8 multiply-adds per memory operation to main memory. The Y-MP was
built in the early 1980s. A machine like a 2.8 GHz Pentium 4, built
in 2003, has slightly less memory bandwidth and vastly better floating
point, so that it can sustain 16.5 multiply-adds per memory operation.
As a result, the code above will run slower on the 2.8 GHz Pentium 4
than on the 166 MHz Y-MP!
A machine with a longer floating-point add latency or with multiple
adders would require more accumulators to run in parallel. It is easy
to change the loop above to compute a 3x3 block
instead of a 2x2 block, but the resulting code is not always faster. The loop requires
registers to hold both the accumulators and the loaded and reused A
and B values. A 2x2 block requires 7 registers. A 3x3 block requires
13, which will not work on a machine with just 8 floating point
registers in the ISA. If the CPU does not have enough registers,
the compiler will schedule extra loads and stores to spill the registers
into stack slots, which will make the loop run slower than a smaller blocked
loop.
Matrix multiplication is like many other codes in that it can be memory
bandwidth limited, and that more registers can help the compiler and
programmer reduce the need for memory bandwidth. This register pressure
is why vendors of RISC CPUs, who intended to build machines more parallel
than the general purpose x86 and 68000 CPUs, adopted 32-entry floating point
register files.
The code above does not use the cache very well. During the
calculation of a horizontal stripe of C results, one horizontal stripe of
B is loaded and the entire matrix A is loaded. For the entire
calculation, C is stored once (that's good), B is loaded into the cache
once (assuming a stripe of B fits in the cache with a stripe of A), but
A is loaded N/ib times, where ib is the size of the strip in the C matrix,
for a total of N^3/ib doubleword loads from main memory. In the code above,
ib is 2.
The next step to reducing the memory traffic is to make ib as large as
possible. We want it to be larger than the "balance" number reported
by streams. In the case of one particular 2.8 GHz Pentium-4 system
used for this example, the balance number is 16.5. The
second code example above can't be extended directly, since
that would require many more accumulator registers. Instead, we block
the loop over i. (Technically, this is actually the second time we've
blocked i, as the first time was the factor of 2.)
So what size matrix fits? Our example system, a 2.8 GHz Pentium 4, has
a 16KB primary data cache. With ib=20, the slice of the B matrix in this
code will be larger than the primary cache when N > 100. For problems
larger than that, we'll need another trick.
That trick is reducing the size of the stripe of the B matrix by blocking
the k loop, so that the stripe is of size ib x kb. Blocking the k loop
means that the C array will be loaded and stored N/kb times, for a total
of 2*N^3/kb memory transfers. A is still transferred N/ib times, for N^3/ib
transfers. So long as
Here is the code with loop k blocked.
The above loop will only achieve 80% of peak flops on the example
system when blocked for the 16KB L1 cache size. It will do worse on
systems with even more unbalanced memory systems. Fortunately, the
Pentium 4 has a 256KB high-bandwidth level-2 cache as well as the
level-1 cache. We are presented with a choice:
Streams benchmark results - shows the overall balance between floating point operations and memory operations for many different computers.
Jim Dehnert's list of interesting SGI compiler publications on this problem This is an Article on Loop nest optimization. Page Contains Information, Facts Details or Explanation Guide About Loop nest optimization The transformation
for( i=0; i < N; i++ )
for( j=0; j < N; j++ )
foo(i,j)into an identical one that operates on blocks, like this: for( i=0; i < N; i++ )
for( jj=0; jj < N; jj += jb )
for( j=jj; j < jj+jb; j++ )
foo(i,j)The two above code sequences always have identical results, and the
second is slightly worse because it has extra loop overhead. The crucial
step is to interchange the outer two iterations, like this: for( jj=0; jj < N; jj += jb )
for( i=0; i < N; i++ )
for( j=jj; j < jj+jb; j++ )
foo(i,j)This code is not guaranteed to have identical results to the ones
above. If later invocations of foo() depend on earlier
invocations, and the order of those invocations are reversed,
this transformed loop will generate potentially wrong answers. The
compiler must detect this situation and avoid this transformation on
such loops.Example: matrix multiply
C = A*B
where A, B, and C are NxN arrays. Subscripts, for the following
description, are in the form C[row][column]. for( i=0; i < N; i++ )
for( j=0; j < N; j++ ) {
C[i][j] = 0;
for( k=0; k < N; k++ )
C[i][j] += A[k][j] * B[i][k];
}There are three problems to solve:
The original loop calculates the result for one entry in the result
matrix at a time. By calculating a small block of entries
simultaneously, the following loop reuses each loaded value twice, so
that the inner loop has four loads and four multiply-adds, thus
solving problem #2. By carrying four accumulators simultaneously,
this code can keep a single floating point adder with a latency of 4
busy nearly all the time (problem #1). However, the code does not address the
third problem. (Nor does it address the cleanup work necessary when N
is odd. Such details will be left out of the following discussion.) for( i=0; i < N; i += 2 )
for( j=0; j < N; j +=2 ) {
acc00 = acc01 = acc10 = acc11 = 0;
for( k=0; k < N; k++ ) {
acc00 += A[k][j+0] * B[i+0][k];
acc01 += A[k][j+1] * B[i+0][k];
acc10 += A[k][j+0] * B[i+1][k];
acc11 += A[k][j+1] * B[i+1][k];
C[i+0][j+0] = acc00;
C[i+0][j+1] = acc01;
C[i+1][j+0] = acc10;
C[i+1][j+1] = acc11;
}This code has had both the i and j iterations blocked by a factor of
two, and had both the resulting two-iteration inner loops completely
unrolled. for( ii=0; ii < N; ii += ib )
for( j=0; j < N; j += 2 )
for( i=ii; i < ii+ib; i += 2 ) {
acc00 = acc01 = acc10 = acc11 = 0;
for( k=0; k < N; k++ ) {
acc00 += A[k][j+0] * B[i+0][k];
acc01 += A[k][j+1] * B[i+0][k];
acc10 += A[k][j+0] * B[i+1][k];
acc11 += A[k][j+1] * B[i+1][k];
}
C[i+0][j+0] = acc00;
C[i+0][j+1] = acc01;
C[i+1][j+0] = acc10;
C[i+1][j+1] = acc11;
}With this code, we can set ib to be anything we like, and the number of
loads of the A matrix will be reduced by that factor. This freedom has
a cost: we are now keeping a Nxib slice of the B matrix in the cache.
So long as that fits, this code will not be limited by the memory system. 2*N/kb + N/ib < N/balance
the machine's memory system will keep up with the floating point unit and
the code will run at maximum performance. The 16KB cache of the Pentium
4 is not quite big enough: we might choose ib=24 and kb=64, thus using 12KB
of the cache -- we don't want to completely fill it, since the C and A
arrays have to have some room to flow through. These numbers comes within
20% of the peak floating-point speed of the processor. for( ii=0; ii < N; ii += ib )
for( kk=0; kk < N; kk += kb )
for( j=0; j < N; j+= 2 )
for( i=ii; i < ii+ib; i += 2 ) {
if( kk==0 )
acc00 = acc01 = acc10 = acc11 = 0;
else {
acc00 = C[i+0][j+0];
acc01 = C[i+0][j+1];
acc10 = C[i+1][j+0];
acc11 = C[i+1][j+1];
}
for( k=kk; k < kk+kb; k++ ) {
acc00 += A[k][j+0] * B[i+0][k];
acc01 += A[k][j+1] * B[i+0][k];
acc10 += A[k][j+0] * B[i+1][k];
acc11 += A[k][j+1] * B[i+1][k];
}
C[i+0][j+0] = acc00;
C[i+0][j+1] = acc01;
C[i+1][j+0] = acc10;
C[i+1][j+1] = acc11;
}The above code examples do not show the details of dealing with values
of N which are not multiples of the blocking factors. Compilers
which do loop nest optimization emit code to clean up the edges of the
computation. For example, most LNO compilers would probably split the
kk==0 iteration off from the rest of the kk iterations, in order to remove
the if statement from the i loop. This is one of the values of such a
compiler: while it is straightforward to code the simple cases of this
optimization, keeping all the details correct as the code is replicated
and transformed is an error-prone process.See also
External links
