Differences between revisions 1 and 2
Revision 1 as of 2012-11-29 03:09:55
Size: 3799
Comment:
Revision 2 as of 2012-11-29 03:10:14
Size: 3801
Comment:
Deletions are marked like this. Additions are marked like this.
Line 32: Line 32:
unrolled a loop (that #pragma unroll is Nvidia-specific, before you get unrolled a loop (that `#pragma unroll` is Nvidia-specific, before you get

Two followups on what we saw in the lecture, copied and pasted from the email I sent after class.

Local memory access patterns

First, I've decreased the overhead in the local memory access demo to the point where the predicted factor of two comes out pretty clearly. More specifically, I've changed this:

for (int j = 0; j < 1000; ++j)
  loc[li] += loc[ARGUMENT * li];

to this:

float x = 0;
for (int j = 0; j < 100; ++j)
{
  #pragma unroll
  for (int k = 0; k < 10; ++k)
    x += loc[ARGUMENT * li];
}
loc[li] = x;

Full code, if you'd like to play around yourself.

In other words: there was a local store in the inner timing loop that had nothing to do with what we were trying to time. That store just added fixed overhead and thus muddied the waters. In addition, I've also unrolled a loop (that #pragma unroll is Nvidia-specific, before you get too excited). With these changes, the performance matches the prediction exactly:

stride 1 elapsed: 0.0174374 s
stride 2 elapsed: 0.0337717 s  --- factor 1.936739422161561
stride 4 elapsed: 0.0674865 s --- factor 1.9983151573654865
stride 8 elapsed: 0.134922 s --- factor 1.9992442933031047
stride 16 elapsed: 0.269786 s --- factor 1.999570121996413
stride 32 elapsed: 0.53952 s -- factor 1.9998072546388617

If you don't remember, the reason for this is that more and more work items are fighting over fewer and fewer banks. With each doubling of the stride, the multiplicity of the conflict doubles as well, and the more "turns" have to be taken until all accesses are performed.

There are 32 banks on this hardware, so 32 is already maximally conflicted:

stride 64 elapsed: 0.539527 s --- factor 1.0000129744958481

and here are some odd numbers for comparison:

stride 3 elapsed: 0.0174662 s
stride 17 elapsed: 0.0174661 s

I.e. these proceed at full speed as predicted.

Global memory access patterns

Second, the correct response to what happens when the strides in global are so big (and thus bad!) that they skip an entire bus width is "I make no prediction". It's such a bad access pattern that we really don't need to care, and thus the behavior in this regime *should* be viewed as an implementation detail.

That said, if you crave truth, read on, but be prepared, it gets a little complicated. The "super-wide bus" is a model that that explains many things, such as why big strides are bad and why misalignment can hurt you. The true memory system on a GPU is actually a fair bit more complicated still.

See Section 5.1 of this PDF (check here for more) for a description of how global memory actually works on AMD GPUs.

(Why AMD? Nvidia doesn't even document these details. And the global memory systems are rather similar, because they interface with the same off-the-shell memory anyhow.)

The 15-second version is this quote from that text:

  • Accesses that differ in the lower bits can run in parallel; those that differ only in the upper bits can be serialized.

In other words, some address bits are used to choose a "memory channel" (see the text). Once your strides only vary in these bits (like the ones we were trying), it can be that your access ends up fighting over just a few channels. This is rather similar conceptually to bank conflicts in local memory (and cache associativity issues on CPUs!).

Teaching/HPCFall2012/Lecture12/GPUMemoryAccess (last edited 2012-11-29 03:10:14 by AndreasKloeckner)