NVIDIA G8 and CUDA Overview
Hardware
The GeForce 8800 of the G8 series contains 16 GPU chips, or
streaming multiprocessors (SMs). Each SM chip has eight cores,
which currently run at 1.35 GHz. The off-chip video memory is
currently limited to 1.5 GB. A single core has a 32-bit
single-precision floating point unit that can also handle integers.
(The G9 series will allow 64-bit operations.) Each SM chip also
has two special functional unites (SFUs) that are presumably shared
among the eight cores. The SFUs perform reciprocal, square root, since
and cosine. The 16 SMs and 2*nbsp;SFUs per chip imply a peak theoretical
performance of (16 SM * 18 functional units * 1.35 GHYz) FLOPs,
or 388.8 gigaflops (388.8 billion floating point
operations per second).
The aggregate bandwidth to off-chip global memory is 86.4 GB/s.
(In comparison, bandwidth of CPU to RAM is typically between 5 GB/s
and 10 GB/s.)
Programmer's Model
One CUDA kernel runs at a time on the device (the G8). (Maybe this
restriction will be lifted in the future?) The CUDA kernel
must declare how many threads it will use at the time that the kernel
is started. The kernel declares a grid of thread blocks
of threads. Hence a thread is known by its thread ID
within its thread block, and by the thread block ID within the grid.
- Thread Block
-
The reason for two levels, thread block and grid, appears to be that
a thread block is a set of threads running on a single chip
(on a single multiprocessor). (A thread block is assigned a unique SM chip.
Two or more thread blocks may run on the
same chip. If one thread block stalls while waiting for global memory, etc.,
then another thread block may run. This can be used to hide the latency.)
- Cache
-
The number of threads in a thread block
is currently limited to at most 512. All threads in a thread block will
share the memory found in the caches. Presumably, cache memory is also
located on the chip (the multiprocessor). Threads of a thread
block can synchronize using the
__syncthreads
primitive.
Synchronization between distinct thread blocks is not easily supported.
So the thread blocks of a grid run independently until the end of that
kernel.
- Global Memory
-
On the other hand, all threads in the entire grid will share the memory
found in global memory. The global memory is presumably the off-chip
memory that populates the video card. Currently, this global
memory is limited to 1.5 GB.
- Warp (performance issue, only)
-
As a performance issue, it is useful to be aware that the threads
of a thread block are organized into warps of 32 threads.
(Hence, one requires four GPU cycles for the eight cores of a multiprocessor
chip to execute the 32 threads of a warp.)
- Types of cache
-
The on-chip cache is split among three kinds. Unlike CPU cache, the
application program must explicitly copy data from global memory to
cache and back.
- shared cache -- This cache is standard read-write cache,
except that it is managed drectly by the application.
The shared cache of an SM chip is 16 KB.
As a performance issue, it is organized into 16 banks, and
applications may be tuned to avoid bank conflicts. (One can
simultaneously access one word for each of the 16 banks
with no slowdown.)
- constant cache -- This cache is read-only cache. The constant
cache of an SM chip is 8 KB. The CUDA programming model
restricts one to 64 total instead of (16 SMs * 8 KB)
= 128 KB. It is
available primarily as an optimization. If all threads of a warp
accesses the same constant during an instruction,
- texture cache -- This cache supports 2-dimensional locality.
There is 8 KB per SM chip. This cache appears to be
managed in hardware, with a wait of more than a hundred GPU
cycles if there is a cache miss??? Details of its use are
omitted here.
- Registers
-
The number of registers for each thread of a thread block can be programmed
on a per-thread block basis. If a thread uses more "variables", the
extra variables will be encoded as additional registers. When there
are more program registers than the number allocated for the thread,
extra registers will automatically be spilled to a special region
of global memory, which is called local memory.
- Host
-
The host copies data from host to the G8 device global memory.
A program on the device can copy data from global memory to cache.
- Additional programming model or hardware constraints:
-
- 512 threads per thread block
- 8 thread blocks per SM
- 768 threads per SM
- 16,384 bytes of shared cache per SM
- 8,192 total registers per SM (shared among all thread blocks
assigned to that SM)
- Interesting CUDA Functions
-
- atomicXXX(): Appendix C of CUDA manual:
Atomic functions; atomic operations in global memory
- cuMemAllocPitch(): Appendix E.6 of CUDA manual;
Efficiency optimization to pad 2-dimensional arrays to
"meet the alignement requirements for coalescing as the
address is updated from row to row" (see Section 5.1.2.1)
- Global Memory: Section 5.1.2.1: optimization issues for coalescing
of addresses: on each half-warp, thread
N
should access address
address HalfWarpBaseAddress + N
.
Future hardware may coalesce on full warp.
For 2-dimensional arrays, it is recommended that the width
be rounded up to a factor of 16.
- Constant Memory: Section 5.1.2.2: optimization issues:
Constant memory is a true cache for a region of global cache.
This is a transparent cache, but the time for the threads
of a half-warp
to read from two locations in constant memory is twice the
cost for the threads of a half-warp to all read from the same
location.
Future hardware may do this for a full warp.
- Shared Memory: Section 5.1.2.4: optimization issues for bank conflicts.
(CUDA has 16 memory banks.)