NVIDIA G8 and CUDA Overview

Hardware

The GeForce 8800 of the G8 series contains 16 GPU chips, or streaming multiprocessors (SMs). Each SM chip has eight cores, which currently run at 1.35 GHz. The off-chip video memory is currently limited to 1.5 GB. A single core has a 32-bit single-precision floating point unit that can also handle integers. (The G9 series will allow 64-bit operations.) Each SM chip also has two special functional unites (SFUs) that are presumably shared among the eight cores. The SFUs perform reciprocal, square root, since and cosine. The 16 SMs and 2*nbsp;SFUs per chip imply a peak theoretical performance of (16 SM * 18 functional units * 1.35 GHYz) FLOPs, or 388.8 gigaflops (388.8 billion floating point operations per second).

The aggregate bandwidth to off-chip global memory is 86.4 GB/s. (In comparison, bandwidth of CPU to RAM is typically between 5 GB/s and 10 GB/s.)

Programmer's Model

One CUDA kernel runs at a time on the device (the G8). (Maybe this restriction will be lifted in the future?) The CUDA kernel must declare how many threads it will use at the time that the kernel is started. The kernel declares a grid of thread blocks of threads. Hence a thread is known by its thread ID within its thread block, and by the thread block ID within the grid.

Thread Block

The reason for two levels, thread block and grid, appears to be that a thread block is a set of threads running on a single chip (on a single multiprocessor). (A thread block is assigned a unique SM chip. Two or more thread blocks may run on the same chip. If one thread block stalls while waiting for global memory, etc., then another thread block may run. This can be used to hide the latency.)

Cache

The number of threads in a thread block is currently limited to at most 512. All threads in a thread block will share the memory found in the caches. Presumably, cache memory is also located on the chip (the multiprocessor). Threads of a thread block can synchronize using the __syncthreads primitive. Synchronization between distinct thread blocks is not easily supported. So the thread blocks of a grid run independently until the end of that kernel.

Global Memory

On the other hand, all threads in the entire grid will share the memory found in global memory. The global memory is presumably the off-chip memory that populates the video card. Currently, this global memory is limited to 1.5 GB.

Warp (performance issue, only)

As a performance issue, it is useful to be aware that the threads of a thread block are organized into warps of 32 threads. (Hence, one requires four GPU cycles for the eight cores of a multiprocessor chip to execute the 32 threads of a warp.)

Types of cache

The on-chip cache is split among three kinds. Unlike CPU cache, the application program must explicitly copy data from global memory to cache and back.

shared cache -- This cache is standard read-write cache, except that it is managed drectly by the application. The shared cache of an SM chip is 16 KB. As a performance issue, it is organized into 16 banks, and applications may be tuned to avoid bank conflicts. (One can simultaneously access one word for each of the 16 banks with no slowdown.)
constant cache -- This cache is read-only cache. The constant cache of an SM chip is 8 KB. The CUDA programming model restricts one to 64 total instead of (16 SMs * 8 KB) = 128 KB. It is available primarily as an optimization. If all threads of a warp accesses the same constant during an instruction,
texture cache -- This cache supports 2-dimensional locality. There is 8 KB per SM chip. This cache appears to be managed in hardware, with a wait of more than a hundred GPU cycles if there is a cache miss??? Details of its use are omitted here.

Registers

The number of registers for each thread of a thread block can be programmed on a per-thread block basis. If a thread uses more "variables", the extra variables will be encoded as additional registers. When there are more program registers than the number allocated for the thread, extra registers will automatically be spilled to a special region of global memory, which is called local memory.

Host

The host copies data from host to the G8 device global memory. A program on the device can copy data from global memory to cache.

Additional programming model or hardware constraints:

512 threads per thread block

8 thread blocks per SM

768 threads per SM

16,384 bytes of shared cache per SM

8,192 total registers per SM (shared among all thread blocks assigned to that SM)

Interesting CUDA Functions

atomicXXX(): Appendix C of CUDA manual: Atomic functions; atomic operations in global memory

cuMemAllocPitch(): Appendix E.6 of CUDA manual; Efficiency optimization to pad 2-dimensional arrays to "meet the alignement requirements for coalescing as the address is updated from row to row" (see Section 5.1.2.1)

Global Memory: Section 5.1.2.1: optimization issues for coalescing of addresses: on each half-warp, thread N should access address address HalfWarpBaseAddress + N. Future hardware may coalesce on full warp. For 2-dimensional arrays, it is recommended that the width be rounded up to a factor of 16.

Constant Memory: Section 5.1.2.2: optimization issues: Constant memory is a true cache for a region of global cache. This is a transparent cache, but the time for the threads of a half-warp to read from two locations in constant memory is twice the cost for the threads of a half-warp to all read from the same location. Future hardware may do this for a full warp.

Shared Memory: Section 5.1.2.4: optimization issues for bank conflicts. (CUDA has 16 memory banks.)