GPU computing

GPUs are better performing, more power efficient, but not easy to program well/efficiently.

CPU:

few complex cores
lots of on-chip memory & control logic

GPU:

many simple cores
little memory & control

NVIDIA GPU architecture

many slim cores, sets grouped into ‘multiprocessors’ with shared memory
various memory spaces – on-chip, off-chip global, separate CPU and GPU memories
work as accelerators, with CPU as the host
CPU offloads work to GPU, symmetric multi-threading
execution model SIMT: single instruction, multiple threads (with hardware scheduler)

Programming GPUs

Kernel is the parallel program.
Device code manages the parallel program.

Low level models: CUDA, OpenCL, and variations

CUDA:

mapping of hardware into programming model
- thread hierarchy that maps to cores, configurable at logical level
- memory spaces mapping to physical memory spaces, usable through variable scopes and types
symmetric multithreading: write code for single thread, instantiate as many as needed
SIMT: single instruction on multiple threads at the same time
thread hierarchy:
- each thread executes same kernel code
- one thread one core
- threads grouped into thread blocks, where only those in same block can cooperate
- all thread blocks logically organised in grid (1D/2D/3D). the grid specifies how many instances run the kernel
parallelization for GPUs: map data/work to threads, write computation for 1 thread. organise threads in blocks, and blocks in grids.

CUDA program organisation:

device code: GPU code, i.e. kernels and GPU functions
- sequential, write for one thread & execute for all
host code: CPU code
- instantiate grid, run the kernel
- handles memory management
host-device communicate is explicit/implicit via PCI/e or NVLink

Example of code:

// GPU kernel code
__global__ myKernel(int n, int *dataGPU) {
    myProcess(dataGPU);
}

// GPU device code
__device__ myProcess(int *dataGPU) {
    // code
}

// CPU code
int main(int argc, const char **argv) {
    myKernel<<<100, 10>>>(1000, myData);
}

Compiling CUDA:

use nvcc
separates source code into device code (GPU) and host code (CPU)

Execution flow loop:

GPU memory allocation
Transfer data CPU → GPU
CPU calls GPU kernel
GPU kernel executes
Transfer data GPU → CPU
GPU memory release

Creating a CUDA application

Identify function to offload
Determine mapping of operations and data to threads
Write kernels & device functions (sequential, per-thread)
Determine block geometry, i.e. threads per block and blocks per grid
Write host code: memory initialization, kernel launch, inspect results
Optimize kernels

GPU data transfer models

explicit allocation and management: allocated twice, copied explicitly on demand, allows asynchronous copies
implicit unified memory: allocated once, implicitly moved/copied when needed, potentially prefetched

Example: vector add

First, the sequential code:

void vector_add(int size, float *a, float *b float *c) {
    for (int i = 0; i < size; ++i) {
        c[i] = a[i] + b[i];
    }
}

What does each thread compute? One addition per thread, each thread uses different element. To find out which, compute mapping of grid to data.

Example with CUDA:

// GPU kernel code
// compute vector sum c = a+b
// each thread does one pair-wise addition
__global__ void vector_add(float *a, float *b, float *c) {
    int i = threadIdx.x + blockDim.x * blockIdx.x; // mapping
    if (i<N) c[i] = a[i] + b[i];
}

// Host CPU code
int main() {
    N = 5000;
    int size = N * sizeof(float);
    float *hostA = malloc(size);
    float *hostB = malloc(size);
    float *hostC = malloc(size);

    // initialize A, B arrays

    // allocate device memory
    cudaMalloc(&deviceA, size);
    cudaMalloc(&deviceB, size);
    cudaMalloc(&deviceC, size);

    // transfer data from host to device
    cudaMemcpy(deviceA, hostA, size, cudaMemcpyHostToDevice);
    cudaMemcpy(deviceB, hostB, size, cudaMemcpyHostToDevice);

    // launch N/256 blocks of 256 threads each
    vector_add<<< N/256+1, 256 >>>(deviceA, deviceB, deviceC);

    // transfer result back from device to host
    cudaMemcpy(hostC, deviceC, size, cudaMemcpyDeviceToHost);

    cudaFree(deviceA);
    cudaFree(deviceB);
    cudaFree(deviceC);
    free(hostA);
    free(hostB);
    free(hostC);
}

With OpenACC:

void vector_add(int size, float *a, float *b float *c) {
    #pragma acc kernels, copyin(a[0:n],b[0:n]), copyout(c[0:n])
    for (int i = 0; i < size; ++i) {
        c[i] = a[i] + b[i];
    }
}

Execution model

Task queue & GigaThread engine

Host: tasks for GPU pushed into queue (“default stream”), execute in order

Device: GigaThread engine manages GPU workload, dispatches blocks to multiprocessors (SMs)

blocks divided into warps (groups of 32 threads)
per SM, warps submitted to warp schedulers, which issue instructions

Scheduling: mapping and ordering application blocks on hardware resources

Context switching: swapping state and data of blocks that replace each other

Block scheduling:

one block runs on one SM, to completion without preemption
undefined block execution order, any should be valid
same application runs correctly on SMs with different numbers of cores (performance may differ)

Warp scheduling:

blocks divided into warps (“wavefronts” in AMD)
threads in warp execute in lock-step
- same operation in every cycle
warps mapped onto cores, concurrent warps per SM limited by hardware resources
undefined warp execution order
very fast context switching, stalled warps immediately replaced. no fairness guarantee
if all warps stalled, no instruction issued so performance lost
- main reason is waiting on global memory
- if need to read global memory often, maximize occupancy: active warps divided by max active warps
- resources allocated for entire block
- potential occupancy limiters: register usage, shared memory usage, block size
- to figure out used resources, use compiler flags: nvcc -Xptxas -v
divergence:
- when each thread does something different, worst case is 1/32 performance
- depends on if branching is data or ID-dependent. if ID, consider changing grouping of threads. if data, consider sorting.
- non-diverging warps have no performance penalty, so here branches are not expensive
- best to avoid divergence at warp-level with logical operators, lazy decisions, etc.

Memory spaces

Multiple device memory scopes:

per-thread local memory
per-SM shared memory: each block has own shared memory, like explicit software cache, data accessible to all threads in same block
- declared with e.g. __shared__ float arr[128];
- not coherent, __syncthreads() required to make writes visible to other threads
- good to use when caching pattern not regular, or when data reuse
device/global memory: GPU frame buffer, any thread can use
- allocated and initialized by host program (cudaMalloc(), cudaMemcpy())
- persistent, values re retained between kernels
- not coherent, writes by other threads might not be visible until kernel finishes
constant memory: fast memory for read-only data
- defined as global variable: __constant float speed_of_light = 0.299792458, or initialize with cudaMemcpyToSymbol
- read-only for GPU, cannot be accessed directly by host
texture memory: read-only, initialized with special constructs on CPU and used with special functions on GPU
registers store thread-local scalars/constant size arrays. stored in SM register file.

Unified/managed memory

all processors see single coherent memory image with common address space
no explicit memory copy calls needed
performance issues with page faults, scheduling of copies, sync
behaviour is hardware-generation dependent

avoid expensive data movement, keeping most operations on device including init. prefetch managed memory when needed. use explicit memory copies.

Prefetching: move data to GPU prior to needing it

Advise: establish location where data resides, only copy data on demand on non-residing device

Host (CPU) manages device (GPU) memory, and copies it back and forth.

Example with unified memory:

__global__ void AplusB(int *ret, int a, int b) {
    ret[threadIdx.x] = a + b + threadIdx.x;
}

int main() {
    int *ret, i;
    cudaMallocManaged(&ret, 1000 * sizeof(int));
    AplusB<<< 1, 1000 >>>(ret, 10, 100);
    cudaDeviceSynchronize();
    for (i = 0; i < 1000; ++i) printf("%d ", host_ret[i]);
    cudaFree(ret);
}

Example with explicit copies

__global__ void AplusB(int *ret, int a, int b) {
    ret[threadIdx.x] = a + b + threadIdx.x;
}

int main() {
    int *ret, i;
    cudaMalloc(&ret, 1000 * sizeof(int));
    AplusB<<< 1, 1000 >>>(ret, 10, 100);
    int *host_ret = malloc(1000 * sizeof(int));
    cudaMemcpy(host_ret, ret, 1000 * sizeof(int), cudaMemcpyDefault);
    for (i = 0; i < 1000; ++i) printf("%d ", host_ret[i]);
    free(host_ret); cudaFree(ret);
}

Memory coalescing

combining multiple memory accesses into one transaction.
group memory accesses in as few memory transactions as possible.
stride 1 access patterns preferred.
structure of arrays is often better than array of structures.