Programming Multi-Core and Many-Core Systems

Table of Contents

GPU computing

GPUs are better performing, more power efficient, but not easy to program well/efficiently.

CPU:

GPU:

NVIDIA GPU architecture

Programming GPUs

Low level models: CUDA, OpenCL, and variations

CUDA:

CUDA program organisation:

Example of code:

// GPU kernel code
__global__ myKernel(int n, int *dataGPU) {
    myProcess(dataGPU);
}

// GPU device code
__device__ myProcess(int *dataGPU) {
    // code
}

// CPU code
int main(int argc, const char **argv) {
    myKernel<<<100, 10>>>(1000, myData);
}

Compiling CUDA:

Execution flow loop:

  1. GPU memory allocation
  2. Transfer data CPU → GPU
  3. CPU calls GPU kernel
  4. GPU kernel executes
  5. Transfer data GPU → CPU
  6. GPU memory release

Creating a CUDA application

  1. Identify function to offload
  2. Determine mapping of operations and data to threads
  3. Write kernels & device functions (sequential, per-thread)
  4. Determine block geometry, i.e. threads per block and blocks per grid
  5. Write host code: memory initialization, kernel launch, inspect results
  6. Optimize kernels

GPU data transfer models

Example: vector add

First, the sequential code:

void vector_add(int size, float *a, float *b float *c) {
    for (int i = 0; i < size; ++i) {
        c[i] = a[i] + b[i];
    }
}

What does each thread compute? One addition per thread, each thread uses different element. To find out which, compute mapping of grid to data.

Example with CUDA:

// GPU kernel code
// compute vector sum c = a+b
// each thread does one pair-wise addition
__global__ void vector_add(float *a, float *b, float *c) {
    int i = threadIdx.x + blockDim.x * blockIdx.x; // mapping
    if (i<N) c[i] = a[i] + b[i];
}

// Host CPU code
int main() {
    N = 5000;
    int size = N * sizeof(float);
    float *hostA = malloc(size);
    float *hostB = malloc(size);
    float *hostC = malloc(size);

    // initialize A, B arrays

    // allocate device memory
    cudaMalloc(&deviceA, size);
    cudaMalloc(&deviceB, size);
    cudaMalloc(&deviceC, size);

    // transfer data from host to device
    cudaMemcpy(deviceA, hostA, size, cudaMemcpyHostToDevice);
    cudaMemcpy(deviceB, hostB, size, cudaMemcpyHostToDevice);

    // launch N/256 blocks of 256 threads each
    vector_add<<< N/256+1, 256 >>>(deviceA, deviceB, deviceC);

    // transfer result back from device to host
    cudaMemcpy(hostC, deviceC, size, cudaMemcpyDeviceToHost);

    cudaFree(deviceA);
    cudaFree(deviceB);
    cudaFree(deviceC);
    free(hostA);
    free(hostB);
    free(hostC);
}

With OpenACC:

void vector_add(int size, float *a, float *b float *c) {
    #pragma acc kernels, copyin(a[0:n],b[0:n]), copyout(c[0:n])
    for (int i = 0; i < size; ++i) {
        c[i] = a[i] + b[i];
    }
}

Execution model

Task queue & GigaThread engine

Host: tasks for GPU pushed into queue (“default stream”), execute in order

Device: GigaThread engine manages GPU workload, dispatches blocks to multiprocessors (SMs)

Scheduling: mapping and ordering application blocks on hardware resources

Context switching: swapping state and data of blocks that replace each other

Block scheduling:

Warp scheduling:

Memory spaces

Multiple device memory scopes:

Unified/managed memory

avoid expensive data movement, keeping most operations on device including init. prefetch managed memory when needed. use explicit memory copies.

Prefetching: move data to GPU prior to needing it

Advise: establish location where data resides, only copy data on demand on non-residing device

Host (CPU) manages device (GPU) memory, and copies it back and forth.

Example with unified memory:

__global__ void AplusB(int *ret, int a, int b) {
    ret[threadIdx.x] = a + b + threadIdx.x;
}

int main() {
    int *ret, i;
    cudaMallocManaged(&ret, 1000 * sizeof(int));
    AplusB<<< 1, 1000 >>>(ret, 10, 100);
    cudaDeviceSynchronize();
    for (i = 0; i < 1000; ++i) printf("%d ", host_ret[i]);
    cudaFree(ret);
}

Example with explicit copies

__global__ void AplusB(int *ret, int a, int b) {
    ret[threadIdx.x] = a + b + threadIdx.x;
}

int main() {
    int *ret, i;
    cudaMalloc(&ret, 1000 * sizeof(int));
    AplusB<<< 1, 1000 >>>(ret, 10, 100);
    int *host_ret = malloc(1000 * sizeof(int));
    cudaMemcpy(host_ret, ret, 1000 * sizeof(int), cudaMemcpyDefault);
    for (i = 0; i < 1000; ++i) printf("%d ", host_ret[i]);
    free(host_ret); cudaFree(ret);
}

Memory coalescing