These are the notes I wrote while revising, sort of a summary of everything.
16*x ⇒ x << 4 (reduction in strength)-O1 and higher)restrict if memory only accessed through one pointer-O2 and higher or specific flags, if
#pragma GCC ivdep)float A[3] __attribute((aligned(16)))#pragma vector aligned#pragma omp parallel for (combines parallel and for pragmas)nthreads chunks#pragma omp parallel starts/terminates threads, for doesn’t
parallel blocks!nowait to get rid of barriers, e.g. parallel already has barrierbarrier construct creates sync barrier where all threads waitsingle assigns block to one thread, with barrier after (unless nowait)
master assigns block to one thread, without sync barrier (unless barrier)sections: each section executed by exactly one thread, threads execute different codethreadprivate vars: accessible like global, different for each thread
copyin#pragma omp task spawns async task on block, original thread continues
#pragma omp taskwaitOMP_PLACES = "{0:4}, {4:4}, {8:4}, {12:4} defines 4 places with 4 execution unitsOMP_WAIT_POLICY#pragma omp atomic read|write|update|capture
pthread_create(...)pthread_join() – otherwise you get zombiespthread_attr_init(), pthread_attr_destroy()
trylock() always returns, showing EBUSY if already lockedpthread_cond_wait()pthread_cond_signal or all with pthread_cond_broadcast__transaction_atomic { ... }pthread_once() runs function at most oncepthread_cancel() cancels threads__global____device__kernelFunc<<<thread_blocks, threads_per_block>>>(args)
dim3dim3 gridDim: grid dimension in blocksdim3 blockDim: block dimension in threadsdim3 blockIdx: the block index in griddim3 threadIdx: the thread index in block(blockIdx.x * blockDim.x) + threadIdx.xcudaGetDeviceProperties()cudaMalloc, cudaMemset, cudaFreecudaMemcy, last arg is enum stating copy direction
cudaMemcpyAsync())cudaMallocManaged(active warps)/(max active warps)
__constant__ variable
cudaMemcpyToSymbol__global__
cudaMalloc and cudaMemcpycudaMallocManaged, no explicit copies neededcudaMemPrefetchAsync__shared__ type var
__syncthreads()atomicAdd)cudaMemcpy is blockingcudaDeviceSynchronize__syncthreads() for block-level synccudaStreamCreate, cudaStreamDestroycudaStreamSynchronize) or all streams (cudaDeviceSynchronize)cudaCreateEvent, cudaDestroyEventcudaEventRecordcudaEventSynchronize (wait for everything before event to happen)cudaEventQuery