These are the notes I wrote while revising, sort of a summary of everything.
16*x
⇒ x << 4
(reduction in strength)-O1
and higher)restrict
if memory only accessed through one pointer-O2
and higher or specific flags, if
#pragma GCC ivdep
)float A[3] __attribute((aligned(16)))
#pragma vector aligned
#pragma omp parallel for
(combines parallel
and for
pragmas)nthreads
chunks#pragma omp parallel
starts/terminates threads, for
doesn’t
parallel
blocks!nowait
to get rid of barriers, e.g. parallel
already has barrierbarrier
construct creates sync barrier where all threads waitsingle
assigns block to one thread, with barrier after (unless nowait
)
master
assigns block to one thread, without sync barrier (unless barrier
)sections
: each section
executed by exactly one thread, threads execute different codethreadprivate
vars: accessible like global, different for each thread
copyin
#pragma omp task
spawns async task on block, original thread continues
#pragma omp taskwait
OMP_PLACES = "{0:4}, {4:4}, {8:4}, {12:4}
defines 4 places with 4 execution unitsOMP_WAIT_POLICY
#pragma omp atomic read|write|update|capture
pthread_create(...)
pthread_join()
– otherwise you get zombiespthread_attr_init()
, pthread_attr_destroy()
trylock()
always returns, showing EBUSY
if already lockedpthread_cond_wait()
pthread_cond_signal
or all with pthread_cond_broadcast
__transaction_atomic { ... }
pthread_once()
runs function at most oncepthread_cancel()
cancels threads__global__
__device__
kernelFunc<<<thread_blocks, threads_per_block>>>(args)
dim3
dim3 gridDim
: grid dimension in blocksdim3 blockDim
: block dimension in threadsdim3 blockIdx
: the block index in griddim3 threadIdx
: the thread index in block(blockIdx.x * blockDim.x) + threadIdx.x
cudaGetDeviceProperties()
cudaMalloc
, cudaMemset
, cudaFree
cudaMemcy
, last arg is enum stating copy direction
cudaMemcpyAsync()
)cudaMallocManaged
(active warps)/(max active warps)
__constant__
variable
cudaMemcpyToSymbol
__global__
cudaMalloc
and cudaMemcpy
cudaMallocManaged
, no explicit copies neededcudaMemPrefetchAsync
__shared__ type var
__syncthreads()
atomicAdd
)cudaMemcpy
is blockingcudaDeviceSynchronize
__syncthreads()
for block-level synccudaStreamCreate
, cudaStreamDestroy
cudaStreamSynchronize
) or all streams (cudaDeviceSynchronize
)cudaCreateEvent
, cudaDestroyEvent
cudaEventRecord
cudaEventSynchronize
(wait for everything before event to happen)cudaEventQuery