Intro
- A CUDA program ends with
.cu. It’s basically a C++ program but with some special parts that can run on the GPU instead of the CPU. The.cufile defines the main logic of your program and also decides which parts should run on the GPU. - The CPU is called the host, and the GPU is called the device. In CUDA, you explicitly tell it which functions belong to the host and which to the device with a specifier. The ones that run on the GPU are called kernels.
- If a function doesn’t have any specifier, it defaults to
__host__, meaning it runs on the CPU as normal C++ code.
2 kinds of GPU-related functions
__global__: runs on the GPU, but is called from the CPU. This is your main kernel.__device__: runs on the GPU, and can only be called by other GPU code (not by the CPU).
CUDA’s Execution Structure
- When we launch a function that’s meant to run on the GPU, CUDA sets up a logical structure called a grid.
- The grid is made up of thread blocks, and each block contains multiple threads which are the actual workers that do the computation.
- Each block runs independently, usually on a GPU Streaming Multiprocessor (SM) (kind of like a core, but each SM can handle multiple blocks).
CUDA’s built-in variables to understand this structure
gridDim→ the dimension of the gridblockIdx→ which block this thread belongs toblockDim→ the dimension of one blockthreadIdx→ the position (ID) of this thread within its block
Wrap
- Each block contains multiple warps.
- Each warp always has 32 threads, even if you only use part of it. If you launch fewer than 32 threads, the remaining lanes in that warp just stay idle, which wastes resources
Aim for better performance
We should always aim for better warp occupancy, since warps are the actual execution units on the GPU.
Each block also carries scheduling overhead like hardware initialization, resource allocation, and SM scheduling, so excessively many small blocks hurt performance.
Order of execution
The GPU fires multiple blocks in parallel, so the threads don’t come out sequentially. Each warp runs 32 threads as one logical unit, so their outputs often look sequential because they execute together, but it’s not guaranteed as
printffrom different warps (and even threads within a warp) can still interleave.
Dimensions of Work
- Blocks can be 1D, 2D, or 3D, depending on the problem we’re solving:
- 1D block → good for simple lists or arrays
- 2D block → great for images (rows × columns)
- 3D block → used for 3D volumes or cubes
- This dimensional flexibility helps map threads directly to data structures (like pixels or voxels) in a natural way.
Defining Grid and Block Layout
dim3 threadsPerBlock(4, 4, 4); // 64 threads per block
dim3 blocksPerGrid(8, 4); // 32 blocks in total- Each thread has its own mini coordinate system.
(x, y)for 2D,(x, y, z)for 3D). These help CUDA automatically figure out which piece of data that thread should handle.
Launching the Kernel
kernel_name<<< Dg, Db, Ns, S >>>(arguments);kernel_name→ the GPU function we’re executingDg→ grid dimension (dim3)Db→ block dimension (dim3)Ns→ optional shared memory size per block (in bytes)S→ optional CUDA stream (acts like a command queue that allows overlapping of computation and data transfer)- If you don’t specify
NsorS, CUDA just uses the defaults (0and the default stream).
Thread Identification
// Combine threadIdx and blockIdx to calculate a Global Index
int global_x = blockIdx.x * blockDim.x + threadIdx.x;- Each thread knows its own position using built-in CUDA variables:
threadIdx→ its position within the blockblockIdx→ its block’s position within the grid
- This lets every thread figure out exactly which data element it should process.
CUDA Compilation
- To compile CUDA code, we use the
nvcccompiler, the NVIDIA’s CUDA compiler. - Not all NVIDIA GPUs support the same features. Each GPU generation has a Compute Capability (CC) number, like
7.0,8.0, or9.0. - This number tells CUDA what features or instructions the GPU supports.
Compile with the desired compute capability
If you don’t tell CUDA which GPU to target, it may default to an older architecture. You can specify the target architecture explicitly using flags like:
-arch=sm_70→ compile for GPU with Compute Capability 7.0-arch=native→ automatically match the GPU wherenvccis running
Multi-Instance GPU
- a feature introduced on NVIDIA’s A100 and H100 GPUs, each instance behaves like a separate GPU:
- It lets you split one physical GPU into multiple isolated “slices”, so different users or jobs can safely share the same GPU without interfering with each other.
