Memory Allocation

__device__ int result[10];

result array is stored directly in the GPU’s global memory.
It is not created per block, there is only one global copy shared by all blocks and threads

cudaFree(pointer to gpu memory)

frees the GPU memory previously allocated with cudaMalloc() to avoid memory leaks.

Dynamic Memory Allocation

cudaMalloc(pointer to the memory pointer, number of bytes to allocate)

Allocates memory on the GPU. The first argument is a pointer to your pointer so that the caller’s variable gets updated with the address of the allocated GPU memory.

cudaMemcpy(destination, source, size, direction)

Copies data between the CPU and GPU
cudaMemcpyHostToDevice: from CPU → GPU
cudaMemcpyDeviceToHost: from GPU → CPU

Static Memory Allocation

__device__ int result;   // lives in GPU global memory
 
int start = 5;
cudaMemcpyToSymbol(result, &start, sizeof(start));

This means: “Take the value 5 from the CPU variable start and store it inside the GPU’s global variable result.”
cudaMemcpyToSymbol() and cudaMemcpyFromSymbol() are the special functions used to move data between the CPU (host) and GPU global memory variables (those declared with __device__ or __constant__).

CUDA Memory Management

__device__ basically tells CUDA to allocate memory on the GPU’s VRAM (global memory). The CPU can’t directly access this memory — it lives entirely on the device side. So if you want to read or write it from the CPU, you must explicitly copy data across using cudaMemcpy().
To make life easier, CUDA provides Unified Memory, through __managed__ or cudaMallocManaged(). This gives you a single variable or pointer that both CPU and GPU can “see”, so you can use it on either side without doing manual copies.
But here’s the catch: under the hood, the data still lives in two separate physical places CPU RAM and GPU VRAM. CUDA just moves (or “pages”) the data automatically when one side needs it. This automation is convenient but can cause performance inefficiency, because you no longer control when or how often data gets transferred.

When Unified Memory is okay

You’re prototyping or learning CUDA.

Your program has small or infrequent data transfers.

The GPU and CPU share a fast interconnect.

You want code portability (same code can run on devices with or without discrete GPUs).

CUDA Memory Hierarchy

Memory Type	Scope / Who Uses It	Location (Hardware)	Speed (Latency)	Bandwidth (Example: H100)	Size (Typical)	Access Type	Declared As	Notes / Analogy
Registers (RMEM)	Per Thread	Inside each SM (SRAM)	⚡ ~1 cycle (fastest)	~124 TB/s	~256 KB per SM (~33 MiB total)	R/W	Compiler decides (local vars)	Thread’s private brain 🧠 — fastest but tiny.
Shared Memory (SMEM)	Per Block	Inside SM (SRAM)	⚡ ~20–30 cycles	~31 TB/s	~256 KB per SM	R/W	`__shared__ float x[];`	Whiteboard for threads in the same block 🧑‍🏫.
L1 Cache	Per SM	On-chip SRAM	⚡ ~20–30 cycles	~31 TB/s	~256 KB per SM	Auto	Managed by hardware	Stores recent global loads. Merged with SMEM in modern GPUs.
Dynamic Shared Memory (DSMEM)	Per GPC (cluster of SMs)	On-chip SRAM	⚡ ~30–40 cycles	—	~3.5 MiB total (example)	R/W	Configured at kernel launch	Shared scratchpad between nearby SMs.
L2 Cache	Whole GPU	On-chip SRAM	🐢 ~200 cycles	~12–14 TB/s	~50 MiB	Auto	Managed by hardware	Shared among all SMs — like a GPU-wide cache.
Local Memory	Per Thread	In global VRAM (off-chip DRAM)	🐢 ~400–500 cycles	~3 TB/s	Depends (spilled vars)	R/W	`float x;` (auto-managed)	Pretends to be “local” but actually slow — used when registers run out.
Global / Device Memory (VRAM / HBM)	Whole GPU	Off-chip DRAM	🐌 ~500 cycles (slowest)	~3 TB/s	40–80 GB	R/W	`__device__ float x[];` or via `cudaMalloc()`	Warehouse of data 📦 — huge but slow to access.
Constant Memory	Whole GPU	Cached in L1/L2	⚙️ ~1 cycle (when cached)	—	64 KB limit	Read-only	`__constant__ float x;`	For constants shared by all threads (like configs).
Texture Memory	Whole GPU	Cached in special unit	🐢 slow (optimized reads)	—	—	Read-only	`__texture__ float x;`	For images, 2D/3D spatial data — supports interpolation.

Registers and shared memory = top-tier performance targets.
Global / constant / texture memory = for large or read-only data.
Good CUDA kernels minimize trips to global memory.

CUDA Thread Indexing

CS Notes

Recent writing

Systemd

Python Iterable vs Iterator vs Generator

Python Behavior

Explorer

CUDA Memory

Memory Allocation

Dynamic Memory Allocation

Static Memory Allocation

CUDA Memory Management

CUDA Memory Hierarchy

CUDA Thread Indexing

Table of Contents

Mentioned by

Graph View