Nsight Compute - a primer on profiling

started on: 2026-03-08

every time i run ncu for profiling, my head spins with the amount of data thrown at me, so here's a checklist for my future reference of how to study a profiling report

The only things that matter are

DRAM bandwidth
Compute throughput
Latency / occupancy

Profiling is just figuring out which one is saturated.

Step 1 — Find the expensive kernels

Open Summary → Duration and sort descending. Only analyze kernels that dominate runtime.

Rule of thumb:

focus on kernels that contribute >5-10% total runtime

ignore everything else. Optimization effort scales with time share.

Step 2 — Identify the bottleneck

Go to Speed Of Light Throughput

Look at:

Memory Throughput
Compute (SM) Throughput

Decision rule:

memory-bound: Memory > 60%  and Compute < 30% 
compute-bound: Compute > 60%
latency/occupancy-bound: Both < 30%

Step 3 — Check if the hardware is already saturated

Go to Memory Workload Analysis -> Max Bandwidth

If DRAM is saturated, kernel-level tweaks will not help.

Only improvements:

reduce memory traffic
fuse kernels
increase arithmetic intensity

Step 4 — Check occupancy / latency hiding

Go to Occupancy

Look at:

Achieved Occupancy
Waves Per SM

Rules of thumb:

Waves per SM < 2 → latency likely exposed
Low occupancy → register/shared-mem pressure
High occupancy → latency not the issue

If occupancy is healthy, stop worrying about block sizes.

Step 5 — Check memory locality

Go to Memory Workload Analysis

Look at:

L2 Hit Rate
L1 Hit Rate

High L2 hit → data reuse
Low L2 hit → streaming workload

Streaming kernels are typically bandwidth-bound.

Step 6 — Decide the optimization strategy

Once you know the bottleneck:

Memory bound

reduce global memory passes
fuse kernels
improve coalescing
tile into shared memory

Compute bound

reduce instruction count
increase ILP / vectorization
use tensor cores / faster math

Latency bound

increase occupancy
reduce register usage
increase work per thread

Mental model

Profiling is about which hardware resource is saturated?

Once that is clear, the optimization path becomes obvious. Everything else in the Nsight report is supporting evidence.

Final Checklist:

Sort by duration (in summary). For kernels with the highest duration:
Determine bound type (SpeedOfLight -> sm vs memory throughput)
Check bandwidth usage (Memory Workload Analysis -> Max bandwidth)
Check occupancy (Occupancy -> Achieved Occupancy)
Decide if optimization is algorithmic or kernel-level

Some cool links

CERN Nsight Compute
nasa perf analysis with ncu --> all kind of multi-node, multi-process cmds listed here

(living-doc)