Nsight Compute - a primer on profiling
every time i run ncu for profiling, my head spins with the amount of data thrown at me, so here's a checklist for my future reference of how to study a profiling report
The only things that matter are
- DRAM bandwidth
- Compute throughput
- Latency / occupancy
Profiling is just figuring out which one is saturated.
Step 1 — Find the expensive kernels
Open Summary → Duration and sort descending. Only analyze kernels that dominate runtime.
Rule of thumb:
- focus on kernels that contribute >5-10% total runtime
ignore everything else. Optimization effort scales with time share.
Step 2 — Identify the bottleneck
Go to Speed Of Light Throughput
Look at:
- Memory Throughput
- Compute (SM) Throughput
Decision rule:
memory-bound: Memory > 60% and Compute < 30%
compute-bound: Compute > 60%
latency/occupancy-bound: Both < 30%
Step 3 — Check if the hardware is already saturated
Go to Memory Workload Analysis -> Max Bandwidth
If DRAM is saturated, kernel-level tweaks will not help.
Only improvements:
- reduce memory traffic
- fuse kernels
- increase arithmetic intensity
Step 4 — Check occupancy / latency hiding
Go to Occupancy
Look at:
- Achieved Occupancy
- Waves Per SM
Rules of thumb:
- Waves per SM < 2 → latency likely exposed
- Low occupancy → register/shared-mem pressure
- High occupancy → latency not the issue
If occupancy is healthy, stop worrying about block sizes.
Step 5 — Check memory locality
Go to Memory Workload Analysis
Look at:
- L2 Hit Rate
- L1 Hit Rate
High L2 hit → data reuse
Low L2 hit → streaming workload
Streaming kernels are typically bandwidth-bound.
Step 6 — Decide the optimization strategy
Once you know the bottleneck:
- Memory bound
- reduce global memory passes
- fuse kernels
- improve coalescing
- tile into shared memory
- Compute bound
- reduce instruction count
- increase ILP / vectorization
- use tensor cores / faster math
- Latency bound
- increase occupancy
- reduce register usage
- increase work per thread
Mental model
Profiling is about which hardware resource is saturated?
Once that is clear, the optimization path becomes obvious. Everything else in the Nsight report is supporting evidence.
Final Checklist:
- Sort by duration (in summary). For kernels with the highest duration:
- Determine bound type (SpeedOfLight -> sm vs memory throughput)
- Check bandwidth usage (Memory Workload Analysis -> Max bandwidth)
- Check occupancy (Occupancy -> Achieved Occupancy)
- Decide if optimization is algorithmic or kernel-level
Some cool links
- CERN Nsight Compute
- nasa perf analysis with ncu --> all kind of multi-node, multi-process cmds listed here
(living-doc)