How to Profile Linux Performance
Learn to profile Linux performance using the USE method, perf stat, perf record, flame graphs, and scheduler tracing to systematically find CPU, memory, and I/O bottlenecks.
Before you start
- ▸Root or sudo access on the target system
- ▸A kernel version of 5.10 or newer for full perf feature support
- ▸The workload or process to profile must be reproducible under load
- ▸Basic familiarity with reading terminal output and process management
Performance problems on Linux rarely announce themselves clearly. A server is slow, a process burns CPU, latency spikes unpredictably — finding the actual cause requires a disciplined methodology and the right tooling. This guide covers the USE method as a diagnostic framework, perf as the primary low-level profiler, and a practical workflow for tracking bottlenecks from high-level symptoms down to specific code paths or resources.
The USE Method
Coined by Brendan Gregg, the USE method gives you a consistent checklist before you start grabbing random tools. For every significant resource — CPUs, memory, disks, network interfaces, bus interconnects — ask three questions:
- Utilization: what percentage of the resource's capacity is in use?
- Saturation: is work queuing up because the resource is fully busy?
- Errors: are there error counts incrementing on this resource?
Work through every resource systematically before diving deep into any one area. This prevents the common trap of profiling CPU for hours when the real bottleneck is disk I/O wait or network receive drops.
Quick System-Wide Baseline
Before touching perf, collect a fast snapshot across all major resources.
CPU utilization and load
vmstat 1 5
The us, sy, and wa columns show user, kernel, and I/O wait percentages per second. High wa points immediately to storage or network saturation, not CPU.
Per-resource snapshot with dstat
dstat -cdngy 1 10
Shows CPU, disk, network, page, and system stats in one view. Install it via apt install dstat, dnf install dstat, or pacman -S dstat.
Memory saturation
free -m
cat /proc/vmstat | grep -E 'pswp|pgmajfault'
Non-zero pswpout means the kernel is actively swapping — a hard saturation signal. Major faults (pgmajfault) indicate pages being pulled from disk.
Disk errors and saturation
iostat -xz 1 5
Watch %util (utilization) and await (average I/O latency in ms). An await above 20 ms on an SSD or above 100 ms on spinning rust warrants investigation. The -z flag suppresses idle devices.
Network errors
ip -s link
cat /proc/net/dev
Look at the error and drop counters per interface. Incrementing values while the system is under load are a USE-method error hit.
Installing and Configuring perf
perf is the Linux kernel's built-in profiler. It uses hardware performance counters (PMU), kernel tracepoints, and software events. You need the version that matches your running kernel.
Debian / Ubuntu
sudo apt install linux-tools-$(uname -r) linux-tools-common
Fedora / RHEL / Rocky
sudo dnf install perf
Arch Linux
sudo pacman -S perf
Verify it works against the running kernel:
perf --version
perf stat ls
By default, unprivileged users are restricted from accessing kernel symbols. To allow full system profiling during your session (revert after), temporarily relax paranoia:
sudo sysctl -w kernel.perf_event_paranoid=1
A value of -1 grants full access including kernel tracepoints. Set it to 2 (the restrictive default) when done. Do not set this persistently on production systems unless you understand the security implications.
CPU Profiling with perf stat and perf record
Counting events on a command
perf stat -d ./your-binary arg1 arg2
The -d flag adds cache miss and branch misprediction counters on top of the default set. The output includes elapsed time, task-clock, context switches, page faults, cycles, instructions, and the IPC (instructions per cycle) ratio. An IPC below 1.0 on a modern out-of-order CPU usually signals memory-bound execution.
Attaching to a running process
perf stat -p $(pgrep -n nginx) -d sleep 10
System-wide sampling
sudo perf record -F 99 -a -g -- sleep 30
-F 99 samples at 99 Hz (avoids lock-step with 100 Hz timer interrupts), -a profiles all CPUs, -g records call graphs. This writes a perf.data file in the current directory.
Recording a specific PID with call graphs
sudo perf record -F 99 -g -p $(pgrep -n postgres) -- sleep 20
Viewing the report
perf report --stdio
The interactive TUI (perf report without --stdio) lets you expand call chains with arrow keys and Enter. Look for functions that own a large self-percentage — those are your hotspots. For kernel symbols to appear correctly, install the debug symbols package for your kernel (linux-image-$(uname -r)-dbgsym on Ubuntu, kernel-debuginfo on Fedora).
Flame Graphs
Flame graphs give a visual representation of sampled stack traces. They're the fastest way to see where wall-clock time is going across a whole application.
git clone https://github.com/brendangregg/FlameGraph
sudo perf record -F 99 -a -g -- sleep 30
sudo perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > flame.svg
Open flame.svg in a browser. Wide plateaus at the top of the graph are hot code paths. The x-axis is proportion of samples, not time.
Tracing Specific Events
Listing available events
perf list
Counting cache misses
perf stat -e cache-misses,cache-references,LLC-load-misses ./your-binary
A last-level cache (LLC) miss ratio above 10% means the workload is thrashing main memory — relevant for tuning data structures or NUMA placement.
Tracing system calls
sudo perf trace -p $(pgrep -n myapp) 2>&1 | head -40
perf trace is a modern alternative to strace with significantly lower overhead. Watch for unexpected read/write syscall storms or repeated futex contention.
Scheduler latency
sudo perf sched record -- sleep 5
sudo perf sched latency
This captures every scheduler event and reports per-task maximum and average wait times. Tasks with maximum latency in the tens of milliseconds are experiencing scheduling delays, which points to CPU saturation or priority inversion.
Verification: Confirming a Bottleneck
After identifying a candidate bottleneck, confirm it with a second independent tool before changing anything.
- CPU hotspot: cross-check with
sudo perf top -gin real time while the workload runs. - I/O bottleneck: confirm with
sudo iotop -oand check which PID is generating the I/O. - Memory pressure: run
sudo smem -r -s swap | head -10to identify the top swap consumers by process. - Network saturation: use
ss -sto check socket state counts andnstat -azfor kernel network counters including retransmits.
Troubleshooting
perf report shows [unknown] symbols
Install kernel debug symbols for your exact kernel version. For JIT-compiled runtimes (Java, Node.js), you need to generate a symbol map: for the JVM use -XX:+PreserveFramePointer and the perf-map-agent; for Node.js use --perf-basic-prof.
perf record fails with "Permission denied"
Lower kernel.perf_event_paranoid as shown above, or run the recording under sudo. On systems with Secure Boot and locked-down kernels (RHEL 9+, Ubuntu 22.04+ in some configs), some tracepoints require disabling lockdown mode, which persists only until reboot.
Call graphs are flat or incomplete
The binary was compiled without frame pointers. Recompile with -fno-omit-frame-pointer, or use DWARF unwinding: perf record --call-graph dwarf -F 99 -p PID. DWARF unwinding is accurate but generates larger perf.data files.
System appears idle but latency is high
This is a classic saturation-without-utilization scenario. Check perf sched latency for scheduling delays, cat /proc/interrupts for interrupt imbalance across CPUs, and numastat for NUMA remote memory hits. Also inspect power management: cpupower frequency-info to confirm CPU frequency scaling isn't throttling under a conservative governor.
Frequently asked questions
- What is the difference between perf stat and perf record?
- perf stat counts hardware and software events across the full duration of a run and gives you aggregate totals. perf record samples the CPU at a set frequency and captures stack traces, letting you see exactly which functions are running most often.
- Can I profile a production system without significant overhead?
- Yes. perf record at 99 Hz typically adds less than 1% CPU overhead. Avoid very high frequencies (above 1000 Hz) or DWARF unwinding on hot paths in production, as those are more expensive.
- Why do flame graphs show [unknown] for most of my application's frames?
- Missing debug symbols or omitted frame pointers are the usual causes. Recompile with -fno-omit-frame-pointer, install debuginfo packages, or switch to DWARF-based call graph recording with --call-graph dwarf.
- The CPU utilization looks low but the application is still slow — what should I check?
- Low utilization with high latency usually means saturation or blocking somewhere else. Check I/O wait with iostat, scheduler delays with perf sched latency, NUMA imbalance with numastat, and lock contention with perf trace watching futex calls.
- Does perf work on containers and VMs?
- perf works in VMs as long as the hypervisor exposes PMU counters (KVM does by default). Inside containers it requires CAP_PERFMON (Linux 5.8+) or CAP_SYS_ADMIN, and the host kernel's perf_event_paranoid setting still applies.
Related guides
AI and Artificial-Life Tools on Linux
Set up open-source AI/ML and artificial-life toolkits on Linux: PyTorch, JAX, DEAP, Avida, NetLogo, and RL environments with GPU driver guidance.
Assembly Language on Linux: A Starter Guide
Write x86-64 assembly on Linux from scratch: install NASM and GAS, learn syscalls, assemble and link a working program, then inspect and debug it.
How to Benchmark Disk Performance with fio
Learn to benchmark Linux disk performance with fio: writing job files, testing latency and throughput, and interpreting IOPS and percentile output correctly.
The Linux Boot Process Explained
Trace the full Linux boot sequence from UEFI firmware through GRUB2, the kernel, initramfs, and systemd to your login prompt — with diagnostics at each stage.