What is the difference between perf stat and perf record?

perf stat counts hardware and software events across the full duration of a run and gives you aggregate totals. perf record samples the CPU at a set frequency and captures stack traces, letting you see exactly which functions are running most often.

Can I profile a production system without significant overhead?

Yes. perf record at 99 Hz typically adds less than 1% CPU overhead. Avoid very high frequencies (above 1000 Hz) or DWARF unwinding on hot paths in production, as those are more expensive.

Why do flame graphs show [unknown] for most of my application's frames?

Missing debug symbols or omitted frame pointers are the usual causes. Recompile with -fno-omit-frame-pointer, install debuginfo packages, or switch to DWARF-based call graph recording with --call-graph dwarf.

The CPU utilization looks low but the application is still slow — what should I check?

Low utilization with high latency usually means saturation or blocking somewhere else. Check I/O wait with iostat, scheduler delays with perf sched latency, NUMA imbalance with numastat, and lock contention with perf trace watching futex calls.

Does perf work on containers and VMs?

perf works in VMs as long as the hypervisor exposes PMU counters (KVM does by default). Inside containers it requires CAP_PERFMON (Linux 5.8+) or CAP_SYS_ADMIN, and the host kernel's perf_event_paranoid setting still applies.

How to Profile Linux Performance

Performance problems on Linux rarely announce themselves clearly. A server is slow, a process burns CPU, latency spikes unpredictably — finding the actual cause requires a disciplined methodology and the right tooling. This guide covers the USE method as a diagnostic framework, perf as the primary low-level profiler, and a practical workflow for tracking bottlenecks from high-level symptoms down to specific code paths or resources.

The USE Method

Coined by Brendan Gregg, the USE method gives you a consistent checklist before you start grabbing random tools. For every significant resource — CPUs, memory, disks, network interfaces, bus interconnects — ask three questions:

Utilization: what percentage of the resource's capacity is in use?
Saturation: is work queuing up because the resource is fully busy?
Errors: are there error counts incrementing on this resource?

Work through every resource systematically before diving deep into any one area. This prevents the common trap of profiling CPU for hours when the real bottleneck is disk I/O wait or network receive drops.

Quick System-Wide Baseline

Before touching perf, collect a fast snapshot across all major resources.

CPU utilization and load

vmstat 1 5

The us, sy, and wa columns show user, kernel, and I/O wait percentages per second. High wa points immediately to storage or network saturation, not CPU.

Per-resource snapshot with dstat

dstat -cdngy 1 10

Shows CPU, disk, network, page, and system stats in one view. Install it via apt install dstat, dnf install dstat, or pacman -S dstat.

Memory saturation

free -m
cat /proc/vmstat | grep -E 'pswp|pgmajfault'

Non-zero pswpout means the kernel is actively swapping — a hard saturation signal. Major faults (pgmajfault) indicate pages being pulled from disk.

Disk errors and saturation

iostat -xz 1 5

Watch %util (utilization) and await (average I/O latency in ms). An await above 20 ms on an SSD or above 100 ms on spinning rust warrants investigation. The -z flag suppresses idle devices.

Network errors

ip -s link
cat /proc/net/dev

Look at the error and drop counters per interface. Incrementing values while the system is under load are a USE-method error hit.

Installing and Configuring perf

perf is the Linux kernel's built-in profiler. It uses hardware performance counters (PMU), kernel tracepoints, and software events. You need the version that matches your running kernel.

Debian / Ubuntu

sudo apt install linux-tools-$(uname -r) linux-tools-common

Fedora / RHEL / Rocky

sudo dnf install perf

Arch Linux

sudo pacman -S perf

Verify it works against the running kernel:

perf --version
perf stat ls

By default, unprivileged users are restricted from accessing kernel symbols. To allow full system profiling during your session (revert after), temporarily relax paranoia:

sudo sysctl -w kernel.perf_event_paranoid=1

A value of -1 grants full access including kernel tracepoints. Set it to 2 (the restrictive default) when done. Do not set this persistently on production systems unless you understand the security implications.

CPU Profiling with perf stat and perf record

Counting events on a command

perf stat -d ./your-binary arg1 arg2

The -d flag adds cache miss and branch misprediction counters on top of the default set. The output includes elapsed time, task-clock, context switches, page faults, cycles, instructions, and the IPC (instructions per cycle) ratio. An IPC below 1.0 on a modern out-of-order CPU usually signals memory-bound execution.

Attaching to a running process

perf stat -p $(pgrep -n nginx) -d sleep 10

System-wide sampling

sudo perf record -F 99 -a -g -- sleep 30

-F 99 samples at 99 Hz (avoids lock-step with 100 Hz timer interrupts), -a profiles all CPUs, -g records call graphs. This writes a perf.data file in the current directory.

Recording a specific PID with call graphs

sudo perf record -F 99 -g -p $(pgrep -n postgres) -- sleep 20

Viewing the report

perf report --stdio

The interactive TUI (perf report without --stdio) lets you expand call chains with arrow keys and Enter. Look for functions that own a large self-percentage — those are your hotspots. For kernel symbols to appear correctly, install the debug symbols package for your kernel (linux-image-$(uname -r)-dbgsym on Ubuntu, kernel-debuginfo on Fedora).

Flame Graphs

Flame graphs give a visual representation of sampled stack traces. They're the fastest way to see where wall-clock time is going across a whole application.

git clone https://github.com/brendangregg/FlameGraph
sudo perf record -F 99 -a -g -- sleep 30
sudo perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > flame.svg

Open flame.svg in a browser. Wide plateaus at the top of the graph are hot code paths. The x-axis is proportion of samples, not time.

Tracing Specific Events

Listing available events

perf list

Counting cache misses

perf stat -e cache-misses,cache-references,LLC-load-misses ./your-binary

A last-level cache (LLC) miss ratio above 10% means the workload is thrashing main memory — relevant for tuning data structures or NUMA placement.

Tracing system calls

sudo perf trace -p $(pgrep -n myapp) 2>&1 | head -40

perf trace is a modern alternative to strace with significantly lower overhead. Watch for unexpected read/write syscall storms or repeated futex contention.

Scheduler latency

sudo perf sched record -- sleep 5
sudo perf sched latency

This captures every scheduler event and reports per-task maximum and average wait times. Tasks with maximum latency in the tens of milliseconds are experiencing scheduling delays, which points to CPU saturation or priority inversion.

Verification: Confirming a Bottleneck

After identifying a candidate bottleneck, confirm it with a second independent tool before changing anything.

CPU hotspot: cross-check with sudo perf top -g in real time while the workload runs.
I/O bottleneck: confirm with sudo iotop -o and check which PID is generating the I/O.
Memory pressure: run sudo smem -r -s swap | head -10 to identify the top swap consumers by process.
Network saturation: use ss -s to check socket state counts and nstat -az for kernel network counters including retransmits.

Troubleshooting

perf report shows [unknown] symbols

Install kernel debug symbols for your exact kernel version. For JIT-compiled runtimes (Java, Node.js), you need to generate a symbol map: for the JVM use -XX:+PreserveFramePointer and the perf-map-agent; for Node.js use --perf-basic-prof.

perf record fails with "Permission denied"

Lower kernel.perf_event_paranoid as shown above, or run the recording under sudo. On systems with Secure Boot and locked-down kernels (RHEL 9+, Ubuntu 22.04+ in some configs), some tracepoints require disabling lockdown mode, which persists only until reboot.

Call graphs are flat or incomplete

The binary was compiled without frame pointers. Recompile with -fno-omit-frame-pointer, or use DWARF unwinding: perf record --call-graph dwarf -F 99 -p PID. DWARF unwinding is accurate but generates larger perf.data files.

System appears idle but latency is high

This is a classic saturation-without-utilization scenario. Check perf sched latency for scheduling delays, cat /proc/interrupts for interrupt imbalance across CPUs, and numastat for NUMA remote memory hits. Also inspect power management: cpupower frequency-info to confirm CPU frequency scaling isn't throttling under a conservative governor.