An Introduction to eBPF
Learn what eBPF is, how the kernel verifier keeps it safe, and how to use bpftrace to trace syscalls, disk I/O, and CPU scheduling with working examples.
Before you start
- ▸Linux kernel 5.8 or newer (check with uname -r)
- ▸Root access or CAP_BPF + CAP_PERFMON capabilities
- ▸Kernel headers matching the running kernel installed
- ▸Basic familiarity with Linux system calls and the kernel/userspace boundary
eBPF (extended Berkeley Packet Filter) lets you run sandboxed programs inside the Linux kernel without changing kernel source code or loading a kernel module. Originally a mechanism for filtering network packets, it has evolved into a general-purpose in-kernel execution engine used for tracing, observability, security enforcement, and networking. Tools like bpftrace, Cilium, and Falco are all built on it. Understanding eBPF at an operational level — what it does, how programs get loaded, and how to use bpftrace safely — is increasingly essential for serious Linux work.
What eBPF Actually Is
eBPF programs are written in a restricted subset of C (or a higher-level DSL like bpftrace's own language), compiled to eBPF bytecode, and loaded into the kernel via the bpf() syscall. The kernel's built-in verifier checks every program before it runs — it must be finite (no unbounded loops in most contexts), must not dereference arbitrary pointers, and must halt. Only if verification passes does the JIT compiler translate the bytecode to native machine code.
Programs attach to hook points: kprobes (arbitrary kernel function entry/return), uprobes (userspace function entry/return), tracepoints (stable, ABI-guaranteed kernel instrumentation), perf events, network XDP hooks, cgroup hooks, and more. They communicate with userspace through eBPF maps — typed key/value stores shared between kernel and user contexts.
Because the verifier enforces safety, a buggy eBPF program cannot crash the kernel the way a bad kernel module can. That said, poorly designed programs can cause measurable overhead on hot code paths, so placement matters.
Kernel and Privilege Requirements
Most eBPF tracing capabilities require a kernel of 4.9 or newer; many modern features (ring buffers, BTF type information, CO-RE) need 5.8+. Current LTS kernels on any major distro are fine. Check yours:
uname -r
Loading tracing eBPF programs traditionally required CAP_SYS_ADMIN or the more targeted CAP_BPF + CAP_PERFMON (available since kernel 5.8). For the examples below, use sudo or run as root. Unprivileged eBPF (for network socket filtering) exists but is a narrower use case.
BTF (BPF Type Format) enables CO-RE (Compile Once – Run Everywhere) — programs that adapt to the running kernel's struct layouts without recompilation. Confirm BTF is exposed:
ls /sys/kernel/btf/vmlinux
If that file is present, your kernel supports CO-RE. Most distro kernels since 2021 ship with it.
Installing bpftrace
bpftrace is the quickest way to write and run eBPF tracing programs interactively. It uses an awk-inspired language and compiles probes on the fly via LLVM.
Debian / Ubuntu
sudo apt update
sudo apt install bpftrace linux-headers-$(uname -r)
Fedora / RHEL 9 / Rocky 9
sudo dnf install bpftrace kernel-devel
On RHEL/Rocky, enable the CodeReady Builder (CRB) repo first if bpftrace is not found in the default channels.
Arch Linux
sudo pacman -S bpftrace linux-headers
bpftrace Language Basics
A bpftrace program follows a simple structure: probe { action }. Multiple probe/action blocks can appear in one script. The runtime provides built-in variables (comm for process name, pid, tid, nsecs, args) and aggregation types like histograms and counts.
List available tracepoints to understand what hook points exist:
sudo bpftrace -l 'tracepoint:syscalls:*' | head -30
List kernel functions probe-able via kprobe:
sudo bpftrace -l 'kprobe:vfs_*'
Observing the Kernel Safely: Practical Examples
Count syscalls per process
This one-liner attaches to the raw syscall entry tracepoint and builds a frequency count. It is read-only and extremely low overhead when syscall rates are modest.
sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
Interrupt with Ctrl-C; bpftrace prints the map automatically on exit.
Trace slow disk I/O (block layer)
Measure the time between block I/O request issue and completion, and print a histogram of latency in microseconds:
sudo bpftrace -e '
tracepoint:block:block_rq_issue { @start[args->sector] = nsecs; }
tracepoint:block:block_rq_complete
/@start[args->sector]/
{
@usecs = hist((nsecs - @start[args->sector]) / 1000);
delete(@start[args->sector]);
}'
The /@start[args->sector]/ filter (a predicate) ensures the completion probe only fires when a matching issue was recorded — preventing stale map entries from skewing results.
Trace file opens by filename
Using the openat syscall tracepoint, print every file opened along with the calling process:
sudo bpftrace -e '
tracepoint:syscalls:sys_enter_openat
{
printf("%s opened %s\n", comm, str(args->filename));
}'
On a busy system this can be very verbose. Add a comm filter to narrow it:
sudo bpftrace -e '
tracepoint:syscalls:sys_enter_openat
/comm == "nginx"/
{
printf("%s\n", str(args->filename));
}'
CPU scheduler latency (run-queue time)
Measure how long tasks wait in the run queue before being scheduled — a key signal for CPU saturation:
sudo bpftrace -e '
tracepoint:sched:sched_wakeup,
tracepoint:sched:sched_wakeup_new
{ @qtime[args->pid] = nsecs; }
tracepoint:sched:sched_switch
/@qtime[args->next_pid]/
{
@usecs = hist((nsecs - @qtime[args->next_pid]) / 1000);
delete(@qtime[args->next_pid]);
}'
Writing a bpftrace Script File
For anything beyond a one-liner, save it to a .bt file. Here is a complete script that profiles CPU stack traces at 99 Hz — useful for finding hot functions without modifying code:
cat > /tmp/cpuprofile.bt <<'EOF'
#!/usr/bin/env bpftrace
// Sample all CPUs at 99 Hz and capture kernel+user stacks
profile:hz:99
{
@[kstack, ustack, comm] = count();
}
EOF
sudo bpftrace /tmp/cpuprofile.bt
Run it for 10–30 seconds, then Ctrl-C. The output is a flame-graph-ready stack count format compatible with Brendan Gregg's stackcollapse-bpftrace.pl.
Understanding Overhead and Safety
eBPF programs run in-kernel on every matching event. Placing a kprobe on a function called millions of times per second — like vfs_read on a busy NFS server — will add measurable latency. Tracepoints are generally lower overhead than kprobes because they use pre-inserted nop sleds. Prefer tracepoints for production use; reserve kprobes for development or targeted short-duration investigations.
Safe practices:
- Use predicates (
/condition/) to filter early and reduce work per event. - Prefer maps and histograms over per-event
printf— aggregation in-kernel is far cheaper than copying data to userspace for every event. - Avoid attaching to scheduler hot paths (
schedule,__schedule) in production without careful testing. - Always run with a duration or Ctrl-C discipline — bpftrace programs do not persist after the process exits, so there is no risk of leaving stale hooks.
Verifying Your Setup
Confirm bpftrace can attach probes and return data:
sudo bpftrace -e 'BEGIN { printf("eBPF is working\n"); exit(); }'
Expected output: eBPF is working. If you see permission errors, ensure you are running as root or have CAP_BPF and CAP_PERFMON. If you see LLVM or BTF errors, check that linux-headers matching your running kernel are installed.
Troubleshooting
- "Cannot open kernel headers" — Install headers for the exact running kernel version (
uname -r), not just any version in the repo. - Tracepoint arguments are wrong — Use
sudo bpftrace -lv tracepoint:syscalls:sys_enter_openatto inspect the actual argument struct for your kernel version. - "verifier log" errors in dmesg — The kernel rejected your program. Read the verifier message carefully; it points to the exact instruction. Common causes: unbounded map lookups without a null check, or reading past a buffer without a bounds check.
- High overhead noticed — Run
sudo perf stat -e bpf:*or check/sys/kernel/debug/tracing/trace_stat/to see probe hit counts. Move to a lower-frequency hook or add stricter predicates. - bpftrace not finding uprobes — Userspace symbols require unstripped binaries or debuginfo packages. On Fedora/RHEL:
sudo dnf debuginfo-install <package>.
Frequently asked questions
- Can an eBPF program crash my kernel?
- A program that fails verification is rejected before it ever runs. A verified program cannot dereference arbitrary pointers or loop forever, so it cannot cause a kernel panic — but it can add latency on hot paths if placed carelessly.
- What is the difference between kprobes and tracepoints?
- Tracepoints are stable, explicitly placed instrumentation hooks with guaranteed argument ABIs; they use a pre-inserted nop sled and have lower overhead. Kprobes attach dynamically to any kernel function but are not ABI-stable — a kernel update can change arguments or remove the function entirely.
- Do eBPF programs persist after I stop bpftrace?
- No. When the bpftrace process exits, all attached probes and maps are cleaned up automatically. Nothing persists in the kernel afterward unless you explicitly pin maps to the BPF filesystem at /sys/fs/bpf/.
- What is the difference between bpftrace and BCC?
- BCC (BPF Compiler Collection) provides Python and Lua bindings for writing full eBPF programs in C; it is more flexible but heavier. bpftrace offers a concise high-level language ideal for one-liners and short scripts, and it now uses BTF/CO-RE so it does not always need kernel headers at runtime.
- Is eBPF available on non-x86 architectures?
- Yes. The eBPF JIT compiler supports x86-64, arm64, s390, powerpc, RISC-V, and others. Performance and feature completeness vary slightly by architecture, but for tracing workloads all major architectures are well supported.
Related guides
AI and Artificial-Life Tools on Linux
Set up open-source AI/ML and artificial-life toolkits on Linux: PyTorch, JAX, DEAP, Avida, NetLogo, and RL environments with GPU driver guidance.
Assembly Language on Linux: A Starter Guide
Write x86-64 assembly on Linux from scratch: install NASM and GAS, learn syscalls, assemble and link a working program, then inspect and debug it.
How to Benchmark Disk Performance with fio
Learn to benchmark Linux disk performance with fio: writing job files, testing latency and throughput, and interpreting IOPS and percentile output correctly.
Btrfs Basics and Snapshots
Learn Btrfs subvolumes, instant copy-on-write snapshots, and safe system rollback — with both manual btrfs commands and Snapper automation.