An Introduction to io_uring
Learn how io_uring's shared ring buffers work, benchmark it against libaio with fio, and harden multi-tenant Linux servers against its known attack surface.
Before you start
- ▸Linux kernel 5.1 or newer (5.15 LTS recommended for production)
- ▸Root or sudo access for benchmarks against block devices and sysctl changes
- ▸fio installed from distro packages (ensures liburing linkage)
- ▸A spare block device or sufficient /tmp space for test files
Linux I/O has always been a story of trade-offs. Traditional blocking calls are simple but waste threads; aio was non-blocking but awkward, incomplete, and riddled with silent fallbacks to blocking behaviour. io_uring, merged in Linux 5.1 (2019) and hardened through subsequent kernels, solves both problems with a lock-free shared-memory ring buffer between user space and the kernel. The result is lower syscall overhead, true async I/O for almost any file type, and throughput numbers that close the gap between software and raw hardware limits. This guide explains the mechanics, shows how to measure performance with fio, and covers the security considerations every sysadmin needs to know before enabling advanced features.
Why io_uring Exists
Before io_uring, the async I/O landscape looked like this:
- POSIX AIO: thread-pool based in glibc; not truly async at the kernel level for buffered I/O.
- Linux AIO (
io_submit): works for O_DIRECT reads and writes on block devices, but not for sockets, pipes, or buffered files. Any path that couldn't satisfy the request immediately fell back to blocking. - epoll + non-blocking I/O: genuinely async for sockets, but requires two syscalls per operation (wait, then read/write) and does not help with file I/O at all.
io_uring replaces all of these with a single, unified interface. A submission queue entry (SQE) describes any operation — read, write, accept, send, fsync, openat, even splice and statx. The kernel drains the submission queue, does the work (possibly on an internal async worker pool), and writes a completion queue entry (CQE). User space polls for completions without entering the kernel at all when the SQPOLL feature is active.
Request Flow in Detail
The Two Rings
io_uring_setup(2) allocates two ring buffers in memory shared between kernel and user space. The Submission Queue (SQ) is an array of indices into an SQE array; user space writes SQEs and advances the tail. The Completion Queue (CQ) is written by the kernel; user space reads CQEs and advances the head. Crucially, both sides read the other's head/tail via atomic loads — no locking, no context switch required for the common path.
Syscall Reduction
With a plain setup, submitting work costs one io_uring_enter(2) syscall per batch, regardless of batch size. Enable IORING_SETUP_SQPOLL and the kernel spawns a dedicated thread that polls the SQ; user space never calls into the kernel at all during steady-state operation. The trade-off is a CPU core burning at ~100 % while idle unless you tune sq_thread_idle (milliseconds before the poller sleeps).
Fixed Buffers and Registered Files
Every time you pass a user-space buffer pointer through a normal read/write, the kernel must validate and pin those pages. Register buffers once with io_uring_register(2) using IORING_REGISTER_BUFFERS, and subsequent IORING_OP_READ_FIXED operations skip that overhead entirely. Similarly, IORING_REGISTER_FILES pre-registers file descriptors so each SQE references a slot index rather than triggering repeated fdget calls.
Kernel Version Requirements
The feature set grew rapidly across kernel versions. Use at least:
| Kernel | Notable addition |
|---|---|
| 5.1 | Initial merge, basic read/write/fsync |
| 5.4 | Fixed buffers, SQPOLL stabilised |
| 5.6 | Socket ops: send, recv, accept, connect |
| 5.11 | Registered ring fds, multishot accept |
| 5.19 / 6.0 | IORING_SETUP_DEFER_TASKRUN, zero-copy send |
| 6.1 (LTS) | io_uring_passthrough for NVMe, hardened unprivileged restrictions |
Check your running kernel: uname -r. Ubuntu 24.04 LTS ships 6.8; Fedora 40 ships 6.8–6.9; RHEL 9 ships 5.14 (backports included, verify with grep io_uring /boot/config-$(uname -r)).
Benchmarking with fio
Install fio
Debian/Ubuntu:
sudo apt install fio
Fedora/RHEL:
sudo dnf install fio
Arch:
sudo pacman -S fio
Baseline: libaio vs io_uring (random read, O_DIRECT)
Run both engines against the same NVMe device. Replace /dev/nvme0n1 with your target — this writes to the device; use a scratch disk or a test file path.
sudo fio \
--name=libaio-randread \
--ioengine=libaio \
--filename=/dev/nvme0n1 \
--rw=randread \
--bs=4k \
--iodepth=128 \
--numjobs=4 \
--direct=1 \
--runtime=30 \
--time_based \
--group_reporting
sudo fio \
--name=iou-randread \
--ioengine=io_uring \
--filename=/dev/nvme0n1 \
--rw=randread \
--bs=4k \
--iodepth=128 \
--numjobs=4 \
--direct=1 \
--runtime=30 \
--time_based \
--group_reporting
On a mid-range NVMe (Samsung 980 Pro class), expect libaio to saturate around 650 k IOPS and io_uring to push 750–800 k IOPS at the same queue depth, primarily because of reduced per-request overhead. Results vary significantly by device, CPU, and queue depth.
SQPOLL mode (zero-syscall path)
sudo fio \
--name=iou-sqpoll \
--ioengine=io_uring \
--filename=/dev/nvme0n1 \
--rw=randread \
--bs=4k \
--iodepth=128 \
--numjobs=4 \
--direct=1 \
--runtime=30 \
--time_based \
--sqpoll_cpu=0 \
--group_reporting
Pin the SQPOLL thread to an isolated CPU (sqpoll_cpu) and ensure that core is removed from the kernel scheduler via isolcpus= or cset for the cleanest numbers. SQPOLL requires CAP_SYS_NICE or root.
Reading fio Output
Key fields in the summary line: IOPS, BW (bandwidth), and lat (usec) — specifically the 99th/99.9th percentile clat (completion latency). io_uring's advantage shows most clearly in clat tail latency under deep queues, not just peak IOPS.
Security Considerations
io_uring's power makes it a serious attack surface. Several CVEs (CVE-2022-29582, CVE-2023-2598, and others) have been found in the subsystem. The kernel community has responded, but the threat model for multi-tenant systems is real.
Restricting Unprivileged Access
Since kernel 6.1, the kernel exposes a sysctl to control access:
# 0 = unrestricted (default on most distros)
# 1 = requires CAP_SYS_ADMIN or a task with same uid/gid
# 2 = completely disabled
cat /proc/sys/kernel/io_uring_disabled
To restrict to privileged processes only (recommended for shared servers):
sudo sysctl -w kernel.io_uring_disabled=1
# Persist across reboots:
echo 'kernel.io_uring_disabled=1' | sudo tee /etc/sysctl.d/99-io_uring.conf
sudo sysctl --system
Some container runtimes (Docker, Podman with default seccomp profiles) already block io_uring_setup inside containers. Verify with:
docker run --rm alpine:latest /bin/sh -c 'apk add -q fio && fio --name=test --ioengine=io_uring --filename=/tmp/t --rw=read --size=1m --bs=4k 2>&1 | head -5'
seccomp and Landlock
If you run untrusted code, add io_uring_setup, io_uring_enter, and io_uring_register to your seccomp deny list. For systemd services, you can restrict the syscalls in the unit file:
# In a .service file [Service] section:
SystemCallFilter=~io_uring_setup io_uring_enter io_uring_register
After editing, reload and verify:
sudo systemctl daemon-reload
sudo systemctl restart your-service
systemctl show your-service --property=SystemCallFilter
Verifying io_uring is Functional
A quick smoke test without needing a spare block device — write a 512 MB test file:
fio \
--name=smoke \
--ioengine=io_uring \
--filename=/tmp/iou_smoke \
--rw=write \
--bs=64k \
--size=512m \
--iodepth=32 \
--numjobs=1 \
--direct=0 \
--output-format=terse | cut -d';' -f7
# Field 7 in terse output is write IOPS; a non-zero value confirms io_uring is working
Check kernel tracepoints to watch the ring in action (requires root):
sudo perf trace -e io_uring:* -- fio --name=t --ioengine=io_uring \
--filename=/tmp/iou_t --rw=read --size=64m --bs=4k --iodepth=16 --runtime=5 --time_based 2>/dev/null | head -30
Troubleshooting
- fio reports engine not available: Your fio was compiled without io_uring support. Install a distro package (not a manually compiled old binary) or build fio from source against a kernel ≥ 5.4 with liburing installed (
apt install liburing-dev/dnf install liburing-devel). - SQPOLL fails with EPERM: SQPOLL requires
CAP_SYS_NICE. Run as root or grant the capability viasetcap cap_sys_nice+epto your binary. - io_uring_disabled is 2 and cannot be changed: Some hardened kernels (Ubuntu's
linux-hardened, certain cloud images) ship with it disabled at build time. Usegrep IO_URING /boot/config-$(uname -r)— ifCONFIG_IO_URINGis not set, you need a different kernel. - Performance no better than libaio: At low queue depths (< 16) or on rotational storage, the overhead difference is negligible. io_uring's gains are most visible at high concurrency on NVMe or fast network I/O.
- Kernel OOM or hangs under SQPOLL: SQPOLL kernels < 5.15 had several stability bugs. Update your kernel; the 5.15 LTS branch is the minimum recommendable for production SQPOLL use.
Frequently asked questions
- Does io_uring work with buffered (non-O_DIRECT) file I/O?
- Yes — unlike Linux AIO, io_uring genuinely supports buffered reads and writes. Operations that cannot complete immediately are handed to an internal async worker pool rather than silently falling back to blocking in the calling thread.
- Is io_uring safe to enable on a public-facing server?
- With care. Set `kernel.io_uring_disabled=1` to limit access to privileged processes, keep your kernel patched (several CVEs landed between 2022–2024), and block the three io_uring syscalls in seccomp profiles for any untrusted workloads such as containers running user-supplied code.
- What applications already use io_uring in production?
- NGINX (since 1.25.x with a patch), RocksDB, Tokio (the async Rust runtime via tokio-uring), PostgreSQL (experimental), and several high-performance storage daemons like Ceph's Crimson OSD all have io_uring backends or are actively developing them.
- How do fixed buffers and registered files help performance?
- Every normal read/write forces the kernel to validate, pin, and unpin user-space memory pages and look up the file descriptor on each call. Registering buffers and files once amortises that cost across thousands of operations, saving measurable CPU time at very high IOPS rates.
- Can io_uring replace epoll for network servers?
- Effectively yes. Since kernel 5.6, io_uring supports accept, connect, send, and recv, and multishot accept (5.19+) lets a single SQE continuously produce new connections without re-arming. Libraries like liburing and frameworks like Glommio are designed around exactly this model.
Related guides
AI and Artificial-Life Tools on Linux
Set up open-source AI/ML and artificial-life toolkits on Linux: PyTorch, JAX, DEAP, Avida, NetLogo, and RL environments with GPU driver guidance.
Assembly Language on Linux: A Starter Guide
Write x86-64 assembly on Linux from scratch: install NASM and GAS, learn syscalls, assemble and link a working program, then inspect and debug it.
How to Benchmark Disk Performance with fio
Learn to benchmark Linux disk performance with fio: writing job files, testing latency and throughput, and interpreting IOPS and percentile output correctly.
The Linux Boot Process Explained
Trace the full Linux boot sequence from UEFI firmware through GRUB2, the kernel, initramfs, and systemd to your login prompt — with diagnostics at each stage.