Hugepages and Transparent Huge Pages
Learn when Transparent Huge Pages hurt database performance, how to disable THP persistently with systemd, and how to configure static hugepages for PostgreSQL and other databases.
Before you start
- ▸Root or sudo access on the target system
- ▸Basic understanding of Linux memory concepts (pages, shared memory, TLB)
- ▸A database already installed (PostgreSQL 15+ used in examples)
- ▸Sufficient free RAM to satisfy the hugepage reservation
Linux hugepages let the kernel map memory in 2 MB (or 1 GB) chunks instead of the default 4 KB pages. Fewer page-table entries means fewer TLB misses, which matters enormously for workloads that touch large, contiguous memory regions—databases being the classic case. There are two mechanisms: static hugepages (reserved at boot, used via mmap(MAP_HUGETLB) or shared memory) and Transparent Huge Pages (THP), the kernel's attempt to give you the benefit automatically. Understanding when each helps—and when THP actively hurts—is the difference between a fast database and a mysteriously slow one.
How THP Works and Why It Can Hurt
THP is managed by khugepaged, a kernel thread that continuously scans virtual memory, finds 2 MB-aligned regions of 4 KB pages, and collapses them into a single hugepage. It also does the reverse (splits hugepages) when memory pressure demands it. This sounds ideal, but the reality is more complicated.
- Allocation latency spikes: Collapsing or splitting hugepages is not free. Under memory pressure,
khugepagedactivity shows up as latency jitter—exactly what a latency-sensitive database cannot tolerate. - Fragmentation: THP requires 2 MB of physically contiguous memory. On a long-running server, this becomes increasingly hard to satisfy, triggering compaction work that stalls application threads.
- Copy-on-write penalty: When a process forks (common in PostgreSQL), a CoW fault on a 2 MB hugepage dirties 2 MB instead of 4 KB.
- Database-specific pain: Oracle, MySQL InnoDB, PostgreSQL, Redis, and MongoDB all document THP as a source of latency anomalies. Oracle and Redis explicitly require it to be disabled.
Where THP genuinely helps: batch analytics workloads, HPC jobs, and JVM applications doing large, sequential heap operations—situations where allocations are predictable and latency variance is acceptable.
Checking Current State
Before changing anything, read the current configuration.
cat /sys/kernel/mm/transparent_hugepage/enabled
Output will look like: [always] madvise never. The bracketed value is active. always means THP is on for all anonymous memory. madvise means only for regions that explicitly request it. never disables THP entirely.
cat /sys/kernel/mm/transparent_hugepage/defrag
This controls memory compaction aggressiveness. always here is the most dangerous setting for latency; it blocks allocation until a hugepage can be assembled.
grep -i hugepage /proc/meminfo
Shows static hugepage pool size, usage, and free count. Also shows AnonHugePages (THP in use).
Disabling THP System-Wide for Database Hosts
Runtime changes take effect immediately but don't survive reboots. Make them persistent with a systemd unit.
Runtime change (all distributions)
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
Persistent via systemd (recommended)
Create a one-shot service that runs before your database starts.
sudo tee /etc/systemd/system/disable-thp.service <<'EOF'
[Unit]
Description=Disable Transparent Huge Pages
DefaultDependencies=no
After=sysinit.target local-fs.target
Before=basic.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'echo never > /sys/kernel/mm/transparent_hugepage/enabled'
ExecStart=/bin/sh -c 'echo never > /sys/kernel/mm/transparent_hugepage/defrag'
RemainAfterExit=yes
[Install]
WantedBy=basic.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now disable-thp.service
Kernel command-line (optional belt-and-suspenders)
Add transparent_hugepage=never to your bootloader. On systems using GRUB:
# Debian/Ubuntu
sudo sed -i 's/GRUB_CMDLINE_LINUX="/GRUB_CMDLINE_LINUX="transparent_hugepage=never /' /etc/default/grub
sudo update-grub
# Fedora/RHEL/Rocky
sudo grubby --update-kernel=ALL --args="transparent_hugepage=never"
Configuring Static Hugepages for Databases
Static hugepages are reserved at boot (or shortly after) and never swapped. PostgreSQL uses them via shared memory (huge_pages = on); Oracle SGA and MySQL InnoDB buffer pool can also use them directly.
Calculate how many pages you need
For PostgreSQL, the value to cover is shared_buffers. If shared_buffers = 32GB, you need at least 16,384 pages of 2 MB each, plus a small buffer.
# Show current hugepage size (usually 2048 kB = 2 MB)
grep Hugepagesize /proc/meminfo
# Calculate: ceil(shared_buffers_bytes / hugepage_size_bytes)
# Example: 32 GB shared_buffers
python3 -c "import math; print(math.ceil(32*1024**3 / (2*1024**2)))"
Reserve hugepages at runtime
sudo sysctl -w vm.nr_hugepages=16400
The kernel allocates hugepages from contiguous free memory. Do this early after boot before memory becomes fragmented. If the system can't satisfy the full count, it allocates as many as it can—check /proc/meminfo to confirm.
Make it persistent via sysctl
echo 'vm.nr_hugepages = 16400' | sudo tee /etc/sysctl.d/90-hugepages.conf
sudo sysctl --system
Set hugepage limits for the postgres user (PostgreSQL)
PostgreSQL uses shmget(SHM_HUGETLB) to map shared memory into hugepages. The process must have sufficient locked memory limits.
sudo tee /etc/security/limits.d/postgres-hugepages.conf <<'EOF'
postgres soft memlock unlimited
postgres hard memlock unlimited
EOF
Then set huge_pages = on in postgresql.conf. PostgreSQL will fall back to regular pages if hugepages are unavailable—watch for the log message huge pages not supported, using regular pages.
Fedora/RHEL: hugetlbfs mount
Some applications (Oracle, custom C code using MAP_HUGETLB) need the hugetlbfs filesystem mounted.
sudo mkdir -p /dev/hugepages
sudo mount -t hugetlbfs nodev /dev/hugepages
# Persist it
echo 'nodev /dev/hugepages hugetlbfs defaults 0 0' | sudo tee -a /etc/fstab
1 GB hugepages (NUMA-aware servers)
1 GB pages must be reserved at boot time via the kernel command line—they cannot be allocated dynamically.
# Fedora/RHEL/Rocky
sudo grubby --update-kernel=ALL --args="hugepagesz=1G hugepages=32"
# Debian/Ubuntu (edit /etc/default/grub then run update-grub)
# Add to GRUB_CMDLINE_LINUX: hugepagesz=1G hugepages=32
Verification
# Confirm THP is disabled
cat /sys/kernel/mm/transparent_hugepage/enabled
# Expected: always madvise [never]
# Confirm static hugepage pool
grep -E 'HugePages_(Total|Free|Rsvd)' /proc/meminfo
# For PostgreSQL: confirm hugepages are in use after pg restart
sudo -u postgres psql -c "SHOW huge_pages;"
sudo -u postgres psql -c "SELECT name, setting FROM pg_settings WHERE name LIKE '%huge%';"
# Watch khugepaged activity to confirm it's idle
grep thp /proc/vmstat | grep -v ' 0$'
If THP is properly disabled, thp_collapse_alloc and thp_fault_alloc counters in /proc/vmstat should stop incrementing after your database restarts.
Troubleshooting
- Hugepages not fully allocated: Memory fragmentation prevents the kernel from satisfying
vm.nr_hugepages. Reboot (fragmentation resets) or tryecho 1 > /proc/sys/vm/compact_memoryto trigger compaction, then re-checkHugePages_Total. - PostgreSQL still not using hugepages: Check
huge_pages = onis set (nottry) and thatshared_buffersis less than or equal to the reserved hugepage pool. Verifymemlocklimits are in effect (ulimit -las the postgres user). - disable-thp.service fails on boot: Ensure
After=sysinit.targetis correct for your init ordering. On some minimal images, replace withAfter=local-fs.target. Checkjournalctl -u disable-thp.service. - THP reappears after package update: Some tuning packages (like
tunedon RHEL with athroughput-performanceprofile) re-enable THP. Checktuned-adm activeand switch tolatency-performanceor create a custom profile. - NUMA systems: On multi-socket servers, verify hugepages are reserved on each NUMA node with
cat /sys/devices/system/node/node*/hugepages/hugepages-2048kB/nr_hugepages. Usevm.nr_hugepages_mempolicyfor NUMA-aware allocation.
Frequently asked questions
- Can I use THP in 'madvise' mode instead of disabling it entirely for databases?
- Yes, madvise mode only activates THP for memory regions that explicitly call madvise(MADV_HUGEPAGE). Most databases do not issue this call, so madvise is effectively equivalent to never for them—but always verify with /proc/vmstat thp_fault_alloc after switching.
- Does disabling THP affect application performance outside the database?
- Potentially yes for JVM workloads or HPC jobs that benefit from THP. On dedicated database servers this is not a concern, but on mixed-workload hosts consider madvise mode or using cgroups memory.thp_disable per service.
- Why must 1 GB hugepages be reserved at boot but 2 MB hugepages can be done at runtime?
- 1 GB hugepages require 1 GB of physically contiguous memory, which can only be guaranteed before the allocator has distributed memory across many small allocations. 2 MB pages are small enough that the compaction subsystem can often assemble them from a running system.
- Does the tuned daemon on RHEL/Rocky override my THP settings?
- Yes. Profiles like throughput-performance explicitly re-enable THP. Either switch to latency-performance (tuned-adm profile latency-performance), create a custom profile, or disable tuned entirely on dedicated database hosts.
- How do I confirm PostgreSQL is actually using hugepages and not silently falling back?
- After restarting PostgreSQL, check pg_settings for huge_pages and look for the absence of 'huge pages not supported, using regular pages' in the PostgreSQL log. Also check HugePages_Rsvd in /proc/meminfo—it should show a non-zero reservation matching your shared_buffers allocation.
Related guides
AI and Artificial-Life Tools on Linux
Set up open-source AI/ML and artificial-life toolkits on Linux: PyTorch, JAX, DEAP, Avida, NetLogo, and RL environments with GPU driver guidance.
Assembly Language on Linux: A Starter Guide
Write x86-64 assembly on Linux from scratch: install NASM and GAS, learn syscalls, assemble and link a working program, then inspect and debug it.
How to Benchmark Disk Performance with fio
Learn to benchmark Linux disk performance with fio: writing job files, testing latency and throughput, and interpreting IOPS and percentile output correctly.
The Linux Boot Process Explained
Trace the full Linux boot sequence from UEFI firmware through GRUB2, the kernel, initramfs, and systemd to your login prompt — with diagnostics at each stage.