Lock Down systemd Services (Sandboxing)
Harden Linux daemons using systemd sandboxing directives: ProtectSystem, PrivateTmp, NoNewPrivileges, CapabilityBoundingSet, and SystemCallFilter explained step by step.
Before you start
- ▸systemd 232 or later (check with: systemctl --version)
- ▸Kernel built with CONFIG_SECCOMP_FILTER enabled (required for SystemCallFilter)
- ▸sudo or root access on the target system
- ▸Basic familiarity with systemd unit files and journalctl
systemd ships a rich set of sandboxing directives that let you wrap a service in a tight security envelope without rewriting a single line of its code. When a misconfigured or compromised daemon escapes its expected behaviour, these directives limit the blast radius — restricting filesystem visibility, system call access, and privilege escalation paths. The following guide walks through the most impactful options: ProtectSystem, PrivateTmp, NoNewPrivileges, CapabilityBoundingSet, and SystemCallFilter.
How systemd Sandboxing Works
systemd leverages kernel namespaces, seccomp-BPF, and Linux capabilities to constrain services at start time. The restrictions live in a .service unit file under [Service] and are applied by the service manager before the process ever runs. No kernel patches or external tools are needed — everything described here ships with systemd 232+ and Linux 3.17+. Most modern LTS distributions are well within range.
Always edit units with systemctl edit <service> to create a drop-in override rather than modifying the vendor-supplied unit directly. Drop-ins survive package upgrades.
Step 1: Create a Drop-in Override
Pick a service to harden. A network-facing service with no need for root is the ideal candidate — nginx, restic, or a custom application daemon all work well. The example below uses a generic myapp.service.
sudo systemctl edit myapp.service
This opens an editor with an empty override file at /etc/systemd/system/myapp.service.d/override.conf. All sandboxing directives go inside a [Service] block. Start with the section header and add directives as you progress through this guide.
[Service]
# directives added below
Step 2: Restrict Filesystem Access with ProtectSystem and ProtectHome
ProtectSystem bind-mounts parts of the OS read-only for the service. ProtectHome makes /home, /root, and /run/user either invisible or read-only.
- ProtectSystem=strict — mounts the entire filesystem tree read-only except for
/dev,/proc, and/sys. The service must write only to paths explicitly whitelisted withReadWritePaths=. - ProtectSystem=full — adds
/etcto the read-only set on top of the default. - ProtectHome=true — home directories appear empty to the process.
- ProtectHome=tmpfs — home directories appear as empty writable tmpfs (useful if the service probes for home dirs).
[Service]
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/myapp /var/log/myapp
If you omit ReadWritePaths while using strict, any write the service attempts outside a tmpfs will fail with EROFS. Map only the minimum paths the service genuinely writes to.
Step 3: Isolate Temporary Files with PrivateTmp
PrivateTmp=true gives the service its own private /tmp and /var/tmp backed by a new filesystem namespace. Files placed there by the service are invisible to other processes and are cleaned up when the service stops.
[Service]
PrivateTmp=true
This prevents a common attack class where an adversary pre-creates predictable filenames in /tmp (symlink attacks, TOCTOU races) to redirect a privileged service into reading or writing attacker-controlled paths.
Step 4: Block Privilege Escalation with NoNewPrivileges
NoNewPrivileges=true sets the PR_SET_NO_NEW_PRIVS prctl flag on the service process and all its children. With it enabled, execve() can never gain new privileges — setuid and setgid bits on executables become inert, and the process cannot acquire capabilities it did not already hold.
[Service]
NoNewPrivileges=true
This is one of the highest-value single-line hardening measures available. It blocks an entire family of local privilege escalation exploits. Enable it unless the service intentionally relies on setuid helpers (e.g., sudo, su, PAM stacks that call setuid binaries).
Note: NoNewPrivileges=true is a prerequisite for SystemCallFilter when the service does not start as root, because seccomp filters are inherited and the no-new-privs flag ensures they cannot be shed.
Step 5: Drop Linux Capabilities with CapabilityBoundingSet
Linux capabilities divide traditional root power into discrete units. Even if a service runs as root, you can strip capabilities it should never need. CapabilityBoundingSet sets the hard ceiling — capabilities not listed here can never be acquired, even via setuid.
[Service]
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
AmbientCapabilities=CAP_NET_BIND_SERVICE
The example above is appropriate for a service that must bind to port 80 or 443 but otherwise needs no root privilege. A service that needs no capabilities at all should use an empty set:
[Service]
CapabilityBoundingSet=
AmbientCapabilities=
Common capabilities to consider removing: CAP_SYS_ADMIN, CAP_NET_RAW, CAP_SETUID, CAP_SETGID, CAP_SYS_PTRACE. Use capsh --print or getpcaps <pid> to inspect what a running process actually holds.
getpcaps $(pgrep -x myapp)
Step 6: Filter System Calls with SystemCallFilter
SystemCallFilter uses seccomp-BPF to allow or deny individual syscalls. When a denied syscall is attempted, the kernel kills the process with SIGSYS by default — or returns an error code if you use the errno action.
systemd ships predefined syscall sets grouped by function, prefixed with @. Start with a permissive allowlist and tighten over time:
[Service]
SystemCallFilter=@system-service
SystemCallFilter=~@privileged @resources
@system-service is a broad set covering what well-behaved daemons typically need. The second line (~ prefix means deny) removes the @privileged and @resources groups from that set. Useful predefined groups include:
@system-service— general daemon syscalls (read, write, open, socket, etc.)@network-io— socket, bind, connect, sendmsg, recvmsg@privileged— mount, ptrace, kexec, and other high-risk calls@resources— setrlimit, ioprio_set, and similar resource manipulation@obsolete— syscalls that should never appear in modern software
To return a benign error instead of killing the process on a denied syscall (useful for debugging or when a library probes for features):
[Service]
SystemCallFilter=@system-service
SystemCallErrorNumber=EPERM
Use strace -c -p <pid> to profile what syscalls a running service actually makes before locking it down hard.
Step 7: Apply Additional High-Value Directives
A few more directives complement the above with minimal compatibility risk:
[Service]
PrivateDevices=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
RestrictNamespaces=true
LockPersonality=true
MemoryDenyWriteExecute=true
- PrivateDevices — replaces
/devwith a minimal set; no raw device access. - ProtectKernelTunables — makes
/proc/sysand/sysread-only. - ProtectKernelModules — blocks
CAP_SYS_MODULEand module loading. - RestrictAddressFamilies — limits socket families; remove
AF_INET6if the service is IPv4-only. - MemoryDenyWriteExecute — prevents JIT or self-modifying code; disable for JVM/Node.js/Python services.
Verification
After saving the override, reload the daemon and restart the service:
sudo systemctl daemon-reload
sudo systemctl restart myapp.service
sudo systemctl status myapp.service
Check the security exposure score systemd computes for the unit:
systemd-analyze security myapp.service
The output rates each directive and gives an overall exposure score from 0 (fully locked) to 10 (no sandboxing). A well-hardened daemon should score below 3. The command also flags directives not yet set, so you can iterate.
# Realistic output fragment (will vary)
# NAME DESCRIPTION EXPOSURE
# PrivateNetwork= Service has access to the host network 0.5
# CapabilityBoundingSet=~... Service has no capability 0.0
# Overall exposure level for myapp.service: 2.1 OK
Troubleshooting
Service fails to start after adding SystemCallFilter
Check the journal for SIGSYS or Bad system call:
journalctl -u myapp.service -n 50
Use SystemCallErrorNumber=EPERM temporarily — this lets the process continue despite blocked calls so you can identify which syscall group is the problem. Then use strace to find the specific call and add the appropriate @group to your allowlist.
ProtectSystem=strict breaks file writes
Add the specific path to ReadWritePaths=. Multiple paths are space-separated. If the service writes to a socket or pipe instead of a regular file, check that the socket path itself is not under a read-only mount.
PrivateTmp breaks a service that communicates via /tmp sockets
Move the socket to /run/myapp/ (which is writable and shared) and update both the service and client configurations accordingly. Use RuntimeDirectory=myapp in the unit to have systemd create and own the directory automatically.
[Service]
RuntimeDirectory=myapp
RuntimeDirectoryMode=0750Frequently asked questions
- Will these directives affect all distributions equally?
- The directives themselves are portable across any systemd-based distro, but some kernel features (e.g., seccomp-BPF for SystemCallFilter) require a kernel built with CONFIG_SECCOMP_FILTER. All major distro kernels on current LTS releases enable this by default.
- Can I apply sandboxing to a service that runs as root?
- Yes, and it is especially valuable there. CapabilityBoundingSet and SystemCallFilter both restrict root processes. ProtectSystem and PrivateTmp work independently of the UID the service runs as.
- Does MemoryDenyWriteExecute break JVM, Python, or Node.js services?
- Yes — JIT-compiling runtimes map memory as writable and then executable, which this directive blocks. Omit MemoryDenyWriteExecute for services running on those runtimes. All other directives still apply.
- How do I find out which syscall group is causing a SIGSYS crash?
- Set SystemCallErrorNumber=EPERM so the service survives blocked calls, then run strace -c -p <pid> to count syscall usage. Match the unexpected calls against the systemd syscall group definitions in man 7 systemd.exec.
- Are drop-in overrides safe to use with third-party package units?
- Yes, drop-ins are the correct approach. Never edit files under /lib/systemd/system/ directly; package upgrades overwrite them. Drop-ins in /etc/systemd/system/<name>.service.d/ are preserved and take precedence.
Related guides
Manage Secrets with Ansible Vault
Encrypt Ansible secrets with AES-256 using ansible-vault: encrypt files and inline vars, automate with password files, and isolate group-level secrets with vault IDs.
AppArmor Explained
Learn how AppArmor profiles work, how to switch between enforce and complain mode, create new profiles, and diagnose access denials on Ubuntu, Debian, and Arch.
Apply CIS Benchmarks with OpenSCAP
Use OpenSCAP and scap-security-guide to evaluate, report on, and remediate Linux systems against CIS Benchmarks — covering install, eval, and automation.
How to Audit a Linux System with auditd
Set up auditd on Linux to track file access, syscalls, and privilege use. Covers persistent rules, file watches, ausearch, and aureport across major distros.