Will these directives affect all distributions equally?

The directives themselves are portable across any systemd-based distro, but some kernel features (e.g., seccomp-BPF for SystemCallFilter) require a kernel built with CONFIG_SECCOMP_FILTER. All major distro kernels on current LTS releases enable this by default.

Can I apply sandboxing to a service that runs as root?

Yes, and it is especially valuable there. CapabilityBoundingSet and SystemCallFilter both restrict root processes. ProtectSystem and PrivateTmp work independently of the UID the service runs as.

Does MemoryDenyWriteExecute break JVM, Python, or Node.js services?

Yes — JIT-compiling runtimes map memory as writable and then executable, which this directive blocks. Omit MemoryDenyWriteExecute for services running on those runtimes. All other directives still apply.

How do I find out which syscall group is causing a SIGSYS crash?

Set SystemCallErrorNumber=EPERM so the service survives blocked calls, then run strace -c -p to count syscall usage. Match the unexpected calls against the systemd syscall group definitions in man 7 systemd.exec.

Are drop-in overrides safe to use with third-party package units?

Yes, drop-ins are the correct approach. Never edit files under /lib/systemd/system/ directly; package upgrades overwrite them. Drop-ins in /etc/systemd/system/ .service.d/ are preserved and take precedence.

Lock Down systemd Services (Sandboxing)

systemd ships a rich set of sandboxing directives that let you wrap a service in a tight security envelope without rewriting a single line of its code. When a misconfigured or compromised daemon escapes its expected behaviour, these directives limit the blast radius — restricting filesystem visibility, system call access, and privilege escalation paths. The following guide walks through the most impactful options: ProtectSystem, PrivateTmp, NoNewPrivileges, CapabilityBoundingSet, and SystemCallFilter.

How systemd Sandboxing Works

systemd leverages kernel namespaces, seccomp-BPF, and Linux capabilities to constrain services at start time. The restrictions live in a .service unit file under [Service] and are applied by the service manager before the process ever runs. No kernel patches or external tools are needed — everything described here ships with systemd 232+ and Linux 3.17+. Most modern LTS distributions are well within range.

Always edit units with systemctl edit <service> to create a drop-in override rather than modifying the vendor-supplied unit directly. Drop-ins survive package upgrades.

Step 1: Create a Drop-in Override

Pick a service to harden. A network-facing service with no need for root is the ideal candidate — nginx, restic, or a custom application daemon all work well. The example below uses a generic myapp.service.

sudo systemctl edit myapp.service

This opens an editor with an empty override file at /etc/systemd/system/myapp.service.d/override.conf. All sandboxing directives go inside a [Service] block. Start with the section header and add directives as you progress through this guide.

[Service]
# directives added below

Step 2: Restrict Filesystem Access with ProtectSystem and ProtectHome

ProtectSystem bind-mounts parts of the OS read-only for the service. ProtectHome makes /home, /root, and /run/user either invisible or read-only.

ProtectSystem=strict — mounts the entire filesystem tree read-only except for /dev, /proc, and /sys. The service must write only to paths explicitly whitelisted with ReadWritePaths=.
ProtectSystem=full — adds /etc to the read-only set on top of the default.
ProtectHome=true — home directories appear empty to the process.
ProtectHome=tmpfs — home directories appear as empty writable tmpfs (useful if the service probes for home dirs).

[Service]
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/myapp /var/log/myapp

If you omit ReadWritePaths while using strict, any write the service attempts outside a tmpfs will fail with EROFS. Map only the minimum paths the service genuinely writes to.

Step 3: Isolate Temporary Files with PrivateTmp

PrivateTmp=true gives the service its own private /tmp and /var/tmp backed by a new filesystem namespace. Files placed there by the service are invisible to other processes and are cleaned up when the service stops.

[Service]
PrivateTmp=true

This prevents a common attack class where an adversary pre-creates predictable filenames in /tmp (symlink attacks, TOCTOU races) to redirect a privileged service into reading or writing attacker-controlled paths.

Step 4: Block Privilege Escalation with NoNewPrivileges

NoNewPrivileges=true sets the PR_SET_NO_NEW_PRIVS prctl flag on the service process and all its children. With it enabled, execve() can never gain new privileges — setuid and setgid bits on executables become inert, and the process cannot acquire capabilities it did not already hold.

[Service]
NoNewPrivileges=true

This is one of the highest-value single-line hardening measures available. It blocks an entire family of local privilege escalation exploits. Enable it unless the service intentionally relies on setuid helpers (e.g., sudo, su, PAM stacks that call setuid binaries).

Note: NoNewPrivileges=true is a prerequisite for SystemCallFilter when the service does not start as root, because seccomp filters are inherited and the no-new-privs flag ensures they cannot be shed.

Step 5: Drop Linux Capabilities with CapabilityBoundingSet

Linux capabilities divide traditional root power into discrete units. Even if a service runs as root, you can strip capabilities it should never need. CapabilityBoundingSet sets the hard ceiling — capabilities not listed here can never be acquired, even via setuid.

[Service]
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
AmbientCapabilities=CAP_NET_BIND_SERVICE

The example above is appropriate for a service that must bind to port 80 or 443 but otherwise needs no root privilege. A service that needs no capabilities at all should use an empty set:

[Service]
CapabilityBoundingSet=
AmbientCapabilities=

Common capabilities to consider removing: CAP_SYS_ADMIN, CAP_NET_RAW, CAP_SETUID, CAP_SETGID, CAP_SYS_PTRACE. Use capsh --print or getpcaps <pid> to inspect what a running process actually holds.

getpcaps $(pgrep -x myapp)

Step 6: Filter System Calls with SystemCallFilter

SystemCallFilter uses seccomp-BPF to allow or deny individual syscalls. When a denied syscall is attempted, the kernel kills the process with SIGSYS by default — or returns an error code if you use the errno action.

systemd ships predefined syscall sets grouped by function, prefixed with @. Start with a permissive allowlist and tighten over time:

[Service]
SystemCallFilter=@system-service
SystemCallFilter=~@privileged @resources

@system-service is a broad set covering what well-behaved daemons typically need. The second line (~ prefix means deny) removes the @privileged and @resources groups from that set. Useful predefined groups include:

@system-service — general daemon syscalls (read, write, open, socket, etc.)
@network-io — socket, bind, connect, sendmsg, recvmsg
@privileged — mount, ptrace, kexec, and other high-risk calls
@resources — setrlimit, ioprio_set, and similar resource manipulation
@obsolete — syscalls that should never appear in modern software

To return a benign error instead of killing the process on a denied syscall (useful for debugging or when a library probes for features):

[Service]
SystemCallFilter=@system-service
SystemCallErrorNumber=EPERM

Use strace -c -p <pid> to profile what syscalls a running service actually makes before locking it down hard.

Step 7: Apply Additional High-Value Directives

A few more directives complement the above with minimal compatibility risk:

[Service]
PrivateDevices=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
RestrictNamespaces=true
LockPersonality=true
MemoryDenyWriteExecute=true

PrivateDevices — replaces /dev with a minimal set; no raw device access.
ProtectKernelTunables — makes /proc/sys and /sys read-only.
ProtectKernelModules — blocks CAP_SYS_MODULE and module loading.
RestrictAddressFamilies — limits socket families; remove AF_INET6 if the service is IPv4-only.
MemoryDenyWriteExecute — prevents JIT or self-modifying code; disable for JVM/Node.js/Python services.

Verification

After saving the override, reload the daemon and restart the service:

sudo systemctl daemon-reload
sudo systemctl restart myapp.service
sudo systemctl status myapp.service

Check the security exposure score systemd computes for the unit:

systemd-analyze security myapp.service

The output rates each directive and gives an overall exposure score from 0 (fully locked) to 10 (no sandboxing). A well-hardened daemon should score below 3. The command also flags directives not yet set, so you can iterate.

# Realistic output fragment (will vary)
# NAME                          DESCRIPTION                              EXPOSURE
# PrivateNetwork=               Service has access to the host network       0.5
# CapabilityBoundingSet=~...    Service has no capability                     0.0
# Overall exposure level for myapp.service: 2.1 OK

Troubleshooting

Service fails to start after adding SystemCallFilter

Check the journal for SIGSYS or Bad system call:

journalctl -u myapp.service -n 50

Use SystemCallErrorNumber=EPERM temporarily — this lets the process continue despite blocked calls so you can identify which syscall group is the problem. Then use strace to find the specific call and add the appropriate @group to your allowlist.

ProtectSystem=strict breaks file writes

Add the specific path to ReadWritePaths=. Multiple paths are space-separated. If the service writes to a socket or pipe instead of a regular file, check that the socket path itself is not under a read-only mount.

PrivateTmp breaks a service that communicates via /tmp sockets

Move the socket to /run/myapp/ (which is writable and shared) and update both the service and client configurations accordingly. Use RuntimeDirectory=myapp in the unit to have systemd create and own the directory automatically.

[Service]
RuntimeDirectory=myapp
RuntimeDirectoryMode=0750