Monitoring and Automatic Restart of Services with systemd: Liveness and Readiness Probes

Maintaining the reliability of critical services often requires continuous monitoring of their health and automatic recovery in case of failures. While Kubernetes offers built-in liveness and readiness probes for containerized applications, Linux system administrators can implement similar health checks and automated restarts for services managed by systemd. This article explores practical ways to perform health monitoring using systemd, focusing on DNS availability checks as an example, and how to configure systemd to automatically restart services that become unhealthy.

Understanding Liveness and Readiness Probes in systemd Context#

Liveness probe: Confirms that a service is running and responsive. If it fails, the service should be restarted.
Readiness probe: Checks if a service is ready to accept traffic or perform its tasks. Until it passes, dependent services or requests should be delayed.

In container orchestration, these probes are automatic, but on systems using systemd, you can emulate similar behavior using external health check scripts, timers, and systemd service settings.

Automatic Service Restart in systemd#

The simplest way to enable automatic restart on failure in systemd is by editing the service unit file and adding:

[Service]
Restart=on-failure
RestartSec=5s

This makes systemd restart the service if it crashes or exits with a failure code, waiting 5 seconds before restarting.

Implementing a DNS Availability Check with systemd#

To implement a liveness probe that verifies DNS availability (for example, for dnsmasq service) and restarts it if DNS is unreachable, follow these steps.

1. DNS Check Script#

Create a script /usr/local/bin/dns-watchdog.sh:

#!/bin/bash

# Configurable DNS server IP and domain to query
DNS_SERVER="8.8.8.8"
DOMAIN="google.com"
SERVICE="dnsmasq.service"

# Check DNS resolution via the specified DNS server
if ! dig @$DNS_SERVER +short $DOMAIN | grep -q . ; then
    logger "DNS unreachable via $DNS_SERVER, restarting $SERVICE"
    systemctl restart $SERVICE
fi

Make the script executable:

chmod +x /usr/local/bin/dns-watchdog.sh

2. Create a systemd service unit for the check#

File /etc/systemd/system/dns-watchdog.service:

[Unit]
Description=DNS Healthcheck and Service Restart

[Service]
Type=oneshot
ExecStart=/usr/local/bin/dns-watchdog.sh

3. Create a systemd timer to run the script periodically#

File /etc/systemd/system/dns-watchdog.timer:

[Unit]
Description=Periodic DNS Healthcheck

[Timer]
OnBootSec=2min
OnUnitActiveSec=1min

[Install]
WantedBy=timers.target

Enable and start the timer:

systemctl daemon-reload
systemctl enable --now dns-watchdog.timer

This setup runs the DNS check every minute, restarting dnsmasq if the DNS query fails.

Alternative Approaches#

Cron job: Instead of a systemd timer, schedule the DNS check script with cron for simplicity.
Watchdog Feature: For more advanced use cases, leverage systemd’s Watchdog capabilities where the service signals systemd regularly to indicate it is healthy.
External Monitoring Tools: Use monitoring tools like Monit or Nagios that can run health checks and restart services as needed.

Summary#

Use Restart=on-failure in systemd service units for basic crash recovery.
Implement custom liveness probes with scripts that test critical dependencies like DNS.
Schedule health checks using systemd timers or cron to monitor and restart services proactively.
This approach increases reliability by combining proactive monitoring with automatic recovery without complex orchestration.

Try adapting the example script and timer to your environment to keep your services healthy and resilient!