oracle-watchdog

Architecture

Distributed heartbeat monitoring and automatic OCI instance recovery

How It Works

Monitor Mode (Oracle Nodes)

  1. Each Oracle node runs the monitor as a systemd service
  2. Monitor creates a Consul session with 30-second TTL and delete behavior
  3. A KV pair is written at oracle-watchdog/nodes/{nodename}, locked to the session
  4. The session is renewed every 10 seconds - if renewal fails, the monitor reconnects automatically
  5. If a node becomes unresponsive (reclaimed by Oracle), the session expires and the KV pair is deleted

Agent Mode

  1. The agent runs on infrastructure separate from the monitored nodes (Docker, Nomad, or any host that can reach Consul and the OCI API)
  2. On each check interval (default 30s), it polls Consul for missing node KV pairs
  3. When a node has been absent longer than the timeout (default 5m), it triggers a restart:
    • Issues an OCI stop command
    • Polls instance state until STOPPED (10s intervals, 5m max wait)
    • Issues an OCI start command
    • Polls instance state until RUNNING
  4. Consecutive restart attempts are tracked per node and reset when the node recovers
  5. Duplicate restart prevention ensures only one restart is in-flight per node at a time

Safety Features

  • Configurable max restart attempts per node (0 = unlimited)
  • Dry-run mode for testing (-dry-run flag)
  • Connection health tracking with consecutive failure thresholds for both Consul and OCI
  • Automatic connection state machine transitions - never crashes, always retries

Optional Features

Both modes ship with an additional optional subsystem that runs in the same process when enabled in the config file. Both are default-disabled and independent of the core OCI-restart flow.

  • WireGuard Endpoint Resolver (monitor) - re-resolves a configured WG peer hostname on an interval and refreshes the kernel peer endpoint via wgctrl when the resolved IP changes. Forces an immediate re-resolve when the most recent peer handshake exceeds the staleness threshold.
  • Cloudflare WAN-IP DDNS Updater (agent) - detects the host’s public IPv4 via configurable HTTP providers and PATCHes a Cloudflare A record when the value changes. IPv4 only. The Cloudflare API token is read once at startup from a configurable env var.

See the README and the package godoc for configuration reference.