oracle-watchdog

oracle-watchdog

Session Heartbeat OCI Stop/Start Prometheus Metrics Self-Healing OpenTelemetry Tracing

README Architecture Go API Grafana GitHub


Automatic recovery for Oracle Cloud free-tier nodes

Oracle periodically reclaims free-tier instances, leaving them in a stuck state that requires a full stop/start cycle to recover. Oracle Watchdog detects unresponsive nodes by polling Consul KV for session-locked heartbeats that expire when a node goes silent, then automatically triggers OCI restart cycles.

  • Monitor mode runs on each Oracle node, holding a session-locked KV entry in Consul as its heartbeat signal
  • Agent mode runs on infrastructure separate from the monitored nodes, polling Consul KV for missing heartbeats and orchestrating OCI stop/start cycles
  • Self-healing design ensures the service never crashes due to Consul or OCI unavailability
  • OpenTelemetry tracing provides visibility into restart cycles via Tempo

Key Features

Consul Session Heartbeat

Monitor processes maintain a Consul session with 30s TTL on each Oracle node.

Sessions are renewed every 10 seconds. A KV pair locked to the session is written at oracle-watchdog/nodes/{nodename}. When a node becomes unresponsive, the session expires and the KV pair is automatically deleted.
Automatic OCI Recovery

Agent detects missing heartbeats and triggers OCI stop/start cycles to recover stuck instances.

Configurable timeout before restart (default 5m). Issues OCI stop, polls until STOPPED, then issues start and polls until RUNNING. Tracks consecutive attempts per node and resets on recovery.
Self-Healing Design

Never crashes due to Consul or OCI unavailability - continuously retries connections.

Both monitor and agent modes use state machines that transition between disconnected, connecting, and active states. Consecutive failure tracking triggers connection resets. Duplicate restart prevention via in-flight tracking.
Prometheus Metrics

13 metrics covering connection health, session status, and restart activity per node.

Monitor exposes connection and session gauges, renewal/failure counters, and reconnect attempts on port 9104. Agent exposes connection status, node counts, per-node restart counters, and check failures on port 9105.
OpenTelemetry Tracing

Every restart cycle is traced end-to-end with spans for OCI stop, poll, and start operations.

Exports traces to Tempo via OTLP gRPC. Each trace captures node name, instance ID, timing for stop/start operations, and error details on failure.
Flexible Deployment

One binary, two modes. Monitor ships as a systemd-friendly Debian package, agent ships as a Docker image.

Same binary on both sides, selected by the -mode flag. Run monitors directly on each Oracle node and the agent anywhere that can reach Consul and the OCI API.
Optional: WireGuard Endpoint Resolver

Monitor mode can re-resolve a configured WG peer hostname and refresh the kernel peer endpoint when its IP changes.

Default-disabled and independent of the core OCI-restart flow. Forces an immediate re-resolve when the most recent peer handshake exceeds the configured staleness threshold. Updates the kernel via netlink (wgctrl).
Optional: Cloudflare WAN-IP DDNS Updater

Agent mode can detect the host's public IPv4 and keep a Cloudflare A record in sync.

Default-disabled and independent of the core OCI-restart flow. Polls configurable detection providers (ipify, Cloudflare trace) in order, IPv4 only. Cloudflare API token read once at startup from an env var.