oracle-watchdog

Session Heartbeat OCI Stop/Start Prometheus Metrics Self-Healing OpenTelemetry Tracing
Automatic recovery for Oracle Cloud free-tier nodes
Oracle periodically reclaims free-tier instances, leaving them in a stuck state that requires a full stop/start cycle to recover. Oracle Watchdog detects unresponsive nodes by polling Consul KV for session-locked heartbeats that expire when a node goes silent, then automatically triggers OCI restart cycles.
- Monitor mode runs on each Oracle node, holding a session-locked KV entry in Consul as its heartbeat signal
- Agent mode runs on infrastructure separate from the monitored nodes, polling Consul KV for missing heartbeats and orchestrating OCI stop/start cycles
- Self-healing design ensures the service never crashes due to Consul or OCI unavailability
- OpenTelemetry tracing provides visibility into restart cycles via Tempo
Key Features
Monitor processes maintain a Consul session with 30s TTL on each Oracle node.
Agent detects missing heartbeats and triggers OCI stop/start cycles to recover stuck instances.
Never crashes due to Consul or OCI unavailability - continuously retries connections.
13 metrics covering connection health, session status, and restart activity per node.
Every restart cycle is traced end-to-end with spans for OCI stop, poll, and start operations.
One binary, two modes. Monitor ships as a systemd-friendly Debian package, agent ships as a Docker image.
Monitor mode can re-resolve a configured WG peer hostname and refresh the kernel peer endpoint when its IP changes.
Agent mode can detect the host's public IPv4 and keep a Cloudflare A record in sync.