oracle-watchdog documentation :: oracle-watchdog

Architecture

Mon, 01 Jan 0001 00:00:00 +0000

Distributed heartbeat monitoring and automatic OCI instance recovery How It Works Monitor Mode (Oracle Nodes) Each Oracle node runs the monitor as a systemd service Monitor creates a Consul session with 30-second TTL and delete behavior A KV pair is written at oracle-watchdog/nodes/{nodename}, locked to the session The session is renewed every 10 seconds - if renewal fails, the monitor reconnects automatically If a node becomes unresponsive (reclaimed by Oracle), the session expires and the KV pair is deleted Agent Mode The agent runs on infrastructure separate from the monitored nodes (Docker, Nomad, or any host that can reach Consul and the OCI API) On each check interval (default 30s), it polls Consul for missing node KV pairs When a node has been absent longer than the timeout (default 5m), it triggers a restart: Issues an OCI stop command Polls instance state until STOPPED (10s intervals, 5m max wait) Issues an OCI start command Polls instance state until RUNNING Consecutive restart attempts are tracked per node and reset when the node recovers Duplicate restart prevention ensures only one restart is in-flight per node at a time Safety Features Configurable max restart attempts per node (0 = unlimited) Dry-run mode for testing (-dry-run flag) Connection health tracking with consecutive failure thresholds for both Consul and OCI Automatic connection state machine transitions - never crashes, always retries Optional Features Both modes ship with an additional optional subsystem that runs in the same process when enabled in the config file. Both are default-disabled and independent of the core OCI-restart flow.

README

Mon, 01 Jan 0001 00:00:00 +0000

A distributed monitoring and recovery system for Oracle Cloud free-tier instances. Oracle periodically reclaims free-tier instances, leaving them in a stuck state that requires a full stop/start cycle to recover. Oracle Watchdog detects unresponsive nodes by polling Consul KV for session-locked heartbeats that expire when a node goes silent, then automatically triggers OCI restart cycles. Monitor mode runs on each Oracle node, holding a session-locked KV entry in Consul as its heartbeat signal Agent mode runs on infrastructure separate from the monitored nodes, polling Consul KV for missing heartbeats and orchestrating OCI stop/start cycles Self-healing design ensures the service never crashes due to Consul or OCI unavailability - it continuously retries and emits metrics on current state OpenTelemetry tracing provides visibility into restart cycles via Tempo Oracle Node 1 Oracle Node 2 (monitor mode) (monitor mode) | | v v +---------+ +---------+ | Consul | | Consul | | Session | | Session | +---------+ +---------+ \ / '----> Consul <-----' | v +----------------+ | oracle-watchdog | | (agent mode) | +----------------+ | | v v OCI Prometheus (restart) (metrics) Table of Contents How It Works Modes Prometheus Metrics Configuration Deployment Development Project Structure How It Works The system operates as a distributed heartbeat monitor:

Grafana Dashboard

Mon, 01 Jan 0001 00:00:00 +0000

Pre-built Grafana dashboard for monitoring Oracle Cloud node health and recovery The dashboard is organized into three sections. Monitor Mode shows a per-node status table with Consul connection and session heartbeat health for each Oracle node, alongside a session activity timeseries and monitor logs from Loki. Agent Mode displays the agent’s Consul and OCI connection status, how many nodes are being watched, and whether any are currently missing. Restart Activity tracks cumulative restart attempts, successes, and failures per node in a table, with a timeseries view of restart events over time. Agent logs round out the bottom of the dashboard.

Changelog

Mon, 01 Jan 0001 00:00:00 +0000

All notable changes to this project are documented in this file. [1.4.1] - 2026-05-13 Added add endpoint resolver (monitor) and WAN DNS updater (agent) Improved update CHANGELOG.md for v1.3.0 (#30) Other publish-deb: target s3:munchbox: prefix instead of root [1.3.0] - 2026-03-16 Added Add auto-generated Go API reference to documentation site Improved update CHANGELOG.md for v1.2.0 (#27) [1.2.0] - 2026-03-16 Added Add periodic heartbeat log to monitor mode Improved update CHANGELOG.md for v1.1.0 (#25) [1.1.0] - 2026-03-16 Added Add publish-deb target and bump version to v1.1.0 Improved update CHANGELOG.md for v0.0.9 (#24) [0.0.9] - 2026-03-16 Added Add tracing to monitor mode Consul calls Improved update CHANGELOG.md for v0.0.8 (#22) [0.0.8] - 2026-03-16 Fixed Fix service graph visibility in Tempo and add Consul client spans Improved update readme to have logo update CHANGELOG.md for v0.0.7 (#18) Other Move logo above title in README and reorder header elements [0.0.7] - 2026-03-15 Added Add Hugo documentation site Improved update CHANGELOG.md for v0.0.6 (#15) [0.0.6] - 2026-03-15 Improved update CHANGELOG.md for v0.0.5 (#9) Other general repo housekeeping/setup [0.0.4] - 2026-03-15 Other Test release functionality [0.0.3] - 2026-03-15 Added Add Grafana dashboard, GoReleaser release pipeline, and fix metrics initialization Add README with architecture, metrics, config, and deployment docs Fixed Fix reliability issues and improve Go best practices (#1) Other Initial standalone repo with CI/CD infrastructure Initial commit