aegis-updates
Over-the-Air (OTA) Rolling Update System for Aegis Database Platform.
Overview
Provides zero-downtime rolling updates for Aegis clusters. Updates are applied to follower nodes first, then the leader last, with automatic rollback if any node fails health checks or quorum is lost.
Modules
version.rs
Version tracking:
VERSION- Compile-time version fromCARGO_PKG_VERSIONNodeVersion- Node version info with node ID, name, version, binary hashClusterVersionInfo- Aggregated version info across all cluster nodes
binary.rs
Binary management:
download_binary()- Download update binary from URLverify_sha256()- Verify binary integrity with SHA-256 hashstage_binary()- Stage binary in staging directorybackup_current_binary()- Backup current binary before updateapply_binary()- Atomic binary replacement (rename)
orchestrator.rs
Rolling update orchestration:
UpdateOrchestrator- Main coordinator for cluster-wide updatesUpdatePlan- Update plan with version, URL, SHA-256, and node listUpdateStatus- Pending, InProgress, Completed, Failed, RolledBackClusterNode- Node info for update targeting- Rolling strategy: followers first, leader last
- Automatic rollback on node failure or quorum loss
rollback.rs
Rollback operations:
restore_backup()- Restore backed-up binaryrollback_node()- Rollback a single node (restore + restart)rollback_nodes()- Rollback multiple nodes in sequence
health.rs
Post-update health verification:
HealthCheck- Health check configuration (timeout, retries, interval)check_node_health()- Single health check against node endpointwait_for_healthy()- Retry health checks until success or timeout
Update Flow
1. Create UpdatePlan (target version, binary URL, SHA-256)
2. Stage binary on each node
3. For each FOLLOWER (then leader last):
a. Drain node (stop accepting queries)
b. Flush data to disk
c. Apply staged binary (atomic rename)
d. Process restarts (PM2 auto-restart)
e. Wait for health check with expected version
f. Verify cluster rejoin
4. If any node fails → rollback that node
5. If quorum lost → rollback entire cluster
Usage Example
use aegis_updates::orchestrator::{UpdateOrchestrator, ClusterNode};
// Create orchestrator
let orchestrator = UpdateOrchestrator::new(
"/usr/local/bin/aegis-server",
"/var/lib/aegis/data",
);
// Create update plan
let plan = orchestrator.create_plan(
"0.2.5",
"https://releases.example.com/aegis-server-0.2.5",
"sha256hash...",
vec![
ClusterNode {
node_id: "node-1".into(),
name: "Dashboard".into(),
address: "http://127.0.0.1:9090".into(),
is_leader: true,
},
ClusterNode {
node_id: "node-2".into(),
name: "NexusScribe".into(),
address: "http://127.0.0.1:9091".into(),
is_leader: false,
},
],
)?;
// Execute rolling update
orchestrator.execute_plan(&plan.id).await?;
// Check status
let status = orchestrator.get_plan(&plan.id)?;
API Endpoints
| Method | Path | Description |
|---|---|---|
| GET | /api/v1/updates/version | Version info for all cluster nodes |
| POST | /api/v1/updates/plan | Create an update plan |
| POST | /api/v1/updates/execute | Execute a pending update plan |
| GET | /api/v1/updates/status/:plan_id | Get update plan status |
| GET | /api/v1/updates/history | List all update plans |
All endpoints require authentication.
Tests
634 tests (workspace total) covering version tracking, binary operations, health checks, orchestration, and rollback.