# Runbook — Postgres restore This is the procedure to bring GuestGuard back from a Postgres backup after data loss. It assumes the infra side of Block G (`pg_basebackup` + WAL archiving to S3, daily logical dumps, cross-region replication) is already in place — see the homelab repo for those. The application side — migration down-scripts, the [`restore-verify`](../cmd/restore-verify/main.go) tool, and this document — lives here in the GuestGuard repo so it ships in lockstep with the schema. --- ## Targets | Metric | Target | |---|---| | RTO (recovery time objective) | ≤ 1 hour from "go" decision to traffic-serving | | RPO (recovery point objective) | ≤ 5 minutes of data loss (WAL ships every 60s, S3 PUT every 5min) | If RTO is going to slip past 1 hour, escalate per the comms plan in `docs/INCIDENT_RESPONSE.md` (infra repo). ## When to invoke this - Primary Postgres is unreachable AND the standby has also failed - Logical corruption discovered (e.g., a bad migration deleted rows) - Region-wide outage at the primary's location - A "what if we restored last Tuesday" drill (see [Drill](#drill-procedure)) If only the primary is unreachable and the standby is healthy, promote the standby (separate runbook). Don't use this procedure unnecessarily — restores are expensive. ## Prerequisites Before starting: - [ ] Decision authority has approved the restore (CTO or on-call lead) - [ ] Read access to the S3 backup bucket: `s3://guestguard-pg-backups` - [ ] `psql`, `pg_basebackup`, `wal-g` (or chosen WAL tool) installed - [ ] Empty target Postgres instance provisioned (Kubernetes Statefulset, RDS, or homelab box — same major version as the backup) - [ ] `GG_DATABASE_URL` env var ready for the new instance - [ ] Maintenance page deployed to the frontend (`/dashboard` returns 503) - [ ] API + notifier pods scaled to 0 (`kubectl scale --replicas=0`) - [ ] This document open in another tab ## Steps ### 1. Stop write traffic ```bash # k8s kubectl scale deployment/guestguard-api --replicas=0 kubectl scale deployment/guestguard-notifier --replicas=0 # Confirm no connections to the (broken) primary kubectl exec -n postgres guestguard-pg-0 -- psql -U postgres -c \ "SELECT count(*) FROM pg_stat_activity WHERE datname='guestguard'" ``` If using docker-compose locally: `docker compose stop api notifier`. ### 2. Identify the recovery point Pick the latest backup that's known-good. For corruption scenarios, this may mean going further back than the most recent dump. ```bash # List base backups (most recent first) wal-g backup-list 2>/dev/null | tail -10 # Pick the timestamp (e.g. base_000000010000000000000007) and decide # the LSN target if doing point-in-time recovery ``` For corruption: pick the latest backup created **before** the corrupting event. For "ransomware / bad migration", probably 1–2 days back. ### 3. Restore the base backup ```bash # Replace BACKUP_NAME with the chosen base wal-g backup-fetch /var/lib/postgresql/data BACKUP_NAME # Configure recovery target (omit recovery_target_time for "latest") cat >> /var/lib/postgresql/data/postgresql.conf <