guestguard/docs/RUNBOOK_RESTORE.md

# Runbook — Postgres restore

This is the procedure to bring GuestGuard back from a Postgres backup
after data loss. It assumes the infra side of Block G (`pg_basebackup` +
WAL archiving to S3, daily logical dumps, cross-region replication) is
already in place — see the homelab repo for those.

The application side — migration down-scripts, the [`restore-verify`](../cmd/restore-verify/main.go)
tool, and this document — lives here in the GuestGuard repo so it ships
in lockstep with the schema.

---

## Targets

| Metric | Target |
|---|---|
| RTO (recovery time objective) | ≤ 1 hour from "go" decision to traffic-serving |
| RPO (recovery point objective) | ≤ 5 minutes of data loss (WAL ships every 60s, S3 PUT every 5min) |

If RTO is going to slip past 1 hour, escalate per the comms plan in `docs/INCIDENT_RESPONSE.md` (infra repo).

## When to invoke this

- Primary Postgres is unreachable AND the standby has also failed
- Logical corruption discovered (e.g., a bad migration deleted rows)
- Region-wide outage at the primary's location
- A "what if we restored last Tuesday" drill (see [Drill](#drill-procedure))

If only the primary is unreachable and the standby is healthy, promote
the standby (separate runbook). Don't use this procedure unnecessarily —
restores are expensive.

## Prerequisites

Before starting:

- [ ] Decision authority has approved the restore (CTO or on-call lead)
- [ ] Read access to the S3 backup bucket: `s3://guestguard-pg-backups`
- [ ] `psql`, `pg_basebackup`, `wal-g` (or chosen WAL tool) installed
- [ ] Empty target Postgres instance provisioned (Kubernetes Statefulset,
      RDS, or homelab box — same major version as the backup)
- [ ] `GG_DATABASE_URL` env var ready for the new instance
- [ ] Maintenance page deployed to the frontend (`/dashboard` returns 503)
- [ ] API + notifier pods scaled to 0 (`kubectl scale --replicas=0`)
- [ ] This document open in another tab

## Steps

### 1. Stop write traffic

```bash
# k8s
kubectl scale deployment/guestguard-api --replicas=0
kubectl scale deployment/guestguard-notifier --replicas=0

# Confirm no connections to the (broken) primary
kubectl exec -n postgres guestguard-pg-0 -- psql -U postgres -c \
  "SELECT count(*) FROM pg_stat_activity WHERE datname='guestguard'"
```

If using docker-compose locally: `docker compose stop api notifier`.

### 2. Identify the recovery point

Pick the latest backup that's known-good. For corruption scenarios,
this may mean going further back than the most recent dump.

```bash
# List base backups (most recent first)
wal-g backup-list 2>/dev/null | tail -10

# Pick the timestamp (e.g. base_000000010000000000000007) and decide
# the LSN target if doing point-in-time recovery
```

For corruption: pick the latest backup created **before** the corrupting
event. For "ransomware / bad migration", probably 1–2 days back.

### 3. Restore the base backup

```bash
# Replace BACKUP_NAME with the chosen base
wal-g backup-fetch /var/lib/postgresql/data BACKUP_NAME

# Configure recovery target (omit recovery_target_time for "latest")
cat >> /var/lib/postgresql/data/postgresql.conf <<EOF
restore_command = 'wal-g wal-fetch "%f" "%p"'
recovery_target_time = '2026-05-13 14:30:00 UTC'  # set if doing PITR
EOF

touch /var/lib/postgresql/data/recovery.signal
```

### 4. Start Postgres and let it replay WAL

```bash
systemctl start postgresql   # or your equivalent

# Watch the log — should see "consistent recovery state reached"
tail -f /var/log/postgresql/postgresql-16-main.log
```

Wait until recovery completes and Postgres is in normal (not recovery)
mode:

```bash
psql -c "SELECT pg_is_in_recovery()"
# Expected: f  (false)
```

### 5. Verify the restored database

This is the critical gate before any application traffic touches it.

```bash
# Build the verifier (only needed once)
go build -o restore-verify ./cmd/restore-verify

# Run it against the restored instance
GG_DATABASE_URL='postgres://guestguard:CHANGE_ME@RESTORED_HOST:5432/guestguard?sslmode=require' \
  ./restore-verify --verbose
```

Expected output: `OK: all N checks passed`. The tool checks:

- All expected tables exist (users, events, guests, tokens, rsvps, etc.)
- All migrations recorded in `schema_migrations`
- No orphan rows across the ten FK relationships we care about
- `users.email` is still unique (case-insensitive)
- No more than one "granting" subscription per user
- Row-count snapshot (for sanity, not pass/fail)

**If any check fails: STOP.** The restore is corrupt — go back to step 2
with an earlier backup OR escalate.

### 6. Apply pending migrations

If the backup is from before a recent migration that shipped to prod,
catch up:

```bash
# The API auto-migrates on boot, but we want to apply migrations
# before traffic, so kick a one-off:
docker run --rm \
  -e GG_DATABASE_URL='postgres://...' \
  ghcr.io/alchemistkay/guestguard-api:latest \
  /app/api --migrate-only

# Or via psql, applying each .up.sql in order if you don't have the image:
for m in 0001_init 0002_rsvps 0003_auth 0004_notifications_d 0005_billing; do
  psql -f internal/storage/migrations/${m}.up.sql || break
done
```

Run `restore-verify` again after migrations to confirm everything's
still coherent.

### 7. Bring the API back up

```bash
kubectl scale deployment/guestguard-api --replicas=2
kubectl scale deployment/guestguard-notifier --replicas=1

# Watch the logs — expect "http server starting" + "billing enabled via stripe"
kubectl logs -f deployment/guestguard-api --tail=20
```

### 8. Smoke test

- [ ] Hit `/health` → 200
- [ ] Sign in as a known test user → dashboard loads, recent events visible
- [ ] Create a new event → succeeds, appears in list
- [ ] Tail API logs for 5 minutes → no 5xx storms

### 9. Re-enable traffic

- [ ] Remove the maintenance page from the frontend
- [ ] Announce restoration in the status channel + status page
- [ ] Note actual RTO + RPO achieved for the post-mortem

## Drill procedure

Run this monthly with no real outage to keep the team's hands warm.

1. Provision a throwaway Postgres instance (`postgres-drill-YYYYMM`).
2. Run steps 2–5 against it (skip 1, 7, 8, 9 — production stays untouched).
3. `restore-verify` MUST pass.
4. Bonus: spin up an API pointed at the drill DB on a one-off port and
   walk through the smoke-test scenarios.
5. Tear down the drill DB.
6. Log the time taken in `docs/RESTORE_DRILL_LOG.md` (or wherever your
   team tracks operational drills).

If any step fails during a drill, the production fail-over procedure is
**unreliable** — treat as a P1 to fix before the next real failure.

## Rollback (if restore is wrong)

If you complete the restore and discover it's the wrong data:

1. Scale API back to 0
2. Find the next earlier backup
3. Drop and recreate the database on the restored instance
4. Repeat from step 3

**Never** point production at a known-bad restored DB hoping to fix it
later — the API will write new data on top of the corruption and the
salvage gets exponentially harder.

## Migration down-scripts

Every `.up.sql` in `internal/storage/migrations/` has a matching
`.down.sql`. They're tested as part of CI and not exercised during
normal restores (the up-only sequence in step 6 is the path used).
They exist for:

- Drill scenarios where you want to "rewind" the schema
- Emergency rollback of a bad shipped migration

Down-script integrity: run the `TestMigrationRoundtrip` integration
test, which applies every migration up → down → up against a fresh
container.

## Application config that supports restored DBs

`GG_DATABASE_URL` is the single source of truth — no hardcoded
hostnames anywhere in the codebase. Verified by:

```bash
grep -rE 'postgres://|host=.*5432' --include='*.go' . | grep -v _test.go | grep -v config.go
# Expected: (empty)
```

If anything surfaces from that grep, file a bug — it'll bite the next
restore.

## Escalation

| Step fails | Who to call |
|---|---|
| Steps 1–4 | On-call infra lead |
| Step 5 (`restore-verify`) | On-call backend lead + DBA |
| Steps 7–8 (app won't start / smoke fails) | On-call backend lead |
| Drill failure | File P1 ticket, link the drill log |

## Change log

| Date | Author | Change |
|---|---|---|
| 2026-05-16 | kay | initial version (Block G) |