Files
guestguard/docs/RUNBOOK_RESTORE.md
Kwaku Danso 59b8781659 feat: ship Tier 1 — auth, authz, rate limits, real notifications, CSV import, billing, backups/DR, privacy
Closes every block in docs/TIER1_PLAN.md from the Claude-scope side. The
homelab / cloud setup steps (SES verification, restore drill, lawyer-
drafted ToS) remain operator-owned but are unblocked.

Block A — Authentication
- Migration 0003: password_hash, email_verified, email_verification_tokens,
  password_reset_tokens, refresh_tokens (with replaced_by family chain).
- Bcrypt hasher, HS256 JWT signer, single-use refresh tokens with rotation
  + replay-detection (revokes the family on reuse).
- /auth/signup, /login, /refresh, /logout, /verify-email,
  /forgot-password, /reset-password — enumeration-safe.
- requireAuth middleware + GET /me.
- Frontend useAuth/useApi with auto-refresh-on-401, login/signup/verify/
  forgot/reset pages, route-guard middleware.

Block B — Authorisation
- EventRepo.GetForHost; Update/Delete scoped by host_id.
- All host routes behind requireAuth + ownership; cross-tenant returns
  404 (no enumeration). ?host_id removed.
- WS auth via short-lived single-use tickets (POST /auth/ws-ticket).
- Tests: TestCrossTenantIsolation — 9 probes.

Block C — Rate limiting
- Redis sliding-window via Lua (atomic ZADD+ZCARD+PEXPIRE).
- Per-route limits matching the plan (signup IP, login IP+email, RSVP/
  access by token, events/guests/tokens by user_id).
- 429 with Retry-After header and JSON body.
- Auth lockout: 5 failed logins → account locked, only password reset
  clears it.
- Frontend: useErrMessage normalises 429 + locked messaging.

Block D — Real notifications
- Migration 0004: provider_message_id, bounce_type, complained columns
  + unsubscribes (CITEXT) suppression table.
- Branded HTML + plaintext templates for verification, reset, invitation,
  confirmation, reminder. Per-page templates avoid html/template's
  contextual-escape collisions.
- Senders: SESv2, Twilio (SMS), SMTP (Mailpit-friendly), Resend HTTP.
- PickEmailSender priority Resend > SMTP > SES > Log — system boots
  cleanly in dev with Mailpit; production flips one env var.
- Webhook endpoints (Twilio status + SES SNS) — bounces add to suppression;
  signature verification stubbed pending creds.
- Auto-send: POST /tokens publishes invitation.send; notifier renders +
  delivers via the configured backend; suppression list honoured.
- Bulk + per-row invitation flow: POST /events/{id}/guests/invitations/bulk
  returns per-guest tokens so phone-only guests can be SMS'd manually.
- Unsubscribe: signed HMAC token (no TTL) + /unsubscribe/[token] page.
- WhatsApp Option A+: wa.me click-to-chat wizard with per-guest progress
  tracking, isLikelyE164 validation, edit-from-wizard.
- Token rotate (POST /tokens/rotate) invalidates the old URL — used by
  the regenerate-link flow.
- Mailpit added to docker-compose for dev inbox.

Block E — CSV import
- Streaming parser: tolerant header detection, UTF-8 BOM + UTF-16 LE/BE
  decoding, row-level validation, 5,000-row cap.
- Strict E.164 phone validation with helpful error message.
- POST /preview + /import + GET /template; preview UI on event page;
  atomic per-batch with dedup on existing emails.

Phone capture across UI
- PhoneInput component: country picker (~50 ISO codes) + national input +
  live E.164 preview + inline length validation.
- Used in Add Guest and Edit Guest modals. Smart paste-handling extracts
  country code from full E.164 strings.

Block F — Billing (Stripe)
- Migration 0005: subscriptions table (user_id → tier/status/period_end +
  Stripe customer/sub ids). Partial unique index keeps one granting sub
  per user.
- internal/billing: Tier + Limits model (Free 1/50, Pro 10/1000, Business
  ∞/5000), Stripe SDK wrapper with IgnoreAPIVersionMismatch for newer
  account API versions.
- /billing/checkout-session, /billing/portal, /billing/status,
  /webhooks/stripe (signature-verified, lifecycle events).
- Tier enforcement: 402 on POST /events, /guests, /import with
  {error, reason, tier, used, limit, upgrade_url} body.
- Frontend: useBilling composable, /dashboard/billing page (current plan,
  usage bars, tier cards), global UpgradeModal triggered by useApi's
  402 interceptor.
- Customer portal kept for self-service cancel/payment-method changes.

Block G — Backups & DR (application side)
- Every migration has a tested .down.sql.
- TestMigrationRoundtrip applies all ups → all downs → all ups against a
  fresh container; catches asymmetric down migrations.
- cmd/restore-verify: 28-check post-restore invariant tool (schema
  presence, no orphans across 10 FK relationships, email uniqueness,
  single-active subscription, row-count snapshot).
- docs/RUNBOOK_RESTORE.md: 9-step restore procedure with RTO/RPO
  targets, drill instructions, rollback path.

Block H — Privacy compliance (application side)
- Migration 0006: deleted_at + terms_accepted_at + privacy_policy_accepted_at
  on users. Partial index on email for live-only uniqueness.
- GET /me/data-export — synchronous JSON dump (user, events, guests,
  tokens, rsvps, access_logs, notifications).
- DELETE /me — soft-delete with PII scrub + refresh-token revocation;
  re-signup with same email works.
- POST /me/accept-terms — idempotent consent recording.
- Frontend /privacy + /terms placeholder pages with substantive (pending
  legal review) copy; footer links; signup terms checkbox; TermsGateModal
  for accounts created before the rollout; export + delete buttons on
  /dashboard/billing.

Tests
- All migrations verified up/down/up.
- Integration suite: TestE2EHappyPath, TestAuthFlow, TestCrossTenantIsolation,
  TestRateLimitSignup, TestLoginLockout, TestUnsubscribeFlow,
  TestSESBounceWebhook, TestTwilioStatusWebhook, TestCsvImportFlow,
  TestCsvImportAtomicRollback, TestBulkIssueInvitations, TestBulkIssueExplicitSubset,
  TestTokenIssuePublishesInvitation, TestTokenIssueWithoutGuestEmailSkipsInvitation,
  TestGuestUpdate, TestGuestDelete, TestTokenRotate, TestSMTPSenderAgainstMailpit,
  TestFreeTierEventLimit, TestFreeTierGuestLimit, TestBusinessTierBypassesLimits,
  TestDataExport, TestDeleteMe, TestAcceptTerms, TestMigrationRoundtrip.
  Full suite runs in ~120s against real Postgres + NATS + Redis + Mailpit.
- Unit suite green across internal/auth, internal/csvimport,
  internal/notification, internal/ratelimit, internal/domain.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:54:22 +01:00

8.1 KiB
Raw Permalink Blame History

Runbook — Postgres restore

This is the procedure to bring GuestGuard back from a Postgres backup after data loss. It assumes the infra side of Block G (pg_basebackup + WAL archiving to S3, daily logical dumps, cross-region replication) is already in place — see the homelab repo for those.

The application side — migration down-scripts, the restore-verify tool, and this document — lives here in the GuestGuard repo so it ships in lockstep with the schema.


Targets

Metric Target
RTO (recovery time objective) ≤ 1 hour from "go" decision to traffic-serving
RPO (recovery point objective) ≤ 5 minutes of data loss (WAL ships every 60s, S3 PUT every 5min)

If RTO is going to slip past 1 hour, escalate per the comms plan in docs/INCIDENT_RESPONSE.md (infra repo).

When to invoke this

  • Primary Postgres is unreachable AND the standby has also failed
  • Logical corruption discovered (e.g., a bad migration deleted rows)
  • Region-wide outage at the primary's location
  • A "what if we restored last Tuesday" drill (see Drill)

If only the primary is unreachable and the standby is healthy, promote the standby (separate runbook). Don't use this procedure unnecessarily — restores are expensive.

Prerequisites

Before starting:

  • Decision authority has approved the restore (CTO or on-call lead)
  • Read access to the S3 backup bucket: s3://guestguard-pg-backups
  • psql, pg_basebackup, wal-g (or chosen WAL tool) installed
  • Empty target Postgres instance provisioned (Kubernetes Statefulset, RDS, or homelab box — same major version as the backup)
  • GG_DATABASE_URL env var ready for the new instance
  • Maintenance page deployed to the frontend (/dashboard returns 503)
  • API + notifier pods scaled to 0 (kubectl scale --replicas=0)
  • This document open in another tab

Steps

1. Stop write traffic

# k8s
kubectl scale deployment/guestguard-api --replicas=0
kubectl scale deployment/guestguard-notifier --replicas=0

# Confirm no connections to the (broken) primary
kubectl exec -n postgres guestguard-pg-0 -- psql -U postgres -c \
  "SELECT count(*) FROM pg_stat_activity WHERE datname='guestguard'"

If using docker-compose locally: docker compose stop api notifier.

2. Identify the recovery point

Pick the latest backup that's known-good. For corruption scenarios, this may mean going further back than the most recent dump.

# List base backups (most recent first)
wal-g backup-list 2>/dev/null | tail -10

# Pick the timestamp (e.g. base_000000010000000000000007) and decide
# the LSN target if doing point-in-time recovery

For corruption: pick the latest backup created before the corrupting event. For "ransomware / bad migration", probably 12 days back.

3. Restore the base backup

# Replace BACKUP_NAME with the chosen base
wal-g backup-fetch /var/lib/postgresql/data BACKUP_NAME

# Configure recovery target (omit recovery_target_time for "latest")
cat >> /var/lib/postgresql/data/postgresql.conf <<EOF
restore_command = 'wal-g wal-fetch "%f" "%p"'
recovery_target_time = '2026-05-13 14:30:00 UTC'  # set if doing PITR
EOF

touch /var/lib/postgresql/data/recovery.signal

4. Start Postgres and let it replay WAL

systemctl start postgresql   # or your equivalent

# Watch the log — should see "consistent recovery state reached"
tail -f /var/log/postgresql/postgresql-16-main.log

Wait until recovery completes and Postgres is in normal (not recovery) mode:

psql -c "SELECT pg_is_in_recovery()"
# Expected: f  (false)

5. Verify the restored database

This is the critical gate before any application traffic touches it.

# Build the verifier (only needed once)
go build -o restore-verify ./cmd/restore-verify

# Run it against the restored instance
GG_DATABASE_URL='postgres://guestguard:CHANGE_ME@RESTORED_HOST:5432/guestguard?sslmode=require' \
  ./restore-verify --verbose

Expected output: OK: all N checks passed. The tool checks:

  • All expected tables exist (users, events, guests, tokens, rsvps, etc.)
  • All migrations recorded in schema_migrations
  • No orphan rows across the ten FK relationships we care about
  • users.email is still unique (case-insensitive)
  • No more than one "granting" subscription per user
  • Row-count snapshot (for sanity, not pass/fail)

If any check fails: STOP. The restore is corrupt — go back to step 2 with an earlier backup OR escalate.

6. Apply pending migrations

If the backup is from before a recent migration that shipped to prod, catch up:

# The API auto-migrates on boot, but we want to apply migrations
# before traffic, so kick a one-off:
docker run --rm \
  -e GG_DATABASE_URL='postgres://...' \
  ghcr.io/alchemistkay/guestguard-api:latest \
  /app/api --migrate-only

# Or via psql, applying each .up.sql in order if you don't have the image:
for m in 0001_init 0002_rsvps 0003_auth 0004_notifications_d 0005_billing; do
  psql -f internal/storage/migrations/${m}.up.sql || break
done

Run restore-verify again after migrations to confirm everything's still coherent.

7. Bring the API back up

kubectl scale deployment/guestguard-api --replicas=2
kubectl scale deployment/guestguard-notifier --replicas=1

# Watch the logs — expect "http server starting" + "billing enabled via stripe"
kubectl logs -f deployment/guestguard-api --tail=20

8. Smoke test

  • Hit /health → 200
  • Sign in as a known test user → dashboard loads, recent events visible
  • Create a new event → succeeds, appears in list
  • Tail API logs for 5 minutes → no 5xx storms

9. Re-enable traffic

  • Remove the maintenance page from the frontend
  • Announce restoration in the status channel + status page
  • Note actual RTO + RPO achieved for the post-mortem

Drill procedure

Run this monthly with no real outage to keep the team's hands warm.

  1. Provision a throwaway Postgres instance (postgres-drill-YYYYMM).
  2. Run steps 25 against it (skip 1, 7, 8, 9 — production stays untouched).
  3. restore-verify MUST pass.
  4. Bonus: spin up an API pointed at the drill DB on a one-off port and walk through the smoke-test scenarios.
  5. Tear down the drill DB.
  6. Log the time taken in docs/RESTORE_DRILL_LOG.md (or wherever your team tracks operational drills).

If any step fails during a drill, the production fail-over procedure is unreliable — treat as a P1 to fix before the next real failure.

Rollback (if restore is wrong)

If you complete the restore and discover it's the wrong data:

  1. Scale API back to 0
  2. Find the next earlier backup
  3. Drop and recreate the database on the restored instance
  4. Repeat from step 3

Never point production at a known-bad restored DB hoping to fix it later — the API will write new data on top of the corruption and the salvage gets exponentially harder.

Migration down-scripts

Every .up.sql in internal/storage/migrations/ has a matching .down.sql. They're tested as part of CI and not exercised during normal restores (the up-only sequence in step 6 is the path used). They exist for:

  • Drill scenarios where you want to "rewind" the schema
  • Emergency rollback of a bad shipped migration

Down-script integrity: run the TestMigrationRoundtrip integration test, which applies every migration up → down → up against a fresh container.

Application config that supports restored DBs

GG_DATABASE_URL is the single source of truth — no hardcoded hostnames anywhere in the codebase. Verified by:

grep -rE 'postgres://|host=.*5432' --include='*.go' . | grep -v _test.go | grep -v config.go
# Expected: (empty)

If anything surfaces from that grep, file a bug — it'll bite the next restore.

Escalation

Step fails Who to call
Steps 14 On-call infra lead
Step 5 (restore-verify) On-call backend lead + DBA
Steps 78 (app won't start / smoke fails) On-call backend lead
Drill failure File P1 ticket, link the drill log

Change log

Date Author Change
2026-05-16 kay initial version (Block G)