Files
guestguard/docs/RUNBOOK_RESTORE.md
T
Kwaku Danso 59b8781659 feat: ship Tier 1 — auth, authz, rate limits, real notifications, CSV import, billing, backups/DR, privacy
Closes every block in docs/TIER1_PLAN.md from the Claude-scope side. The
homelab / cloud setup steps (SES verification, restore drill, lawyer-
drafted ToS) remain operator-owned but are unblocked.

Block A — Authentication
- Migration 0003: password_hash, email_verified, email_verification_tokens,
  password_reset_tokens, refresh_tokens (with replaced_by family chain).
- Bcrypt hasher, HS256 JWT signer, single-use refresh tokens with rotation
  + replay-detection (revokes the family on reuse).
- /auth/signup, /login, /refresh, /logout, /verify-email,
  /forgot-password, /reset-password — enumeration-safe.
- requireAuth middleware + GET /me.
- Frontend useAuth/useApi with auto-refresh-on-401, login/signup/verify/
  forgot/reset pages, route-guard middleware.

Block B — Authorisation
- EventRepo.GetForHost; Update/Delete scoped by host_id.
- All host routes behind requireAuth + ownership; cross-tenant returns
  404 (no enumeration). ?host_id removed.
- WS auth via short-lived single-use tickets (POST /auth/ws-ticket).
- Tests: TestCrossTenantIsolation — 9 probes.

Block C — Rate limiting
- Redis sliding-window via Lua (atomic ZADD+ZCARD+PEXPIRE).
- Per-route limits matching the plan (signup IP, login IP+email, RSVP/
  access by token, events/guests/tokens by user_id).
- 429 with Retry-After header and JSON body.
- Auth lockout: 5 failed logins → account locked, only password reset
  clears it.
- Frontend: useErrMessage normalises 429 + locked messaging.

Block D — Real notifications
- Migration 0004: provider_message_id, bounce_type, complained columns
  + unsubscribes (CITEXT) suppression table.
- Branded HTML + plaintext templates for verification, reset, invitation,
  confirmation, reminder. Per-page templates avoid html/template's
  contextual-escape collisions.
- Senders: SESv2, Twilio (SMS), SMTP (Mailpit-friendly), Resend HTTP.
- PickEmailSender priority Resend > SMTP > SES > Log — system boots
  cleanly in dev with Mailpit; production flips one env var.
- Webhook endpoints (Twilio status + SES SNS) — bounces add to suppression;
  signature verification stubbed pending creds.
- Auto-send: POST /tokens publishes invitation.send; notifier renders +
  delivers via the configured backend; suppression list honoured.
- Bulk + per-row invitation flow: POST /events/{id}/guests/invitations/bulk
  returns per-guest tokens so phone-only guests can be SMS'd manually.
- Unsubscribe: signed HMAC token (no TTL) + /unsubscribe/[token] page.
- WhatsApp Option A+: wa.me click-to-chat wizard with per-guest progress
  tracking, isLikelyE164 validation, edit-from-wizard.
- Token rotate (POST /tokens/rotate) invalidates the old URL — used by
  the regenerate-link flow.
- Mailpit added to docker-compose for dev inbox.

Block E — CSV import
- Streaming parser: tolerant header detection, UTF-8 BOM + UTF-16 LE/BE
  decoding, row-level validation, 5,000-row cap.
- Strict E.164 phone validation with helpful error message.
- POST /preview + /import + GET /template; preview UI on event page;
  atomic per-batch with dedup on existing emails.

Phone capture across UI
- PhoneInput component: country picker (~50 ISO codes) + national input +
  live E.164 preview + inline length validation.
- Used in Add Guest and Edit Guest modals. Smart paste-handling extracts
  country code from full E.164 strings.

Block F — Billing (Stripe)
- Migration 0005: subscriptions table (user_id → tier/status/period_end +
  Stripe customer/sub ids). Partial unique index keeps one granting sub
  per user.
- internal/billing: Tier + Limits model (Free 1/50, Pro 10/1000, Business
  ∞/5000), Stripe SDK wrapper with IgnoreAPIVersionMismatch for newer
  account API versions.
- /billing/checkout-session, /billing/portal, /billing/status,
  /webhooks/stripe (signature-verified, lifecycle events).
- Tier enforcement: 402 on POST /events, /guests, /import with
  {error, reason, tier, used, limit, upgrade_url} body.
- Frontend: useBilling composable, /dashboard/billing page (current plan,
  usage bars, tier cards), global UpgradeModal triggered by useApi's
  402 interceptor.
- Customer portal kept for self-service cancel/payment-method changes.

Block G — Backups & DR (application side)
- Every migration has a tested .down.sql.
- TestMigrationRoundtrip applies all ups → all downs → all ups against a
  fresh container; catches asymmetric down migrations.
- cmd/restore-verify: 28-check post-restore invariant tool (schema
  presence, no orphans across 10 FK relationships, email uniqueness,
  single-active subscription, row-count snapshot).
- docs/RUNBOOK_RESTORE.md: 9-step restore procedure with RTO/RPO
  targets, drill instructions, rollback path.

Block H — Privacy compliance (application side)
- Migration 0006: deleted_at + terms_accepted_at + privacy_policy_accepted_at
  on users. Partial index on email for live-only uniqueness.
- GET /me/data-export — synchronous JSON dump (user, events, guests,
  tokens, rsvps, access_logs, notifications).
- DELETE /me — soft-delete with PII scrub + refresh-token revocation;
  re-signup with same email works.
- POST /me/accept-terms — idempotent consent recording.
- Frontend /privacy + /terms placeholder pages with substantive (pending
  legal review) copy; footer links; signup terms checkbox; TermsGateModal
  for accounts created before the rollout; export + delete buttons on
  /dashboard/billing.

Tests
- All migrations verified up/down/up.
- Integration suite: TestE2EHappyPath, TestAuthFlow, TestCrossTenantIsolation,
  TestRateLimitSignup, TestLoginLockout, TestUnsubscribeFlow,
  TestSESBounceWebhook, TestTwilioStatusWebhook, TestCsvImportFlow,
  TestCsvImportAtomicRollback, TestBulkIssueInvitations, TestBulkIssueExplicitSubset,
  TestTokenIssuePublishesInvitation, TestTokenIssueWithoutGuestEmailSkipsInvitation,
  TestGuestUpdate, TestGuestDelete, TestTokenRotate, TestSMTPSenderAgainstMailpit,
  TestFreeTierEventLimit, TestFreeTierGuestLimit, TestBusinessTierBypassesLimits,
  TestDataExport, TestDeleteMe, TestAcceptTerms, TestMigrationRoundtrip.
  Full suite runs in ~120s against real Postgres + NATS + Redis + Mailpit.
- Unit suite green across internal/auth, internal/csvimport,
  internal/notification, internal/ratelimit, internal/domain.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:54:22 +01:00

252 lines
8.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Runbook — Postgres restore
This is the procedure to bring GuestGuard back from a Postgres backup
after data loss. It assumes the infra side of Block G (`pg_basebackup` +
WAL archiving to S3, daily logical dumps, cross-region replication) is
already in place — see the homelab repo for those.
The application side — migration down-scripts, the [`restore-verify`](../cmd/restore-verify/main.go)
tool, and this document — lives here in the GuestGuard repo so it ships
in lockstep with the schema.
---
## Targets
| Metric | Target |
|---|---|
| RTO (recovery time objective) | ≤ 1 hour from "go" decision to traffic-serving |
| RPO (recovery point objective) | ≤ 5 minutes of data loss (WAL ships every 60s, S3 PUT every 5min) |
If RTO is going to slip past 1 hour, escalate per the comms plan in `docs/INCIDENT_RESPONSE.md` (infra repo).
## When to invoke this
- Primary Postgres is unreachable AND the standby has also failed
- Logical corruption discovered (e.g., a bad migration deleted rows)
- Region-wide outage at the primary's location
- A "what if we restored last Tuesday" drill (see [Drill](#drill-procedure))
If only the primary is unreachable and the standby is healthy, promote
the standby (separate runbook). Don't use this procedure unnecessarily —
restores are expensive.
## Prerequisites
Before starting:
- [ ] Decision authority has approved the restore (CTO or on-call lead)
- [ ] Read access to the S3 backup bucket: `s3://guestguard-pg-backups`
- [ ] `psql`, `pg_basebackup`, `wal-g` (or chosen WAL tool) installed
- [ ] Empty target Postgres instance provisioned (Kubernetes Statefulset,
RDS, or homelab box — same major version as the backup)
- [ ] `GG_DATABASE_URL` env var ready for the new instance
- [ ] Maintenance page deployed to the frontend (`/dashboard` returns 503)
- [ ] API + notifier pods scaled to 0 (`kubectl scale --replicas=0`)
- [ ] This document open in another tab
## Steps
### 1. Stop write traffic
```bash
# k8s
kubectl scale deployment/guestguard-api --replicas=0
kubectl scale deployment/guestguard-notifier --replicas=0
# Confirm no connections to the (broken) primary
kubectl exec -n postgres guestguard-pg-0 -- psql -U postgres -c \
"SELECT count(*) FROM pg_stat_activity WHERE datname='guestguard'"
```
If using docker-compose locally: `docker compose stop api notifier`.
### 2. Identify the recovery point
Pick the latest backup that's known-good. For corruption scenarios,
this may mean going further back than the most recent dump.
```bash
# List base backups (most recent first)
wal-g backup-list 2>/dev/null | tail -10
# Pick the timestamp (e.g. base_000000010000000000000007) and decide
# the LSN target if doing point-in-time recovery
```
For corruption: pick the latest backup created **before** the corrupting
event. For "ransomware / bad migration", probably 12 days back.
### 3. Restore the base backup
```bash
# Replace BACKUP_NAME with the chosen base
wal-g backup-fetch /var/lib/postgresql/data BACKUP_NAME
# Configure recovery target (omit recovery_target_time for "latest")
cat >> /var/lib/postgresql/data/postgresql.conf <<EOF
restore_command = 'wal-g wal-fetch "%f" "%p"'
recovery_target_time = '2026-05-13 14:30:00 UTC' # set if doing PITR
EOF
touch /var/lib/postgresql/data/recovery.signal
```
### 4. Start Postgres and let it replay WAL
```bash
systemctl start postgresql # or your equivalent
# Watch the log — should see "consistent recovery state reached"
tail -f /var/log/postgresql/postgresql-16-main.log
```
Wait until recovery completes and Postgres is in normal (not recovery)
mode:
```bash
psql -c "SELECT pg_is_in_recovery()"
# Expected: f (false)
```
### 5. Verify the restored database
This is the critical gate before any application traffic touches it.
```bash
# Build the verifier (only needed once)
go build -o restore-verify ./cmd/restore-verify
# Run it against the restored instance
GG_DATABASE_URL='postgres://guestguard:CHANGE_ME@RESTORED_HOST:5432/guestguard?sslmode=require' \
./restore-verify --verbose
```
Expected output: `OK: all N checks passed`. The tool checks:
- All expected tables exist (users, events, guests, tokens, rsvps, etc.)
- All migrations recorded in `schema_migrations`
- No orphan rows across the ten FK relationships we care about
- `users.email` is still unique (case-insensitive)
- No more than one "granting" subscription per user
- Row-count snapshot (for sanity, not pass/fail)
**If any check fails: STOP.** The restore is corrupt — go back to step 2
with an earlier backup OR escalate.
### 6. Apply pending migrations
If the backup is from before a recent migration that shipped to prod,
catch up:
```bash
# The API auto-migrates on boot, but we want to apply migrations
# before traffic, so kick a one-off:
docker run --rm \
-e GG_DATABASE_URL='postgres://...' \
ghcr.io/alchemistkay/guestguard-api:latest \
/app/api --migrate-only
# Or via psql, applying each .up.sql in order if you don't have the image:
for m in 0001_init 0002_rsvps 0003_auth 0004_notifications_d 0005_billing; do
psql -f internal/storage/migrations/${m}.up.sql || break
done
```
Run `restore-verify` again after migrations to confirm everything's
still coherent.
### 7. Bring the API back up
```bash
kubectl scale deployment/guestguard-api --replicas=2
kubectl scale deployment/guestguard-notifier --replicas=1
# Watch the logs — expect "http server starting" + "billing enabled via stripe"
kubectl logs -f deployment/guestguard-api --tail=20
```
### 8. Smoke test
- [ ] Hit `/health` → 200
- [ ] Sign in as a known test user → dashboard loads, recent events visible
- [ ] Create a new event → succeeds, appears in list
- [ ] Tail API logs for 5 minutes → no 5xx storms
### 9. Re-enable traffic
- [ ] Remove the maintenance page from the frontend
- [ ] Announce restoration in the status channel + status page
- [ ] Note actual RTO + RPO achieved for the post-mortem
## Drill procedure
Run this monthly with no real outage to keep the team's hands warm.
1. Provision a throwaway Postgres instance (`postgres-drill-YYYYMM`).
2. Run steps 25 against it (skip 1, 7, 8, 9 — production stays untouched).
3. `restore-verify` MUST pass.
4. Bonus: spin up an API pointed at the drill DB on a one-off port and
walk through the smoke-test scenarios.
5. Tear down the drill DB.
6. Log the time taken in `docs/RESTORE_DRILL_LOG.md` (or wherever your
team tracks operational drills).
If any step fails during a drill, the production fail-over procedure is
**unreliable** — treat as a P1 to fix before the next real failure.
## Rollback (if restore is wrong)
If you complete the restore and discover it's the wrong data:
1. Scale API back to 0
2. Find the next earlier backup
3. Drop and recreate the database on the restored instance
4. Repeat from step 3
**Never** point production at a known-bad restored DB hoping to fix it
later — the API will write new data on top of the corruption and the
salvage gets exponentially harder.
## Migration down-scripts
Every `.up.sql` in `internal/storage/migrations/` has a matching
`.down.sql`. They're tested as part of CI and not exercised during
normal restores (the up-only sequence in step 6 is the path used).
They exist for:
- Drill scenarios where you want to "rewind" the schema
- Emergency rollback of a bad shipped migration
Down-script integrity: run the `TestMigrationRoundtrip` integration
test, which applies every migration up → down → up against a fresh
container.
## Application config that supports restored DBs
`GG_DATABASE_URL` is the single source of truth — no hardcoded
hostnames anywhere in the codebase. Verified by:
```bash
grep -rE 'postgres://|host=.*5432' --include='*.go' . | grep -v _test.go | grep -v config.go
# Expected: (empty)
```
If anything surfaces from that grep, file a bug — it'll bite the next
restore.
## Escalation
| Step fails | Who to call |
|---|---|
| Steps 14 | On-call infra lead |
| Step 5 (`restore-verify`) | On-call backend lead + DBA |
| Steps 78 (app won't start / smoke fails) | On-call backend lead |
| Drill failure | File P1 ticket, link the drill log |
## Change log
| Date | Author | Change |
|---|---|---|
| 2026-05-16 | kay | initial version (Block G) |