feat: ship Tier 1 — auth, authz, rate limits, real notifications, CSV import, billing, backups/DR, privacy

Closes every block in docs/TIER1_PLAN.md from the Claude-scope side. The homelab / cloud setup steps (SES verification, restore drill, lawyer- drafted ToS) remain operator-owned but are unblocked. Block A — Authentication - Migration 0003: password_hash, email_verified, email_verification_tokens, password_reset_tokens, refresh_tokens (with replaced_by family chain). - Bcrypt hasher, HS256 JWT signer, single-use refresh tokens with rotation + replay-detection (revokes the family on reuse). - /auth/signup, /login, /refresh, /logout, /verify-email, /forgot-password, /reset-password — enumeration-safe. - requireAuth middleware + GET /me. - Frontend useAuth/useApi with auto-refresh-on-401, login/signup/verify/ forgot/reset pages, route-guard middleware. Block B — Authorisation - EventRepo.GetForHost; Update/Delete scoped by host_id. - All host routes behind requireAuth + ownership; cross-tenant returns 404 (no enumeration). ?host_id removed. - WS auth via short-lived single-use tickets (POST /auth/ws-ticket). - Tests: TestCrossTenantIsolation — 9 probes. Block C — Rate limiting - Redis sliding-window via Lua (atomic ZADD+ZCARD+PEXPIRE). - Per-route limits matching the plan (signup IP, login IP+email, RSVP/ access by token, events/guests/tokens by user_id). - 429 with Retry-After header and JSON body. - Auth lockout: 5 failed logins → account locked, only password reset clears it. - Frontend: useErrMessage normalises 429 + locked messaging. Block D — Real notifications - Migration 0004: provider_message_id, bounce_type, complained columns + unsubscribes (CITEXT) suppression table. - Branded HTML + plaintext templates for verification, reset, invitation, confirmation, reminder. Per-page templates avoid html/template's contextual-escape collisions. - Senders: SESv2, Twilio (SMS), SMTP (Mailpit-friendly), Resend HTTP. - PickEmailSender priority Resend > SMTP > SES > Log — system boots cleanly in dev with Mailpit; production flips one env var. - Webhook endpoints (Twilio status + SES SNS) — bounces add to suppression; signature verification stubbed pending creds. - Auto-send: POST /tokens publishes invitation.send; notifier renders + delivers via the configured backend; suppression list honoured. - Bulk + per-row invitation flow: POST /events/{id}/guests/invitations/bulk returns per-guest tokens so phone-only guests can be SMS'd manually. - Unsubscribe: signed HMAC token (no TTL) + /unsubscribe/[token] page. - WhatsApp Option A+: wa.me click-to-chat wizard with per-guest progress tracking, isLikelyE164 validation, edit-from-wizard. - Token rotate (POST /tokens/rotate) invalidates the old URL — used by the regenerate-link flow. - Mailpit added to docker-compose for dev inbox. Block E — CSV import - Streaming parser: tolerant header detection, UTF-8 BOM + UTF-16 LE/BE decoding, row-level validation, 5,000-row cap. - Strict E.164 phone validation with helpful error message. - POST /preview + /import + GET /template; preview UI on event page; atomic per-batch with dedup on existing emails. Phone capture across UI - PhoneInput component: country picker (~50 ISO codes) + national input + live E.164 preview + inline length validation. - Used in Add Guest and Edit Guest modals. Smart paste-handling extracts country code from full E.164 strings. Block F — Billing (Stripe) - Migration 0005: subscriptions table (user_id → tier/status/period_end + Stripe customer/sub ids). Partial unique index keeps one granting sub per user. - internal/billing: Tier + Limits model (Free 1/50, Pro 10/1000, Business ∞/5000), Stripe SDK wrapper with IgnoreAPIVersionMismatch for newer account API versions. - /billing/checkout-session, /billing/portal, /billing/status, /webhooks/stripe (signature-verified, lifecycle events). - Tier enforcement: 402 on POST /events, /guests, /import with {error, reason, tier, used, limit, upgrade_url} body. - Frontend: useBilling composable, /dashboard/billing page (current plan, usage bars, tier cards), global UpgradeModal triggered by useApi's 402 interceptor. - Customer portal kept for self-service cancel/payment-method changes. Block G — Backups & DR (application side) - Every migration has a tested .down.sql. - TestMigrationRoundtrip applies all ups → all downs → all ups against a fresh container; catches asymmetric down migrations. - cmd/restore-verify: 28-check post-restore invariant tool (schema presence, no orphans across 10 FK relationships, email uniqueness, single-active subscription, row-count snapshot). - docs/RUNBOOK_RESTORE.md: 9-step restore procedure with RTO/RPO targets, drill instructions, rollback path. Block H — Privacy compliance (application side) - Migration 0006: deleted_at + terms_accepted_at + privacy_policy_accepted_at on users. Partial index on email for live-only uniqueness. - GET /me/data-export — synchronous JSON dump (user, events, guests, tokens, rsvps, access_logs, notifications). - DELETE /me — soft-delete with PII scrub + refresh-token revocation; re-signup with same email works. - POST /me/accept-terms — idempotent consent recording. - Frontend /privacy + /terms placeholder pages with substantive (pending legal review) copy; footer links; signup terms checkbox; TermsGateModal for accounts created before the rollout; export + delete buttons on /dashboard/billing. Tests - All migrations verified up/down/up. - Integration suite: TestE2EHappyPath, TestAuthFlow, TestCrossTenantIsolation, TestRateLimitSignup, TestLoginLockout, TestUnsubscribeFlow, TestSESBounceWebhook, TestTwilioStatusWebhook, TestCsvImportFlow, TestCsvImportAtomicRollback, TestBulkIssueInvitations, TestBulkIssueExplicitSubset, TestTokenIssuePublishesInvitation, TestTokenIssueWithoutGuestEmailSkipsInvitation, TestGuestUpdate, TestGuestDelete, TestTokenRotate, TestSMTPSenderAgainstMailpit, TestFreeTierEventLimit, TestFreeTierGuestLimit, TestBusinessTierBypassesLimits, TestDataExport, TestDeleteMe, TestAcceptTerms, TestMigrationRoundtrip. Full suite runs in ~120s against real Postgres + NATS + Redis + Mailpit. - Unit suite green across internal/auth, internal/csvimport, internal/notification, internal/ratelimit, internal/domain. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:54:22 +01:00
parent a0ed34f860
commit 59b8781659
124 changed files with 13702 additions and 445 deletions
@@ -0,0 +1,251 @@
+# Runbook — Postgres restore
+
+This is the procedure to bring GuestGuard back from a Postgres backup
+after data loss. It assumes the infra side of Block G (`pg_basebackup` +
+WAL archiving to S3, daily logical dumps, cross-region replication) is
+already in place — see the homelab repo for those.
+
+The application side — migration down-scripts, the [`restore-verify`](../cmd/restore-verify/main.go)
+tool, and this document — lives here in the GuestGuard repo so it ships
+in lockstep with the schema.
+
+---
+
+## Targets
+
+| Metric | Target |
+|---|---|
+| RTO (recovery time objective) | ≤ 1 hour from "go" decision to traffic-serving |
+| RPO (recovery point objective) | ≤ 5 minutes of data loss (WAL ships every 60s, S3 PUT every 5min) |
+
+If RTO is going to slip past 1 hour, escalate per the comms plan in `docs/INCIDENT_RESPONSE.md` (infra repo).
+
+## When to invoke this
+
+- Primary Postgres is unreachable AND the standby has also failed
+- Logical corruption discovered (e.g., a bad migration deleted rows)
+- Region-wide outage at the primary's location
+- A "what if we restored last Tuesday" drill (see [Drill](#drill-procedure))
+
+If only the primary is unreachable and the standby is healthy, promote
+the standby (separate runbook). Don't use this procedure unnecessarily —
+restores are expensive.
+
+## Prerequisites
+
+Before starting:
+
+- [ ] Decision authority has approved the restore (CTO or on-call lead)
+- [ ] Read access to the S3 backup bucket: `s3://guestguard-pg-backups`
+- [ ] `psql`, `pg_basebackup`, `wal-g` (or chosen WAL tool) installed
+- [ ] Empty target Postgres instance provisioned (Kubernetes Statefulset,
+      RDS, or homelab box — same major version as the backup)
+- [ ] `GG_DATABASE_URL` env var ready for the new instance
+- [ ] Maintenance page deployed to the frontend (`/dashboard` returns 503)
+- [ ] API + notifier pods scaled to 0 (`kubectl scale --replicas=0`)
+- [ ] This document open in another tab
+
+## Steps
+
+### 1. Stop write traffic
+
+```bash
+# k8s
+kubectl scale deployment/guestguard-api --replicas=0
+kubectl scale deployment/guestguard-notifier --replicas=0
+
+# Confirm no connections to the (broken) primary
+kubectl exec -n postgres guestguard-pg-0 -- psql -U postgres -c \
+  "SELECT count(*) FROM pg_stat_activity WHERE datname='guestguard'"
+```
+
+If using docker-compose locally: `docker compose stop api notifier`.
+
+### 2. Identify the recovery point
+
+Pick the latest backup that's known-good. For corruption scenarios,
+this may mean going further back than the most recent dump.
+
+```bash
+# List base backups (most recent first)
+wal-g backup-list 2>/dev/null | tail -10
+
+# Pick the timestamp (e.g. base_000000010000000000000007) and decide
+# the LSN target if doing point-in-time recovery
+```
+
+For corruption: pick the latest backup created **before** the corrupting
+event. For "ransomware / bad migration", probably 1–2 days back.
+
+### 3. Restore the base backup
+
+```bash
+# Replace BACKUP_NAME with the chosen base
+wal-g backup-fetch /var/lib/postgresql/data BACKUP_NAME
+
+# Configure recovery target (omit recovery_target_time for "latest")
+cat >> /var/lib/postgresql/data/postgresql.conf <<EOF
+restore_command = 'wal-g wal-fetch "%f" "%p"'
+recovery_target_time = '2026-05-13 14:30:00 UTC'  # set if doing PITR
+EOF
+
+touch /var/lib/postgresql/data/recovery.signal
+```
+
+### 4. Start Postgres and let it replay WAL
+
+```bash
+systemctl start postgresql   # or your equivalent
+
+# Watch the log — should see "consistent recovery state reached"
+tail -f /var/log/postgresql/postgresql-16-main.log
+```
+
+Wait until recovery completes and Postgres is in normal (not recovery)
+mode:
+
+```bash
+psql -c "SELECT pg_is_in_recovery()"
+# Expected: f  (false)
+```
+
+### 5. Verify the restored database
+
+This is the critical gate before any application traffic touches it.
+
+```bash
+# Build the verifier (only needed once)
+go build -o restore-verify ./cmd/restore-verify
+
+# Run it against the restored instance
+GG_DATABASE_URL='postgres://guestguard:CHANGE_ME@RESTORED_HOST:5432/guestguard?sslmode=require' \
+  ./restore-verify --verbose
+```
+
+Expected output: `OK: all N checks passed`. The tool checks:
+
+- All expected tables exist (users, events, guests, tokens, rsvps, etc.)
+- All migrations recorded in `schema_migrations`
+- No orphan rows across the ten FK relationships we care about
+- `users.email` is still unique (case-insensitive)
+- No more than one "granting" subscription per user
+- Row-count snapshot (for sanity, not pass/fail)
+
+**If any check fails: STOP.** The restore is corrupt — go back to step 2
+with an earlier backup OR escalate.
+
+### 6. Apply pending migrations
+
+If the backup is from before a recent migration that shipped to prod,
+catch up:
+
+```bash
+# The API auto-migrates on boot, but we want to apply migrations
+# before traffic, so kick a one-off:
+docker run --rm \
+  -e GG_DATABASE_URL='postgres://...' \
+  ghcr.io/alchemistkay/guestguard-api:latest \
+  /app/api --migrate-only
+
+# Or via psql, applying each .up.sql in order if you don't have the image:
+for m in 0001_init 0002_rsvps 0003_auth 0004_notifications_d 0005_billing; do
+  psql -f internal/storage/migrations/${m}.up.sql || break
+done
+```
+
+Run `restore-verify` again after migrations to confirm everything's
+still coherent.
+
+### 7. Bring the API back up
+
+```bash
+kubectl scale deployment/guestguard-api --replicas=2
+kubectl scale deployment/guestguard-notifier --replicas=1
+
+# Watch the logs — expect "http server starting" + "billing enabled via stripe"
+kubectl logs -f deployment/guestguard-api --tail=20
+```
+
+### 8. Smoke test
+
+- [ ] Hit `/health` → 200
+- [ ] Sign in as a known test user → dashboard loads, recent events visible
+- [ ] Create a new event → succeeds, appears in list
+- [ ] Tail API logs for 5 minutes → no 5xx storms
+
+### 9. Re-enable traffic
+
+- [ ] Remove the maintenance page from the frontend
+- [ ] Announce restoration in the status channel + status page
+- [ ] Note actual RTO + RPO achieved for the post-mortem
+
+## Drill procedure
+
+Run this monthly with no real outage to keep the team's hands warm.
+
+1. Provision a throwaway Postgres instance (`postgres-drill-YYYYMM`).
+2. Run steps 2–5 against it (skip 1, 7, 8, 9 — production stays untouched).
+3. `restore-verify` MUST pass.
+4. Bonus: spin up an API pointed at the drill DB on a one-off port and
+   walk through the smoke-test scenarios.
+5. Tear down the drill DB.
+6. Log the time taken in `docs/RESTORE_DRILL_LOG.md` (or wherever your
+   team tracks operational drills).
+
+If any step fails during a drill, the production fail-over procedure is
+**unreliable** — treat as a P1 to fix before the next real failure.
+
+## Rollback (if restore is wrong)
+
+If you complete the restore and discover it's the wrong data:
+
+1. Scale API back to 0
+2. Find the next earlier backup
+3. Drop and recreate the database on the restored instance
+4. Repeat from step 3
+
+**Never** point production at a known-bad restored DB hoping to fix it
+later — the API will write new data on top of the corruption and the
+salvage gets exponentially harder.
+
+## Migration down-scripts
+
+Every `.up.sql` in `internal/storage/migrations/` has a matching
+`.down.sql`. They're tested as part of CI and not exercised during
+normal restores (the up-only sequence in step 6 is the path used).
+They exist for:
+
+- Drill scenarios where you want to "rewind" the schema
+- Emergency rollback of a bad shipped migration
+
+Down-script integrity: run the `TestMigrationRoundtrip` integration
+test, which applies every migration up → down → up against a fresh
+container.
+
+## Application config that supports restored DBs
+
+`GG_DATABASE_URL` is the single source of truth — no hardcoded
+hostnames anywhere in the codebase. Verified by:
+
+```bash
+grep -rE 'postgres://|host=.*5432' --include='*.go' . | grep -v _test.go | grep -v config.go
+# Expected: (empty)
+```
+
+If anything surfaces from that grep, file a bug — it'll bite the next
+restore.
+
+## Escalation
+
+| Step fails | Who to call |
+|---|---|
+| Steps 1–4 | On-call infra lead |
+| Step 5 (`restore-verify`) | On-call backend lead + DBA |
+| Steps 7–8 (app won't start / smoke fails) | On-call backend lead |
+| Drill failure | File P1 ticket, link the drill log |
+
+## Change log
+
+| Date | Author | Change |
+|---|---|---|
+| 2026-05-16 | kay | initial version (Block G) |