feat: ship Tier 1 — auth, authz, rate limits, real notifications, CSV import, billing, backups/DR, privacy

Closes every block in docs/TIER1_PLAN.md from the Claude-scope side. The
homelab / cloud setup steps (SES verification, restore drill, lawyer-
drafted ToS) remain operator-owned but are unblocked.

Block A — Authentication
- Migration 0003: password_hash, email_verified, email_verification_tokens,
  password_reset_tokens, refresh_tokens (with replaced_by family chain).
- Bcrypt hasher, HS256 JWT signer, single-use refresh tokens with rotation
  + replay-detection (revokes the family on reuse).
- /auth/signup, /login, /refresh, /logout, /verify-email,
  /forgot-password, /reset-password — enumeration-safe.
- requireAuth middleware + GET /me.
- Frontend useAuth/useApi with auto-refresh-on-401, login/signup/verify/
  forgot/reset pages, route-guard middleware.

Block B — Authorisation
- EventRepo.GetForHost; Update/Delete scoped by host_id.
- All host routes behind requireAuth + ownership; cross-tenant returns
  404 (no enumeration). ?host_id removed.
- WS auth via short-lived single-use tickets (POST /auth/ws-ticket).
- Tests: TestCrossTenantIsolation — 9 probes.

Block C — Rate limiting
- Redis sliding-window via Lua (atomic ZADD+ZCARD+PEXPIRE).
- Per-route limits matching the plan (signup IP, login IP+email, RSVP/
  access by token, events/guests/tokens by user_id).
- 429 with Retry-After header and JSON body.
- Auth lockout: 5 failed logins → account locked, only password reset
  clears it.
- Frontend: useErrMessage normalises 429 + locked messaging.

Block D — Real notifications
- Migration 0004: provider_message_id, bounce_type, complained columns
  + unsubscribes (CITEXT) suppression table.
- Branded HTML + plaintext templates for verification, reset, invitation,
  confirmation, reminder. Per-page templates avoid html/template's
  contextual-escape collisions.
- Senders: SESv2, Twilio (SMS), SMTP (Mailpit-friendly), Resend HTTP.
- PickEmailSender priority Resend > SMTP > SES > Log — system boots
  cleanly in dev with Mailpit; production flips one env var.
- Webhook endpoints (Twilio status + SES SNS) — bounces add to suppression;
  signature verification stubbed pending creds.
- Auto-send: POST /tokens publishes invitation.send; notifier renders +
  delivers via the configured backend; suppression list honoured.
- Bulk + per-row invitation flow: POST /events/{id}/guests/invitations/bulk
  returns per-guest tokens so phone-only guests can be SMS'd manually.
- Unsubscribe: signed HMAC token (no TTL) + /unsubscribe/[token] page.
- WhatsApp Option A+: wa.me click-to-chat wizard with per-guest progress
  tracking, isLikelyE164 validation, edit-from-wizard.
- Token rotate (POST /tokens/rotate) invalidates the old URL — used by
  the regenerate-link flow.
- Mailpit added to docker-compose for dev inbox.

Block E — CSV import
- Streaming parser: tolerant header detection, UTF-8 BOM + UTF-16 LE/BE
  decoding, row-level validation, 5,000-row cap.
- Strict E.164 phone validation with helpful error message.
- POST /preview + /import + GET /template; preview UI on event page;
  atomic per-batch with dedup on existing emails.

Phone capture across UI
- PhoneInput component: country picker (~50 ISO codes) + national input +
  live E.164 preview + inline length validation.
- Used in Add Guest and Edit Guest modals. Smart paste-handling extracts
  country code from full E.164 strings.

Block F — Billing (Stripe)
- Migration 0005: subscriptions table (user_id → tier/status/period_end +
  Stripe customer/sub ids). Partial unique index keeps one granting sub
  per user.
- internal/billing: Tier + Limits model (Free 1/50, Pro 10/1000, Business
  ∞/5000), Stripe SDK wrapper with IgnoreAPIVersionMismatch for newer
  account API versions.
- /billing/checkout-session, /billing/portal, /billing/status,
  /webhooks/stripe (signature-verified, lifecycle events).
- Tier enforcement: 402 on POST /events, /guests, /import with
  {error, reason, tier, used, limit, upgrade_url} body.
- Frontend: useBilling composable, /dashboard/billing page (current plan,
  usage bars, tier cards), global UpgradeModal triggered by useApi's
  402 interceptor.
- Customer portal kept for self-service cancel/payment-method changes.

Block G — Backups & DR (application side)
- Every migration has a tested .down.sql.
- TestMigrationRoundtrip applies all ups → all downs → all ups against a
  fresh container; catches asymmetric down migrations.
- cmd/restore-verify: 28-check post-restore invariant tool (schema
  presence, no orphans across 10 FK relationships, email uniqueness,
  single-active subscription, row-count snapshot).
- docs/RUNBOOK_RESTORE.md: 9-step restore procedure with RTO/RPO
  targets, drill instructions, rollback path.

Block H — Privacy compliance (application side)
- Migration 0006: deleted_at + terms_accepted_at + privacy_policy_accepted_at
  on users. Partial index on email for live-only uniqueness.
- GET /me/data-export — synchronous JSON dump (user, events, guests,
  tokens, rsvps, access_logs, notifications).
- DELETE /me — soft-delete with PII scrub + refresh-token revocation;
  re-signup with same email works.
- POST /me/accept-terms — idempotent consent recording.
- Frontend /privacy + /terms placeholder pages with substantive (pending
  legal review) copy; footer links; signup terms checkbox; TermsGateModal
  for accounts created before the rollout; export + delete buttons on
  /dashboard/billing.

Tests
- All migrations verified up/down/up.
- Integration suite: TestE2EHappyPath, TestAuthFlow, TestCrossTenantIsolation,
  TestRateLimitSignup, TestLoginLockout, TestUnsubscribeFlow,
  TestSESBounceWebhook, TestTwilioStatusWebhook, TestCsvImportFlow,
  TestCsvImportAtomicRollback, TestBulkIssueInvitations, TestBulkIssueExplicitSubset,
  TestTokenIssuePublishesInvitation, TestTokenIssueWithoutGuestEmailSkipsInvitation,
  TestGuestUpdate, TestGuestDelete, TestTokenRotate, TestSMTPSenderAgainstMailpit,
  TestFreeTierEventLimit, TestFreeTierGuestLimit, TestBusinessTierBypassesLimits,
  TestDataExport, TestDeleteMe, TestAcceptTerms, TestMigrationRoundtrip.
  Full suite runs in ~120s against real Postgres + NATS + Redis + Mailpit.
- Unit suite green across internal/auth, internal/csvimport,
  internal/notification, internal/ratelimit, internal/domain.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Kwaku Danso
2026-05-16 23:54:22 +01:00
parent a0ed34f860
commit 59b8781659
124 changed files with 13702 additions and 445 deletions
+251
View File
@@ -0,0 +1,251 @@
# Runbook — Postgres restore
This is the procedure to bring GuestGuard back from a Postgres backup
after data loss. It assumes the infra side of Block G (`pg_basebackup` +
WAL archiving to S3, daily logical dumps, cross-region replication) is
already in place — see the homelab repo for those.
The application side — migration down-scripts, the [`restore-verify`](../cmd/restore-verify/main.go)
tool, and this document — lives here in the GuestGuard repo so it ships
in lockstep with the schema.
---
## Targets
| Metric | Target |
|---|---|
| RTO (recovery time objective) | ≤ 1 hour from "go" decision to traffic-serving |
| RPO (recovery point objective) | ≤ 5 minutes of data loss (WAL ships every 60s, S3 PUT every 5min) |
If RTO is going to slip past 1 hour, escalate per the comms plan in `docs/INCIDENT_RESPONSE.md` (infra repo).
## When to invoke this
- Primary Postgres is unreachable AND the standby has also failed
- Logical corruption discovered (e.g., a bad migration deleted rows)
- Region-wide outage at the primary's location
- A "what if we restored last Tuesday" drill (see [Drill](#drill-procedure))
If only the primary is unreachable and the standby is healthy, promote
the standby (separate runbook). Don't use this procedure unnecessarily —
restores are expensive.
## Prerequisites
Before starting:
- [ ] Decision authority has approved the restore (CTO or on-call lead)
- [ ] Read access to the S3 backup bucket: `s3://guestguard-pg-backups`
- [ ] `psql`, `pg_basebackup`, `wal-g` (or chosen WAL tool) installed
- [ ] Empty target Postgres instance provisioned (Kubernetes Statefulset,
RDS, or homelab box — same major version as the backup)
- [ ] `GG_DATABASE_URL` env var ready for the new instance
- [ ] Maintenance page deployed to the frontend (`/dashboard` returns 503)
- [ ] API + notifier pods scaled to 0 (`kubectl scale --replicas=0`)
- [ ] This document open in another tab
## Steps
### 1. Stop write traffic
```bash
# k8s
kubectl scale deployment/guestguard-api --replicas=0
kubectl scale deployment/guestguard-notifier --replicas=0
# Confirm no connections to the (broken) primary
kubectl exec -n postgres guestguard-pg-0 -- psql -U postgres -c \
"SELECT count(*) FROM pg_stat_activity WHERE datname='guestguard'"
```
If using docker-compose locally: `docker compose stop api notifier`.
### 2. Identify the recovery point
Pick the latest backup that's known-good. For corruption scenarios,
this may mean going further back than the most recent dump.
```bash
# List base backups (most recent first)
wal-g backup-list 2>/dev/null | tail -10
# Pick the timestamp (e.g. base_000000010000000000000007) and decide
# the LSN target if doing point-in-time recovery
```
For corruption: pick the latest backup created **before** the corrupting
event. For "ransomware / bad migration", probably 12 days back.
### 3. Restore the base backup
```bash
# Replace BACKUP_NAME with the chosen base
wal-g backup-fetch /var/lib/postgresql/data BACKUP_NAME
# Configure recovery target (omit recovery_target_time for "latest")
cat >> /var/lib/postgresql/data/postgresql.conf <<EOF
restore_command = 'wal-g wal-fetch "%f" "%p"'
recovery_target_time = '2026-05-13 14:30:00 UTC' # set if doing PITR
EOF
touch /var/lib/postgresql/data/recovery.signal
```
### 4. Start Postgres and let it replay WAL
```bash
systemctl start postgresql # or your equivalent
# Watch the log — should see "consistent recovery state reached"
tail -f /var/log/postgresql/postgresql-16-main.log
```
Wait until recovery completes and Postgres is in normal (not recovery)
mode:
```bash
psql -c "SELECT pg_is_in_recovery()"
# Expected: f (false)
```
### 5. Verify the restored database
This is the critical gate before any application traffic touches it.
```bash
# Build the verifier (only needed once)
go build -o restore-verify ./cmd/restore-verify
# Run it against the restored instance
GG_DATABASE_URL='postgres://guestguard:CHANGE_ME@RESTORED_HOST:5432/guestguard?sslmode=require' \
./restore-verify --verbose
```
Expected output: `OK: all N checks passed`. The tool checks:
- All expected tables exist (users, events, guests, tokens, rsvps, etc.)
- All migrations recorded in `schema_migrations`
- No orphan rows across the ten FK relationships we care about
- `users.email` is still unique (case-insensitive)
- No more than one "granting" subscription per user
- Row-count snapshot (for sanity, not pass/fail)
**If any check fails: STOP.** The restore is corrupt — go back to step 2
with an earlier backup OR escalate.
### 6. Apply pending migrations
If the backup is from before a recent migration that shipped to prod,
catch up:
```bash
# The API auto-migrates on boot, but we want to apply migrations
# before traffic, so kick a one-off:
docker run --rm \
-e GG_DATABASE_URL='postgres://...' \
ghcr.io/alchemistkay/guestguard-api:latest \
/app/api --migrate-only
# Or via psql, applying each .up.sql in order if you don't have the image:
for m in 0001_init 0002_rsvps 0003_auth 0004_notifications_d 0005_billing; do
psql -f internal/storage/migrations/${m}.up.sql || break
done
```
Run `restore-verify` again after migrations to confirm everything's
still coherent.
### 7. Bring the API back up
```bash
kubectl scale deployment/guestguard-api --replicas=2
kubectl scale deployment/guestguard-notifier --replicas=1
# Watch the logs — expect "http server starting" + "billing enabled via stripe"
kubectl logs -f deployment/guestguard-api --tail=20
```
### 8. Smoke test
- [ ] Hit `/health` → 200
- [ ] Sign in as a known test user → dashboard loads, recent events visible
- [ ] Create a new event → succeeds, appears in list
- [ ] Tail API logs for 5 minutes → no 5xx storms
### 9. Re-enable traffic
- [ ] Remove the maintenance page from the frontend
- [ ] Announce restoration in the status channel + status page
- [ ] Note actual RTO + RPO achieved for the post-mortem
## Drill procedure
Run this monthly with no real outage to keep the team's hands warm.
1. Provision a throwaway Postgres instance (`postgres-drill-YYYYMM`).
2. Run steps 25 against it (skip 1, 7, 8, 9 — production stays untouched).
3. `restore-verify` MUST pass.
4. Bonus: spin up an API pointed at the drill DB on a one-off port and
walk through the smoke-test scenarios.
5. Tear down the drill DB.
6. Log the time taken in `docs/RESTORE_DRILL_LOG.md` (or wherever your
team tracks operational drills).
If any step fails during a drill, the production fail-over procedure is
**unreliable** — treat as a P1 to fix before the next real failure.
## Rollback (if restore is wrong)
If you complete the restore and discover it's the wrong data:
1. Scale API back to 0
2. Find the next earlier backup
3. Drop and recreate the database on the restored instance
4. Repeat from step 3
**Never** point production at a known-bad restored DB hoping to fix it
later — the API will write new data on top of the corruption and the
salvage gets exponentially harder.
## Migration down-scripts
Every `.up.sql` in `internal/storage/migrations/` has a matching
`.down.sql`. They're tested as part of CI and not exercised during
normal restores (the up-only sequence in step 6 is the path used).
They exist for:
- Drill scenarios where you want to "rewind" the schema
- Emergency rollback of a bad shipped migration
Down-script integrity: run the `TestMigrationRoundtrip` integration
test, which applies every migration up → down → up against a fresh
container.
## Application config that supports restored DBs
`GG_DATABASE_URL` is the single source of truth — no hardcoded
hostnames anywhere in the codebase. Verified by:
```bash
grep -rE 'postgres://|host=.*5432' --include='*.go' . | grep -v _test.go | grep -v config.go
# Expected: (empty)
```
If anything surfaces from that grep, file a bug — it'll bite the next
restore.
## Escalation
| Step fails | Who to call |
|---|---|
| Steps 14 | On-call infra lead |
| Step 5 (`restore-verify`) | On-call backend lead + DBA |
| Steps 78 (app won't start / smoke fails) | On-call backend lead |
| Drill failure | File P1 ticket, link the drill log |
## Change log
| Date | Author | Change |
|---|---|---|
| 2026-05-16 | kay | initial version (Block G) |