feat: ship Tier 1 — auth, authz, rate limits, real notifications, CSV import, billing, backups/DR, privacy
Closes every block in docs/TIER1_PLAN.md from the Claude-scope side. The
homelab / cloud setup steps (SES verification, restore drill, lawyer-
drafted ToS) remain operator-owned but are unblocked.
Block A — Authentication
- Migration 0003: password_hash, email_verified, email_verification_tokens,
password_reset_tokens, refresh_tokens (with replaced_by family chain).
- Bcrypt hasher, HS256 JWT signer, single-use refresh tokens with rotation
+ replay-detection (revokes the family on reuse).
- /auth/signup, /login, /refresh, /logout, /verify-email,
/forgot-password, /reset-password — enumeration-safe.
- requireAuth middleware + GET /me.
- Frontend useAuth/useApi with auto-refresh-on-401, login/signup/verify/
forgot/reset pages, route-guard middleware.
Block B — Authorisation
- EventRepo.GetForHost; Update/Delete scoped by host_id.
- All host routes behind requireAuth + ownership; cross-tenant returns
404 (no enumeration). ?host_id removed.
- WS auth via short-lived single-use tickets (POST /auth/ws-ticket).
- Tests: TestCrossTenantIsolation — 9 probes.
Block C — Rate limiting
- Redis sliding-window via Lua (atomic ZADD+ZCARD+PEXPIRE).
- Per-route limits matching the plan (signup IP, login IP+email, RSVP/
access by token, events/guests/tokens by user_id).
- 429 with Retry-After header and JSON body.
- Auth lockout: 5 failed logins → account locked, only password reset
clears it.
- Frontend: useErrMessage normalises 429 + locked messaging.
Block D — Real notifications
- Migration 0004: provider_message_id, bounce_type, complained columns
+ unsubscribes (CITEXT) suppression table.
- Branded HTML + plaintext templates for verification, reset, invitation,
confirmation, reminder. Per-page templates avoid html/template's
contextual-escape collisions.
- Senders: SESv2, Twilio (SMS), SMTP (Mailpit-friendly), Resend HTTP.
- PickEmailSender priority Resend > SMTP > SES > Log — system boots
cleanly in dev with Mailpit; production flips one env var.
- Webhook endpoints (Twilio status + SES SNS) — bounces add to suppression;
signature verification stubbed pending creds.
- Auto-send: POST /tokens publishes invitation.send; notifier renders +
delivers via the configured backend; suppression list honoured.
- Bulk + per-row invitation flow: POST /events/{id}/guests/invitations/bulk
returns per-guest tokens so phone-only guests can be SMS'd manually.
- Unsubscribe: signed HMAC token (no TTL) + /unsubscribe/[token] page.
- WhatsApp Option A+: wa.me click-to-chat wizard with per-guest progress
tracking, isLikelyE164 validation, edit-from-wizard.
- Token rotate (POST /tokens/rotate) invalidates the old URL — used by
the regenerate-link flow.
- Mailpit added to docker-compose for dev inbox.
Block E — CSV import
- Streaming parser: tolerant header detection, UTF-8 BOM + UTF-16 LE/BE
decoding, row-level validation, 5,000-row cap.
- Strict E.164 phone validation with helpful error message.
- POST /preview + /import + GET /template; preview UI on event page;
atomic per-batch with dedup on existing emails.
Phone capture across UI
- PhoneInput component: country picker (~50 ISO codes) + national input +
live E.164 preview + inline length validation.
- Used in Add Guest and Edit Guest modals. Smart paste-handling extracts
country code from full E.164 strings.
Block F — Billing (Stripe)
- Migration 0005: subscriptions table (user_id → tier/status/period_end +
Stripe customer/sub ids). Partial unique index keeps one granting sub
per user.
- internal/billing: Tier + Limits model (Free 1/50, Pro 10/1000, Business
∞/5000), Stripe SDK wrapper with IgnoreAPIVersionMismatch for newer
account API versions.
- /billing/checkout-session, /billing/portal, /billing/status,
/webhooks/stripe (signature-verified, lifecycle events).
- Tier enforcement: 402 on POST /events, /guests, /import with
{error, reason, tier, used, limit, upgrade_url} body.
- Frontend: useBilling composable, /dashboard/billing page (current plan,
usage bars, tier cards), global UpgradeModal triggered by useApi's
402 interceptor.
- Customer portal kept for self-service cancel/payment-method changes.
Block G — Backups & DR (application side)
- Every migration has a tested .down.sql.
- TestMigrationRoundtrip applies all ups → all downs → all ups against a
fresh container; catches asymmetric down migrations.
- cmd/restore-verify: 28-check post-restore invariant tool (schema
presence, no orphans across 10 FK relationships, email uniqueness,
single-active subscription, row-count snapshot).
- docs/RUNBOOK_RESTORE.md: 9-step restore procedure with RTO/RPO
targets, drill instructions, rollback path.
Block H — Privacy compliance (application side)
- Migration 0006: deleted_at + terms_accepted_at + privacy_policy_accepted_at
on users. Partial index on email for live-only uniqueness.
- GET /me/data-export — synchronous JSON dump (user, events, guests,
tokens, rsvps, access_logs, notifications).
- DELETE /me — soft-delete with PII scrub + refresh-token revocation;
re-signup with same email works.
- POST /me/accept-terms — idempotent consent recording.
- Frontend /privacy + /terms placeholder pages with substantive (pending
legal review) copy; footer links; signup terms checkbox; TermsGateModal
for accounts created before the rollout; export + delete buttons on
/dashboard/billing.
Tests
- All migrations verified up/down/up.
- Integration suite: TestE2EHappyPath, TestAuthFlow, TestCrossTenantIsolation,
TestRateLimitSignup, TestLoginLockout, TestUnsubscribeFlow,
TestSESBounceWebhook, TestTwilioStatusWebhook, TestCsvImportFlow,
TestCsvImportAtomicRollback, TestBulkIssueInvitations, TestBulkIssueExplicitSubset,
TestTokenIssuePublishesInvitation, TestTokenIssueWithoutGuestEmailSkipsInvitation,
TestGuestUpdate, TestGuestDelete, TestTokenRotate, TestSMTPSenderAgainstMailpit,
TestFreeTierEventLimit, TestFreeTierGuestLimit, TestBusinessTierBypassesLimits,
TestDataExport, TestDeleteMe, TestAcceptTerms, TestMigrationRoundtrip.
Full suite runs in ~120s against real Postgres + NATS + Redis + Mailpit.
- Unit suite green across internal/auth, internal/csvimport,
internal/notification, internal/ratelimit, internal/domain.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,251 @@
|
||||
# Runbook — Postgres restore
|
||||
|
||||
This is the procedure to bring GuestGuard back from a Postgres backup
|
||||
after data loss. It assumes the infra side of Block G (`pg_basebackup` +
|
||||
WAL archiving to S3, daily logical dumps, cross-region replication) is
|
||||
already in place — see the homelab repo for those.
|
||||
|
||||
The application side — migration down-scripts, the [`restore-verify`](../cmd/restore-verify/main.go)
|
||||
tool, and this document — lives here in the GuestGuard repo so it ships
|
||||
in lockstep with the schema.
|
||||
|
||||
---
|
||||
|
||||
## Targets
|
||||
|
||||
| Metric | Target |
|
||||
|---|---|
|
||||
| RTO (recovery time objective) | ≤ 1 hour from "go" decision to traffic-serving |
|
||||
| RPO (recovery point objective) | ≤ 5 minutes of data loss (WAL ships every 60s, S3 PUT every 5min) |
|
||||
|
||||
If RTO is going to slip past 1 hour, escalate per the comms plan in `docs/INCIDENT_RESPONSE.md` (infra repo).
|
||||
|
||||
## When to invoke this
|
||||
|
||||
- Primary Postgres is unreachable AND the standby has also failed
|
||||
- Logical corruption discovered (e.g., a bad migration deleted rows)
|
||||
- Region-wide outage at the primary's location
|
||||
- A "what if we restored last Tuesday" drill (see [Drill](#drill-procedure))
|
||||
|
||||
If only the primary is unreachable and the standby is healthy, promote
|
||||
the standby (separate runbook). Don't use this procedure unnecessarily —
|
||||
restores are expensive.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Before starting:
|
||||
|
||||
- [ ] Decision authority has approved the restore (CTO or on-call lead)
|
||||
- [ ] Read access to the S3 backup bucket: `s3://guestguard-pg-backups`
|
||||
- [ ] `psql`, `pg_basebackup`, `wal-g` (or chosen WAL tool) installed
|
||||
- [ ] Empty target Postgres instance provisioned (Kubernetes Statefulset,
|
||||
RDS, or homelab box — same major version as the backup)
|
||||
- [ ] `GG_DATABASE_URL` env var ready for the new instance
|
||||
- [ ] Maintenance page deployed to the frontend (`/dashboard` returns 503)
|
||||
- [ ] API + notifier pods scaled to 0 (`kubectl scale --replicas=0`)
|
||||
- [ ] This document open in another tab
|
||||
|
||||
## Steps
|
||||
|
||||
### 1. Stop write traffic
|
||||
|
||||
```bash
|
||||
# k8s
|
||||
kubectl scale deployment/guestguard-api --replicas=0
|
||||
kubectl scale deployment/guestguard-notifier --replicas=0
|
||||
|
||||
# Confirm no connections to the (broken) primary
|
||||
kubectl exec -n postgres guestguard-pg-0 -- psql -U postgres -c \
|
||||
"SELECT count(*) FROM pg_stat_activity WHERE datname='guestguard'"
|
||||
```
|
||||
|
||||
If using docker-compose locally: `docker compose stop api notifier`.
|
||||
|
||||
### 2. Identify the recovery point
|
||||
|
||||
Pick the latest backup that's known-good. For corruption scenarios,
|
||||
this may mean going further back than the most recent dump.
|
||||
|
||||
```bash
|
||||
# List base backups (most recent first)
|
||||
wal-g backup-list 2>/dev/null | tail -10
|
||||
|
||||
# Pick the timestamp (e.g. base_000000010000000000000007) and decide
|
||||
# the LSN target if doing point-in-time recovery
|
||||
```
|
||||
|
||||
For corruption: pick the latest backup created **before** the corrupting
|
||||
event. For "ransomware / bad migration", probably 1–2 days back.
|
||||
|
||||
### 3. Restore the base backup
|
||||
|
||||
```bash
|
||||
# Replace BACKUP_NAME with the chosen base
|
||||
wal-g backup-fetch /var/lib/postgresql/data BACKUP_NAME
|
||||
|
||||
# Configure recovery target (omit recovery_target_time for "latest")
|
||||
cat >> /var/lib/postgresql/data/postgresql.conf <<EOF
|
||||
restore_command = 'wal-g wal-fetch "%f" "%p"'
|
||||
recovery_target_time = '2026-05-13 14:30:00 UTC' # set if doing PITR
|
||||
EOF
|
||||
|
||||
touch /var/lib/postgresql/data/recovery.signal
|
||||
```
|
||||
|
||||
### 4. Start Postgres and let it replay WAL
|
||||
|
||||
```bash
|
||||
systemctl start postgresql # or your equivalent
|
||||
|
||||
# Watch the log — should see "consistent recovery state reached"
|
||||
tail -f /var/log/postgresql/postgresql-16-main.log
|
||||
```
|
||||
|
||||
Wait until recovery completes and Postgres is in normal (not recovery)
|
||||
mode:
|
||||
|
||||
```bash
|
||||
psql -c "SELECT pg_is_in_recovery()"
|
||||
# Expected: f (false)
|
||||
```
|
||||
|
||||
### 5. Verify the restored database
|
||||
|
||||
This is the critical gate before any application traffic touches it.
|
||||
|
||||
```bash
|
||||
# Build the verifier (only needed once)
|
||||
go build -o restore-verify ./cmd/restore-verify
|
||||
|
||||
# Run it against the restored instance
|
||||
GG_DATABASE_URL='postgres://guestguard:CHANGE_ME@RESTORED_HOST:5432/guestguard?sslmode=require' \
|
||||
./restore-verify --verbose
|
||||
```
|
||||
|
||||
Expected output: `OK: all N checks passed`. The tool checks:
|
||||
|
||||
- All expected tables exist (users, events, guests, tokens, rsvps, etc.)
|
||||
- All migrations recorded in `schema_migrations`
|
||||
- No orphan rows across the ten FK relationships we care about
|
||||
- `users.email` is still unique (case-insensitive)
|
||||
- No more than one "granting" subscription per user
|
||||
- Row-count snapshot (for sanity, not pass/fail)
|
||||
|
||||
**If any check fails: STOP.** The restore is corrupt — go back to step 2
|
||||
with an earlier backup OR escalate.
|
||||
|
||||
### 6. Apply pending migrations
|
||||
|
||||
If the backup is from before a recent migration that shipped to prod,
|
||||
catch up:
|
||||
|
||||
```bash
|
||||
# The API auto-migrates on boot, but we want to apply migrations
|
||||
# before traffic, so kick a one-off:
|
||||
docker run --rm \
|
||||
-e GG_DATABASE_URL='postgres://...' \
|
||||
ghcr.io/alchemistkay/guestguard-api:latest \
|
||||
/app/api --migrate-only
|
||||
|
||||
# Or via psql, applying each .up.sql in order if you don't have the image:
|
||||
for m in 0001_init 0002_rsvps 0003_auth 0004_notifications_d 0005_billing; do
|
||||
psql -f internal/storage/migrations/${m}.up.sql || break
|
||||
done
|
||||
```
|
||||
|
||||
Run `restore-verify` again after migrations to confirm everything's
|
||||
still coherent.
|
||||
|
||||
### 7. Bring the API back up
|
||||
|
||||
```bash
|
||||
kubectl scale deployment/guestguard-api --replicas=2
|
||||
kubectl scale deployment/guestguard-notifier --replicas=1
|
||||
|
||||
# Watch the logs — expect "http server starting" + "billing enabled via stripe"
|
||||
kubectl logs -f deployment/guestguard-api --tail=20
|
||||
```
|
||||
|
||||
### 8. Smoke test
|
||||
|
||||
- [ ] Hit `/health` → 200
|
||||
- [ ] Sign in as a known test user → dashboard loads, recent events visible
|
||||
- [ ] Create a new event → succeeds, appears in list
|
||||
- [ ] Tail API logs for 5 minutes → no 5xx storms
|
||||
|
||||
### 9. Re-enable traffic
|
||||
|
||||
- [ ] Remove the maintenance page from the frontend
|
||||
- [ ] Announce restoration in the status channel + status page
|
||||
- [ ] Note actual RTO + RPO achieved for the post-mortem
|
||||
|
||||
## Drill procedure
|
||||
|
||||
Run this monthly with no real outage to keep the team's hands warm.
|
||||
|
||||
1. Provision a throwaway Postgres instance (`postgres-drill-YYYYMM`).
|
||||
2. Run steps 2–5 against it (skip 1, 7, 8, 9 — production stays untouched).
|
||||
3. `restore-verify` MUST pass.
|
||||
4. Bonus: spin up an API pointed at the drill DB on a one-off port and
|
||||
walk through the smoke-test scenarios.
|
||||
5. Tear down the drill DB.
|
||||
6. Log the time taken in `docs/RESTORE_DRILL_LOG.md` (or wherever your
|
||||
team tracks operational drills).
|
||||
|
||||
If any step fails during a drill, the production fail-over procedure is
|
||||
**unreliable** — treat as a P1 to fix before the next real failure.
|
||||
|
||||
## Rollback (if restore is wrong)
|
||||
|
||||
If you complete the restore and discover it's the wrong data:
|
||||
|
||||
1. Scale API back to 0
|
||||
2. Find the next earlier backup
|
||||
3. Drop and recreate the database on the restored instance
|
||||
4. Repeat from step 3
|
||||
|
||||
**Never** point production at a known-bad restored DB hoping to fix it
|
||||
later — the API will write new data on top of the corruption and the
|
||||
salvage gets exponentially harder.
|
||||
|
||||
## Migration down-scripts
|
||||
|
||||
Every `.up.sql` in `internal/storage/migrations/` has a matching
|
||||
`.down.sql`. They're tested as part of CI and not exercised during
|
||||
normal restores (the up-only sequence in step 6 is the path used).
|
||||
They exist for:
|
||||
|
||||
- Drill scenarios where you want to "rewind" the schema
|
||||
- Emergency rollback of a bad shipped migration
|
||||
|
||||
Down-script integrity: run the `TestMigrationRoundtrip` integration
|
||||
test, which applies every migration up → down → up against a fresh
|
||||
container.
|
||||
|
||||
## Application config that supports restored DBs
|
||||
|
||||
`GG_DATABASE_URL` is the single source of truth — no hardcoded
|
||||
hostnames anywhere in the codebase. Verified by:
|
||||
|
||||
```bash
|
||||
grep -rE 'postgres://|host=.*5432' --include='*.go' . | grep -v _test.go | grep -v config.go
|
||||
# Expected: (empty)
|
||||
```
|
||||
|
||||
If anything surfaces from that grep, file a bug — it'll bite the next
|
||||
restore.
|
||||
|
||||
## Escalation
|
||||
|
||||
| Step fails | Who to call |
|
||||
|---|---|
|
||||
| Steps 1–4 | On-call infra lead |
|
||||
| Step 5 (`restore-verify`) | On-call backend lead + DBA |
|
||||
| Steps 7–8 (app won't start / smoke fails) | On-call backend lead |
|
||||
| Drill failure | File P1 ticket, link the drill log |
|
||||
|
||||
## Change log
|
||||
|
||||
| Date | Author | Change |
|
||||
|---|---|---|
|
||||
| 2026-05-16 | kay | initial version (Block G) |
|
||||
Reference in New Issue
Block a user