Closes every block in docs/TIER1_PLAN.md from the Claude-scope side. The
homelab / cloud setup steps (SES verification, restore drill, lawyer-
drafted ToS) remain operator-owned but are unblocked.
Block A — Authentication
- Migration 0003: password_hash, email_verified, email_verification_tokens,
password_reset_tokens, refresh_tokens (with replaced_by family chain).
- Bcrypt hasher, HS256 JWT signer, single-use refresh tokens with rotation
+ replay-detection (revokes the family on reuse).
- /auth/signup, /login, /refresh, /logout, /verify-email,
/forgot-password, /reset-password — enumeration-safe.
- requireAuth middleware + GET /me.
- Frontend useAuth/useApi with auto-refresh-on-401, login/signup/verify/
forgot/reset pages, route-guard middleware.
Block B — Authorisation
- EventRepo.GetForHost; Update/Delete scoped by host_id.
- All host routes behind requireAuth + ownership; cross-tenant returns
404 (no enumeration). ?host_id removed.
- WS auth via short-lived single-use tickets (POST /auth/ws-ticket).
- Tests: TestCrossTenantIsolation — 9 probes.
Block C — Rate limiting
- Redis sliding-window via Lua (atomic ZADD+ZCARD+PEXPIRE).
- Per-route limits matching the plan (signup IP, login IP+email, RSVP/
access by token, events/guests/tokens by user_id).
- 429 with Retry-After header and JSON body.
- Auth lockout: 5 failed logins → account locked, only password reset
clears it.
- Frontend: useErrMessage normalises 429 + locked messaging.
Block D — Real notifications
- Migration 0004: provider_message_id, bounce_type, complained columns
+ unsubscribes (CITEXT) suppression table.
- Branded HTML + plaintext templates for verification, reset, invitation,
confirmation, reminder. Per-page templates avoid html/template's
contextual-escape collisions.
- Senders: SESv2, Twilio (SMS), SMTP (Mailpit-friendly), Resend HTTP.
- PickEmailSender priority Resend > SMTP > SES > Log — system boots
cleanly in dev with Mailpit; production flips one env var.
- Webhook endpoints (Twilio status + SES SNS) — bounces add to suppression;
signature verification stubbed pending creds.
- Auto-send: POST /tokens publishes invitation.send; notifier renders +
delivers via the configured backend; suppression list honoured.
- Bulk + per-row invitation flow: POST /events/{id}/guests/invitations/bulk
returns per-guest tokens so phone-only guests can be SMS'd manually.
- Unsubscribe: signed HMAC token (no TTL) + /unsubscribe/[token] page.
- WhatsApp Option A+: wa.me click-to-chat wizard with per-guest progress
tracking, isLikelyE164 validation, edit-from-wizard.
- Token rotate (POST /tokens/rotate) invalidates the old URL — used by
the regenerate-link flow.
- Mailpit added to docker-compose for dev inbox.
Block E — CSV import
- Streaming parser: tolerant header detection, UTF-8 BOM + UTF-16 LE/BE
decoding, row-level validation, 5,000-row cap.
- Strict E.164 phone validation with helpful error message.
- POST /preview + /import + GET /template; preview UI on event page;
atomic per-batch with dedup on existing emails.
Phone capture across UI
- PhoneInput component: country picker (~50 ISO codes) + national input +
live E.164 preview + inline length validation.
- Used in Add Guest and Edit Guest modals. Smart paste-handling extracts
country code from full E.164 strings.
Block F — Billing (Stripe)
- Migration 0005: subscriptions table (user_id → tier/status/period_end +
Stripe customer/sub ids). Partial unique index keeps one granting sub
per user.
- internal/billing: Tier + Limits model (Free 1/50, Pro 10/1000, Business
∞/5000), Stripe SDK wrapper with IgnoreAPIVersionMismatch for newer
account API versions.
- /billing/checkout-session, /billing/portal, /billing/status,
/webhooks/stripe (signature-verified, lifecycle events).
- Tier enforcement: 402 on POST /events, /guests, /import with
{error, reason, tier, used, limit, upgrade_url} body.
- Frontend: useBilling composable, /dashboard/billing page (current plan,
usage bars, tier cards), global UpgradeModal triggered by useApi's
402 interceptor.
- Customer portal kept for self-service cancel/payment-method changes.
Block G — Backups & DR (application side)
- Every migration has a tested .down.sql.
- TestMigrationRoundtrip applies all ups → all downs → all ups against a
fresh container; catches asymmetric down migrations.
- cmd/restore-verify: 28-check post-restore invariant tool (schema
presence, no orphans across 10 FK relationships, email uniqueness,
single-active subscription, row-count snapshot).
- docs/RUNBOOK_RESTORE.md: 9-step restore procedure with RTO/RPO
targets, drill instructions, rollback path.
Block H — Privacy compliance (application side)
- Migration 0006: deleted_at + terms_accepted_at + privacy_policy_accepted_at
on users. Partial index on email for live-only uniqueness.
- GET /me/data-export — synchronous JSON dump (user, events, guests,
tokens, rsvps, access_logs, notifications).
- DELETE /me — soft-delete with PII scrub + refresh-token revocation;
re-signup with same email works.
- POST /me/accept-terms — idempotent consent recording.
- Frontend /privacy + /terms placeholder pages with substantive (pending
legal review) copy; footer links; signup terms checkbox; TermsGateModal
for accounts created before the rollout; export + delete buttons on
/dashboard/billing.
Tests
- All migrations verified up/down/up.
- Integration suite: TestE2EHappyPath, TestAuthFlow, TestCrossTenantIsolation,
TestRateLimitSignup, TestLoginLockout, TestUnsubscribeFlow,
TestSESBounceWebhook, TestTwilioStatusWebhook, TestCsvImportFlow,
TestCsvImportAtomicRollback, TestBulkIssueInvitations, TestBulkIssueExplicitSubset,
TestTokenIssuePublishesInvitation, TestTokenIssueWithoutGuestEmailSkipsInvitation,
TestGuestUpdate, TestGuestDelete, TestTokenRotate, TestSMTPSenderAgainstMailpit,
TestFreeTierEventLimit, TestFreeTierGuestLimit, TestBusinessTierBypassesLimits,
TestDataExport, TestDeleteMe, TestAcceptTerms, TestMigrationRoundtrip.
Full suite runs in ~120s against real Postgres + NATS + Redis + Mailpit.
- Unit suite green across internal/auth, internal/csvimport,
internal/notification, internal/ratelimit, internal/domain.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
8.1 KiB
Runbook — Postgres restore
This is the procedure to bring GuestGuard back from a Postgres backup
after data loss. It assumes the infra side of Block G (pg_basebackup +
WAL archiving to S3, daily logical dumps, cross-region replication) is
already in place — see the homelab repo for those.
The application side — migration down-scripts, the restore-verify
tool, and this document — lives here in the GuestGuard repo so it ships
in lockstep with the schema.
Targets
| Metric | Target |
|---|---|
| RTO (recovery time objective) | ≤ 1 hour from "go" decision to traffic-serving |
| RPO (recovery point objective) | ≤ 5 minutes of data loss (WAL ships every 60s, S3 PUT every 5min) |
If RTO is going to slip past 1 hour, escalate per the comms plan in docs/INCIDENT_RESPONSE.md (infra repo).
When to invoke this
- Primary Postgres is unreachable AND the standby has also failed
- Logical corruption discovered (e.g., a bad migration deleted rows)
- Region-wide outage at the primary's location
- A "what if we restored last Tuesday" drill (see Drill)
If only the primary is unreachable and the standby is healthy, promote the standby (separate runbook). Don't use this procedure unnecessarily — restores are expensive.
Prerequisites
Before starting:
- Decision authority has approved the restore (CTO or on-call lead)
- Read access to the S3 backup bucket:
s3://guestguard-pg-backups psql,pg_basebackup,wal-g(or chosen WAL tool) installed- Empty target Postgres instance provisioned (Kubernetes Statefulset, RDS, or homelab box — same major version as the backup)
GG_DATABASE_URLenv var ready for the new instance- Maintenance page deployed to the frontend (
/dashboardreturns 503) - API + notifier pods scaled to 0 (
kubectl scale --replicas=0) - This document open in another tab
Steps
1. Stop write traffic
# k8s
kubectl scale deployment/guestguard-api --replicas=0
kubectl scale deployment/guestguard-notifier --replicas=0
# Confirm no connections to the (broken) primary
kubectl exec -n postgres guestguard-pg-0 -- psql -U postgres -c \
"SELECT count(*) FROM pg_stat_activity WHERE datname='guestguard'"
If using docker-compose locally: docker compose stop api notifier.
2. Identify the recovery point
Pick the latest backup that's known-good. For corruption scenarios, this may mean going further back than the most recent dump.
# List base backups (most recent first)
wal-g backup-list 2>/dev/null | tail -10
# Pick the timestamp (e.g. base_000000010000000000000007) and decide
# the LSN target if doing point-in-time recovery
For corruption: pick the latest backup created before the corrupting event. For "ransomware / bad migration", probably 1–2 days back.
3. Restore the base backup
# Replace BACKUP_NAME with the chosen base
wal-g backup-fetch /var/lib/postgresql/data BACKUP_NAME
# Configure recovery target (omit recovery_target_time for "latest")
cat >> /var/lib/postgresql/data/postgresql.conf <<EOF
restore_command = 'wal-g wal-fetch "%f" "%p"'
recovery_target_time = '2026-05-13 14:30:00 UTC' # set if doing PITR
EOF
touch /var/lib/postgresql/data/recovery.signal
4. Start Postgres and let it replay WAL
systemctl start postgresql # or your equivalent
# Watch the log — should see "consistent recovery state reached"
tail -f /var/log/postgresql/postgresql-16-main.log
Wait until recovery completes and Postgres is in normal (not recovery) mode:
psql -c "SELECT pg_is_in_recovery()"
# Expected: f (false)
5. Verify the restored database
This is the critical gate before any application traffic touches it.
# Build the verifier (only needed once)
go build -o restore-verify ./cmd/restore-verify
# Run it against the restored instance
GG_DATABASE_URL='postgres://guestguard:CHANGE_ME@RESTORED_HOST:5432/guestguard?sslmode=require' \
./restore-verify --verbose
Expected output: OK: all N checks passed. The tool checks:
- All expected tables exist (users, events, guests, tokens, rsvps, etc.)
- All migrations recorded in
schema_migrations - No orphan rows across the ten FK relationships we care about
users.emailis still unique (case-insensitive)- No more than one "granting" subscription per user
- Row-count snapshot (for sanity, not pass/fail)
If any check fails: STOP. The restore is corrupt — go back to step 2 with an earlier backup OR escalate.
6. Apply pending migrations
If the backup is from before a recent migration that shipped to prod, catch up:
# The API auto-migrates on boot, but we want to apply migrations
# before traffic, so kick a one-off:
docker run --rm \
-e GG_DATABASE_URL='postgres://...' \
ghcr.io/alchemistkay/guestguard-api:latest \
/app/api --migrate-only
# Or via psql, applying each .up.sql in order if you don't have the image:
for m in 0001_init 0002_rsvps 0003_auth 0004_notifications_d 0005_billing; do
psql -f internal/storage/migrations/${m}.up.sql || break
done
Run restore-verify again after migrations to confirm everything's
still coherent.
7. Bring the API back up
kubectl scale deployment/guestguard-api --replicas=2
kubectl scale deployment/guestguard-notifier --replicas=1
# Watch the logs — expect "http server starting" + "billing enabled via stripe"
kubectl logs -f deployment/guestguard-api --tail=20
8. Smoke test
- Hit
/health→ 200 - Sign in as a known test user → dashboard loads, recent events visible
- Create a new event → succeeds, appears in list
- Tail API logs for 5 minutes → no 5xx storms
9. Re-enable traffic
- Remove the maintenance page from the frontend
- Announce restoration in the status channel + status page
- Note actual RTO + RPO achieved for the post-mortem
Drill procedure
Run this monthly with no real outage to keep the team's hands warm.
- Provision a throwaway Postgres instance (
postgres-drill-YYYYMM). - Run steps 2–5 against it (skip 1, 7, 8, 9 — production stays untouched).
restore-verifyMUST pass.- Bonus: spin up an API pointed at the drill DB on a one-off port and walk through the smoke-test scenarios.
- Tear down the drill DB.
- Log the time taken in
docs/RESTORE_DRILL_LOG.md(or wherever your team tracks operational drills).
If any step fails during a drill, the production fail-over procedure is unreliable — treat as a P1 to fix before the next real failure.
Rollback (if restore is wrong)
If you complete the restore and discover it's the wrong data:
- Scale API back to 0
- Find the next earlier backup
- Drop and recreate the database on the restored instance
- Repeat from step 3
Never point production at a known-bad restored DB hoping to fix it later — the API will write new data on top of the corruption and the salvage gets exponentially harder.
Migration down-scripts
Every .up.sql in internal/storage/migrations/ has a matching
.down.sql. They're tested as part of CI and not exercised during
normal restores (the up-only sequence in step 6 is the path used).
They exist for:
- Drill scenarios where you want to "rewind" the schema
- Emergency rollback of a bad shipped migration
Down-script integrity: run the TestMigrationRoundtrip integration
test, which applies every migration up → down → up against a fresh
container.
Application config that supports restored DBs
GG_DATABASE_URL is the single source of truth — no hardcoded
hostnames anywhere in the codebase. Verified by:
grep -rE 'postgres://|host=.*5432' --include='*.go' . | grep -v _test.go | grep -v config.go
# Expected: (empty)
If anything surfaces from that grep, file a bug — it'll bite the next restore.
Escalation
| Step fails | Who to call |
|---|---|
| Steps 1–4 | On-call infra lead |
Step 5 (restore-verify) |
On-call backend lead + DBA |
| Steps 7–8 (app won't start / smoke fails) | On-call backend lead |
| Drill failure | File P1 ticket, link the drill log |
Change log
| Date | Author | Change |
|---|---|---|
| 2026-05-16 | kay | initial version (Block G) |