Files

T

Kwaku Danso 4bf0dab853 docs: add 4-tier production roadmap and detailed Tier 1 plan

- CLAUDE.md: 4-tier feature roadmap appended after the build-order
  section (launch blockers → moat features). Future sessions
  reference this to know which tier a new feature belongs to.

- docs/TIER1_PLAN.md: detailed sequencing for the 8 blocks of
  Tier 1 work (auth, authz, rate limiting, notifications, CSV
  import, billing, backups, privacy) with schema changes,
  endpoints, tests, and effort estimates per block.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-11 21:09:25 +01:00

24 KiB

Raw Blame History

Tier 1 Production Plan

This document sequences the work to take GuestGuard from feature-complete demo to a product that can be sold to event hosts.

Read CLAUDE.md for project conventions and the full 4-tier roadmap. This document is purely the what and the in what order for Tier 1.

TL;DR

Eight work blocks (A–H), grouped into three waves that respect dependencies. Estimated effort: ~8–10 weeks for one engineer, ~5–6 weeks for two.

Wave 1 (foundation, must finish before anything else):
  A. Authentication ──┐
                      ├── B. Authorisation
                      └── C. Rate limiting (parallel)

Wave 2 (depends on auth being real):
  D. Notifications ───┐
                      ├── E. CSV import (parallel)
                      └── F. Billing

Wave 3 (ops + legal, can run alongside Wave 2):
  G. Backups & DR
  H. Privacy compliance

Block A — Authentication

Why first: every other Tier 1 item depends on knowing who's calling.

Goal

Replace the useHost() localStorage bootstrap with real auth: email + password, verified emails, password reset, JWT-based sessions with refresh tokens. The existing users table is reused.

Schema changes

Migration 0003_auth.up.sql:

ALTER TABLE users
  ADD COLUMN password_hash    TEXT,             -- bcrypt; nullable for OAuth-only users later
  ADD COLUMN email_verified   BOOLEAN NOT NULL DEFAULT FALSE,
  ADD COLUMN email_verified_at TIMESTAMPTZ;

CREATE TABLE email_verification_tokens (
  token_hash   TEXT PRIMARY KEY,
  user_id      UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  expires_at   TIMESTAMPTZ NOT NULL,
  consumed_at  TIMESTAMPTZ
);

CREATE TABLE password_reset_tokens (
  token_hash   TEXT PRIMARY KEY,
  user_id      UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  expires_at   TIMESTAMPTZ NOT NULL,
  consumed_at  TIMESTAMPTZ
);

CREATE TABLE refresh_tokens (
  token_hash   TEXT PRIMARY KEY,
  user_id      UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  expires_at   TIMESTAMPTZ NOT NULL,
  revoked_at   TIMESTAMPTZ,
  user_agent   TEXT,
  ip_address   INET,
  created_at   TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_refresh_tokens_user ON refresh_tokens(user_id) WHERE revoked_at IS NULL;

Backend (`internal/auth`, `internal/api`)

New auth.PasswordHasher (bcrypt via golang.org/x/crypto/bcrypt, cost 12)
New auth.JWTSigner issuing access tokens (15min TTL) signed with GG_JWT_SECRET
New repos for verification, reset, and refresh tokens (token hashes stored, never raw)
New handlers:
- POST /auth/signup — creates unverified user, emits verification email
- POST /auth/login — verifies password, requires verified email, returns access + refresh
- POST /auth/refresh — rotates refresh token (single-use), returns new pair
- POST /auth/logout — revokes the refresh token
- POST /auth/verify-email — consumes verification token, sets email_verified
- POST /auth/forgot-password — emits reset email (no-op if email unknown — don't leak existence)
- POST /auth/reset-password — consumes reset token, updates password_hash, revokes all refresh tokens
New middleware requireAuth that pulls Authorization: Bearer …, validates, attaches userID to request context
Delete POST /users (the demo bootstrap)

Frontend

Delete useHost() composable; replace with useAuth() (access token in memory, refresh token in httpOnly cookie set by server)
New pages: /login, /signup, /verify-email, /forgot-password, /reset-password/:token
useApi() composable adds Authorization header; on 401, calls /auth/refresh; on refresh failure, redirects to /login
Dashboard route guard: redirect to /login if no session
Sign-out button calls /auth/logout, clears state, redirects to /

Notifications dependency

Verification + reset emails need real email delivery. Until Block D lands, use a stub EmailSender that prints the link to the API server logs so developers and the test environment can complete the flow without a Twilio/SES account. Document this in the block's README.

Tests

Unit: password hashing round-trip, JWT signing + parsing with expiry, token-hash storage
Integration: signup → verify-email → login → refresh → use-protected-endpoint → logout
Integration: forgot-password → reset-password → old refresh tokens revoked
Security: rate-limit signup (deferred to Block C, document the dependency)

Definition of done

Migration 0003_auth.up.sql applied
All /auth/* endpoints return appropriate status codes (verified against httpstatus.dev conventions)
Refresh-token rotation enforced (reusing a refresh token revokes the family — token-replay defence)
Email verification mandatory before first login
Frontend has working signup → verify → login → dashboard flow end-to-end
useHost() and POST /users removed from the codebase
No localhost-only assumptions in code paths

Effort: ~2 weeks for one engineer.

Block B — Authorisation

Why now: same PR cluster as Block A. Adding new endpoints without authz bakes in security debt.

Goal

Every host-facing endpoint enforces "this caller can only touch their own data". Audit the current API surface and add authz checks to each endpoint.

Schema changes

None — events.host_id already exists. We just need to start trusting the session-derived userID instead of the query parameter.

Backend

Apply requireAuth middleware to every route except: /health, /auth/*, the guest-facing /access/{token}, /rsvp/{token}, and the WS endpoint (note: WS auth needs its own design — see open questions)
For each event-scoped endpoint, derive hostID from session and reject if the event's host_id doesn't match:
- GET /events → list only events where host_id = session.userID
- GET /events/{id} → 404 (not 403, to avoid leaking existence) if owner mismatch
- All PATCH/DELETE /events/{id} → same
- POST /events/{id}/guests, GET /events/{id}/guests, POST /events/{id}/guests/{guest_id}/tokens, GET /events/{id}/activity → same
Remove the ?host_id=... query parameter from GET /events — derive from session
Update the integration test to authenticate first

Frontend

All host-facing API calls include the access token (already handled if useApi() was updated in Block A)
Update GET /events calls to drop the host_id query param

WebSocket auth (open question)

The WS endpoint /ws/events/{id} is currently anonymous. Options:

Pass JWT as query param (?token=...) — browsers can't send Authorization headers on WS handshake
Cookie-based session (httpOnly cookie set by /auth/login)
Short-lived WS ticket: client calls POST /auth/ws-ticket (auth required), receives a single-use 60s ticket, passes as ?ticket=... to the WS handshake

Recommend option 3 — most secure, no token in URL beyond a single request. Document the choice.

Tests

Unit: authz middleware accepts/rejects/redirects appropriately
Integration: host A cannot list, read, modify host B's events (verify 404)
Integration: WS ticket flow works end-to-end

Definition of done

Every host route requires a valid session
Cross-tenant data access returns 404, not 403 (don't leak existence)
WS authentication implemented (option 3 recommended)
?host_id=... query parameter removed everywhere
Pen-test pass: try to read/modify another user's event with their event_id but your own token

Effort: ~3–4 days, assuming Block A laid the middleware groundwork.

Block C — Rate limiting + abuse controls

Why now: small block, no dependency on auth other than knowing the userID for per-user limits. Redis is already provisioned but unused — this finally puts it to work.

Goal

Stop trivial abuse: someone scripting POST /auth/signup 10k times, brute-forcing the RSVP page, spamming token issuance, etc.

Schema changes

None — Redis only.

Backend

New internal/ratelimit package with a sliding-window limiter backed by Redis (use Redis INCR + EXPIRE or a Lua script for atomicity)
Apply per-route, per-key limits via middleware:

Endpoint	Key	Limit
`POST /auth/signup`	IP	5 / hour
`POST /auth/login`	IP + email	10 / 5 min (lock on consecutive failures)
`POST /auth/forgot-password`	IP + email	3 / hour
`POST /rsvp/{token}`	token	10 / hour
`GET /access/{token}`	token	60 / hour
`POST /events`	userID	20 / day
`POST /events/{id}/guests`	userID	1000 / day
`POST /events/{id}/guests/{guest_id}/tokens`	userID	500 / day

Return 429 Too Many Requests with Retry-After header on limit
CAPTCHA (hCaptcha or Cloudflare Turnstile) on POST /auth/signup and POST /auth/forgot-password
Lockout: after 5 consecutive failed logins, require password reset to unlock

Frontend

Render CAPTCHA widget on signup + forgot-password forms
On 429, show "You're going too fast — please try again in a minute" instead of generic error

Tests

Unit: limiter increments correctly, expires at window boundary
Integration: 6th signup from the same IP within an hour returns 429
Integration: CAPTCHA token validated server-side before processing signup

Definition of done

Redis MULTI/EXEC or Lua script confirms atomicity of the limiter
All endpoints in the table above are limited
CAPTCHA wired on signup + forgot-password
Lockout flow tested end-to-end
Limiter exposes Prometheus metrics (already implicit — ratelimit_block_total per endpoint)

Effort: ~3–4 days.

Block D — Real notifications

Why now: Block A's email verification + password reset need real delivery. Don't ship auth to production with a logger stub.

Goal

Replace LogSender in internal/notification with real Twilio + SES adapters. Branded HTML email templates. Bounce + complaint handling. Unsubscribe.

Schema changes

ALTER TABLE notifications
  ADD COLUMN provider_message_id TEXT,
  ADD COLUMN bounce_type TEXT,             -- 'permanent' | 'transient' | NULL
  ADD COLUMN complained BOOLEAN NOT NULL DEFAULT FALSE,
  ADD COLUMN delivered_at TIMESTAMPTZ;     -- already exists per memory, confirm

CREATE TABLE unsubscribes (
  email        CITEXT PRIMARY KEY,
  reason       TEXT,
  created_at   TIMESTAMPTZ NOT NULL DEFAULT now()
);

Backend (`internal/notification`, `cmd/notifier`)

TwilioSender (real github.com/twilio/twilio-go client)
- Retry with exponential backoff: 1s, 5s, 30s, 5m, 30m
- Permanent failure codes mapped to bounce_type = 'permanent'
- Cost tracking: log message segments per send
SESSender (real github.com/aws/aws-sdk-go-v2/service/sesv2)
- HTML + plaintext multipart
- List-Unsubscribe header on every email
- Configuration set with SNS topic for bounces + complaints
HTML templates (internal/notification/templates/*.tmpl):
- invitation.html — "You're invited to {event_name}"
- confirmation.html — RSVP recorded
- verification.html — verify your email
- reset.html — reset your password
- reminder.html — 1-day-before reminder
Webhook endpoints (in internal/api, public, signed by provider):
- POST /webhooks/twilio/status — Twilio message status callbacks
- POST /webhooks/ses/notifications — SNS-delivered bounce/complaint notifications
- Both verify signatures before trusting the payload
Check unsubscribes table before sending any email; refuse silently if present

Frontend

Unsubscribe page at /unsubscribe/:token — token signed so we know who's unsubscribing
Host setting: from-name + reply-to email per event (Tier 2 polish, defer if rushed)

Configuration

Required env vars (add to internal/config):

GG_TWILIO_ACCOUNT_SID
GG_TWILIO_AUTH_TOKEN
GG_TWILIO_FROM_NUMBER

GG_SES_REGION
GG_SES_FROM_EMAIL          # must be a verified identity
GG_SES_CONFIGURATION_SET

GG_PUBLIC_BASE_URL          # for unsubscribe + invitation links in templates

Tests

Unit: template rendering produces expected HTML and text
Unit: retry logic backs off correctly, surrenders after N attempts
Integration (with stubs): bounce webhook marks notification, blocks future sends to that email
Manual: actually send to a test inbox in a staging Twilio + SES account

Definition of done

Email verification email arrives in a real inbox (Gmail, Outlook)
SMS arrives on a real phone
DKIM + SPF + DMARC verified for sender domain (this is human-owned infra setup)
Bounces and complaints recorded in notifications + unsubscribes
Unsubscribe link in every email; clicking it adds the address to the suppression list
Templates render correctly in Gmail web, Outlook web, iOS Mail, Apple Mail (litmus.com or equivalent)

Effort: ~1.5–2 weeks (mostly template polish + deliverability setup).

Block E — CSV guest import

Why now: highest user-visible impact of any Tier 1 item, no dependency on other blocks except Block B's authz. Marketing already promises it.

Goal

A host can drag a .csv onto the dashboard and have hundreds of guests added in seconds. Validation surfaces problems before commit. Dedup is automatic.

Schema changes

None — uses existing guests table.

Backend

POST /events/{id}/guests/import — multipart/form-data, single CSV file
- Header detection: tolerant of name|Name|guest_name, email|Email, phone|Phone|telephone, plus_ones|+1|plusones
- Validation: name required, email format if present, phone E.164-ish if present, plus_ones non-negative integer
- Dedup: skip rows whose email matches an existing guest on the same event
- Returns: { added: int, skipped: int, errors: [{ row: int, reason: string }] }
- Atomic per-batch: either all valid rows commit or none (transaction)
- Limit: 5,000 rows per import
POST /events/{id}/guests/import/preview — same payload, but doesn't write; returns parsed rows for confirm UI
Sample CSV download: GET /events/{id}/guests/import/template — returns a .csv with example rows

Frontend

New section on event detail page: "Import guests from a spreadsheet"
Drag-drop zone (use vue-file-pond or native HTML5 drag-drop)
After upload: hit /preview, show a sortable table of rows with row-level errors highlighted
"Looks good — import" button calls /import
Show success summary: "Imported 247 guests. Skipped 3 duplicates. 2 rows had errors."
Help text linking to the template CSV

Tests

Unit: header detection accepts the listed variants and rejects unknown columns gracefully
Unit: validation rejects bad emails, accepts blank emails (phone-only guests valid)
Integration: dedup leaves existing guests untouched
Integration: rolling back on mid-batch error doesn't leave partial state

Definition of done

Sample CSV downloadable from the import UI
Preview always shown before commit
Errors are row-level, not "the whole file is invalid"
Encoding: handles UTF-8 with BOM (Excel exports), UTF-16 (Mac Numbers exports)
File-size cap: 1MB / 5,000 rows enforced server-side
No memory blow-up: parse rows as a stream, not into a []Row of arbitrary size

Effort: ~3–5 days.

Block F — Billing

Why last in Wave 2: depends on real auth (Block A), real notifications (Block D, for receipts), and a stable data model. Don't build until those are solid.

Goal

Stripe-based subscriptions. Free tier with hard limits. Paid tiers unlock higher limits. Failed-payment dunning. Self-serve upgrade + downgrade.

Pricing model (decision required — see open questions)

Recommended starter pricing (placeholder, validate with target market):

Tier	Price	Events/mo	Guests/event	SMS/mo	Branding
Free	$0	1	50	0 (email only)	No
Personal	$19/event	1 per purchase	500	100	Logo
Pro	$49/mo	10	1,000	1,000	Full
Business	$199/mo	Unlimited	5,000	5,000	+ custom domain

Schema changes

CREATE TABLE subscriptions (
  id                    UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id               UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  stripe_customer_id    TEXT NOT NULL,
  stripe_subscription_id TEXT,
  tier                  TEXT NOT NULL,           -- 'free' | 'personal' | 'pro' | 'business'
  status                TEXT NOT NULL,           -- 'active' | 'past_due' | 'canceled' | 'incomplete'
  current_period_end    TIMESTAMPTZ,
  cancel_at_period_end  BOOLEAN NOT NULL DEFAULT FALSE,
  created_at            TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at            TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE UNIQUE INDEX ON subscriptions(user_id) WHERE status = 'active';

CREATE TABLE usage_counters (
  user_id      UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  period_start DATE NOT NULL,
  events_count INT NOT NULL DEFAULT 0,
  sms_count    INT NOT NULL DEFAULT 0,
  PRIMARY KEY (user_id, period_start)
);

Backend

internal/billing package wrapping the Stripe SDK
POST /billing/checkout-session — returns a Stripe Checkout URL for the requested tier
POST /billing/portal — returns a Stripe Customer Portal URL
POST /webhooks/stripe — signature-verified, handles:
- customer.subscription.created / .updated / .deleted → upsert into subscriptions
- invoice.payment_failed → trigger dunning email (Block D)
- invoice.payment_succeeded → clear past-due state
Enforcement: middleware checks usage against tier limits before allowing POST /events, POST /events/{id}/guests, SMS triggers. Returns 402 Payment Required with the upgrade URL on limit.

Frontend

/billing page: current plan, usage bars, upgrade/downgrade buttons
On 402, show modal: "You've hit your plan limit. Upgrade?"
Stripe Checkout opens in a new tab; on return, poll subscription state until updated by webhook

Tests

Integration: free user can create 1 event, second fails with 402
Integration: webhook signature verification rejects forged payloads
Integration: cancellation flow keeps access until period end

Definition of done

Stripe in test mode end-to-end working
Webhook signatures verified
Usage counters reset monthly (cron or compute on-demand)
Receipts emailed via Stripe (default behaviour, just confirm enabled)
Refund policy documented (referenced from billing page)

Effort: ~2 weeks.

Block G — Backups & disaster recovery

Mostly infra-owned, but the application side has documentation work.

Claude's scope

All migrations have a *.down.sql that's been tested locally
New docs/RUNBOOK_RESTORE.md documenting the restore procedure step-by-step
Confirm Postgres connection string env var supports the recovery instance (no hardcoded primary-only hostnames)
Optional: a cmd/restore-verify tool that runs after a restore to assert schema invariants (guest counts ≈ rsvp counts, no orphaned tokens, etc.)

Human / infra scope

pg_basebackup + WAL archiving to S3
Daily logical dump as a secondary safety net
Cross-region replication of the S3 bucket
Monthly restore drill scheduled
Documented RTO (e.g. 1 hour) and RPO (e.g. 5 minutes)

Definition of done

Every existing migration has a tested down migration
docs/RUNBOOK_RESTORE.md exists and a fresh engineer could follow it
First restore drill completed successfully

Effort: ~2 days for the application-side work.

Block H — Privacy compliance

Legal documents are human-owned. Application-level support is Claude scope.

Claude's scope

GET /me/data-export — streams a JSON document with every record (user, events, guests, tokens, RSVPs, access_logs, notifications) belonging to the authenticated user. Long-running, so async: enqueue → email a link.
DELETE /me — cascade-deletes the user and everything tied to them. Soft-delete first (set deleted_at), hard-delete on a cron after 30 days to honour any in-flight legal holds.
DELETE /events/{id}/guests/{guest_id} (host-triggered) — already exists in spirit; add a "forget this guest" action that removes RSVP/access-log rows but keeps the aggregate counter for the event.
Data retention: automated nightly job to soft-delete events whose event_date is older than 18 months (configurable per host once Tier 2).
Add privacy_policy_accepted_at and terms_accepted_at columns to users; block first login until both are accepted.

Human / legal scope

Privacy policy + ToS drafted by a lawyer
DPAs signed with Twilio, SES, Stripe, MaxMind, and any other subprocessor
Public privacy page at /privacy, ToS at /terms
Cookie banner (only required if analytics are added; currently we have none)
GDPR Article 30 record of processing activities

Definition of done

GET /me/data-export produces a complete, parseable JSON dump
DELETE /me cascades correctly with no orphan rows (verified by FK constraints)
Privacy + ToS pages live and linked from the footer + signup form
Acceptance enforced on first login after the launch date
Retention cron job tested

Effort: ~3–4 days for the application work; legal work runs in parallel.

Cross-cutting concerns

These touch most blocks above; bake them in as you go, not as a separate pass.

Logging + auditing

Every state-changing endpoint logs: userID, action, target_id, result, request_id. Use slog with a correlation ID middleware. Critical for post-incident forensics.

Observability lite (Tier 3 scope, but minimum viable for launch)

Prometheus /metrics endpoint on the API exposing: request rate by endpoint, latency percentiles, 4xx/5xx counts, ratelimit_block_total
Sentry (or self-hosted GlitchTip) for unhandled errors, with release tagging

Feature flags

Lightweight feature_flags table or env-var driven (no LaunchDarkly yet). Useful for rolling out Block F's billing without exposing it to all users at once.

Open questions

Resolve before starting:

Final pricing tiers — the table in Block F is a placeholder. Confirm with the target market (interview 10 wedding planners, 10 corporate event managers).
Email provider — SES vs Postmark vs SendGrid. SES is cheapest but has the harshest deliverability ramp; Postmark is best for transactional but pricier.
2FA at launch or v1.1? — Recommend v1.1; one less moving piece on the launch path.
Custom domain for RSVP pages at launch or v1.1? — Recommend v1.1 (Tier 2). Adds DNS + cert complexity.
WebSocket auth mechanism — Recommend Block B option 3 (short-lived ticket).
EU data residency at launch? — If targeting EU customers, this becomes Tier 1 (separate EU deployment). Otherwise defer to Tier 4.

Sequencing summary table

Wave	Block	Depends on	Effort (1 eng)	Can parallelise with
1	A. Auth	—	2w	—
1	B. Authz	A	4d	C
1	C. Rate limiting	A (for `userID`)	4d	B
2	D. Notifications	A	2w	E
2	E. CSV import	B	4d	D, F
2	F. Billing	A, D	2w	E
3	G. Backups	— (infra)	2d (Claude)	any
3	H. Privacy	A	3d	any

One engineer, sequential: ~9 weeks. Two engineers, parallel-where-possible: ~5.5 weeks.

What's not in Tier 1 (deliberate)

These are tempting but are Tier 2:

Editable RSVPs (guests can change response after submitting)
Multi-host collaborators
Event branding (logo, colours, custom domain)
Day-of QR check-in
Better fraud-engine thresholds (false-positive feedback loop)
Calendar integration
Auto-reminders (1-day before, etc.)
Mobile push notifications

Ship Tier 1 first. The launch story is "personal invitations + live tracking + quiet fraud detection + works reliably + you can pay us money". Everything else is the second release.

24 KiB Raw Blame History Unescape Escape

Tier 1 Production Plan

TL;DR

Block A — Authentication

Goal

Schema changes

Backend (internal/auth, internal/api)

Frontend

Notifications dependency

Tests

Definition of done

Effort: ~2 weeks for one engineer.

Block B — Authorisation

Goal

Schema changes

Backend

Frontend

WebSocket auth (open question)

Tests

Definition of done

Effort: ~3–4 days, assuming Block A laid the middleware groundwork.

Block C — Rate limiting + abuse controls

Goal

Schema changes

Backend

Frontend

Tests

Definition of done

Effort: ~3–4 days.

Block D — Real notifications

Goal

Schema changes

Backend (internal/notification, cmd/notifier)

Frontend

Configuration

Tests

Definition of done

Effort: ~1.5–2 weeks (mostly template polish + deliverability setup).

Block E — CSV guest import

Goal

Schema changes

Backend

Frontend

Tests

Definition of done

Effort: ~3–5 days.

Block F — Billing

Goal

Pricing model (decision required — see open questions)

Schema changes

Backend

Frontend

Tests

Definition of done

Effort: ~2 weeks.

Block G — Backups & disaster recovery

Claude's scope

Human / infra scope

Definition of done

Effort: ~2 days for the application-side work.

Block H — Privacy compliance

Claude's scope

Human / legal scope

Definition of done

Effort: ~3–4 days for the application work; legal work runs in parallel.

Cross-cutting concerns

Logging + auditing

Observability lite (Tier 3 scope, but minimum viable for launch)

Feature flags

Open questions

Sequencing summary table

What's not in Tier 1 (deliberate)

24 KiB

Raw Blame History

Backend (`internal/auth`, `internal/api`)

Backend (`internal/notification`, `cmd/notifier`)