commit f760fc3e21fc7399dbc992b25dd246c09435bab1 Author: Kwaku Danso <72142185+cloud-dev101@users.noreply.github.com> Date: Thu May 7 12:35:40 2026 +0100 chore: initial project setup with CLAUDE.md and architecture doc diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..22def97 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,192 @@ +# CLAUDE.md — GuestGuard + +> This file configures Claude Code's behaviour for this project. + +## What this project is + +GuestGuard is a secure event RSVP platform with real-time fraud detection. Guests receive unique, cryptographically signed invitation links. When someone accesses a link, the system collects device fingerprints and IP geolocation, scores the access for fraud risk, and blocks unauthorised users. + +The architecture has three components, each with a genuine reason to be separate: + +1. **Core API (Go)** — All synchronous CRUD: events, guests, RSVPs, tokens, auth. Publishes events to NATS. Serves REST + WebSocket. +2. **Fraud Engine (Python/FastAPI)** — Consumes access events from NATS, runs ML risk scoring, publishes fraud scores back. Exposes gRPC for synchronous scoring during RSVP submission. +3. **Notification Worker (Go)** — Consumes events from NATS, sends SMS via Twilio, email via AWS SES, with retry and backoff. + +Message broker: NATS JetStream. +Frontend: Nuxt 3 with Tailwind CSS. +Database: PostgreSQL. +Cache: Redis. + +## Architecture document + +Read `docs/ARCHITECTURE.md` for the full architecture, data flows, database schema, fraud scoring model, and API design. **Always reference this file before building any feature** — it contains the detailed specifications for every component. + +## Project context + +- **Owner:** alchemistkay (Kwaku Danso) +- **Platform:** k4scloud homelab (k3s) + GitHub mirror +- **Domain:** guestguard.k4scloud.com +- **Registry:** harbor.k4scloud.com/guestguard +- **GitOps:** Gitea Actions → Harbor → ArgoCD +- **Go module:** github.com/alchemistkay/guestguard + +## Repository structure + +``` +guestguard/ +├── CLAUDE.md # This file +├── docs/ +│ └── ARCHITECTURE.md # Full architecture spec (read this first) +├── cmd/ +│ ├── api/main.go # Core API entrypoint +│ └── notifier/main.go # Notification worker entrypoint +├── internal/ +│ ├── api/ # HTTP handlers, middleware, WebSocket +│ ├── auth/ # JWT, OAuth +│ ├── domain/ # Business logic (events, guests, RSVPs, tokens) +│ ├── fraud/ # gRPC client for fraud engine +│ ├── nats/ # NATS publisher/subscriber +│ ├── notification/ # Twilio, SES adapters +│ └── storage/ # PostgreSQL, Redis repositories +├── fraud-engine/ +│ ├── app/ # FastAPI app +│ ├── scoring/ # Risk scoring logic +│ ├── consumers/ # NATS event consumers +│ └── Dockerfile +├── frontend/ +│ ├── pages/ +│ ├── components/ +│ ├── composables/ +│ └── Dockerfile +├── docker-compose.yml +├── Makefile +└── README.md +``` + +## Conventions + +### Go services (Core API + Notification Worker) +- Module path: `github.com/alchemistkay/guestguard` +- Use `log/slog` for logging (JSON handler in production) +- Use `net/http` stdlib with Go 1.22+ routing for the Core API +- Health endpoint at `GET /health` on every service +- Config via `os.Getenv` with defaults — no Viper +- Graceful shutdown with signal handling (SIGINT, SIGTERM) +- Table-driven tests +- PostgreSQL via `github.com/jackc/pgx/v5` +- Redis via `github.com/redis/go-redis/v9` +- NATS via `github.com/nats-io/nats.go` +- JWT via `github.com/golang-jwt/jwt/v5` +- gRPC via `google.golang.org/grpc` (client to fraud engine) + +### Python service (Fraud Engine) +- Python 3.11+ with FastAPI +- Pydantic v2 for schemas, pydantic-settings for config +- `ruff` for linting and formatting +- `pytest` for testing +- scikit-learn for ML model (start with heuristic scoring, upgrade later) +- NATS via `nats-py` +- gRPC via `grpcio` (server) + +### Frontend (Nuxt 3) +- Nuxt 3 with Tailwind CSS +- Dark theme (zinc-950 background) +- Brand colour: green (#22c55e) — conveys safety/security +- All API calls use relative URLs via `useApi()` composable +- WebSocket URLs derived from `window.location` +- No NUXT_ env vars for API URLs in production +- SSR for landing page, client-side for dashboard + +### Docker +- Multi-stage builds +- Alpine base for Go, slim base for Python, Alpine for Nuxt (node) +- Non-root user in final image (UID 1000) +- Health check via k8s probes (not HEALTHCHECK in Dockerfile) + +### Git +- Branch: main (default), feature/*, release/*, hotfix/* +- Commit messages: conventional commits (feat:, fix:, chore:, docs:, ci:) +- Squash merge feature branches +- Always commit package-lock.json + +## What to build vs what NOT to build + +### Claude Code builds: +- All Go source code (cmd/, internal/) +- All Python source code (fraud-engine/) +- All frontend code (frontend/) +- Unit tests and integration tests +- docker-compose.yml for local development +- Dockerfiles (multi-stage, following conventions above) +- Database migrations (golang-migrate) +- Protobuf definitions for gRPC (fraud scoring) +- NATS subject definitions and message schemas +- Makefile + +### Claude Code does NOT build (human handles): +- Kubernetes manifests (Deployments, Services, etc.) +- Gitea Actions CI/CD pipelines +- ArgoCD Application definitions +- NetworkPolicies +- Sealed Secrets +- Terraform +- Grafana dashboards +- Prometheus alert rules +- Helm charts or Kustomize overlays +- Harbor project configuration + +## Quality gates before handoff + +Before marking application code as "ready for deployment": + +1. `docker-compose up` starts all services (API, fraud engine, notifier, frontend, PostgreSQL, Redis, NATS) +2. Health endpoints return 200 on API (`/health`), fraud engine (`/health`) +3. Unit tests pass: `go test ./...` and `cd fraud-engine && pytest -v` +4. Frontend loads at http://localhost:3000 and can reach backend APIs +5. Core flow works: create event → add guest → generate token → access RSVP link → submit RSVP +6. Fraud flow works: access with different fingerprint → risk score > 60 → SMS verification triggered +7. Notification flow works: RSVP confirmed → confirmation event published to NATS → worker logs delivery attempt +8. No hardcoded localhost URLs in production code paths +9. Structured JSON logging on all services +10. Prometheus metrics endpoint exposed on API (`/metrics`) + +## Build order + +When building this project, follow this sequence: + +### Phase 1: Core API foundation +1. Go project scaffold (cmd/api, internal packages) +2. Config loading from environment +3. PostgreSQL connection + schema (events, guests, tokens, rsvps tables) +4. Event CRUD endpoints (POST/GET/PATCH/DELETE /events) +5. Guest management endpoints (POST/GET /events/:id/guests, CSV import) +6. Token generation (HMAC-SHA256 signed, per-guest unique) +7. RSVP endpoint (validate token → record response) +8. Health endpoint +9. Unit tests for token logic and RSVP validation +10. Dockerfile + +### Phase 2: NATS + Fraud Engine +1. NATS JetStream connection in Core API +2. Publish access events on token validation (guest.access.attempted) +3. Fraud Engine scaffold (FastAPI + NATS consumer) +4. Device fingerprint collection (frontend JavaScript) +5. Heuristic risk scoring (weighted features) +6. gRPC service definition (proto file) +7. gRPC server in Fraud Engine +8. gRPC client in Core API (synchronous scoring during RSVP) +9. Publish fraud.scored events back to NATS +10. Core API consumes fraud scores, flags tokens +11. Integration tests (NATS end-to-end) + +### Phase 3: Notifications + Frontend +1. Notification Worker scaffold (cmd/notifier) +2. NATS consumers (invitation.send, rsvp.confirmed, fraud.alert) +3. Twilio SMS adapter (with retry logic) +4. Nuxt 3 frontend scaffold +5. Landing page +6. RSVP flow pages (/rsvp/:token) +7. Host dashboard (event list, guest management, RSVP tracking) +8. Real-time monitor page (WebSocket for live RSVPs + fraud alerts) +9. docker-compose.yml for full local development +10. End-to-end testing diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md new file mode 100644 index 0000000..67cd235 --- /dev/null +++ b/docs/ARCHITECTURE.md @@ -0,0 +1,789 @@ +# GUESTGUARD — REARCHITECTED + +## Honest Architecture for a DevOps Portfolio + +--- + +# THE PHILOSOPHY: BUILD IT LIKE A STAFF ENGINEER WOULD + +GuestGuard is not a microservices project. It's an **event-driven security platform** — and the architecture should reflect that. The core CRUD (events, guests, RSVPs) is simple. The interesting engineering lives in the **real-time fraud detection pipeline**, the **notification delivery system**, and the **observability layer** that ties it all together. + +Every architectural decision below has a defensible answer to the question: *"Why didn't you just put this in the monolith?"* + +--- + +# POSITIONING + +**Tagline:** "Real-time fraud detection for event access control — an event-driven platform with ML-powered risk scoring, observability, and production-grade delivery infrastructure." + +**What this project proves (that VerifyHub and Web3Escrow don't):** +- You can build a **product** — not just infrastructure for infrastructure's sake +- You understand **event-driven architecture** with genuine reasons for async processing +- You can implement **ML inference in a pipeline** (not just call an API) +- You know when to use a monolith and when to break things out +- You can ship a polished, usable application that solves a real problem + +**Together with your other projects, your portfolio now covers:** +| Project | Proves | +|---------|--------| +| VerifyHub | Microservices resilience, chaos engineering, observability | +| Web3Escrow | Blockchain infrastructure, smart contract DevOps | +| GuestGuard | Event-driven architecture, real-time ML pipeline, product thinking | + +--- + +# THE ARCHITECTURE — HONEST VERSION + +## Three Components, Each With a Real Reason to Exist + +### 1. Core API (Go) +**Why it exists:** Handles all synchronous request/response operations. +**Why Go:** You're already using Go in VerifyHub. Consistent stack. Fast, small binaries, great for containerisation. +**Why monolithic:** Events, guests, RSVPs, and auth are tightly coupled domain logic. Splitting them into separate services would add network latency, distributed transaction complexity, and operational overhead — all for zero benefit. A senior engineer would never split these. + +**What lives here:** +- Event CRUD (create, update, delete, list) +- Guest management (import CSV, manual add, remove) +- RSVP handling (validate token, record response, update guest status) +- Authentication (JWT issuance, session management, OAuth for hosts) +- Token generation (cryptographically signed, per-guest unique links) +- REST API serving the frontend +- WebSocket endpoint for real-time dashboard updates +- Publishes events to NATS (access attempts, RSVPs, fraud signals) + +### 2. Fraud Engine (Python) +**Why it exists:** The fraud detection logic is genuinely different from CRUD. It consumes events asynchronously, runs ML inference, and has different scaling characteristics (CPU-bound scoring vs I/O-bound API serving). +**Why Python:** scikit-learn, numpy, pandas — the ML ecosystem lives in Python. Using Go here would mean reimplementing or wrapping everything. +**Why separate:** This is the one service that has a legitimate reason to be independent. It has a different deployment cadence (model updates), different resource profile (CPU for inference), and different failure characteristics (if scoring is slow, RSVPs should still work). + +**What lives here:** +- Consumes `access.attempted` events from NATS +- Device fingerprint analysis (browser, OS, screen, timezone, WebGL hash) +- IP geolocation scoring (distance from guest's expected location) +- Behavioural analysis (time patterns, click speed, navigation flow) +- Risk score calculation (0-100, combining all signals) +- Publishes `fraud.scored` events back to NATS +- Exposes a simple gRPC endpoint for synchronous scoring (used during RSVP submission as a final gate) +- Model retraining pipeline (batch job, not real-time) + +### 3. Notification Worker (Go) +**Why it exists:** Sending SMS/email is inherently async, can fail, needs retries with backoff, and should never block the API response. This is the textbook case for a background worker. +**Why separate:** If Twilio is down, guests should still be able to RSVP. If you're sending 500 invitations, the API shouldn't be tied up for 10 minutes. Different failure domain, different retry logic, different rate limits. + +**What lives here:** +- Consumes events from NATS (invitation.send, verification.required, fraud.alert, rsvp.confirmed) +- SMS delivery via Twilio (with retry, backoff, DLQ) +- Email delivery via AWS SES (with templates, retry) +- Delivery status tracking (delivered, failed, bounced) +- Rate limiting per provider (Twilio: 100/sec, SES: 14/sec) +- Host alert delivery (WebSocket push for real-time, email digest for batch) + +--- + +## Why NATS (Not RabbitMQ or Kafka) + +You're already using RabbitMQ in VerifyHub. Using a **different** message broker in GuestGuard shows range and lets you talk about tradeoffs in interviews. + +**NATS JetStream fits here because:** +- Lightweight (single binary, ~15MB RAM) — good for a portfolio project's budget +- Built-in persistence with JetStream (you need durability for notifications) +- Subject-based routing is natural for this domain (`guest.access.attempted`, `fraud.scored`, `notification.send.sms`) +- Simpler operationally than Kafka, more capable than basic Redis pub/sub + +**Interview talking point:** "I used RabbitMQ in VerifyHub for task queues and NATS in GuestGuard for event streaming — I can explain when I'd pick each one." + +--- + +# DETAILED DATA FLOW + +## Happy Path: Guest RSVPs Successfully + +``` +Guest clicks unique link + │ + ▼ +┌─────────────────┐ +│ Core API (Go) │ +│ │ +│ 1. Validate JWT │ ← Token embedded in URL, cryptographically signed +│ token │ +│ 2. Check token │ ← Not expired, not already used, not revoked +│ status │ +│ 3. Collect │ ← Browser fingerprint, IP, user agent +│ device data │ +│ 4. Publish │──────────► NATS: guest.access.attempted +│ access event │ {guest_id, token, fingerprint, ip, timestamp} +│ 5. Serve RSVP │ +│ form │ +└────────┬────────┘ + │ + Guest submits RSVP + │ + ▼ +┌─────────────────┐ +│ Core API (Go) │ +│ │ +│ 6. Sync fraud │──── gRPC ───► Fraud Engine: score this access +│ check (fast) │◄── score ──── {score: 23, risk: LOW} +│ 7. Record RSVP │ +│ 8. Publish │──────────► NATS: rsvp.confirmed +│ confirmation │ {guest_id, event_id, response, plus_ones} +│ 9. Return │ +│ success │ +└─────────────────┘ + │ + │ (async, non-blocking) + ▼ +┌──────────────────────┐ +│ Notification Worker │ +│ │ +│ Consumes rsvp.confirmed +│ 10. Send confirmation│──► SMS: "Your RSVP is confirmed!" +│ to guest │──► Email: confirmation with calendar invite +│ 11. Update host │──► WebSocket: dashboard counter updates +│ dashboard │ +└──────────────────────┘ +``` + +## Fraud Path: Shared Link Detected + +``` +Uninvited person clicks forwarded link + │ + ▼ +┌─────────────────┐ +│ Core API (Go) │ +│ │ +│ 1. Validate │ ← Token is valid (it's a real link) +│ token │ +│ 2. Collect │ ← DIFFERENT fingerprint, DIFFERENT IP +│ device data │ +│ 3. Publish │──────────► NATS: guest.access.attempted +│ access event │ {guest_id, token, fingerprint: NEW, ip: NEW} +└────────┬────────┘ + │ + ▼ (async) +┌──────────────────────┐ +│ Fraud Engine │ +│ │ +│ Consumes guest.access.attempted +│ 4. Compare finger- │ ← Previous fingerprint on file +│ print to baseline │ +│ 5. Check IP geo- │ ← Guest registered in London, +│ location │ access from Lagos +│ 6. Score risk │ ← Score: 87 (HIGH) +│ 7. Publish result │──────────► NATS: fraud.scored +│ │ {guest_id, score: 87, risk: HIGH, reasons: [...]} +└──────────────────────┘ + │ + ▼ (async) +┌─────────────────┐ ┌──────────────────────┐ +│ Core API (Go) │ │ Notification Worker │ +│ │ │ │ +│ Consumes fraud.scored │ Consumes fraud.scored +│ 8. Flag token │ │ 9. Alert host │──► Push notification +│ 9. Require SMS │ │ 10. Log attempt │──► Email digest +│ verification │ └──────────────────────┘ +│ on next │ +│ access │ +└────────┬────────┘ + │ + Uninvited person tries to RSVP + │ + ▼ +┌─────────────────┐ +│ Core API │ +│ │ +│ 11. Sync fraud │──── gRPC ───► Fraud Engine: score = 87 +│ check │◄──────────── BLOCK +│ 12. Require SMS │ ← Send code to ORIGINAL guest's phone +│ to original │ +│ guest phone │ +│ 13. Uninvited │ ← Can't receive the code +│ person │ +│ blocked │ +└─────────────────┘ +``` + +--- + +# DATA MODEL + +## PostgreSQL Schema (Single Database) + +```sql +-- Core domain +events + id UUID PRIMARY KEY + host_id UUID REFERENCES users(id) + name VARCHAR(255) + slug VARCHAR(100) UNIQUE + event_date TIMESTAMPTZ + venue TEXT + max_capacity INTEGER + settings JSONB -- theme, custom fields, etc. + status event_status -- draft, published, closed, archived + created_at TIMESTAMPTZ + updated_at TIMESTAMPTZ + +guests + id UUID PRIMARY KEY + event_id UUID REFERENCES events(id) + name VARCHAR(255) + email VARCHAR(255) + phone VARCHAR(20) -- for SMS verification + plus_ones INTEGER DEFAULT 0 + dietary_notes TEXT + table_number INTEGER + created_at TIMESTAMPTZ + +tokens + id UUID PRIMARY KEY + guest_id UUID REFERENCES guests(id) UNIQUE + token_hash VARCHAR(64) -- SHA-256 of the actual token + expires_at TIMESTAMPTZ + status token_status -- active, used, revoked, expired + used_at TIMESTAMPTZ + created_at TIMESTAMPTZ + +rsvps + id UUID PRIMARY KEY + guest_id UUID REFERENCES guests(id) UNIQUE + response rsvp_response -- attending, declined, maybe + plus_ones INTEGER + dietary_notes TEXT + submitted_at TIMESTAMPTZ + device_fingerprint JSONB + ip_address INET + risk_score SMALLINT -- 0-100, from fraud engine + +-- Fraud detection +access_logs + id UUID PRIMARY KEY + guest_id UUID REFERENCES guests(id) + token_id UUID REFERENCES tokens(id) + fingerprint JSONB -- browser, OS, screen, timezone, etc. + ip_address INET + geo_location JSONB -- {country, city, lat, lng} + risk_score SMALLINT + risk_reasons TEXT[] + flagged BOOLEAN DEFAULT FALSE + created_at TIMESTAMPTZ + +-- Notifications +notifications + id UUID PRIMARY KEY + guest_id UUID REFERENCES guests(id) + channel notification_channel -- sms, email + type notification_type -- invitation, verification, confirmation, reminder + status delivery_status -- queued, sent, delivered, failed, bounced + provider_id VARCHAR(100) -- Twilio SID or SES message ID + attempts SMALLINT DEFAULT 0 + last_attempt TIMESTAMPTZ + delivered_at TIMESTAMPTZ + error TEXT + created_at TIMESTAMPTZ + +-- Indexes that matter +CREATE INDEX idx_tokens_hash ON tokens(token_hash) WHERE status = 'active'; +CREATE INDEX idx_access_logs_guest ON access_logs(guest_id, created_at DESC); +CREATE INDEX idx_access_logs_flagged ON access_logs(flagged) WHERE flagged = TRUE; +CREATE INDEX idx_rsvps_event ON rsvps(guest_id); -- join through guests.event_id +CREATE INDEX idx_notifications_status ON notifications(status) WHERE status IN ('queued', 'failed'); +``` + +## Redis Usage (Specific, Not Vague) + +``` +token:{hash} → guest_id, status, expires (fast token lookup, TTL = token expiry) +rate:{ip} → counter (rate limiting, TTL = 1 minute) +fingerprint:{guest} → baseline fingerprint JSON (comparison for fraud detection) +event:{id}:stats → {total, attending, declined} (dashboard counters, updated on RSVP) +ws:connections:{host} → set of WebSocket conn IDs (for push notifications) +``` + +--- + +# FRAUD ENGINE — DETAIL + +## Risk Scoring Model + +This is not an API call. It's a lightweight ML pipeline you build and can explain. + +### Features (Input to Model) + +| Feature | Source | Weight | Why | +|---------|--------|--------|-----| +| Fingerprint match | Compare current vs baseline | HIGH | Different device = possible shared link | +| IP geolocation distance | MaxMind GeoLite2 (free) | MEDIUM | Guest in London, access from Mumbai | +| Access time pattern | Time of day, day of week | LOW | 3 AM access is unusual | +| Browser consistency | Same browser as registration? | MEDIUM | Chrome → Safari = suspicious | +| Repeated access | How many times token accessed | LOW | Normal people click once or twice | +| Referrer analysis | Where did they come from? | LOW | Direct link vs social media share | + +### Model + +**Start simple:** Weighted scoring (not even ML yet). Each feature gets a score 0-100, multiply by weight, sum, normalise. + +**Then upgrade:** Train a Random Forest classifier on the access_logs table. Label data: host-confirmed fraudulent attempts become positive examples, successful RSVPs become negative examples. Export model with joblib, load at startup, inference in ~2ms. + +**Interview talking point:** "I started with heuristic scoring, then replaced it with a trained model once I had labelled data. The scoring interface didn't change — the fraud engine is a black box to the rest of the system." + +### Risk Thresholds + +| Score | Action | UX | +|-------|--------|-----| +| 0-30 | Allow | Smooth RSVP flow | +| 31-60 | Soft verify | "Confirm your name" (knowledge check) | +| 61-85 | SMS verify | Send code to registered phone | +| 86-100 | Block + alert | "This invitation cannot be used. The host has been notified." | + +--- + +# FRONTEND — NUXT 3 + +## Why Nuxt 3 (Not Next.js) + +Your other two projects use Nuxt 3. **Consistency across your portfolio** means you demonstrate deep framework knowledge, not surface-level familiarity with three different frameworks. + +## Pages + +### Public / Guest-Facing +``` +/ → Landing page (product overview) +/e/{slug} → Event public page (if host enables) +/rsvp/{token} → RSVP flow (the core guest experience) +/rsvp/{token}/confirm → Post-RSVP confirmation +``` + +### Host Dashboard +``` +/dashboard → Overview (events, stats) +/dashboard/events/new → Create event (wizard) +/dashboard/events/{id} → Event detail (guest list, RSVPs, fraud alerts) +/dashboard/events/{id}/guests → Guest management (import, add, remove) +/dashboard/events/{id}/invites → Send invitations (bulk SMS/email) +/dashboard/events/{id}/monitor → Live monitoring (real-time RSVPs + fraud) +/dashboard/events/{id}/export → Export guest list (CSV, PDF) +/dashboard/settings → Account, billing, notification preferences +``` + +### The Monitor Page — The DevOps Showcase + +This is GuestGuard's equivalent of VerifyHub's Chaos Panel. A real-time operational dashboard that shows: + +**Live RSVP Stream:** New RSVPs appearing in real time via WebSocket. +**Fraud Activity:** Flagged access attempts with risk scores, reasons, and fingerprint diffs. +**Delivery Status:** SMS/email send rates, failures, retries. +**System Health:** API latency (p50/p95/p99), fraud engine scoring time, NATS queue depth. + +This page is what makes GuestGuard a DevOps project, not just a web app. + +--- + +# DEVOPS & INFRASTRUCTURE + +## Repository Structure + +``` +guestguard/ +├── cmd/ +│ ├── api/ # Go API binary entrypoint +│ └── notifier/ # Go notification worker entrypoint +├── internal/ +│ ├── api/ # HTTP handlers, middleware, WebSocket +│ ├── auth/ # JWT, OAuth +│ ├── domain/ # Business logic (events, guests, RSVPs, tokens) +│ ├── fraud/ # gRPC client for fraud engine +│ ├── nats/ # NATS publisher/subscriber +│ ├── notification/ # Twilio, SES adapters +│ └── storage/ # PostgreSQL, Redis repositories +├── fraud-engine/ +│ ├── app/ # FastAPI app +│ ├── scoring/ # Risk scoring logic + ML model +│ ├── consumers/ # NATS event consumers +│ ├── model/ # Trained model artifacts +│ ├── Dockerfile +│ └── requirements.txt +├── frontend/ +│ ├── nuxt.config.ts +│ ├── pages/ +│ ├── components/ +│ ├── composables/ +│ └── Dockerfile +├── infra/ +│ ├── terraform/ +│ │ ├── modules/ +│ │ │ ├── eks/ +│ │ │ ├── rds/ +│ │ │ ├── elasticache/ +│ │ │ └── networking/ +│ │ ├── environments/ +│ │ │ ├── staging/ +│ │ │ └── production/ +│ │ └── main.tf +│ ├── kubernetes/ +│ │ ├── base/ # Kustomize base +│ │ │ ├── api/ +│ │ │ ├── fraud-engine/ +│ │ │ ├── notifier/ +│ │ │ ├── nats/ +│ │ │ └── frontend/ +│ │ └── overlays/ +│ │ ├── staging/ +│ │ └── production/ +│ └── helm/ +│ └── guestguard/ # Helm chart for full deployment +├── monitoring/ +│ ├── grafana/ +│ │ └── dashboards/ +│ │ ├── api-performance.json +│ │ ├── fraud-detection.json +│ │ ├── notification-delivery.json +│ │ └── system-overview.json +│ ├── prometheus/ +│ │ ├── prometheus.yml +│ │ └── rules/ +│ │ ├── api-alerts.yml +│ │ ├── fraud-alerts.yml +│ │ └── notification-alerts.yml +│ └── loki/ +│ └── loki-config.yml +├── .github/ +│ └── workflows/ +│ ├── ci.yml # Lint, test, build +│ ├── cd-staging.yml # Deploy to staging +│ └── cd-prod.yml # Deploy to production (manual approval) +├── docker-compose.yml # Local development +├── Makefile +└── README.md +``` + +## CI/CD Pipeline + +``` +Push to main + │ + ▼ +┌─────────────────────────────┐ +│ GitHub Actions: CI │ +│ │ +│ 1. Lint (golangci-lint, │ +│ ruff for Python, │ +│ eslint for frontend) │ +│ 2. Unit tests │ +│ - Go: go test ./... │ +│ - Python: pytest │ +│ - Frontend: vitest │ +│ 3. Integration tests │ +│ - docker-compose up │ +│ - Test API → NATS → │ +│ Fraud Engine flow │ +│ 4. Security scan │ +│ - Trivy on images │ +│ - gosec on Go code │ +│ - bandit on Python │ +│ 5. Build Docker images │ +│ 6. Push to ECR │ +└─────────────┬───────────────┘ + │ + ▼ +┌─────────────────────────────┐ +│ GitHub Actions: CD Staging │ +│ │ +│ 7. Update Kustomize │ +│ overlay with new │ +│ image tags │ +│ 8. ArgoCD syncs to │ +│ staging EKS cluster │ +│ 9. Smoke tests against │ +│ staging │ +└─────────────┬───────────────┘ + │ + ▼ (manual approval) +┌─────────────────────────────┐ +│ GitHub Actions: CD Prod │ +│ │ +│ 10. ArgoCD syncs to │ +│ production │ +│ 11. Canary deployment │ +│ (10% → 50% → 100%) │ +│ 12. Automated rollback │ +│ on error rate spike │ +└─────────────────────────────┘ +``` + +## Kubernetes Resources + +| Component | Replicas | Resources | HPA | +|-----------|----------|-----------|-----| +| Core API | 2 | 256Mi / 0.25 CPU | Scale on CPU > 70% | +| Fraud Engine | 1 | 512Mi / 0.5 CPU | Scale on queue depth | +| Notification Worker | 1 | 128Mi / 0.1 CPU | Scale on queue depth | +| NATS | 1 (JetStream) | 256Mi / 0.25 CPU | — | +| Frontend (SSR) | 2 | 256Mi / 0.25 CPU | Scale on CPU > 70% | +| PostgreSQL | RDS (not in cluster) | — | — | +| Redis | ElastiCache (not in cluster) | — | — | + +**Interview talking point:** "I run stateful services (Postgres, Redis) as managed AWS services, not in Kubernetes. Databases in K8s adds operational complexity without benefit for a platform of this scale." + +## Monitoring & Alerting + +### Grafana Dashboards + +**1. API Performance** +- Request rate (req/sec by endpoint) +- Latency percentiles (p50, p95, p99) +- Error rate by status code +- Active WebSocket connections + +**2. Fraud Detection** +- Access attempts per minute +- Risk score distribution (histogram) +- Flagged vs clean ratio +- Scoring latency (p50, p95) +- Top flagged events + +**3. Notification Delivery** +- Send rate by channel (SMS vs email) +- Delivery success rate +- Retry queue depth +- Provider error rates (Twilio vs SES) +- Cost per notification + +**4. System Overview** +- NATS queue depths by subject +- Consumer lag +- Pod CPU/memory +- Node health + +### Alert Rules + +```yaml +# Fraud engine is slow (scoring should be < 50ms) +- alert: FraudScoringLatencyHigh + expr: histogram_quantile(0.95, fraud_scoring_duration_seconds) > 0.05 + for: 5m + annotations: + summary: "Fraud scoring p95 latency above 50ms" + +# Notification delivery failing +- alert: NotificationDeliveryFailureRate + expr: rate(notifications_failed_total[5m]) / rate(notifications_sent_total[5m]) > 0.1 + for: 5m + annotations: + summary: "More than 10% of notifications failing" + +# NATS consumer falling behind +- alert: NATSConsumerLag + expr: nats_consumer_num_pending > 1000 + for: 2m + annotations: + summary: "NATS consumer has 1000+ pending messages" +``` + +--- + +# TOKEN SECURITY — DETAIL + +## Token Generation + +``` +Token = base64url( + header: { alg: "HS256", typ: "GG" } + payload: { + gid: "guest-uuid", // guest ID + eid: "event-uuid", // event ID + iat: 1717200000, // issued at + exp: 1719792000, // expires (30 days default) + nonce: "random-16-bytes" // prevents token prediction + } + signature: HMAC-SHA256(header.payload, server_secret) +) +``` + +**URL format:** `https://guestguard.app/rsvp/tk_aBcDeFgHiJkLmNoPqRsT` + +**Properties:** +- Cryptographically signed (can't be forged) +- Contains no PII (guest name, email not in token) +- Expires after configurable TTL +- One-time use for RSVP (token marked `used` after submission) +- Revocable by host at any time + +--- + +# WHAT THE README SAYS + +```markdown +# GuestGuard + +> Stop uninvited guests before they RSVP. + +An event-driven RSVP platform with real-time fraud detection. +Unique, cryptographically signed invitation links + device +fingerprinting + ML risk scoring = only your actual guests +can respond. + +## Architecture Decisions + +**Why a monolith for the core API?** +Events, guests, RSVPs, and auth are a single bounded context. +Splitting them into microservices would add network hops, +distributed transactions, and operational overhead with no +benefit. The monolith serves REST, WebSocket, and publishes +events — it's not a limitation, it's a deliberate choice. + +**Why a separate fraud engine?** +Different language (Python for ML), different scaling profile +(CPU-bound inference vs I/O-bound API), different deployment +cadence (model updates). This is the one service that earns +its independence. + +**Why a separate notification worker?** +Sending SMS/email is async, can fail, needs retries, and +should never block the API. Classic background worker pattern. + +**Why NATS?** +Lightweight, persistent with JetStream, and I'm already using +RabbitMQ in VerifyHub — so I can compare both in interviews. + +## Tech Stack + +| Layer | Technology | Why | +|-------|-----------|-----| +| Core API | Go | Fast, small binaries, consistent with VerifyHub | +| Fraud Engine | Python (FastAPI) | ML ecosystem (scikit-learn, numpy) | +| Frontend | Nuxt 3 | SSR, consistent across all portfolio projects | +| Message Broker | NATS JetStream | Lightweight event streaming | +| Database | PostgreSQL (RDS) | Relational data with JSONB for flexibility | +| Cache | Redis (ElastiCache) | Token lookup, rate limiting, sessions | +| Monitoring | Prometheus + Grafana | Metrics and dashboards | +| Logging | Loki | Log aggregation | +| CI/CD | GitHub Actions + ArgoCD | GitOps deployment | +| Infrastructure | Terraform + EKS | Consistent with other projects | +| SMS | Twilio | Industry standard | +| Email | AWS SES | Cost-effective, already in AWS | +| GeoIP | MaxMind GeoLite2 | Free, accurate, offline lookup | + +## Run Locally + +docker-compose up + +## Live Demo + +[guestguard.kwakudanso.dev](https://guestguard.kwakudanso.dev) +``` + +--- + +# IMPLEMENTATION PLAN + +## Phase 1: Foundation (Week 1-2) + +**Goal:** Core API serving RSVP flow, tokens working, PostgreSQL schema live. + +- [ ] Go project scaffold (cmd/api, internal packages) +- [ ] PostgreSQL schema + migrations (golang-migrate) +- [ ] Token generation and validation +- [ ] Event CRUD endpoints +- [ ] Guest management (add, import CSV, list) +- [ ] RSVP endpoint (validate token → record response) +- [ ] Basic Nuxt 3 frontend (RSVP page only) +- [ ] Docker Compose for local dev +- [ ] Unit tests for token and RSVP logic + +## Phase 2: Fraud Engine (Week 3-4) + +**Goal:** Device fingerprinting working, risk scoring live, fraud alerts flowing. + +- [ ] NATS JetStream setup +- [ ] Core API publishes access events to NATS +- [ ] Fraud Engine consumes events +- [ ] Device fingerprint collection (frontend JS) +- [ ] IP geolocation (MaxMind GeoLite2) +- [ ] Heuristic risk scoring (weighted features) +- [ ] gRPC endpoint for synchronous scoring +- [ ] Fraud Engine publishes scored events +- [ ] Core API consumes fraud scores, flags tokens +- [ ] Integration tests (end-to-end fraud flow) + +## Phase 3: Notifications + Dashboard (Week 5-6) + +**Goal:** SMS/email working, host dashboard live with real-time updates. + +- [ ] Notification Worker consuming NATS events +- [ ] Twilio SMS integration (with retry logic) +- [ ] AWS SES email integration (with templates) +- [ ] WebSocket endpoint in Core API +- [ ] Host dashboard pages (Nuxt 3) +- [ ] Real-time RSVP stream on dashboard +- [ ] Fraud alert display on dashboard +- [ ] Notification delivery status tracking +- [ ] Guest RSVP confirmation page (polished) + +## Phase 4: Infrastructure + DevOps (Week 7-8) + +**Goal:** Deployed to EKS, monitoring live, CI/CD complete. + +- [ ] Terraform modules (EKS, RDS, ElastiCache, networking) +- [ ] Kubernetes manifests (Kustomize base + overlays) +- [ ] GitHub Actions CI pipeline (lint, test, build, scan) +- [ ] GitHub Actions CD pipeline (staging → prod) +- [ ] ArgoCD setup +- [ ] Prometheus + Grafana dashboards (4 dashboards) +- [ ] Loki log aggregation +- [ ] Alert rules +- [ ] Canary deployment configuration +- [ ] README + architecture documentation + +## Phase 5: Polish + ML Upgrade (Week 9-10) + +**Goal:** ML model trained, landing page polished, documentation complete. + +- [ ] Train Random Forest on access_logs data +- [ ] Replace heuristic scoring with ML model +- [ ] Model versioning and deployment pipeline +- [ ] Landing page design and build +- [ ] Event public page +- [ ] Mobile responsive pass on all pages +- [ ] Performance optimisation (API latency, SSR) +- [ ] Security hardening (rate limits, input validation, CORS) +- [ ] Architecture decision records (ADRs) +- [ ] Blog post: "Building a Real-Time Fraud Detection Pipeline" + +--- + +# COST ESTIMATE + +| Resource | Monthly Cost | +|----------|-------------| +| EKS cluster (shared with other projects) | $0 (already running) | +| EKS worker nodes (2x t3.medium spot) | ~$30 (shared) | +| RDS PostgreSQL (db.t3.micro) | ~$15 | +| ElastiCache Redis (cache.t3.micro) | ~$12 | +| Twilio SMS (dev volume) | ~$5 | +| AWS SES | ~$1 | +| MaxMind GeoLite2 | Free | +| **Total incremental** | **~$63/month** | + +Since you're already running EKS for VerifyHub and Web3Escrow, the incremental cost is just the managed services. + +--- + +# INTERVIEW TALKING POINTS + +**"Why didn't you make this all microservices?"** +"Because the core domain — events, guests, RSVPs — is a single bounded context. Splitting it would add distributed transaction complexity for zero benefit. The fraud engine and notification worker are separate because they have genuinely different characteristics: different languages, different scaling profiles, different failure domains." + +**"How does the fraud detection work?"** +"When a guest accesses their RSVP link, we collect a device fingerprint and IP. That gets published to NATS, consumed by the fraud engine, which compares it against the baseline from when the invitation was sent. The risk score combines fingerprint similarity, geolocation distance, and behavioural signals. I started with heuristic scoring and then trained a Random Forest classifier once I had labelled data." + +**"Why NATS instead of Kafka or RabbitMQ?"** +"I used RabbitMQ in VerifyHub for task queues. GuestGuard needs event streaming — publish access events, multiple consumers (fraud engine, notification worker, analytics). NATS JetStream fits that pattern with less operational overhead than Kafka. For a platform at this scale, Kafka would be over-engineering." + +**"What happens when the fraud engine is down?"** +"RSVPs still work. The synchronous gRPC call has a 100ms timeout with a fallback: if the fraud engine is unreachable, the access is allowed but flagged for async review. The async scoring via NATS will catch up when the service recovers. Notifications also continue independently — different failure domain." + +**"How do you handle 500 simultaneous invitation sends?"** +"The API publishes 500 notification.send events to NATS. The notification worker processes them with rate limiting (Twilio allows 100/sec). If Twilio throttles us, messages go to the retry queue with exponential backoff. Delivery status is tracked per-notification. The host sees a progress bar via WebSocket, not a spinning loader for 10 minutes."