Files

31 KiB

GUESTGUARD — REARCHITECTED

Honest Architecture for a DevOps Portfolio


THE PHILOSOPHY: BUILD IT LIKE A STAFF ENGINEER WOULD

GuestGuard is not a microservices project. It's an event-driven security platform — and the architecture should reflect that. The core CRUD (events, guests, RSVPs) is simple. The interesting engineering lives in the real-time fraud detection pipeline, the notification delivery system, and the observability layer that ties it all together.

Every architectural decision below has a defensible answer to the question: "Why didn't you just put this in the monolith?"


POSITIONING

Tagline: "Real-time fraud detection for event access control — an event-driven platform with ML-powered risk scoring, observability, and production-grade delivery infrastructure."

What this project proves (that VerifyHub and Web3Escrow don't):

  • You can build a product — not just infrastructure for infrastructure's sake
  • You understand event-driven architecture with genuine reasons for async processing
  • You can implement ML inference in a pipeline (not just call an API)
  • You know when to use a monolith and when to break things out
  • You can ship a polished, usable application that solves a real problem

Together with your other projects, your portfolio now covers:

Project Proves
VerifyHub Microservices resilience, chaos engineering, observability
Web3Escrow Blockchain infrastructure, smart contract DevOps
GuestGuard Event-driven architecture, real-time ML pipeline, product thinking

THE ARCHITECTURE — HONEST VERSION

Three Components, Each With a Real Reason to Exist

1. Core API (Go)

Why it exists: Handles all synchronous request/response operations. Why Go: You're already using Go in VerifyHub. Consistent stack. Fast, small binaries, great for containerisation. Why monolithic: Events, guests, RSVPs, and auth are tightly coupled domain logic. Splitting them into separate services would add network latency, distributed transaction complexity, and operational overhead — all for zero benefit. A senior engineer would never split these.

What lives here:

  • Event CRUD (create, update, delete, list)
  • Guest management (import CSV, manual add, remove)
  • RSVP handling (validate token, record response, update guest status)
  • Authentication (JWT issuance, session management, OAuth for hosts)
  • Token generation (cryptographically signed, per-guest unique links)
  • REST API serving the frontend
  • WebSocket endpoint for real-time dashboard updates
  • Publishes events to NATS (access attempts, RSVPs, fraud signals)

2. Fraud Engine (Python)

Why it exists: The fraud detection logic is genuinely different from CRUD. It consumes events asynchronously, runs ML inference, and has different scaling characteristics (CPU-bound scoring vs I/O-bound API serving). Why Python: scikit-learn, numpy, pandas — the ML ecosystem lives in Python. Using Go here would mean reimplementing or wrapping everything. Why separate: This is the one service that has a legitimate reason to be independent. It has a different deployment cadence (model updates), different resource profile (CPU for inference), and different failure characteristics (if scoring is slow, RSVPs should still work).

What lives here:

  • Consumes access.attempted events from NATS
  • Device fingerprint analysis (browser, OS, screen, timezone, WebGL hash)
  • IP geolocation scoring (distance from guest's expected location)
  • Behavioural analysis (time patterns, click speed, navigation flow)
  • Risk score calculation (0-100, combining all signals)
  • Publishes fraud.scored events back to NATS
  • Exposes a simple gRPC endpoint for synchronous scoring (used during RSVP submission as a final gate)
  • Model retraining pipeline (batch job, not real-time)

3. Notification Worker (Go)

Why it exists: Sending SMS/email is inherently async, can fail, needs retries with backoff, and should never block the API response. This is the textbook case for a background worker. Why separate: If Twilio is down, guests should still be able to RSVP. If you're sending 500 invitations, the API shouldn't be tied up for 10 minutes. Different failure domain, different retry logic, different rate limits.

What lives here:

  • Consumes events from NATS (invitation.send, verification.required, fraud.alert, rsvp.confirmed)
  • SMS delivery via Twilio (with retry, backoff, DLQ)
  • Email delivery via AWS SES (with templates, retry)
  • Delivery status tracking (delivered, failed, bounced)
  • Rate limiting per provider (Twilio: 100/sec, SES: 14/sec)
  • Host alert delivery (WebSocket push for real-time, email digest for batch)

Why NATS (Not RabbitMQ or Kafka)

You're already using RabbitMQ in VerifyHub. Using a different message broker in GuestGuard shows range and lets you talk about tradeoffs in interviews.

NATS JetStream fits here because:

  • Lightweight (single binary, ~15MB RAM) — good for a portfolio project's budget
  • Built-in persistence with JetStream (you need durability for notifications)
  • Subject-based routing is natural for this domain (guest.access.attempted, fraud.scored, notification.send.sms)
  • Simpler operationally than Kafka, more capable than basic Redis pub/sub

Interview talking point: "I used RabbitMQ in VerifyHub for task queues and NATS in GuestGuard for event streaming — I can explain when I'd pick each one."


DETAILED DATA FLOW

Happy Path: Guest RSVPs Successfully

Guest clicks unique link
        │
        ▼
┌─────────────────┐
│   Core API (Go) │
│                 │
│ 1. Validate JWT │ ← Token embedded in URL, cryptographically signed
│    token        │
│ 2. Check token  │ ← Not expired, not already used, not revoked
│    status       │
│ 3. Collect      │ ← Browser fingerprint, IP, user agent
│    device data  │
│ 4. Publish      │──────────► NATS: guest.access.attempted
│    access event │           {guest_id, token, fingerprint, ip, timestamp}
│ 5. Serve RSVP   │
│    form         │
└────────┬────────┘
         │
    Guest submits RSVP
         │
         ▼
┌─────────────────┐
│   Core API (Go) │
│                 │
│ 6. Sync fraud   │──── gRPC ───► Fraud Engine: score this access
│    check (fast) │◄── score ────  {score: 23, risk: LOW}
│ 7. Record RSVP  │
│ 8. Publish      │──────────► NATS: rsvp.confirmed
│    confirmation │           {guest_id, event_id, response, plus_ones}
│ 9. Return       │
│    success      │
└─────────────────┘
         │
         │ (async, non-blocking)
         ▼
┌──────────────────────┐
│ Notification Worker  │
│                      │
│ Consumes rsvp.confirmed
│ 10. Send confirmation│──► SMS: "Your RSVP is confirmed!"
│     to guest         │──► Email: confirmation with calendar invite
│ 11. Update host      │──► WebSocket: dashboard counter updates
│     dashboard        │
└──────────────────────┘
Uninvited person clicks forwarded link
        │
        ▼
┌─────────────────┐
│   Core API (Go) │
│                 │
│ 1. Validate     │ ← Token is valid (it's a real link)
│    token        │
│ 2. Collect      │ ← DIFFERENT fingerprint, DIFFERENT IP
│    device data  │
│ 3. Publish      │──────────► NATS: guest.access.attempted
│    access event │           {guest_id, token, fingerprint: NEW, ip: NEW}
└────────┬────────┘
         │
         ▼ (async)
┌──────────────────────┐
│    Fraud Engine      │
│                      │
│ Consumes guest.access.attempted
│ 4. Compare finger-  │ ← Previous fingerprint on file
│    print to baseline │
│ 5. Check IP geo-    │ ← Guest registered in London,
│    location          │   access from Lagos
│ 6. Score risk        │ ← Score: 87 (HIGH)
│ 7. Publish result   │──────────► NATS: fraud.scored
│                      │  {guest_id, score: 87, risk: HIGH, reasons: [...]}
└──────────────────────┘
         │
         ▼ (async)
┌─────────────────┐        ┌──────────────────────┐
│   Core API (Go) │        │ Notification Worker  │
│                 │        │                      │
│ Consumes fraud.scored    │ Consumes fraud.scored
│ 8. Flag token   │        │ 9. Alert host        │──► Push notification
│ 9. Require SMS  │        │ 10. Log attempt      │──► Email digest
│    verification │        └──────────────────────┘
│    on next      │
│    access       │
└────────┬────────┘
         │
    Uninvited person tries to RSVP
         │
         ▼
┌─────────────────┐
│   Core API      │
│                 │
│ 11. Sync fraud  │──── gRPC ───► Fraud Engine: score = 87
│     check       │◄────────────  BLOCK
│ 12. Require SMS │ ← Send code to ORIGINAL guest's phone
│     to original │
│     guest phone │
│ 13. Uninvited   │ ← Can't receive the code
│     person      │
│     blocked     │
└─────────────────┘

DATA MODEL

PostgreSQL Schema (Single Database)

-- Core domain
events
  id              UUID PRIMARY KEY
  host_id         UUID REFERENCES users(id)
  name            VARCHAR(255)
  slug            VARCHAR(100) UNIQUE
  event_date      TIMESTAMPTZ
  venue           TEXT
  max_capacity    INTEGER
  settings        JSONB          -- theme, custom fields, etc.
  status          event_status   -- draft, published, closed, archived
  created_at      TIMESTAMPTZ
  updated_at      TIMESTAMPTZ

guests
  id              UUID PRIMARY KEY
  event_id        UUID REFERENCES events(id)
  name            VARCHAR(255)
  email           VARCHAR(255)
  phone           VARCHAR(20)    -- for SMS verification
  plus_ones       INTEGER DEFAULT 0
  dietary_notes   TEXT
  table_number    INTEGER
  created_at      TIMESTAMPTZ

tokens
  id              UUID PRIMARY KEY
  guest_id        UUID REFERENCES guests(id) UNIQUE
  token_hash      VARCHAR(64)    -- SHA-256 of the actual token
  expires_at      TIMESTAMPTZ
  status          token_status   -- active, used, revoked, expired
  used_at         TIMESTAMPTZ
  created_at      TIMESTAMPTZ

rsvps
  id              UUID PRIMARY KEY
  guest_id        UUID REFERENCES guests(id) UNIQUE
  response        rsvp_response  -- attending, declined, maybe
  plus_ones       INTEGER
  dietary_notes   TEXT
  submitted_at    TIMESTAMPTZ
  device_fingerprint  JSONB
  ip_address      INET
  risk_score      SMALLINT       -- 0-100, from fraud engine

-- Fraud detection
access_logs
  id              UUID PRIMARY KEY
  guest_id        UUID REFERENCES guests(id)
  token_id        UUID REFERENCES tokens(id)
  fingerprint     JSONB          -- browser, OS, screen, timezone, etc.
  ip_address      INET
  geo_location    JSONB          -- {country, city, lat, lng}
  risk_score      SMALLINT
  risk_reasons    TEXT[]
  flagged         BOOLEAN DEFAULT FALSE
  created_at      TIMESTAMPTZ

-- Notifications
notifications
  id              UUID PRIMARY KEY
  guest_id        UUID REFERENCES guests(id)
  channel         notification_channel  -- sms, email
  type            notification_type     -- invitation, verification, confirmation, reminder
  status          delivery_status       -- queued, sent, delivered, failed, bounced
  provider_id     VARCHAR(100)          -- Twilio SID or SES message ID
  attempts        SMALLINT DEFAULT 0
  last_attempt    TIMESTAMPTZ
  delivered_at    TIMESTAMPTZ
  error           TEXT
  created_at      TIMESTAMPTZ

-- Indexes that matter
CREATE INDEX idx_tokens_hash ON tokens(token_hash) WHERE status = 'active';
CREATE INDEX idx_access_logs_guest ON access_logs(guest_id, created_at DESC);
CREATE INDEX idx_access_logs_flagged ON access_logs(flagged) WHERE flagged = TRUE;
CREATE INDEX idx_rsvps_event ON rsvps(guest_id);  -- join through guests.event_id
CREATE INDEX idx_notifications_status ON notifications(status) WHERE status IN ('queued', 'failed');

Redis Usage (Specific, Not Vague)

token:{hash}          → guest_id, status, expires  (fast token lookup, TTL = token expiry)
rate:{ip}             → counter                     (rate limiting, TTL = 1 minute)
fingerprint:{guest}   → baseline fingerprint JSON   (comparison for fraud detection)
event:{id}:stats      → {total, attending, declined} (dashboard counters, updated on RSVP)
ws:connections:{host}  → set of WebSocket conn IDs   (for push notifications)

FRAUD ENGINE — DETAIL

Risk Scoring Model

This is not an API call. It's a lightweight ML pipeline you build and can explain.

Features (Input to Model)

Feature Source Weight Why
Fingerprint match Compare current vs baseline HIGH Different device = possible shared link
IP geolocation distance MaxMind GeoLite2 (free) MEDIUM Guest in London, access from Mumbai
Access time pattern Time of day, day of week LOW 3 AM access is unusual
Browser consistency Same browser as registration? MEDIUM Chrome → Safari = suspicious
Repeated access How many times token accessed LOW Normal people click once or twice
Referrer analysis Where did they come from? LOW Direct link vs social media share

Model

Start simple: Weighted scoring (not even ML yet). Each feature gets a score 0-100, multiply by weight, sum, normalise.

Then upgrade: Train a Random Forest classifier on the access_logs table. Label data: host-confirmed fraudulent attempts become positive examples, successful RSVPs become negative examples. Export model with joblib, load at startup, inference in ~2ms.

Interview talking point: "I started with heuristic scoring, then replaced it with a trained model once I had labelled data. The scoring interface didn't change — the fraud engine is a black box to the rest of the system."

Risk Thresholds

Score Action UX
0-30 Allow Smooth RSVP flow
31-60 Soft verify "Confirm your name" (knowledge check)
61-85 SMS verify Send code to registered phone
86-100 Block + alert "This invitation cannot be used. The host has been notified."

FRONTEND — NUXT 3

Why Nuxt 3 (Not Next.js)

Your other two projects use Nuxt 3. Consistency across your portfolio means you demonstrate deep framework knowledge, not surface-level familiarity with three different frameworks.

Pages

Public / Guest-Facing

/                           → Landing page (product overview)
/e/{slug}                   → Event public page (if host enables)
/rsvp/{token}               → RSVP flow (the core guest experience)
/rsvp/{token}/confirm       → Post-RSVP confirmation

Host Dashboard

/dashboard                  → Overview (events, stats)
/dashboard/events/new       → Create event (wizard)
/dashboard/events/{id}      → Event detail (guest list, RSVPs, fraud alerts)
/dashboard/events/{id}/guests → Guest management (import, add, remove)
/dashboard/events/{id}/invites → Send invitations (bulk SMS/email)
/dashboard/events/{id}/monitor → Live monitoring (real-time RSVPs + fraud)
/dashboard/events/{id}/export → Export guest list (CSV, PDF)
/dashboard/settings         → Account, billing, notification preferences

The Monitor Page — The DevOps Showcase

This is GuestGuard's equivalent of VerifyHub's Chaos Panel. A real-time operational dashboard that shows:

Live RSVP Stream: New RSVPs appearing in real time via WebSocket. Fraud Activity: Flagged access attempts with risk scores, reasons, and fingerprint diffs. Delivery Status: SMS/email send rates, failures, retries. System Health: API latency (p50/p95/p99), fraud engine scoring time, NATS queue depth.

This page is what makes GuestGuard a DevOps project, not just a web app.


DEVOPS & INFRASTRUCTURE

Repository Structure

guestguard/
├── cmd/
│   ├── api/              # Go API binary entrypoint
│   └── notifier/         # Go notification worker entrypoint
├── internal/
│   ├── api/              # HTTP handlers, middleware, WebSocket
│   ├── auth/             # JWT, OAuth
│   ├── domain/           # Business logic (events, guests, RSVPs, tokens)
│   ├── fraud/            # gRPC client for fraud engine
│   ├── nats/             # NATS publisher/subscriber
│   ├── notification/     # Twilio, SES adapters
│   └── storage/          # PostgreSQL, Redis repositories
├── fraud-engine/
│   ├── app/              # FastAPI app
│   ├── scoring/          # Risk scoring logic + ML model
│   ├── consumers/        # NATS event consumers
│   ├── model/            # Trained model artifacts
│   ├── Dockerfile
│   └── requirements.txt
├── frontend/
│   ├── nuxt.config.ts
│   ├── pages/
│   ├── components/
│   ├── composables/
│   └── Dockerfile
├── infra/
│   ├── terraform/
│   │   ├── modules/
│   │   │   ├── eks/
│   │   │   ├── rds/
│   │   │   ├── elasticache/
│   │   │   └── networking/
│   │   ├── environments/
│   │   │   ├── staging/
│   │   │   └── production/
│   │   └── main.tf
│   ├── kubernetes/
│   │   ├── base/         # Kustomize base
│   │   │   ├── api/
│   │   │   ├── fraud-engine/
│   │   │   ├── notifier/
│   │   │   ├── nats/
│   │   │   └── frontend/
│   │   └── overlays/
│   │       ├── staging/
│   │       └── production/
│   └── helm/
│       └── guestguard/   # Helm chart for full deployment
├── monitoring/
│   ├── grafana/
│   │   └── dashboards/
│   │       ├── api-performance.json
│   │       ├── fraud-detection.json
│   │       ├── notification-delivery.json
│   │       └── system-overview.json
│   ├── prometheus/
│   │   ├── prometheus.yml
│   │   └── rules/
│   │       ├── api-alerts.yml
│   │       ├── fraud-alerts.yml
│   │       └── notification-alerts.yml
│   └── loki/
│       └── loki-config.yml
├── .github/
│   └── workflows/
│       ├── ci.yml         # Lint, test, build
│       ├── cd-staging.yml # Deploy to staging
│       └── cd-prod.yml    # Deploy to production (manual approval)
├── docker-compose.yml     # Local development
├── Makefile
└── README.md

CI/CD Pipeline

Push to main
    │
    ▼
┌─────────────────────────────┐
│  GitHub Actions: CI         │
│                             │
│  1. Lint (golangci-lint,    │
│     ruff for Python,        │
│     eslint for frontend)    │
│  2. Unit tests              │
│     - Go: go test ./...     │
│     - Python: pytest        │
│     - Frontend: vitest      │
│  3. Integration tests       │
│     - docker-compose up     │
│     - Test API → NATS →     │
│       Fraud Engine flow     │
│  4. Security scan           │
│     - Trivy on images       │
│     - gosec on Go code      │
│     - bandit on Python      │
│  5. Build Docker images     │
│  6. Push to ECR             │
└─────────────┬───────────────┘
              │
              ▼
┌─────────────────────────────┐
│  GitHub Actions: CD Staging │
│                             │
│  7. Update Kustomize        │
│     overlay with new        │
│     image tags              │
│  8. ArgoCD syncs to         │
│     staging EKS cluster     │
│  9. Smoke tests against     │
│     staging                 │
└─────────────┬───────────────┘
              │
              ▼ (manual approval)
┌─────────────────────────────┐
│  GitHub Actions: CD Prod    │
│                             │
│  10. ArgoCD syncs to        │
│      production             │
│  11. Canary deployment      │
│      (10% → 50% → 100%)    │
│  12. Automated rollback     │
│      on error rate spike    │
└─────────────────────────────┘

Kubernetes Resources

Component Replicas Resources HPA
Core API 2 256Mi / 0.25 CPU Scale on CPU > 70%
Fraud Engine 1 512Mi / 0.5 CPU Scale on queue depth
Notification Worker 1 128Mi / 0.1 CPU Scale on queue depth
NATS 1 (JetStream) 256Mi / 0.25 CPU
Frontend (SSR) 2 256Mi / 0.25 CPU Scale on CPU > 70%
PostgreSQL RDS (not in cluster)
Redis ElastiCache (not in cluster)

Interview talking point: "I run stateful services (Postgres, Redis) as managed AWS services, not in Kubernetes. Databases in K8s adds operational complexity without benefit for a platform of this scale."

Monitoring & Alerting

Grafana Dashboards

1. API Performance

  • Request rate (req/sec by endpoint)
  • Latency percentiles (p50, p95, p99)
  • Error rate by status code
  • Active WebSocket connections

2. Fraud Detection

  • Access attempts per minute
  • Risk score distribution (histogram)
  • Flagged vs clean ratio
  • Scoring latency (p50, p95)
  • Top flagged events

3. Notification Delivery

  • Send rate by channel (SMS vs email)
  • Delivery success rate
  • Retry queue depth
  • Provider error rates (Twilio vs SES)
  • Cost per notification

4. System Overview

  • NATS queue depths by subject
  • Consumer lag
  • Pod CPU/memory
  • Node health

Alert Rules

# Fraud engine is slow (scoring should be < 50ms)
- alert: FraudScoringLatencyHigh
  expr: histogram_quantile(0.95, fraud_scoring_duration_seconds) > 0.05
  for: 5m
  annotations:
    summary: "Fraud scoring p95 latency above 50ms"

# Notification delivery failing
- alert: NotificationDeliveryFailureRate
  expr: rate(notifications_failed_total[5m]) / rate(notifications_sent_total[5m]) > 0.1
  for: 5m
  annotations:
    summary: "More than 10% of notifications failing"

# NATS consumer falling behind
- alert: NATSConsumerLag
  expr: nats_consumer_num_pending > 1000
  for: 2m
  annotations:
    summary: "NATS consumer has 1000+ pending messages"

TOKEN SECURITY — DETAIL

Token Generation

Token = base64url(
  header: { alg: "HS256", typ: "GG" }
  payload: {
    gid: "guest-uuid",          // guest ID
    eid: "event-uuid",          // event ID
    iat: 1717200000,            // issued at
    exp: 1719792000,            // expires (30 days default)
    nonce: "random-16-bytes"    // prevents token prediction
  }
  signature: HMAC-SHA256(header.payload, server_secret)
)

URL format: https://guestguard.app/rsvp/tk_aBcDeFgHiJkLmNoPqRsT

Properties:

  • Cryptographically signed (can't be forged)
  • Contains no PII (guest name, email not in token)
  • Expires after configurable TTL
  • One-time use for RSVP (token marked used after submission)
  • Revocable by host at any time

WHAT THE README SAYS

# GuestGuard

> Stop uninvited guests before they RSVP.

An event-driven RSVP platform with real-time fraud detection. 
Unique, cryptographically signed invitation links + device 
fingerprinting + ML risk scoring = only your actual guests 
can respond.

## Architecture Decisions

**Why a monolith for the core API?**
Events, guests, RSVPs, and auth are a single bounded context. 
Splitting them into microservices would add network hops, 
distributed transactions, and operational overhead with no 
benefit. The monolith serves REST, WebSocket, and publishes 
events — it's not a limitation, it's a deliberate choice.

**Why a separate fraud engine?**
Different language (Python for ML), different scaling profile 
(CPU-bound inference vs I/O-bound API), different deployment 
cadence (model updates). This is the one service that earns 
its independence.

**Why a separate notification worker?**
Sending SMS/email is async, can fail, needs retries, and 
should never block the API. Classic background worker pattern.

**Why NATS?**
Lightweight, persistent with JetStream, and I'm already using 
RabbitMQ in VerifyHub — so I can compare both in interviews.

## Tech Stack

| Layer | Technology | Why |
|-------|-----------|-----|
| Core API | Go | Fast, small binaries, consistent with VerifyHub |
| Fraud Engine | Python (FastAPI) | ML ecosystem (scikit-learn, numpy) |
| Frontend | Nuxt 3 | SSR, consistent across all portfolio projects |
| Message Broker | NATS JetStream | Lightweight event streaming |
| Database | PostgreSQL (RDS) | Relational data with JSONB for flexibility |
| Cache | Redis (ElastiCache) | Token lookup, rate limiting, sessions |
| Monitoring | Prometheus + Grafana | Metrics and dashboards |
| Logging | Loki | Log aggregation |
| CI/CD | GitHub Actions + ArgoCD | GitOps deployment |
| Infrastructure | Terraform + EKS | Consistent with other projects |
| SMS | Twilio | Industry standard |
| Email | AWS SES | Cost-effective, already in AWS |
| GeoIP | MaxMind GeoLite2 | Free, accurate, offline lookup |

## Run Locally

docker-compose up

## Live Demo

[guestguard.kwakudanso.dev](https://guestguard.kwakudanso.dev)

IMPLEMENTATION PLAN

Phase 1: Foundation (Week 1-2)

Goal: Core API serving RSVP flow, tokens working, PostgreSQL schema live.

  • Go project scaffold (cmd/api, internal packages)
  • PostgreSQL schema + migrations (golang-migrate)
  • Token generation and validation
  • Event CRUD endpoints
  • Guest management (add, import CSV, list)
  • RSVP endpoint (validate token → record response)
  • Basic Nuxt 3 frontend (RSVP page only)
  • Docker Compose for local dev
  • Unit tests for token and RSVP logic

Phase 2: Fraud Engine (Week 3-4)

Goal: Device fingerprinting working, risk scoring live, fraud alerts flowing.

  • NATS JetStream setup
  • Core API publishes access events to NATS
  • Fraud Engine consumes events
  • Device fingerprint collection (frontend JS)
  • IP geolocation (MaxMind GeoLite2)
  • Heuristic risk scoring (weighted features)
  • gRPC endpoint for synchronous scoring
  • Fraud Engine publishes scored events
  • Core API consumes fraud scores, flags tokens
  • Integration tests (end-to-end fraud flow)

Phase 3: Notifications + Dashboard (Week 5-6)

Goal: SMS/email working, host dashboard live with real-time updates.

  • Notification Worker consuming NATS events
  • Twilio SMS integration (with retry logic)
  • AWS SES email integration (with templates)
  • WebSocket endpoint in Core API
  • Host dashboard pages (Nuxt 3)
  • Real-time RSVP stream on dashboard
  • Fraud alert display on dashboard
  • Notification delivery status tracking
  • Guest RSVP confirmation page (polished)

Phase 4: Infrastructure + DevOps (Week 7-8)

Goal: Deployed to EKS, monitoring live, CI/CD complete.

  • Terraform modules (EKS, RDS, ElastiCache, networking)
  • Kubernetes manifests (Kustomize base + overlays)
  • GitHub Actions CI pipeline (lint, test, build, scan)
  • GitHub Actions CD pipeline (staging → prod)
  • ArgoCD setup
  • Prometheus + Grafana dashboards (4 dashboards)
  • Loki log aggregation
  • Alert rules
  • Canary deployment configuration
  • README + architecture documentation

Phase 5: Polish + ML Upgrade (Week 9-10)

Goal: ML model trained, landing page polished, documentation complete.

  • Train Random Forest on access_logs data
  • Replace heuristic scoring with ML model
  • Model versioning and deployment pipeline
  • Landing page design and build
  • Event public page
  • Mobile responsive pass on all pages
  • Performance optimisation (API latency, SSR)
  • Security hardening (rate limits, input validation, CORS)
  • Architecture decision records (ADRs)
  • Blog post: "Building a Real-Time Fraud Detection Pipeline"

COST ESTIMATE

Resource Monthly Cost
EKS cluster (shared with other projects) $0 (already running)
EKS worker nodes (2x t3.medium spot) ~$30 (shared)
RDS PostgreSQL (db.t3.micro) ~$15
ElastiCache Redis (cache.t3.micro) ~$12
Twilio SMS (dev volume) ~$5
AWS SES ~$1
MaxMind GeoLite2 Free
Total incremental ~$63/month

Since you're already running EKS for VerifyHub and Web3Escrow, the incremental cost is just the managed services.


INTERVIEW TALKING POINTS

"Why didn't you make this all microservices?" "Because the core domain — events, guests, RSVPs — is a single bounded context. Splitting it would add distributed transaction complexity for zero benefit. The fraud engine and notification worker are separate because they have genuinely different characteristics: different languages, different scaling profiles, different failure domains."

"How does the fraud detection work?" "When a guest accesses their RSVP link, we collect a device fingerprint and IP. That gets published to NATS, consumed by the fraud engine, which compares it against the baseline from when the invitation was sent. The risk score combines fingerprint similarity, geolocation distance, and behavioural signals. I started with heuristic scoring and then trained a Random Forest classifier once I had labelled data."

"Why NATS instead of Kafka or RabbitMQ?" "I used RabbitMQ in VerifyHub for task queues. GuestGuard needs event streaming — publish access events, multiple consumers (fraud engine, notification worker, analytics). NATS JetStream fits that pattern with less operational overhead than Kafka. For a platform at this scale, Kafka would be over-engineering."

"What happens when the fraud engine is down?" "RSVPs still work. The synchronous gRPC call has a 100ms timeout with a fallback: if the fraud engine is unreachable, the access is allowed but flagged for async review. The async scoring via NATS will catch up when the service recovers. Notifications also continue independently — different failure domain."

"How do you handle 500 simultaneous invitation sends?" "The API publishes 500 notification.send events to NATS. The notification worker processes them with rate limiting (Twilio allows 100/sec). If Twilio throttles us, messages go to the retry queue with exponential backoff. Delivery status is tracked per-notification. The host sees a progress bar via WebSocket, not a spinning loader for 10 minutes."