feat(tier2): finish the finish line — Block H follow-ups, Block G geolocation, cross-cutting

Three threads of work land here together to close out Tier 2. ### Block H follow-ups — day-of check-in - Scanner is now an "open on your phone" magic-link flow. Hosts on desktop mint a scoped JWT via POST /events/{id}/scanner-ticket and render its URL into a QR; phone scans it and lands on /scanner with the ticket as bearer. The ticket carries Audience=scanner so it can never substitute for a session token. - Plus-one confirmation at the door: scan → POST /check-in/preview to fetch guest + expected party size → confirm buttons ("Just them", "Party of N", custom) → POST /check-in. No more silent arrival_count=1. - Offline scan queue: failed POSTs go into an IndexedDB store and drain on the 'online' event with poison-message protection. - Day-of arrivals headline widget on the event overview, gated to the host's local calendar date so it doesn't dominate the page weeks out. - Tab nav restyled with inline heroicons + scrollable segmented control; Check-in moves to the rightmost slot. - PWA: manifest + service worker scoped to /scanner, generated 192/512 icons (Go scripted renderer in scripts/gen-scanner-icons.go). - Confirmation email QR was rendering broken because html/template rewrites data: URLs to #ZgotmplZ; mark the value as template.URL. - Email "open your invitation" link 404'd because we had no token to put after /rsvp/. Threaded AccessLink through the RSVPConfirmed NATS event from the API at submit time. ### Block G remainder — geolocation + threshold preview - Pluggable GeoResolver in the fraud engine (NullResolver, IPApiResolver for the free ip-api.com fallback, MaxMindResolver behind GG_GEOIP_DB_PATH). Wrapped in a Redis cache (30d TTL). Geo flows through both gRPC and NATS scoring paths. - geo_jump scoring feature: >500km in <1h flags ("accessed from Lagos and Paris within 12 minutes"); >500km in <6h is a softer signal. The existing single-signal cap keeps a lone geo_jump in MEDIUM. - FraudScored event carries geo_country/city/lat/lon; ApplyScore uses COALESCE so a later re-score without geo doesn't wipe earlier data. - Threshold-slider live preview: GET /events/{id}/security/thresholds/preview returns band counts the host's existing access events would have fallen into under the proposed thresholds. Debounced (250ms) widget under the Advanced sliders so the host gets concrete feedback instead of guessing. ### Cross-cutting — audit, tier-gating, feature flags - audit_log table + internal/audit.Recorder (async fire-and-forget on detached context so an audit blip never fails the real action). Wired into branding update, thresholds update, allowlist add/remove, collaborator invite/role-change/remove, message create/send-now/cancel. - Tier-gating: extended billing.Limits with MaxCollaborators, CustomBranding, Scanner, Broadcasts. Free = none; Pro = 5 + all; Business = unlimited. Gates the scanner-ticket, message create, branding put, and collaborator invite endpoints with 402 + structured upgrade payload. Auto-reminders, fraud detection, and analytics deliberately stay on every tier — those are safety + visibility features, not upsell levers. - Feature flags: feature_flags table + internal/flags.Store with 30s in-memory refresh, stable sha256(key + user_id) percent bucketing, unknown-key-defaults-on. Six Tier 2 flags pre-seeded. Three handlers (branding, broadcasts, scanner) check the kill switch ahead of the tier gate so ops can pull a feature back without a redeploy. ### Verified - go test ./... + fraud-engine pytest (12/12 incl. 3 new geo_jump tests + 5 new flags tests). - docker compose build + up across api, fraud-engine, notifier, frontend. - /health endpoints 200; migrations 0014 + 0015 applied; 6 flags seeded; audit_log table + partial indexes confirmed. - Fraud-engine logs confirm geo resolver kind=CachedGeoResolver provider=auto. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 20:30:02 +01:00
parent 003a320690
commit 98678ff5a3
49 changed files with 3798 additions and 238 deletions
@@ -0,0 +1,176 @@
+// Package flags loads + serves feature flag decisions.
+//
+// Why this exists: even with tier-gating and audit logs, you sometimes
+// just want to turn a feature off RIGHT NOW (e.g. the new geo-jump
+// scorer is throwing false positives during a real event). A row in
+// the feature_flags table flips it without a redeploy.
+//
+// Design notes:
+//
+//   - Flag values are loaded into an in-memory map and refreshed in the
+//     background every 30s. A check is a map lookup; the cost of a
+//     gate vanishes from the hot path.
+//
+//   - Default for unknown flags is enabled. New code wiring a gate
+//     ships live; ops disables later if needed.
+//
+//   - Percent rollout uses a stable hash of (flag_key, user_id) so the
+//     same user sees a consistent decision across requests. Anonymous
+//     callers (uuid.Nil) always get the "on" side of the percentage —
+//     they're treated as the public path.
+package flags
+
+import (
+	"context"
+	"crypto/sha256"
+	"encoding/binary"
+	"log/slog"
+	"sync"
+	"time"
+
+	"github.com/google/uuid"
+	"github.com/jackc/pgx/v5/pgxpool"
+)
+
+type Flag struct {
+	Key            string
+	Enabled        bool
+	PercentRollout int
+}
+
+// Store loads + serves feature flag decisions. Zero-value Store
+// allows everything (handy for tests).
+type Store struct {
+	pool   *pgxpool.Pool
+	logger *slog.Logger
+
+	mu    sync.RWMutex
+	flags map[string]Flag
+}
+
+// New returns a Store with no flags loaded yet. Call Refresh to do the
+// first read; the lifecycle helper Start spawns a periodic refresher
+// for free.
+func New(pool *pgxpool.Pool, logger *slog.Logger) *Store {
+	if logger == nil {
+		logger = slog.Default()
+	}
+	return &Store{
+		pool:   pool,
+		logger: logger,
+		flags:  map[string]Flag{},
+	}
+}
+
+// Start runs an initial load and then refreshes every 30 seconds. It
+// returns a stop function to be deferred from the caller.
+func (s *Store) Start(ctx context.Context) func() {
+	if s == nil || s.pool == nil {
+		return func() {}
+	}
+	_ = s.Refresh(ctx)
+	tickerCtx, cancel := context.WithCancel(ctx)
+	go func() {
+		t := time.NewTicker(30 * time.Second)
+		defer t.Stop()
+		for {
+			select {
+			case <-tickerCtx.Done():
+				return
+			case <-t.C:
+				if err := s.Refresh(tickerCtx); err != nil {
+					s.logger.Warn("feature flags refresh failed", "err", err)
+				}
+			}
+		}
+	}()
+	return cancel
+}
+
+// Refresh re-reads the table. Errors leave the previous in-memory
+// snapshot intact so a transient DB blip doesn't black out the gate
+// for every request.
+func (s *Store) Refresh(ctx context.Context) error {
+	if s == nil || s.pool == nil {
+		return nil
+	}
+	rows, err := s.pool.Query(ctx, `SELECT key, enabled, percent_rollout FROM feature_flags`)
+	if err != nil {
+		return err
+	}
+	defer rows.Close()
+	next := make(map[string]Flag, 16)
+	for rows.Next() {
+		var f Flag
+		var pct int16
+		if err := rows.Scan(&f.Key, &f.Enabled, &pct); err != nil {
+			return err
+		}
+		f.PercentRollout = int(pct)
+		next[f.Key] = f
+	}
+	if err := rows.Err(); err != nil {
+		return err
+	}
+	s.mu.Lock()
+	s.flags = next
+	s.mu.Unlock()
+	return nil
+}
+
+// Enabled returns true when the gate identified by `key` is on for
+// `subject`. A subject is normally a userID; pass uuid.Nil for global
+// gates that have no user dimension (the call still respects on/off
+// state but skips the percent-rollout split).
+//
+// Unknown flag → enabled (safe default for new code).
+func (s *Store) Enabled(key string, subject uuid.UUID) bool {
+	if s == nil {
+		return true
+	}
+	s.mu.RLock()
+	f, known := s.flags[key]
+	s.mu.RUnlock()
+	if !known {
+		return true
+	}
+	if !f.Enabled {
+		return false
+	}
+	if f.PercentRollout >= 100 {
+		return true
+	}
+	if f.PercentRollout <= 0 {
+		return false
+	}
+	if subject == uuid.Nil {
+		return true
+	}
+	return percentBucket(key, subject) < f.PercentRollout
+}
+
+// Snapshot returns the current flag set — handy for an admin GET so
+// ops can see what's actually loaded without diving into the DB.
+func (s *Store) Snapshot() map[string]Flag {
+	if s == nil {
+		return nil
+	}
+	s.mu.RLock()
+	defer s.mu.RUnlock()
+	out := make(map[string]Flag, len(s.flags))
+	for k, v := range s.flags {
+		out[k] = v
+	}
+	return out
+}
+
+// percentBucket maps (flag, subject) → [0, 100). Stable + uniform.
+func percentBucket(key string, subject uuid.UUID) int {
+	h := sha256.New()
+	h.Write([]byte(key))
+	h.Write([]byte{0})
+	h.Write(subject[:])
+	sum := h.Sum(nil)
+	n := binary.BigEndian.Uint32(sum[:4])
+	return int(n % 100)
+}
@@ -0,0 +1,67 @@
+package flags
+
+import (
+	"testing"
+
+	"github.com/google/uuid"
+)
+
+// nilStore returns enabled for everything — handy for hot paths where
+// the store isn't wired (tests, init).
+func TestNilStoreAllows(t *testing.T) {
+	var s *Store
+	if !s.Enabled("anything", uuid.New()) {
+		t.Fatal("nil *Store must report Enabled=true so absent infra never disables behaviour")
+	}
+}
+
+// Unknown keys default to enabled — the safe default for "we just
+// added a gate; ship code first, write the row later if needed".
+func TestUnknownKeyDefaultsOn(t *testing.T) {
+	s := New(nil, nil)
+	if !s.Enabled("never_seeded", uuid.New()) {
+		t.Errorf("unknown key should be enabled by default")
+	}
+}
+
+func TestExplicitlyDisabledKey(t *testing.T) {
+	s := New(nil, nil)
+	s.flags["kill"] = Flag{Key: "kill", Enabled: false, PercentRollout: 100}
+	if s.Enabled("kill", uuid.New()) {
+		t.Errorf("disabled flag must be off regardless of percent")
+	}
+}
+
+func TestPercentRolloutZeroDisablesForUsers(t *testing.T) {
+	s := New(nil, nil)
+	s.flags["k"] = Flag{Key: "k", Enabled: true, PercentRollout: 0}
+	if s.Enabled("k", uuid.New()) {
+		t.Errorf("0%% rollout should be off for any user")
+	}
+}
+
+// Stable bucketing — same (key, user) must always return the same
+// decision so a user doesn't flap on and off across requests.
+func TestPercentBucketIsStable(t *testing.T) {
+	s := New(nil, nil)
+	s.flags["k"] = Flag{Key: "k", Enabled: true, PercentRollout: 50}
+	u := uuid.New()
+	first := s.Enabled("k", u)
+	for i := 0; i < 20; i++ {
+		if s.Enabled("k", u) != first {
+			t.Fatalf("flag decision for user %v flapped on iteration %d", u, i)
+		}
+	}
+}
+
+// Anonymous callers (uuid.Nil) skip percent-rollout splits — they're
+// treated as the public path. Otherwise a 25%-rollout flag would 75%
+// of the time refuse anonymous traffic, which is the wrong default
+// for a public endpoint.
+func TestAnonymousIsAlwaysOn(t *testing.T) {
+	s := New(nil, nil)
+	s.flags["k"] = Flag{Key: "k", Enabled: true, PercentRollout: 1}
+	if !s.Enabled("k", uuid.Nil) {
+		t.Errorf("uuid.Nil subject should be Enabled=true for partial rollouts")
+	}
+}