Skip to content

Prototype Scope

The prototype proves the operating model end-to-end on one representative workflow — bulk card spending-limit updates — with a real chat interface, LLM pipeline, async execution engine, and observability layer.

Tech stack

LayerChoiceWhy
FrontendNext.js 16 + React 19 + Tailwind 4App Router with real-time Convex hooks; dark-mode design system built on CSS tokens
Backend / DBConvex 1.35Real-time subscriptions replace polling; mutations, queries, and scheduled actions in one runtime
LLMOpenAI (gpt-4o, text-embedding-3-small)Structured output mode for intent extraction; embeddings power semantic KB search
Schema validationZod 4Single source of truth for intent shape, shared between LLM output and Convex mutations

What has been built

Chat interface

  • Natural-language input with five example prompts for quick-start
  • User bubbles are editable inline — click to revise and re-run without losing thread context
  • Per-response retry button re-runs the original request with a new idempotency key
  • Thumbs up / down feedback on every AI response (answer and bulk op) — wired to the metrics store

LLM pipeline (interpreter)

  • Single system prompt classifies input as either a question or a bulk_op request
  • RAG: top-k semantic search against embedded KB articles (OpenAI text-embedding-3-small, 1536-dim) injects relevant policy context into the prompt before calling the LLM
  • Structured output (Zod schema) extracts intent, targetGroup, newLimit, notifyCardholders
  • Unknown or unsupported intents surface a graceful "not supported" thread entry rather than silently failing
  • KB gap logging: questions with low-confidence KB matches are recorded as kb_gap metric events

Policy validation

  • bulk_update_card_limit only — bulk_freeze_cards and bulk_notify_cardholders are recognised but not yet wired end-to-end
  • Frozen and cancelled cards excluded automatically (policy rule P6)
  • Operations affecting > 25 eligible cards flagged approvalRequired: true (P4)
  • Hard cap: 200 cards max per operation (P5); max SGD 5,000 per-card limit
  • Exclusion reasons and policy notes stored on the job record and surfaced in the confirmation UI

Confirmation screen

  • Shows target group, new limit, total resolved, eligible, excluded (with per-card reasons), and approval flag
  • Idempotency key prevents duplicate jobs from double-submission or rapid retry

Async execution engine

  • confirmJob fans out one job_item record per eligible card, then schedules each item as a Convex scheduled action
  • Staggered start times (random 500–3 000 ms) simulate realistic API call pacing
  • Deterministic mock card API: 18% transient failure rate, 100% permanent failure for compliance-locked cards
  • Exponential backoff retry: 1 s → 2 s → 4 s, max 3 attempts; retryable vs. permanent failure distinction enforced by failureCode

Job lifecycle controls

  • Cancel: marks all queued items cancelled; in-flight items run to completion
  • Retry failed: creates a scoped re-run targeting only failed_retryable items; original job record unchanged
  • Job terminal states: completed, completed_with_failures, cancelled, failed
  • Item states: queued → in_progress → succeeded | failed_retryable | failed_permanent | cancelled | skipped

Real-time progress

  • Convex subscriptions push live counts to the job progress component — no polling
  • Live counts: succeeded · failed · retrying · cancelled · remaining queued

Job detail page

  • Per-item status table with colour-coded badges
  • Per-card runbook panel: freeze / unfreeze, block (with confirmation guard), report fraud, list recent transactions — all wired to Convex mutations against the mock card store

Metrics dashboard (/metrics)

  • Stats cards: total jobs, AI draft acceptance rate, thumbs-up / thumbs-down counts
  • Top-5 KB gap table: questions where the KB returned no strong match, ranked by frequency
  • Data sourced from append-only metrics_events table (last 1 000 events)

KB article store

  • Convex table with OpenAI embeddings, vector index (by_embedding, 1 536 dims)
  • Ingestion script reads from datasets/reap-help-center.jsonl; re-ingestion is idempotent on articleId

Still mocked / out of scope

The prototype does NOT:
  • Call real Reap APIs — card service is a seeded Convex table; no real card is ever modified
  • Send real notificationsnotifyCardholders is stored but no email or SMS is dispatched
  • Enforce authentication or RBAC — all users treated as admin; approval gate is flagged but has no enforcement UI
  • Wire bulk_freeze_cards / bulk_notify_cardholders end-to-end — recognised by the LLM but execution path is not implemented
  • Export failure reports as CSV — planned but not yet built
  • Support multi-region or multi-currency policy variations — single flat ruleset
  • Use durable workflow orchestration (Temporal) — Convex scheduled actions are sufficient for the prototype; durability gaps remain under extreme failure scenarios

Iteration roadmap

IterationFocusKey additions
v0 — Prototype ✓One workflow end-to-end: chat → intent → policy check → confirmation → async execution → retry / cancelProves the operating model. LLM pipeline with RAG, real-time job progress, deterministic failure simulation, per-card runbooks, metrics dashboard.
v1 — Reusable frameworkSame job lifecycle for multiple operation typesWire bulk_freeze_cards and bulk_notify_cardholders end-to-end. Add RBAC enforcement. Implement CSV failure export. Add approval workflow UI for flagged jobs.
v2 — Policy-aware copilotRicher policy retrieval; better block / approval explanations; escalation notesConnect to a real policy registry. Add KYC hold rules, regional carve-outs, approval expiry. LLM explains policy blocks in plain language. Draft customer-facing comms for successful operations.
v3 — Selective auto-resolutionAuto-resolve only well-understood, policy-safe, low-volume operations with historical metrics validating safetyAuto-approve routine operations (<10 items, no policy conflicts, operation type with >99% historical success rate). Escalate only exceptions. Requires 90 days of v2 data to establish baselines.