/work · chat-with-pdf

Chat with PDF.

Ask any document a question. Even the scanned ones nobody can search.

year · 2025role · designed, built, shippedtags · AI, Full-stackread · ~2 minlive

§01 · architecture

How it moves the data.

§02 · problem

Most PDF interactions are limited to keyword search — useless for scanned documents and unable to understand context. Engineers, researchers, and analysts waste hours skimming long documents to find a single passage.

§03 · approach

An AI-powered document Q&A system. Upload any PDF — scanned or digital — and ask questions in natural language. An async pipeline runs Mistral OCR to extract page text, chunks and embeds it, then answers with hybrid retrieval (vector + full-text). Every answer streams back with citations that link to the source page and highlight the referenced text in an inline PDF viewer.

§04 · decisions

What was chosen.
What was rejected.

d/01

Convex (real-time DB + functions)

REST + Postgres + manual websockets

Convex gives real-time reactivity for free. Chat messages appear instantly without polling, schema changes deploy without migrations. Vendor lock-in is the cost; shipping the full backend in days instead of weeks is the gain.

d/02

Mistral OCR 4

Google Document AI

Shipped first on Document AI, then migrated. Mistral OCR returns clean, markdown-structured text — tables, headings, and reading order preserved — from a single API call, instead of Document AI's processor setup and page-by-page batching. Comparable accuracy on scanned and multi-column PDFs, a far simpler integration, and lower cost. A 100-page-per-document cap is the tradeoff.

d/03

Hybrid retrieval inside Convex

Pinecone or Weaviate

OpenAI embeddings (text-embedding-3-small) live in Convex's native vector index alongside a full-text index — one data layer, no extra service to run. Each query fans out to both, then a rerank pass merges the results and pulls neighboring chunks for context. Works at current scale; a dedicated index becomes worth it around 10K docs, not before.

§05 · tradeoffs

What this costs.

t/01
Chunks target 450 words with a 75-word overlap, tracking the page span each chunk covers so citations can point back to an exact page. Smaller chunks fragmented context; larger ones dropped retrieval precision. The overlap is what keeps a passage that straddles two chunks retrievable.
t/02
OCR runs asynchronously at upload, not on every query, and retries with backoff (15s, then 60s) on transient failures. The cost of making scanned PDFs queryable is paid once; the 100-page-per-document cap keeps a single job bounded.
t/03
Hybrid retrieval adds a rerank and neighbor-expansion pass over pure vector search, plus a routing step that decides between chunk lookup and document summaries. A few hundred extra milliseconds for a conversational interface, in exchange for answers that stay grounded when the evidence is thin.

§06 · impact

What this returned.

150+

documents indexed

100+

active users

<3s

average answer time

§07 · stack

Next.jsConvexMistral OCR 4OpenAITypeScriptReactTailwindClerk

last edited · 2026-05-11~2 min read

VoiceFlow

How it moves the data.

What was chosen.What was rejected.

What this costs.

What this returned.

What was chosen.
What was rejected.