[Service Name] · RAG search
A retrieval-augmented generation service. Ingest documents, chunk them, embed them, store the vectors, retrieve at query time, ground the answer in real sources.
Source of truth
Production runs as a Node service on Fly.io (or anywhere with Postgres + outbound HTTPS). Pinecone holds vectors; Postgres holds metadata; S3-compatible storage holds the original docs. The ingestion job is the source of truth for what's searchable.
Tech stack
Node 22 + TypeScript + Fastify (lighter than Express for an API-only service). Pinecone for vectors. OpenAI text-embedding-3-large for embeddings. Anthropic Claude (or OpenAI GPT) for the final answer-generation step. Postgres for chunk metadata + source URLs. BullMQ + Redis for ingestion job queue. Pino logs.
Deploy
fly deploy from local. Postgres + Redis on Fly. Pinecone serverless index lives in their cloud. OpenAI / Anthropic keys via fly secrets set.
File map
src/index.tsFastify app + route mountingsrc/ingest/document loader, chunker (semantic + token-based), embedder, vector writersrc/query/retrieval (top-k from Pinecone), reranking, prompt assembly, LLM call, citation buildersrc/jobs/BullMQ workers for async ingestionsrc/db/schema.tssources,chunks,embeddings_metatablessrc/lib/openai.ts,src/lib/anthropic.tswrapped SDK clientssrc/lib/pinecone.tsindex clientprompts/system prompts for retrieval + answer steps, version-tracked
.env keys
DATABASE_URLREDIS_URLPINECONE_API_KEY,PINECONE_INDEX_NAMEOPENAI_API_KEYANTHROPIC_API_KEYEMBEDDING_MODELdefaulttext-embedding-3-largeANSWER_MODELdefaultclaude-sonnet-4-6CHUNK_SIZE_TOKENSdefault 800CHUNK_OVERLAP_TOKENSdefault 100
Hard rules
- Chunk size + overlap are tuned once per corpus and not changed per query. Re-index if you change them.
- Every retrieved chunk MUST be paired with its source URL or doc ID. No ungrounded chunks in the prompt.
- The answer prompt instructs the model to refuse if the retrieved context is insufficient. Better to say "I don't know" than to hallucinate.
- Citations are returned as structured data, not extracted from the answer text. The retrieval layer knows what was used.
- Embedding API costs are linear in tokens. Cache embeddings by content hash; never re-embed identical chunks.
- Test the retrieval layer in isolation (no LLM). If retrieval is bad, the answer can't be saved by a smarter LLM.
Recent significant changes
- 2026-05-20: Scaffolded. Locked: Pinecone over pgvector (latency at scale), separate embedding + answer models (different optimization targets), BullMQ for ingestion (async is mandatory).
Next session: start here
- Create Pinecone index. Embedding model dimension must match index dimension.
- Run
npm run ingest -- --source ./sample-docs/against a small corpus first. - Test
POST /querywith curl. Inspect what was retrieved before judging the answer. - Tune chunk size against your corpus (technical docs want smaller, prose wants larger).
- Add eval harness with 20 question + ground-truth-answer pairs before going to prod.