
# [Service Name] · RAG search

A retrieval-augmented generation service. Ingest documents, chunk them, embed them, store the vectors, retrieve at query time, ground the answer in real sources.

## Source of truth
Production runs as a Node service on Fly.io (or anywhere with Postgres + outbound HTTPS). Pinecone holds vectors; Postgres holds metadata; S3-compatible storage holds the original docs. The ingestion job is the source of truth for what's searchable.

## Tech stack
Node 22 + TypeScript + Fastify (lighter than Express for an API-only service). Pinecone for vectors. OpenAI `text-embedding-3-large` for embeddings. Anthropic Claude (or OpenAI GPT) for the final answer-generation step. Postgres for chunk metadata + source URLs. BullMQ + Redis for ingestion job queue. Pino logs.

## Deploy
`fly deploy` from local. Postgres + Redis on Fly. Pinecone serverless index lives in their cloud. OpenAI / Anthropic keys via `fly secrets set`.

## File map
- `src/index.ts` Fastify app + route mounting
- `src/ingest/` document loader, chunker (semantic + token-based), embedder, vector writer
- `src/query/` retrieval (top-k from Pinecone), reranking, prompt assembly, LLM call, citation builder
- `src/jobs/` BullMQ workers for async ingestion
- `src/db/schema.ts` `sources`, `chunks`, `embeddings_meta` tables
- `src/lib/openai.ts`, `src/lib/anthropic.ts` wrapped SDK clients
- `src/lib/pinecone.ts` index client
- `prompts/` system prompts for retrieval + answer steps, version-tracked

## .env keys
- `DATABASE_URL`
- `REDIS_URL`
- `PINECONE_API_KEY`, `PINECONE_INDEX_NAME`
- `OPENAI_API_KEY`
- `ANTHROPIC_API_KEY`
- `EMBEDDING_MODEL` default `text-embedding-3-large`
- `ANSWER_MODEL` default `claude-sonnet-4-6`
- `CHUNK_SIZE_TOKENS` default 800
- `CHUNK_OVERLAP_TOKENS` default 100

## Hard rules
- Chunk size + overlap are tuned once per corpus and not changed per query. Re-index if you change them.
- Every retrieved chunk MUST be paired with its source URL or doc ID. No ungrounded chunks in the prompt.
- The answer prompt instructs the model to refuse if the retrieved context is insufficient. Better to say "I don't know" than to hallucinate.
- Citations are returned as structured data, not extracted from the answer text. The retrieval layer knows what was used.
- Embedding API costs are linear in tokens. Cache embeddings by content hash; never re-embed identical chunks.
- Test the retrieval layer in isolation (no LLM). If retrieval is bad, the answer can't be saved by a smarter LLM.

## Recent significant changes
- 2026-05-20: Scaffolded. Locked: Pinecone over pgvector (latency at scale), separate embedding + answer models (different optimization targets), BullMQ for ingestion (async is mandatory).

## Next session: start here
1. Create Pinecone index. Embedding model dimension must match index dimension.
2. Run `npm run ingest -- --source ./sample-docs/` against a small corpus first.
3. Test `POST /query` with curl. Inspect what was retrieved before judging the answer.
4. Tune chunk size against your corpus (technical docs want smaller, prose wants larger).
5. Add eval harness with 20 question + ground-truth-answer pairs before going to prod.
