/library / template-rag-service
templateAI agents

RAG search service (Pinecone + OpenAI embeddings + Postgres)

A document ingestion + retrieval-augmented generation service. Chunking, embeddings via OpenAI, vector storage in Pinecone, source metadata in Postgres, a typed query endpoint. Use when you need 'ChatGPT for your docs' but want to own the stack.

use whenYou have a corpus of documents (support articles, internal wiki, product manuals, codebase) and you want users to ask questions and get answers grounded in the source material with citations.

May 20, 20262,510 bytesragpineconeopenaiembeddingssearch

[Service Name] · RAG search

A retrieval-augmented generation service. Ingest documents, chunk them, embed them, store the vectors, retrieve at query time, ground the answer in real sources.

Source of truth

Production runs as a Node service on Fly.io (or anywhere with Postgres + outbound HTTPS). Pinecone holds vectors; Postgres holds metadata; S3-compatible storage holds the original docs. The ingestion job is the source of truth for what's searchable.

Tech stack

Node 22 + TypeScript + Fastify (lighter than Express for an API-only service). Pinecone for vectors. OpenAI text-embedding-3-large for embeddings. Anthropic Claude (or OpenAI GPT) for the final answer-generation step. Postgres for chunk metadata + source URLs. BullMQ + Redis for ingestion job queue. Pino logs.

Deploy

fly deploy from local. Postgres + Redis on Fly. Pinecone serverless index lives in their cloud. OpenAI / Anthropic keys via fly secrets set.

File map

  • src/index.ts Fastify app + route mounting
  • src/ingest/ document loader, chunker (semantic + token-based), embedder, vector writer
  • src/query/ retrieval (top-k from Pinecone), reranking, prompt assembly, LLM call, citation builder
  • src/jobs/ BullMQ workers for async ingestion
  • src/db/schema.ts sources, chunks, embeddings_meta tables
  • src/lib/openai.ts, src/lib/anthropic.ts wrapped SDK clients
  • src/lib/pinecone.ts index client
  • prompts/ system prompts for retrieval + answer steps, version-tracked

.env keys

  • DATABASE_URL
  • REDIS_URL
  • PINECONE_API_KEY, PINECONE_INDEX_NAME
  • OPENAI_API_KEY
  • ANTHROPIC_API_KEY
  • EMBEDDING_MODEL default text-embedding-3-large
  • ANSWER_MODEL default claude-sonnet-4-6
  • CHUNK_SIZE_TOKENS default 800
  • CHUNK_OVERLAP_TOKENS default 100

Hard rules

  • Chunk size + overlap are tuned once per corpus and not changed per query. Re-index if you change them.
  • Every retrieved chunk MUST be paired with its source URL or doc ID. No ungrounded chunks in the prompt.
  • The answer prompt instructs the model to refuse if the retrieved context is insufficient. Better to say "I don't know" than to hallucinate.
  • Citations are returned as structured data, not extracted from the answer text. The retrieval layer knows what was used.
  • Embedding API costs are linear in tokens. Cache embeddings by content hash; never re-embed identical chunks.
  • Test the retrieval layer in isolation (no LLM). If retrieval is bad, the answer can't be saved by a smarter LLM.

Recent significant changes

  • 2026-05-20: Scaffolded. Locked: Pinecone over pgvector (latency at scale), separate embedding + answer models (different optimization targets), BullMQ for ingestion (async is mandatory).

Next session: start here

  1. Create Pinecone index. Embedding model dimension must match index dimension.
  2. Run npm run ingest -- --source ./sample-docs/ against a small corpus first.
  3. Test POST /query with curl. Inspect what was retrieved before judging the answer.
  4. Tune chunk size against your corpus (technical docs want smaller, prose wants larger).
  5. Add eval harness with 20 question + ground-truth-answer pairs before going to prod.

Get the next CLAUDE.md in your inbox.

One new template every week, plus occasional case studies.