RAG search service (Pinecone + OpenAI embeddings + Postgres)

A document ingestion + retrieval-augmented generation service. Chunking, embeddings via OpenAI, vector storage in Pinecone, source metadata in Postgres, a typed query endpoint. Use when you need 'ChatGPT for your docs' but want to own the stack.

use whenYou have a corpus of documents (support articles, internal wiki, product manuals, codebase) and you want users to ask questions and get answers grounded in the source material with citations.

May 20, 20262,510 bytesragpineconeopenaiembeddingssearch

download .md↓

[Service Name] · RAG search

A retrieval-augmented generation service. Ingest documents, chunk them, embed them, store the vectors, retrieve at query time, ground the answer in real sources.

Source of truth

Production runs as a Node service on Fly.io (or anywhere with Postgres + outbound HTTPS). Pinecone holds vectors; Postgres holds metadata; S3-compatible storage holds the original docs. The ingestion job is the source of truth for what's searchable.

Tech stack

Node 22 + TypeScript + Fastify (lighter than Express for an API-only service). Pinecone for vectors. OpenAI text-embedding-3-large for embeddings. Anthropic Claude (or OpenAI GPT) for the final answer-generation step. Postgres for chunk metadata + source URLs. BullMQ + Redis for ingestion job queue. Pino logs.

Deploy

fly deploy from local. Postgres + Redis on Fly. Pinecone serverless index lives in their cloud. OpenAI / Anthropic keys via fly secrets set.

File map

src/index.ts Fastify app + route mounting
src/ingest/ document loader, chunker (semantic + token-based), embedder, vector writer
src/query/ retrieval (top-k from Pinecone), reranking, prompt assembly, LLM call, citation builder
src/jobs/ BullMQ workers for async ingestion
src/db/schema.ts sources, chunks, embeddings_meta tables
src/lib/openai.ts, src/lib/anthropic.ts wrapped SDK clients
src/lib/pinecone.ts index client
prompts/ system prompts for retrieval + answer steps, version-tracked

.env keys

DATABASE_URL
REDIS_URL
PINECONE_API_KEY, PINECONE_INDEX_NAME
OPENAI_API_KEY
ANTHROPIC_API_KEY
EMBEDDING_MODEL default text-embedding-3-large
ANSWER_MODEL default claude-sonnet-4-6
CHUNK_SIZE_TOKENS default 800
CHUNK_OVERLAP_TOKENS default 100

Hard rules

Chunk size + overlap are tuned once per corpus and not changed per query. Re-index if you change them.
Every retrieved chunk MUST be paired with its source URL or doc ID. No ungrounded chunks in the prompt.
The answer prompt instructs the model to refuse if the retrieved context is insufficient. Better to say "I don't know" than to hallucinate.
Citations are returned as structured data, not extracted from the answer text. The retrieval layer knows what was used.
Embedding API costs are linear in tokens. Cache embeddings by content hash; never re-embed identical chunks.
Test the retrieval layer in isolation (no LLM). If retrieval is bad, the answer can't be saved by a smarter LLM.

Recent significant changes

2026-05-20: Scaffolded. Locked: Pinecone over pgvector (latency at scale), separate embedding + answer models (different optimization targets), BullMQ for ingestion (async is mandatory).

Next session: start here

Create Pinecone index. Embedding model dimension must match index dimension.
Run npm run ingest -- --source ./sample-docs/ against a small corpus first.
Test POST /query with curl. Inspect what was retrieved before judging the answer.
Tune chunk size against your corpus (technical docs want smaller, prose wants larger).
Add eval harness with 20 question + ground-truth-answer pairs before going to prod.

← older

Anthropic Computer Use agent (web automation in a sandbox)

Get the next CLAUDE.md in your inbox.

One new template every week, plus occasional case studies.