
# [Service Name] · LangGraph orchestration

A multi-agent state machine. Nodes are agents, edges are decisions, state is typed. Checkpoints let you pause, inspect, and resume. Human-in-the-loop where the cost of being wrong is high.

## Source of truth
Production runs as a FastAPI service on Fly. Postgres stores checkpoints (LangGraph's `PostgresSaver`). Every graph run is replayable from any checkpoint.

## Tech stack
Python 3.13 + LangGraph 0.3+ + LangChain (only the lightweight prompt + tool bindings). Anthropic Claude for reasoning, OpenAI for tool calls when JSON-structured-output is the constraint. FastAPI for the run-orchestration API. PostgreSQL via `PostgresSaver` for checkpoints. Pydantic v2 for state schemas.

## Deploy
`fly deploy`. Postgres on Fly with the `langgraph_checkpoints` table created by LangGraph migrations.

## File map
- `app/graph.py` graph definition: nodes, edges, conditional routing
- `app/state.py` Pydantic State TypedDict (the central state object)
- `app/nodes/researcher.py`, `writer.py`, `critic.py` one file per agent
- `app/tools/` shared tools (web search, file read, code interpreter)
- `app/api.py` FastAPI: POST `/runs` + GET `/runs/{thread_id}/state`
- `app/checkpoint.py` PostgresSaver setup
- `prompts/` system prompts per node, version-controlled

## .env keys
- `ANTHROPIC_API_KEY`
- `OPENAI_API_KEY`
- `DATABASE_URL`
- `TAVILY_API_KEY` (if you use Tavily for web search)
- `MAX_GRAPH_STEPS` default 30 (cycle protection)

## Hard rules
- State is typed end-to-end. NO untyped dicts in state. Pydantic models or it doesn't ship.
- Every node is pure: same state in -> same state out (modulo LLM nondeterminism). No side effects in nodes; side effects go through tools.
- Conditional edges return a string identifier, never a node function reference. Keeps the graph serializable.
- Checkpoint after every node, not just at end. Mid-graph crashes happen.
- Human-in-the-loop nodes use `interrupt()` (LangGraph 0.3+). The graph pauses; the API surfaces the interrupt; resume with `Command(resume=...)`.
- Cap graph recursion. Cycle prevention is YOUR job, not LangGraph's.

## Recent significant changes
- 2026-05-17: Scaffolded. Locked: LangGraph over building from scratch (the state + checkpoint machinery is non-trivial), Pydantic for state typing, PostgresSaver over in-memory (production survives restarts).

## Next session: start here
1. Define your state schema in `app/state.py`. Start tiny; you can add fields.
2. Write the simplest possible graph: one node, one edge to END. Confirm it runs.
3. Add nodes incrementally. After each, run a fresh thread end-to-end and read the checkpoint history.
4. Add a critic node early. It catches loops + dead ends faster than you will.
5. Build `eval/` with 10 representative inputs + expected final-state assertions before adding more nodes.
