[Service Name] · LangGraph orchestration
A multi-agent state machine. Nodes are agents, edges are decisions, state is typed. Checkpoints let you pause, inspect, and resume. Human-in-the-loop where the cost of being wrong is high.
Source of truth
Production runs as a FastAPI service on Fly. Postgres stores checkpoints (LangGraph's PostgresSaver). Every graph run is replayable from any checkpoint.
Tech stack
Python 3.13 + LangGraph 0.3+ + LangChain (only the lightweight prompt + tool bindings). Anthropic Claude for reasoning, OpenAI for tool calls when JSON-structured-output is the constraint. FastAPI for the run-orchestration API. PostgreSQL via PostgresSaver for checkpoints. Pydantic v2 for state schemas.
Deploy
fly deploy. Postgres on Fly with the langgraph_checkpoints table created by LangGraph migrations.
File map
app/graph.pygraph definition: nodes, edges, conditional routingapp/state.pyPydantic State TypedDict (the central state object)app/nodes/researcher.py,writer.py,critic.pyone file per agentapp/tools/shared tools (web search, file read, code interpreter)app/api.pyFastAPI: POST/runs+ GET/runs/{thread_id}/stateapp/checkpoint.pyPostgresSaver setupprompts/system prompts per node, version-controlled
.env keys
ANTHROPIC_API_KEYOPENAI_API_KEYDATABASE_URLTAVILY_API_KEY(if you use Tavily for web search)MAX_GRAPH_STEPSdefault 30 (cycle protection)
Hard rules
- State is typed end-to-end. NO untyped dicts in state. Pydantic models or it doesn't ship.
- Every node is pure: same state in -> same state out (modulo LLM nondeterminism). No side effects in nodes; side effects go through tools.
- Conditional edges return a string identifier, never a node function reference. Keeps the graph serializable.
- Checkpoint after every node, not just at end. Mid-graph crashes happen.
- Human-in-the-loop nodes use
interrupt()(LangGraph 0.3+). The graph pauses; the API surfaces the interrupt; resume withCommand(resume=...). - Cap graph recursion. Cycle prevention is YOUR job, not LangGraph's.
Recent significant changes
- 2026-05-17: Scaffolded. Locked: LangGraph over building from scratch (the state + checkpoint machinery is non-trivial), Pydantic for state typing, PostgresSaver over in-memory (production survives restarts).
Next session: start here
- Define your state schema in
app/state.py. Start tiny; you can add fields. - Write the simplest possible graph: one node, one edge to END. Confirm it runs.
- Add nodes incrementally. After each, run a fresh thread end-to-end and read the checkpoint history.
- Add a critic node early. It catches loops + dead ends faster than you will.
- Build
eval/with 10 representative inputs + expected final-state assertions before adding more nodes.