
# [Agent Name] · Computer Use

A Claude-powered web agent. Takes a goal in English, drives a sandboxed Chromium to achieve it, returns a transcript + screenshots.

## Source of truth
GitHub. The Docker image is what runs in production. Agent state per run is stored in Postgres (transcript, screenshots, final result). Sessions never share a browser.

## Tech stack
Python 3.13 + Anthropic SDK (`claude-sonnet-4-6` minimum; Opus for harder tasks). Playwright (Chromium) inside Docker. FastAPI for the run-orchestration API. PostgreSQL for run history + transcripts. S3-compatible storage for screenshots. The agent uses Anthropic's `computer_use_20250124` tool spec for screenshot/click/type/key/scroll.

## Deploy
Build the Docker image, push to a runner that supports nested containers (Fly Machines, ECS Fargate with privileged tasks, or a beefy EC2 with Docker-in-Docker). Each task spins up an isolated Chromium.

## File map
- `agent/loop.py` main agent loop: screenshot -> Claude -> action -> screenshot
- `agent/tools.py` computer use tool implementations (Playwright wrappers)
- `agent/prompts/` system prompts per task category
- `api/main.py` FastAPI: POST `/runs` to start, GET `/runs/{id}` for status
- `db/schema.sql` `runs`, `transcripts`, `screenshots` tables
- `Dockerfile` Python + Playwright + Chromium + Xvfb
- `eval/` task suite with expected outcomes

## .env keys
- `ANTHROPIC_API_KEY`
- `DATABASE_URL`
- `S3_BUCKET`, `S3_ACCESS_KEY`, `S3_SECRET`
- `MAX_AGENT_STEPS` default 50
- `SCREENSHOT_INTERVAL_MS` default 1500

## Hard rules
- Every run gets a fresh browser context. NEVER share cookies, localStorage, or auth state across runs unless the user explicitly provides a session cookie for that run.
- The agent runs in a network-namespaced container with allowlisted domains. No outbound to arbitrary hosts unless declared in the task config.
- Hard timeout per run: 5 minutes default. Hard step cap: 50 actions. Cost cap: $1.00 of Anthropic spend per run.
- Capture a screenshot AND the DOM snapshot at every step. The screenshot is what Claude sees; the DOM is for debugging when it fails.
- NEVER let the agent enter credentials, payment info, or other sensitive data unless the user pre-authorized it in the task config with explicit scope.
- Log every action with reasoning. You'll need this when 1 in 50 runs goes off the rails.

## Recent significant changes
- 2026-05-19: Scaffolded. Locked: Playwright over Selenium (faster, modern), Docker isolation per run (no shared state), explicit step + cost caps (computer-use can spend fast).

## Next session: start here
1. Build the Docker image. Confirm Playwright + Chromium + Xvfb work in headless mode.
2. Wire `ANTHROPIC_API_KEY`. Run `python -m agent.loop --task 'go to example.com and tell me the H1'`.
3. Inspect screenshots after each step. The first few runs always reveal prompt gaps.
4. Build the eval suite BEFORE adding more capabilities. Without evals you can't tell if a prompt change helps or hurts.
5. Expose `/runs` API. Auth required (this thing can spend $).
