[Agent Name] · Computer Use
A Claude-powered web agent. Takes a goal in English, drives a sandboxed Chromium to achieve it, returns a transcript + screenshots.
Source of truth
GitHub. The Docker image is what runs in production. Agent state per run is stored in Postgres (transcript, screenshots, final result). Sessions never share a browser.
Tech stack
Python 3.13 + Anthropic SDK (claude-sonnet-4-6 minimum; Opus for harder tasks). Playwright (Chromium) inside Docker. FastAPI for the run-orchestration API. PostgreSQL for run history + transcripts. S3-compatible storage for screenshots. The agent uses Anthropic's computer_use_20250124 tool spec for screenshot/click/type/key/scroll.
Deploy
Build the Docker image, push to a runner that supports nested containers (Fly Machines, ECS Fargate with privileged tasks, or a beefy EC2 with Docker-in-Docker). Each task spins up an isolated Chromium.
File map
agent/loop.pymain agent loop: screenshot -> Claude -> action -> screenshotagent/tools.pycomputer use tool implementations (Playwright wrappers)agent/prompts/system prompts per task categoryapi/main.pyFastAPI: POST/runsto start, GET/runs/{id}for statusdb/schema.sqlruns,transcripts,screenshotstablesDockerfilePython + Playwright + Chromium + Xvfbeval/task suite with expected outcomes
.env keys
ANTHROPIC_API_KEYDATABASE_URLS3_BUCKET,S3_ACCESS_KEY,S3_SECRETMAX_AGENT_STEPSdefault 50SCREENSHOT_INTERVAL_MSdefault 1500
Hard rules
- Every run gets a fresh browser context. NEVER share cookies, localStorage, or auth state across runs unless the user explicitly provides a session cookie for that run.
- The agent runs in a network-namespaced container with allowlisted domains. No outbound to arbitrary hosts unless declared in the task config.
- Hard timeout per run: 5 minutes default. Hard step cap: 50 actions. Cost cap: $1.00 of Anthropic spend per run.
- Capture a screenshot AND the DOM snapshot at every step. The screenshot is what Claude sees; the DOM is for debugging when it fails.
- NEVER let the agent enter credentials, payment info, or other sensitive data unless the user pre-authorized it in the task config with explicit scope.
- Log every action with reasoning. You'll need this when 1 in 50 runs goes off the rails.
Recent significant changes
- 2026-05-19: Scaffolded. Locked: Playwright over Selenium (faster, modern), Docker isolation per run (no shared state), explicit step + cost caps (computer-use can spend fast).
Next session: start here
- Build the Docker image. Confirm Playwright + Chromium + Xvfb work in headless mode.
- Wire
ANTHROPIC_API_KEY. Runpython -m agent.loop --task 'go to example.com and tell me the H1'. - Inspect screenshots after each step. The first few runs always reveal prompt gaps.
- Build the eval suite BEFORE adding more capabilities. Without evals you can't tell if a prompt change helps or hurts.
- Expose
/runsAPI. Auth required (this thing can spend $).