Skills › Software Development › Dev workflow & Git
pr-test
E2E manual testing of PRs/branches using docker compose, agent-browser, and API calls. TRIGGER when user asks to manually test a PR, test a feature end-to-end, or run integration tests against a running system.
Tools: needed,-y
The full skill
—
name: pr-test
description: "E2E manual testing of PRs/branches using docker compose, agent-browser, and API calls. TRIGGER when user asks to manually test a PR, test a feature end-to-end, or run integration tests against a running system."
user-invocable: true
argument-hint: "[worktree path or PR number] — tests the PR in the given worktree. Optional flags: –fix (auto-fix issues found)"
metadata:
author: autogpt-team
version: "2.0.0"
—
# Manual E2E Test
Test a PR/branch end-to-end by building the full platform, interacting via browser and API, capturing screenshots, and reporting results.
## Critical Requirements
These are NON-NEGOTIABLE. Every test run MUST satisfy ALL the following:
### 1. Screenshots at Every Step
– Take a screenshot at EVERY significant test step — not just at the end
– Every test scenario MUST have at least one BEFORE and one AFTER screenshot
– Name screenshots sequentially: `{NN}-{action}-{state}.png` (e.g., `01-credits-before.png`, `02-credits-after.png`)
– If a screenshot is missing for a scenario, the test is INCOMPLETE — go back and take it
### 2. Screenshots MUST Be Posted to PR
– Push ALL screenshots to a temp branch `test-screenshots/pr-{N}`
– Post a PR comment with ALL screenshots embedded inline using GitHub raw URLs
– This is NOT optional — every test run MUST end with a PR comment containing screenshots
– If screenshot upload fails, retry. If it still fails, list failed files and require manual drag-and-drop/paste attachment in the PR comment
### 3. State Verification with Before/After Evidence
– For EVERY state-changing operation (API call, user action), capture the state BEFORE and AFTER
– Log the actual API response values (e.g., `credits_before=100, credits_after=95`)
– Screenshot MUST show the relevant UI state change
– Compare expected vs actual values explicitly — do not just eyeball it
### 4. Negative Test Cases Are Mandatory
– Test at least ONE negative case per feature (e.g., insufficient credits, invalid input, unauthorized access)
– Verify error messages are user-friendly and accurate
– Verify the system state did NOT change after a rejected operation
### 5. Test Report Must Include Full Evidence
Each test scenario in the report MUST have:
– **Steps**: What was done (exact commands or UI actions)
– **Expected**: What should happen
– **Actual**: What actually happened
– **API Evidence**: Before/after API response values for state-changing operations
– **Screenshot Evidence**: Before/after screenshots with explanations
## State Manipulation for Realistic Testing
When testing features that depend on specific states (rate limits, credits, quotas):
1. **Use Redis CLI to set counters directly:**
“`bash
# Find the Redis container
REDIS_CONTAINER=$(docker ps –format '{{.Names}}' | grep redis | head -1)
# Set a key with expiry
docker exec $REDIS_CONTAINER redis-cli SET key value EX ttl
# Example: Set rate limit counter to near-limit
docker exec $REDIS_CONTAINER redis-cli SET "rate_limit:user:[email protected]" 99 EX 3600
# Example: Check current value
docker exec $REDIS_CONTAINER redis-cli GET "rate_limit:user:[email protected]"
“`
2. **Use API calls to check before/after state:**
“`bash
# BEFORE: Record current state
BEFORE=$(curl -s -H "Authorization: Bearer $TOKEN" http://localhost:8006/api/credits | jq '.credits')
echo "Credits BEFORE: $BEFORE"
# Perform the action…
# AFTER: Record new state and compare
AFTER=$(curl -s -H "Authorization: Bearer $TOKEN" http://localhost:8006/api/credits | jq '.credits')
echo "Credits AFTER: $AFTER"
echo "Delta: $(( BEFORE – AFTER ))"
“`
3. **Take screenshots BEFORE and AFTER state changes** — the UI must reflect the backend state change
4. **Never rely on mocked/injected browser state** — always use real backend state. Do NOT use `agent-browser eval` to fake UI state. The backend must be the source of truth.
5. **Use direct DB queries when needed:**
“`bash
# Query via Supabase's PostgREST or docker exec into the DB
docker exec supabase-db psql -U supabase_admin -d postgres -c "SELECT credits FROM user_credits WHERE user_id = '…';"
“`
6. **After every API test, verify the state change actually persisted:**
“`bash
# Example: After a credits purchase, verify DB matches API
API_CREDITS=$(curl -s -H "Authorization: Bearer $TOKEN" http://localhost:8006/api/credits | jq '.credits')
DB_CREDITS=$(docker exec supabase-db psql -U supabase_admin -d postgres -t -c "SELECT credits FROM user_credits WHERE user_id = '…';" | tr -d ' ')
[ "$API_CREDITS" = "$DB_CREDITS" ] && echo "CONSISTENT" || echo "MISMATCH: API=$API_CREDITS DB=$DB_CREDITS"
“`
## Arguments
– `$ARGUMENTS` — worktree path (e.g. `$REPO_ROOT`) or PR number
– If `–fix` flag is present, auto-fix bugs found and push fixes (like pr-address loop)
## Step 0: Resolve the target
“`bash
# If argument is a PR number, find its worktree
gh pr view {N} –json headRefName –jq '.headRefName'
# If argument is a path, use it directly
“`
Determine:
– `REPO_ROOT` — the root repo directory: `git -C "$WORKTREE_PATH" worktree list | head -1 | awk '{print $1}'` (or `git rev-parse –show-toplevel` if not a worktree)
– `WORKTREE_PATH` — the worktree directory
– `PLATFORM_DIR` — `$WORKTREE_PATH/autogpt_platform`
– `BACKEND_DIR` — `$PLATFORM_DIR/backend`
– `FRONTEND_DIR` — `$PLATFORM_DIR/frontend`
– `PR_NUMBER` — the PR number (from `gh pr list –head $(git branch –show-current)`)
– `PR_TITLE` — the PR title, slugified (e.g. "Add copilot permissions" → "add-copilot-permissions")
– `RESULTS_DIR` — `$REPO_ROOT/test-results/PR-{PR_NUMBER}-{slugified-title}`
Create the results directory:
“`bash
PR_NUMBER=$(cd $WORKTREE_PATH && gh pr list –head $(git branch –show-current) –repo Significant-Gravitas/AutoGPT –json number –jq '.[0].number')
PR_TITLE=$(cd $WORKTREE_PATH && gh pr list –head $(git branch –show-current) –repo Significant-Gravitas/AutoGPT –json title –jq '.[0].title' | tr '[:upper:]' '[:lower:]' | sed 's/[^a-z0-9]/-/g' | sed 's/–*/-/g' | sed 's/^-//;s/-$//' | head -c 50)
RESULTS_DIR="$REPO_ROOT/test-results/PR-${PR_NUMBER}-${PR_TITLE}"
mkdir -p $RESULTS_DIR
“`
**Test user credentials** (for logging into the UI or verifying results manually):
– Email: `[email protected]`
– Password: `testtest123`
## Step 1: Understand the PR
Before testing, understand what changed:
“`bash
cd $WORKTREE_PATH
# Read PR description to understand the WHY
gh pr view {N} –json body –jq '.body'
git log –oneline dev..HEAD | head -20
git diff dev –stat
“`
Read the PR description (Why / What / How) and changed files to understand:
0. **Why** does this PR exist? What problem does it solve?
1. **What** feature/fix does this PR implement?
2. **How** does it work? What's the approach?
3. What components are affected? (backend, frontend, copilot, executor, etc.)
4. What are the key user-facing behaviors to test?
## Step 2: Write test scenarios
Based on the PR analysis, write a test plan to `$RESULTS_DIR/test-plan.md`:
“`markdown
# Test Plan: PR #{N} — {title}
## Scenarios
1. [Scenario name] — [what to verify]
2. …
## API Tests (if applicable)
1. [Endpoint] — [expected behavior]
– Before state: [what to check before]
– After state: [what to verify changed]
## UI Tests (if applicable)
1. [Page/component] — [interaction to test]
– Screenshot before: [what to capture]
– Screenshot after: [what to capture]
## Negative Tests (REQUIRED — at least one per feature)
1. [What should NOT happen] — [how to trigger it]
– Expected error: [what error message/code]
– State unchanged: [what to verify did NOT change]
“`
**Be critical** — include edge cases, error paths, and security checks. Every scenario MUST specify what screenshots to take and what state to verify.
## Step 3: Environment setup
### 3a. Copy .env files from the root worktree
The root worktree (`$REPO_ROOT`) has the canonical `.env` files with all API keys. Copy them to the target worktree:
“`bash
# CRITICAL: .env files are NOT checked into git. They must be copied manually.
cp $REPO_ROOT/autogpt_platform/.env $PLATFORM_DIR/.env
cp $REPO_ROOT/autogpt_platform/backend/.env $BACKEND_DIR/.env
cp $REPO_ROOT/autogpt_platform/frontend/.env $FRONTEND_DIR/.env
“`
### 3b. Configure copilot authentication
The copilot needs an LLM API to function. Two approaches (try subscription first):
#### Option 1: Subscription mode (preferred — uses your Claude Max/Pro subscription)
The `claude_agent_sdk` Python package **bundles its own Claude CLI binary** — no need to install `@anthropic-ai/claude-code` via npm. The backend auto-provisions credentials from environment variables on startup.
Run the helper script to extract tokens from your host and auto-update `backend/.env` (works on macOS, Linux, and Windows/WSL):
“`bash
# Extracts OAuth tokens and writes CLAUDE_CODE_OAUTH_TOKEN + CLAUDE_CODE_REFRESH_TOKEN into .env
bash $BACKEND_DIR/scripts/refresh_claude_token.sh –env-file $BACKEND_DIR/.env
“`
**How it works:** The script reads the OAuth token from:
– **macOS**: system keychain (`"Claude Code-credentials"`)
– **Linux/WSL**: `~/.claude/.credentials.json`
– **Windows**: `%APPDATA%/claude/.credentials.json`
It sets `CLAUDE_CODE_OAUTH_TOKEN`, `CLAUDE_CODE_REFRESH_TOKEN`, and `CHAT_USE_CLAUDE_CODE_SUBSCRIPTION=true` in the `.env` file. On container startup, the backend auto-provisions `~/.claude/.credentials.json` inside the container from these env vars. The SDK's bundled CLI then authenticates using that file. No `claude login`, no npm install needed.
**Note:** The OAuth token expires (~24h). If copilot returns auth errors, re-run the script and restart: `$BACKEND_DIR/scripts/refresh_claude_token.sh –env-file $BACKEND_DIR/.env && docker compose up -d copilot_executor`
#### Option 2: OpenRouter API key mode (fallback)
If subscription mode doesn't work, switch to API key mode using OpenRouter:
“`bash
# In $BACKEND_DIR/.env, ensure these are set:
CHAT_USE_CLAUDE_CODE_SUBSCRIPTION=false
CHAT_API_KEY=<value of OPEN_ROUTER_API_KEY from the same .env>
CHAT_BASE_URL=https://openrouter.ai/api/v1
CHAT_USE_CLAUDE_AGENT_SDK=true
“`
Use `sed` to update these values:
“`bash
ORKEY=$(grep "^OPEN_ROUTER_API_KEY=" $BACKEND_DIR/.env | cut -d= -f2)
[ -n "$ORKEY" ] || { echo "ERROR: OPEN_ROUTER_API_KEY is missing in $BACKEND_DIR/.env"; exit 1; }
perl -i -pe 's/CHAT_USE_CLAUDE_CODE_SUBSCRIPTION=true/CHAT_USE_CLAUDE_CODE_SUBSCRIPTION=false/' $BACKEND_DIR/.env
# Add or update CHAT_API_KEY and CHAT_BASE_URL
grep -q "^CHAT_API_KEY=" $BACKEND_DIR/.env && perl -i -pe "s|^CHAT_API_KEY=.*|CHAT_API_KEY=$ORKEY|" $BACKEND_DIR/.env || echo "CHAT_API_KEY=$ORKEY" >> $BACKEND_DIR/.env
grep -q "^CHAT_BASE_URL=" $BACKEND_DIR/.env && perl -i -pe 's|^CHAT_BASE_URL=.*|CHAT_BASE_URL=https://openrouter.ai/api/v1|' $BACKEND_DIR/.env || echo "CHAT_BASE_URL=https://openrouter.ai/api/v1" >> $BACKEND_DIR/.env
“`
### 3c. Stop conflicting containers
“`bash
# Stop any running app containers (keep infra: supabase, redis, rabbitmq, clamav)
docker ps –format "{{.Names}}" | grep -E "rest_server|executor|copilot|websocket|database_manager|scheduler|notification|frontend|migrate" | while read name; do
docker stop "$name" 2>/dev/null
done
“`
### 3e. Build and start
“`bash
cd $PLATFORM_DIR && docker compose build –no-cache 2>&1 | tail -20
if [ ${PIPESTATUS[0]} -ne 0 ]; then echo "ERROR: Docker build failed"; exit 1; fi
cd $PLATFORM_DIR && docker compose up -d 2>&1 | tail -20
if [ ${PIPESTATUS[0]} -ne 0 ]; then echo "ERROR: Docker compose up failed"; exit 1; fi
“`
**Note:** If the container appears to be running old code (e.g. missing PR changes), use `docker compose build –no-cache` to force a full rebuild. Docker BuildKit may sometimes reuse cached `COPY` layers from a previous build on a different branch.
**Expected time: 3-8 minutes** for build, 5-10 minutes with `–no-cache`.
### 3f. Wait for services to be ready
“`bash
# Poll until backend and frontend respond
for i in $(seq 1 60); do
BACKEND=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8006/docs 2>/dev/null)
FRONTEND=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:3000 2>/dev/null)
if [ "$BACKEND" = "200" ] && [ "$FRONTEND" = "200" ]; then
echo "Services ready"
break
fi
sleep 5
done
“`
### 3h. Create test user and get auth token
“`bash
ANON_KEY=$(grep "NEXT_PUBLIC_SUPABASE_ANON_KEY=" $FRONTEND_DIR/.env | sed 's/.*NEXT_PUBLIC_SUPABASE_ANON_KEY=//' | tr -d '[:space:]')
# Signup (idempotent — returns "User already registered" if exists)
RESULT=$(curl -s -X POST 'http://localhost:8000/auth/v1/signup' \
-H "apikey: $ANON_KEY" \
-H 'Content-Type: application/json' \
-d '{"email":"[email protected]","password":"testtest123"}')
# If "Database error finding user", restart supabase-auth and retry
if echo "$RESULT" | grep -q "Database error"; then
docker restart supabase-auth && sleep 5
curl -s -X POST 'http://localhost:8000/auth/v1/signup' \
-H "apikey: $ANON_KEY" \
-H 'Content-Type: application/json' \
-d '{"email":"[email protected]","password":"testtest123"}'
fi
# Get auth token
TOKEN=$(curl -s -X POST 'http://localhost:8000/auth/v1/token?grant_type=password' \
-H "apikey: $ANON_KEY" \
-H 'Content-Type: application/json' \
-d '{"email":"[email protected]","password":"testtest123"}' | jq -r '.access_token // ""')
“`
**Use this token for ALL API calls:**
“`bash
curl -H "Authorization: Bearer $TOKEN" http://localhost:8006/api/…
“`
### 3i. Disable onboarding for test user
The frontend redirects to `/onboarding` when the `VISIT_COPILOT` step is not in `completedSteps`.
Mark it complete via the backend API so every browser test lands on the real feature UI:
“`bash
ONBOARDING_RESULT=$(curl -s –max-time 30 -X POST \
"http://localhost:8006/api/onboarding/step?step=VISIT_COPILOT" \
-H "Authorization: Bearer $TOKEN")
echo "Onboarding bypass: $ONBOARDING_RESULT"
# Verify it took effect
ONBOARDING_STATUS=$(curl -s –max-time 30 \
"http://localhost:8006/api/onboarding/completed" \
-H "Authorization: Bearer $TOKEN" | jq -r '.is_completed')
echo "Onboarding completed: $ONBOARDING_STATUS"
if [ "$ONBOARDING_STATUS" != "true" ]; then
echo "ERROR: onboarding bypass failed — browser tests will hit /onboarding instead of the target feature. Investigate before proceeding."
exit 1
fi
“`
## Step 4: Run tests
### Service ports reference
| Service | Port | URL |
|———|——|—–|
| Frontend | 3000 | http://localhost:3000 |
| Backend REST | 8006 | http://localhost:8006 |
| Supabase Auth (via Kong) | 8000 | http://localhost:8000 |
| Executor | 8002 | http://localhost:8002 |
| Copilot Executor | 8008 | http://localhost:8008 |
| WebSocket | 8001 | http://localhost:8001 |
| Database Manager | 8005 | http://localhost:8005 |
| Redis | 6379 | localhost:6379 |
| RabbitMQ | 5672 | localhost:5672 |
### API testing
Use `curl` with the auth token for backend API tests. **For EVERY API call that changes state, record before/after values:**
“`bash
# Example: List agents
curl -s -H "Authorization: Bearer $TOKEN" http://localhost:8006/api/graphs | jq . | head -20
# Example: Create an agent
curl -s -X POST http://localhost:8006/api/graphs \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{…}' | jq .
# Example: Run an agent
curl -s -X POST "http://localhost:8006/api/graphs/{graph_id}/execute" \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{"data": {…}}'
# Example: Get execution results
curl -s -H "Authorization: Bearer $TOKEN" \
"http://localhost:8006/api/graphs/{graph_id}/executions/{exec_id}" | jq .
“`
**State verification pattern (use for EVERY state-changing API call):**
“`bash
# 1. Record BEFORE state
BEFORE_STATE=$(curl -s -H "Authorization: Bearer $TOKEN" http://localhost:8006/api/{resource} | jq '{relevant_fields}')
echo "BEFORE: $BEFORE_STATE"
# 2. Perform the action
ACTION_RESULT=$(curl -s -X POST … | jq .)
echo "ACTION RESULT: $ACTION_RESULT"
# 3. Record AFTER state
AFTER_STATE=$(curl -s -H "Authorization: Bearer $TOKEN" http://localhost:8006/api/{resource} | jq '{relevant_fields}')
echo "AFTER: $AFTER_STATE"
# 4. Log the comparison
echo "=== STATE CHANGE VERIFICATION ==="
echo "Before: $BEFORE_STATE"
echo "After: $AFTER_STATE"
echo "Expected change: {describe what should have changed}"
“`
### Browser testing with agent-browser
“`bash
# Close any existing session
agent-browser close 2>/dev/null || true
# Use –session-name to persist cookies across navigations
# This means login only needs to happen once per test session
agent-browser –session-name pr-test open 'http://localhost:3000/login' –timeout 15000
# Get interactive elements
agent-browser –session-name pr-test snapshot | grep "textbox\|button"
# Login
agent-browser –session-name pr-test fill {email_ref} "[email protected]"
agent-browser –session-name pr-test fill {password_ref} "testtest123"
agent-browser –session-name pr-test click {login_button_ref}
sleep 5
# Dismiss cookie banner if present
agent-browser –session-name pr-test click 'text=Accept All' 2>/dev/null || true
# Navigate — cookies are preserved so login persists
agent-browser –session-name pr-test open 'http://localhost:3000/copilot' –timeout 10000
# Take screenshot
agent-browser –session-name pr-test screenshot $RESULTS_DIR/01-page.png
# Interact with elements
agent-browser –session-name pr-test fill {ref} "text"
agent-browser –session-name pr-test press "Enter"
agent-browser –session-name pr-test click {ref}
agent-browser –session-name pr-test click 'text=Button Text'
# Read page content
agent-browser –session-name pr-test snapshot | grep "text:"
“`
**Key pages:**
– `/copilot` — CoPilot chat (for testing copilot features)
– `/build` — Agent builder (for testing block/node features)
– `/build?flowID={id}` — Specific agent in builder
– `/library` — Agent library (for testing listing/import features)
– `/library/agents/{id}` — Agent detail with run history
– `/marketplace` — Marketplace
### Checking logs
“`bash
# Backend REST server
docker logs autogpt_platform-rest_server-1 2>&1 | tail -30
# Executor (runs agent graphs)
docker logs autogpt_platform-executor-1 2>&1 | tail -30
# Copilot executor (runs copilot chat sessions)
docker logs autogpt_platform-copilot_executor-1 2>&1 | tail -30
# Frontend
docker logs autogpt_platform-frontend-1 2>&1 | tail -30
# Filter for errors
docker logs autogpt_platform-executor-1 2>&1 | grep -i "error\|exception\|traceback" | tail -20
“`
### Copilot chat testing
The copilot uses SSE streaming. To test via API:
“`bash
# Create a session
SESSION_ID=$(curl -s -X POST 'http://localhost:8006/api/chat/sessions' \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{}' | jq -r '.id // .session_id // ""')
# Stream a message (SSE – will stream chunks)
curl -N -X POST "http://localhost:8006/api/chat/sessions/$SESSION_ID/stream" \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{"message": "Hello, what can you help me with?"}' \
–max-time 60 2>/dev/null | head -50
“`
Or test via browser (preferred for UI verification):
“`bash
agent-browser –session-name pr-test open 'http://localhost:3000/copilot' –timeout 10000
# … fill chat input and press Enter, wait 20-30s for response
“`
## Step 5: Record results and take screenshots
**Take a screenshot at EVERY significant test step** — before and after interactions, on success, and on failure. This is NON-NEGOTIABLE.
**Required screenshot pattern for each test scenario:**
“`bash
# BEFORE the action
agent-browser –session-name pr-test screenshot $RESULTS_DIR/{NN}-{scenario}-before.png
# Perform the action…
# AFTER the action
agent-browser –session-name pr-test screenshot $RESULTS_DIR/{NN}-{scenario}-after.png
“`
**Naming convention:**
“`bash
# Examples:
# $RESULTS_DIR/01-login-page-before.png
# $RESULTS_DIR/02-login-page-after.png
# $RESULTS_DIR/03-credits-page-before.png
# $RESULTS_DIR/04-credits-purchase-after.png
# $RESULTS_DIR/05-negative-insufficient-credits.png
# $RESULTS_DIR/06-error-state.png
“`
**Minimum requirements:**
– At least TWO screenshots per test scenario (before + after)
– At least ONE screenshot for each negative test case showing the error state
– If a test fails, screenshot the failure state AND any error logs visible in the UI
## Step 6: Show results to user with screenshots
**CRITICAL: After all tests complete, you MUST show every screenshot to the user using the Read tool, with an explanation of what each screenshot shows.** This is the most important part of the test report — the user needs to visually verify the results.
For each screenshot:
1. Use the `Read` tool to display the PNG file (Claude can read images)
2. Write a 1-2 sentence explanation below it describing:
– What page/state is being shown
– What the screenshot proves (which test scenario it validates)
– Any notable details visible in the UI
Format the output like this:
“`markdown
### Screenshot 1: {descriptive title}
[Read the PNG file here]
**What it shows:** {1-2 sentence explanation of what this screenshot proves}
—
“`
After showing all screenshots, output a **detailed** summary table:
| # | Scenario | Result | API Evidence | Screenshot Evidence |
|—|———-|——–|————-|——————-|
| 1 | {name} | PASS/FAIL | Before: X, After: Y | 01-before.png, 02-after.png |
| 2 | … | … | … | … |
**IMPORTANT:** As you show each screenshot and record test results, persist them in shell variables for Step 7:
“`bash
# Build these variables during Step 6 — they are required by Step 7's script
# NOTE: declare -A requires Bash 4.0+. This is standard on modern systems (macOS ships zsh
# but Homebrew bash is 5.x; Linux typically has bash 5.x). If running on Bash <4, use a
# plain variable with a lookup function instead.
declare -A SCREENSHOT_EXPLANATIONS=(
["01-login-page.png"]="Shows the login page loaded successfully with SSO options visible."
["02-builder-with-block.png"]="The builder canvas displays the newly added block connected to the trigger."
# … one entry per screenshot, using the same explanations you showed the user above
)
TEST_RESULTS_TABLE="| 1 | Login flow | PASS | N/A | 01-login-before.png, 02-login-after.png |
| 2 | Credits purchase | PASS | Before: 100, After: 95 | 03-credits-before.png, 04-credits-after.png |
| 3 | Insufficient credits (negative) | PASS | Credits: 0, rejected | 05-insufficient-credits-error.png |"
# … one row per test scenario with actual results
“`
## Step 7: Post test report as PR comment with screenshots
Upload screenshots to the PR using the GitHub Git API (no local git operations — safe for worktrees), then post a comment with inline images and per-screenshot explanations.
**This step is MANDATORY. Every test run MUST post a PR comment with screenshots. No exceptions.**
**CRITICAL — NEVER post a bare directory link like `https://github.com/…/tree/…`.** Every screenshot MUST appear as `` inline in the PR comment so reviewers can see them without clicking any links. After posting, the verification step below greps the comment for `![` tags and exits 1 if none are found — the test run is considered incomplete until this passes.
“`bash
# Upload screenshots via GitHub Git API (creates blobs, tree, commit, and ref remotely)
REPO="Significant-Gravitas/AutoGPT"
SCREENSHOTS_BRANCH="test-screenshots/pr-${PR_NUMBER}"
SCREENSHOTS_DIR="test-screenshots/PR-${PR_NUMBER}"
# Step 1: Create blobs for each screenshot and build tree JSON
# Retry each blob upload up to 3 times. If still failing, list them at end of report.
shopt -s nullglob
SCREENSHOT_FILES=("$RESULTS_DIR"/*.png)
if [ ${#SCREENSHOT_FILES[@]} -eq 0 ]; then
echo "ERROR: No screenshots found in $RESULTS_DIR. Test run is incomplete."
exit 1
fi
TREE_JSON='['
FIRST=true
FAILED_UPLOADS=()
for img in "${SCREENSHOT_FILES[@]}"; do
BASENAME=$(basename "$img")
B64=$(base64 < "$img")
BLOB_SHA=""
for attempt in 1 2 3; do
BLOB_SHA=$(gh api "repos/${REPO}/git/blobs" -f content="$B64" -f encoding="base64" –jq '.sha' 2>/dev/null || true)
[ -n "$BLOB_SHA" ] && break
sleep 1
done
if [ -z "$BLOB_SHA" ]; then
FAILED_UPLOADS+=("$img")
continue
fi
if [ "$FIRST" = true ]; then FIRST=false; else TREE_JSON+=','; fi
TREE_JSON+="{\"path\":\"${SCREENSHOTS_DIR}/${BASENAME}\",\"mode\":\"100644\",\"type\":\"blob\",\"sha\":\"${BLOB_SHA}\"}"
done
TREE_JSON+=']'
# Step 2: Create tree, commit, and branch ref
TREE_SHA=$(echo "$TREE_JSON" | jq -c '{tree: .}' | gh api "repos/${REPO}/git/trees" –input – –jq '.sha')
# Resolve parent commit so screenshots are chained, not orphan root commits
PARENT_SHA=$(gh api "repos/${REPO}/git/refs/heads/${SCREENSHOTS_BRANCH}" –jq '.object.sha' 2>/dev/null || echo "")
if [ -n "$PARENT_SHA" ]; then
COMMIT_SHA=$(gh api "repos/${REPO}/git/commits" \
-f message="test: add E2E test screenshots for PR #${PR_NUMBER}" \
-f tree="$TREE_SHA" \
-f "parents[]=$PARENT_SHA" \
–jq '.sha')
else
COMMIT_SHA=$(gh api "repos/${REPO}/git/commits" \
-f message="test: add E2E test screenshots for PR #${PR_NUMBER}" \
-f tree="$TREE_SHA" \
–jq '.sha')
fi
gh api "repos/${REPO}/git/refs" \
-f ref="refs/heads/${SCREENSHOTS_BRANCH}" \
-f sha="$COMMIT_SHA" 2>/dev/null \
|| gh api "repos/${REPO}/git/refs/heads/${SCREENSHOTS_BRANCH}" \
-X PATCH -f sha="$COMMIT_SHA" -F force=true
“`
Then post the comment with **inline images AND explanations for each screenshot**:
“`bash
REPO_URL="https://raw.githubusercontent.com/${REPO}/${SCREENSHOTS_BRANCH}"
# Build image markdown using uploaded image URLs; skip FAILED_UPLOADS (listed separately)
IMAGE_MARKDOWN=""
for img in "${SCREENSHOT_FILES[@]}"; do
BASENAME=$(basename "$img")
TITLE=$(echo "${BASENAME%.png}" | sed 's/^[0-9]*-//' | sed 's/-/ /g' | awk '{for(i=1;i<=NF;i++) $i=toupper(substr($i,1,1)) tolower(substr($i,2))}1')
# Skip images that failed to upload — they will be listed at the end
IS_FAILED=false
for failed in "${FAILED_UPLOADS[@]}"; do
[ "$(basename "$failed")" = "$BASENAME" ] && IS_FAILED=true && break
done
if [ "$IS_FAILED" = true ]; then
continue
fi
EXPLANATION="${SCREENSHOT_EXPLANATIONS[$BASENAME]}"
if [ -z "$EXPLANATION" ]; then
echo "ERROR: Missing screenshot explanation for $BASENAME. Add it to SCREENSHOT_EXPLANATIONS in Step 6."
exit 1
fi
IMAGE_MARKDOWN="${IMAGE_MARKDOWN}
### ${TITLE}

${EXPLANATION}
"
done
# Write comment body to file to avoid shell interpretation issues with special characters
COMMENT_FILE=$(mktemp)
# If any uploads failed, append a section listing them with instructions
FAILED_SECTION=""
if [ ${#FAILED_UPLOADS[@]} -gt 0 ]; then
FAILED_SECTION="
## ⚠️ Failed Screenshot Uploads
The following screenshots could not be uploaded via the GitHub API after 3 retries.
**To add them:** drag-and-drop or paste these files into a PR comment manually:
"
for failed in "${FAILED_UPLOADS[@]}"; do
FAILED_SECTION="${FAILED_SECTION}
– \`$(basename "$failed")\` (local path: \`$failed\`)"
done
FAILED_SECTION="${FAILED_SECTION}
**Run status:** INCOMPLETE until the files above are manually attached and visible inline in the PR."
fi
cat > "$COMMENT_FILE" <<INNEREOF
## E2E Test Report
| # | Scenario | Result | API Evidence | Screenshot Evidence |
|—|———-|——–|————-|——————-|
${TEST_RESULTS_TABLE}
${IMAGE_MARKDOWN}
${FAILED_SECTION}
INNEREOF
gh api "repos/${REPO}/issues/$PR_NUMBER/comments" -F body=@"$COMMENT_FILE"
rm -f "$COMMENT_FILE"
# Verify the posted comment contains inline images — exit 1 if none found
# Use separate –paginate + jq pipe: –jq applies per-page, not to the full list
LAST_COMMENT=$(gh api "repos/${REPO}/issues/$PR_NUMBER/comments" –paginate 2>/dev/null | jq -r '.[-1].body // ""')
if ! echo "$LAST_COMMENT" | grep -q '!\['; then
echo "ERROR: Posted comment contains no inline images (![). Bare directory links are not acceptable." >&2
exit 1
fi
echo "✓ Inline images verified in posted comment"
“`
**The PR comment MUST include:**
1. A summary table of all scenarios with PASS/FAIL and before/after API evidence
2. Every successfully uploaded screenshot rendered inline; any failed uploads listed with manual attachment instructions
3. A 1-2 sentence explanation below each screenshot describing what it proves
This approach uses the GitHub Git API to create blobs, trees, commits, and refs entirely server-side. No local `git checkout` or `git push` — safe for worktrees and won't interfere with the PR branch.
## Step 8: Evaluate and post a formal PR review
After the test comment is posted, evaluate whether the run was thorough enough to make a merge decision, then post a formal GitHub review (approve or request changes). **This step is mandatory — every test run MUST end with a formal review decision.**
### Evaluation criteria
Re-read the PR description:
“`bash
gh pr view "$PR_NUMBER" –json body –jq '.body' –repo "$REPO"
“`
Score the run against each criterion:
| Criterion | Pass condition |
|———–|—————|
| **Coverage** | Every feature/change described in the PR has at least one test scenario |
| **All scenarios pass** | No FAIL rows in the results table |
| **Negative tests** | At least one failure-path test per feature (invalid input, unauthorized, edge case) |
| **Before/after evidence** | Every state-changing API call has before/after values logged |
| **Screenshots are meaningful** | Screenshots show the actual state change, not just a loading spinner or blank page |
| **No regressions** | Existing core flows (login, agent create/run) still work |
### Decision logic
“`
ALL criteria pass → APPROVE
Any scenario FAIL or missing PR feature → REQUEST_CHANGES (list gaps)
Evidence weak (no before/after, vague shots) → REQUEST_CHANGES (list what's missing)
“`
### Post the review
“`bash
REVIEW_FILE=$(mktemp)
# Count results
PASS_COUNT=$(echo "$TEST_RESULTS_TABLE" | grep -c "PASS" || true)
FAIL_COUNT=$(echo "$TEST_RESULTS_TABLE" | grep -c "FAIL" || true)
TOTAL=$(( PASS_COUNT + FAIL_COUNT ))
# List any coverage gaps found during evaluation (populate this array as you assess)
# e.g. COVERAGE_GAPS=("PR claims to add X but no test covers it")
COVERAGE_GAPS=()
“`
**If APPROVING** — all criteria met, zero failures, full coverage:
“`bash
cat > "$REVIEW_FILE" <<REVIEWEOF
## E2E Test Evaluation — APPROVED
**Results:** ${PASS_COUNT}/${TOTAL} scenarios passed.
**Coverage:** All features described in the PR were exercised.
**Evidence:** Before/after API values logged for all state-changing operations; screenshots show meaningful state transitions.
**Negative tests:** Failure paths tested for each feature.
No regressions observed on core flows.
REVIEWEOF
gh pr review "$PR_NUMBER" –repo "$REPO" –approve –body "$(cat "$REVIEW_FILE")"
echo "✅ PR approved"
“`
**If REQUESTING CHANGES** — any failure, coverage gap, or missing evidence:
“`bash
FAIL_LIST=$(echo "$TEST_RESULTS_TABLE" | grep "FAIL" | awk -F'|' '{print "- Scenario" $2 "failed"}' || true)
cat > "$REVIEW_FILE" <<REVIEWEOF
## E2E Test Evaluation — Changes Requested
**Results:** ${PASS_COUNT}/${TOTAL} scenarios passed, ${FAIL_COUNT} failed.
### Required before merge
${FAIL_LIST}
$(for gap in "${COVERAGE_GAPS[@]}"; do echo "- $gap"; done)
Please fix the above and re-run the E2E tests.
REVIEWEOF
gh pr review "$PR_NUMBER" –repo "$REPO" –request-changes –body "$(cat "$REVIEW_FILE")"
echo "❌ Changes requested"
“`
“`bash
rm -f "$REVIEW_FILE"
“`
**Rules:**
– In `–fix` mode, fix all failures before posting the review — the review reflects the final state after fixes
– Never approve if any scenario failed, even if it seems like a flake — rerun that scenario first
– Never request changes for issues already fixed in this run
## Fix mode (–fix flag)
When `–fix` is present, the standard is HIGHER. Do not just note issues — FIX them immediately.
### Fix protocol for EVERY issue found (including UX issues):
1. **Identify** the root cause in the code — read the relevant source files
2. **Write a failing test first** (TDD): For backend bugs, write a test marked with `pytest.mark.xfail(reason="…")`. For frontend/Playwright bugs, write a test with `.fixme` annotation. Run it to confirm it fails as expected.
3. **Screenshot** the broken state: `agent-browser screenshot $RESULTS_DIR/{NN}-broken-{description}.png`
4. **Fix** the code in the worktree
5. **Rebuild** ONLY the affected service (not the whole stack):
“`bash
cd $PLATFORM_DIR && docker compose up –build -d {service_name}
# e.g., docker compose up –build -d rest_server
# e.g., docker compose up –build -d frontend
“`
6. **Wait** for the service to be ready (poll health endpoint)
7. **Re-test** the same scenario
8. **Screenshot** the fixed state: `agent-browser screenshot $RESULTS_DIR/{NN}-fixed-{description}.png`
9. **Remove the xfail/fixme marker** from the test written in step 2, and verify it passes
10. **Verify** the fix did not break other scenarios (run a quick smoke test)
11. **Commit and push** immediately:
“`bash
cd $WORKTREE_PATH
git add -A
git commit -m "fix: {description of fix}"
git push
“`
12. **Continue** to the next test scenario
### Fix loop (like pr-address)
“`text
test scenario → find issue (bug OR UX problem) → screenshot broken state
→ fix code → rebuild affected service only → re-test → screenshot fixed state
→ verify no regressions → commit + push
→ repeat for next scenario
→ after ALL scenarios pass, run full re-test to verify everything together
“`
**Key differences from non-fix mode:**
– UX issues count as bugs — fix them (bad alignment, confusing labels, missing loading states)
– Every fix MUST have a before/after screenshot pair proving it works
– Commit after EACH fix, not in a batch at the end
– The final re-test must produce a clean set of all-passing screenshots
## Known issues and workarounds
### Problem: "Database error finding user" on signup
**Cause:** Supabase auth service schema cache is stale after migration.
**Fix:** `docker restart supabase-auth && sleep 5` then retry signup.
### Problem: Copilot returns auth errors in subscription mode
**Cause:** `CHAT_USE_CLAUDE_CODE_SUBSCRIPTION=true` but `CLAUDE_CODE_OAUTH_TOKEN` is not set or expired.
**Fix:** Re-extract the OAuth token from macOS keychain (see step 3b, Option 1) and recreate the container (`docker compose up -d copilot_executor`). The backend auto-provisions `~/.claude/.credentials.json` from the env var on startup. No `npm install` or `claude login` needed — the SDK bundles its own CLI binary.
### Problem: agent-browser can't find chromium
**Cause:** The Dockerfile auto-provisions system chromium on all architectures (including ARM64). If your branch is behind `dev`, this may not be present yet.
**Fix:** Check if chromium exists: `which chromium || which chromium-browser`. If missing, install it: `apt-get install -y chromium` and set `AGENT_BROWSER_EXECUTABLE_PATH=/usr/bin/chromium` in the container environment.
### Problem: agent-browser selector matches multiple elements
**Cause:** `text=X` matches all elements containing that text.
**Fix:** Use `agent-browser snapshot` to get specific `ref=eNN` references, then use those: `agent-browser click eNN`.
### Problem: Frontend shows cookie banner blocking interaction
**Fix:** `agent-browser click 'text=Accept All'` before other interactions.
### Problem: Container loses npm packages after rebuild
**Cause:** `docker compose up –build` rebuilds the image, losing runtime installs.
**Fix:** Add packages to the Dockerfile instead of installing at runtime.
### Problem: Services not starting after `docker compose up`
**Fix:** Wait and check health: `docker compose ps`. Common cause: migration hasn't finished. Check: `docker logs autogpt_platform-migrate-1 2>&1 | tail -5`. If supabase-db isn't healthy: `docker restart supabase-db && sleep 10`.
### Problem: Docker uses cached layers with old code (PR changes not visible)
**Cause:** `docker compose up –build` reuses cached `COPY` layers from previous builds. If the PR branch changes Python files but the previous build already cached that layer from `dev`, the container runs `dev` code.
**Fix:** Always use `docker compose build –no-cache` for the first build of a PR branch. Subsequent rebuilds within the same branch can use `–build`.
### Problem: `agent-browser open` loses login session
**Cause:** Without session persistence, `agent-browser open` starts fresh.
**Fix:** Use `–session-name pr-test` on ALL agent-browser commands. This auto-saves/restores cookies and localStorage across navigations. Alternatively, use `agent-browser eval "window.location.href = '…'"` to navigate within the same context.
### Problem: Supabase auth returns "Database error querying schema"
**Cause:** The database schema changed (migration ran) but supabase-auth has a stale schema cache.
**Fix:** `docker restart supabase-db && sleep 10 && docker restart supabase-auth && sleep 8`. If user data was lost, re-signup.