feat: code-grader plain-text fallback + workspace env preflight by christso · Pull Request #1209 · EntityProcess/agentv

christso · 2026-05-02T13:47:20Z

Summary

`code-grader` plain-text fallback (#1207, #1210)

`code-grader` now works without the JSON protocol for simple pass/fail checks. The exit code determines the score and stdout becomes the assertion text:

Exit code	Score	Verdict
0	1.0	pass
non-zero (no stderr)	0.0	fail

#!/bin/bash
# check-pages.sh
pages=$(pdfinfo report.pdf | grep Pages | awk '{print $2}')
if [ "$pages" -ge 5 ]; then
  echo "PDF has $pages pages (≥5 required)"
else
  echo "PDF has only $pages pages (<5 required)"
  exit 1
fi

Silent one-liners work too:

- type: code-grader
  command: ["bash", "-c", "[ $(wc -l < output.txt) -ge 10 ]"]

For numeric scores or multi-aspect results, the existing JSON protocol is unchanged:
```bash
echo '{"score": 0.75, "assertions": [{"text": "relevance score", "passed": true}]}'
```

Scripts writing to stderr on non-zero exit still surface as errors (existing behaviour).

Design note: initial implementation used a `shell` grader type and then a string/numeric stdout interpretation. Both were dropped after reviewing how promptfoo handles this — no framework parses plain-text strings as scores, the right boundary is exit code (binary) vs JSON (everything else). See #1210 for the design discussion.

`workspace.env` preflight checks (#1208)

Declares required system dependencies checked once before `before_all` hooks. Fails immediately with a clear diagnostic — so a 30-minute eval doesn't burn time before hitting a missing `ffmpeg`.

workspace:
  env:
    required_commands: [ffmpeg, pandoc]
    required_python_modules: [PIL, openai]

Red/Green UAT Evidence

#1207 Red: `bash -c "[ 14 -ge 5 ]"` as code-grader → score 0 (empty stdout wrongly failed)

#1207 Green:

✅ exit-code-pass   | score 1, assertion "exit 0"
⚠️ exit-code-fail   | score 0, assertion "exit 1"
✅ stdout-pass      | score 1, assertion "PDF has 14 pages (≥5 required)"
⚠️ stdout-fail      | score 0, assertion "PDF has only 3 pages (<5 required)"

#1208: eval with missing `nonexistent_command_xyz_abc` → immediate setup error before any test runs ✅

Test plan

8 unit tests for `code-grader` plain-text fallback
All 2314 tests pass
CI green ✅

Closes #1207
Closes #1208
Closes #1210

🤖 Generated with Claude Code

Adds two new eval features: **Shell grader** (`type: shell`): runs a shell command and checks its stdout. - No `expected`: passes when exit code is 0 - `expected` with no `operator`: exact string match (trimmed stdout) - `expected` + `operator` (>, <, >=, <=, ==, !=): numeric float comparison **Workspace env preflight** (`workspace.env`): declares required system dependencies that are checked once before before_all hooks run. Fails fast with a clear diagnostic listing all missing commands/modules. Example: ```yaml workspace: env: required_commands: [ffmpeg, pandoc] required_python_modules: [PIL, openai] assertions: - type: shell command: "pdfinfo report.pdf | grep Pages | awk '{print $2}'" operator: ">=" expected: "5" ``` Closes #1207, #1208 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cloudflare-workers-and-pages · 2026-05-02T13:47:50Z

Deploying agentv with Cloudflare Pages

Latest commit:	`a5287c6`
Status:	✅ Deploy successful!
Preview URL:	https://19f827ef.agentv.pages.dev
Branch Preview URL:	https://feat-1207-1208-shell-grader.agentv.pages.dev

View logs

…1210) Per design review: the `shell` grader type violated the "audit existing primitives first" principle — `code-grader` already runs shell commands. Promptfoo solves this the same way (javascript/python fallbacks, no dedicated shell type). Remove the `shell` grader type entirely and instead extend `code-grader` to accept plain-text stdout without requiring the JSON protocol: | stdout (trimmed, case-insensitive) | score | |---|---| | empty string | 1 if exit 0, 0 if exit non-zero | | "true", "pass", "1" | 1 | | "false", "fail", "0" | 0 | | numeric string | clamped float | | anything else | 1 if exit 0, 0 if exit non-zero | Scripts that write to stderr on non-zero exit still surface as errors (existing behavior). Silent non-zero exits (e.g. `[ "$pages" -ge 5 ]`) use exit-code convention. Usage: # numeric comparison via exit code - type: code-grader command: ["bash", "-c", "[ $(pdfinfo report.pdf | grep Pages | awk '{print $2}') -ge 5 ]"] # score from stdout - type: code-grader command: ["bash", "-c", "echo 0.75"] Closes #1210 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ertion text Replace the string/numeric score interpretation with a clean two-convention model: - Exit code: 0 = score 1 (pass), non-zero = score 0 (fail) - Stdout: becomes the assertion text (human-readable context for the result) - Stderr on non-zero exit: still surfaces as an error For numeric scores or multi-aspect results, use the JSON protocol. This removes the "0"/"1"/numeric string ambiguity and aligns with how Unix tooling (bats, make, shell builtins) already signals pass/fail. Updates docs and tests to reflect the new model. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

christso and others added 2 commits May 2, 2026 15:42

fix: resolve lint errors in shell grader and targets-validator imports

6c63a36

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

christso mentioned this pull request May 3, 2026

feat: extend code-grader to accept plain-text and exit-code output #1210

Closed

christso and others added 2 commits May 3, 2026 06:21

style: fix biome formatting in code-grader

54b2032

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

christso changed the title ~~feat: shell grader + workspace env preflight checks~~ feat: code-grader plain-text fallback + workspace env preflight May 3, 2026

christso and others added 2 commits May 4, 2026 04:33

style: fix biome formatting

a5287c6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

christso merged commit 5ae93a3 into main May 4, 2026
4 checks passed

christso deleted the feat/1207-1208-shell-grader-preflight branch May 4, 2026 03:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: code-grader plain-text fallback + workspace env preflight#1209

feat: code-grader plain-text fallback + workspace env preflight#1209
christso merged 6 commits intomainfrom
feat/1207-1208-shell-grader-preflight

christso commented May 2, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages Bot commented May 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

`code-grader` plain-text fallback (#1207, #1210)

`workspace.env` preflight checks (#1208)

Red/Green UAT Evidence

Test plan

Uh oh!

cloudflare-workers-and-pages Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented May 2, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented May 2, 2026 •

edited

Loading