Skip to content

feat: code-grader plain-text fallback + workspace env preflight#1209

Merged
christso merged 6 commits intomainfrom
feat/1207-1208-shell-grader-preflight
May 4, 2026
Merged

feat: code-grader plain-text fallback + workspace env preflight#1209
christso merged 6 commits intomainfrom
feat/1207-1208-shell-grader-preflight

Conversation

@christso
Copy link
Copy Markdown
Collaborator

@christso christso commented May 2, 2026

Summary

Fixes #1207 and #1208.

`code-grader` plain-text fallback (#1207, #1210)

`code-grader` now works without the JSON protocol for simple pass/fail checks. The exit code determines the score and stdout becomes the assertion text:

Exit code Score Verdict
0 1.0 pass
non-zero (no stderr) 0.0 fail
#!/bin/bash
# check-pages.sh
pages=$(pdfinfo report.pdf | grep Pages | awk '{print $2}')
if [ "$pages" -ge 5 ]; then
  echo "PDF has $pages pages (≥5 required)"
else
  echo "PDF has only $pages pages (<5 required)"
  exit 1
fi

Silent one-liners work too:

- type: code-grader
  command: ["bash", "-c", "[ $(wc -l < output.txt) -ge 10 ]"]

For numeric scores or multi-aspect results, the existing JSON protocol is unchanged:
```bash
echo '{"score": 0.75, "assertions": [{"text": "relevance score", "passed": true}]}'
```

Scripts writing to stderr on non-zero exit still surface as errors (existing behaviour).

Design note: initial implementation used a `shell` grader type and then a string/numeric stdout interpretation. Both were dropped after reviewing how promptfoo handles this — no framework parses plain-text strings as scores, the right boundary is exit code (binary) vs JSON (everything else). See #1210 for the design discussion.

`workspace.env` preflight checks (#1208)

Declares required system dependencies checked once before `before_all` hooks. Fails immediately with a clear diagnostic — so a 30-minute eval doesn't burn time before hitting a missing `ffmpeg`.

workspace:
  env:
    required_commands: [ffmpeg, pandoc]
    required_python_modules: [PIL, openai]

Red/Green UAT Evidence

#1207 Red: `bash -c "[ 14 -ge 5 ]"` as code-grader → score 0 (empty stdout wrongly failed)

#1207 Green:

✅ exit-code-pass   | score 1, assertion "exit 0"
⚠️ exit-code-fail   | score 0, assertion "exit 1"
✅ stdout-pass      | score 1, assertion "PDF has 14 pages (≥5 required)"
⚠️ stdout-fail      | score 0, assertion "PDF has only 3 pages (<5 required)"

#1208: eval with missing `nonexistent_command_xyz_abc` → immediate setup error before any test runs ✅

Test plan

  • 8 unit tests for `code-grader` plain-text fallback
  • All 2314 tests pass
  • CI green ✅

Closes #1207
Closes #1208
Closes #1210

🤖 Generated with Claude Code

christso and others added 2 commits May 2, 2026 15:42
Adds two new eval features:

**Shell grader** (`type: shell`): runs a shell command and checks its stdout.
- No `expected`: passes when exit code is 0
- `expected` with no `operator`: exact string match (trimmed stdout)
- `expected` + `operator` (>, <, >=, <=, ==, !=): numeric float comparison

**Workspace env preflight** (`workspace.env`): declares required system
dependencies that are checked once before before_all hooks run. Fails fast
with a clear diagnostic listing all missing commands/modules.

Example:
```yaml
workspace:
  env:
    required_commands: [ffmpeg, pandoc]
    required_python_modules: [PIL, openai]
assertions:
  - type: shell
    command: "pdfinfo report.pdf | grep Pages | awk '{print $2}'"
    operator: ">="
    expected: "5"
```

Closes #1207, #1208

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented May 2, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: a5287c6
Status: ✅  Deploy successful!
Preview URL: https://19f827ef.agentv.pages.dev
Branch Preview URL: https://feat-1207-1208-shell-grader.agentv.pages.dev

View logs

christso and others added 2 commits May 3, 2026 06:21
…1210)

Per design review: the `shell` grader type violated the "audit existing primitives
first" principle — `code-grader` already runs shell commands. Promptfoo solves this
the same way (javascript/python fallbacks, no dedicated shell type).

Remove the `shell` grader type entirely and instead extend `code-grader` to accept
plain-text stdout without requiring the JSON protocol:

| stdout (trimmed, case-insensitive) | score |
|---|---|
| empty string | 1 if exit 0, 0 if exit non-zero |
| "true", "pass", "1" | 1 |
| "false", "fail", "0" | 0 |
| numeric string | clamped float |
| anything else | 1 if exit 0, 0 if exit non-zero |

Scripts that write to stderr on non-zero exit still surface as errors (existing
behavior). Silent non-zero exits (e.g. `[ "$pages" -ge 5 ]`) use exit-code convention.

Usage:
  # numeric comparison via exit code
  - type: code-grader
    command: ["bash", "-c", "[ $(pdfinfo report.pdf | grep Pages | awk '{print $2}') -ge 5 ]"]

  # score from stdout
  - type: code-grader
    command: ["bash", "-c", "echo 0.75"]

Closes #1210
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@christso christso changed the title feat: shell grader + workspace env preflight checks feat: code-grader plain-text fallback + workspace env preflight May 3, 2026
christso and others added 2 commits May 4, 2026 04:33
…ertion text

Replace the string/numeric score interpretation with a clean two-convention model:

- Exit code: 0 = score 1 (pass), non-zero = score 0 (fail)
- Stdout: becomes the assertion text (human-readable context for the result)
- Stderr on non-zero exit: still surfaces as an error

For numeric scores or multi-aspect results, use the JSON protocol.
This removes the "0"/"1"/numeric string ambiguity and aligns with how
Unix tooling (bats, make, shell builtins) already signals pass/fail.

Updates docs and tests to reflect the new model.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@christso christso merged commit 5ae93a3 into main May 4, 2026
4 checks passed
@christso christso deleted the feat/1207-1208-shell-grader-preflight branch May 4, 2026 03:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant