feat: code-grader plain-text fallback + workspace env preflight#1209
Merged
feat: code-grader plain-text fallback + workspace env preflight#1209
Conversation
Adds two new eval features:
**Shell grader** (`type: shell`): runs a shell command and checks its stdout.
- No `expected`: passes when exit code is 0
- `expected` with no `operator`: exact string match (trimmed stdout)
- `expected` + `operator` (>, <, >=, <=, ==, !=): numeric float comparison
**Workspace env preflight** (`workspace.env`): declares required system
dependencies that are checked once before before_all hooks run. Fails fast
with a clear diagnostic listing all missing commands/modules.
Example:
```yaml
workspace:
env:
required_commands: [ffmpeg, pandoc]
required_python_modules: [PIL, openai]
assertions:
- type: shell
command: "pdfinfo report.pdf | grep Pages | awk '{print $2}'"
operator: ">="
expected: "5"
```
Closes #1207, #1208
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Deploying agentv with
|
| Latest commit: |
a5287c6
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://19f827ef.agentv.pages.dev |
| Branch Preview URL: | https://feat-1207-1208-shell-grader.agentv.pages.dev |
…1210) Per design review: the `shell` grader type violated the "audit existing primitives first" principle — `code-grader` already runs shell commands. Promptfoo solves this the same way (javascript/python fallbacks, no dedicated shell type). Remove the `shell` grader type entirely and instead extend `code-grader` to accept plain-text stdout without requiring the JSON protocol: | stdout (trimmed, case-insensitive) | score | |---|---| | empty string | 1 if exit 0, 0 if exit non-zero | | "true", "pass", "1" | 1 | | "false", "fail", "0" | 0 | | numeric string | clamped float | | anything else | 1 if exit 0, 0 if exit non-zero | Scripts that write to stderr on non-zero exit still surface as errors (existing behavior). Silent non-zero exits (e.g. `[ "$pages" -ge 5 ]`) use exit-code convention. Usage: # numeric comparison via exit code - type: code-grader command: ["bash", "-c", "[ $(pdfinfo report.pdf | grep Pages | awk '{print $2}') -ge 5 ]"] # score from stdout - type: code-grader command: ["bash", "-c", "echo 0.75"] Closes #1210 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ertion text Replace the string/numeric score interpretation with a clean two-convention model: - Exit code: 0 = score 1 (pass), non-zero = score 0 (fail) - Stdout: becomes the assertion text (human-readable context for the result) - Stderr on non-zero exit: still surfaces as an error For numeric scores or multi-aspect results, use the JSON protocol. This removes the "0"/"1"/numeric string ambiguity and aligns with how Unix tooling (bats, make, shell builtins) already signals pass/fail. Updates docs and tests to reflect the new model. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1207 and #1208.
`code-grader` plain-text fallback (#1207, #1210)
`code-grader` now works without the JSON protocol for simple pass/fail checks. The exit code determines the score and stdout becomes the assertion text:
Silent one-liners work too:
For numeric scores or multi-aspect results, the existing JSON protocol is unchanged:
```bash
echo '{"score": 0.75, "assertions": [{"text": "relevance score", "passed": true}]}'
```
Scripts writing to stderr on non-zero exit still surface as errors (existing behaviour).
Design note: initial implementation used a `shell` grader type and then a string/numeric stdout interpretation. Both were dropped after reviewing how promptfoo handles this — no framework parses plain-text strings as scores, the right boundary is exit code (binary) vs JSON (everything else). See #1210 for the design discussion.
`workspace.env` preflight checks (#1208)
Declares required system dependencies checked once before `before_all` hooks. Fails immediately with a clear diagnostic — so a 30-minute eval doesn't burn time before hitting a missing `ffmpeg`.
Red/Green UAT Evidence
#1207 Red: `bash -c "[ 14 -ge 5 ]"` as code-grader → score 0 (empty stdout wrongly failed)
#1207 Green:
#1208: eval with missing `nonexistent_command_xyz_abc` → immediate setup error before any test runs ✅
Test plan
Closes #1207
Closes #1208
Closes #1210
🤖 Generated with Claude Code