Skip to content

feat(openai_agents): expose real usage, response_id, plumb previous_response_id, opt-in prompt_cache_key for stateful responses and prompt caching#335

Open
x wants to merge 7 commits intomainfrom
dpeticolas/prompt-cache-upstream
Open

feat(openai_agents): expose real usage, response_id, plumb previous_response_id, opt-in prompt_cache_key for stateful responses and prompt caching#335
x wants to merge 7 commits intomainfrom
dpeticolas/prompt-cache-upstream

Conversation

@x
Copy link
Copy Markdown
Contributor

@x x commented Apr 30, 2026

Summary

Four changes to the streaming Responses API path in temporal_streaming_model.py.

My goal is to:

  • Make it easier to track prompt cache hits (usage)
  • Make it easier to get cache hits (prompt_cache_key)
  • Make it easier to use the responses API (response_id, previous_response_id)
  1. Capture real usage from ResponseCompletedEvent.response.usage and surface in the span's output dict.

    TemporalStreamingModel.get_response was constructing a zero-filled Usage and discarding event.response.usage. It now reads usage off the streaming protocol and falls back to zeros only on the error path (stream ends without a completed event).

    The len(''.join(reasoning_contents)) // 4 reasoning-tokens estimator is dropped — the real value arrives in the API response.

    The streaming_model_get_response span now carries {input_tokens, output_tokens, total_tokens, cached_input_tokens, reasoning_tokens} so traces show cache-hit rate.

  2. Return real response_id captured from ResponseCompletedEvent.response.id instead of the previously fabricated f"resp_{uuid.uuid4().hex[:8]}".

    OpenAI Agents SDK reads this field for previous_response_id chaining (run.py:145), exposes it as RunResult.last_response_id (result.py:108), and writes it into traces (tracing/span_data.py:164). A client-side UUID has never been issued by any server, so any caller that picks it up and tries to chain (the documented use case) gets a 400 "response not found".

    Latent since this file was added (commit 2f2a6ed7) because nothing in the codebase chains previous_response_id yet.

  3. Forward previous_response_id, conversation_id, and prompt from the SDK kwarg.

    The SDK's abstract Model.get_response has previous_response_id, conversation_id, prompt as required keyword-only params; the SDK threads previous_response_id down through _ServerConversationTracker when callers set it on Runner.run / RunConfig.

    Our implementation declared **kwargs # noqa: ARG002 and silently swallowed all three. Callers who used the official chaining API got no stateful behavior and no error. Replaced with explicit named params.

    This change is not necessarily required we need to pass previous_response_id but we could be doing that via extra_args. That said, the silent behavior is sketch. I could be convinced that the right thing to here is to raise an exception when we see them in kwargs and expect them in extra_args but this felt less dangerous and more intuitive to the caller.

  4. Plumb prompt_cache_key to responses.create as an opt-in parameter.

    Callers set it via model_settings.extra_args["prompt_cache_key"]. We do not auto-inject a default. I don't trust prompt_cache_key to be standard across all OpenAI-compatible endpoints.

    When unset, the parameter resolves to NOT_GIVEN and is omitted from the request body entirely.

Test plan

All 38 tests in test_streaming_model.py pass locally:

  • test_usage_captured_from_completed_event
  • test_usage_falls_back_when_no_completed_event
  • test_usage_emitted_in_span_output
  • test_response_id_captured_from_completed_event
  • test_response_id_is_none_when_no_completed_event — guards against the fake-UUID footgun
  • test_prompt_cache_key_not_sent_by_default — verifies NOT_GIVEN fallback for non-OpenAI compat
  • test_prompt_cache_key_forwarded_when_opted_in
  • test_previous_response_id_not_sent_by_default — verifies NOT_GIVEN fallback
  • test_previous_response_id_forwarded_via_sdk_kwarg — verifies the SDK's official chaining API now works
  • test_conversation_id_and_prompt_accepted_but_not_forwarded — verifies SDK contract compliance without surface-area expansion
  • All 28 pre-existing tests still pass (after stripping vestigial task_id kwargs)

Why these changes together

The motivating workstream (downstream of this PR) is a stateful-Responses-API migration for ST&S agents to chain via previous_response_id for 40–80% better cache utilization on reasoning models. That migration needs:

  • Real Usage to measure cache hit rate before/after.
  • Real response_id to actually pass back as previous_response_id.
  • previous_response_id actually reaching responses.create instead of being silently swallowed.
  • Optional prompt_cache_key for callers who want it without forcing it on everyone.

Once a release containing these changes lands and aimi-scale bumps its agentex-sdk pin, the runtime monkey-patch in st_s/*/agentex_usage_patch.py and the corresponding _apply_agentex_usage_patch() calls in each run_worker.py can be deleted.

Greptile Summary

This PR fixes four latent issues in TemporalStreamingModel.get_response: (1) real Usage is now read from ResponseCompletedEvent.response.usage instead of zeros, (2) the real server-issued response_id replaces a fabricated UUID, (3) previous_response_id/conversation_id/prompt are surfaced as explicit kwargs and forwarded to responses.create, and (4) prompt_cache_key is added as an opt-in via model_settings.extra_args. The changes are well-tested and the core logic is correct, though the existing review comments on prompt_cache_key SDK compatibility warrant attention before merge.

Confidence Score: 4/5

Safe to merge once the open prompt_cache_key SDK-parameter concern from the previous review thread is resolved.

The four core fixes (real usage, real response_id, forwarding SDK kwargs, opt-in cache key) are logically correct and well-tested. Score is capped at 4 because the previously raised concern about prompt_cache_key — always passing it as an explicit named kwarg when the OpenAI Python SDK may not declare it — appears unresolved and could raise a TypeError for every caller.

temporal_streaming_model.py line 625: prompt_cache_key is passed as a named kwarg regardless of whether the SDK accepts it as a declared parameter

Important Files Changed

Filename Overview
src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py Captures real usage and response_id from ResponseCompletedEvent, plumbs previous_response_id/conversation_id/prompt as explicit kwargs, and adds opt-in prompt_cache_key. Core logic is correct; previous review comments flagged the prompt_cache_key SDK naming concern.
src/agentex/lib/core/temporal/plugins/openai_agents/tests/test_streaming_model.py Adds 10 new targeted tests for usage capture, response_id, prompt_cache_key opt-in, and SDK kwarg forwarding. Removes vestigial task_id kwargs from existing tests. Previous review flagged a vacuously-true assertion in test_prompt_cache_key_forwarded_when_opted_in.

Sequence Diagram

sequenceDiagram
    participant SDK as OpenAI Agents SDK
    participant TSM as TemporalStreamingModel
    participant OAI as responses.create

    SDK->>TSM: get_response(...,previous_response_id,conversation_id,prompt)
    TSM->>TSM: pop prompt_cache_key from extra_args (opt-in)
    TSM->>OAI: responses.create(previous_response_id,conversation,prompt,prompt_cache_key,...)
    OAI-->>TSM: stream of events
    loop stream events
        TSM->>TSM: process delta events
        TSM->>TSM: on ResponseCompletedEvent capture usage + response.id
    end
    TSM->>TSM: Build Usage from captured_usage (fallback to zeros if None)
    TSM-->>SDK: ModelResponse(output, usage, response_id=captured_response_id)
Loading

Reviews (6): Last reviewed commit: "Merge branch 'main' into dpeticolas/prom..." | Re-trigger Greptile

Comment thread src/agentex/lib/core/temporal/plugins/openai_agents/tests/test_streaming_model.py Outdated
x added 3 commits April 30, 2026 17:12
…_key

Two related changes to the streaming Responses API path in
`TemporalStreamingModel.get_response`. Both are observability/cache
improvements; neither changes how the API is called for callers who
don't opt in.

1. Capture real `Usage` from `ResponseCompletedEvent.response.usage`.
   Previously a zero-filled `Usage` was constructed and `event.response.usage`
   from the streaming protocol was discarded. The reasoning-tokens
   estimator (`len(''.join(reasoning_contents)) // 4`) is also dropped —
   the real value arrives in the API response. Falls back to zeros only
   when the stream ends without a `ResponseCompletedEvent` (error path).

2. Surface usage in the span's `output` dict. The
   `streaming_model_get_response` span now carries
   `{input_tokens, output_tokens, total_tokens, cached_input_tokens,
   reasoning_tokens}` so traces show cache-hit rate without external
   log scraping.

3. Plumb `prompt_cache_key` to `responses.create` as an opt-in. Callers
   set it via `model_settings.extra_args["prompt_cache_key"]`. We do not
   auto-inject a default — `prompt_cache_key` is not standard across
   OpenAI-compatible endpoints, and a non-OpenAI server that strictly
   validates request bodies could reject the field. When unset, the
   parameter resolves to `NOT_GIVEN` and is omitted from the request
   body entirely. Behavior on alternative providers is identical to
   today's unless a caller explicitly opts in.
`TemporalStreamingModel.get_response` was synthesizing a client-side
UUID for `ModelResponse.response_id`:

    response_id=f"resp_{uuid.uuid4().hex[:8]}"

Replace with the real `response.id` captured off
`ResponseCompletedEvent.response.id` (alongside the `Usage` capture
already happening in the same branch). On the error path, where the
stream ends without a `ResponseCompletedEvent`, we return `None` —
matching the documented `str | None` contract on
`ModelResponse.response_id`.

## Why this matters

The OpenAI Agents SDK reads `ModelResponse.response_id` in three places:

- `agents/run.py:145` — gates whether the SDK chains via
  `previous_response_id` on the next call (the conditional is
  None-tolerant: a None value just leaves the chain pointer alone).
- `agents/result.py:108` — exposes the value to user code as
  `RunResult.last_response_id`.
- `agents/tracing/span_data.py:164` — written into trace records.

A client-side UUID was never issued by any server. Any caller that
picks it up and tries to chain via `previous_response_id` (the
documented use case for `RunResult.last_response_id`) gets a 400
"response not found" from the API, surfacing far from the actual
cause.

Comparable SDK providers do this correctly:

- `agents/models/openai_responses.py:149`: `response_id=response.id`
- `agents/models/openai_chatcompletions.py:135`: `response_id=None`
- `agents/extensions/models/litellm_model.py:182`: `response_id=None`

`None` is the documented sentinel for "this provider doesn't support
response_id," and the SDK is built to handle it.

The bug has been latent since this file was added (commit 2f2a6ed,
Oct 10) because nothing in the codebase's call paths chains
`previous_response_id` yet. The first caller that does (e.g. a
multi-turn stateful Responses API workflow) triggers it.

## Compatibility

This change is invisible to callers that don't read `response_id` — and
nothing in `scale-agentex-python` reads it. A repo-wide grep finds zero
consumers; only the (now-fixed) write site exists. The field is
serialized into Temporal event history and trace records but consumed
only by the OpenAI Agents SDK, which already handles `None`.
… key

Seven new tests in `TestStreamingModelUsageResponseIdAndCacheKey`:

- Usage captured from `ResponseCompletedEvent.response.usage`
- Usage falls back to zeros when stream ends without a completed event
- Usage emitted in span output_data["usage"]
- response_id captured from `ResponseCompletedEvent.response.id`
- response_id is None (NOT a fabricated UUID) when stream ends without
  a completed event — guards against the previous footgun where a
  client-side UUID would be returned and silently break downstream
  `previous_response_id` chaining
- prompt_cache_key resolves to NOT_GIVEN by default (omitted from
  request body, safe for non-OpenAI endpoints)
- prompt_cache_key forwarded when caller opts in via
  `model_settings.extra_args["prompt_cache_key"]`, and popped from
  extra_args so it isn't passed twice

Pre-existing tests in `TestStreamingModelBasics` (test_responses_api_streaming,
test_task_id_threading, test_redis_context_creation) updated to set
`response.id=None` on their `MagicMock(spec=ResponseCompletedEvent)`
mocks. Without this, the auto-generated MagicMock attribute for
`response.id` flows into `ModelResponse.response_id` and trips
pydantic's `str | None` validation.
@x x force-pushed the dpeticolas/prompt-cache-upstream branch from dc7de0f to eb8ff68 Compare April 30, 2026 21:15
@x x changed the title feat(openai_agents): capture real Usage + enable prompt_cache_key routing feat(openai_agents): expose real Usage and response_id, opt-in prompt_cache_key Apr 30, 2026
@x x changed the title feat(openai_agents): expose real Usage and response_id, opt-in prompt_cache_key feat(openai_agents): expose real response.usage, map response_id=response.id, opt-in prompt_cache_key Apr 30, 2026
The OpenAI Agents SDK's `Model.get_response` abstract has three
keyword-only parameters: `previous_response_id`, `conversation_id`,
`prompt`. The SDK threads them down through `_ServerConversationTracker`
when callers use `Runner.run(..., previous_response_id=X)` or set
`RunConfig` with `auto_previous_response_id=True`.

`TemporalStreamingModel.get_response` was declared with
`**kwargs # noqa: ARG002`, which silently swallowed all three. Callers
who used the SDK's official chaining API saw their `previous_response_id`
disappear and got no stateful behavior — without an error.

This commit:

- Replaces `**kwargs` with explicit `previous_response_id`,
  `conversation_id`, `prompt` params, matching the abstract.
- Forwards `previous_response_id` to `responses.create` via
  `_non_null_or_not_given` (so `None` resolves to `NOT_GIVEN` and the
  field is omitted from the request body — identical behavior to today
  for callers that don't opt in).
- Accepts `conversation_id` and `prompt` for SDK contract compliance
  but does not forward them yet (marked `# noqa: ARG002`); they can be
  wired through later if a use case appears.

## Compatibility with non-OpenAI backends

Same opt-in pattern as `prompt_cache_key`. `TemporalStreamingModel`
calls `responses.create`, but the underlying client can be pointed at
any OpenAI-compatible server (LiteLLM proxy, Foundry, vLLM, etc.).
Some of those backends don't recognize `previous_response_id`. Because
we forward it only when explicitly set, callers who don't opt in see
no change in the wire request — the field is filtered out by
`NOT_GIVEN`. Callers who opt in are responsible for knowing whether
their backend supports it.

## Test housekeeping

The 27 existing tests that passed `task_id=sample_task_id` to
`get_response` were relying on `**kwargs` to silently swallow it.
Production reads `task_id` from a ContextVar (set by
`ContextInterceptor` in real Temporal flows, set by the
`_streaming_context_vars` fixture in tests), not from a function
argument. The kwarg was vestigial cruft. Removed.
@x x changed the title feat(openai_agents): expose real response.usage, map response_id=response.id, opt-in prompt_cache_key feat(openai_agents): expose real Usage/response_id, plumb previous_response_id, opt-in prompt_cache_key Apr 30, 2026
@x x changed the title feat(openai_agents): expose real Usage/response_id, plumb previous_response_id, opt-in prompt_cache_key feat(openai_agents): expose real usage, response_id, plumb previous_response_id, opt-in prompt_cache_key for stateful responses and prompt caching Apr 30, 2026
x added 2 commits April 30, 2026 18:08
…create

The SDK's ``Model.get_response`` abstract has three Responses API
server-state parameters: ``previous_response_id``, ``conversation_id``,
``prompt``. The prior commit wired up ``previous_response_id`` and
accepted the other two for SDK contract compliance but discarded them
with ``# noqa: ARG002``.

Accept-and-discard is a code smell: callers using the SDK's
``Runner.run(conversation_id=..., prompt=...)`` API would see their
arguments silently dropped. Since both map directly to ``responses.create``
kwargs and we're already on that endpoint, the cost of forwarding is two
lines and removes the smell entirely.

- ``conversation_id`` (SDK abstract name) → ``conversation`` (responses.create
  endpoint kwarg). The ``Conversation`` type accepts ``str`` directly, so
  no translation is needed.
- ``prompt`` is the same name on both sides.

Both follow the same opt-in pattern as ``previous_response_id`` and
``prompt_cache_key``: ``None`` resolves to ``NOT_GIVEN`` and is omitted
from the request body, so behavior on alternative OpenAI-compatible
backends is unchanged unless a caller explicitly opts in.
… union

Two ruff fixes for the test file:

- ARG002 on 25 test method signatures: the prior commit
  (forward previous_response_id from SDK kwarg) stripped the
  vestigial ``task_id=sample_task_id`` kwargs from get_response calls,
  but left ``sample_task_id`` in the test method parameter lists.
  The contextvars fixture (``_streaming_context_vars``) already pulls
  ``sample_task_id`` transitively, so the explicit param is redundant.
  Removed from the 25 flagged signatures; preserved on
  ``test_responses_api_streaming`` where it's still used inside the
  body to assert against the streaming context.

- FA102 on _make_response_completed_event: the new test helper used a
  PEP 604 union (``str | None``) without ``from __future__ import
  annotations``. Switched to ``Optional[str]`` to keep the change
  local to the helper rather than retrofitting future annotations
  across the file.
@x x enabled auto-merge May 1, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants