Conversation
…_key
Two related changes to the streaming Responses API path in
`TemporalStreamingModel.get_response`. Both are observability/cache
improvements; neither changes how the API is called for callers who
don't opt in.
1. Capture real `Usage` from `ResponseCompletedEvent.response.usage`.
Previously a zero-filled `Usage` was constructed and `event.response.usage`
from the streaming protocol was discarded. The reasoning-tokens
estimator (`len(''.join(reasoning_contents)) // 4`) is also dropped —
the real value arrives in the API response. Falls back to zeros only
when the stream ends without a `ResponseCompletedEvent` (error path).
2. Surface usage in the span's `output` dict. The
`streaming_model_get_response` span now carries
`{input_tokens, output_tokens, total_tokens, cached_input_tokens,
reasoning_tokens}` so traces show cache-hit rate without external
log scraping.
3. Plumb `prompt_cache_key` to `responses.create` as an opt-in. Callers
set it via `model_settings.extra_args["prompt_cache_key"]`. We do not
auto-inject a default — `prompt_cache_key` is not standard across
OpenAI-compatible endpoints, and a non-OpenAI server that strictly
validates request bodies could reject the field. When unset, the
parameter resolves to `NOT_GIVEN` and is omitted from the request
body entirely. Behavior on alternative providers is identical to
today's unless a caller explicitly opts in.
`TemporalStreamingModel.get_response` was synthesizing a client-side
UUID for `ModelResponse.response_id`:
response_id=f"resp_{uuid.uuid4().hex[:8]}"
Replace with the real `response.id` captured off
`ResponseCompletedEvent.response.id` (alongside the `Usage` capture
already happening in the same branch). On the error path, where the
stream ends without a `ResponseCompletedEvent`, we return `None` —
matching the documented `str | None` contract on
`ModelResponse.response_id`.
## Why this matters
The OpenAI Agents SDK reads `ModelResponse.response_id` in three places:
- `agents/run.py:145` — gates whether the SDK chains via
`previous_response_id` on the next call (the conditional is
None-tolerant: a None value just leaves the chain pointer alone).
- `agents/result.py:108` — exposes the value to user code as
`RunResult.last_response_id`.
- `agents/tracing/span_data.py:164` — written into trace records.
A client-side UUID was never issued by any server. Any caller that
picks it up and tries to chain via `previous_response_id` (the
documented use case for `RunResult.last_response_id`) gets a 400
"response not found" from the API, surfacing far from the actual
cause.
Comparable SDK providers do this correctly:
- `agents/models/openai_responses.py:149`: `response_id=response.id`
- `agents/models/openai_chatcompletions.py:135`: `response_id=None`
- `agents/extensions/models/litellm_model.py:182`: `response_id=None`
`None` is the documented sentinel for "this provider doesn't support
response_id," and the SDK is built to handle it.
The bug has been latent since this file was added (commit 2f2a6ed,
Oct 10) because nothing in the codebase's call paths chains
`previous_response_id` yet. The first caller that does (e.g. a
multi-turn stateful Responses API workflow) triggers it.
## Compatibility
This change is invisible to callers that don't read `response_id` — and
nothing in `scale-agentex-python` reads it. A repo-wide grep finds zero
consumers; only the (now-fixed) write site exists. The field is
serialized into Temporal event history and trace records but consumed
only by the OpenAI Agents SDK, which already handles `None`.
… key Seven new tests in `TestStreamingModelUsageResponseIdAndCacheKey`: - Usage captured from `ResponseCompletedEvent.response.usage` - Usage falls back to zeros when stream ends without a completed event - Usage emitted in span output_data["usage"] - response_id captured from `ResponseCompletedEvent.response.id` - response_id is None (NOT a fabricated UUID) when stream ends without a completed event — guards against the previous footgun where a client-side UUID would be returned and silently break downstream `previous_response_id` chaining - prompt_cache_key resolves to NOT_GIVEN by default (omitted from request body, safe for non-OpenAI endpoints) - prompt_cache_key forwarded when caller opts in via `model_settings.extra_args["prompt_cache_key"]`, and popped from extra_args so it isn't passed twice Pre-existing tests in `TestStreamingModelBasics` (test_responses_api_streaming, test_task_id_threading, test_redis_context_creation) updated to set `response.id=None` on their `MagicMock(spec=ResponseCompletedEvent)` mocks. Without this, the auto-generated MagicMock attribute for `response.id` flows into `ModelResponse.response_id` and trips pydantic's `str | None` validation.
dc7de0f to
eb8ff68
Compare
response.usage, map response_id=response.id, opt-in prompt_cache_key
The OpenAI Agents SDK's `Model.get_response` abstract has three keyword-only parameters: `previous_response_id`, `conversation_id`, `prompt`. The SDK threads them down through `_ServerConversationTracker` when callers use `Runner.run(..., previous_response_id=X)` or set `RunConfig` with `auto_previous_response_id=True`. `TemporalStreamingModel.get_response` was declared with `**kwargs # noqa: ARG002`, which silently swallowed all three. Callers who used the SDK's official chaining API saw their `previous_response_id` disappear and got no stateful behavior — without an error. This commit: - Replaces `**kwargs` with explicit `previous_response_id`, `conversation_id`, `prompt` params, matching the abstract. - Forwards `previous_response_id` to `responses.create` via `_non_null_or_not_given` (so `None` resolves to `NOT_GIVEN` and the field is omitted from the request body — identical behavior to today for callers that don't opt in). - Accepts `conversation_id` and `prompt` for SDK contract compliance but does not forward them yet (marked `# noqa: ARG002`); they can be wired through later if a use case appears. ## Compatibility with non-OpenAI backends Same opt-in pattern as `prompt_cache_key`. `TemporalStreamingModel` calls `responses.create`, but the underlying client can be pointed at any OpenAI-compatible server (LiteLLM proxy, Foundry, vLLM, etc.). Some of those backends don't recognize `previous_response_id`. Because we forward it only when explicitly set, callers who don't opt in see no change in the wire request — the field is filtered out by `NOT_GIVEN`. Callers who opt in are responsible for knowing whether their backend supports it. ## Test housekeeping The 27 existing tests that passed `task_id=sample_task_id` to `get_response` were relying on `**kwargs` to silently swallow it. Production reads `task_id` from a ContextVar (set by `ContextInterceptor` in real Temporal flows, set by the `_streaming_context_vars` fixture in tests), not from a function argument. The kwarg was vestigial cruft. Removed.
response.usage, map response_id=response.id, opt-in prompt_cache_keyusage, response_id, plumb previous_response_id, opt-in prompt_cache_key for stateful responses and prompt caching
…create The SDK's ``Model.get_response`` abstract has three Responses API server-state parameters: ``previous_response_id``, ``conversation_id``, ``prompt``. The prior commit wired up ``previous_response_id`` and accepted the other two for SDK contract compliance but discarded them with ``# noqa: ARG002``. Accept-and-discard is a code smell: callers using the SDK's ``Runner.run(conversation_id=..., prompt=...)`` API would see their arguments silently dropped. Since both map directly to ``responses.create`` kwargs and we're already on that endpoint, the cost of forwarding is two lines and removes the smell entirely. - ``conversation_id`` (SDK abstract name) → ``conversation`` (responses.create endpoint kwarg). The ``Conversation`` type accepts ``str`` directly, so no translation is needed. - ``prompt`` is the same name on both sides. Both follow the same opt-in pattern as ``previous_response_id`` and ``prompt_cache_key``: ``None`` resolves to ``NOT_GIVEN`` and is omitted from the request body, so behavior on alternative OpenAI-compatible backends is unchanged unless a caller explicitly opts in.
… union Two ruff fixes for the test file: - ARG002 on 25 test method signatures: the prior commit (forward previous_response_id from SDK kwarg) stripped the vestigial ``task_id=sample_task_id`` kwargs from get_response calls, but left ``sample_task_id`` in the test method parameter lists. The contextvars fixture (``_streaming_context_vars``) already pulls ``sample_task_id`` transitively, so the explicit param is redundant. Removed from the 25 flagged signatures; preserved on ``test_responses_api_streaming`` where it's still used inside the body to assert against the streaming context. - FA102 on _make_response_completed_event: the new test helper used a PEP 604 union (``str | None``) without ``from __future__ import annotations``. Switched to ``Optional[str]`` to keep the change local to the helper rather than retrofitting future annotations across the file.
danielmiller98
approved these changes
May 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four changes to the streaming Responses API path in
temporal_streaming_model.py.My goal is to:
usage)prompt_cache_key)response_id,previous_response_id)Capture real
usagefromResponseCompletedEvent.response.usageand surface in the span'soutputdict.TemporalStreamingModel.get_responsewas constructing a zero-filledUsageand discardingevent.response.usage. It now reads usage off the streaming protocol and falls back to zeros only on the error path (stream ends without a completed event).The
len(''.join(reasoning_contents)) // 4reasoning-tokens estimator is dropped — the real value arrives in the API response.The
streaming_model_get_responsespan now carries{input_tokens, output_tokens, total_tokens, cached_input_tokens, reasoning_tokens}so traces show cache-hit rate.Return real
response_idcaptured fromResponseCompletedEvent.response.idinstead of the previously fabricatedf"resp_{uuid.uuid4().hex[:8]}".OpenAI Agents SDK reads this field for
previous_response_idchaining (run.py:145), exposes it asRunResult.last_response_id(result.py:108), and writes it into traces (tracing/span_data.py:164). A client-side UUID has never been issued by any server, so any caller that picks it up and tries to chain (the documented use case) gets a 400 "response not found".Latent since this file was added (commit
2f2a6ed7) because nothing in the codebase chainsprevious_response_idyet.Forward
previous_response_id,conversation_id, andpromptfrom the SDK kwarg.The SDK's abstract
Model.get_responsehasprevious_response_id,conversation_id,promptas required keyword-only params; the SDK threadsprevious_response_iddown through_ServerConversationTrackerwhen callers set it onRunner.run/RunConfig.Our implementation declared
**kwargs # noqa: ARG002and silently swallowed all three. Callers who used the official chaining API got no stateful behavior and no error. Replaced with explicit named params.This change is not necessarily required we need to pass
previous_response_idbut we could be doing that viaextra_args. That said, the silent behavior is sketch. I could be convinced that the right thing to here is to raise an exception when we see them inkwargsand expect them inextra_argsbut this felt less dangerous and more intuitive to the caller.Plumb
prompt_cache_keytoresponses.createas an opt-in parameter.Callers set it via
model_settings.extra_args["prompt_cache_key"]. We do not auto-inject a default. I don't trustprompt_cache_keyto be standard across all OpenAI-compatible endpoints.When unset, the parameter resolves to
NOT_GIVENand is omitted from the request body entirely.Test plan
All 38 tests in
test_streaming_model.pypass locally:test_usage_captured_from_completed_eventtest_usage_falls_back_when_no_completed_eventtest_usage_emitted_in_span_outputtest_response_id_captured_from_completed_eventtest_response_id_is_none_when_no_completed_event— guards against the fake-UUID footguntest_prompt_cache_key_not_sent_by_default— verifiesNOT_GIVENfallback for non-OpenAI compattest_prompt_cache_key_forwarded_when_opted_intest_previous_response_id_not_sent_by_default— verifiesNOT_GIVENfallbacktest_previous_response_id_forwarded_via_sdk_kwarg— verifies the SDK's official chaining API now workstest_conversation_id_and_prompt_accepted_but_not_forwarded— verifies SDK contract compliance without surface-area expansiontask_idkwargs)Why these changes together
The motivating workstream (downstream of this PR) is a stateful-Responses-API migration for ST&S agents to chain via
previous_response_idfor 40–80% better cache utilization on reasoning models. That migration needs:Usageto measure cache hit rate before/after.response_idto actually pass back asprevious_response_id.previous_response_idactually reachingresponses.createinstead of being silently swallowed.prompt_cache_keyfor callers who want it without forcing it on everyone.Once a release containing these changes lands and
aimi-scalebumps itsagentex-sdkpin, the runtime monkey-patch inst_s/*/agentex_usage_patch.pyand the corresponding_apply_agentex_usage_patch()calls in eachrun_worker.pycan be deleted.Greptile Summary
This PR fixes four latent issues in
TemporalStreamingModel.get_response: (1) realUsageis now read fromResponseCompletedEvent.response.usageinstead of zeros, (2) the real server-issuedresponse_idreplaces a fabricated UUID, (3)previous_response_id/conversation_id/promptare surfaced as explicit kwargs and forwarded toresponses.create, and (4)prompt_cache_keyis added as an opt-in viamodel_settings.extra_args. The changes are well-tested and the core logic is correct, though the existing review comments onprompt_cache_keySDK compatibility warrant attention before merge.Confidence Score: 4/5
Safe to merge once the open prompt_cache_key SDK-parameter concern from the previous review thread is resolved.
The four core fixes (real usage, real response_id, forwarding SDK kwargs, opt-in cache key) are logically correct and well-tested. Score is capped at 4 because the previously raised concern about prompt_cache_key — always passing it as an explicit named kwarg when the OpenAI Python SDK may not declare it — appears unresolved and could raise a TypeError for every caller.
temporal_streaming_model.py line 625: prompt_cache_key is passed as a named kwarg regardless of whether the SDK accepts it as a declared parameter
Important Files Changed
Sequence Diagram
sequenceDiagram participant SDK as OpenAI Agents SDK participant TSM as TemporalStreamingModel participant OAI as responses.create SDK->>TSM: get_response(...,previous_response_id,conversation_id,prompt) TSM->>TSM: pop prompt_cache_key from extra_args (opt-in) TSM->>OAI: responses.create(previous_response_id,conversation,prompt,prompt_cache_key,...) OAI-->>TSM: stream of events loop stream events TSM->>TSM: process delta events TSM->>TSM: on ResponseCompletedEvent capture usage + response.id end TSM->>TSM: Build Usage from captured_usage (fallback to zeros if None) TSM-->>SDK: ModelResponse(output, usage, response_id=captured_response_id)Reviews (6): Last reviewed commit: "Merge branch 'main' into dpeticolas/prom..." | Re-trigger Greptile