Skip to content

fix: resolve lost-wakeup races in InMemoryTaskStore.wait_for_update#2536

Open
blackwell-systems wants to merge 1 commit intomodelcontextprotocol:mainfrom
blackwell-systems:fix/task-store-lost-wakeup
Open

fix: resolve lost-wakeup races in InMemoryTaskStore.wait_for_update#2536
blackwell-systems wants to merge 1 commit intomodelcontextprotocol:mainfrom
blackwell-systems:fix/task-store-lost-wakeup

Conversation

@blackwell-systems
Copy link
Copy Markdown

Summary

Two races in InMemoryTaskStore.wait_for_update that can cause indefinite hangs when polling task status.

Race 1: Concurrent waiters

wait_for_update overwrites _update_events[task_id] with a fresh event on every call. If two clients poll the same task, the second overwrites the first's event. notify_update sets only the latest event, so the first waiter hangs forever.

Fix: Use a list of events per task_id. Each waiter appends its own event. notify_update sets and removes all events for the task.

Race 2: Notify before wait

If update_task calls notify_update before any wait_for_update is active, the signal is lost (no event exists to set). The next wait_for_update creates a fresh event and waits for an update that already happened.

Fix: Track pending updates in a set. notify_update adds to the set when no waiters exist. wait_for_update checks and consumes the pending flag before creating an event.

Reproducer

Both races are reproducible with a 30-line script (included in issue #2535). Output before fix:

Race 1: concurrent waiters
  waiter a: HUNG
  waiter b: woke
  FAIL: lost wakeup

Race 2: notify before wait
  FAIL: signal lost, waiter hung forever

After fix:

Race 1: concurrent waiters
  waiter a: woke
  waiter b: woke
  PASS

Race 2: notify before wait
  PASS

Test plan

  • Reproducer script confirms both races are fixed
  • All 217 existing task store tests pass (pytest tests/experimental/tasks/ -v)
  • 0 regressions

Fixes #2535

Two races in wait_for_update:

1. Concurrent waiters: second caller overwrites the first's event in
   _update_events[task_id], so the first waiter hangs forever.
   Fix: use a list of events per task_id so each waiter gets its own.

2. Notify before wait: if update_task completes before wait_for_update
   is called, the signal is lost because no event exists yet.
   Fix: track pending updates in a set; wait_for_update checks and
   consumes pending flags before creating an event.

Both races are reachable via task_result_handler.py:126 when multiple
clients poll the same task or when a task completes between status
checks.

Adds two tests: concurrent waiters and notify-before-wait.

Fixes modelcontextprotocol#2535
@blackwell-systems blackwell-systems force-pushed the fix/task-store-lost-wakeup branch from 17b3ab2 to 065264d Compare May 2, 2026 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

InMemoryTaskStore.wait_for_update has lost-wakeup races (concurrent waiters, notify-before-wait)

1 participant