Skip to content

feat(block): Allow wait block to wait up to 30 days#4331

Open
TheodoreSpeaks wants to merge 50 commits intostagingfrom
feat/long-waits
Open

feat(block): Allow wait block to wait up to 30 days#4331
TheodoreSpeaks wants to merge 50 commits intostagingfrom
feat/long-waits

Conversation

@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator

Summary

Reuse human in the loop logic to allow wait blocks to wait up to 30 days. Think this could be useful for things like email automation where you want to send followups after x days.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation
  • Other: ___________

Testing

  • Tested locally with a wait of 6 minutes. Ran resume and validated that later blocks run.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

Screenshots/Videos

waleedlatif1 and others added 30 commits April 3, 2026 23:30
…ership workflow edits via sockets, ui improvements
…ration, signup method feature flags, SSO improvements
* feat(posthog): Add tracking on mothership abort (#4023)

Co-authored-by: Theodore Li <theo@sim.ai>

* fix(login): fix captcha headers for manual login  (#4025)

* fix(signup): fix turnstile key loading

* fix(login): fix captcha header passing

* Catch user already exists, remove login form captcha
…nts, secrets performance, polling refactors, drag resources in mothership
…endar triggers, docs updates, integrations/models pages improvements
…mat, logs performance improvements

fix(csp): add missing analytics domains, remove unsafe-eval, fix workspace CSP gap (#4179)
fix(landing): return 404 for invalid dynamic route slugs (#4182)
improvement(seo): optimize sitemaps, robots.txt, and core web vitals across sim and docs (#4170)
fix(gemini): support structured output with tools on Gemini 3 models (#4184)
feat(brightdata): add Bright Data integration with 8 tools (#4183)
fix(mothership): fix superagent credentials (#4185)
fix(logs): close sidebar when selected log disappears from filtered list; cleanup (#4186)
v0.6.46: mothership streaming fixes, brightdata integration
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 29, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docs Ready Ready Preview, Comment May 2, 2026 6:46pm

Request Review

@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

@BugBot review

@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 29, 2026

PR Summary

Medium Risk
Adds a new time-based pause/resume path (DB schema + cron poller) that automatically resumes executions, which could impact workflow scheduling and pause/resume correctness if misconfigured. Changes are localized but touch execution persistence and resume dispatch logic.

Overview
Enables the Wait block to delay executions up to 30 days by splitting behavior into short in-process sleeps (≤5 min) vs long waits that suspend the workflow and later auto-resume.

This introduces a new pauseKind (human vs time) and resumeAt timestamp propagated through pause metadata/pause points, persists an indexed paused_executions.next_resume_at for efficient lookup, and adds a cron-authenticated /api/resume/poll endpoint (plus Helm CronJob) that claims due rows and dispatches resumes.

Manual resume routes and paused-execution listing/detail now exclude time-based pauses and explicitly restrict manual resume to allowedPauseKinds: ['human'], avoiding users resuming timed waits via the UI/API.

Reviewed by Cursor Bugbot for commit c234b01. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread apps/sim/lib/core/config/feature-flags.ts Outdated
Comment thread apps/sim/app/api/resume/poll/route.ts
@TheodoreSpeaks TheodoreSpeaks marked this pull request as ready for review April 29, 2026 02:19
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 29, 2026

Greptile Summary

This PR extends the Wait block to support durations up to 30 days by reusing the human-in-the-loop pause/resume infrastructure. Waits ≤ 5 minutes continue to execute in-process; longer waits suspend the workflow by writing pauseKind: 'time' pause points to pausedExecutions, and a new per-minute cron endpoint (/api/resume/poll) resumes them when their resumeAt timestamp is reached.

  • P1 — permanently stranded executions on dispatch failure: In route.ts, when enqueueOrStartResume throws for a due pause point, the error is caught but nextRemaining (which controls the rescheduled nextResumeAt) only tracks future points. After the loop, nextResumeAt is set to NULL, so the cron query (isNotNull(nextResumeAt)) never selects the row again and the workflow hangs indefinitely with no retry or alerting.

Confidence Score: 3/5

Not safe to merge until the failed-dispatch silent-strand bug is fixed; any transient error during resume dispatch permanently locks a workflow.

A confirmed P1 defect in the new poll route means any transient DB or lock error during dispatch will permanently orphan a paused execution with no retry or observability. The rest of the implementation (schema migration, in-process vs. suspended branching, UI filtering, cron config) is well-structured and correct.

apps/sim/app/api/resume/poll/route.ts — the failed-dispatch no-retry bug and missing ORDER BY

Important Files Changed

Filename Overview
apps/sim/app/api/resume/poll/route.ts New cron-driven polling endpoint that resumes time-based paused executions; has a P1 bug where failed dispatches permanently strand executions, and lacks ORDER BY on the batch query
apps/sim/executor/handlers/wait/wait-handler.ts Refactored to support in-process (≤5 min) and suspended (>5 min, up to 30 days) waits; implementation is clean but executeWithNode signature is narrower than the BlockHandler interface
apps/sim/executor/types.ts Adds PauseKind union type and pauseKind/resumeAt fields to PauseMetadata and PausePoint; backward compat handled via ?? fallbacks in the manager
apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts Correctly propagates pauseKind/resumeAt through persistPauseResult and computes nextResumeAt from the earliest time-based pause point
packages/db/migrations/0201_brave_kylun.sql Adds next_resume_at column and partial index on paused_executions; migration looks correct
apps/sim/executor/handlers/wait/wait-handler.test.ts Replaces old max-enforcement tests with suspended-workflow tests for hours/days/minutes above threshold; good coverage of the new branching logic
apps/sim/executor/handlers/human-in-the-loop/human-in-the-loop-handler.ts Adds explicit pauseKind: 'human' to pause metadata; single-line change is correct and unambiguous
apps/sim/blocks/blocks/wait.ts Adds hours/days options and updates documentation to reflect the new 30-day ceiling and dual-mode execution
helm/sim/values.yaml Adds a 1-minute CronJob for the new poll endpoint with Forbid concurrency policy; configuration is consistent with other cron jobs

Sequence Diagram

sequenceDiagram
    participant W as WaitBlockHandler
    participant E as ExecutionEngine
    participant M as PauseResumeManager
    participant DB as pausedExecutions DB
    participant C as CronJob (/api/resume/poll)

    W->>W: execute(inputs)
    alt waitMs ≤ 5 min (in-process)
        W->>W: sleep(waitMs)
        W-->>E: {status: 'completed'}
    else waitMs > 5 min (suspended)
        W-->>E: {status: 'waiting', _pauseMetadata: {pauseKind: 'time', resumeAt}}
        E->>M: persistPauseResult(pausePoints)
        M->>M: compute nextResumeAt (earliest time pause point)
        M->>DB: INSERT/UPDATE pausedExecutions {nextResumeAt}
    end

    loop Every 1 minute
        C->>DB: SELECT WHERE status='paused' AND nextResumeAt <= now LIMIT 200
        DB-->>C: dueRows[]
        loop for each dueRow
            loop for each duePoint (pauseKind='time', resumeAt <= now)
                C->>M: enqueueOrStartResume({executionId, contextId})
                M-->>C: {status: 'starting', ...}
                C->>M: startResumeExecution() [fire and forget]
            end
            C->>DB: UPDATE SET nextResumeAt = nextRemaining (null if all done)
        end
    end
Loading

Reviews (1): Last reviewed commit: "restore ff" | Re-trigger Greptile

Comment on lines +92 to +141
for (const point of duePoints) {
const contextId = point.contextId
if (!contextId) continue
try {
const enqueueResult = await PauseResumeManager.enqueueOrStartResume({
executionId: row.executionId,
contextId,
resumeInput: {},
userId,
})

if (enqueueResult.status === 'starting') {
PauseResumeManager.startResumeExecution({
resumeEntryId: enqueueResult.resumeEntryId,
resumeExecutionId: enqueueResult.resumeExecutionId,
pausedExecution: enqueueResult.pausedExecution,
contextId: enqueueResult.contextId,
resumeInput: enqueueResult.resumeInput,
userId: enqueueResult.userId,
}).catch((error) => {
logger.error('Background time-pause resume failed', {
executionId: row.executionId,
contextId,
error: toError(error).message,
})
})
}
dispatched++
} catch (error) {
const message = toError(error).message
logger.warn('Failed to dispatch time-pause resume', {
executionId: row.executionId,
contextId,
error: message,
})
failures.push({ executionId: row.executionId, contextId, error: message })
}
}

await db
.update(pausedExecutions)
.set({ nextResumeAt: nextRemaining })
.where(eq(pausedExecutions.id, row.id))
}

logger.info('Time-pause resume poll completed', {
requestId,
claimedRows,
dispatched,
failureCount: failures.length,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Failed dispatches permanently strand executions

When enqueueOrStartResume throws for a due pause point, the error is caught and pushed to failures[], but nextRemaining is unaffected (it only tracks future points). The loop then runs UPDATE … SET next_resume_at = nextRemaining (effectively NULL when all points were due). After this update, the row no longer satisfies the cron query (isNotNull(nextResumeAt)), so it is silently abandoned and the workflow is permanently stuck in status = 'paused'.

Any transient failure — DB timeout, lock contention, network hiccup inside enqueueOrStartResume — turns into a permanent hang with no visible alert and no retry path.

A simple fix is to re-schedule failed points by putting their resumeAt back into nextRemaining:

for (const point of duePoints) {
  const contextId = point.contextId
  if (!contextId) continue
  try {
    // ... dispatch ...
    dispatched++
  } catch (error) {
    const message = toError(error).message
    logger.warn('Failed to dispatch time-pause resume', { ... })
    failures.push({ executionId: row.executionId, contextId, error: message })
    // Re-queue failed point
    if (point.resumeAt) {
      const retryAt = new Date(point.resumeAt)
      if (!Number.isNaN(retryAt.getTime())) {
        if (!nextRemaining || retryAt < nextRemaining) nextRemaining = retryAt
      }
    }
  }
}

Alternatively, schedule a short retry (e.g. new Date(Date.now() + 60_000)) to avoid hammering a bad point at full frequency.

Comment on lines +56 to +66
metadata: pausedExecutions.metadata,
})
.from(pausedExecutions)
.where(
and(
eq(pausedExecutions.status, 'paused'),
isNotNull(pausedExecutions.nextResumeAt),
lte(pausedExecutions.nextResumeAt, now)
)
)
.limit(POLL_BATCH_LIMIT)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No ORDER BY on batch query — high-volume queues risk row starvation

Without an explicit ORDER BY, PostgreSQL returns rows in an unspecified order. When the queue depth exceeds POLL_BATCH_LIMIT = 200, the same 200 rows may be returned on every invocation (e.g. lowest physical heap order), while later-inserted rows are perpetually skipped. Adding .orderBy(pausedExecutions.nextResumeAt) ensures the most-overdue entries are always processed first and that all rows are eventually drained.

.orderBy(pausedExecutions.nextResumeAt)
.limit(POLL_BATCH_LIMIT)

Comment on lines +106 to +117
async executeWithNode(
ctx: ExecutionContext,
block: SerializedBlock,
inputs: Record<string, any>,
nodeMetadata: {
nodeId: string
loopId?: string
parallelId?: string
branchIndex?: number
branchTotal?: number
}
): Promise<BlockOutput> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 executeWithNode signature is narrower than the BlockHandler interface

BlockHandler.executeWithNode in types.ts declares nodeMetadata with three additional optional fields (originalBlockId, isLoopNode, executionOrder). The WaitBlockHandler implementation omits all three, so the method technically does not satisfy the declared interface contract. While TypeScript currently allows this (the extra fields are optional and ignored at runtime), it means callers that pass full nodeMetadata objects will silently drop fields the handler might need in a future iteration. Widening the implementation's parameter type to match the interface definition prevents this drift.

icecrasher321 and others added 3 commits April 29, 2026 10:16
…rizations, mothership positional table row insertion, CI improvements, org-external users, file viewer improvements
v0.6.62: fix new copilot chat creation and selection on refresh
Comment thread apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts
waleedlatif1 and others added 3 commits May 2, 2026 01:55
…ixes, db query optimizations, contract boundaries code hygiene, CORS, toast improvements, tables infinite query, executor robustness, reranker support
@gitguardian
Copy link
Copy Markdown

gitguardian Bot commented May 2, 2026

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
29606901 Triggered Generic High Entropy Secret a54dcbe apps/sim/providers/utils.test.ts View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c234b01. Configure here.

status: 'completed',
status: 'waiting',
resumeAt,
_pauseMetadata: pauseMetadata,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resumed wait block status stays "waiting" not "completed"

Medium Severity

When a long wait (>5 min) suspends and is later resumed by the poll, the block's output status remains 'waiting' because the resume merge in runResumeExecution spreads the existing output (which has status: 'waiting') but never updates it to 'completed'. In-process waits correctly return status: 'completed'. Downstream blocks referencing {{wait-block.status}} in conditional logic will see different values depending on whether the wait was short or long, potentially causing silent control-flow divergence.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c234b01. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants