feat(enterprise): add data drains for continuous export to S3 / webhook#4440
feat(enterprise): add data drains for continuous export to S3 / webhook#4440waleedlatif1 wants to merge 20 commits intostagingfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
PR SummaryHigh Risk Overview Implements the backend: new Adds destination/source registries with initial implementations for S3 (deterministic keys, SSE Reviewed by Cursor Bugbot for commit 57f6571. Configure here. |
Data drains let enterprise organizations continuously export Sim data (workflow logs, job logs, audit logs, copilot chats, copilot runs) to customer-controlled S3 buckets or HTTPS webhooks on hourly or daily schedules. Pairs with data retention to satisfy long-term compliance archives. Built around two registries (DrainSource + DrainDestination) so adding new sources or destinations is a single-file change. Cursor-based at-least-once delivery; cursor advances only on full success and rows carry stable ids so consumers can dedupe. Includes SSRF-validated webhooks with DNS pinning, HMAC-SHA256 timestamp signatures, S3 server-side encryption, audit logging on every config and run change, and self-hosted env var gating that mirrors data retention.
53850b2 to
db5d298
Compare
Greptile SummaryThis PR introduces a full data-drain pipeline for continuously exporting workflow logs, job logs, audit logs, copilot chats, and copilot runs to customer-owned S3 buckets or HTTPS webhooks on hourly/daily schedules. It includes two destination backends (S3 + webhook), five source generators with composite keyset cursors, a dispatcher with conditional atomic claiming, an orphan reaper, Trigger.dev background task integration, and SSRF-safe delivery with HMAC-SHA256 signatures.
Confidence Score: 5/5Safe to merge; the core delivery guarantees, security hardening, and access control gates are all correct. This is a large, well-structured feature with thorough security controls. All findings are minor: one metrics counter undercount in the dispatcher, one hard-parse failure mode that could affect list responsiveness after a schema evolution, and a missing Trigger.dev maxDuration that could impede very large first-time backfills. None of these affect data integrity, security, or correctness for the typical enterprise drain workflow. apps/sim/lib/data-drains/dispatcher.ts (skipped counter), apps/sim/lib/data-drains/serializers.ts (hard-parse list failure), apps/sim/background/run-data-drain.ts (no maxDuration) Important Files Changed
Sequence DiagramsequenceDiagram
participant Cron as Cron (/run-data-drains)
participant Dispatcher
participant DB
participant Queue as Job Queue (Trigger.dev)
participant Task as run-data-drain task
participant Source
participant Destination
Cron->>Dispatcher: dispatchDueDrains()
Dispatcher->>DB: reapOrphanedRuns()
Dispatcher->>DB: SELECT due drains
loop per candidate
Dispatcher->>DB: isOrganizationOnEnterprisePlan
Dispatcher->>DB: UPDATE claim (conditional)
Dispatcher->>Queue: enqueue run-data-drain
end
Queue->>Task: trigger with signal
Task->>DB: SELECT drain + config
Task->>Task: decryptCredentials
Task->>DB: INSERT dataDrainRun status=running
Task->>Destination: openSession()
loop each chunk
Task->>Source: pages() chunk[]
Task->>Destination: deliver(body, metadata, signal)
Destination-->>Task: locator
end
alt success
Task->>DB: UPDATE dataDrains cursor + lastSuccessAt
Task->>DB: UPDATE dataDrainRun status=success
else cancelled / error
Task->>DB: UPDATE dataDrains lastRunAt only
Task->>DB: UPDATE dataDrainRun status=failed
end
Task->>Destination: close()
Reviews (12): Last reviewed commit: "refactor(data-drains): trim extraneous c..." | Re-trigger Greptile |
…ument copilot_chats cursor - Thread AbortSignal through webhook test() and secureFetchWithPinnedIP so the route's 10s timeout actually cancels the outbound request - Re-validate destinationConfig against the typed schema in serializeDrain so unexpected JSONB shapes surface instead of leaking - Note in docs that drains export rows once on creation cursor; mutable copilot_chats fields are a point-in-time snapshot
|
@greptile |
|
@cursor review |
…SRF, unused hook) - webhook deliver: pass signal to secureFetchWithPinnedIP so aborts cancel the in-flight socket instead of waiting for the 30s timeout - S3 config: SSRF-validate the optional endpoint via validateExternalUrl so an enterprise admin cannot point the AWS SDK at internal/metadata addresses - hooks: remove unused useDataDrain (single-drain detail hook had no consumer) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
@greptile |
|
@cursor review |
…, self-hosted gate)
- update body schema: drop the discriminated-union-with-.optional() that silently required destinationType for any non-undefined body. The route already validates destination payloads against the typed configSchema/credentialsSchema for the existing drain, so the contract is now a flat partial — clients can send {enabled:false} without supplying destinationType
- S3 buildKey: partition by run startedAt instead of new Date() per chunk so a single run that crosses midnight still lands under one YYYY/MM/DD prefix
- self-hosted gate: wire DATA_DRAINS_ENABLED into authorizeDrainAccess and the cron dispatcher route so the docs claim ("reserved for server-side feature gating") is actually enforced — mutating endpoints 404 and the dispatcher no-ops when unset
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
@greptile |
|
@cursor review |
…elf-hosted isOrganizationOnEnterprisePlan returns false on deployments without billing infrastructure, so the dispatcher would silently skip every drain on self-hosted even with DATA_DRAINS_ENABLED=true. Mirror the access.ts pattern: when isBillingEnabled is false, treat all orgs as eligible — the cron route's DATA_DRAINS_ENABLED gate already controls global on/off. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…cher A throw from isOrganizationOnEnterprisePlan (Stripe outage, DB timeout) for one org used to propagate out of the for-loop and abort the whole dispatch batch. Wrap the check in try-catch so a single bad lookup just skips that drain — the next cron tick retries it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
@greptile |
Promise reject is idempotent so this wasn't a correctness bug, but routing the already-aborted branch through settledReject keeps all settling paths consistent and ensures cleanupAbort runs even if a listener somehow gets registered later. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
@cursor review |
|
@greptile |
- service: throw on cancellation after pages loop so a run aborted mid-stream isn't recorded as success - audit-logs: include org-scoped rows (workspace_id IS NULL with metadata->>organizationId match) alongside workspace rows - access: require owner/admin for read routes too; drain configs leak bucket names and webhook URLs Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
@cursor review |
|
@greptile |
…batch If the rollback update threw (e.g. transient DB error), the exception bubbled out of the for loop and silently skipped the rest of the candidate drains for the cycle. Wrap it so the batch continues. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
@greptile |
|
@cursor review |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 4fe0d0e. Configure here.
Audit cleanup before merge: - service: drop chunk-empty defensive skip (sources already handle it), trim WHAT-comments - dispatcher: tighten claim-race / rollback / enterprise-cache rationale to a single WHY each - access: collapse the duplicated module-top + inline comments into one TSDoc on the gate function - s3: fix orphaned doc block over assertEndpointIsPublic, soften the forcePathStyle TSDoc to match the actual default - webhook: drop empty close() comment - docs: clarify that drain reads also require owner/admin, drop the "on the dispatcher tick" implementation detail Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
@greptile |
|
@cursor review |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 57f6571. Configure here.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add runsList factory to avoid bypassing the query-key factory, drop keepPreviousData on the near-static drains list, invalidate the drain detail on run-now/delete, remove the orphaned detail/runs caches on delete, add an aria-label on the row actions trigger, and use cn() for the conditional run-status class. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLAUDE.md said `@/lib/utils` but the actual export lives at `@/lib/core/utils/cn`. Build was failing with "Module not found". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…es, audit-log index - cursor: compare in millisecond buckets so PG's microsecond-precision timestamps don't cause the cursor row to re-emit forever. The JS Date round-trip truncates 00:00:00.123456 to 00:00:00.123, which made gt(col, cursor) match the cursor row itself. - service: insert the run row before parse/decrypt so encryption-key rotation or schema drift surface as a failed run in the UI instead of vanishing into background-job logs while lastRunAt advances. - audit_log: add (workspace_id, created_at, id) composite index so the audit-logs source's tie-breaking ORDER BY is satisfied by the index without a heap fetch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Summary
DATA_DRAINS_ENABLED/NEXT_PUBLIC_DATA_DRAINS_ENABLED, mirroring data retentionType of Change
Testing
bun run check:api-validationpassingChecklist