chore: instrument controlplane and cas with OpenTelemetry tracing#3081
chore: instrument controlplane and cas with OpenTelemetry tracing#3081javirln wants to merge 8 commits intochainloop-dev:mainfrom
Conversation
Add distributed tracing across both services using OpenTelemetry SDK v1.43.0. TracerProvider is wired via Wire and configured through proto-based config. - Add pkg/otelx helper with LayeredTracer for automatic chainloop.layer tagging - Add TracerProvider with OTLP gRPC exporter to both controlplane and CAS - Add otelgrpc stats handlers on both gRPC servers (server + client side) - Instrument SQL queries via XSAM/otelsql in the controlplane data layer - Add spans to all biz, data, middleware, and CAS service layer methods - Extend Observability proto with Tracing config in both services Signed-off-by: Javier Rodriguez <javier@chainloop.dev>
…hing - Replace raw t.Fatal/t.Errorf with testify assert/require in otelx tests - Fix staticcheck QF1008: remove redundant embedded field selector - Update integration test mock expectations from exact ctx to mock.Anything Signed-off-by: Javier Rodriguez <javier@chainloop.dev>
…tests Update dispatcher_test.go, integration_test.go, and organization_integration_test.go to use mock.Anything for context parameters in Register, Attach, and SaveCredentials mock expectations. Signed-off-by: Javier Rodriguez <javier@chainloop.dev>
…s disabled Change NewTracerProvider to return trace.TracerProvider interface with noop.NewTracerProvider() when disabled, eliminating nil checks in consumers. The data layer now always uses otelsql since the provider is always valid. Signed-off-by: Javier Rodriguez <javier@chainloop.dev>
Signed-off-by: Javier Rodriguez <javier@chainloop.dev>
There was a problem hiding this comment.
5 issues found across 98 files
Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed. cubic prioritises the most important files to review.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="app/controlplane/pkg/biz/casbackend_checker.go">
<violation number="1" location="app/controlplane/pkg/biz/casbackend_checker.go:67">
P2: The new span wraps the entire checker lifecycle instead of each backend-check run, creating a long-lived parent span and coupling sampling across all periodic executions.</violation>
</file>
<file name="app/controlplane/internal/server/otel.go">
<violation number="1" location="app/controlplane/internal/server/otel.go:78">
P1: `sampling_ratio=0` is handled as `AlwaysSample`, so a config intended to disable tracing instead samples every trace.</violation>
</file>
<file name="app/controlplane/internal/conf/controlplane/config/v1/conf.proto">
<violation number="1" location="app/controlplane/internal/conf/controlplane/config/v1/conf.proto:66">
P2: `sampling_ratio` should use explicit presence (`optional`) to distinguish unset from an explicit `0.0` value.</violation>
</file>
<file name="app/artifact-cas/internal/server/otel.go">
<violation number="1" location="app/artifact-cas/internal/server/otel.go:77">
P2: Out-of-range `sampling_ratio` values are silently mapped to `AlwaysSample`, which can unexpectedly enable 100% tracing under misconfiguration.</violation>
</file>
<file name="app/controlplane/pkg/data/data.go">
<violation number="1" location="app/controlplane/pkg/data/data.go:129">
P2: `min_open_conns` is being mapped to `SetMaxIdleConns`, which does not enforce a minimum open pool size and changes config semantics.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
| log.Infof("DB: setting min open conns: %v", n) | ||
| poolConfig.MinConns = c.MinOpenConns | ||
| // database/sql doesn't have MinOpenConns, but MaxIdleConns serves a similar purpose | ||
| db.SetMaxIdleConns(int(n)) |
There was a problem hiding this comment.
P2: min_open_conns is being mapped to SetMaxIdleConns, which does not enforce a minimum open pool size and changes config semantics.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At app/controlplane/pkg/data/data.go, line 129:
<comment>`min_open_conns` is being mapped to `SetMaxIdleConns`, which does not enforce a minimum open pool size and changes config semantics.</comment>
<file context>
@@ -95,38 +96,44 @@ func NewData(c *config.DatabaseConfig, logger log.Logger) (*Data, func(), error)
log.Infof("DB: setting min open conns: %v", n)
- poolConfig.MinConns = c.MinOpenConns
+ // database/sql doesn't have MinOpenConns, but MaxIdleConns serves a similar purpose
+ db.SetMaxIdleConns(int(n))
}
</file context>
| db.SetMaxIdleConns(int(n)) | |
| log.Warnf("DB: min_open_conns is not supported by database/sql; ignoring value: %v", n) |
Adopt the platform's LayeredTracer design: lazy tracer resolution via global provider, per-layer disable support via SetDisabledLayers, and TraceCarrier for async context propagation. The LayeredTracer no longer embeds trace.Tracer and instead resolves it at span creation time. Signed-off-by: Javier Rodriguez <javier@chainloop.dev>
There was a problem hiding this comment.
2 issues found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="pkg/otelx/otelx_test.go">
<violation number="1" location="pkg/otelx/otelx_test.go:67">
P2: Do not reset the global OpenTelemetry tracer provider to nil in test cleanup; it can cause panics in subsequent tracer creation.</violation>
</file>
<file name="pkg/otelx/otelx.go">
<violation number="1" location="pkg/otelx/otelx.go:102">
P1: Returning `trace.SpanFromContext(ctx)` for disabled layers can end the parent span when callers `defer span.End()`. Return an explicit noop span instead.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
- Fix sampling_ratio=0 being treated as AlwaysSample by using optional proto field and explicit 0/1 boundary handling with NeverSample - Fix disabled layer returning parent span from context (could end parent on defer); use pre-allocated noop span instead - Move CASBackendChecker span from Start (long-lived loop) to checkBackends (per-iteration) for correct span lifecycle - Fix test cleanup setting global TracerProvider to nil - Clarify min_open_conns mapping to MaxIdleConns in log message Signed-off-by: Javier Rodriguez <javier@chainloop.dev>
In client-streaming gRPC, Send can return EOF when the server has already terminated the stream with an error. Ignore Send errors in error-path subtests since the assertion happens on CloseAndRecv. Signed-off-by: Javier Rodriguez <javier@chainloop.dev>
This pull request adds OpenTelemetry tracing support to the artifact-cas service, making tracing configurable via the application config and wiring it into the application's startup. The configuration schema, protobuf definitions, and generated code are updated to support tracing options. The gRPC server is now instrumented for tracing, and the configuration file includes a section for observability tracing settings.
Observability and Tracing Integration:
observability.tracingsection to the config schema (conf.protoandconfig.devel.yaml), allowing tracing to be enabled/disabled, endpoint configuration, insecure mode, and sampling ratio. [1] [2]Bootstrap_Observability_Tracingmessage and related getters/setters inconf.pb.go). [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]Application Wiring and Startup:
wire.go,wire_gen.go, andmain.go) to pass the full bootstrap config and initialize aTracerProviderif tracing is enabled. The tracer provider is now passed to the app and properly cleaned up on shutdown. [1] [2] [3] [4] [5] [6]gRPC Server Instrumentation:
otelgrpcstats handler, enabling distributed tracing for all gRPC requests. [1] [2]Other:
wire.go.These changes make the service ready for distributed tracing with OpenTelemetry, improving observability and making it easier to diagnose issues in production environments.