Skip to content

chore: instrument controlplane and cas with OpenTelemetry tracing#3081

Open
javirln wants to merge 8 commits intochainloop-dev:mainfrom
javirln:feat/otel-instrumentation
Open

chore: instrument controlplane and cas with OpenTelemetry tracing#3081
javirln wants to merge 8 commits intochainloop-dev:mainfrom
javirln:feat/otel-instrumentation

Conversation

@javirln
Copy link
Copy Markdown
Member

@javirln javirln commented Apr 30, 2026

This pull request adds OpenTelemetry tracing support to the artifact-cas service, making tracing configurable via the application config and wiring it into the application's startup. The configuration schema, protobuf definitions, and generated code are updated to support tracing options. The gRPC server is now instrumented for tracing, and the configuration file includes a section for observability tracing settings.

Observability and Tracing Integration:

  • Added a new observability.tracing section to the config schema (conf.proto and config.devel.yaml), allowing tracing to be enabled/disabled, endpoint configuration, insecure mode, and sampling ratio. [1] [2]
  • Updated the protobuf definitions and generated Go code to include the new tracing configuration options (Bootstrap_Observability_Tracing message and related getters/setters in conf.pb.go). [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]

Application Wiring and Startup:

  • Modified the application wiring (wire.go, wire_gen.go, and main.go) to pass the full bootstrap config and initialize a TracerProvider if tracing is enabled. The tracer provider is now passed to the app and properly cleaned up on shutdown. [1] [2] [3] [4] [5] [6]

gRPC Server Instrumentation:

  • Instrumented the gRPC server with OpenTelemetry by adding the otelgrpc stats handler, enabling distributed tracing for all gRPC requests. [1] [2]

Other:

  • Updated copyright year in wire.go.

These changes make the service ready for distributed tracing with OpenTelemetry, improving observability and making it easier to diagnose issues in production environments.

javirln added 5 commits April 30, 2026 09:11
Add distributed tracing across both services using OpenTelemetry SDK v1.43.0.
TracerProvider is wired via Wire and configured through proto-based config.

- Add pkg/otelx helper with LayeredTracer for automatic chainloop.layer tagging
- Add TracerProvider with OTLP gRPC exporter to both controlplane and CAS
- Add otelgrpc stats handlers on both gRPC servers (server + client side)
- Instrument SQL queries via XSAM/otelsql in the controlplane data layer
- Add spans to all biz, data, middleware, and CAS service layer methods
- Extend Observability proto with Tracing config in both services

Signed-off-by: Javier Rodriguez <javier@chainloop.dev>
…hing

- Replace raw t.Fatal/t.Errorf with testify assert/require in otelx tests
- Fix staticcheck QF1008: remove redundant embedded field selector
- Update integration test mock expectations from exact ctx to mock.Anything

Signed-off-by: Javier Rodriguez <javier@chainloop.dev>
…tests

Update dispatcher_test.go, integration_test.go, and
organization_integration_test.go to use mock.Anything for context
parameters in Register, Attach, and SaveCredentials mock expectations.

Signed-off-by: Javier Rodriguez <javier@chainloop.dev>
…s disabled

Change NewTracerProvider to return trace.TracerProvider interface with
noop.NewTracerProvider() when disabled, eliminating nil checks in
consumers. The data layer now always uses otelsql since the provider
is always valid.

Signed-off-by: Javier Rodriguez <javier@chainloop.dev>
Signed-off-by: Javier Rodriguez <javier@chainloop.dev>
@javirln javirln requested review from jiparis and migmartri April 30, 2026 10:15
@javirln javirln changed the title Feat/otel instrumentation chore: instrument controlplane and cas with OpenTelemetry tracing Apr 30, 2026
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 98 files

Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed. cubic prioritises the most important files to review.

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="app/controlplane/pkg/biz/casbackend_checker.go">

<violation number="1" location="app/controlplane/pkg/biz/casbackend_checker.go:67">
P2: The new span wraps the entire checker lifecycle instead of each backend-check run, creating a long-lived parent span and coupling sampling across all periodic executions.</violation>
</file>

<file name="app/controlplane/internal/server/otel.go">

<violation number="1" location="app/controlplane/internal/server/otel.go:78">
P1: `sampling_ratio=0` is handled as `AlwaysSample`, so a config intended to disable tracing instead samples every trace.</violation>
</file>

<file name="app/controlplane/internal/conf/controlplane/config/v1/conf.proto">

<violation number="1" location="app/controlplane/internal/conf/controlplane/config/v1/conf.proto:66">
P2: `sampling_ratio` should use explicit presence (`optional`) to distinguish unset from an explicit `0.0` value.</violation>
</file>

<file name="app/artifact-cas/internal/server/otel.go">

<violation number="1" location="app/artifact-cas/internal/server/otel.go:77">
P2: Out-of-range `sampling_ratio` values are silently mapped to `AlwaysSample`, which can unexpectedly enable 100% tracing under misconfiguration.</violation>
</file>

<file name="app/controlplane/pkg/data/data.go">

<violation number="1" location="app/controlplane/pkg/data/data.go:129">
P2: `min_open_conns` is being mapped to `SetMaxIdleConns`, which does not enforce a minimum open pool size and changes config semantics.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

Comment thread app/controlplane/internal/server/otel.go Outdated
Comment thread app/controlplane/pkg/biz/casbackend_checker.go Outdated
Comment thread app/controlplane/internal/conf/controlplane/config/v1/conf.proto Outdated
Comment thread app/artifact-cas/internal/server/otel.go
log.Infof("DB: setting min open conns: %v", n)
poolConfig.MinConns = c.MinOpenConns
// database/sql doesn't have MinOpenConns, but MaxIdleConns serves a similar purpose
db.SetMaxIdleConns(int(n))
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: min_open_conns is being mapped to SetMaxIdleConns, which does not enforce a minimum open pool size and changes config semantics.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At app/controlplane/pkg/data/data.go, line 129:

<comment>`min_open_conns` is being mapped to `SetMaxIdleConns`, which does not enforce a minimum open pool size and changes config semantics.</comment>

<file context>
@@ -95,38 +96,44 @@ func NewData(c *config.DatabaseConfig, logger log.Logger) (*Data, func(), error)
 		log.Infof("DB: setting min open conns: %v", n)
-		poolConfig.MinConns = c.MinOpenConns
+		// database/sql doesn't have MinOpenConns, but MaxIdleConns serves a similar purpose
+		db.SetMaxIdleConns(int(n))
 	}
 
</file context>
Suggested change
db.SetMaxIdleConns(int(n))
log.Warnf("DB: min_open_conns is not supported by database/sql; ignoring value: %v", n)
Fix with Cubic

Comment thread app/controlplane/pkg/data/apitoken.go
jiparis
jiparis previously approved these changes Apr 30, 2026
Copy link
Copy Markdown
Member

@jiparis jiparis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. thanks!

Adopt the platform's LayeredTracer design: lazy tracer resolution via
global provider, per-layer disable support via SetDisabledLayers, and
TraceCarrier for async context propagation. The LayeredTracer no longer
embeds trace.Tracer and instead resolves it at span creation time.

Signed-off-by: Javier Rodriguez <javier@chainloop.dev>
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="pkg/otelx/otelx_test.go">

<violation number="1" location="pkg/otelx/otelx_test.go:67">
P2: Do not reset the global OpenTelemetry tracer provider to nil in test cleanup; it can cause panics in subsequent tracer creation.</violation>
</file>

<file name="pkg/otelx/otelx.go">

<violation number="1" location="pkg/otelx/otelx.go:102">
P1: Returning `trace.SpanFromContext(ctx)` for disabled layers can end the parent span when callers `defer span.End()`. Return an explicit noop span instead.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

Comment thread pkg/otelx/otelx.go Outdated
Comment thread pkg/otelx/otelx_test.go Outdated
javirln added 2 commits April 30, 2026 14:38
- Fix sampling_ratio=0 being treated as AlwaysSample by using optional
  proto field and explicit 0/1 boundary handling with NeverSample
- Fix disabled layer returning parent span from context (could end
  parent on defer); use pre-allocated noop span instead
- Move CASBackendChecker span from Start (long-lived loop) to
  checkBackends (per-iteration) for correct span lifecycle
- Fix test cleanup setting global TracerProvider to nil
- Clarify min_open_conns mapping to MaxIdleConns in log message

Signed-off-by: Javier Rodriguez <javier@chainloop.dev>
In client-streaming gRPC, Send can return EOF when the server has
already terminated the stream with an error. Ignore Send errors in
error-path subtests since the assertion happens on CloseAndRecv.

Signed-off-by: Javier Rodriguez <javier@chainloop.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants