Preamble

Logs tell stories; metrics aggregate; traces show latency structure across services. When Celery workers, Kafka consumers, and HTTP handlers share a request’s fate, grep alone loses. OpenTelemetry gives Python and Java a common propagation format so spans stitch into one timeline.


What OpenTelemetry achieves

OpenTelemetry is a vendor-neutral way to produce, propagate, and export telemetry—primarily traces, metrics, and logs—using one data model and one set of APIs/SDKs across languages. The same instrumentation can send data to Jaeger, Grafana Tempo, Honeycomb, Datadog, Google Cloud Trace, AWS X-Ray (often via a collector or vendor exporter), and so on. That separation matters: your code describes behavior; the backend stores and queries it.

At runtime, OTel solves correlation: every span shares a trace id; parent/child links show which work happened in service of which user action. Context propagation (W3C traceparent / tracestate, optional baggage) carries that identity across process boundaries so a Python gateway and a Java microservice appear as one trace, not two unrelated timelines.


Problems without a common telemetry layer

Without something like OTel, teams typically get fragmented views: each language picks ad hoc libraries, each service logs to different fields, and there is no guaranteed link between an HTTP request, a queue message, and a database call. Debugging becomes manual stitching (“find the log line near 14:32:05”) instead of following a trace id.

Vendor lock-in is the other pain: rewriting all instrumentation when you change APM tools is expensive and risky. OTel pushes portable semantics (span names, attributes, resource metadata) so backends compete on query UX and cost, not on owning your instrumentation API.

Sampling and cost stay explicit: you decide head-based (at trace start) or tail-based (after completion, often in a collector) policies instead of each SDK inventing its own rules.


Spans users care about

Name spans after operations with meaning: “authorize payment,” “fetch user profile,” not internalCall2. Attributes carry stable keys—tenant, feature flag, queue name—so dashboards group sensibly.


Propagation through async and thread pools

Context must survive asyncio, executor services, and reactive stacks. Dropped context is the classic “two traces for one button click” bug. Propagation can be verified in tests where possible—inject a fake exporter and assert parent/child links.


Complementing logs and metrics

Structured logs with trace_id still matter: they bridge systems that are not fully instrumented yet. Metrics capture SLOs; traces explain which hop blew the budget.


Minimal reproducible trace (Python ↔ Java)

Why the code shape matters: the outgoing side must inject the current context into carriers (HTTP headers or gRPC metadata). The incoming side must extract that context and start a child span so the backend shows a single tree.

Python (manual inject/extract; frameworks often do this via middleware):

from opentelemetry import context, trace
from opentelemetry.propagate import inject, extract

tracer = trace.get_tracer(__name__)

def _header_getter(carrier, key):
    return carrier.get(key)

def call_java(base_url: str):
    with tracer.start_as_current_span("call_java_service") as span:
        span.set_attribute("peer.service", "java-edge")
        headers = {}
        inject(headers)  # traceparent (+ tracestate/baggage if configured)
        # requests.post(f"{base_url}/api/edge", headers=headers, timeout=5)

def handle_incoming_request(request_headers: dict):
    ctx = extract(_header_getter, request_headers)
    token = context.attach(ctx)
    try:
        with tracer.start_as_current_span("handle_incoming_http") as span:
            span.set_attribute("http.route", "/webhook")
            # business logic; further calls inject the same trace
    finally:
        context.detach(token)

Java (OpenTelemetry API; Spring Boot Micrometer OTel usually extracts automatically):

import io.opentelemetry.api.GlobalOpenTelemetry;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Context;
import io.opentelemetry.context.propagation.TextMapGetter;

Tracer tracer = GlobalOpenTelemetry.get().getTracer("com.example.edge");

public void handle(Map<String, String> incomingHeaders) {
    Context parent = GlobalOpenTelemetry.get().getPropagators()
        .getTextMapPropagator()
        .extract(Context.current(), incomingHeaders, MAP_GETTER);

    try (var scope = parent.makeCurrent()) {
        Span span = tracer.spanBuilder("handlePythonCaller")
            .setParent(parent)
            .startSpan();
        try (var ss = span.makeCurrent()) {
            span.setAttribute("tenant.id", tenant);
            // business logic
        } finally {
            span.end();
        }
    }
}

// MAP_GETTER: TextMapGetter<Map<String,String>> with keys() / get()

Acceptance test: one integration test that fails if child span is not linked under the HTTP/gRPC parent—same discipline as pdb and the Java Debugger: Breakpoints That Teach: prove the tool path works on every PR.


Docker: wiring the OTLP pipeline

Containers should not hardcode a vendor URL in application code. Use environment variables the OpenTelemetry SDKs understand (OTEL_*), and point them at an OpenTelemetry Collector or a vendor’s OTLP endpoint.

Typical service container (gRPC OTLP to a collector on the same Docker network):

ENV OTEL_SERVICE_NAME=billing-api
ENV OTEL_RESOURCE_ATTRIBUTES=deployment.environment=staging,service.namespace=payments
ENV OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
ENV OTEL_EXPORTER_OTLP_PROTOCOL=grpc
# Optional: traces only to start
ENV OTEL_TRACES_EXPORTER=otlp

Collector as a sidecar or shared service (conceptual docker-compose fragment): your apps export OTLP; the collector receives, batches, samples, and exports to Jaeger, Tempo, S3, etc. That keeps TLS, retries, and secrets out of every language runtime.

Design choices to document: one collector per host vs per cluster; 4317 (gRPC) vs 4318 (HTTP); whether the app sends directly to SaaS or always through a collector for PII scrubbing and tail sampling.


Terraform (and IaC): what you actually declare

Terraform rarely “installs OpenTelemetry” inside the JVM; it shapes the platform the SDKs talk to. Common patterns:

  • Endpoints and secrets: variables for OTEL_EXPORTER_OTLP_ENDPOINT, API keys (often as ECS task secrets, Kubernetes secrets, or Vault references), and TLS.
  • Collector deployment: helm_release for the OpenTelemetry Helm charts, or an ECS service / EC2 ASG running the collector image—receivers (OTLP), processors (batch, attributes, tail sampling), exporters (Tempo, S3, vendor).
  • Cloud-native tracing: AWS X-Ray sampling rules (aws_xray_sampling_rule), IAM for OTLP ingress, or managed offerings where Terraform provisions the destination and your apps only set exporter env vars.

Why this belongs in system design: sampling, retention, and who pays for egress are operational contracts between services and the observability tier—same as SLIs and on-call runbooks.


Conclusion

Observability is queryable architecture. Polyglot Interop: HTTP and gRPC Between Python and Java treats contracts (including error shapes) as part of the public API; Architecture for Evolvable Services After Polyglot Expansion is where ADRs and README-driven decisions belong in a growing system.