3 min read
When OpenTelemetry Auto-Instrumentation Meets Python PEX, A Debugging Journey

Picture this: You’re tasked with implementing distributed tracing across your microservices. “Easy,” you think, “OpenTelemetry has auto-instrumentation!” Six hours later, you’re staring at empty trace dashboards wondering why your Python FastAPI service refuses to send a single span. This is that story.


The Setup

I started with a local minikube cluster to test OpenTelemetry auto-instrumentation.

  • Deploy Jaeger for trace storage/UI
  • Deploy OpenTelemetry Collector as a gateway
  • Use the OpenTelemetry Operator to auto-instrument Python apps
  • Watch traces flow without touching application code
# Start fresh
minikube start --memory=8192 --cpus=4
kubectl create namespace observability

# Install Jaeger
kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.52.0/jaeger-operator.yaml
kubectl apply -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: observability
spec:
  strategy: AllInOne
EOF

# Install OpenTelemetry Operator
kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml

Act 1: The Silent Treatment

Created the collector and instrumentation:

# collector.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-gateway
  namespace: observability
spec:
  mode: deployment
  config: |
    receivers:
      otlp:
        protocols:
          http:
            endpoint: 0.0.0.0:4318
    exporters:
      jaeger:
        endpoint: jaeger-collector.observability:14250
        tls:
          insecure: true
    service:
      pipelines:
        traces:
          receivers: [otlp]
          exporters: [jaeger]
---
# instrumentation.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: python-instrumentation
  namespace: default
spec:
  exporter:
    endpoint: http://otel-gateway-collector.observability:4318
  propagators:
    - tracecontext
    - baggage
  python:
    env:
      - name: OTEL_PYTHON_LOG_CORRELATION
        value: "true"

Deployed a test FastAPI app with the magic annotation:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: fastapi-app
spec:
  template:
    metadata:
      annotations:
        instrumentation.opentelemetry.io/inject-python: "true"
    spec:
      containers:
      - name: app
        image: myapp:latest
        ports:
        - containerPort: 8000

Port-forwarded to Jaeger UI:

kubectl port-forward -n observability svc/jaeger-query 16686:16686

Result? Nothing. Zero traces.


Act 2: The REPL Detective Work

Time to get hands dirty. Exec’d into the pod:

kubectl exec -it deployment/fastapi-app -- bash

Test 1: Is auto-instrumentation even loaded?

$ python
>>> import sys
>>> 'sitecustomize' in sys.modules
True
>>> import sitecustomize
>>> print(sitecustomize.__file__)
/otel-auto-instrumentation-python/sitecustomize.py

Good! The operator injected its magic.

Test 2: Can we reach the collector?

>>> import requests
>>> endpoint = "http://otel-gateway-collector.observability:4318"
>>> r = requests.post(f"{endpoint}/v1/traces", 
...                   data=b"junk", 
...                   headers={"content-type": "application/x-protobuf"})
>>> r.status_code, r.text
(400, 'proto: illegal wireType 6')

Perfect! The collector is reachable and trying to parse our junk.

Test 3: Can we manually send a span?

>>> from opentelemetry import trace
>>> tracer = trace.get_tracer("manual-test")
>>> with tracer.start_as_current_span("test-span"):
...     print("Hello from manual span")
... 
>>> # Force flush
>>> trace.get_tracer_provider()._active_span_processor.force_flush()

Checked Jaeger… The manual span appeared!

So REPL can send traces, but the FastAPI server can’t? 🤔


Act 3: The Split-Brain Mystery

Let’s check what the actual server process sees:

# Check PID 1 environment
$ tr '\0' '\n' < /proc/1/environ | grep -E '^(PYTHONPATH|OTEL_)'
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-gateway-collector.observability:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_SERVICE_NAME=fastapi-app
_PEX_PYTHONPATH=/otel-auto-instrumentation-python

Hold on… _PEX_PYTHONPATH? Not PYTHONPATH?

# Check the actual process
$ ps aux | head -2
USER  PID  COMMAND
root    1  /usr/bin/python3.11 -sE /app/.bootstrap/pex/pex.py --python /usr/bin/python3.11 /app/service.pex

There’s the smoking gun! The app is packaged as a PEX with -sE flags:

  • -E: Ignores all PYTHON* environment variables
  • -s: Ignores user site directory

Act 4: Understanding the PEX Problem

PEX (Python EXecutable) creates hermetic Python environments. Here’s what’s happening:

  1. OpenTelemetry Operator sets PYTHONPATH=/otel-auto-instrumentation-python
  2. PEX launcher starts Python with -E flag, ignoring PYTHONPATH
  3. Python never loads /otel-auto-instrumentation-python/sitecustomize.py
  4. No auto-instrumentation happens

But why does the REPL work? Because python command bypasses the PEX launcher!

Let’s verify this theory:

# In REPL (works)
>>> import sys
>>> '/otel-auto-instrumentation-python' in sys.path
True

# Check server's sys.path
$ cat > check_path.py << EOF
import sys
import json
with open('/tmp/syspath.json', 'w') as f:
    json.dump(sys.path, f)
EOF

$ python /app/service.pex check_path.py
$ cat /tmp/syspath.json | jq
# Result: No /otel-auto-instrumentation-python!

The Solutions

Solution A: Rebuild PEX with Non-Hermetic Scripts (Clean)

# Original PEX build
pex -r requirements.txt -c gunicorn -o service.pex .

# Fixed PEX build
pex -r requirements.txt \
    -c gunicorn \
    --venv service \
    --non-hermetic-venv-scripts \
    -o service.pex .

The --non-hermetic-venv-scripts flag creates venv scripts that respect environment variables.

Solution B: Runtime Wrapper (Quick & Dirty)

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: app
        command: ["/bin/sh"]
        args:
          - -c
          - |
            export PYTHONPATH="/otel-auto-instrumentation-python"
            exec python /app/service.pex

Solution C: Why Other Approaches Fail

These don’t work:

  • PEX_EXTRA_SYS_PATH: Appends to sys.path after startup (too late for sitecustomize.py)
  • PEX_INHERIT_PATH: Still blocked by -E flag
  • Manual instrumentation: Defeats the whole “zero-code” purpose

Key Takeaways

1. The Quick Diagnostic

When OpenTelemetry auto-instrumentation seems broken:

# 1. Check if instrumentation is loaded in REPL
kubectl exec <pod> -- python -c "import sitecustomize; print('Loaded from:', sitecustomize.__file__)"

# 2. Check PID 1 environment
kubectl exec <pod> -- sh -c 'tr "\0" "\n" < /proc/1/environ | grep -E "^(PYTHONPATH|_PEX_)"'

# 3. Check the actual process command
kubectl exec <pod> -- ps aux | grep python

# 4. Test manual spans
kubectl exec -it <pod> -- python
>>> from opentelemetry import trace
>>> with trace.get_tracer("test").start_as_current_span("test"): pass
>>> trace.get_tracer_provider()._active_span_processor.force_flush()

2. Python Packaging Compatibility

Packaging MethodAuto-instrumentation Works?Why?
Plain Python/pip✅ YesRespects PYTHONPATH
Virtualenv✅ YesNormal Python startup
PEX (default)❌ NoHermetic mode ignores PYTHONPATH
PEX (non-hermetic)✅ YesRespects environment
Zipapp❌ Usually NoSimilar to PEX

3. The Hidden Assumption

OpenTelemetry’s Python auto-instrumentation relies on a clever but fragile trick:

  • Set PYTHONPATH to include instrumentation
  • Python loads sitecustomize.py at startup
  • This hooks and instruments your code

Any packaging that breaks Python’s normal startup breaks this mechanism.