Observability Platform — Traces

6 min readJul 15, 2024

This post is part of a series of blogs about setting up an Observability platform for an organization. This series includes the details of observability platform components, architecture, and the tool stack used to build the platform.

Traces

Collecting traces from the services is an important functionality in setting up an observability platform.

For collecting traces from the application we can use Open telemetry, and for visualization we can use Grafana Tempo.

Open Telemetry:

OpenTelemetry is an open-source project that provides a set of APIs, libraries, agents, and instrumentation to enable observability for cloud-native software. It allows developers to collect, process, and export telemetry data such as metrics, traces, and logs from applications and services.

The project aims to standardise and simplify observability in distributed systems by providing a vendor-neutral, community-driven framework. It supports multiple programming languages and frameworks, making it versatile for various environments.

Below are the steps involved in collecting telemetry data and use it for monitoring

Install and Configure Open Telemetry collector
Instrument your applications
Export telemetry data to the Backend
Deploy and monitor

Install and Configure Open Telemetry collector

Here we can use the OpenTelemetry Operator Chart to deploy open telemetry and instrumentations. The OpenTelemetry Operator is a Kubernetes operator that manages OpenTelemetry Collectors and auto-instrumentation of workloads.

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts$ 
helm repo update
helm install  opentelemetry-operator open-telemetry/opentelemetry-operator -n ingress-basic -f values.yaml

You can refer to the values.yaml file for more details about the configuration. You can use the default values.yaml file you don’t have custom configurations. In this case, you can use an automatically generated self-signed certificate by setting the below values. Helm will create a self-signed cert and a secret for you.

admissionWebhooks.certManager.enabled: false
admissionWebhooks.autoGenerateCert.enabled: true

Once the open telemetry operator is deployed, we need to deploy the open telemetry collector. For this, we can use the below config file:

# kubectl apply -f opentelemetry-collector.yaml

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector
spec:
  mode: sidecar
  nodeSelector:
    agentpool: appmonitor
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    processors:
      memory_limiter:
        check_interval: 1s
        limit_percentage: 75
        spike_limit_percentage: 15
      batch:
        send_batch_size: 10000
        timeout: 10s
    exporters:
      logging:
        loglevel: debug
      otlp:
        headers:
          x-scope-orgid: TEMPO_TENANT_VALUE
        endpoint: TEMPO_ENDPOINT_VALUE      
      prometheusremotewrite:
        endpoint: OTL_PROMETHEUS_ENDPOINT_VALUE
        tls:
          insecure: true
        
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [logging, otlp]
        metrics:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [logging, prometheusremotewrite]
        logs:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [logging, otlp]
  mode: deployment
  resources: {}
  targetAllocator: {}

We will discuss TEMPO_TENANT and TEMPO_ENDPOINT below in detail. ‘OTL_PROMETHEUS_ENDPOINT_VALUE’ is the Prometheus write end point where the open telemetry collector can write the metrics. For example : http://<prometheus-deployment-name>.<namspace>.svc.cluster.local:9090/api/v1/write

Enabling auto instrumentation:

To enable auto we can deploy auto instrumentation custom resources using the below config. The operator can inject and configure OpenTelemetry auto-instrumentation libraries. Currently, Apache HTTPD, DotNet, Go, Java, Nginx, NodeJS and Python are supported.

kubectl apply -f - <<EOF
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: my-instrumentation
spec:
  exporter:
    endpoint: http://otel-collector:4317
  propagators:
    - tracecontext
    - baggage
    - b3
  sampler:
    type: parentbased_traceidratio
    argument: "0.25"
  python:
    env:
      # Required if endpoint is set to 4317.
      # Python autoinstrumentation uses http/proto by default
      # so data must be sent to 4318 instead of 4317.
      - name: OTEL_EXPORTER_OTLP_ENDPOINT
        value: http://otel-collector:4318
  dotnet:
    env:
      # Required if endpoint is set to 4317.
      # Dotnet autoinstrumentation uses http/proto by default
      # See https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/blob/888e2cd216c77d12e56b54ee91dafbc4e7452a52/docs/config.md#otlp
      - name: OTEL_EXPORTER_OTLP_ENDPOINT
        value: http://otel-collector:4318
  go:
    env:
      # Required if endpoint is set to 4317.
      # Go autoinstrumentation uses http/proto by default
      # so data must be sent to 4318 instead of 4317.
      - name: OTEL_EXPORTER_OTLP_ENDPOINT
        value: http://otel-collector:4318
EOF

Then add an annotation to a pod to enable injection. The annotation can be added to a namespace, so that all pods within that namespace will get instrumentation, or by adding the annotation to individual PodSpec objects, available as part of Deployment, Statefulset, and other resources.

# Java:
instrumentation.opentelemetry.io/inject-java: "true"

# NodeJS:
instrumentation.opentelemetry.io/inject-nodejs: "true"

#Python:
instrumentation.opentelemetry.io/inject-python: "true"

# .NET
instrumentation.opentelemetry.io/inject-dotnet: "true"
instrumentation.opentelemetry.io/otel-dotnet-auto-runtime: "linux-x64" # for Linux glibc based images, this is default value and can be omitted
instrumentation.opentelemetry.io/otel-dotnet-auto-runtime: "linux-musl-x64"  # for Linux musl based images

# GO
instrumentation.opentelemetry.io/inject-go: "true"
instrumentation.opentelemetry.io/otel-go-auto-target-exe: "/path/to/container/executable"

# Apache HTTPD
instrumentation.opentelemetry.io/inject-apache-httpd: "true"

# NGIX
instrumentation.opentelemetry.io/inject-nginx: "true"

.NET auto-instrumentation also honours an annotation that will be used to set the .NET Runtime Identifiers(RIDs). Currently, only two RIDs are supported: linux-x64 and linux-musl-x64. By default, linux-x64 is used.

Go auto-instrumentation also honours an annotation that will be used to set the OTEL_GO_AUTO_TARGET_EXE env var. This env var can also be set via the Instrumentation resource, with the annotation taking precedence. Since Go auto-instrumentation requires OTEL_GO_AUTO_TARGET_EXE to be set, you must supply a valid executable path via the annotation or the Instrumentation resource. Failure to set this value causes the instrumentation injection to abort, leaving the original pod unchanged.

Go auto-instrumentation also requires elevated permissions. The below permissions are set automatically and are required.

securityContext:
  privileged: true
  runAsUser: 0

Grafana Tempo:

Grafana Tempo is a distributed tracing backend, part of the broader Grafana observability stack. Tempo focuses on providing a highly scalable and cost-effective solution for storing and querying traces. By integrating with object storage (such as Amazon S3, Google Cloud Storage, or Microsoft Azure Blob Storage), Tempo can store large amounts of trace data at a lower cost than solutions requiring more expensive block storage or databases.

We deployed Tempo by using Helm Chart in the cluster where we deployed Grafana.

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install tempo grafana/tempo-distributed -f values.yaml -n app-monitoring

Storage:

In the values.yaml file we need to define where the trace data should be stored. We can use the below configuration to set the Azure storage option.

storage:
  trace:
    backend: azure
    azure:
      container_name: "traces"
      storage_account_name: "<azure storage account name>"
      storage_account_key: "<azure storage account key>"
    blocklist_poll_tenant_index_builders: 1
    blocklist_poll_jitter_ms: 500
     # Settings for the Admin client storage backend and buckets. Only valid is enterprise.enabled is true.
  admin:
    # -- The supported storage backends are gcs, s3 and azure, as specified in https://grafana.com/docs/enterprise-traces/latest/config/reference/#admin_client_config
    backend: azure

Traces Receiver:

The below configuration can be used to enable Tempo to receive traces from open telemetry

traces:
   otlp:
    http:
      # -- Enable Tempo to ingest Open Telemetry HTTP traces
      enabled: true
      # -- HTTP receiver advanced config
      receiverConfig: {}
    grpc:
      # -- Enable Tempo to ingest Open Telemetry GRPC traces
      enabled: true
      # -- GRPC receiver advanced config
      receiverConfig: {}

Configure Ingress and Basic Auth:

We need to configure ingress and basic authentication to enable external access.

ingress:
    enabled: true
    ingressClassName: nginx
    annotations: {}
    hosts:
      - host: traces.domain.com
        paths:
          - path: /
            pathType: Prefix
    tls:
      - secretName: app-monitoring-tls
        hosts:
          - traces.domain.com
  basicAuth:
    enabled: true
    username: <username>
    password: <password>
    htpasswd: >-
      {{ htpasswd (required "'gateway.basicAuth.username' is required" .Values.gateway.basicAuth.username) (required "'gateway.basicAuth.password' is required" .Values.gateway.basicAuth.password) }}
    existingSecret: null

In the open telemetry collector configuration, we can set the hostname in the above configuration as the ‘TEMPO_ENDPOINT_VALUE’.

Now we need to add Tempo as a data source in Grafana

We can fill the ingress domain in the URL section and basic auth details in the Basic Auth section.

Now we need to fill in the header details. Add the header X-Scope-OrgID and ‘TEMPO_TENANT_VALUE’ as the value. This will help us to identify the traces from multiple applications as we add traces from multiple application clusters to the same tempo instance in the Grafana cluster.

Now we can explore the data source and build dashboards from it or analyse the traces to get more application insights.

We discussed about logs in the pervious part of this series, Observability Platform — Logs

Thanks for taking the time to read this post. I hope you found it interesting and informative.

Observability Platform — Traces

Written by Remya Savithry

No responses yet