Observability Platform — Traces
This post is part of a series of blogs about setting up an Observability platform for an organization. This series includes the details of observability platform components, architecture, and the tool stack used to build the platform.
- Observability Platform — Introduction
- Observability Platform — Components and tools
- Observability Platform — Metrics (Prometheus & Grafana)
- Observability Platform — Logs
- Observability Platform — Traces
Traces
Collecting traces from the services is an important functionality in setting up an observability platform.
For collecting traces from the application we can use Open telemetry, and for visualization we can use Grafana Tempo.
Open Telemetry:
OpenTelemetry is an open-source project that provides a set of APIs, libraries, agents, and instrumentation to enable observability for cloud-native software. It allows developers to collect, process, and export telemetry data such as metrics, traces, and logs from applications and services.
The project aims to standardise and simplify observability in distributed systems by providing a vendor-neutral, community-driven framework. It supports multiple programming languages and frameworks, making it versatile for various environments.
Below are the steps involved in collecting telemetry data and use it for monitoring
- Install and Configure Open Telemetry collector
- Instrument your applications
- Export telemetry data to the Backend
- Deploy and monitor
Install and Configure Open Telemetry collector
Here we can use the OpenTelemetry Operator Chart to deploy open telemetry and instrumentations. The OpenTelemetry Operator is a Kubernetes operator that manages OpenTelemetry Collectors and auto-instrumentation of workloads.
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts$
helm repo update
helm install opentelemetry-operator open-telemetry/opentelemetry-operator -n ingress-basic -f values.yaml
You can refer to the values.yaml file for more details about the configuration. You can use the default values.yaml file you don’t have custom configurations. In this case, you can use an automatically generated self-signed certificate by setting the below values. Helm will create a self-signed cert and a secret for you.
admissionWebhooks.certManager.enabled: false
admissionWebhooks.autoGenerateCert.enabled: true
Once the open telemetry operator is deployed, we need to deploy the open telemetry collector. For this, we can use the below config file:
# kubectl apply -f opentelemetry-collector.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel-collector
spec:
mode: sidecar
nodeSelector:
agentpool: appmonitor
config: |
receivers:
otlp:
protocols:
grpc:
http:
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 15
batch:
send_batch_size: 10000
timeout: 10s
exporters:
logging:
loglevel: debug
otlp:
headers:
x-scope-orgid: TEMPO_TENANT_VALUE
endpoint: TEMPO_ENDPOINT_VALUE
prometheusremotewrite:
endpoint: OTL_PROMETHEUS_ENDPOINT_VALUE
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [logging, otlp]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [logging, prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [logging, otlp]
mode: deployment
resources: {}
targetAllocator: {}
We will discuss TEMPO_TENANT and TEMPO_ENDPOINT below in detail. ‘OTL_PROMETHEUS_ENDPOINT_VALUE’ is the Prometheus write end point where the open telemetry collector can write the metrics. For example : http://<prometheus-deployment-name>.<namspace>.svc.cluster.local:9090/api/v1/write
Enabling auto instrumentation:
To enable auto we can deploy auto instrumentation custom resources using the below config. The operator can inject and configure OpenTelemetry auto-instrumentation libraries. Currently, Apache HTTPD, DotNet, Go, Java, Nginx, NodeJS and Python are supported.
kubectl apply -f - <<EOF
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: my-instrumentation
spec:
exporter:
endpoint: http://otel-collector:4317
propagators:
- tracecontext
- baggage
- b3
sampler:
type: parentbased_traceidratio
argument: "0.25"
python:
env:
# Required if endpoint is set to 4317.
# Python autoinstrumentation uses http/proto by default
# so data must be sent to 4318 instead of 4317.
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://otel-collector:4318
dotnet:
env:
# Required if endpoint is set to 4317.
# Dotnet autoinstrumentation uses http/proto by default
# See https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/blob/888e2cd216c77d12e56b54ee91dafbc4e7452a52/docs/config.md#otlp
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://otel-collector:4318
go:
env:
# Required if endpoint is set to 4317.
# Go autoinstrumentation uses http/proto by default
# so data must be sent to 4318 instead of 4317.
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://otel-collector:4318
EOF
Then add an annotation to a pod to enable injection. The annotation can be added to a namespace, so that all pods within that namespace will get instrumentation, or by adding the annotation to individual PodSpec objects, available as part of Deployment, Statefulset, and other resources.
# Java:
instrumentation.opentelemetry.io/inject-java: "true"
# NodeJS:
instrumentation.opentelemetry.io/inject-nodejs: "true"
#Python:
instrumentation.opentelemetry.io/inject-python: "true"
# .NET
instrumentation.opentelemetry.io/inject-dotnet: "true"
instrumentation.opentelemetry.io/otel-dotnet-auto-runtime: "linux-x64" # for Linux glibc based images, this is default value and can be omitted
instrumentation.opentelemetry.io/otel-dotnet-auto-runtime: "linux-musl-x64" # for Linux musl based images
# GO
instrumentation.opentelemetry.io/inject-go: "true"
instrumentation.opentelemetry.io/otel-go-auto-target-exe: "/path/to/container/executable"
# Apache HTTPD
instrumentation.opentelemetry.io/inject-apache-httpd: "true"
# NGIX
instrumentation.opentelemetry.io/inject-nginx: "true"
.NET auto-instrumentation also honours an annotation that will be used to set the .NET Runtime Identifiers(RIDs). Currently, only two RIDs are supported: linux-x64
and linux-musl-x64
. By default, linux-x64
is used.
Go auto-instrumentation also honours an annotation that will be used to set the OTEL_GO_AUTO_TARGET_EXE env var. This env var can also be set via the Instrumentation resource, with the annotation taking precedence. Since Go auto-instrumentation requires OTEL_GO_AUTO_TARGET_EXE
to be set, you must supply a valid executable path via the annotation or the Instrumentation resource. Failure to set this value causes the instrumentation injection to abort, leaving the original pod unchanged.
Go auto-instrumentation also requires elevated permissions. The below permissions are set automatically and are required.
securityContext:
privileged: true
runAsUser: 0
Grafana Tempo:
Grafana Tempo is a distributed tracing backend, part of the broader Grafana observability stack. Tempo focuses on providing a highly scalable and cost-effective solution for storing and querying traces. By integrating with object storage (such as Amazon S3, Google Cloud Storage, or Microsoft Azure Blob Storage), Tempo can store large amounts of trace data at a lower cost than solutions requiring more expensive block storage or databases.
We deployed Tempo by using Helm Chart in the cluster where we deployed Grafana.
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install tempo grafana/tempo-distributed -f values.yaml -n app-monitoring
Storage:
In the values.yaml file we need to define where the trace data should be stored. We can use the below configuration to set the Azure storage option.
storage:
trace:
backend: azure
azure:
container_name: "traces"
storage_account_name: "<azure storage account name>"
storage_account_key: "<azure storage account key>"
blocklist_poll_tenant_index_builders: 1
blocklist_poll_jitter_ms: 500
# Settings for the Admin client storage backend and buckets. Only valid is enterprise.enabled is true.
admin:
# -- The supported storage backends are gcs, s3 and azure, as specified in https://grafana.com/docs/enterprise-traces/latest/config/reference/#admin_client_config
backend: azure
Traces Receiver:
The below configuration can be used to enable Tempo to receive traces from open telemetry
traces:
otlp:
http:
# -- Enable Tempo to ingest Open Telemetry HTTP traces
enabled: true
# -- HTTP receiver advanced config
receiverConfig: {}
grpc:
# -- Enable Tempo to ingest Open Telemetry GRPC traces
enabled: true
# -- GRPC receiver advanced config
receiverConfig: {}
Configure Ingress and Basic Auth:
We need to configure ingress and basic authentication to enable external access.
ingress:
enabled: true
ingressClassName: nginx
annotations: {}
hosts:
- host: traces.domain.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: app-monitoring-tls
hosts:
- traces.domain.com
basicAuth:
enabled: true
username: <username>
password: <password>
htpasswd: >-
{{ htpasswd (required "'gateway.basicAuth.username' is required" .Values.gateway.basicAuth.username) (required "'gateway.basicAuth.password' is required" .Values.gateway.basicAuth.password) }}
existingSecret: null
In the open telemetry collector configuration, we can set the hostname in the above configuration as the ‘TEMPO_ENDPOINT_VALUE’.
Now we need to add Tempo as a data source in Grafana
We can fill the ingress domain in the URL section and basic auth details in the Basic Auth section.
Now we need to fill in the header details. Add the header X-Scope-OrgID and ‘TEMPO_TENANT_VALUE’ as the value. This will help us to identify the traces from multiple applications as we add traces from multiple application clusters to the same tempo instance in the Grafana cluster.
Now we can explore the data source and build dashboards from it or analyse the traces to get more application insights.
We discussed about logs in the pervious part of this series, Observability Platform — Logs
Thanks for taking the time to read this post. I hope you found it interesting and informative.