Skills › DevOps & Infrastructure › Observability & tracing

service-mesh-observability

Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.

Freerisk: medium

servicemeshobservabilitykubernetesgrpc

Open in Drive Source

The full skill

— name: service-mesh-observability description: Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication. — # Service Mesh Observability Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments. ## When to Use This Skill – Setting up distributed tracing across services – Implementing service mesh metrics and dashboards – Debugging latency and error issues – Defining SLOs for service communication – Visualizing service dependencies – Troubleshooting mesh connectivity ## Core Concepts ### 1. Three Pillars of Observability “` ┌─────────────────────────────────────────────────────┐ │ Observability │ ├─────────────────┬─────────────────┬─────────────────┤ │ Metrics │ Traces │ Logs │ │ │ │ │ │ • Request rate │ • Span context │ • Access logs │ │ • Error rate │ • Latency │ • Error details │ │ • Latency P50 │ • Dependencies │ • Debug info │ │ • Saturation │ • Bottlenecks │ • Audit trail │ └─────────────────┴─────────────────┴─────────────────┘ “` ### 2. Golden Signals for Mesh | Signal | Description | Alert Threshold | | ————– | ————————- | —————– | | **Latency** | Request duration P50, P99 | P99 > 500ms | | **Traffic** | Requests per second | Anomaly detection | | **Errors** | 5xx error rate | > 1% | | **Saturation** | Resource utilization | > 80% | ## Templates ### Template 1: Istio with Prometheus & Grafana “`yaml # Install Prometheus apiVersion: v1 kind: ConfigMap metadata: name: prometheus namespace: istio-system data: prometheus.yml: | global: scrape_interval: 15s scrape_configs: – job_name: 'istio-mesh' kubernetes_sd_configs: – role: endpoints namespaces: names: – istio-system relabel_configs: – source_labels: [__meta_kubernetes_service_name] action: keep regex: istio-telemetry — # ServiceMonitor for Prometheus Operator apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: istio-mesh namespace: istio-system spec: selector: matchLabels: app: istiod endpoints: – port: http-monitoring interval: 15s “` ### Template 2: Key Istio Metrics Queries “`promql # Request rate by service sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name) # Error rate (5xx) sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m])) / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100 # P99 latency histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])) by (le, destination_service_name)) # TCP connections sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name) # Request size histogram_quantile(0.99, sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m])) by (le, destination_service_name)) “` ### Template 3: Jaeger Distributed Tracing “`yaml # Jaeger installation for Istio apiVersion: install.istio.io/v1alpha1 kind: IstioOperator spec: meshConfig: enableTracing: true defaultConfig: tracing: sampling: 100.0 # 100% in dev, lower in prod zipkin: address: jaeger-collector.istio-system:9411 — # Jaeger deployment apiVersion: apps/v1 kind: Deployment metadata: name: jaeger namespace: istio-system spec: selector: matchLabels: app: jaeger template: metadata: labels: app: jaeger spec: containers: – name: jaeger image: jaegertracing/all-in-one:1.50 ports: – containerPort: 5775 # UDP – containerPort: 6831 # Thrift – containerPort: 6832 # Thrift – containerPort: 5778 # Config – containerPort: 16686 # UI – containerPort: 14268 # HTTP – containerPort: 14250 # gRPC – containerPort: 9411 # Zipkin env: – name: COLLECTOR_ZIPKIN_HOST_PORT value: ":9411" “` ### Template 4: Linkerd Viz Dashboard “`bash # Install Linkerd viz extension linkerd viz install | kubectl apply -f – # Access dashboard linkerd viz dashboard # CLI commands for observability # Top requests linkerd viz top deploy/my-app # Per-route metrics linkerd viz routes deploy/my-app –to deploy/backend # Live traffic inspection linkerd viz tap deploy/my-app –to deploy/backend # Service edges (dependencies) linkerd viz edges deployment -n my-namespace “` ### Template 5: Grafana Dashboard JSON “`json { "dashboard": { "title": "Service Mesh Overview", "panels": [ { "title": "Request Rate", "type": "graph", "targets": [ { "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)", "legendFormat": "{{destination_service_name}}" } ] }, { "title": "Error Rate", "type": "gauge", "targets": [ { "expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100" } ], "fieldConfig": { "defaults": { "thresholds": { "steps": [ { "value": 0, "color": "green" }, { "value": 1, "color": "yellow" }, { "value": 5, "color": "red" } ] } } } }, { "title": "P99 Latency", "type": "graph", "targets": [ { "expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))", "legendFormat": "{{destination_service_name}}" } ] }, { "title": "Service Topology", "type": "nodeGraph", "targets": [ { "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)" } ] } ] } } “` ### Template 6: Kiali Service Mesh Visualization “`yaml # Kiali installation apiVersion: kiali.io/v1alpha1 kind: Kiali metadata: name: kiali namespace: istio-system spec: auth: strategy: anonymous # or openid, token deployment: accessible_namespaces: – "**" external_services: prometheus: url: http://prometheus.istio-system:9090 tracing: url: http://jaeger-query.istio-system:16686 grafana: url: http://grafana.istio-system:3000 “` ### Template 7: OpenTelemetry Integration “`yaml # OpenTelemetry Collector for mesh apiVersion: v1 kind: ConfigMap metadata: name: otel-collector-config data: config.yaml: | receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 zipkin: endpoint: 0.0.0.0:9411 processors: batch: timeout: 10s exporters: jaeger: endpoint: jaeger-collector:14250 tls: insecure: true prometheus: endpoint: 0.0.0.0:8889 service: pipelines: traces: receivers: [otlp, zipkin] processors: [batch] exporters: [jaeger] metrics: receivers: [otlp] processors: [batch] exporters: [prometheus] — # Istio Telemetry v2 with OTel apiVersion: telemetry.istio.io/v1alpha1 kind: Telemetry metadata: name: mesh-default namespace: istio-system spec: tracing: – providers: – name: otel randomSamplingPercentage: 10 “` ## Alerting Rules “`yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: mesh-alerts namespace: istio-system spec: groups: – name: mesh.rules rules: – alert: HighErrorRate expr: | sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name) / sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate for {{ $labels.destination_service_name }}" – alert: HighLatency expr: | histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m])) by (le, destination_service_name)) > 1000 for: 5m labels: severity: warning annotations: summary: "High P99 latency for {{ $labels.destination_service_name }}" – alert: MeshCertExpiring expr: | (certmanager_certificate_expiration_timestamp_seconds – time()) / 86400 < 7 labels: severity: warning annotations: summary: "Mesh certificate expiring in less than 7 days" “` ## Best Practices ### Do's – **Sample appropriately** – 100% in dev, 1-10% in prod – **Use trace context** – Propagate headers consistently – **Set up alerts** – For golden signals – **Correlate metrics/traces** – Use exemplars – **Retain strategically** – Hot/cold storage tiers ### Don'ts – **Don't over-sample** – Storage costs add up – **Don't ignore cardinality** – Limit label values – **Don't skip dashboards** – Visualize dependencies – **Don't forget costs** – Monitor observability costs