Skill

SkillsDevOps & Infrastructure › Observability & tracing

prometheus-configuration

Set up Prometheus for comprehensive metric collection, storage, and monitoring of infrastructure and applications. Use when implementing metrics collection, setting up monitoring infrastructure, or configuring alerting systems.

Freerisk: medium
prometheusconfigurationdockerkubernetes

The full skill

— name: prometheus-configuration description: Set up Prometheus for comprehensive metric collection, storage, and monitoring of infrastructure and applications. Use when implementing metrics collection, setting up monitoring infrastructure, or configuring alerting systems. — # Prometheus Configuration Complete guide to Prometheus setup, metric collection, scrape configuration, and recording rules. ## Purpose Configure Prometheus for comprehensive metric collection, alerting, and monitoring of infrastructure and applications. ## When to Use – Set up Prometheus monitoring – Configure metric scraping – Create recording rules – Design alert rules – Implement service discovery ## Prometheus Architecture “` ┌──────────────┐ │ Applications │ ← Instrumented with client libraries └──────┬───────┘ │ /metrics endpoint ↓ ┌──────────────┐ │ Prometheus │ ← Scrapes metrics periodically │ Server │ └──────┬───────┘ │ ├─→ AlertManager (alerts) ├─→ Grafana (visualization) └─→ Long-term storage (Thanos/Cortex) “` ## Installation ### Kubernetes with Helm “`bash helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update helm install prometheus prometheus-community/kube-prometheus-stack \ –namespace monitoring \ –create-namespace \ –set prometheus.prometheusSpec.retention=30d \ –set prometheus.prometheusSpec.storageVolumeSize=50Gi “` ### Docker Compose “`yaml version: "3.8" services: prometheus: image: prom/prometheus:latest ports: – "9090:9090" volumes: – ./prometheus.yml:/etc/prometheus/prometheus.yml – prometheus-data:/prometheus command: – "–config.file=/etc/prometheus/prometheus.yml" – "–storage.tsdb.path=/prometheus" – "–storage.tsdb.retention.time=30d" volumes: prometheus-data: “` ## Configuration File **prometheus.yml:** “`yaml global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: "production" region: "us-west-2" # Alertmanager configuration alerting: alertmanagers: – static_configs: – targets: – alertmanager:9093 # Load rules files rule_files: – /etc/prometheus/rules/*.yml # Scrape configurations scrape_configs: # Prometheus itself – job_name: "prometheus" static_configs: – targets: ["localhost:9090"] # Node exporters – job_name: "node-exporter" static_configs: – targets: – "node1:9100" – "node2:9100" – "node3:9100" relabel_configs: – source_labels: [__address__] target_label: instance regex: "([^:]+)(:[0-9]+)?" replacement: "${1}" # Kubernetes pods with annotations – job_name: "kubernetes-pods" kubernetes_sd_configs: – role: pod relabel_configs: – source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true – source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) – source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ – source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace – source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod # Application metrics – job_name: "my-app" static_configs: – targets: – "app1.example.com:9090" – "app2.example.com:9090" metrics_path: "/metrics" scheme: "https" tls_config: ca_file: /etc/prometheus/ca.crt cert_file: /etc/prometheus/client.crt key_file: /etc/prometheus/client.key “` **Reference:** See `assets/prometheus.yml.template` ## Scrape Configurations ### Static Targets “`yaml scrape_configs: – job_name: "static-targets" static_configs: – targets: ["host1:9100", "host2:9100"] labels: env: "production" region: "us-west-2" “` ### File-based Service Discovery “`yaml scrape_configs: – job_name: "file-sd" file_sd_configs: – files: – /etc/prometheus/targets/*.json – /etc/prometheus/targets/*.yml refresh_interval: 5m “` **targets/production.json:** “`json [ { "targets": ["app1:9090", "app2:9090"], "labels": { "env": "production", "service": "api" } } ] “` ### Kubernetes Service Discovery “`yaml scrape_configs: – job_name: "kubernetes-services" kubernetes_sd_configs: – role: service relabel_configs: – source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true – source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) – source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) “` **Reference:** See `references/scrape-configs.md` ## Recording Rules Create pre-computed metrics for frequently queried expressions: “`yaml # /etc/prometheus/rules/recording_rules.yml groups: – name: api_metrics interval: 15s rules: # HTTP request rate per service – record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m])) # Error rate percentage – record: job:http_requests_errors:rate5m expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) – record: job:http_requests_error_rate:percentage expr: | (job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100 # P95 latency – record: job:http_request_duration:p95 expr: | histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) ) – name: resource_metrics interval: 30s rules: # CPU utilization percentage – record: instance:node_cpu:utilization expr: | 100 – (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # Memory utilization percentage – record: instance:node_memory:utilization expr: | 100 – ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100) # Disk usage percentage – record: instance:node_disk:utilization expr: | 100 – ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100) “` **Reference:** See `references/recording-rules.md` ## Alert Rules “`yaml # /etc/prometheus/rules/alert_rules.yml groups: – name: availability interval: 30s rules: – alert: ServiceDown expr: up{job="my-app"} == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.instance }} is down" description: "{{ $labels.job }} has been down for more than 1 minute" – alert: HighErrorRate expr: job:http_requests_error_rate:percentage > 5 for: 5m labels: severity: warning annotations: summary: "High error rate for {{ $labels.job }}" description: "Error rate is {{ $value }}% (threshold: 5%)" – alert: HighLatency expr: job:http_request_duration:p95 > 1 for: 5m labels: severity: warning annotations: summary: "High latency for {{ $labels.job }}" description: "P95 latency is {{ $value }}s (threshold: 1s)" – name: resources interval: 1m rules: – alert: HighCPUUsage expr: instance:node_cpu:utilization > 80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}%" – alert: HighMemoryUsage expr: instance:node_memory:utilization > 85 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value }}%" – alert: DiskSpaceLow expr: instance:node_disk:utilization > 90 for: 5m labels: severity: critical annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Disk usage is {{ $value }}%" “` ## Validation “`bash # Validate configuration promtool check config prometheus.yml # Validate rules promtool check rules /etc/prometheus/rules/*.yml # Test query promtool query instant http://localhost:9090 'up' “` **Reference:** See `scripts/validate-prometheus.sh` ## Best Practices 1. **Use consistent naming** for metrics (prefix_name_unit) 2. **Set appropriate scrape intervals** (15-60s typical) 3. **Use recording rules** for expensive queries 4. **Implement high availability** (multiple Prometheus instances) 5. **Configure retention** based on storage capacity 6. **Use relabeling** for metric cleanup 7. **Monitor Prometheus itself** 8. **Implement federation** for large deployments 9. **Use Thanos/Cortex** for long-term storage 10. **Document custom metrics** ## Troubleshooting **Check scrape targets:** “`bash curl http://localhost:9090/api/v1/targets “` **Check configuration:** “`bash curl http://localhost:9090/api/v1/status/config “` **Test query:** “`bash curl 'http://localhost:9090/api/v1/query?query=up' “` ## Related Skills – `grafana-dashboards` – For visualization – `slo-implementation` – For SLO monitoring – `distributed-tracing` – For request tracing