Skill

SkillsDevOps & Infrastructure › Observability & tracing

grafana-dashboards

Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.

Freerisk: low
grafanadashboardsansibleterraformslack

The full skill

— name: grafana-dashboards description: Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces. — # Grafana Dashboards Create and manage production-ready Grafana dashboards for comprehensive system observability. ## Purpose Design effective Grafana dashboards for monitoring applications, infrastructure, and business metrics. ## When to Use – Visualize Prometheus metrics – Create custom dashboards – Implement SLO dashboards – Monitor infrastructure – Track business KPIs ## Dashboard Design Principles ### 1. Hierarchy of Information “` ┌─────────────────────────────────────┐ │ Critical Metrics (Big Numbers) │ ├─────────────────────────────────────┤ │ Key Trends (Time Series) │ ├─────────────────────────────────────┤ │ Detailed Metrics (Tables/Heatmaps) │ └─────────────────────────────────────┘ “` ### 2. RED Method (Services) – **Rate** – Requests per second – **Errors** – Error rate – **Duration** – Latency/response time ### 3. USE Method (Resources) – **Utilization** – % time resource is busy – **Saturation** – Queue length/wait time – **Errors** – Error count ## Dashboard Structure ### API Monitoring Dashboard “`json { "dashboard": { "title": "API Monitoring", "tags": ["api", "production"], "timezone": "browser", "refresh": "30s", "panels": [ { "title": "Request Rate", "type": "graph", "targets": [ { "expr": "sum(rate(http_requests_total[5m])) by (service)", "legendFormat": "{{service}}" } ], "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 } }, { "title": "Error Rate %", "type": "graph", "targets": [ { "expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100", "legendFormat": "Error Rate" } ], "alert": { "conditions": [ { "evaluator": { "params": [5], "type": "gt" }, "operator": { "type": "and" }, "query": { "params": ["A", "5m", "now"] }, "type": "query" } ] }, "gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 } }, { "title": "P95 Latency", "type": "graph", "targets": [ { "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))", "legendFormat": "{{service}}" } ], "gridPos": { "x": 0, "y": 8, "w": 24, "h": 8 } } ] } } “` **Reference:** See `assets/api-dashboard.json` ## Panel Types ### 1. Stat Panel (Single Value) “`json { "type": "stat", "title": "Total Requests", "targets": [ { "expr": "sum(http_requests_total)" } ], "options": { "reduceOptions": { "values": false, "calcs": ["lastNotNull"] }, "orientation": "auto", "textMode": "auto", "colorMode": "value" }, "fieldConfig": { "defaults": { "thresholds": { "mode": "absolute", "steps": [ { "value": 0, "color": "green" }, { "value": 80, "color": "yellow" }, { "value": 90, "color": "red" } ] } } } } “` ### 2. Time Series Graph “`json { "type": "graph", "title": "CPU Usage", "targets": [ { "expr": "100 – (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)" } ], "yaxes": [ { "format": "percent", "max": 100, "min": 0 }, { "format": "short" } ] } “` ### 3. Table Panel “`json { "type": "table", "title": "Service Status", "targets": [ { "expr": "up", "format": "table", "instant": true } ], "transformations": [ { "id": "organize", "options": { "excludeByName": { "Time": true }, "indexByName": {}, "renameByName": { "instance": "Instance", "job": "Service", "Value": "Status" } } } ] } “` ### 4. Heatmap “`json { "type": "heatmap", "title": "Latency Heatmap", "targets": [ { "expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)", "format": "heatmap" } ], "dataFormat": "tsbuckets", "yAxis": { "format": "s" } } “` ## Variables ### Query Variables “`json { "templating": { "list": [ { "name": "namespace", "type": "query", "datasource": "Prometheus", "query": "label_values(kube_pod_info, namespace)", "refresh": 1, "multi": false }, { "name": "service", "type": "query", "datasource": "Prometheus", "query": "label_values(kube_service_info{namespace=\"$namespace\"}, service)", "refresh": 1, "multi": true } ] } } “` ### Use Variables in Queries “` sum(rate(http_requests_total{namespace="$namespace", service=~"$service"}[5m])) “` ## Alerts in Dashboards “`json { "alert": { "name": "High Error Rate", "conditions": [ { "evaluator": { "params": [5], "type": "gt" }, "operator": { "type": "and" }, "query": { "params": ["A", "5m", "now"] }, "reducer": { "type": "avg" }, "type": "query" } ], "executionErrorState": "alerting", "for": "5m", "frequency": "1m", "message": "Error rate is above 5%", "noDataState": "no_data", "notifications": [{ "uid": "slack-channel" }] } } “` ## Dashboard Provisioning **dashboards.yml:** “`yaml apiVersion: 1 providers: – name: "default" orgId: 1 folder: "General" type: file disableDeletion: false updateIntervalSeconds: 10 allowUiUpdates: true options: path: /etc/grafana/dashboards “` ## Common Dashboard Patterns ### Infrastructure Dashboard **Key Panels:** – CPU utilization per node – Memory usage per node – Disk I/O – Network traffic – Pod count by namespace – Node status **Reference:** See `assets/infrastructure-dashboard.json` ### Database Dashboard **Key Panels:** – Queries per second – Connection pool usage – Query latency (P50, P95, P99) – Active connections – Database size – Replication lag – Slow queries **Reference:** See `assets/database-dashboard.json` ### Application Dashboard **Key Panels:** – Request rate – Error rate – Response time (percentiles) – Active users/sessions – Cache hit rate – Queue length ## Best Practices 1. **Start with templates** (Grafana community dashboards) 2. **Use consistent naming** for panels and variables 3. **Group related metrics** in rows 4. **Set appropriate time ranges** (default: Last 6 hours) 5. **Use variables** for flexibility 6. **Add panel descriptions** for context 7. **Configure units** correctly 8. **Set meaningful thresholds** for colors 9. **Use consistent colors** across dashboards 10. **Test with different time ranges** ## Dashboard as Code ### Terraform Provisioning “`hcl resource "grafana_dashboard" "api_monitoring" { config_json = file("${path.module}/dashboards/api-monitoring.json") folder = grafana_folder.monitoring.id } resource "grafana_folder" "monitoring" { title = "Production Monitoring" } “` ### Ansible Provisioning “`yaml – name: Deploy Grafana dashboards copy: src: "{{ item }}" dest: /etc/grafana/dashboards/ with_fileglob: – "dashboards/*.json" notify: restart grafana “` ## Related Skills – `prometheus-configuration` – For metric collection – `slo-implementation` – For SLO dashboards