In a modern DevOps architecture, "it's working" isn't an answer—it's a temporary state. As an Architect, I’ve learned that the difference between a 2 AM emergency and a peaceful night's sleep is the quality of your observability stack. Today, we’re diving into the "Gold Standard" of monitoring: Prometheus and Grafana.
1. The Architecture of Observability
Before we look at charts, we need to understand the flow. Prometheus doesn't wait for your apps to send data; it pulls (scrapes) data from defined endpoints.
- Node Exporter: Installed on your EC2 or bare-metal instances to export hardware metrics.
- Prometheus Server: The time-series database that stores and queries the data.
- Grafana: The visualization layer that turns raw numbers into actionable insights.
2. Configuring the Scrape Job
To get started, you need to define your targets in prometheus.yml. Here is a snippet of a job I recently configured for a cluster of self-hosted runners.
scrape_configs:
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100', '10.0.1.50:9100']
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '([^:]+)(?::\d+)?'
replacement: '${1}'3. The Four Golden Signals
When building your Grafana dashboard, don't just track "everything." Focus on the Four Golden Signals of SRE:
- Latency: The time it takes to service a request.
- Traffic: A measure of how much demand is being placed on your system.
- Errors: The rate of requests that fail, either explicitly or implicitly.
- Saturation: How "full" your service is (CPU usage, memory limits, etc.).
4. Writing Your First PromQL Query
Once data is flowing, you’ll use PromQL to filter it. For example, to find the per-second rate of increase in HTTP errors over the last 5 minutes:
rate(http_requests_total{status=~"5.."}[5m])
5. From Visualization to Alerting
A dashboard is only useful if someone looks at it. For everything else, we use Alertmanager. I prefer routing these to Slack or PagerDuty to ensure critical issues (like a Liquibase lock or a downed EMR node) are handled instantly.