Alertmanager
Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integration such as email, PagerDuty, or OpsGenie.
Prometheus ───▶ Alertmanager ───▶ Receivers
(Alerts) (Routing) (Email, PagerDuty)
Key Features
Deduplication
Before: 100 identical alerts from 10 instances
After: 1 consolidated alert
Grouping
Group: Database
├── MySQL-01: Connection timeout
├── MySQL-02: High CPU usage
└── PostgreSQL: Disk space low
Routing
Alert Rules ──▶ Alert{labels} ──▶ Routes ──▶ Receivers ──▶ Notifications
│
┌─ Match Labels ──▶ Select Integration
│
└─ Silence Check ──▶ Suppress
Alert Routing Flow
Alerts flow through Alertmanager following this process:
-
Label Evaluation: Alertmanager examines alert labels (e.g.,
severity=critical,team=backend) -
Route Matching: Routes compare alert labels against match conditions:
match: {severity="critical"}→ matches critical alertsmatch: {team="backend"}→ matches backend team alerts
-
Receiver Selection: Matched routes determine the receiver integration:
- Email receiver for warnings
- PagerDuty receiver for critical alerts
- Slack receiver for general notifications
-
Silence Checking: Before sending, Alertmanager checks for active silences that match the alert labels
Example Flow:
Alert: {severity="critical", service="web", team="backend"}
↓
Route: match {team="backend"} AND {severity="critical"}
↓ (matches)
Receiver: backend-pagerduty-integration
↓
Notification: PagerDuty incident created
Silence Example:
Alert: {severity="warning", instance="web-01"}
↓
Silence: match {instance="web-01"} (maintenance window)
↓ (silenced)
Result: Alert suppressed, no notification sent
Configuration
Basic Structure
|
|
Advanced Routing
Root Route
├── group_by: ['alertname']
├── receiver: 'default'
│
└── Child Route
├── match: { severity: 'critical' }
├── receiver: 'pagerduty'
│
└── Child Route
├── match: { team: 'database' }
└── receiver: 'db-team'
Receivers
|
|
PagerDuty
|
|
Silencing
Temporary Silences
Matchers:
├── alertname = "DatabaseDown"
└── instance = "db-01"
Duration: 2 hours
Comment: "Scheduled maintenance"
Created by: [email protected]
Silence via API
|
|
Alert States
FIRING ───▶ ACTIVE ───▶ RESOLVED
│ │ │
│ │ └─▶ Notification sent
│ └─▶ Grouped & routed
└─▶ Deduplicated
Best Practices
Alert Design
- Actionable: Alerts should require human intervention
- Informative: Include context and resolution steps
- Appropriate: Critical alerts for immediate action only
Configuration
- Test routing: Use
amtoolto validate configurations - Monitor Alertmanager: Watch for dropped alerts and delays
- Version control: Keep configurations in Git
Maintenance
- Regular review: Audit silences and inhibition rules
- Update integrations: Keep receiver configurations current
- Document procedures: Runbooks for common alert responses
Monitoring Alertmanager
Key Metrics
alertmanager_alerts_total{state="active"} # Active alerts
alertmanager_alerts_invalid_total # Invalid alerts received
alertmanager_notifications_total # Notifications sent
alertmanager_silences_total # Active silences
Health Checks
|
|