Alertmanager

Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integration such as email, PagerDuty, or OpsGenie.

Prometheus ───▶ Alertmanager ───▶ Receivers
(Alerts)        (Routing)         (Email, PagerDuty)

Key Features

Deduplication

Before: 100 identical alerts from 10 instances
After:  1 consolidated alert

Grouping

Group: Database
├── MySQL-01:    Connection timeout
├── MySQL-02:    High CPU usage
└── PostgreSQL:  Disk space low

Routing

Alert Rules ──▶ Alert{labels} ──▶ Routes ──▶ Receivers ──▶ Notifications
                                      │
                                ┌─ Match Labels ──▶ Select Integration
                                │
                                └─ Silence Check ──▶ Suppress

Alert Routing Flow

Alerts flow through Alertmanager following this process:

  1. Label Evaluation: Alertmanager examines alert labels (e.g., severity=critical, team=backend)

  2. Route Matching: Routes compare alert labels against match conditions:

    • match: {severity="critical"} → matches critical alerts
    • match: {team="backend"} → matches backend team alerts
  3. Receiver Selection: Matched routes determine the receiver integration:

    • Email receiver for warnings
    • PagerDuty receiver for critical alerts
    • Slack receiver for general notifications
  4. Silence Checking: Before sending, Alertmanager checks for active silences that match the alert labels

Example Flow:

Alert: {severity="critical", service="web", team="backend"}
       ↓
Route: match {team="backend"} AND {severity="critical"}
       ↓ (matches)
Receiver: backend-pagerduty-integration
       ↓
Notification: PagerDuty incident created

Silence Example:

Alert: {severity="warning", instance="web-01"}
       ↓
Silence: match {instance="web-01"} (maintenance window)
       ↓ (silenced)
Result: Alert suppressed, no notification sent

Configuration

Basic Structure

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
global:
  smtp_smarthost: 'localhost:587'
  smtp_from: '[email protected]'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'email'

receivers:
- name: 'email'
  email_configs:
  - to: '[email protected]'

Advanced Routing

Root Route
├── group_by: ['alertname']
├── receiver: 'default'
│
└── Child Route
    ├── match: { severity: 'critical' }
    ├── receiver: 'pagerduty'
    │
    └── Child Route
        ├── match: { team: 'database' }
        └── receiver: 'db-team'

Receivers

Email

1
2
3
4
5
6
7
8
receivers:
- name: 'email'
  email_configs:
  - to: '[email protected]'
    from: '[email protected]'
    smarthost: 'smtp.example.com:587'
    auth_username: 'alertmanager'
    auth_password: 'password'

PagerDuty

1
2
3
4
5
receivers:
- name: 'pagerduty'
  pagerduty_configs:
  - service_key: 'your-service-key'
    description: '{{ .CommonAnnotations.summary }}'

Silencing

Temporary Silences

Matchers:
├── alertname = "DatabaseDown"
└── instance = "db-01"

Duration: 2 hours
Comment: "Scheduled maintenance"
Created by: [email protected]

Silence via API

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Create silence
curl -X POST http://alertmanager:9093/api/v2/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {"name": "alertname", "value": "HighCPUUsage", "isRegex": false}
    ],
    "startsAt": "2024-01-01T00:00:00Z",
    "endsAt": "2024-01-01T02:00:00Z",
    "createdBy": "admin",
    "comment": "Scheduled maintenance"
  }'

Alert States

FIRING ───▶ ACTIVE ───▶ RESOLVED
    │         │         │
    │         │         └─▶ Notification sent
    │         └─▶ Grouped & routed
    └─▶ Deduplicated

Best Practices

Alert Design

  • Actionable: Alerts should require human intervention
  • Informative: Include context and resolution steps
  • Appropriate: Critical alerts for immediate action only

Configuration

  • Test routing: Use amtool to validate configurations
  • Monitor Alertmanager: Watch for dropped alerts and delays
  • Version control: Keep configurations in Git

Maintenance

  • Regular review: Audit silences and inhibition rules
  • Update integrations: Keep receiver configurations current
  • Document procedures: Runbooks for common alert responses

Monitoring Alertmanager

Key Metrics

alertmanager_alerts_total{state="active"}     # Active alerts
alertmanager_alerts_invalid_total            # Invalid alerts received
alertmanager_notifications_total             # Notifications sent
alertmanager_silences_total                  # Active silences

Health Checks

1
2
3
4
5
6
7
8
# Check Alertmanager health
curl http://alertmanager:9093/-/healthy

# Check readiness
curl http://alertmanager:9093/-/ready

# Validate configuration
amtool check-config alertmanager.yml
~
Last updated: 2024-01-01