Alerting and Rules
Kibana's alerting system monitors your data and triggers actions when specified conditions are met. It replaces the older Watcher feature with a more flexible, Kibana-native approach.
Alerting Architecture
┌─────────────────────────────────────────────────────┐
│ Kibana │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ Rule │───→│ Condition │───→│ Actions │ │
│ │(Schedule)│ │ (Check) │ │ (Notify/Act) │ │
│ └──────────┘ └──────────┘ └──────────────┘ │
│ │ │ │
│ Runs every N Evaluates query Sends alert: │
│ minutes/hours against threshold - Email │
│ - Slack │
│ - Webhook │
│ - PagerDuty │
│ - Jira │
│ - Teams │
└─────────────────────────────────────────────────────┘
Key Concepts
| Term | Definition |
|---|---|
| Rule | A scheduled check that evaluates conditions |
| Rule type | What kind of check (threshold, query, anomaly, etc.) |
| Condition | The criteria that triggers an alert |
| Action | What happens when the condition is met |
| Connector | Integration with external service (Slack, email, etc.) |
| Alert | A specific instance of a triggered condition |
| Recovery | When a condition returns to normal |
| Muting | Temporarily silencing alerts |
| Snooze | Pausing a rule for a set duration |
Setting Up Connectors
Before creating rules, configure connectors for alert delivery.
Available Connectors
| Connector | Purpose | Setup Needed |
|---|---|---|
| Send alert emails | SMTP server config | |
| Slack | Post to Slack channels | Webhook URL |
| Microsoft Teams | Post to Teams channels | Webhook URL |
| PagerDuty | Trigger incidents | Integration key |
| Jira | Create/update tickets | URL + credentials |
| ServiceNow | Create incidents | Instance + credentials |
| Webhook | HTTP request to any URL | URL + auth |
| Server log | Write to Kibana logs | None (built-in) |
| Index | Write to Elasticsearch | Index name |
| Opsgenie | Alert management | API key |
| xMatters | Incident management | URL + credentials |
Creating a Slack Connector
- Go to Stack Management → Alerts and Insights → Connectors
- Click "Create connector"
- Select "Slack"
- Configure:
Name: Production Alerts Slack
Webhook URL: https://hooks.slack.com/services/T00/B00/xxxx
Test: Click "Test" tab
Message: "Test alert from Kibana"
Click "Run" → Verify message appears in Slack
Creating an Email Connector
- First, configure SMTP in
kibana.yml:
# kibana.yml
xpack.actions.email:
service: other
host: smtp.example.com
port: 587
secure: true
- Create the connector:
Name: Alert Emails
From: alerts@example.com
Service: Other (configured in kibana.yml)
Host: smtp.example.com
Port: 587
Secure: true
Authentication:
User: alerts@example.com
Password: ********
Creating a Webhook Connector
For custom integrations:
Name: Custom Alert Webhook
Method: POST
URL: https://api.example.com/alerts
Headers:
Content-Type: application/json
Authorization: Bearer your-api-token
Body (configured per rule action):
{
"alert": "{{alertName}}",
"status": "{{alertActionGroup}}",
"message": "{{context.message}}",
"timestamp": "{{date}}"
}
Rule Types
Elasticsearch Query Rule
Triggers when a query matches documents above a threshold.
Creating a Query Rule:
- Go to Stack Management → Rules (or Observability → Alerts → Manage Rules)
- Click "Create rule"
- Select "Elasticsearch query"
- Configure:
Name: High Error Rate
Tags: [production, errors, critical]
Schedule: Check every 5 minutes
Query type: KQL
Data view: filebeat-*
Query: level: "error" AND service.name: "payment-service"
Condition:
WHEN count() OVER all documents
IS ABOVE 50
FOR THE LAST 15 minutes
Actions:
When alert status changes:
Connector: Production Alerts Slack
Message: |
🚨 *High Error Rate Alert*
Service: payment-service
Error count: {{context.value}} in last 15 minutes
Threshold: 50
Link: {{context.link}}
Index Threshold Rule
Monitors a metric in an index against a threshold:
Name: CPU Threshold Alert
Schedule: Every 1 minute
Index: metricbeat-*
Time field: @timestamp
Aggregation: max(system.cpu.total.pct)
Group by: host.name
Condition:
IS ABOVE 0.90 (90%)
FOR THE LAST 5 minutes
Actions (alert):
Slack: "⚠️ High CPU on {{context.group}}: {{context.value}}%"
Actions (recovery):
Slack: "✅ CPU recovered on {{context.group}}: {{context.value}}%"
Log Threshold Rule (Observability)
Specialized for log monitoring:
Name: Error Spike Detection
Schedule: Every 1 minute
Data view: filebeat-*
Condition:
WHEN count of log entries
WITH level: "error"
IS MORE THAN 100
FOR THE LAST 5 minutes
Group by: service.name
Actions:
Connector: PagerDuty
Severity: Critical
Summary: "Error spike in {{context.group}}: {{context.value}} errors in 5 min"
Metric Threshold Rule (Observability)
Monitor infrastructure metrics:
Name: Disk Space Warning
Schedule: Every 5 minutes
Metric: system.filesystem.used.pct
Aggregation: max
Group by: host.name, system.filesystem.mount_point
Alert conditions:
Warning: IS ABOVE 80%
Critical: IS ABOVE 90%
Actions (Warning):
Email: "⚠️ Disk at {{context.value}}% on {{context.group}}"
Actions (Critical):
PagerDuty: Severity high
Slack: "🔴 CRITICAL: Disk at {{context.value}}% on {{context.group}}"
Anomaly Detection Rule (ML)
Trigger on ML anomaly score:
Name: Unusual Transaction Volume
Schedule: Every 15 minutes
ML job: transaction-volume-anomaly
Severity: Critical (score ≥ 75)
Condition:
WHEN anomaly score IS ABOVE 75
FOR job "transaction-volume-anomaly"
Actions:
Slack: |
🔍 *ML Anomaly Detected*
Job: transaction-volume-anomaly
Score: {{context.anomalyScore}}
Description: {{context.message}}
Action Variables
Actions support template variables for dynamic content:
Common Variables
{{alertId}} Unique alert identifier
{{alertName}} Rule name
{{spaceId}} Kibana space ID
{{date}} Timestamp of alert
{{alertActionGroup}} Current status (e.g., "alert", "recovered")
{{context.value}} The value that triggered the alert
{{context.conditions}} Human-readable condition summary
{{context.group}} Group-by value (e.g., host name)
{{context.message}} Default alert message
{{context.link}} Link to the alert in Kibana
Using Variables in Messages
Slack message:
🚨 *Alert: {{alertName}}*
Status: {{alertActionGroup}}
Value: {{context.value}}
Threshold: {{context.conditions}}
Group: {{context.group}}
Time: {{date}}
<{{context.link}}|View in Kibana>
Email (HTML):
<h2>Alert: {{alertName}}</h2>
<table>
<tr><td><b>Status:</b></td><td>{{alertActionGroup}}</td></tr>
<tr><td><b>Value:</b></td><td>{{context.value}}</td></tr>
<tr><td><b>Time:</b></td><td>{{date}}</td></tr>
</table>
<p><a href="{{context.link}}">View in Kibana</a></p>
Webhook JSON:
{
"alert_name": "{{alertName}}",
"status": "{{alertActionGroup}}",
"value": "{{context.value}}",
"group": "{{context.group}}",
"timestamp": "{{date}}",
"kibana_url": "{{context.link}}"
}
Managing Rules
Rule List
View all rules at Stack Management → Rules:
┌────────────────────────────────────────────────────────────┐
│ Rules [Create rule] │
├────────────────────────────────────────────────────────────┤
│ Name │ Status │ Last run │ Next run │ Alerts│
│───────────────────│─────────│───────────│──────────│───────│
│ High Error Rate │ Active │ 2m ago │ 3m │ 2 │
│ CPU Threshold │ OK │ 1m ago │ 0m │ 0 │
│ Disk Space │ Snoozed │ 5m ago │ - │ 1 │
│ Login Failures │ Error │ 30m ago │ - │ - │
└────────────────────────────────────────────────────────────┘
Rule Statuses
| Status | Meaning |
|---|---|
| OK | Rule ran, no conditions met |
| Active | Conditions met, alerts firing |
| Error | Rule execution failed |
| Pending | Rule waiting for next check |
| Snoozed | Rule temporarily paused |
| Disabled | Rule turned off |
Snoozing a Rule
Temporarily pause alerts (useful during maintenance):
- Click rule name → "Snooze"
- Choose duration:
- 1 hour
- 8 hours
- 24 hours
- Custom duration
- Indefinitely
- Rule stops sending actions but continues evaluating
Muting Specific Alerts
Mute alerts for specific groups without stopping the entire rule:
Rule: CPU Threshold (grouped by host.name)
Active alerts:
✅ prod-web-01: 95% → Sending alerts
🔇 prod-db-01: 92% → Muted (under maintenance)
✅ prod-api-01: 88% → Sending alerts
- Open rule details
- Find the specific alert
- Click "Mute" on that alert
Editing a Rule
- Click rule name in the list
- Click "Edit"
- Modify conditions, actions, or schedule
- Click "Save"
Changes take effect on the next scheduled check.
Disabling a Rule
- Click rule name
- Click "Disable"
- Rule stops evaluating entirely
Alert History
Viewing Alert Events
- Open a rule
- Click "Alert history" tab
- See timeline of:
- When alerts fired
- When alerts recovered
- Action execution results
Event Log
The event log shows detailed execution history:
Timestamp Event Status
2024-01-15 10:30:00 Rule executed OK
2024-01-15 10:35:00 Rule executed Active (2 alerts)
2024-01-15 10:35:01 Action: Slack Success
2024-01-15 10:35:02 Action: Email Success
2024-01-15 10:40:00 Rule executed Active (1 alert)
2024-01-15 10:40:01 Action: Slack (recovery) Success
2024-01-15 10:45:00 Rule executed OK
Advanced Patterns
Tiered Alerting
Set up multiple severity levels:
Rule 1: Warning Level
Condition: error_rate > 5%
Action: Slack #monitoring channel
Schedule: Every 5 minutes
Rule 2: Critical Level
Condition: error_rate > 15%
Action: PagerDuty + Slack #incidents
Schedule: Every 1 minute
Rule 3: Emergency Level
Condition: error_rate > 50%
Action: PagerDuty (high urgency) + SMS + Slack #all-hands
Schedule: Every 30 seconds
Composite Alerts
Combine multiple signals:
Rule: Service Health Composite
Query: (response_time_p95 > 2000 OR error_rate > 10%)
AND service.name: "checkout-service"
This fires when EITHER condition is true for the checkout service.
Alert Deduplication
Prevent alert fatigue with action frequency:
Action settings:
Run action: Every 1 hour
(Sends at most 1 alert per hour even if condition remains true)
Alternatives:
- On status change only (alert/recovery)
- On each check interval
- Throttled (every N minutes/hours)
Maintenance Windows
Schedule planned maintenance to suppress alerts:
- Go to Stack Management → Maintenance Windows
- Click "Create"
- Configure:
Name: Weekend Deployment Window
Schedule: Every Saturday 02:00-06:00 UTC
Filter: Tags match "production"
Repeat: Weekly
During this window:
- Rules still evaluate
- Actions are suppressed
- Alerts are tracked but not sent
Rules via API
Create a Rule
curl -X POST "localhost:5601/api/alerting/rule" \
-H "kbn-xsrf: true" \
-H "Content-Type: application/json" \
-d '{
"name": "High Error Rate",
"rule_type_id": ".es-query",
"consumer": "alerts",
"schedule": { "interval": "5m" },
"params": {
"searchType": "esQuery",
"timeWindowSize": 15,
"timeWindowUnit": "m",
"threshold": [50],
"thresholdComparator": ">",
"size": 100,
"esQuery": "{\"query\":{\"bool\":{\"filter\":[{\"term\":{\"level\":\"error\"}}]}}}",
"index": ["filebeat-*"],
"timeField": "@timestamp"
},
"actions": [
{
"id": "slack-connector-id",
"group": "query matched",
"params": {
"message": "Error count: {{context.value}} in last 15 min"
}
}
],
"tags": ["production", "errors"]
}'
List Rules
curl -X GET "localhost:5601/api/alerting/rules/_find?per_page=20&page=1" \
-H "kbn-xsrf: true"
Disable a Rule
curl -X POST "localhost:5601/api/alerting/rule/RULE_ID/_disable" \
-H "kbn-xsrf: true"
Get Alert History
curl -X GET "localhost:5601/api/alerting/rule/RULE_ID/_alert_summary" \
-H "kbn-xsrf: true"
Practical Examples
Example 1: Website Uptime Alert
Rule: Website Down
Schedule: Every 1 minute
Type: Elasticsearch query
Query (KQL):
monitor.status: "down" AND monitor.name: "production-website"
Condition:
WHEN count() IS ABOVE 0
FOR THE LAST 3 minutes
Actions (Alert):
PagerDuty: Critical incident
Slack: "🔴 Website is DOWN! Status check failed for {{context.group}}"
Actions (Recovery):
Slack: "✅ Website is back UP. Downtime: ~{{context.value}} minutes"
Example 2: Business Hours Only Alert
Rule: Slow API Response (Business Hours)
Schedule: Every 5 minutes
Type: Index threshold
Index: apm-*
Condition:
WHEN avg(transaction.duration.us) IS ABOVE 5000000
GROUP BY service.name
FOR THE LAST 10 minutes
Actions:
Connector: Slack
Run action: Only between 09:00-18:00 Mon-Fri
Message: |
⚠️ Slow API Response
Service: {{context.group}}
Avg response: {{context.value}}μs
Threshold: 5,000,000μs (5 seconds)
Example 3: Security Alert
Rule: Brute Force Detection
Schedule: Every 1 minute
Type: Elasticsearch query
Query (KQL):
event.action: "authentication_failed"
Condition:
WHEN count() IS ABOVE 20
GROUP BY source.ip
FOR THE LAST 5 minutes
Actions:
Webhook (to firewall API):
POST https://firewall.example.com/api/block
Body: { "ip": "{{context.group}}", "reason": "brute_force", "duration": "1h" }
Slack:
"🔒 Brute force detected from {{context.group}}: {{context.value}} failed attempts in 5 min"
Common Issues
Rule shows "Error" status
Causes:
- Connector misconfigured → Test connector independently
- Invalid query syntax → Verify query in Discover first
- Index doesn't exist → Check index pattern
- Insufficient permissions → Check user role has alerting privileges
Alerts not firing
Checklist:
- Rule is enabled (not disabled or snoozed)
- Time range in condition matches data availability
- Threshold is appropriate (not too high)
- Data view or index has recent data
- Rule schedule is running (check "Last run" timestamp)
Too many alerts (alert fatigue)
Solutions:
- Increase thresholds to reduce noise
- Use action frequency throttling (e.g., once per hour)
- Group by relevant field to consolidate
- Use maintenance windows for planned events
- Implement tiered alerting (warning → critical → emergency)
Actions failing
Debug steps:
- Check connector test (Stack Management → Connectors → Test)
- View execution log for error details
- Verify network connectivity (firewall, proxy)
- Check API keys and credentials haven't expired
- Review Kibana server logs for detailed errors
# Check Kibana logs for action errors
grep "action" /var/log/kibana/kibana.log | grep -i error
Summary
In this chapter, you learned:
- ✅ Alerting architecture: rules, conditions, actions, and connectors
- ✅ Setting up connectors for Slack, email, PagerDuty, and webhooks
- ✅ Creating different rule types (query, threshold, metric, ML)
- ✅ Using template variables for dynamic alert messages
- ✅ Managing rules: snoozing, muting, disabling
- ✅ Advanced patterns: tiered alerting, maintenance windows, deduplication
- ✅ Automating rule management via API
- ✅ Troubleshooting common alerting issues
Next: Securing your Kibana instance with users, roles, and spaces!