Alerting and Rules | Kibana Tutorial

Kibana's alerting system monitors your data and triggers actions when specified conditions are met. It replaces the older Watcher feature with a more flexible, Kibana-native approach.

Alerting Architecture

┌─────────────────────────────────────────────────────┐
│ Kibana │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ Rule │───→│ Condition │───→│ Actions │ │
│ │(Schedule)│ │ (Check) │ │ (Notify/Act) │ │
│ └──────────┘ └──────────┘ └──────────────┘ │
│ │ │ │
│ Runs every N Evaluates query Sends alert: │
│ minutes/hours against threshold - Email │
│ - Slack │
│ - Webhook │
│ - PagerDuty │
│ - Jira │
│ - Teams │
└─────────────────────────────────────────────────────┘

Key Concepts

Term	Definition
Rule	A scheduled check that evaluates conditions
Rule type	What kind of check (threshold, query, anomaly, etc.)
Condition	The criteria that triggers an alert
Action	What happens when the condition is met
Connector	Integration with external service (Slack, email, etc.)
Alert	A specific instance of a triggered condition
Recovery	When a condition returns to normal
Muting	Temporarily silencing alerts
Snooze	Pausing a rule for a set duration

Setting Up Connectors

Before creating rules, configure connectors for alert delivery.

Available Connectors

Connector	Purpose	Setup Needed
Email	Send alert emails	SMTP server config
Slack	Post to Slack channels	Webhook URL
Microsoft Teams	Post to Teams channels	Webhook URL
PagerDuty	Trigger incidents	Integration key
Jira	Create/update tickets	URL + credentials
ServiceNow	Create incidents	Instance + credentials
Webhook	HTTP request to any URL	URL + auth
Server log	Write to Kibana logs	None (built-in)
Index	Write to Elasticsearch	Index name
Opsgenie	Alert management	API key
xMatters	Incident management	URL + credentials

Creating a Slack Connector

Go to Stack Management → Alerts and Insights → Connectors
Click "Create connector"
Select "Slack"
Configure:

Name: Production Alerts Slack
Webhook URL: https://hooks.slack.com/services/T00/B00/xxxx

Test: Click "Test" tab
 Message: "Test alert from Kibana"
 Click "Run" → Verify message appears in Slack

Creating an Email Connector

First, configure SMTP in kibana.yml:

# kibana.yml
xpack.actions.email:
 service: other
 host: smtp.example.com
 port: 587
 secure: true

Create the connector:

Name: Alert Emails
From: alerts@example.com
Service: Other (configured in kibana.yml)
Host: smtp.example.com
Port: 587
Secure: true
Authentication:
 User: alerts@example.com
 Password: ********

Creating a Webhook Connector

For custom integrations:

Name: Custom Alert Webhook
Method: POST
URL: https://api.example.com/alerts
Headers:
 Content-Type: application/json
 Authorization: Bearer your-api-token

Body (configured per rule action):
{
 "alert": "{{alertName}}",
 "status": "{{alertActionGroup}}",
 "message": "{{context.message}}",
 "timestamp": "{{date}}"
}

Rule Types

Elasticsearch Query Rule

Triggers when a query matches documents above a threshold.

Creating a Query Rule:

Go to Stack Management → Rules (or Observability → Alerts → Manage Rules)
Click "Create rule"
Select "Elasticsearch query"
Configure:

Name: High Error Rate
Tags: [production, errors, critical]

Schedule: Check every 5 minutes

Query type: KQL
Data view: filebeat-*
Query: level: "error" AND service.name: "payment-service"

Condition:
 WHEN count() OVER all documents
 IS ABOVE 50
 FOR THE LAST 15 minutes

Actions:
 When alert status changes:
 Connector: Production Alerts Slack
 Message: |
  *High Error Rate Alert*
 Service: payment-service
 Error count: {{context.value}} in last 15 minutes
 Threshold: 50
 Link: {{context.link}}

Index Threshold Rule

Monitors a metric in an index against a threshold:

Name: CPU Threshold Alert
Schedule: Every 1 minute

Index: metricbeat-*
Time field: @timestamp
Aggregation: max(system.cpu.total.pct)
Group by: host.name

Condition:
 IS ABOVE 0.90 (90%)
 FOR THE LAST 5 minutes

Actions (alert):
 Slack: " High CPU on {{context.group}}: {{context.value}}%"

Actions (recovery):
 Slack: " CPU recovered on {{context.group}}: {{context.value}}%"

Log Threshold Rule (Observability)

Specialized for log monitoring:

Name: Error Spike Detection
Schedule: Every 1 minute

Data view: filebeat-*
Condition:
 WHEN count of log entries
 WITH level: "error"
 IS MORE THAN 100
 FOR THE LAST 5 minutes

Group by: service.name

Actions:
 Connector: PagerDuty
 Severity: Critical
 Summary: "Error spike in {{context.group}}: {{context.value}} errors in 5 min"

Metric Threshold Rule (Observability)

Monitor infrastructure metrics:

Name: Disk Space Warning
Schedule: Every 5 minutes

Metric: system.filesystem.used.pct
Aggregation: max
Group by: host.name, system.filesystem.mount_point

Alert conditions:
 Warning: IS ABOVE 80%
 Critical: IS ABOVE 90%

Actions (Warning):
 Email: " Disk at {{context.value}}% on {{context.group}}"

Actions (Critical):
 PagerDuty: Severity high
 Slack: " CRITICAL: Disk at {{context.value}}% on {{context.group}}"

Anomaly Detection Rule (ML)

Trigger on ML anomaly score:

Name: Unusual Transaction Volume
Schedule: Every 15 minutes

ML job: transaction-volume-anomaly
Severity: Critical (score ≥ 75)

Condition:
 WHEN anomaly score IS ABOVE 75
 FOR job "transaction-volume-anomaly"

Actions:
 Slack: |
 *ML Anomaly Detected*
 Job: transaction-volume-anomaly
 Score: {{context.anomalyScore}}
 Description: {{context.message}}

Action Variables

Actions support template variables for dynamic content:

Common Variables

{{alertId}} Unique alert identifier
{{alertName}} Rule name
{{spaceId}} Kibana space ID
{{date}} Timestamp of alert
{{alertActionGroup}} Current status (e.g., "alert", "recovered")
{{context.value}} The value that triggered the alert
{{context.conditions}} Human-readable condition summary
{{context.group}} Group-by value (e.g., host name)
{{context.message}} Default alert message
{{context.link}} Link to the alert in Kibana

Using Variables in Messages

Slack message:

 *Alert: {{alertName}}*

Status: {{alertActionGroup}}
Value: {{context.value}}
Threshold: {{context.conditions}}
Group: {{context.group}}
Time: {{date}}

<{{context.link}}|View in Kibana>

Email (HTML):

<h2>Alert: {{alertName}}</h2>
<table>
 <tr><td><b>Status:</b></td><td>{{alertActionGroup}}</td></tr>
 <tr><td><b>Value:</b></td><td>{{context.value}}</td></tr>
 <tr><td><b>Time:</b></td><td>{{date}}</td></tr>
</table>
<p><a href="{{context.link}}">View in Kibana</a></p>

Webhook JSON:

{
 "alert_name": "{{alertName}}",
 "status": "{{alertActionGroup}}",
 "value": "{{context.value}}",
 "group": "{{context.group}}",
 "timestamp": "{{date}}",
 "kibana_url": "{{context.link}}"
}

Managing Rules

Rule List

View all rules at Stack Management → Rules:

┌────────────────────────────────────────────────────────────┐
│ Rules [Create rule] │
├────────────────────────────────────────────────────────────┤
│ Name │ Status │ Last run │ Next run │ Alerts│
│───────────────────│─────────│───────────│──────────│───────│
│ High Error Rate │ Active │ 2m ago │ 3m │ 2 │
│ CPU Threshold │ OK │ 1m ago │ 0m │ 0 │
│ Disk Space │ Snoozed │ 5m ago │ - │ 1 │
│ Login Failures │ Error │ 30m ago │ - │ - │
└────────────────────────────────────────────────────────────┘

Rule Statuses

Status	Meaning
OK	Rule ran, no conditions met
Active	Conditions met, alerts firing
Error	Rule execution failed
Pending	Rule waiting for next check
Snoozed	Rule temporarily paused
Disabled	Rule turned off

Snoozing a Rule

Temporarily pause alerts (useful during maintenance):

Click rule name → "Snooze"
Choose duration:

1 hour
8 hours
24 hours
Custom duration
Indefinitely

Rule stops sending actions but continues evaluating

Muting Specific Alerts

Mute alerts for specific groups without stopping the entire rule:

Rule: CPU Threshold (grouped by host.name)

Active alerts:
 prod-web-01: 95% → Sending alerts
  prod-db-01: 92% → Muted (under maintenance)
 prod-api-01: 88% → Sending alerts

Open rule details
Find the specific alert
Click "Mute" on that alert

Editing a Rule

Click rule name in the list
Click "Edit"
Modify conditions, actions, or schedule
Click "Save"

Changes take effect on the next scheduled check.

Disabling a Rule

Click rule name
Click "Disable"
Rule stops evaluating entirely

Alert History

Viewing Alert Events

Open a rule
Click "Alert history" tab
See timeline of:

When alerts fired
When alerts recovered
Action execution results

Event Log

The event log shows detailed execution history:

Timestamp Event Status
2024-01-15 10:30:00 Rule executed OK
2024-01-15 10:35:00 Rule executed Active (2 alerts)
2024-01-15 10:35:01 Action: Slack Success
2024-01-15 10:35:02 Action: Email Success
2024-01-15 10:40:00 Rule executed Active (1 alert)
2024-01-15 10:40:01 Action: Slack (recovery) Success
2024-01-15 10:45:00 Rule executed OK

Advanced Patterns

Tiered Alerting

Set up multiple severity levels:

Rule 1: Warning Level
 Condition: error_rate > 5%
 Action: Slack #monitoring channel
 Schedule: Every 5 minutes

Rule 2: Critical Level
 Condition: error_rate > 15%
 Action: PagerDuty + Slack #incidents
 Schedule: Every 1 minute

Rule 3: Emergency Level
 Condition: error_rate > 50%
 Action: PagerDuty (high urgency) + SMS + Slack #all-hands
 Schedule: Every 30 seconds

Composite Alerts

Combine multiple signals:

Rule: Service Health Composite
 Query: (response_time_p95 > 2000 OR error_rate > 10%)
 AND service.name: "checkout-service"
 
 This fires when EITHER condition is true for the checkout service.

Alert Deduplication

Prevent alert fatigue with action frequency:

Action settings:
 Run action: Every 1 hour
 (Sends at most 1 alert per hour even if condition remains true)

Alternatives:
 - On status change only (alert/recovery)
 - On each check interval
 - Throttled (every N minutes/hours)

Maintenance Windows

Schedule planned maintenance to suppress alerts:

Go to Stack Management → Maintenance Windows
Click "Create"
Configure:

Name: Weekend Deployment Window
Schedule: Every Saturday 02:00-06:00 UTC
Filter: Tags match "production"
Repeat: Weekly

During this window:
 - Rules still evaluate
 - Actions are suppressed
 - Alerts are tracked but not sent

Rules via API

Create a Rule

curl -X POST "localhost:5601/api/alerting/rule" \
 -H "kbn-xsrf: true" \
 -H "Content-Type: application/json" \
 -d '{
 "name": "High Error Rate",
 "rule_type_id": ".es-query",
 "consumer": "alerts",
 "schedule": { "interval": "5m" },
 "params": {
 "searchType": "esQuery",
 "timeWindowSize": 15,
 "timeWindowUnit": "m",
 "threshold": [50],
 "thresholdComparator": ">",
 "size": 100,
 "esQuery": "{\"query\":{\"bool\":{\"filter\":[{\"term\":{\"level\":\"error\"}}]}}}",
 "index": ["filebeat-*"],
 "timeField": "@timestamp"
 },
 "actions": [
 {
 "id": "slack-connector-id",
 "group": "query matched",
 "params": {
 "message": "Error count: {{context.value}} in last 15 min"
 }
 }
 ],
 "tags": ["production", "errors"]
 }'

List Rules

curl -X GET "localhost:5601/api/alerting/rules/_find?per_page=20&page=1" \
 -H "kbn-xsrf: true"

Disable a Rule

curl -X POST "localhost:5601/api/alerting/rule/RULE_ID/_disable" \
 -H "kbn-xsrf: true"

Get Alert History

curl -X GET "localhost:5601/api/alerting/rule/RULE_ID/_alert_summary" \
 -H "kbn-xsrf: true"

Practical Examples

Example 1: Website Uptime Alert

Rule: Website Down
Schedule: Every 1 minute
Type: Elasticsearch query

Query (KQL):
 monitor.status: "down" AND monitor.name: "production-website"

Condition:
 WHEN count() IS ABOVE 0
 FOR THE LAST 3 minutes

Actions (Alert):
 PagerDuty: Critical incident
 Slack: " Website is DOWN! Status check failed for {{context.group}}"

Actions (Recovery):
 Slack: " Website is back UP. Downtime: ~{{context.value}} minutes"

Example 2: Business Hours Only Alert

Rule: Slow API Response (Business Hours)
Schedule: Every 5 minutes
Type: Index threshold

Index: apm-*
Condition:
 WHEN avg(transaction.duration.us) IS ABOVE 5000000
 GROUP BY service.name
 FOR THE LAST 10 minutes

Actions:
 Connector: Slack
 Run action: Only between 09:00-18:00 Mon-Fri
 Message: |
 Slow API Response
 Service: {{context.group}}
 Avg response: {{context.value}}μs
 Threshold: 5,000,000μs (5 seconds)

Example 3: Security Alert

Rule: Brute Force Detection
Schedule: Every 1 minute
Type: Elasticsearch query

Query (KQL):
 event.action: "authentication_failed"

Condition:
 WHEN count() IS ABOVE 20
 GROUP BY source.ip
 FOR THE LAST 5 minutes

Actions:
 Webhook (to firewall API):
 POST https://firewall.example.com/api/block
 Body: { "ip": "{{context.group}}", "reason": "brute_force", "duration": "1h" }
 
 Slack:
 " Brute force detected from {{context.group}}: {{context.value}} failed attempts in 5 min"

Common Issues

Rule shows "Error" status

Causes:

Connector misconfigured → Test connector independently
Invalid query syntax → Verify query in Discover first
Index doesn't exist → Check index pattern
Insufficient permissions → Check user role has alerting privileges

Alerts not firing

Checklist:

Rule is enabled (not disabled or snoozed)
Time range in condition matches data availability
Threshold is appropriate (not too high)
Data view or index has recent data
Rule schedule is running (check "Last run" timestamp)

Too many alerts (alert fatigue)

Solutions:

Increase thresholds to reduce noise
Use action frequency throttling (e.g., once per hour)
Group by relevant field to consolidate
Use maintenance windows for planned events
Implement tiered alerting (warning → critical → emergency)

Actions failing

Debug steps:

Check connector test (Stack Management → Connectors → Test)
View execution log for error details
Verify network connectivity (firewall, proxy)
Check API keys and credentials haven't expired
Review Kibana server logs for detailed errors

# Check Kibana logs for action errors
grep "action" /var/log/kibana/kibana.log | grep -i error

Next Steps

Continue to 11-security.md for securing your Kibana instance with users, roles, and spaces.