Alerting and Rules

Kibana's alerting system monitors your data and triggers actions when specified conditions are met. It replaces the older Watcher feature with a more flexible, Kibana-native approach.

Alerting Architecture

┌─────────────────────────────────────────────────────┐
│                    Kibana                            │
│                                                     │
│  ┌──────────┐    ┌──────────┐    ┌──────────────┐  │
│  │   Rule   │───→│ Condition │───→│   Actions    │  │
│  │(Schedule)│    │  (Check)  │    │ (Notify/Act) │  │
│  └──────────┘    └──────────┘    └──────────────┘  │
│       │                               │             │
│  Runs every N    Evaluates query      Sends alert:  │
│  minutes/hours   against threshold    - Email       │
│                                       - Slack       │
│                                       - Webhook     │
│                                       - PagerDuty   │
│                                       - Jira        │
│                                       - Teams       │
└─────────────────────────────────────────────────────┘

Key Concepts

TermDefinition
RuleA scheduled check that evaluates conditions
Rule typeWhat kind of check (threshold, query, anomaly, etc.)
ConditionThe criteria that triggers an alert
ActionWhat happens when the condition is met
ConnectorIntegration with external service (Slack, email, etc.)
AlertA specific instance of a triggered condition
RecoveryWhen a condition returns to normal
MutingTemporarily silencing alerts
SnoozePausing a rule for a set duration

Setting Up Connectors

Before creating rules, configure connectors for alert delivery.

Available Connectors

ConnectorPurposeSetup Needed
EmailSend alert emailsSMTP server config
SlackPost to Slack channelsWebhook URL
Microsoft TeamsPost to Teams channelsWebhook URL
PagerDutyTrigger incidentsIntegration key
JiraCreate/update ticketsURL + credentials
ServiceNowCreate incidentsInstance + credentials
WebhookHTTP request to any URLURL + auth
Server logWrite to Kibana logsNone (built-in)
IndexWrite to ElasticsearchIndex name
OpsgenieAlert managementAPI key
xMattersIncident managementURL + credentials

Creating a Slack Connector

  1. Go to Stack ManagementAlerts and InsightsConnectors
  2. Click "Create connector"
  3. Select "Slack"
  4. Configure:
Name: Production Alerts Slack
Webhook URL: https://hooks.slack.com/services/T00/B00/xxxx

Test: Click "Test" tab
  Message: "Test alert from Kibana"
  Click "Run" → Verify message appears in Slack

Creating an Email Connector

  1. First, configure SMTP in kibana.yml:
# kibana.yml
xpack.actions.email:
  service: other
  host: smtp.example.com
  port: 587
  secure: true
  1. Create the connector:
Name: Alert Emails
From: alerts@example.com
Service: Other (configured in kibana.yml)
Host: smtp.example.com
Port: 587
Secure: true
Authentication:
  User: alerts@example.com
  Password: ********

Creating a Webhook Connector

For custom integrations:

Name: Custom Alert Webhook
Method: POST
URL: https://api.example.com/alerts
Headers:
  Content-Type: application/json
  Authorization: Bearer your-api-token

Body (configured per rule action):
{
  "alert": "{{alertName}}",
  "status": "{{alertActionGroup}}",
  "message": "{{context.message}}",
  "timestamp": "{{date}}"
}

Rule Types

Elasticsearch Query Rule

Triggers when a query matches documents above a threshold.

Creating a Query Rule:

  1. Go to Stack ManagementRules (or ObservabilityAlertsManage Rules)
  2. Click "Create rule"
  3. Select "Elasticsearch query"
  4. Configure:
Name: High Error Rate
Tags: [production, errors, critical]

Schedule: Check every 5 minutes

Query type: KQL
Data view: filebeat-*
Query: level: "error" AND service.name: "payment-service"

Condition:
  WHEN count() OVER all documents
  IS ABOVE 50
  FOR THE LAST 15 minutes

Actions:
  When alert status changes:
    Connector: Production Alerts Slack
    Message: |
      🚨 *High Error Rate Alert*
      Service: payment-service
      Error count: {{context.value}} in last 15 minutes
      Threshold: 50
      Link: {{context.link}}

Index Threshold Rule

Monitors a metric in an index against a threshold:

Name: CPU Threshold Alert
Schedule: Every 1 minute

Index: metricbeat-*
Time field: @timestamp
Aggregation: max(system.cpu.total.pct)
Group by: host.name

Condition:
  IS ABOVE 0.90 (90%)
  FOR THE LAST 5 minutes

Actions (alert):
  Slack: "⚠️ High CPU on {{context.group}}: {{context.value}}%"

Actions (recovery):
  Slack: "✅ CPU recovered on {{context.group}}: {{context.value}}%"

Log Threshold Rule (Observability)

Specialized for log monitoring:

Name: Error Spike Detection
Schedule: Every 1 minute

Data view: filebeat-*
Condition:
  WHEN count of log entries
  WITH level: "error"
  IS MORE THAN 100
  FOR THE LAST 5 minutes

Group by: service.name

Actions:
  Connector: PagerDuty
  Severity: Critical
  Summary: "Error spike in {{context.group}}: {{context.value}} errors in 5 min"

Metric Threshold Rule (Observability)

Monitor infrastructure metrics:

Name: Disk Space Warning
Schedule: Every 5 minutes

Metric: system.filesystem.used.pct
Aggregation: max
Group by: host.name, system.filesystem.mount_point

Alert conditions:
  Warning:  IS ABOVE 80%
  Critical: IS ABOVE 90%

Actions (Warning):
  Email: "⚠️ Disk at {{context.value}}% on {{context.group}}"

Actions (Critical):
  PagerDuty: Severity high
  Slack: "🔴 CRITICAL: Disk at {{context.value}}% on {{context.group}}"

Anomaly Detection Rule (ML)

Trigger on ML anomaly score:

Name: Unusual Transaction Volume
Schedule: Every 15 minutes

ML job: transaction-volume-anomaly
Severity: Critical (score ≥ 75)

Condition:
  WHEN anomaly score IS ABOVE 75
  FOR job "transaction-volume-anomaly"

Actions:
  Slack: |
    🔍 *ML Anomaly Detected*
    Job: transaction-volume-anomaly
    Score: {{context.anomalyScore}}
    Description: {{context.message}}

Action Variables

Actions support template variables for dynamic content:

Common Variables

{{alertId}}              Unique alert identifier
{{alertName}}            Rule name
{{spaceId}}              Kibana space ID
{{date}}                 Timestamp of alert
{{alertActionGroup}}     Current status (e.g., "alert", "recovered")
{{context.value}}        The value that triggered the alert
{{context.conditions}}   Human-readable condition summary
{{context.group}}        Group-by value (e.g., host name)
{{context.message}}      Default alert message
{{context.link}}         Link to the alert in Kibana

Using Variables in Messages

Slack message:

🚨 *Alert: {{alertName}}*

Status: {{alertActionGroup}}
Value: {{context.value}}
Threshold: {{context.conditions}}
Group: {{context.group}}
Time: {{date}}

<{{context.link}}|View in Kibana>

Email (HTML):

<h2>Alert: {{alertName}}</h2>
<table>
  <tr><td><b>Status:</b></td><td>{{alertActionGroup}}</td></tr>
  <tr><td><b>Value:</b></td><td>{{context.value}}</td></tr>
  <tr><td><b>Time:</b></td><td>{{date}}</td></tr>
</table>
<p><a href="{{context.link}}">View in Kibana</a></p>

Webhook JSON:

{
  "alert_name": "{{alertName}}",
  "status": "{{alertActionGroup}}",
  "value": "{{context.value}}",
  "group": "{{context.group}}",
  "timestamp": "{{date}}",
  "kibana_url": "{{context.link}}"
}

Managing Rules

Rule List

View all rules at Stack ManagementRules:

┌────────────────────────────────────────────────────────────┐
│  Rules                                [Create rule]         │
├────────────────────────────────────────────────────────────┤
│ Name              │ Status  │ Last run  │ Next run │ Alerts│
│───────────────────│─────────│───────────│──────────│───────│
│ High Error Rate   │ Active  │ 2m ago    │ 3m       │ 2     │
│ CPU Threshold     │ OK      │ 1m ago    │ 0m       │ 0     │
│ Disk Space        │ Snoozed │ 5m ago    │ -        │ 1     │
│ Login Failures    │ Error   │ 30m ago   │ -        │ -     │
└────────────────────────────────────────────────────────────┘

Rule Statuses

StatusMeaning
OKRule ran, no conditions met
ActiveConditions met, alerts firing
ErrorRule execution failed
PendingRule waiting for next check
SnoozedRule temporarily paused
DisabledRule turned off

Snoozing a Rule

Temporarily pause alerts (useful during maintenance):

  1. Click rule name → "Snooze"
  2. Choose duration:
    • 1 hour
    • 8 hours
    • 24 hours
    • Custom duration
    • Indefinitely
  3. Rule stops sending actions but continues evaluating

Muting Specific Alerts

Mute alerts for specific groups without stopping the entire rule:

Rule: CPU Threshold (grouped by host.name)

Active alerts:
  ✅ prod-web-01: 95% → Sending alerts
  🔇 prod-db-01: 92%  → Muted (under maintenance)
  ✅ prod-api-01: 88% → Sending alerts
  1. Open rule details
  2. Find the specific alert
  3. Click "Mute" on that alert

Editing a Rule

  1. Click rule name in the list
  2. Click "Edit"
  3. Modify conditions, actions, or schedule
  4. Click "Save"

Changes take effect on the next scheduled check.

Disabling a Rule

  1. Click rule name
  2. Click "Disable"
  3. Rule stops evaluating entirely

Alert History

Viewing Alert Events

  1. Open a rule
  2. Click "Alert history" tab
  3. See timeline of:
    • When alerts fired
    • When alerts recovered
    • Action execution results

Event Log

The event log shows detailed execution history:

Timestamp            Event                    Status
2024-01-15 10:30:00  Rule executed            OK
2024-01-15 10:35:00  Rule executed            Active (2 alerts)
2024-01-15 10:35:01  Action: Slack            Success
2024-01-15 10:35:02  Action: Email            Success
2024-01-15 10:40:00  Rule executed            Active (1 alert)
2024-01-15 10:40:01  Action: Slack (recovery) Success
2024-01-15 10:45:00  Rule executed            OK

Advanced Patterns

Tiered Alerting

Set up multiple severity levels:

Rule 1: Warning Level
  Condition: error_rate > 5%
  Action: Slack #monitoring channel
  Schedule: Every 5 minutes

Rule 2: Critical Level
  Condition: error_rate > 15%
  Action: PagerDuty + Slack #incidents
  Schedule: Every 1 minute

Rule 3: Emergency Level
  Condition: error_rate > 50%
  Action: PagerDuty (high urgency) + SMS + Slack #all-hands
  Schedule: Every 30 seconds

Composite Alerts

Combine multiple signals:

Rule: Service Health Composite
  Query: (response_time_p95 > 2000 OR error_rate > 10%)
         AND service.name: "checkout-service"
  
  This fires when EITHER condition is true for the checkout service.

Alert Deduplication

Prevent alert fatigue with action frequency:

Action settings:
  Run action: Every 1 hour
  (Sends at most 1 alert per hour even if condition remains true)

Alternatives:
  - On status change only (alert/recovery)
  - On each check interval
  - Throttled (every N minutes/hours)

Maintenance Windows

Schedule planned maintenance to suppress alerts:

  1. Go to Stack ManagementMaintenance Windows
  2. Click "Create"
  3. Configure:
Name: Weekend Deployment Window
Schedule: Every Saturday 02:00-06:00 UTC
Filter: Tags match "production"
Repeat: Weekly

During this window:
  - Rules still evaluate
  - Actions are suppressed
  - Alerts are tracked but not sent

Rules via API

Create a Rule

curl -X POST "localhost:5601/api/alerting/rule" \
  -H "kbn-xsrf: true" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "High Error Rate",
    "rule_type_id": ".es-query",
    "consumer": "alerts",
    "schedule": { "interval": "5m" },
    "params": {
      "searchType": "esQuery",
      "timeWindowSize": 15,
      "timeWindowUnit": "m",
      "threshold": [50],
      "thresholdComparator": ">",
      "size": 100,
      "esQuery": "{\"query\":{\"bool\":{\"filter\":[{\"term\":{\"level\":\"error\"}}]}}}",
      "index": ["filebeat-*"],
      "timeField": "@timestamp"
    },
    "actions": [
      {
        "id": "slack-connector-id",
        "group": "query matched",
        "params": {
          "message": "Error count: {{context.value}} in last 15 min"
        }
      }
    ],
    "tags": ["production", "errors"]
  }'

List Rules

curl -X GET "localhost:5601/api/alerting/rules/_find?per_page=20&page=1" \
  -H "kbn-xsrf: true"

Disable a Rule

curl -X POST "localhost:5601/api/alerting/rule/RULE_ID/_disable" \
  -H "kbn-xsrf: true"

Get Alert History

curl -X GET "localhost:5601/api/alerting/rule/RULE_ID/_alert_summary" \
  -H "kbn-xsrf: true"

Practical Examples

Example 1: Website Uptime Alert

Rule: Website Down
Schedule: Every 1 minute
Type: Elasticsearch query

Query (KQL):
  monitor.status: "down" AND monitor.name: "production-website"

Condition:
  WHEN count() IS ABOVE 0
  FOR THE LAST 3 minutes

Actions (Alert):
  PagerDuty: Critical incident
  Slack: "🔴 Website is DOWN! Status check failed for {{context.group}}"

Actions (Recovery):
  Slack: "✅ Website is back UP. Downtime: ~{{context.value}} minutes"

Example 2: Business Hours Only Alert

Rule: Slow API Response (Business Hours)
Schedule: Every 5 minutes
Type: Index threshold

Index: apm-*
Condition:
  WHEN avg(transaction.duration.us) IS ABOVE 5000000
  GROUP BY service.name
  FOR THE LAST 10 minutes

Actions:
  Connector: Slack
  Run action: Only between 09:00-18:00 Mon-Fri
  Message: |
    ⚠️ Slow API Response
    Service: {{context.group}}
    Avg response: {{context.value}}μs
    Threshold: 5,000,000μs (5 seconds)

Example 3: Security Alert

Rule: Brute Force Detection
Schedule: Every 1 minute
Type: Elasticsearch query

Query (KQL):
  event.action: "authentication_failed"

Condition:
  WHEN count() IS ABOVE 20
  GROUP BY source.ip
  FOR THE LAST 5 minutes

Actions:
  Webhook (to firewall API):
    POST https://firewall.example.com/api/block
    Body: { "ip": "{{context.group}}", "reason": "brute_force", "duration": "1h" }
  
  Slack:
    "🔒 Brute force detected from {{context.group}}: {{context.value}} failed attempts in 5 min"

Common Issues

Rule shows "Error" status

Causes:

  1. Connector misconfigured → Test connector independently
  2. Invalid query syntax → Verify query in Discover first
  3. Index doesn't exist → Check index pattern
  4. Insufficient permissions → Check user role has alerting privileges

Alerts not firing

Checklist:

  1. Rule is enabled (not disabled or snoozed)
  2. Time range in condition matches data availability
  3. Threshold is appropriate (not too high)
  4. Data view or index has recent data
  5. Rule schedule is running (check "Last run" timestamp)

Too many alerts (alert fatigue)

Solutions:

  1. Increase thresholds to reduce noise
  2. Use action frequency throttling (e.g., once per hour)
  3. Group by relevant field to consolidate
  4. Use maintenance windows for planned events
  5. Implement tiered alerting (warning → critical → emergency)

Actions failing

Debug steps:

  1. Check connector test (Stack Management → Connectors → Test)
  2. View execution log for error details
  3. Verify network connectivity (firewall, proxy)
  4. Check API keys and credentials haven't expired
  5. Review Kibana server logs for detailed errors
# Check Kibana logs for action errors
grep "action" /var/log/kibana/kibana.log | grep -i error

Summary

In this chapter, you learned:

  • ✅ Alerting architecture: rules, conditions, actions, and connectors
  • ✅ Setting up connectors for Slack, email, PagerDuty, and webhooks
  • ✅ Creating different rule types (query, threshold, metric, ML)
  • ✅ Using template variables for dynamic alert messages
  • ✅ Managing rules: snoozing, muting, disabling
  • ✅ Advanced patterns: tiered alerting, maintenance windows, deduplication
  • ✅ Automating rule management via API
  • ✅ Troubleshooting common alerting issues

Next: Securing your Kibana instance with users, roles, and spaces!