Machine Learning in Kibana

Kibana's machine learning features automatically detect anomalies, forecast trends, and classify data. ML runs within Elasticsearch and is managed through Kibana's interface, requiring no external tools or data export.

ML Overview

┌────────────────────────────────────────────────────────┐
│                    Kibana ML UI                         │
│                                                        │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │
│  │   Anomaly    │  │  Data Frame  │  │   Trained    │ │
│  │  Detection   │  │  Analytics   │  │   Models     │ │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘ │
│         │                 │                  │         │
│  Unsupervised      Supervised &        NLP & Custom   │
│  time-series       unsupervised        models         │
│  analysis          batch analysis                     │
│                                                        │
│  • Anomaly jobs    • Outlier detection  • NER          │
│  • Forecasting     • Regression         • Sentiment    │
│  • Population      • Classification     • Text embed   │
│    analysis        • Inference          • ELSER        │
└────────────────────────────────────────────────────────┘

Prerequisites

  • License: Platinum or Enterprise (or trial)
  • ML nodes: Dedicated ML nodes recommended for production
  • Data: Time-series data with consistent patterns works best

Check ML availability:

curl -X GET "localhost:9200/_license" | jq '.license.type'
# "platinum", "enterprise", or "trial"

# Start a 30-day trial
curl -X POST "localhost:9200/_license/start_trial?acknowledge=true"

Anomaly Detection

Anomaly detection finds unexpected patterns in time-series data. It learns what "normal" looks like and flags deviations.

How It Works

Historical Data → ML Model Training → Real-time Scoring

  Normal pattern:    ──────/\──────/\──────/\────
  Anomaly detected:  ──────/\──────/\────┐ ↑
                                          │ Unexpected spike
                                          └───────────

The ML algorithm:

  1. Analyzes historical data to learn patterns
  2. Accounts for seasonality (hourly, daily, weekly)
  3. Builds a probability model of expected behavior
  4. Scores new data points by how unusual they are (0-100)

Creating an Anomaly Detection Job

Via the UI

  1. Go to Machine LearningAnomaly DetectionCreate job
  2. Select your data view (e.g., kibana_sample_data_ecommerce)
  3. Choose a job type:
Job TypeDescriptionUse Case
Single metricOne metric, simple analysisTotal request count
Multi-metricMultiple metrics, same entityCPU + memory + disk per host
PopulationCompare individuals to groupUsers vs typical user behavior
AdvancedFull configuration controlComplex custom analysis
CategorizationGroup log messagesIdentify new error types

Single Metric Job

Example: Detect unusual order volume

Step 1: Choose data view
  kibana_sample_data_ecommerce

Step 2: Pick time range
  Use full data range

Step 3: Configure
  Aggregation: Count
  Bucket span: 1h
  (How often to check - shorter = more sensitive, longer = less noisy)

Step 4: Job details
  Job ID: ecommerce-order-volume
  Group: ecommerce
  Description: "Detect unusual order volume"

Step 5: Validation
  Review settings, check for issues

Step 6: Create and open

Multi-Metric Job

Example: Monitor server health

Data view: metricbeat-*

Detectors:
  1. high_mean(system.cpu.total.pct)
  2. high_mean(system.memory.used.pct)
  3. high_mean(system.filesystem.used.pct)

Split field: host.name
Bucket span: 15m

Result: Each host is analyzed independently for CPU, memory, and disk anomalies

Population Job

Example: Detect unusual user behavior

Data view: webserver-logs-*

Population field: user.name
Detector: high_count

Over field: user.name
By field: url.path (what pages they visit)

Result: Identifies users whose browsing patterns differ
        significantly from the typical user

Bucket Span

The bucket span determines analysis granularity:

Bucket SpanBest ForSensitivity
5mHigh-frequency metricsVery sensitive, more noise
15mServer metrics, logsBalanced
1hBusiness metricsLess sensitive, less noise
1dDaily aggregatesLow sensitivity, trend-level

Rule of thumb: Set bucket span to match the shortest meaningful pattern in your data. If you expect issues to last at least 15 minutes before they matter, use 15m.

Detectors

Detectors define what the ML job looks for:

FUNCTIONS (what to measure):
  count           Number of documents
  high_count      Unusually high count
  low_count       Unusually low count
  mean            Average of a field
  high_mean       Unusually high average
  low_mean        Unusually low average
  sum             Total of a field
  high_sum        Unusually high total
  low_sum         Unusually low total
  median          Median value
  min / max       Extreme values
  distinct_count  Number of unique values
  rare            Rare values in a field
  freq_rare       Rarely seen values by frequency
  info_content    Information content changes
  metric          General metric analysis

MODIFIERS (how to split):
  by_field_name       Split into sub-buckets
  over_field_name     Population analysis
  partition_field_name Separate models per value

Detector Examples

// Detect high request latency per service
{
  "function": "high_mean",
  "field_name": "response_time",
  "by_field_name": "endpoint",
  "partition_field_name": "service.name"
}

// Detect unusual error counts
{
  "function": "high_count",
  "by_field_name": "error.type"
}

// Detect rare HTTP status codes
{
  "function": "rare",
  "by_field_name": "response.status_code"
}

// Detect unusual geographic access patterns
{
  "function": "rare",
  "by_field_name": "geoip.country_iso_code",
  "over_field_name": "user.name"
}

Viewing Anomaly Results

Anomaly Explorer

Navigate to MLAnomaly DetectionAnomaly Explorer.

┌─────────────────────────────────────────────────────────┐
│  Anomaly Explorer                     [Time Picker]      │
├─────────────────────────────────────────────────────────┤
│  Severity: [Critical ▼]  [Warning ▼]  [Minor ▼]         │
│                                                         │
│  Overall Anomaly Timeline:                              │
│  ░░░░░░█░░░░░█░░░░░░░░░░░██░░░░░░░░░░░░░░░░░░░░░      │
│                                                         │
│  Top Anomalies:                                         │
│  ┌────────────┬───────┬───────────┬──────────────────┐  │
│  │ Time       │ Score │ Detector  │ Description      │  │
│  │ Jan 15 10h │  92   │ high_mean │ Unusual CPU on   │  │
│  │            │       │           │ prod-web-03      │  │
│  │ Jan 15 8h  │  78   │ high_count│ Spike in errors  │  │
│  │            │       │           │ for /api/checkout│  │
│  │ Jan 14 23h │  65   │ rare      │ Unusual country  │  │
│  │            │       │           │ code: XX         │  │
│  └────────────┴───────┴───────────┴──────────────────┘  │
└─────────────────────────────────────────────────────────┘

Anomaly Scores

Score RangeSeverityMeaning
75-100CriticalHighly unusual, investigate immediately
50-74MajorSignificant deviation from normal
25-49MinorNotable but may not require action
0-24Low/WarningSlight deviation, informational

Single Metric Viewer

Detailed view for individual metrics:

┌──────────────────────────────────────────────────────┐
│  Single Metric Viewer                                │
│                                                      │
│  Job: ecommerce-order-volume                         │
│                                                      │
│  Expected range (gray band):                         │
│          ┌──────────────────────────┐                │
│  ────────│     ██████              │── Actual         │
│          │   ██      ██            │                  │
│  ────────│ ██          ████████    │── Expected       │
│          │                   ██████│                  │
│          └──────────────────────────┘                │
│                                                      │
│  ● = Anomaly point (colored by severity)             │
│  Gray band = Expected range                          │
│  Blue line = Actual value                            │
└──────────────────────────────────────────────────────┘

Click an anomaly point to see:

  • Anomaly score
  • Actual value vs expected
  • Typical value
  • Influencers (contributing factors)

Forecasting

Forecast future values based on learned patterns.

Creating a Forecast

  1. Open a job in Single Metric Viewer
  2. Click "Forecast"
  3. Set duration: 1 day, 1 week, etc.
  4. Click "Run"
                    Forecast
  ─────────────────|──────────────
  Historical data  | Predicted range
                   |  ╱──────────╲  (upper bound)
  ████████████████ | ────────────── (prediction)
                   |  ╲──────────╱  (lower bound)
          Today -->|

Forecast Considerations

✅ Works best with:
  - Consistent seasonal patterns
  - Sufficient historical data (2+ cycles)
  - Single metric jobs
  - Stationary data (no permanent trend changes)

❌ Not reliable for:
  - Highly irregular data
  - Very short history
  - Data affected by external unknown events
  - Multi-metric or population jobs (limited support)

Data Frame Analytics

Batch analysis jobs that run on entire datasets rather than streaming time-series data.

Outlier Detection

Find data points that don't fit the pattern:

1. Go to ML → Data Frame Analytics → Create job
2. Type: Outlier detection
3. Source index: kibana_sample_data_ecommerce
4. Destination index: ecommerce-outliers

Configuration:
  Included fields:
    - taxful_total_price
    - products.quantity
    - total_quantity
  
  Method: ensemble (default, combines multiple methods)
  N neighbors: 10

Result: Each document gets an outlier_score (0.0 to 1.0)

Regression

Predict a continuous numeric value:

Type: Regression
Source: house-prices-*
Destination: house-prices-predictions

Configuration:
  Dependent variable: price (what to predict)
  
  Training percent: 80% (80% train, 20% test)
  
  Included fields:
    - square_feet
    - bedrooms
    - bathrooms
    - location
    - year_built
    - lot_size

Result:
  - Model trained on 80% of data
  - Predictions added to destination index
  - Feature importance shows which fields matter most
  - Evaluation metrics (RMSE, R-squared)

Classification

Predict a category:

Type: Classification
Source: network-traffic-*
Destination: traffic-classified

Configuration:
  Dependent variable: is_malicious (true/false)
  
  Training percent: 80%
  
  Included fields:
    - bytes_in
    - bytes_out
    - duration
    - protocol
    - source_port
    - dest_port
    - packet_count

Result:
  - Documents get predicted class + probability
  - Confusion matrix shows accuracy
  - Feature importance for model interpretability

Viewing Data Frame Analytics Results

  1. Go to MLData Frame Analytics
  2. Click your job
  3. See:
    • Results explorer: Scatter plots, tables
    • Feature importance: Which fields drive predictions
    • Evaluation: Accuracy metrics
Feature Importance (Regression: house price):
  square_feet     ████████████████  0.45
  location        ██████████        0.28
  bedrooms        █████             0.12
  year_built      ████              0.10
  lot_size        ██                0.05

Creating ML Jobs via API

Anomaly Detection Job

# Create a job
curl -X PUT "localhost:9200/_ml/anomaly_detectors/api-latency-job" \
  -H "Content-Type: application/json" \
  -d '{
    "description": "Detect unusual API response times",
    "analysis_config": {
      "bucket_span": "15m",
      "detectors": [
        {
          "function": "high_mean",
          "field_name": "response_time_ms",
          "by_field_name": "endpoint",
          "partition_field_name": "service.name"
        }
      ],
      "influencers": ["service.name", "endpoint", "host.name"]
    },
    "data_description": {
      "time_field": "@timestamp"
    },
    "analysis_limits": {
      "model_memory_limit": "256mb"
    }
  }'

# Create a datafeed
curl -X PUT "localhost:9200/_ml/datafeeds/datafeed-api-latency-job" \
  -H "Content-Type: application/json" \
  -d '{
    "job_id": "api-latency-job",
    "indices": ["apm-*"],
    "query": {
      "bool": {
        "filter": [
          { "term": { "transaction.type": "request" } }
        ]
      }
    }
  }'

# Start the datafeed
curl -X POST "localhost:9200/_ml/datafeeds/datafeed-api-latency-job/_start"

Check Job Status

# Job stats
curl "localhost:9200/_ml/anomaly_detectors/api-latency-job/_stats"

# Get anomaly results
curl "localhost:9200/_ml/anomaly_detectors/api-latency-job/results/records" \
  -H "Content-Type: application/json" \
  -d '{
    "sort": { "record_score": "desc" },
    "size": 10
  }'

Integrating ML with Alerts

Anomaly Alert Rule

  1. Go to RulesCreate rule
  2. Type: Anomaly detection alert
  3. Configure:
Name: Critical API Anomaly
ML Job: api-latency-job
Severity: 75 (critical only)
Check interval: 5 minutes

Result type: Record (individual anomaly)
  OR Bucket (overall bucket score)
  OR Influencer (top influencers)

Actions:
  Slack: |
    🤖 *ML Anomaly Detected*
    Job: {{context.jobIds}}
    Score: {{context.anomalyScore}}
    Top influencers: {{context.topInfluencers}}
    {{context.message}}

Integrating ML Results in Dashboards

  1. Create a saved search for anomaly results:

    • Index: .ml-anomalies-*
    • Filter: job_id: "api-latency-job" AND record_score > 50
  2. Add to dashboard:

    • Lens visualization: Anomaly score over time
    • Data table: Top anomalies with details
Lens chart:
  Data view: .ml-anomalies-*
  X-axis: timestamp (date histogram)
  Y-axis: max(record_score)
  Filter: job_id: "api-latency-job"
  Color: By record_score (gradient red)

Practical Examples

Example 1: E-Commerce Anomaly Detection

Job: ecommerce-revenue-anomaly
Purpose: Detect unusual revenue patterns

Detectors:
  1. high_sum(taxful_total_price)   → Revenue spikes
  2. low_sum(taxful_total_price)    → Revenue drops
  3. high_distinct_count(customer_id) → Unusual customer volume

Bucket span: 1h
Influencers: category.keyword, day_of_week, geoip.country_iso_code

Expected outcomes:
  - Detects flash sale spikes
  - Detects outage-related revenue drops
  - Catches bot activity (unusual customer patterns)

Example 2: Security Anomaly Detection

Job: security-login-anomaly
Purpose: Detect unusual login patterns

Detectors:
  1. high_count by source.ip                → Brute force
  2. rare by geoip.country_iso_code 
       over user.name                       → Unusual login location
  3. high_distinct_count(user.name) 
       by source.ip                         → Credential stuffing

Bucket span: 15m
Influencers: user.name, source.ip, geoip.country_iso_code

Alert: Score > 75 → PagerDuty + Slack #security

Example 3: Infrastructure Capacity Planning

Job: infra-capacity-forecast
Purpose: Predict when resources will be exhausted

Single metric job:
  Detector: mean(system.filesystem.used.pct)
  Partition: host.name

After training (2+ weeks of data):
  1. Open Single Metric Viewer
  2. Select host
  3. Forecast 30 days
  4. Identify when disk reaches 90%

Use for:
  - Proactive capacity upgrades
  - Budget planning
  - SLA compliance

Tips and Best Practices

Job Configuration

✅ Start with single metric jobs to learn
✅ Use appropriate bucket span (match your data cadence)
✅ Set influencers to help diagnose anomalies
✅ Use partition fields for per-entity analysis
✅ Set model memory limits to prevent resource exhaustion

❌ Don't use very small bucket spans (< 5m) without good reason
❌ Don't create too many detectors per job (< 10 recommended)
❌ Don't skip the validation step
❌ Don't expect immediate results (model needs training time)

Data Quality

✅ Ensure consistent data flow (gaps confuse the model)
✅ Pre-filter irrelevant data with datafeed queries
✅ Provide sufficient history (2+ seasonal cycles minimum)
✅ Use keyword fields for categorical analysis

❌ Don't train on data with known outages (or exclude those periods)
❌ Don't mix fundamentally different data sources in one job
❌ Don't ignore data gaps (they affect model quality)

Resource Management

ML node sizing:
  - RAM: 8GB minimum per node, 32GB+ for production
  - CPU: 4+ cores
  - Disk: Fast storage for model state

Model memory limits:
  - Simple jobs: 16-64MB
  - Multi-metric with partitions: 128-512MB
  - Population jobs: 256MB-2GB
  - Large cardinality: 1-4GB

Monitor at: Stack Management → License Management → ML section

Common Issues

Job not finding anomalies

Causes:

  1. Insufficient training data → Wait for 2+ bucket span cycles
  2. Bucket span too large → Try smaller intervals
  3. Data is truly consistent → No anomalies to find
  4. Wrong detector function → Review detector configuration

High model memory usage

Fix:

  1. Reduce partition field cardinality
  2. Increase bucket span (fewer buckets = less memory)
  3. Limit by_field to top N values
  4. Increase model_memory_limit if justified

Too many false positives

Fix:

  1. Increase severity threshold in alerts (75+ for critical)
  2. Add custom rules to suppress known patterns
  3. Increase bucket span for less sensitivity
  4. Exclude known maintenance periods from training
# Add a calendar to exclude events
curl -X PUT "localhost:9200/_ml/calendars/maintenance" \
  -H "Content-Type: application/json" \
  -d '{ "job_ids": ["api-latency-job"] }'

curl -X POST "localhost:9200/_ml/calendars/maintenance/events" \
  -H "Content-Type: application/json" \
  -d '{
    "events": [
      {
        "description": "Weekly maintenance",
        "start_time": "2024-01-20T02:00:00Z",
        "end_time": "2024-01-20T06:00:00Z"
      }
    ]
  }'

Summary

In this chapter, you learned:

  • ✅ ML capabilities: anomaly detection, data frame analytics, forecasting
  • ✅ Creating anomaly detection jobs (single, multi-metric, population)
  • ✅ Understanding detectors, bucket spans, and influencers
  • ✅ Interpreting anomaly scores and results
  • ✅ Forecasting future values from learned patterns
  • ✅ Data frame analytics: outlier detection, regression, classification
  • ✅ Integrating ML with alerting and dashboards
  • ✅ Resource management and troubleshooting

Next: Monitoring application performance with APM!