Machine Learning in Kibana | Kibana Tutorial

Kibana's machine learning features automatically detect anomalies, forecast trends, and classify data. ML runs within Elasticsearch and is managed through Kibana's interface, requiring no external tools or data export.

ML Overview

┌────────────────────────────────────────────────────────┐
│ Kibana ML UI │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Anomaly │ │ Data Frame │ │ Trained │ │
│ │ Detection │ │ Analytics │ │ Models │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ Unsupervised Supervised & NLP & Custom │
│ time-series unsupervised models │
│ analysis batch analysis │
│ │
│ • Anomaly jobs • Outlier detection • NER │
│ • Forecasting • Regression • Sentiment │
│ • Population • Classification • Text embed │
│ analysis • Inference • ELSER │
└────────────────────────────────────────────────────────┘

Prerequisites

License: Platinum or Enterprise (or trial)
ML nodes: Dedicated ML nodes recommended for production
Data: Time-series data with consistent patterns works best

Check ML availability:

curl -X GET "localhost:9200/_license" | jq '.license.type'
# "platinum", "enterprise", or "trial"

# Start a 30-day trial
curl -X POST "localhost:9200/_license/start_trial?acknowledge=true"

Anomaly Detection

Anomaly detection finds unexpected patterns in time-series data. It learns what "normal" looks like and flags deviations.

How It Works

Historical Data → ML Model Training → Real-time Scoring

 Normal pattern: ──────/\──────/\──────/\────
 Anomaly detected: ──────/\──────/\────┐ ↑
 │ Unexpected spike
 └───────────

The ML algorithm:

Analyzes historical data to learn patterns
Accounts for seasonality (hourly, daily, weekly)
Builds a probability model of expected behavior
Scores new data points by how unusual they are (0-100)

Creating an Anomaly Detection Job

Via the UI

Go to Machine Learning → Anomaly Detection → Create job
Select your data view (e.g., kibana_sample_data_ecommerce)
Choose a job type:

Job Type	Description	Use Case
Single metric	One metric, simple analysis	Total request count
Multi-metric	Multiple metrics, same entity	CPU + memory + disk per host
Population	Compare individuals to group	Users vs typical user behavior
Advanced	Full configuration control	Complex custom analysis
Categorization	Group log messages	Identify new error types

Single Metric Job

Example: Detect unusual order volume

Step 1: Choose data view
 kibana_sample_data_ecommerce

Step 2: Pick time range
 Use full data range

Step 3: Configure
 Aggregation: Count
 Bucket span: 1h
 (How often to check - shorter = more sensitive, longer = less noisy)

Step 4: Job details
 Job ID: ecommerce-order-volume
 Group: ecommerce
 Description: "Detect unusual order volume"

Step 5: Validation
 Review settings, check for issues

Step 6: Create and open

Multi-Metric Job

Example: Monitor server health

Data view: metricbeat-*

Detectors:
 1. high_mean(system.cpu.total.pct)
 2. high_mean(system.memory.used.pct)
 3. high_mean(system.filesystem.used.pct)

Split field: host.name
Bucket span: 15m

Result: Each host is analyzed independently for CPU, memory, and disk anomalies

Population Job

Example: Detect unusual user behavior

Data view: webserver-logs-*

Population field: user.name
Detector: high_count

Over field: user.name
By field: url.path (what pages they visit)

Result: Identifies users whose browsing patterns differ
 significantly from the typical user

Bucket Span

The bucket span determines analysis granularity:

Bucket Span	Best For	Sensitivity
5m	High-frequency metrics	Very sensitive, more noise
15m	Server metrics, logs	Balanced
1h	Business metrics	Less sensitive, less noise
1d	Daily aggregates	Low sensitivity, trend-level

Rule of thumb: Set bucket span to match the shortest meaningful pattern in your data. If you expect issues to last at least 15 minutes before they matter, use 15m.

Detectors

Detectors define what the ML job looks for:

FUNCTIONS (what to measure):
 count Number of documents
 high_count Unusually high count
 low_count Unusually low count
 mean Average of a field
 high_mean Unusually high average
 low_mean Unusually low average
 sum Total of a field
 high_sum Unusually high total
 low_sum Unusually low total
 median Median value
 min / max Extreme values
 distinct_count Number of unique values
 rare Rare values in a field
 freq_rare Rarely seen values by frequency
 info_content Information content changes
 metric General metric analysis

MODIFIERS (how to split):
 by_field_name Split into sub-buckets
 over_field_name Population analysis
 partition_field_name Separate models per value

Detector Examples

// Detect high request latency per service
{
 "function": "high_mean",
 "field_name": "response_time",
 "by_field_name": "endpoint",
 "partition_field_name": "service.name"
}

// Detect unusual error counts
{
 "function": "high_count",
 "by_field_name": "error.type"
}

// Detect rare HTTP status codes
{
 "function": "rare",
 "by_field_name": "response.status_code"
}

// Detect unusual geographic access patterns
{
 "function": "rare",
 "by_field_name": "geoip.country_iso_code",
 "over_field_name": "user.name"
}

Viewing Anomaly Results

Anomaly Explorer

Navigate to ML → Anomaly Detection → Anomaly Explorer.

┌─────────────────────────────────────────────────────────┐
│ Anomaly Explorer [Time Picker] │
├─────────────────────────────────────────────────────────┤
│ Severity: [Critical ▼] [Warning ▼] [Minor ▼] │
│ │
│ Overall Anomaly Timeline: │
│ ░░░░░░█░░░░░█░░░░░░░░░░░██░░░░░░░░░░░░░░░░░░░░░ │
│ │
│ Top Anomalies: │
│ ┌────────────┬───────┬───────────┬──────────────────┐ │
│ │ Time │ Score │ Detector │ Description │ │
│ │ Jan 15 10h │ 92 │ high_mean │ Unusual CPU on │ │
│ │ │ │ │ prod-web-03 │ │
│ │ Jan 15 8h │ 78 │ high_count│ Spike in errors │ │
│ │ │ │ │ for /api/checkout│ │
│ │ Jan 14 23h │ 65 │ rare │ Unusual country │ │
│ │ │ │ │ code: XX │ │
│ └────────────┴───────┴───────────┴──────────────────┘ │
└─────────────────────────────────────────────────────────┘

Anomaly Scores

Score Range	Severity	Meaning
75-100	Critical	Highly unusual, investigate immediately
50-74	Major	Significant deviation from normal
25-49	Minor	Notable but may not require action
0-24	Low/Warning	Slight deviation, informational

Single Metric Viewer

Detailed view for individual metrics:

┌──────────────────────────────────────────────────────┐
│ Single Metric Viewer │
│ │
│ Job: ecommerce-order-volume │
│ │
│ Expected range (gray band): │
│ ┌──────────────────────────┐ │
│ ────────│ ██████ │── Actual │
│ │ ██ ██ │ │
│ ────────│ ██ ████████ │── Expected │
│ │ ██████│ │
│ └──────────────────────────┘ │
│ │
│ ● = Anomaly point (colored by severity) │
│ Gray band = Expected range │
│ Blue line = Actual value │
└──────────────────────────────────────────────────────┘

Click an anomaly point to see:

Anomaly score
Actual value vs expected
Typical value
Influencers (contributing factors)

Forecasting

Forecast future values based on learned patterns.

Creating a Forecast

Open a job in Single Metric Viewer
Click "Forecast"
Set duration: 1 day, 1 week, etc.
Click "Run"

 Forecast
 ─────────────────|──────────────
 Historical data | Predicted range
 | ╱──────────╲ (upper bound)
 ████████████████ | ────────────── (prediction)
 | ╲──────────╱ (lower bound)
 Today -->|

Forecast Considerations

 Works best with:
 - Consistent seasonal patterns
 - Sufficient historical data (2+ cycles)
 - Single metric jobs
 - Stationary data (no permanent trend changes)

 Not reliable for:
 - Highly irregular data
 - Very short history
 - Data affected by external unknown events
 - Multi-metric or population jobs (limited support)

Data Frame Analytics

Batch analysis jobs that run on entire datasets rather than streaming time-series data.

Outlier Detection

Find data points that don't fit the pattern:

1. Go to ML → Data Frame Analytics → Create job
2. Type: Outlier detection
3. Source index: kibana_sample_data_ecommerce
4. Destination index: ecommerce-outliers

Configuration:
 Included fields:
 - taxful_total_price
 - products.quantity
 - total_quantity
 
 Method: ensemble (default, combines multiple methods)
 N neighbors: 10

Result: Each document gets an outlier_score (0.0 to 1.0)

Regression

Predict a continuous numeric value:

Type: Regression
Source: house-prices-*
Destination: house-prices-predictions

Configuration:
 Dependent variable: price (what to predict)
 
 Training percent: 80% (80% train, 20% test)
 
 Included fields:
 - square_feet
 - bedrooms
 - bathrooms
 - location
 - year_built
 - lot_size

Result:
 - Model trained on 80% of data
 - Predictions added to destination index
 - Feature importance shows which fields matter most
 - Evaluation metrics (RMSE, R-squared)

Classification

Predict a category:

Type: Classification
Source: network-traffic-*
Destination: traffic-classified

Configuration:
 Dependent variable: is_malicious (true/false)
 
 Training percent: 80%
 
 Included fields:
 - bytes_in
 - bytes_out
 - duration
 - protocol
 - source_port
 - dest_port
 - packet_count

Result:
 - Documents get predicted class + probability
 - Confusion matrix shows accuracy
 - Feature importance for model interpretability

Viewing Data Frame Analytics Results

Go to ML → Data Frame Analytics
Click your job
See:

Results explorer: Scatter plots, tables
Feature importance: Which fields drive predictions
Evaluation: Accuracy metrics

Feature Importance (Regression: house price):
 square_feet ████████████████ 0.45
 location ██████████ 0.28
 bedrooms █████ 0.12
 year_built ████ 0.10
 lot_size ██ 0.05

Creating ML Jobs via API

Anomaly Detection Job

# Create a job
curl -X PUT "localhost:9200/_ml/anomaly_detectors/api-latency-job" \
 -H "Content-Type: application/json" \
 -d '{
 "description": "Detect unusual API response times",
 "analysis_config": {
 "bucket_span": "15m",
 "detectors": [
 {
 "function": "high_mean",
 "field_name": "response_time_ms",
 "by_field_name": "endpoint",
 "partition_field_name": "service.name"
 }
 ],
 "influencers": ["service.name", "endpoint", "host.name"]
 },
 "data_description": {
 "time_field": "@timestamp"
 },
 "analysis_limits": {
 "model_memory_limit": "256mb"
 }
 }'

# Create a datafeed
curl -X PUT "localhost:9200/_ml/datafeeds/datafeed-api-latency-job" \
 -H "Content-Type: application/json" \
 -d '{
 "job_id": "api-latency-job",
 "indices": ["apm-*"],
 "query": {
 "bool": {
 "filter": [
 { "term": { "transaction.type": "request" } }
 ]
 }
 }
 }'

# Start the datafeed
curl -X POST "localhost:9200/_ml/datafeeds/datafeed-api-latency-job/_start"

Check Job Status

# Job stats
curl "localhost:9200/_ml/anomaly_detectors/api-latency-job/_stats"

# Get anomaly results
curl "localhost:9200/_ml/anomaly_detectors/api-latency-job/results/records" \
 -H "Content-Type: application/json" \
 -d '{
 "sort": { "record_score": "desc" },
 "size": 10
 }'

Integrating ML with Alerts

Anomaly Alert Rule

Go to Rules → Create rule
Type: Anomaly detection alert
Configure:

Name: Critical API Anomaly
ML Job: api-latency-job
Severity: 75 (critical only)
Check interval: 5 minutes

Result type: Record (individual anomaly)
 OR Bucket (overall bucket score)
 OR Influencer (top influencers)

Actions:
 Slack: |
  *ML Anomaly Detected*
 Job: {{context.jobIds}}
 Score: {{context.anomalyScore}}
 Top influencers: {{context.topInfluencers}}
 {{context.message}}

Integrating ML Results in Dashboards

Create a saved search for anomaly results:

Index: .ml-anomalies-*
Filter: job_id: "api-latency-job" AND record_score > 50

Add to dashboard:

Lens visualization: Anomaly score over time
Data table: Top anomalies with details

Lens chart:
 Data view: .ml-anomalies-*
 X-axis: timestamp (date histogram)
 Y-axis: max(record_score)
 Filter: job_id: "api-latency-job"
 Color: By record_score (gradient red)

Practical Examples

Example 1: E-Commerce Anomaly Detection

Job: ecommerce-revenue-anomaly
Purpose: Detect unusual revenue patterns

Detectors:
 1. high_sum(taxful_total_price) → Revenue spikes
 2. low_sum(taxful_total_price) → Revenue drops
 3. high_distinct_count(customer_id) → Unusual customer volume

Bucket span: 1h
Influencers: category.keyword, day_of_week, geoip.country_iso_code

Expected outcomes:
 - Detects flash sale spikes
 - Detects outage-related revenue drops
 - Catches bot activity (unusual customer patterns)

Example 2: Security Anomaly Detection

Job: security-login-anomaly
Purpose: Detect unusual login patterns

Detectors:
 1. high_count by source.ip → Brute force
 2. rare by geoip.country_iso_code 
 over user.name → Unusual login location
 3. high_distinct_count(user.name) 
 by source.ip → Credential stuffing

Bucket span: 15m
Influencers: user.name, source.ip, geoip.country_iso_code

Alert: Score > 75 → PagerDuty + Slack #security

Example 3: Infrastructure Capacity Planning

Job: infra-capacity-forecast
Purpose: Predict when resources will be exhausted

Single metric job:
 Detector: mean(system.filesystem.used.pct)
 Partition: host.name

After training (2+ weeks of data):
 1. Open Single Metric Viewer
 2. Select host
 3. Forecast 30 days
 4. Identify when disk reaches 90%

Use for:
 - Proactive capacity upgrades
 - Budget planning
 - SLA compliance

Tips and Best Practices

Job Configuration

 Start with single metric jobs to learn
 Use appropriate bucket span (match your data cadence)
 Set influencers to help diagnose anomalies
 Use partition fields for per-entity analysis
 Set model memory limits to prevent resource exhaustion

 Don't use very small bucket spans (< 5m) without good reason
 Don't create too many detectors per job (< 10 recommended)
 Don't skip the validation step
 Don't expect immediate results (model needs training time)

Data Quality

 Ensure consistent data flow (gaps confuse the model)
 Pre-filter irrelevant data with datafeed queries
 Provide sufficient history (2+ seasonal cycles minimum)
 Use keyword fields for categorical analysis

 Don't train on data with known outages (or exclude those periods)
 Don't mix fundamentally different data sources in one job
 Don't ignore data gaps (they affect model quality)

Resource Management

ML node sizing:
 - RAM: 8GB minimum per node, 32GB+ for production
 - CPU: 4+ cores
 - Disk: Fast storage for model state

Model memory limits:
 - Simple jobs: 16-64MB
 - Multi-metric with partitions: 128-512MB
 - Population jobs: 256MB-2GB
 - Large cardinality: 1-4GB

Monitor at: Stack Management → License Management → ML section

Common Issues

Job not finding anomalies

Causes:

Insufficient training data → Wait for 2+ bucket span cycles
Bucket span too large → Try smaller intervals
Data is truly consistent → No anomalies to find
Wrong detector function → Review detector configuration

High model memory usage

Fix:

Reduce partition field cardinality
Increase bucket span (fewer buckets = less memory)
Limit by_field to top N values
Increase model_memory_limit if justified

Too many false positives

Fix:

Increase severity threshold in alerts (75+ for critical)
Add custom rules to suppress known patterns
Increase bucket span for less sensitivity
Exclude known maintenance periods from training

# Add a calendar to exclude events
curl -X PUT "localhost:9200/_ml/calendars/maintenance" \
 -H "Content-Type: application/json" \
 -d '{ "job_ids": ["api-latency-job"] }'

curl -X POST "localhost:9200/_ml/calendars/maintenance/events" \
 -H "Content-Type: application/json" \
 -d '{
 "events": [
 {
 "description": "Weekly maintenance",
 "start_time": "2024-01-20T02:00:00Z",
 "end_time": "2024-01-20T06:00:00Z"
 }
 ]
 }'

Next Steps

Continue to 13-apm.md for monitoring application performance with APM.