Machine Learning in Kibana
Kibana's machine learning features automatically detect anomalies, forecast trends, and classify data. ML runs within Elasticsearch and is managed through Kibana's interface, requiring no external tools or data export.
ML Overview
┌────────────────────────────────────────────────────────┐
│ Kibana ML UI │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Anomaly │ │ Data Frame │ │ Trained │ │
│ │ Detection │ │ Analytics │ │ Models │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ Unsupervised Supervised & NLP & Custom │
│ time-series unsupervised models │
│ analysis batch analysis │
│ │
│ • Anomaly jobs • Outlier detection • NER │
│ • Forecasting • Regression • Sentiment │
│ • Population • Classification • Text embed │
│ analysis • Inference • ELSER │
└────────────────────────────────────────────────────────┘
Prerequisites
- License: Platinum or Enterprise (or trial)
- ML nodes: Dedicated ML nodes recommended for production
- Data: Time-series data with consistent patterns works best
Check ML availability:
curl -X GET "localhost:9200/_license" | jq '.license.type'
# "platinum", "enterprise", or "trial"
# Start a 30-day trial
curl -X POST "localhost:9200/_license/start_trial?acknowledge=true"
Anomaly Detection
Anomaly detection finds unexpected patterns in time-series data. It learns what "normal" looks like and flags deviations.
How It Works
Historical Data → ML Model Training → Real-time Scoring
Normal pattern: ──────/\──────/\──────/\────
Anomaly detected: ──────/\──────/\────┐ ↑
│ Unexpected spike
└───────────
The ML algorithm:
- Analyzes historical data to learn patterns
- Accounts for seasonality (hourly, daily, weekly)
- Builds a probability model of expected behavior
- Scores new data points by how unusual they are (0-100)
Creating an Anomaly Detection Job
Via the UI
- Go to Machine Learning → Anomaly Detection → Create job
- Select your data view (e.g.,
kibana_sample_data_ecommerce) - Choose a job type:
| Job Type | Description | Use Case |
|---|---|---|
| Single metric | One metric, simple analysis | Total request count |
| Multi-metric | Multiple metrics, same entity | CPU + memory + disk per host |
| Population | Compare individuals to group | Users vs typical user behavior |
| Advanced | Full configuration control | Complex custom analysis |
| Categorization | Group log messages | Identify new error types |
Single Metric Job
Example: Detect unusual order volume
Step 1: Choose data view
kibana_sample_data_ecommerce
Step 2: Pick time range
Use full data range
Step 3: Configure
Aggregation: Count
Bucket span: 1h
(How often to check - shorter = more sensitive, longer = less noisy)
Step 4: Job details
Job ID: ecommerce-order-volume
Group: ecommerce
Description: "Detect unusual order volume"
Step 5: Validation
Review settings, check for issues
Step 6: Create and open
Multi-Metric Job
Example: Monitor server health
Data view: metricbeat-*
Detectors:
1. high_mean(system.cpu.total.pct)
2. high_mean(system.memory.used.pct)
3. high_mean(system.filesystem.used.pct)
Split field: host.name
Bucket span: 15m
Result: Each host is analyzed independently for CPU, memory, and disk anomalies
Population Job
Example: Detect unusual user behavior
Data view: webserver-logs-*
Population field: user.name
Detector: high_count
Over field: user.name
By field: url.path (what pages they visit)
Result: Identifies users whose browsing patterns differ
significantly from the typical user
Bucket Span
The bucket span determines analysis granularity:
| Bucket Span | Best For | Sensitivity |
|---|---|---|
| 5m | High-frequency metrics | Very sensitive, more noise |
| 15m | Server metrics, logs | Balanced |
| 1h | Business metrics | Less sensitive, less noise |
| 1d | Daily aggregates | Low sensitivity, trend-level |
Rule of thumb: Set bucket span to match the shortest meaningful pattern in your data. If you expect issues to last at least 15 minutes before they matter, use 15m.
Detectors
Detectors define what the ML job looks for:
FUNCTIONS (what to measure):
count Number of documents
high_count Unusually high count
low_count Unusually low count
mean Average of a field
high_mean Unusually high average
low_mean Unusually low average
sum Total of a field
high_sum Unusually high total
low_sum Unusually low total
median Median value
min / max Extreme values
distinct_count Number of unique values
rare Rare values in a field
freq_rare Rarely seen values by frequency
info_content Information content changes
metric General metric analysis
MODIFIERS (how to split):
by_field_name Split into sub-buckets
over_field_name Population analysis
partition_field_name Separate models per value
Detector Examples
// Detect high request latency per service
{
"function": "high_mean",
"field_name": "response_time",
"by_field_name": "endpoint",
"partition_field_name": "service.name"
}
// Detect unusual error counts
{
"function": "high_count",
"by_field_name": "error.type"
}
// Detect rare HTTP status codes
{
"function": "rare",
"by_field_name": "response.status_code"
}
// Detect unusual geographic access patterns
{
"function": "rare",
"by_field_name": "geoip.country_iso_code",
"over_field_name": "user.name"
}
Viewing Anomaly Results
Anomaly Explorer
Navigate to ML → Anomaly Detection → Anomaly Explorer.
┌─────────────────────────────────────────────────────────┐
│ Anomaly Explorer [Time Picker] │
├─────────────────────────────────────────────────────────┤
│ Severity: [Critical ▼] [Warning ▼] [Minor ▼] │
│ │
│ Overall Anomaly Timeline: │
│ ░░░░░░█░░░░░█░░░░░░░░░░░██░░░░░░░░░░░░░░░░░░░░░ │
│ │
│ Top Anomalies: │
│ ┌────────────┬───────┬───────────┬──────────────────┐ │
│ │ Time │ Score │ Detector │ Description │ │
│ │ Jan 15 10h │ 92 │ high_mean │ Unusual CPU on │ │
│ │ │ │ │ prod-web-03 │ │
│ │ Jan 15 8h │ 78 │ high_count│ Spike in errors │ │
│ │ │ │ │ for /api/checkout│ │
│ │ Jan 14 23h │ 65 │ rare │ Unusual country │ │
│ │ │ │ │ code: XX │ │
│ └────────────┴───────┴───────────┴──────────────────┘ │
└─────────────────────────────────────────────────────────┘
Anomaly Scores
| Score Range | Severity | Meaning |
|---|---|---|
| 75-100 | Critical | Highly unusual, investigate immediately |
| 50-74 | Major | Significant deviation from normal |
| 25-49 | Minor | Notable but may not require action |
| 0-24 | Low/Warning | Slight deviation, informational |
Single Metric Viewer
Detailed view for individual metrics:
┌──────────────────────────────────────────────────────┐
│ Single Metric Viewer │
│ │
│ Job: ecommerce-order-volume │
│ │
│ Expected range (gray band): │
│ ┌──────────────────────────┐ │
│ ────────│ ██████ │── Actual │
│ │ ██ ██ │ │
│ ────────│ ██ ████████ │── Expected │
│ │ ██████│ │
│ └──────────────────────────┘ │
│ │
│ ● = Anomaly point (colored by severity) │
│ Gray band = Expected range │
│ Blue line = Actual value │
└──────────────────────────────────────────────────────┘
Click an anomaly point to see:
- Anomaly score
- Actual value vs expected
- Typical value
- Influencers (contributing factors)
Forecasting
Forecast future values based on learned patterns.
Creating a Forecast
- Open a job in Single Metric Viewer
- Click "Forecast"
- Set duration: 1 day, 1 week, etc.
- Click "Run"
Forecast
─────────────────|──────────────
Historical data | Predicted range
| ╱──────────╲ (upper bound)
████████████████ | ────────────── (prediction)
| ╲──────────╱ (lower bound)
Today -->|
Forecast Considerations
✅ Works best with:
- Consistent seasonal patterns
- Sufficient historical data (2+ cycles)
- Single metric jobs
- Stationary data (no permanent trend changes)
❌ Not reliable for:
- Highly irregular data
- Very short history
- Data affected by external unknown events
- Multi-metric or population jobs (limited support)
Data Frame Analytics
Batch analysis jobs that run on entire datasets rather than streaming time-series data.
Outlier Detection
Find data points that don't fit the pattern:
1. Go to ML → Data Frame Analytics → Create job
2. Type: Outlier detection
3. Source index: kibana_sample_data_ecommerce
4. Destination index: ecommerce-outliers
Configuration:
Included fields:
- taxful_total_price
- products.quantity
- total_quantity
Method: ensemble (default, combines multiple methods)
N neighbors: 10
Result: Each document gets an outlier_score (0.0 to 1.0)
Regression
Predict a continuous numeric value:
Type: Regression
Source: house-prices-*
Destination: house-prices-predictions
Configuration:
Dependent variable: price (what to predict)
Training percent: 80% (80% train, 20% test)
Included fields:
- square_feet
- bedrooms
- bathrooms
- location
- year_built
- lot_size
Result:
- Model trained on 80% of data
- Predictions added to destination index
- Feature importance shows which fields matter most
- Evaluation metrics (RMSE, R-squared)
Classification
Predict a category:
Type: Classification
Source: network-traffic-*
Destination: traffic-classified
Configuration:
Dependent variable: is_malicious (true/false)
Training percent: 80%
Included fields:
- bytes_in
- bytes_out
- duration
- protocol
- source_port
- dest_port
- packet_count
Result:
- Documents get predicted class + probability
- Confusion matrix shows accuracy
- Feature importance for model interpretability
Viewing Data Frame Analytics Results
- Go to ML → Data Frame Analytics
- Click your job
- See:
- Results explorer: Scatter plots, tables
- Feature importance: Which fields drive predictions
- Evaluation: Accuracy metrics
Feature Importance (Regression: house price):
square_feet ████████████████ 0.45
location ██████████ 0.28
bedrooms █████ 0.12
year_built ████ 0.10
lot_size ██ 0.05
Creating ML Jobs via API
Anomaly Detection Job
# Create a job
curl -X PUT "localhost:9200/_ml/anomaly_detectors/api-latency-job" \
-H "Content-Type: application/json" \
-d '{
"description": "Detect unusual API response times",
"analysis_config": {
"bucket_span": "15m",
"detectors": [
{
"function": "high_mean",
"field_name": "response_time_ms",
"by_field_name": "endpoint",
"partition_field_name": "service.name"
}
],
"influencers": ["service.name", "endpoint", "host.name"]
},
"data_description": {
"time_field": "@timestamp"
},
"analysis_limits": {
"model_memory_limit": "256mb"
}
}'
# Create a datafeed
curl -X PUT "localhost:9200/_ml/datafeeds/datafeed-api-latency-job" \
-H "Content-Type: application/json" \
-d '{
"job_id": "api-latency-job",
"indices": ["apm-*"],
"query": {
"bool": {
"filter": [
{ "term": { "transaction.type": "request" } }
]
}
}
}'
# Start the datafeed
curl -X POST "localhost:9200/_ml/datafeeds/datafeed-api-latency-job/_start"
Check Job Status
# Job stats
curl "localhost:9200/_ml/anomaly_detectors/api-latency-job/_stats"
# Get anomaly results
curl "localhost:9200/_ml/anomaly_detectors/api-latency-job/results/records" \
-H "Content-Type: application/json" \
-d '{
"sort": { "record_score": "desc" },
"size": 10
}'
Integrating ML with Alerts
Anomaly Alert Rule
- Go to Rules → Create rule
- Type: Anomaly detection alert
- Configure:
Name: Critical API Anomaly
ML Job: api-latency-job
Severity: 75 (critical only)
Check interval: 5 minutes
Result type: Record (individual anomaly)
OR Bucket (overall bucket score)
OR Influencer (top influencers)
Actions:
Slack: |
🤖 *ML Anomaly Detected*
Job: {{context.jobIds}}
Score: {{context.anomalyScore}}
Top influencers: {{context.topInfluencers}}
{{context.message}}
Integrating ML Results in Dashboards
Create a saved search for anomaly results:
- Index:
.ml-anomalies-* - Filter:
job_id: "api-latency-job" AND record_score > 50
- Index:
Add to dashboard:
- Lens visualization: Anomaly score over time
- Data table: Top anomalies with details
Lens chart:
Data view: .ml-anomalies-*
X-axis: timestamp (date histogram)
Y-axis: max(record_score)
Filter: job_id: "api-latency-job"
Color: By record_score (gradient red)
Practical Examples
Example 1: E-Commerce Anomaly Detection
Job: ecommerce-revenue-anomaly
Purpose: Detect unusual revenue patterns
Detectors:
1. high_sum(taxful_total_price) → Revenue spikes
2. low_sum(taxful_total_price) → Revenue drops
3. high_distinct_count(customer_id) → Unusual customer volume
Bucket span: 1h
Influencers: category.keyword, day_of_week, geoip.country_iso_code
Expected outcomes:
- Detects flash sale spikes
- Detects outage-related revenue drops
- Catches bot activity (unusual customer patterns)
Example 2: Security Anomaly Detection
Job: security-login-anomaly
Purpose: Detect unusual login patterns
Detectors:
1. high_count by source.ip → Brute force
2. rare by geoip.country_iso_code
over user.name → Unusual login location
3. high_distinct_count(user.name)
by source.ip → Credential stuffing
Bucket span: 15m
Influencers: user.name, source.ip, geoip.country_iso_code
Alert: Score > 75 → PagerDuty + Slack #security
Example 3: Infrastructure Capacity Planning
Job: infra-capacity-forecast
Purpose: Predict when resources will be exhausted
Single metric job:
Detector: mean(system.filesystem.used.pct)
Partition: host.name
After training (2+ weeks of data):
1. Open Single Metric Viewer
2. Select host
3. Forecast 30 days
4. Identify when disk reaches 90%
Use for:
- Proactive capacity upgrades
- Budget planning
- SLA compliance
Tips and Best Practices
Job Configuration
✅ Start with single metric jobs to learn
✅ Use appropriate bucket span (match your data cadence)
✅ Set influencers to help diagnose anomalies
✅ Use partition fields for per-entity analysis
✅ Set model memory limits to prevent resource exhaustion
❌ Don't use very small bucket spans (< 5m) without good reason
❌ Don't create too many detectors per job (< 10 recommended)
❌ Don't skip the validation step
❌ Don't expect immediate results (model needs training time)
Data Quality
✅ Ensure consistent data flow (gaps confuse the model)
✅ Pre-filter irrelevant data with datafeed queries
✅ Provide sufficient history (2+ seasonal cycles minimum)
✅ Use keyword fields for categorical analysis
❌ Don't train on data with known outages (or exclude those periods)
❌ Don't mix fundamentally different data sources in one job
❌ Don't ignore data gaps (they affect model quality)
Resource Management
ML node sizing:
- RAM: 8GB minimum per node, 32GB+ for production
- CPU: 4+ cores
- Disk: Fast storage for model state
Model memory limits:
- Simple jobs: 16-64MB
- Multi-metric with partitions: 128-512MB
- Population jobs: 256MB-2GB
- Large cardinality: 1-4GB
Monitor at: Stack Management → License Management → ML section
Common Issues
Job not finding anomalies
Causes:
- Insufficient training data → Wait for 2+ bucket span cycles
- Bucket span too large → Try smaller intervals
- Data is truly consistent → No anomalies to find
- Wrong detector function → Review detector configuration
High model memory usage
Fix:
- Reduce partition field cardinality
- Increase bucket span (fewer buckets = less memory)
- Limit
by_fieldto top N values - Increase
model_memory_limitif justified
Too many false positives
Fix:
- Increase severity threshold in alerts (75+ for critical)
- Add custom rules to suppress known patterns
- Increase bucket span for less sensitivity
- Exclude known maintenance periods from training
# Add a calendar to exclude events
curl -X PUT "localhost:9200/_ml/calendars/maintenance" \
-H "Content-Type: application/json" \
-d '{ "job_ids": ["api-latency-job"] }'
curl -X POST "localhost:9200/_ml/calendars/maintenance/events" \
-H "Content-Type: application/json" \
-d '{
"events": [
{
"description": "Weekly maintenance",
"start_time": "2024-01-20T02:00:00Z",
"end_time": "2024-01-20T06:00:00Z"
}
]
}'
Summary
In this chapter, you learned:
- ✅ ML capabilities: anomaly detection, data frame analytics, forecasting
- ✅ Creating anomaly detection jobs (single, multi-metric, population)
- ✅ Understanding detectors, bucket spans, and influencers
- ✅ Interpreting anomaly scores and results
- ✅ Forecasting future values from learned patterns
- ✅ Data frame analytics: outlier detection, regression, classification
- ✅ Integrating ML with alerting and dashboards
- ✅ Resource management and troubleshooting
Next: Monitoring application performance with APM!