Best Practices and Tips

This chapter consolidates practical advice for running Kibana effectively in production. It covers performance, organization, naming conventions, team workflows, and common pitfalls.

Dashboard Design

Layout Principles

1. Most important information at the top
2. Summary → Detail (top to bottom)
3. Time-series charts in wide panels
4. Related metrics side by side
5. Controls and filters at the very top

Recommended layout:
┌─────────────────────────────────────────────────────────┐
│ [Filters] [Controls] [Time Range] │ Row 1
├────────┬────────┬────────┬──────────────────────────────┤
│ KPI 1 │ KPI 2 │ KPI 3 │ KPI 4 │ Row 2
├────────┴────────┴────────┴──────────────────────────────┤
│ Time-series chart (full width) │ Row 3
├──────────────────────────┬──────────────────────────────┤
│ Breakdown chart │ Breakdown chart │ Row 4
├──────────────────────────┴──────────────────────────────┤
│ Detail table (full width) │ Row 5
└─────────────────────────────────────────────────────────┘

Panel Count

Keep dashboards focused:

 8-15 panels: Fast loading, clear purpose
 15-25 panels: Acceptable for broad views
 30+ panels: Split into linked dashboards

Technique: Use drilldowns to link an overview dashboard
to detail dashboards instead of packing everything into one.

Color Consistency

Establish a color scheme and stick to it:

Status colors (across all dashboards):
 Green: #00BFA5 Success, healthy, within SLA
 Yellow: #FFB74D Warning, degraded, approaching limit
 Red: #FF5252 Error, critical, SLA breach
 Blue: #448AFF Informational, neutral
 Gray: #9E9E9E Inactive, disabled, no data

Category colors (per domain):
 Assign fixed colors to categories so they're recognizable:
 - "Men's Clothing" always uses blue
 - "Women's Clothing" always uses pink
 - "Shoes" always uses brown

Titles and Labels

 "Daily Revenue ($)": clear metric, clear unit
 "Error Rate (%) - Last 24h": metric, unit, time context
 "Top 10 Endpoints by Latency": scope, dimension, metric

 "Chart 1": meaningless
 "taxful_total_price": raw field name
 "Data": too vague

Naming Conventions

Saved Objects

Use a consistent naming scheme for all saved objects:

Pattern: [Team/Domain] Object Type - Description

Dashboards:
 [Ops] Overview - Production Health
 [Ops] Detail - API Gateway
 [Sales] Overview - Revenue Metrics
 [Sales] Detail - Product Performance

Visualizations:
 [Ops] Metric - Error Rate
 [Ops] Line - Request Latency p95
 [Sales] Bar - Top Products by Revenue
 [Sales] Pie - Revenue by Category

Saved Searches:
 [Ops] Search - 5xx Errors Last 24h
 [Sales] Search - High Value Orders

Index Patterns / Data Views:
 logs-prod-* (Production Logs)
 logs-staging-* (Staging Logs)
 metrics-prod-* (Production Metrics)

Tags

Apply tags consistently:

Tag categories:
 Team: ops, dev, sales, marketing, security
 Environment: production, staging, development
 Type: monitoring, analytics, reporting, investigation
 Status: active, archived, draft, template
 Priority: critical, important, reference

Example:
 Dashboard: "[Ops] Overview - Production Health"
 Tags: [ops, production, monitoring, critical]

Spaces

Space naming:
 production → Production monitoring and dashboards
 development → Dev/test dashboards and experiments
 marketing → Marketing analytics
 security → Security operations
 shared → Cross-team dashboards

Keep it simple: one space per team or function.

Query Performance

Index and Field Optimization

 Use keyword fields for filtering and aggregations
 category.keyword: "Shoes"

 Use text fields for aggregations
 category: "Shoes" (analyzed field, slower)

 Use date histogram with appropriate intervals
 Auto interval or 1h for daily views, 1d for monthly

 Use tiny intervals on large time ranges
 1-minute intervals over 1 year = too many buckets

Query Patterns

Fast queries:
 Term filter on keyword field
 Range filter on numeric/date field
 Bool filter combining few conditions
 Date histogram with auto interval

Slow queries:
 Wildcard prefix (*something)
 Regex on large text fields
 High-cardinality terms aggregation (top 10000)
 Nested aggregations 4+ levels deep
 Scripts in aggregations

Time Range Strategy

Real-time dashboards: Last 15-30 minutes
Operational monitoring: Last 1-4 hours
Daily review: Last 24 hours / Today
Weekly reports: Last 7 days
Monthly analysis: Last 30 days
Historical research: Custom range (as narrow as possible)

Tip: Always set a time range. Querying "all time" on
production indices is the #1 cause of slow dashboards.

Caching

Kibana and Elasticsearch cache query results. Maximize cache hits:

 Use filters (cacheable) over query_string (not always cached)
 Round time ranges to hour/day boundaries
 Use consistent queries across dashboard panels
 Enable query caching in Elasticsearch:
 indices.queries.cache.enabled: true (default)

Elasticsearch request cache works best when:
 - Shard data doesn't change (rolled-over indices)
 - Same query is repeated
 - Time range uses "now" rounding: now/h, now/d

Data View Strategy

Naming and Organization

Convention: source-environment-*

Examples:
 filebeat-prod-* All production Filebeat logs
 metricbeat-prod-* All production Metricbeat metrics
 apm-prod-* All production APM data
 custom-orders-* Custom order data (all environments)

Avoid:
 * (everything) Too broad, slow, confusing field list
 logs-* Ambiguous: prod? dev? staging?

Field Formatting

Set up field formatting once in the data view so every dashboard benefits:

Standard formatting:
 response → Color (green 2xx, yellow 3xx, red 4xx/5xx)
 bytes → Bytes (auto KB/MB/GB)
 response_time → Duration (ms)
 price → Currency ($0,0.00)
 url → URL (clickable)
 percentage → Percent (0.00%)
 ip_address → String (no special formatting)

Runtime Fields

Use runtime fields for common calculations so every dashboard has access:

Useful runtime fields:
 hour_of_day → Extract hour from @timestamp
 day_of_week → Extract day name
 response_class → "2xx", "3xx", "4xx", "5xx" from response code
 sla_status → "within_sla" / "breach" from response_time
 environment → Extract from index name or hostname

Alerting Best Practices

Alert Design

 Alert on symptoms, not causes
 "Error rate > 5%" (symptom)
 NOT "Pod restarted" (cause - may be normal)

 Include runbook links in alert messages
 "Error rate high. Runbook: https://wiki.example.com/runbooks/high-error-rate"

 Set appropriate thresholds (not too sensitive)
 Test thresholds against historical data before enabling

 Use tiered severity
 Warning: error_rate > 2% → Slack #monitoring
 Critical: error_rate > 10% → PagerDuty + Slack #incidents

 Alert on everything
 Alert fatigue = ignored alerts = missed incidents

Alert Noise Reduction

Techniques:
1. Aggregate: Group by service instead of per-instance
2. Throttle: Send at most once per hour while condition persists
3. Delay: Require condition true for 5+ minutes before alerting
4. Exclude: Filter out known false positives (health checks, test data)
5. Schedule: Suppress during maintenance windows
6. Snooze: Temporarily silence during known issues

Recovery Actions

Always configure recovery (resolved) actions:

Alert fired:
 " Error rate 12% on payment-service (threshold: 5%)"

Alert recovered:
 " Error rate 1.2% on payment-service - back to normal
 Duration: 23 minutes"

Recovery actions help close the loop and prevent confusion
about whether an issue is still ongoing.

Security Best Practices

Access Control

Principles:
1. Least privilege: Users get minimum access needed
2. Role-based: Define roles per function, assign to users
3. Space isolation: Separate environments and teams
4. Audit: Enable audit logging in production

Common roles:
 viewer → Read-only dashboards (stakeholders)
 analyst → Read data + create visualizations (analysts)
 editor → Full dashboard management (dashboard authors)
 alert_manager → Manage alerts (on-call engineers)
 admin → Full Kibana management (admins only)

Sensitive Data

 Use field-level security to hide PII fields
 Use document-level security for multi-tenant data
 Never display raw credit card or SSN fields
 Mask email addresses in shared dashboards
 Use runtime fields to create masked versions of sensitive data

Example runtime field:
 masked_email: "jo***@example.com" (from "john@example.com")

Production Hardening

# kibana.yml production settings
server.ssl.enabled: true
server.ssl.certificate: /path/to/cert.pem
server.ssl.key: /path/to/key.pem

xpack.security.enabled: true
xpack.encryptedSavedObjects.encryptionKey: "min-32-char-key"
xpack.reporting.encryptionKey: "min-32-char-key"
xpack.security.session.idleTimeout: "1h"
xpack.security.session.lifespan: "8h"

# Disable telemetry in production
telemetry.enabled: false

# CSP headers
csp.strict: true
csp.warnLegacyBrowsers: false

Operational Workflow

Dashboard Lifecycle

1. DRAFT
 Create in development space
 Use sample or dev data
 Iterate on layout and queries

2. REVIEW
 Share with stakeholders
 Gather feedback
 Test with production-like data

3. DEPLOY
 Move to production space
 Connect to production data views
 Set up access controls

4. MAINTAIN
 Monitor performance
 Update as data schema changes
 Archive when no longer needed
 Version control via export/import

Version Control for Dashboards

Export dashboards as NDJSON and store in git:

# Export all dashboards
curl -X POST "localhost:5601/api/saved_objects/_export" \
 -H "kbn-xsrf: true" \
 -H "Content-Type: application/json" \
 -d '{
 "type": ["dashboard", "visualization", "search", "lens"],
 "includeReferencesDeep": true
 }' > kibana-dashboards.ndjson

# Commit to git
git add kibana-dashboards.ndjson
git commit -m "Export Kibana dashboards - Jan 2024"

# Import (restore or deploy to another instance)
curl -X POST "localhost:5601/api/saved_objects/_import?overwrite=true" \
 -H "kbn-xsrf: true" \
 --form file=@kibana-dashboards.ndjson

Backup Strategy

What to back up:
 Saved objects (dashboards, visualizations, data views)
 kibana.yml configuration
 Connector configurations (API keys, webhook URLs)
 ML job configurations
 Alert rules

How:
 - Saved objects: API export (ndjson) → git
 - Config files: Standard config management (Ansible, etc.)
 - Elasticsearch snapshots (includes .kibana index)

Schedule:
 - After any significant dashboard changes
 - Weekly automated export
 - Before Kibana version upgrades

Performance Tuning

Kibana Server

# kibana.yml performance settings

# Increase Node.js memory for large dashboards
# Set via environment variable:
# NODE_OPTIONS="--max-old-space-size=4096"

# Request timeout (increase for slow queries)
elasticsearch.requestTimeout: 60000

# Shard timeout
elasticsearch.shardTimeout: 30000

# Max payload size (for large imports)
server.maxPayload: 10485760

Elasticsearch Query Optimization

For Kibana-specific optimization:

1. Use index lifecycle management (ILM) to move old data to cheaper tiers
 Hot → Warm → Cold → Delete

2. Create summary indices for dashboard-heavy queries
 Daily/hourly rollups of frequently aggregated metrics

3. Optimize mappings
 - Disable _source on metrics indices if not needed for Discover
 - Use keyword instead of text for fields only used in aggs
 - Set doc_values: false for fields never aggregated

4. Tune shard count
 - Target 10-50GB per shard
 - Avoid too many small shards (overhead per shard)
 - Avoid too few large shards (slow queries)

Browser Performance

For users experiencing slow Kibana:

 Use Chrome or Firefox (latest versions)
 Close unused dashboard tabs
 Clear browser cache if behavior is unexpected
 Disable browser extensions that may interfere
 Use wired connection for large data sets

Dashboard-specific:
 Limit auto-refresh frequency (10s minimum)
 Set reasonable time ranges
 Use "Apply" button for controls (not auto-apply)
 Reduce panel count per dashboard

Monitoring Kibana Itself

Stack Monitoring

Enable monitoring to track Kibana's health:

# kibana.yml
monitoring.ui.enabled: true

Navigate to Stack ManagementStack Monitoring:

Kibana Instance Health:
┌──────────────────────────────────────────────┐
│ Requests: 45/s │
│ Response time: 120ms (avg), 450ms (p95) │
│ Memory: 1.2GB / 4GB (30%) │
│ Status: Green │
│ Connected to: 3 Elasticsearch nodes │
│ Uptime: 14 days │
└──────────────────────────────────────────────┘

Key Metrics to Watch

MetricWarningCritical
Response time (p95)> 2s> 5s
Memory usage> 70%> 90%
Request rateVariesSudden spike/drop
Elasticsearch connectivityIntermittentLost
StatusYellowRed

Health Check Endpoint

# Quick health check
curl -s "localhost:5601/api/status" | jq '.status.overall.level'
# "available" = healthy

# Detailed status
curl -s "localhost:5601/api/status" | jq '.status'

Upgrade Strategy

Before Upgrading

1. Read release notes for breaking changes
2. Export all saved objects (backup)
3. Test upgrade in staging/dev first
4. Check plugin compatibility
5. Verify Elasticsearch compatibility matrix
6. Plan rollback procedure

Kibana-Elasticsearch Compatibility

Rule: Kibana version must match Elasticsearch minor version

 Kibana 8.11.x + Elasticsearch 8.11.x
 Kibana 8.11.0 + Elasticsearch 8.11.3 (patch mismatch OK)
 Kibana 8.11.x + Elasticsearch 8.10.x (minor mismatch)
 Kibana 8.x + Elasticsearch 7.x (major mismatch)

Upgrade order: Elasticsearch first, then Kibana

Post-Upgrade Checklist

 Verify Kibana starts and connects to Elasticsearch
 Check saved objects migration (Stack Management → Upgrade Assistant)
 Test critical dashboards render correctly
 Verify alerts are firing
 Check ML jobs are running
 Test user authentication
 Review deprecated features and plan migration

Common Pitfalls

1. Querying Without Time Bounds

Problem: Dashboard queries scan entire index (years of data)
Impact: Slow queries, high memory, potential timeout

Fix: Always set a time range appropriate to the use case

2. Text Fields in Aggregations

Problem: Using "category" (text) instead of "category.keyword"
Impact: "Field is not aggregatable" error or unexpected results

Fix: Always use .keyword suffix for exact match and aggregations

3. Too Many Unique Values

Problem: Terms aggregation on high-cardinality field (e.g., user_id with millions)
Impact: Extremely slow, high memory, inaccurate

Fix: Use Top N (limit to 10-20), or use filters for specific values

4. Ignoring Error Messages

Problem: Visualization shows "No results" but user ignores it
Impact: Decisions made on missing data

Fix: Investigate: check time range, filters, data view, field names

5. Single Kibana Instance for Everything

Problem: One Kibana instance serves dev, staging, and production
Impact: Resource contention, security risk, messy organization

Fix: Separate instances per environment, or at minimum use Spaces

6. Not Using Saved Objects API for Migrations

Problem: Manually recreating dashboards in new environment
Impact: Time-consuming, error-prone, inconsistent

Fix: Export/import via API, store in version control

7. Alert Fatigue

Problem: Too many low-value alerts firing constantly
Impact: Team ignores alerts, real issues missed

Fix: 
 - Review and remove noisy alerts quarterly
 - Require each alert to have a clear action (what should the recipient do?)
 - Use tiered severity
 - Set proper thresholds based on historical data

Quick Reference

Keyboard Shortcuts

ShortcutAction
/Focus search bar
Ctrl/Cmd + KCommand palette
Ctrl/Cmd + /Toggle navigation
Ctrl/Cmd + SSave current object
EscapeClose modal
Ctrl/Cmd + ZUndo (in Lens editor)

Useful API Endpoints

# System status
GET /api/status

# Saved objects
POST /api/saved_objects/_export
POST /api/saved_objects/_import
GET /api/saved_objects/_find?type=dashboard

# Data views
GET /api/data_views
POST /api/data_views/data_view

# Alerting
GET /api/alerting/rules/_find
POST /api/alerting/rule

# Spaces
GET /api/spaces/space
POST /api/spaces/space

Configuration Files

Kibana: /etc/kibana/kibana.yml
Elasticsearch: /etc/elasticsearch/elasticsearch.yml
Filebeat: /etc/filebeat/filebeat.yml
Metricbeat: /etc/metricbeat/metricbeat.yml
APM Server: /etc/apm-server/apm-server.yml

Docker volumes:
 kibana: /usr/share/kibana/config/kibana.yml
 elasticsearch: /usr/share/elasticsearch/config/elasticsearch.yml

Where to Go From Here

You now have what you need to build effective dashboards, write efficient queries, set up monitoring and alerting, secure your deployment, and run it in production.

For continued learning:

A few project ideas to lock in what you learned:

  • Build an end-to-end log pipeline (Filebeat, Logstash, Elasticsearch, Kibana) for an application you maintain
  • Create a service-level dashboard with SLO tracking and tiered alerts
  • Set up an ML anomaly detection job on a real metric and tune it over a week
  • Wire up APM in a small service and trace a real request across two hops