Best Practices and Tips

This chapter consolidates practical advice for running Kibana effectively in production. It covers performance, organization, naming conventions, team workflows, and common pitfalls.

Dashboard Design

Layout Principles

1. Most important information at the top
2. Summary → Detail (top to bottom)
3. Time-series charts in wide panels
4. Related metrics side by side
5. Controls and filters at the very top

Recommended layout:
┌─────────────────────────────────────────────────────────┐
│  [Filters]  [Controls]  [Time Range]                     │  Row 1
├────────┬────────┬────────┬──────────────────────────────┤
│ KPI 1  │ KPI 2  │ KPI 3  │ KPI 4                        │  Row 2
├────────┴────────┴────────┴──────────────────────────────┤
│  Time-series chart (full width)                          │  Row 3
├──────────────────────────┬──────────────────────────────┤
│  Breakdown chart         │  Breakdown chart              │  Row 4
├──────────────────────────┴──────────────────────────────┤
│  Detail table (full width)                               │  Row 5
└─────────────────────────────────────────────────────────┘

Panel Count

Keep dashboards focused:

✅ 8-15 panels: Fast loading, clear purpose
✅ 15-25 panels: Acceptable for broad views
❌ 30+ panels: Split into linked dashboards

Technique: Use drilldowns to link an overview dashboard
to detail dashboards instead of packing everything into one.

Color Consistency

Establish a color scheme and stick to it:

Status colors (across all dashboards):
  Green:   #00BFA5   Success, healthy, within SLA
  Yellow:  #FFB74D   Warning, degraded, approaching limit
  Red:     #FF5252   Error, critical, SLA breach
  Blue:    #448AFF   Informational, neutral
  Gray:    #9E9E9E   Inactive, disabled, no data

Category colors (per domain):
  Assign fixed colors to categories so they're recognizable:
  - "Men's Clothing" always uses blue
  - "Women's Clothing" always uses pink
  - "Shoes" always uses brown

Titles and Labels

✅ "Daily Revenue ($)": clear metric, clear unit
✅ "Error Rate (%) - Last 24h": metric, unit, time context
✅ "Top 10 Endpoints by Latency": scope, dimension, metric

❌ "Chart 1": meaningless
❌ "taxful_total_price": raw field name
❌ "Data": too vague

Naming Conventions

Saved Objects

Use a consistent naming scheme for all saved objects:

Pattern: [Team/Domain] Object Type - Description

Dashboards:
  [Ops] Overview - Production Health
  [Ops] Detail - API Gateway
  [Sales] Overview - Revenue Metrics
  [Sales] Detail - Product Performance

Visualizations:
  [Ops] Metric - Error Rate
  [Ops] Line - Request Latency p95
  [Sales] Bar - Top Products by Revenue
  [Sales] Pie - Revenue by Category

Saved Searches:
  [Ops] Search - 5xx Errors Last 24h
  [Sales] Search - High Value Orders

Index Patterns / Data Views:
  logs-prod-* (Production Logs)
  logs-staging-* (Staging Logs)
  metrics-prod-* (Production Metrics)

Tags

Apply tags consistently:

Tag categories:
  Team:        ops, dev, sales, marketing, security
  Environment: production, staging, development
  Type:        monitoring, analytics, reporting, investigation
  Status:      active, archived, draft, template
  Priority:    critical, important, reference

Example:
  Dashboard: "[Ops] Overview - Production Health"
  Tags: [ops, production, monitoring, critical]

Spaces

Space naming:
  production     → Production monitoring and dashboards
  development    → Dev/test dashboards and experiments
  marketing      → Marketing analytics
  security       → Security operations
  shared         → Cross-team dashboards

Keep it simple: one space per team or function.

Query Performance

Index and Field Optimization

✅ Use keyword fields for filtering and aggregations
   category.keyword: "Shoes"

❌ Use text fields for aggregations
   category: "Shoes" (analyzed field, slower)

✅ Use date histogram with appropriate intervals
   Auto interval or 1h for daily views, 1d for monthly

❌ Use tiny intervals on large time ranges
   1-minute intervals over 1 year = too many buckets

Query Patterns

Fast queries:
  ✅ Term filter on keyword field
  ✅ Range filter on numeric/date field
  ✅ Bool filter combining few conditions
  ✅ Date histogram with auto interval

Slow queries:
  ❌ Wildcard prefix (*something)
  ❌ Regex on large text fields
  ❌ High-cardinality terms aggregation (top 10000)
  ❌ Nested aggregations 4+ levels deep
  ❌ Scripts in aggregations

Time Range Strategy

Real-time dashboards:    Last 15-30 minutes
Operational monitoring:  Last 1-4 hours
Daily review:            Last 24 hours / Today
Weekly reports:          Last 7 days
Monthly analysis:        Last 30 days
Historical research:     Custom range (as narrow as possible)

Tip: Always set a time range. Querying "all time" on
production indices is the #1 cause of slow dashboards.

Caching

Kibana and Elasticsearch cache query results. Maximize cache hits:

✅ Use filters (cacheable) over query_string (not always cached)
✅ Round time ranges to hour/day boundaries
✅ Use consistent queries across dashboard panels
✅ Enable query caching in Elasticsearch:
   indices.queries.cache.enabled: true (default)

Elasticsearch request cache works best when:
  - Shard data doesn't change (rolled-over indices)
  - Same query is repeated
  - Time range uses "now" rounding: now/h, now/d

Data View Strategy

Naming and Organization

Convention: source-environment-*

Examples:
  filebeat-prod-*        All production Filebeat logs
  metricbeat-prod-*      All production Metricbeat metrics
  apm-prod-*             All production APM data
  custom-orders-*        Custom order data (all environments)

Avoid:
  * (everything)         Too broad, slow, confusing field list
  logs-*                 Ambiguous: prod? dev? staging?

Field Formatting

Set up field formatting once in the data view so every dashboard benefits:

Standard formatting:
  response         → Color (green 2xx, yellow 3xx, red 4xx/5xx)
  bytes            → Bytes (auto KB/MB/GB)
  response_time    → Duration (ms)
  price            → Currency ($0,0.00)
  url              → URL (clickable)
  percentage       → Percent (0.00%)
  ip_address       → String (no special formatting)

Runtime Fields

Use runtime fields for common calculations so every dashboard has access:

Useful runtime fields:
  hour_of_day     → Extract hour from @timestamp
  day_of_week     → Extract day name
  response_class  → "2xx", "3xx", "4xx", "5xx" from response code
  sla_status      → "within_sla" / "breach" from response_time
  environment     → Extract from index name or hostname

Alerting Best Practices

Alert Design

✅ Alert on symptoms, not causes
   "Error rate > 5%" (symptom)
   NOT "Pod restarted" (cause - may be normal)

✅ Include runbook links in alert messages
   "Error rate high. Runbook: https://wiki.example.com/runbooks/high-error-rate"

✅ Set appropriate thresholds (not too sensitive)
   Test thresholds against historical data before enabling

✅ Use tiered severity
   Warning: error_rate > 2%  → Slack #monitoring
   Critical: error_rate > 10% → PagerDuty + Slack #incidents

❌ Alert on everything
   Alert fatigue = ignored alerts = missed incidents

Alert Noise Reduction

Techniques:
1. Aggregate: Group by service instead of per-instance
2. Throttle: Send at most once per hour while condition persists
3. Delay: Require condition true for 5+ minutes before alerting
4. Exclude: Filter out known false positives (health checks, test data)
5. Schedule: Suppress during maintenance windows
6. Snooze: Temporarily silence during known issues

Recovery Actions

Always configure recovery (resolved) actions:

Alert fired:
  "🔴 Error rate 12% on payment-service (threshold: 5%)"

Alert recovered:
  "✅ Error rate 1.2% on payment-service - back to normal
   Duration: 23 minutes"

Recovery actions help close the loop and prevent confusion
about whether an issue is still ongoing.

Security Best Practices

Access Control

Principles:
1. Least privilege: Users get minimum access needed
2. Role-based: Define roles per function, assign to users
3. Space isolation: Separate environments and teams
4. Audit: Enable audit logging in production

Common roles:
  viewer         → Read-only dashboards (stakeholders)
  analyst        → Read data + create visualizations (analysts)
  editor         → Full dashboard management (dashboard authors)
  alert_manager  → Manage alerts (on-call engineers)
  admin          → Full Kibana management (admins only)

Sensitive Data

✅ Use field-level security to hide PII fields
✅ Use document-level security for multi-tenant data
✅ Never display raw credit card or SSN fields
✅ Mask email addresses in shared dashboards
✅ Use runtime fields to create masked versions of sensitive data

Example runtime field:
  masked_email: "jo***@example.com" (from "john@example.com")

Production Hardening

# kibana.yml production settings
server.ssl.enabled: true
server.ssl.certificate: /path/to/cert.pem
server.ssl.key: /path/to/key.pem

xpack.security.enabled: true
xpack.encryptedSavedObjects.encryptionKey: "min-32-char-key"
xpack.reporting.encryptionKey: "min-32-char-key"
xpack.security.session.idleTimeout: "1h"
xpack.security.session.lifespan: "8h"

# Disable telemetry in production
telemetry.enabled: false

# CSP headers
csp.strict: true
csp.warnLegacyBrowsers: false

Operational Workflow

Dashboard Lifecycle

1. DRAFT
   Create in development space
   Use sample or dev data
   Iterate on layout and queries

2. REVIEW
   Share with stakeholders
   Gather feedback
   Test with production-like data

3. DEPLOY
   Move to production space
   Connect to production data views
   Set up access controls

4. MAINTAIN
   Monitor performance
   Update as data schema changes
   Archive when no longer needed
   Version control via export/import

Version Control for Dashboards

Export dashboards as NDJSON and store in git:

# Export all dashboards
curl -X POST "localhost:5601/api/saved_objects/_export" \
  -H "kbn-xsrf: true" \
  -H "Content-Type: application/json" \
  -d '{
    "type": ["dashboard", "visualization", "search", "lens"],
    "includeReferencesDeep": true
  }' > kibana-dashboards.ndjson

# Commit to git
git add kibana-dashboards.ndjson
git commit -m "Export Kibana dashboards - Jan 2024"

# Import (restore or deploy to another instance)
curl -X POST "localhost:5601/api/saved_objects/_import?overwrite=true" \
  -H "kbn-xsrf: true" \
  --form file=@kibana-dashboards.ndjson

Backup Strategy

What to back up:
  ✅ Saved objects (dashboards, visualizations, data views)
  ✅ kibana.yml configuration
  ✅ Connector configurations (API keys, webhook URLs)
  ✅ ML job configurations
  ✅ Alert rules

How:
  - Saved objects: API export (ndjson) → git
  - Config files: Standard config management (Ansible, etc.)
  - Elasticsearch snapshots (includes .kibana index)

Schedule:
  - After any significant dashboard changes
  - Weekly automated export
  - Before Kibana version upgrades

Performance Tuning

Kibana Server

# kibana.yml performance settings

# Increase Node.js memory for large dashboards
# Set via environment variable:
# NODE_OPTIONS="--max-old-space-size=4096"

# Request timeout (increase for slow queries)
elasticsearch.requestTimeout: 60000

# Shard timeout
elasticsearch.shardTimeout: 30000

# Max payload size (for large imports)
server.maxPayload: 10485760

Elasticsearch Query Optimization

For Kibana-specific optimization:

1. Use index lifecycle management (ILM) to move old data to cheaper tiers
   Hot → Warm → Cold → Delete

2. Create summary indices for dashboard-heavy queries
   Daily/hourly rollups of frequently aggregated metrics

3. Optimize mappings
   - Disable _source on metrics indices if not needed for Discover
   - Use keyword instead of text for fields only used in aggs
   - Set doc_values: false for fields never aggregated

4. Tune shard count
   - Target 10-50GB per shard
   - Avoid too many small shards (overhead per shard)
   - Avoid too few large shards (slow queries)

Browser Performance

For users experiencing slow Kibana:

✅ Use Chrome or Firefox (latest versions)
✅ Close unused dashboard tabs
✅ Clear browser cache if behavior is unexpected
✅ Disable browser extensions that may interfere
✅ Use wired connection for large data sets

Dashboard-specific:
✅ Limit auto-refresh frequency (10s minimum)
✅ Set reasonable time ranges
✅ Use "Apply" button for controls (not auto-apply)
✅ Reduce panel count per dashboard

Monitoring Kibana Itself

Stack Monitoring

Enable monitoring to track Kibana's health:

# kibana.yml
monitoring.ui.enabled: true

Navigate to Stack ManagementStack Monitoring:

Kibana Instance Health:
┌──────────────────────────────────────────────┐
│  Requests:      45/s                         │
│  Response time: 120ms (avg), 450ms (p95)     │
│  Memory:        1.2GB / 4GB (30%)            │
│  Status:        Green                        │
│  Connected to:  3 Elasticsearch nodes        │
│  Uptime:        14 days                      │
└──────────────────────────────────────────────┘

Key Metrics to Watch

MetricWarningCritical
Response time (p95)> 2s> 5s
Memory usage> 70%> 90%
Request rateVariesSudden spike/drop
Elasticsearch connectivityIntermittentLost
StatusYellowRed

Health Check Endpoint

# Quick health check
curl -s "localhost:5601/api/status" | jq '.status.overall.level'
# "available" = healthy

# Detailed status
curl -s "localhost:5601/api/status" | jq '.status'

Upgrade Strategy

Before Upgrading

1. Read release notes for breaking changes
2. Export all saved objects (backup)
3. Test upgrade in staging/dev first
4. Check plugin compatibility
5. Verify Elasticsearch compatibility matrix
6. Plan rollback procedure

Kibana-Elasticsearch Compatibility

Rule: Kibana version must match Elasticsearch minor version

✅ Kibana 8.11.x + Elasticsearch 8.11.x
✅ Kibana 8.11.0 + Elasticsearch 8.11.3 (patch mismatch OK)
❌ Kibana 8.11.x + Elasticsearch 8.10.x (minor mismatch)
❌ Kibana 8.x + Elasticsearch 7.x (major mismatch)

Upgrade order: Elasticsearch first, then Kibana

Post-Upgrade Checklist

✅ Verify Kibana starts and connects to Elasticsearch
✅ Check saved objects migration (Stack Management → Upgrade Assistant)
✅ Test critical dashboards render correctly
✅ Verify alerts are firing
✅ Check ML jobs are running
✅ Test user authentication
✅ Review deprecated features and plan migration

Common Pitfalls

1. Querying Without Time Bounds

Problem: Dashboard queries scan entire index (years of data)
Impact:  Slow queries, high memory, potential timeout

Fix:     Always set a time range appropriate to the use case

2. Text Fields in Aggregations

Problem: Using "category" (text) instead of "category.keyword"
Impact:  "Field is not aggregatable" error or unexpected results

Fix:     Always use .keyword suffix for exact match and aggregations

3. Too Many Unique Values

Problem: Terms aggregation on high-cardinality field (e.g., user_id with millions)
Impact:  Extremely slow, high memory, inaccurate

Fix:     Use Top N (limit to 10-20), or use filters for specific values

4. Ignoring Error Messages

Problem: Visualization shows "No results" but user ignores it
Impact:  Decisions made on missing data

Fix:     Investigate: check time range, filters, data view, field names

5. Single Kibana Instance for Everything

Problem: One Kibana instance serves dev, staging, and production
Impact:  Resource contention, security risk, messy organization

Fix:     Separate instances per environment, or at minimum use Spaces

6. Not Using Saved Objects API for Migrations

Problem: Manually recreating dashboards in new environment
Impact:  Time-consuming, error-prone, inconsistent

Fix:     Export/import via API, store in version control

7. Alert Fatigue

Problem: Too many low-value alerts firing constantly
Impact:  Team ignores alerts, real issues missed

Fix:     
  - Review and remove noisy alerts quarterly
  - Require each alert to have a clear action (what should the recipient do?)
  - Use tiered severity
  - Set proper thresholds based on historical data

Quick Reference

Keyboard Shortcuts

ShortcutAction
/Focus search bar
Ctrl/Cmd + KCommand palette
Ctrl/Cmd + /Toggle navigation
Ctrl/Cmd + SSave current object
EscapeClose modal
Ctrl/Cmd + ZUndo (in Lens editor)

Useful API Endpoints

# System status
GET /api/status

# Saved objects
POST /api/saved_objects/_export
POST /api/saved_objects/_import
GET  /api/saved_objects/_find?type=dashboard

# Data views
GET  /api/data_views
POST /api/data_views/data_view

# Alerting
GET  /api/alerting/rules/_find
POST /api/alerting/rule

# Spaces
GET  /api/spaces/space
POST /api/spaces/space

Configuration Files

Kibana:         /etc/kibana/kibana.yml
Elasticsearch:  /etc/elasticsearch/elasticsearch.yml
Filebeat:       /etc/filebeat/filebeat.yml
Metricbeat:     /etc/metricbeat/metricbeat.yml
APM Server:     /etc/apm-server/apm-server.yml

Docker volumes:
  kibana:         /usr/share/kibana/config/kibana.yml
  elasticsearch:  /usr/share/elasticsearch/config/elasticsearch.yml

Summary

In this chapter, you learned:

  • ✅ Dashboard design principles for clarity and performance
  • ✅ Naming conventions for saved objects, tags, and spaces
  • ✅ Query performance optimization techniques
  • ✅ Alerting best practices to avoid alert fatigue
  • ✅ Security hardening for production deployments
  • ✅ Operational workflows: lifecycle, versioning, backups
  • ✅ Performance tuning for Kibana server and Elasticsearch
  • ✅ Monitoring Kibana itself and upgrade strategies
  • ✅ Common pitfalls and how to avoid them

This concludes the Kibana tutorial. You now have the knowledge to build effective dashboards, write efficient queries, set up monitoring and alerting, secure your deployment, and maintain it in production. The official Elastic documentation is an excellent resource for continued learning and reference.