Application Performance Monitoring (APM)

APM in Kibana provides real-time visibility into application performance: response times, error rates, transaction traces, and service dependencies. It instruments your application code to collect data and displays it through dedicated Kibana views.

APM Architecture

┌──────────────────────────────────────────────────────────┐
│  Your Applications                                       │
│                                                          │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐   │
│  │ Node.js │  │ Python  │  │  Java   │  │   Go    │   │
│  │   App   │  │   App   │  │   App   │  │   App   │   │
│  │  + APM  │  │  + APM  │  │  + APM  │  │  + APM  │   │
│  │  Agent  │  │  Agent  │  │  Agent  │  │  Agent  │   │
│  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘   │
│       └─────────┬──┴───────────┴──────────────┘         │
│                 │                                        │
│                 ▼                                        │
│         ┌──────────────┐                                 │
│         │  APM Server  │  (or Elastic Agent with         │
│         │              │   APM integration)               │
│         └──────┬───────┘                                 │
│                │                                        │
│                ▼                                        │
│         ┌──────────────┐                                 │
│         │Elasticsearch │  (stores APM data)              │
│         └──────┬───────┘                                 │
│                │                                        │
│                ▼                                        │
│         ┌──────────────┐                                 │
│         │   Kibana     │  (APM UI)                       │
│         │   APM App    │                                 │
│         └──────────────┘                                 │
└──────────────────────────────────────────────────────────┘

Setting Up APM

Step 1: APM Server

Install APM Server or use Elastic Agent with APM integration:

Docker:

docker run -d \
  --name apm-server \
  --net elastic \
  -p 8200:8200 \
  -e "output.elasticsearch.hosts=['http://elasticsearch:9200']" \
  docker.elastic.co/apm/apm-server:8.11.0

Elastic Agent (recommended for 8.x):

  1. Go to FleetAgent policies
  2. Add integration → Elastic APM
  3. Configure:
Host: 0.0.0.0:8200
Secret token: your-secret-token

Step 2: Instrument Your Application

Install the APM agent for your language:

Node.js:

npm install elastic-apm-node
// Add at the VERY TOP of your main file (before any other require)
const apm = require('elastic-apm-node').start({
  serviceName: 'my-api',
  serverUrl: 'http://localhost:8200',
  secretToken: 'your-secret-token',
  environment: 'production',
  // Capture request body (careful with PII)
  captureBody: 'errors',
  // Sample rate (1.0 = 100%)
  transactionSampleRate: 1.0,
});

Python:

pip install elastic-apm
# Flask
from elasticapm.contrib.flask import ElasticAPM

app = Flask(__name__)
app.config['ELASTIC_APM'] = {
    'SERVICE_NAME': 'my-flask-app',
    'SERVER_URL': 'http://localhost:8200',
    'SECRET_TOKEN': 'your-secret-token',
    'ENVIRONMENT': 'production',
}
apm = ElasticAPM(app)
# Django - settings.py
INSTALLED_APPS = [
    'elasticapm.contrib.django',
    # ...
]

ELASTIC_APM = {
    'SERVICE_NAME': 'my-django-app',
    'SERVER_URL': 'http://localhost:8200',
    'SECRET_TOKEN': 'your-secret-token',
    'ENVIRONMENT': 'production',
}

Java:

# Download the agent JAR
curl -O https://repo1.maven.org/maven2/co/elastic/apm/elastic-apm-agent/1.44.0/elastic-apm-agent-1.44.0.jar
# Run with agent attached
java -javaagent:/path/to/elastic-apm-agent-1.44.0.jar \
  -Delastic.apm.service_name=my-java-app \
  -Delastic.apm.server_urls=http://localhost:8200 \
  -Delastic.apm.secret_token=your-secret-token \
  -Delastic.apm.environment=production \
  -jar my-app.jar

Go:

go get go.elastic.co/apm/v2
import (
    "go.elastic.co/apm/v2"
    "go.elastic.co/apm/module/apmhttp/v2"
)

func main() {
    // Environment variables:
    // ELASTIC_APM_SERVICE_NAME=my-go-app
    // ELASTIC_APM_SERVER_URL=http://localhost:8200
    // ELASTIC_APM_SECRET_TOKEN=your-secret-token
    
    mux := http.NewServeMux()
    mux.HandleFunc("/api/users", handleUsers)
    
    // Wrap with APM middleware
    handler := apmhttp.Wrap(mux)
    http.ListenAndServe(":8080", handler)
}

Supported Agents

LanguageAuto-InstrumentationManual Spans
Node.jsExpress, Koa, Hapi, Fastify, HTTPYes
PythonDjango, Flask, StarletteYes
JavaSpring, Servlet, JAX-RSYes
Gonet/http, Gin, Echo, gRPCYes
.NETASP.NET Core, EF CoreYes
RubyRails, Sinatra, GrapeYes
PHPWordPress, LaravelYes
RustActix, RocketCommunity
RUM (JS)Browser, React, Angular, VueYes

The APM Interface

Services View

Navigate to ObservabilityAPMServices:

┌─────────────────────────────────────────────────────────┐
│  Services                          [Environment ▼]       │
├─────────────────────────────────────────────────────────┤
│  Service         │ Env  │ Latency  │ Throughput │ Errors│
│──────────────────│──────│──────────│────────────│───────│
│ api-gateway      │ prod │ 145ms    │ 1.2k/min  │ 0.3%  │
│ payment-service  │ prod │ 320ms    │ 450/min   │ 1.2%  │
│ user-service     │ prod │ 85ms     │ 2.1k/min  │ 0.1%  │
│ order-service    │ prod │ 210ms    │ 800/min   │ 0.5%  │
│ notification-svc │ prod │ 50ms     │ 300/min   │ 0.0%  │
└─────────────────────────────────────────────────────────┘

Service Detail

Click a service to see detailed performance data:

┌─────────────────────────────────────────────────────────┐
│  api-gateway                                            │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Latency:    145ms (avg)  │  Throughput: 1.2k/min      │
│  ────────────────────     │  ────────────────────      │
│  p50: 85ms               │                              │
│  p95: 450ms              │  Error rate: 0.3%            │
│  p99: 1.2s               │  ────────────────────       │
│                                                         │
│  ┌─ Latency Distribution ──────────────────────────┐   │
│  │  ██                                              │   │
│  │  ████                                            │   │
│  │  ████████                                        │   │
│  │  ████████████████                                │   │
│  │  ██████████████████████                          │   │
│  │  0ms    100ms    500ms    1s    2s    5s         │   │
│  └──────────────────────────────────────────────────┘   │
│                                                         │
│  Transactions:                                          │
│  ┌────────────────────────┬────────┬──────┬──────────┐ │
│  │ Name                   │ Latency│ TPM  │ Errors   │ │
│  │ GET /api/users         │ 85ms   │ 450  │ 0.1%     │ │
│  │ POST /api/orders       │ 320ms  │ 200  │ 0.8%     │ │
│  │ GET /api/products      │ 120ms  │ 350  │ 0.2%     │ │
│  │ POST /api/checkout     │ 1.2s   │ 100  │ 2.1%     │ │
│  └────────────────────────┴────────┴──────┴──────────┘ │
└─────────────────────────────────────────────────────────┘

Transactions

Transaction Types

APM categorizes operations:

TypeDescriptionExample
requestHTTP request/responseGET /api/users
messagingMessage queue operationsProcess order from queue
scheduledCron/scheduled tasksNightly report generation
customUser-definedBackground data sync

Transaction Waterfall

Click a transaction to see the detailed trace waterfall:

Transaction: POST /api/checkout (1.2s total)
├── api-gateway (12ms)
│   └── HTTP POST /api/orders → order-service
│       ├── order-service (180ms)
│       │   ├── SELECT * FROM products (15ms)
│       │   ├── SELECT * FROM inventory (8ms)
│       │   └── HTTP POST /api/payments → payment-service
│       │       ├── payment-service (850ms)
│       │       │   ├── Stripe API call (780ms)  ← bottleneck!
│       │       │   └── INSERT INTO payments (25ms)
│       │       └── response (payment-service → order-service)
│       ├── INSERT INTO orders (12ms)
│       └── HTTP POST /api/notifications → notification-svc
│           └── notification-svc (45ms)
│               └── SendGrid API call (35ms)
└── response (api-gateway → client)

Reading the Waterfall

Each span shows:

  • Service name and color coding
  • Span type (db, http, custom)
  • Duration (absolute and as percentage)
  • Status (success, error)
  • Details (SQL query, HTTP method/URL, etc.)

Click a span for details:

Span: Stripe API call
Type: external.http
Duration: 780ms (65% of transaction)
URL: https://api.stripe.com/v1/charges
Method: POST
Status: 200

Labels:
  stripe.charge_id: ch_1234
  amount: 9999

Errors

Error View

Navigate to a service → Errors tab:

┌─────────────────────────────────────────────────────────┐
│  Errors: payment-service                                │
├─────────────────────────────────────────────────────────┤
│  Error Group          │ Count │ Last Seen │ Handled     │
│───────────────────────│───────│───────────│─────────────│
│  StripeCardError:     │  45   │ 2m ago    │ Unhandled   │
│    Card declined      │       │           │             │
│  ConnectionError:     │  12   │ 15m ago   │ Handled     │
│    DB timeout         │       │           │             │
│  ValidationError:     │  8    │ 1h ago    │ Handled     │
│    Invalid amount     │       │           │             │
└─────────────────────────────────────────────────────────┘

Error Detail

Click an error group to see:

  • Stack trace (with source code context if source maps configured)
  • Exception message and type
  • Transaction that caused the error
  • User and request context
  • Occurrence timeline
Error: StripeCardError
Message: Your card was declined.

Stack trace:
  at processPayment (src/services/payment.js:145:11)
  at async createOrder (src/routes/orders.js:67:5)
  at async handle (src/middleware/handler.js:23:7)

Context:
  User: user-12345
  Request: POST /api/checkout
  Body: { amount: 9999, currency: "usd" }

Metadata:
  service.version: 2.3.1
  host.name: prod-payment-02
  container.id: abc123def456

Service Map

The service map shows dependencies between services:

Navigate to: APM → Service Map

                    ┌──────────┐
                    │  Client  │
                    │ (Browser)│
                    └────┬─────┘
                         │
                    ┌────▼─────┐
                    │   API    │
                    │ Gateway  │
                    └────┬─────┘
                   ╱     │      ╲
          ┌───────▼──┐ ┌─▼──────┐ ┌▼──────────┐
          │  User    │ │ Order  │ │  Product  │
          │ Service  │ │Service │ │  Service  │
          └────┬─────┘ └───┬───┘ └─────┬─────┘
               │           │           │
          ┌────▼─────┐ ┌───▼───┐  ┌───▼──────┐
          │ User DB  │ │Order  │  │Product   │
          │(Postgres)│ │  DB   │  │Cache     │
          └──────────┘ │(MySQL)│  │(Redis)   │
                       └───┬───┘  └──────────┘
                           │
                    ┌──────▼──────┐
                    │  Payment   │
                    │  Service   │
                    └──────┬─────┘
                           │
                    ┌──────▼──────┐
                    │   Stripe   │
                    │   (ext)    │
                    └─────────────┘

Features:

  • Nodes sized by throughput
  • Edges colored by health (green/yellow/red)
  • Click a node to see service details
  • Click an edge to see transaction details between services
  • External services shown as separate nodes

Correlations

APM Correlations helps you find why some transactions are slow or failing.

Latency Correlations

  1. Open a service → Correlations tab
  2. Select metric: Latency
  3. Kibana automatically finds fields correlated with high latency:
Correlated fields (high latency):
┌─────────────────────────────┬────────────┬─────────────┐
│ Field                       │ Value      │ Impact      │
│ host.name                   │ prod-web-3 │ +450ms      │
│ url.path                    │ /api/search│ +320ms      │
│ user_agent.name             │ IE 11      │ +200ms      │
│ geoip.country_iso_code      │ AU         │ +150ms      │
└─────────────────────────────┴────────────┴─────────────┘

Insight: Requests to prod-web-3 are consistently slower.
         Possible cause: resource constraints on that host.

Error Correlations

Find what's different about failing transactions:

Correlated fields (errors):
┌─────────────────────────────┬────────────┬─────────────┐
│ Field                       │ Value      │ Impact      │
│ service.version             │ 2.3.1      │ 5x errors   │
│ host.name                   │ prod-api-2 │ 3x errors   │
│ user.id                     │ bot-crawler│ 2x errors   │
└─────────────────────────────┴────────────┴─────────────┘

Insight: Version 2.3.1 has significantly more errors.
         This was a recent deployment - likely a regression.

Custom Instrumentation

Adding Custom Spans

Node.js:

const apm = require('elastic-apm-node');

async function processOrder(order) {
  // Create a custom span
  const span = apm.startSpan('process-order');
  
  try {
    // Validate
    const validateSpan = apm.startSpan('validate-order');
    await validateOrder(order);
    validateSpan.end();
    
    // Charge payment
    const paymentSpan = apm.startSpan('charge-payment', 'external');
    const charge = await chargePayment(order.total);
    paymentSpan.end();
    
    // Save to database
    const dbSpan = apm.startSpan('save-order', 'db');
    await saveOrder(order, charge);
    dbSpan.end();
    
  } catch (err) {
    apm.captureError(err);
    throw err;
  } finally {
    if (span) span.end();
  }
}

Python:

import elasticapm

@elasticapm.capture_span('process_order')
def process_order(order):
    with elasticapm.capture_span('validate_order'):
        validate_order(order)
    
    with elasticapm.capture_span('charge_payment', span_type='external'):
        charge = charge_payment(order['total'])
    
    with elasticapm.capture_span('save_order', span_type='db'):
        save_order(order, charge)

Custom Labels and Context

Add business context to transactions:

// Node.js - Add labels to current transaction
apm.setLabel('order_id', order.id);
apm.setLabel('order_total', order.total);
apm.setLabel('customer_tier', customer.tier);

// Add user context
apm.setUserContext({
  id: user.id,
  username: user.name,
  email: user.email,
});

// Add custom context (not indexed, but visible in UI)
apm.setCustomContext({
  cart_items: cart.items.length,
  payment_method: order.paymentMethod,
  promo_code: order.promoCode,
});
# Python - Add labels
elasticapm.label(order_id=order.id, customer_tier=customer.tier)

# Set user context
elasticapm.set_user_context(
    user_id=user.id,
    username=user.name,
    email=user.email,
)

Real User Monitoring (RUM)

Monitor frontend performance from the user's browser:

Setting Up RUM

npm install @elastic/apm-rum
import { init as initApm } from '@elastic/apm-rum';

const apm = initApm({
  serviceName: 'my-frontend',
  serverUrl: 'https://apm.example.com:8200',
  serviceVersion: '1.0.0',
  environment: 'production',
  
  // Distributed tracing (connect frontend to backend traces)
  distributedTracingOrigins: ['https://api.example.com'],
});

RUM Metrics

MetricDescriptionGood Target
Page load timeTotal time to load page< 3s
First Contentful PaintFirst visual content< 1.8s
Largest Contentful PaintMain content loaded< 2.5s
First Input DelayTime to interactivity< 100ms
Cumulative Layout ShiftVisual stability< 0.1
Time to First ByteServer response time< 600ms

RUM in Kibana

Navigate to APMServices → your frontend service:

User Experience Dashboard:
┌────────────────────────────────────────────────────┐
│  Page Load Distribution                            │
│  ██████████████████████████  < 1s (45%)            │
│  ████████████████            1-3s (30%)            │
│  ████████                    3-5s (15%)            │
│  ████                        5-10s (8%)            │
│  ██                          > 10s (2%)            │
├────────────────────────────────────────────────────┤
│  Slowest Pages:                                    │
│  /checkout          4.2s avg                       │
│  /search?q=...      3.1s avg                       │
│  /product/detail    2.8s avg                       │
├────────────────────────────────────────────────────┤
│  By Browser:       By Country:     By Connection:  │
│  Chrome  120ms     US   145ms      4G    150ms     │
│  Firefox 135ms     UK   210ms      3G    450ms     │
│  Safari  155ms     AU   380ms      WiFi  120ms     │
└────────────────────────────────────────────────────┘

APM Alerts

Latency Alert

Rule: High API Latency
Type: APM latency threshold

Service: api-gateway
Transaction type: request
Environment: production

Condition:
  WHEN p95 latency IS ABOVE 2000ms
  FOR THE LAST 5 minutes

Actions:
  Slack: "⚠️ API latency p95 at {{context.value}}ms (threshold: 2000ms)"

Error Rate Alert

Rule: Error Rate Spike
Type: APM error rate threshold

Service: payment-service
Environment: production

Condition:
  WHEN error rate IS ABOVE 5%
  FOR THE LAST 5 minutes

Actions:
  PagerDuty: Critical
  Slack: "🔴 Error rate: {{context.value}}% for payment-service"

Failed Transaction Rate

Rule: Checkout Failures
Type: APM failed transaction rate threshold

Service: order-service
Transaction name: POST /api/checkout
Environment: production

Condition:
  WHEN failed transaction rate IS ABOVE 10%

Actions:
  Webhook: POST to incident management

Practical Examples

Example 1: Debugging a Slow Endpoint

Scenario: Users report /api/search is slow.

1. APM → Services → api-gateway → Transactions
2. Find "GET /api/search" - avg latency 3.2s (SLA: 500ms)
3. Click transaction → View sample traces
4. Waterfall shows:
   - api-gateway: 50ms
   - search-service: 3150ms
     - Elasticsearch query: 3100ms  ← bottleneck
5. Click ES query span → See actual query
6. Query is missing index filter, scanning too many indices
7. Fix: Add date-based index pattern to query
8. Result: Latency drops to 200ms

Example 2: Correlating Errors with Deployments

1. APM → Services → payment-service → Errors
2. Spike in "ConnectionError: timeout" starting 2 hours ago
3. Check service.version label → version 2.4.0 deployed 2 hours ago
4. Click error → Stack trace shows new retry logic has a bug
5. Compare: version 2.3.9 had 0.1% errors, 2.4.0 has 5.2%
6. Action: Roll back to 2.3.9, fix bug, redeploy

Example 3: End-to-End Trace Analysis

1. User reports order #12345 failed
2. APM → Traces → Search for label "order_id: 12345"
3. Full trace:
   - Frontend (RUM): 8.5s total
   - API Gateway: 200ms
   - Order Service: 8.1s
     - Inventory check: 15ms
     - Payment: 8000ms  ← timeout!
       - Stripe API: timeout after 8000ms
   - Error: PaymentTimeoutError
4. Root cause: Stripe API was experiencing an outage
5. Fix: Add circuit breaker, reduce timeout, show user-friendly error

Tips and Best Practices

Instrumentation

✅ Start with auto-instrumentation (agents handle common frameworks)
✅ Add custom spans for business-critical code paths
✅ Set meaningful labels (order_id, customer_tier, feature_flag)
✅ Configure sampling rates appropriately
✅ Use distributed tracing to connect frontend and backend

❌ Don't instrument every function (too much overhead)
❌ Don't log sensitive data in labels (PII, credentials)
❌ Don't set sample rate to 100% in high-traffic production
❌ Don't forget to instrument async operations

Sampling Rates

Development:     1.0  (100% - capture everything)
Staging:         1.0  (100%)
Production:
  Low traffic:   1.0  (100%)
  Medium:        0.5  (50%)
  High traffic:  0.1  (10%)
  Very high:     0.01 (1%)

Dynamic sampling:
  Errors: Always capture (sample rate doesn't affect error capture)
  Slow transactions: Increase sampling for slow ones

Performance Overhead

Typical APM agent overhead:
  CPU:    1-3% increase
  Memory: 10-50MB additional
  Latency: < 1ms per transaction

To minimize:
  - Use lower sample rates in production
  - Disable body capture unless needed
  - Limit custom span depth
  - Use async reporting (default)

Common Issues

No data showing in APM

Checklist:

  1. APM Server is running and accessible from application
  2. Agent is initialized before other code (Node.js: must be first require)
  3. Secret token matches between agent and APM Server
  4. Network allows traffic on port 8200
  5. Check APM Server logs for errors
# Test APM Server connectivity
curl http://localhost:8200/

# Expected: {"ok":{"build_date":"...","build_sha":"...","version":"8.11.0"}}

Traces not connecting across services

Fix: Ensure distributed tracing headers are propagated:

  • traceparent (W3C standard)
  • elastic-apm-traceparent (Elastic format)

Check that HTTP clients forward these headers between services.

High cardinality warnings

Cause: Too many unique transaction names (e.g., URL includes IDs).

Fix: Normalize transaction names:

// Bad: /api/users/12345, /api/users/67890 (unique per user)
// Good: /api/users/:id (grouped)

// Node.js - Set transaction name explicitly
apm.setTransactionName('GET /api/users/:id');

Summary

In this chapter, you learned:

  • ✅ APM architecture: agents, APM Server, Elasticsearch, Kibana
  • ✅ Setting up APM agents for Node.js, Python, Java, and Go
  • ✅ Navigating services, transactions, and error views
  • ✅ Reading transaction waterfall traces
  • ✅ Using the service map and correlations
  • ✅ Custom instrumentation with spans, labels, and context
  • ✅ Real User Monitoring for frontend performance
  • ✅ Setting up APM-specific alerts
  • ✅ Debugging performance issues with practical workflows

Next: Best practices and tips for getting the most out of Kibana!