Application Performance Monitoring (APM)

APM in Kibana provides real-time visibility into application performance: response times, error rates, transaction traces, and service dependencies. It instruments your application code to collect data and displays it through dedicated Kibana views.

APM Architecture

┌──────────────────────────────────────────────────────────┐
│ Your Applications │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Node.js │ │ Python │ │ Java │ │ Go │ │
│ │ App │ │ App │ │ App │ │ App │ │
│ │ + APM │ │ + APM │ │ + APM │ │ + APM │ │
│ │ Agent │ │ Agent │ │ Agent │ │ Agent │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
│ └─────────┬──┴───────────┴──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ APM Server │ (or Elastic Agent with │
│ │ │ APM integration) │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │Elasticsearch │ (stores APM data) │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Kibana │ (APM UI) │
│ │ APM App │ │
│ └──────────────┘ │
└──────────────────────────────────────────────────────────┘

Setting Up APM

Step 1: APM Server

Install APM Server or use Elastic Agent with APM integration:

Docker:

docker run -d \
 --name apm-server \
 --net elastic \
 -p 8200:8200 \
 -e "output.elasticsearch.hosts=['http://elasticsearch:9200']" \
 docker.elastic.co/apm/apm-server:8.11.0

Elastic Agent (recommended for 8.x):

  1. Go to FleetAgent policies
  2. Add integration → Elastic APM
  3. Configure:
Host: 0.0.0.0:8200
Secret token: your-secret-token

Step 2: Instrument Your Application

Install the APM agent for your language:

Node.js:

npm install elastic-apm-node
// Add at the VERY TOP of your main file (before any other require)
const apm = require('elastic-apm-node').start({
 serviceName: 'my-api',
 serverUrl: 'http://localhost:8200',
 secretToken: 'your-secret-token',
 environment: 'production',
 // Capture request body (careful with PII)
 captureBody: 'errors',
 // Sample rate (1.0 = 100%)
 transactionSampleRate: 1.0,
});

Python:

pip install elastic-apm
# Flask
from elasticapm.contrib.flask import ElasticAPM

app = Flask(__name__)
app.config['ELASTIC_APM'] = {
 'SERVICE_NAME': 'my-flask-app',
 'SERVER_URL': 'http://localhost:8200',
 'SECRET_TOKEN': 'your-secret-token',
 'ENVIRONMENT': 'production',
}
apm = ElasticAPM(app)
# Django - settings.py
INSTALLED_APPS = [
 'elasticapm.contrib.django',
 # ...
]

ELASTIC_APM = {
 'SERVICE_NAME': 'my-django-app',
 'SERVER_URL': 'http://localhost:8200',
 'SECRET_TOKEN': 'your-secret-token',
 'ENVIRONMENT': 'production',
}

Java:

# Download the agent JAR
curl -O https://repo1.maven.org/maven2/co/elastic/apm/elastic-apm-agent/1.44.0/elastic-apm-agent-1.44.0.jar
# Run with agent attached
java -javaagent:/path/to/elastic-apm-agent-1.44.0.jar \
 -Delastic.apm.service_name=my-java-app \
 -Delastic.apm.server_urls=http://localhost:8200 \
 -Delastic.apm.secret_token=your-secret-token \
 -Delastic.apm.environment=production \
 -jar my-app.jar

Go:

go get go.elastic.co/apm/v2
import (
 "go.elastic.co/apm/v2"
 "go.elastic.co/apm/module/apmhttp/v2"
)

func main() {
 // Environment variables:
 // ELASTIC_APM_SERVICE_NAME=my-go-app
 // ELASTIC_APM_SERVER_URL=http://localhost:8200
 // ELASTIC_APM_SECRET_TOKEN=your-secret-token
 
 mux := http.NewServeMux()
 mux.HandleFunc("/api/users", handleUsers)
 
 // Wrap with APM middleware
 handler := apmhttp.Wrap(mux)
 http.ListenAndServe(":8080", handler)
}

Supported Agents

LanguageAuto-InstrumentationManual Spans
Node.jsExpress, Koa, Hapi, Fastify, HTTPYes
PythonDjango, Flask, StarletteYes
JavaSpring, Servlet, JAX-RSYes
Gonet/http, Gin, Echo, gRPCYes
.NETASP.NET Core, EF CoreYes
RubyRails, Sinatra, GrapeYes
PHPWordPress, LaravelYes
RustActix, RocketCommunity
RUM (JS)Browser, React, Angular, VueYes

The APM Interface

Services View

Navigate to ObservabilityAPMServices:

┌─────────────────────────────────────────────────────────┐
│ Services [Environment ▼] │
├─────────────────────────────────────────────────────────┤
│ Service │ Env │ Latency │ Throughput │ Errors│
│──────────────────│──────│──────────│────────────│───────│
│ api-gateway │ prod │ 145ms │ 1.2k/min │ 0.3% │
│ payment-service │ prod │ 320ms │ 450/min │ 1.2% │
│ user-service │ prod │ 85ms │ 2.1k/min │ 0.1% │
│ order-service │ prod │ 210ms │ 800/min │ 0.5% │
│ notification-svc │ prod │ 50ms │ 300/min │ 0.0% │
└─────────────────────────────────────────────────────────┘

Service Detail

Click a service to see detailed performance data:

┌─────────────────────────────────────────────────────────┐
│ api-gateway │
├─────────────────────────────────────────────────────────┤
│ │
│ Latency: 145ms (avg) │ Throughput: 1.2k/min │
│ ──────────────────── │ ──────────────────── │
│ p50: 85ms │ │
│ p95: 450ms │ Error rate: 0.3% │
│ p99: 1.2s │ ──────────────────── │
│ │
│ ┌─ Latency Distribution ──────────────────────────┐ │
│ │ ██ │ │
│ │ ████ │ │
│ │ ████████ │ │
│ │ ████████████████ │ │
│ │ ██████████████████████ │ │
│ │ 0ms 100ms 500ms 1s 2s 5s │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ Transactions: │
│ ┌────────────────────────┬────────┬──────┬──────────┐ │
│ │ Name │ Latency│ TPM │ Errors │ │
│ │ GET /api/users │ 85ms │ 450 │ 0.1% │ │
│ │ POST /api/orders │ 320ms │ 200 │ 0.8% │ │
│ │ GET /api/products │ 120ms │ 350 │ 0.2% │ │
│ │ POST /api/checkout │ 1.2s │ 100 │ 2.1% │ │
│ └────────────────────────┴────────┴──────┴──────────┘ │
└─────────────────────────────────────────────────────────┘

Transactions

Transaction Types

APM categorizes operations:

TypeDescriptionExample
requestHTTP request/responseGET /api/users
messagingMessage queue operationsProcess order from queue
scheduledCron/scheduled tasksNightly report generation
customUser-definedBackground data sync

Transaction Waterfall

Click a transaction to see the detailed trace waterfall:

Transaction: POST /api/checkout (1.2s total)
├── api-gateway (12ms)
│ └── HTTP POST /api/orders → order-service
│ ├── order-service (180ms)
│ │ ├── SELECT * FROM products (15ms)
│ │ ├── SELECT * FROM inventory (8ms)
│ │ └── HTTP POST /api/payments → payment-service
│ │ ├── payment-service (850ms)
│ │ │ ├── Stripe API call (780ms) ← bottleneck!
│ │ │ └── INSERT INTO payments (25ms)
│ │ └── response (payment-service → order-service)
│ ├── INSERT INTO orders (12ms)
│ └── HTTP POST /api/notifications → notification-svc
│ └── notification-svc (45ms)
│ └── SendGrid API call (35ms)
└── response (api-gateway → client)

Reading the Waterfall

Each span shows:

  • Service name and color coding
  • Span type (db, http, custom)
  • Duration (absolute and as percentage)
  • Status (success, error)
  • Details (SQL query, HTTP method/URL, etc.)

Click a span for details:

Span: Stripe API call
Type: external.http
Duration: 780ms (65% of transaction)
URL: https://api.stripe.com/v1/charges
Method: POST
Status: 200

Labels:
 stripe.charge_id: ch_1234
 amount: 9999

Errors

Error View

Navigate to a service → Errors tab:

┌─────────────────────────────────────────────────────────┐
│ Errors: payment-service │
├─────────────────────────────────────────────────────────┤
│ Error Group │ Count │ Last Seen │ Handled │
│───────────────────────│───────│───────────│─────────────│
│ StripeCardError: │ 45 │ 2m ago │ Unhandled │
│ Card declined │ │ │ │
│ ConnectionError: │ 12 │ 15m ago │ Handled │
│ DB timeout │ │ │ │
│ ValidationError: │ 8 │ 1h ago │ Handled │
│ Invalid amount │ │ │ │
└─────────────────────────────────────────────────────────┘

Error Detail

Click an error group to see:

  • Stack trace (with source code context if source maps configured)
  • Exception message and type
  • Transaction that caused the error
  • User and request context
  • Occurrence timeline
Error: StripeCardError
Message: Your card was declined.

Stack trace:
 at processPayment (src/services/payment.js:145:11)
 at async createOrder (src/routes/orders.js:67:5)
 at async handle (src/middleware/handler.js:23:7)

Context:
 User: user-12345
 Request: POST /api/checkout
 Body: { amount: 9999, currency: "usd" }

Metadata:
 service.version: 2.3.1
 host.name: prod-payment-02
 container.id: abc123def456

Service Map

The service map shows dependencies between services:

Navigate to: APM → Service Map

 ┌──────────┐
 │ Client │
 │ (Browser)│
 └────┬─────┘
 │
 ┌────▼─────┐
 │ API │
 │ Gateway │
 └────┬─────┘
 ╱ │ ╲
 ┌───────▼──┐ ┌─▼──────┐ ┌▼──────────┐
 │ User │ │ Order │ │ Product │
 │ Service │ │Service │ │ Service │
 └────┬─────┘ └───┬───┘ └─────┬─────┘
 │ │ │
 ┌────▼─────┐ ┌───▼───┐ ┌───▼──────┐
 │ User DB │ │Order │ │Product │
 │(Postgres)│ │ DB │ │Cache │
 └──────────┘ │(MySQL)│ │(Redis) │
 └───┬───┘ └──────────┘
 │
 ┌──────▼──────┐
 │ Payment │
 │ Service │
 └──────┬─────┘
 │
 ┌──────▼──────┐
 │ Stripe │
 │ (ext) │
 └─────────────┘

Features:

  • Nodes sized by throughput
  • Edges colored by health (green/yellow/red)
  • Click a node to see service details
  • Click an edge to see transaction details between services
  • External services shown as separate nodes

Correlations

APM Correlations helps you find why some transactions are slow or failing.

Latency Correlations

  1. Open a service → Correlations tab
  2. Select metric: Latency
  3. Kibana automatically finds fields correlated with high latency:
Correlated fields (high latency):
┌─────────────────────────────┬────────────┬─────────────┐
│ Field │ Value │ Impact │
│ host.name │ prod-web-3 │ +450ms │
│ url.path │ /api/search│ +320ms │
│ user_agent.name │ IE 11 │ +200ms │
│ geoip.country_iso_code │ AU │ +150ms │
└─────────────────────────────┴────────────┴─────────────┘

Insight: Requests to prod-web-3 are consistently slower.
 Possible cause: resource constraints on that host.

Error Correlations

Find what's different about failing transactions:

Correlated fields (errors):
┌─────────────────────────────┬────────────┬─────────────┐
│ Field │ Value │ Impact │
│ service.version │ 2.3.1 │ 5x errors │
│ host.name │ prod-api-2 │ 3x errors │
│ user.id │ bot-crawler│ 2x errors │
└─────────────────────────────┴────────────┴─────────────┘

Insight: Version 2.3.1 has significantly more errors.
 This was a recent deployment - likely a regression.

Custom Instrumentation

Adding Custom Spans

Node.js:

const apm = require('elastic-apm-node');

async function processOrder(order) {
 // Create a custom span
 const span = apm.startSpan('process-order');
 
 try {
 // Validate
 const validateSpan = apm.startSpan('validate-order');
 await validateOrder(order);
 validateSpan.end();
 
 // Charge payment
 const paymentSpan = apm.startSpan('charge-payment', 'external');
 const charge = await chargePayment(order.total);
 paymentSpan.end();
 
 // Save to database
 const dbSpan = apm.startSpan('save-order', 'db');
 await saveOrder(order, charge);
 dbSpan.end();
 
 } catch (err) {
 apm.captureError(err);
 throw err;
 } finally {
 if (span) span.end();
 }
}

Python:

import elasticapm

@elasticapm.capture_span('process_order')
def process_order(order):
 with elasticapm.capture_span('validate_order'):
 validate_order(order)
 
 with elasticapm.capture_span('charge_payment', span_type='external'):
 charge = charge_payment(order['total'])
 
 with elasticapm.capture_span('save_order', span_type='db'):
 save_order(order, charge)

Custom Labels and Context

Add business context to transactions:

// Node.js - Add labels to current transaction
apm.setLabel('order_id', order.id);
apm.setLabel('order_total', order.total);
apm.setLabel('customer_tier', customer.tier);

// Add user context
apm.setUserContext({
 id: user.id,
 username: user.name,
 email: user.email,
});

// Add custom context (not indexed, but visible in UI)
apm.setCustomContext({
 cart_items: cart.items.length,
 payment_method: order.paymentMethod,
 promo_code: order.promoCode,
});
# Python - Add labels
elasticapm.label(order_id=order.id, customer_tier=customer.tier)

# Set user context
elasticapm.set_user_context(
 user_id=user.id,
 username=user.name,
 email=user.email,
)

Real User Monitoring (RUM)

Monitor frontend performance from the user's browser:

Setting Up RUM

npm install @elastic/apm-rum
import { init as initApm } from '@elastic/apm-rum';

const apm = initApm({
 serviceName: 'my-frontend',
 serverUrl: 'https://apm.example.com:8200',
 serviceVersion: '1.0.0',
 environment: 'production',
 
 // Distributed tracing (connect frontend to backend traces)
 distributedTracingOrigins: ['https://api.example.com'],
});

RUM Metrics

MetricDescriptionGood Target
Page load timeTotal time to load page< 3s
First Contentful PaintFirst visual content< 1.8s
Largest Contentful PaintMain content loaded< 2.5s
First Input DelayTime to interactivity< 100ms
Cumulative Layout ShiftVisual stability< 0.1
Time to First ByteServer response time< 600ms

RUM in Kibana

Navigate to APMServices → your frontend service:

User Experience Dashboard:
┌────────────────────────────────────────────────────┐
│ Page Load Distribution │
│ ██████████████████████████ < 1s (45%) │
│ ████████████████ 1-3s (30%) │
│ ████████ 3-5s (15%) │
│ ████ 5-10s (8%) │
│ ██ > 10s (2%) │
├────────────────────────────────────────────────────┤
│ Slowest Pages: │
│ /checkout 4.2s avg │
│ /search?q=... 3.1s avg │
│ /product/detail 2.8s avg │
├────────────────────────────────────────────────────┤
│ By Browser: By Country: By Connection: │
│ Chrome 120ms US 145ms 4G 150ms │
│ Firefox 135ms UK 210ms 3G 450ms │
│ Safari 155ms AU 380ms WiFi 120ms │
└────────────────────────────────────────────────────┘

APM Alerts

Latency Alert

Rule: High API Latency
Type: APM latency threshold

Service: api-gateway
Transaction type: request
Environment: production

Condition:
 WHEN p95 latency IS ABOVE 2000ms
 FOR THE LAST 5 minutes

Actions:
 Slack: " API latency p95 at {{context.value}}ms (threshold: 2000ms)"

Error Rate Alert

Rule: Error Rate Spike
Type: APM error rate threshold

Service: payment-service
Environment: production

Condition:
 WHEN error rate IS ABOVE 5%
 FOR THE LAST 5 minutes

Actions:
 PagerDuty: Critical
 Slack: " Error rate: {{context.value}}% for payment-service"

Failed Transaction Rate

Rule: Checkout Failures
Type: APM failed transaction rate threshold

Service: order-service
Transaction name: POST /api/checkout
Environment: production

Condition:
 WHEN failed transaction rate IS ABOVE 10%

Actions:
 Webhook: POST to incident management

Practical Examples

Example 1: Debugging a Slow Endpoint

Scenario: Users report /api/search is slow.

1. APM → Services → api-gateway → Transactions
2. Find "GET /api/search" - avg latency 3.2s (SLA: 500ms)
3. Click transaction → View sample traces
4. Waterfall shows:
 - api-gateway: 50ms
 - search-service: 3150ms
 - Elasticsearch query: 3100ms ← bottleneck
5. Click ES query span → See actual query
6. Query is missing index filter, scanning too many indices
7. Fix: Add date-based index pattern to query
8. Result: Latency drops to 200ms

Example 2: Correlating Errors with Deployments

1. APM → Services → payment-service → Errors
2. Spike in "ConnectionError: timeout" starting 2 hours ago
3. Check service.version label → version 2.4.0 deployed 2 hours ago
4. Click error → Stack trace shows new retry logic has a bug
5. Compare: version 2.3.9 had 0.1% errors, 2.4.0 has 5.2%
6. Action: Roll back to 2.3.9, fix bug, redeploy

Example 3: End-to-End Trace Analysis

1. User reports order #12345 failed
2. APM → Traces → Search for label "order_id: 12345"
3. Full trace:
 - Frontend (RUM): 8.5s total
 - API Gateway: 200ms
 - Order Service: 8.1s
 - Inventory check: 15ms
 - Payment: 8000ms ← timeout!
 - Stripe API: timeout after 8000ms
 - Error: PaymentTimeoutError
4. Root cause: Stripe API was experiencing an outage
5. Fix: Add circuit breaker, reduce timeout, show a clear error to the user

Tips and Best Practices

Instrumentation

 Start with auto-instrumentation (agents handle common frameworks)
 Add custom spans for business-critical code paths
 Set meaningful labels (order_id, customer_tier, feature_flag)
 Configure sampling rates appropriately
 Use distributed tracing to connect frontend and backend

 Don't instrument every function (too much overhead)
 Don't log sensitive data in labels (PII, credentials)
 Don't set sample rate to 100% in high-traffic production
 Don't forget to instrument async operations

Sampling Rates

Development: 1.0 (100% - capture everything)
Staging: 1.0 (100%)
Production:
 Low traffic: 1.0 (100%)
 Medium: 0.5 (50%)
 High traffic: 0.1 (10%)
 Very high: 0.01 (1%)

Dynamic sampling:
 Errors: Always capture (sample rate doesn't affect error capture)
 Slow transactions: Increase sampling for slow ones

Performance Overhead

Typical APM agent overhead:
 CPU: 1-3% increase
 Memory: 10-50MB additional
 Latency: < 1ms per transaction

To minimize:
 - Use lower sample rates in production
 - Disable body capture unless needed
 - Limit custom span depth
 - Use async reporting (default)

Common Issues

No data showing in APM

Checklist:

  1. APM Server is running and accessible from application
  2. Agent is initialized before other code (Node.js: must be first require)
  3. Secret token matches between agent and APM Server
  4. Network allows traffic on port 8200
  5. Check APM Server logs for errors
# Test APM Server connectivity
curl http://localhost:8200/

# Expected: {"ok":{"build_date":"...","build_sha":"...","version":"8.11.0"}}

Traces not connecting across services

Fix: Ensure distributed tracing headers are propagated:

  • traceparent (W3C standard)
  • elastic-apm-traceparent (Elastic format)

Check that HTTP clients forward these headers between services.

High cardinality warnings

Cause: Too many unique transaction names (e.g., URL includes IDs).

Fix: Normalize transaction names:

// Bad: /api/users/12345, /api/users/67890 (unique per user)
// Good: /api/users/:id (grouped)

// Node.js - Set transaction name explicitly
apm.setTransactionName('GET /api/users/:id');

Next Steps

Continue to 14-best-practices.md for best practices and production tips.