Application Performance Monitoring (APM)
APM in Kibana provides real-time visibility into application performance: response times, error rates, transaction traces, and service dependencies. It instruments your application code to collect data and displays it through dedicated Kibana views.
APM Architecture
┌──────────────────────────────────────────────────────────┐
│ Your Applications │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Node.js │ │ Python │ │ Java │ │ Go │ │
│ │ App │ │ App │ │ App │ │ App │ │
│ │ + APM │ │ + APM │ │ + APM │ │ + APM │ │
│ │ Agent │ │ Agent │ │ Agent │ │ Agent │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
│ └─────────┬──┴───────────┴──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ APM Server │ (or Elastic Agent with │
│ │ │ APM integration) │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │Elasticsearch │ (stores APM data) │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Kibana │ (APM UI) │
│ │ APM App │ │
│ └──────────────┘ │
└──────────────────────────────────────────────────────────┘
Setting Up APM
Step 1: APM Server
Install APM Server or use Elastic Agent with APM integration:
Docker:
docker run -d \
--name apm-server \
--net elastic \
-p 8200:8200 \
-e "output.elasticsearch.hosts=['http://elasticsearch:9200']" \
docker.elastic.co/apm/apm-server:8.11.0
Elastic Agent (recommended for 8.x):
- Go to Fleet → Agent policies
- Add integration → Elastic APM
- Configure:
Host: 0.0.0.0:8200
Secret token: your-secret-token
Step 2: Instrument Your Application
Install the APM agent for your language:
Node.js:
npm install elastic-apm-node
// Add at the VERY TOP of your main file (before any other require)
const apm = require('elastic-apm-node').start({
serviceName: 'my-api',
serverUrl: 'http://localhost:8200',
secretToken: 'your-secret-token',
environment: 'production',
// Capture request body (careful with PII)
captureBody: 'errors',
// Sample rate (1.0 = 100%)
transactionSampleRate: 1.0,
});
Python:
pip install elastic-apm
# Flask
from elasticapm.contrib.flask import ElasticAPM
app = Flask(__name__)
app.config['ELASTIC_APM'] = {
'SERVICE_NAME': 'my-flask-app',
'SERVER_URL': 'http://localhost:8200',
'SECRET_TOKEN': 'your-secret-token',
'ENVIRONMENT': 'production',
}
apm = ElasticAPM(app)
# Django - settings.py
INSTALLED_APPS = [
'elasticapm.contrib.django',
# ...
]
ELASTIC_APM = {
'SERVICE_NAME': 'my-django-app',
'SERVER_URL': 'http://localhost:8200',
'SECRET_TOKEN': 'your-secret-token',
'ENVIRONMENT': 'production',
}
Java:
# Download the agent JAR
curl -O https://repo1.maven.org/maven2/co/elastic/apm/elastic-apm-agent/1.44.0/elastic-apm-agent-1.44.0.jar
# Run with agent attached
java -javaagent:/path/to/elastic-apm-agent-1.44.0.jar \
-Delastic.apm.service_name=my-java-app \
-Delastic.apm.server_urls=http://localhost:8200 \
-Delastic.apm.secret_token=your-secret-token \
-Delastic.apm.environment=production \
-jar my-app.jar
Go:
go get go.elastic.co/apm/v2
import (
"go.elastic.co/apm/v2"
"go.elastic.co/apm/module/apmhttp/v2"
)
func main() {
// Environment variables:
// ELASTIC_APM_SERVICE_NAME=my-go-app
// ELASTIC_APM_SERVER_URL=http://localhost:8200
// ELASTIC_APM_SECRET_TOKEN=your-secret-token
mux := http.NewServeMux()
mux.HandleFunc("/api/users", handleUsers)
// Wrap with APM middleware
handler := apmhttp.Wrap(mux)
http.ListenAndServe(":8080", handler)
}
Supported Agents
| Language | Auto-Instrumentation | Manual Spans |
|---|---|---|
| Node.js | Express, Koa, Hapi, Fastify, HTTP | Yes |
| Python | Django, Flask, Starlette | Yes |
| Java | Spring, Servlet, JAX-RS | Yes |
| Go | net/http, Gin, Echo, gRPC | Yes |
| .NET | ASP.NET Core, EF Core | Yes |
| Ruby | Rails, Sinatra, Grape | Yes |
| PHP | WordPress, Laravel | Yes |
| Rust | Actix, Rocket | Community |
| RUM (JS) | Browser, React, Angular, Vue | Yes |
The APM Interface
Services View
Navigate to Observability → APM → Services:
┌─────────────────────────────────────────────────────────┐
│ Services [Environment ▼] │
├─────────────────────────────────────────────────────────┤
│ Service │ Env │ Latency │ Throughput │ Errors│
│──────────────────│──────│──────────│────────────│───────│
│ api-gateway │ prod │ 145ms │ 1.2k/min │ 0.3% │
│ payment-service │ prod │ 320ms │ 450/min │ 1.2% │
│ user-service │ prod │ 85ms │ 2.1k/min │ 0.1% │
│ order-service │ prod │ 210ms │ 800/min │ 0.5% │
│ notification-svc │ prod │ 50ms │ 300/min │ 0.0% │
└─────────────────────────────────────────────────────────┘
Service Detail
Click a service to see detailed performance data:
┌─────────────────────────────────────────────────────────┐
│ api-gateway │
├─────────────────────────────────────────────────────────┤
│ │
│ Latency: 145ms (avg) │ Throughput: 1.2k/min │
│ ──────────────────── │ ──────────────────── │
│ p50: 85ms │ │
│ p95: 450ms │ Error rate: 0.3% │
│ p99: 1.2s │ ──────────────────── │
│ │
│ ┌─ Latency Distribution ──────────────────────────┐ │
│ │ ██ │ │
│ │ ████ │ │
│ │ ████████ │ │
│ │ ████████████████ │ │
│ │ ██████████████████████ │ │
│ │ 0ms 100ms 500ms 1s 2s 5s │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ Transactions: │
│ ┌────────────────────────┬────────┬──────┬──────────┐ │
│ │ Name │ Latency│ TPM │ Errors │ │
│ │ GET /api/users │ 85ms │ 450 │ 0.1% │ │
│ │ POST /api/orders │ 320ms │ 200 │ 0.8% │ │
│ │ GET /api/products │ 120ms │ 350 │ 0.2% │ │
│ │ POST /api/checkout │ 1.2s │ 100 │ 2.1% │ │
│ └────────────────────────┴────────┴──────┴──────────┘ │
└─────────────────────────────────────────────────────────┘
Transactions
Transaction Types
APM categorizes operations:
| Type | Description | Example |
|---|---|---|
| request | HTTP request/response | GET /api/users |
| messaging | Message queue operations | Process order from queue |
| scheduled | Cron/scheduled tasks | Nightly report generation |
| custom | User-defined | Background data sync |
Transaction Waterfall
Click a transaction to see the detailed trace waterfall:
Transaction: POST /api/checkout (1.2s total)
├── api-gateway (12ms)
│ └── HTTP POST /api/orders → order-service
│ ├── order-service (180ms)
│ │ ├── SELECT * FROM products (15ms)
│ │ ├── SELECT * FROM inventory (8ms)
│ │ └── HTTP POST /api/payments → payment-service
│ │ ├── payment-service (850ms)
│ │ │ ├── Stripe API call (780ms) ← bottleneck!
│ │ │ └── INSERT INTO payments (25ms)
│ │ └── response (payment-service → order-service)
│ ├── INSERT INTO orders (12ms)
│ └── HTTP POST /api/notifications → notification-svc
│ └── notification-svc (45ms)
│ └── SendGrid API call (35ms)
└── response (api-gateway → client)
Reading the Waterfall
Each span shows:
- Service name and color coding
- Span type (db, http, custom)
- Duration (absolute and as percentage)
- Status (success, error)
- Details (SQL query, HTTP method/URL, etc.)
Click a span for details:
Span: Stripe API call
Type: external.http
Duration: 780ms (65% of transaction)
URL: https://api.stripe.com/v1/charges
Method: POST
Status: 200
Labels:
stripe.charge_id: ch_1234
amount: 9999
Errors
Error View
Navigate to a service → Errors tab:
┌─────────────────────────────────────────────────────────┐
│ Errors: payment-service │
├─────────────────────────────────────────────────────────┤
│ Error Group │ Count │ Last Seen │ Handled │
│───────────────────────│───────│───────────│─────────────│
│ StripeCardError: │ 45 │ 2m ago │ Unhandled │
│ Card declined │ │ │ │
│ ConnectionError: │ 12 │ 15m ago │ Handled │
│ DB timeout │ │ │ │
│ ValidationError: │ 8 │ 1h ago │ Handled │
│ Invalid amount │ │ │ │
└─────────────────────────────────────────────────────────┘
Error Detail
Click an error group to see:
- Stack trace (with source code context if source maps configured)
- Exception message and type
- Transaction that caused the error
- User and request context
- Occurrence timeline
Error: StripeCardError
Message: Your card was declined.
Stack trace:
at processPayment (src/services/payment.js:145:11)
at async createOrder (src/routes/orders.js:67:5)
at async handle (src/middleware/handler.js:23:7)
Context:
User: user-12345
Request: POST /api/checkout
Body: { amount: 9999, currency: "usd" }
Metadata:
service.version: 2.3.1
host.name: prod-payment-02
container.id: abc123def456
Service Map
The service map shows dependencies between services:
Navigate to: APM → Service Map
┌──────────┐
│ Client │
│ (Browser)│
└────┬─────┘
│
┌────▼─────┐
│ API │
│ Gateway │
└────┬─────┘
╱ │ ╲
┌───────▼──┐ ┌─▼──────┐ ┌▼──────────┐
│ User │ │ Order │ │ Product │
│ Service │ │Service │ │ Service │
└────┬─────┘ └───┬───┘ └─────┬─────┘
│ │ │
┌────▼─────┐ ┌───▼───┐ ┌───▼──────┐
│ User DB │ │Order │ │Product │
│(Postgres)│ │ DB │ │Cache │
└──────────┘ │(MySQL)│ │(Redis) │
└───┬───┘ └──────────┘
│
┌──────▼──────┐
│ Payment │
│ Service │
└──────┬─────┘
│
┌──────▼──────┐
│ Stripe │
│ (ext) │
└─────────────┘
Features:
- Nodes sized by throughput
- Edges colored by health (green/yellow/red)
- Click a node to see service details
- Click an edge to see transaction details between services
- External services shown as separate nodes
Correlations
APM Correlations helps you find why some transactions are slow or failing.
Latency Correlations
- Open a service → Correlations tab
- Select metric: Latency
- Kibana automatically finds fields correlated with high latency:
Correlated fields (high latency):
┌─────────────────────────────┬────────────┬─────────────┐
│ Field │ Value │ Impact │
│ host.name │ prod-web-3 │ +450ms │
│ url.path │ /api/search│ +320ms │
│ user_agent.name │ IE 11 │ +200ms │
│ geoip.country_iso_code │ AU │ +150ms │
└─────────────────────────────┴────────────┴─────────────┘
Insight: Requests to prod-web-3 are consistently slower.
Possible cause: resource constraints on that host.
Error Correlations
Find what's different about failing transactions:
Correlated fields (errors):
┌─────────────────────────────┬────────────┬─────────────┐
│ Field │ Value │ Impact │
│ service.version │ 2.3.1 │ 5x errors │
│ host.name │ prod-api-2 │ 3x errors │
│ user.id │ bot-crawler│ 2x errors │
└─────────────────────────────┴────────────┴─────────────┘
Insight: Version 2.3.1 has significantly more errors.
This was a recent deployment - likely a regression.
Custom Instrumentation
Adding Custom Spans
Node.js:
const apm = require('elastic-apm-node');
async function processOrder(order) {
// Create a custom span
const span = apm.startSpan('process-order');
try {
// Validate
const validateSpan = apm.startSpan('validate-order');
await validateOrder(order);
validateSpan.end();
// Charge payment
const paymentSpan = apm.startSpan('charge-payment', 'external');
const charge = await chargePayment(order.total);
paymentSpan.end();
// Save to database
const dbSpan = apm.startSpan('save-order', 'db');
await saveOrder(order, charge);
dbSpan.end();
} catch (err) {
apm.captureError(err);
throw err;
} finally {
if (span) span.end();
}
}
Python:
import elasticapm
@elasticapm.capture_span('process_order')
def process_order(order):
with elasticapm.capture_span('validate_order'):
validate_order(order)
with elasticapm.capture_span('charge_payment', span_type='external'):
charge = charge_payment(order['total'])
with elasticapm.capture_span('save_order', span_type='db'):
save_order(order, charge)
Custom Labels and Context
Add business context to transactions:
// Node.js - Add labels to current transaction
apm.setLabel('order_id', order.id);
apm.setLabel('order_total', order.total);
apm.setLabel('customer_tier', customer.tier);
// Add user context
apm.setUserContext({
id: user.id,
username: user.name,
email: user.email,
});
// Add custom context (not indexed, but visible in UI)
apm.setCustomContext({
cart_items: cart.items.length,
payment_method: order.paymentMethod,
promo_code: order.promoCode,
});
# Python - Add labels
elasticapm.label(order_id=order.id, customer_tier=customer.tier)
# Set user context
elasticapm.set_user_context(
user_id=user.id,
username=user.name,
email=user.email,
)
Real User Monitoring (RUM)
Monitor frontend performance from the user's browser:
Setting Up RUM
npm install @elastic/apm-rum
import { init as initApm } from '@elastic/apm-rum';
const apm = initApm({
serviceName: 'my-frontend',
serverUrl: 'https://apm.example.com:8200',
serviceVersion: '1.0.0',
environment: 'production',
// Distributed tracing (connect frontend to backend traces)
distributedTracingOrigins: ['https://api.example.com'],
});
RUM Metrics
| Metric | Description | Good Target |
|---|---|---|
| Page load time | Total time to load page | < 3s |
| First Contentful Paint | First visual content | < 1.8s |
| Largest Contentful Paint | Main content loaded | < 2.5s |
| First Input Delay | Time to interactivity | < 100ms |
| Cumulative Layout Shift | Visual stability | < 0.1 |
| Time to First Byte | Server response time | < 600ms |
RUM in Kibana
Navigate to APM → Services → your frontend service:
User Experience Dashboard:
┌────────────────────────────────────────────────────┐
│ Page Load Distribution │
│ ██████████████████████████ < 1s (45%) │
│ ████████████████ 1-3s (30%) │
│ ████████ 3-5s (15%) │
│ ████ 5-10s (8%) │
│ ██ > 10s (2%) │
├────────────────────────────────────────────────────┤
│ Slowest Pages: │
│ /checkout 4.2s avg │
│ /search?q=... 3.1s avg │
│ /product/detail 2.8s avg │
├────────────────────────────────────────────────────┤
│ By Browser: By Country: By Connection: │
│ Chrome 120ms US 145ms 4G 150ms │
│ Firefox 135ms UK 210ms 3G 450ms │
│ Safari 155ms AU 380ms WiFi 120ms │
└────────────────────────────────────────────────────┘
APM Alerts
Latency Alert
Rule: High API Latency
Type: APM latency threshold
Service: api-gateway
Transaction type: request
Environment: production
Condition:
WHEN p95 latency IS ABOVE 2000ms
FOR THE LAST 5 minutes
Actions:
Slack: "⚠️ API latency p95 at {{context.value}}ms (threshold: 2000ms)"
Error Rate Alert
Rule: Error Rate Spike
Type: APM error rate threshold
Service: payment-service
Environment: production
Condition:
WHEN error rate IS ABOVE 5%
FOR THE LAST 5 minutes
Actions:
PagerDuty: Critical
Slack: "🔴 Error rate: {{context.value}}% for payment-service"
Failed Transaction Rate
Rule: Checkout Failures
Type: APM failed transaction rate threshold
Service: order-service
Transaction name: POST /api/checkout
Environment: production
Condition:
WHEN failed transaction rate IS ABOVE 10%
Actions:
Webhook: POST to incident management
Practical Examples
Example 1: Debugging a Slow Endpoint
Scenario: Users report /api/search is slow.
1. APM → Services → api-gateway → Transactions
2. Find "GET /api/search" - avg latency 3.2s (SLA: 500ms)
3. Click transaction → View sample traces
4. Waterfall shows:
- api-gateway: 50ms
- search-service: 3150ms
- Elasticsearch query: 3100ms ← bottleneck
5. Click ES query span → See actual query
6. Query is missing index filter, scanning too many indices
7. Fix: Add date-based index pattern to query
8. Result: Latency drops to 200ms
Example 2: Correlating Errors with Deployments
1. APM → Services → payment-service → Errors
2. Spike in "ConnectionError: timeout" starting 2 hours ago
3. Check service.version label → version 2.4.0 deployed 2 hours ago
4. Click error → Stack trace shows new retry logic has a bug
5. Compare: version 2.3.9 had 0.1% errors, 2.4.0 has 5.2%
6. Action: Roll back to 2.3.9, fix bug, redeploy
Example 3: End-to-End Trace Analysis
1. User reports order #12345 failed
2. APM → Traces → Search for label "order_id: 12345"
3. Full trace:
- Frontend (RUM): 8.5s total
- API Gateway: 200ms
- Order Service: 8.1s
- Inventory check: 15ms
- Payment: 8000ms ← timeout!
- Stripe API: timeout after 8000ms
- Error: PaymentTimeoutError
4. Root cause: Stripe API was experiencing an outage
5. Fix: Add circuit breaker, reduce timeout, show user-friendly error
Tips and Best Practices
Instrumentation
✅ Start with auto-instrumentation (agents handle common frameworks)
✅ Add custom spans for business-critical code paths
✅ Set meaningful labels (order_id, customer_tier, feature_flag)
✅ Configure sampling rates appropriately
✅ Use distributed tracing to connect frontend and backend
❌ Don't instrument every function (too much overhead)
❌ Don't log sensitive data in labels (PII, credentials)
❌ Don't set sample rate to 100% in high-traffic production
❌ Don't forget to instrument async operations
Sampling Rates
Development: 1.0 (100% - capture everything)
Staging: 1.0 (100%)
Production:
Low traffic: 1.0 (100%)
Medium: 0.5 (50%)
High traffic: 0.1 (10%)
Very high: 0.01 (1%)
Dynamic sampling:
Errors: Always capture (sample rate doesn't affect error capture)
Slow transactions: Increase sampling for slow ones
Performance Overhead
Typical APM agent overhead:
CPU: 1-3% increase
Memory: 10-50MB additional
Latency: < 1ms per transaction
To minimize:
- Use lower sample rates in production
- Disable body capture unless needed
- Limit custom span depth
- Use async reporting (default)
Common Issues
No data showing in APM
Checklist:
- APM Server is running and accessible from application
- Agent is initialized before other code (Node.js: must be first require)
- Secret token matches between agent and APM Server
- Network allows traffic on port 8200
- Check APM Server logs for errors
# Test APM Server connectivity
curl http://localhost:8200/
# Expected: {"ok":{"build_date":"...","build_sha":"...","version":"8.11.0"}}
Traces not connecting across services
Fix: Ensure distributed tracing headers are propagated:
traceparent(W3C standard)elastic-apm-traceparent(Elastic format)
Check that HTTP clients forward these headers between services.
High cardinality warnings
Cause: Too many unique transaction names (e.g., URL includes IDs).
Fix: Normalize transaction names:
// Bad: /api/users/12345, /api/users/67890 (unique per user)
// Good: /api/users/:id (grouped)
// Node.js - Set transaction name explicitly
apm.setTransactionName('GET /api/users/:id');
Summary
In this chapter, you learned:
- ✅ APM architecture: agents, APM Server, Elasticsearch, Kibana
- ✅ Setting up APM agents for Node.js, Python, Java, and Go
- ✅ Navigating services, transactions, and error views
- ✅ Reading transaction waterfall traces
- ✅ Using the service map and correlations
- ✅ Custom instrumentation with spans, labels, and context
- ✅ Real User Monitoring for frontend performance
- ✅ Setting up APM-specific alerts
- ✅ Debugging performance issues with practical workflows
Next: Best practices and tips for getting the most out of Kibana!