Application Performance Monitoring (APM)
APM in Kibana provides real-time visibility into application performance: response times, error rates, transaction traces, and service dependencies. It instruments your application code to collect data and displays it through dedicated Kibana views.
APM Architecture
┌──────────────────────────────────────────────────────────┐
│ Your Applications │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Node.js │ │ Python │ │ Java │ │ Go │ │
│ │ App │ │ App │ │ App │ │ App │ │
│ │ + APM │ │ + APM │ │ + APM │ │ + APM │ │
│ │ Agent │ │ Agent │ │ Agent │ │ Agent │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
│ └─────────┬──┴───────────┴──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ APM Server │ (or Elastic Agent with │
│ │ │ APM integration) │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │Elasticsearch │ (stores APM data) │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Kibana │ (APM UI) │
│ │ APM App │ │
│ └──────────────┘ │
└──────────────────────────────────────────────────────────┘
Setting Up APM
Step 1: APM Server
Install APM Server or use Elastic Agent with APM integration:
Docker:
docker run -d \
--name apm-server \
--net elastic \
-p 8200:8200 \
-e "output.elasticsearch.hosts=['http://elasticsearch:9200']" \
docker.elastic.co/apm/apm-server:8.11.0
Elastic Agent (recommended for 8.x):
- Go to Fleet → Agent policies
- Add integration → Elastic APM
- Configure:
Host: 0.0.0.0:8200
Secret token: your-secret-token
Step 2: Instrument Your Application
Install the APM agent for your language:
Node.js:
npm install elastic-apm-node
// Add at the VERY TOP of your main file (before any other require)
const apm = require('elastic-apm-node').start({
serviceName: 'my-api',
serverUrl: 'http://localhost:8200',
secretToken: 'your-secret-token',
environment: 'production',
// Capture request body (careful with PII)
captureBody: 'errors',
// Sample rate (1.0 = 100%)
transactionSampleRate: 1.0,
});
Python:
pip install elastic-apm
# Flask
from elasticapm.contrib.flask import ElasticAPM
app = Flask(__name__)
app.config['ELASTIC_APM'] = {
'SERVICE_NAME': 'my-flask-app',
'SERVER_URL': 'http://localhost:8200',
'SECRET_TOKEN': 'your-secret-token',
'ENVIRONMENT': 'production',
}
apm = ElasticAPM(app)
# Django - settings.py
INSTALLED_APPS = [
'elasticapm.contrib.django',
# ...
]
ELASTIC_APM = {
'SERVICE_NAME': 'my-django-app',
'SERVER_URL': 'http://localhost:8200',
'SECRET_TOKEN': 'your-secret-token',
'ENVIRONMENT': 'production',
}
Java:
# Download the agent JAR
curl -O https://repo1.maven.org/maven2/co/elastic/apm/elastic-apm-agent/1.44.0/elastic-apm-agent-1.44.0.jar
# Run with agent attached
java -javaagent:/path/to/elastic-apm-agent-1.44.0.jar \
-Delastic.apm.service_name=my-java-app \
-Delastic.apm.server_urls=http://localhost:8200 \
-Delastic.apm.secret_token=your-secret-token \
-Delastic.apm.environment=production \
-jar my-app.jar
Go:
go get go.elastic.co/apm/v2
import (
"go.elastic.co/apm/v2"
"go.elastic.co/apm/module/apmhttp/v2"
)
func main() {
// Environment variables:
// ELASTIC_APM_SERVICE_NAME=my-go-app
// ELASTIC_APM_SERVER_URL=http://localhost:8200
// ELASTIC_APM_SECRET_TOKEN=your-secret-token
mux := http.NewServeMux()
mux.HandleFunc("/api/users", handleUsers)
// Wrap with APM middleware
handler := apmhttp.Wrap(mux)
http.ListenAndServe(":8080", handler)
}
Supported Agents
| Language | Auto-Instrumentation | Manual Spans |
|---|---|---|
| Node.js | Express, Koa, Hapi, Fastify, HTTP | Yes |
| Python | Django, Flask, Starlette | Yes |
| Java | Spring, Servlet, JAX-RS | Yes |
| Go | net/http, Gin, Echo, gRPC | Yes |
| .NET | ASP.NET Core, EF Core | Yes |
| Ruby | Rails, Sinatra, Grape | Yes |
| PHP | WordPress, Laravel | Yes |
| Rust | Actix, Rocket | Community |
| RUM (JS) | Browser, React, Angular, Vue | Yes |
The APM Interface
Services View
Navigate to Observability → APM → Services:
┌─────────────────────────────────────────────────────────┐
│ Services [Environment ▼] │
├─────────────────────────────────────────────────────────┤
│ Service │ Env │ Latency │ Throughput │ Errors│
│──────────────────│──────│──────────│────────────│───────│
│ api-gateway │ prod │ 145ms │ 1.2k/min │ 0.3% │
│ payment-service │ prod │ 320ms │ 450/min │ 1.2% │
│ user-service │ prod │ 85ms │ 2.1k/min │ 0.1% │
│ order-service │ prod │ 210ms │ 800/min │ 0.5% │
│ notification-svc │ prod │ 50ms │ 300/min │ 0.0% │
└─────────────────────────────────────────────────────────┘
Service Detail
Click a service to see detailed performance data:
┌─────────────────────────────────────────────────────────┐
│ api-gateway │
├─────────────────────────────────────────────────────────┤
│ │
│ Latency: 145ms (avg) │ Throughput: 1.2k/min │
│ ──────────────────── │ ──────────────────── │
│ p50: 85ms │ │
│ p95: 450ms │ Error rate: 0.3% │
│ p99: 1.2s │ ──────────────────── │
│ │
│ ┌─ Latency Distribution ──────────────────────────┐ │
│ │ ██ │ │
│ │ ████ │ │
│ │ ████████ │ │
│ │ ████████████████ │ │
│ │ ██████████████████████ │ │
│ │ 0ms 100ms 500ms 1s 2s 5s │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ Transactions: │
│ ┌────────────────────────┬────────┬──────┬──────────┐ │
│ │ Name │ Latency│ TPM │ Errors │ │
│ │ GET /api/users │ 85ms │ 450 │ 0.1% │ │
│ │ POST /api/orders │ 320ms │ 200 │ 0.8% │ │
│ │ GET /api/products │ 120ms │ 350 │ 0.2% │ │
│ │ POST /api/checkout │ 1.2s │ 100 │ 2.1% │ │
│ └────────────────────────┴────────┴──────┴──────────┘ │
└─────────────────────────────────────────────────────────┘
Transactions
Transaction Types
APM categorizes operations:
| Type | Description | Example |
|---|---|---|
| request | HTTP request/response | GET /api/users |
| messaging | Message queue operations | Process order from queue |
| scheduled | Cron/scheduled tasks | Nightly report generation |
| custom | User-defined | Background data sync |
Transaction Waterfall
Click a transaction to see the detailed trace waterfall:
Transaction: POST /api/checkout (1.2s total)
├── api-gateway (12ms)
│ └── HTTP POST /api/orders → order-service
│ ├── order-service (180ms)
│ │ ├── SELECT * FROM products (15ms)
│ │ ├── SELECT * FROM inventory (8ms)
│ │ └── HTTP POST /api/payments → payment-service
│ │ ├── payment-service (850ms)
│ │ │ ├── Stripe API call (780ms) ← bottleneck!
│ │ │ └── INSERT INTO payments (25ms)
│ │ └── response (payment-service → order-service)
│ ├── INSERT INTO orders (12ms)
│ └── HTTP POST /api/notifications → notification-svc
│ └── notification-svc (45ms)
│ └── SendGrid API call (35ms)
└── response (api-gateway → client)
Reading the Waterfall
Each span shows:
- Service name and color coding
- Span type (db, http, custom)
- Duration (absolute and as percentage)
- Status (success, error)
- Details (SQL query, HTTP method/URL, etc.)
Click a span for details:
Span: Stripe API call
Type: external.http
Duration: 780ms (65% of transaction)
URL: https://api.stripe.com/v1/charges
Method: POST
Status: 200
Labels:
stripe.charge_id: ch_1234
amount: 9999
Errors
Error View
Navigate to a service → Errors tab:
┌─────────────────────────────────────────────────────────┐
│ Errors: payment-service │
├─────────────────────────────────────────────────────────┤
│ Error Group │ Count │ Last Seen │ Handled │
│───────────────────────│───────│───────────│─────────────│
│ StripeCardError: │ 45 │ 2m ago │ Unhandled │
│ Card declined │ │ │ │
│ ConnectionError: │ 12 │ 15m ago │ Handled │
│ DB timeout │ │ │ │
│ ValidationError: │ 8 │ 1h ago │ Handled │
│ Invalid amount │ │ │ │
└─────────────────────────────────────────────────────────┘
Error Detail
Click an error group to see:
- Stack trace (with source code context if source maps configured)
- Exception message and type
- Transaction that caused the error
- User and request context
- Occurrence timeline
Error: StripeCardError
Message: Your card was declined.
Stack trace:
at processPayment (src/services/payment.js:145:11)
at async createOrder (src/routes/orders.js:67:5)
at async handle (src/middleware/handler.js:23:7)
Context:
User: user-12345
Request: POST /api/checkout
Body: { amount: 9999, currency: "usd" }
Metadata:
service.version: 2.3.1
host.name: prod-payment-02
container.id: abc123def456
Service Map
The service map shows dependencies between services:
Navigate to: APM → Service Map
┌──────────┐
│ Client │
│ (Browser)│
└────┬─────┘
│
┌────▼─────┐
│ API │
│ Gateway │
└────┬─────┘
╱ │ ╲
┌───────▼──┐ ┌─▼──────┐ ┌▼──────────┐
│ User │ │ Order │ │ Product │
│ Service │ │Service │ │ Service │
└────┬─────┘ └───┬───┘ └─────┬─────┘
│ │ │
┌────▼─────┐ ┌───▼───┐ ┌───▼──────┐
│ User DB │ │Order │ │Product │
│(Postgres)│ │ DB │ │Cache │
└──────────┘ │(MySQL)│ │(Redis) │
└───┬───┘ └──────────┘
│
┌──────▼──────┐
│ Payment │
│ Service │
└──────┬─────┘
│
┌──────▼──────┐
│ Stripe │
│ (ext) │
└─────────────┘
Features:
- Nodes sized by throughput
- Edges colored by health (green/yellow/red)
- Click a node to see service details
- Click an edge to see transaction details between services
- External services shown as separate nodes
Correlations
APM Correlations helps you find why some transactions are slow or failing.
Latency Correlations
- Open a service → Correlations tab
- Select metric: Latency
- Kibana automatically finds fields correlated with high latency:
Correlated fields (high latency):
┌─────────────────────────────┬────────────┬─────────────┐
│ Field │ Value │ Impact │
│ host.name │ prod-web-3 │ +450ms │
│ url.path │ /api/search│ +320ms │
│ user_agent.name │ IE 11 │ +200ms │
│ geoip.country_iso_code │ AU │ +150ms │
└─────────────────────────────┴────────────┴─────────────┘
Insight: Requests to prod-web-3 are consistently slower.
Possible cause: resource constraints on that host.
Error Correlations
Find what's different about failing transactions:
Correlated fields (errors):
┌─────────────────────────────┬────────────┬─────────────┐
│ Field │ Value │ Impact │
│ service.version │ 2.3.1 │ 5x errors │
│ host.name │ prod-api-2 │ 3x errors │
│ user.id │ bot-crawler│ 2x errors │
└─────────────────────────────┴────────────┴─────────────┘
Insight: Version 2.3.1 has significantly more errors.
This was a recent deployment - likely a regression.
Custom Instrumentation
Adding Custom Spans
Node.js:
const apm = require('elastic-apm-node');
async function processOrder(order) {
// Create a custom span
const span = apm.startSpan('process-order');
try {
// Validate
const validateSpan = apm.startSpan('validate-order');
await validateOrder(order);
validateSpan.end();
// Charge payment
const paymentSpan = apm.startSpan('charge-payment', 'external');
const charge = await chargePayment(order.total);
paymentSpan.end();
// Save to database
const dbSpan = apm.startSpan('save-order', 'db');
await saveOrder(order, charge);
dbSpan.end();
} catch (err) {
apm.captureError(err);
throw err;
} finally {
if (span) span.end();
}
}
Python:
import elasticapm
@elasticapm.capture_span('process_order')
def process_order(order):
with elasticapm.capture_span('validate_order'):
validate_order(order)
with elasticapm.capture_span('charge_payment', span_type='external'):
charge = charge_payment(order['total'])
with elasticapm.capture_span('save_order', span_type='db'):
save_order(order, charge)
Custom Labels and Context
Add business context to transactions:
// Node.js - Add labels to current transaction
apm.setLabel('order_id', order.id);
apm.setLabel('order_total', order.total);
apm.setLabel('customer_tier', customer.tier);
// Add user context
apm.setUserContext({
id: user.id,
username: user.name,
email: user.email,
});
// Add custom context (not indexed, but visible in UI)
apm.setCustomContext({
cart_items: cart.items.length,
payment_method: order.paymentMethod,
promo_code: order.promoCode,
});
# Python - Add labels
elasticapm.label(order_id=order.id, customer_tier=customer.tier)
# Set user context
elasticapm.set_user_context(
user_id=user.id,
username=user.name,
email=user.email,
)
Real User Monitoring (RUM)
Monitor frontend performance from the user's browser:
Setting Up RUM
npm install @elastic/apm-rum
import { init as initApm } from '@elastic/apm-rum';
const apm = initApm({
serviceName: 'my-frontend',
serverUrl: 'https://apm.example.com:8200',
serviceVersion: '1.0.0',
environment: 'production',
// Distributed tracing (connect frontend to backend traces)
distributedTracingOrigins: ['https://api.example.com'],
});
RUM Metrics
| Metric | Description | Good Target |
|---|---|---|
| Page load time | Total time to load page | < 3s |
| First Contentful Paint | First visual content | < 1.8s |
| Largest Contentful Paint | Main content loaded | < 2.5s |
| First Input Delay | Time to interactivity | < 100ms |
| Cumulative Layout Shift | Visual stability | < 0.1 |
| Time to First Byte | Server response time | < 600ms |
RUM in Kibana
Navigate to APM → Services → your frontend service:
User Experience Dashboard:
┌────────────────────────────────────────────────────┐
│ Page Load Distribution │
│ ██████████████████████████ < 1s (45%) │
│ ████████████████ 1-3s (30%) │
│ ████████ 3-5s (15%) │
│ ████ 5-10s (8%) │
│ ██ > 10s (2%) │
├────────────────────────────────────────────────────┤
│ Slowest Pages: │
│ /checkout 4.2s avg │
│ /search?q=... 3.1s avg │
│ /product/detail 2.8s avg │
├────────────────────────────────────────────────────┤
│ By Browser: By Country: By Connection: │
│ Chrome 120ms US 145ms 4G 150ms │
│ Firefox 135ms UK 210ms 3G 450ms │
│ Safari 155ms AU 380ms WiFi 120ms │
└────────────────────────────────────────────────────┘
APM Alerts
Latency Alert
Rule: High API Latency
Type: APM latency threshold
Service: api-gateway
Transaction type: request
Environment: production
Condition:
WHEN p95 latency IS ABOVE 2000ms
FOR THE LAST 5 minutes
Actions:
Slack: " API latency p95 at {{context.value}}ms (threshold: 2000ms)"
Error Rate Alert
Rule: Error Rate Spike
Type: APM error rate threshold
Service: payment-service
Environment: production
Condition:
WHEN error rate IS ABOVE 5%
FOR THE LAST 5 minutes
Actions:
PagerDuty: Critical
Slack: " Error rate: {{context.value}}% for payment-service"
Failed Transaction Rate
Rule: Checkout Failures
Type: APM failed transaction rate threshold
Service: order-service
Transaction name: POST /api/checkout
Environment: production
Condition:
WHEN failed transaction rate IS ABOVE 10%
Actions:
Webhook: POST to incident management
Practical Examples
Example 1: Debugging a Slow Endpoint
Scenario: Users report /api/search is slow.
1. APM → Services → api-gateway → Transactions
2. Find "GET /api/search" - avg latency 3.2s (SLA: 500ms)
3. Click transaction → View sample traces
4. Waterfall shows:
- api-gateway: 50ms
- search-service: 3150ms
- Elasticsearch query: 3100ms ← bottleneck
5. Click ES query span → See actual query
6. Query is missing index filter, scanning too many indices
7. Fix: Add date-based index pattern to query
8. Result: Latency drops to 200ms
Example 2: Correlating Errors with Deployments
1. APM → Services → payment-service → Errors
2. Spike in "ConnectionError: timeout" starting 2 hours ago
3. Check service.version label → version 2.4.0 deployed 2 hours ago
4. Click error → Stack trace shows new retry logic has a bug
5. Compare: version 2.3.9 had 0.1% errors, 2.4.0 has 5.2%
6. Action: Roll back to 2.3.9, fix bug, redeploy
Example 3: End-to-End Trace Analysis
1. User reports order #12345 failed
2. APM → Traces → Search for label "order_id: 12345"
3. Full trace:
- Frontend (RUM): 8.5s total
- API Gateway: 200ms
- Order Service: 8.1s
- Inventory check: 15ms
- Payment: 8000ms ← timeout!
- Stripe API: timeout after 8000ms
- Error: PaymentTimeoutError
4. Root cause: Stripe API was experiencing an outage
5. Fix: Add circuit breaker, reduce timeout, show a clear error to the user
Tips and Best Practices
Instrumentation
Start with auto-instrumentation (agents handle common frameworks)
Add custom spans for business-critical code paths
Set meaningful labels (order_id, customer_tier, feature_flag)
Configure sampling rates appropriately
Use distributed tracing to connect frontend and backend
Don't instrument every function (too much overhead)
Don't log sensitive data in labels (PII, credentials)
Don't set sample rate to 100% in high-traffic production
Don't forget to instrument async operations
Sampling Rates
Development: 1.0 (100% - capture everything)
Staging: 1.0 (100%)
Production:
Low traffic: 1.0 (100%)
Medium: 0.5 (50%)
High traffic: 0.1 (10%)
Very high: 0.01 (1%)
Dynamic sampling:
Errors: Always capture (sample rate doesn't affect error capture)
Slow transactions: Increase sampling for slow ones
Performance Overhead
Typical APM agent overhead:
CPU: 1-3% increase
Memory: 10-50MB additional
Latency: < 1ms per transaction
To minimize:
- Use lower sample rates in production
- Disable body capture unless needed
- Limit custom span depth
- Use async reporting (default)
Common Issues
No data showing in APM
Checklist:
- APM Server is running and accessible from application
- Agent is initialized before other code (Node.js: must be first require)
- Secret token matches between agent and APM Server
- Network allows traffic on port 8200
- Check APM Server logs for errors
# Test APM Server connectivity
curl http://localhost:8200/
# Expected: {"ok":{"build_date":"...","build_sha":"...","version":"8.11.0"}}
Traces not connecting across services
Fix: Ensure distributed tracing headers are propagated:
traceparent(W3C standard)elastic-apm-traceparent(Elastic format)
Check that HTTP clients forward these headers between services.
High cardinality warnings
Cause: Too many unique transaction names (e.g., URL includes IDs).
Fix: Normalize transaction names:
// Bad: /api/users/12345, /api/users/67890 (unique per user)
// Good: /api/users/:id (grouped)
// Node.js - Set transaction name explicitly
apm.setTransactionName('GET /api/users/:id');
Next Steps
Continue to 14-best-practices.md for best practices and production tips.