Chapter 11: Monitoring and Observability | WSO2 Complete Guide

Overview

Running WSO2 in production means seeing what it does. That requires logs, metrics, API analytics, and distributed traces, plus the discipline to read them when something breaks.

Logging

Log Files

Log File	Location	Contents
`wso2carbon.log`	`repository/logs/`	Main server log
`http_access.log`	`repository/logs/`	HTTP access log (gateway)
`audit.log`	`repository/logs/`	Security audit events
`gc.log`	`repository/logs/`	JVM garbage collection
`correlation.log`	`repository/logs/`	Request correlation
`wso2-apigw-errors.log`	`repository/logs/`	Gateway-specific errors

Log Configuration

deployment.toml:

# Root logger
[logging]
level = "INFO"

# Per-package log levels
[[logging.loggers]]
name = "org.apache.synapse"
level = "WARN"

[[logging.loggers]]
name = "org.wso2.carbon.apimgt"
level = "INFO"

[[logging.loggers]]
name = "org.apache.synapse.transport.http.wire"
level = "OFF"   # Enable DEBUG for wire logs (verbose)

# Audit log
[[logging.loggers]]
name = "AUDIT_LOG"
level = "INFO"
appender = "AUDIT_LOGFILE"

Wire Logs (Debug HTTP Traffic)

Enable temporarily for debugging. Extremely verbose, so never use it in production long-term.

# Enable wire logs
[[logging.loggers]]
name = "org.apache.synapse.transport.http.wire"
level = "DEBUG"

Wire log output:

[2026-02-11 10:30:15] DEBUG - wire >> "POST /api/employees HTTP/1.1[\r][\n]"
[2026-02-11 10:30:15] DEBUG - wire >> "Content-Type: application/json[\r][\n]"
[2026-02-11 10:30:15] DEBUG - wire >> "Authorization: Bearer eyJ4NX...[\r][\n]"
[2026-02-11 10:30:15] DEBUG - wire >> "{"name":"John Doe","email":"john@example.com"}"

Correlation Logging

Track a request across all components with a correlation ID.

# Enable correlation logs
[monitoring.correlation]
enable = true
log_all_methods = true

Usage:

# Set correlation ID in request
curl -H "activityid: abc-123-def" https://gateway:8243/api/data

# Find in logs
grep "abc-123-def" repository/logs/correlation.log

Correlation log output:

abc-123-def|HTTP-Listener|2026-02-11 10:30:15|0|api/data|GET|InSequence|Start
abc-123-def|HTTP-Sender|2026-02-11 10:30:15|45|backend:8080/data|GET|Call|End
abc-123-def|HTTP-Listener|2026-02-11 10:30:15|48|api/data|GET|OutSequence|End

HTTP Access Logging

[transport.http.access_log]
enable = true
format = "combined"
# Format: %h %l %u %t "%r" %s %b "%{Referer}i" "%{User-Agent}i" %D

# Or custom format
# format = "%h %t %r %s %D %{X-Correlation-ID}i"

Metrics

JMX Monitoring

WSO2 exposes JMX MBeans for JVM and application metrics.

# Enable JMX
[monitoring.jmx]
rmi_hostname = "localhost"
rmi_port = 9999

Connect with JConsole or VisualVM:

jconsole localhost:9999
# Or
jvisualvm

Key MBeans:

java.lang:type=Memory: Heap usage
java.lang:type=Threading: Thread counts
java.lang:type=GarbageCollector: GC stats
org.apache.synapse:type=Transport: HTTP transport metrics
org.wso2.carbon:type=ServerAdmin: Server status

Prometheus Integration

Export metrics in Prometheus format.

# Enable Prometheus metrics
[monitoring.prometheus]
enable = true
port = 9201

Scrape config (prometheus.yml):

scrape_configs:
  - job_name: 'wso2-apim-gateway'
    metrics_path: /metrics
    scheme: https
    tls_config:
      insecure_skip_verify: true
    static_configs:
      - targets:
          - 'gw1.example.com:9201'
          - 'gw2.example.com:9201'
        labels:
          component: gateway

  - job_name: 'wso2-mi'
    metrics_path: /metric-service/metrics
    static_configs:
      - targets:
          - 'mi1.example.com:9201'
        labels:
          component: micro-integrator

Key Prometheus Metrics:

Metric	Description
`wso2_api_request_count_total`	Total API requests
`wso2_api_error_count_total`	Total API errors
`wso2_api_response_time_seconds`	Response time histogram
`jvm_memory_bytes_used`	JVM heap usage
`jvm_threads_current`	Active threads
`http_requests_total`	HTTP transport requests

Grafana Dashboards

API Gateway Dashboard:

{
  "panels": [
    {
      "title": "Request Rate",
      "type": "graph",
      "targets": [
        {"expr": "rate(wso2_api_request_count_total[5m])"}
      ]
    },
    {
      "title": "Error Rate",
      "type": "graph",
      "targets": [
        {"expr": "rate(wso2_api_error_count_total[5m]) / rate(wso2_api_request_count_total[5m]) * 100"}
      ]
    },
    {
      "title": "P95 Response Time",
      "type": "graph",
      "targets": [
        {"expr": "histogram_quantile(0.95, rate(wso2_api_response_time_seconds_bucket[5m]))"}
      ]
    },
    {
      "title": "JVM Heap Usage",
      "type": "gauge",
      "targets": [
        {"expr": "jvm_memory_bytes_used{area='heap'} / jvm_memory_bytes_max{area='heap'} * 100"}
      ]
    }
  ]
}

API Analytics

Built-in Analytics

WSO2 API Manager provides a built-in analytics dashboard.

Enable Analytics:

[apim.analytics]
enable = true

[apim.analytics.properties]
"publisher.reporter.class" = "org.wso2.am.analytics.publisher.sample.reporter.AnalyticsMetricReporter"

Metrics Available:

API usage by time period
Top APIs by request count
Response time distributions
Error breakdown by type
Geographic distribution of requests
Top applications and subscribers

ELK Stack Integration

Send logs and analytics to Elasticsearch for centralized analysis.

Filebeat Configuration (filebeat.yml):

filebeat.inputs:
  - type: log
    paths:
      - /opt/wso2am/repository/logs/wso2carbon.log
    fields:
      log_type: server
    multiline:
      pattern: '^\d{4}-\d{2}-\d{2}'
      negate: true
      match: after

  - type: log
    paths:
      - /opt/wso2am/repository/logs/http_access*.log
    fields:
      log_type: access

  - type: log
    paths:
      - /opt/wso2am/repository/logs/audit.log
    fields:
      log_type: audit

  - type: log
    paths:
      - /opt/wso2am/repository/logs/correlation.log
    fields:
      log_type: correlation

output.elasticsearch:
  hosts: ["http://elasticsearch:9200"]
  index: "wso2-%{[fields.log_type]}-%{+yyyy.MM.dd}"

Logstash Filter (for structured parsing):

filter {
  if [fields][log_type] == "access" {
    grok {
      match => {
        "message" => "%{IPORHOST:client_ip} - - \[%{HTTPDATE:timestamp}\] \"%{WORD:method} %{URIPATHPARAM:request} HTTP/%{NUMBER:http_version}\" %{NUMBER:status} %{NUMBER:bytes} %{NUMBER:response_time}"
      }
    }
    mutate {
      convert => {
        "status" => "integer"
        "bytes" => "integer"
        "response_time" => "integer"
      }
    }
  }
}

Distributed Tracing

OpenTelemetry Integration

# Enable OpenTelemetry tracing
[opentelemetry]
enable = true
exporter_type = "otlp"

[opentelemetry.remote]
host = "otel-collector.example.com"
port = 4317

Jaeger Integration

[opentelemetry]
enable = true
exporter_type = "jaeger"

[opentelemetry.remote]
host = "jaeger.example.com"
port = 14250

Trace flow through WSO2:

Client → Gateway (Span 1)
           → Key Validation (Span 2)
           → Backend Call (Span 3)
           → Response Processing (Span 4)
         ← Response to Client

Health Monitoring

Health Check Endpoints

API Manager:

# Gateway health
curl -k https://localhost:8243/services/Version

# Carbon health
curl -k https://localhost:9443/carbon/admin/login.jsp -o /dev/null -w "%{http_code}"

Micro Integrator:

# Liveness (server is running)
curl http://localhost:9164/liveness
# Response: {"status": "active"}

# Readiness (ready to accept requests)
curl http://localhost:9164/readiness
# Response: {"status": "ready"}

# List deployed services
curl http://localhost:9164/management/apis

Custom Health Check (MI)

<api name="HealthAPI" context="/health" xmlns="http://ws.apache.org/ns/synapse">
    <resource methods="GET" uri-template="/">
        <inSequence>
            <!-- Check backend connectivity -->
            <call>
                <endpoint>
                    <http method="get" uri-template="http://backend:8080/ping">
                        <timeout>
                            <duration>5000</duration>
                        </timeout>
                    </http>
                </endpoint>
            </call>
            
            <filter source="$axis2:HTTP_SC" regex="200">
                <then>
                    <payloadFactory media-type="json">
                        <format>
                            {"status": "healthy", "backend": "UP", "timestamp": "$1"}
                        </format>
                        <args>
                            <arg expression="get-property('SYSTEM_TIME')"/>
                        </args>
                    </payloadFactory>
                </then>
                <else>
                    <payloadFactory media-type="json">
                        <format>
                            {"status": "degraded", "backend": "DOWN", "timestamp": "$1"}
                        </format>
                        <args>
                            <arg expression="get-property('SYSTEM_TIME')"/>
                        </args>
                    </payloadFactory>
                    <property name="HTTP_SC" value="503" scope="axis2"/>
                </else>
            </filter>
            <respond/>
        </inSequence>
    </resource>
</api>

Alerting

Prometheus Alerting Rules

groups:
  - name: wso2-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(wso2_api_error_count_total[5m]) / rate(wso2_api_request_count_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "API error rate above 5%"

      - alert: HighResponseTime
        expr: histogram_quantile(0.95, rate(wso2_api_response_time_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 response time above 2 seconds"

      - alert: HighMemoryUsage
        expr: jvm_memory_bytes_used{area="heap"} / jvm_memory_bytes_max{area="heap"} > 0.85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "JVM heap usage above 85%"

      - alert: GatewayDown
        expr: up{job="wso2-apim-gateway"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Gateway instance is down"

Troubleshooting

Common Issues and Solutions

Symptom	Likely Cause	Action
401 Unauthorized	Expired/invalid token	Check token expiry, regenerate
403 Forbidden	Missing scopes	Verify scope assignment
500 Internal Error	Backend failure	Check backend logs, fault sequence
503 Service Unavailable	Backend down	Check endpoint connectivity
Slow responses	Resource exhaustion	Check JVM heap, thread count, DB pool
OOM crash	Insufficient heap	Increase `-Xmx`, check for memory leaks
Connection timeout	Network or firewall	Verify connectivity, increase timeout

Diagnostic Commands

# Check server status
curl -k https://localhost:9443/services/Version

# Thread dump (find deadlocks, stuck threads)
kill -3 <PID>
# Or
jstack <PID> > thread_dump.txt

# Heap dump (analyze memory)
jmap -dump:format=b,file=heap.hprof <PID>

# Check open file descriptors
ls /proc/<PID>/fd | wc -l

# Check active connections
ss -tnp | grep <PID> | wc -l

# Monitor GC in real-time
jstat -gcutil <PID> 1000

Analyzing Slow APIs

# 1. Enable correlation logging
# 2. Send request with correlation ID
curl -H "activityid: debug-001" https://gateway:8243/api/slow-endpoint

# 3. Analyze correlation log for time spent in each stage
grep "debug-001" repository/logs/correlation.log

# Output shows time per step:
# debug-001|HTTP-Listener|...|0ms|InSequence|Start
# debug-001|HTTP-Sender|...|1250ms|Backend|Call
# debug-001|HTTP-Listener|...|1255ms|OutSequence|End
# → Backend call took 1250ms; investigate backend

Log Level Changes at Runtime

# MI CLI: change log level without restart
mi log-level update org.apache.synapse DEBUG

# Revert
mi log-level update org.apache.synapse INFO

Key Takeaways

Correlation logging traces requests end-to-end across components
Wire logs are essential for debugging but too verbose for production
Prometheus + Grafana provides real-time metrics and dashboards
ELK stack centralizes logs for search and analysis
OpenTelemetry/Jaeger adds distributed tracing across services
Health check endpoints enable load balancer and Kubernetes probes
Thread dumps and heap dumps diagnose JVM-level issues
Runtime log level changes avoid restarts during incident investigation

Next Steps

Continue to Chapter 12: Advanced Topics to learn about advanced integration patterns, streaming, and microservices.