Chapter 11: Monitoring and Observability

Overview

Monitoring WSO2 in production requires visibility into logs, metrics, API analytics, and distributed traces. This chapter covers built-in tools, integration with external observability stacks, and common troubleshooting patterns.

Logging

Log Files

Log FileLocationContents
wso2carbon.logrepository/logs/Main server log
http_access.logrepository/logs/HTTP access log (gateway)
audit.logrepository/logs/Security audit events
gc.logrepository/logs/JVM garbage collection
correlation.logrepository/logs/Request correlation
wso2-apigw-errors.logrepository/logs/Gateway-specific errors

Log Configuration

deployment.toml:

# Root logger
[logging]
level = "INFO"

# Per-package log levels
[[logging.loggers]]
name = "org.apache.synapse"
level = "WARN"

[[logging.loggers]]
name = "org.wso2.carbon.apimgt"
level = "INFO"

[[logging.loggers]]
name = "org.apache.synapse.transport.http.wire"
level = "OFF"   # Enable DEBUG for wire logs (verbose)

# Audit log
[[logging.loggers]]
name = "AUDIT_LOG"
level = "INFO"
appender = "AUDIT_LOGFILE"

Wire Logs (Debug HTTP Traffic)

Enable temporarily for debugging. Extremely verbose, so never use it in production long-term.

# Enable wire logs
[[logging.loggers]]
name = "org.apache.synapse.transport.http.wire"
level = "DEBUG"

Wire log output:

[2026-02-11 10:30:15] DEBUG - wire >> "POST /api/employees HTTP/1.1[\r][\n]"
[2026-02-11 10:30:15] DEBUG - wire >> "Content-Type: application/json[\r][\n]"
[2026-02-11 10:30:15] DEBUG - wire >> "Authorization: Bearer eyJ4NX...[\r][\n]"
[2026-02-11 10:30:15] DEBUG - wire >> "{"name":"John Doe","email":"john@example.com"}"

Correlation Logging

Track a request across all components with a correlation ID.

# Enable correlation logs
[monitoring.correlation]
enable = true
log_all_methods = true

Usage:

# Set correlation ID in request
curl -H "activityid: abc-123-def" https://gateway:8243/api/data

# Find in logs
grep "abc-123-def" repository/logs/correlation.log

Correlation log output:

abc-123-def|HTTP-Listener|2026-02-11 10:30:15|0|api/data|GET|InSequence|Start
abc-123-def|HTTP-Sender|2026-02-11 10:30:15|45|backend:8080/data|GET|Call|End
abc-123-def|HTTP-Listener|2026-02-11 10:30:15|48|api/data|GET|OutSequence|End

HTTP Access Logging

[transport.http.access_log]
enable = true
format = "combined"
# Format: %h %l %u %t "%r" %s %b "%{Referer}i" "%{User-Agent}i" %D

# Or custom format
# format = "%h %t %r %s %D %{X-Correlation-ID}i"

Metrics

JMX Monitoring

WSO2 exposes JMX MBeans for JVM and application metrics.

# Enable JMX
[monitoring.jmx]
rmi_hostname = "localhost"
rmi_port = 9999

Connect with JConsole or VisualVM:

jconsole localhost:9999
# Or
jvisualvm

Key MBeans:

  • java.lang:type=Memory: Heap usage
  • java.lang:type=Threading: Thread counts
  • java.lang:type=GarbageCollector: GC stats
  • org.apache.synapse:type=Transport: HTTP transport metrics
  • org.wso2.carbon:type=ServerAdmin: Server status

Prometheus Integration

Export metrics in Prometheus format.

# Enable Prometheus metrics
[monitoring.prometheus]
enable = true
port = 9201

Scrape config (prometheus.yml):

scrape_configs:
  - job_name: 'wso2-apim-gateway'
    metrics_path: /metrics
    scheme: https
    tls_config:
      insecure_skip_verify: true
    static_configs:
      - targets:
          - 'gw1.example.com:9201'
          - 'gw2.example.com:9201'
        labels:
          component: gateway

  - job_name: 'wso2-mi'
    metrics_path: /metric-service/metrics
    static_configs:
      - targets:
          - 'mi1.example.com:9201'
        labels:
          component: micro-integrator

Key Prometheus Metrics:

MetricDescription
wso2_api_request_count_totalTotal API requests
wso2_api_error_count_totalTotal API errors
wso2_api_response_time_secondsResponse time histogram
jvm_memory_bytes_usedJVM heap usage
jvm_threads_currentActive threads
http_requests_totalHTTP transport requests

Grafana Dashboards

API Gateway Dashboard:

{
  "panels": [
    {
      "title": "Request Rate",
      "type": "graph",
      "targets": [
        {"expr": "rate(wso2_api_request_count_total[5m])"}
      ]
    },
    {
      "title": "Error Rate",
      "type": "graph",
      "targets": [
        {"expr": "rate(wso2_api_error_count_total[5m]) / rate(wso2_api_request_count_total[5m]) * 100"}
      ]
    },
    {
      "title": "P95 Response Time",
      "type": "graph",
      "targets": [
        {"expr": "histogram_quantile(0.95, rate(wso2_api_response_time_seconds_bucket[5m]))"}
      ]
    },
    {
      "title": "JVM Heap Usage",
      "type": "gauge",
      "targets": [
        {"expr": "jvm_memory_bytes_used{area='heap'} / jvm_memory_bytes_max{area='heap'} * 100"}
      ]
    }
  ]
}

API Analytics

Built-in Analytics

WSO2 API Manager provides a built-in analytics dashboard.

Enable Analytics:

[apim.analytics]
enable = true

[apim.analytics.properties]
"publisher.reporter.class" = "org.wso2.am.analytics.publisher.sample.reporter.AnalyticsMetricReporter"

Metrics Available:

  • API usage by time period
  • Top APIs by request count
  • Response time distributions
  • Error breakdown by type
  • Geographic distribution of requests
  • Top applications and subscribers

ELK Stack Integration

Send logs and analytics to Elasticsearch for centralized analysis.

Filebeat Configuration (filebeat.yml):

filebeat.inputs:
  - type: log
    paths:
      - /opt/wso2am/repository/logs/wso2carbon.log
    fields:
      log_type: server
    multiline:
      pattern: '^\d{4}-\d{2}-\d{2}'
      negate: true
      match: after

  - type: log
    paths:
      - /opt/wso2am/repository/logs/http_access*.log
    fields:
      log_type: access

  - type: log
    paths:
      - /opt/wso2am/repository/logs/audit.log
    fields:
      log_type: audit

  - type: log
    paths:
      - /opt/wso2am/repository/logs/correlation.log
    fields:
      log_type: correlation

output.elasticsearch:
  hosts: ["http://elasticsearch:9200"]
  index: "wso2-%{[fields.log_type]}-%{+yyyy.MM.dd}"

Logstash Filter (for structured parsing):

filter {
  if [fields][log_type] == "access" {
    grok {
      match => {
        "message" => "%{IPORHOST:client_ip} - - \[%{HTTPDATE:timestamp}\] \"%{WORD:method} %{URIPATHPARAM:request} HTTP/%{NUMBER:http_version}\" %{NUMBER:status} %{NUMBER:bytes} %{NUMBER:response_time}"
      }
    }
    mutate {
      convert => {
        "status" => "integer"
        "bytes" => "integer"
        "response_time" => "integer"
      }
    }
  }
}

Distributed Tracing

OpenTelemetry Integration

# Enable OpenTelemetry tracing
[opentelemetry]
enable = true
exporter_type = "otlp"

[opentelemetry.remote]
host = "otel-collector.example.com"
port = 4317

Jaeger Integration

[opentelemetry]
enable = true
exporter_type = "jaeger"

[opentelemetry.remote]
host = "jaeger.example.com"
port = 14250

Trace flow through WSO2:

Client → Gateway (Span 1)
           → Key Validation (Span 2)
           → Backend Call (Span 3)
           → Response Processing (Span 4)
         ← Response to Client

Health Monitoring

Health Check Endpoints

API Manager:

# Gateway health
curl -k https://localhost:8243/services/Version

# Carbon health
curl -k https://localhost:9443/carbon/admin/login.jsp -o /dev/null -w "%{http_code}"

Micro Integrator:

# Liveness (server is running)
curl http://localhost:9164/liveness
# Response: {"status": "active"}

# Readiness (ready to accept requests)
curl http://localhost:9164/readiness
# Response: {"status": "ready"}

# List deployed services
curl http://localhost:9164/management/apis

Custom Health Check (MI)

<api name="HealthAPI" context="/health" xmlns="http://ws.apache.org/ns/synapse">
    <resource methods="GET" uri-template="/">
        <inSequence>
            <!-- Check backend connectivity -->
            <call>
                <endpoint>
                    <http method="get" uri-template="http://backend:8080/ping">
                        <timeout>
                            <duration>5000</duration>
                        </timeout>
                    </http>
                </endpoint>
            </call>
            
            <filter source="$axis2:HTTP_SC" regex="200">
                <then>
                    <payloadFactory media-type="json">
                        <format>
                            {"status": "healthy", "backend": "UP", "timestamp": "$1"}
                        </format>
                        <args>
                            <arg expression="get-property('SYSTEM_TIME')"/>
                        </args>
                    </payloadFactory>
                </then>
                <else>
                    <payloadFactory media-type="json">
                        <format>
                            {"status": "degraded", "backend": "DOWN", "timestamp": "$1"}
                        </format>
                        <args>
                            <arg expression="get-property('SYSTEM_TIME')"/>
                        </args>
                    </payloadFactory>
                    <property name="HTTP_SC" value="503" scope="axis2"/>
                </else>
            </filter>
            <respond/>
        </inSequence>
    </resource>
</api>

Alerting

Prometheus Alerting Rules

groups:
  - name: wso2-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(wso2_api_error_count_total[5m]) / rate(wso2_api_request_count_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "API error rate above 5%"

      - alert: HighResponseTime
        expr: histogram_quantile(0.95, rate(wso2_api_response_time_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 response time above 2 seconds"

      - alert: HighMemoryUsage
        expr: jvm_memory_bytes_used{area="heap"} / jvm_memory_bytes_max{area="heap"} > 0.85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "JVM heap usage above 85%"

      - alert: GatewayDown
        expr: up{job="wso2-apim-gateway"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Gateway instance is down"

Troubleshooting

Common Issues and Solutions

SymptomLikely CauseAction
401 UnauthorizedExpired/invalid tokenCheck token expiry, regenerate
403 ForbiddenMissing scopesVerify scope assignment
500 Internal ErrorBackend failureCheck backend logs, fault sequence
503 Service UnavailableBackend downCheck endpoint connectivity
Slow responsesResource exhaustionCheck JVM heap, thread count, DB pool
OOM crashInsufficient heapIncrease -Xmx, check for memory leaks
Connection timeoutNetwork or firewallVerify connectivity, increase timeout

Diagnostic Commands

# Check server status
curl -k https://localhost:9443/services/Version

# Thread dump (find deadlocks, stuck threads)
kill -3 <PID>
# Or
jstack <PID> > thread_dump.txt

# Heap dump (analyze memory)
jmap -dump:format=b,file=heap.hprof <PID>

# Check open file descriptors
ls /proc/<PID>/fd | wc -l

# Check active connections
ss -tnp | grep <PID> | wc -l

# Monitor GC in real-time
jstat -gcutil <PID> 1000

Analyzing Slow APIs

# 1. Enable correlation logging
# 2. Send request with correlation ID
curl -H "activityid: debug-001" https://gateway:8243/api/slow-endpoint

# 3. Analyze correlation log for time spent in each stage
grep "debug-001" repository/logs/correlation.log

# Output shows time per step:
# debug-001|HTTP-Listener|...|0ms|InSequence|Start
# debug-001|HTTP-Sender|...|1250ms|Backend|Call
# debug-001|HTTP-Listener|...|1255ms|OutSequence|End
# → Backend call took 1250ms; investigate backend

Log Level Changes at Runtime

# MI CLI: change log level without restart
mi log-level update org.apache.synapse DEBUG

# Revert
mi log-level update org.apache.synapse INFO

Key Takeaways

  • Correlation logging traces requests end-to-end across components
  • Wire logs are essential for debugging but too verbose for production
  • Prometheus + Grafana provides real-time metrics and dashboards
  • ELK stack centralizes logs for search and analysis
  • OpenTelemetry/Jaeger adds distributed tracing across services
  • Health check endpoints enable load balancer and Kubernetes probes
  • Thread dumps and heap dumps diagnose JVM-level issues
  • Runtime log level changes avoid restarts during incident investigation

Next Steps

Continue to Chapter 12: Advanced Topics to learn about advanced integration patterns, streaming, and microservices.