Chapter 10: Monitoring & Observability

You cannot manage what you cannot measure. Azure Monitor is the unified observability platform for all Azure resources: metrics, logs, alerts, and dashboards.

Azure Monitor Overview

Azure Monitor Architecture

Data Sources                  Azure Monitor               Destinations
──────────────────────────────────────────────────────────────────────
Applications (App Insights) ─┐
Azure Resources (Metrics)   ─┤
Azure Resources (Logs)      ─┤─► Log Analytics Workspace ─► Alerts
VMs (Agent)                 ─┤                            ─► Dashboards
Containers (Container Ins.) ─┘                            ─► Workbooks
                                                          ─► Power BI
Custom metrics / logs ────────► Metrics store            ─► Logic Apps
                                                          ─► Event Hub

The Three Pillars of Observability

PillarAzure ServiceWhat It Captures
MetricsAzure Monitor MetricsNumeric time-series data (CPU %, request/s)
LogsLog Analytics (Azure Monitor Logs)Structured events, traces, errors
TracesApplication InsightsEnd-to-end request traces across services

Azure Monitor Metrics

Metrics are automatically collected for every Azure resource with no configuration. View them in the Portal under Monitor > Metrics.

# List available metrics for an App Service
az monitor metrics list-definitions \
  --resource $(az webapp show --resource-group myapp-rg --name myapp-api --query id -o tsv) \
  --output table

# Query a metric (average CPU over the last hour)
az monitor metrics list \
  --resource $(az webapp show --resource-group myapp-rg --name myapp-api --query id -o tsv) \
  --metric "CpuTime" \
  --interval PT1M \
  --start-time $(date -u -d "1 hour ago" +%Y-%m-%dT%H:%MZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%MZ) \
  --output table

Log Analytics Workspace

Log Analytics is the centralised log store. Resources send their diagnostic logs here; you query them with KQL (Kusto Query Language).

# Create a Log Analytics workspace
az monitor log-analytics workspace create \
  --resource-group myapp-rg \
  --workspace-name myapp-logs \
  --location eastus \
  --sku PerGB2018   # Pay per GB ingested

# Get the workspace ID and key
WORKSPACE_ID=$(az monitor log-analytics workspace show \
  --resource-group myapp-rg \
  --workspace-name myapp-logs \
  --query customerId -o tsv)

# Send diagnostic logs from an App Service to the workspace
az monitor diagnostic-settings create \
  --name "AppServiceLogs" \
  --resource $(az webapp show --resource-group myapp-rg --name myapp-api --query id -o tsv) \
  --workspace $(az monitor log-analytics workspace show --resource-group myapp-rg --workspace-name myapp-logs --query id -o tsv) \
  --logs '[
    {"category": "AppServiceHTTPLogs", "enabled": true},
    {"category": "AppServiceConsoleLogs", "enabled": true},
    {"category": "AppServiceAppLogs", "enabled": true}
  ]' \
  --metrics '[{"category": "AllMetrics", "enabled": true}]'

# Enable for Azure SQL
az monitor diagnostic-settings create \
  --name "SQLLogs" \
  --resource $(az sql db show --resource-group myapp-rg --server myapp-sql-server --name myappdb --query id -o tsv) \
  --workspace $(az monitor log-analytics workspace show --resource-group myapp-rg --workspace-name myapp-logs --query id -o tsv) \
  --logs '[{"category": "SQLInsights", "enabled": true}, {"category": "Errors", "enabled": true}]' \
  --metrics '[{"category": "Basic", "enabled": true}]'

KQL (Kusto Query Language)

KQL is used to query logs in Log Analytics, Application Insights, and Azure Data Explorer.

// All App Service HTTP requests in the last hour
AppServiceHTTPLogs
| where TimeGenerated > ago(1h)
| project TimeGenerated, CsMethod, CsUriStem, ScStatus, TimeTaken

// HTTP 5xx errors count by URL
AppServiceHTTPLogs
| where TimeGenerated > ago(24h)
| where ScStatus >= 500
| summarize ErrorCount = count() by CsUriStem
| order by ErrorCount desc
| take 20

// Average response time by hour
AppServiceHTTPLogs
| where TimeGenerated > ago(7d)
| summarize AvgResponseMs = avg(TimeTaken) by bin(TimeGenerated, 1h)
| render timechart

// Application exceptions
AppExceptions
| where TimeGenerated > ago(1h)
| project TimeGenerated, Message, OuterMessage, Type, Assembly
| order by TimeGenerated desc

// Cosmos DB request units consumed by operation
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.DOCUMENTDB"
| where TimeGenerated > ago(1h)
| summarize TotalRU = sum(todouble(requestCharge_s)) by operationType_s
| order by TotalRU desc

// VM CPU > 90%
Perf
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| where TimeGenerated > ago(1h)
| where CounterValue > 90
| project TimeGenerated, Computer, CounterValue
| order by CounterValue desc
# Run a KQL query from the CLI
az monitor log-analytics query \
  --workspace $WORKSPACE_ID \
  --analytics-query "AppServiceHTTPLogs | where ScStatus >= 500 | summarize count() by CsUriStem | order by count_ desc | take 10" \
  --output table

Application Insights

Application Insights is an Application Performance Management (APM) service. It provides:

  • Request and dependency tracking
  • Exception logging
  • Performance profiling
  • User behaviour analytics
  • Distributed tracing across microservices
  • Live Metrics Stream (real-time)
# Create an Application Insights resource
az monitor app-insights component create \
  --resource-group myapp-rg \
  --app myapp-insights \
  --location eastus \
  --workspace $(az monitor log-analytics workspace show --resource-group myapp-rg --workspace-name myapp-logs --query id -o tsv)

# Get the connection string (for the SDK)
az monitor app-insights component show \
  --resource-group myapp-rg \
  --app myapp-insights \
  --query connectionString \
  --output tsv

Instrumenting Your App

Python (Flask / FastAPI):

# requirements.txt
# azure-monitor-opentelemetry

from azure.monitor.opentelemetry import configure_azure_monitor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
import os

configure_azure_monitor(
    connection_string=os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"]
)

app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)

@app.route("/orders/<order_id>")
def get_order(order_id):
    # This request is automatically traced
    order = db.get_order(order_id)   # DB call automatically traced
    return jsonify(order)

Node.js:

// app.js: MUST be the first import
const { useAzureMonitor } = require("@azure/monitor-opentelemetry");
useAzureMonitor();

const express = require("express");
const app = express();
// All HTTP requests and dependencies are now auto-instrumented

Custom telemetry:

from opentelemetry import trace
from opentelemetry.metrics import get_meter

tracer = trace.get_tracer(__name__)
meter = get_meter(__name__)
orders_counter = meter.create_counter("orders_processed")

def process_order(order_id: str):
    with tracer.start_as_current_span("process-order") as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute("order.region", "eastus")

        result = do_processing(order_id)

        orders_counter.add(1, {"status": "success"})
        return result

Application Map

The Application Map in the portal visualises all dependencies (databases, external APIs, queues) and shows failure rates and latency for each connection automatically.

Smart Detection

Application Insights automatically detects:

  • Slow server response times
  • Degraded dependency response times
  • Abnormal failure rate increases
  • Memory leaks

No configuration required. Alerts appear in the Failures and Performance blades.

Alerts

Alerts notify you (email, SMS, Teams, PagerDuty, webhook) when a condition is met.

# Create an action group (who to notify)
az monitor action-group create \
  --resource-group myapp-rg \
  --name oncall-team \
  --short-name oncall \
  --action email ops-email ops@mycompany.com

# Metric alert: notify when App Service HTTP 5xx > 10 in 5 minutes
az monitor metrics alert create \
  --resource-group myapp-rg \
  --name "High-Error-Rate" \
  --scopes $(az webapp show --resource-group myapp-rg --name myapp-api --query id -o tsv) \
  --condition "count Http5xx > 10" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --severity 1 \
  --action oncall-team \
  --description "More than 10 HTTP 5xx errors in 5 minutes"

# Log alert: alert on exceptions from Application Insights
az monitor scheduled-query create \
  --resource-group myapp-rg \
  --name "App-Exceptions" \
  --scopes $(az monitor app-insights component show --resource-group myapp-rg --app myapp-insights --query id -o tsv) \
  --condition "count > 5" \
  --condition-query "exceptions | where timestamp > ago(5m) | summarize count()" \
  --evaluation-frequency 5m \
  --window-size 5m \
  --severity 2 \
  --action oncall-team

Alert Severity Levels

SeverityMeaningExample
0 (Critical)Service is downAll requests failing
1 (Error)Significant impactHigh error rate
2 (Warning)Potential issueCPU > 80%
3 (Informational)Non-urgentDeployment completed
4 (Verbose)Debugging onlyMetric crossed threshold

Azure Monitor Workbooks

Workbooks are interactive visual reports combining metrics, logs, and text. Create custom dashboards for SLA reporting, cost analysis, or capacity planning.

Create in the Portal: Monitor > Workbooks > + New

Container Insights (AKS Monitoring)

# Enable Container Insights on AKS
az aks enable-addons \
  --resource-group myapp-rg \
  --name myapp-aks \
  --addons monitoring \
  --workspace-resource-id $(az monitor log-analytics workspace show --resource-group myapp-rg --workspace-name myapp-logs --query id -o tsv)

This gives you:

  • Container CPU and memory usage
  • Node resource utilisation
  • Pod restart counts
  • Kubernetes event logs
  • Live container logs

Observability Best Practices

  1. Centralise logs: Send all diagnostic logs to a single Log Analytics workspace
  2. Instrument your apps: Add Application Insights to every service
  3. Alert on symptoms, not causes: Alert on high error rate, not CPU (end-user impact first)
  4. Set up dashboards before incidents: Create runbook dashboards so on-call engineers know where to look
  5. Use distributed tracing: Ensure traceparent headers propagate across service boundaries
  6. Define SLOs: Set explicit targets (e.g., 99.9% of requests under 500 ms) and alert when approaching the error budget
  7. Review and tune alerts: Too many false-positive alerts cause alert fatigue and get ignored

Cost Management

Track and control your Azure spend:

# Show current month spend by resource group
az consumption usage list \
  --billing-period-name $(date +%Y%m) \
  --output table

# Create a budget alert ($500/month)
az consumption budget create \
  --budget-name MyBudget \
  --amount 500 \
  --time-grain Monthly \
  --start-date $(date +%Y-%m-01) \
  --end-date 2026-12-31 \
  --notifications '[{
    "enabled": true,
    "operator": "GreaterThan",
    "threshold": 80,
    "contactEmails": ["finance@mycompany.com"]
  }]'

In the Portal: Cost Management + Billing > Cost Analysis gives a breakdown by service, resource group, tag, or time period.

Next Steps

Continue to 11-architecture-best-practices.md to learn about the Azure Well-Architected Framework, reliability patterns, and production readiness.