Chapter 10: Monitoring & Observability | Microsoft Azure Tutorial

How to know what your system is doing. Azure Monitor is the umbrella product. Underneath it sit metrics, Log Analytics, Application Insights, and a small army of alert types. The names overlap. The model below makes them stop fighting each other.

Azure Monitor Overview

Azure Monitor Architecture

Data Sources                  Azure Monitor               Destinations
──────────────────────────────────────────────────────────────────────
Applications (App Insights) ─┐
Azure Resources (Metrics)   ─┤
Azure Resources (Logs)      ─┤─► Log Analytics Workspace ─► Alerts
VMs (Agent)                 ─┤                            ─► Dashboards
Containers (Container Ins.) ─┘                            ─► Workbooks
                                                          ─► Power BI
Custom metrics / logs ────────► Metrics store            ─► Logic Apps
                                                          ─► Event Hub

The Three Pillars of Observability

Pillar	Azure Service	What It Captures
Metrics	Azure Monitor Metrics	Numeric time-series data (CPU %, request/s)
Logs	Log Analytics (Azure Monitor Logs)	Structured events, traces, errors
Traces	Application Insights	End-to-end request traces across services

Azure Monitor Metrics

Metrics are automatically collected for every Azure resource with no configuration. View them in the Portal under Monitor > Metrics.

# List available metrics for an App Service
az monitor metrics list-definitions \
  --resource $(az webapp show --resource-group myapp-rg --name myapp-api --query id -o tsv) \
  --output table

# Query a metric (average CPU over the last hour)
az monitor metrics list \
  --resource $(az webapp show --resource-group myapp-rg --name myapp-api --query id -o tsv) \
  --metric "CpuTime" \
  --interval PT1M \
  --start-time $(date -u -d "1 hour ago" +%Y-%m-%dT%H:%MZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%MZ) \
  --output table

Log Analytics Workspace

Log Analytics is the centralised log store. Resources send their diagnostic logs here; you query them with KQL (Kusto Query Language).

# Create a Log Analytics workspace
az monitor log-analytics workspace create \
  --resource-group myapp-rg \
  --workspace-name myapp-logs \
  --location eastus \
  --sku PerGB2018   # Pay per GB ingested

# Get the workspace ID and key
WORKSPACE_ID=$(az monitor log-analytics workspace show \
  --resource-group myapp-rg \
  --workspace-name myapp-logs \
  --query customerId -o tsv)

# Send diagnostic logs from an App Service to the workspace
az monitor diagnostic-settings create \
  --name "AppServiceLogs" \
  --resource $(az webapp show --resource-group myapp-rg --name myapp-api --query id -o tsv) \
  --workspace $(az monitor log-analytics workspace show --resource-group myapp-rg --workspace-name myapp-logs --query id -o tsv) \
  --logs '[
    {"category": "AppServiceHTTPLogs", "enabled": true},
    {"category": "AppServiceConsoleLogs", "enabled": true},
    {"category": "AppServiceAppLogs", "enabled": true}
  ]' \
  --metrics '[{"category": "AllMetrics", "enabled": true}]'

# Enable for Azure SQL
az monitor diagnostic-settings create \
  --name "SQLLogs" \
  --resource $(az sql db show --resource-group myapp-rg --server myapp-sql-server --name myappdb --query id -o tsv) \
  --workspace $(az monitor log-analytics workspace show --resource-group myapp-rg --workspace-name myapp-logs --query id -o tsv) \
  --logs '[{"category": "SQLInsights", "enabled": true}, {"category": "Errors", "enabled": true}]' \
  --metrics '[{"category": "Basic", "enabled": true}]'

KQL (Kusto Query Language)

KQL is used to query logs in Log Analytics, Application Insights, and Azure Data Explorer.

// All App Service HTTP requests in the last hour
AppServiceHTTPLogs
| where TimeGenerated > ago(1h)
| project TimeGenerated, CsMethod, CsUriStem, ScStatus, TimeTaken

// HTTP 5xx errors count by URL
AppServiceHTTPLogs
| where TimeGenerated > ago(24h)
| where ScStatus >= 500
| summarize ErrorCount = count() by CsUriStem
| order by ErrorCount desc
| take 20

// Average response time by hour
AppServiceHTTPLogs
| where TimeGenerated > ago(7d)
| summarize AvgResponseMs = avg(TimeTaken) by bin(TimeGenerated, 1h)
| render timechart

// Application exceptions
AppExceptions
| where TimeGenerated > ago(1h)
| project TimeGenerated, Message, OuterMessage, Type, Assembly
| order by TimeGenerated desc

// Cosmos DB request units consumed by operation
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.DOCUMENTDB"
| where TimeGenerated > ago(1h)
| summarize TotalRU = sum(todouble(requestCharge_s)) by operationType_s
| order by TotalRU desc

// VM CPU > 90%
Perf
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| where TimeGenerated > ago(1h)
| where CounterValue > 90
| project TimeGenerated, Computer, CounterValue
| order by CounterValue desc

# Run a KQL query from the CLI
az monitor log-analytics query \
  --workspace $WORKSPACE_ID \
  --analytics-query "AppServiceHTTPLogs | where ScStatus >= 500 | summarize count() by CsUriStem | order by count_ desc | take 10" \
  --output table

Application Insights

Application Insights is an Application Performance Management (APM) service. It provides:

Request and dependency tracking
Exception logging
Performance profiling
User behaviour analytics
Distributed tracing across microservices
Live Metrics Stream (real-time)

# Create an Application Insights resource
az monitor app-insights component create \
  --resource-group myapp-rg \
  --app myapp-insights \
  --location eastus \
  --workspace $(az monitor log-analytics workspace show --resource-group myapp-rg --workspace-name myapp-logs --query id -o tsv)

# Get the connection string (for the SDK)
az monitor app-insights component show \
  --resource-group myapp-rg \
  --app myapp-insights \
  --query connectionString \
  --output tsv

Instrumenting Your App

Python (Flask / FastAPI):

# requirements.txt
# azure-monitor-opentelemetry

from azure.monitor.opentelemetry import configure_azure_monitor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
import os

configure_azure_monitor(
    connection_string=os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"]
)

app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)

@app.route("/orders/<order_id>")
def get_order(order_id):
    # This request is automatically traced
    order = db.get_order(order_id)   # DB call automatically traced
    return jsonify(order)

Node.js:

// app.js: MUST be the first import
const { useAzureMonitor } = require("@azure/monitor-opentelemetry");
useAzureMonitor();

const express = require("express");
const app = express();
// All HTTP requests and dependencies are now auto-instrumented

Custom telemetry:

from opentelemetry import trace
from opentelemetry.metrics import get_meter

tracer = trace.get_tracer(__name__)
meter = get_meter(__name__)
orders_counter = meter.create_counter("orders_processed")

def process_order(order_id: str):
    with tracer.start_as_current_span("process-order") as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute("order.region", "eastus")

        result = do_processing(order_id)

        orders_counter.add(1, {"status": "success"})
        return result

Application Map

The Application Map in the portal visualises all dependencies (databases, external APIs, queues) and shows failure rates and latency for each connection automatically.

Smart Detection

Application Insights automatically detects:

Slow server response times
Degraded dependency response times
Abnormal failure rate increases
Memory leaks

No configuration required. Alerts appear in the Failures and Performance blades.

Alerts

Alerts notify you (email, SMS, Teams, PagerDuty, webhook) when a condition is met.

# Create an action group (who to notify)
az monitor action-group create \
  --resource-group myapp-rg \
  --name oncall-team \
  --short-name oncall \
  --action email ops-email ops@mycompany.com

# Metric alert: notify when App Service HTTP 5xx > 10 in 5 minutes
az monitor metrics alert create \
  --resource-group myapp-rg \
  --name "High-Error-Rate" \
  --scopes $(az webapp show --resource-group myapp-rg --name myapp-api --query id -o tsv) \
  --condition "count Http5xx > 10" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --severity 1 \
  --action oncall-team \
  --description "More than 10 HTTP 5xx errors in 5 minutes"

# Log alert: alert on exceptions from Application Insights
az monitor scheduled-query create \
  --resource-group myapp-rg \
  --name "App-Exceptions" \
  --scopes $(az monitor app-insights component show --resource-group myapp-rg --app myapp-insights --query id -o tsv) \
  --condition "count > 5" \
  --condition-query "exceptions | where timestamp > ago(5m) | summarize count()" \
  --evaluation-frequency 5m \
  --window-size 5m \
  --severity 2 \
  --action oncall-team

Alert Severity Levels

Severity	Meaning	Example
0 (Critical)	Service is down	All requests failing
1 (Error)	Significant impact	High error rate
2 (Warning)	Potential issue	CPU > 80%
3 (Informational)	Non-urgent	Deployment completed
4 (Verbose)	Debugging only	Metric crossed threshold

Azure Monitor Workbooks

Workbooks are interactive visual reports combining metrics, logs, and text. Create custom dashboards for SLA reporting, cost analysis, or capacity planning.

Create in the Portal: Monitor > Workbooks > + New

Container Insights (AKS Monitoring)

# Enable Container Insights on AKS
az aks enable-addons \
  --resource-group myapp-rg \
  --name myapp-aks \
  --addons monitoring \
  --workspace-resource-id $(az monitor log-analytics workspace show --resource-group myapp-rg --workspace-name myapp-logs --query id -o tsv)

This gives you:

Container CPU and memory usage
Node resource utilisation
Pod restart counts
Kubernetes event logs
Live container logs

Observability Best Practices

Centralise logs: Send all diagnostic logs to a single Log Analytics workspace
Instrument your apps: Add Application Insights to every service
Alert on symptoms, not causes: Alert on high error rate, not CPU (end-user impact first)
Set up dashboards before incidents: Create runbook dashboards so on-call engineers know where to look
Use distributed tracing: Ensure traceparent headers propagate across service boundaries
Define SLOs: Set explicit targets (e.g., 99.9% of requests under 500 ms) and alert when approaching the error budget
Review and tune alerts: Too many false-positive alerts cause alert fatigue and get ignored

Cost Management

Track and control your Azure spend:

# Show current month spend by resource group
az consumption usage list \
  --billing-period-name $(date +%Y%m) \
  --output table

# Create a budget alert ($500/month)
az consumption budget create \
  --budget-name MyBudget \
  --amount 500 \
  --time-grain Monthly \
  --start-date $(date +%Y-%m-01) \
  --end-date 2026-12-31 \
  --notifications '[{
    "enabled": true,
    "operator": "GreaterThan",
    "threshold": 80,
    "contactEmails": ["finance@mycompany.com"]
  }]'

In the Portal: Cost Management + Billing > Cost Analysis gives a breakdown by service, resource group, tag, or time period.

Next Steps

Continue to 11-architecture-best-practices.md to learn about the Azure Well-Architected Framework, reliability patterns, and production readiness.