Chapter 11: Architecture & Best Practices

Building on Azure is not just about knowing the services. It's about combining them in ways that are reliable, secure, cost-efficient, and operationally manageable. This chapter covers the Azure Well-Architected Framework and practical production patterns.

Azure Well-Architected Framework

The Azure Well-Architected Framework (WAF) defines five pillars for evaluating cloud workload quality. Run the Azure Well-Architected Review to get a scored assessment of your architecture.

┌──────────────────────────────────────────────────────────────┐
│           Azure Well-Architected Framework                   │
├──────────────┬──────────────┬──────────────┬────────────────┤
│  Reliability │  Security    │  Cost Optim. │  Operational   │
│              │              │              │  Excellence    │
├──────────────┴──────────────┴──────────────┴────────────────┤
│                   Performance Efficiency                      │
└──────────────────────────────────────────────────────────────┘

Pillar 1: Reliability

Goal: The system recovers from failures and continues to function.

Key Concepts

  • SLA (Service Level Agreement): Microsoft's uptime commitment per service
  • SLO (Service Level Objective): Your internal target (stricter than SLA)
  • RTO (Recovery Time Objective): Maximum acceptable downtime after failure
  • RPO (Recovery Point Objective): Maximum acceptable data loss in time

Reliability Patterns

Multi-region active-passive:

Traffic Manager (DNS failover)
├── Primary: East US (active: serves all traffic)
└── Secondary: West US (passive: hot standby, failover in minutes)

Multi-region active-active:

Azure Front Door (global load balancer + CDN + WAF)
├── East US (App Service + Azure SQL)
└── West Europe (App Service + Azure SQL with geo-replication)
(Both regions serve traffic; Front Door routes to the nearest healthy region)
# Create Traffic Manager for DNS-based failover
az network traffic-manager profile create \
  --resource-group myapp-rg \
  --name myapp-tm \
  --routing-method Priority \
  --unique-dns-name myapp-tm-dns \
  --monitor-port 443 \
  --monitor-protocol HTTPS \
  --monitor-path /health

# Add primary endpoint
az network traffic-manager endpoint create \
  --resource-group myapp-rg \
  --profile-name myapp-tm \
  --type azureEndpoints \
  --name primary \
  --target-resource-id $(az webapp show --resource-group myapp-rg --name myapp-api-eastus --query id -o tsv) \
  --priority 1

# Add secondary endpoint
az network traffic-manager endpoint create \
  --resource-group myapp-rg \
  --profile-name myapp-tm \
  --type azureEndpoints \
  --name secondary \
  --target-resource-id $(az webapp show --resource-group myapp-rg --name myapp-api-westus --query id -o tsv) \
  --priority 2

Health Endpoints

Every service should expose a /health (liveness) and /ready (readiness) endpoint:

# health.py: FastAPI example
from fastapi import APIRouter
from sqlalchemy import text

router = APIRouter()

@router.get("/health")
async def health():
    """Liveness: is the process running?"""
    return {"status": "ok"}

@router.get("/ready")
async def readiness(db: Session = Depends(get_db)):
    """Readiness: can we serve traffic?"""
    try:
        db.execute(text("SELECT 1"))
        return {"status": "ready"}
    except Exception as e:
        raise HTTPException(status_code=503, detail=str(e))

Reliability Checklist

  • [ ] Deploy across Availability Zones (zone-redundant services or VMs in multiple zones)
  • [ ] Set up Azure SQL Failover Groups or Cosmos DB multi-region writes
  • [ ] Configure App Service auto-scale (minimum 2 instances in production)
  • [ ] Implement circuit breaker patterns for downstream service failures
  • [ ] Set retry policies with exponential backoff for transient failures
  • [ ] Test failures regularly (chaos engineering)
  • [ ] Define RTO and RPO and verify they are met

Pillar 2: Security

Goal: Protect against threats and meet compliance requirements.

Zero Trust Model

Traditional perimeter model:  "Trust everything inside the network"
Zero Trust model:             "Never trust, always verify, regardless of location"

Verify explicitly:   Authenticate and authorise every request
Use least privilege: Minimal access, time-limited where possible
Assume breach:       Design as if attackers are already inside

Security Checklist

  • [ ] Enable MFA for all users; use Conditional Access policies
  • [ ] Use Managed Identity: eliminate stored credentials entirely
  • [ ] Store all secrets in Key Vault: use Key Vault references in App Service
  • [ ] Enable Private Endpoints for databases and Key Vault in production
  • [ ] Apply NSGs to all subnets with explicit deny-all + allow rules
  • [ ] Enable Defender for Cloud: review Secure Score weekly
  • [ ] Enable Azure DDoS Protection Standard for internet-facing workloads
  • [ ] Enable soft delete and purge protection on Key Vault
  • [ ] Rotate secrets regularly: use Key Vault expiry notifications
  • [ ] Scan images with Defender for Containers / ACR vulnerability scanning
  • [ ] Enable audit logs: Entra ID sign-in logs, Key Vault access logs, SQL audit

Pillar 3: Cost Optimisation

Goal: Deliver business value at the lowest cost.

Cost Reduction Strategies

StrategySaving PotentialHow
Right-size VMs20–40%Monitor CPU/memory; use Azure Advisor recommendations
Reserved InstancesUp to 72%1- or 3-year commitment for predictable workloads
Spot VMsUp to 90%Use for batch jobs, fault-tolerant workloads
Azure Hybrid BenefitUp to 40%Apply existing Windows Server / SQL Server licences
ServerlessVariablePay-per-use instead of idle capacity
Auto-shutdown~65%Turn off dev/test VMs outside business hours
Storage tiers30–90%Move infrequently accessed data to Cool/Archive
Delete unused resources100%Regular audits using Azure Advisor
# Auto-shutdown a dev VM at 7pm UTC
az vm auto-shutdown \
  --resource-group myapp-rg \
  --name mydevvm \
  --time 1900 \
  --email devteam@mycompany.com

# View Azure Advisor cost recommendations
az advisor recommendation list \
  --category Cost \
  --output table

# View current costs by resource group (last 30 days)
az costmanagement query \
  --type Usage \
  --timeframe MonthToDate \
  --dataset-granularity Daily \
  --dataset-aggregation '{"totalCost": {"name": "Cost", "function": "Sum"}}' \
  --dataset-grouping '[{"type": "Dimension", "name": "ResourceGroup"}]'

Cost Tagging Strategy

# Enforce cost tracking tags via Azure Policy
# Required tags: environment, project, owner, costcenter

az policy definition create \
  --name "require-tags" \
  --display-name "Require tags on resources" \
  --description "Require environment, project, owner, and costcenter tags" \
  --rules '{
    "if": {
      "anyOf": [
        {"field": "tags[environment]", "exists": "false"},
        {"field": "tags[project]", "exists": "false"},
        {"field": "tags[owner]", "exists": "false"}
      ]
    },
    "then": {"effect": "deny"}
  }' \
  --mode All

Pillar 4: Operational Excellence

Goal: Operate and monitor systems to deliver business value continuously.

Key Practices

Infrastructure as Code (IaC):

  • All resources defined in Bicep or Terraform (no manual portal changes in production)
  • IaC stored in Git with code review and version history
  • Use terraform plan / az deployment what-if to preview changes

Deployment Strategy:

StrategyHowRiskRollback
Blue/GreenTwo identical environments, swap DNSLowSwap back
CanaryRoute small % to new version firstVery lowReduce traffic %
RollingReplace instances one by oneMediumRolling back
RecreateStop old, start newHigh (downtime)Redeploy old version

App Service deployment slots implement blue/green natively. Azure Traffic Manager and Front Door enable canary deployments.

Runbooks: Document operational procedures for:

  • Scaling up/down
  • Responding to alerts
  • Database failover procedures
  • Secret rotation
  • Disaster recovery drills

Pillar 5: Performance Efficiency

Goal: Use resources efficiently to meet system requirements and maintain that efficiency as demand changes.

Caching Strategies

Request flow without caching:
Client → App → Database (slow)

With Redis cache:
Client → App → Redis (fast hit)
                 ↓ miss
               Database → populate Redis
import redis
import json
from functools import wraps

redis_client = redis.from_url(os.environ["REDIS_URL"])

def cached(ttl_seconds=300):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            cache_key = f"{func.__name__}:{args}:{kwargs}"
            cached = redis_client.get(cache_key)
            if cached:
                return json.loads(cached)
            result = func(*args, **kwargs)
            redis_client.setex(cache_key, ttl_seconds, json.dumps(result))
            return result
        return wrapper
    return decorator

@cached(ttl_seconds=60)
def get_product(product_id: str):
    return db.query(Product).filter_by(id=product_id).first()

CDN with Azure Front Door

Serve static assets (JS, CSS, images) from the nearest edge location:

az afd profile create \
  --resource-group myapp-rg \
  --profile-name myapp-afd \
  --sku Standard_AzureFrontDoor

az afd endpoint create \
  --resource-group myapp-rg \
  --profile-name myapp-afd \
  --endpoint-name myapp

az afd origin-group create \
  --resource-group myapp-rg \
  --profile-name myapp-afd \
  --origin-group-name default-og \
  --probe-path /health \
  --probe-protocol Https

Reference Architectures

Web Application (Standard)

Internet
    │
Azure Front Door (CDN + WAF + global LB)
    │
App Service Plan (Standard S2, 2+ instances, across AZs)
    │               │
Azure SQL DB    Azure Cache for Redis
(Business Crit.)   (session + query cache)
    │
Blob Storage (static assets, user uploads)
    │
Key Vault (secrets)
    │
Application Insights -> Log Analytics Workspace

Microservices on AKS

Internet
    │
Azure Application Gateway + WAF
    │
AKS Cluster (across 3 AZs)
├── Ingress (nginx / AGIC)
├── Service A (3 replicas)  ─► Azure SQL
├── Service B (3 replicas)  ─► Cosmos DB
├── Service C (3 replicas)  ─► Redis Cache
│                            ─► Service Bus
└── Shared sidecar (Dapr / Envoy)
    │
Azure Container Registry (image source)
Key Vault (CSI driver for secrets)
Azure Monitor + Container Insights

Serverless Event Processing

IoT Devices / Web Clients
    │
Event Hub (ingestion: millions of events/sec)
    │
Azure Functions (stream processing, triggered by Event Hub)
    │
Cosmos DB (write processed results)
    │
Power BI / Azure Synapse (analytics and reporting)

Production Readiness Checklist

Before going live, validate these items:

Reliability:

  • [ ] Zone-redundant or multi-region deployment
  • [ ] Auto-scale configured with appropriate min/max
  • [ ] Health endpoints exposed and monitored
  • [ ] Backup and restore tested
  • [ ] Disaster recovery runbook documented and tested

Security:

  • [ ] MFA enabled for all accounts
  • [ ] Managed Identity used (no credentials in code)
  • [ ] Key Vault in use with RBAC access model
  • [ ] NSGs configured (deny-all default)
  • [ ] Private Endpoints for databases and Key Vault
  • [ ] Defender for Cloud enabled
  • [ ] SSL/TLS enforced (HTTPS only)

Monitoring:

  • [ ] Application Insights instrumented
  • [ ] Diagnostic logs routed to Log Analytics
  • [ ] Metric and log alerts configured
  • [ ] On-call runbook linked from alerts
  • [ ] Dashboard created for key service KPIs

Cost:

  • [ ] Budget alerts configured
  • [ ] Dev/test VMs auto-shutdown enabled
  • [ ] Reserved instances purchased for stable workloads
  • [ ] Resource tagging policy enforced

Operations:

  • [ ] All infrastructure defined in Bicep/Terraform (no manual resources)
  • [ ] CI/CD pipeline deploys to staging before production
  • [ ] Deployment uses blue/green or canary strategy
  • [ ] Rollback procedure documented and tested

Azure Naming & Tagging Convention (Template)

# Resource groups
{project}-{env}-rg                # myapp-prod-rg

# Compute
{project}-{component}-{env}-vm-{n}   # myapp-web-prod-vm-01
{project}-{env}-aks                  # myapp-prod-aks
{project}-{env}-plan                 # myapp-prod-plan
{project}-{env}-{service}            # myapp-prod-api

# Storage
{project}{env}st{n}              # myappprodst01  (no hyphens, lowercase)

# Networking
{project}-{env}-vnet             # myapp-prod-vnet
{project}-{env}-{tier}-snet      # myapp-prod-web-snet
{project}-{env}-{tier}-nsg       # myapp-prod-web-nsg

# Security
{project}-{env}-kv               # myapp-prod-kv

# Required tags for every resource
environment: prod | staging | dev | test
project:     myapp
owner:       teamname
costcenter:  12345

Further Learning

TopicResource
AZ-900 (Fundamentals)Microsoft Learn path
AZ-104 (Administrator)Microsoft Learn path
AZ-204 (Developer)Microsoft Learn path
AZ-305 (Architect)Microsoft Learn path
Well-Architected FrameworkOfficial docs
Architecture CenterReference architectures
Free hands-on labsMicrosoft Learn
Azure pricingPricing calculator