Chapter 11: Architecture & Best Practices

Knowing the services is not the same as building a system that survives a busy Tuesday. This chapter pulls the rest of the tutorial together: the Well-Architected Framework, the patterns that come up over and over, and a production-readiness checklist you can run through before launch.

Azure Well-Architected Framework

The Azure Well-Architected Framework (WAF) defines five pillars for evaluating cloud workload quality. Run the Azure Well-Architected Review to get a scored assessment of your architecture.

┌──────────────────────────────────────────────────────────────┐
│           Azure Well-Architected Framework                   │
├──────────────┬──────────────┬──────────────┬────────────────┤
│  Reliability │  Security    │  Cost Optim. │  Operational   │
│              │              │              │  Excellence    │
├──────────────┴──────────────┴──────────────┴────────────────┤
│                   Performance Efficiency                      │
└──────────────────────────────────────────────────────────────┘

Pillar 1: Reliability

Goal: The system recovers from failures and continues to function.

Key Concepts

  • SLA (Service Level Agreement): Microsoft's uptime commitment per service
  • SLO (Service Level Objective): Your internal target (stricter than SLA)
  • RTO (Recovery Time Objective): Maximum acceptable downtime after failure
  • RPO (Recovery Point Objective): Maximum acceptable data loss in time

Reliability Patterns

Multi-region active-passive:

Traffic Manager (DNS failover)
├── Primary: East US (active: serves all traffic)
└── Secondary: West US (passive: hot standby, failover in minutes)

Multi-region active-active:

Azure Front Door (global load balancer + CDN + WAF)
├── East US (App Service + Azure SQL)
└── West Europe (App Service + Azure SQL with geo-replication)
(Both regions serve traffic; Front Door routes to the nearest healthy region)
# Create Traffic Manager for DNS-based failover
az network traffic-manager profile create \
  --resource-group myapp-rg \
  --name myapp-tm \
  --routing-method Priority \
  --unique-dns-name myapp-tm-dns \
  --monitor-port 443 \
  --monitor-protocol HTTPS \
  --monitor-path /health

# Add primary endpoint
az network traffic-manager endpoint create \
  --resource-group myapp-rg \
  --profile-name myapp-tm \
  --type azureEndpoints \
  --name primary \
  --target-resource-id $(az webapp show --resource-group myapp-rg --name myapp-api-eastus --query id -o tsv) \
  --priority 1

# Add secondary endpoint
az network traffic-manager endpoint create \
  --resource-group myapp-rg \
  --profile-name myapp-tm \
  --type azureEndpoints \
  --name secondary \
  --target-resource-id $(az webapp show --resource-group myapp-rg --name myapp-api-westus --query id -o tsv) \
  --priority 2

Health Endpoints

Every service should expose a /health (liveness) and /ready (readiness) endpoint:

# health.py: FastAPI example
from fastapi import APIRouter
from sqlalchemy import text

router = APIRouter()

@router.get("/health")
async def health():
    """Liveness: is the process running?"""
    return {"status": "ok"}

@router.get("/ready")
async def readiness(db: Session = Depends(get_db)):
    """Readiness: can we serve traffic?"""
    try:
        db.execute(text("SELECT 1"))
        return {"status": "ready"}
    except Exception as e:
        raise HTTPException(status_code=503, detail=str(e))

Reliability Checklist

  • [ ] Deploy across Availability Zones (zone-redundant services or VMs in multiple zones)
  • [ ] Set up Azure SQL Failover Groups or Cosmos DB multi-region writes
  • [ ] Configure App Service auto-scale (minimum 2 instances in production)
  • [ ] Implement circuit breaker patterns for downstream service failures
  • [ ] Set retry policies with exponential backoff for transient failures
  • [ ] Test failures regularly (chaos engineering)
  • [ ] Define RTO and RPO and verify they are met

Pillar 2: Security

Goal: Protect against threats and meet compliance requirements.

Zero Trust Model

Traditional perimeter model:  "Trust everything inside the network"
Zero Trust model:             "Never trust, always verify, regardless of location"

Verify explicitly:   Authenticate and authorise every request
Use least privilege: Minimal access, time-limited where possible
Assume breach:       Design as if attackers are already inside

Security Checklist

  • [ ] Enable MFA for all users; use Conditional Access policies
  • [ ] Use Managed Identity: eliminate stored credentials entirely
  • [ ] Store all secrets in Key Vault: use Key Vault references in App Service
  • [ ] Enable Private Endpoints for databases and Key Vault in production
  • [ ] Apply NSGs to all subnets with explicit deny-all + allow rules
  • [ ] Enable Defender for Cloud: review Secure Score weekly
  • [ ] Enable Azure DDoS Protection Standard for internet-facing workloads
  • [ ] Enable soft delete and purge protection on Key Vault
  • [ ] Rotate secrets regularly: use Key Vault expiry notifications
  • [ ] Scan images with Defender for Containers / ACR vulnerability scanning
  • [ ] Enable audit logs: Entra ID sign-in logs, Key Vault access logs, SQL audit

Pillar 3: Cost Optimisation

Goal: Deliver business value at the lowest cost.

Cost Reduction Strategies

StrategySaving PotentialHow
Right-size VMs20–40%Monitor CPU/memory; use Azure Advisor recommendations
Reserved InstancesUp to 72%1- or 3-year commitment for predictable workloads
Spot VMsUp to 90%Use for batch jobs, fault-tolerant workloads
Azure Hybrid BenefitUp to 40%Apply existing Windows Server / SQL Server licences
ServerlessVariablePay-per-use instead of idle capacity
Auto-shutdown~65%Turn off dev/test VMs outside business hours
Storage tiers30–90%Move infrequently accessed data to Cool/Archive
Delete unused resources100%Regular audits using Azure Advisor
# Auto-shutdown a dev VM at 7pm UTC
az vm auto-shutdown \
  --resource-group myapp-rg \
  --name mydevvm \
  --time 1900 \
  --email devteam@mycompany.com

# View Azure Advisor cost recommendations
az advisor recommendation list \
  --category Cost \
  --output table

# View current costs by resource group (last 30 days)
az costmanagement query \
  --type Usage \
  --timeframe MonthToDate \
  --dataset-granularity Daily \
  --dataset-aggregation '{"totalCost": {"name": "Cost", "function": "Sum"}}' \
  --dataset-grouping '[{"type": "Dimension", "name": "ResourceGroup"}]'

Cost Tagging Strategy

# Enforce cost tracking tags via Azure Policy
# Required tags: environment, project, owner, costcenter

az policy definition create \
  --name "require-tags" \
  --display-name "Require tags on resources" \
  --description "Require environment, project, owner, and costcenter tags" \
  --rules '{
    "if": {
      "anyOf": [
        {"field": "tags[environment]", "exists": "false"},
        {"field": "tags[project]", "exists": "false"},
        {"field": "tags[owner]", "exists": "false"}
      ]
    },
    "then": {"effect": "deny"}
  }' \
  --mode All

Pillar 4: Operational Excellence

Goal: Operate and monitor systems to deliver business value continuously.

Key Practices

Infrastructure as Code (IaC):

  • All resources defined in Bicep or Terraform (no manual portal changes in production)
  • IaC stored in Git with code review and version history
  • Use terraform plan / az deployment what-if to preview changes

Deployment Strategy:

StrategyHowRiskRollback
Blue/GreenTwo identical environments, swap DNSLowSwap back
CanaryRoute small % to new version firstVery lowReduce traffic %
RollingReplace instances one by oneMediumRolling back
RecreateStop old, start newHigh (downtime)Redeploy old version

App Service deployment slots implement blue/green natively. Azure Traffic Manager and Front Door enable canary deployments.

Runbooks: Document operational procedures for:

  • Scaling up/down
  • Responding to alerts
  • Database failover procedures
  • Secret rotation
  • Disaster recovery drills

Pillar 5: Performance Efficiency

Goal: Use resources efficiently to meet system requirements and maintain that efficiency as demand changes.

Caching Strategies

Request flow without caching:
Client → App → Database (slow)

With Redis cache:
Client → App → Redis (fast hit)
                 ↓ miss
               Database → populate Redis
import redis
import json
from functools import wraps

redis_client = redis.from_url(os.environ["REDIS_URL"])

def cached(ttl_seconds=300):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            cache_key = f"{func.__name__}:{args}:{kwargs}"
            cached = redis_client.get(cache_key)
            if cached:
                return json.loads(cached)
            result = func(*args, **kwargs)
            redis_client.setex(cache_key, ttl_seconds, json.dumps(result))
            return result
        return wrapper
    return decorator

@cached(ttl_seconds=60)
def get_product(product_id: str):
    return db.query(Product).filter_by(id=product_id).first()

CDN with Azure Front Door

Serve static assets (JS, CSS, images) from the nearest edge location:

az afd profile create \
  --resource-group myapp-rg \
  --profile-name myapp-afd \
  --sku Standard_AzureFrontDoor

az afd endpoint create \
  --resource-group myapp-rg \
  --profile-name myapp-afd \
  --endpoint-name myapp

az afd origin-group create \
  --resource-group myapp-rg \
  --profile-name myapp-afd \
  --origin-group-name default-og \
  --probe-path /health \
  --probe-protocol Https

Reference Architectures

Web Application (Standard)

Internet
    │
Azure Front Door (CDN + WAF + global LB)
    │
App Service Plan (Standard S2, 2+ instances, across AZs)
    │               │
Azure SQL DB    Azure Cache for Redis
(Business Crit.)   (session + query cache)
    │
Blob Storage (static assets, user uploads)
    │
Key Vault (secrets)
    │
Application Insights -> Log Analytics Workspace

Microservices on AKS

Internet
    │
Azure Application Gateway + WAF
    │
AKS Cluster (across 3 AZs)
├── Ingress (nginx / AGIC)
├── Service A (3 replicas)  ─► Azure SQL
├── Service B (3 replicas)  ─► Cosmos DB
├── Service C (3 replicas)  ─► Redis Cache
│                            ─► Service Bus
└── Shared sidecar (Dapr / Envoy)
    │
Azure Container Registry (image source)
Key Vault (CSI driver for secrets)
Azure Monitor + Container Insights

Serverless Event Processing

IoT Devices / Web Clients
    │
Event Hub (ingestion: millions of events/sec)
    │
Azure Functions (stream processing, triggered by Event Hub)
    │
Cosmos DB (write processed results)
    │
Power BI / Azure Synapse (analytics and reporting)

Production Readiness Checklist

Before going live, validate these items:

Reliability:

  • [ ] Zone-redundant or multi-region deployment
  • [ ] Auto-scale configured with appropriate min/max
  • [ ] Health endpoints exposed and monitored
  • [ ] Backup and restore tested
  • [ ] Disaster recovery runbook documented and tested

Security:

  • [ ] MFA enabled for all accounts
  • [ ] Managed Identity used (no credentials in code)
  • [ ] Key Vault in use with RBAC access model
  • [ ] NSGs configured (deny-all default)
  • [ ] Private Endpoints for databases and Key Vault
  • [ ] Defender for Cloud enabled
  • [ ] SSL/TLS enforced (HTTPS only)

Monitoring:

  • [ ] Application Insights instrumented
  • [ ] Diagnostic logs routed to Log Analytics
  • [ ] Metric and log alerts configured
  • [ ] On-call runbook linked from alerts
  • [ ] Dashboard created for key service KPIs

Cost:

  • [ ] Budget alerts configured
  • [ ] Dev/test VMs auto-shutdown enabled
  • [ ] Reserved instances purchased for stable workloads
  • [ ] Resource tagging policy enforced

Operations:

  • [ ] All infrastructure defined in Bicep/Terraform (no manual resources)
  • [ ] CI/CD pipeline deploys to staging before production
  • [ ] Deployment uses blue/green or canary strategy
  • [ ] Rollback procedure documented and tested

Azure Naming & Tagging Convention (Template)

# Resource groups
{project}-{env}-rg                # myapp-prod-rg

# Compute
{project}-{component}-{env}-vm-{n}   # myapp-web-prod-vm-01
{project}-{env}-aks                  # myapp-prod-aks
{project}-{env}-plan                 # myapp-prod-plan
{project}-{env}-{service}            # myapp-prod-api

# Storage
{project}{env}st{n}              # myappprodst01  (no hyphens, lowercase)

# Networking
{project}-{env}-vnet             # myapp-prod-vnet
{project}-{env}-{tier}-snet      # myapp-prod-web-snet
{project}-{env}-{tier}-nsg       # myapp-prod-web-nsg

# Security
{project}-{env}-kv               # myapp-prod-kv

# Required tags for every resource
environment: prod | staging | dev | test
project:     myapp
owner:       teamname
costcenter:  12345

Where to Go From Here

You now have the vocabulary and the patterns. The next step is to build something you would not be embarrassed to put in production. Pick one architecture from the section above, deploy it, then deliberately break it: fail over a region, kill a pod, rotate a secret without warning the app. The lessons stick when something behind you is on fire.

A few directions worth your time:

TopicResource
AZ-900 (Fundamentals)Microsoft Learn path
AZ-104 (Administrator)Microsoft Learn path
AZ-204 (Developer)Microsoft Learn path
AZ-305 (Architect)Microsoft Learn path
Well-Architected FrameworkOfficial docs
Architecture CenterReference architectures
Free hands-on labsMicrosoft Learn
Azure pricingPricing calculator