Chapter 11: Architecture & Best Practices
Building on Azure is not just about knowing the services. It's about combining them in ways that are reliable, secure, cost-efficient, and operationally manageable. This chapter covers the Azure Well-Architected Framework and practical production patterns.
Azure Well-Architected Framework
The Azure Well-Architected Framework (WAF) defines five pillars for evaluating cloud workload quality. Run the Azure Well-Architected Review to get a scored assessment of your architecture.
┌──────────────────────────────────────────────────────────────┐
│ Azure Well-Architected Framework │
├──────────────┬──────────────┬──────────────┬────────────────┤
│ Reliability │ Security │ Cost Optim. │ Operational │
│ │ │ │ Excellence │
├──────────────┴──────────────┴──────────────┴────────────────┤
│ Performance Efficiency │
└──────────────────────────────────────────────────────────────┘
Pillar 1: Reliability
Goal: The system recovers from failures and continues to function.
Key Concepts
- SLA (Service Level Agreement): Microsoft's uptime commitment per service
- SLO (Service Level Objective): Your internal target (stricter than SLA)
- RTO (Recovery Time Objective): Maximum acceptable downtime after failure
- RPO (Recovery Point Objective): Maximum acceptable data loss in time
Reliability Patterns
Multi-region active-passive:
Traffic Manager (DNS failover)
├── Primary: East US (active: serves all traffic)
└── Secondary: West US (passive: hot standby, failover in minutes)
Multi-region active-active:
Azure Front Door (global load balancer + CDN + WAF)
├── East US (App Service + Azure SQL)
└── West Europe (App Service + Azure SQL with geo-replication)
(Both regions serve traffic; Front Door routes to the nearest healthy region)
# Create Traffic Manager for DNS-based failover
az network traffic-manager profile create \
--resource-group myapp-rg \
--name myapp-tm \
--routing-method Priority \
--unique-dns-name myapp-tm-dns \
--monitor-port 443 \
--monitor-protocol HTTPS \
--monitor-path /health
# Add primary endpoint
az network traffic-manager endpoint create \
--resource-group myapp-rg \
--profile-name myapp-tm \
--type azureEndpoints \
--name primary \
--target-resource-id $(az webapp show --resource-group myapp-rg --name myapp-api-eastus --query id -o tsv) \
--priority 1
# Add secondary endpoint
az network traffic-manager endpoint create \
--resource-group myapp-rg \
--profile-name myapp-tm \
--type azureEndpoints \
--name secondary \
--target-resource-id $(az webapp show --resource-group myapp-rg --name myapp-api-westus --query id -o tsv) \
--priority 2
Health Endpoints
Every service should expose a /health (liveness) and /ready (readiness) endpoint:
# health.py: FastAPI example
from fastapi import APIRouter
from sqlalchemy import text
router = APIRouter()
@router.get("/health")
async def health():
"""Liveness: is the process running?"""
return {"status": "ok"}
@router.get("/ready")
async def readiness(db: Session = Depends(get_db)):
"""Readiness: can we serve traffic?"""
try:
db.execute(text("SELECT 1"))
return {"status": "ready"}
except Exception as e:
raise HTTPException(status_code=503, detail=str(e))
Reliability Checklist
- [ ] Deploy across Availability Zones (zone-redundant services or VMs in multiple zones)
- [ ] Set up Azure SQL Failover Groups or Cosmos DB multi-region writes
- [ ] Configure App Service auto-scale (minimum 2 instances in production)
- [ ] Implement circuit breaker patterns for downstream service failures
- [ ] Set retry policies with exponential backoff for transient failures
- [ ] Test failures regularly (chaos engineering)
- [ ] Define RTO and RPO and verify they are met
Pillar 2: Security
Goal: Protect against threats and meet compliance requirements.
Zero Trust Model
Traditional perimeter model: "Trust everything inside the network"
Zero Trust model: "Never trust, always verify, regardless of location"
Verify explicitly: Authenticate and authorise every request
Use least privilege: Minimal access, time-limited where possible
Assume breach: Design as if attackers are already inside
Security Checklist
- [ ] Enable MFA for all users; use Conditional Access policies
- [ ] Use Managed Identity: eliminate stored credentials entirely
- [ ] Store all secrets in Key Vault: use Key Vault references in App Service
- [ ] Enable Private Endpoints for databases and Key Vault in production
- [ ] Apply NSGs to all subnets with explicit deny-all + allow rules
- [ ] Enable Defender for Cloud: review Secure Score weekly
- [ ] Enable Azure DDoS Protection Standard for internet-facing workloads
- [ ] Enable soft delete and purge protection on Key Vault
- [ ] Rotate secrets regularly: use Key Vault expiry notifications
- [ ] Scan images with Defender for Containers / ACR vulnerability scanning
- [ ] Enable audit logs: Entra ID sign-in logs, Key Vault access logs, SQL audit
Pillar 3: Cost Optimisation
Goal: Deliver business value at the lowest cost.
Cost Reduction Strategies
| Strategy | Saving Potential | How |
|---|---|---|
| Right-size VMs | 20–40% | Monitor CPU/memory; use Azure Advisor recommendations |
| Reserved Instances | Up to 72% | 1- or 3-year commitment for predictable workloads |
| Spot VMs | Up to 90% | Use for batch jobs, fault-tolerant workloads |
| Azure Hybrid Benefit | Up to 40% | Apply existing Windows Server / SQL Server licences |
| Serverless | Variable | Pay-per-use instead of idle capacity |
| Auto-shutdown | ~65% | Turn off dev/test VMs outside business hours |
| Storage tiers | 30–90% | Move infrequently accessed data to Cool/Archive |
| Delete unused resources | 100% | Regular audits using Azure Advisor |
# Auto-shutdown a dev VM at 7pm UTC
az vm auto-shutdown \
--resource-group myapp-rg \
--name mydevvm \
--time 1900 \
--email devteam@mycompany.com
# View Azure Advisor cost recommendations
az advisor recommendation list \
--category Cost \
--output table
# View current costs by resource group (last 30 days)
az costmanagement query \
--type Usage \
--timeframe MonthToDate \
--dataset-granularity Daily \
--dataset-aggregation '{"totalCost": {"name": "Cost", "function": "Sum"}}' \
--dataset-grouping '[{"type": "Dimension", "name": "ResourceGroup"}]'
Cost Tagging Strategy
# Enforce cost tracking tags via Azure Policy
# Required tags: environment, project, owner, costcenter
az policy definition create \
--name "require-tags" \
--display-name "Require tags on resources" \
--description "Require environment, project, owner, and costcenter tags" \
--rules '{
"if": {
"anyOf": [
{"field": "tags[environment]", "exists": "false"},
{"field": "tags[project]", "exists": "false"},
{"field": "tags[owner]", "exists": "false"}
]
},
"then": {"effect": "deny"}
}' \
--mode All
Pillar 4: Operational Excellence
Goal: Operate and monitor systems to deliver business value continuously.
Key Practices
Infrastructure as Code (IaC):
- All resources defined in Bicep or Terraform (no manual portal changes in production)
- IaC stored in Git with code review and version history
- Use
terraform plan/az deployment what-ifto preview changes
Deployment Strategy:
| Strategy | How | Risk | Rollback |
|---|---|---|---|
| Blue/Green | Two identical environments, swap DNS | Low | Swap back |
| Canary | Route small % to new version first | Very low | Reduce traffic % |
| Rolling | Replace instances one by one | Medium | Rolling back |
| Recreate | Stop old, start new | High (downtime) | Redeploy old version |
App Service deployment slots implement blue/green natively. Azure Traffic Manager and Front Door enable canary deployments.
Runbooks: Document operational procedures for:
- Scaling up/down
- Responding to alerts
- Database failover procedures
- Secret rotation
- Disaster recovery drills
Pillar 5: Performance Efficiency
Goal: Use resources efficiently to meet system requirements and maintain that efficiency as demand changes.
Caching Strategies
Request flow without caching:
Client → App → Database (slow)
With Redis cache:
Client → App → Redis (fast hit)
↓ miss
Database → populate Redis
import redis
import json
from functools import wraps
redis_client = redis.from_url(os.environ["REDIS_URL"])
def cached(ttl_seconds=300):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
cache_key = f"{func.__name__}:{args}:{kwargs}"
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
result = func(*args, **kwargs)
redis_client.setex(cache_key, ttl_seconds, json.dumps(result))
return result
return wrapper
return decorator
@cached(ttl_seconds=60)
def get_product(product_id: str):
return db.query(Product).filter_by(id=product_id).first()
CDN with Azure Front Door
Serve static assets (JS, CSS, images) from the nearest edge location:
az afd profile create \
--resource-group myapp-rg \
--profile-name myapp-afd \
--sku Standard_AzureFrontDoor
az afd endpoint create \
--resource-group myapp-rg \
--profile-name myapp-afd \
--endpoint-name myapp
az afd origin-group create \
--resource-group myapp-rg \
--profile-name myapp-afd \
--origin-group-name default-og \
--probe-path /health \
--probe-protocol Https
Reference Architectures
Web Application (Standard)
Internet
│
Azure Front Door (CDN + WAF + global LB)
│
App Service Plan (Standard S2, 2+ instances, across AZs)
│ │
Azure SQL DB Azure Cache for Redis
(Business Crit.) (session + query cache)
│
Blob Storage (static assets, user uploads)
│
Key Vault (secrets)
│
Application Insights -> Log Analytics Workspace
Microservices on AKS
Internet
│
Azure Application Gateway + WAF
│
AKS Cluster (across 3 AZs)
├── Ingress (nginx / AGIC)
├── Service A (3 replicas) ─► Azure SQL
├── Service B (3 replicas) ─► Cosmos DB
├── Service C (3 replicas) ─► Redis Cache
│ ─► Service Bus
└── Shared sidecar (Dapr / Envoy)
│
Azure Container Registry (image source)
Key Vault (CSI driver for secrets)
Azure Monitor + Container Insights
Serverless Event Processing
IoT Devices / Web Clients
│
Event Hub (ingestion: millions of events/sec)
│
Azure Functions (stream processing, triggered by Event Hub)
│
Cosmos DB (write processed results)
│
Power BI / Azure Synapse (analytics and reporting)
Production Readiness Checklist
Before going live, validate these items:
Reliability:
- [ ] Zone-redundant or multi-region deployment
- [ ] Auto-scale configured with appropriate min/max
- [ ] Health endpoints exposed and monitored
- [ ] Backup and restore tested
- [ ] Disaster recovery runbook documented and tested
Security:
- [ ] MFA enabled for all accounts
- [ ] Managed Identity used (no credentials in code)
- [ ] Key Vault in use with RBAC access model
- [ ] NSGs configured (deny-all default)
- [ ] Private Endpoints for databases and Key Vault
- [ ] Defender for Cloud enabled
- [ ] SSL/TLS enforced (HTTPS only)
Monitoring:
- [ ] Application Insights instrumented
- [ ] Diagnostic logs routed to Log Analytics
- [ ] Metric and log alerts configured
- [ ] On-call runbook linked from alerts
- [ ] Dashboard created for key service KPIs
Cost:
- [ ] Budget alerts configured
- [ ] Dev/test VMs auto-shutdown enabled
- [ ] Reserved instances purchased for stable workloads
- [ ] Resource tagging policy enforced
Operations:
- [ ] All infrastructure defined in Bicep/Terraform (no manual resources)
- [ ] CI/CD pipeline deploys to staging before production
- [ ] Deployment uses blue/green or canary strategy
- [ ] Rollback procedure documented and tested
Azure Naming & Tagging Convention (Template)
# Resource groups
{project}-{env}-rg # myapp-prod-rg
# Compute
{project}-{component}-{env}-vm-{n} # myapp-web-prod-vm-01
{project}-{env}-aks # myapp-prod-aks
{project}-{env}-plan # myapp-prod-plan
{project}-{env}-{service} # myapp-prod-api
# Storage
{project}{env}st{n} # myappprodst01 (no hyphens, lowercase)
# Networking
{project}-{env}-vnet # myapp-prod-vnet
{project}-{env}-{tier}-snet # myapp-prod-web-snet
{project}-{env}-{tier}-nsg # myapp-prod-web-nsg
# Security
{project}-{env}-kv # myapp-prod-kv
# Required tags for every resource
environment: prod | staging | dev | test
project: myapp
owner: teamname
costcenter: 12345
Further Learning
| Topic | Resource |
|---|---|
| AZ-900 (Fundamentals) | Microsoft Learn path |
| AZ-104 (Administrator) | Microsoft Learn path |
| AZ-204 (Developer) | Microsoft Learn path |
| AZ-305 (Architect) | Microsoft Learn path |
| Well-Architected Framework | Official docs |
| Architecture Center | Reference architectures |
| Free hands-on labs | Microsoft Learn |
| Azure pricing | Pricing calculator |