Chapter 11: Architecture & Best Practices
Knowing the services is not the same as building a system that survives a busy Tuesday. This chapter pulls the rest of the tutorial together: the Well-Architected Framework, the patterns that come up over and over, and a production-readiness checklist you can run through before launch.
Azure Well-Architected Framework
The Azure Well-Architected Framework (WAF) defines five pillars for evaluating cloud workload quality. Run the Azure Well-Architected Review to get a scored assessment of your architecture.
┌──────────────────────────────────────────────────────────────┐
│ Azure Well-Architected Framework │
├──────────────┬──────────────┬──────────────┬────────────────┤
│ Reliability │ Security │ Cost Optim. │ Operational │
│ │ │ │ Excellence │
├──────────────┴──────────────┴──────────────┴────────────────┤
│ Performance Efficiency │
└──────────────────────────────────────────────────────────────┘
Pillar 1: Reliability
Goal: The system recovers from failures and continues to function.
Key Concepts
- SLA (Service Level Agreement): Microsoft's uptime commitment per service
- SLO (Service Level Objective): Your internal target (stricter than SLA)
- RTO (Recovery Time Objective): Maximum acceptable downtime after failure
- RPO (Recovery Point Objective): Maximum acceptable data loss in time
Reliability Patterns
Multi-region active-passive:
Traffic Manager (DNS failover)
├── Primary: East US (active: serves all traffic)
└── Secondary: West US (passive: hot standby, failover in minutes)
Multi-region active-active:
Azure Front Door (global load balancer + CDN + WAF)
├── East US (App Service + Azure SQL)
└── West Europe (App Service + Azure SQL with geo-replication)
(Both regions serve traffic; Front Door routes to the nearest healthy region)
# Create Traffic Manager for DNS-based failover
az network traffic-manager profile create \
--resource-group myapp-rg \
--name myapp-tm \
--routing-method Priority \
--unique-dns-name myapp-tm-dns \
--monitor-port 443 \
--monitor-protocol HTTPS \
--monitor-path /health
# Add primary endpoint
az network traffic-manager endpoint create \
--resource-group myapp-rg \
--profile-name myapp-tm \
--type azureEndpoints \
--name primary \
--target-resource-id $(az webapp show --resource-group myapp-rg --name myapp-api-eastus --query id -o tsv) \
--priority 1
# Add secondary endpoint
az network traffic-manager endpoint create \
--resource-group myapp-rg \
--profile-name myapp-tm \
--type azureEndpoints \
--name secondary \
--target-resource-id $(az webapp show --resource-group myapp-rg --name myapp-api-westus --query id -o tsv) \
--priority 2
Health Endpoints
Every service should expose a /health (liveness) and /ready (readiness) endpoint:
# health.py: FastAPI example
from fastapi import APIRouter
from sqlalchemy import text
router = APIRouter()
@router.get("/health")
async def health():
"""Liveness: is the process running?"""
return {"status": "ok"}
@router.get("/ready")
async def readiness(db: Session = Depends(get_db)):
"""Readiness: can we serve traffic?"""
try:
db.execute(text("SELECT 1"))
return {"status": "ready"}
except Exception as e:
raise HTTPException(status_code=503, detail=str(e))
Reliability Checklist
- [ ] Deploy across Availability Zones (zone-redundant services or VMs in multiple zones)
- [ ] Set up Azure SQL Failover Groups or Cosmos DB multi-region writes
- [ ] Configure App Service auto-scale (minimum 2 instances in production)
- [ ] Implement circuit breaker patterns for downstream service failures
- [ ] Set retry policies with exponential backoff for transient failures
- [ ] Test failures regularly (chaos engineering)
- [ ] Define RTO and RPO and verify they are met
Pillar 2: Security
Goal: Protect against threats and meet compliance requirements.
Zero Trust Model
Traditional perimeter model: "Trust everything inside the network"
Zero Trust model: "Never trust, always verify, regardless of location"
Verify explicitly: Authenticate and authorise every request
Use least privilege: Minimal access, time-limited where possible
Assume breach: Design as if attackers are already inside
Security Checklist
- [ ] Enable MFA for all users; use Conditional Access policies
- [ ] Use Managed Identity: eliminate stored credentials entirely
- [ ] Store all secrets in Key Vault: use Key Vault references in App Service
- [ ] Enable Private Endpoints for databases and Key Vault in production
- [ ] Apply NSGs to all subnets with explicit deny-all + allow rules
- [ ] Enable Defender for Cloud: review Secure Score weekly
- [ ] Enable Azure DDoS Protection Standard for internet-facing workloads
- [ ] Enable soft delete and purge protection on Key Vault
- [ ] Rotate secrets regularly: use Key Vault expiry notifications
- [ ] Scan images with Defender for Containers / ACR vulnerability scanning
- [ ] Enable audit logs: Entra ID sign-in logs, Key Vault access logs, SQL audit
Pillar 3: Cost Optimisation
Goal: Deliver business value at the lowest cost.
Cost Reduction Strategies
| Strategy | Saving Potential | How |
|---|---|---|
| Right-size VMs | 20–40% | Monitor CPU/memory; use Azure Advisor recommendations |
| Reserved Instances | Up to 72% | 1- or 3-year commitment for predictable workloads |
| Spot VMs | Up to 90% | Use for batch jobs, fault-tolerant workloads |
| Azure Hybrid Benefit | Up to 40% | Apply existing Windows Server / SQL Server licences |
| Serverless | Variable | Pay-per-use instead of idle capacity |
| Auto-shutdown | ~65% | Turn off dev/test VMs outside business hours |
| Storage tiers | 30–90% | Move infrequently accessed data to Cool/Archive |
| Delete unused resources | 100% | Regular audits using Azure Advisor |
# Auto-shutdown a dev VM at 7pm UTC
az vm auto-shutdown \
--resource-group myapp-rg \
--name mydevvm \
--time 1900 \
--email devteam@mycompany.com
# View Azure Advisor cost recommendations
az advisor recommendation list \
--category Cost \
--output table
# View current costs by resource group (last 30 days)
az costmanagement query \
--type Usage \
--timeframe MonthToDate \
--dataset-granularity Daily \
--dataset-aggregation '{"totalCost": {"name": "Cost", "function": "Sum"}}' \
--dataset-grouping '[{"type": "Dimension", "name": "ResourceGroup"}]'
Cost Tagging Strategy
# Enforce cost tracking tags via Azure Policy
# Required tags: environment, project, owner, costcenter
az policy definition create \
--name "require-tags" \
--display-name "Require tags on resources" \
--description "Require environment, project, owner, and costcenter tags" \
--rules '{
"if": {
"anyOf": [
{"field": "tags[environment]", "exists": "false"},
{"field": "tags[project]", "exists": "false"},
{"field": "tags[owner]", "exists": "false"}
]
},
"then": {"effect": "deny"}
}' \
--mode All
Pillar 4: Operational Excellence
Goal: Operate and monitor systems to deliver business value continuously.
Key Practices
Infrastructure as Code (IaC):
- All resources defined in Bicep or Terraform (no manual portal changes in production)
- IaC stored in Git with code review and version history
- Use
terraform plan/az deployment what-ifto preview changes
Deployment Strategy:
| Strategy | How | Risk | Rollback |
|---|---|---|---|
| Blue/Green | Two identical environments, swap DNS | Low | Swap back |
| Canary | Route small % to new version first | Very low | Reduce traffic % |
| Rolling | Replace instances one by one | Medium | Rolling back |
| Recreate | Stop old, start new | High (downtime) | Redeploy old version |
App Service deployment slots implement blue/green natively. Azure Traffic Manager and Front Door enable canary deployments.
Runbooks: Document operational procedures for:
- Scaling up/down
- Responding to alerts
- Database failover procedures
- Secret rotation
- Disaster recovery drills
Pillar 5: Performance Efficiency
Goal: Use resources efficiently to meet system requirements and maintain that efficiency as demand changes.
Caching Strategies
Request flow without caching:
Client → App → Database (slow)
With Redis cache:
Client → App → Redis (fast hit)
↓ miss
Database → populate Redis
import redis
import json
from functools import wraps
redis_client = redis.from_url(os.environ["REDIS_URL"])
def cached(ttl_seconds=300):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
cache_key = f"{func.__name__}:{args}:{kwargs}"
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
result = func(*args, **kwargs)
redis_client.setex(cache_key, ttl_seconds, json.dumps(result))
return result
return wrapper
return decorator
@cached(ttl_seconds=60)
def get_product(product_id: str):
return db.query(Product).filter_by(id=product_id).first()
CDN with Azure Front Door
Serve static assets (JS, CSS, images) from the nearest edge location:
az afd profile create \
--resource-group myapp-rg \
--profile-name myapp-afd \
--sku Standard_AzureFrontDoor
az afd endpoint create \
--resource-group myapp-rg \
--profile-name myapp-afd \
--endpoint-name myapp
az afd origin-group create \
--resource-group myapp-rg \
--profile-name myapp-afd \
--origin-group-name default-og \
--probe-path /health \
--probe-protocol Https
Reference Architectures
Web Application (Standard)
Internet
│
Azure Front Door (CDN + WAF + global LB)
│
App Service Plan (Standard S2, 2+ instances, across AZs)
│ │
Azure SQL DB Azure Cache for Redis
(Business Crit.) (session + query cache)
│
Blob Storage (static assets, user uploads)
│
Key Vault (secrets)
│
Application Insights -> Log Analytics Workspace
Microservices on AKS
Internet
│
Azure Application Gateway + WAF
│
AKS Cluster (across 3 AZs)
├── Ingress (nginx / AGIC)
├── Service A (3 replicas) ─► Azure SQL
├── Service B (3 replicas) ─► Cosmos DB
├── Service C (3 replicas) ─► Redis Cache
│ ─► Service Bus
└── Shared sidecar (Dapr / Envoy)
│
Azure Container Registry (image source)
Key Vault (CSI driver for secrets)
Azure Monitor + Container Insights
Serverless Event Processing
IoT Devices / Web Clients
│
Event Hub (ingestion: millions of events/sec)
│
Azure Functions (stream processing, triggered by Event Hub)
│
Cosmos DB (write processed results)
│
Power BI / Azure Synapse (analytics and reporting)
Production Readiness Checklist
Before going live, validate these items:
Reliability:
- [ ] Zone-redundant or multi-region deployment
- [ ] Auto-scale configured with appropriate min/max
- [ ] Health endpoints exposed and monitored
- [ ] Backup and restore tested
- [ ] Disaster recovery runbook documented and tested
Security:
- [ ] MFA enabled for all accounts
- [ ] Managed Identity used (no credentials in code)
- [ ] Key Vault in use with RBAC access model
- [ ] NSGs configured (deny-all default)
- [ ] Private Endpoints for databases and Key Vault
- [ ] Defender for Cloud enabled
- [ ] SSL/TLS enforced (HTTPS only)
Monitoring:
- [ ] Application Insights instrumented
- [ ] Diagnostic logs routed to Log Analytics
- [ ] Metric and log alerts configured
- [ ] On-call runbook linked from alerts
- [ ] Dashboard created for key service KPIs
Cost:
- [ ] Budget alerts configured
- [ ] Dev/test VMs auto-shutdown enabled
- [ ] Reserved instances purchased for stable workloads
- [ ] Resource tagging policy enforced
Operations:
- [ ] All infrastructure defined in Bicep/Terraform (no manual resources)
- [ ] CI/CD pipeline deploys to staging before production
- [ ] Deployment uses blue/green or canary strategy
- [ ] Rollback procedure documented and tested
Azure Naming & Tagging Convention (Template)
# Resource groups
{project}-{env}-rg # myapp-prod-rg
# Compute
{project}-{component}-{env}-vm-{n} # myapp-web-prod-vm-01
{project}-{env}-aks # myapp-prod-aks
{project}-{env}-plan # myapp-prod-plan
{project}-{env}-{service} # myapp-prod-api
# Storage
{project}{env}st{n} # myappprodst01 (no hyphens, lowercase)
# Networking
{project}-{env}-vnet # myapp-prod-vnet
{project}-{env}-{tier}-snet # myapp-prod-web-snet
{project}-{env}-{tier}-nsg # myapp-prod-web-nsg
# Security
{project}-{env}-kv # myapp-prod-kv
# Required tags for every resource
environment: prod | staging | dev | test
project: myapp
owner: teamname
costcenter: 12345
Where to Go From Here
You now have the vocabulary and the patterns. The next step is to build something you would not be embarrassed to put in production. Pick one architecture from the section above, deploy it, then deliberately break it: fail over a region, kill a pod, rotate a secret without warning the app. The lessons stick when something behind you is on fire.
A few directions worth your time:
| Topic | Resource |
|---|---|
| AZ-900 (Fundamentals) | Microsoft Learn path |
| AZ-104 (Administrator) | Microsoft Learn path |
| AZ-204 (Developer) | Microsoft Learn path |
| AZ-305 (Architect) | Microsoft Learn path |
| Well-Architected Framework | Official docs |
| Architecture Center | Reference architectures |
| Free hands-on labs | Microsoft Learn |
| Azure pricing | Pricing calculator |