Cheatsheets — Quick Reference for AI Infrastructure Engineers¶
A condensed, at-a-glance reference covering every chapter of the book. Print this, bookmark it, pin it to your monitor — these are the numbers, commands, and checklists you'll reach for daily.
Azure GPU VM Comparison (Ch3, Ch4)¶
| VM SKU | GPU | VRAM | Ideal For | Inference | Training | ~Cost/hr |
|---|---|---|---|---|---|---|
Standard_NC4as_T4_v3 | 1× T4 | 16 GB | Cost-effective inference | ✅✅ | ❌ | ~$0.53 |
Standard_NC6s_v3 | 1× V100 | 16 GB | General AI workloads | ✅ | ⚠️ | ~$0.90 |
Standard_NC24ads_A100_v4 | 1× A100 | 80 GB | Single-GPU training/inference | ✅✅ | ✅✅ | ~$3.67 |
Standard_NV36ads_A10_v5 | 1× A10 | 24 GB | Visualization + lightweight AI | ✅ | ❌ | ~$1.80 |
Standard_ND96asr_v4 | 8× A100 | 640 GB total | Multi-GPU LLM training | ✅✅✅ | ✅✅✅ | ~$27.20 |
Standard_ND96isr_H100_v5 | 8× H100 | 640 GB total | Frontier model training | ✅✅✅ | ✅✅✅ | ~$40+ |
💡 Pro Tip: For inference, start with T4 (NC4as_T4_v3). Only move to A10 or A100 when you've proven the model doesn't fit in 16 GB VRAM or needs higher throughput.
CPU vs GPU Quick Reference (Ch3)¶
| Attribute | CPU | GPU |
|---|---|---|
| Core count | 8–128 | 2,560–16,896 CUDA cores |
| Architecture | Optimized for serial/branching logic | Optimized for parallel SIMD operations |
| Memory | Shared system RAM (up to TBs) | Dedicated VRAM (16–80 GB per GPU) |
| Best for | Web apps, ETL, classical ML, preprocessing | Deep learning, embeddings, matrix math |
| Azure families | Dv5, Ev5, Fv2 | NCas_T4, NC_A100, ND_H100 |
| Cost profile | 💲 | 💲💲💲 |
⚠️ Production Gotcha: Don't put preprocessing (image resizing, tokenization) on GPUs. Use CPU nodes for data preparation and reserve GPU exclusively for model computation.
GPU Memory Math (Ch4)¶
Use these formulas to calculate whether your model fits in GPU VRAM before you provision.
Model Parameter Memory¶
| Precision | Formula | 7B Model | 13B Model | 70B Model |
|---|---|---|---|---|
| FP32 | params × 4 bytes | 28 GB | 52 GB | 280 GB |
| FP16 / BF16 | params × 2 bytes | 14 GB | 26 GB | 140 GB |
| INT8 | params × 1 byte | 7 GB | 13 GB | 70 GB |
| INT4 (GPTQ/AWQ) | params × 0.5 bytes | 3.5 GB | 6.5 GB | 35 GB |
Training Memory (Full Fine-Tuning)¶
Total VRAM ≈ Model weights + Gradients + Optimizer states + Activations
≈ (params × 2) + (params × 2) + (params × 8) + activations
≈ params × 12 bytes (AdamW, FP16 mixed precision) + activations
Example: 7B model with AdamW → 7B × 12 = 84 GB minimum → requires 2× A100-80GB or gradient checkpointing.
Inference Memory (Quick Estimate)¶
💡 Pro Tip: When a model just barely fits in VRAM, you'll OOM under load because the KV cache grows with sequence length and batch size. Leave 20% VRAM headroom.
GPU Count Calculator¶
| Model | Precision | Min VRAM | Fits On |
|---|---|---|---|
| Llama 2 7B | FP16 | 14 GB | 1× T4 (tight) or 1× A10 |
| Llama 2 13B | FP16 | 26 GB | 1× A100-80GB |
| Llama 2 70B | FP16 | 140 GB | 2× A100-80GB |
| Llama 2 70B | INT4 | 35 GB | 1× A100-80GB |
⚠️ Production Gotcha: These are weights only. KV cache for a 4K context with batch size 32 can add 8-12 GB. Always benchmark actual memory under realistic load.
Security Checklist for AI Workloads (Ch8)¶
| # | Control | Description | Priority |
|---|---|---|---|
| 1 | Managed Identity for all services | No service principal secrets; use SystemAssigned or UserAssigned | 🔴 Critical |
| 2 | Private Endpoints everywhere | Azure ML, Storage, ACR, Key Vault, OpenAI — no public endpoints | 🔴 Critical |
| 3 | Key Vault for all secrets | API keys, connection strings, certificates — never in code or env vars | 🔴 Critical |
| 4 | Network segmentation | Hub-spoke VNet, NSGs restricting east-west traffic | 🔴 Critical |
| 5 | Diagnostic logs enabled | All resources → central Log Analytics workspace | 🟡 High |
| 6 | RBAC least privilege | Cognitive Services User not Contributor; scope to resource not RG | 🟡 High |
| 7 | Prompt injection testing | Input validation, content filtering enabled on Azure OpenAI | 🟡 High |
| 8 | Data encryption | TLS 1.2+ in transit, SSE with CMK at rest for sensitive data | 🟡 High |
| 9 | Egress control | NSG/Firewall restricting outbound from inference nodes | 🟡 High |
| 10 | Container image scanning | Only signed images from private ACR; block latest tag | 🟠 Medium |
| 11 | API rate limiting | Azure API Management or Application Gateway WAF | 🟠 Medium |
| 12 | Model artifact integrity | Checksum validation on model files pulled from storage | 🟠 Medium |
Monitoring and Observability (Ch7)¶
The GPU Golden Signals¶
| Signal | Metric | Source | Alert Threshold |
|---|---|---|---|
| Utilization | DCGM_FI_DEV_GPU_UTIL | DCGM Exporter | < 20% (waste) or > 95% (saturation) |
| Memory | DCGM_FI_DEV_FB_USED / FB_FREE | DCGM Exporter | > 90% used (OOM risk) |
| Temperature | DCGM_FI_DEV_GPU_TEMP | DCGM Exporter | > 83°C (thermal throttling) |
| Errors | DCGM_FI_DEV_ECC_DBE_VOL_TOTAL | DCGM Exporter | > 0 (hardware fault) |
Application-Level Metrics¶
| Metric | What It Tells You | Tool |
|---|---|---|
| P50/P95/P99 inference latency | User experience and SLA compliance | Application Insights |
| Requests per second (RPS) | Load pressure and capacity planning | Application Insights |
| Error rate (4xx, 5xx) | Client misuse vs. server issues | Application Insights |
| Token throughput (TPM) | Azure OpenAI consumption rate | Azure Monitor |
| 429 rate | Throttling frequency | Azure Monitor / Log Analytics |
| Queue depth | Backpressure in async pipelines | Prometheus / custom metric |
Essential KQL Queries¶
// GPU VM performance over last 24 hours
InsightsMetrics
| where Name == "dcgm_gpu_utilization"
| where TimeGenerated > ago(24h)
| summarize avg(Val), max(Val), percentile(Val, 95) by bin(TimeGenerated, 5m), Computer
| render timechart
// Azure OpenAI 429 errors by deployment
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where ResultType == "429"
| summarize count() by bin(TimeGenerated, 1h), _ResourceId
| render barchart
MLOps Commands (Ch6)¶
Azure ML CLI v2 Essentials¶
# Register a model from local files
az ml model create --name fraud-detector --version 1 \
--path ./model/ --type custom_model
# Create a managed online endpoint
az ml online-endpoint create --name fraud-api \
--auth-mode aml_token
# Deploy a model to the endpoint
az ml online-deployment create --name blue \
--endpoint-name fraud-api \
--model azureml:fraud-detector:1 \
--instance-type Standard_NC4as_T4_v3 \
--instance-count 2 \
--file deployment.yml
# Blue-green traffic split
az ml online-endpoint update --name fraud-api \
--traffic "blue=80 green=20"
# Test the endpoint
az ml online-endpoint invoke --name fraud-api \
--request-file sample-request.json
Model Lifecycle Commands¶
# List model versions
az ml model list --name fraud-detector --output table
# Archive old model version (soft delete)
az ml model archive --name fraud-detector --version 1
# Download model artifacts for inspection
az ml model download --name fraud-detector --version 2 \
--download-path ./local-model/
MLOps Pipeline Trigger¶
# Submit a training pipeline run
az ml job create --file pipeline.yml \
--set inputs.training_data.path=azureml:transactions:latest
# Monitor a running job
az ml job show --name <job-name> --output table
# Stream job logs
az ml job stream --name <job-name>
💡 Pro Tip: Always version your models with semantic versioning in the model registry. When an inference endpoint degrades, you need to roll back to fraud-detector:3 not "whatever was deployed last Tuesday."
Infrastructure Quick Deploy Commands (Ch5)¶
Create a GPU VM¶
az vm create \
--resource-group rg-ai-lab \
--name vm-gpu-inference \
--image Ubuntu2204 \
--size Standard_NC4as_T4_v3 \
--admin-username azureuser \
--generate-ssh-keys \
--priority Spot \
--max-price 0.35 \
--eviction-policy Deallocate
⚠️ Production Gotcha: Always check GPU quota before deploying: az vm list-usage --location eastus -o table | grep -i "NC". Quota requests take 1-5 business days.
AKS GPU Node Pool (Terraform)¶
resource "azurerm_kubernetes_cluster_node_pool" "gpu" {
name = "gpuinfer"
kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id
vm_size = "Standard_NC4as_T4_v3"
mode = "User"
auto_scaling_enabled = true
min_count = 0
max_count = 6
node_labels = {
"hardware" = "gpu"
"gpu-type" = "t4"
}
node_taints = ["nvidia.com/gpu=present:NoSchedule"]
tags = {
Environment = "production"
WorkloadType = "inference"
CostCenter = "ai-platform"
}
}
Kubernetes GPU Pod Manifest¶
apiVersion: v1
kind: Pod
metadata:
name: inference-server
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "present"
effect: "NoSchedule"
nodeSelector:
hardware: gpu
containers:
- name: model
image: myacr.azurecr.io/inference:v2.1
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
memory: "8Gi"
cpu: "4"
Cost Engineering Quick Reference (Ch9)¶
Cost Optimization Levers¶
| Lever | Savings Range | Risk | Best For |
|---|---|---|---|
| Spot VMs | 60-90% | Eviction (batch only) | Training, batch inference |
| Reserved Instances (1-yr) | 20-35% | Commitment | Steady-state inference |
| Reserved Instances (3-yr) | 40-55% | Long commitment | Permanent workloads |
| VM Scheduling | 30-50% | Cold start latency | Business-hours-only workloads |
| Right-sizing | 15-40% | Performance testing needed | Over-provisioned clusters |
| PTU vs Standard (OpenAI) | 20-50% | Capacity commitment | Sustained >100K TPM |
| Quantization (INT8/INT4) | 50-75% GPU reduction | Slight accuracy loss | Inference (not training) |
Quick Cost Estimation¶
Monthly GPU cost = (Hourly rate × Hours/day × 30) × VM count
= ($0.53 × 12 × 30) × 4 = $763/month (T4, 12hr/day)
Annual savings from RI = Monthly cost × 12 × 0.30 (avg 1-yr discount)
= $763 × 12 × 0.30 = $2,747/year
Azure Cost Management Commands¶
# Current month spend by resource group
az consumption usage list \
--start-date $(date -d "$(date +%Y-%m-01)" +%Y-%m-%d) \
--end-date $(date +%Y-%m-%d) \
--query "[?contains(instanceName,'gpu')].{Name:instanceName,Cost:pretaxCost}" \
--output table
# Create a budget with alerts
az consumption budget create \
--budget-name ai-gpu-budget \
--amount 5000 \
--category cost \
--time-grain monthly \
--start-date 2025-01-01 \
--end-date 2025-12-31
Chargeback Tagging (Azure Policy)¶
{
"if": {
"allOf": [
{ "field": "type", "equals": "Microsoft.Compute/virtualMachines" },
{ "field": "Microsoft.Compute/virtualMachines/vmSize", "like": "Standard_N*" },
{ "field": "tags['CostCenter']", "exists": false }
]
},
"then": { "effect": "deny" }
}
💡 Pro Tip: Set up a weekly cost anomaly alert at 120% of your trailing 4-week average. GPU costs can spike fast when someone forgets to deallocate a training VM.
Platform Ops Commands (Ch10)¶
Namespace Isolation for Multi-Tenancy¶
# Resource quota per team namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-alpha-quota
namespace: team-alpha
spec:
hard:
requests.nvidia.com/gpu: "4"
requests.cpu: "32"
requests.memory: "128Gi"
pods: "50"
---
# Limit range to prevent single-pod resource hogging
apiVersion: v1
kind: LimitRange
metadata:
name: gpu-limits
namespace: team-alpha
spec:
limits:
- type: Container
max:
nvidia.com/gpu: "2"
memory: "64Gi"
default:
cpu: "2"
memory: "8Gi"
Network Policy for Namespace Isolation¶
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-cross-namespace
namespace: team-alpha
spec:
podSelector: {}
policyTypes: ["Ingress", "Egress"]
ingress:
- from:
- namespaceSelector:
matchLabels:
name: team-alpha
egress:
- to:
- namespaceSelector:
matchLabels:
name: team-alpha
- to: # Allow DNS
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
GPU Time-Slicing Configuration¶
# NVIDIA device plugin ConfigMap for time-slicing
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin
namespace: kube-system
data:
config: |
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 2
⚠️ Production Gotcha: GPU time-slicing provides no memory isolation. If Pod A allocates 14 GB on a 16 GB T4, Pod B will OOM even though Kubernetes shows the GPU as "available." Only use time-slicing in dev/test.
Azure OpenAI Throughput Reference (Ch11)¶
Deployment Types Comparison¶
| Attribute | Standard (PayGo) | Provisioned (PTU) | Global Standard |
|---|---|---|---|
| Billing | Per 1K tokens | Hourly (per PTU) | Per 1K tokens |
| Latency guarantee | None (best effort) | SLA-backed | None |
| Throttling | TPM/RPM limits | Provisioned capacity | Higher limits |
| Best for | Dev/test, bursty | Production, steady | Global routing |
| Min commitment | None | 1 month | None |
| Scale | Auto (with limits) | Fixed provisioned | Auto |
PTU Sizing Estimator¶
Step 1: Measure average tokens per request
Avg input tokens: 800
Avg output tokens: 400
Total per request: 1,200
Step 2: Calculate sustained TPM
Requests per minute: 150
Sustained TPM = 150 × 1,200 = 180,000 TPM
Step 3: Use Azure capacity calculator or:
~1 PTU ≈ 3,600 TPM (GPT-4o, approximate)
PTUs needed ≈ 180,000 ÷ 3,600 = 50 PTUs
Step 4: Add 25% headroom for bursts
Recommended: 63 PTUs
Token Estimation Rules of Thumb¶
| Content Type | Approximate Tokens |
|---|---|
| 1 English word | ~1.3 tokens |
| 1 page of text (~500 words) | ~650 tokens |
| 1 line of code | ~10-15 tokens |
| A 100-line code file | ~1,200 tokens |
| JSON payload (1 KB) | ~300 tokens |
Azure OpenAI Monitoring Commands¶
# Check current deployments and their TPM limits
az cognitiveservices account deployment list \
--resource-group rg-ai-prod \
--name aoai-prod \
--output table
# Query 429 throttling events (last 24h)
az monitor metrics list \
--resource /subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.CognitiveServices/accounts/{name} \
--metric "AzureOpenAIRequests" \
--dimension StatusCode \
--start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
--interval PT1H \
--output table
💡 Pro Tip: If your 429 rate exceeds 5%, you're leaving performance on the table. Either increase your TPM quota, migrate to PTU, or implement client-side exponential backoff with jitter.
Troubleshooting Decision Tree (Ch12)¶
GPU Not Detected¶
GPU not visible to workload?
├── Check VM SKU → Is it an N-series VM?
│ └── No → Wrong VM SKU. Redeploy with NC/ND/NV series.
├── nvidia-smi returns error?
│ ├── Driver not installed → Install NVIDIA driver extension:
│ │ az vm extension set --name NvidiaGpuDriverLinux \
│ │ --publisher Microsoft.HCPCompute --version 1.9 \
│ │ --vm-name <vm> --resource-group <rg>
│ └── Driver version mismatch → Check CUDA compatibility matrix
├── In AKS pod, `nvidia-smi` not found?
│ ├── Missing GPU device plugin → Deploy NVIDIA device plugin DaemonSet
│ ├── Pod missing resource request → Add `nvidia.com/gpu: 1` to limits
│ └── Pod not scheduled on GPU node → Check tolerations and nodeSelector
└── nvidia-smi shows GPU but CUDA fails?
└── CUDA toolkit version doesn't match driver → Rebuild container with matching CUDA
Inference Latency Spikes¶
P95 latency exceeding SLA?
├── Check GPU utilization
│ ├── > 95% → GPU saturated. Scale out (more replicas) or optimize batch size.
│ └── < 50% → GPU is not the bottleneck. Check below.
├── Check GPU memory
│ ├── > 90% → KV cache pressure. Reduce max sequence length or batch size.
│ └── Normal → Not memory-bound. Check below.
├── Check CPU utilization
│ ├── > 80% → Preprocessing bottleneck. Move to dedicated CPU pods.
│ └── Normal → Check below.
├── Check network / storage
│ ├── Model loading slow → Cache model on local NVMe, not Blob Storage.
│ └── High network latency → Check Private Endpoint routing.
└── Check application code
├── No request batching → Implement dynamic batching (batch size 8-32).
└── Synchronous preprocessing → Move to async pipeline.
Azure OpenAI 429 Throttling¶
Getting HTTP 429 responses?
├── Check Retry-After header value
│ ├── < 10 seconds → Implement exponential backoff with jitter
│ └── > 60 seconds → You've hit a hard quota limit
├── Check deployment TPM/RPM limits
│ ├── Near limit → Request quota increase or add second deployment
│ └── Well under limit → Check for retry storms amplifying actual TPM
├── Multiple consumers sharing one deployment?
│ └── Yes → Separate deployments per consumer with dedicated quotas
└── Sustained high usage?
└── Evaluate PTU migration (see PTU Sizing above)
OOM (Out of Memory) Errors¶
CUDA out of memory error?
├── Calculate model memory requirement (see GPU Memory Math above)
│ ├── Model too large for GPU → Quantize (INT8/INT4) or use larger GPU
│ └── Model fits → Batch size or sequence length too high
├── Reduce batch size by 50%, retry
│ ├── Works → Gradually increase to find sweet spot
│ └── Still OOM → Enable gradient checkpointing (training) or reduce max_seq_len
├── Multi-GPU available?
│ └── Enable tensor parallelism (inference) or ZeRO Stage 3 (training)
└── Still failing?
└── Profile with `torch.cuda.memory_summary()` to find the allocation spike
Performance and Throughput Formulas (Ch3, Ch11)¶
| Metric | Formula | Example |
|---|---|---|
| TPM | Avg tokens/request × RPM | 1,200 × 150 = 180K TPM |
| QPS | RPM ÷ 60 | 300 RPM = 5 QPS |
| Model VRAM (FP16) | Parameters × 2 bytes | 7B × 2 = 14 GB |
| Training VRAM (AdamW) | Parameters × 12 bytes + activations | 7B × 12 = 84 GB + activations |
| GPU utilization efficiency | Active compute time ÷ Total allocated time | 80% = healthy |
| Cost per 1K inferences | (VM $/hr ÷ inferences/hr) × 1000 | ($0.53 ÷ 3600) × 1000 = $0.15 |
| PTU break-even TPM | PTU hourly cost ÷ Standard per-token cost | Varies by model and region |
Where to Run Your Model — Decision Flow (Ch3, Ch10)¶
Is it a proprietary/custom model?
├── Yes → Do you need to train it?
│ ├── Yes → Azure ML Compute Clusters (managed) or GPU VMs (full control)
│ └── No (inference only) → How much traffic?
│ ├── < 10 QPS → Azure ML Managed Endpoint or single GPU VM
│ ├── 10-100 QPS → AKS with GPU node pool + HPA
│ └── > 100 QPS → AKS multi-node GPU pool + cluster autoscaler
└── No (using a foundation model) → Azure OpenAI Service
├── Dev/test or bursty → Standard (PayGo) deployment
├── Sustained production → Provisioned (PTU) deployment
└── Need global distribution → Global Standard deployment
Resource Tagging Convention (Ch9, Ch10)¶
| Tag | Example | Purpose | Required By |
|---|---|---|---|
Environment | dev, staging, prod | Lifecycle stage | Azure Policy |
CostCenter | CC-4521 | Chargeback and budgeting | Azure Policy |
Team | ml-platform | Ownership and escalation | Convention |
WorkloadType | training, inference, batch | Cost analysis and scheduling | Convention |
Model | fraud-detector-v2, gpt-4o | Correlate to model metrics | Convention |
DataClassification | public, confidential, restricted | Security controls | Azure Policy |
AutoShutdown | true, business-hours | Cost automation triggers | Automation Runbook |