Cheatsheets — Quick Reference for AI Infrastructure Engineers¶

A condensed, at-a-glance reference covering every chapter of the book. Print this, bookmark it, pin it to your monitor — these are the numbers, commands, and checklists you'll reach for daily.

Azure GPU VM Comparison (Ch3, Ch4)¶

VM SKU	GPU	VRAM	Ideal For	Inference	Training	~Cost/hr
`Standard_NC4as_T4_v3`	1× T4	16 GB	Cost-effective inference	✅✅	❌	~$0.53
`Standard_NC6s_v3`	1× V100	16 GB	General AI workloads	✅	⚠️	~$0.90
`Standard_NC24ads_A100_v4`	1× A100	80 GB	Single-GPU training/inference	✅✅	✅✅	~$3.67
`Standard_NV36ads_A10_v5`	1× A10	24 GB	Visualization + lightweight AI	✅	❌	~$1.80
`Standard_ND96asr_v4`	8× A100	640 GB total	Multi-GPU LLM training	✅✅✅	✅✅✅	~$27.20
`Standard_ND96isr_H100_v5`	8× H100	640 GB total	Frontier model training	✅✅✅	✅✅✅	~$40+

💡 Pro Tip: For inference, start with T4 (NC4as_T4_v3). Only move to A10 or A100 when you've proven the model doesn't fit in 16 GB VRAM or needs higher throughput.

CPU vs GPU Quick Reference (Ch3)¶

Attribute	CPU	GPU
Core count	8–128	2,560–16,896 CUDA cores
Architecture	Optimized for serial/branching logic	Optimized for parallel SIMD operations
Memory	Shared system RAM (up to TBs)	Dedicated VRAM (16–80 GB per GPU)
Best for	Web apps, ETL, classical ML, preprocessing	Deep learning, embeddings, matrix math
Azure families	Dv5, Ev5, Fv2	NCas_T4, NC_A100, ND_H100
Cost profile	💲	💲💲💲

⚠️ Production Gotcha: Don't put preprocessing (image resizing, tokenization) on GPUs. Use CPU nodes for data preparation and reserve GPU exclusively for model computation.

GPU Memory Math (Ch4)¶

Use these formulas to calculate whether your model fits in GPU VRAM before you provision.

Model Parameter Memory¶

Precision	Formula	7B Model	13B Model	70B Model
FP32	`params × 4 bytes`	28 GB	52 GB	280 GB
FP16 / BF16	`params × 2 bytes`	14 GB	26 GB	140 GB
INT8	`params × 1 byte`	7 GB	13 GB	70 GB
INT4 (GPTQ/AWQ)	`params × 0.5 bytes`	3.5 GB	6.5 GB	35 GB

Training Memory (Full Fine-Tuning)¶

Total VRAM ≈ Model weights + Gradients + Optimizer states + Activations
           ≈ (params × 2) + (params × 2) + (params × 8) + activations
           ≈ params × 12 bytes (AdamW, FP16 mixed precision) + activations

Example: 7B model with AdamW → 7B × 12 = 84 GB minimum → requires 2× A100-80GB or gradient checkpointing.

Inference Memory (Quick Estimate)¶

VRAM ≈ Model weights + KV cache
KV cache per token ≈ 2 × num_layers × hidden_dim × 2 bytes (FP16)

💡 Pro Tip: When a model just barely fits in VRAM, you'll OOM under load because the KV cache grows with sequence length and batch size. Leave 20% VRAM headroom.

GPU Count Calculator¶

Min GPUs needed = Total model memory ÷ Per-GPU VRAM × 1.2 (safety margin)

Model	Precision	Min VRAM	Fits On
Llama 2 7B	FP16	14 GB	1× T4 (tight) or 1× A10
Llama 2 13B	FP16	26 GB	1× A100-80GB
Llama 2 70B	FP16	140 GB	2× A100-80GB
Llama 2 70B	INT4	35 GB	1× A100-80GB

⚠️ Production Gotcha: These are weights only. KV cache for a 4K context with batch size 32 can add 8-12 GB. Always benchmark actual memory under realistic load.

Security Checklist for AI Workloads (Ch8)¶

#	Control	Description	Priority
1	Managed Identity for all services	No service principal secrets; use `SystemAssigned` or `UserAssigned`	🔴 Critical
2	Private Endpoints everywhere	Azure ML, Storage, ACR, Key Vault, OpenAI — no public endpoints	🔴 Critical
3	Key Vault for all secrets	API keys, connection strings, certificates — never in code or env vars	🔴 Critical
4	Network segmentation	Hub-spoke VNet, NSGs restricting east-west traffic	🔴 Critical
5	Diagnostic logs enabled	All resources → central Log Analytics workspace	🟡 High
6	RBAC least privilege	`Cognitive Services User` not `Contributor`; scope to resource not RG	🟡 High
7	Prompt injection testing	Input validation, content filtering enabled on Azure OpenAI	🟡 High
8	Data encryption	TLS 1.2+ in transit, SSE with CMK at rest for sensitive data	🟡 High
9	Egress control	NSG/Firewall restricting outbound from inference nodes	🟡 High
10	Container image scanning	Only signed images from private ACR; block `latest` tag	🟠 Medium
11	API rate limiting	Azure API Management or Application Gateway WAF	🟠 Medium
12	Model artifact integrity	Checksum validation on model files pulled from storage	🟠 Medium

Monitoring and Observability (Ch7)¶

The GPU Golden Signals¶

Signal	Metric	Source	Alert Threshold
Utilization	`DCGM_FI_DEV_GPU_UTIL`	DCGM Exporter	< 20% (waste) or > 95% (saturation)
Memory	`DCGM_FI_DEV_FB_USED` / `FB_FREE`	DCGM Exporter	> 90% used (OOM risk)
Temperature	`DCGM_FI_DEV_GPU_TEMP`	DCGM Exporter	> 83°C (thermal throttling)
Errors	`DCGM_FI_DEV_ECC_DBE_VOL_TOTAL`	DCGM Exporter	> 0 (hardware fault)

Application-Level Metrics¶

Metric	What It Tells You	Tool
P50/P95/P99 inference latency	User experience and SLA compliance	Application Insights
Requests per second (RPS)	Load pressure and capacity planning	Application Insights
Error rate (4xx, 5xx)	Client misuse vs. server issues	Application Insights
Token throughput (TPM)	Azure OpenAI consumption rate	Azure Monitor
429 rate	Throttling frequency	Azure Monitor / Log Analytics
Queue depth	Backpressure in async pipelines	Prometheus / custom metric

Essential KQL Queries¶

// GPU VM performance over last 24 hours
InsightsMetrics
| where Name == "dcgm_gpu_utilization"
| where TimeGenerated > ago(24h)
| summarize avg(Val), max(Val), percentile(Val, 95) by bin(TimeGenerated, 5m), Computer
| render timechart

// Azure OpenAI 429 errors by deployment
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COGNITIVESERVICES"
| where ResultType == "429"
| summarize count() by bin(TimeGenerated, 1h), _ResourceId
| render barchart

MLOps Commands (Ch6)¶

Azure ML CLI v2 Essentials¶

# Register a model from local files
az ml model create --name fraud-detector --version 1 \
  --path ./model/ --type custom_model

# Create a managed online endpoint
az ml online-endpoint create --name fraud-api \
  --auth-mode aml_token

# Deploy a model to the endpoint
az ml online-deployment create --name blue \
  --endpoint-name fraud-api \
  --model azureml:fraud-detector:1 \
  --instance-type Standard_NC4as_T4_v3 \
  --instance-count 2 \
  --file deployment.yml

# Blue-green traffic split
az ml online-endpoint update --name fraud-api \
  --traffic "blue=80 green=20"

# Test the endpoint
az ml online-endpoint invoke --name fraud-api \
  --request-file sample-request.json

Model Lifecycle Commands¶

# List model versions
az ml model list --name fraud-detector --output table

# Archive old model version (soft delete)
az ml model archive --name fraud-detector --version 1

# Download model artifacts for inspection
az ml model download --name fraud-detector --version 2 \
  --download-path ./local-model/

MLOps Pipeline Trigger¶

# Submit a training pipeline run
az ml job create --file pipeline.yml \
  --set inputs.training_data.path=azureml:transactions:latest

# Monitor a running job
az ml job show --name <job-name> --output table

# Stream job logs
az ml job stream --name <job-name>

💡 Pro Tip: Always version your models with semantic versioning in the model registry. When an inference endpoint degrades, you need to roll back to fraud-detector:3 not "whatever was deployed last Tuesday."

Infrastructure Quick Deploy Commands (Ch5)¶

Create a GPU VM¶

az vm create \
  --resource-group rg-ai-lab \
  --name vm-gpu-inference \
  --image Ubuntu2204 \
  --size Standard_NC4as_T4_v3 \
  --admin-username azureuser \
  --generate-ssh-keys \
  --priority Spot \
  --max-price 0.35 \
  --eviction-policy Deallocate

⚠️ Production Gotcha: Always check GPU quota before deploying: az vm list-usage --location eastus -o table | grep -i "NC". Quota requests take 1-5 business days.

AKS GPU Node Pool (Terraform)¶

resource "azurerm_kubernetes_cluster_node_pool" "gpu" {
  name                  = "gpuinfer"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id
  vm_size               = "Standard_NC4as_T4_v3"
  mode                  = "User"
  auto_scaling_enabled  = true
  min_count             = 0
  max_count             = 6

  node_labels = {
    "hardware" = "gpu"
    "gpu-type" = "t4"
  }

  node_taints = ["nvidia.com/gpu=present:NoSchedule"]

  tags = {
    Environment  = "production"
    WorkloadType = "inference"
    CostCenter   = "ai-platform"
  }
}

Kubernetes GPU Pod Manifest¶

apiVersion: v1
kind: Pod
metadata:
  name: inference-server
spec:
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Equal"
    value: "present"
    effect: "NoSchedule"
  nodeSelector:
    hardware: gpu
  containers:
  - name: model
    image: myacr.azurecr.io/inference:v2.1
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        nvidia.com/gpu: 1
        memory: "8Gi"
        cpu: "4"

Cost Engineering Quick Reference (Ch9)¶

Cost Optimization Levers¶

Lever	Savings Range	Risk	Best For
Spot VMs	60-90%	Eviction (batch only)	Training, batch inference
Reserved Instances (1-yr)	20-35%	Commitment	Steady-state inference
Reserved Instances (3-yr)	40-55%	Long commitment	Permanent workloads
VM Scheduling	30-50%	Cold start latency	Business-hours-only workloads
Right-sizing	15-40%	Performance testing needed	Over-provisioned clusters
PTU vs Standard (OpenAI)	20-50%	Capacity commitment	Sustained >100K TPM
Quantization (INT8/INT4)	50-75% GPU reduction	Slight accuracy loss	Inference (not training)

Quick Cost Estimation¶

Monthly GPU cost = (Hourly rate × Hours/day × 30) × VM count
                 = ($0.53 × 12 × 30) × 4 = $763/month (T4, 12hr/day)

Annual savings from RI = Monthly cost × 12 × 0.30 (avg 1-yr discount)
                       = $763 × 12 × 0.30 = $2,747/year

Azure Cost Management Commands¶

# Current month spend by resource group
az consumption usage list \
  --start-date $(date -d "$(date +%Y-%m-01)" +%Y-%m-%d) \
  --end-date $(date +%Y-%m-%d) \
  --query "[?contains(instanceName,'gpu')].{Name:instanceName,Cost:pretaxCost}" \
  --output table

# Create a budget with alerts
az consumption budget create \
  --budget-name ai-gpu-budget \
  --amount 5000 \
  --category cost \
  --time-grain monthly \
  --start-date 2025-01-01 \
  --end-date 2025-12-31

Chargeback Tagging (Azure Policy)¶

{
  "if": {
    "allOf": [
      { "field": "type", "equals": "Microsoft.Compute/virtualMachines" },
      { "field": "Microsoft.Compute/virtualMachines/vmSize", "like": "Standard_N*" },
      { "field": "tags['CostCenter']", "exists": false }
    ]
  },
  "then": { "effect": "deny" }
}

💡 Pro Tip: Set up a weekly cost anomaly alert at 120% of your trailing 4-week average. GPU costs can spike fast when someone forgets to deallocate a training VM.

Platform Ops Commands (Ch10)¶

Namespace Isolation for Multi-Tenancy¶

# Resource quota per team namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-alpha-quota
  namespace: team-alpha
spec:
  hard:
    requests.nvidia.com/gpu: "4"
    requests.cpu: "32"
    requests.memory: "128Gi"
    pods: "50"
---
# Limit range to prevent single-pod resource hogging
apiVersion: v1
kind: LimitRange
metadata:
  name: gpu-limits
  namespace: team-alpha
spec:
  limits:
  - type: Container
    max:
      nvidia.com/gpu: "2"
      memory: "64Gi"
    default:
      cpu: "2"
      memory: "8Gi"

Network Policy for Namespace Isolation¶

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-cross-namespace
  namespace: team-alpha
spec:
  podSelector: {}
  policyTypes: ["Ingress", "Egress"]
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: team-alpha
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: team-alpha
  - to:  # Allow DNS
    - namespaceSelector: {}
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53

GPU Time-Slicing Configuration¶

# NVIDIA device plugin ConfigMap for time-slicing
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin
  namespace: kube-system
data:
  config: |
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 2

⚠️ Production Gotcha: GPU time-slicing provides no memory isolation. If Pod A allocates 14 GB on a 16 GB T4, Pod B will OOM even though Kubernetes shows the GPU as "available." Only use time-slicing in dev/test.

Azure OpenAI Throughput Reference (Ch11)¶

Deployment Types Comparison¶

Attribute	Standard (PayGo)	Provisioned (PTU)	Global Standard
Billing	Per 1K tokens	Hourly (per PTU)	Per 1K tokens
Latency guarantee	None (best effort)	SLA-backed	None
Throttling	TPM/RPM limits	Provisioned capacity	Higher limits
Best for	Dev/test, bursty	Production, steady	Global routing
Min commitment	None	1 month	None
Scale	Auto (with limits)	Fixed provisioned	Auto

PTU Sizing Estimator¶

Step 1: Measure average tokens per request
         Avg input tokens:  800
         Avg output tokens: 400
         Total per request: 1,200

Step 2: Calculate sustained TPM
         Requests per minute: 150
         Sustained TPM = 150 × 1,200 = 180,000 TPM

Step 3: Use Azure capacity calculator or:
         ~1 PTU ≈ 3,600 TPM (GPT-4o, approximate)
         PTUs needed ≈ 180,000 ÷ 3,600 = 50 PTUs

Step 4: Add 25% headroom for bursts
         Recommended: 63 PTUs

Token Estimation Rules of Thumb¶

Content Type	Approximate Tokens
1 English word	~1.3 tokens
1 page of text (~500 words)	~650 tokens
1 line of code	~10-15 tokens
A 100-line code file	~1,200 tokens
JSON payload (1 KB)	~300 tokens

Azure OpenAI Monitoring Commands¶

# Check current deployments and their TPM limits
az cognitiveservices account deployment list \
  --resource-group rg-ai-prod \
  --name aoai-prod \
  --output table

# Query 429 throttling events (last 24h)
az monitor metrics list \
  --resource /subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.CognitiveServices/accounts/{name} \
  --metric "AzureOpenAIRequests" \
  --dimension StatusCode \
  --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
  --interval PT1H \
  --output table

💡 Pro Tip: If your 429 rate exceeds 5%, you're leaving performance on the table. Either increase your TPM quota, migrate to PTU, or implement client-side exponential backoff with jitter.

Troubleshooting Decision Tree (Ch12)¶

GPU Not Detected¶

GPU not visible to workload?
├── Check VM SKU → Is it an N-series VM?
│   └── No → Wrong VM SKU. Redeploy with NC/ND/NV series.
├── nvidia-smi returns error?
│   ├── Driver not installed → Install NVIDIA driver extension:
│   │   az vm extension set --name NvidiaGpuDriverLinux \
│   │     --publisher Microsoft.HCPCompute --version 1.9 \
│   │     --vm-name <vm> --resource-group <rg>
│   └── Driver version mismatch → Check CUDA compatibility matrix
├── In AKS pod, `nvidia-smi` not found?
│   ├── Missing GPU device plugin → Deploy NVIDIA device plugin DaemonSet
│   ├── Pod missing resource request → Add `nvidia.com/gpu: 1` to limits
│   └── Pod not scheduled on GPU node → Check tolerations and nodeSelector
└── nvidia-smi shows GPU but CUDA fails?
    └── CUDA toolkit version doesn't match driver → Rebuild container with matching CUDA

Inference Latency Spikes¶

P95 latency exceeding SLA?
├── Check GPU utilization
│   ├── > 95% → GPU saturated. Scale out (more replicas) or optimize batch size.
│   └── < 50% → GPU is not the bottleneck. Check below.
├── Check GPU memory
│   ├── > 90% → KV cache pressure. Reduce max sequence length or batch size.
│   └── Normal → Not memory-bound. Check below.
├── Check CPU utilization
│   ├── > 80% → Preprocessing bottleneck. Move to dedicated CPU pods.
│   └── Normal → Check below.
├── Check network / storage
│   ├── Model loading slow → Cache model on local NVMe, not Blob Storage.
│   └── High network latency → Check Private Endpoint routing.
└── Check application code
    ├── No request batching → Implement dynamic batching (batch size 8-32).
    └── Synchronous preprocessing → Move to async pipeline.

Azure OpenAI 429 Throttling¶

Getting HTTP 429 responses?
├── Check Retry-After header value
│   ├── < 10 seconds → Implement exponential backoff with jitter
│   └── > 60 seconds → You've hit a hard quota limit
├── Check deployment TPM/RPM limits
│   ├── Near limit → Request quota increase or add second deployment
│   └── Well under limit → Check for retry storms amplifying actual TPM
├── Multiple consumers sharing one deployment?
│   └── Yes → Separate deployments per consumer with dedicated quotas
└── Sustained high usage?
    └── Evaluate PTU migration (see PTU Sizing above)

OOM (Out of Memory) Errors¶

CUDA out of memory error?
├── Calculate model memory requirement (see GPU Memory Math above)
│   ├── Model too large for GPU → Quantize (INT8/INT4) or use larger GPU
│   └── Model fits → Batch size or sequence length too high
├── Reduce batch size by 50%, retry
│   ├── Works → Gradually increase to find sweet spot
│   └── Still OOM → Enable gradient checkpointing (training) or reduce max_seq_len
├── Multi-GPU available?
│   └── Enable tensor parallelism (inference) or ZeRO Stage 3 (training)
└── Still failing?
    └── Profile with `torch.cuda.memory_summary()` to find the allocation spike

Performance and Throughput Formulas (Ch3, Ch11)¶

Metric	Formula	Example
TPM	`Avg tokens/request × RPM`	1,200 × 150 = 180K TPM
QPS	`RPM ÷ 60`	300 RPM = 5 QPS
Model VRAM (FP16)	`Parameters × 2 bytes`	7B × 2 = 14 GB
Training VRAM (AdamW)	`Parameters × 12 bytes + activations`	7B × 12 = 84 GB + activations
GPU utilization efficiency	`Active compute time ÷ Total allocated time`	80% = healthy
Cost per 1K inferences	`(VM $/hr ÷ inferences/hr) × 1000`	($0.53 ÷ 3600) × 1000 = $0.15
PTU break-even TPM	`PTU hourly cost ÷ Standard per-token cost`	Varies by model and region

Where to Run Your Model — Decision Flow (Ch3, Ch10)¶

Is it a proprietary/custom model?
├── Yes → Do you need to train it?
│   ├── Yes → Azure ML Compute Clusters (managed) or GPU VMs (full control)
│   └── No (inference only) → How much traffic?
│       ├── < 10 QPS → Azure ML Managed Endpoint or single GPU VM
│       ├── 10-100 QPS → AKS with GPU node pool + HPA
│       └── > 100 QPS → AKS multi-node GPU pool + cluster autoscaler
└── No (using a foundation model) → Azure OpenAI Service
    ├── Dev/test or bursty → Standard (PayGo) deployment
    ├── Sustained production → Provisioned (PTU) deployment
    └── Need global distribution → Global Standard deployment

Resource Tagging Convention (Ch9, Ch10)¶

Tag	Example	Purpose	Required By
`Environment`	`dev`, `staging`, `prod`	Lifecycle stage	Azure Policy
`CostCenter`	`CC-4521`	Chargeback and budgeting	Azure Policy
`Team`	`ml-platform`	Ownership and escalation	Convention
`WorkloadType`	`training`, `inference`, `batch`	Cost analysis and scheduling	Convention
`Model`	`fraud-detector-v2`, `gpt-4o`	Correlate to model metrics	Convention
`DataClassification`	`public`, `confidential`, `restricted`	Security controls	Azure Policy
`AutoShutdown`	`true`, `business-hours`	Cost automation triggers	Automation Runbook