Technical Case Studies — AI for Infrastructure Engineers¶
Real-world scenarios showing how infrastructure teams deliver measurable business impact with AI workloads on Azure. Each case study maps directly to the concepts covered in this book, so you can trace every decision back to a specific chapter.
These are production stories — not proofs of concept. Every metric is grounded in realistic operational data, and every architecture decision reflects the trade-offs infrastructure engineers face daily.
Case Study 1 — GPU Cluster Buildout for Large-Scale Model Training¶
📖 Chapters Referenced: Ch3 (Compute), Ch4 (GPU Deep Dive), Ch5 (IaC)
Scenario¶
A financial services firm needed to fine-tune a 13-billion-parameter language model on proprietary transaction data for fraud detection. Their data science team had validated the approach using a single A100 GPU in a research environment, but the full training run required multi-node distributed training across 8 nodes — 64 GPUs total — with a 72-hour training window to meet a regulatory deadline.
The infrastructure team had deep experience with high-availability compute clusters but had never provisioned GPU hardware at this scale. They faced unfamiliar challenges: InfiniBand networking requirements, NCCL collective communication tuning, and GPU memory management for a model that consumed 48 GB of VRAM per GPU before optimizer states were even loaded.
The CTO gave them three weeks to go from zero GPU infrastructure to a production training run — with full observability and cost controls in place from day one.
Azure Services and Configuration¶
| Component | Service / SKU | Configuration |
|---|---|---|
| Compute | Standard_ND96asr_v4 (8× A100 80 GB) | 8 nodes, InfiniBand enabled, Proximity Placement Group |
| Networking | Accelerated Networking + InfiniBand | NCCL NCCL_IB_HCA=mlx5 environment variable set |
| Storage | Azure NetApp Files (Ultra tier) | 100 TiB volume, 4,500 MiB/s throughput for checkpoint I/O |
| IaC | Bicep modules | Parameterized templates for node count, SKU, and region |
| Monitoring | Azure Monitor + DCGM Exporter | GPU memory, SM occupancy, NVLink throughput dashboards |
| Cost control | Azure Budgets + auto-shutdown Logic App | $85K budget cap with 80%/90%/100% alert thresholds |
What the Infra Team Did¶
- Quota validation — Submitted quota requests for 768 A100 vCPUs in
southcentralustwo weeks early. First request was denied; they split acrosssouthcentralus(512) andeastus2(256) and got approved in 4 business days. - Network topology — Deployed all nodes in a single Proximity Placement Group within one availability zone. Configured InfiniBand SR-IOV with the
ND96asr_v4's built-in Mellanox ConnectX-6 adapters. Validated inter-node bandwidth at 200 Gbps usingib_write_bw. - Storage design — Chose Azure NetApp Files over Blob Storage after benchmarking checkpoint writes. A single 25 GB checkpoint across 64 GPUs completed in 11 seconds on ANF versus 4+ minutes on Premium Blob.
- IaC automation — Built Bicep modules with a
nodeCountparameter so the cluster could scale from 2 to 16 nodes without code changes. CI/CD pipeline in Azure DevOps ranaz deployment group what-ifon every PR. - GPU memory planning — Calculated memory requirements using the formula from Ch4:
Model params (13B) × 4 bytes (FP32) = 52 GB base. With mixed-precision training (BF16) and ZeRO Stage 3 optimizer sharding, actual per-GPU usage dropped to 38 GB — within the 80 GB A100 envelope. - Monitoring from day one — Deployed DCGM Exporter as a DaemonSet equivalent on each VM, feeding Prometheus. Built Grafana dashboards showing per-GPU SM activity, memory utilization, and NVLink throughput. Set alerts for GPU memory > 95% and SM occupancy < 20% (indicating a stalled rank).
Quantified Results¶
| Metric | Result |
|---|---|
| Training time (13B model, 500K samples) | 68 hours (under 72-hour deadline) |
| GPU utilization (average across 64 GPUs) | 87% SM occupancy |
| Checkpoint I/O time | 11 seconds per checkpoint (every 2 hours) |
| Total infrastructure cost | $73,400 (under $85K budget) |
| Time from zero to first training run | 16 days |
Lessons Learned¶
- Quota lead time is your critical path — start GPU quota requests before any other work. (Ch3)
- InfiniBand is non-negotiable for multi-node training — standard Ethernet added 3× communication overhead in their benchmarks. (Ch4)
- GPU memory math eliminates guesswork — the per-layer memory formula prevented two OOM incidents during hyperparameter sweeps. (Ch4)
- IaC pays for itself on the first resize — when the data science team requested a 12-node run, the infra team delivered in 20 minutes by changing one parameter. (Ch5)
Case Study 2 — Infrastructure as Code for Multi-Environment ML Pipelines¶
📖 Chapters Referenced: Ch5 (IaC), Ch6 (MLOps), Ch8 (Security)
Scenario¶
A healthcare analytics company ran their ML training pipelines using manually provisioned Azure ML workspaces. Each data scientist had their own workspace with inconsistent configurations — different compute SKUs, no network isolation, and secrets stored in plaintext config files. When an audit revealed PHI (Protected Health Information) was being processed in a workspace without Private Link, the CISO mandated a full infrastructure rebuild with compliance controls.
The platform engineering team had 6 weeks to deliver a repeatable, auditable infrastructure stack that supported three environments (dev, staging, prod) with identical security controls. They also needed to integrate the infrastructure lifecycle with the data science team's model training pipelines so that environment provisioning and model deployment were part of the same CI/CD flow.
Azure Services and Configuration¶
| Component | Service | Configuration |
|---|---|---|
| IaC | Terraform (AzureRM provider 3.x) | Modules for workspace, compute, networking, RBAC |
| ML Platform | Azure Machine Learning | Private Link-enabled workspace, managed VNet |
| Networking | VNet + Private Endpoints + NSGs | Hub-spoke topology, no public endpoints |
| Secrets | Azure Key Vault | Managed Identity access, no API key usage |
| CI/CD | GitHub Actions | terraform plan on PR, terraform apply on merge to main |
| Policy | Azure Policy | Deny public endpoints, require encryption, enforce tagging |
What the Infra Team Did¶
- Module architecture — Created four Terraform modules:
network(VNet, subnets, NSGs, Private DNS zones),workspace(Azure ML workspace, linked Key Vault, Storage, ACR),compute(training clusters with auto-scale), andrbac(role assignments per environment). Each module was versioned independently in a Terraform registry. - Environment parity — Used
tfvarsfiles per environment with identical module references. The only differences were compute SKU sizes (dev:Standard_DS3_v2, prod:Standard_NC6s_v3) and node counts. - Security hardening — Deployed Azure Policy assignments that denied creation of any Azure ML workspace without Private Link. Added Key Vault access policies using Managed Identity only — no service principal secrets. Enabled diagnostic settings on every resource, forwarding to a central Log Analytics workspace.
- MLOps integration — The model training pipeline (defined in Azure ML YAML) referenced compute targets by name. Since Terraform created compute targets with deterministic names (
train-cpu-dev,train-gpu-prod), the data science team's pipeline YAML worked across environments without modification. - Drift detection — Configured a nightly GitHub Actions workflow that ran
terraform planagainst each environment and opened an issue if drift was detected. In the first month, it caught 3 manual changes made through the portal.
Quantified Results¶
| Metric | Result |
|---|---|
| Environment provisioning time | 22 minutes (was 2-3 days manual) |
| Security audit findings | 0 critical (was 14) |
| Configuration drift incidents (monthly) | 3 detected and remediated automatically |
| Developer onboarding time (new data scientist) | 4 hours (was 2 days) |
| Terraform modules reused across teams | 3 teams adopted the same modules within 2 months |
Lessons Learned¶
- IaC is a security control, not just automation — the audit passed because infrastructure was code-reviewed, not because of a checklist. (Ch5)
- MLOps and IaC must share a naming contract — when compute target names are deterministic, pipeline YAML becomes environment-agnostic. (Ch6)
- Private Link adds 15-20 minutes to provisioning — plan for it in CI/CD timeouts and communicate the trade-off to stakeholders. (Ch8)
💡 Pro Tip: Run terraform plan in your PR pipeline and post the output as a PR comment. Reviewers catch misconfigurations before they reach any environment.
Case Study 3 — Observability Platform for GPU-Accelerated Inference¶
📖 Chapters Referenced: Ch7 (Monitoring), Ch4 (GPU Deep Dive), Ch12 (Troubleshooting)
Scenario¶
A media company ran a real-time content moderation system powered by a vision transformer model deployed on AKS with GPU node pools. The system processed 15,000 images per minute during peak hours. The operations team had basic Kubernetes monitoring (CPU, memory, pod restarts) but no GPU-specific telemetry. When latency spikes occurred during traffic surges, they had no way to distinguish between GPU saturation, model inefficiency, and storage I/O bottlenecks.
After a 45-minute outage where P95 latency exceeded 8 seconds (SLA was 500 ms), the VP of Engineering mandated a full observability overhaul. The infrastructure team needed to build a monitoring stack that could pinpoint the root cause of latency degradation within 2 minutes — fast enough for the on-call engineer to act before users noticed.
Azure Services and Configuration¶
| Component | Service | Configuration |
|---|---|---|
| Compute | AKS with Standard_NC16as_T4_v3 node pool | 6 nodes, cluster autoscaler (min 4, max 12) |
| GPU metrics | DCGM Exporter (DaemonSet) | Exported: DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_SM_CLOCK |
| Metrics pipeline | Prometheus (Azure Managed) + Grafana | 15-second scrape interval, 30-day retention |
| Application telemetry | Application Insights (SDK instrumented) | Custom metrics: inference_latency_ms, batch_size, model_version |
| Alerting | Azure Monitor Action Groups | PagerDuty integration for P1, Teams webhook for P2/P3 |
| Log aggregation | Container Insights + Log Analytics | KQL queries for pod-level correlation |
What the Infra Team Did¶
- GPU telemetry pipeline — Deployed DCGM Exporter as a DaemonSet with tolerations for the GPU node pool taint. Configured Prometheus to scrape GPU metrics at 15-second intervals. Built Grafana dashboards with four panels: GPU utilization (%), GPU memory usage (MB), SM clock frequency (MHz), and PCIe throughput (GB/s).
- Application-level instrumentation — Worked with the ML engineering team to add custom Application Insights metrics to the inference service. Every request logged:
inference_latency_ms,preprocessing_time_ms,postprocessing_time_ms,batch_size, andmodel_version. This separated model computation time from I/O overhead. - Correlation dashboards — Built a Grafana dashboard that overlaid GPU utilization, P95 inference latency, and pod count on the same time axis. This immediately revealed a pattern: latency spikes correlated with GPU memory > 90%, not GPU compute utilization. The bottleneck was memory pressure from large batch sizes, not insufficient GPU cores.
- Alert hierarchy — Defined three alert tiers:
- P1 (PagerDuty, immediate): P95 latency > 2 seconds for 3 consecutive minutes
- P2 (Teams, 15-min response): GPU memory > 85% sustained for 10 minutes
- P3 (Daily digest): GPU utilization < 30% for 1 hour (waste detection)
- Runbook automation — Created an Azure Automation runbook triggered by P2 alerts that automatically reduced the inference batch size from 32 to 16, buying the team time to investigate without user impact.
Quantified Results¶
| Metric | Before | After |
|---|---|---|
| Mean time to detect (MTTD) | 12 minutes | 90 seconds |
| Mean time to resolve (MTTR) | 45 minutes | 8 minutes |
| P95 latency (peak hours) | 1,200 ms (with spikes to 8s) | 380 ms (stable) |
| False positive alerts (monthly) | 47 | 6 |
| GPU waste identified (monthly savings) | — | $2,800 from right-sizing batch sizes |
Lessons Learned¶
- GPU utilization alone is misleading — high utilization looks healthy, but if GPU memory is saturated, latency still degrades. Always monitor both. (Ch4, Ch7)
- Application-level metrics are non-negotiable — without
inference_latency_msbroken down by phase, the team would have blamed the GPU when the real bottleneck was image preprocessing on CPU. (Ch7) - Structured alert tiers prevent alert fatigue — the old system had 47 false positives per month. The new tiered approach reduced noise by 87%. (Ch7, Ch12)
⚠️ Production Gotcha: DCGM Exporter's default metric set is enormous. Export only the 8-10 metrics you actually dashboard — otherwise Prometheus storage costs will surprise you.
Case Study 4 — Cost Engineering for a Mixed GPU/CPU Inference Fleet¶
📖 Chapters Referenced: Ch9 (Cost Engineering), Ch3 (Compute), Ch11 (Azure OpenAI)
Scenario¶
A B2B SaaS company operated three AI-powered features: document summarization (Azure OpenAI GPT-4o), image classification (custom vision model on GPU VMs), and anomaly detection (classical ML on CPU). Their monthly Azure AI spend had grown from $18K to $67K in 6 months with no corresponding increase in customer usage. The CFO demanded a 35% cost reduction without degrading service quality.
The infrastructure team discovered several cost drivers: GPU VMs running 24/7 for a workload that peaked only 10 hours per day, Azure OpenAI standard deployments with 40% of requests hitting 429 throttling (triggering expensive retries), and no chargeback visibility — three product teams consumed GPU resources but costs were allocated to a single "AI Infrastructure" budget line.
What the Infra Team Did¶
- Usage analysis — Exported 90 days of Azure Cost Management data and correlated it with Prometheus GPU utilization metrics. Found that GPU VMs averaged 22% utilization during off-peak hours (10 PM – 8 AM) and 78% during business hours.
- GPU scheduling — Implemented Azure Automation runbooks to deallocate the image classification VMs from 10 PM to 7 AM daily. Added a 30-minute warm-up buffer with health checks before the first request was routed. Saved 37% on GPU compute.
- Azure OpenAI migration to PTU — Analyzed token consumption patterns: the document summarization feature consumed a steady 180K TPM during business hours. Migrated from standard (pay-per-token) to 300 PTU, which provided 200K+ TPM at a fixed hourly rate. This eliminated 429 errors and reduced cost per token by 42%.
- Spot VMs for batch workloads — Moved the nightly model retraining job (anomaly detection) to
Standard_NC4as_T4_v3Spot VMs with a max price of 60% of on-demand. Implemented checkpointing every 30 minutes so evictions only lost 30 minutes of work. Spot eviction rate averaged 3% over 90 days. - Chargeback tagging — Implemented mandatory resource tags (
CostCenter,Product,WorkloadType) enforced by Azure Policy. Built a Power BI dashboard showing cost per product team per AI feature. Within one month, the image classification team voluntarily reduced their batch size after seeing their per-request cost. - Reserved Instances — Purchased 1-year reservations for the 4 GPU VMs that ran during business hours, saving an additional 21% versus pay-as-you-go.
Quantified Results¶
| Cost Lever | Monthly Savings | Percentage |
|---|---|---|
| GPU VM scheduling (off-hours deallocation) | $8,900 | 37% of GPU compute |
| Azure OpenAI PTU migration | $6,200 | 42% per-token cost reduction |
| Spot VMs for batch training | $1,800 | 58% vs. on-demand |
| Reserved Instances (1-year) | $4,100 | 21% of reserved VM costs |
| Total monthly savings | $21,000 | 31% of total AI spend |
Post-optimization monthly spend: $46,000 (down from $67,000).
Lessons Learned¶
- GPU utilization data is the foundation of cost engineering — without Prometheus metrics, the team would have guessed at scheduling windows. (Ch9)
- PTU is almost always cheaper for sustained workloads — the break-even point was roughly 120K TPM sustained for 8+ hours/day. (Ch11)
- Chargeback changes behavior — teams optimize when they see their own costs. Tag enforcement is a cost control. (Ch9)
💡 Pro Tip: Before committing to PTU, run a 2-week analysis of your standard deployment's actual TPM consumption. Azure OpenAI's metrics in Azure Monitor give you the exact numbers to calculate break-even.
Case Study 5 — Multi-Team Platform Operations for AI Workloads¶
📖 Chapters Referenced: Ch10 (Platform Ops), Ch5 (IaC), Ch8 (Security)
Scenario¶
A technology company with 200+ engineers had four product teams running AI workloads on Azure — each with their own AKS clusters, storage accounts, and networking configurations. The result was infrastructure sprawl: 14 AKS clusters, 23 storage accounts, inconsistent security policies, and no shared GPU capacity. When one team needed A100 GPUs for a training sprint, they waited 3 weeks for quota approval while another team's A100 nodes sat idle at 12% utilization.
The CTO tasked the platform engineering team with building a shared AI platform that provided self-service capabilities while maintaining security guardrails, cost visibility, and efficient GPU utilization across all teams.
Azure Services and Configuration¶
| Component | Service | Configuration |
|---|---|---|
| Compute platform | AKS (3 clusters: dev, staging, prod) | Shared GPU node pools with namespace isolation |
| Multi-tenancy | Kubernetes namespaces + NetworkPolicy | Per-team namespaces, resource quotas, limit ranges |
| GPU sharing | NVIDIA MPS + time-slicing | 2× time-slicing on T4 nodes for dev, dedicated for prod |
| IaC | Terraform + Atlantis | PR-based workflow, mandatory plan review |
| Self-service | Backstage developer portal | Templates for "new ML project" provisioning |
| Cost allocation | Kubecost + Azure Cost Management | Per-namespace cost attribution with daily reports |
| Security | Azure Policy + OPA Gatekeeper | Enforce private registries, block privileged containers |
What the Infra Team Did¶
- Cluster consolidation — Reduced 14 AKS clusters to 3 (dev, staging, prod). Each cluster used namespace-level isolation with Kubernetes ResourceQuotas capping per-team GPU requests (
requests.nvidia.com/gpu). NetworkPolicies enforced namespace-to-namespace isolation — teams could not access each other's inference endpoints. - GPU sharing strategy — Implemented NVIDIA GPU time-slicing on T4 dev nodes, allowing 2 workloads to share a single GPU. Production workloads on A100 nodes got dedicated GPU access — no time-slicing. This increased dev GPU utilization from 12% to 64% without purchasing additional hardware.
- Self-service provisioning — Built Backstage templates that let data scientists provision a new ML project namespace with: a GPU resource quota, a dedicated Azure Container Registry scope, Key Vault access for secrets, and a pre-configured Prometheus ServiceMonitor. Provisioning time dropped from a 3-week ticket to 15 minutes.
- GitOps workflow — All infrastructure changes went through Terraform PRs reviewed by the platform team via Atlantis. No
kubectl applywas allowed in production — ArgoCD synced manifests from Git. This created a full audit trail of every infrastructure change. - Cost attribution — Deployed Kubecost with Azure Cost Management integration. Each namespace's compute, storage, and network costs were attributed to the owning team's cost center. Monthly reports were auto-generated and sent to team leads. GPU idle time was flagged separately, showing cost of reserved-but-unused GPU hours.
- Guardrails via policy — OPA Gatekeeper policies enforced: no
latestimage tags, mandatory resource limits on all pods, GPU requests must include matching limits (preventing overallocation), and all images must come from the internal ACR. Azure Policy enforced network-level controls — no public load balancers, mandatory Private Link for storage access.
Quantified Results¶
| Metric | Before | After |
|---|---|---|
| AKS clusters | 14 | 3 |
| GPU utilization (dev) | 12% | 64% |
| GPU utilization (prod) | 41% | 78% |
| New project provisioning time | 3 weeks | 15 minutes |
| Monthly GPU compute spend | $89,000 | $52,000 (42% reduction) |
| Security policy violations (monthly) | Unknown | 0 (enforced by Gatekeeper) |
| Infrastructure team headcount for AI ops | 6 FTEs across 4 teams | 2 FTEs on platform team |
Lessons Learned¶
- Shared platforms are a cost multiplier — consolidating 14 clusters to 3 saved more than the GPU scheduling optimization did. (Ch10)
- GPU time-slicing is safe for dev, dangerous for prod — in production, a noisy neighbor on a shared GPU caused a 3× latency spike during their first week. They reverted to dedicated GPUs for prod immediately. (Ch10, Ch4)
- Self-service without guardrails is chaos; guardrails without self-service are bottlenecks — the Backstage + Gatekeeper combination gave teams speed with safety. (Ch10, Ch8)
- Cost visibility drives organic optimization — two teams voluntarily reduced their GPU quotas after seeing Kubecost reports, freeing capacity for the team that actually needed it. (Ch9)
⚠️ Production Gotcha: Kubernetes ResourceQuotas cap the number of GPUs a namespace can request, but they don't prevent a single pod from consuming all GPU memory. Always pair ResourceQuotas with container-level resources.limits enforcement via OPA Gatekeeper.
"The best AI infrastructure isn't the one with the most GPUs — it's the one where every GPU-hour produces business value."