Chapter 5 — Infrastructure as Code for AI Full Book
"Instead of Standard_NC6s_v3, the node pool is running Standard_D16s_v5 — a general-purpose CPU VM with no GPU at all."
📖 This chapter is available in the full book
It started as a win. You manually provisioned a GPU cluster in East US 2 for an ML experiment — an AKS cluster with a Standard_NC6s_v3 node pool, accelerated networking, the right NVIDIA drivers, proper taints. It took most of a day, but it worked.
Three weeks later, the same team needs the identical setup in West US 3. Two days later, it's "done." Except it isn't. Someone fat-fingered the VM SKU. The training job launches, finds no CUDA device, falls back to CPU, and grinds along at a fraction of the expected speed. Nobody notices for three days. By the time someone checks, the cluster has burned through $4,000 in compute on a VM that can't do the one thing it was provisioned for.
What You'll Learn in This Chapter
- The $4,000 Typo
- Why IaC Is Non-Negotiable for AI
- The IaC Landscape for AI
- Terraform for AI Infrastructure
- Bicep for AI Infrastructure
- CI/CD Pipelines for AI Infrastructure
- Governance and Guardrails
- Hands-On: Deploy an AKS GPU Cluster with Terraform