π§ͺ AI infrastructure labs Free¶
Welcome to the hands-on labs section of the AI for Infra Pros β The Practical Handbook for Infrastructure Engineers.
Each lab demonstrates how to apply the infrastructure concepts from the book in real-world Azure environments.
Lab scope and expectations¶
These labs are infrastructure-focused and designed for:
- Provisioning GPU-enabled environments
- Deploying inference-ready workloads
- Validating performance, access, and observability
They do not cover:
- Model training or fine-tuning
- Data science experimentation
- Advanced MLOps pipelines
The goal is to help infrastructure engineers confidently run AI workloads, not build models from scratch.
Lab index¶
| Lab | Description | Technologies |
|---|---|---|
| Lab 1 β Bicep VM with GPU | Deploy a single GPU-enabled VM using Azure Bicep to host AI inference workloads. | Bicep, Azure CLI, NVIDIA Drivers |
| Lab 2 β Terraform AKS GPU Cluster | Provision an Azure Kubernetes Service cluster with a dedicated GPU node pool for AI workloads. | Terraform, AKS, GPU, IaC |
| Lab 3 β YAML Inference API (Azure ML) | Publish a trained model as an inference endpoint using Azure Machine Learning and YAML configuration. | Azure ML, YAML, CLI, REST API |
Prerequisites¶
Before running any of the labs:
- Have an active Azure Subscription
- Install the latest Azure CLI
- Install Terraform and/or Bicep depending on the lab
- Ensure GPU quotas are available in your target region
- Common SKUs:
Standard_NC4as_T4_v3(T4 inference)Standard_NC6s_v3(V100)
- Check quotas with:
- Install and update the Azure ML CLI extension: Tested with Azure CLI
>= 2.55.0 - Authenticate with Azure:
- Have sufficient permissions (Owner or Contributor on the target Resource Group)
β οΈ Cost warning¶
These labs may create GPU-backed resources, which can incur significant costs if left running.
Always:
- Use the smallest GPU SKU possible
- Complete validation steps promptly
- Delete resource groups after finishing
GPU resources can cost \(0.90β\)30+/hour depending on SKU.
Lab workflow¶
All labs follow a similar structure:
- Provision infrastructure (VM, AKS, or AML workspace)
- Configure access, security, and monitoring
- Deploy models or containers for inference
- Validate performance and connectivity
- Clean up resources to avoid unnecessary costs
Recommendations¶
- Prefer West US 3 or West Europe β they historically offer broader GPU SKU availability, but quotas still apply
- Always tag resources with project and owner names
- Store deployment logs for auditing and rollback
- For production-grade deployments, add Private Endpoints and Azure Policy validation
Cleanup reminder¶
After finishing a lab, remember to delete the created resources to prevent billing surprises:
References¶
- Azure Machine Learning Documentation
- Terraform on Azure
- Bicep Language Reference
- Azure AI Infrastructure Overview
βYou donβt scale AI with PowerPoint β you scale it with Infrastructure as Code.β