Chapter 9 — Cost Engineering for AI Workloads¶
"The cloud doesn't have a spending problem. It has a visibility problem."
The $127,000 Monday Morning¶
It's Monday morning. You're halfway through your coffee when an email from finance lands with the subject line: "URGENT: Azure bill — $127,000 — please explain." Last month's forecast was $42,000. You open Azure Cost Management and start drilling down. Two ND96isr_H100_v5 VMs jump off the screen — provisioned three weeks ago for a "quick experiment" and never shut down. At roughly $98/hour each, running 24/7 for three weeks, that's approximately $33,000 in idle GPU time. Nobody was using them. Nobody even remembered they were running.
This isn't a hypothetical. Variations of this story play out in organizations every month. The ML engineer who provisioned those VMs wasn't being reckless — they were iterating fast, which is exactly what you want from your data science team. The failure wasn't human; it was systemic. No auto-shutdown policy, no budget alerts, no tagging to trace the VMs back to a project or owner.
This chapter gives you the frameworks, formulas, and operational practices to make sure that email never lands in your inbox. Not by slowing down experimentation, but by building guardrails that make cost awareness automatic.
Why AI Cost Engineering Is Different¶
If you've managed cloud costs for traditional workloads, you already know the fundamentals: right-size VMs, use reserved instances, shut down dev/test environments at night. AI workloads follow the same principles — but the stakes are dramatically higher and the spending patterns are far less predictable.
GPU VMs Cost 10–100× More Than General-Purpose VMs¶
A Standard_D4s_v5 (4 vCPUs, 16 GB RAM) costs roughly $0.19/hour. An ND96isr_H100_v5 (8× H100 GPUs) costs roughly $98/hour. That's a 500× difference. A misconfigured general-purpose VM running idle for a weekend costs you $9. A misconfigured GPU VM running idle for a weekend costs you $4,700. The margin for error shrinks dramatically.
Training Is Bursty¶
Traditional workloads tend toward steady-state patterns — web servers handle predictable traffic, databases serve consistent queries. AI training is fundamentally different. A team might consume zero GPU hours for two weeks while preparing data, then spike to 64 GPUs for a five-day training run, then drop back to zero. This burst pattern makes forecasting difficult and makes reserved capacity commitments risky without careful planning.
Token-Based Pricing Adds a Variable Layer¶
When your teams consume Azure OpenAI services, costs scale with usage in a way that's harder to predict than VM hours. A chatbot that handles 1,000 queries per day with short prompts costs a fraction of one that processes 1,000 legal documents with 100K-token contexts. Both are "the same application" from an infrastructure perspective, but the cost profiles are wildly different.
Experimentation Culture Conflicts with Budget Discipline¶
Data scientists need to experiment — that's how models improve. But experimentation means spinning up resources on short notice, trying different configurations, and sometimes abandoning approaches midway. Telling the ML team "submit a purchase order before provisioning any GPU" kills velocity. The solution isn't less experimentation; it's better guardrails around experimentation.
Infra ↔ AI Translation: GPU idle time is like leaving every light on in a stadium after the game ends. The hourly electricity bill is enormous, nobody's benefiting from it, and the fix is a simple timer — but someone has to install the timer before the first game.
GPU Cost Modeling¶
Before you can optimize costs, you need to model them. AI workloads have two fundamentally different cost profiles: training (running your own models on GPU VMs) and inference (consuming a model API like Azure OpenAI). Let's build the formulas for each.
GPU VM Pricing by Family¶
The table below provides approximate pay-as-you-go costs for common Azure GPU VM SKUs. These prices change frequently — always verify against the Azure Pricing Calculator for current rates.
| VM SKU | GPU | GPU Count | GPU Memory | Approx. Cost/Hour (Pay-as-you-go) | Primary Use Case |
|---|---|---|---|---|---|
| NC4as_T4_v3 | NVIDIA T4 | 1 | 16 GB | ~$0.53 | Inference, light fine-tuning |
| NC24ads_A100_v4 | NVIDIA A100 | 1 | 80 GB | ~$3.67 | Training, inference |
| NC48ads_A100_v4 | NVIDIA A100 | 2 | 160 GB | ~$7.35 | Multi-GPU training |
| ND96asr_v4 | NVIDIA A100 | 8 | 320 GB | ~$27.20 | Large-scale training |
| ND96isr_H100_v5 | NVIDIA H100 | 8 | 640 GB | ~$98.00 | Frontier model training |
Note: Prices are approximate, in USD, and vary by region. East US and West US 2 tend to have the most availability for GPU SKUs.
Training Cost Formula¶
For training workloads running on GPU VMs, the core formula is:
Worked example — fine-tuning a 7B parameter model:
| Component | Calculation | Cost |
|---|---|---|
| Compute | 2× A100 GPUs × 18 hours × $3.67/hr | $132.12 |
| Storage | 500 GB Premium SSD × 18 hours | ~$2.50 |
| Networking | Negligible (single VM) | ~$0 |
| Total | ~$135 |
Worked example — pre-training a 70B parameter model:
| Component | Calculation | Cost |
|---|---|---|
| Compute | 64× H100 GPUs (8 VMs) × 72 hours × $98/hr per VM | $56,448 |
| Storage | 10 TB across nodes × 72 hours | ~$85 |
| Networking | Inter-node InfiniBand (included in ND SKU) | $0 |
| Total | ~$56,533 |
The difference between these examples illustrates why right-sizing matters. Provisioning H100s for a job that runs fine on A100s doesn't just waste money — it wastes 3–4× the money.
Inference Cost Formula (Azure OpenAI)¶
For Azure OpenAI consumption, costs are token-based:
Worked example — customer support chatbot (GPT-4o):
| Component | Calculation | Cost |
|---|---|---|
| Input tokens | 10,000 requests/day × 800 tokens × $0.0025/1K | $20.00/day |
| Output tokens | 10,000 requests/day × 400 tokens × $0.01/1K | $40.00/day |
| Daily total | $60/day | |
| Monthly total | $60 × 30 | ~$1,800/month |
Worked example — same chatbot using GPT-4o-mini:
| Component | Calculation | Cost |
|---|---|---|
| Input tokens | 10,000 requests/day × 800 tokens × $0.00015/1K | $1.20/day |
| Output tokens | 10,000 requests/day × 400 tokens × $0.0006/1K | $2.40/day |
| Daily total | $3.60/day | |
| Monthly total | $3.60 × 30 | ~$108/month |
That's a 94% cost reduction for queries where GPT-4o-mini delivers acceptable quality — and for many customer support scenarios, it does.
Note: Token prices shown are approximate and subject to change. Always verify current pricing on the Azure OpenAI pricing page.
Decision Matrix: Compute Purchasing Models¶
| Factor | Pay-as-you-go | 1-Year Reserved | 3-Year Reserved | Spot VMs |
|---|---|---|---|---|
| Discount | 0% (baseline) | ~30–40% | ~50–60% | ~60–90% |
| Commitment | None | 1 year | 3 years | None |
| Eviction risk | None | None | None | High |
| Best for | Experimentation, unpredictable workloads | Steady-state inference | Long-term training clusters | Fault-tolerant training |
| Budget predictability | Low | High | High | Low |
| Flexibility | Maximum | Moderate (can exchange) | Low | Maximum |
💡 Pro Tip: Don't commit to reserved instances until you have at least 2–3 months of utilization data. Many organizations reserve too early, then end up paying for GPUs they don't use. Start with pay-as-you-go, measure actual consumption, then reserve only the baseline you're confident you'll sustain.
Spot and Low-Priority VMs for Training¶
Azure Spot VMs offer the same GPU hardware at 60–90% discount — but Azure can reclaim them with as little as 30 seconds' notice when capacity is needed. For the right workloads, this is the single biggest cost lever available.
When Spot Is Safe¶
Spot VMs work well when your training framework supports checkpoint-and-resume. This means the training job periodically saves its state (model weights, optimizer state, learning rate schedule, current epoch) to durable storage. If the VM is evicted, a new Spot VM picks up from the last checkpoint instead of starting over.
Frameworks that support this well:
- PyTorch Lightning: Built-in checkpointing with
ModelCheckpointcallback - DeepSpeed: Automatic checkpointing integrated with ZeRO optimizer
- Hugging Face Transformers:
save_stepsandresume_from_checkpointparameters - Azure ML: Managed checkpointing for training jobs
When Spot Is NOT Safe¶
Do not use Spot VMs when:
- Deadlines are non-negotiable: If a model must be trained by Friday, repeated evictions could push you past deadline
- Checkpointing isn't implemented: Without checkpointing, every eviction restarts training from scratch — potentially costing more than pay-as-you-go
- Jobs are very short (under 1 hour): The overhead of checkpoint/resume outweighs the savings
- You're running inference in production: Production endpoints need availability guarantees that Spot cannot provide
Implementing Checkpoint-and-Resume¶
The pattern is straightforward:
- Save checkpoints to Azure Blob Storage or Azure Files — not to local SSD (which is lost on eviction)
- Set checkpoint frequency based on cost of lost work. If training costs $50/hour, checkpoint every 15 minutes to cap re-work at $12.50 per eviction
- Build your startup script to check for existing checkpoints and resume if found
- Use Azure VM Scale Sets with Spot to automatically replace evicted VMs
⚠️ Production Gotcha: Spot VMs can be evicted with only 30 seconds' notice. Your checkpoint must write to durable storage, not local disk. If your checkpoint takes 5 minutes to write 20 GB of model state, you'll lose it on eviction. Either checkpoint more frequently with smaller state, or use faster storage (Premium SSD or Azure NetApp Files as a staging layer before Blob).
Spot Savings Example¶
| Scenario | Pay-as-you-go | Spot (70% discount) | Savings |
|---|---|---|---|
| 8× A100 training, 72 hours | $1,958 | $587 | $1,371 |
| 8× H100 training, 72 hours | $7,056 | $2,117 | $4,939 |
| 4× T4 inference testing, 40 hours | $85 | $25 | $60 |
Even accounting for occasional evictions and re-work, Spot VMs typically deliver 50–80% net savings on fault-tolerant training workloads.
Right-Sizing Strategies¶
The most expensive GPU is the one that's doing nothing — or doing work that a cheaper GPU could handle equally well. Right-sizing AI workloads requires matching the GPU to the task, not defaulting to the most powerful hardware available.
Don't Use H100s When T4s Will Do¶
This is the most common cost mistake in AI infrastructure. A team requests H100s "because we want the best performance," but their actual workload is running inference on a 7B parameter model that fits comfortably in a T4's 16 GB of memory. The H100 is 185× more expensive per hour than a single T4. Unless they're training a frontier model or need the H100's specific capabilities (FP8 Tensor Cores, higher memory bandwidth), they're burning money.
General sizing guidelines:
| Workload | Recommended Starting SKU | Why |
|---|---|---|
| Inference (models ≤13B) | NC-series T4 | 16 GB memory, cost-effective |
| Inference (models 13B–70B) | NC-series A100 | 80 GB memory, good throughput |
| Fine-tuning (models ≤13B) | NC-series A100 (1–2 GPUs) | Sufficient memory with LoRA/QLoRA |
| Fine-tuning (models 70B+) | ND-series A100 (8 GPUs) | Needs multi-GPU + NVLink |
| Pre-training | ND-series H100 | Maximum throughput, NVLink + InfiniBand |
GPU Utilization Benchmarking¶
Before scaling up, measure what you're actually using. Run nvidia-smi or use Azure Monitor GPU metrics to check:
- GPU Compute Utilization (%): If consistently below 30%, the GPU is oversized for the workload or the data pipeline is the bottleneck
- GPU Memory Utilization (%): If below 50%, a smaller GPU may work. If above 90%, you may need more memory or to enable gradient checkpointing
- GPU Memory Used (GB): Compare to the GPU's total memory to understand headroom
Infra ↔ AI Translation: Right-sizing GPUs is exactly like right-sizing VMs in traditional infrastructure. You wouldn't run a static website on a 64-core VM. Same principle — but the cost of getting it wrong is 100× higher because GPU VMs are 100× more expensive.
Auto-Shutdown Policies for Dev/Test¶
Every GPU VM provisioned for development, experimentation, or testing should have an auto-shutdown policy. Azure supports this natively through two mechanisms:
- Azure DevTest Labs auto-shutdown: Set a daily shutdown time on individual VMs
- Azure Automation runbooks: Schedule shutdown across resource groups or by tag
- Azure Policy: Enforce that all GPU VMs in dev/test subscriptions must have auto-shutdown enabled
💡 Pro Tip: Set the default auto-shutdown time to 7:00 PM local time for all dev/test GPU VMs. Engineers who need their VM to run overnight can extend it manually — but the default should be "off." A single ND96isr_H100_v5 left running from Friday evening to Monday morning costs approximately $4,700. Auto-shutdown eliminates this entirely.
Scaling Down After Experiments¶
Establish a process — not just a hope — for decommissioning experiment resources:
- Tag every GPU resource with
experiment-name,owner, andexpected-end-date - Run a weekly report listing GPU VMs older than their expected end date
- Auto-notify owners 48 hours before you deallocate
- Deallocate if no response — data is on persistent storage, the VM can be recreated
Azure OpenAI Cost Optimization¶
Azure OpenAI pricing splits into two models: Standard (pay-per-token) and Provisioned Throughput Units (PTU). Choosing the wrong one — or not choosing at all — is one of the most common sources of unexpected AI spend.
Standard (Pay-per-Token)¶
Standard deployment charges per 1,000 tokens consumed. It's simple, requires no commitment, and scales to zero when unused. This is the right choice for:
- Applications in development or early production
- Workloads with unpredictable or variable traffic
- Low-volume use cases (under a few hundred thousand tokens per day)
The risk is that costs scale linearly with usage. If your application goes viral or an upstream team starts sending higher volumes, your bill grows proportionally with no ceiling.
Provisioned Throughput Units (PTU)¶
PTU deployments reserve dedicated model capacity, measured in Provisioned Throughput Units. You pay a fixed hourly or monthly rate regardless of how many tokens you consume. Throughput per PTU varies by model, version, and region, so you should always use the Azure OpenAI capacity calculator to estimate PTU requirements for your specific workload.
PTU makes sense when:
- You have sustained, predictable traffic with high utilization
- You need guaranteed latency that shared (standard) deployments can't provide
- Your token volume is high enough that the per-token cost under PTU is lower than standard pricing
When PTU Pays for Itself¶
The break-even point depends on your model, region, and traffic pattern, but as a general guideline: if your standard deployment is consistently utilized at 60–70% or above of what a PTU allocation would provide, PTU typically becomes cheaper. Below that utilization, you're paying for reserved capacity you're not using.
Decision Matrix: Standard vs PTU¶
| Factor | Standard (Pay-per-Token) | Provisioned Throughput (PTU) |
|---|---|---|
| Pricing model | Per 1K tokens consumed | Fixed hourly/monthly rate |
| Commitment | None | Monthly or yearly |
| Best for | Variable/unpredictable traffic | Steady, high-volume traffic |
| Latency | Shared capacity (variable) | Dedicated capacity (consistent) |
| Cost at low volume | Lower | Higher (paying for idle capacity) |
| Cost at high volume | Higher (linear scaling) | Lower (amortized across tokens) |
| Scale to zero | Yes | No (minimum PTU commitment) |
Note: PTU pricing, throughput-per-unit ratios, and minimum commitments vary by model, version, and region. Always use the Azure OpenAI capacity calculator for accurate sizing.
Token Optimization Strategies¶
Regardless of whether you use Standard or PTU, reducing token consumption directly reduces cost:
Prompt caching: Azure OpenAI supports automatic prompt caching for repeated prefixes. If your system prompt is 2,000 tokens and identical across all requests, cached tokens are charged at a reduced rate. Structure your prompts with the static portion first.
Shorter system prompts: A 3,000-token system prompt that could be 800 tokens wastes 2,200 tokens per request. At 10,000 requests per day with GPT-4o, that's 22 million wasted input tokens — roughly $55/day or $1,650/month in unnecessary spend.
Response length limits: Use the max_tokens parameter to cap response length. If your application only needs 200-word answers, don't allow 2,000-token responses. This is both a cost and a latency optimization.
Multi-model routing: Not every request needs your most capable (and most expensive) model. Route simple classification, extraction, or FAQ queries to GPT-4o-mini and reserve GPT-4o for complex reasoning, multi-step analysis, or tasks where quality measurably suffers with the smaller model. A well-implemented routing layer can cut inference costs by 50–80%.
💡 Pro Tip: Build a simple evaluation harness that runs the same 200 representative queries through both GPT-4o and GPT-4o-mini, then have a domain expert score the outputs. If GPT-4o-mini scores within 5% on 70%+ of queries, you've identified a huge cost savings opportunity with minimal quality impact.
FinOps Practices for AI¶
FinOps — the practice of bringing financial accountability to cloud spending — is critical for AI workloads because the cost of getting it wrong is so much higher. A team that over-provisions CPU VMs might waste hundreds of dollars. A team that over-provisions GPU VMs wastes tens of thousands.
Cost Attribution: Tagging¶
Every AI resource should be tagged with at minimum:
| Tag | Purpose | Example |
|---|---|---|
cost-center | Financial attribution | CC-4521-ML |
project | Which initiative | customer-churn-model |
team | Who owns it | data-science-west |
environment | Dev, test, prod | dev |
expected-end-date | When to review/delete | 2025-03-15 |
Use Azure Policy to enforce that GPU VM SKUs (NC, ND) cannot be created without these tags. This is a non-negotiable governance control.
⚠️ Production Gotcha: Tags are only useful if they're enforced at provisioning time. If you allow untagged resources and try to tag retroactively, you'll always be playing catch-up. Deploy an Azure Policy with deny effect that blocks GPU VM creation without required tags. Teams will push back — hold the line.
Budgets and Alerts¶
Azure Cost Management supports budgets with action-triggered alerts. For AI workloads, set up a three-tier alerting strategy:
| Alert Threshold | Action | Purpose |
|---|---|---|
| 50% of monthly budget | Email notification to team leads | Early visibility |
| 75% of monthly budget | Email + Teams notification to team + finance | Escalation |
| 90% of monthly budget | Email + automated action (e.g., stop non-production VMs) | Prevention |
Create separate budgets for GPU compute, Azure OpenAI, and storage — don't lump them into one budget where a spike in GPU spend hides behind headroom in storage.
Chargeback and Showback¶
For organizations with shared GPU clusters, decide between:
- Showback: Teams see what they consume but aren't billed directly. Lower friction, but weaker incentive to optimize
- Chargeback: Teams are billed for consumption from their own budget. Stronger incentive, but requires accurate metering and can create perverse incentives (teams hoard reserved capacity)
Most organizations start with showback and move to chargeback as the AI practice matures and cost attribution tooling becomes reliable.
GPU Quota Governance¶
Azure GPU quotas are your first line of defense against runaway provisioning. By default, most subscriptions have zero quota for ND-series VMs — you must explicitly request it. Use this to your advantage:
- Centralize quota requests through a platform or FinOps team
- Approve quota by project, not by individual
- Set subscription-level quotas that cap the maximum number of GPU VMs any single team can provision
- Review quota allocations quarterly and reclaim unused quota
Regular Cost Reviews¶
Schedule a monthly AI cost review that brings together infrastructure, data science, and finance stakeholders. Review:
- Total GPU spend vs budget
- GPU utilization rates across all VMs
- Azure OpenAI token consumption trends
- Top 5 cost drivers and optimization opportunities
- Resources older than their expected end date
This meeting is where you catch the "\(33,000 idle GPU" problem before it becomes a "\)127,000 email from finance" problem.
Cost Attribution in Shared Clusters (AKS)¶
When multiple teams share an AKS cluster with GPU node pools, cost attribution becomes more complex than simple VM tagging. You need namespace-level visibility into who's consuming what.
Namespace-Level Cost Tracking¶
In a shared AKS cluster, each team or project should have its own Kubernetes namespace. This gives you a natural boundary for cost attribution:
- Azure Cost Analysis can break down AKS costs by namespace when the AKS cost analysis add-on is enabled
- OpenCost (CNCF project) provides real-time cost allocation by namespace, pod, and label
- Kubecost offers similar functionality with additional optimization recommendations
Resource Quotas per Namespace¶
Kubernetes ResourceQuotas prevent any single namespace from consuming more than its share of cluster resources. For GPU workloads, this is essential:
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: team-nlp
spec:
hard:
requests.nvidia.com/gpu: "4"
limits.nvidia.com/gpu: "4"
This caps the team-nlp namespace at 4 GPUs, regardless of how many pods they try to schedule. Without this, a single team's runaway training job could consume every GPU in the cluster.
Cluster Proportional Autoscaler¶
Use the cluster autoscaler to scale GPU node pools to zero when no GPU pods are pending. This ensures you're not paying for idle GPU nodes during off-hours. Configure the autoscaler with:
- Scale-down delay: How long a node must be idle before being removed (e.g., 10 minutes for dev clusters, 30 minutes for production)
- Scale-down utilization threshold: Remove nodes below a GPU utilization threshold
- Maximum node count: Hard cap on how many GPU nodes the autoscaler can provision
Tools Comparison¶
| Tool | Cost | Namespace Attribution | GPU Support | Optimization Recommendations |
|---|---|---|---|---|
| Azure Cost Analysis | Included | Yes (with add-on) | Yes | Basic |
| OpenCost | Free (open-source) | Yes | Yes | Limited |
| Kubecost | Free tier + paid | Yes | Yes | Detailed |
💡 Pro Tip: Enable the AKS cost analysis add-on (az aks update --enable-cost-analysis) before you need it. It requires time to accumulate data before it becomes useful. If you enable it after a cost incident, you won't have historical data to analyze.
Chapter Checklist¶
Before moving on, confirm you have these cost engineering practices in place:
- Cost model documented for both training (GPU-hours) and inference (tokens) workloads
- Tagging policy enforced via Azure Policy — all GPU resources tagged with cost-center, project, team, and environment
- Budget alerts configured at 50%, 75%, and 90% thresholds with escalating actions
- Auto-shutdown enabled on all dev/test GPU VMs
- Spot VMs evaluated for fault-tolerant training workloads with checkpointing implemented
- Right-sizing validated — GPU utilization benchmarked before provisioning larger SKUs
- Azure OpenAI pricing model selected — Standard vs PTU evaluated based on utilization data
- Token optimization implemented — prompt caching, system prompt trimming, response length limits, multi-model routing
- GPU quota governance centralized with approval workflow
- Monthly cost review meeting scheduled with infra, data science, and finance stakeholders
- Namespace-level cost tracking enabled for shared AKS clusters
- Weekly idle resource report running to catch forgotten experiments
What's Next¶
Cost is controlled. But as your AI platform grows from one team to ten, you need operational patterns that scale: fleet management, multi-tenancy, scheduling, and SLA design. Chapter 10 takes you from running AI projects to running an AI platform.