Skip to content

Azure ML Environment

This module explains Azure ML platform building blocks and how to choose compute and serving options based on scale, latency, and cost.

Main workspace assets

  • Workspace
  • Compute Instance
  • Compute Cluster
  • Data assets
  • Model registry
  • Endpoints

Control plane vs data plane

Plane Responsibility
Control plane Asset metadata, run history, permissions, governance
Data plane Actual compute execution, model inference, data movement

Workspace Taxonomy

Azure ML workspace taxonomy

Note - What this shows: The Azure ML workspace taxonomy : how the workspace contains compute, data assets, models, and endpoints under one governance boundary. Use it to see which asset type owns each artifact you will create in later modules.

Azure ML environment taxonomy

Note - What this shows: How a versioned environment (base image + pinned dependencies) is reused across both training and inference. Sharing one environment is what prevents training/serving skew : the same code behaving differently in production than in training.

Key concepts:

  • Experiment: a tracked training run.
  • Registered model: trained artifact stored with version and lineage.
  • Endpoint: deployment surface for scoring requests.

Additional key terms:

  • Environment: pinned runtime dependencies and base image.
  • Datastore: registered storage connection.
  • Dataset/Data asset: versioned data reference used by jobs.

Azure endpoint concept

Note - What this shows: The anatomy of an Azure ML endpoint: the deployment surface that receives scoring requests, applies authentication, and routes traffic to one or more model versions. This is the object consumers actually call.

Compute guidance

  • Compute Instance for development
  • Compute Cluster for scalable training
  • ACI or AKS for serving

Practical split:

  • AML Compute Cluster: training, sweeps, AutoML parallel iterations.
  • AKS Inference Cluster: production-grade deployment and autoscaling.

Compute decision matrix

Need Recommended option
Notebook exploration and debugging Compute Instance
Parallelized training and HPO Compute Cluster
Quick endpoint prototype ACI
Production, autoscale, high availability AKS

Security and governance baseline

  • Use managed identities for data access.
  • Restrict network paths with private endpoints where possible.
  • Use least-privilege RBAC.
  • Keep lineage from data to model to endpoint for auditability.

Backend execution flow (what happens after submit)

sequenceDiagram participant U as User/SDK/CLI participant W as Azure ML Workspace participant C as Compute Cluster participant S as Storage/Registry U->>W: Submit job spec (code, env, data refs) W->>C: Provision/attach compute W->>C: Resolve environment image C->>S: Mount/download datasets C->>C: Execute training command C->>S: Write logs, metrics, artifacts W->>W: Record lineage links W->>U: Return run status and outputs

Asset lineage map

Asset Versioned Produced by Consumed by
Data asset Yes Data registration job Training/inference jobs
Environment Yes Environment build/pin Training and deployment
Model Yes Training run output Online/Batch endpoints
Endpoint deployment Yes (revisioned) Deploy pipeline Consumers (apps/APIs)

Enterprise considerations

  • Multi-workspace strategy: separate dev, test, prod with promotion gates.
  • Registry strategy: central model registry for cross-workspace sharing.
  • Access model: human access via RBAC groups; workload access via managed identity.
  • Compliance trail: preserve run IDs, model versions, dataset versions, and deployment revisions.

Azure ML RBAC roles reference

Role Typical assignee Permissions
Owner Platform team leads Full control including role assignments
Contributor ML engineers Create/manage all assets, no role changes
AzureML Data Scientist Data scientists Run experiments, register models, deploy
AzureML Compute Operator Ops team Start/stop compute, view runs
Reader Stakeholders View assets and run history only

Deep dive: every concept, explained

This section explains why each Azure ML building block exists and what problem it solves, not just what it is called.

The workspace as the unit of governance

A workspace is the top-level container that ties together compute, data, models, and endpoints under one identity and access boundary. It exists so that everything about a project : who can touch it, which runs produced which model, which data version trained it : is recorded in one auditable place. Behind the scenes a workspace provisions associated Azure resources: a storage account (artifacts, datasets), Key Vault (secrets), Container Registry (environment images), and Application Insights (telemetry). Understanding this mapping explains most permissions and networking issues you will hit later.

Control plane vs data plane : why the split matters

  • The control plane handles metadata and intent: "register this dataset", "start this job", "who is allowed to deploy". It is lightweight, always-on, and is where governance, lineage, and RBAC live.
  • The data plane handles actual work: spinning up VMs, moving gigabytes, running training loops, serving inference. It is where cost and performance are determined.

This separation is why you can submit a job (control plane) and have it queue until compute (data plane) is available, and why permission to see an asset is distinct from permission to run expensive compute with it.

Compute Instance vs Compute Cluster vs inference cluster

Compute Lifecycle Why it exists
Compute Instance Single, always-on dev VM Interactive notebooks, debugging, attached to one user identity
Compute Cluster Auto-scales 0→N nodes per job, then back to 0 Parallel training, hyperparameter sweeps, AutoML trials; you pay only while jobs run
AKS / managed inference Long-lived, autoscaling pods Low-latency, high-availability serving with health probes

The key economic idea: training compute should scale to zero when idle (bursty, batch), while serving compute stays warm (steady, latency-sensitive). Choosing the wrong one is a top cause of surprise cloud bills.

Assets, versioning, and lineage

Every first-class asset (data, environment, model, endpoint deployment) is versioned. This is not bureaucracy : it is what makes an ML system reproducible and auditable:

  • Data asset : a versioned pointer to data in a datastore, so a run records exactly which snapshot it trained on.
  • Environment : a pinned runtime (base image + dependency versions). Reusing the same environment for training and inference prevents the "works in training, breaks in production" class of bugs.
  • Model : the trained artifact plus metadata linking it back to the run, data, and environment that produced it (its lineage).
  • Endpoint deployment : a revisioned serving configuration, so traffic can be split or rolled back between versions.

Lineage is the chain data v → run → model v → endpoint revision. When a production prediction is questioned (audit, incident, fairness review), lineage lets you reconstruct precisely how it was produced.

Identity and access concepts

  • Managed identity : an Azure-managed credential attached to a workload (not a person) so jobs can read data or registries without embedded secrets. This is the secure default.
  • RBAC (role-based access control) : permissions granted to identities via roles. The least-privilege principle means giving each identity the minimum role needed (e.g. Contributor for engineers, not Owner), limiting blast radius if credentials are compromised.
  • Private endpoint : routes traffic to the workspace over a private network path instead of the public internet, reducing exposure for regulated workloads.

The submit-to-result flow, demystified

When you submit a job, the control plane validates the spec, provisions or attaches data-plane compute, resolves the environment image (pulling or building the container), mounts the referenced data version, runs your command, streams logs/metrics/artifacts back to storage, and records lineage. Knowing this sequence is what lets you debug a stuck run: each arrow in the sequence diagram above is a place a job can fail (quota, image build, data mount, code error).


MLOps maturity model

MLOps (Machine Learning Operations) is the discipline of applying DevOps principles to machine learning workflows. Microsoft defines a four-level maturity model that describes how organizations evolve from ad-hoc experimentation toward fully automated, self-healing ML systems. Understanding where your team sits today, and what the next level requires, is the starting point for any platform investment decision.

The four levels

Level Name Training Deployment Monitoring Retraining
0 Manual Scripts run on a laptop Manual copy of model artifact None Ad hoc, on request
1 Automated training pipeline ML pipeline on compute cluster Semi-manual or scripted Basic metrics Manual trigger
2 Full CI/CD ML Pipeline triggered by code commit Model deployed via release pipeline Drift detection alerts Threshold-triggered
3 Automated retraining Triggered by data drift or schedule Blue/green, canary, automated rollback Full observability stack Fully autonomous

Level 0 — Manual

At Level 0 a data scientist runs Python scripts locally, stores the model as a pickle file, and emails it to an engineer who deploys it by hand. There is no versioning, no lineage, and no rollback path. The model is effectively a black box attached to tribal knowledge.

Azure ML components that lift you out of Level 0: - Registering your workspace forces all assets (data, model, endpoint) into a governed store. - Using mlflow.log_metric and mlflow.log_artifact inside your script turns a local run into a tracked experiment without changing the training logic.

Level 1 — Automated training pipeline

At Level 1 training is a repeatable, parameterised pipeline. Any engineer can re-run the pipeline from the same data and get the same model. Deployment still requires human action.

Azure ML components that enable Level 1: - Compute Cluster with min_instances=0 so the pipeline runs on-demand and shuts down. - Azure ML Pipelines (component DAGs) to wire data preparation → training → evaluation → model registration as discrete, rerunnable steps. - Registered environments so every step uses an immutable runtime image. - Data assets with version pinning so the pipeline records exactly which data version it trained on.

Level 2 — Full CI/CD ML

At Level 2 a code commit or data-drift event triggers the entire train-evaluate-deploy pipeline via a CI/CD orchestrator. A gate (automated test or approval) prevents bad models from reaching production.

Azure ML components that enable Level 2: - GitHub Actions / Azure DevOps integrated with azure/aml-run or CLI v2 az ml job create actions. - Model evaluation step in the pipeline that compares new model metrics against the current champion; the deploy step only proceeds if the challenger wins. - Managed online endpoints with blue/green traffic splitting so deployment is zero-downtime.

Level 3 — Automated retraining

At Level 3 the system monitors its own predictions, detects drift, and automatically kicks off a retraining pipeline without human involvement. Models are continuously current.

Azure ML components that enable Level 3: - Data drift monitor on the online endpoint that emits an alert or directly triggers a pipeline via Event Grid. - Schedule-based pipeline triggers as a simpler baseline for periodic retraining. - Model promotion registry across dev/test/prod workspaces with automated promotion gates.

Note - Maturity is not a race: Most production ML teams operate effectively at Level 2. Level 3 automation is appropriate when retraining latency is a business-critical KPI (e.g. fraud models where data distribution shifts daily). Choose the level that matches your business cadence, not the one that sounds most impressive.


Azure ML pipelines as first-class assets

The component concept

A component in Azure ML is a self-contained, reusable unit of computation. It is analogous to a function: it declares typed inputs, typed outputs, and a command to execute. Components are defined in YAML and versioned in the workspace registry, so they can be shared across projects and reused without copy-pasting code.

The key benefit is composability: a pipeline is just a DAG (directed acyclic graph) of component invocations wired together by connecting outputs of one component to inputs of the next. Azure ML handles scheduling, data movement, and lineage recording for you.

YAML component definition example

# components/prep_data/component.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: prep_data
display_name: Prepare Training Data
version: "1"
type: command

inputs:
  raw_data:
    type: uri_folder
    description: Raw CSV files from the data lake
  validation_split:
    type: number
    default: 0.2

outputs:
  train_data:
    type: uri_folder
  val_data:
    type: uri_folder

code: ./src
environment: azureml:fraud-train@latest

command: >-
  python prep.py
  --raw_data ${{inputs.raw_data}}
  --validation_split ${{inputs.validation_split}}
  --train_data ${{outputs.train_data}}
  --val_data ${{outputs.val_data}}

Note - Input/output types: uri_folder is a path to a directory (blob container path or local path); uri_file is a single file. Azure ML resolves these references and mounts or downloads the data before your script runs. Use mlflow_model as an output type when the step produces a registered model artifact.

Pipeline composition in YAML

# pipelines/fraud_pipeline.yml
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: Fraud Detection Training Pipeline
experiment_name: fraud-detection

inputs:
  raw_data:
    type: uri_folder
    path: azureml:fraud-raw-data@latest

jobs:
  prep_step:
    type: command
    component: azureml:prep_data@1
    inputs:
      raw_data: ${{parent.inputs.raw_data}}
      validation_split: 0.2
    outputs:
      train_data:
        mode: rw_mount
      val_data:
        mode: rw_mount

  train_step:
    type: command
    component: azureml:train_model@1
    inputs:
      train_data: ${{parent.jobs.prep_step.outputs.train_data}}
      val_data: ${{parent.jobs.prep_step.outputs.val_data}}
      learning_rate: 0.01
      n_estimators: 500
    outputs:
      model_output:
        mode: rw_mount

  evaluate_step:
    type: command
    component: azureml:evaluate_model@1
    inputs:
      model: ${{parent.jobs.train_step.outputs.model_output}}
      val_data: ${{parent.jobs.prep_step.outputs.val_data}}

Submit the pipeline:

az ml job create \
  --file pipelines/fraud_pipeline.yml \
  --workspace-name my-workspace \
  --resource-group my-rg \
  --stream

How pipelines enable reproducible workflows

Each pipeline run stores: 1. The component version used at each step. 2. The data asset version consumed. 3. The environment version each step ran in. 4. All inputs, outputs, and metrics per step.

This means any historical run can be exactly reproduced by re-submitting with the same component and data versions — no manual reconstruction required. Lineage links from data → component → model → endpoint are recorded automatically.

Difference from Azure Pipelines / DevOps pipelines

Azure ML Pipeline Azure DevOps / GitHub Actions Pipeline
Runs on ML compute (GPU/CPU clusters) Runs on CI/CD agents (lightweight VMs)
Orchestrates data + model steps Orchestrates code build, test, deploy steps
First-class ML lineage and experiment tracking Source control and artifact management
Can be triggered by a DevOps pipeline Cannot run ML training jobs natively

The correct architecture is nested: a GitHub Actions workflow (DevOps pipeline) responds to a code commit and calls az ml job create to submit an Azure ML Pipeline. The DevOps layer handles CI/CD; the Azure ML layer handles ML execution.

Tip - Reusability: Register frequently-used components (data validation, model evaluation, feature engineering) once in a shared component registry. Teams can pull them by name and version instead of duplicating code, and improvements propagate to all pipelines that reference the latest version.


Feature stores: concept and motivation

What is a feature store?

A feature store is a centralised repository for storing, serving, and reusing engineered features — the transformed, business-meaningful signals derived from raw data that you feed into ML models. It exists to solve a class of problems that arise when the same feature (e.g. "30-day transaction velocity for customer X") needs to be computed consistently for both training and real-time inference.

Without a feature store, teams typically: - Reimplement the same feature logic in two places (training notebook and production service). - Discover months later that the implementations diverged, producing training/serving skew. - Cannot reuse features across projects, so every new model duplicates data engineering work.

Online store vs offline store

Store Latency Contents Used for
Offline store Minutes–hours Historical feature values, columnar format (Parquet) Model training, batch scoring
Online store Milliseconds Latest feature values, key-value format (Redis/CosmosDB) Real-time inference

The offline store enables you to build training datasets by retrieving historical feature values as they existed at the time of each training label (point-in-time correctness). The online store serves the most recent value at inference time.

The training/serving consistency problem

Suppose you train a fraud model with the feature "number of transactions in the last 30 days for this card". Your training code queries a data warehouse and aggregates correctly. Your inference service, written by a different engineer, queries a different table with a slightly different window definition. The feature values differ systematically — and the model's performance in production degrades in ways that are very hard to diagnose.

A feature store enforces one canonical implementation of each feature transformation. The same pipeline that writes features to the offline store also writes them to the online store, guaranteeing both are derived from identical logic.

Point-in-time correctness

When building a training dataset you must not "look into the future". If label \(y_i\) was assigned at time \(t_i\), the feature vector \(\mathbf{x}_i\) must contain only information available at \(t_i\). A naive join on entity ID without time filtering causes label leakage, producing training metrics far better than production performance.

Feature stores solve this with time-travel queries:

\[\mathbf{x}_i = \text{FeatureStore.get\_historical\_features}(\text{entity}=e_i,\ t=t_i)\]

The store returns the feature value as it was stored at (or before) \(t_i\), not the current value.

How it connects to Azure ML

Azure ML integrates with Microsoft Fabric Feature Store and the Azure ML managed feature store (preview). You can: - Define feature sets as Python transformation code registered in the feature store. - Generate training datasets by joining feature sets with a label table using point-in-time joins via the Azure ML SDK. - Serve features at inference time from the online store behind a managed online endpoint.

# Retrieve historical features for training (SDK v2 preview)
from azureml.featurestore import FeatureStoreClient, get_offline_features
from azure.identity import DefaultAzureCredential

fs_client = FeatureStoreClient(
    credential=DefaultAzureCredential(),
    subscription_id="<sub>",
    resource_group_name="<rg>",
    name="my-feature-store",
)

feature_set = fs_client.feature_sets.get("transactions", "1")
obs_df = spark.read.parquet("abfss://labels@datalake.dfs.core.windows.net/labels.parquet")

training_df = get_offline_features(
    features=[feature_set.get_feature("tx_velocity_30d")],
    observation_data=obs_df,
    timestamp_column="event_time",
)

Note - Maturity: The Azure ML managed feature store reached general availability in 2024. For simpler projects, a well-designed feature engineering pipeline (with strict versioning and identical logic in training and inference) provides most of the benefits before you commit to a full feature store adoption.


Distributed training concepts in Azure ML

Why distributed training?

For large models or large datasets, a single GPU becomes the bottleneck. Distributed training spreads computation across multiple GPUs (or multiple nodes of GPUs). The two fundamental strategies are data parallelism and model parallelism.

Data parallelism

In data parallelism each GPU holds a complete copy of the model but processes a different mini-batch of data. After each forward and backward pass, the gradients are synchronised (averaged) across all GPUs before the weight update.

\[\text{Effective batch size} = B \times N_{\text{GPU}}\]

where \(B\) is the per-GPU batch size and \(N_{\text{GPU}}\) is the number of GPUs. Larger effective batch size means fewer gradient synchronisation rounds per epoch, but may require learning-rate scaling (linear scaling rule: \(\eta' = \eta \times N_{\text{GPU}}\)).

PyTorch Distributed Data Parallel (DDP) is the standard implementation:

# train.py (DDP entry point — called by Azure ML on each rank)
import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup():
    dist.init_process_group(backend="nccl")  # NCCL = GPU-to-GPU via NVLink/InfiniBand
    torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))

def main():
    setup()
    model = MyModel().cuda()
    model = DDP(model, device_ids=[int(os.environ["LOCAL_RANK"])])
    # ... training loop unchanged from single-GPU code

Model parallelism

In model parallelism the model is too large to fit on a single GPU, so different layers are placed on different GPUs. Data flows sequentially through the pipeline of devices. This requires more careful engineering (pipeline parallelism, tensor parallelism) and is typically used for LLM pre-training (e.g. GPT-style models with billions of parameters).

Strategy Use when Framework support
Data parallelism (DDP) Model fits on one GPU; need faster training PyTorch DDP, Horovod
Model parallelism Model does NOT fit on one GPU DeepSpeed, Megatron-LM
Pipeline parallelism Very deep models; balance layer latency DeepSpeed, FairScale
Tensor parallelism Very wide layers (large attention heads) Megatron-LM, NeMo

Parameter server architecture

An older alternative to DDP is the parameter server model: some nodes act as servers storing the global parameter state, while worker nodes compute gradients and push them to the servers. This is less favoured for GPU training today (all-reduce is more bandwidth-efficient) but is still used in distributed CPU training and some sparse embedding workloads.

Configuring a distributed training job in Azure ML

# jobs/distributed_train.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
type: command
display_name: DDP Training Job

code: ./src
command: >-
  python train.py
  --epochs 50
  --batch_size 64

environment: azureml:pytorch-gpu-env@latest
compute: azureml:gpu-cluster

distribution:
  type: pytorch           # Azure ML injects MASTER_ADDR, MASTER_PORT, RANK, WORLD_SIZE
  process_count_per_instance: 4  # GPUs per node

resources:
  instance_count: 2       # 2 nodes × 4 GPUs = 8-way DDP

experiment_name: distributed-training

Azure ML sets the MASTER_ADDR, MASTER_PORT, RANK, LOCAL_RANK, and WORLD_SIZE environment variables automatically, so your script calls dist.init_process_group() without hard-coded addresses.

GPU cluster setup

# Create a GPU compute cluster (Standard_NC6s_v3 = 1x V100)
az ml compute create \
  --name gpu-cluster \
  --type AmlCompute \
  --min-instances 0 \
  --max-instances 8 \
  --size Standard_NC24s_v3 \   # 4x V100 per node
  --workspace-name my-workspace \
  --resource-group my-rg

Tip - Spot VMs for distributed training: GPU spot (low-priority) VMs cost ~60–80% less than dedicated. For fault-tolerant training jobs (with checkpointing), always use --tier LowPriority. Azure ML automatically requeuess the job if the VM is evicted.


Networking and private deployment

Why network isolation matters

In regulated industries (banking, healthcare, government), data must never traverse the public internet. Even with TLS encryption, traffic that touches a public IP creates an audit finding. Azure ML's private networking features allow you to deploy a workspace and all associated resources entirely within a private address space.

Virtual network (VNet) injection

When you VNet-inject an Azure ML workspace, the associated resources (storage, key vault, container registry, Application Insights) all receive private endpoints — network interfaces within your VNet with private IP addresses. Traffic between compute and storage stays entirely within the Azure backbone, never leaving your VNet.

# Create a VNet-injected workspace
az ml workspace create \
  --name secure-workspace \
  --resource-group my-rg \
  --location eastus \
  --managed-network allow_only_approved_outbound

Private endpoint for the workspace

A private endpoint for the workspace itself means the Azure ML Studio UI and all SDK/CLI calls route through a private IP in your VNet. External access requires either: - A jumpbox VM inside the VNet, or - Azure Bastion for browser-based access, or - A VPN / ExpressRoute connecting your corporate network to the VNet.

Private DNS zones

When a private endpoint is created, the workspace FQDN (e.g. my-workspace.api.azureml.ms) must resolve to the private IP, not the public one. This requires: 1. A private DNS zone (privatelink.api.azureml.ms) linked to the VNet. 2. An A record mapping the workspace hostname to the private endpoint IP.

Azure can create these automatically during workspace provisioning (--public-network-access disabled) or you can manage them manually for complex hub-spoke topologies.

Outbound rules (managed VNet)

Azure ML's managed VNet feature (generally available 2024) handles egress without you managing NSGs and route tables. You declare allowed outbound destinations:

# workspace_managed_vnet.yml
managed_network:
  isolation_mode: allow_only_approved_outbound
  outbound_rules:
    - name: allow-pypi
      type: fqdn
      destination: pypi.org
    - name: allow-conda-forge
      type: fqdn
      destination: conda.anaconda.org
    - name: allow-storage
      type: private_endpoint
      destination:
        service_resource_id: /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/mydatalake
        subresource_type: blob

Summary: why financial and healthcare orgs require this

Requirement Azure ML feature
Data never leaves private network VNet injection + private endpoints
No public IP on workspace --public-network-access disabled
Audit trail of network paths NSG flow logs + Azure Monitor
Encryption in transit TLS 1.2+ enforced on all private endpoints
Encryption at rest Storage + KV with customer-managed keys (CMK)

Note - Planning: Private networking must be planned before workspace creation. Retrofitting private endpoints onto an existing public workspace is possible but requires re-deploying associated resources. Design the network topology upfront to avoid downtime.


Cost optimization patterns

Azure ML costs fall into two buckets: compute (VMs running training or inference) and storage + egress (data, artifacts, logs). Compute is almost always the dominant cost.

Spot VMs for training

Spot (low-priority) VMs use spare Azure capacity at 60–80% discount. They can be evicted with 2 minutes notice. For ML training, the mitigation is checkpointing:

# Save checkpoint every N epochs
if epoch % checkpoint_every == 0:
    torch.save({
        "epoch": epoch,
        "model_state": model.state_dict(),
        "optimizer_state": optimizer.state_dict(),
    }, f"outputs/checkpoint_epoch_{epoch}.pt")

Azure ML automatically requeues the job on eviction; the job resumes from the last checkpoint.

az ml compute create \
  --name spot-cluster \
  --type AmlCompute \
  --size Standard_NC6s_v3 \
  --min-instances 0 \
  --max-instances 4 \
  --tier LowPriority          # <-- spot pricing

Auto-shutdown for compute instances

A compute instance left running overnight costs ~$50–200/day depending on VM size. Set an auto-shutdown schedule:

az ml compute update \
  --name my-instance \
  --workspace-name my-workspace \
  --resource-group my-rg \
  --idle-time-before-shutdown-minutes 30

Compute cluster min=0

Cluster nodes are billed per minute. Setting min-instances=0 means the cluster scales to zero when idle. The trade-off is a cold-start delay of ~3–5 minutes to provision the first node. For development clusters this is almost always the right trade-off.

ACI vs AKS cost

Serving option Billing model Best for Avoid when
ACI Per-second CPU+memory Prototype, <5 RPS High throughput, GPU inference
Managed Online Endpoint Per-minute VM uptime Production REST API Batch scoring
Serverless Online Endpoint Per-token / per-request (models) Sporadic LLM traffic Consistent high throughput
Batch Endpoint + Cluster Per-minute compute only while scoring Large offline batch jobs Real-time SLA required

Reserved instances for inference

If a production inference endpoint runs 24/7, a 1-year reserved instance reduces the VM cost by approximately 40%. A 3-year reservation reduces it by ~60%.

Real cost calculation example

Suppose you run a monthly retraining pipeline on a Standard_NC6s_v3 spot VM (1× V100, ~$0.90/h spot price) and serve a model on a Standard_DS3_v2 (4 vCPU, 14 GB, ~$0.20/h on-demand):

Item Duration Spot/On-demand Monthly cost
Prep + training pipeline 4 h/month Spot $0.90/h $3.60
Evaluation + registration 0.5 h/month Spot $0.90/h $0.45
Online endpoint (inference) 730 h/month On-demand $0.20/h $146
Storage (50 GB artifacts) 730 h/month $0.018/GB/month $0.90
Total ~$151/month

The inference VM is 97% of the cost. This is typical and explains why right-sizing the inference VM SKU and using autoscale (scale to 0 with scale_settings.scale_down_as_needed) matters far more than optimising training compute.

Tip - Cost alerts: Create an Azure Budget alert at 80% of your expected monthly spend so you are notified before costs spiral. Set a second alert at 100%. Training jobs that get stuck in an infinite loop on a non-spot cluster are a common cause of surprise bills.


Azure ML and DevOps: the MLOps CI/CD pipeline

Architecture overview

A production MLOps pipeline combines two orchestrators: 1. GitHub Actions (or Azure DevOps) — responds to source control events (code push, PR merge, scheduled trigger) and orchestrates the overall release workflow. 2. Azure ML Pipelines — runs the ML-specific work (data prep, training, evaluation) on ML compute with full lineage tracking.

flowchart TD A[Developer pushes code] --> B[GitHub Actions: CI workflow] B --> C{Lint and unit tests pass?} C -- No --> D[Fail — notify developer] C -- Yes --> E[az ml job create — submit training pipeline] E --> F[Azure ML: data prep → train → evaluate] F --> G{New model beats champion?} G -- No --> H[Register as challenger only — no deploy] G -- Yes --> I[GitHub Actions: CD workflow] I --> J[Deploy to staging endpoint] J --> K[Integration tests against staging] K --> L{Tests pass?} L -- No --> M[Roll back — alert on-call] L -- Yes --> N[Promote to production endpoint — shift traffic]

GitHub Actions integration

# .github/workflows/train-and-deploy.yml
name: Train and Deploy Fraud Model

on:
  push:
    branches: [main]
    paths:
      - "src/**"
      - "pipelines/**"
      - "components/**"

env:
  AZURE_SUBSCRIPTION: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
  AZURE_RESOURCE_GROUP: my-rg
  AML_WORKSPACE: my-workspace

jobs:
  train:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Azure login
        uses: azure/login@v2
        with:
          creds: ${{ secrets.AZURE_CREDENTIALS }}

      - name: Install Azure ML CLI extension
        run: az extension add --name ml --yes

      - name: Submit training pipeline
        id: train_job
        run: |
          JOB_NAME=$(az ml job create \
            --file pipelines/fraud_pipeline.yml \
            --workspace-name $AML_WORKSPACE \
            --resource-group $AZURE_RESOURCE_GROUP \
            --query name -o tsv)
          echo "job_name=$JOB_NAME" >> $GITHUB_OUTPUT
          az ml job stream --name $JOB_NAME \
            --workspace-name $AML_WORKSPACE \
            --resource-group $AZURE_RESOURCE_GROUP

      - name: Check job outcome
        run: |
          STATUS=$(az ml job show \
            --name ${{ steps.train_job.outputs.job_name }} \
            --workspace-name $AML_WORKSPACE \
            --resource-group $AZURE_RESOURCE_GROUP \
            --query status -o tsv)
          if [ "$STATUS" != "Completed" ]; then
            echo "Training job failed with status: $STATUS"
            exit 1
          fi

  deploy-staging:
    needs: train
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4

      - name: Azure login
        uses: azure/login@v2
        with:
          creds: ${{ secrets.AZURE_CREDENTIALS }}

      - name: Deploy to staging endpoint
        run: |
          az ml online-deployment create \
            --file deployments/staging_deployment.yml \
            --workspace-name $AML_WORKSPACE \
            --resource-group $AZURE_RESOURCE_GROUP \
            --all-traffic

      - name: Run integration tests
        run: python tests/integration/test_endpoint.py --env staging

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production   # requires manual approval in GitHub Environments
    steps:
      - uses: actions/checkout@v4

      - name: Azure login
        uses: azure/login@v2
        with:
          creds: ${{ secrets.AZURE_CREDENTIALS }}

      - name: Shift 100% traffic to new deployment
        run: |
          az ml online-endpoint update \
            --name fraud-endpoint \
            --traffic "new-deployment=100" \
            --workspace-name $AML_WORKSPACE \
            --resource-group $AZURE_RESOURCE_GROUP

Model training triggered by code change

The paths filter on the workflow ensures that only relevant changes trigger training. Changes to documentation or unrelated configuration files do not burn GPU budget. The paths filter checks for changes in src/, pipelines/, and components/ — the three directories that affect model behaviour.

Promotion gates

In the GitHub Actions example, the production environment is configured in GitHub repository settings to require a reviewer approval before the job runs. This creates a manual gate between staging validation and production deployment. For fully automated Level 3 MLOps, this gate can be replaced by an automated test suite that checks:

  • Prediction latency \(\leq\) SLA threshold (e.g. \(p99 \leq 100\text{ ms}\)).
  • Model accuracy \(\geq\) champion accuracy on a held-out evaluation set.
  • No data drift detected on the new model's shadow predictions.

Model testing in the pipeline

The evaluation component in the Azure ML pipeline should produce a JSON file with metrics that the GitHub Actions workflow reads to make the promotion decision:

# evaluate.py — runs inside Azure ML pipeline
import json, mlflow
from sklearn.metrics import roc_auc_score

# ... load model and validation data ...
auc = roc_auc_score(y_val, model.predict_proba(X_val)[:, 1])
mlflow.log_metric("val_auc", auc)

# Write a machine-readable result for the CI gate
result = {"val_auc": auc, "promoted": auc >= CHAMPION_AUC_THRESHOLD}
with open("outputs/eval_result.json", "w") as f:
    json.dump(result, f)

The GitHub Actions step downloads this artifact and fails the pipeline if promoted is False, preventing the deployment job from running.

Tip - Secrets in Actions: Never hard-code subscription IDs or connection strings in YAML files. Use GitHub Encrypted Secrets (${{ secrets.NAME }}) for credentials and ${{ vars.NAME }} for non-sensitive configuration. The AZURE_CREDENTIALS secret is a service principal JSON created with az ad sp create-for-rbac --role Contributor --scopes /subscriptions/<sub> and stored as a repository secret.

Environment versioning strategy

Azure ML environments are immutable once published. Recommended versioning approach:

  1. Pin all packages with exact versions in conda.yml or requirements.txt.
  2. Use environment name + version (e.g., fraud-train:3) as the reference in jobs.
  3. Rebuild the environment when any dependency changes, never mutate existing versions.
  4. Reuse the same environment for training and inference to guarantee compatibility.
# Example conda.yml
name: fraud-train
channels:
  - defaults
dependencies:
  - python=3.10
  - pip:
    - scikit-learn==1.3.0
    - azureml-sdk==1.55.0
    - pandas==2.0.3
    - lightgbm==4.0.0

Cost management tips

Practice Saves
Set compute cluster min nodes = 0 Avoids idle compute charges
Use spot/low-priority VMs for training 60-80% compute cost reduction
Set auto-shutdown on compute instances Prevents overnight idle spend
Use ACI for low-QPS endpoints instead of AKS Eliminates cluster overhead

Choosing compute: a decision flow

Picking compute is mostly a function of two questions: is the work interactive or batch, and is it latency-sensitive or throughput-oriented. This flow captures the common path.

flowchart TD A[What are you doing?] --> B{Interactive dev
or a job?} B -->|Interactive| C[Compute Instance
auto-shutdown on] B -->|Job| D{Training/HPO
or serving?} D -->|Training or sweep| E[Compute Cluster
min nodes = 0, spot VMs] D -->|Serving| F{Steady high QPS
or low/spiky QPS?} F -->|High QPS, low latency| G[Managed online / AKS
autoscale, health probes] F -->|Low or spiky QPS| H[ACI or managed endpoint
scale-to-low]

Common environment and compute pitfalls

These are the issues that most often block a first Azure ML deployment. Recognizing the symptom saves hours of debugging.

Symptom Likely cause Fix
Job stuck in "Queued" for a long time Cluster at max nodes or quota exhausted Raise quota, increase max nodes, or use a smaller VM SKU
"Image build failed" before training starts Unpinned or conflicting dependencies in the environment Pin exact versions; build and test the environment image once, then reuse it
Model works in a notebook but fails at the endpoint Training/serving environment skew Reuse the same registered environment for training and inference
Endpoint returns 401/403 Missing key or token, or RBAC too restrictive Use the endpoint key/token; grant the caller the right role
Surprise monthly bill Compute instance left running, or AKS over-provisioned Enable auto-shutdown; set cluster min nodes to 0; right-size serving

Tip - Reproducibility checklist: Before you call a result "done", confirm three versions are pinned and recorded: the data version, the environment version, and the code commit. With those three, any run on this workspace can be reproduced exactly, which is the whole point of the asset model.

Quick self-check

# Question Answer
1 What is the difference between the control plane and the data plane, and which one determines cost? The control plane handles asset metadata, run history, permissions, and governance; the data plane performs the actual compute execution, inference, and data movement. The data plane (compute) determines cost.
2 Why should a training compute cluster scale to zero but a serving cluster stay warm? Training is bursty/batch, so scaling to zero when idle avoids paying for unused compute; serving must stay warm to meet low-latency requirements and avoid cold-start delays.
3 What three versioned things must be recorded to reproduce a training run? The data version, the environment version, and the code commit.
4 Why does reusing one environment for training and inference prevent a whole class of bugs? Identical pinned library versions at train and serve time eliminate train/serve skew caused by version mismatches.