MLOps Interview Questions

AI Frameworks, Tools, and Best Practices

50+ Questions 10 Topics All Levels
38
Questions
10
Topic Areas
20+
Frameworks
3
Difficulty Levels
Section 01

Training Frameworks (PyTorch, TensorFlow, JAX)

Q1
What are the key differences between PyTorch and TensorFlow 2.x? When would you choose one over the other?Junior

Key Differences

  • Execution Model: PyTorch uses eager execution by default (define-by-run). TensorFlow 2.x also defaults to eager but supports @tf.function for graph compilation.
  • API Style: PyTorch is more Pythonic and intuitive. TensorFlow has Keras as high-level API but lower-level ops can be verbose.
  • Debugging: PyTorch integrates naturally with Python debuggers (pdb). TensorFlow graphs are harder to debug.
  • Deployment: TensorFlow has better production tooling (TF Serving, TFLite, TF.js). PyTorch catching up with TorchServe, ONNX.
  • Research vs Production: PyTorch dominates research papers. TensorFlow stronger in enterprise production.

When to Choose

  • PyTorch: Research, prototyping, NLP (Hugging Face), dynamic architectures, when team prefers Pythonic code
  • TensorFlow: Mobile/edge deployment, existing TF infrastructure, TPU training, production-first projects
Interview Tip
Mention that the gap has narrowed significantly. Both are production-ready. The choice often depends on team expertise and existing infrastructure rather than technical superiority.
Q2
Explain the difference between model.eval() and torch.no_grad() in PyTorch.Junior

model.eval()

Sets the model to evaluation mode. This affects layers that behave differently during training vs inference:

  • Dropout: Disabled (no random zeroing)
  • BatchNorm: Uses running mean/variance instead of batch statistics
  • Does NOT disable gradient computation

torch.no_grad()

Context manager that disables gradient computation:

  • Saves memory: No need to store intermediate activations for backward pass
  • Faster inference: Skip gradient tape operations
  • Does NOT affect layer behavior
# Correct inference pattern - use BOTH
model.eval()  # Change layer behavior
with torch.no_grad():  # Disable gradients
    outputs = model(inputs)

# Don't forget to switch back for training
model.train()
Key Insight
Always use both together for inference. model.eval() alone still computes gradients (wasting memory). torch.no_grad() alone doesn't fix BatchNorm/Dropout behavior.
Q3
What is JAX and when would you use it over PyTorch/TensorFlow?Mid

What is JAX?

JAX is Google's library for high-performance numerical computing. It's NumPy + automatic differentiation + XLA compilation + vectorization.

Key Features

  • Functional paradigm: Pure functions, no hidden state, explicit random keys
  • Transformations: grad (autodiff), jit (compilation), vmap (auto-batching), pmap (parallelization)
  • XLA compilation: Optimized kernels for GPU/TPU
  • Composable: Transformations can be combined freely
import jax.numpy as jnp
from jax import grad, jit, vmap

# Define a loss function
def loss_fn(params, x, y):
    pred = jnp.dot(x, params)
    return jnp.mean((pred - y) ** 2)

# Get gradient function (automatic!)
grad_fn = grad(loss_fn)

# JIT compile for speed
fast_grad = jit(grad_fn)

# Vectorize over batch dimension
batched_pred = vmap(predict_fn)

When to Use JAX

  • Research requiring custom autodiff: Higher-order gradients, Hessians
  • TPU-first projects: JAX has excellent TPU support
  • Scientific computing: Physics simulations, differential equations
  • When you need vmap: Auto-vectorization is powerful

When NOT to Use JAX

  • Need mature ecosystem (Hugging Face, torchvision)
  • Team unfamiliar with functional programming
  • Standard deep learning tasks where PyTorch/TF suffice
Q4
How do you handle GPU memory issues during training? What strategies exist?Mid

Immediate Fixes

  • Reduce batch size: Most direct solution, but affects convergence
  • Gradient accumulation: Simulate larger batches without memory cost
  • Mixed precision (FP16/BF16): Halves memory, often faster too
# Gradient Accumulation in PyTorch
accumulation_steps = 4
optimizer.zero_grad()

for i, (inputs, labels) in enumerate(dataloader):
    outputs = model(inputs)
    loss = criterion(outputs, labels) / accumulation_steps
    loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Advanced Techniques

  • Gradient checkpointing: Trade compute for memory by recomputing activations
  • Model parallelism: Split model across GPUs (for very large models)
  • Offloading: Move optimizer states to CPU (DeepSpeed ZeRO)
  • 8-bit optimizers: bitsandbytes library reduces optimizer memory
# Mixed Precision Training with PyTorch
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for inputs, labels in dataloader:
    optimizer.zero_grad()

    with autocast():  # FP16 forward pass
        outputs = model(inputs)
        loss = criterion(outputs, labels)

    scaler.scale(loss).backward()  # Scaled backward
    scaler.step(optimizer)
    scaler.update()
Pro Tip
Start with mixed precision + gradient accumulation. They're easy to implement and often sufficient. Gradient checkpointing is next. Model parallelism/offloading for very large models only.
Q5
Explain the PyTorch DataLoader and how to optimize data loading for training.Junior

DataLoader Basics

DataLoader wraps a Dataset and provides batching, shuffling, and parallel data loading.

from torch.utils.data import DataLoader, Dataset

dataloader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,          # Shuffle for training
    num_workers=4,          # Parallel data loading
    pin_memory=True,        # Faster GPU transfer
    prefetch_factor=2,      # Batches to prefetch per worker
    persistent_workers=True # Keep workers alive between epochs
)

Optimization Strategies

  • num_workers: Start with 4, increase until CPU-bound. Too many causes overhead.
  • pin_memory=True: Pre-allocates memory for faster CPU→GPU transfer
  • persistent_workers=True: Avoids worker restart overhead between epochs
  • prefetch_factor: Load next batches while GPU is computing

Common Issues

  • Slow first epoch: Workers initializing. Use persistent_workers.
  • Memory leak: Large objects in Dataset.__getitem__(). Process data lazily.
  • Bottleneck detection: If GPU util is low, data loading is the bottleneck.
Rule of Thumb
Set num_workers = number of CPU cores / number of GPUs. Always use pin_memory=True for GPU training. Profile with torch.profiler to find bottlenecks.
Section 02

Distributed Training

Q1
What is the difference between Data Parallelism and Model Parallelism?Mid

Data Parallelism

Same model replicated across devices, each processes different data batches.

  • How it works: Split batch across GPUs → forward pass → all-reduce gradients → update
  • When to use: Model fits in single GPU memory
  • Scaling: Near-linear with more GPUs (communication overhead exists)
  • Tools: PyTorch DDP, tf.distribute.MirroredStrategy, Horovod

Model Parallelism

Model split across devices, each holds part of the model.

  • Pipeline parallelism: Split by layers (GPU1: layers 1-10, GPU2: layers 11-20)
  • Tensor parallelism: Split individual layers across GPUs
  • When to use: Model too large for single GPU
  • Challenge: Pipeline bubbles cause GPU idle time
  • Tools: DeepSpeed, Megatron-LM, FairScale
# PyTorch DistributedDataParallel (DDP)
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group(backend="nccl")
model = DDP(model, device_ids=[local_rank])

# Training loop unchanged - DDP handles gradient sync
for batch in dataloader:
    loss = model(batch)
    loss.backward()  # Gradients automatically synchronized
    optimizer.step()
Interview Tip
Most production systems use Data Parallelism (DDP). Model Parallelism is for LLMs (GPT, LLaMA) that don't fit on one GPU. Mention ZeRO (DeepSpeed) as a hybrid approach.
Q2
Explain DeepSpeed ZeRO and its different stages.Senior

What is ZeRO?

Zero Redundancy Optimizer - partitions model states across GPUs instead of replicating them, dramatically reducing memory per GPU.

Memory Breakdown (Standard DDP)

For a model with Ψ parameters in mixed precision:

  • Parameters (FP16): 2Ψ bytes
  • Gradients (FP16): 2Ψ bytes
  • Optimizer states (FP32): 12Ψ bytes (Adam: params + momentum + variance)
  • Total per GPU: 16Ψ bytes (all replicated!)

ZeRO Stages

  • ZeRO-1: Partition optimizer states → 4x memory reduction
  • ZeRO-2: + Partition gradients → 8x memory reduction
  • ZeRO-3: + Partition parameters → Linear scaling with GPUs
  • ZeRO-Offload: Offload to CPU RAM/NVMe
  • ZeRO-Infinity: Offload everything, train trillion-parameter models
# DeepSpeed config for ZeRO Stage 2
{
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu"
        },
        "allgather_bucket_size": 2e8,
        "reduce_bucket_size": 2e8
    },
    "fp16": {"enabled": true},
    "train_batch_size": 32
}

Trade-offs

  • ZeRO-1/2: Minimal communication overhead, use by default
  • ZeRO-3: More communication (all-gather params), but enables huge models
  • Offloading: Slower but allows training on fewer GPUs
Q3
What is Horovod and how does it compare to PyTorch DDP?Mid

What is Horovod?

Uber's distributed training framework. Framework-agnostic (TensorFlow, PyTorch, MXNet). Uses ring-allreduce for gradient synchronization.

Key Features

  • Framework agnostic: Same API for TF and PyTorch
  • MPI-based: Leverages battle-tested HPC communication
  • Minimal code changes: Wrap optimizer, done
  • Elastic training: Add/remove workers dynamically
import horovod.torch as hvd

hvd.init()
torch.cuda.set_device(hvd.local_rank())

model.cuda()
optimizer = optim.SGD(model.parameters(), lr=0.01 * hvd.size())

# Wrap optimizer - handles gradient sync
optimizer = hvd.DistributedOptimizer(optimizer)

# Broadcast initial state
hvd.broadcast_parameters(model.state_dict(), root_rank=0)

Horovod vs PyTorch DDP

AspectHorovodPyTorch DDP
PerformanceExcellentExcellent (often faster)
Framework supportTF, PyTorch, MXNetPyTorch only
SetupRequires MPIBuilt-in
Elastic trainingYesTorchElastic
Recommendation
For PyTorch-only projects, use DDP (native, faster, simpler). For multi-framework environments or existing Horovod infrastructure, Horovod is solid. Both scale to thousands of GPUs.
Q4
How do you handle batch normalization in distributed training?Senior

The Problem

Standard BatchNorm computes statistics per-GPU. With small per-GPU batch sizes (large models), statistics become noisy and unstable.

Solutions

  • SyncBatchNorm: Synchronize statistics across all GPUs (PyTorch: nn.SyncBatchNorm)
  • GroupNorm/LayerNorm: Statistics per sample, not per batch - no sync needed
  • Virtual BatchNorm: Use reference batch for statistics
# Convert all BatchNorm to SyncBatchNorm
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
model = DDP(model, device_ids=[local_rank])

# Or use GroupNorm instead (no sync overhead)
# Replace: nn.BatchNorm2d(64)
# With:    nn.GroupNorm(num_groups=32, num_channels=64)

Trade-offs

  • SyncBatchNorm: More accurate but adds communication overhead
  • GroupNorm: No overhead, works well for small batches, slightly different results
  • Large effective batch size: Regular BatchNorm often fine if total batch is large
Best Practice
If per-GPU batch ≥ 16, standard BatchNorm is usually fine. For smaller batches (large models), use SyncBatchNorm or switch to GroupNorm/LayerNorm. Modern architectures (Transformers) use LayerNorm anyway.
Section 03

Experiment Tracking (MLflow, W&B, Neptune)

Q1
What is MLflow and what are its main components?Junior

MLflow Components

  • MLflow Tracking: Log parameters, metrics, artifacts. Compare experiments.
  • MLflow Projects: Package code for reproducibility (MLproject file)
  • MLflow Models: Standard format for packaging models (multiple flavors)
  • Model Registry: Centralized model store with versioning and stages
import mlflow

# Start experiment
mlflow.set_experiment("my-experiment")

with mlflow.start_run():
    # Log parameters
    mlflow.log_param("learning_rate", 0.01)
    mlflow.log_param("epochs", 100)

    # Train model...

    # Log metrics
    mlflow.log_metric("accuracy", 0.95)
    mlflow.log_metric("loss", 0.05)

    # Log model
    mlflow.sklearn.log_model(model, "model")

    # Log artifacts (plots, data)
    mlflow.log_artifact("confusion_matrix.png")

Key Benefits

  • Open source: No vendor lock-in
  • Self-hosted or managed: Databricks, AWS, Azure offerings
  • Framework agnostic: Works with any ML library
  • Model Registry: Staging → Production workflow
Q2
Compare MLflow, Weights & Biases, and Neptune. When would you use each?Mid

Comparison

FeatureMLflowW&BNeptune
HostingSelf-hosted / ManagedCloud-firstCloud-first
Open SourceYes (Apache 2.0)Client onlyClient only
UI/UXBasicExcellentGood
CollaborationBasicExcellentGood
Model RegistryBuilt-inYesYes
PriceFree (self-host)Free tier, paidFree tier, paid

When to Use Each

  • MLflow: Enterprise with data privacy requirements, need self-hosting, Databricks users
  • Weights & Biases: Research teams, need collaboration features, best visualizations, hyperparameter sweeps
  • Neptune: Production ML teams, need extensive metadata tracking, good API
Recommendation
For startups and research: W&B (best UX, free for individuals). For enterprises with compliance needs: MLflow (self-hosted). All three are solid choices - pick based on team needs and budget.
Q3
How do you implement hyperparameter tuning with experiment tracking?Mid

Approach 1: Optuna + MLflow

import optuna
import mlflow

def objective(trial):
    with mlflow.start_run(nested=True):
        # Sample hyperparameters
        lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
        n_layers = trial.suggest_int("n_layers", 1, 5)

        mlflow.log_params({"lr": lr, "n_layers": n_layers})

        # Train and evaluate
        accuracy = train_model(lr, n_layers)

        mlflow.log_metric("accuracy", accuracy)
        return accuracy

with mlflow.start_run(run_name="hpo-study"):
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=100)

    mlflow.log_params(study.best_params)

Approach 2: W&B Sweeps

# sweep_config.yaml
method: bayes
metric:
  name: val_accuracy
  goal: maximize
parameters:
  learning_rate:
    distribution: log_uniform_values
    min: 0.00001
    max: 0.1
  batch_size:
    values: [16, 32, 64]

# In training script
import wandb

wandb.init(config=wandb.config)
# Access: wandb.config.learning_rate

Best Practices

  • Use nested runs: Group trials under parent experiment
  • Log early stopping metrics: Prune bad trials early
  • Save best model artifact: Register best trial's model
  • Use Bayesian optimization: More efficient than grid/random search
Section 04

Model Versioning & Registry

Q1
What is DVC (Data Version Control) and how does it work with Git?Junior

What is DVC?

DVC is Git for data and models. It tracks large files, datasets, and ML models alongside your code without storing them in Git.

How It Works

  • .dvc files: Small pointer files stored in Git (contain hash of actual data)
  • Remote storage: Actual data stored in S3, GCS, Azure, or local
  • Git integration: Data versions linked to code commits
# Initialize DVC in a Git repo
dvc init

# Track a large dataset
dvc add data/training_data.csv
git add data/training_data.csv.dvc data/.gitignore
git commit -m "Add training data"

# Configure remote storage
dvc remote add -d myremote s3://my-bucket/dvc-store

# Push data to remote
dvc push

# Checkout data for specific Git commit
git checkout v1.0
dvc checkout  # Fetches matching data version

Key Benefits

  • Reproducibility: Exact data + code combination for any commit
  • Storage efficiency: Deduplication, only stores diffs
  • Pipelines: Define and version ML pipelines (dvc.yaml)
Q2
Explain the MLflow Model Registry and its stage transitions.Mid

Model Registry Concepts

  • Registered Model: Named entity grouping model versions
  • Model Version: Specific iteration with artifacts, metrics, lineage
  • Stages: None → Staging → Production → Archived
import mlflow
from mlflow import MlflowClient

client = MlflowClient()

# Register model from a run
result = mlflow.register_model(
    "runs:/<run_id>/model",
    "fraud-detection-model"
)

# Transition to staging
client.transition_model_version_stage(
    name="fraud-detection-model",
    version=1,
    stage="Staging"
)

# After validation, promote to production
client.transition_model_version_stage(
    name="fraud-detection-model",
    version=1,
    stage="Production",
    archive_existing_versions=True  # Archive old prod version
)

# Load production model for serving
model = mlflow.pyfunc.load_model(
    "models:/fraud-detection-model/Production"
)

Stage Workflow

  • None: Just registered, not validated
  • Staging: Under testing, A/B testing, shadow mode
  • Production: Serving live traffic
  • Archived: Old versions kept for rollback
Best Practice
Automate stage transitions with CI/CD. Run validation tests before Staging→Production. Keep archived versions for quick rollback.
Q3
How do you version ML models in production? What metadata should be tracked?Mid

Versioning Strategies

  • Semantic versioning: major.minor.patch (1.2.3)
  • Date-based: 2024-01-15-v1
  • Git SHA: Link to exact code commit
  • Experiment ID: Link to training run

Essential Metadata

  • Training data: Dataset version, hash, sample count, date range
  • Code: Git commit SHA, branch, repo URL
  • Environment: Python version, package versions (requirements.txt hash)
  • Hyperparameters: All training configuration
  • Metrics: Training/validation scores, evaluation results
  • Lineage: Parent model (for fine-tuning), training run ID
# Model metadata example (stored with model)
{
    "model_version": "2.1.0",
    "git_sha": "a1b2c3d4",
    "training_run_id": "mlflow-run-xyz",
    "dataset": {
        "name": "transactions_v3",
        "hash": "sha256:abc123...",
        "rows": 1000000,
        "date_range": ["2023-01-01", "2023-12-31"]
    },
    "metrics": {
        "auc_roc": 0.95,
        "precision": 0.92
    },
    "created_at": "2024-01-15T10:30:00Z",
    "created_by": "training-pipeline"
}
Key Insight
You should be able to reproduce any model from its metadata alone. If you can't answer "what data and code produced this model?", your versioning is incomplete.
Section 05

Model Serving Frameworks

Q1
Compare TensorFlow Serving, TorchServe, and Triton Inference Server.Mid

Comparison Overview

FeatureTF ServingTorchServeTriton
FrameworksTensorFlow onlyPyTorch onlyTF, PyTorch, ONNX, TensorRT
BatchingDynamic batchingDynamic batchingAdvanced dynamic batching
Model ManagementVersion policiesMAR archivesModel repository
GPU OptimizationGoodGoodExcellent (NVIDIA)
Concurrent ModelsYesYesBest (GPU sharing)

When to Use Each

  • TF Serving: TensorFlow models, need battle-tested serving, gRPC preferred
  • TorchServe: PyTorch models, need custom handlers, AWS integration
  • Triton: Multi-framework, need maximum GPU efficiency, ensemble models
Production Recommendation
For heterogeneous model serving at scale, Triton is the best choice. It handles multiple frameworks, optimizes GPU utilization, and supports model ensembles natively.
Q2
What is ONNX and why is it important for model deployment?Junior

What is ONNX?

Open Neural Network Exchange - an open format for representing ML models. Enables interoperability between frameworks.

Key Benefits

  • Framework agnostic: Train in PyTorch, deploy with TensorRT
  • Optimization: ONNX Runtime optimizes for target hardware
  • Portability: Same model runs on cloud, edge, mobile
  • Ecosystem: Wide tooling support (converters, optimizers)
# Export PyTorch model to ONNX
import torch

model.eval()
dummy_input = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size"},
        "output": {0: "batch_size"}
    },
    opset_version=14
)

# Run with ONNX Runtime
import onnxruntime as ort

session = ort.InferenceSession("model.onnx")
outputs = session.run(None, {"input": input_array})

Common Use Cases

  • Edge deployment: Convert PyTorch → ONNX → TensorRT for Jetson
  • Mobile: ONNX → CoreML (iOS) or TFLite (Android)
  • Standardization: Single format for model artifacts
Q3
How do you implement A/B testing for ML models in production?Senior

A/B Testing Architecture

  • Traffic splitting: Route percentage of requests to each model version
  • Consistent assignment: Same user always sees same model (by user ID hash)
  • Metrics collection: Track business KPIs per variant
  • Statistical analysis: Determine winner with significance
# Simple traffic splitting with feature flags
import hashlib

def get_model_variant(user_id, experiment_config):
    # Consistent hashing for user assignment
    hash_val = int(hashlib.md5(
        f"{user_id}_{experiment_config['name']}".encode()
    ).hexdigest(), 16)

    bucket = hash_val % 100

    if bucket < experiment_config['control_percentage']:
        return "control", load_model("v1")
    else:
        return "treatment", load_model("v2")

# Log for analysis
def predict_with_logging(user_id, features):
    variant, model = get_model_variant(user_id, config)
    prediction = model.predict(features)

    log_prediction(
        user_id=user_id,
        variant=variant,
        prediction=prediction,
        timestamp=now()
    )
    return prediction

Key Considerations

  • Sample size: Calculate required samples for statistical power
  • Guardrail metrics: Monitor for regressions in critical metrics
  • Ramp-up: Start with small percentage, increase gradually
  • Shadow mode: Run new model without affecting users first

Tools

  • Istio/Envoy: Service mesh traffic splitting
  • LaunchDarkly/Unleash: Feature flag platforms
  • Seldon/KServe: Built-in canary deployments
Statistical Note
Don't peek at results early! Pre-register your sample size and decision criteria. Use sequential testing if you need early stopping.
Q4
What is dynamic batching and why is it important for inference?Mid

The Problem

GPUs are optimized for parallel processing. Single requests underutilize GPU. But waiting too long for batches increases latency.

Dynamic Batching

Automatically groups incoming requests into batches based on:

  • Max batch size: Upper limit on batch
  • Max delay: Maximum time to wait for more requests
  • Preferred batch sizes: Optimize for specific sizes
# Triton config.pbtxt example
dynamic_batching {
    preferred_batch_size: [4, 8, 16]
    max_queue_delay_microseconds: 100
}

instance_group [
    {
        count: 2
        kind: KIND_GPU
    }
]

Trade-offs

  • Throughput vs Latency: Larger batches = higher throughput, longer wait
  • Memory: Larger batches need more GPU memory
  • Padding overhead: Variable-length inputs need padding
Tuning Tip
Start with max_delay = p50 latency target / 10. Increase batch size until GPU utilization is high but latency SLA is met. Profile under realistic load.
Section 06

Model Optimization & Compression

Q1
Explain quantization and its types (PTQ vs QAT).Mid

What is Quantization?

Converting model weights and activations from floating-point (FP32) to lower precision (INT8, FP16). Reduces model size and speeds up inference.

Post-Training Quantization (PTQ)

  • Process: Quantize after training using calibration data
  • Pros: Fast, no retraining needed
  • Cons: May lose accuracy, especially for sensitive models
  • Best for: Large models, CNNs, when time is limited

Quantization-Aware Training (QAT)

  • Process: Simulate quantization during training
  • Pros: Better accuracy preservation
  • Cons: Requires retraining, more complex
  • Best for: Accuracy-critical applications, smaller models
# PyTorch Post-Training Quantization
import torch.quantization

# Prepare model
model.eval()
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
model_prepared = torch.quantization.prepare(model)

# Calibrate with representative data
with torch.no_grad():
    for batch in calibration_loader:
        model_prepared(batch)

# Convert to quantized model
model_quantized = torch.quantization.convert(model_prepared)

# Size reduction: ~4x (FP32 → INT8)

Quantization Levels

  • FP16: 2x size reduction, minimal accuracy loss, wide hardware support
  • INT8: 4x size reduction, may need calibration, faster on CPUs
  • INT4/GPTQ: 8x reduction, for LLMs, requires careful tuning
Q2
What is knowledge distillation and when would you use it?Mid

Concept

Train a smaller "student" model to mimic a larger "teacher" model. Student learns from teacher's soft predictions (logits), not just hard labels.

Why Soft Labels Help

  • More information: Teacher's logits encode relationships between classes
  • Example: "Cat" image might have teacher output [cat: 0.9, dog: 0.08, bird: 0.02] - student learns cats look more like dogs than birds
  • Regularization: Soft targets provide smoother training signal
# Knowledge Distillation Loss
import torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, labels,
                        temperature=4.0, alpha=0.7):
    # Soft targets from teacher (with temperature)
    soft_loss = F.kl_div(
        F.log_softmax(student_logits / temperature, dim=1),
        F.softmax(teacher_logits / temperature, dim=1),
        reduction='batchmean'
    ) * (temperature ** 2)

    # Hard targets (ground truth)
    hard_loss = F.cross_entropy(student_logits, labels)

    # Weighted combination
    return alpha * soft_loss + (1 - alpha) * hard_loss

When to Use

  • Edge deployment: Need smaller model for mobile/IoT
  • Latency requirements: Student can be 10x faster
  • Ensemble compression: Distill ensemble into single model
  • Proprietary models: Can't deploy teacher, only student
Pro Tip
Temperature (T) controls softness. Higher T = softer probabilities = more knowledge transfer. Start with T=4, tune based on results. Common alpha values: 0.5-0.9.
Q3
Explain pruning techniques for neural networks.Mid

What is Pruning?

Removing unnecessary weights or neurons from a neural network to reduce size and computation while maintaining accuracy.

Pruning Types

  • Unstructured (Weight) Pruning: Remove individual weights (sparse matrices)
  • Structured Pruning: Remove entire neurons, channels, or layers
  • Magnitude-based: Remove smallest weights
  • Gradient-based: Remove weights with smallest gradients
# PyTorch Pruning Example
import torch.nn.utils.prune as prune

# Prune 30% of weights in a layer
prune.l1_unstructured(
    module=model.fc1,
    name='weight',
    amount=0.3
)

# Structured pruning (remove channels)
prune.ln_structured(
    module=model.conv1,
    name='weight',
    amount=0.2,
    n=2,  # L2 norm
    dim=0  # Prune output channels
)

# Make pruning permanent
prune.remove(model.fc1, 'weight')

Pruning Workflow

  1. Train full model to convergence
  2. Prune (remove weights)
  3. Fine-tune (recover accuracy)
  4. Repeat (iterative pruning often better)

Practical Considerations

  • Unstructured: Higher sparsity possible, but needs sparse hardware/software
  • Structured: Actual speedup on standard hardware, but less aggressive
  • Typical results: 50-90% weights removed with <1% accuracy drop
Section 07

ML Pipelines & Orchestration

Q1
Compare Kubeflow Pipelines, Airflow, and Prefect for ML workflows.Mid

Overview

AspectKubeflowAirflowPrefect
FocusML-nativeGeneral dataModern data/ML
InfrastructureKubernetes requiredFlexibleFlexible
Learning CurveSteepMediumEasy
ML FeaturesBuilt-in (experiments, artifacts)Via pluginsGood integration
Dynamic WorkflowsYesLimited (2.0 better)Excellent

When to Use Each

  • Kubeflow Pipelines: Full ML platform on Kubernetes, need experiment tracking, caching, artifact management built-in
  • Airflow: Already using for data pipelines, need mature scheduling, large ecosystem
  • Prefect: Python-first, need dynamic workflows, modern API, quick setup
# Prefect Example - Clean Python
from prefect import flow, task

@task
def load_data():
    return pd.read_csv("data.csv")

@task
def train_model(data):
    model = RandomForestClassifier()
    model.fit(data)
    return model

@flow
def ml_pipeline():
    data = load_data()
    model = train_model(data)
    return model

# Just run it!
ml_pipeline()
Q2
What is Metaflow and why did Netflix create it?Mid

What is Metaflow?

A human-centric ML framework from Netflix. Focuses on data scientist productivity rather than infrastructure complexity.

Design Philosophy

  • "Write once, run anywhere": Same code runs locally and on AWS Batch/Step Functions
  • Versioning built-in: Every run's data, code, and artifacts automatically versioned
  • Failure handling: Resume from failed steps, not from scratch
  • No YAML/configs: Pure Python decorators
from metaflow import FlowSpec, step, Parameter

class TrainingFlow(FlowSpec):
    learning_rate = Parameter('lr', default=0.01)

    @step
    def start(self):
        self.data = load_data()
        self.next(self.train)

    @step
    def train(self):
        self.model = train(self.data, lr=self.learning_rate)
        self.next(self.end)

    @step
    def end(self):
        print(f"Done! Model: {self.model}")

# Run locally
# python flow.py run

# Run on AWS Batch
# python flow.py run --with batch

Key Features

  • @retry decorator: Automatic retries on failure
  • @resources: Specify CPU/GPU/memory per step
  • Artifacts: self.x automatically versioned and accessible
  • Client API: Access past runs' data programmatically
Why Netflix Created It
Data scientists were spending 80% of time on infrastructure, 20% on ML. Metaflow flips this. It's opinionated about infrastructure so you don't have to be.
Q3
How do you handle pipeline caching and artifact management?Mid

Why Caching Matters

  • Cost: Don't reprocess same data repeatedly
  • Time: Skip expensive steps when inputs unchanged
  • Iteration speed: Faster experimentation

Caching Strategies

  • Input-based: Cache key = hash of inputs (Kubeflow default)
  • Code + input: Invalidate when code changes too
  • Time-based: Force refresh after TTL
# Kubeflow Pipeline with caching
from kfp import dsl

@dsl.component
def preprocess(data_path: str) -> str:
    # Cached based on data_path
    ...

@dsl.pipeline
def my_pipeline():
    preprocess_task = preprocess(data_path="s3://...")
    # Enable caching
    preprocess_task.set_caching_options(enable_caching=True)

Artifact Management

  • Storage: S3, GCS, or artifact store (MLflow, W&B)
  • Naming: Include run ID, timestamp, hash in artifact names
  • Metadata: Store lineage (what inputs produced this?)
  • Cleanup: Retention policies for old artifacts
Best Practice
Cache aggressively but invalidate correctly. Use content-addressable storage (hash-based names). Always log which cached artifacts were used for reproducibility.
Section 08

Feature Stores

Q1
What is a feature store and why do you need one?Junior

What is a Feature Store?

A centralized repository for storing, managing, and serving ML features. It bridges the gap between data engineering and data science.

Problems It Solves

  • Training-Serving Skew: Same feature computation for training and inference
  • Feature Duplication: Teams recomputing same features
  • Point-in-Time Correctness: Get features as they were at prediction time (no data leakage)
  • Low-Latency Serving: Pre-computed features for real-time inference

Architecture Components

  • Offline Store: Historical features for training (data warehouse, Parquet)
  • Online Store: Latest features for serving (Redis, DynamoDB)
  • Feature Registry: Metadata, lineage, documentation
  • Transformation Engine: Compute features from raw data
When You Need One
Multiple ML models sharing features, real-time serving requirements, feature reuse across teams, or issues with training-serving skew. For single model projects, often overkill.
Q2
Compare Feast, Tecton, and Databricks Feature Store.Mid

Comparison

FeatureFeastTectonDatabricks
TypeOpen sourceManagedManaged (Unity Catalog)
Real-time transformsLimitedExcellentGood
StreamingBasicNativeSpark Streaming
SetupSelf-managedFully managedDatabricks-managed
CostFree + infra$$$$Databricks pricing

When to Use Each

  • Feast: Budget-conscious, simple batch features, want open source, have engineering capacity
  • Tecton: Real-time features critical, need streaming, want managed service, enterprise budget
  • Databricks: Already on Databricks, want integrated experience, batch-first workflows
# Feast Example
from feast import FeatureStore

store = FeatureStore(repo_path=".")

# Get training data with point-in-time join
training_df = store.get_historical_features(
    entity_df=entities_with_timestamps,
    features=[
        "user_features:age",
        "user_features:total_purchases",
        "product_features:price"
    ]
).to_df()

# Get online features for serving
features = store.get_online_features(
    features=[...],
    entity_rows=[{"user_id": 123}]
).to_dict()
Q3
Explain point-in-time joins and why they matter for ML.Senior

The Problem

When creating training data, you need features as they were at prediction time, not as they are now. Otherwise, you're leaking future information.

Example

Predicting fraud for a transaction on Jan 15:

  • Wrong: Use user's current purchase count (includes Jan 16-31)
  • Right: Use purchase count as of Jan 14 (before prediction)

Point-in-Time Join

For each entity + event timestamp, find the most recent feature values before that timestamp.

# Entities (what we're predicting for)
entities = [
    {"user_id": 1, "event_time": "2024-01-15 10:00"},
    {"user_id": 1, "event_time": "2024-01-20 14:00"},
]

# Features (change over time)
features = [
    {"user_id": 1, "feature_time": "2024-01-10", "purchases": 5},
    {"user_id": 1, "feature_time": "2024-01-18", "purchases": 8},
]

# Point-in-time join result:
# user_id=1, event=Jan15 → purchases=5 (from Jan10)
# user_id=1, event=Jan20 → purchases=8 (from Jan18)

Why It's Hard

  • Complex joins: Not a simple SQL join - need ASOF semantics
  • Scale: Can be expensive with many entities and features
  • Multiple feature tables: Each with different update frequencies
Critical Insight
Data leakage from incorrect time handling is one of the most common ML bugs. Models look great in offline evaluation but fail in production. Feature stores automate point-in-time correctness.
Section 09

Model Monitoring & Observability

Q1
What is model drift and what types exist?Junior

What is Model Drift?

Model performance degradation over time due to changes in data patterns, user behavior, or the underlying phenomenon being modeled.

Types of Drift

  • Data Drift (Covariate Shift): Input distribution changes. P(X) changes, but P(Y|X) stays same. Example: New user demographics.
  • Concept Drift: Relationship between inputs and outputs changes. P(Y|X) changes. Example: Fraud patterns evolve.
  • Label Drift: Target distribution changes. P(Y) changes. Example: Seasonal purchase patterns.
  • Upstream Data Changes: Schema changes, missing features, new categories.

Detection Methods

  • Statistical tests: KS test, Chi-squared, PSI (Population Stability Index)
  • Distribution comparison: Compare feature histograms over time
  • Performance monitoring: Track accuracy/precision if labels available
Real-World Impact
COVID-19 caused massive concept drift across industries. Models trained on 2019 data failed in 2020. Always monitor and be ready to retrain.
Q2
Compare Evidently, WhyLabs, and Great Expectations for ML monitoring.Mid

Tool Comparison

FeatureEvidentlyWhyLabsGreat Expectations
FocusML monitoring, driftML observabilityData quality
TypeOpen sourceManaged + open sourceOpen source
Drift DetectionExcellentExcellentBasic
Real-timeBatch + RTReal-time nativeBatch
ReportsBeautiful HTMLDashboardData docs
# Evidently - Drift Report
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(
    reference_data=train_df,
    current_data=production_df
)
report.save_html("drift_report.html")

# Check drift programmatically
result = report.as_dict()
if result["metrics"][0]["result"]["dataset_drift"]:
    trigger_retraining()

When to Use Each

  • Evidently: Data science teams, need quick drift detection, beautiful reports
  • WhyLabs: Production ML systems, need real-time monitoring, alerting
  • Great Expectations: Data engineering focus, pipeline data quality, schema validation
Q3
What metrics should you monitor for ML models in production?Mid

Model Performance Metrics

  • Online metrics: Accuracy, precision, recall (if labels available)
  • Proxy metrics: CTR, conversion rate, engagement (business KPIs)
  • Prediction distribution: Score histograms, class balance

Data Quality Metrics

  • Feature statistics: Mean, std, min, max, nulls per feature
  • Distribution drift: PSI, KL divergence, KS statistic
  • Schema compliance: Types, ranges, cardinality

Operational Metrics

  • Latency: p50, p95, p99 inference time
  • Throughput: Requests per second
  • Error rates: 4xx, 5xx, timeout rates
  • Resource usage: GPU/CPU utilization, memory
# Prometheus metrics for ML serving
from prometheus_client import Histogram, Counter, Gauge

# Latency histogram
INFERENCE_LATENCY = Histogram(
    'model_inference_seconds',
    'Time spent on inference',
    buckets=[.01, .025, .05, .1, .25, .5, 1]
)

# Prediction counter by class
PREDICTIONS = Counter(
    'model_predictions_total',
    'Total predictions',
    ['model_version', 'predicted_class']
)

# Feature value gauge (for drift detection)
FEATURE_MEAN = Gauge(
    'feature_mean',
    'Rolling mean of feature',
    ['feature_name']
)
Monitoring Strategy
Start with operational metrics (can catch issues immediately). Add data drift detection (catches problems before they affect users). Add performance metrics when ground truth is available (may be delayed).
Q4
How do you set up alerting for ML model degradation?Senior

Alert Categories

  • Immediate (P0): Model serving errors, latency spikes, complete failures
  • Urgent (P1): Significant drift detected, performance drop >10%
  • Warning (P2): Gradual drift trends, minor performance changes

Alert Design Principles

  • Avoid alert fatigue: Too many alerts = ignored alerts
  • Use anomaly detection: Dynamic thresholds beat static ones
  • Multi-signal alerts: Combine metrics to reduce false positives
  • Actionable alerts: Include runbook links, context
# Example Prometheus alerting rules
groups:
  - name: ml_model_alerts
    rules:
      # High latency alert
      - alert: ModelLatencyHigh
        expr: histogram_quantile(0.95, model_inference_seconds) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Model p95 latency > 500ms"

      # Drift alert
      - alert: DataDriftDetected
        expr: feature_psi > 0.25
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "PSI > 0.25 indicates significant drift"
          runbook: "https://wiki/ml-drift-runbook"

      # Accuracy drop
      - alert: ModelAccuracyDrop
        expr: (model_accuracy - model_accuracy offset 7d) < -0.05
        for: 30m
        labels:
          severity: critical

Response Playbook

  1. Triage: Is it data, model, or infrastructure?
  2. Rollback: Can we switch to previous model version?
  3. Investigate: Check feature distributions, upstream changes
  4. Remediate: Retrain, fix data pipeline, or accept degradation
Section 10

ML Testing & Validation

Q1
What types of tests should you have for ML systems?Mid

Testing Pyramid for ML

  • Unit Tests: Test individual functions (preprocessing, feature engineering)
  • Data Tests: Validate data quality, schema, distributions
  • Model Tests: Validate model behavior and performance
  • Integration Tests: Test end-to-end pipeline
  • A/B Tests: Validate in production

Data Tests

# Great Expectations style data tests
def test_feature_ranges(df):
    assert df["age"].between(0, 120).all()
    assert df["income"].min() >= 0
    assert df["category"].isin(VALID_CATEGORIES).all()

def test_no_nulls_in_critical_features(df):
    critical = ["user_id", "timestamp", "target"]
    assert df[critical].notna().all().all()

def test_feature_distribution(train_df, test_df):
    for col in numerical_features:
        psi = calculate_psi(train_df[col], test_df[col])
        assert psi < 0.25, f"{col} has PSI {psi}"

Model Tests

def test_model_performance_threshold(model, test_data):
    y_pred = model.predict(test_data.X)
    accuracy = accuracy_score(test_data.y, y_pred)
    assert accuracy >= 0.85, f"Accuracy {accuracy} below threshold"

def test_model_not_worse_than_baseline(model, baseline, test_data):
    model_score = model.score(test_data)
    baseline_score = baseline.score(test_data)
    assert model_score >= baseline_score * 0.95

def test_model_invariance(model):
    # Prediction shouldn't change for semantically identical inputs
    input1 = "The movie was great!"
    input2 = "The movie was great !"  # Extra space
    assert model.predict(input1) == model.predict(input2)
Q2
How do you test for model fairness and bias?Senior

Fairness Metrics

  • Demographic Parity: Equal positive prediction rates across groups
  • Equalized Odds: Equal TPR and FPR across groups
  • Predictive Parity: Equal precision across groups
  • Calibration: Predicted probabilities match actual rates per group

Testing Approach

# Using Fairlearn
from fairlearn.metrics import MetricFrame, selection_rate

# Calculate metrics per group
metric_frame = MetricFrame(
    metrics={
        "accuracy": accuracy_score,
        "selection_rate": selection_rate,
        "precision": precision_score
    },
    y_true=y_test,
    y_pred=y_pred,
    sensitive_features=test_df["gender"]
)

# Check disparities
print(metric_frame.by_group)
print(f"Selection rate ratio: {metric_frame.ratio()}")

# Assert fairness constraints
def test_demographic_parity(model, data, sensitive_attr):
    groups = data[sensitive_attr].unique()
    rates = {}
    for group in groups:
        mask = data[sensitive_attr] == group
        rates[group] = model.predict(data[mask]).mean()

    ratio = min(rates.values()) / max(rates.values())
    assert ratio >= 0.8, f"Demographic parity ratio {ratio} < 0.8"

Bias Mitigation Strategies

  • Pre-processing: Resampling, reweighting training data
  • In-processing: Add fairness constraints to training (Fairlearn)
  • Post-processing: Adjust thresholds per group
Important Consideration
Different fairness metrics can conflict - you can't satisfy all simultaneously. Choose metrics based on your use case and legal requirements. Document your choices and trade-offs.
Q3
What is shadow deployment and how do you implement it?Mid

What is Shadow Deployment?

Running a new model in parallel with production, processing real traffic but not affecting user experience. The new model's predictions are logged but not used.

Benefits

  • Zero risk: Users only see production model results
  • Real data testing: Validate on actual production traffic
  • Performance comparison: Compare latency, predictions, errors
  • Catch issues: Find edge cases before they affect users
# Shadow deployment pattern
import asyncio
from concurrent.futures import ThreadPoolExecutor

executor = ThreadPoolExecutor(max_workers=2)

async def predict_with_shadow(features):
    # Production prediction (blocking, returned to user)
    prod_result = production_model.predict(features)

    # Shadow prediction (async, logged only)
    def shadow_predict():
        try:
            shadow_result = shadow_model.predict(features)
            log_shadow_prediction(
                features=features,
                prod=prod_result,
                shadow=shadow_result,
                match=prod_result == shadow_result
            )
        except Exception as e:
            log_shadow_error(e)

    executor.submit(shadow_predict)
    return prod_result  # Only production result returned

Analysis After Shadow Period

  • Prediction agreement: How often do models agree?
  • Disagreement analysis: What inputs cause different predictions?
  • Latency comparison: Is shadow model faster/slower?
  • Error rates: Any crashes or timeouts?
Implementation Tip
Make shadow predictions async/non-blocking so they don't increase user-facing latency. Log extensively for analysis. Run shadow for days/weeks to capture all edge cases and traffic patterns.
Q4
How do you validate model reproducibility?Mid

Sources of Non-Reproducibility

  • Random seeds: Model initialization, data shuffling, dropout
  • Non-deterministic operations: cuDNN, parallel reduction
  • Environment differences: Library versions, hardware
  • Data ordering: Different order = different results

Reproducibility Checklist

# Set all random seeds
import random
import numpy as np
import torch

def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

    # For full reproducibility (slower)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

# Test reproducibility
def test_training_reproducibility():
    set_seed(42)
    model1, metrics1 = train_model(data)

    set_seed(42)
    model2, metrics2 = train_model(data)

    assert metrics1 == metrics2
    # Or for floats:
    assert np.allclose(metrics1, metrics2, rtol=1e-5)

Best Practices

  • Version everything: Code (Git), data (DVC), environment (Docker)
  • Log all hyperparameters: Including random seeds
  • Use deterministic operations: Accept performance trade-off
  • Hash inputs/outputs: Verify data pipeline consistency
  • Reproducibility tests: Run same training twice, compare results
Reality Check
Perfect reproducibility is often impossible (GPU non-determinism, floating-point variations). Instead, aim for "close enough" reproducibility - metrics within acceptable tolerance. Document known sources of variance.