MLOps Interview Q&A - AI Frameworks Focus

Section 01

Training Frameworks (PyTorch, TensorFlow, JAX)

Q1

What are the key differences between PyTorch and TensorFlow 2.x? When would you choose one over the other?Junior

▼

Key Differences

Execution Model: PyTorch uses eager execution by default (define-by-run). TensorFlow 2.x also defaults to eager but supports @tf.function for graph compilation.
API Style: PyTorch is more Pythonic and intuitive. TensorFlow has Keras as high-level API but lower-level ops can be verbose.
Debugging: PyTorch integrates naturally with Python debuggers (pdb). TensorFlow graphs are harder to debug.
Deployment: TensorFlow has better production tooling (TF Serving, TFLite, TF.js). PyTorch catching up with TorchServe, ONNX.
Research vs Production: PyTorch dominates research papers. TensorFlow stronger in enterprise production.

When to Choose

PyTorch: Research, prototyping, NLP (Hugging Face), dynamic architectures, when team prefers Pythonic code
TensorFlow: Mobile/edge deployment, existing TF infrastructure, TPU training, production-first projects

Interview Tip

Mention that the gap has narrowed significantly. Both are production-ready. The choice often depends on team expertise and existing infrastructure rather than technical superiority.

Q2

Explain the difference between model.eval() and torch.no_grad() in PyTorch.Junior

▼

model.eval()

Sets the model to evaluation mode. This affects layers that behave differently during training vs inference:

Dropout: Disabled (no random zeroing)
BatchNorm: Uses running mean/variance instead of batch statistics
Does NOT disable gradient computation

torch.no_grad()

Context manager that disables gradient computation:

Saves memory: No need to store intermediate activations for backward pass
Faster inference: Skip gradient tape operations
Does NOT affect layer behavior

# Correct inference pattern - use BOTH
model.eval()  # Change layer behavior
with torch.no_grad():  # Disable gradients
    outputs = model(inputs)

# Don't forget to switch back for training
model.train()

Key Insight

Always use both together for inference. model.eval() alone still computes gradients (wasting memory). torch.no_grad() alone doesn't fix BatchNorm/Dropout behavior.

Q3

What is JAX and when would you use it over PyTorch/TensorFlow?Mid

▼

What is JAX?

JAX is Google's library for high-performance numerical computing. It's NumPy + automatic differentiation + XLA compilation + vectorization.

Key Features

Functional paradigm: Pure functions, no hidden state, explicit random keys
Transformations: grad (autodiff), jit (compilation), vmap (auto-batching), pmap (parallelization)
XLA compilation: Optimized kernels for GPU/TPU
Composable: Transformations can be combined freely

import jax.numpy as jnp
from jax import grad, jit, vmap

# Define a loss function
def loss_fn(params, x, y):
    pred = jnp.dot(x, params)
    return jnp.mean((pred - y) ** 2)

# Get gradient function (automatic!)
grad_fn = grad(loss_fn)

# JIT compile for speed
fast_grad = jit(grad_fn)

# Vectorize over batch dimension
batched_pred = vmap(predict_fn)

When to Use JAX

Research requiring custom autodiff: Higher-order gradients, Hessians
TPU-first projects: JAX has excellent TPU support
Scientific computing: Physics simulations, differential equations
When you need vmap: Auto-vectorization is powerful

When NOT to Use JAX

Need mature ecosystem (Hugging Face, torchvision)
Team unfamiliar with functional programming
Standard deep learning tasks where PyTorch/TF suffice

Q4

How do you handle GPU memory issues during training? What strategies exist?Mid

▼

Immediate Fixes

Reduce batch size: Most direct solution, but affects convergence
Gradient accumulation: Simulate larger batches without memory cost
Mixed precision (FP16/BF16): Halves memory, often faster too

# Gradient Accumulation in PyTorch
accumulation_steps = 4
optimizer.zero_grad()

for i, (inputs, labels) in enumerate(dataloader):
    outputs = model(inputs)
    loss = criterion(outputs, labels) / accumulation_steps
    loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Advanced Techniques

Gradient checkpointing: Trade compute for memory by recomputing activations
Model parallelism: Split model across GPUs (for very large models)
Offloading: Move optimizer states to CPU (DeepSpeed ZeRO)
8-bit optimizers: bitsandbytes library reduces optimizer memory

# Mixed Precision Training with PyTorch
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for inputs, labels in dataloader:
    optimizer.zero_grad()

    with autocast():  # FP16 forward pass
        outputs = model(inputs)
        loss = criterion(outputs, labels)

    scaler.scale(loss).backward()  # Scaled backward
    scaler.step(optimizer)
    scaler.update()

Pro Tip

Start with mixed precision + gradient accumulation. They're easy to implement and often sufficient. Gradient checkpointing is next. Model parallelism/offloading for very large models only.

Q5

Explain the PyTorch DataLoader and how to optimize data loading for training.Junior

▼

DataLoader Basics

DataLoader wraps a Dataset and provides batching, shuffling, and parallel data loading.

from torch.utils.data import DataLoader, Dataset

dataloader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,          # Shuffle for training
    num_workers=4,          # Parallel data loading
    pin_memory=True,        # Faster GPU transfer
    prefetch_factor=2,      # Batches to prefetch per worker
    persistent_workers=True # Keep workers alive between epochs
)

Optimization Strategies

num_workers: Start with 4, increase until CPU-bound. Too many causes overhead.
pin_memory=True: Pre-allocates memory for faster CPU→GPU transfer
persistent_workers=True: Avoids worker restart overhead between epochs
prefetch_factor: Load next batches while GPU is computing

Common Issues

Slow first epoch: Workers initializing. Use persistent_workers.
Memory leak: Large objects in Dataset.__getitem__(). Process data lazily.
Bottleneck detection: If GPU util is low, data loading is the bottleneck.

Rule of Thumb

Set num_workers = number of CPU cores / number of GPUs. Always use pin_memory=True for GPU training. Profile with torch.profiler to find bottlenecks.

Section 02

Distributed Training

Q1

What is the difference between Data Parallelism and Model Parallelism?Mid

▼

Data Parallelism

Same model replicated across devices, each processes different data batches.

How it works: Split batch across GPUs → forward pass → all-reduce gradients → update
When to use: Model fits in single GPU memory
Scaling: Near-linear with more GPUs (communication overhead exists)
Tools: PyTorch DDP, tf.distribute.MirroredStrategy, Horovod

Model Parallelism

Model split across devices, each holds part of the model.

Pipeline parallelism: Split by layers (GPU1: layers 1-10, GPU2: layers 11-20)
Tensor parallelism: Split individual layers across GPUs
When to use: Model too large for single GPU
Challenge: Pipeline bubbles cause GPU idle time
Tools: DeepSpeed, Megatron-LM, FairScale

# PyTorch DistributedDataParallel (DDP)
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group(backend="nccl")
model = DDP(model, device_ids=[local_rank])

# Training loop unchanged - DDP handles gradient sync
for batch in dataloader:
    loss = model(batch)
    loss.backward()  # Gradients automatically synchronized
    optimizer.step()

Interview Tip

Most production systems use Data Parallelism (DDP). Model Parallelism is for LLMs (GPT, LLaMA) that don't fit on one GPU. Mention ZeRO (DeepSpeed) as a hybrid approach.

Q2

Explain DeepSpeed ZeRO and its different stages.Senior

▼

What is ZeRO?

Zero Redundancy Optimizer - partitions model states across GPUs instead of replicating them, dramatically reducing memory per GPU.

Memory Breakdown (Standard DDP)

For a model with Ψ parameters in mixed precision:

Parameters (FP16): 2Ψ bytes
Gradients (FP16): 2Ψ bytes
Optimizer states (FP32): 12Ψ bytes (Adam: params + momentum + variance)
Total per GPU: 16Ψ bytes (all replicated!)

ZeRO Stages

ZeRO-1: Partition optimizer states → 4x memory reduction
ZeRO-2: + Partition gradients → 8x memory reduction
ZeRO-3: + Partition parameters → Linear scaling with GPUs
ZeRO-Offload: Offload to CPU RAM/NVMe
ZeRO-Infinity: Offload everything, train trillion-parameter models

# DeepSpeed config for ZeRO Stage 2
{
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu"
        },
        "allgather_bucket_size": 2e8,
        "reduce_bucket_size": 2e8
    },
    "fp16": {"enabled": true},
    "train_batch_size": 32
}

Trade-offs

ZeRO-1/2: Minimal communication overhead, use by default
ZeRO-3: More communication (all-gather params), but enables huge models
Offloading: Slower but allows training on fewer GPUs

Q3

What is Horovod and how does it compare to PyTorch DDP?Mid

▼

What is Horovod?

Uber's distributed training framework. Framework-agnostic (TensorFlow, PyTorch, MXNet). Uses ring-allreduce for gradient synchronization.

Key Features

Framework agnostic: Same API for TF and PyTorch
MPI-based: Leverages battle-tested HPC communication
Minimal code changes: Wrap optimizer, done
Elastic training: Add/remove workers dynamically

import horovod.torch as hvd

hvd.init()
torch.cuda.set_device(hvd.local_rank())

model.cuda()
optimizer = optim.SGD(model.parameters(), lr=0.01 * hvd.size())

# Wrap optimizer - handles gradient sync
optimizer = hvd.DistributedOptimizer(optimizer)

# Broadcast initial state
hvd.broadcast_parameters(model.state_dict(), root_rank=0)

Horovod vs PyTorch DDP

Aspect	Horovod	PyTorch DDP
Performance	Excellent	Excellent (often faster)
Framework support	TF, PyTorch, MXNet	PyTorch only
Setup	Requires MPI	Built-in
Elastic training	Yes	TorchElastic

Recommendation

For PyTorch-only projects, use DDP (native, faster, simpler). For multi-framework environments or existing Horovod infrastructure, Horovod is solid. Both scale to thousands of GPUs.

Q4

How do you handle batch normalization in distributed training?Senior

▼

The Problem

Standard BatchNorm computes statistics per-GPU. With small per-GPU batch sizes (large models), statistics become noisy and unstable.

Solutions

SyncBatchNorm: Synchronize statistics across all GPUs (PyTorch: nn.SyncBatchNorm)
GroupNorm/LayerNorm: Statistics per sample, not per batch - no sync needed
Virtual BatchNorm: Use reference batch for statistics

# Convert all BatchNorm to SyncBatchNorm
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
model = DDP(model, device_ids=[local_rank])

# Or use GroupNorm instead (no sync overhead)
# Replace: nn.BatchNorm2d(64)
# With:    nn.GroupNorm(num_groups=32, num_channels=64)

Trade-offs

SyncBatchNorm: More accurate but adds communication overhead
GroupNorm: No overhead, works well for small batches, slightly different results
Large effective batch size: Regular BatchNorm often fine if total batch is large

Best Practice

If per-GPU batch ≥ 16, standard BatchNorm is usually fine. For smaller batches (large models), use SyncBatchNorm or switch to GroupNorm/LayerNorm. Modern architectures (Transformers) use LayerNorm anyway.

Section 03

Experiment Tracking (MLflow, W&B, Neptune)

Q1

What is MLflow and what are its main components?Junior

▼

MLflow Components

MLflow Tracking: Log parameters, metrics, artifacts. Compare experiments.
MLflow Projects: Package code for reproducibility (MLproject file)
MLflow Models: Standard format for packaging models (multiple flavors)
Model Registry: Centralized model store with versioning and stages

import mlflow

# Start experiment
mlflow.set_experiment("my-experiment")

with mlflow.start_run():
    # Log parameters
    mlflow.log_param("learning_rate", 0.01)
    mlflow.log_param("epochs", 100)

    # Train model...

    # Log metrics
    mlflow.log_metric("accuracy", 0.95)
    mlflow.log_metric("loss", 0.05)

    # Log model
    mlflow.sklearn.log_model(model, "model")

    # Log artifacts (plots, data)
    mlflow.log_artifact("confusion_matrix.png")

Key Benefits

Open source: No vendor lock-in
Self-hosted or managed: Databricks, AWS, Azure offerings
Framework agnostic: Works with any ML library
Model Registry: Staging → Production workflow

Q2

Compare MLflow, Weights & Biases, and Neptune. When would you use each?Mid

▼

Comparison

Feature	MLflow	W&B	Neptune
Hosting	Self-hosted / Managed	Cloud-first	Cloud-first
Open Source	Yes (Apache 2.0)	Client only	Client only
UI/UX	Basic	Excellent	Good
Collaboration	Basic	Excellent	Good
Model Registry	Built-in	Yes	Yes
Price	Free (self-host)	Free tier, paid	Free tier, paid

When to Use Each

MLflow: Enterprise with data privacy requirements, need self-hosting, Databricks users
Weights & Biases: Research teams, need collaboration features, best visualizations, hyperparameter sweeps
Neptune: Production ML teams, need extensive metadata tracking, good API

Recommendation

For startups and research: W&B (best UX, free for individuals). For enterprises with compliance needs: MLflow (self-hosted). All three are solid choices - pick based on team needs and budget.

Q3

How do you implement hyperparameter tuning with experiment tracking?Mid

▼

Approach 1: Optuna + MLflow

import optuna
import mlflow

def objective(trial):
    with mlflow.start_run(nested=True):
        # Sample hyperparameters
        lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
        n_layers = trial.suggest_int("n_layers", 1, 5)

        mlflow.log_params({"lr": lr, "n_layers": n_layers})

        # Train and evaluate
        accuracy = train_model(lr, n_layers)

        mlflow.log_metric("accuracy", accuracy)
        return accuracy

with mlflow.start_run(run_name="hpo-study"):
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=100)

    mlflow.log_params(study.best_params)

Approach 2: W&B Sweeps

# sweep_config.yaml
method: bayes
metric:
  name: val_accuracy
  goal: maximize
parameters:
  learning_rate:
    distribution: log_uniform_values
    min: 0.00001
    max: 0.1
  batch_size:
    values: [16, 32, 64]

# In training script
import wandb

wandb.init(config=wandb.config)
# Access: wandb.config.learning_rate

Best Practices

Use nested runs: Group trials under parent experiment
Log early stopping metrics: Prune bad trials early
Save best model artifact: Register best trial's model
Use Bayesian optimization: More efficient than grid/random search

Section 04

Model Versioning & Registry

Q1

What is DVC (Data Version Control) and how does it work with Git?Junior

▼

What is DVC?

DVC is Git for data and models. It tracks large files, datasets, and ML models alongside your code without storing them in Git.

How It Works

.dvc files: Small pointer files stored in Git (contain hash of actual data)
Remote storage: Actual data stored in S3, GCS, Azure, or local
Git integration: Data versions linked to code commits

# Initialize DVC in a Git repo
dvc init

# Track a large dataset
dvc add data/training_data.csv
git add data/training_data.csv.dvc data/.gitignore
git commit -m "Add training data"

# Configure remote storage
dvc remote add -d myremote s3://my-bucket/dvc-store

# Push data to remote
dvc push

# Checkout data for specific Git commit
git checkout v1.0
dvc checkout  # Fetches matching data version

Key Benefits

Reproducibility: Exact data + code combination for any commit
Storage efficiency: Deduplication, only stores diffs
Pipelines: Define and version ML pipelines (dvc.yaml)

Q2

Explain the MLflow Model Registry and its stage transitions.Mid

▼

Model Registry Concepts

Registered Model: Named entity grouping model versions
Model Version: Specific iteration with artifacts, metrics, lineage
Stages: None → Staging → Production → Archived

import mlflow
from mlflow import MlflowClient

client = MlflowClient()

# Register model from a run
result = mlflow.register_model(
    "runs:/<run_id>/model",
    "fraud-detection-model"
)

# Transition to staging
client.transition_model_version_stage(
    name="fraud-detection-model",
    version=1,
    stage="Staging"
)

# After validation, promote to production
client.transition_model_version_stage(
    name="fraud-detection-model",
    version=1,
    stage="Production",
    archive_existing_versions=True  # Archive old prod version
)

# Load production model for serving
model = mlflow.pyfunc.load_model(
    "models:/fraud-detection-model/Production"
)

Stage Workflow

None: Just registered, not validated
Staging: Under testing, A/B testing, shadow mode
Production: Serving live traffic
Archived: Old versions kept for rollback

Best Practice

Automate stage transitions with CI/CD. Run validation tests before Staging→Production. Keep archived versions for quick rollback.

Q3

How do you version ML models in production? What metadata should be tracked?Mid

▼

Versioning Strategies

Semantic versioning: major.minor.patch (1.2.3)
Date-based: 2024-01-15-v1
Git SHA: Link to exact code commit
Experiment ID: Link to training run

Essential Metadata

Training data: Dataset version, hash, sample count, date range
Code: Git commit SHA, branch, repo URL
Environment: Python version, package versions (requirements.txt hash)
Hyperparameters: All training configuration
Metrics: Training/validation scores, evaluation results
Lineage: Parent model (for fine-tuning), training run ID

# Model metadata example (stored with model)
{
    "model_version": "2.1.0",
    "git_sha": "a1b2c3d4",
    "training_run_id": "mlflow-run-xyz",
    "dataset": {
        "name": "transactions_v3",
        "hash": "sha256:abc123...",
        "rows": 1000000,
        "date_range": ["2023-01-01", "2023-12-31"]
    },
    "metrics": {
        "auc_roc": 0.95,
        "precision": 0.92
    },
    "created_at": "2024-01-15T10:30:00Z",
    "created_by": "training-pipeline"
}

Key Insight

You should be able to reproduce any model from its metadata alone. If you can't answer "what data and code produced this model?", your versioning is incomplete.

Section 05

Model Serving Frameworks

Q1

Compare TensorFlow Serving, TorchServe, and Triton Inference Server.Mid

▼

Comparison Overview

Feature	TF Serving	TorchServe	Triton
Frameworks	TensorFlow only	PyTorch only	TF, PyTorch, ONNX, TensorRT
Batching	Dynamic batching	Dynamic batching	Advanced dynamic batching
Model Management	Version policies	MAR archives	Model repository
GPU Optimization	Good	Good	Excellent (NVIDIA)
Concurrent Models	Yes	Yes	Best (GPU sharing)

When to Use Each

TF Serving: TensorFlow models, need battle-tested serving, gRPC preferred
TorchServe: PyTorch models, need custom handlers, AWS integration
Triton: Multi-framework, need maximum GPU efficiency, ensemble models

Production Recommendation

For heterogeneous model serving at scale, Triton is the best choice. It handles multiple frameworks, optimizes GPU utilization, and supports model ensembles natively.

Q2

What is ONNX and why is it important for model deployment?Junior

▼

What is ONNX?

Open Neural Network Exchange - an open format for representing ML models. Enables interoperability between frameworks.

Key Benefits

Framework agnostic: Train in PyTorch, deploy with TensorRT
Optimization: ONNX Runtime optimizes for target hardware
Portability: Same model runs on cloud, edge, mobile
Ecosystem: Wide tooling support (converters, optimizers)

# Export PyTorch model to ONNX
import torch

model.eval()
dummy_input = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size"},
        "output": {0: "batch_size"}
    },
    opset_version=14
)

# Run with ONNX Runtime
import onnxruntime as ort

session = ort.InferenceSession("model.onnx")
outputs = session.run(None, {"input": input_array})

Common Use Cases

Edge deployment: Convert PyTorch → ONNX → TensorRT for Jetson
Mobile: ONNX → CoreML (iOS) or TFLite (Android)
Standardization: Single format for model artifacts

Q3

How do you implement A/B testing for ML models in production?Senior

▼

A/B Testing Architecture

Traffic splitting: Route percentage of requests to each model version
Consistent assignment: Same user always sees same model (by user ID hash)
Metrics collection: Track business KPIs per variant
Statistical analysis: Determine winner with significance

# Simple traffic splitting with feature flags
import hashlib

def get_model_variant(user_id, experiment_config):
    # Consistent hashing for user assignment
    hash_val = int(hashlib.md5(
        f"{user_id}_{experiment_config['name']}".encode()
    ).hexdigest(), 16)

    bucket = hash_val % 100

    if bucket < experiment_config['control_percentage']:
        return "control", load_model("v1")
    else:
        return "treatment", load_model("v2")

# Log for analysis
def predict_with_logging(user_id, features):
    variant, model = get_model_variant(user_id, config)
    prediction = model.predict(features)

    log_prediction(
        user_id=user_id,
        variant=variant,
        prediction=prediction,
        timestamp=now()
    )
    return prediction

Key Considerations

Sample size: Calculate required samples for statistical power
Guardrail metrics: Monitor for regressions in critical metrics
Ramp-up: Start with small percentage, increase gradually
Shadow mode: Run new model without affecting users first

Tools

Istio/Envoy: Service mesh traffic splitting
LaunchDarkly/Unleash: Feature flag platforms
Seldon/KServe: Built-in canary deployments

Statistical Note

Don't peek at results early! Pre-register your sample size and decision criteria. Use sequential testing if you need early stopping.

Q4

What is dynamic batching and why is it important for inference?Mid

▼

The Problem

GPUs are optimized for parallel processing. Single requests underutilize GPU. But waiting too long for batches increases latency.

Dynamic Batching

Automatically groups incoming requests into batches based on:

Max batch size: Upper limit on batch
Max delay: Maximum time to wait for more requests
Preferred batch sizes: Optimize for specific sizes

# Triton config.pbtxt example
dynamic_batching {
    preferred_batch_size: [4, 8, 16]
    max_queue_delay_microseconds: 100
}

instance_group [
    {
        count: 2
        kind: KIND_GPU
    }
]

Trade-offs

Throughput vs Latency: Larger batches = higher throughput, longer wait
Memory: Larger batches need more GPU memory
Padding overhead: Variable-length inputs need padding

Tuning Tip

Start with max_delay = p50 latency target / 10. Increase batch size until GPU utilization is high but latency SLA is met. Profile under realistic load.

Section 06

Model Optimization & Compression

Q1

Explain quantization and its types (PTQ vs QAT).Mid

▼

What is Quantization?

Converting model weights and activations from floating-point (FP32) to lower precision (INT8, FP16). Reduces model size and speeds up inference.

Post-Training Quantization (PTQ)

Process: Quantize after training using calibration data
Pros: Fast, no retraining needed
Cons: May lose accuracy, especially for sensitive models
Best for: Large models, CNNs, when time is limited

Quantization-Aware Training (QAT)

Process: Simulate quantization during training
Pros: Better accuracy preservation
Cons: Requires retraining, more complex
Best for: Accuracy-critical applications, smaller models

# PyTorch Post-Training Quantization
import torch.quantization

# Prepare model
model.eval()
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
model_prepared = torch.quantization.prepare(model)

# Calibrate with representative data
with torch.no_grad():
    for batch in calibration_loader:
        model_prepared(batch)

# Convert to quantized model
model_quantized = torch.quantization.convert(model_prepared)

# Size reduction: ~4x (FP32 → INT8)

Quantization Levels

FP16: 2x size reduction, minimal accuracy loss, wide hardware support
INT8: 4x size reduction, may need calibration, faster on CPUs
INT4/GPTQ: 8x reduction, for LLMs, requires careful tuning

Q2

What is knowledge distillation and when would you use it?Mid

▼

Concept

Train a smaller "student" model to mimic a larger "teacher" model. Student learns from teacher's soft predictions (logits), not just hard labels.

Why Soft Labels Help

More information: Teacher's logits encode relationships between classes
Example: "Cat" image might have teacher output [cat: 0.9, dog: 0.08, bird: 0.02] - student learns cats look more like dogs than birds
Regularization: Soft targets provide smoother training signal

# Knowledge Distillation Loss
import torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, labels,
                        temperature=4.0, alpha=0.7):
    # Soft targets from teacher (with temperature)
    soft_loss = F.kl_div(
        F.log_softmax(student_logits / temperature, dim=1),
        F.softmax(teacher_logits / temperature, dim=1),
        reduction='batchmean'
    ) * (temperature ** 2)

    # Hard targets (ground truth)
    hard_loss = F.cross_entropy(student_logits, labels)

    # Weighted combination
    return alpha * soft_loss + (1 - alpha) * hard_loss

When to Use

Edge deployment: Need smaller model for mobile/IoT
Latency requirements: Student can be 10x faster
Ensemble compression: Distill ensemble into single model
Proprietary models: Can't deploy teacher, only student

Pro Tip

Temperature (T) controls softness. Higher T = softer probabilities = more knowledge transfer. Start with T=4, tune based on results. Common alpha values: 0.5-0.9.

Q3

Explain pruning techniques for neural networks.Mid

▼

What is Pruning?

Removing unnecessary weights or neurons from a neural network to reduce size and computation while maintaining accuracy.

Pruning Types

Unstructured (Weight) Pruning: Remove individual weights (sparse matrices)
Structured Pruning: Remove entire neurons, channels, or layers
Magnitude-based: Remove smallest weights
Gradient-based: Remove weights with smallest gradients

# PyTorch Pruning Example
import torch.nn.utils.prune as prune

# Prune 30% of weights in a layer
prune.l1_unstructured(
    module=model.fc1,
    name='weight',
    amount=0.3
)

# Structured pruning (remove channels)
prune.ln_structured(
    module=model.conv1,
    name='weight',
    amount=0.2,
    n=2,  # L2 norm
    dim=0  # Prune output channels
)

# Make pruning permanent
prune.remove(model.fc1, 'weight')

Pruning Workflow

Train full model to convergence
Prune (remove weights)
Fine-tune (recover accuracy)
Repeat (iterative pruning often better)

Practical Considerations

Unstructured: Higher sparsity possible, but needs sparse hardware/software
Structured: Actual speedup on standard hardware, but less aggressive
Typical results: 50-90% weights removed with <1% accuracy drop

Section 07

ML Pipelines & Orchestration

Q1

Compare Kubeflow Pipelines, Airflow, and Prefect for ML workflows.Mid

▼

Overview

Aspect	Kubeflow	Airflow	Prefect
Focus	ML-native	General data	Modern data/ML
Infrastructure	Kubernetes required	Flexible	Flexible
Learning Curve	Steep	Medium	Easy
ML Features	Built-in (experiments, artifacts)	Via plugins	Good integration
Dynamic Workflows	Yes	Limited (2.0 better)	Excellent

When to Use Each

Kubeflow Pipelines: Full ML platform on Kubernetes, need experiment tracking, caching, artifact management built-in
Airflow: Already using for data pipelines, need mature scheduling, large ecosystem
Prefect: Python-first, need dynamic workflows, modern API, quick setup

# Prefect Example - Clean Python
from prefect import flow, task

@task
def load_data():
    return pd.read_csv("data.csv")

@task
def train_model(data):
    model = RandomForestClassifier()
    model.fit(data)
    return model

@flow
def ml_pipeline():
    data = load_data()
    model = train_model(data)
    return model

# Just run it!
ml_pipeline()

Q2

What is Metaflow and why did Netflix create it?Mid

▼

What is Metaflow?

A human-centric ML framework from Netflix. Focuses on data scientist productivity rather than infrastructure complexity.

Design Philosophy

"Write once, run anywhere": Same code runs locally and on AWS Batch/Step Functions
Versioning built-in: Every run's data, code, and artifacts automatically versioned
Failure handling: Resume from failed steps, not from scratch
No YAML/configs: Pure Python decorators

from metaflow import FlowSpec, step, Parameter

class TrainingFlow(FlowSpec):
    learning_rate = Parameter('lr', default=0.01)

    @step
    def start(self):
        self.data = load_data()
        self.next(self.train)

    @step
    def train(self):
        self.model = train(self.data, lr=self.learning_rate)
        self.next(self.end)

    @step
    def end(self):
        print(f"Done! Model: {self.model}")

# Run locally
# python flow.py run

# Run on AWS Batch
# python flow.py run --with batch

Key Features

@retry decorator: Automatic retries on failure
@resources: Specify CPU/GPU/memory per step
Artifacts: self.x automatically versioned and accessible
Client API: Access past runs' data programmatically

Why Netflix Created It

Data scientists were spending 80% of time on infrastructure, 20% on ML. Metaflow flips this. It's opinionated about infrastructure so you don't have to be.

Q3

How do you handle pipeline caching and artifact management?Mid

▼

Why Caching Matters

Cost: Don't reprocess same data repeatedly
Time: Skip expensive steps when inputs unchanged
Iteration speed: Faster experimentation

Caching Strategies

Input-based: Cache key = hash of inputs (Kubeflow default)
Code + input: Invalidate when code changes too
Time-based: Force refresh after TTL

# Kubeflow Pipeline with caching
from kfp import dsl

@dsl.component
def preprocess(data_path: str) -> str:
    # Cached based on data_path
    ...

@dsl.pipeline
def my_pipeline():
    preprocess_task = preprocess(data_path="s3://...")
    # Enable caching
    preprocess_task.set_caching_options(enable_caching=True)

Artifact Management

Storage: S3, GCS, or artifact store (MLflow, W&B)
Naming: Include run ID, timestamp, hash in artifact names
Metadata: Store lineage (what inputs produced this?)
Cleanup: Retention policies for old artifacts

Best Practice

Cache aggressively but invalidate correctly. Use content-addressable storage (hash-based names). Always log which cached artifacts were used for reproducibility.

Section 08

Feature Stores

Q1

What is a feature store and why do you need one?Junior

▼

What is a Feature Store?

A centralized repository for storing, managing, and serving ML features. It bridges the gap between data engineering and data science.

Problems It Solves

Training-Serving Skew: Same feature computation for training and inference
Feature Duplication: Teams recomputing same features
Point-in-Time Correctness: Get features as they were at prediction time (no data leakage)
Low-Latency Serving: Pre-computed features for real-time inference

Architecture Components

Offline Store: Historical features for training (data warehouse, Parquet)
Online Store: Latest features for serving (Redis, DynamoDB)
Feature Registry: Metadata, lineage, documentation
Transformation Engine: Compute features from raw data

When You Need One

Multiple ML models sharing features, real-time serving requirements, feature reuse across teams, or issues with training-serving skew. For single model projects, often overkill.

Q2

Compare Feast, Tecton, and Databricks Feature Store.Mid

▼

Comparison

Feature	Feast	Tecton	Databricks
Type	Open source	Managed	Managed (Unity Catalog)
Real-time transforms	Limited	Excellent	Good
Streaming	Basic	Native	Spark Streaming
Setup	Self-managed	Fully managed	Databricks-managed
Cost	Free + infra	$$$$	Databricks pricing

When to Use Each

Feast: Budget-conscious, simple batch features, want open source, have engineering capacity
Tecton: Real-time features critical, need streaming, want managed service, enterprise budget
Databricks: Already on Databricks, want integrated experience, batch-first workflows

# Feast Example
from feast import FeatureStore

store = FeatureStore(repo_path=".")

# Get training data with point-in-time join
training_df = store.get_historical_features(
    entity_df=entities_with_timestamps,
    features=[
        "user_features:age",
        "user_features:total_purchases",
        "product_features:price"
    ]
).to_df()

# Get online features for serving
features = store.get_online_features(
    features=[...],
    entity_rows=[{"user_id": 123}]
).to_dict()

Q3

Explain point-in-time joins and why they matter for ML.Senior

▼

The Problem

When creating training data, you need features as they were at prediction time, not as they are now. Otherwise, you're leaking future information.

Example

Predicting fraud for a transaction on Jan 15:

Wrong: Use user's current purchase count (includes Jan 16-31)
Right: Use purchase count as of Jan 14 (before prediction)

Point-in-Time Join

For each entity + event timestamp, find the most recent feature values before that timestamp.

# Entities (what we're predicting for)
entities = [
    {"user_id": 1, "event_time": "2024-01-15 10:00"},
    {"user_id": 1, "event_time": "2024-01-20 14:00"},
]

# Features (change over time)
features = [
    {"user_id": 1, "feature_time": "2024-01-10", "purchases": 5},
    {"user_id": 1, "feature_time": "2024-01-18", "purchases": 8},
]

# Point-in-time join result:
# user_id=1, event=Jan15 → purchases=5 (from Jan10)
# user_id=1, event=Jan20 → purchases=8 (from Jan18)

Why It's Hard

Complex joins: Not a simple SQL join - need ASOF semantics
Scale: Can be expensive with many entities and features
Multiple feature tables: Each with different update frequencies

Critical Insight

Data leakage from incorrect time handling is one of the most common ML bugs. Models look great in offline evaluation but fail in production. Feature stores automate point-in-time correctness.

Section 09

Model Monitoring & Observability

Q1

What is model drift and what types exist?Junior

▼

What is Model Drift?

Model performance degradation over time due to changes in data patterns, user behavior, or the underlying phenomenon being modeled.

Types of Drift

Data Drift (Covariate Shift): Input distribution changes. P(X) changes, but P(Y|X) stays same. Example: New user demographics.
Concept Drift: Relationship between inputs and outputs changes. P(Y|X) changes. Example: Fraud patterns evolve.
Label Drift: Target distribution changes. P(Y) changes. Example: Seasonal purchase patterns.
Upstream Data Changes: Schema changes, missing features, new categories.

Detection Methods

Statistical tests: KS test, Chi-squared, PSI (Population Stability Index)
Distribution comparison: Compare feature histograms over time
Performance monitoring: Track accuracy/precision if labels available

Real-World Impact

COVID-19 caused massive concept drift across industries. Models trained on 2019 data failed in 2020. Always monitor and be ready to retrain.

Q2

Compare Evidently, WhyLabs, and Great Expectations for ML monitoring.Mid

▼

Tool Comparison

Feature	Evidently	WhyLabs	Great Expectations
Focus	ML monitoring, drift	ML observability	Data quality
Type	Open source	Managed + open source	Open source
Drift Detection	Excellent	Excellent	Basic
Real-time	Batch + RT	Real-time native	Batch
Reports	Beautiful HTML	Dashboard	Data docs

# Evidently - Drift Report
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(
    reference_data=train_df,
    current_data=production_df
)
report.save_html("drift_report.html")

# Check drift programmatically
result = report.as_dict()
if result["metrics"][0]["result"]["dataset_drift"]:
    trigger_retraining()

When to Use Each

Evidently: Data science teams, need quick drift detection, beautiful reports
WhyLabs: Production ML systems, need real-time monitoring, alerting
Great Expectations: Data engineering focus, pipeline data quality, schema validation

Q3

What metrics should you monitor for ML models in production?Mid

▼

Model Performance Metrics

Online metrics: Accuracy, precision, recall (if labels available)
Proxy metrics: CTR, conversion rate, engagement (business KPIs)
Prediction distribution: Score histograms, class balance

Data Quality Metrics

Feature statistics: Mean, std, min, max, nulls per feature
Distribution drift: PSI, KL divergence, KS statistic
Schema compliance: Types, ranges, cardinality

Operational Metrics

Latency: p50, p95, p99 inference time
Throughput: Requests per second
Error rates: 4xx, 5xx, timeout rates
Resource usage: GPU/CPU utilization, memory

# Prometheus metrics for ML serving
from prometheus_client import Histogram, Counter, Gauge

# Latency histogram
INFERENCE_LATENCY = Histogram(
    'model_inference_seconds',
    'Time spent on inference',
    buckets=[.01, .025, .05, .1, .25, .5, 1]
)

# Prediction counter by class
PREDICTIONS = Counter(
    'model_predictions_total',
    'Total predictions',
    ['model_version', 'predicted_class']
)

# Feature value gauge (for drift detection)
FEATURE_MEAN = Gauge(
    'feature_mean',
    'Rolling mean of feature',
    ['feature_name']
)

Monitoring Strategy

Start with operational metrics (can catch issues immediately). Add data drift detection (catches problems before they affect users). Add performance metrics when ground truth is available (may be delayed).

Q4

How do you set up alerting for ML model degradation?Senior

▼

Alert Categories

Immediate (P0): Model serving errors, latency spikes, complete failures
Urgent (P1): Significant drift detected, performance drop >10%
Warning (P2): Gradual drift trends, minor performance changes

Alert Design Principles

Avoid alert fatigue: Too many alerts = ignored alerts
Use anomaly detection: Dynamic thresholds beat static ones
Multi-signal alerts: Combine metrics to reduce false positives
Actionable alerts: Include runbook links, context

# Example Prometheus alerting rules
groups:
  - name: ml_model_alerts
    rules:
      # High latency alert
      - alert: ModelLatencyHigh
        expr: histogram_quantile(0.95, model_inference_seconds) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Model p95 latency > 500ms"

      # Drift alert
      - alert: DataDriftDetected
        expr: feature_psi > 0.25
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "PSI > 0.25 indicates significant drift"
          runbook: "https://wiki/ml-drift-runbook"

      # Accuracy drop
      - alert: ModelAccuracyDrop
        expr: (model_accuracy - model_accuracy offset 7d) < -0.05
        for: 30m
        labels:
          severity: critical

Response Playbook

Triage: Is it data, model, or infrastructure?
Rollback: Can we switch to previous model version?
Investigate: Check feature distributions, upstream changes
Remediate: Retrain, fix data pipeline, or accept degradation

Section 10

ML Testing & Validation

Q1

What types of tests should you have for ML systems?Mid

▼

Testing Pyramid for ML

Unit Tests: Test individual functions (preprocessing, feature engineering)
Data Tests: Validate data quality, schema, distributions
Model Tests: Validate model behavior and performance
Integration Tests: Test end-to-end pipeline
A/B Tests: Validate in production

Data Tests

# Great Expectations style data tests
def test_feature_ranges(df):
    assert df["age"].between(0, 120).all()
    assert df["income"].min() >= 0
    assert df["category"].isin(VALID_CATEGORIES).all()

def test_no_nulls_in_critical_features(df):
    critical = ["user_id", "timestamp", "target"]
    assert df[critical].notna().all().all()

def test_feature_distribution(train_df, test_df):
    for col in numerical_features:
        psi = calculate_psi(train_df[col], test_df[col])
        assert psi < 0.25, f"{col} has PSI {psi}"

Model Tests

def test_model_performance_threshold(model, test_data):
    y_pred = model.predict(test_data.X)
    accuracy = accuracy_score(test_data.y, y_pred)
    assert accuracy >= 0.85, f"Accuracy {accuracy} below threshold"

def test_model_not_worse_than_baseline(model, baseline, test_data):
    model_score = model.score(test_data)
    baseline_score = baseline.score(test_data)
    assert model_score >= baseline_score * 0.95

def test_model_invariance(model):
    # Prediction shouldn't change for semantically identical inputs
    input1 = "The movie was great!"
    input2 = "The movie was great !"  # Extra space
    assert model.predict(input1) == model.predict(input2)

Q2

How do you test for model fairness and bias?Senior

▼

Fairness Metrics

Demographic Parity: Equal positive prediction rates across groups
Equalized Odds: Equal TPR and FPR across groups
Predictive Parity: Equal precision across groups
Calibration: Predicted probabilities match actual rates per group

Testing Approach

# Using Fairlearn
from fairlearn.metrics import MetricFrame, selection_rate

# Calculate metrics per group
metric_frame = MetricFrame(
    metrics={
        "accuracy": accuracy_score,
        "selection_rate": selection_rate,
        "precision": precision_score
    },
    y_true=y_test,
    y_pred=y_pred,
    sensitive_features=test_df["gender"]
)

# Check disparities
print(metric_frame.by_group)
print(f"Selection rate ratio: {metric_frame.ratio()}")

# Assert fairness constraints
def test_demographic_parity(model, data, sensitive_attr):
    groups = data[sensitive_attr].unique()
    rates = {}
    for group in groups:
        mask = data[sensitive_attr] == group
        rates[group] = model.predict(data[mask]).mean()

    ratio = min(rates.values()) / max(rates.values())
    assert ratio >= 0.8, f"Demographic parity ratio {ratio} < 0.8"

Bias Mitigation Strategies

Pre-processing: Resampling, reweighting training data
In-processing: Add fairness constraints to training (Fairlearn)
Post-processing: Adjust thresholds per group

Important Consideration

Different fairness metrics can conflict - you can't satisfy all simultaneously. Choose metrics based on your use case and legal requirements. Document your choices and trade-offs.

Q3

What is shadow deployment and how do you implement it?Mid

▼

What is Shadow Deployment?

Running a new model in parallel with production, processing real traffic but not affecting user experience. The new model's predictions are logged but not used.

Benefits

Zero risk: Users only see production model results
Real data testing: Validate on actual production traffic
Performance comparison: Compare latency, predictions, errors
Catch issues: Find edge cases before they affect users

# Shadow deployment pattern
import asyncio
from concurrent.futures import ThreadPoolExecutor

executor = ThreadPoolExecutor(max_workers=2)

async def predict_with_shadow(features):
    # Production prediction (blocking, returned to user)
    prod_result = production_model.predict(features)

    # Shadow prediction (async, logged only)
    def shadow_predict():
        try:
            shadow_result = shadow_model.predict(features)
            log_shadow_prediction(
                features=features,
                prod=prod_result,
                shadow=shadow_result,
                match=prod_result == shadow_result
            )
        except Exception as e:
            log_shadow_error(e)

    executor.submit(shadow_predict)
    return prod_result  # Only production result returned

Analysis After Shadow Period

Prediction agreement: How often do models agree?
Disagreement analysis: What inputs cause different predictions?
Latency comparison: Is shadow model faster/slower?
Error rates: Any crashes or timeouts?

Implementation Tip

Make shadow predictions async/non-blocking so they don't increase user-facing latency. Log extensively for analysis. Run shadow for days/weeks to capture all edge cases and traffic patterns.

Q4

How do you validate model reproducibility?Mid

▼

Sources of Non-Reproducibility

Random seeds: Model initialization, data shuffling, dropout
Non-deterministic operations: cuDNN, parallel reduction
Environment differences: Library versions, hardware
Data ordering: Different order = different results

Reproducibility Checklist

# Set all random seeds
import random
import numpy as np
import torch

def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

    # For full reproducibility (slower)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

# Test reproducibility
def test_training_reproducibility():
    set_seed(42)
    model1, metrics1 = train_model(data)

    set_seed(42)
    model2, metrics2 = train_model(data)

    assert metrics1 == metrics2
    # Or for floats:
    assert np.allclose(metrics1, metrics2, rtol=1e-5)

Best Practices

Version everything: Code (Git), data (DVC), environment (Docker)
Log all hyperparameters: Including random seeds
Use deterministic operations: Accept performance trade-off
Hash inputs/outputs: Verify data pipeline consistency
Reproducibility tests: Run same training twice, compare results

Reality Check

Perfect reproducibility is often impossible (GPU non-determinism, floating-point variations). Instead, aim for "close enough" reproducibility - metrics within acceptable tolerance. Document known sources of variance.

MLOps Interview Questions

Training Frameworks (PyTorch, TensorFlow, JAX)

Key Differences

When to Choose

model.eval()

torch.no_grad()

What is JAX?

Key Features

When to Use JAX

When NOT to Use JAX

Immediate Fixes

Advanced Techniques

DataLoader Basics

Optimization Strategies

Common Issues

Distributed Training

Data Parallelism

Model Parallelism

What is ZeRO?

Memory Breakdown (Standard DDP)

ZeRO Stages

Trade-offs

What is Horovod?

Key Features

Horovod vs PyTorch DDP

The Problem

Solutions

Trade-offs

Experiment Tracking (MLflow, W&B, Neptune)

MLflow Components

Key Benefits

Comparison

When to Use Each

Approach 1: Optuna + MLflow

Approach 2: W&B Sweeps

Best Practices

Model Versioning & Registry

What is DVC?

How It Works

Key Benefits

Model Registry Concepts

Stage Workflow

Versioning Strategies

Essential Metadata

Model Serving Frameworks

Comparison Overview

When to Use Each

What is ONNX?

Key Benefits

Common Use Cases

A/B Testing Architecture

Key Considerations

Tools

The Problem

Dynamic Batching

Trade-offs

Model Optimization & Compression

What is Quantization?

Post-Training Quantization (PTQ)

Quantization-Aware Training (QAT)

Quantization Levels

Concept

Why Soft Labels Help

When to Use

What is Pruning?

Pruning Types

Pruning Workflow

Practical Considerations

ML Pipelines & Orchestration

Overview

When to Use Each

What is Metaflow?

Design Philosophy

Key Features

Why Caching Matters

Caching Strategies

Artifact Management

Feature Stores

What is a Feature Store?

Problems It Solves