MLOps Pipeline: From Jupyter Notebook to Production Model Serving
Data scientists build models in Jupyter notebooks. Production needs reproducible training, versioned datasets, automated retraining, and scalable serving. MLOps bridges this gap.
The Pipeline Architecture
A production ML pipeline has five stages: data versioning → feature engineering → training → evaluation → deployment. Each stage must be reproducible and auditable.
Data Versioning with DVC
# Track large datasets without bloating git
dvc init
dvc add data/training_set.parquet
git add data/training_set.parquet.dvc .gitignore
git commit -m "Add training dataset v1"
dvc push # Upload to S3/GCS
Experiment Tracking with MLflow
import mlflow
with mlflow.start_run():
mlflow.log_params({"lr": 0.001, "epochs": 50})
model = train(config)
mlflow.log_metrics({"accuracy": 0.94, "f1": 0.91})
mlflow.sklearn.log_model(model, "model")
Automated Retraining
Schedule retraining when data drift is detected. Monitor prediction distributions against training distributions. When KL divergence exceeds threshold, trigger the pipeline automatically. This keeps models fresh without manual intervention.