The Databricks MLOps book is a game-changer for anyone looking to streamline their machine learning operations. It provides a comprehensive guide to MLOps workflows, helping you bridge the gap between data science and production.
With Databricks MLOps, you can automate the entire MLOps workflow, from data ingestion to model deployment. This means faster time-to-market, reduced costs, and improved collaboration among teams.
The book covers key concepts such as model monitoring, model serving, and model versioning, all of which are crucial for a successful MLOps workflow. By understanding these concepts, you can ensure that your models are accurate, reliable, and scalable.
By following the guidelines outlined in the Databricks MLOps book, you can create a repeatable and scalable MLOps workflow that meets the needs of your organization.
What Is MLOps?
MLOps is a set of processes and automated steps for managing code, data, and models to improve performance, stability, and long-term efficiency of ML systems. It combines DevOps, DataOps, and ModelOps.
ML systems go through stages that progress from early development stages to a final production stage. This progression includes tighter access limitations and rigorous testing.
The Databricks platform allows you to manage ML assets on a single platform with unified access control. This reduces the risks and delays associated with moving data around.
Developing data applications and ML applications on the same platform can reduce the risks and delays associated with data movement.
Take a look at this: Mlops Platforms
MLOps Workflow
An MLOps workflow is a set of processes and automation to manage data, code, and models. This workflow is crucial for meeting the two goals of stable performance and long-term efficiency in ML systems.
The MLOps workflow consists of three stages: development, staging, and production. In the development stage, data scientists can quickly get started iterating on ML code for new projects using Databricks MLOps Stacks. This stack provides a customizable ML project structure, unit-tested Python modules, and notebooks.
Consider reading: Learn How to Code Books
Here are the three modular components included in the default stack:
In the staging stage, model validation and deployment pipelines are developed, and the model is promoted to production if it passes pre-deployment checks.
Validation and Deploy
In the validation and deploy stage, you'll determine if your model is ready for production. This stage uses the model URI from the training pipeline and loads the model from Unity Catalog.
Model validation is a crucial step, where you execute a series of checks to ensure the model meets your requirements. These checks can include format and metadata validations, performance evaluations on selected data slices, and compliance with organizational requirements.
The model deployment pipeline typically promotes the newly trained model to production status, either by directly updating the existing model or by comparing it to the existing model in production. This pipeline can also set up any required inference infrastructure, such as Model Serving endpoints.
Here's a step-by-step overview of the validation and deploy process:
- Model validation: Loads the model from Unity Catalog and executes validation checks.
- Model deployment: Promotes the newly trained model to production status and sets up inference infrastructure.
- Comparison: Compares the newly trained model to the existing model in production to determine if it's ready for deployment.
The validation and deploy process is fully automated, but manual approval steps can be set up using workflow notifications or CI/CD callbacks.
In the case of real-time model serving, you'll need to set up the infrastructure to deploy the model as a REST API endpoint. This can be done using Mosaic AI Model Serving, which executes a zero-downtime update by keeping the existing configuration running until the new one is ready.
By following this validation and deploy process, you can ensure that your model is ready for production and meets your organization's requirements.
Gold Layer - FS
In the Gold Layer - FS, we're dealing with the last layer of data transformation where the result can be fed directly to our model training pipeline. This layer is all about creating feature tables using the Databricks Feature Store API.
A feature store is a data management layer that acts as an interface between your data and model, helping with feature discovery, lineage, and skew. If you're not familiar with feature stores, you should check out The Comprehensive Guide to Feature Stores | Databricks.
The implementation of a feature store in Databricks depends on whether your workspace is enabled for Unity Catalog or not. If it is, any Delta table serves as a feature table, and Unity Catalog acts as a feature store.
There are two ways to create, store, and manage features in Databricks: Databricks Feature Engineering in Unity Catalog and the Databricks Workspace Feature Store legacy solution. We'll go with the first solution, as it's the future of feature management.
To create a feature table, you need to select the catalog and schema, create a FeatureEngineeringClient instance, and exclude the target/label column. Each feature table must have at least one primary key, which Databricks uses to create the final training set by joining the feature table with the data frame that includes the target/label column.
Here's a quick summary of the steps:
- Select the catalog and schema
- Create a FeatureEngineeringClient instance
- Exclude the target/label column
- Include at least one primary key
With these steps, you can create feature tables and manage your features in Databricks.
Development Stage
In the development stage, data scientists focus on experimentation, developing features and models, and running experiments to optimize model performance. This stage is all about trying out new ideas and refining your approach.
The development environment is represented by the dev catalog in Unity Catalog, where data scientists have read-write access to create temporary data and feature tables. Data scientists also have read-only access to production data in the prod catalog, allowing them to analyze current production model predictions and performance.
A snapshot of production data can be written to the dev catalog if read-only access to the prod catalog is not possible. This enables data scientists to develop and evaluate project code.
Data scientists can iterate on ML code and file pull requests (PRs), which triggers unit tests and integration tests in an isolated staging Databricks workspace. Model training and batch inference jobs in staging will immediately update to run the latest code when a PR is merged into main.
Additional reading: Books to Help Learn Code in Java
The development stage involves two tasks in the model training pipeline: training and tuning, and evaluation. Training and tuning involves logging model parameters, metrics, and artifacts to the MLflow Tracking server, while evaluation tests the model quality by testing on held-out data and logs the results to the MLflow Tracking server.
Here is a summary of the development stage tasks:
The output of the model training pipeline is an ML model artifact stored in the MLflow Tracking server for the development environment. If the pipeline is executed in the staging or production workspace, the model artifact is stored in the MLflow Tracking server for that workspace.
Staging Stage
The staging stage is a crucial part of the MLOps process in Databricks. It's where you test your ML pipeline code to ensure it's ready for production.
In this stage, all ML pipeline code is thoroughly tested, including code for model training, feature engineering pipelines, inference code, and more. This ensures that everything is working as expected before moving on to the production stage.
A CI pipeline is created by ML engineers to implement unit and integration tests in this stage. The output of the staging process is a release branch that triggers the CI/CD system to start the production stage.
The staging environment should have its own catalog in Unity Catalog for testing ML pipelines and registering models. This catalog is temporary and only retains assets until testing is complete. The development environment may also require access to the staging catalog for debugging purposes.
Here are the key characteristics of the staging stage:
- Testing of ML pipeline code
- Unit and integration tests
- Release branch creation
- Staging environment with its own Unity Catalog
Production Stage
In the production stage, ML engineers own the environment where ML pipelines are deployed and executed. This is where model training, validation, and deployment happen.
Data scientists typically don't have write or compute access to the production environment, but they should have visibility to test results, logs, model artifacts, production pipeline status, and monitoring tables. This allows them to identify and diagnose problems in production and compare the performance of new models to models currently in production.
Data scientists can grant read-only access to assets in the production catalog for these purposes.
Here are some key roles and responsibilities in the production stage:
In the production stage, pipelines trigger model training, validate and deploy new model versions, publish predictions to downstream tables or applications, and monitor the entire process to avoid performance degradation and instability.
MLOps Tools
Databricks MLOps Stacks is a customizable stack for starting new ML projects on Databricks that follow production best-practices out of the box. It provides a quick way to get started iterating on ML code for new projects while ops engineers set up CI/CD and ML resources management.
The default stack in the repo includes three modular components: ML Code, ML Resources as Code, and CI/CD (GitHub Actions or Azure DevOps). These components are useful for quickly iterating on ML problems, governing and auditing ML resources, and shipping ML code faster with confidence.
Here's an interesting read: Databricks Mlops Stack
Here are the three modular components of the default stack:
Databricks asset bundles and Databricks asset bundle templates are also in public preview.
Data Ingestion and Transformation
Data Ingestion and Transformation is a crucial step in the MLOps process. It involves collecting data from various sources, transforming it into a usable format, and loading it into a data warehouse or lake.
Databricks provides a range of tools and technologies to simplify this process, including Delta Lake, which enables you to store data in a structured format that's easily queryable and scalable.
Delta Lake allows you to handle data in its raw, unprocessed form, which is particularly useful for real-time data pipelines. For instance, you can use it to store data from IoT devices or social media feeds.
Data Analysis (EDA)
Data analysis is a crucial step in understanding your data. Exploratory data analysis, or EDA, is an interactive and iterative process where you analyze data in a notebook to assess its potential to solve a business problem.
In EDA, data scientists begin identifying data preparation and featurization steps for model training. This process is not typically part of a pipeline that will be deployed elsewhere.
AutoML can accelerate this process by generating baseline models for a dataset. It performs and records a set of trials, providing a Python notebook with source code for each trial run.
Data Ingestion & Transformation
We structure our data on UC using the Medallion Architecture, which is a recommended way to organize data based on data maturity. It has three layers: bronze (raw), silver (structured), and gold (enriched).
The Medallion Architecture is useful for our use case because it helps us keep our data organized and easily accessible. We can use it to track the progress of our data from raw to enriched.
We start with the raw data, which we collect from the Movielens data archive. We use the curl command to download the dataset as a .zip file into the volume storage.
Take a look at this: Mlops Architecture
The raw data is stored as a collection of .tsv files. We can then move on to the next layer, which is the transformed data. This is where we enforce schemas and filter columns for our data tables.
The transformed data is also where data preprocessing and cleaning takes place. This is an important step because it helps us ensure that our data is accurate and reliable.
Here are the three schemas we use in our Medallion Architecture:
- raw (bronze): raw data collected from the Movielens data archive
- transformed (silver): we enforce schemas and filter columns for our data tables
- feature_store (gold): this schema contains all our aggregated data that can be fed to our ML pipeline
Having a clear data transformation process helps us to ensure that our data is accurate and reliable. It also helps us to track the progress of our data and make sure that it is easily accessible.
Create Catalog and Schemas
To create a catalog and schemas, you can use Databricks UI or write SQL commands in a notebook or Databricks SQL editor. For this tutorial, we'll use notebooks.
You'll need to specify the name of your catalog and schemas in a config.json file. This file will serve as a reference point for creating your catalog and schemas.
To organize your data and AI access, you'll want to decide how you want to set up your project. Databricks recommends using Unity Catalog (UC) for this purpose. UC is a centralised metadata management service that stores information such as storage locations, table definitions, partition information, and schema details.
Here are the different levels of the data object hierarchy in Unity Catalog:
By following these steps, you'll be able to create a catalog and schemas that meet your project's needs. Remember to check out the release notes and Data, Analytics and AI Governance pages for more information on using Unity Catalog.
Experiment Tracking and Management
Experiment tracking is a crucial aspect of the machine learning lifecycle. It helps you keep track of different experiments, including models, results, parameters, code, data schema, data samples, and supplementary artifacts.
Experiment tracking is essential as projects get bigger and more complex. It's like trying to keep track of multiple balls in the air - without a system, it's easy to lose sight of what's working and what's not.
In Databricks, you can use MLflow for experiment tracking. MLflow offers a library and environment agnostic API to record ML Experiments. Experiments are the primary unit of organization in MLflow.
Each experiment consists of a group of related Runs. An MLflow run corresponds to a single execution of model code. When you run an experiment or train a model, it creates a run.
Here's a breakdown of how to create a workspace experiment in Databricks:
- Workspace experiment: we can create a workspace experiment that can log the run of any experiment in your notebook by using the experiment ID or name.
- Notebook experiment*: experiment is assigned to a specific notebook. By default Databricks creates an experiment with the name and id of the corresponding notebook when we use the MLflow API to log a run.
Note that you need to set the experiment name explicitly using the full path when you run a notebook using Databricks jobs.
Develop ML Pipelines
Developing ML pipelines is a crucial step in the MLOps process. It involves creating a pipeline that can be used to train, validate, and deploy machine learning models.
Data scientists create new or updated pipelines in a development (“dev”) branch of the project repository. This is where the pipeline is developed, tested, and refined before being deployed to production.
If this caught your attention, see: Mlops Continuous Delivery and Automation Pipelines in Machine Learning
A multitask Databricks workflow is recommended, where the first task is the model training pipeline, followed by model validation and model deployment tasks. This ensures that the model is thoroughly tested and validated before being deployed to production.
The code repository contains all of the pipelines, modules, and other project files for an ML project. Data scientists should work in a repository to share code and track changes.
Here are the key tasks involved in developing an ML pipeline:
* Training and tuning: This involves training the model using the training data and tuning the hyperparameters to optimize the model's performance.Evaluation: This involves evaluating the model's performance using a held-out dataset and logging the results to the MLflow tracking server.Registration: This involves registering the trained model to Unity Catalog.
By following these steps and best practices, data scientists can develop robust and reliable ML pipelines that can be deployed to production with confidence.
Featured Images: pexels.com