Databricks MLOps is a platform that automates the entire machine learning lifecycle, from data preparation to model deployment.
It provides a unified environment for data engineers, data scientists, and model deployment teams to collaborate and automate the end-to-end data science process.
Databricks MLOps includes features like automated model training, model validation, and model deployment, making it easier to deploy machine learning models into production.
Automating the machine learning lifecycle with Databricks MLOps reduces the risk of human error, increases model accuracy, and speeds up time-to-production.
A unique perspective: Mlops Lifecycle
What Is It?
Databricks is a cloud-based platform for big data processing and analytics. It provides an integrated environment for running Apache Spark and developing machine learning models.
Databricks includes a collaboration and sharing platform for data scientists and engineers to work together and scale their work. This platform is called MLflow, an open source tool originally developed by Databricks for experiment tracking.
MLflow is a single Python package that covers some key steps in model management. It's a powerful tool that helps data scientists and engineers manage their machine learning models.
MLOps, short for Machine Learning Operations, refers to a combination of systematic processes, technologies, and best practices to operationalize machine learning models into production.
Readers also liked: Mlops Continuous Delivery and Automation Pipelines in Machine Learning
Benefits and Features
Databricks MLOps offers a flexible architecture that combines the best features of a data warehouse and data lake, known as Data Lakehouse. This architecture allows you to use the same data source for your ML workflow without moving data from one solution to another.
One of the biggest advantages of implementing MLOps with Databricks is that it provides a unified platform for managing the three major components of machine learning operations. This unified approach enhances collaborations between data scientists and operations professionals.
By leveraging the benefits of Data Lakehouse, you'll be able to implement MLOps practices in a more flexible and scalable way. This makes it easier to manage data and deploy machine learning models.
A unified platform like Databricks MLOps minimizes risks and delays associated with managing data and deploying machine learning models by providing unified access control.
If this caught your attention, see: Mlops Platform
Implementation and Best Practices
Implementing MLOps with Databricks requires careful planning and execution.
Several best practices are involved in the successful implementation of MLOps with Databricks.
The steps for implementing MLOps with Databricks include following a step-by-step guide.
Scalability and Performance
Scalability and performance are crucial for data-driven projects. Databricks provides scalable clusters that can accommodate large data sets and computations.
This scalability is particularly important when working with big data processing engines like Apache Spark and Google BigQuery. These engines require significant resources to handle large-scale data processing.
Databricks' scalable clusters are designed to support machine learning models of any size. This means you can train models on massive data sets without worrying about performance issues.
As a result, you can process and analyze large data sets efficiently, making it ideal for applications that require real-time insights.
Best Practices for Implementation
To implement MLOps with Databricks successfully, you should follow several best practices.
Create a separate environment for every stage of your machine learning code and model development. Each environment should include clearly defined transitions from one stage to another.
Implementing MLOps with Databricks involves several steps, but creating separate environments is a crucial one.
Databricks recommends creating separate environments for the different phases of machine learning code and model development. This helps to keep things organized and prevents errors from one stage affecting another.
A separate environment for each stage of development is essential for a smooth MLOps implementation with Databricks.
Related reading: Deep Learning Ai Mlops
Development and Training
In Databricks, machine learning (ML) pipelines are developed and trained using a structured approach. This includes data, code, and models, which need to be developed, validated, and deployed across different environments.
Data scientists can iterate on ML code and file pull requests (PRs), which will trigger unit tests and integration tests in an isolated staging Databricks workspace. This ensures that the code is thoroughly tested before moving to production.
An ML pipeline structure is defined in the default stack, which includes dev, staging, and prod environments. This allows for seamless deployment of automated model training and batch inference jobs.
With the help of the AutoML system, Databricks reprocesses the data involving specific tasks such as feature engineering and normalization. This makes the data more suitable for training machine learning models in real-time.
To train models on Databricks, data scientists can use Databricks notebooks as a development tool for data engineering, feature engineering, and model training. This provides a unified data platform for accessing a wide variety of data sources.
Here's a brief overview of the development and training process:
- Dev: Develop and test ML code in a dev environment
- Staging: Thoroughly test the ML pipeline code in a staging environment
- Prod: Deploy the tested code to production
By following this structured approach, data scientists can ensure that their ML pipelines are reliable, efficient, and effective.
Deployment and Monitoring
Databricks MLOps provides a robust platform for automating machine learning workflow and model deployment using REST API and scripting languages like Python or Bash.
You can automate the deployment of your machine learning models with Databricks, minimizing the chances of manual errors during deployment and ensuring consistent model releases.
Databricks Deployment Services allow you to set up a Databricks Proof of Concept (PoC) in just a few days to test its performance and scalability for your business.
Automating machine learning workflow and model deployment with Databricks ensures that your models are deployed quickly and efficiently, reducing the time and effort required for manual deployment.
To automate the deployment of your models, you can use the Databricks REST API and scripting languages like Python or Bash to create a new workflow and add necessary steps to it.
You might like: Mlops Python
Databricks provides real-time model serving capabilities, but you can also use MLRun for real-time serving with a complex serving graph, allowing you to deploy the real-time serving with pre-processing, model inference, and post-processing in the same graph.
You can deploy the serving graph as a real-time serverless function using the MLRun function deploy command, which will give you an endpoint URL for model inference.
To monitor the performance of your models, you can use the Databricks Lakehouse Monitoring feature, which allows you to check the statistics and quality of your data and detect shifts in model metrics and inferred data.
The Lakehouse Monitoring Dashboard provides visualizations for model performance drift and data drift, allowing you to detect variations in data distribution and model performance over time.
You can also use the MLRun function invoke API to call the inference or through an HTTP endpoint, making it easy to integrate with other systems and tools.
In the production stage, model training is triggered by code changes or scheduled to train a new model using the latest production data, and the trained models are registered in the MLflow Registry.
For more insights, see: Mlops Monitoring
Once the tests have been passed, the model is deployed for serving or scoring, and you're required to monitor input data and model predictions of statistical properties such as data drift, model performance, and other relevant metrics.
Databricks accommodates both automatic and manual retraining approaches to ensure that your models stay up-to-date and relevant to current events.
Pipeline Structure and Development
An ML solution consists of data, code, and models, which need to be developed, validated, and deployed across different environments. These environments are represented as dev, staging, and prod in this repository.
The repository uses CI/CD workflows to test and deploy automated model training and batch inference jobs across these environments. Data scientists can iterate on ML code by creating file pull requests, which trigger unit tests and integration tests in an isolated staging Databricks workspace.
A new release branch can be cut after merging a pull request into main, and then promoted to production as part of a regularly scheduled release process. This ensures that the latest code is used in production.
Here are the different stages of an ML pipeline:
- Build (create_model_version notebook): Reading raw data, applying preprocessing techniques, storing processed features, training the model, generating predictions, and creating a new model version in the Model Registry.
- Test (test_model notebook): Retrieving the new model, comparing input and output schemas, and generating predictions using features from the Feature Store.
- Promote (promote_model notebook): Comparing metrics of the new model and the model in production, and promoting the new model if it performs better.
These steps are contained in different notebooks and correspond to the steps outlined in the solution overview.
Automated Deployment and CI/CD
Automated deployment is a game-changer in machine learning workflow and model deployment. It minimizes manual errors during deployment and guarantees consistent model releases.
Databricks allows you to automate machine learning workflow and model deployment using REST API and scripting languages like Python or Bash. The generated script can create a new workflow and add necessary steps to it.
Automating deployment also ensures that all code in the machine learning development project undergoes the same review and testing processes. This is achieved by deploying code instead of models from one environment to another.
Here's a summary of the benefits of automated deployment:
By implementing automated deployment and CI/CD, you can ensure that your machine learning workflows and pipelines are efficient, scalable, and reliable.
Copy Trained to MLRun
To copy trained models to MLRun, you can use an MLRun Jupyter notebook. This allows you to leverage the power of Jupyter notebooks for interactive development and experimentation, and the efficiency and scalability of MLRun for deploying your models in production.
You'll first need to configure the authentication to your Databricks cluster, either using a token or a username and password. Once authenticated, you can use the MlflowClient to download the models trained on the Databricks cluster.
Here's a snippet showing how to copy the models for a specific run-id to a destination on the MLRun cluster: you'll save the model to a local directory for simplicity.
From an MLRun Jupyter notebook, you can download the models trained on a Databricks cluster using MlflowClient.
CI/CD with Github Actions
CI/CD with Github Actions is a game-changer for any organization looking to automate their machine learning workflow. It ensures that every pull request triggers a workflow that goes through three stages: DEV, TEST, and PROD.
In a real business scenario, Databricks already exists with at least two environments: development and production. For CI/CD, Github Actions are used to trigger a workflow upon a pull request of a branch to the main branch. This workflow includes three stages: DEV, TEST, and PROD.
The DEV stage runs a workflow called 'eda_modeling' that does exploratory data analysis, modeling, and promoting of the best model to the Model Registry. This ensures that only the best models make it to the production environment.
The TEST stage runs a workflow called 'job-model-tests' that includes model tests for the transitions in the 'Staging' and 'Production' stages in the Model Registry. This guarantees that models are thoroughly tested before being deployed to production.
The PROD stage runs a workflow 'inference' for batch inference against new data. This ensures that models are regularly tested against new data to ensure they remain accurate.
Here's a summary of the Github Actions workflow:
The DATABRICKS_HOST and DATABRICKS_TOKEN should reflect the host and token in the development Databricks environment for the stages MAIN and TEST and the production environment for the PROD stage.
Data Preparation and Engineering
Data Preparation is crucial for getting accurate results from machine learning models. Incomplete, inaccurate, and inconsistent data can lead to false conclusions and predictions.
Null values, duplicated rows, and multicollinearity are common issues that need to be addressed during data preparation. Exploratory Data Analysis (EDA) can help identify and fix these issues, as seen in a dataset with 20,640 entries where 207 null values were safely dropped.
Feature Engineering is also an essential step in data preparation, where categorical variables are converted into numerical values using techniques like One-Hot Encoding. This is necessary to scale the variables and combine them with one-hot-encoded categorical columns.
Access Control and Versioning
Access control and versioning are crucial components of any successful software operations process, including implementing MLOps with Databricks.
To manage access control, Databricks recommends using Git for version control, allowing you to prepare and develop code within notebooks or integrated development environments (IDEs). This also enables seamless synchronization to Databricks workspaces through Databricks Repos integration with your Git provider.
It's essential to store data in a Lakehouse architecture in your cloud account for optimal data management. This involves storing raw data and feature tables in Delta Lake format, with access controls limiting who can access, read, and modify them.
To manage ML models and model development, you can use MLflow. This allows you to easily track the model development process and save code snapshots, metrics, model parameters, and other descriptive data on the go.
Here are the recommended steps for implementing access control and versioning:
- Use Git for version control and integrate it with Databricks Repos for seamless synchronization.
- Store data in a Lakehouse architecture, using Delta Lake format with access controls.
- Manage ML models and model development using MLflow.
Data Engineering
Data Engineering is a crucial step in the data preparation process. It involves transforming raw data into a format that's ready for analysis and modeling.
Digital data is a valuable asset for many businesses in the 21st century, boasting real-world applications in all types of industries. It's essential to handle data properly to extract insights and make informed decisions.
Data engineers use various techniques to load, transform, and prepare data for analysis. For example, they might use pandas dataframes to load data and perform exploration data analysis, such as basic statistics and heatmaps to visualize correlations.
Data engineers also need to handle categorical variables, which can be converted into numerical values using techniques like one-hot encoding. This process helps to create a more uniform dataset that's easier to work with.
Post-scaling, the numerical variables' distributions can be seen in more or less the same range, which is essential for modeling and analysis. This process helps to prevent false conclusions and predictions that can arise from inconsistent data.
Data engineers must also be aware of multicollinearity, which occurs when features are highly correlated with each other. To address this, they can create new features that reduce multicollinearity, such as rooms_per_bedroom and population_per_house.
A Feature Store, like the one in Databricks, can be a centralized repository to store features that can be used for model training and inference. This helps to ensure reproducibility and reusability of features across different models and projects.
Exploratory Data Analysis
Exploratory Data Analysis is a crucial step in the data preparation process. It helps identify potential issues with the data that could impact the accuracy of the model.
The dataset in question consists of 20,640 entries.
Performing exploratory data analysis, we found that the feature total_bedrooms has 207 null values, which can be safely dropped.
Duplicated rows were also identified and dropped. This ensures that the data is clean and free from errors.
For machine learning lifecycle management, MLflow is used to track experiments, model versioning, and other essential tasks.
Frequently Asked Questions
Is MLflow owned by Databricks?
MLflow is not owned by Databricks, but Databricks developed the open source platform. Databricks is actually the company behind the managed version of MLflow.
What is the difference between MLOps and MLflow?
MLOps is the broader concept of managing the machine learning lifecycle in production, while MLflow is a tool that assists in the development and experimentation stages. Understanding the difference between these two is crucial for efficiently deploying and maintaining machine learning models in real-world applications.
Sources
- https://github.com/databricks/mlops-stacks
- https://www.clearpeaks.com/end-to-end-mlops-in-databricks/
- https://addepto.com/blog/implementing-mlops-with-databricks/
- https://www.iguazio.com/blog/integrating-mlops-with-mlrun-and-databricks/
- https://databuildcompany.com/end-to-end-mlops-with-databricks-a-hands-on-tutorial/
Featured Images: pexels.com