Machine learning operations, or MLOps, is the process of taking a machine learning model from development to production and deployment. It's a crucial step in making AI-powered applications a reality.
MLOps involves automating the entire lifecycle of a model, from training to testing to deployment. This includes tasks such as data preparation, model training, model evaluation, and model deployment.
To get started with MLOps, you'll need to understand the different stages of the MLOps pipeline, which include data ingestion, data processing, model training, model evaluation, model deployment, and model monitoring.
Data Preparation
Data Preparation is a crucial step in the MLOps process. Incomplete, inaccurate, and inconsistent data can lead to false conclusions and predictions.
Data cleaning and understanding is essential to ensure the quality of the results. This involves identifying and correcting errors, handling missing values, and transforming data into a suitable format for analysis.
Inaccurate data can have a significant impact on the model's performance. Incomplete data, on the other hand, can lead to biased results.
Understanding the data is just as important as cleaning it. This includes knowing the data types, distributions, and correlations between variables.
Feature Engineering and Management
Feature Engineering and Management is a crucial step in the MLOps process. It involves transforming raw data into features that can be used for model training and prediction.
One key aspect of feature engineering is handling categorical data, as seen in the ocean_proximity feature with 5 categories. To address this, One Hot Encoding is applied.
Feature engineering also involves checking for multicollinearity among features. Pearson's correlation revealed a few features with high multicollinearity, which was addressed by creating new features such as rooms_per_bedroom and population_per_house.
To ensure reproducibility, it's essential to store features in a centralized repository. The Feature Store in Databricks is used for this purpose, where the clean dataset was registered, keeping 20% separate for inference after model development.
You might enjoy: What Is Feature Engineering in Data Science
Feature Engineering
Feature Engineering is a crucial step in the machine learning pipeline. It involves transforming raw data into features that can be used to train a model.
One of the key aspects of feature engineering is handling categorical variables. In the case of the ocean_proximity feature, which has 5 categories, One Hot Encoding is applied. This is a common technique used to convert categorical variables into numerical variables.
Suggestion: Feature Engineering Python
To address multicollinearity, new features can be created by computing ratios of existing features. For example, rooms_per_bedroom and population_per_house were created by dividing total_rooms by total_bedrooms and population by households, respectively.
Here are some key considerations when creating new features:
- Identify correlations: Use techniques like Pearson's correlation to identify correlations between features.
- Create new features: Use ratios or other transformations to create new features that can help address multicollinearity.
By following these best practices, you can create a robust set of features that will help your model perform well.
Notebook
To get started with feature engineering and management, you should begin by exploring the Jupyter notebook. Click on the Jupyter icon at the top right corner of our Anyscale Workspace page to open up our JupyterLab instance in a new tab.
Navigate to the notebooks directory and open up the madewithml.ipynb notebook to interactively walkthrough the core machine learning workloads.
Take a look at this: Mlops Open Source
Compute Configuration
Compute Configuration is a crucial aspect of Feature Engineering and Management. It determines what resources our workloads will be executed on.
We've already created a compute configuration for our workloads, but we can create one from scratch if needed. This involves defining the specifications for our computing environment.
The compute configuration is essentially a blueprint for our computing resources. We can customize it to suit our specific requirements and workload needs.
For instance, we can specify the type of hardware, the amount of memory, and the processing power required for our workloads. This ensures that our workloads are executed efficiently and effectively.
Machine Learning
Machine Learning is a subset of artificial intelligence that enables systems to learn from data without being explicitly programmed. This is achieved through algorithms that can improve their performance on a task over time.
One of the key characteristics of Machine Learning is its ability to handle high-dimensional data, such as images and text, which can be difficult to work with using traditional programming methods. According to the article, Machine Learning models can be trained using large datasets, allowing them to learn complex patterns and relationships.
In the context of MLOps, Machine Learning models are often deployed in production environments, where they can be used to make predictions or decisions in real-time. This requires careful consideration of issues such as model serving, model monitoring, and model maintenance.
Machine Learning Hyperparameter Tuning
Machine Learning Hyperparameter Tuning is a crucial step in developing an accurate model. It involves adjusting the model's parameters to optimize its performance.
To measure the accuracy of a LightGBM Regression model, the adjusted R-squared is used. This statistical measure represents the proportion of the variance for a dependent variable that’s explained by the independent variables.
Hyperparameter tuning can significantly improve the selected performance metrics. In the example, Bayesian Optimization using the hyperopt library was used to perform hyperparameter tuning.
The Root Mean Squared Error (RMSE) is another important metric used in regression tasks to measure the error of a model. It has the advantage of being expressed in the same units as the target variable.
For model training, features registered in the Feature Store are used to build the training dataset. The dataset is then split into training (70%) and test (30%) sets for modeling.
Hyperparameter tuning can lead to small improvements in performance metrics, but it's still a crucial step in developing an accurate model. In the example, the improvement of the adjusted R2 metric was from 0.83 to 0.84 after model tuning.
You might enjoy: Random Shuffle Dataset Python Huggingface
Machine Learning Operations
Machine Learning Operations (MLOps) refers to a set of practices to efficiently and reliably deploy and maintain ML models in the production environment.
MLOps began as only a set of best practices, but today it has evolved into a completely independent approach to the process of machine learning lifecycle management.
The goal of MLOps is to ensure that ML models are deployed and maintained in a way that is efficient, reliable, and scalable.
MLOps involves a range of activities, including model training, testing, deployment, and monitoring.
To implement MLOps, DevOps engineers, ML engineers, and Data Scientists work together to transition algorithms to production systems.
Key features of MLOps include the ability to configure scheduled runs or event-driven runs, set up continuous training pipelines, and maintain and store older runs.
Here are some key features of training operationalization:
- Capability to configure scheduled runs or event-driven runs that are initiated when new data is present and the model starts decaying.
- Setup continuous training pipelines with custom hyperparameter settings.
- Access to model registry to contain the ML artifact repository.
By implementing MLOps, organizations can ensure that their ML models are deployed and maintained in a way that is efficient, reliable, and scalable.
Deployment and Monitoring
Model deployment is a complex process that involves multiple components such as Continuous Integration (CI), Continuous Delivery (CD), online experimentation, and production deployment. This process is crucial for making the model available for use in the actual production environment.
In Databricks, you can visit the section of registered models to enable serving and start the serving process. The status will be visible on the top left, and this process will create a cluster for you where the current registered model will be deployed.
The key features of model deployment include continuous integration, continuous delivery, and different strategies of production deployment such as Canary deployment, Shadow deployment, and Blue/green deployment. These strategies help ensure that the model is properly tested and validated before being deployed to production.
Online experimentation such as Smoke testing, A/B testing, and MAB testing are carried out to test whether the new model is performing better than the older one or not. When a new model is considered for deployment in production, the old model also runs in parallel, and a subset of traffic is passed to the newer version later.
Here are the different strategies of production deployment:
Model monitoring is an essential task after deployment to make sure the effectiveness of the deployed model remains. This involves analyzing the prediction schemas versus the ideal schemas and checking for anomalies.
The key features of model monitoring include providing security from the problems of data drift and concept drift, helping to analyze and improve the outputs of evaluation metrics such as memory utilization, resource utilization, latency, and throughput.
You might like: Mlops Monitoring
Continuous Integration and Deployment
Continuous Integration and Deployment is a crucial step in the MLOps process. It ensures that changes to the code are verified automatically, which helps to prevent bugs and errors from making it into production.
In Databricks, you can use Github Actions to automate the CI/CD process. This involves triggering workflows upon a pull request to the main branch, which runs through three stages: DEV, TEST, and PROD. The DEV stage runs a workflow called 'eda_modeling' that does exploratory data analysis, modeling, and promoting of the best model to the Model Registry.
The CI/CD process involves multiple components, including Continuous Integration, Continuous Delivery, and online experimentation. Continuous Integration deals with reading the source code and the model from the model registry to check the correctness of the input-output format from the model. Continuous Delivery involves three basic phases: deployment to staging, acceptance testing, deployment to production, followed by progressive delivery.
Here are the key components of the CI/CD process:
- DEV: Runs a workflow called 'eda_modeling' that does exploratory data analysis, modeling, and promoting of the best model to the Model Registry.
- TEST: Runs a workflow called 'job-model-tests' that includes the model tests for the transitions in the 'Staging' and 'Production' stages in the Model Registry.
- PROD: Runs a workflow 'inference' for batch inference against new data.
By automating the CI/CD process, you can ensure that your ML models are deployed quickly and reliably, and that any errors or bugs are caught early on. This helps to improve the overall quality and performance of your models, and reduces the risk of errors or downtime in production.
Check this out: Learning with Errors Problem
Ci/Cd
CI/CD is a crucial part of any software development process, and for MLOps projects, it's essential to automate the deployment of ML models in production.
CI/CD can be achieved using tools like GitHub Actions, which can trigger workflows upon a pull request to the main branch.
Readers also liked: Ci Cd in Mlops
These workflows can include multiple stages, such as DEV, TEST, and PROD, each running a specific workflow. For example, the DEV stage can run a workflow called ‘eda_modeling’ that does exploratory data analysis, modeling, and promoting of the best model to the Model Registry.
Here's a breakdown of the different stages:
To set up CI/CD, you'll need to add credentials to the /settings/secrets/actions page of your GitHub repository. This includes a personal access token, which can be generated by following these steps: New GitHub personal access token → Add a name → Toggle repo and workflow → Click Generate token (scroll down) → Copy the token and paste it when prompted for your password.
Once you've set up your credentials, you can make changes to your code and push them to GitHub, which will trigger the workloads workflow. If the workflow succeeds, it will produce comments with the training and evaluation results directly on the pull request.
Versioning
Versioning is a crucial aspect of Continuous Integration and Deployment. It helps keep track of different model versions created during the process.
Model versioning includes storing model files, source code, training settings, and data split information. This allows for easy analysis of performance across different versions.
Having multiple versions of a model is essential, especially when something breaks in the current system. You can go back to a previous stable version to fix the issue.
Key features of model versioning include:
- Model tracking and storage for different versions.
- Ease of accessibility and convenience to keep a check on the model versions with their parameter settings.
- Automatic creation of MLflow model object after each run.
- Provision of the complete project environment, including the conda.yaml and requirements.txt files.
Cluster
When setting up your cluster, you have several options to choose from. You can set up your cluster locally or via Anyscale, or use cloud providers like AWS and GCP, which have community-supported integrations.
To set up your cluster, you'll need to configure the environment and compute settings. You can also use a cluster environment that's already been created for you, like the one we used when setting up our Anyscale Workspace.
You can create or update a cluster environment yourself, which determines where your workloads will be executed, including the operating system and dependencies.
Here are some options for creating a cluster:
- On AWS and GCP
- On Kubernetes via the KubeRay project
- Deploy Ray manually on-prem or onto platforms not listed here
Each of these options has its own benefits and requirements, so be sure to choose the one that best fits your needs.
Anyscale Services
Once you've executed your ML workloads, you're ready to launch your model to production using Anyscale Services. This is where you'll serve your model to users.
To launch your service, make sure to change the $GITHUB_USERNAME in serve_model.yaml. This will allow you to save results from your workloads to S3 for retrieval later.
After updating the config, you're ready to launch your service.
Sources
- https://databuildcompany.com/end-to-end-mlops-with-databricks-a-hands-on-tutorial/
- https://github.com/GokuMohandas/mlops-course
- https://www.projectpro.io/data-science-in-python-tutorial/mlops-python-tutorial-for-beginners
- https://www.linkedin.com/pulse/mlops-tutorial-ishita-patil-fioxc
- https://www.igmguru.com/blog/mlops-tutorial
Featured Images: pexels.com