SageMaker Automl is a powerful tool that automates the machine learning process, allowing you to go from data to model in a matter of minutes. With SageMaker Automl, you can train models on large datasets and achieve high accuracy without requiring extensive machine learning expertise.
Amazon SageMaker Automl supports a wide range of algorithms, including regression, classification, and clustering. You can choose from a variety of algorithms, including popular ones like XGBoost and LightGBM.
SageMaker Automl also provides a user-friendly interface that makes it easy to automate the machine learning process. You can simply upload your dataset, choose an algorithm, and let SageMaker Automl do the rest.
What Is AutoML?
AutoML is a low-code approach to machine learning that automates the creation, training, and deployment of models.
It automatically prepares your data, selects the best algorithm and hyperparameters, and trains the model for you. This empowers users with limited coding experience to build and deploy sophisticated machine learning models.
With AutoML, you can deploy the model and use it to make predictions with very little intervention on your part.
It's not about devaluing the work of data scientists, but rather saving their time so they can focus on more important tasks like feature engineering and data cleaning.
The option to 'tinker under the hood' is always available, if more optimisation is desired.
Setting Up Training
To set up training for your SageMaker AutoML project, you'll need to select a column to predict, which is a required input. This is where you choose the variable your model will be predicting, such as the loan_status column.
You can also use the Model Type section to configure other aspects of the model training, like choosing the metric you train on or setting the maximum training time. However, for the purpose of exploring AutoML features, the quick build option should be sufficient.
Here are the basic steps to follow:
- Select the target variable (e.g. loan_status column)
- Configure advanced options (e.g. metric, split, max training time) in the Model Type section
- Click Quick build to start the training process
Setting Up Training
To set up training, you need to select the dataset you want to work with.
First, you need to tell your model which variable it should be predicting, which is a required input. This is done by selecting the target variable in the Select a column to predict section.
The distribution of the target variable can be seen after selecting the target variable, which is useful to see if the classes are balanced or not.
You can configure other aspects of the model training in the Model Type section, which offers advanced options such as choosing the metric you train on or the split of the data into training and validation sets.
You can also set the maximum time the model is allowed to train for in the Model Type section.
To start the training process, click on Quick build on the top right-hand corner of the screen.
Note that you can select the standard build option for a more accurate model, but this will take hours rather than minutes.
Here are some key settings to keep in mind when setting up training:
- Select the target variable in the Select a column to predict section.
- Configure other aspects of the model training in the Model Type section.
- Set the maximum time the model is allowed to train for in the Model Type section.
- Click on Quick build on the top right-hand corner of the screen to start the training process.
Training Data Format – File Mode vs Pipe Mode vs Fast File Mode
SageMaker supports Simple Storage Service (S3), Elastic File System (EFS), and FSx for Lustre for training dataset location. These storage options give you flexibility when it comes to where you store your data.
Most algorithms work best with the optimized protobuf recordIO format for the training data. This format allows for efficient data processing and can improve training times.
Using RecordIO format allows algorithms to take advantage of Pipe mode when training the algorithms that support it. This can lead to faster training times and better performance.
SageMaker offers three training data format modes: File mode, Pipe mode, and Fast File mode. Each mode has its own advantages and disadvantages.
Here's a brief overview of each mode:
- File mode: This is the default mode and is suitable for most algorithms.
- Pipe mode: This mode is ideal for algorithms that support it, as it can lead to faster training times.
- Fast File mode: This mode is similar to Pipe mode but is more suitable for algorithms that don't support Pipe mode.
Using Pipe mode can significantly improve training times for algorithms that support it. It's worth noting that not all algorithms support Pipe mode, so be sure to check the documentation for your specific algorithm.
Evaluating Performance
Evaluating Performance is a crucial step in refining your machine learning model. You can do this in the analyze tab, where you'll see the overall score for the metric you trained on, such as accuracy.
To get a better understanding of your model's decision-making process, you can also see the feature importance, which measures the extent to which each independent variable contributes to the model's predictions.
In the advanced metrics tab, you'll find a confusion matrix, which shows the actual vs. predicted categorizations, allowing you to see where your model successfully and unsuccessfully classifies your data.
Here are some key metrics to keep in mind when evaluating the performance of your model, depending on the type of problem you're tackling:
Candidate Metrics
Evaluating the performance of a machine learning model is crucial to understand how well it's doing its job. You can see the overall score for the metric you trained on in the analyze tab, such as accuracy.
The analyze tab also shows the feature importance of the model, which measures how much each independent variable contributes to the model's predictions. This helps you understand the model's decision-making process.
For classification problems, you get a confusion matrix in the advanced metrics tab. This shows the actual vs. predicted categorizations, so you can see where the model succeeds and fails.
Amazon SageMaker Autopilot produces metrics that measure the predictive quality of machine learning model candidates. These metrics vary depending on the problem type.
Here are some of the metrics you can expect to see:
These metrics give you a clear picture of how well your model is performing and where it needs improvement.
Monitor
Monitoring your models is a crucial step in evaluating their performance. It's like keeping an eye on your car's mileage to ensure it's running smoothly.
Continuous monitoring can be set up with a real-time endpoint or a batch transform job that runs regularly. This allows you to catch any issues before they become major problems.
SageMaker Model Monitor provides prebuilt monitoring capabilities that don't require coding, giving you flexibility to monitor models by coding if needed. This means you can set up alerts and notifications to let you know when there are deviations in model quality.
Early detection of deviations is key, enabling you to take corrective actions such as retraining models or fixing quality issues without having to monitor models manually or build additional tooling. This saves you time and effort in the long run.
SageMaker Model Monitor offers the following types of monitoring:
- Continuous monitoring with real-time endpoints or batch transform jobs
- On-schedule monitoring for asynchronous batch transform jobs
Inference and Deployment
You can deploy your SageMaker AutoML model in one of two ways: automatic deployment or manual deployment. Automatic deployment means SageMaker will test multiple models and automatically deploy the model with the best results.
To deploy a model manually, you can select the best model after evaluating it and click "Deploy" in the top right corner. In SageMaker Studio, only provisioned endpoints are supported.
Here are the main options for inference and deployment:
- Batch Inference: run predictions in a batch fashion on a set of data, useful for re-infering all predictions after creating a new model.
- Real-Time Inference: best used for applications that provide direct user feedback and have a crucial response time.
- Inference Pipeline: provides a scalable and cost-effective solution for deploying large numbers of models using a shared endpoint.
- Model Deployment: helps deploy the ML code to make predictions, supports auto-scaling and high availability.
Batch Inference
Batch Inference is a useful feature for making predictions on a set of data all at once.
You can find Batch Inference in the Predict tab of your model in Canvas. It's great for running predictions once for all data or re-infer all predictions after creating a new model.
To deploy a batch inference job, you need to specify the instance type and count to use for the batch prediction job. This will determine the resources allocated to the job.
You'll also need to configure the input data, which will be read from S3. The output will be written to S3 as well.
Here are the details you'll need to provide when deploying a batch inference job:
Once you're done, you can create the batch transformation job and monitor its progress in the processing jobs section of the AWS Web Console.
Real-Time Inference
Real-Time Inference is a powerful tool for applications that require direct user feedback and a crucial response time. You can deploy a real-time inference endpoint in SageMaker Studio by selecting the model from the AutoML run and clicking on deploy in the top right corner.
To deploy a real-time inference endpoint, you only need to specify an endpoint name and an instance type. In contrast to batch inference, you don't need to worry about complex configurations.
If you want to deploy a serverless endpoint, you can do so by using the deployment mechanism provided by the AWS web console. This allows you to select a model to be deployed and click Create endpoint.
Once your endpoint deployment is finished, you can use the AWS SDK to query the model for predictions. This makes it easy to integrate real-time inference into your application.
You can also use variants to test multiple models or model versions behind the same endpoint. This is especially useful for A/B testing or comparing the performance of different models.
Here are the key benefits of using real-time inference:
- Direct user feedback and crucial response time
- Easy deployment in SageMaker Studio
- Support for serverless endpoints
- Use of AWS SDK for querying models
Step 2: Deploy
Deploying your machine learning model is a crucial step in making it accessible to users. You can deploy your model in one of two ways: Automatic deployment or Manual deployment.
Automatic deployment is a great option, as SageMaker will test multiple models and automatically deploy the model with the best results. This is especially useful for beginners.
To deploy your model manually, you'll need to evaluate it and then manually deploy the best model. This gives you more control over the deployment process.
You can also use SageMaker's autopilot feature to deploy your model. There are three ways to do this: Automatically, Manually, or through API calls.
Here are the three options for autopilot model deployment:
- Automatically: SageMaker will automatically deploy the best model from the experiment to an endpoint.
- Manually: You can choose not to auto-deploy your model and instead deploy it manually.
- API calls: You can also deploy your model through API calls, which is useful if you want to create AutoML jobs without logging into the AWS console.
By deploying your model, you'll be able to make predictions and provide value to your users.
Amazon Services and Tools
You can use Autopilot with or without code - via an AWS SDK or Amazon SageMaker Studio. It supports data tables formatted as Parquet or CSV files.
Autopilot can handle massive datasets with hundreds of GBs and supports binary, regression, and multi-class classification problems.
It accepts data types for columns including text, numerical, time, and categorical series consisting of number strings separated by commas.
AWS Works
AWS SageMaker Autopilot automates key machine learning tasks, such as exploring data, selecting algorithms, and preparing data for model training or tuning.
It uses cross-validation procedures to evaluate the capacity of prospective algorithms to make predictions based on new data inputs.
Autopilot provides metrics that assess the predictive abilities of candidate machine learning models, simplifying the machine learning process using automated tasks in an AutoML pipeline.
It ranks the optimized models by performance and identifies the best-performing model, allowing you to deploy the appropriate model much faster than normal.
Autopilot supports various types of problems, including binary, regression, and multi-class classification problems.
It can operate automatically with varying degrees of human intervention and supports data tables formatted as Parquet or CSV files.
Accepted data types for columns include text, numerical, time, and categorical series consisting of number strings separated by commas.
Autopilot generates reports indicating each feature's importance for the predictions that the best candidate model made, making it easier for AWS customers to understand ML models.
You can use Autopilot's model governance report to make informed risk and compliance decisions and present the report to external auditors and regulators.
Autopilot provides full visibility into the processes used to wrangle data, select models, and train and tune each candidate tested.
The following diagram outlines the main steps of an Autopilot-managed AutoML process:
- Explore data
- Select algorithms
- Prepare data for model training or tuning
- Train and tune models
- Evaluate and rank models
- Deploy the best-performing model
Amazon Explainability
Amazon Explainability is a powerful tool that helps you understand how your machine learning models make predictions. This is crucial for building trustworthy models that regulators and consumers can trust.
Amazon SageMaker Autopilot uses Amazon SageMaker Clarify tools to provide explainability functionality. These tools help you understand a model's characteristics before deploying it, and debug its predictions after deployment.
You can use the SHAP model to explain why a model generated specific predictions. AutoPilot provides a plot of SHAP values that show the importance of each feature for your model.
This approach helps you understand why your trained model made a given prediction, providing per-instance explanations at inference. This is particularly useful for auditing and ensuring regulatory compliance.
Here's a quick rundown of how SHAP values work:
By using SHAP values and other explainability tools, you can build models that are transparent, trustworthy, and reliable. This is a critical step in developing machine learning models that can be relied upon to make accurate predictions.
Frequently Asked Questions
What is AutoML in SageMaker?
Amazon SageMaker Autopilot automates the machine learning process, from building to deploying models. This automated feature-set simplifies the creation of accurate and reliable models
What are the disadvantages of AutoML?
AutoML has several disadvantages, including limited customization options, data quality and preprocessing challenges, and a lack of interpretability and collaboration features. These limitations can hinder the effectiveness and reliability of machine learning models.
Sources
- https://sagemaker.readthedocs.io/en/stable/api/training/automl.html
- https://superluminar.io/2024/03/06/sagemaker-automl-simplifying-data-science-with-intelligent-automation/
- https://www.run.ai/guides/machine-learning-in-the-cloud/sagemaker-autopilot
- https://jayendrapatil.com/tag/sagemaker-automl/
- https://www.analyticsvidhya.com/blog/2022/06/automate-ml-model-with-amazon-sagemaker/
Featured Images: pexels.com