Automl Python can save you a significant amount of time and effort by automating many machine learning tasks. It provides a simple and intuitive interface for building and training machine learning models.
With Automl Python, you can easily experiment with different algorithms and hyperparameters to find the best combination for your specific problem. This can be a major time-saver, especially for complex tasks.
Automl Python also provides a range of pre-trained models that can be fine-tuned for your specific task, reducing the need for extensive training from scratch.
Curious to learn more? Check out: Ai and Machine Learning Training
When to Use AutoML
When to use AutoML, you want to apply automated machine learning when you want Azure Machine Learning to train and tune a model for you using the target metric you specify.
Automated ML democratizes the machine learning model development process and empowers users to identify an end-to-end machine learning pipeline for any problem. It's perfect for implementing ML solutions without extensive programming knowledge.
Recommended read: Ai Ml Libraries in Python
You can use AutoML for classification, regression, forecasting, computer vision, and NLP tasks. It's a game-changer for saving time and resources, applying data science best practices, and providing agile problem-solving.
Here are some key benefits of using AutoML:
- Implement ML solutions without extensive programming knowledge
- Save time and resources
- Apply data science best practices
- Provide agile problem-solving
AutoML Tasks
AutoML tasks can be categorized into several types, including tabular data based tasks, computer vision tasks, and natural language processing tasks. These tasks determine the function used by your job and the model algorithms applied.
The main task types supported by Automated ML are classification, regression, and forecasting, as well as computer vision and natural language processing tasks. Classification is a type of supervised learning where models learn to use training data and apply those learnings to new data.
Common classification examples include fraud detection, handwriting recognition, and object detection. Classification models predict which categories new data fall into based on learnings from its training data.
Computer vision tasks include image classification, object detection, and instance segmentation. These tasks allow you to easily generate models trained on image data for scenarios like image classification and object detection.
Discover more: Automl Vision
Here are the supported computer vision tasks:
When to Use: Classification, Regression, Forecasting
AutoML is incredibly versatile, and one of its most impressive features is its ability to tackle a wide range of machine learning tasks. Classification, regression, and forecasting are just a few examples of the many tasks AutoML can handle.
Classification is a type of supervised learning where models learn to identify categories or labels in data. With AutoML, you can use classification for tasks like fraud detection, handwriting recognition, and object detection.
Regression, on the other hand, is used for predicting continuous values, such as prices or inventory levels. AutoML makes it easy to implement regression models without extensive programming knowledge.
Forecasting is another crucial task that AutoML can handle. It's used for predicting future values, such as revenue, inventory, or customer demand. With AutoML, you can use automated time-series forecasting to combine techniques and approaches and get a recommended, high-quality forecast.
You might like: How to Use Huggingface Models in Python
Here are some key metrics to keep in mind for classification and forecasting tasks:
These metrics can help you evaluate the performance of your AutoML models and choose the best one for your specific use case.
Ensemble Models
Ensemble models are enabled by default in automated machine learning. They improve machine learning results and predictive performance by combining multiple models as opposed to using single models. Ensemble iterations appear as the final iterations of your job.
Automated machine learning uses both voting and stacking ensemble methods for combining models. Voting predicts based on the weighted average of predicted class probabilities (for classification tasks) or predicted regression targets (for regression tasks).
Stacking combines heterogeneous models and trains a meta-model based on the output from the individual models. The current default meta-models are LogisticRegression for classification tasks and ElasticNet for regression/forecasting tasks.
The Caruana ensemble selection algorithm with sorted ensemble initialization is used to decide which models to use within the ensemble. It initializes the ensemble with up to five models with the best individual scores, and verifies that these models are within 5% threshold of the best score to avoid a poor initial ensemble.
To change default ensemble settings, see the AutoML package.
For your interest: Energy-based Model
Preparation and Data
Preparing your data is a crucial step in automating machine learning with Python. To do this, you can specify separate training data and validation data sets, which is necessary for training automated machine learning jobs.
Training data must be provided to the training_data parameter in the factory function of your automated machine learning job. If you don't explicitly specify a validation_data or n_cross_validation parameter, Automated ML applies default techniques to determine how validation is performed.
The default validation technique depends on the number of rows in your dataset. If your training data has more than 20,000 rows, Automated ML will use a training and validation data split, taking 10% of the initial training data set as the validation set.
If your training data has fewer than 20,000 rows, Automated ML will use a cross-validation approach. If your dataset has fewer than 1,000 rows, ten folds are used. If your dataset has between 1,000 and 20,000 rows, three folds are used.
A unique perspective: Automated Decisions
Before preprocessing your data, you can use pandas profiling to perform a quick exploratory data analysis (EDA) and identify variables, types, ranges, and missing values. This can be a useful way to start the AutoML process and create a report with descriptive statistics for each variable.
Preprocessing options are also available, including minimal support for automated Target Encoding of high cardinality categorical variables. The only currently supported option is preprocessing=["target_encoding"], which automatically tunes a Target Encoder model and applies it to columns that meet certain cardinality requirements for tree-based algorithms.
Take a look at this: Preprocessing Data
Data Split
Data Split is a crucial step in the machine learning process. You need to specify what type of data will be used for training, validation, and testing.
Automated ML requires training data to train ML models, and you can specify what type of model validation to perform. If you don't explicitly specify a validation_data or n_cross_validation parameter, Automated ML applies default techniques to determine how validation is performed.
The choice of validation technique depends on the size of your training data. If it's larger than 20,000 rows, Automated ML will apply a training and validation data split. This means 10% of the initial training data set will be taken as the validation set.
If your training data is smaller than or equal to 20,000 rows, Automated ML will use a cross-validation approach. This involves dividing your data into multiple folds, with the default number of folds depending on the number of rows.
Here's a breakdown of the default validation techniques used by Automated ML:
In the cross-validation approach, if your dataset is fewer than 1,000 rows, ten folds are used. If your rows are equal to or between 1,000 and 20,000, three folds are used.
Preprocessing
Preprocessing is a crucial step in data preparation, and it's where the magic happens. Feature engineering is the process of using domain knowledge of the data to create features that help machine learning algorithms learn better.
Automated machine learning experiments in Azure Machine Learning can apply scaling and normalization techniques to facilitate feature engineering. These techniques, along with feature engineering, are referred to as featurization. Featurization can be applied automatically or customized based on your data.
Featurization steps, such as feature normalization, handling missing data, and converting text to numeric, become part of the underlying model. This means that when you use the model for predictions, the same featurization steps applied during training are applied to your input data automatically.
You can customize featurization in Azure Machine Learning studio by enabling automatic featurization in the View additional configuration section. Alternatively, you can specify featurization in your AutoML Job object using the Python SDK.
In automated machine learning experiments, your data is automatically transformed to numbers and vectors of numbers. The data is also scaled and normalized to help algorithms that are sensitive to features that are on different scales. This is called featurization.
The following table shows the accepted settings for featurization:
As of H2O 3.32.0.1, AutoML now has a preprocessing option with minimal support for automated Target Encoding of high cardinality categorical variables. The only currently supported option is preprocessing=["target_encoding"], which automatically tunes a Target Encoder model and applies it to columns that meet certain cardinality requirements for tree-based algorithms.
Expand your knowledge: Tensor Data Preprocessing
Pandas Profiling
Pandas profiling is a quick and easy way to perform exploratory data analysis with just a few lines of code. It's a useful starting point for the AutoML process.
Pandas profiling creates a report that contains several descriptive statistics for each variable. This report can help identify variables, types, ranges, and missing values in the dataset.
The EDA takes raw data and correlates datasets. This can be particularly useful for identifying relationships between different variables.
Pandas profiling won't replace the detailed analysis that an experienced data scientist could produce from the same dataset. However, it's a great way to get a quick overview of the data.
The results of pandas profiling are easy to read and share. This makes it a great tool for collaboration and communication with others.
Worth a look: Random Shuffle Dataset Python Huggingface
Model Selection and Training
Automated machine learning supports ensemble models, which improve machine learning results and predictive performance by combining multiple models. Ensemble learning is enabled by default and uses both voting and stacking ensemble methods.
The Caruana ensemble selection algorithm is used to decide which models to use within the ensemble. This algorithm initializes the ensemble with up to five models with the best individual scores and verifies that these models are within 5% threshold of the best score to avoid a poor initial ensemble.
The task method determines the list of algorithms or models to apply, and the supported algorithms per machine learning task include Logistic Regression, Elastic Net, AutoARIMA, and many others.
Use Pipelines
To use pipelines, you can add AutoML Job steps to your Azure Machine Learning Pipelines. This allows you to automate your entire workflow by hooking up your data preparation scripts to Automated ML.
You can use Azure Machine Learning Pipelines to automate your machine learning operations workflows. This approach enables you to automate your entire workflow by integrating your data preparation scripts with Automated ML.
You can add AutoML Job steps to your pipeline using the Azure Machine Learning SDKv2 or the Azure Machine Learning CLIv2. For example, the code for a sample pipeline with an Automated ML classification component and a command component that shows the resulting output is available in the examples repository.
For more insights, see: Mlops Continuous Delivery and Automation Pipelines in Machine Learning
Here's a step-by-step guide to using pipelines:
1. Identify the machine learning task type, such as classification, regression, or time series forecasting.
2. Choose the algorithms and models to apply, such as Logistic Regression, Light GBM, or AutoARIMA.
3. Configure the pipeline with the necessary inputs and outputs, including the training and validation data.
4. Run the pipeline using the Azure Machine Learning CLIv2 or the Azure Machine Learning SDKv2.
By following these steps, you can automate your machine learning operations workflows and streamline your model development process.
Additional reading: Azure Auto Ml
Glm Hyperparameters
In the context of GLM, hyperparameters play a crucial role in determining the performance of the model. GLM uses its own internal grid search rather than the H2O Grid interface.
The search for optimal hyperparameters is done by AutoML, which builds a single model with lambda_search enabled. This allows it to pass a list of alpha values and return a single model with the best alpha-lambda combination.
The values that are searched over for GLM hyperparameters include six different alpha values: 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
These specific alpha values are used by AutoML to find the best combination with lambda, resulting in a well-performing model.
You might like: Grid Search Examples Python
GBM Hyperparameters
GBM, or Gradient Boosting Machine, is a popular machine learning algorithm. It has several hyperparameters that need to be tuned for optimal performance.
The learn rate for GBM is hard coded at 0.1, which means it cannot be searched over during hyperparameter tuning.
GBM's max depth can be set to various values, including 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, and 16.
The number of trees in GBM is also hard coded at 10000, which is a value found by early stopping.
GBM's sample rate can be set to various values, including 0.50, 0.60, 0.70, 0.80, 0.90, and 1.00.
Here are the GBM hyperparameters that can be searched over:
GBM is a powerful algorithm, but its performance can be sensitive to the choice of hyperparameters.
Model Evaluation and Deployment
You can evaluate an experiment by grabbing the best performing model and its associated metrics once a run completes.
After identifying a suitable model, you can deploy it directly from code as either an Azure Container Instance or an Azure Kubernetes Service.
With several authentication and scaling options available, you have the flexibility to choose the best approach for your specific needs.
ONNX
You can convert models built with Azure Machine Learning's automated ML to the ONNX format.
This allows you to run them on various platforms and devices. ONNX is a widely adopted format that supports many algorithms, including those used in Azure Machine Learning.
The ONNX runtime supports C#, making it easy to use these models in .NET applications without needing to recode or deal with REST endpoint latencies.
You can learn more about using an AutoML ONNX model in a .NET application with ML.NET and inferencing ONNX models with the ONNX runtime C# API.
To get started, you can check out the Jupyter notebook example that shows how to convert models to ONNX format.
Suggestion: How to Convert My Python Code into Application
Leaderboard
The leaderboard is a crucial tool in evaluating the performance of your AutoML models. It ranks models based on their performance, with the default metric depending on the problem type.
In binary classification problems, the default metric is AUC, while in multiclass classification problems, it's mean per-class error. In regression problems, the default sort metric is RMSE.
A unique perspective: Binary Categorization
You can adjust the number of folds used in the model evaluation process by specifying the nfolds parameter. This allows you to fine-tune the evaluation process to suit your needs.
The leaderboard also includes additional metrics, such as training time and prediction time. You can add these columns to the leaderboard by specifying the extra_columns parameter.
Here are some of the extra columns you can add to the leaderboard:
- training_time_ms: A column providing the training time of each model in milliseconds.
- predict_time_per_row_ms: A column providing the average prediction time by the model for a single row.
- ALL: Adds columns for both training_time_ms and predict_time_per_row_ms.
By examining the leaderboard, you can get a clear picture of your models' performance and make informed decisions about which ones to deploy.
Log
Accessing meta information about your AutoML model's training process can be incredibly useful for post-analysis. You can do this by checking the event log, which is an H2OFrame that lists the selected AutoML backend events generated during training.
The event log is a valuable resource for understanding what happened during the training process. You can access it using the event_log property of the AutoML object.
The training_info dictionary exposes data that could be useful for post-analysis, such as various timings. However, if you want training and prediction times for each model, it's easier to explore that data in the extended leaderboard using the h2o.get_leaderboard() function.
You might like: Log Trick Reparameterization Trick
Model Deployment
Model deployment is a crucial step after testing a model. You can register it for later use in the Azure Machine Learning studio.
With a registered model, you can use one-click deployment. This makes it easy to deploy your model without having to write a lot of code.
You can also deploy a model directly from code as an Azure Container Instance or an Azure Kubernetes Service. This gives you flexibility in how you deploy your model.
Using automated ML, you can build a Python model and convert it to the ONNX format. This makes it possible to run the model on various platforms and devices.
The ONNX runtime supports C#, so you can use the model built automatically in your C# apps without needing to recode or use REST endpoints.
Tools and Integration
To get started with AutoML in Python, you'll need to install the necessary tools. This can be done by installing the AutoML Tools runtime environment for Windows or Linux.
The easiest way to do this is to download the ready-to-use AutoML Tools Python environment from the ActiveState Platform. To access this, you'll need to create an account using your GitHub credentials or email address.
Signing up is a straightforward process that unlocks the ActiveState Platform's benefits for you. For Windows users, you can download and install the AutoML Tools runtime by running a specific command at a CMD prompt.
Linux users can do the same by running a different command. Both commands automatically download and install the CLI, State Tool, and AutoML Tools runtime into a virtual environment.
Intriguing read: Automated Machine Learning with Microsoft Azure Pdf Free Download
Experiment Management
Experiment Management is a crucial part of automating machine learning tasks with Python. Automated machine learning jobs are currently only supported on Azure Machine Learning remote compute cluster or compute instance.
To run an experiment, you can use the MLClient created in the prerequisites to run a command in the workspace. This command can be used to return information about the job, including its final metrics score and generated models.
If you run an experiment with the same configuration settings and primary metric multiple times, you might see variation in each experiment's final metrics score and generated models due to inherent randomness in the algorithms used. This can result in results with the same model name but different hyper-parameters used.
Configuring your experiment settings is essential to ensure that your automated machine learning experiment behaves as expected. This includes setting job training settings and exit criteria with the training and limits settings, such as specifying accuracy as the primary metric and five cross-validation folds.
Running the experiment is where the magic happens. You can configure how your machine learning experiment should behave, including selecting the task, giving Azure a dataset, and telling it which column to predict and which machine learning metric is most important. You can also configure information on cross validation, how many models to try, the compute resource to use, and how many different nodes in the cluster to activate.
Once you have configured your experiment, you can create and submit it, causing Azure to run your experiment and compare the various models it generates until it finds the best performing model based on the validation criteria you specified. This process can be monitored directly in a Jupyter Notebook in your IDE using the Azure ML Python SDK's widgets.
Frequently Asked Questions
What is AutoML in Python?
AutoML in Python automates the process of selecting and optimizing machine learning models, making it easier to apply machine learning to real-world problems. This simplifies the process, requiring less domain knowledge and expertise.
Which AutoML library is best?
There is no single "best" AutoML library, as each has its strengths and weaknesses, but popular choices include PyCaret and H2O AutoML for ease of use and Auto-sklearn and FLAML for advanced features. The right choice depends on your specific project needs and goals.
Sources
- https://learn.microsoft.com/en-us/azure/machine-learning/concept-automated-ml
- https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
- https://www.activestate.com/blog/the-top-10-automl-python-packages-to-automate-your-machine-learning-tasks/
- https://learn.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train
- https://accessibleai.dev/post/introducingazuremlpythonsdk/
Featured Images: pexels.com