H2O AutoML is a powerful tool that automates the machine learning process, allowing data scientists to focus on higher-level tasks.
It can automate the entire machine learning pipeline, from data preparation to model selection and tuning.
With H2O AutoML, you can train hundreds of models in a matter of minutes, making it ideal for large-scale datasets.
This automation saves a significant amount of time and effort, allowing data scientists to explore more ideas and experiment with different models.
Suggestion: Genetic Algorithm Machine Learning
Getting Started
To get started with H2O AutoML, all you need to do is point to your dataset and identify the response column. The AutoML interface is designed to be user-friendly, with as few parameters as possible.
You'll need to specify the data arguments, such as x, y, training_frame, and validation_frame, which are the same as the other H2O algorithms. This is usually all you'll need to do, as most of the time these are the only parameters you'll need to set.
AutoML automates the supervised machine learning model training process, finding the best model given a training frame and response. It returns an H2OAutoML object, which contains a leaderboard of all the models trained, ranked by a default model performance metric.
You can configure values for max_runtime_secs and/or max_models to set explicit time or number-of-model limits on your run. This allows you to control how long the AutoML process takes and how many models are trained.
Data Preparation
Data Preparation is a crucial step in the H2O AutoML process, and as of H2O 3.32.0.1, it now includes a preprocessing option with minimal support for automated Target Encoding of high cardinality categorical variables.
This means you can automatically tune a Target Encoder model and apply it to columns that meet certain cardinality requirements for the tree-based algorithms, such as XGBoost, H2O GBM, and Random Forest.
Required Data Parameters
When working with data, it's essential to understand the required parameters for a successful analysis. The type of data you're working with is crucial.
The response column, denoted by 'y', is a critical parameter that specifies the name or index of the response column. This is a fundamental aspect of data preparation.
The training set, specified by 'training_frame', is another vital parameter that determines the dataset used for training. In my experience, a well-prepared training set can make all the difference in achieving accurate results.
To ensure you're on the right track, it's essential to identify the correct parameters for your data. Here are the key parameters to keep in mind:
- y: The name or index of the response column.
- training_frame: The training set.
Optional Data Parameters
When working with large datasets, it's essential to consider which columns to include or exclude from the prediction process. You can specify a list of predictor column names or indexes using the x argument.
If you want to exclude certain columns, you'll need to set the x argument. This is especially useful when you have a large dataset and only want to use a subset of the columns for prediction.
The validation_frame argument is ignored unless you're using nfolds equal to 0. In this case, you can specify a validation frame for early stopping of individual models and grid searches.
You can use the leaderboard_frame argument to specify a data frame for leaderboard scoring. This frame won't be used for anything else besides ranking models.
The blending_frame argument allows you to specify a frame for computing predictions that serve as the training frame for Stacked Ensemble models. If provided, all Stacked Ensembles will be trained using Blending instead of the default Stacking method.
If you want to override the default, randomized, 5-fold cross-validation scheme, you can use the fold_column argument to specify a column with cross-validation fold index assignment per observation.
The weights_column argument lets you specify a column with observation weights. This can be useful if you have some observations that are more important than others, or if you want to repeat certain rows.
Preprocessing
Preprocessing is a crucial step in preparing your data for machine learning models, and H2O AutoML has made significant improvements in this area.
As of H2O 3.32.0.1, AutoML now has a preprocessing option with minimal support for automated Target Encoding of high cardinality categorical variables.
This means that you can automatically tune a Target Encoder model and apply it to columns that meet certain cardinality requirements for tree-based algorithms like XGBoost, H2O GBM, and Random Forest.
The only currently supported option is preprocessing=["target_encoding"], which allows for this automated Target Encoding.
Work to improve the automated preprocessing support, including improved model performance and customization, is documented in this ticket.
This is a significant advancement in AutoML, and it's exciting to see the potential for improved model performance and reduced manual effort in data preparation.
Discover more: Auto Ml Perfect Performance Stack
Glm Hyperparameters
In the process of AutoML grid search, a single model with lambda_search enabled is built and passed a list of alpha values.
The GLM uses its own internal grid search rather than the H2O Grid interface.
The GLM algorithm searches over a range of alpha values, including 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
Here's a breakdown of the alpha values that are searched over:
This approach allows AutoML to return a single model with the best alpha-lambda combination rather than one model for each alpha.
Best of Both Worlds: Sparkling Water
Sparkling Water is a powerful solution that combines the vast H2O machine learning toolkit with the data processing capabilities of Spark.
It's ideal for users who need to manage large data clusters and want to transfer data between Spark and H2O.
Users can query big datasets using Spark SQL, feed the results into an H2O cluster to build a model and make predictions.
This consolidation of frameworks offers even more flexibility in deploying machine learning algorithms with the existing Spark implementation.
Results from the H2O pipeline can easily be deployed independently or within Spark.
Arguments
When working with H2O AutoML, you'll need to specify the required data parameters and stopping parameters. The required data parameters include the name of the response column, which is simply specified by its name or index, and the training set, which is specified by the training_frame argument.
To ensure reproducibility, it's essential to set the max_models parameter to a specific value. This parameter specifies the maximum number of models to build in an AutoML run, excluding the Stacked Ensemble models. By setting this parameter, you can guarantee that all models are trained until convergence, without being constrained by a time budget.
You can also adjust the runtime of your AutoML process by specifying the max_runtime_secs argument. This argument specifies the maximum time that the AutoML process will run for, and the default is 0, indicating no limit.
Here are some key engine arguments to keep in mind:
The validation argument is particularly useful for performance assessment and potential early stopping. By reserving a proportion of your training data as a validation set, you can get a more accurate picture of your model's performance and make data-driven decisions about when to stop the AutoML process.
Training and Prediction
Training with H2O AutoML is a straightforward process. You can use the h2o.automl() function in R or the H2OAutoML class in Python to get started.
To train a model, you'll need to specify the x argument, which represents the set of predictors. In the example provided, all columns other than the response are used as predictors. However, you can also use the default value of x, which is "all columns, excluding y", to achieve the same result.
The training process involves searching, screening, and evaluating many models for a specific dataset. AutoML can be particularly insightful as an exploratory approach to identify model families and parameterization that is most likely to succeed.
You can use agua's helper functions to quickly wrangle and visualize AutoML's results. For instance, you can use the rank_results() function to return the leaderboard in a tidy format with rankings within each metric.
Here's a summary of agua's helper functions for AutoML results:
- rank_results() returns the leaderboard in a tidy format with rankings within each metric.
- collect_metrics() returns average statistics of performance metrics (summarized) per model, or raw value for each resample (unsummarized).
- tidy() returns a tibble with performance and individual model objects.
- member_weights() computes member importance for all stacked ensemble models.
These functions can help you quickly assess and visualize the results of your AutoML search.
Automating ML
Automating ML is a game-changer in the world of machine learning. It allows you to automate several steps in an end-to-end ML pipeline with minimal human intervention, without affecting the model's efficiency.
The process of automating machine learning, referred to as AutoML, is now a standard feature across various platforms such as Azure, Google Cloud, and so on. With AutoML, several steps in an end-to-end ML pipeline can be taken care of with minimal human intervention, without affecting the model's efficiency.
H2O's AutoML trains and cross-validates a default random forest, an extremely-randomized forest, a random grid of gradient boosting machines (GBMs), a random grid of deep neural nets, a fixed grid of GLMs, and then two stacked ensemble models at the end. One ensemble contains all the models (optimized for model performance), and the second ensemble contains just the best performing model from each algorithm class/family (optimized for production use).
Some of these steps where AutoML proves useful are data preprocessing tasks (augmentation, standardization, feature selection, etc.), automatic generation of various models (random forests, GBM etc.), and deploying the best model out of these generated models.
To automate ML, you can use H2O's AutoML, which has so few parameters that need to be specified, that all the user needs to do is to point to their dataset, identify the columns needed for predictions, and specify a limit on the number of total models trained, if there is a time constraint for the pipeline.
Here are some of the most commonly used engine arguments for H2O AutoML:
- max_runtime_secs and max_models: Adjust runtime.
- include_algos and exclude_algos: A character vector naming the algorithms to include or exclude.
- validation: An integer between 0 and 1 specifying the proportion of training data reserved as validation set.
It's worth noting that one of the following stopping strategies (time or number-of-model based) must be specified. When both options are set, then the AutoML run will stop as soon as it hits either of these limits.
Prediction
Making predictions is a crucial step in the machine learning process.
Using the predict() function with AutoML generates predictions on the leader model from the run. The order of the rows in the results is the same as the order in which the data was loaded, even if some rows fail due to missing values or unseen factor levels.
You can generate test set predictions using the predict() function, just like in the previous code example.
Frequently Asked Questions
Is H2O AutoML good?
Yes, H2O AutoML is a good choice for both novice and experienced data scientists, as it automates repetitive tasks and saves time for higher-level problem-solving. It's ideal for accelerating model development pipelines and freeing up time for more complex tasks.
Featured Images: pexels.com