The GBM algorithm is a powerful tool in machine learning, and understanding its basics is crucial for leveraging its full potential. GBM stands for Gradient Boosting Machine, which is a type of ensemble learning algorithm.
GBM works by combining multiple weak models to create a strong predictive model. This is achieved through an iterative process where each subsequent model is trained to correct the errors of the previous model.
The GBM algorithm is particularly effective for handling complex data sets with many features, making it a popular choice in industries such as finance and healthcare. By using a combination of decision trees, GBM can handle non-linear relationships between variables.
GBM is also highly tunable, allowing users to adjust parameters such as the number of trees, learning rate, and maximum depth to optimize performance for specific tasks.
Intriguing read: Difference between Model and Algorithm in Machine Learning
What Is GBM Algorithm?
The Gradient Boosting Machine (GBM) algorithm is a powerful tool in machine learning that helps minimize bias error in models. It works by building models sequentially, with each subsequent model trying to reduce the errors of the previous one.
The algorithm is based on the idea of adding weak learners to minimize a loss function. This loss function is different for regression and classification problems.
GBM is used for both regression and classification tasks, with the choice of algorithm depending on the type of problem. For continuous target columns, we use Gradient Boosting Regressor, while for classification problems, we use Gradient Boosting Classifier.
The key to GBM is its ability to reduce errors by building new models on the errors or residuals of the previous model. This is done using gradient descent to minimize the loss function.
The loss function used in GBM is different for regression and classification problems, with Mean Squared Error (MSE) being commonly used for regression and log-likelihood for classification.
A fresh viewpoint: Automatic Document Classification Machine Learning
GBM Algorithm Basics
The Gradient Boosting Model (GBM) has its roots in the 1990s with the development of the AdaBoost algorithm. This algorithm was particularly successful in classification problems, and it paved the way for the creation of the typical GBM model we use today.
The GBM algorithm is a type of boosting algorithm that can be used for both classification and regression problems. In classification problems, the GBM algorithm uses a log-likelihood loss function, whereas in regression problems, it uses the mean squared error.
The GBM algorithm typically uses decision trees, especially CARTs, of a fixed size as base learners. This is because decision trees are simple to understand and interpret, and they can handle complex interactions between features.
What Is a Classifier?
A classifier is a type of machine learning model that predicts a target column with a binary value.
In the context of the GBM algorithm, a classifier is used when the target column is binary.
The difference between a classifier and a regressor lies in the loss function used, with classifiers using log-likelihood instead of Mean squared error.
The GBM algorithm initializes the model with a constant value, which is log(odds) for classifiers.
This constant value is used as a starting point for the gradient boosting process.
The use of log(odds) as the initial constant value is a deliberate choice that helps the model converge to a solution.
By using log(odds), the model can effectively handle binary target columns and produce accurate predictions.
Base Model
The base model is a crucial step in the GBM algorithm. It's built by taking the average of the target column in the training dataset.
This is done for simplicity, and mathematically, it's represented as finding the predicted value (gamma) that minimizes the loss function. The loss function is calculated as the difference between observed values (yi) and the predicted value (gamma).
The loss function is minimized when gamma equals 14500, which becomes the prediction for the base model.
Model Output
The GBM model output is a crucial aspect of understanding how the algorithm works.
The output of a GBM model includes information on the number of trees or iterations used in its execution. This is a key factor in determining the model's complexity and accuracy.
You can also use the summary function to produce a Variable Importance Table and a Plot of the model. This table ranks individual variables based on their relative influence, with higher values indicating greater importance.
In the GBM modelling, Variable Importance is a vital feature that helps you understand which predictor variables had zero influence in the model.
The Variable Importance Table provides a clear ranking of variables based on their relative influence. This is a useful tool for identifying the most important variables in your model.
The Titanic Model is a great example of how Variable Importance can be used to identify key variables in a GBM model. In this case, Cabin and Sex were found to be the most important variables.
How GBM Algorithm Works
The GBM algorithm is a powerful ensemble learning method. It's based on a single predictive model, typically a decision tree, which is combined with other models to improve its accuracy.
The algorithm starts by transforming the loss function into a function of log(odds), making calculations easier. This transformation is not compulsory, but it helps in some cases.
A decision tree is the base model used in GBM, and it's used to make predictions. The minimum value of the transformed loss function is the first prediction, also known as the base model prediction.
The GBM algorithm works by adding a new decision tree to the base model, which is trained on the residuals of the previous predictions. This process is repeated multiple times to improve the accuracy of the model.
The new predictions are made by combining the predictions from the individual base models, including the new decision tree. This is where the magic of GBM happens, as it can handle high bias and low variance models, making it particularly effective for decision trees.
GBM Algorithm Implementation
The gbm package in R has two training functions: gbm::gbm() and gbm::gbm.fit(). The primary difference is that gbm::gbm() uses the formula interface to specify your model whereas gbm::gbm.fit() requires the separated x and y matrices; gbm::gbm.fit() is more efficient and recommended for advanced users.
The default settings in gbm include a learning rate (shrinkage) of 0.001, which is a very small learning rate and typically requires a large number of trees to sufficiently minimize the loss function. We start with a learning rate of 0.1 and increase the number of trees to train.
The default depth of each tree (interaction.depth) is 1, which means we are ensembling a bunch of decision stumps and are not able to capture any interaction effects. For the Ames housing data set, we increase the tree depth to 3 and use the default value for minimum number of observations required in the trees terminal nodes (n.minobsinnode).
Curious to learn more? Check out: Decision Tree Algorithm Machine Learning
Update Previous Predictions
In the GBM algorithm, updating previous predictions is a crucial step in achieving accurate results.
The process involves updating the predictions of the previous model, which can be done by adding the output of the new decision tree to the previous prediction, multiplied by the learning rate.
As we've seen, the new prediction is calculated by multiplying the output of the new decision tree by the learning rate and adding it to the previous prediction, Fm-1(x).
In our example, the previous prediction F1-1 was 0, so the previous prediction is simply the output of the base model, which is 14500.
We'll iterate through these steps again and again until the loss is negligible, refining our predictions with each new decision tree.
For instance, if a new data point comes in with a car height of 1.40, it will go through all the trees and produce a final output, F2(x), based on the two trees we've created so far.
Scikit-learn Implementation
To implement the GBM algorithm using scikit-learn, you'll need to import the required libraries. The task at hand is to classify the income of an individual based on given inputs about their personal life.
First, you'll need to divide your data into train and test sets. This is a crucial step in training and testing your model. The data should be scaled to lie between 0 and 1, which will help improve the model's performance.
If this caught your attention, see: Data Labeling in Machine Learning with Python Pdf
You can define the Gradient Boosting Classifier along with its hyperparameters. One hyperparameter to consider is the max_depth parameter, which affects the complexity of the decision trees. This is similar to decision trees and random forests.
Next, you'll fit this model into the training data. The confusion matrix will give you a report of the number of classifications and misclassifications. In the example given, the model had 1334 misclassifications compared to 8302 correct classifications.
The accuracy of the model is 86%, which is pretty good but can be improved by tuning the hyperparameters or processing the data to remove outliers. This gives you a basic idea of how gradient boosting works and its underlying principles.
You can also tune the max_depth parameter to improve the model's performance.
Curious to learn more? Check out: Labeling Data for Machine Learning
Assessing Results
A confusion matrix is a really useful tool for comparing a model's actual values with its predicted values. It reveals how confused your model is between the two classes.
The columns of the confusion matrix are the true classes, while the rows are the predicted classes. You can create a confusion matrix using the ifelse() function to "cut" your predicted probabilities at a given threshold.
Our model perfectly predicted 118 out of 141 observations, giving us an accuracy of 83.69%. There were 23 cases that our model got wrong.
These 23 cases were divided between False negatives and false positives as 9 and 14 respectively.
Our model has an AUC of 0.8389. The closer the ROC curve comes to the upper left corner in the grid, the better it is.
The upper left corner corresponds to a true positive rate of 1 and a false positive rate of 0. Good classifiers have bigger areas under the curves.
GBM Algorithm Techniques
AdaBoost updates the weights by increasing the weights of incorrectly classified samples in each iteration, so that the next weak learner focuses more on these samples.
Gradient Boosting, on the other hand, updates the weights by computing the negative gradient of the loss function with respect to the predicted output.
AdaBoost is more susceptible to noise and outliers in the data, as it assigns high weights to misclassified samples, whereas Gradient Boosting is generally more robust.
XGBoost offers additional regularization hyperparameters that provide added protection against overfitting, and also implements early stopping to stop model assessment when additional trees offer no improvement.
Here are some key differences between AdaBoost and Gradient Boosting:
XGBoost also provides a wide range of base learners, including decision trees and linear models, and supports GPU and Spark compatibility for parallel processing.
Feature Importance Ranking
Feature Importance Ranking is a key aspect of the GBM algorithm, allowing us to understand which variables are most influential in our model. By aggregating the importance function of the base learners, we can obtain a ranking of the features.
The GBM algorithm uses entropy-based decision trees by default, which means the importance of features is ranked based on entropy. This ranking is averaged out over all base learners, providing a comprehensive view of the most important features.
To calculate feature importance, GBM uses three built-in measures: Gain, Coverage, and Frequency. Gain is the most common model-centric metric to use, equivalent to the impurity measure in random forests. Coverage quantifies the relative number of observations influenced by a feature, while Frequency represents the relative number of times a feature occurs in the trees of the model.
Here are the three feature importance measures used by GBM, listed in a table for easy reference:
By examining the top 10 influential features in our final model using the Gain metric, we can see very similar results to those obtained with a random forest model.
AdaBoost Algorithm
The AdaBoost Algorithm is a boosting method that starts by building a decision stump and then assigning equal weights to all the data points.
It increases the weights for all the points that are misclassified and lowers the weight for those that are easy to classify or are correctly classified. This is done to improve the predictions made by the first stump.
A new decision stump is made for these weighted data points, and the process is repeated until the errors are minimized and the dataset is predicted correctly.
The main difference between AdaBoost and other gradient boosting techniques is that in AdaBoost, we can change the base estimator according to our needs, whereas in other techniques, the base estimator is fixed, such as decision trees.
In AdaBoost, the probability of selecting a wrongly classified observation increases, which is why only those observations get selected that were misclassified in the previous model.
This process continues until the errors are minimized, and the dataset is predicted correctly.
Curious to learn more? Check out: Applied Machine Learning Explainability Techniques
Dropout
Dropout is a technique that can help reduce overfitting in GBMs, and it's loosely related to regularization.
Dropout can be used to address overfitting in GBMs by randomly dropping trees in the boosting sequence, which is commonly referred to as DART.
DART stands for Dropout Additive Regression Trees, and it was initially explored in the context of Multiple Additive Regression Trees (MART).
The percentage of dropouts is another regularization parameter that can be tuned to control overfitting.
Typically, if gamma, alpha, or lambda can't help with overfitting, exploring DART hyperparameters would be the next best option.
Dropout has been widely employed in deep learning to prevent deep neural networks from overfitting, and it's been developed by Srivastava et al. in 2014.
DART is an alternative approach to reduce overfitting, and it's been shown to improve model performance by preventing the first few trees from dominating the model.
A fresh viewpoint: Is Transfer Learning Different than Deep Learning
GBM Algorithm Examples
The Gradient Boosting Machine (GBM) algorithm is a powerful tool in machine learning, and it's useful to see it in action. In Example 1, we used GBM to predict car prices based on various features.
To implement the GBM algorithm, you'll need to import the necessary libraries, set a seed for reproducibility, and load your dataset. This is shown in Examples 2 and 3.
The GBM algorithm works by iteratively adding trees to the model, with each tree attempting to correct the errors made by the previous tree. This is where the learning rate (nu) comes in, which is usually set between 0-1 to control the effect each tree has on the final prediction.
In Example 1, we used a learning rate of 0.1, which is a common choice. The GBM algorithm can be used for both classification and regression tasks, as shown in Examples 2 and 3.
Here are some common steps to implement the GBM algorithm:
- Import necessary libraries
- Set seed for reproducibility
- Load dataset
- Instantiate GBM model and fit the data
- Predict on test set and compute accuracy or RMSE
Example: 1 Classification
To get started with Gradient Boosting Machine (GBM) classification, let's take a look at a simple example.
The first step is to import the necessary libraries, which in this case is scikit-learn.
Setting a seed for reproducibility is crucial to ensure that the results are consistent. In this example, the seed is set to 42.
Loading the digit dataset and splitting it into train and test sets is the next step.
To instantiate a Gradient Boosting classifier, we need to specify the number of estimators, learning rate, and other parameters. In this example, the classifier is instantiated with 100 estimators and a learning rate of 0.1.
The classifier is then fitted to the training data, which is the next step in the process.
Finally, the test set is predicted, and the accuracy score is computed.
Here's a summary of the steps involved:
- Import necessary libraries
- Set SEED for reproducibility
- Load the digit dataset and split it into train and test
- Instantiate Gradient Boosting classifier and fit the model
- Predict the test set and compute the accuracy score
Example: 2 Regression
Let's dive into an example of using the GBM algorithm for regression. We'll be using the diabetes dataset, which is a classic example of a regression problem.
We'll start by importing the necessary libraries, including scikit-learn, which provides the implementation of the GBM algorithm.
To ensure reproducibility, we'll set a seed for the random number generator. This is a good practice when working with machine learning algorithms.
Next, we'll load the diabetes dataset and split it into a training set and a test set. This will allow us to evaluate the performance of our model on unseen data.
Now, let's instantiate a Gradient Boosting Regressor and fit the model to the training data. We'll use the default parameters for this example.
After training the model, we'll use it to make predictions on the test set. We'll then compute the Root Mean Squared Error (RMSE) to evaluate the performance of our model.
Here's a summary of the steps we've taken:
- Imported necessary libraries
- Set seed for reproducibility
- Loaded diabetes dataset and split it into train and test
- Instantiated Gradient Boosting Regressor and fit the model
- Predicted on test set and computed RMSE
GBM Algorithm Advantages and Disadvantages
The Gradient Boosting Machine (GBM) algorithm has its advantages and disadvantages. It can increase the accuracy of a base learner, such as a decision tree or linear regression.
However, this comes at the cost of intelligibility and interpretability. Following the paths of hundreds or thousands of trees is much harder than tracing the decision-making process of a single tree.
GBM is often used in classification algorithms and ensemble learning. It's a popular choice for its ability to improve model performance, but its implementation can be more difficult due to the higher computational demand.
To achieve both performance and interpretability, some model compression techniques can transform an XGBoost into a single "born-again" decision tree that approximates the same decision function.
GBM Algorithm Hyperparameters
The GBM algorithm has a variety of hyperparameters that can be tuned to improve its performance. The two main boosting hyperparameters include the number of trees and the learning rate.
The number of trees is the total number of trees in the sequence or ensemble, and it can range from a few hundred to many thousands. In regression, GBMs will chase residuals as long as you allow them to, so it's essential to find the optimal number of trees that minimize the loss function of interest with cross-validation.
The learning rate determines the contribution of each tree on the final outcome and controls how quickly the algorithm proceeds down the gradient descent. Typical values range from 0.001 to 0.3, with smaller values making the model more robust to the specific characteristics of each individual tree.
GBM models also have tree-specific hyperparameters, including tree depth and the minimum number of observations in terminal nodes. Tree depth controls the depth of the individual trees, with typical values ranging from 3 to 8. Smaller depth trees are computationally efficient but require more trees, while higher depth trees allow the algorithm to capture unique interactions but increase the risk of over-fitting.
Here's a summary of the main GBM hyperparameters:
- Number of trees: 100-10,000
- Learning rate: 0.001-0.3
- Tree depth: 3-8
- Minimum number of observations in terminal nodes: 5-15
Stochastic hyperparameters, such as subsampling rows and columns, can also be used to improve the performance of GBM models. However, these hyperparameters are more data-dependent and require careful tuning.
Hyperparameters
GBM models contain two categories of hyperparameters: boosting hyperparameters and tree-specific hyperparameters. The boosting hyperparameters are crucial in determining the model's performance.
The number of trees in a GBM model is a key hyperparameter that controls the model's complexity. Too many trees can lead to overfitting, but it's not uncommon to have thousands of trees. We must find the optimal number of trees that minimize the loss function with cross-validation.
The learning rate, also known as shrinkage, determines the contribution of each tree on the final outcome. Values range from 0–1, with typical values between 0.001–0.3. Smaller values make the model robust to individual tree characteristics, but increase the risk of not reaching the optimum.
Tree depth is another crucial hyperparameter that controls the complexity of individual trees. Typical values range from 3–8, but smaller depth trees like decision stumps are computationally efficient.
The minimum number of observations in terminal nodes also controls the complexity of each tree. Typical values range from 5–15, with higher values helping to prevent overfitting.
Subsampling of rows before creating each tree can be beneficial, with typical values ranging between 0.5–0.8. Subsampling of columns and its impact on performance largely depends on the data, with higher values tending to perform better in noisy data and lower values performing well in data with many relevant predictors.
Here's a summary of the main boosting and tree-specific hyperparameters:
Stochastic Hyperparameters
Stochastic hyperparameters are a crucial aspect of the GBM algorithm, and they can significantly impact its performance. They add an extra layer of randomness to the algorithm, which can help prevent overfitting and improve prediction accuracy.
One way to introduce stochasticity is by subsampling rows before creating each tree. This is available in gbm, h2o, and xgboost, and it's generally beneficial to subsample aggressively, using 50% or less of the training data.
Subsampling columns before creating each tree is also an option, available in h2o and xgboost. However, the impact of this on performance largely depends on the nature of the data and whether there's strong multicollinearity or noisy features.
In cases with fewer relevant predictors, higher values of column subsampling tend to perform better. Conversely, when there are many relevant predictors, lower values of column subsampling tend to perform well.
Here's a summary of the stochastic hyperparameters:
Friedman observed a substantial improvement in gradient boosting's accuracy by introducing subsampling, and found that 0.5 ≤ f ≤ 0.8 leads to good results for small and moderate-sized training sets.
Frequently Asked Questions
What is the difference between XGBoost and GBM?
XGBoost and GBM differ in how they handle missing values, with XGBoost being more flexible and faster due to its ability to learn from missing data. This key difference affects their performance in real-world data analysis.
Sources
- https://www.analyticsvidhya.com/blog/2021/09/gradient-boosting-algorithm-a-complete-guide-for-beginners/
- https://towardsdatascience.com/understanding-gradient-boosting-machines-9be756fe76ab
- https://bradleyboehmke.github.io/HOML/gbm.html
- https://en.wikipedia.org/wiki/Gradient_boosting
- https://www.geeksforgeeks.org/ml-gradient-boosting/
Featured Images: pexels.com