Concepts of boosting algorithms in machine learning

What are the differences between Adaboost, gradient boosting and Newton boosting?

Mar 19, 2024

What is boosting?

In machine learning, boosting is an ensembling technique to combine a set of weak learners to form a stronger learner with smaller bias and variances. A weak (strong) learner refers to a predictor, for example a classifier or a regressor, that is weakly (strongly) correlated with the target label.

Although boosting does not require specific forms of algorithms, many boosting algorithms generally learn each weak predictor iteratively and in the end combine all weak predictors to form one strong predictor.

In mathematical form, it can be expressed as:

\(F_n(x) = \sum_{k=1}^{n} w_k f_{k}(x)\)

where

\(f_n(x)\)

is the final predictor at n-th iteration,

\(f_k(x)\)

is the weak predictor at k-th iteration, and

\(w_k\)

is the weight associated to the k-th weak predictor, which will be calculated differently depending on the boosting algorithm used.

In the following sections, we will explain how weak predictors and weights are determined for each boosting algorithm.

ensemble_boosting_wikipedia — Illustration of boosting algorithms from wikipedia

Adaboost

While there have been many developments on boosting algorithms, one of the earliest significant boosting algorithms is as known as the Adaboost algorithm.

Learning weak predictors

The core concept of Adaboost corresponds to that, at each iteration, a weak predictor is fitted to a weighted dataset, instead of the original dataset. The weight for each data point at a particular iteration is determined by its residual. Typically, the larger the residual, the larger the weight is given. This is to allow the weak predictors at later iterations to focus on correcting prediction errors of more difficult samples.

Shinkage

As iterations increase, more and more emphasis will be placed on difficult samples, hence more susceptible to the presence of outliers.

One way to ensure the stability of the algorithm against this is by employing learning rate, or also as known as shrinkage rate. The idea is that the contribution from weak learners in additional iterations will be shrinked by this rate, controlling the influences that the later-fitted learners have on the overall predictions.

The mathematical formula for this can be written as:

\(F_n(x) = f_0(x) + w \sum^n_{k=0} f_{k}(x)\)

where n is the number of boosting iterations and w is the shrinkage rate.

Gradient boosting

Gradient boosting is a special kind of boosting algorithm that, instead of fitting at each time step a weak learner to residuals from previous iterations, it fits instead to negative gradients at that particular iteration. (Yes, even if you are training a classification model, at each iteration, except the very first iteration, the weak predictor is actually a regression model predicting negative gradients.)

In mathematical form, the update rule for the final predictor is

\(F_k(x) = F_{k-1}(x) + \eta_k f_k(x)\)

where similar to above,

\(F_{m}(x)\)

is the final predictor (not the weak predictor!) at k-th iteration,

\(f_k(x)\)

is the weak predictor at k-th iteration, and

\(\eta_k\)

is the shrinkage rate at k-th iteration.

When the weaker learner f is trained on negative gradients as target label, the

\(F_k(x) \approx F_{k-1}(x) - \eta_k \left.\frac{\partial L}{\partial F} \right |_{F_{m-1}(x)}\)

This formula looks exactly like iteration updates with gradient descent optimisation. In fact it can be shown that this iteration update is actually mathematically equivalent to gradient descent in the function space, hence the name gradient boosting.

It can be shown that this approach can generalize to various objective functions and is equivalent to performing gradient descent in the functional space. Popular open-source packages such as scikit-learn use gradient boosting in various forms.

Newton boosting

Similar to gradient boosting, at each boosting iteration, Newton boosting fits a weak predictor to a modified target label. Different from gradient boosting, instead of using negative gradients as target label, it uses negative gradients and hessian matrices to perform updates in each iteration:

\(F_k(x) = F_{k-1}(x) - \eta_k \frac{\left. \frac{\partial L}{\partial F}\right|_{F_{k-1}(x)}}{\left. \frac{\partial^2 L}{\partial F^2}\right|_{F_{k-1}(x)}}\)

Again, like gradient boosting, this update ressembles Newton’s method and corresponds to Newton's descent in functional space.

Gradient tree boosting

Actually many of the most popular open-source tree-based gradient boosting algorithms nowadays implement gradient boosting a bit differently.

In particular, like ordinary gradient boosting, at each boosting iteration, a tree is grown and added to the final predictor.

However, when growing a tree at each boosting round, the optimal split in each node is determined by maximising the loss reduction calculated by this formula:

\(\Delta L_{\mathrm{split}} = L_{\mathrm{right}} + L_{\mathrm{left}} - L_{\mathrm{no~split}}\)

where

\(L_{k} = -\frac{1}{2} \sum_{i=1}^{N_{\mathrm{node}}} \sum_{j\in\mathrm{node~j}} g_j\)

where g is the gradient for data point j respectively. This formula is derived by a minimisation of a first-order Taylor approximation to the loss function.

Newton tree boosting

Similar to gradient tree boosting, one can also come up with an analogy and define a more advanced boosting algorithm which takes into the account the hessian of the loss functions.

In particular, when growing a tree at each boosting round, we use instead the second-order Taylor approximation to the loss function. All that is needed to do is to use a different loss function for node splitting.

In other words, similar to gradient tree boosting, at each node split, the optimal split is defined by the split giving the maximum loss reduction:

\(L_{k} = -\frac{1}{2} \sum_{i=1}^{N_{\mathrm{node}}} \frac{\sum_{j\in\mathrm{node~j}} g^2_j}{\sum_{j\in\mathrm{node~j}} h_j+\lambda} + \gamma N_{\mathrm{node}}\)

where

\(L_{k} = -\frac{1}{2} \sum_{i=1}^{N_{\mathrm{node}}} \frac{\sum_{j\in\mathrm{node~j}} g^2_j}{\sum_{j\in\mathrm{node~j}} h_j}\)

Popular open-source packages such as LightGBM and XGBoost use Newton boosting (apart from regularisation) for their main boosting method.

Summary

Understanding the fundamental concepts of boosting algorithms presented in this article is crucial for grasping the design choices behind advanced boosting algorithms like LightGBM and XGBoost.

These advanced algorithms often build upon the core principles of boosting explained here, but with additional optimisations on efficiency, scalability, and handling complex data structures. For instance, LightGBM and XGBoost leverage decision trees as weak learners and implement efficient algorithms for finding optimal splits within those trees. Their advancements lie in techniques like gradient-based decision tree learning, gradient-based sampling, and regularization to prevent overfitting.

By understanding the fundamental building blocks of boosting explored in this article, you can gain a deeper appreciation for the design philosophies behind cutting-edge boosting algorithms and how they achieve superior performance!

Vertical Solution

Discussion about this post