Tree-Based Supervised Learning Methods

15 min readFeb 14, 2024

(This is a translated version from Brazilian Portuguese of the article, the original is available here)

Introduction

In this article, we will discuss some tree-based methods, conceptualizing what decision trees are, how they are used for regression and classification, how the algorithm works, what ensemble is, and some models of this methodology such as bagging, random forest, boosting, and BART.

Decision Trees

Decision Trees, or decision trees, are non-parametric regression and classification methods of supervised learning. In it, the goal is to create a model that predicts the value of a dependent variable from learning decision rules inferred from the features. This method has several advantages over others, but also some disadvantages.

One advantage is the simplicity of understanding and interpreting, and the trees can be visualized graphically. They require little data preparation. Other techniques, for example, may require data normalization, creation of dummy variables, and removal of null/blank values. (The implementation of scikit-learn still does not handle null values). The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree. They can handle numerical and categorical data (their scikit-learn implementation does not support categorical variables yet) and problems with multiple outputs. They use a “white box” model, meaning it easily explains observed situations by boolean logic. It is possible to validate a model using statistical tests. This makes it possible to explain the reliability of the model. And they perform well even if their assumptions are slightly violated by the true model from which the data were generated.

Some disadvantages concern the decision tree process, which can create overly complex models that do not generalize well across data. In other words, overfitting, but some mechanisms such as pruning, defining a minimum number of samples per node or defining a maximum depth or maximum depth of the tree are necessary so that this problem does not happen. Decision trees can be unstable because small variations in the data can result in a completely different tree being generated, but this problem can be mitigated by using decision trees within an ensemble. Decision tree predictions are neither smooth nor continuous, but constant piecewise approximations. Therefore, they are not good at extrapolation. Decision tree learning algorithms are based on heuristic algorithms, such as the greedy algorithm, where locally optimal decisions are made at each node. These algorithms cannot guarantee the return of the globally optimal decision tree, but this can be mitigated by training multiple trees in an ensemble learner, where features and samples are randomly sampled with replacement. There are concepts that are difficult to learn because decision trees do not express them easily, such as XOR, parity, or multiplexer problems. Decision tree learners create biased trees if some classes dominate, if the data is not balanced. Therefore, it is recommended to balance the dataset before fitting with the decision tree.

1.1 Regression

The process of building a regression tree is divided into two steps. First, the predictor variables are divided into J distinct and non-overlapping regions, R1, R2, …, Rj. And for each observation that is in region Rj, a prediction is made in which its value is the average (generally) of the value of the dependent training variable in region Rj.

Imagine that you are doing a regression of workers’ salaries based on years of experience and years of study.

The figure above shows a regression tree fitted to this data. It consists of a set of splitting rules (splits), starting at the top of the tree. The first division at the top of the tree assigns the first branch on the left to those with less than 10 years of experience. The predicted salary for these workers is given by the average salary value for workers with less than 10 years of experience. Workers with ten or more years of experience are assigned the field on the right, and are then subdivided into years of study.

In general, the tree would stratify or segment workers into three regions of prediction space. The first in which people have less than 10 years of experience, workers with ten or more years of experience without higher education (<12), and workers with more than 10 years of experience with higher education (>12)

These three regions can be written as

R1 = {X | Experience < 10},
R2 = {X | Experience >= 10, Study <12} and
R2 = {X | Experience >= 10, Study >12}

Figure 1.2. Three tree partition regions

These three regions R1, R2 and R3 are called terminal nodes or leaves of the tree. The points along the division of the branches are called internal nodes. In this fictitious example above in figure 1.1, two internal nodes are indicated as Experience < 10 and Studies < 12. The segments of the trees that connect the nodes are called branches. One interpretation that can be made from the graph in figure 1.1 is that Experience is the most important factor in determining salary, and workers with fewer years of education earn less than those with more.

The objective in partitions and their number is to find a number of partitions that minimizes RSS, which is given by:

where ŷrj is the average response to the training observation in the jth partition.

It is computationally unfeasible to consider all possible numbers of partitions, so a top-down and greedy approach called recursive binary splitting is used. It is top-down because it starts at the top of the tree, where all observation points belong to the same region and then successively divides the predictor spaces. Each division is made through two new branches down the tree. And it is also greedy because for each step of building the tree, the best division is made at each step, in the current step.

So that we can perform recursive binary splitting, first the predictor Xj and the cutoff point s are selected at which the division of the predictor space into the regions {X | Xj < s} and {X | Xj >= s} takes the largest RSS reduction possible. All predictors X1, For each j and s, pairs of half-planes are defined and a value of j and s is sought that minimizes the equation where ŷrj1 is the average response to the training observations in R1(j, s), and ŷr2 is the average response to the training observations in R2(j, s). Thus, we find the values of j and s that minimize the RSS.

R1 (j, s) = {X|Xj < s} and
R2 (j, s) = {X|Xj ≥ s}

Then the process is repeated, looking for the best predictor and the best cutoff point to divide the data in order to minimize the RSS in each region. But instead of dividing the entire predictor variable space, one of the two previously identified regions is divided, resulting in three regions. And again one of the three regions is divided seeking to minimize RSS. This process continues until some criterion is reached, for example, continuing the process until reaching a maximum of ten, fifty or a thousand regions.

Once the regions R1, R2, …, Rj are created, we predict the response based on the test observations using the average of the training observations of the region j in which the test observation belongs.

This process described so far may produce good results on the training dataset, but will probably overfit the data, resulting in poor performance on the test dataset. This is because the final tree is very complex. A tree with few partitions and regions leads to low variance and better interpretation, at the cost of having little bias.

An alternative is to build the tree only to the point where the RSS minimization from the partition meets some criterion, such as exceeding some threshold. Thus, resulting in smaller trees.

Through cost complexity pruning or weakest link pruning we can prune up to this point. Instead of considering all possible subtrees that minimize RSS until reaching a criterion, we can consider a sequence of trees indexed by a non-negative tuning parameter alpha α. Each alpha value corresponds to subtrees T ⊂ T0 such that it is the smallest possible.

Here |T| indicates the number of terminal nodes of the tree T, Rm is the subset of the predictor space (the rectangles in figure 1.2) corresponding to the m-th terminal node, and ŷrm is the predicted response associated with Rm, with the predicted response being the average of the observations training in Rm. The alpha tuning parameter controls the trade-off between the complexity of a subtree and its fit to the training data. When alpha = 0, then the subtree T will be equal to T0, because it will only be measuring the training error. But the higher the alpha, the more minimized and smaller the subtree will be. The more we increase the alpha, the more the branches are pruned, so it is easy to obtain a sequence of subtrees depending on the alpha. Thus, we can select an alpha value using a validation or cross-validation dataset and then return to the complete dataset and obtain the subtree corresponding to the alpha. The regression tree building process can be summarized as:

Use recursive binary splitting to grow a large tree on the training data, stopping only when the terminal nodes have less than a minimum number of observations.
Apply cost complexity pruning to the tree to obtain the sequence of best subtrees as a function of alpha.
Use K-fold cross validation to choose the alpha. Thus, dividing the training observations into K parts where each part k = 1,…K:
3.1. Repeat steps 1 and 2 on all but the kth part of the training data
3.2. Validate the mean squared prediction error on the data in the kth part to the left, as a function of alpha.
Average the results for each alpha value, and choose the alpha that minimizes the average error
Return to the subtree from step 2 that corresponds to the chosen alpha value.

1.2 Classification

Classification trees are very similar to regression trees, but they are used to predict qualitative results. In classification trees, the predicted value is the most common class in the training observations in the region to which it belongs.

Just like in regression trees, we use recusrive binary splitting to build the tree, but we cannot use RSS as a criterion to partition the regions, so as an alternative we can use classification error rate, gini index or entropy.

Since we are going to assign an observation in a given region to the most common class of training observations in that region, the classification error rate is the fraction of training observations in that region that do not belong to the most common class.

In it, we represent with p̂mk the proportion of training observations in the m-th region that are from the k-th class. But the error is not sensitive to the growth of the tree, and it is preferable to use gini index or entropy.

The gini index is the measurement of the total variance across the K classes. It is the measurement of the purity of the nodes, when it has a small value it indicates that the node predominantly contains observations from a single class.

In information theory, entropy is a measurement of impurity or uncertainty in a group of observations. In the classification tree, it determines the choice of data partition based on the information gain from entropy. The greater the randomness in the data, the greater the entropy and the lower the information gain.

Just like the gini index, entropy will have a small value if the mth node is pure.

When building a classification tree, gini and entropy are generally used to validate partition quality, as these two approaches are more sensitive to node purity than classification error rate. Any of the three approaches can be used when pruning the tree, but if accuracy of the final tree is the goal, it is preferable to use the classification error rate.

2 Ensemble

The methods discussed below are ensemble methods. Ensemble is an approach that combines several weak learners, models that would yield an average and weak prediction, with the aim of obtaining a single, much more powerful prediction.

2.1 Bagging

Bootstrap aggregation or bagging is a general-purpose procedure for reducing the variance of statistical learning methods.

In a dataset with n independent observations Z1, …,Zn, each observation has variance σ², the variance of the mean of the observations is given by σ²/n. This means that averaging the set of observations reduces the variance. Therefore, one way to increase test accuracy by reducing variance is to carry out several training sessions with subsets of the population, construct separate predictions and average them. Here, we use B to generate the training subsets, and average them to obtain a single statistical learning model with low variance.

This is where the bootstrapping aggregation “boostrap” comes in. The bagging procedure uses bootstrapping sampling, generating different subsets of the training, selecting the data randomly with replacement. After generating B different subsets, training is carried out on the bth subset, and at the end we average the predictions.

Weak learners are trained on each subsample in parallel independently and at the end, an average or majority of the results are taken to calculate a more accurate estimate. In regression problems, an average of all predicted outcomes by individual classifieds is taken, this process is called soft voting. In classification problems, the class with the majority of votes is accepted, a process called hard voting or majority voting.

2.2 Random Forests

Random forest is an improvement on bagging with a difference in the way decision trees are created, decorrelating them. As with bagging, a number of decision trees are built on subsets sampled from bootstrapping. But to build these trees, each time a partition is performed, only a random number m of predictors are considered as candidates out of the total number of predictors p. This partition or split of the tree is made with only m predictors and at each split a new number m of predictors is chosen and m≈√p are generally used, that is, the number of random predictors considered for the next partition is approximately equal to square root of total predictors.

By considering only this small number of predictors as candidates, the possibility of a very strong predictor being selected at the top is eliminated, creating several similar and correlated trees in parallel. The average of these correlated trees does not decrease the variance as the average of a set of uncorrelated trees. Random forest solves this problem based on this limitation on the number of predictors in each partition of the decision trees. Using this small subset of predictors is useful when we have a large number of correlated predictors.

2.3 Boosting

Boosting is another way to improve predictions from decision trees. Like bagging, boosting is a general-purpose approach that can be applied to various statistical learning methods for regression and classification problems.

The boosting process is similar to bagging, however, the training is done sequentially, unlike bagging which is done in parallel. Each tree grows using information from the previous tree. Boosting does not use bootstrapping sampling, instead the trees are fit on a modified version of the original dataset.

In boosting, the current decision tree uses the residue from the previous model as the variable to be predicted, instead of the Y variable. Then a new decision tree is added to the adjusted function to obtain the updated residue. Each tree can be small and with only a few terminal nodes, determined by the d parameter in the algorithm. By adjusting small trees to residuals, improvement in prediction is made slowly in areas where it does not perform as well. The lambda tree shrinkage parameter λ slows the process, allowing more trees of different shapes to attack the waste. In general, statistical learning methods that “learn slowly” perform well.

Boosting has three tuning parameters: The number of trees B, the lambda choice parameter λ and the number of partitions d of each tree that controls the complexity of the ensemble.

Unlike bagging, boosting can overfit if the number of trees parameter B is too large, even slowly. To avoid this, we can use cross-validation to select the number of trees B. The lambda shrinkage factor λ is a small, positive value, it controls the rate at which boosting learns. The values typically used are 0.01 or 0.001, and the correct choice depends largely on the problem. A very small number of λ requires a very large number of B to achieve good performance.

2.4 Bayesian Additive Regression Trees (BART)

No, it’s not that Bart. (Image: Tenor | Reproduction)

Bayesian additive regression trees or BART is another ensemble and Bayesian method that uses decision trees. Just like bagging and random forest, each tree is built randomly, and also each tree tries to capture a signal not yet accounted for by the current model, similar to boosting. It uses a Bayesian approach to fit the ensemble trees, each time we randomly pertubate a tree to fit the residuals, we are drawing a new tree from a posterior distribution.

To use BART we must select the number of trees K, the number of iterations B and the number of “burn-in” iterations L. The first predictions of the first iterations are generally discarded as they do not provide a good result, this is the so-called period of “burn-in”. Generally, a large value is used for B and K and a moderate value for L.

The notation fˆb_k(x) represents the prediction in x for the kth regression tree used in the bth iteration. At the end of each iteration, the K-trees of that interaction will be summed.

In the first iteration of BART, all trees are initialized to have a single root, where the average response is divided by the total number of trees. In subsequent iterations, BART updates each of the K trees, one at a time. In the bth iteration, to update the kth tree, the predictions of all trees, except the kth tree, are subtracted from each response value, so that the partial residue for the ith observation is obtained.

Instead of fitting a new tree to this partial residual, BART randomly chooses the perturbation for the tree from the previous iteration from a set of perturbations, prioritizing those that improve the fit to the partial residual. BART’s output is a collection of predictive models:

3 Conclusion

In this article I sought to bring the theory and intuition behind tree-based methods but without the intention of exhausting the subject. I tried to keep the explanations simple and intuitive within the scope of this article without delving into the details, so I recommend that you try to learn them too. Each method covered has several implementations for use in machine learning, from the most basic scikit-learn to gradient boosting. By understanding the logic of each method, its implementation is more robust given that it is known what the model is doing. But in addition, it is important to study the parameters of each implementation and their impact on the model to obtain a better result.

Follow my profile and sign up here to receive the next posts related to data science, machine learning and statistics by email.

References

1. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning. Springer Texts in Statistics. https://doi.org/10.1007/978-1-0716-1418-1_1

2. Decision Trees. (2022). Scikit-learn. https://scikit-learn.org/stable/modules/tree.html

3. What is Bagging?. (2021). IBM. https://www.ibm.com/cloud/learn/bagging