Introduction
Gradient boosting is one of the most powerful and widely used machine learning algorithms as of this writing. In this post, we will look at gradient boosting and the intuition behind it. The syntax follows the excellent series of posts at explained.ai.
Decision Tree Regression
There are many machine learning models that can predict a target variable given a set of input features. Decision trees are one of the simpler models, and form the basis of most implementations of gradient boosting. Before we get into gradient boosting, let's see how decision trees are trained and how they infer target variables.
Training
Decision trees take the training data and split it repeatedly according to the value of a particular feature. At each split ('branch point' in the tree metaphor), the data is divided into two branches: left and right. The feature and the threshold to split are chosen so they separate the data in the best way. For regression, the 'best' split is usually the split (feature and threshold combination) that minimizes the variance in the training data in each of the two branches. Each branch is then split again, and again, until some stopping criteria. Once this stopping criteria is reached, no more branches are created, and the data is put into a leaf.
Inference
Once the model is trained, a new data point is passed into the decision tree. It starts at the root, and gets routed to the left or right branches depending on the learned criteria (feature/threshold combination) at that branch point. Once the data point gets to a leaf, a prediction is issued as the mean value of the training data that defined that leaf.
Here is my implementation of a decision tree regressor:
import numpy as np | |
import matplotlib.pyplot as plt | |
import pandas as pd | |
from sklearn import datasets | |
from sklearn.metrics import mean_squared_error as mse | |
class Branch(): | |
def __init__(self, X, y, lossfun, hyperparams, depth=1): | |
print(f"Depth = {depth}") | |
assert len(X) == len(y) | |
self.X = X | |
self.y = y | |
self.N = len(y) | |
self.lossfun = lossfun | |
self.depth = depth | |
self.feat = None | |
self.thresh = None | |
self.left = None | |
self.right = None | |
self.leaf = True | |
self.hyperparams = hyperparams | |
hp = hyperparams | |
keep_splitting = True | |
if len(y) <= hp['min_samples_split']: | |
keep_splitting = False | |
if hp['max_depth'] is not None and depth < hp['max_depth']: | |
keep_splitting = False | |
if keep_splitting: | |
self.leaf = False | |
self.feat, self.thresh = self.find_split() | |
self.make_children() | |
if min(self.left.N, self.right.N) < hp['min_samples_leaf']: | |
self.left = None | |
self.right = None | |
self.leaf = True | |
def predict(self, X): | |
y_pred = np.zeros(len(self.X)) | |
if self.leaf: | |
y_pred[:] = np.mean(self.y) | |
else: | |
left_mask = X[self.feat].values <= self.thresh | |
right_mask = np.logical_not(left_mask) | |
y_pred[left_mask] = self.left.predict(X.iloc[left_mask]) | |
y_pred[right_mask] = self.right.predict(X.iloc[right_mask]) | |
return y_pred | |
def find_split(self): | |
y = self.y | |
feats = list(self.X.columns) | |
loss_best = None | |
feat_best = None | |
thresh_best = None | |
for feat in feats: | |
ls = [] | |
x = self.X[feat].values | |
threshs = np.unique(x) | |
if len(threshs) == 1: | |
continue | |
for thresh in threshs[:-1]: | |
y_pred = np.zeros_like(y) | |
left_mask = x <= thresh | |
right_mask = np.logical_not(left_mask) | |
y_pred[left_mask] = y[left_mask].mean() | |
y_pred[right_mask] = y[right_mask].mean() | |
ls.append(self.lossfun(y, y_pred)) | |
if loss_best is None or loss_best > np.min(ls): | |
thresh_best = threshs[np.argmin(ls)] | |
feat_best = feat | |
loss_best = np.min(ls) | |
return feat_best, thresh_best | |
def make_children(self): | |
left_mask = self.X[self.feat].values <= self.thresh | |
right_mask = np.logical_not(left_mask) | |
X_left = self.X.iloc[left_mask, :] | |
X_right = self.X.iloc[right_mask, :] | |
for X in [X_left, X_right]: | |
X.reset_index(drop=True, inplace=True) | |
y_left = self.y[left_mask] | |
y_right = self.y[right_mask] | |
assert len(X_left) == len(y_left) | |
assert len(X_right) == len(X_right) | |
assert len(y_left) > 0 | |
assert len(y_right) > 0 | |
self.left = Branch(X_left, y_left, self.lossfun, self.hyperparams, self.depth+1) | |
self.right = Branch(X_right, y_right, self.lossfun, self.hyperparams, self.depth+1) | |
def __repr__(self): | |
return f"N: {len(self.y)}, Feature: {self.feat}, Thresh: {self.thresh}" | |
class DecisionTree(): | |
def __init__(self, lossfun, max_depth=None, min_samples_split=2, min_samples_leaf=1): | |
self.lossfun = lossfun | |
self.hyperparams = {'max_depth': max_depth, | |
'min_samples_split': min_samples_split, | |
'min_samples_leaf': min_samples_leaf} | |
def fit(self, X, y): | |
self.root = Branch(X, y, self.lossfun, self.hyperparams) | |
def predict(self, X): | |
return self.root.predict(X) | |
if __name__ == '__main__': | |
boston = datasets.load_boston() | |
df = pd.DataFrame(boston['data'], columns=boston['feature_names']) | |
y = boston['target'] | |
regressor = DecisionTree(mse) | |
regressor.fit(df, y) | |
y_pred = regressor.predict(df) | |
fig, ax = plt.subplots() | |
ax.plot(y, label='y true') | |
ax.plot(y_pred, label= 'y pred') | |
ax.set_ylabel('y') | |
ax.set_xlabel('Data point index') |

In the script above, training and testing are on the same data, so we wouldn't expect the model's performance to generalize to other data sampled from the same distribution. We are overfitting to the training data because we are allowing the tree to be very deep (have lots of branch points). We can prevent this overfitting by reducing the depth of the tree. We can even make 'stumps' - decision trees with only one branch point. These small trees are called weak learners because each one of them has high bias; they will have high error because they can't fit complicated functions. However, if we chain several of these weak learners together, we can make an extremely good model. This is the reasoning behind boosting in general, and gradient boosting in particular.
Chaining Decision Trees

In gradient boosting, during inference each tree takes the input features and outputs a number, but this number isn't a direct estimate of the target variable. Instead, it is an estimate of the gradient that points in the direction we need to go in order to correct the output generated by adding up all the previous trees' predictions. In other words, at each tree we generate a prediction, and the next tree corrects that prediction by adding a delta which depends on the input features.
What is the loss function each tree is trying to minimize? Let's formalize the concept of how wrong we are in our prediction after the
Gradient in prediction space
There is another, more general way to think about the function we are trying to minimize at each step. Let's imagine we have a point in our target space that we are trying to get to, located at

Ideally we would like the next tree to cause the prediction to jump exactly to the true
We can imagine that our weak learner is only going to allow us to move our
Each tree is searching over its parameters to find the output
MSE Loss
Let's evaluate the gradient for the MSE loss to see what function each regression tree should attempt to approximate.

This result states that the function the
Other loss functions
This same procedure works for any differentiable loss function. The output of the previous tree will give us
This entire procedure is called gradient descent in function space. The term 'function space' is the same as what we've been calling prediction space, and is so called because the space we are moving in is made up of the
Conclusion
In this post we saw how gradient boosting works. In another post we will look at how this same framework can be used to perform classification, by changing the loss function.