Ensemble learning is a technique where multiple models are combined to improve accuracy and stability compared to a single model. It helps reduce bias, variance, and overfitting.
Concept: Bagging reduces variance by training multiple models on different random subsets of data and aggregating their predictions.
Example: Random Forest (an ensemble of decision trees).
Concept: Boosting builds models sequentially, where each new model corrects errors made by the previous ones.
Uses weak learners (like decision stumps) and assigns higher weights to misclassified samples.
Minimizes residual errors using gradient descent.
More optimized versions of gradient boosting.
Concept: Stacking trains multiple different models and then uses a meta-model to learn the best way to combine their outputs.
Concept: Combines multiple models’ predictions using:
Concept: Similar to stacking but simpler. Uses a holdout validation set to combine predictions.
| Method | Reduces Variance | Reduces Bias | Complexity | Example Algorithms |
|---|---|---|---|---|
| Bagging | ✅ | ❌ | Medium | Random Forest |
| Boosting | ✅ | ✅ | High | XGBoost, AdaBoost |
| Stacking | ✅ | ✅ | Very High | Custom-built |
| Voting | ✅ | ❌ | Low | Hard/Soft Voting |
| Blending | ✅ | ✅ | Medium | Weighted averages |
Ensemble learning is a powerful technique to improve model accuracy. Choose the right method based on your data:
The idea of Additive modelling:
Additive modelling is at the foundation of Boosting algorithms. The idea is simple - form a complex function by adding together
a bunch of simpler terms. In Gradient Boosting, a number of simpler models are added together to give a complex final model.
As we shall see, gradient boosting learns a model by taking a weighted sum of a suitable number of base learners.
Boosting: take low variance and high bias models; use additive combining to reduce bias.
- The 𝑁 models are trained sequentially, taking into account the success of the previous model and increasing the
weights of the data that this previous model has had the highest error on, which makes the subsequent models focus on the
most difficult data observations.
- Also, the individual models that perform the best on the weighted training samples will become stronger (get a
higher weight) and therefore have a higher impact on the final prediction.
Suppose we have the following dataset with one feature \( x \) and a target \( y \):
| \( x \) | \( y \) |
|---|---|
| 1 | 2 |
| 2 | 4 |
| 3 | 6 |
| 4 | 8 |
The initial model \( \hat{y}_i^{(0)} \) is typically the mean of the target values:
\[ \hat{y}_i^{(0)} = \frac{1}{n} \sum_{i=1}^n y_i = \frac{2 + 4 + 6 + 8}{4} = 5 \]
So, the initial predictions are:
| \( x \) | \( y \) | \( \hat{y}_i^{(0)} \) |
|---|---|---|
| 1 | 2 | 5 |
| 2 | 4 | 5 |
| 3 | 6 | 5 |
| 4 | 8 | 5 |
The residuals \( r_i^{(1)} \) are the differences between the actual values \( y_i \) and the predicted values \( \hat{y}_i^{(0)} \):
\[ r_i^{(1)} = y_i - \hat{y}_i^{(0)} \]
| \( x \) | \( y \) | \( \hat{y}_i^{(0)} \) | \( r_i^{(1)} \) |
|---|---|---|---|
| 1 | 2 | 5 | -3 |
| 2 | 4 | 5 | -1 |
| 3 | 6 | 5 | 1 |
| 4 | 8 | 5 | 3 |
A decision tree is trained to predict these residuals. Suppose the tree makes the following predictions:
| \( x \) | \( f_1(x) \) |
|---|---|
| 1 | -2 |
| 2 | -2 |
| 3 | 2 |
| 4 | 2 |
We update the predictions using a learning rate \( \eta = 0.1 \):
\[ \hat{y}_i^{(1)} = \hat{y}_i^{(0)} + \eta f_1(x_i) \]
We compute new residuals based on updated predictions:
| \( x \) | \( y \) | \( \hat{y}_i^{(1)} \) | \( r_i^{(2)} \) |
|---|---|---|---|
| 1 | 2 | 4.8 | -2.8 |
| 2 | 4 | 4.8 | -0.8 |
| 3 | 6 | 5.2 | 0.8 |
| 4 | 8 | 5.2 | 2.8 |
We train another tree \( f_2(x) \) to predict the new residuals. Suppose the tree makes these predictions:
| \( x \) | \( f_2(x) \) |
|---|---|
| 1 | -1.8 |
| 2 | -1.8 |
| 3 | 1.8 |
| 4 | 1.8 |
We update the predictions again using \( \eta = 0.1 \):
\[ \hat{y}_i^{(2)} = \hat{y}_i^{(1)} + \eta f_2(x_i) \]
After 2 iterations, the final predictions are:
| \( x \) | \( y \) | \( \hat{y}_i^{(2)} \) |
|---|---|---|
| 1 | 2 | 4.62 |
| 2 | 4 | 4.62 |
| 3 | 6 | 5.38 |
| 4 | 8 | 5.38 |
GBDT minimizes a loss function \( L(y, \hat{y}) \) by iteratively fitting decision trees to the negative gradient of the loss.
F_0(x) = argmin_c ∑ L(y_i, c) # Initialize with a constant value (e.g., mean of y)
For t = 1 to N (number of trees):
- Compute the residuals (negative gradient of the loss function):
r_i^{(t)} = - ∂L(y_i, F_{t-1}(x_i)) / ∂F_{t-1}(x_i)
- Fit a regression tree f_t(x) to predict the residuals:
f_t(x) = TrainDecisionTree(X, r^{(t)})
- Compute step size γ_t by optimizing:
γ_t = argmin_γ ∑ L(y_i, F_{t-1}(x_i) + γ f_t(x_i))
- Update the model:
F_t(x) = F_{t-1}(x) + η γ_t f_t(x) # η is the learning rate
Final model:
F_N(x) = F_0(x) + η ∑ γ_t f_t(x)
# Gradient Boosting Decision Trees (GBDT) Pseudocode
# Input: Training data (X, y), number of trees (N), learning rate (eta)
# Output: Final model F(x)
# Step 1: Initialize model with a constant value (typically mean of y)
F_0(x) = mean(y)
# Step 2: Iterate through N trees
for t = 1 to N do:
# Compute residuals (negative gradient of loss function)
residuals = -Gradient_Loss(y, F_{t-1}(X))
# Train a new decision tree f_t(X) to predict residuals
f_t(X) = TrainDecisionTree(X, residuals)
# Compute optimal step size γ_t
gamma_t = OptimalStepSize(y, F_{t-1}(X), f_t(X))
# Update the model by adding the scaled predictions of the new tree
F_t(x) = F_{t-1}(x) + eta * gamma_t * f_t(x)
# Return the final model
return F_N(x)
Feature importance in Decision Trees (DT) is computed based on how much each feature contributes to reducing the impurity (e.g., Gini impurity or entropy) in the dataset. Below are the key methods used for computing feature importance.
The most common way to measure feature importance is using Gini Importance (also known as Mean Decrease in Impurity or MDI). It is computed as:
Feature Importance (Xj) = ∑t ∈ Tj (Nt / N) * ΔI(t)
Another method is Permutation Feature Importance, which measures how much a model’s accuracy drops when a feature's values are randomly shuffled.
A more advanced and interpretable way to compute feature importance is using SHAP values, which quantify how much each feature contributes to a model's predictions.
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
import pandas as pd
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
# Train Decision Tree
dt = DecisionTreeClassifier()
dt.fit(X, y)
# Get feature importances
feature_importances = dt.feature_importances_
# Display as DataFrame
df_importance = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
df_importance = df_importance.sort_values(by='Importance', ascending=False)
print(df_importance)
In Random Forest (RF) Classifiers, the choice of base learners significantly impacts the performance of the ensemble model. The preferred base learners should have low bias and high variance, ensuring that the ensemble benefits from variance reduction through averaging.
Decision Trees (DT) are commonly used as base learners in Random Forest because they:
To ensure optimal performance, Decision Trees in Random Forest should have a reasonable depth:
Typically, in Random Forest, trees are grown fully or with moderate depth (e.g., max depth between 5-20 depending on the dataset) to maintain high variance while allowing bagging to stabilize predictions.
Random Forest reduces variance by averaging predictions from multiple independent trees. Using high variance base learners ensures diversity among individual trees, making the ensemble more robust.
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Train Random Forest with Decision Trees as base learners
rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf.fit(X, y)
# Print feature importances
print("Feature Importances:", rf.feature_importances_)
Bootstrapping is a fundamental concept in Random Forest (RF) that helps in reducing variance and improving model stability. It involves sampling with replacement to create multiple training datasets.
Given a standard training set D of size n, Random Forest generates m new training sets Dᵢ, each of size n, by sampling from D uniformly and with replacement.
Each of the m bootstrap samples is used to train an individual model, usually a Decision Tree. This ensures that each tree sees a slightly different version of the data.
The outputs of these m models are then combined to make the final prediction:
Bootstrapping plays a crucial role in stabilizing the Random Forest model:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Train Random Forest with bootstrapping
rf = RandomForestClassifier(n_estimators=100, bootstrap=True, random_state=42)
rf.fit(X, y)
# Print feature importances
print("Feature Importances:", rf.feature_importances_)
Bagging is the simplest way of combining predictions that belong to the same type,
while Boosting is a way of combining predictions that belong to different types.
- Bagging aims to decrease variance, not bias.
- Boosting aims to decrease bias, not variance.
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)
# Train Bagging Classifier
bagging_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)
bagging_clf.fit(X_train, y_train)
# Evaluate the model
print("Bagging Classifier Accuracy:", bagging_clf.score(X_test, y_test))
- Bagging decreases variance.
- Boosting decreases bias.
- Underfitting occurs when a model has high bias and low variance.
- Overfitting occurs when a model has low bias and high variance.
- If data is large → Both Bagging and Boosting work, but Boosting is not as necessary.
- If data is small → Boosting is more prone to overfitting, while Bagging is safer.
- In general, Boosting overfits more than Bagging.
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Create a small dataset
X, y = make_classification(n_samples=50, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train Boosting Classifier
boosting_clf = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2), n_estimators=50, random_state=42)
boosting_clf.fit(X_train, y_train)
# Evaluate the model
print("Boosting Classifier Accuracy on Training:", boosting_clf.score(X_train, y_train))
print("Boosting Classifier Accuracy on Test:", boosting_clf.score(X_test, y_test))
This example demonstrates how Boosting can overfit a small dataset, leading to much higher training accuracy than test accuracy.
- One of the big advantages of bagging is that it can be parallelized.
- Different models are fitted independently from each other.
- Intensive parallelization techniques can be used if required.
- This makes bagging computationally efficient and scalable.
- Boosting uses a sequential modeling technique.
- The same model is trained repeatedly, each time adjusting to previous errors.
- Because each model depends on the previous one, it cannot be parallelized easily.
- As a result, boosting requires more computational power compared to bagging.
- Bagging is highly parallelizable and computationally efficient.
- Boosting is sequential and requires more time and computational resources.
- When scalability is needed, bagging is often the better choice.
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Create a dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train Bagging Classifier with parallel processing (n_jobs=-1 uses all CPU cores)
bagging_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=50, n_jobs=-1, random_state=42)
bagging_clf.fit(X_train, y_train)
# Evaluate the model
print("Bagging Classifier Accuracy:", bagging_clf.score(X_test, y_test))
The above code demonstrates how bagging leverages parallel computing by setting
n_jobs=-1, utilizing all CPU cores for training.