Ensemble Techniques

Ensemble Techniques in Machine Learning

Ensemble learning is a technique where multiple models are combined to improve accuracy and stability compared to a single model. It helps reduce bias, variance, and overfitting.

1. Bagging (Bootstrap Aggregating)

Concept: Bagging reduces variance by training multiple models on different random subsets of data and aggregating their predictions.

  • Each model is trained on a different bootstrapped sample.
  • Final prediction is done via majority voting (classification) or averaging (regression).

Example: Random Forest (an ensemble of decision trees).

2. Boosting

Concept: Boosting builds models sequentially, where each new model corrects errors made by the previous ones.

a) AdaBoost

Uses weak learners (like decision stumps) and assigns higher weights to misclassified samples.

b) Gradient Boosting (GBM)

Minimizes residual errors using gradient descent.

c) XGBoost, LightGBM, CatBoost

More optimized versions of gradient boosting.

3. Stacking (Stacked Generalization)

Concept: Stacking trains multiple different models and then uses a meta-model to learn the best way to combine their outputs.

  • Base models: Decision Trees, SVM, Neural Networks
  • Meta-model: Logistic Regression or another learner

4. Voting Ensemble

Concept: Combines multiple models’ predictions using:

  • Hard Voting: Chooses the most frequent class.
  • Soft Voting: Averages probability scores.

5. Blending

Concept: Similar to stacking but simpler. Uses a holdout validation set to combine predictions.

Comparison of Ensemble Methods

Method Reduces Variance Reduces Bias Complexity Example Algorithms
Bagging Medium Random Forest
Boosting High XGBoost, AdaBoost
Stacking Very High Custom-built
Voting Low Hard/Soft Voting
Blending Medium Weighted averages

Conclusion

Ensemble learning is a powerful technique to improve model accuracy. Choose the right method based on your data:

  • Use Bagging if variance is high (e.g., Random Forest).
  • Use Boosting if both bias and variance need reduction (e.g., XGBoost, LightGBM).
  • Use Stacking for combining multiple different models.
  • Use Voting/Blending for simple ensemble combinations.

How Boosting Works?

Video Link

            The idea of Additive modelling:
            Additive modelling is at the foundation of Boosting algorithms. The idea is simple - form a complex function by adding together
            a bunch of simpler terms. In Gradient Boosting, a number of simpler models are added together to give a complex final model.
            As we shall see, gradient boosting learns a model by taking a weighted sum of a suitable number of base learners.
                    
            Boosting: take low variance and high bias models; use additive combining to reduce bias.
            - The 𝑁 models are trained sequentially, taking into account the success of the previous model and increasing the
              weights of the data that this previous model has had the highest error on, which makes the subsequent models focus on the
              most difficult data observations.
            - Also, the individual models that perform the best on the weighted training samples will become stronger (get a
              higher weight) and therefore have a higher impact on the final prediction.
                    

Mathematical Process of Gradient Boosting Decision Trees (GBDT)

Example Problem

Suppose we have the following dataset with one feature \( x \) and a target \( y \):

\( x \) \( y \)
1 2
2 4
3 6
4 8

Step 1: Initial Model

The initial model \( \hat{y}_i^{(0)} \) is typically the mean of the target values:

\[ \hat{y}_i^{(0)} = \frac{1}{n} \sum_{i=1}^n y_i = \frac{2 + 4 + 6 + 8}{4} = 5 \]

So, the initial predictions are:

\( x \) \( y \) \( \hat{y}_i^{(0)} \)
1 2 5
2 4 5
3 6 5
4 8 5

Step 2: Calculate Residuals (Errors)

The residuals \( r_i^{(1)} \) are the differences between the actual values \( y_i \) and the predicted values \( \hat{y}_i^{(0)} \):

\[ r_i^{(1)} = y_i - \hat{y}_i^{(0)} \]

\( x \) \( y \) \( \hat{y}_i^{(0)} \) \( r_i^{(1)} \)
1 2 5 -3
2 4 5 -1
3 6 5 1
4 8 5 3

Step 3: Train a New Tree on Residuals

A decision tree is trained to predict these residuals. Suppose the tree makes the following predictions:

\( x \) \( f_1(x) \)
1 -2
2 -2
3 2
4 2

Step 4: Update Predictions

We update the predictions using a learning rate \( \eta = 0.1 \):

\[ \hat{y}_i^{(1)} = \hat{y}_i^{(0)} + \eta f_1(x_i) \]

Step 5: Calculate New Residuals

We compute new residuals based on updated predictions:

\( x \) \( y \) \( \hat{y}_i^{(1)} \) \( r_i^{(2)} \)
1 2 4.8 -2.8
2 4 4.8 -0.8
3 6 5.2 0.8
4 8 5.2 2.8

Step 6: Train Another Tree on New Residuals

We train another tree \( f_2(x) \) to predict the new residuals. Suppose the tree makes these predictions:

\( x \) \( f_2(x) \)
1 -1.8
2 -1.8
3 1.8
4 1.8

Step 7: Update Predictions Again

We update the predictions again using \( \eta = 0.1 \):

\[ \hat{y}_i^{(2)} = \hat{y}_i^{(1)} + \eta f_2(x_i) \]

Final Model

After 2 iterations, the final predictions are:

\( x \) \( y \) \( \hat{y}_i^{(2)} \)
1 2 4.62
2 4 4.62
3 6 5.38
4 8 5.38

Key Takeaways

  • Each tree improves predictions by focusing on residuals.
  • The learning rate \( \eta \) controls the contribution of each tree.
  • Iterating further reduces errors and improves accuracy.

Gradient Boosting Decision Trees (GBDT) Optimization

Objective Function

GBDT minimizes a loss function \( L(y, \hat{y}) \) by iteratively fitting decision trees to the negative gradient of the loss.

General Optimization Equation:

            F_0(x) = argmin_c ∑ L(y_i, c)  # Initialize with a constant value (e.g., mean of y)
            
            For t = 1 to N (number of trees):
                - Compute the residuals (negative gradient of the loss function):
                  r_i^{(t)} = - ∂L(y_i, F_{t-1}(x_i)) / ∂F_{t-1}(x_i)
                
                - Fit a regression tree f_t(x) to predict the residuals:
                  f_t(x) = TrainDecisionTree(X, r^{(t)})
            
                - Compute step size γ_t by optimizing:
                  γ_t = argmin_γ ∑ L(y_i, F_{t-1}(x_i) + γ f_t(x_i))
                
                - Update the model:
                  F_t(x) = F_{t-1}(x) + η γ_t f_t(x)  # η is the learning rate
            
            Final model:
                F_N(x) = F_0(x) + η ∑ γ_t f_t(x)
                

GBDT Pseudocode

            # Gradient Boosting Decision Trees (GBDT) Pseudocode
            
            # Input: Training data (X, y), number of trees (N), learning rate (eta)
            # Output: Final model F(x)
            
            # Step 1: Initialize model with a constant value (typically mean of y)
            F_0(x) = mean(y)
            
            # Step 2: Iterate through N trees
            for t = 1 to N do:
                # Compute residuals (negative gradient of loss function)
                residuals = -Gradient_Loss(y, F_{t-1}(X))
                
                # Train a new decision tree f_t(X) to predict residuals
                f_t(X) = TrainDecisionTree(X, residuals)
                
                # Compute optimal step size γ_t
                gamma_t = OptimalStepSize(y, F_{t-1}(X), f_t(X))
                
                # Update the model by adding the scaled predictions of the new tree
                F_t(x) = F_{t-1}(x) + eta * gamma_t * f_t(x)
            
            # Return the final model
            return F_N(x)
                

Feature Importance in Decision Trees

Feature importance in Decision Trees (DT) is computed based on how much each feature contributes to reducing the impurity (e.g., Gini impurity or entropy) in the dataset. Below are the key methods used for computing feature importance.

1. Gini Importance (Mean Decrease in Impurity)

The most common way to measure feature importance is using Gini Importance (also known as Mean Decrease in Impurity or MDI). It is computed as:

Feature Importance (Xj) = ∑t ∈ Tj (Nt / N) * ΔI(t)

  • Tj = set of all nodes where feature Xj is used for splitting.
  • Nt = number of samples in node t.
  • N = total number of samples.
  • ΔI(t) = impurity reduction at node t.

2. Permutation Feature Importance

Another method is Permutation Feature Importance, which measures how much a model’s accuracy drops when a feature's values are randomly shuffled.

  1. Train the tree model normally.
  2. Compute the baseline accuracy or error.
  3. Shuffle the values of a feature and predict again.
  4. Measure the drop in accuracy—higher drops indicate more important features.

3. SHAP Values (SHapley Additive exPlanations)

A more advanced and interpretable way to compute feature importance is using SHAP values, which quantify how much each feature contributes to a model's predictions.

Example: Feature Importance in Python (Scikit-Learn)

                
                from sklearn.tree import DecisionTreeClassifier
                from sklearn.datasets import load_iris
                import pandas as pd
            
                # Load dataset
                iris = load_iris()
                X, y = iris.data, iris.target
                feature_names = iris.feature_names
            
                # Train Decision Tree
                dt = DecisionTreeClassifier()
                dt.fit(X, y)
            
                # Get feature importances
                feature_importances = dt.feature_importances_
            
                # Display as DataFrame
                df_importance = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
                df_importance = df_importance.sort_values(by='Importance', ascending=False)
            
                print(df_importance)
                
                

Key Takeaways

  • Impurity Reduction (Gini Importance) – Measures how much a feature reduces impurity.
  • Permutation Importance – Measures drop in accuracy when a feature is shuffled.
  • SHAP Values – Provides a more interpretable way to measure feature contrib

What Kind of Base Learners Are Preferable in Random Forest Classifiers?

In Random Forest (RF) Classifiers, the choice of base learners significantly impacts the performance of the ensemble model. The preferred base learners should have low bias and high variance, ensuring that the ensemble benefits from variance reduction through averaging.

1. Decision Trees as Base Learners

Decision Trees (DT) are commonly used as base learners in Random Forest because they:

  • Are high variance models – small changes in data lead to different trees.
  • Can capture complex relationships in data.
  • Work well with bootstrap aggregation (bagging), reducing overfitting.

2. Depth of Decision Trees

To ensure optimal performance, Decision Trees in Random Forest should have a reasonable depth:

  • If trees are too deep, they may overfit individual bootstrap samples.
  • If trees are too shallow, they may have high bias and underperform.

Typically, in Random Forest, trees are grown fully or with moderate depth (e.g., max depth between 5-20 depending on the dataset) to maintain high variance while allowing bagging to stabilize predictions.

3. Why High Variance Base Learners?

Random Forest reduces variance by averaging predictions from multiple independent trees. Using high variance base learners ensures diversity among individual trees, making the ensemble more robust.

4. Example: Implementing Random Forest with Decision Trees in Python

                    
                    from sklearn.ensemble import RandomForestClassifier
                    from sklearn.datasets import load_iris
                
                    # Load dataset
                    iris = load_iris()
                    X, y = iris.data, iris.target
                
                    # Train Random Forest with Decision Trees as base learners
                    rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
                    rf.fit(X, y)
                
                    # Print feature importances
                    print("Feature Importances:", rf.feature_importances_)
                    
                    

Key Takeaways

  • Low Bias, High Variance Models – Decision Trees are preferred as they provide diversity.
  • Reasonable Tree Depth – Too deep trees may overfit, too shallow trees may underperform.
  • Bagging Reduces Variance – Combining multiple high-variance trees stabilizes predictions.

How Does Bootstrapping Work in Random Forest Classification?

Bootstrapping is a fundamental concept in Random Forest (RF) that helps in reducing variance and improving model stability. It involves sampling with replacement to create multiple training datasets.

1. Bootstrapping: Sampling with Replacement

Given a standard training set D of size n, Random Forest generates m new training sets Dᵢ, each of size n, by sampling from D uniformly and with replacement.

  • Since we sample with replacement, some observations may appear multiple times in a given Dᵢ.
  • For large n, a bootstrap sample Dᵢ contains about 63.2% unique examples from the original dataset D, while the rest are duplicates.

2. Training on Bootstrap Samples

Each of the m bootstrap samples is used to train an individual model, usually a Decision Tree. This ensures that each tree sees a slightly different version of the data.

3. Aggregation (Bagging)

The outputs of these m models are then combined to make the final prediction:

  • For classification: Majority voting is used to determine the final class.
  • For regression: Predictions are averaged to obtain the final output.

4. Impact of Bootstrapping in Random Forest

Bootstrapping plays a crucial role in stabilizing the Random Forest model:

  • Variance Reduction: Since we aggregate multiple diverse models, the overall variance is reduced.
  • Robustness: Even if a portion of the data is changed, the overall prediction remains stable.
  • Low Bias & Reduced Variance: Individual trees may have low bias and high variance, but bagging ensures the final model has low bias and reduced variance.

5. Example: Implementing Bootstrapping in Random Forest

                    
                    from sklearn.ensemble import RandomForestClassifier
                    from sklearn.datasets import load_iris
                
                    # Load dataset
                    iris = load_iris()
                    X, y = iris.data, iris.target
                
                    # Train Random Forest with bootstrapping
                    rf = RandomForestClassifier(n_estimators=100, bootstrap=True, random_state=42)
                    rf.fit(X, y)
                
                    # Print feature importances
                    print("Feature Importances:", rf.feature_importances_)
                    
                    

Key Takeaways

  • Bootstrapping creates multiple training datasets by sampling with replacement.
  • Each tree is trained on a different bootstrap sample.
  • The final prediction is obtained by majority voting (classification) or averaging (regression).
  • Bootstrapping ensures that variance is reduced and model stability is improved.

Why is Bagging Better Than Boosting?

                    Bagging is the simplest way of combining predictions that belong to the same type, 
                    while Boosting is a way of combining predictions that belong to different types.
                    
                    - Bagging aims to decrease variance, not bias.
                    - Boosting aims to decrease bias, not variance.
                    

1. Difference in Approach

  • Bagging (Bootstrap Aggregating) trains multiple models independently in parallel on different bootstrap samples and averages their predictions.
  • Boosting trains models sequentially, with each new model correcting the errors of the previous model.

2. Stability and Overfitting

  • Bagging is less prone to overfitting because it reduces variance by averaging multiple models.
  • Boosting can overfit more easily since it aggressively corrects mistakes in training.

3. Computation and Interpretability

  • Bagging is computationally efficient and easy to parallelize.
  • Boosting is harder to parallelize since it builds models sequentially.

4. When to Use Bagging vs. Boosting

  • Use Bagging when you have a high variance model like Decision Trees to improve stability.
  • Use Boosting when you have a high bias model and need better accuracy but can handle the risk of overfitting.

5. Example: Implementing Bagging in Python

                    
                    from sklearn.ensemble import BaggingClassifier
                    from sklearn.tree import DecisionTreeClassifier
                    from sklearn.datasets import load_iris
                    from sklearn.model_selection import train_test_split
                
                    # Load dataset
                    iris = load_iris()
                    X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)
                
                    # Train Bagging Classifier
                    bagging_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)
                    bagging_clf.fit(X_train, y_train)
                
                    # Evaluate the model
                    print("Bagging Classifier Accuracy:", bagging_clf.score(X_test, y_test))
                    
                    

Key Takeaways

  • Bagging reduces variance and stabilizes predictions.
  • Boosting reduces bias but may lead to overfitting.
  • Bagging is simpler, more robust, and easier to parallelize.

Boosting vs. Bagging: Overfitting in Different Data Conditions

Understanding Bias and Variance

                    - Bagging decreases variance.
                    - Boosting decreases bias.
                    - Underfitting occurs when a model has high bias and low variance.
                    - Overfitting occurs when a model has low bias and high variance.
                    

1. When the Number of Data Points is Huge

  • Bagging performs well because it stabilizes predictions and reduces variance.
  • Boosting may still work effectively, but with enough data, bias is naturally reduced, making Boosting less necessary.
  • Overfitting risk is lower in both methods due to the large dataset.

2. When the Number of Data Points is Low

  • Boosting is more prone to overfitting because it aggressively corrects mistakes, even when they are due to noise.
  • Bagging helps by reducing variance, but if the dataset is too small, the models may still suffer from high variance.
  • Overfitting risk is higher in Boosting than in Bagging.

3. Key Takeaways

                    - If data is large → Both Bagging and Boosting work, but Boosting is not as necessary.
                    - If data is small → Boosting is more prone to overfitting, while Bagging is safer.
                    - In general, Boosting overfits more than Bagging.
                    

4. Example: Overfitting in Boosting

                    
                    from sklearn.ensemble import AdaBoostClassifier
                    from sklearn.tree import DecisionTreeClassifier
                    from sklearn.datasets import make_classification
                    from sklearn.model_selection import train_test_split
                
                    # Create a small dataset
                    X, y = make_classification(n_samples=50, n_features=10, random_state=42)
                    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
                
                    # Train Boosting Classifier
                    boosting_clf = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2), n_estimators=50, random_state=42)
                    boosting_clf.fit(X_train, y_train)
                
                    # Evaluate the model
                    print("Boosting Classifier Accuracy on Training:", boosting_clf.score(X_train, y_train))
                    print("Boosting Classifier Accuracy on Test:", boosting_clf.score(X_test, y_test))
                    
                    

This example demonstrates how Boosting can overfit a small dataset, leading to much higher training accuracy than test accuracy.

Parallelization in Bagging vs. Computational Cost in Boosting

1. Parallelization in Bagging

                    - One of the big advantages of bagging is that it can be parallelized.
                    - Different models are fitted independently from each other.
                    - Intensive parallelization techniques can be used if required.
                    - This makes bagging computationally efficient and scalable.
                    

2. Computational Cost in Boosting

                    - Boosting uses a sequential modeling technique.
                    - The same model is trained repeatedly, each time adjusting to previous errors.
                    - Because each model depends on the previous one, it cannot be parallelized easily.
                    - As a result, boosting requires more computational power compared to bagging.
                    

3. Key Takeaways

                    - Bagging is highly parallelizable and computationally efficient.
                    - Boosting is sequential and requires more time and computational resources.
                    - When scalability is needed, bagging is often the better choice.
                    

4. Example: Parallelization in Bagging

                    
                    from sklearn.ensemble import BaggingClassifier
                    from sklearn.tree import DecisionTreeClassifier
                    from sklearn.datasets import make_classification
                    from sklearn.model_selection import train_test_split
                
                    # Create a dataset
                    X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
                    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
                
                    # Train Bagging Classifier with parallel processing (n_jobs=-1 uses all CPU cores)
                    bagging_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=50, n_jobs=-1, random_state=42)
                    bagging_clf.fit(X_train, y_train)
                
                    # Evaluate the model
                    print("Bagging Classifier Accuracy:", bagging_clf.score(X_test, y_test))
                    
                    

The above code demonstrates how bagging leverages parallel computing by setting n_jobs=-1, utilizing all CPU cores for training.