Definition: Accuracy is the ratio of correctly predicted instances to the total number of instances.
Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)
A model classifies 100 patients as having or not having a disease.
Accuracy = (50 + 30) / (50 + 30 + 10 + 10) = 80%
Limitation: In imbalanced datasets (e.g., 95% class A, 5% class B), a model predicting only class A will have high accuracy but fail for class B.
Definition: Precision measures how many of the predicted positive cases are actually correct.
Formula: Precision = TP / (TP + FP)
In a spam detection model:
Precision = 40 / (40 + 20) = 0.67 (67%)
Use case: When false positives must be minimized, such as spam detection.
Definition: Recall measures how many actual positive cases were correctly predicted.
Formula: Recall = TP / (TP + FN)
In a cancer detection model:
Recall = 45 / (45 + 5) = 0.90 (90%)
Use case: Important in medical diagnosis where missing a positive case can be critical.
Definition: The F1-score is the harmonic mean of precision and recall.
Formula: F1 = 2 × (Precision × Recall) / (Precision + Recall)
For a model with:
F1 = 2 × (0.67 × 0.90) / (0.67 + 0.90) = 0.76 (76%)
Use case: Useful for imbalanced datasets where both false positives and false negatives are important.
Definition: The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.
Formula: ROC-AUC measures the area under this curve, with values closer to 1 being better.
Consider two models:
Model A is better at distinguishing between positive and negative cases.
Use case: Useful for evaluating classification models with probability scores.
Definition: Log Loss measures the difference between predicted probabilities and actual labels.
Formula: Log Loss = - (1/N) Σ (y log(p) + (1-y) log(1-p))
If a model predicts probabilities for five samples:
The log loss will be low, indicating good predictions.
Use case: Commonly used in probabilistic models like logistic regression.
| Metric | Formula | Example Use Case | Limitations |
|---|---|---|---|
| Accuracy | (TP + TN) / (TP + TN + FP + FN) | General classification | Misleading for imbalanced datasets |
| Precision | TP / (TP + FP) | Spam filtering | Ignores false negatives |
| Recall | TP / (TP + FN) | Medical diagnosis | Ignores false positives |
| F1-Score | 2 × (Precision × Recall) / (Precision + Recall) | Imbalanced datasets | Harder to interpret than accuracy |
| ROC-AUC | Area under the ROC curve | Evaluating probabilistic classifiers | Not useful for multi-class classification |
| Log Loss | - (1/N) Σ (y log(p) + (1-y) log(1-p)) | Probabilistic classification | Hard to interpret directly |
Accuracy is not always the best metric for evaluating classification models, especially when dealing with imbalanced datasets. In such cases, other metrics like Precision, Recall, and F1-Score provide a better assessment of model performance.
Consider a dataset that predicts whether a flight Landed Safely (1) or Crashed (0).
If a model predicts every flight as "Landed Safely (1)", it would still achieve 90% accuracy, even though it completely fails to detect any crashes. This makes accuracy a misleading metric in such cases.
For highly imbalanced datasets, accuracy is often misleading. Instead, metrics like Recall, Precision, and F1-Score provide a clearer picture of model performance, especially in cases where the minority class is critical.
The inputs required to calculate the average F1 Score are:
Precision measures how many of the predicted positive cases are actually correct.
Formula:
Precision = TP / (TP + FP)
Recall measures how many actual positive cases were correctly predicted.
Formula:
Recall = TP / (TP + FN)
The F1 Score is the harmonic mean of Precision and Recall.
Formula:
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Macro F1 Score is the unweighted mean of the F1 Scores of all classes.
Formula:
Macro F1 = (F11 + F12 + ... + F1n) / n
Use case: Suitable when all classes should be treated equally.
Weighted F1 Score considers the number of instances in each class.
Formula:
Weighted F1 = Σ (Samples in class × F1) / Total Samples
Use case: Useful when classes are imbalanced.
Micro F1 Score computes global TP, FP, and FN across all classes before calculating F1.
Formula:
Micro F1 = 2 × (Σ TP) / (2 × Σ TP + Σ FP + Σ FN)
Use case: Preferred when class distribution is imbalanced and you want to evaluate overall performance.
The inputs required to calculate the average F1 Score are:
Consider a classification model that predicts three classes: Cat, Dog, Rabbit.
| Class | TP | FP | FN |
|---|---|---|---|
| Cat | 50 | 10 | 5 |
| Dog | 40 | 20 | 15 |
| Rabbit | 30 | 5 | 10 |
For Cat:
For Dog:
For Rabbit:
For Cat:
F1 = 2 × (0.833 × 0.909) / (0.833 + 0.909) = 0.869
For Dog:
F1 = 2 × (0.667 × 0.727) / (0.667 + 0.727) = 0.696
For Rabbit:
F1 = 2 × (0.857 × 0.750) / (0.857 + 0.750) = 0.800
Macro F1 = (0.869 + 0.696 + 0.800) / 3 = 0.788
Weighted F1 = (55 × 0.869 + 55 × 0.696 + 40 × 0.800) / (55 + 55 + 40) = 0.778
Micro F1 = 2 × (Σ TP) / (2 × Σ TP + Σ FP + Σ FN)
= 2 × (50 + 40 + 30) / (2 × (50 + 40 + 30) + (10 + 20 + 5) + (5 + 15 + 10))
= 2 × (120) / (2 × 120 + 35 + 30) = 240 / 275 = 0.873
The importance of recall versus precision depends on the specific problem and the consequences of false positives versus false negatives.
Recall is crucial when missing a positive instance (false negative) is more costly than incorrectly flagging some negative instances (false positives).
Precision is critical when false positives (incorrectly classifying negatives as positives) are more problematic than missing some positive instances.
In some scenarios, both false positives and false negatives have significant consequences, making it essential to find a balance between recall and precision.
Choosing between recall and precision depends on the problem at hand. In life-critical systems (e.g., medical diagnoses, disaster warnings), recall is prioritized. In systems where incorrect classifications cause harm (e.g., spam filters, hiring decisions), precision is more important. When both false positives and false negatives matter, a balance between the two is required using metrics like the F1 Score.
Precision is prioritized when false positives are more costly or dangerous than false negatives.
Choose precision over recall when the cost of a false positive is higher than a false negative.
Recall is prioritized when false negatives are more costly or dangerous than false positives.
Choose recall over precision when missing a positive case is riskier than a false alarm.
Cross-validation (CV) is a technique used in machine learning and statistics to evaluate the performance of a model on unseen data. Instead of using a single train-test split, cross-validation splits the dataset multiple times to ensure the model generalizes well.
Cross-validation is an essential technique for evaluating machine learning models, ensuring they generalize well to new data. It prevents overfitting, improves reliability, and helps in model selection. Choosing the right type of cross-validation depends on the dataset and the problem at hand.
Not all classification algorithms support multi-class classification. Some algorithms, like the Perceptron, Logistic Regression, and Support Vector Machines (SVMs), are designed for binary classification.
To use these binary classifiers for multi-class problems, we split the dataset into multiple binary classification problems. Two common approaches are:
One-vs-Rest (OvR) is a method where the multi-class dataset is split into multiple binary classification problems. Each classifier is trained to distinguish one class from all the others. The final prediction is made by the classifier that is most confident.
Consider a dataset with three classes: red, blue, and green. The OvR approach creates the following binary classification problems:
One-vs-One (OvO) is a method where the dataset is split into multiple binary classification problems, but instead of comparing one class against all others, it compares every pair of classes individually. The final prediction is made using a voting system among all classifiers.
Consider a dataset with four classes: red, blue, green, and yellow. The OvO approach creates the following binary classification problems:
| Feature | One-Vs-Rest (OvR) | One-Vs-One (OvO) |
|---|---|---|
| Number of Classifiers | K | K(K-1)/2 |
| Training Speed | Faster | Slower (more classifiers) |
| Inference Speed | Faster | Slower |
| Best for Large Datasets? | Yes | No |
| Best for Algorithms like SVM? | No | Yes |
| Accuracy | Good, but may struggle with close decision boundaries. | Higher, since each classifier focuses on two specific classes. |
Both One-Vs-Rest and One-Vs-One are useful techniques for adapting binary classifiers to multi-class problems.
Average Precision (AP) measures the area under the Precision-Recall (PR) curve. It evaluates how well a classification or object detection model balances precision and recall.
Mean Average Precision (mAP) is the average of AP scores across all categories in a dataset. It is commonly used in object detection and information retrieval tasks.
Mathematically, mAP is calculated as:
mAP = (AP₁ + AP₂ + ... + APₙ) / N
where:
mAP is widely used in:
mAP provides a robust evaluation metric for tasks where ranking and precision-recall trade-offs matter. It is a key metric in object detection, retrieval systems, and recommendation engines.
When dealing with a dataset where the number of positive samples (minority class) is much lower than the number of negative samples (majority class), traditional metrics like accuracy become unreliable.
If the dataset is highly imbalanced, precision is a more suitable metric because:
Precision = TP / (TP + FP)
False Positive Rate (FPR) is defined as:
FPR = FP / (FP + TN)
When the number of negative samples (TN) is very large, FPR remains low even if the model makes many false positives, making it a less reliable metric.
Besides precision, other useful metrics include:
For datasets with a large number of negative samples, precision is a better metric than FPR because it focuses on the correct identification of positive cases without being affected by the abundance of negative samples.
Dataset: 9 positive samples, 1 negative sample.
Model Prediction: Predicts all samples as positive.
Since FPR is very high, the model is not reliable despite high precision and recall. ROC is a better metric here.
Dataset: 9 negative samples, 1 positive sample.
Model Prediction: Predicts all as negative.
This model fails entirely.
Dataset: 8 positive samples, 2 negative samples.
Model Prediction: Predicts 9 as positive, 1 as negative.
FPR is high (0.5), showing poor performance for negative class. ROC is better in this case.
Dataset: 8 negative samples, 2 positive samples.
Model Prediction: Predicts 1 as positive, rest as negative.
Low recall (0.5) but good precision.
Dataset: 9 positive samples, 1 negative sample.
Model Prediction: Predicts 7 as positive, 3 as negative.
Both metrics indicate strong performance.
Dataset: 9 negative samples, 1 positive sample.
Model Prediction: Predicts 3 as positive (1 correct), 7 as negative.
FPR is low, but poor detection is reflected in low precision.
Regularization is a technique used to prevent overfitting in machine learning models. Overfitting occurs when a model learns the training data too well, including noise and outliers, and performs poorly on unseen data. Regularization introduces additional constraints or penalties to the model's learning process to ensure it generalizes better to new data.
Regularization is a powerful tool to prevent overfitting and improve the generalization of machine learning models. By adding a penalty to the loss function, it balances the trade-off between bias and variance, ensuring the model performs well on both training and unseen data. The choice of regularization technique (L1, L2, Elastic Net, etc.) depends on the specific problem and dataset.
In order to understand how the deviation of the function is varied, bias and variance can be adopted. Bias is the measurement of deviation or error from the real value of the function, while variance is the measurement of deviation in the response variable function when estimating it over different training samples of the dataset.
Therefore, for a generalized data model, we must keep bias as low as possible to achieve high accuracy. Additionally, the model should not produce greatly varied results, so low variance is recommended for good performance.
The relationship between bias and variance is closely related to overfitting, underfitting, and model capacity. When calculating the generalization error (where bias and variance are crucial elements), an increase in model capacity can lead to an increase in variance and a decrease in bias.
The trade-off is the tension between the error introduced by bias and variance. The image below shows the bias-variance tradeoff as a function of model capacity.
From the graph, it can be observed that:
The graph below depicts the conditions of underfitting, exact fit, and overfitting.
Overfitting occurs when a model has low bias and high variance, fitting the training data too well but failing to generalize to new data. This often happens when the model considers too many features, including insignificant ones.
Underfitting occurs when a model has high bias and low variance, failing to capture the underlying patterns in the data.
Regularization is a technique used to prevent overfitting by penalizing complex models. It achieves this by adding a regularization term to the loss function, which shrinks the model's coefficients towards zero. This reduces the impact of insignificant features and stabilizes the model.
Regularization adds a penalty term to the cost function to penalize complex models. This reduces the weights of the model, making it simpler and less prone to overfitting.
L1 regularization is preferred when dealing with high-dimensional data, as it provides sparse solutions by shrinking some coefficients to zero. The regression model using L1 regularization is called Lasso Regression.
The loss function with L1 regularization is:
\[ \text{Loss} = \text{Error}(Y, \hat{Y}) + \lambda \sum_{i=1}^n |w_i| \]
Where \( \lambda \) is the regularization parameter.
L2 regularization is used to handle multicollinearity by shrinking all coefficients proportionally. The regression model using L2 regularization is called Ridge Regression.
The loss function with L2 regularization is:
\[ \text{Loss} = \text{Error}(Y, \hat{Y}) + \lambda \sum_{i=1}^n w_i^2 \]
Key differences between L1 and L2 regularization:
Regularization is a powerful technique to prevent overfitting by penalizing complex models. L1 regularization is useful for feature selection, while L2 regularization is better for handling multicollinearity. The choice between L1 and L2 depends on the specific problem and dataset.
L1 Regularization (Lasso):
The loss function with L1 regularization is:
\[ \mathcal{L}_1 = \text{Logistic Loss} + \lambda \sum_{i=1}^d |W_i| \]
The gradient of the L1 penalty is:
\[ \frac{\partial \mathcal{L}_1}{\partial W_i} = \begin{cases} +\lambda & \text{if } W_i > 0 \\ -\lambda & \text{if } W_i < 0 \end{cases} \]
L2 Regularization (Ridge):
The loss function with L2 regularization is:
\[ \mathcal{L}_2 = \text{Logistic Loss} + \lambda \sum_{i=1}^d W_i^2 \]
The gradient of the L2 penalty is:
\[ \frac{\partial \mathcal{L}_2}{\partial W_i} = 2\lambda W_i \]
L1 Regularization:
L2 Regularization:
L1 Constraint (Diamond Shape):
L2 Constraint (Sphere Shape):
Assume \(W_1 = 0.1\), \(\lambda = 0.1\), and \(\eta = 0.01\):
L1 Regularization:
L2 Regularization:
| Aspect | L1 Regularization | L2 Regularization |
|---|---|---|
| Gradient | Constant (\(\pm \lambda\)) | Proportional to \(W_i\) (\(2\lambda W_i\)) |
| Sparsity | Yes (weights reach exactly zero) | No (weights remain non-zero) |
| Use Case | Feature selection, high-dimensional data | Handling multicollinearity |
L1 regularization creates sparsity because:
L2 regularization, in contrast, shrinks weights smoothly but never achieves exact sparsity.
Choose a model that:
Key Challenge: Balancing these goals is often contradictory.
As model complexity increases, bias decreases but variance increases. The optimal balance minimizes total error.
Let \( Y \) be the target variable and \( X \) be the predictor variable:
\[ Y = f(X) + e \]
The expected squared error at a point \( x \) is:
\[ \text{Err}(x) = E\left[ (Y - \hat{f}(x))^2 \right] \]
This error can be decomposed into three components:
\[ \text{Err}(x) = \underbrace{\left( E[\hat{f}(x)] - f(x) \right)^2}_{\text{Bias}^2} + \underbrace{E\left[ (\hat{f}(x) - E[\hat{f}(x)])^2 \right]}_{\text{Variance}} + \underbrace{\sigma_e^2}_{\text{Irreducible Error}} \]
One Hot Encoding converts categorical text data into numerical vectors. For a vocabulary of size \( N \), each word is represented as an \( N \)-dimensional vector where:
TF-IDF quantifies the importance of a word in a document relative to a corpus. It combines:
\[ \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total terms in } d} \] Example: For the document "He is Walter":
\[ \text{IDF}(t) = \log\left(\frac{\text{Total documents}}{\text{Documents containing } t}\right) \] Example (base 10 log):
\[ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t) \] Example with Smoothing (add 1 to avoid zeros):
| Technique | Use Case | Limitations |
|---|---|---|
| One Hot Encoding | Simple categorical data | High dimensionality for large vocabularies |
| TF-IDF | Text classification, information retrieval | Does not capture semantic meaning |
Word2Vec is a neural network-based method for generating dense vector representations of words. Unlike sparse methods like One Hot Encoding, Word2Vec captures semantic and syntactic relationships between words by mapping them to vectors in a continuous vector space. Words with similar meanings or contexts are positioned closer together in this space.
The objective is to maximize the log-likelihood of observing context words given a target word (Skip-Gram) or vice versa (CBOW). For Skip-Gram:
\[ \text{Maximize } \frac{1}{T} \sum_{t=1}^T \sum_{-c \leq j \leq c, j \neq 0} \log p(w_{t+j} | w_t) \]
Where:
| Advantages | Limitations |
|---|---|
| Captures semantic relationships | Fails to handle polysemy (e.g., "bank" as river vs. financial) |
| Low-dimensional embeddings | Fixed context window size |
| Works well with small datasets | Cannot handle out-of-vocabulary words |
Word2Vec revolutionized NLP by enabling machines to understand word semantics through vector arithmetic. While newer models like BERT and GPT have emerged, Word2Vec remains foundational for tasks requiring lightweight, interpretable word embeddings.
| Word Embedding Technique | Main Characteristics | Use Cases |
|---|---|---|
| TF-IDF |
|
|
| Word2Vec |
|
|
| GloVe |
|
|
| BERT |
|
|