Logistic Regression

Explain Logistic Regression

Logistic regression is a statistical method used to predict a binary outcome, such as "yes" or "no," based on prior observations in a dataset. It models the probability of a certain class or event existing based on independent variables.

A logistic regression model predicts a dependent variable by analyzing the relationship between one or more independent variables.

Logistic regression can be interpreted in terms of geometry, probability, and loss function:

Geometrically: If the data is linearly separable, a hyperplane can separate the data points into two classes.
Equation of a hyperplane:
w^Tx + b = 0
If the hyperplane passes through the origin, then b = 0, reducing the equation to: w^Tx = 0

What is Sigmoid Function & Squashing?

The sigmoid function is a mathematical function that can take any real value and map it to a value between 0 and 1, forming an "S"-shaped curve.

The sigmoid function, also called the logistic function, is defined as:
Y = 1 / (1 + exp(-z))

In logistic regression, our optimization problem is to maximize the sum of signed distances. However, this approach is sensitive to outliers. To mitigate this issue, we introduce a concept: if the signed distance is small, we keep it as is; if it is large, we scale it to a smaller value.

To achieve this, we apply the sigmoid function, which converts a large range of signed distances into a limited range of [-1,1]. This process of compressing values into a fixed range is called squashing.

Explain about Optimization Problem in Logistic Regression.

In any classification problem, our goal is to maximize the number of correctly classified points and minimize the number of misclassified points.

For a correctly classified point, the condition holds: y_i W^T x_i > 0

For a misclassified point, the condition holds: y_i W^T x_i < 0

Thus, our optimization problem is to find W that maximizes the sum of y_i W^T x_i.

W* = argmax (∑ y_i W^T x_i)

Mathematical Formulation of Objective Function

For LR, the optimization problem is:

W* = argmax ( ∑ ( y_i W^T x_i ) )

After applying the sigmoid function, the equation transforms into:

W* = argmax ∑ ( 1 / ( 1 + exp( - y_i W^T x_i ) ) )

Now, if we apply a monotonic increasing function such as Logarithm, then it becomes:

W* = argmax ∑ log( 1 / ( 1 + exp( - y_i W^T x_i ) ) )

⇒ W* = argmin ∑ log( 1 + exp( - y_i W^T x_i ) )

Let Z_i = y_i W^T x_i, then:

W* = argmin ∑ log( 1 + exp( - Z_i ) ) for i ∈ (0, n)

The minimum value of the above occurs at Z_i → ∞.

If Z_i tends to +∞, then the equation approaches 0.

If the selected W correctly classifies all training points, and Z_i → ∞, then W is the best W for training data.

However, this leads to overfitting, as it does not guarantee good performance on test data.

The training data may contain outliers that the model has fitted perfectly.

To prevent overfitting, we introduce regularization, modifying the equation as follows:

W* = argmin ∑ log( ( 1 + exp( - y_i W^T x_i ) ) ) + λ W^T W

Where λ is a hyperparameter controlling regularization. It is determined using cross-validation:

If λ = 0, there is no regularization → high variance (overfitting).
If λ is too large, the loss term diminishes, meaning training data has little influence on optimization → high bias (underfitting).

Explain Importance of Weight Vector in Logistic Regression

Optimization problem:

W* = argmin ∑ log( (1 + exp( - y_i W^T x_i ) ) )

So, the optimal W (W*) is the Weight vector, which is a d-dimensional vector.

Geometric intuition:

The weight vector W is normal to a hyperplane that separates data points into different classes.

Positive data points lie in the direction of W.
Negative data points lie in the opposite direction.

For Logistic Regression:

If W^T x_q > 0, then y_q = +1
If W^T x_q < 0, then y_q = -1
If the point lies on the hyperplane (W^T x_q = 0), we cannot determine the class of the query point.

Interpretation of Weight Vectors:

Case 1: If W_i is positive:
- As x_qi increases, W_ix_qi increases.
- sigmoid (W^T x_q) increases.
- P(y_q = +1) increases.
Case 2: If W_i is negative:
- As x_qi increases, W_ix_qi decreases.
- sigmoid (W^T x_q) decreases.
- P(y_q = +1) decreases, while P(y_q = -1) increases.

Multi-Collinearity of Features

In Logistic Regression (LR), feature importance is interpreted from weight vectors under the assumption of independence.

However, if there is co-linearity, we cannot interpret feature importance from the weight vector.

Definition:

Two features are collinear if one feature can be expressed as a function of another feature.
A multi-collinear feature is a feature that can be expressed as a function of multiple other features.

Impact of Multi-Collinearity:

Weight vectors are affected by multi-collinear features.
To use weight vectors for feature importance interpretation, multi-collinear features must be removed.

How to detect Multi-Collinearity?

A multi-collinear feature can be identified by adding noise (perturbation) to the feature values:

If, after training, the weight vector changes significantly, the features are multi-collinear.
In such cases, the weight vector cannot be used for feature importance interpretation.

Conclusion:

Performing a multi-collinearity test is mandatory to ensure reliable feature importance analysis.

Find Train & Run Time Space and Time Complexity of Logistic Regression

Solving the optimization problem using Stochastic Gradient Descent:

Train Time: Time Complexity: O(n d)
Run Time: Time Complexity: O(d)
Space Complexity: O(d)

After analyzing the model, your manager has informed that your regression model is suffering from multicollinearity. How would you check if he’s true? Without losing any information, can you still build a better model?

To check for multicollinearity, you can:

Calculate the Variance Inflation Factor (VIF). If VIF > 10, multicollinearity is a concern.
Check the correlation matrix for highly correlated independent variables.
Observe instability in coefficient estimates.

Building a better model without losing information:

Use Principal Component Analysis (PCA) to transform correlated variables into independent components.
Apply Ridge or Lasso Regression to penalize large coefficients and reduce dependency.
Remove redundant features while preserving predictive power.

What are the basic assumptions to be made for linear regression?

Linearity: The relationship between the independent and dependent variable is linear.
Independence: Observations are independent of each other.
Homoscedasticity: Constant variance of errors across all levels of the independent variable.
No multicollinearity: Independent variables should not be highly correlated.
Normality of residuals: Errors should be normally distributed.

What is the difference between stochastic gradient descent (SGD) and gradient descent (GD)?

Gradient Descent (GD): Uses the entire dataset to compute gradients in each step. Slower but provides stable convergence.
Stochastic Gradient Descent (SGD): Uses a single random sample per step. Faster but can be noisy.

When would you use GD over SGD, and vice-versa?

Use GD: When dataset is small and requires stable convergence.
Use SGD: When dataset is large, as it is computationally efficient.
Mini-batch SGD: A hybrid approach that balances stability and efficiency.

How do you decide whether your linear regression model fits the data?

To assess the goodness of fit of a linear regression model, you can use several statistical methods:

R-Squared (Coefficient of Determination): Measures how well the independent variables explain the variance in the dependent variable.
Adjusted R-Squared: Adjusts R-Squared for the number of predictors to prevent overestimation.
Residual Analysis: Checking residual plots for randomness helps identify non-linearity or heteroscedasticity.
p-values & Confidence Intervals: Determines statistical significance of predictors.
F-Test: Assesses overall model significance.

More details: ResearchGate Post

Is it possible to perform logistic regression with Microsoft Excel?

Yes, logistic regression can be performed in Microsoft Excel using tools like:

Solver: To maximize the likelihood function.
Analysis ToolPak: For regression and statistical functions.
Custom VBA Macros: To automate calculations.

Tutorial: YouTube Video

When will you use classification over regression?

Classification is used when the target variable is categorical, while regression is used for continuous variables.

Use Classification: When predicting discrete labels (e.g., spam detection, medical diagnosis).
Use Regression: When predicting continuous values (e.g., stock prices, temperature forecasting).

More details: Quora Discussion

Why isn't Logistic Regression called Logistic Classification?

Despite being used for classification tasks, logistic regression is still a regression-based approach:

It models the probability of class membership using a regression equation.
The logistic function maps predictions to probabilities between 0 and 1.
Classification happens after applying a threshold (e.g., 0.5).

More details: Stats StackExchange

How to Decrease the Test Time Complexity of a Logistic Regression Model?

To reduce the test time complexity of a logistic regression model, we can:

Reduce the number of features (d), as test time complexity is O(d).
Use feature selection techniques like PCA, L1 regularization, or mutual information.
Apply model pruning to eliminate less significant features.
Use hardware optimizations such as vectorized operations.

What is the Need for Sigmoid Function in Logistic Regression?

The sigmoid function is used in logistic regression because:

It maps any real-valued number to a probability between 0 and 1, making it suitable for binary classification.
It allows for a smooth gradient, which is essential for optimization using gradient descent.
It helps in squashing large values into a manageable range, preventing extreme values from dominating the model.
It ensures the output can be interpreted as a probability, which is crucial for decision-making.

The sigmoid function is mathematically represented as:

σ(z) = 1 / (1 + exp(-z))