Covariance and Correlation

Understanding Covariance and Correlation

Covariance and correlation are two statistical tools that are used to determine the relationship between two variables.

CORRELATION

Correlation is the technical term used to describe the joint variability between Random Variables (Random Variables that vary together are said to be correlated). Correlation is said to exist between 2 Random Variables if:

  1. The greater values of one variable generally correspond with the greater values of the other variable, and the same correspondence holds true for the lesser values.
  2. The greater values of one variable generally correspond to the lesser values of the other and the same correspondence holds true for the lesser values.

In the first case the correlation is said to be positive, and in the latter case, it is said to be negative.

Correlation does not imply causation. Just because 2 Random Variables are correlated, it does not mean one of them is the cause of the other.

MEASURES OF CORRELATION

1. COVARIANCE

Covariance is a measure of the correlation between two random variables. Mathematically defined as:

\[\text{Cov}(X,Y) = \frac{1}{n} \sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y}) \]

It is the average of the product of the differences of each random variable with their own mean. Covariance can be expressed in python code as shown below:


Python implementation:


                def fn_covariance(rv1, rv2):
                    rv1 = np.array([*rv1])
                    rv2 = np.array([*rv2])
                    diff_1 = rv1 - rv1.mean()
                    diff_2 = rv2 - rv2.mean()
                    cov = (diff_1 * diff_2).mean()
                    return cov
                            

Example output: Covariance = 3.969 (positive correlation between heights & weights).

2. PEARSON’S CORRELATION COEFFICIENT (PCC)

Normalized covariance:

\[ \text{Corr}(X,Y) = \frac{\text{Cov}(X,Y)}{\sigma_x \sigma_y} \]

Values range: [-1, 1]. Python code:


                def fn_PCC(rv1, rv2):
                    covariance = fn_covariance(rv1, rv2)
                    std_dev1 = rv1.std()
                    std_dev2 = rv2.std()
                    PCC = covariance / (std_dev1 * std_dev2)
                    return PCC
                            

Example result: PCC = 0.925 (strong positive correlation).

3. SPEARMAN’S RANKED CORRELATION COEFFICIENT (SRCC)

Uses ranked scores to detect monotonic relationships. Python code:


                def fn_SRCC(rv1, rv2):
                    from scipy import stats
                    rank_rv1 = stats.rankdata(rv1, method='ordinal')
                    rank_rv2 = stats.rankdata(rv2, method='ordinal')
                    SRCC = fn_PCC(rank_rv1, rank_rv2)
                    return SRCC
                            

Example output: SRCC = 0.944 (strong monotonic correlation).

4. ANOVA FOR CATEGORICAL-NUMERIC CORRELATION

Uses F-statistic to compare group variances. Python code:


                def fn_var_between(sample_mean, group_means, group_sizes):
                    sq_dists = sum([(mean - sample_mean)**2 * n for mean, n in zip(group_means, group_sizes)])
                    var_between = sq_dists / (len(group_means) - 1)
                    return var_between
                
                def fn_var_within(group_rvs):
                    variances = [np.var(rv, ddof=1) for rv in group_rvs]
                    avg_var = np.mean(variances)
                    return avg_var
                
                # Compute F-statistic
                fstat = var_between / var_within  # Example: F = 280.87 (significant correlation)
                            

Comparison Table

Measure Indicates Range Strength Scale Dependency
Covariance Direction only -∞ to +∞ ❌ Unknown Yes
Correlation Strength & Direction -1 to 1 No

Example Data

Consider the following dataset of Hours Studied (X) and Exam Scores (Y):

Student Hours Studied (X) Exam Score (Y)
1 2 50
2 3 60
3 5 65
4 7 70
5 9 85

Step 1: Compute Mean

X̄ = (2 + 3 + 5 + 7 + 9) / 5 = 5.2
Ȳ = (50 + 60 + 65 + 70 + 85) / 5 = 66
    

Step 2: Compute Covariance

Cov(X, Y) = Σ (Xᵢ - X̄) (Yᵢ - Ȳ) / n

Calculated values:
Cov(X, Y) = (51.2 + 13.2 + 0.2 + 7.2 + 72.2) / 5 = 28.8
    
Covariance = 28.8 → Positive relationship

Step 3: Compute Standard Deviations

σₓ = sqrt(Σ (Xᵢ - X̄)² / n) = sqrt(32.8 / 5) ≈ 2.56
σᵧ = sqrt(Σ (Yᵢ - Ȳ)² / n) = sqrt(670 / 5) ≈ 11.58
    

Step 4: Compute Correlation (r)

r = Cov(X, Y) / (σₓ * σᵧ)
r = 28.8 / (2.56 * 11.58) ≈ 0.97
    
Correlation (r) = 0.97 → Strong Positive Correlation

Final Results

Metric Value Interpretation
Covariance 28.8 Positive Relationship
Correlation (r) 0.97 Strong Positive Correlation
PCC (r) 0.97 Almost Perfect Linear Relationship
Conclusion: The more hours a student studies, the higher their exam score. Since r ≈ 1, there is a strong linear relationship.