Covariance and correlation are two statistical tools that are used to determine the relationship between two variables.
Correlation is the technical term used to describe the joint variability between Random Variables (Random Variables that vary together are said to be correlated). Correlation is said to exist between 2 Random Variables if:
In the first case the correlation is said to be positive, and in the latter case, it is said to be negative.
Correlation does not imply causation. Just because 2 Random Variables are correlated, it does not mean one of them is the cause of the other.
Covariance is a measure of the correlation between two random variables. Mathematically defined as:
\[\text{Cov}(X,Y) = \frac{1}{n} \sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y}) \]
It is the average of the product of the differences of each random variable with their own mean. Covariance can be expressed in python code as shown below:
Python implementation:
def fn_covariance(rv1, rv2):
rv1 = np.array([*rv1])
rv2 = np.array([*rv2])
diff_1 = rv1 - rv1.mean()
diff_2 = rv2 - rv2.mean()
cov = (diff_1 * diff_2).mean()
return cov
Example output: Covariance = 3.969 (positive correlation between heights & weights).
Normalized covariance:
\[ \text{Corr}(X,Y) = \frac{\text{Cov}(X,Y)}{\sigma_x \sigma_y} \]
Values range: [-1, 1]. Python code:
def fn_PCC(rv1, rv2):
covariance = fn_covariance(rv1, rv2)
std_dev1 = rv1.std()
std_dev2 = rv2.std()
PCC = covariance / (std_dev1 * std_dev2)
return PCC
Example result: PCC = 0.925 (strong positive correlation).
Uses ranked scores to detect monotonic relationships. Python code:
def fn_SRCC(rv1, rv2):
from scipy import stats
rank_rv1 = stats.rankdata(rv1, method='ordinal')
rank_rv2 = stats.rankdata(rv2, method='ordinal')
SRCC = fn_PCC(rank_rv1, rank_rv2)
return SRCC
Example output: SRCC = 0.944 (strong monotonic correlation).
Uses F-statistic to compare group variances. Python code:
def fn_var_between(sample_mean, group_means, group_sizes):
sq_dists = sum([(mean - sample_mean)**2 * n for mean, n in zip(group_means, group_sizes)])
var_between = sq_dists / (len(group_means) - 1)
return var_between
def fn_var_within(group_rvs):
variances = [np.var(rv, ddof=1) for rv in group_rvs]
avg_var = np.mean(variances)
return avg_var
# Compute F-statistic
fstat = var_between / var_within # Example: F = 280.87 (significant correlation)
| Measure | Indicates | Range | Strength | Scale Dependency |
|---|---|---|---|---|
| Covariance | Direction only | -∞ to +∞ | ❌ Unknown | Yes |
| Correlation | Strength & Direction | -1 to 1 | ✅ | No |
Consider the following dataset of Hours Studied (X) and Exam Scores (Y):
| Student | Hours Studied (X) | Exam Score (Y) |
|---|---|---|
| 1 | 2 | 50 |
| 2 | 3 | 60 |
| 3 | 5 | 65 |
| 4 | 7 | 70 |
| 5 | 9 | 85 |
X̄ = (2 + 3 + 5 + 7 + 9) / 5 = 5.2
Ȳ = (50 + 60 + 65 + 70 + 85) / 5 = 66
Cov(X, Y) = Σ (Xᵢ - X̄) (Yᵢ - Ȳ) / n
Calculated values:
Cov(X, Y) = (51.2 + 13.2 + 0.2 + 7.2 + 72.2) / 5 = 28.8
σₓ = sqrt(Σ (Xᵢ - X̄)² / n) = sqrt(32.8 / 5) ≈ 2.56
σᵧ = sqrt(Σ (Yᵢ - Ȳ)² / n) = sqrt(670 / 5) ≈ 11.58
r = Cov(X, Y) / (σₓ * σᵧ)
r = 28.8 / (2.56 * 11.58) ≈ 0.97
| Metric | Value | Interpretation |
|---|---|---|
| Covariance | 28.8 | Positive Relationship |
| Correlation (r) | 0.97 | Strong Positive Correlation |
| PCC (r) | 0.97 | Almost Perfect Linear Relationship |