Statistics & Machine Learning Hub

Central tendency refers to the statistical concept that describes the center or typical value of a dataset. It provides a single value that summarizes the entire distribution of data, representing a point around which most of the data points tend to cluster. The goal is to find the "center" of the data to understand the overall trend or typical value.
There are three main measures of central tendency:
Mean: The average of all values in the dataset.
Median: The middle value when the data is arranged in order.
Mode: The value that appears most frequently in the dataset.

Measures of Central Tendency

Mean (Average)

The mean is calculated by adding all the values in the dataset and dividing by the number of values. It is the most commonly used measure of central tendency.

            Example:
            Dataset: [3, 7, 8, 5, 12, 19, 15, 10, 4, 6]
            Mean = (3 + 7 + 8 + 5 + 12 + 19 + 15 + 10 + 4 + 6) / 10 = 89 / 10 = 8.9

Median

The median is the middle value when the data is ordered. If there is an even number of values, the median is the average of the two middle numbers.

            Example:
            Dataset: [3, 7, 8, 5, 12, 19, 15, 10, 4, 6]
            Ordered Dataset: [3, 4, 5, 6, 7, 8, 10, 12, 15, 19]
            Median = (7 + 8) / 2 = 7.5

Mode

The mode is the value that appears most frequently in a dataset. There can be more than one mode if multiple values have the same frequency.

            Example:
            Dataset: [3, 7, 8, 5, 12, 19, 15, 10, 4, 6]
            Mode = No mode (all values appear only once)

Measures of Dispersion

Measures of dispersion describe the spread or variability of a dataset. They give us an understanding of how spread out or clustered the values are in relation to the central tendency.

Range

The range is the difference between the highest and lowest values in the dataset. It provides a simple way to measure the spread but does not account for how the data is distributed within that range.

            Range = Max(X) - Min(X)
            Example:
            Dataset: [3, 7, 8, 5, 12, 19, 15, 10, 4, 6]
            Range = 19 - 3 = 16

Variance

Variance measures the average squared deviation from the mean. It gives an idea of how much the data points vary from the mean.

            Population Variance Example:
            Dataset: [3, 7, 8, 5, 12, 19, 15, 10, 4, 6]
            Mean = 8.9
            Variance = ((3-8.9)^2 + (7-8.9)^2 + ... + (6-8.9)^2) / 10 = 43.89

Standard Deviation

The standard deviation is the square root of the variance. It provides a more interpretable measure of spread, as it is in the same units as the original data.

            Population Standard Deviation Example:
            Variance = 43.89
            Standard Deviation = √43.89 ≈ 6.62

Percentiles and Quartiles

Percentiles and quartiles help to understand the distribution of data by dividing the dataset into specific parts based on the values of the data.

Percentiles

Percentiles are values that divide a dataset into 100 equal parts. Each percentile represents the percentage of data below it. For example, the 50th percentile is the value below which 50% of the data falls.

Example: We will find the 25th, 50th (median), and 75th percentiles for the dataset [3, 7, 8, 5, 12, 19, 15, 10, 4, 6].

            Dataset: [3, 7, 8, 5, 12, 19, 15, 10, 4, 6]
            Ordered Dataset: [3, 4, 5, 6, 7, 8, 10, 12, 15, 19]
            
            - 25th Percentile (P25): Position = (25 / 100) * (10 + 1) = 2.75, so P25 is between the 2nd and 3rd values: P25 ≈ 4.5
            - 50th Percentile (P50 or Median): Position = (50 / 100) * (10 + 1) = 5.5, so P50 is between the 5th and 6th values: P50 = (7 + 8) / 2 = 7.5
            - 75th Percentile (P75): Position = (75 / 100) * (10 + 1) = 8.25, so P75 is between the 8th and 9th values: P75 ≈ 11.5

Quartiles

Quartiles divide the dataset into four equal parts. The three quartiles are:

First Quartile (Q1): The 25th percentile, separating the lowest 25% of the data.

Second Quartile (Q2): The 50th percentile, also known as the median.

Third Quartile (Q3): The 75th percentile, separating the lowest 75% of the data.

For the dataset [3, 7, 8, 5, 12, 19, 15, 10, 4, 6], the quartiles are:

            Ordered Dataset: [3, 4, 5, 6, 7, 8, 10, 12, 15, 19]
            
            Q1 (25th Percentile): P25 ≈ 4.5
            Q2 (50th Percentile / Median): P50 = 7.5
            Q3 (75th Percentile): P75 ≈ 11.5