Spearman's Rank Correlation Coefficient
The Spearman rank correlation coefficient is a non-parametric measure of correlation (or monotonic dependency) between two variables. $$ r_s = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} $$ where \( d_i \) is the difference between the ranks of the \( i \)-th pair, and \( n \) represents the number of observations.
This method relies on data ranks rather than the raw values of the variables themselves.
The Spearman coefficient, often represented by \( \rho \) or \( r_s \), ranges from -1 to 1:
- \( r_s = 1 \): indicates a perfect positive correlation, where the ranks of both variables move consistently in the same direction.
- \( r_s = -1 \): indicates a perfect negative correlation, showing that as the rank of one variable increases, the other decreases.
- \( r_s = 0 \): suggests no correlation, or no consistent monotonic relationship.
To calculate \( r_s \), the formula considers the difference in ranks between paired values across the two variables:
What is a data rank? A “rank” is simply a value’s position when data is ordered from smallest to largest (or vice versa) within a variable. The original position in an unsorted list doesn’t matter; only the numerical order does. For instance, consider the variable $ X = \{ 8 , 6, 4, 2, 5, 7 \} $. When sorted, we get $ X_s = \{ 2, 4, 5, 6, 7, 8 \} $. Thus, 2 is assigned rank 1, 4 rank 2, 5 rank 3, and so on.
This coefficient is particularly useful when data doesn’t meet the assumptions of Pearson’s correlation, such as linearity or normality, since it focuses purely on the rank order of values.
It’s a valuable indicator for gauging non-linear correlations between two statistical variables.
A Practical Example
Let’s say we have the following values for variables \( X \) and \( Y \) for five individuals:
Individual | \( X \) | \( Y \) |
---|---|---|
A | 15 | 10 |
B | 20 | 25 |
C | 25 | 30 |
D | 35 | 20 |
E | 30 | 35 |
We assign ranks to the values in \( X \), with the lowest value ranked 1, and so forth:
- For \( X \): \[15 \to 1, \, 20 \to 2, \, 25 \to 3, \, 30 \to 4, \, 35 \to 5\]
- For \( Y \): \[10 \to 1, \, 20 \to 2, \, 25 \to 3, \, 30 \to 4, \, 35 \to 5\]
Here’s the updated table showing the ranks:
Individual | \( X \) | Rank of \( X \) | \( Y \) | Rank of \( Y \) |
---|---|---|---|---|
A | 15 | 1 | 10 | 1 |
B | 20 | 2 | 25 | 3 |
C | 25 | 3 | 30 | 4 |
D | 35 | 5 | 20 | 2 |
E | 30 | 4 | 35 | 5 |
Now, we calculate the difference \( d = \text{Rank of } X - \text{Rank of } Y \) for each individual, then square each difference to get \( d^2 \).
Individual | Rank of \( X \) | Rank of \( Y \) | \( d = X - Y \) | \( d^2 \) |
---|---|---|---|---|
A | 1 | 1 | 0 | 0 |
B | 2 | 3 | -1 | 1 |
C | 3 | 4 | -1 | 1 |
D | 5 | 2 | 3 | 9 |
E | 4 | 5 | -1 | 1 |
Finally, we sum up the \( d^2 \) values:
$$ \sum d^2 = 0 + 1 + 1 + 9 + 1 = 12 $$
We can now use Spearman’s formula to estimate the correlation between the data:
$$ r_s = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} $$
Where \( \sum d_i^2 = 12 \) and \( n = 5 \)
$$ r_s = 1 - \frac{6 \cdot 12}{5 \cdot (5^2 - 1)} $$
$$ r_s = 1 - \frac{72}{120} $$
$$ r_s = 1 - 0.6 $$
$$ r_s = 0.4 $$
With a Spearman coefficient \( r_s = 0.4 \), there is a moderate positive correlation between \( X \) and \( Y \).
This means that as \( X \) increases, \( Y \) generally tends to increase as well, though the relationship is not particularly strong.
Pros and Cons of the Spearman Coefficient
Here’s a summary of the main pros and cons of Spearman’s correlation coefficient:
A] Pros of the Spearman Coefficient
- Measures Monotonic Relationships
Spearman captures monotonic dependency between variables, indicating whether one variable consistently increases or decreases in tandem with another, without assuming linearity. This makes it an excellent indicator of non-linear correlations. - Less Sensitive to Outliers
Since it’s based on ranks, Spearman’s coefficient is less influenced by extreme values than Pearson’s correlation. - Suitable for Ordinal Data
Spearman works well with ordinal data or when assumptions of linear or normal distribution aren’t met. - Intuitive Interpretation
Spearman values are straightforward: 1 and -1 represent perfect monotonic positive or negative correlations, while 0 indicates no monotonic relationship.
B] Cons of the Spearman Coefficient
- Can’t Detect Complex Non-Monotonic Relationships
Spearman doesn’t identify correlations in cases where the relationship isn’t monotonic, such as quadratic or sinusoidal patterns. - Rank-Based Limitation
Using ranks may reduce precision, particularly in small datasets where rank assignments might not fully capture relationships. - Limited Information on Magnitude and Variation Direction
Spearman only shows if values rise or fall together but doesn’t convey the rate of change. Pearson, by contrast, offers more detail on variation magnitude. - Ignores Data Magnitude
Since it only uses ranks, Spearman discards the actual value sizes, which could be a drawback when relationships are influenced by magnitude. - Challenges with Tied Ranks
When many values are tied, Spearman struggles due to necessary adjustments, which may smooth the result and lessen accuracy. - Sensitivity to Rank Changes
Spearman relies solely on rank order, so minor changes in data ordering can significantly impact its value, especially in small datasets, making it somewhat sensitive to misalignment.
And so forth.