Spearman's Rank Correlation Coefficient

The Spearman rank correlation coefficient is a non-parametric measure of correlation (or monotonic dependency) between two variables. $$ r_s = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} $$ where \( d_i \) is the difference between the ranks of the \( i \)-th pair, and \( n \) represents the number of observations.

This method relies on data ranks rather than the raw values of the variables themselves.

The Spearman coefficient, often represented by \( \rho \) or \( r_s \), ranges from -1 to 1:

  • \( r_s = 1 \): indicates a perfect positive correlation, where the ranks of both variables move consistently in the same direction.
  • \( r_s = -1 \): indicates a perfect negative correlation, showing that as the rank of one variable increases, the other decreases.
  • \( r_s = 0 \): suggests no correlation, or no consistent monotonic relationship.

To calculate \( r_s \), the formula considers the difference in ranks between paired values across the two variables:

What is a data rank? A “rank” is simply a value’s position when data is ordered from smallest to largest (or vice versa) within a variable. The original position in an unsorted list doesn’t matter; only the numerical order does. For instance, consider the variable $ X = \{ 8 , 6, 4, 2, 5, 7 \} $. When sorted, we get $ X_s = \{ 2, 4, 5, 6, 7, 8 \} $. Thus, 2 is assigned rank 1, 4 rank 2, 5 rank 3, and so on.

This coefficient is particularly useful when data doesn’t meet the assumptions of Pearson’s correlation, such as linearity or normality, since it focuses purely on the rank order of values.

It’s a valuable indicator for gauging non-linear correlations between two statistical variables.

A Practical Example

Let’s say we have the following values for variables \( X \) and \( Y \) for five individuals:

Individual \( X \) \( Y \)
A 15 10
B 20 25
C 25 30
D 35 20
E 30 35

We assign ranks to the values in \( X \), with the lowest value ranked 1, and so forth:

  • For \( X \): \[15 \to 1, \, 20 \to 2, \, 25 \to 3, \, 30 \to 4, \, 35 \to 5\]
  • For \( Y \): \[10 \to 1, \, 20 \to 2, \, 25 \to 3, \, 30 \to 4, \, 35 \to 5\]

Here’s the updated table showing the ranks:

Individual \( X \) Rank of \( X \) \( Y \) Rank of \( Y \)
A 15 1 10 1
B 20 2 25 3
C 25 3 30 4
D 35 5 20 2
E 30 4 35 5

Now, we calculate the difference \( d = \text{Rank of } X - \text{Rank of } Y \) for each individual, then square each difference to get \( d^2 \).

Individual Rank of \( X \) Rank of \( Y \) \( d = X - Y \) \( d^2 \)
A 1 1 0 0
B 2 3 -1 1
C 3 4 -1 1
D 5 2 3 9
E 4 5 -1 1

Finally, we sum up the \( d^2 \) values:

$$ \sum d^2 = 0 + 1 + 1 + 9 + 1 = 12 $$

We can now use Spearman’s formula to estimate the correlation between the data:

$$ r_s = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} $$

Where \( \sum d_i^2 = 12 \) and \( n = 5 \)

$$ r_s = 1 - \frac{6 \cdot 12}{5 \cdot (5^2 - 1)} $$

$$ r_s = 1 - \frac{72}{120} $$

$$ r_s = 1 - 0.6 $$

$$ r_s = 0.4 $$

With a Spearman coefficient \( r_s = 0.4 \), there is a moderate positive correlation between \( X \) and \( Y \).

This means that as \( X \) increases, \( Y \) generally tends to increase as well, though the relationship is not particularly strong.

Spearman correlation

Pros and Cons of the Spearman Coefficient

Here’s a summary of the main pros and cons of Spearman’s correlation coefficient:

A] Pros of the Spearman Coefficient

  • Measures Monotonic Relationships
    Spearman captures monotonic dependency between variables, indicating whether one variable consistently increases or decreases in tandem with another, without assuming linearity. This makes it an excellent indicator of non-linear correlations.
  • Less Sensitive to Outliers
    Since it’s based on ranks, Spearman’s coefficient is less influenced by extreme values than Pearson’s correlation.
  • Suitable for Ordinal Data
    Spearman works well with ordinal data or when assumptions of linear or normal distribution aren’t met.
  • Intuitive Interpretation
    Spearman values are straightforward: 1 and -1 represent perfect monotonic positive or negative correlations, while 0 indicates no monotonic relationship.

B] Cons of the Spearman Coefficient

  • Can’t Detect Complex Non-Monotonic Relationships
    Spearman doesn’t identify correlations in cases where the relationship isn’t monotonic, such as quadratic or sinusoidal patterns.
  • Rank-Based Limitation
    Using ranks may reduce precision, particularly in small datasets where rank assignments might not fully capture relationships.
  • Limited Information on Magnitude and Variation Direction
    Spearman only shows if values rise or fall together but doesn’t convey the rate of change. Pearson, by contrast, offers more detail on variation magnitude.
  • Ignores Data Magnitude
    Since it only uses ranks, Spearman discards the actual value sizes, which could be a drawback when relationships are influenced by magnitude.
  • Challenges with Tied Ranks
    When many values are tied, Spearman struggles due to necessary adjustments, which may smooth the result and lessen accuracy.
  • Sensitivity to Rank Changes
    Spearman relies solely on rank order, so minor changes in data ordering can significantly impact its value, especially in small datasets, making it somewhat sensitive to misalignment.

And so forth.

 
 

Please feel free to point out any errors or typos, or share suggestions to improve these notes. English isn't my first language, so if you notice any mistakes, let me know, and I'll be sure to fix them.

FacebookTwitterLinkedinLinkedin
knowledge base

Statistics