Covariance

Covariance is a statistical measure that describes how two variables, \( X \) and \( Y \), vary together. For \(n\) pairs of data \((X_i, Y_i)\), covariance \( \text{Cov}(X, Y) \) is calculated as follows: $$ \text{Cov}(X, Y) = \frac{1}{n} \sum_{i=1}^{n} (X_i - \overline{X})(Y_i - \overline{Y}) $$ where \( \overline{X} \) and \( \overline{Y} \) are the averages of \( X \) and \( Y \), respectively.

Covariance provides an absolute value that indicates the linear relationship between the two variables:

  • If covariance is positive, it implies that, on average, when one variable increases, the other also tends to increase.
  • If covariance is negative, it suggests that as one variable increases, the other tends to decrease.

Finally, a covariance of zero indicates no linear relationship between the two variables, although it doesn’t necessarily mean they are independent.

Relationship between Covariance and Correlation. Covariance measures the absolute relationship between two variables, showing only how they move together. However, it depends on the units of measurement and does not indicate the strength or comparability of this movement. Correlation, on the other hand, measures both the strength and direction of the relationship. For example, covariance is used in the Pearson coefficient to assess correlation on a scale from -1 to 1, making correlation a normalized form of covariance.

A Practical Example

Let’s calculate the covariance between two variables: hours of study ( X ) and test scores ( Y ) among a group of students.

Student Study Hours (X) Score (Y)
A 2 65
B 4 70
C 6 80
D 8 85
E 10 90

First, we calculate the averages of \( X \) and \( Y \):

$$ \overline{X} = \frac{2 + 4 + 6 + 8 + 10}{5} = 6 $$

$$ \overline{Y} = \frac{65 + 70 + 80 + 85 + 90}{5} = 78 $$

With these averages, we can now compute the covariance between X and Y:

$$ \text{Cov}(X, Y) = \frac{1}{n} \sum_{i=1}^{n} (X_i - \overline{X})(Y_i - \overline{Y}) $$

$$ \text{Cov}(X, Y) = \frac{1}{n} \sum_{i=1}^{n} (X_i - 6)(Y_i - 78) $$

Next, we calculate each product of the differences from the mean:

  • Student A: \((2 - 6)(65 - 78) = (-4)(-13) = 52\)
  • Student B: \((4 - 6)(70 - 78) = (-2)(-8) = 16\)
  • Student C: \((6 - 6)(80 - 78) = (0)(2) = 0\)
  • Student D: \((8 - 6)(85 - 78) = (2)(7) = 14\)
  • Student E: \((10 - 6)(90 - 78) = (4)(12) = 48\)

The covariance is then calculated as follows:

$$ \text{Cov}(X, Y) = \frac{52 + 16 + 0 + 14 + 48}{5} $$

$$ \text{Cov}(X, Y) = \frac{130}{5} $$

$$ \text{Cov}(X, Y) = 26 $$

In this case, the covariance between study hours and scores is 26.

This positive absolute value indicates a positive relationship between the two variables.

In other words, as study hours increase, scores also tend to rise.

graph

Zero Covariance Does Not Imply Independence

Covariance only captures the linear relationship between two variables.

If covariance is zero, there’s no linear relationship, but other forms of non-linear dependence may still be present.

In other words, two variables could have a more complex relationship, such as quadratic or exponential dependence, which covariance would not detect.

Example. A classic example is two variables \( X \) and \( Y = X^2 \), where \( X \) is centered around its mean (i.e., it has a mean of zero). Here, \( X \) and \( Y \) are dependent since knowing \( X \) allows you to determine \( Y \), yet their covariance is zero because the relationship is non-linear.

The Difference Between Covariance and Correlation

Covariance and correlation are related but distinct.

  • Covariance measures the relationship in absolute terms and depends on the units of the variables.
  • Correlation measures the relationship in relative terms, providing a standardized value that allows direct comparison.

While covariance shows how two variables move together in absolute terms, correlation standardizes this movement, making comparisons possible regardless of the variables' units of measurement.

How is correlation measured?

Covariance gives some insight but isn’t very useful as a standalone measure of correlation.

To measure correlation, we turn to other statistical tools, like the Pearson coefficient.

For example, in the Pearson coefficient, the correlation \( \rho(X, Y) \) between two variables \( X \) and \( Y \) is given by:

$$ \rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} $$

where \( \text{Cov}(X, Y) \) is the covariance between \( X \) and \( Y \), and \( \sigma_X \) and \( \sigma_Y \) are the standard deviations of \( X \) and \( Y \).

Unlike covariance, Pearson’s coefficient ranges from -1 to 1.

  • \( \rho(X, Y) = 1 \): perfect positive correlation; the variables move together proportionally.
  • \( \rho(X, Y) = -1 \): perfect negative correlation; the variables move in opposite directions proportionally.
  • \( \rho(X, Y) = 0 \): no linear relationship (although non-linear dependencies may still exist).

As a normalized measure, correlation is unaffected by scaling transformations of the variables, such as multiplication and translation.

It’s also especially useful for comparing relationships among variables measured in different units.

And so forth.

 
 

Please feel free to point out any errors or typos, or share suggestions to improve these notes. English isn't my first language, so if you notice any mistakes, let me know, and I'll be sure to fix them.

FacebookTwitterLinkedinLinkedin
knowledge base

Statistics