Linear Correlation in Contingency Tables

To calculate the linear correlation coefficient \( r \) between two variables in a contingency table

Y / X y₁ y₂ ... yⱼ ... Total
x₁ f₁₁ f₁₂ ... f₁ⱼ ... R₁
x₂ f₂₁ f₂₂ ... f₂ⱼ ... R₂
... ... ... ... ... ... ...
xᵢ fᵢ₁ fᵢ₂ ... fᵢⱼ ... Rᵢ
... ... ... ... ... ... ...
Total C₁ C₂ ... Cⱼ ... N

Here’s the process:

  1. Calculate the means \( \bar{x} \) and \( \bar{y} \) of variables \( X \) and \( Y \) using marginal frequencies.
  2. Find the deviations of each value from the mean (\( x_i - \bar{x} \) and \( y_j - \bar{y} \)).
  3. Multiply each deviation of \( X \) and \( Y \) by their joint frequencies \( f_{ij} \) and sum these products.
  4. Compute the squared deviations weighted by the marginal frequencies \( R_i \) and \( C_j \) for \( X \) and \( Y \), summing the results.
  5. Finally, use the formula to calculate the linear coefficient $$ r = \frac{\sum (x_i - \bar{x})(y_j - \bar{y}) f_{ij}}{\sqrt{\sum (x_i - \bar{x})^2 R_i \cdot \sum (y_j - \bar{y})^2 C_j}} $$ where the double summation includes all values \( i \) and \( j \) across rows and columns. This formula gives \( r \), which reflects the strength and direction of the correlation between \( X \) and \( Y \).

This method provides \( r \), a measure of the linear correlation between the two variables.

Note: In the contingency table used to represent the joint frequency distribution of two variables, \( X \) and \( Y \):

  • \( x_i \) and \( y_j \) are the values of the variables \( X \) and \( Y \).
  • \( f_{ij} \) represents the joint frequency associated with the pair \( (x_i, y_j) \).
  • \( C_j \) is the column sum, representing marginal frequencies for \( Y \).
  • \( R_i \) is the row sum, representing marginal frequencies for \( X \).
  • \( N \) is the grand total, or the sum of all joint frequencies.

Overall, this structure enables the calculation of statistics like the correlation coefficient using the joint frequencies \( f_{ij} \), marginal frequencies of \( X \) and \( Y \), and the means of their marginal distributions.

    A Practical Example

    Let’s go through an example to calculate the correlation coefficient \( r \) using a contingency table.

    Suppose we want to study the correlation between study hours (\( X \)) and final grades (\( Y \)) in a group of students.

      \( Y = 6 \) \( Y = 7 \) \( Y = 8 \) Total (Ri)
    \( X = 4 \) 3 2 1 6
    \( X = 5 \) 1 3 1 5
    \( X = 6 \) 1 1 2 4
    Total (Cj) 5 6 4 15

    This table shows the joint frequencies of students for each combination of study hours and final grades:

    Calculate the marginal means of \( X \) and \( Y \):

    $$ \bar{x} = \frac{4 \times 6 + 5 \times 5 + 6 \times 4}{15} = \frac{24 + 25 + 24}{15} = \frac{73}{15} \approx 4.87 $$

    $$ \bar{y} = \frac{6 \times 5 + 7 \times 6 + 8 \times 4}{15} = \frac{30 + 42 + 32}{15} = \frac{104}{15} \approx 6.93 $$

    Then calculate the deviations from the mean for each value of \( X \) and \( Y \).

    The deviations for variable \( X \) are:

    • \( 4 - \bar{x} = 4 - 4.87 = -0.87 \)
    • \( 5 - \bar{x} = 5 - 4.87 = 0.13 \)
    • \( 6 - \bar{x} = 6 - 4.87 = 1.13 \)

    The deviations for variable \( Y \) are:

    • \( 6 - \bar{y} = 6 - 6.93 = -0.93 \)
    • \( 7 - \bar{y} = 7 - 6.93 = 0.07 \)
    • \( 8 - \bar{y} = 8 - 6.93 = 1.07 \)

    For clarity, I’ve listed the deviations directly above rows and alongside columns in the table.

    Next, compute the product of deviations for each cell and multiply by the joint frequency \( f_{ij} \):

      
     
    \( y_j - \bar{y}  \)
    \( -0.93 \) \( 0.07 \) \( 1.07 \)
     
    \( x_i - \bar{x}  \)
     
    \( -0.87 \) \( (-0.87)(-0.93) \times 3 \) \( (-0.87)(0.07) \times 2 \) \( (-0.87)(1.07) \times 1 \)
    \( 0.13 \) \( (0.13)(-0.93) \times 1\) \( (0.13)(0.07) \times 3 \) \( (0.13)(1.07) \times 1 \)
    \( 1.13 \) \( (1.13)(-0.93) \times 1\) \( (1.13)(0.07) \times 1 \) \( (1.13)(1.07) \times 2 \)

    For instance, in the top-left cell, \( X=4 \) deviates by \( -0.87 \) from \( \bar{x} \), while \( Y=6 \) deviates by \( -0.93 \) from \( \bar{y} \). Their joint frequency is \( 3 \), representing the number of students who studied \( X=4 \) hours and scored \( Y=6 \). The final result here is \( 2.43 \) $$ (-0.87)(-0.93) \times 3 = 2.43 $$

    At this stage, we calculate each cell’s values in the table and add up the partial totals for each row.

      
     
    \( y_j - \bar{y}  \)
    \( -0.93 \) \( 0.07 \) \( 1.07 \) \( (x_i - \bar{x})(y_j - \bar{y}) f_{ij} \)
     
    \( x_i - \bar{x}  \)
     
    \( -0.87 \)   2.43  -0.12   -0.93 1.38
    \( 0.13 \)  -0.12  0.03   0.14  0.05
    \( 1.13 \)  -1.05   0.08   2.42  1.45

    Now calculate each cell’s value and sum the partial totals for each row:

    $$ \sum (x_i - \bar{x})(y_j - \bar{y}) f_{ij} = 1.38 + 0.05 + 1.45 = 2.88 $$

    Calculate the squared deviations \( (x_i - \bar{x})^2 \) for each row and multiply by the row’s marginal frequency \( R_i \):

    $$ \sum (x_i - \bar{x})^2 R_i = (-0.87)^2 \times 6 + (0.13)^2 \times 5 + (1.13)^2 \times 4 $$

    $$ \sum (x_i - \bar{x})^2 R_i = 4.54 + 0.08 + 5.11 $$

    $$ \sum (x_i - \bar{x})^2 R_i = 9.73 $$

      
     
    \( y_j - \bar{y}  \)
    \( -0.93 \) \( 0.07 \) \( 1.07 \) \( (x_i - \bar{x})(y_j - \bar{y}) f_{ij} \) \( (x_i - \bar{x})^2 R_i \)
     
    \( x_i - \bar{x}  \)
     
    \( -0.87 \) 2.43  -0.12   -0.93  1.38  4.54
    \( 0.13 \) -0.12   0.03   0.14  0.05 0.08
    \( 1.13 \)  -1.05   0.08   2.42  1.45  5.11 

    Then, calculate the squared deviations \( (y_j - \bar{y})^2 \) for each column and multiply by the column’s marginal frequency \( C_j \):

    $$ \sum (y_j - \bar{y})^2 C_j = (-0.93)^2 \times 5 + (0.07)^2 \times 6 + (1.07)^2 \times 4  $$

    $$ \sum (y_j - \bar{y})^2 C_j = 4.32 + 0.03 + 4.58 $$

    $$ \sum (y_j - \bar{y})^2 C_j = 8.93  $$

      
     
    \( y_j - \bar{y}  \)
    \( -0.93 \) \( 0.07 \) \( 1.07 \) \( (x_i - \bar{x})(y_j - \bar{y}) f_{ij} \) \( (x_i - \bar{x})^2 R_i \)
     
    \( x_i - \bar{x}  \)
     
    \( -0.87 \) 2.43 -0.12 -0.93 1.38 4.54
    \( 0.13 \) -0.12 0.03  0.14 0.05 0.08
    \( 1.13 \) -1.05 0.08 2.42 1.45  5.11 
    \( (y_i - \bar{y})^2 C_i \) 4.32 0.03 4.58

    Finally, apply the formula:

    $$ r = \frac{\sum (x_i - \bar{x})(y_j - \bar{y}) f_{ij}}{\sqrt{\sum (x_i - \bar{x})^2 R_i \cdot \sum (y_j - \bar{y})^2 C_j}} $$

    $$ r = \frac{2.88}{9.33} \approx 0.31 $$

    The correlation coefficient \( r \approx 0.31 \) suggests a moderate positive correlation between study hours and final grades.

    And so forth.

     
     

    Please feel free to point out any errors or typos, or share suggestions to improve these notes. English isn't my first language, so if you notice any mistakes, let me know, and I'll be sure to fix them.

    FacebookTwitterLinkedinLinkedin
    knowledge base

    Statistics