Calculating Chi-Square Using Scipy in Python
This code calculates the expected frequencies and the chi-square statistic for a dataset using Python.
First, I need to import the necessary libraries: scipy and numpy.
import numpy as np
import scipy.stats as stats
The approach differs depending on whether I'm calculating the chi-square for a contingency table as a test of independence or comparing a sample to a theoretical distribution.
Chi-Square Test for Independence (Contingency Table)
For a contingency table, the chi-square calculation follows the computation of the expected frequencies.
In this example, I store the observed data in a 2x2 contingency table using an array.
observed = np.array([[50, 30], [20, 100]])
The `observed` matrix represents the contingency table.
I then calculate the expected frequencies and the chi-square statistic using the chi2_contingency() function from scipy.stats.
chi2, p_value, dof, expected_frequencies = stats.chi2_contingency(observed)
This function returns:
- `chi2`: the chi-square statistic.
- `p_value`: the associated p-value.
- `dof`: the degrees of freedom.
- `expected_frequencies`: the matrix of expected frequencies.
Finally, I display the results.
print("Expected frequencies:")
print(expected_frequencies)
print("Chi-square value:", chi2)
print("P-value:", p_value)
print("Degrees of freedom:", dof)
Here's the output:
Expected frequencies: [[28. 52.] [42. 78.]]
Chi-square value: 42.33058608058608
P-value: 7.707766001215446e-11
Degrees of freedom: 1
Chi-Square Goodness of Fit (Single Sample)
When comparing a single sample against a theoretical distribution, I first calculate the expected frequencies and then compute the chi-square value.
In this example, I store the observed frequencies in a list.
observed = [20, 30, 50]
Next, I store the theoretical probabilities for each category in another list.
theoretical_probabilities = [0.25, 0.25, 0.5]
I calculate the total number of observations and store it in the variable N.
N = sum(observed)
The expected frequencies are determined by multiplying the theoretical probabilities by the total number of observations.
expected_frequencies = [p * N for p in theoretical_probabilities]
I then use the stats.chisquare() function to compute the chi-square statistic.
chi2, p_value = stats.chisquare(f_obs=observed, f_exp=expected_frequencies)
This function computes the chi-square statistic and the p-value, comparing the observed and expected frequencies.
Finally, I print the results.
print("Expected frequencies:", expected_frequencies)
print("Chi-square value:", chi2)
print("P-value:", p_value)
Here's the output:
Expected frequencies: [25.0, 25.0, 50.0]
Chi-square value: 2.0
P-value: 0.36787944117144245
In both scenarios, computing the expected frequencies and chi-square statistic helps determine if the differences between observed and expected data are statistically significant.
And so on.