Standard Error in Statistics

The standard error $ e_x $ represents the standard deviation of the sample mean, indicating how much we expect the sample mean $ E(X) $ to differ from the true population mean $ \mu $.

The standard error is calculated using the formula:

$$ e_x = \frac{\sigma}{\sqrt{n}} $$

Where:

$ \sigma $ is the standard deviation of the population, which measures how spread out the data are within the population.
$ n $ is the sample size, i.e., the number of observations in the sample.

If the sample mean is $ E(X) $, then we can reasonably state that the true population mean $ \mu $ is likely to lie within the range:

$$ E(X) \pm e_x $$

From the formula $ e_x = \frac{\sigma}{\sqrt{n}} $, we see that as the sample size $ n $ increases, the standard error decreases.

This happens because larger samples tend to provide sample means that are closer to the true population mean.

Note: If the sample size is sufficiently large (typically n>30), we can rely on the normal distribution of sample means to calculate the standard error even if the population itself is not normally distributed. Conversely, if the sample size is small and the population is not normally distributed, the standard error may be less reliable because the central limit theorem is less effective in this case.

A practical example
Relationship between standard error and sample size:
Standard error on a single sample

A practical example

Suppose I want to estimate the average income of workers in a city.

Since I can’t interview every worker, I take a random sample of $n= 50 $ individuals from the population and analyze the data.

$$ n = 50 $$

The average income of the 50 individuals in the sample (the sample mean) is €2,500 per month.

$$ E(X) = 2,500 \ € $$

For this example, we’ll assume that the standard deviation of incomes in the population ( $ \sigma $ ) is known and is €600. This reflects the variability of incomes within the population.

$$ \sigma = 600 \ € $$

Now I want to calculate how close the sample mean of €2,500 is to the true population mean.

To do this, I calculate the standard error of the sample mean:

$$ e_x = \frac{\sigma}{\sqrt{n}} = \frac{600}{\sqrt{50}} \approx \frac{600}{7.07} \approx 84.86 $$

Therefore, the standard error is approximately €84.86.

This means that if I took multiple random samples of 50 people, their sample means would typically vary around the true population mean by about €84.86.

Note: The sample means collected from multiple random samples of the same size form the sampling distribution of the mean. Thanks to the central limit theorem, this distribution tends to become normally distributed as the sample size increases, regardless of the original distribution of the population. This allows me to use the properties of the normal distribution to build confidence intervals around the sample mean, even if the population isn’t normally distributed.

Constructing a confidence interval

Using the standard error, I can construct a confidence interval to estimate the true average income of the population.

For instance, I want to build a 95% confidence interval (CI).

$$ CI_{95} = \bar{x} \pm z \times e_x $$

Where $ z $ is a critical value associated with the confidence level in a normal distribution, $ \bar{x} $ is the sample mean, and $ e_x $ is the standard error.

Note: Critical values (z) are typically found using a standard normal distribution table or statistical software that calculates cumulative probabilities for each z-value. Here are some commonly used critical values:

90% confidence: z = 1.645
95% confidence: z = 1.96
99% confidence: z = 2.575
99.9% confidence: z = 3.291

For a 95% confidence level in a standard normal distribution, I should use a critical value of z = 1.96.

$$ CI_{95} = \bar{x} \pm 1.96 \times e_x $$

Given that the sample mean is $ \bar{x} = 2,500 \ € $ and the standard error is \( e_x = 84.86 \ € $$

$$ CI_{95} = 2,500 \pm 1.96 \times 84.86 \approx 2,500 \pm 166.32 $$

$$ CI_{95} = [2,333.68 \, \text{€}, 2,666.32 \, \text{€}] $$

In conclusion, with 95% confidence, I can say that the true average income of the population lies between €2,333.68 and €2,666.32.

This interval accounts for the uncertainty in the sample mean estimate, as reflected by the standard error.

Note: If I wanted to reduce the standard error and get a more precise estimate, I would need to increase the sample size. For example, to halve the standard error, I would need to quadruple the sample size. If the new sample size were $ 4 \times 50 = 200 $, the standard error would become: $$ e_x = \frac{600}{\sqrt{200}} = \frac{600}{14.14} \approx 42.43 $$ The standard error would now be reduced to €42.43, halving the variability around the estimated mean, which makes the confidence interval narrower and the estimate more precise.

Relationship between standard error and sample size:

Since the standard error is inversely proportional to the square root of the sample size $ n $, it’s important to remember this relationship:

If you want to reduce the standard error by a factor of $ q $, you need to increase the sample size by a factor of $ q^2 $. $$ \frac{1}{q} \times \frac{\sigma}{\sqrt{n}} = \frac{\sigma}{\sqrt{q^2 \times n}} $$

For example, to halve the standard error ( $ q = 2 $ ), you need to quadruple the sample size to $ 4n $. This is because:

$$ \frac{1}{2} \times \frac{\sigma}{\sqrt{n}} = \frac{\sigma}{\sqrt{4n}} $$

Therefore, doubling the precision (halving the error) requires considerably more effort in terms of the number of observations.

In other words, to obtain more accurate estimates, you need to significantly increase the sample size and collect a lot more data.

Example: If I have a sample size of $ n = 100 $ and a standard error $ e_x = 5 $, to reduce the standard error to 2.5 (half its value), I would need to increase the sample size to $ 4 \times 100 = 400 $ observations. This is because the standard error only decreases with the square root of $ n $, so a much larger sample is required to improve the estimate’s precision.

Standard error on a single sample

When working with a single sample, if I know the population standard deviation $ \sigma $, I can calculate the standard error of the sample mean using the classic formula:

$$ \frac{\sigma}{\sqrt{n}} $$

However, in most cases, the population standard deviation $ \sigma $ is unknown. Therefore, I need an alternative method to calculate the standard error.

If $ \sigma $ is unknown, I can calculate the standard deviation from the sample itself.

In this case, however, it’s important to note that the standard error will be underestimated because it is no longer an unbiased estimator.

In particular, when the sample size is small, using the sample standard deviation instead of the true population standard deviation tends to systematically underestimate the population’s standard deviation, and thus, the standard error as well. For example, with n=2, the underestimation is around 25%, and for n=6, it's about 5%.

To correct this issue, I calculate the standard error using the following formula:

$$ e_x = \frac{s}{\sqrt{n}} $$

Where $ s $ is the sample standard deviation $ s $.

$$ s = \sqrt{ \frac{1}{n-1} \cdot \sum_{i=1}^n (x_i-\bar{x})^2 } $$

In the sample standard deviation formula, the denominator is $ n-1 $ instead of $ n $. This is known as Bessel’s correction, and it compensates for the bias when using a sample instead of the entire population.

This correction ensures that the standard error calculated from the sample is more accurate.

Note: In some texts, I’ve seen the correction applied directly to the standard error itself: $$ e_x = \frac{ \sigma_s}{\sqrt{n-1}} $$ In these cases, the sample standard deviation is calculated without Bessel’s correction: $$ \sigma_s = \sqrt{ \frac{1}{n} \cdot \sum_{i=1}^n (x_i-\bar{x})^2 } $$ While this method still produces a mathematically equivalent standard error, it’s not recommended, as the standard deviation calculated this way is underestimated. From a mathematical standpoint, the final result for the standard error is the same: $$ e_x = \frac{s}{\sqrt{n}} $$ $$ e_x = \frac{ \sqrt{ \frac{1}{n-1} \cdot \sum_{i=1}^n (x_i-\bar{x})^2 } }{\sqrt{n}} $$ $$ e_x = \sqrt{ \frac{1}{n-1} \cdot \sum_{i=1}^n (x_i-\bar{x})^2 } \cdot \frac{ 1 }{\sqrt{n}} $$ $$ e_x = \sqrt{ \frac{ 1 }{\sqrt{n}} \cdot \frac{1}{n-1} \cdot \sum_{i=1}^n (x_i-\bar{x})^2 } $$ $$ e_x = \sqrt{ \frac{ 1 }{\sqrt{n}} \cdot \sum_{i=1}^n (x_i-\bar{x})^2 } \cdot \frac{1}{\sqrt{ n-1 }} $$ $$ e_x = \sigma_s \cdot \frac{1}{\sqrt{ n-1 }} $$ $$ e_x = \frac{ \sigma_s}{\sqrt{ n-1 }} $$ Where $ \sigma_s $ is the sample standard deviation formula instead of the population version. $$ \sigma_s = \sqrt{ \frac{1}{n} \cdot \sum_{i=1}^n (x_i- \bar{x})^2 } $$

And so on.