Inference About Population Variances

MATH 4720/MSSC 5720 Introduction to Statistics

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

Inference About Population Variances

One Population Variance
Comparing Two Population Variances

Inference for One Population Variance

Why Inference for Population Variances?

We like to know if \(\sigma_1 = \sigma_2\), so a correct or better method can be used.

Which test we learned needs \(\sigma_1 = \sigma_2\)?

In some situations, we care about variation!

the variation in potency of drugs: affects patients’ health

the variance of stock prices : the higher the variance, the riskier the investment

Inference for Population Variances

The sample variance \(S^2 = \frac{\sum_{i=1}^n(X_i - \overline{X})^2}{n-1}\) is our point estimator for the population variance \(\sigma^2\).
The inference for \(\sigma^2\) needs the population to be normal.

❗ The methods can work poorly if the normality is violated, even the sample is large.

Chi-Squared \(\chi^2\) Distribution

The inference for \(\sigma^2\) involves the \(\chi^2\) distribution.

Defined over positive numbers
Parameter: degrees of freedom \(df\)
Right skewed
More symmetric as \(df\) gets larger

Upper Tail and Lower Tail of Chi-Square

\(\chi^2_{\frac{\alpha}{2},\, df}\) has area to the right of \(\alpha/2\).
\(\chi^2_{1-\frac{\alpha}{2},\, df}\) has area to the left of \(\alpha/2\).
In \(N(0, 1)\), \(z_{1-\frac{\alpha}{2}} = -z_{\frac{\alpha}{2}}\). But \(\chi^2_{1-\frac{\alpha}{2},\,df} \ne -\chi^2_{\frac{\alpha}{2},\,df}\) because of non-symmetry of the \(\chi^2\) distribution.

Sampling Distribution

When a random sample of size \(n\) is from \(\color{red}{N(\mu, \sigma^2)}\), \[ \frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1} \]
The inference method for \(\sigma^2\) introduced here can work poorly if the normality assumption is violated, even for large samples!

\((1-\alpha)100\%\) Confidence Interval for \(\sigma^2\)

\((1-\alpha)100\%\) CI for \(\sigma^2\) is \[\color{blue}{\left( \frac{(n-1)S^2}{\chi^2_{\frac{\alpha}{2}, \, n-1}}, \frac{(n-1)S^2}{\chi^2_{1-\frac{\alpha}{2}, \, n-1}} \right)}\]

❗ The CI for \(\sigma^2\) cannot be expressed as \((S^2-m, S^2+m)\) anymore!

Example: Supermodel Heights

Listed below are heights (cm) for the simple random sample of 16 female supermodels:

heights <- c(178, 177, 176, 174, 175, 178, 175, 178, 
             178, 177, 180, 176, 180, 178, 180, 176)

The supermodels’ height is normally distributed.
Construct a \(95\%\) confidence interval for population standard deviation \(\sigma\).

\(n = 16\), \(s^2 = 3.4\), \(\alpha = 0.05\).
\(\chi^2_{\alpha/2, n-1} = \chi^2_{0.025, 15} = 27.49\)
\(\chi^2_{1-\alpha/2, n-1} = \chi^2_{0.975, 15} = 6.26\)

The \(95\%\) CI for \(\sigma\) is \(\small \left( \sqrt{\frac{(n-1)s^2}{\chi^2_{\frac{\alpha}{2}, \, n-1}}}, \sqrt{\frac{(n-1)s^2}{\chi^2_{1-\frac{\alpha}{2}, \, n-1}}} \right) = \left( \sqrt{\frac{(16-1)(3.4)}{27.49}}, \sqrt{\frac{(16-1)(3.4)}{6.26}}\right) = (1.36, 2.85)\)

Example: Computation in R

n <- 16
s2 <- var(heights)
al <- 0.05

## two chi-square critical values
chi2_right <- qchisq(al / 2, df = n - 1, lower.tail = FALSE)
chi2_left <- qchisq(al / 2, df = n - 1, lower.tail = TRUE)

## two bounds of CI for sigma2
ci_lwr <- (n - 1) * s2 / chi2_right
ci_upr <- (n - 1) * s2 / chi2_left

## two bounds of CI for sigma
sqrt(ci_lwr)

[1] 1.36

sqrt(ci_upr)

[1] 2.85

Example Cont’d: Testing

Use \(\alpha = 0.05\) to test the claim that “supermodels have heights with a standard deviation that is less than \(\sigma = 7.5\) cm for the population of women”.

Step 1: \(H_0: \sigma = \sigma_0\) vs. \(H_1: \sigma < \sigma_0\). Here \(\sigma_0 = 7.5\) cm

Step 2: \(\alpha = 0.05\)

Step 3: Under \(H_0\), \(\chi_{test}^2 = \frac{(n-1)s^2}{\sigma_0^2} = \frac{(16-1)(3.4)}{7.5^2} = 0.91\), a statistic drawn from \(\chi^2_{n-1}\).

Step 4-c: This is a left-tailed test. The critical value is \(\chi_{1-\alpha, df}^2 = \chi_{0.95, 15}^2 = 7.26\)
Step-5-c: Reject \(H_0\) in favor of \(H_1\) if \(\chi_{test}^2 < \chi_{1-\alpha, df}^2\). Since \(0.91 < 7.26\), we reject \(H_0\).

Example Cont’d: Testing

Use \(\alpha = 0.05\) to test the claim that “supermodels have heights with a standard deviation that is less than \(\sigma = 7.5\) cm for the population of women”.

Step 1: \(H_0: \sigma = \sigma_0\) vs. \(H_1: \sigma < \sigma_0\). Here \(\sigma_0 = 7.5\) cm
Step 2: \(\alpha = 0.05\)
Step 3: Under \(H_0\), \(\chi_{test}^2 = \frac{(n-1)s^2}{\sigma_0^2} = \frac{(16-1)(3.4)}{7.5^2} = 0.91\), a statistic drawn from \(\chi^2_{n-1}\).

Step 6: There is sufficient evidence to support the claim that supermodels have heights with a SD that is less than the SD for the population of women.

Heights of supermodels vary less than heights of women in the general population.

Back to Pooled t-Test

In a pooled t-test, we assume

both samples are of large size or drawn from a normal population.
\(\sigma_1 = \sigma_2\)

Use QQ-plot (and normality tests, Anderson, Shapiro, etc) to check the assumption of normal distribution.
We learn to check the assumption \(\sigma_1 = \sigma_2\).

Inference for Comparing Two Population Variances

F Distribution

We use \(F\) distribution for the inference about two population variances.

Two parameters: \(df_1\), \(df_2\)
Right skewed
Defined over positive numbers

Upper and Lower Tail of F Distribution

We denote \(F_{\alpha, \, df_1, \, df_2}\) as the \(F\) quantile so that \(P(F_{df_1, df_2} > F_{\alpha, \, df_1, \, df_2}) = \alpha\).

Sampling Distribution

The random samples of size \(n_1\) and \(n_2\) are independent from two normal populations, \(N(\mu_1, \sigma_1^2)\) and \(N(\mu_2, \sigma_2^2)\).
The ratio \[\frac{S_1^2/S_2^2}{\sigma_1^2/\sigma_2^2} \sim F_{n_1-1, \, n_2-1}\]

\((1-\alpha)100\%\) Confidence Interval for \(\sigma_1^2 / \sigma_2^2\)

\((1-\alpha)100\%\) CI for \(\sigma_1^2 / \sigma_2^2\) is \[\color{blue}{\left( \frac{s_1^2/s_2^2}{F_{\alpha/2, \, n_1 - 1, \, n_2 - 1}}, \frac{s_1^2/s_2^2}{F_{1-\alpha/2, \, \, n_1 - 1, \, n_2 - 1}} \right)}\]

❗ The CI for \(\sigma_1^2 / \sigma_2^2\) cannot be expressed as \(\left(\frac{s_1^2}{s_2^2}-m, \frac{s_1^2}{s_2^2} + m\right)\) anymore!

F test for comparing \(\sigma_1^2\) and \(\sigma_2^2\)

Step 1: right-tailed \(\small \begin{align} &H_0: \sigma_1 \le \sigma_2 \\ &H_1: \sigma_1 > \sigma_2 \end{align}\) and two-tailed \(\small \begin{align} &H_0: \sigma_1 = \sigma_2 \\ &H_1: \sigma_1 \ne \sigma_2 \end{align}\)

Step 2: \(\alpha = 0.05\)

Step 3: Under \(H_0\), \(\sigma_1 = \sigma_2\), and the test statistic is

\[\small F_{test} = \frac{s_1^2/s_2^2}{\sigma_1^2/\sigma_2^2} = \frac{s_1^2}{s_2^2} \sim F_{n_1-1, \, n_2-1}\]

Step 4-c:
- Right-tailed: \(F_{\alpha, \, n_1-1, \, n_2-1}\) .
- Two-tailed: \(F_{\alpha/2, \, n_1-1, \, n_2-1}\) or \(F_{1-\alpha/2, \, n_1-1, \, n_2-1}\)

Step 5-c:
- Right-tailed: reject \(H_0\) if \(F_{test} \ge F_{\alpha, \, n_1-1, \, n_2-1}\).
- Two-tailed: reject \(H_0\) if \(F_{test} \ge F_{\alpha/2, \, n_1-1, \, n_2-1}\) or \(F_{test} \le F_{1-\alpha/2, \, n_1-1, \, n_2-1}\)

Back to the Weight Loss Example

A study was conducted to see the effectiveness of a weight loss program.

Two groups (Control and Experimental) of 10 subjects were selected.
The two populations are normally distributed and have the same SD.

The data on weight loss was collected at the end of six months
- Control: \(n_1 = 10\), \(\overline{x}_1 = 2.1\, lb\), \(s_1 = 0.5\, lb\)
- Experimental: \(n_2 = 10\), \(\overline{x}_2 = 4.2\, lb\), \(s_2 = 0.7\, lb\)
Assumptions:
- \(\sigma_1 = \sigma_2\)
- The weight loss for both groups are normally distributed.

Back to the Weight Loss Example: Check if \(\sigma_1 = \sigma_2\)

\(n_1 = 10\), \(s_1 = 0.5 \, lb\)
\(n_2 = 10\), \(s_2 = 0.7 \, lb\)
Step 1: \(\begin{align} &H_0: \sigma_1 = \sigma_2 \\ &H_1: \sigma_1 \ne \sigma_2 \end{align}\)
Step 2: \(\alpha = 0.05\)
Step 3: \(F_{test} = \frac{s_1^2}{s_2^2} = \frac{0.5^2}{0.7^2} = 0.51\).
Step 4-c: Two-tailed test. The critical value is \(F_{0.05/2, \, 10-1, \, 10-1} = 4.03\) or \(F_{1-0.05/2, \, 10-1, \, 10-1} = 0.25\).

Step 5-c: Is \(F_{test} > 4.03\) or \(F_{test} < 0.25\)? No.
Step 6: The evidence is not sufficient to reject the claim that \(\sigma_1 = \sigma_2\).

Back to the Weight Loss Example: 95% CI for \(\sigma_1^2 / \sigma_2^2\)

The 95% CI for \(\sigma_1^2 / \sigma_2^2\) is \[\small \begin{align} &\left( \frac{s_1^2/s_2^2}{F_{\alpha/2, \, df_1, \, df_2}}, \frac{s_1^2/s_2^2}{F_{1-\alpha/2, \, df_1, \, df_2}} \right) \\ &= \left( \frac{0.51}{4.03}, \frac{0.51}{0.25} \right) = \left(0.13, 2.04\right)\end{align}\]
We are 95% confident that the ratio \(\sigma_1^2 / \sigma_2^2\) is between 0.13 and 2.04.

Implementing F-test in R

n1 <- 10; n2 <- 10
s1 <- 0.5; s2 <- 0.7
al <- 0.05

## 95% CI for sigma_1^2 / sigma_2^2
f_small <- qf(p = al / 2, 
              df1 = n1 - 1, df2 = n2 - 1, 
              lower.tail = TRUE)
f_big <- qf(p = al / 2, 
            df1 = n1 - 1, df2 = n2 - 1, 
            lower.tail = FALSE)

## lower bound
(s1 ^ 2 / s2 ^ 2) / f_big

[1] 0.127

## upper bound
(s1 ^ 2 / s2 ^ 2) / f_small

[1] 2.05

## Testing sigma_1 = sigma_2
(test_stats <- s1 ^ 2 / s2 ^ 2)

[1] 0.51

(cri_big <- qf(p = al / 2, 
               df1 = n1 - 1, 
               df2 = n2 - 1, 
               lower.tail = FALSE))

[1] 4.03

(cri_small <- qf(p = al / 2, 
                 df1 = n1 - 1, 
                 df2 = n2 - 1, 
                 lower.tail = TRUE))

[1] 0.248

# var.test(x, y, alternative = "two.sided")