
MATH 4720/MSSC 5720 Introduction to Statistics
| Subject | Gender |
|---|---|
| 1 | Male |
| 2 | Female |
| \(\vdots\) | \(\vdots\) |
| \(n\) | Female |
One-way frequency/count table
| \(X\) | Count |
|---|---|
| Male | \(y\) |
| Female | \(n-y\) |

The sample male proportion is \(y/n\).
Goal: estimate or test gender ratio or the male population proportion
What probability distribution might be appropriate for the count \(Y\)?
\(binomial(n, \pi)\) could be a good option for the count data with 2 categories.
Fixed number of trials. (Fixed \(n\) subjects)
Each trial results in one of two outcomes. (Either \(M\) or \(F\))
Trials are independent. (If sample subjects randomly)
If the proportion of being in category \(M\) is \(\pi\), the count \(Y\) has \[P(Y = y \mid n, \pi) = \frac{n!}{y!(n-y)!}\pi^{y}(1-\pi)^{n-y}\]
Our job is to do inference about \(\pi\).
| Voter | R | D |
|---|---|---|
| 1 | x | |
| 2 | x | |
| \(\vdots\) | \(\vdots\) | \(\vdots\) |
| 1000 | x |

Let \(Y\) be the number of voters voted for R.
Assume \(Y \sim binomial(n = 1000, \pi)\).
\(\pi = P(\text{a voter voted for R}) =\) population proportion of all voters voted for R: A unknown parameter to be estimated or tested.
Predict whether or not R won the election.
| Voter | R | D |
|---|---|---|
| 1 | x | |
| 2 | x | |
| \(\vdots\) | \(\vdots\) | \(\vdots\) |
| 1000 | x |
| Candidate | Count | Proportion |
|---|---|---|
| R | \(y\) | \(\hat{\pi} = y/n\) |
| D | \(n-y\) | \(1-\hat{\pi} = 1- \frac{y}{n}\) |
What are \(H_0\) and \(H_1\)?
Step 0: Assumptions
Step 1: Null and Alternative Hypothesis
Step 2: Set \(\alpha\)
Step 3: Test Statistic
The sampling distribution of \(\hat{\pi}\) is approximately normal with mean \(\pi\) and standard error \(\sqrt{\frac{\pi(1-\pi)}{n}}\) if \(y_i\) are independent and the assumptions are satisfied.
Step 4-c: Critical Value \(z_{\alpha}\) (one-tailed) or \(z_{\alpha/2}\) (two-tailed)
Step 5-c: Draw a Conclusion Using Critical Value Method
Step 6: Restate the Conclusion in Nontechnical Terms, and Address the Original Claim
In an exit poll of 1,000 voters, 520 voted for R, one of the two candidates.
Step 0: \(n\pi_0 = 1000(1/2) = 500 \ge 5\) and \(n(1-\pi_0) \ge 5\)
Step 1: \(\begin{align} &H_0: \pi \le 1/2 \\ &H_1: \pi > 1/2 \end{align}\)
Step 2: \(\alpha = 0.05\)
Step 3: \(z_{test} = \frac{\hat{\pi} - \pi_0}{\sqrt{\frac{\pi_0(1-\pi_0)}{n}}} = \frac{\frac{520}{1000} - 0.5}{\sqrt{\frac{0.5(1-0.5)}{1000}}} = 1.26\)
Step 4: \(z_{\alpha} = z_{0.05} = 1.645\)
Step 5: Reject \(H_0\) in favor of \(H_1\) if \(z_{test} > z_{\alpha}\). Since \(z_{test} < z_{\alpha}\), we do not reject \(H_0\).
Step 6: We do not have sufficient evidence to conclude that R won.
We make the same conclusion based on \(p\)-value.
\[ p\text{-value} = P(Z > 1.26) = 0.1 > 0.05\]
Assumptions: \(n\hat{\pi} \ge 5\) and \(n(1-\hat{\pi}) \ge 5\)
A \(100(1 - \alpha)\%\) confidence interval for \(\pi\) is \[\hat{\pi} \pm z_{\alpha/2}\sqrt{\frac{\pi(1-\pi)}{n}}\] where \(\hat{\pi} = y/n\).
👉 No hypothesized value \(\pi_0\) in the confidence interval.
Assumption: \(n\hat{\pi} = 1000(0.52) = 520 \ge 5\) and \(n(1-\hat{\pi}) = 480 \ge 5\).
Estimate the proportion of all voters voted for R using 95% confidence interval:
\[\hat{\pi} \pm z_{\alpha/2}\sqrt{\frac{\hat{\pi}(1-\hat{\pi})}{n}} = 0.52 \pm z_{0.025}\sqrt{\frac{0.52(1-0.52)}{1000}} = (0.49, 0.55).\]
# Use alternative = "two.sided" to get CI
prop.test(x = 520, n = 1000, p = 0.5, alternative = "greater", correct = FALSE)
1-sample proportions test without continuity correction
data: 520 out of 1000, null probability 0.5
X-squared = 2, df = 1, p-value = 0.1
alternative hypothesis: true p is greater than 0.5
95 percent confidence interval:
0.494 1.000
sample estimates:
p
0.52
| Group 1 (M) | Group 2 (F) | |
|---|---|---|
| trials | \(n_1\) | \(n_2\) |
| number of successes | \(Y_1\) | \(Y_2\) |
| distribution | \(binomial(n_1, \pi_1)\) | \(binomial(n_2, \pi_2)\) |
| Voter | Gender | Approve |
|---|---|---|
| 1 | M | Yes |
| 2 | F | Yes |
| \(\vdots\) | \(\vdots\) | \(\vdots\) |
| 1000 | F | No |

Step 0: Assumptions
Step 1: Null and Alternative Hypothesis
Step 2: Set \(\alpha\)
Step 3: Test Statistic
Step 4-c: Find the Critical Value \(z_{\alpha}\) (one-tailed) or \(z_{\alpha/2}\) (two-tailed)
Step 5-c: Draw a Conclusion Using Critical Value Method
Step 6: Restate the Conclusion in Nontechnical Terms, and Address the Original Claim
A study on 300 students to compare the effectiveness of learning Statistics via online and in-person classes.

| Exam Results | Online Instruction | In-Person Instruction |
|---|---|---|
| Pass | 94 | 113 |
| Fail | 31 | 62 |
| Total | 125 | 175 |
\[\hat{\pi}_1 - \hat{\pi}_2 \pm z_{\alpha/2}\sqrt{\frac{\hat{\pi}_1(1-\hat{\pi}_1)}{n_1}+\frac{\hat{\pi}_2(1-\hat{\pi}_2)}{n_2}}\]
Want to know how much effective is the online program.
Estimate \(\pi_1 - \pi_2\) using a \(95\%\) confidence interval:
\[\hat{\pi}_1 - \hat{\pi}_2 \pm z_{\alpha/2}\sqrt{\frac{\hat{\pi}_1(1-\hat{\pi}_1)}{n_1}+\frac{\hat{\pi}_2(1-\hat{\pi}_2)}{n_2}}\]
\(z_{0.05/2} = 1.96\)
The 95% confidence interval is
\[0.75 - 0.65 \pm 1.96\sqrt{\frac{(0.75)(1-0.75)}{125} + \frac{(0.65)(1-0.65)}{175}}\\ = (0.002, 0.210)\]
| Subject | Variable |
|---|---|
| 1 | \(A_2\) |
| 2 | \(A_4\) |
| 3 | \(A_1\) |
| 4 | \(A_3\) |
| 5 | \(A_k\) |
| \(\vdots\) | \(\vdots\) |
| \(n\) | \(A_3\) |
With the size \(n\), for categories \(A_1, \dots , A_k\), their observed count is \(O_1, \dots, O_k\), and \(\sum_{i=1}^kO_i = n\).
One-way count table:
| \(A_1\) | \(A_2\) | \(\cdots\) | \(A_k\) | Total |
|---|---|---|---|---|
| \(O_1\) | \(O_2\) | \(\cdots\) | \(O_k\) | \(n\) |
Are the selected jurors are racially representative of the population?

| White | Black | Hispanic | Asian | Total | |
|---|---|---|---|---|---|
| Representation in juries | \(O_1 = 205\) | \(O_2 = 26\) | \(O_3 = 25\) | \(O_4 = 19\) | \(n = 275\) |
| Registered voters | \(\pi_1^0 = 0.72\) | \(\pi_2^0 = 0.07\) | \(\pi_3^0 = 0.12\) | \(\pi_4^0 = 0.09\) | \(1.00\) |
| White | Black | Hispanic | Asian | Total | |
|---|---|---|---|---|---|
| Representation in juries | \(O_1 = 205\) | \(O_2 = 26\) | \(O_3 = 25\) | \(O_4 = 19\) | \(n = 275\) |
| Registered voters | \(\pi_1^0 = 0.72\) | \(\pi_2^0 = 0.07\) | \(\pi_3^0 = 0.12\) | \(\pi_4^0 = 0.09\) | \(1.00\) |
| White | Black | Hispanic | Asian | Total | |
|---|---|---|---|---|---|
| Representation in juries | \(O_1/n = 0.75\) | \(O_2/n = 0.09\) | \(O_3/n = 0.09\) | \(O_4/n = 0.07\) | \(1.00\) |
| Registered voters | \(\pi_1^0 = 0.72\) | \(\pi_2^0 = 0.07\) | \(\pi_3^0 = 0.12\) | \(\pi_4^0 = 0.09\) | \(1.00\) |
| White | Black | Hispanic | Asian | Total | |
|---|---|---|---|---|---|
| Representation in juries | \(O_1 = 205\) | \(O_2 = 26\) | \(O_3 = 25\) | \(O_4 = 19\) | \(n = 275\) |
| Registered voters | \(\pi_1^0 = 0.72\) | \(\pi_2^0 = 0.07\) | \(\pi_3^0 = 0.12\) | \(\pi_4^0 = 0.09\) | \(1.00\) |
If the individuals are randomly selected to serve on a jury, about how many of the 275 people would we expect to be white? How about Hispanic?
| White | Black | Hispanic | Asian | |
|---|---|---|---|---|
| Representation in juries | \(O_1 = 205\) | \(O_2 = 26\) | \(O_3 = 25\) | \(O_4 = 19\) |
| Hypothesized expected number | \(275 \times \pi_1^0 = 198\) | \(275 \times \pi_2^0 = 19.25\) | \(275 \times \pi_3^0 = 33\) | \(275 \times \pi_4^0 = 24.75\) |
Are the proportions (counts) of juries close enough to the proportions (counts) of registered voters, so that we are confident saying that the jurors really were randomly sampled from the registered voters?
| White | Black | Hispanic | Asian | Total | |
|---|---|---|---|---|---|
| Observed Count | \(O_1 = 205\) | \(O_2 = 26\) | \(O_3 = 25\) | \(O_4 = 19\) | \(n = 275\) |
| Expected Count | \(E_1 = 198\) | \(E_2 = 19.25\) | \(E_3 = 33\) | \(E_4 = 24.75\) | \(n = 275\) |
| Proportion under \(H_0\) | \(\pi_1^0 = 0.72\) | \(\pi_2^0 = 0.07\) | \(\pi_3^0 = 0.12\) | \(\pi_4^0 = 0.09\) | \(1.00\) |
| White | Black | Hispanic | Asian | Total | |
|---|---|---|---|---|---|
| Observed Count | 205 | 26 | 25 | 19 | 275 |
| Expected Count | 198 | 19.25 | 33 | 24.75 | 275 |
| Population Proportion \((H_0)\) | 0.72 | 0.07 | 0.12 | 0.09 | 1.00 |
Observed Count and Expected Count are similar if no bias on juries.
Test whether the differences are strong enough to provide convincing evidence that the jurors are not a random sample of registered voters.
\(\begin{align} &H_0: \text{No racial bias in who serves on a jury, and } \\ &H_1: \text{There is racial bias in juror selection} \end{align}\)
\(\begin{align} &H_0: \pi_1 = \pi_1^0, \pi_2 = \pi_2^0, \dots, \pi_k = \pi_k^0\\ &H_1: \pi_i \ne \pi_i^0 \text{ for some } i \end{align}\)
| White | Black | Hispanic | Asian | Total | |
|---|---|---|---|---|---|
| Observed Count | \(O_1 = 205\) | \(O_2 = 26\) | \(O_3 = 25\) | \(O_4 = 19\) | \(n = 275\) |
| Expected Count | \(E_1 = 198\) | \(E_2 = 19.25\) | \(E_3 = 33\) | \(E_4 = 24.75\) | \(n = 275\) |
| Proportion under \(H_0\) | \(\pi_1^0 = 0.72\) | \(\pi_2^0 = 0.07\) | \(\pi_3^0 = 0.12\) | \(\pi_4^0 = 0.09\) | \(1.00\) |
Under \(H_0\), \(\chi^2_{test} = \frac{(O_1 - E_1)^2}{E_1} + \frac{(O_2 - E_2)^2}{E_2} + \cdots + \frac{(O_k - E_k)^2}{E_k}\), \(E_i = n\pi_i^0, i = 1, \dots, k\)
Reject \(H_0\) if \(\chi^2_{test} > \chi^2_{\alpha, df}\), \(df = k-1\)
Require each \(E_i \ge 5\), \(i = 1, \dots, k\).
| White | Black | Hispanic | Asian | Total | |
|---|---|---|---|---|---|
| Observed Count | \(O_1 = 205\) | \(O_2 = 26\) | \(O_3 = 25\) | \(O_4 = 19\) | \(n = 275\) |
| Expected Count | \(E_1 = 198\) | \(E_2 = 19.25\) | \(E_3 = 33\) | \(E_4 = 24.75\) | \(n = 275\) |
| Proportion under \(H_0\) | \(\pi_1^0 = 0.72\) | \(\pi_2^0 = 0.07\) | \(\pi_3^0 = 0.12\) | \(\pi_4^0 = 0.09\) | \(1.00\) |
Under \(H_0\), \(\chi^2_{test} = \frac{(205 - 198)^2}{198} + \frac{(26 - 19.25)^2}{19.25} + \frac{(25 - 33)^2}{33} + \frac{(19 - 24.75)^2}{24.75} = 5.89\)
\(\chi^2_{0.05, 3} = 7.81\).
Do not reject \(H_0\) in favor of \(H_1\).
The data do not provide convincing evidence of racial bias in the juror selection.
obs <- c(205, 26, 25, 19)
pi_0 <- c(0.72, 0.07, 0.12, 0.09)
## Use chisq.test() function
chisq.test(x = obs, p = pi_0)
Chi-squared test for given probabilities
data: obs
X-squared = 6, df = 3, p-value = 0.1
Does the “Opinion on President’s Job Performance” depends on “Gender”?
Job performance: approve, disapprove, no opinion
Gender: male, female
| Voter | Gender | Performance |
|---|---|---|
| 1 | M | approve |
| 2 | F | disapprove |
| \(\vdots\) | \(\vdots\) | \(\vdots\) |
| n | F | disapprove |
| Approve | Disapprove | No Opinion | |
|---|---|---|---|
| Male | 18 | 22 | 10 |
| Female | 23 | 20 | 12 |

| Approve | Disapprove | No Opinion | Total | |
|---|---|---|---|---|
| Male | 18 (19.52) | 22 (20) | 10 (10.48) | 50 |
| Female | 23 (21.48) | 20 (22) | 12 (11.52) | 55 |
| Total | 41 | 42 | 22 | 105 |
\[\text{Expected Count}_{\text{row i; col j}} = \frac{\text{(row i total}) \times (\text{col j total})}{\text{table total}}\]
Require every \(E_{ij} \ge 5\) in the contingency table.
\(\begin{align} &H_0: \text{Two variables are independent }\\ &H_1: \text{The two are dependent (associated) } \end{align}\)
\(\chi^2_{test} = \sum_{i=1}^r\sum_{j=1}^c\frac{(O_{ij} - E_{ij})^2}{E_{ij}}\), where \(r\) is the number of rows and \(c\) is the number of columns in the table.
Reject \(H_0\) if \(\chi^2_{test} > \chi^2_{\alpha, \, df}\), \(df = (r-1)(c-1)\).
| Approve | Disapprove | No Opinion | Total | |
|---|---|---|---|---|
| Male | 18 (19.52) | 22 (20) | 10 (10.48) | 50 |
| Female | 23 (21.48) | 20 (22) | 12 (11.52) | 55 |
| Total | 41 | 42 | 22 | 105 |
\(\begin{align} &H_0: \text{ Opinion does not depend on gender } \\ &H_1: \text{ Opinion and gender are dependent } \end{align}\)
\(\small \chi^2_{test} = \frac{(18 - 19.52)^2}{19.52} + \frac{(22 - 20)^2}{20} + \frac{(10 - 10.48)^2}{10.48} + \frac{(23 - 21.48)^2}{21.48} + \frac{(20 - 22)^2}{22} + \frac{(12 - 11.52)^2}{11.52}= 0.65\)
\(\chi^2_{\alpha, df} =\chi^2_{0.05, (2-1)(3-1)} = 5.991\).
Since \(\chi_{test}^2 < \chi^2_{\alpha, df}\), we do not reject \(H_0\).
We do not conclude that the Opinion on President’s Job Performance depends on Gender.
[,1] [,2] [,3]
[1,] 18 22 10
[2,] 23 20 12
## Using chisq.test() function
(indept_test <- chisq.test(contingency_table))
Pearson's Chi-squared test
data: contingency_table
X-squared = 0.7, df = 2, p-value = 0.7
qchisq(0.05, df = (2 - 1) * (3 - 1), lower.tail = FALSE) ## critical value[1] 5.99
Help Marquette to offer AI-related courses, and better teaching using AI!

