
MATH 4720/MSSC 5720 Introduction to Statistics
Regression models the relationship between one or more numerical/categorical response variables \((Y)\) and one or more numerical/categorical explanatory variables \((X)\).
A regression function \(f(X)\) describes how a response variable \(Y\), on average, changes as an explanatory variable \(X\) changes.
Examples:
college GPA \((Y)\) vs. ACT/SAT score \((X)\)
sales \((Y)\) vs. advertising expenditure \((X)\)
crime rate \((Y)\) vs. median income level \((X)\)

The true relationship between \(X\) and the mean of \(Y\), the regression function \(f(X)\), is unknown.
The collected data \((x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\) is all we know and have.
Goal: estimate \(f(X)\) from our data and use it to predict value of \(Y\) given any value of \(X\).
Math review: A linear function \(y = f(x) = \beta_0 + \beta_1 x\) represents a straight line
\(\beta_1\): slope, the amount by which \(y\) changes when \(x\) increases by one unit.
\(\beta_0\): intercept, the value of \(y\) when \(x = 0\).
The linearity assumption: \(\beta_1\) does not change as \(x\) changes.

Real data \((x_i, y_i), i = 1, 2, \dots, n\) do not form a perfect straight line!
\(y_i = \beta_0+\beta_1x_i + \color{red}{\epsilon_i}\)
When we collect our data, at any given level of \(X = x\), \(y\) is assumed being drawn from a normal distribution (for inference purpose).
Its value varies around and will not be exactly equal to its mean \(\mu_y\).
When we collect our data, at any given level of \(X = x\), \(y\) is assumed being drawn from a normal distribution (for inference purpose).
Its value varies around and will not be exactly equal to its mean \(\mu_y\).
For the \(i\)-th measurement in the target population,
\[Y_i = \beta_0 + \beta_1X_i + \epsilon_i\]
\(Y_i\): the \(i\)-th value of the response (random) variable.
\(X_i\): the \(i\)-th known fixed value of the predictor.
\(\epsilon_i\): the \(i\)-th random error with assumption \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\).
\(\beta_0\) and \(\beta_1\) are model coefficients.
\(\beta_0\), \(\beta_1\) and \(\sigma^2\) are fixed unknown parameters to be estimated from the sample data once we collect them.
\(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\)
\[\begin{align*} \mu_{Y_i \mid X_i} = \beta_0 + \beta_1X_i \end{align*}\]
The mean response \(\mu_{Y\mid X}\) has a straight-line relationship with \(X\) given by a population regression line \[\mu_{Y\mid X} = \beta_0 + \beta_1X\]

\(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\) \[\begin{align*} \mathrm{Var}(Y_i \mid X_i) &= \mathrm{Var}(\epsilon_i) = \sigma^2 \end{align*}\] The variance of \(Y\) does not depend on \(X\).

\(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\) \[\begin{align*} Y_i \mid X_i \stackrel{indep}{\sim} N(\beta_0 + \beta_1X_i, \sigma^2) \end{align*}\] For any fixed value of \(X_i = x_i\), the response \(Y_i\) varies with \(N(\mu_{Y_i\mid x_i}, \sigma^2)\).

Use sample statistics \(b_0\) and \(b_1\) computed from our sample data to estimate \(\beta_0\) and \(\beta_1\).
\(\hat{y}_{i} = b_0 + b_1~x_{i}\) is called fitted value of \(y_i\), a point estimate of the mean \(\mu_{y|x_i}\) and \(y_i\) itself.
Given the sample data \(\{ (x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\},\)
Which sample regression line is the best?
What are the best estimators \(b_0\) and \(b_1\) for \(\beta_0\) and \(\beta_1\)?
Given the sample data \(\{ (x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\},\)
Which sample regression line is the best?
What are the best estimators \(b_0\) and \(b_1\) for \(\beta_0\) and \(\beta_1\)?
\[SS_{res} = e_1^2 + e_2^2 + \dots + e_n^2 = \sum_{i = 1}^n e_i^2,\] where the residual \(e_i = y_i - \hat{y}_i = y_i - (b_0 + b_1x_i)\).
\[\small{\begin{align} SS_{res} &= (y_1 - b_0 - b_1x_1)^2 + (y_2 - b_0 - b_1x_2)^2 + \dots + (y_n - b_0 - b_1x_n)^2\\ &= \sum_{i=1}^n(y_i - b_0 - b_1x_i)^2 \end{align}}\] that is the smallest comparing to any other \(SS_{res} = \sum_{i=1}^n(y_i - a_0 - a_1x_i)^2\) that uses another pair of values \((a_0, a_1) \ne (b_0, b_1)\).
\[(b_0, b_1) = \arg \min_{\alpha_0, \alpha_1} \sum_{i=1}^n(y_i - \alpha_0 - \alpha_1x_i)^2\]
MATH 1450/1455 …
\[\color{red}{b_0 = \overline{y} - b_1\overline{x}}\]
\[\color{red}{b_1 = \frac{\sum_{i=1}^n(x_i - \overline{x})(y_i - \overline{y})}{\sum_{i=1}^n(x_i - \overline{x})^2} = r \frac{\sqrt{S_{yy}}}{\sqrt{S_{xx}}}},\] where \(S_{xx} = \sum_{i=1}^n(x_i - \overline{x})^2\) and \(S_{yy} = \sum_{i=1}^n(y_i - \overline{y})^2\)
What can we learn from the formula of \(b_0\) and \(b_1\)?
mpg Data# A tibble: 234 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
# ℹ 224 more rowshwy vs. Displacement displ
plot(x = mpg$displ, y = mpg$hwy,
las = 1, pch = 19, col = "navy", cex = 0.5,
xlab = "Displacement (litres)", ylab = "Highway MPG",
main = "Highway MPG vs. Engine Displacement (litres)")## all elements in reg_fit
names(reg_fit) [1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
## use $ to extract an element of a list
reg_fit$coefficients(Intercept) displ
35.70 -3.53
\(\widehat{hwy}_{i} = b_0 + b_1 \times displ_{i} = 35.7 - 3.5 \times displ_{i}\)
\(b_1\): For one unit (litre) increase of the displacement, we expect the highway MPG to be decreased, on average, by 3.5.
Think of \(\sigma^2\) as variance around the line or the mean square (prediction) error.
The estimate of \(\sigma^2\) is the mean square residual \(MS_{res}\):
\[\hat{\sigma}^2 = MS_{res} = \frac{SS_{res}}{n-2} = \frac{\sum_{i=1}^n(y_i - \hat{y}_i)^2}{n-2}\]

(summ_reg_fit <- summary(reg_fit))
Call:
lm(formula = hwy ~ displ, data = mpg)
Residuals:
Min 1Q Median 3Q Max
-7.104 -2.165 -0.224 2.059 15.010
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.698 0.720 49.5 <2e-16 ***
displ -3.531 0.195 -18.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.84 on 232 degrees of freedom
Multiple R-squared: 0.587, Adjusted R-squared: 0.585
F-statistic: 329 on 1 and 232 DF, p-value: <2e-16# lots of fitted information saved in summary(reg_fit)!
names(summ_reg_fit) [1] "call" "terms" "residuals" "coefficients"
[5] "aliased" "sigma" "df" "r.squared"
[9] "adj.r.squared" "fstatistic" "cov.unscaled"
# residual standard error (sigma_hat)
summ_reg_fit$sigma[1] 3.84
[1] 3.84
confint(reg_fit, level = 0.95) 2.5 % 97.5 %
(Intercept) 34.28 37.12
displ -3.91 -3.15
\(H_0: \beta_1 = \beta_1^0 \quad H_1: \beta_1 \ne \beta_1^0\)
Test statistic: Under \(H_0\),
\[t_{test} = \frac{b_1 - \color{red}{\beta_1^0}}{\sqrt{\frac{\hat{\sigma}^2}{S_{xx}}}} \sim t_{n-2}\]
summ_reg_fit$coefficients Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.70 0.720 49.6 2.12e-125
displ -3.53 0.195 -18.2 2.04e-46
\(H_0: \beta_1 = 0 \quad H_1: \beta_1 \ne 0\)
Failing to reject \(H_0: \beta_1 = 0\) implies there is no linear relationship between \(Y\) and \(X\).

If we reject \(H_0: \beta_1 = 0\), does it mean \(X\) and \(Y\) are linearly related?
Suppose we only have data \(Y\) and have no information about \(X\) and relationship between \(X\) and \(Y\). How do we predict a value of \(Y\)?
Our best guess would be \(\overline{y}\) if the data have no pattern, i.e., \(\hat{y}_i = \overline{y}\).
Treat \(X\) and \(Y\) as uncorrelated.
The (total) deviation from the mean is \((y_i - \overline{y})\)
If \(X\) and \(Y\) are linearly related, fitting a linear regression model helps us predict the value of \(Y\) when the value of \(X\) is provided.
\(\hat{y}_i = b_0 + b_1x_i\) is closer to \(y_i\) than \(\overline{y}\).
The regression model explains some deviation of \(y\).
Total deviation = Deviation explained by regression + Unexplained deviation
\((y_i - \overline{y}) = (\hat{y}_i - \overline{y}) + (y_i - \hat{y}_i)\)
\((19 - 9) = (13 - 9) + (19 - 13)\)
\(\sum_{i=1}^n(y_i - \overline{y})^2 = \sum_{i=1}^n(\hat{y}_i - \overline{y})^2 + \sum_{i=1}^n(y_i - \hat{y}_i)^2\)
Total SS \((SS_T)\) = Regression SS \((SS_R)\) + Residual SS \((SS_{res})\)
\(df_T = df_R + df_{res}\)
\(\color{blue}{(n-1) = 1 +(n-2)}\)
A larger value of \(F_{test}\) indicates that regression is significant.
Reject \(H_0\) if
The ANOVA is designed to test \(H_0\) that all predictors have no value in predicting \(y\).
In SLR, the \(F\)-test of ANOVA gives the same result as a two-sided \(t\)-test of \(H_0: \beta_1=0\).
anova(reg_fit)Analysis of Variance Table
Response: hwy
Df Sum Sq Mean Sq F value Pr(>F)
displ 1 4848 4848 329 <2e-16 ***
Residuals 232 3414 15
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[R^2 = \frac{SS_R}{SS_T} =\frac{SS_T - SS_{res}}{SS_T} = 1 - \frac{SS_{res}}{SS_T}\]
\(R^2\) as the proportionate reduction of total variation associated with the use of \(X\).
(a) \(\hat{y}_i = y_i\) and \(\small SS_{res} = \sum_{i=1}^n(y_i - \hat{y}_i)^2 = 0\). (b) \(\hat{y}_i = \overline{y}\) and \(\small SS_R = \sum_{i=1}^n(\hat{y}_i - \overline{y})^2 = 0\).
Call:
lm(formula = hwy ~ displ, data = mpg)
Residuals:
Min 1Q Median 3Q Max
-7.104 -2.165 -0.224 2.059 15.010
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.698 0.720 49.5 <2e-16 ***
displ -3.531 0.195 -18.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.84 on 232 degrees of freedom
Multiple R-squared: 0.587, Adjusted R-squared: 0.585
F-statistic: 329 on 1 and 232 DF, p-value: <2e-16summ_reg_fit$r.squared[1] 0.587

\[\hat{\mu}_{y | x_0} = b_0 + b_1 x_0\]
\[\hat{\mu}_{y | x_0} \pm t_{\alpha/2, n-2} \hat{\sigma}\sqrt{\frac{1}{n} + \frac{(x_0 - \overline{x})^2}{S_{xx}}}\]
Does the length of the CI for \(\mu_{y|x_0}\) stay the same at any location of \(x_0\)?
## CI for the mean response
predict(reg_fit, newdata = data.frame(displ = 5.5), interval = "confidence", level = 0.95) fit lwr upr
1 16.3 15.4 17.2
\[\hat{y}_0(x_0) = b_0 + b_1 x_0\]
\[\hat{y}_0 \pm t_{\alpha/2, n-2} \hat{\sigma}\sqrt{{\color{red}{1}}+ \frac{1}{n} + \frac{(x_0 - \overline{x})^2}{S_{xx}}}\]
What is the difference between CI for \(\mu_{y | x_0}\) and PI for \(y_0(x_0)\)?
## CI for the mean response
predict(reg_fit, newdata = data.frame(displ = 5.5), interval = "confidence", level = 0.95) fit lwr upr
1 16.3 15.4 17.2
## PI for the new observation
predict(reg_fit, newdata = data.frame(displ = 5.5), interval = "predict", level = 0.95) fit lwr upr
1 16.3 8.67 23.9