
MATH 4720/MSSC 5720 Introduction to Statistics


π€ But wait, then what is DATA SCIENCE β



My ChatGPT says:
Statistics is foundational to Data Science, but Data Science also includes programming, data engineering, machine learning, and business communication.
STAT 5 β Statistics
STAT 7 β Statistical Methods for the Biological, Environmental, and Health Sciences
STAT 17 β Statistical Methods for Business and Economics
STAT 80A β Gambling and Gaming
STAT 80B β The Art of Data Visualization
STAT 108 β Linear Regression
STAT 131 β Introduction to Probability Theory
STAT 132 β Classical and Bayesian Inference
STAT 202 β Linear Models in SAS
STAT 203 β Introduction to Probability Theory
STAT 204 β Introduction to Statistical Data Analysis
STAT 205 β Introduction to Classical Statistical Learning
STAT 205B β Intermediate Classical Inference
STAT 206 β Applied Bayesian Statistics
STAT 206B β Intermediate Bayesian Inference
STAT 207 β Intermediate Bayesian Statistical Modeling
STAT 208 β Linear Statistical Models
STAT 209 β Generalized Linear Models
STAT 221 β Statistical Machine Learning
STAT 222 β Bayesian Nonparametric Methods
STAT 223 β Time Series Analysis
STAT 224 β Bayesian Survival Analysis and Clinical Design
STAT 225 β Multivariate Statistical Methods
STAT 226 β Spatial Statistics
STAT 227 β Statistical Learning and High-Dimensional Data Analysis
STAT 229 β Advanced Bayesian Computation
STAT 243 β Stochastic Processes
STAT 244 β Bayesian Decision Theory
STAT 246 β Probability Theory with Markov Chains
STAT 266A β Data Visualization and Statistical Programming in R
STAT 266B β Advanced Statistical Programming in R
STAT 266C β Introduction to Data Wrangling
Collection, organization, analysis, interpretation and presentation of data.

We spend most of time on various statistical methods for analyzing data.
Learn useful information
about the population we are interested (e.g. All Marquette students)
from our sample data (e.g. Students in MATH 4720)
through statistical inferential methods, including estimation and testing (e.g. Confidence intervals)
Data: A set of objects on which we observe or measure one or more characteristics.
Objects are individuals, observations, subjects or cases in statistical studies.
A characteristic or attribute is called a variable because it varies from one to another.
| Number | Name | Class | Pos | Ht | Wt | Hometown | High_School | PPG | RPG | APG |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Kam Jones | Sr | G | 6β5β | 205 | Memphis, TN | Evangelical Christian School | 19.2 | 4.5 | 5.9 |
| 2 | Chase Ross | Jr | G | 6β5β | 210 | Dallas, TX | Cushing Academy | 10.5 | 3.8 | 2.1 |
| 4 | Stevie Mitchell | Sr | G | 6β3β | 200 | Reading, PA | Wilson HS | 10.7 | 4.1 | 1.6 |
| 5 | Tre Norman | So | G | 6β4β | 210 | Boston, MA | Worcester Academy | 1.9 | 1.5 | 0.5 |
| 7 | Zaide Lowery | So | G | 6β5β | 200 | Springfield, MO | La Lumiere School | 4.1 | 3.0 | 0.2 |
| 8 | Joshua Clark | Fr | F | 7β1β | 225 | Virginia Beach, VA | Clements HS | NA | NA | NA |
| 10 | Damarius Owens | Fr | F | 6β7β | 200 | Rochester, NY | Western Reserve Academy | 2.6 | 1.2 | 0.5 |
| 12 | Ben Gold | Jr | F | 6β11β | 235 | Wellington, NZ | NBA Global Academy | 7.4 | 4.3 | 0.9 |
| 13 | Royce Parham | Fr | F | 6β8β | 230 | Pittsburgh, PA | Western Reserve Academy | 5.1 | 2.2 | 0.4 |
| 21 | Al Amadou | So | F | 6β9β | 210 | Philadelphia, PA | Chestnut Hill Academy | NA | NA | NA |
| 22 | Sean Jones | Jr (RS) | G | 5β10β | 185 | Columbus, OH | Lincoln HS | NA | NA | NA |
| 23 | David Joplin | Sr | F | 6β8β | 225 | Brookfield, WI | Brookfield Central HS | 14.2 | 5.4 | 1.3 |
| 25 | Jack Anderson | Sr | G | 6β4β | 200 | Davie, FL | Western HS | 0.4 | 0.4 | 0.0 |
| 35 | Caedin Hamilton | R-Fr | F | 6β9β | 250 | Santa Maria, CA | St. Joseph HS | 1.5 | 1.2 | 0.6 |
| 40 | Casey OβMalley | Jr | G | 6β3β | 200 | Omaha, NE | Creighton Prep | 0.6 | 0.2 | 0.0 |
| 41 | Jonah Lucas | Jr | G | 6β1β | 180 | West Lafayette, IN | Harrison HS | 0.0 | 0.2 | 0.0 |
| 42 | Luke Jacobson | Fr | F | 6β7β | 215 | San Luis Obispo, CA | Mission Prep | NA | NA | NA |
| 54 | Jake Ciardo | Jr | G | 6β2β | 185 | Germantown, WI | Germantown HS | 0.4 | 0.8 | 0.0 |
| 55 | Cameron Brown | Sr | G | 6β1β | 215 | Plano, TX | John Paul II HS | 0.4 | 0.2 | 0.0 |
Each row corresponds to a unique case or observational unit.
Each column represents a characteristic or variable.
This structure allows new cases to be added as rows or new variables as new columns.
| Number | Name | Class | Pos | Ht | Wt | Hometown | High_School | PPG | RPG | APG |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Kam Jones | Sr | G | 6β5β | 205 | Memphis, TN | Evangelical Christian School | 19.2 | 4.5 | 5.9 |
| 2 | Chase Ross | Jr | G | 6β5β | 210 | Dallas, TX | Cushing Academy | 10.5 | 3.8 | 2.1 |
| 4 | Stevie Mitchell | Sr | G | 6β3β | 200 | Reading, PA | Wilson HS | 10.7 | 4.1 | 1.6 |
| 5 | Tre Norman | So | G | 6β4β | 210 | Boston, MA | Worcester Academy | 1.9 | 1.5 | 0.5 |
| 7 | Zaide Lowery | So | G | 6β5β | 200 | Springfield, MO | La Lumiere School | 4.1 | 3.0 | 0.2 |
| 8 | Joshua Clark | Fr | F | 7β1β | 225 | Virginia Beach, VA | Clements HS | NA | NA | NA |
| 10 | Damarius Owens | Fr | F | 6β7β | 200 | Rochester, NY | Western Reserve Academy | 2.6 | 1.2 | 0.5 |
| 12 | Ben Gold | Jr | F | 6β11β | 235 | Wellington, NZ | NBA Global Academy | 7.4 | 4.3 | 0.9 |
| 13 | Royce Parham | Fr | F | 6β8β | 230 | Pittsburgh, PA | Western Reserve Academy | 5.1 | 2.2 | 0.4 |
| 21 | Al Amadou | So | F | 6β9β | 210 | Philadelphia, PA | Chestnut Hill Academy | NA | NA | NA |
| 22 | Sean Jones | Jr (RS) | G | 5β10β | 185 | Columbus, OH | Lincoln HS | NA | NA | NA |
| 23 | David Joplin | Sr | F | 6β8β | 225 | Brookfield, WI | Brookfield Central HS | 14.2 | 5.4 | 1.3 |
| 25 | Jack Anderson | Sr | G | 6β4β | 200 | Davie, FL | Western HS | 0.4 | 0.4 | 0.0 |
| 35 | Caedin Hamilton | R-Fr | F | 6β9β | 250 | Santa Maria, CA | St. Joseph HS | 1.5 | 1.2 | 0.6 |
| 40 | Casey OβMalley | Jr | G | 6β3β | 200 | Omaha, NE | Creighton Prep | 0.6 | 0.2 | 0.0 |
| 41 | Jonah Lucas | Jr | G | 6β1β | 180 | West Lafayette, IN | Harrison HS | 0.0 | 0.2 | 0.0 |
| 42 | Luke Jacobson | Fr | F | 6β7β | 215 | San Luis Obispo, CA | Mission Prep | NA | NA | NA |
| 54 | Jake Ciardo | Jr | G | 6β2β | 185 | Germantown, WI | Germantown HS | 0.4 | 0.8 | 0.0 |
| 55 | Cameron Brown | Sr | G | 6β1β | 215 | Plano, TX | John Paul II HS | 0.4 | 0.2 | 0.0 |
The first step in conducting a study is to identify questions to be investigated.
A clear research question is helpful in identifying
Target Population: the collection of all objects which we are interested in studying from.

All Marquette students that are currently enrolled.
The first step in conducting a study is to identify questions to be investigated.
A clear research question is helpful in identifying
Target Population: the collection of all objects which we are interested in studying from.

All people with severe heart disease.
Sometimes, itβs possible to collect data of all cases we are interested.
Most of the time, it is too expensive to collect data for every case in a population.
What about the average GPA of all students in Illinois? the U.S.? the world? π± π± π±

Sample: A subset of cases selected from a population.
Compute the average GPA of the sample data
Hope sample avg GPA \(\approx\) population avg GPA. π
Is this 4720/5720 class a sample data of the target population Marquette students?
Is this 4720/5720 class a βgoodβ sample of the target population?

Is this 4720/5720 class a βgoodβ sample of the target population?
The sample is convenient to be collected, but it may NOT be representative of the population.
Biased sample: The average GPA of the class may be far from that of all Marquette undergrads.
We always seek to randomly select a sample from a population.
Lots of statistical methods are based on randomness assumption.
Observational or Experimental?
Randomly select 40 males and 40 females to see the difference in blood pressure levels between male and female.
Test the effects of a new drug by randomly dividing patients into 3 groups (high dosage, low dosage, placebo).


What is the confounder that is not in the data, but affects ice cream sales and the number of drownings?
Temperature: as temperature increases, ice cream sales increase and the number of drownings goes up because more people swim.

Making causal conclusions based on experiments is often more reasonable than making the same causal conclusions based on observational data.
Observational studies are generally only sufficient to show associations, not causality.
Random Sample: Each member of a population is equally likely to be selected.
Simple Random Sample (SRS): Every possible sample of sample size \(n\) has the same chance to be chosen.
Example: If sample 100 students from all, say 10,000 Marquette students, I would randomly assign each student a number (from 1 to 10,000), then randomly select 100 numbers.


Stratified Sampling: Subdivide the population into different subgroups (strata) that share the same characteristics, then draw a simple random sample from each subgroup.
Homogeneous within strata; Non-homogeneous between strata
Cluster Sampling: Divide the population into clusters, then randomly select some of those clusters, and then choose all the members from those selected clusters.
Homogeneous between clusters; Non-homogeneous within clusters
A categorical variable provides non-numerical information which can be placed in one (and only one) category from two or more categories.
A numerical variable is recorded in a numerical value representing counts or measurements.
A discrete variable takes on values of a finite or countable number.
A continuous variable takes on values anywhere over a particular range without gaps or jumps.
Gender (Male = 0, Female = 1, Trans = 2)
Class (Freshman = 1, Sophomore = 2, Junior = 3, Senior = 4, Graduate = 5)
Country (USA = 100, Canada = 101, UK = 200, Germany = 201, Japan = 300, Korea = 301)
United Airlines boarding groups
The numbers represent categories only; differences between them are meaningless.
We need to learn the level of measurements to know whether or which arithmetic operations are meaningful.
| Grade | Percentage |
|---|---|
| A | [94, 100] |
| A- | [90, 94) |
| B+ | [87, 90) |
| B | [83, 87) |
| B- | [80, 83) |
| C+ | [77, 80) |
| C | [73, 77) |
| C- | [70, 73) |
| D+ | [65, 70) |
| D | [60, 65) |
| F | [0, 60) |
Identify data type of each variable in the Marquette menβs basketball player data
| Number | Name | Class | Pos | Ht | Wt | Hometown | High_School | PPG | RPG | APG |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Kam Jones | Sr | G | 6β5β | 205 | Memphis, TN | Evangelical Christian School | 19.2 | 4.5 | 5.9 |
| 2 | Chase Ross | Jr | G | 6β5β | 210 | Dallas, TX | Cushing Academy | 10.5 | 3.8 | 2.1 |
| 4 | Stevie Mitchell | Sr | G | 6β3β | 200 | Reading, PA | Wilson HS | 10.7 | 4.1 | 1.6 |
| 5 | Tre Norman | So | G | 6β4β | 210 | Boston, MA | Worcester Academy | 1.9 | 1.5 | 0.5 |
| 7 | Zaide Lowery | So | G | 6β5β | 200 | Springfield, MO | La Lumiere School | 4.1 | 3.0 | 0.2 |
| 8 | Joshua Clark | Fr | F | 7β1β | 225 | Virginia Beach, VA | Clements HS | NA | NA | NA |
| 10 | Damarius Owens | Fr | F | 6β7β | 200 | Rochester, NY | Western Reserve Academy | 2.6 | 1.2 | 0.5 |
| 12 | Ben Gold | Jr | F | 6β11β | 235 | Wellington, NZ | NBA Global Academy | 7.4 | 4.3 | 0.9 |
| 13 | Royce Parham | Fr | F | 6β8β | 230 | Pittsburgh, PA | Western Reserve Academy | 5.1 | 2.2 | 0.4 |
| 21 | Al Amadou | So | F | 6β9β | 210 | Philadelphia, PA | Chestnut Hill Academy | NA | NA | NA |
| 22 | Sean Jones | Jr (RS) | G | 5β10β | 185 | Columbus, OH | Lincoln HS | NA | NA | NA |
| 23 | David Joplin | Sr | F | 6β8β | 225 | Brookfield, WI | Brookfield Central HS | 14.2 | 5.4 | 1.3 |
| 25 | Jack Anderson | Sr | G | 6β4β | 200 | Davie, FL | Western HS | 0.4 | 0.4 | 0.0 |
| 35 | Caedin Hamilton | R-Fr | F | 6β9β | 250 | Santa Maria, CA | St. Joseph HS | 1.5 | 1.2 | 0.6 |
| 40 | Casey OβMalley | Jr | G | 6β3β | 200 | Omaha, NE | Creighton Prep | 0.6 | 0.2 | 0.0 |
| 41 | Jonah Lucas | Jr | G | 6β1β | 180 | West Lafayette, IN | Harrison HS | 0.0 | 0.2 | 0.0 |
| 42 | Luke Jacobson | Fr | F | 6β7β | 215 | San Luis Obispo, CA | Mission Prep | NA | NA | NA |
| 54 | Jake Ciardo | Jr | G | 6β2β | 185 | Germantown, WI | Germantown HS | 0.4 | 0.8 | 0.0 |
| 55 | Cameron Brown | Sr | G | 6β1β | 215 | Plano, TX | John Paul II HS | 0.4 | 0.2 | 0.0 |
03:00