Overview of Statistics and Data 📖

MATH 4720/MSSC 5720 Introduction to Statistics

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

What is Statisitcs

Statistics as Numeric Records

In ordinary conversations, the word statistics is used as a term to indicate a set or collection of numeric records.

Statistics as Numeric Records

In ordinary conversations, the word statistics is used as a term to indicate a set or collection of numeric records.

https://slamgoods.com/products/jordan-collectors-issue

https://www.amazon.com/Funny-Statistics-Shirt-Definition/dp/B07JKMCDR2?customId=B07537PB8C&customizationToken=MC_Assembly_1%23B07537PB8C&th=1&psc=1

Statistics as a Discipline

https://en.wikipedia.org/wiki/Statistics

Statistics is a subject dealing with data, or Science of Data.

A science of data using statistical thinking, methods and models.

🤔 But wait, then what is DATA SCIENCE ❓

Difference between Statistics and Data Science

My ChatGPT says:

Statistics is foundational to Data Science, but Data Science also includes programming, data engineering, machine learning, and business communication.

Data Science Life Cycle

UC Santa Cruz Department of Statistics Courses

STAT 5 – Statistics

STAT 7 – Statistical Methods for the Biological, Environmental, and Health Sciences

STAT 17 – Statistical Methods for Business and Economics

STAT 80A – Gambling and Gaming

STAT 80B – The Art of Data Visualization

STAT 108 – Linear Regression

STAT 131 – Introduction to Probability Theory

STAT 132 – Classical and Bayesian Inference

STAT 202 – Linear Models in SAS

STAT 203 – Introduction to Probability Theory

STAT 204 – Introduction to Statistical Data Analysis

STAT 205 – Introduction to Classical Statistical Learning

STAT 205B – Intermediate Classical Inference

STAT 206 – Applied Bayesian Statistics

STAT 206B – Intermediate Bayesian Inference

STAT 207 – Intermediate Bayesian Statistical Modeling

STAT 208 – Linear Statistical Models

STAT 209 – Generalized Linear Models

STAT 221 – Statistical Machine Learning

STAT 222 – Bayesian Nonparametric Methods

STAT 223 – Time Series Analysis

STAT 224 – Bayesian Survival Analysis and Clinical Design

STAT 225 – Multivariate Statistical Methods

STAT 226 – Spatial Statistics

STAT 227 – Statistical Learning and High-Dimensional Data Analysis

STAT 229 – Advanced Bayesian Computation

STAT 243 – Stochastic Processes

STAT 244 – Bayesian Decision Theory

STAT 246 – Probability Theory with Markov Chains

STAT 266A – Data Visualization and Statistical Programming in R

STAT 266B – Advanced Statistical Programming in R

STAT 266C – Introduction to Data Wrangling

️⬛ Methods and models
🟩 Other data science related topics

Data Science May Now Be a Broader View of Statistics

Collection, organization, analysis, interpretation and presentation of data.

What We Learn In this Course

We Focus On Statistical Inference

We spend most of time on various statistical methods for analyzing data.
Learn useful information
- about the population we are interested (e.g. All Marquette students)
- from our sample data (e.g. Students in MATH 4720)
- through statistical inferential methods, including estimation and testing (e.g. Confidence intervals)

Statistics is a Science of Data, so What is Data?

Data: A set of objects on which we observe or measure one or more characteristics.
Objects are individuals, observations, subjects or cases in statistical studies.
A characteristic or attribute is called a variable because it varies from one to another.

Number	Name	Class	Pos	Ht	Wt	Hometown	High_School	PPG	RPG	APG
1	Kam Jones	Sr	G	6’5”	205	Memphis, TN	Evangelical Christian School	19.2	4.5	5.9
2	Chase Ross	Jr	G	6’5”	210	Dallas, TX	Cushing Academy	10.5	3.8	2.1
4	Stevie Mitchell	Sr	G	6’3”	200	Reading, PA	Wilson HS	10.7	4.1	1.6
5	Tre Norman	So	G	6’4”	210	Boston, MA	Worcester Academy	1.9	1.5	0.5
7	Zaide Lowery	So	G	6’5”	200	Springfield, MO	La Lumiere School	4.1	3.0	0.2
8	Joshua Clark	Fr	F	7’1”	225	Virginia Beach, VA	Clements HS	NA	NA	NA
10	Damarius Owens	Fr	F	6’7”	200	Rochester, NY	Western Reserve Academy	2.6	1.2	0.5
12	Ben Gold	Jr	F	6’11”	235	Wellington, NZ	NBA Global Academy	7.4	4.3	0.9
13	Royce Parham	Fr	F	6’8”	230	Pittsburgh, PA	Western Reserve Academy	5.1	2.2	0.4
21	Al Amadou	So	F	6’9”	210	Philadelphia, PA	Chestnut Hill Academy	NA	NA	NA
22	Sean Jones	Jr (RS)	G	5’10”	185	Columbus, OH	Lincoln HS	NA	NA	NA
23	David Joplin	Sr	F	6’8”	225	Brookfield, WI	Brookfield Central HS	14.2	5.4	1.3
25	Jack Anderson	Sr	G	6’4”	200	Davie, FL	Western HS	0.4	0.4	0.0
35	Caedin Hamilton	R-Fr	F	6’9”	250	Santa Maria, CA	St. Joseph HS	1.5	1.2	0.6
40	Casey O’Malley	Jr	G	6’3”	200	Omaha, NE	Creighton Prep	0.6	0.2	0.0
41	Jonah Lucas	Jr	G	6’1”	180	West Lafayette, IN	Harrison HS	0.0	0.2	0.0
42	Luke Jacobson	Fr	F	6’7”	215	San Luis Obispo, CA	Mission Prep	NA	NA	NA
54	Jake Ciardo	Jr	G	6’2”	185	Germantown, WI	Germantown HS	0.4	0.8	0.0
55	Cameron Brown	Sr	G	6’1”	215	Plano, TX	John Paul II HS	0.4	0.2	0.0

Data Matrix

Each row corresponds to a unique case or observational unit.
Each column represents a characteristic or variable.
This structure allows new cases to be added as rows or new variables as new columns.

Number	Name	Class	Pos	Ht	Wt	Hometown	High_School	PPG	RPG	APG
1	Kam Jones	Sr	G	6’5”	205	Memphis, TN	Evangelical Christian School	19.2	4.5	5.9
2	Chase Ross	Jr	G	6’5”	210	Dallas, TX	Cushing Academy	10.5	3.8	2.1
4	Stevie Mitchell	Sr	G	6’3”	200	Reading, PA	Wilson HS	10.7	4.1	1.6
5	Tre Norman	So	G	6’4”	210	Boston, MA	Worcester Academy	1.9	1.5	0.5
7	Zaide Lowery	So	G	6’5”	200	Springfield, MO	La Lumiere School	4.1	3.0	0.2
8	Joshua Clark	Fr	F	7’1”	225	Virginia Beach, VA	Clements HS	NA	NA	NA
10	Damarius Owens	Fr	F	6’7”	200	Rochester, NY	Western Reserve Academy	2.6	1.2	0.5
12	Ben Gold	Jr	F	6’11”	235	Wellington, NZ	NBA Global Academy	7.4	4.3	0.9
13	Royce Parham	Fr	F	6’8”	230	Pittsburgh, PA	Western Reserve Academy	5.1	2.2	0.4
21	Al Amadou	So	F	6’9”	210	Philadelphia, PA	Chestnut Hill Academy	NA	NA	NA
22	Sean Jones	Jr (RS)	G	5’10”	185	Columbus, OH	Lincoln HS	NA	NA	NA
23	David Joplin	Sr	F	6’8”	225	Brookfield, WI	Brookfield Central HS	14.2	5.4	1.3
25	Jack Anderson	Sr	G	6’4”	200	Davie, FL	Western HS	0.4	0.4	0.0
35	Caedin Hamilton	R-Fr	F	6’9”	250	Santa Maria, CA	St. Joseph HS	1.5	1.2	0.6
40	Casey O’Malley	Jr	G	6’3”	200	Omaha, NE	Creighton Prep	0.6	0.2	0.0
41	Jonah Lucas	Jr	G	6’1”	180	West Lafayette, IN	Harrison HS	0.0	0.2	0.0
42	Luke Jacobson	Fr	F	6’7”	215	San Luis Obispo, CA	Mission Prep	NA	NA	NA
54	Jake Ciardo	Jr	G	6’2”	185	Germantown, WI	Germantown HS	0.4	0.8	0.0
55	Cameron Brown	Sr	G	6’1”	215	Plano, TX	John Paul II HS	0.4	0.2	0.0

Population and Sample

Target Population

The first step in conducting a study is to identify questions to be investigated.
A clear research question is helpful in identifying
- what cases should be studied (row)
- what variables are important (column)
Target Population: the collection of all objects which we are interested in studying from.

What is the average GPA of currently enrolled Marquette students?

All Marquette students that are currently enrolled.

Target Population

The first step in conducting a study is to identify questions to be investigated.
A clear research question is helpful in identifying
- what cases should be studied (row)
- what variables are important (column)
Target Population: the collection of all objects which we are interested in studying from.

Does a new drug reduce mortality in patients with severe heart disease?

All people with severe heart disease.

Sample Data

Sometimes, it’s possible to collect data of all cases we are interested.
Most of the time, it is too expensive to collect data for every case in a population.
What about the average GPA of all students in Illinois? the U.S.? the world? 😱 😱 😱

Sample: A subset of cases selected from a population.
Compute the average GPA of the sample data
Hope sample avg GPA \(\approx\) population avg GPA. 🙏

Good Sample vs. Bad Sample

Is this 4720/5720 class a sample data of the target population Marquette students?

Is this 4720/5720 class a “good” sample of the target population?

Good Sample vs. Bad Sample

Is this 4720/5720 class a “good” sample of the target population?

The sample is convenient to be collected, but it may NOT be representative of the population.
Biased sample: The average GPA of the class may be far from that of all Marquette undergrads.

How and Why a Representative Sample?

We always seek to randomly select a sample from a population.
Lots of statistical methods are based on randomness assumption.

Data Collection

Two Types of Studies to Collect Sample Data

Observational Study: Observe and measure characteristics/variables, and do NOT attempt to modify or intervene with the subjects being studied.
- Sample from 1️⃣ the heart disease and 2️⃣ heart disease-free populations. Then record the fat content of the diets for the two groups.

Experimental Study: Apply some treatment(s) and then proceed to observe its responses or effects on the individuals (experimental units).
- Assign volunteers to one of several diets with different levels of dietary fat (treatments). Then compare the treatments with respect to the incidence of hear disease after a period of time.

Observational or Experimental?

Randomly select 40 males and 40 females to see the difference in blood pressure levels between male and female.
Test the effects of a new drug by randomly dividing patients into 3 groups (high dosage, low dosage, placebo).

Limitation of Observational Studies: Confounding

Confounder: A variable NOT included in a study but affects the variables in the study.
Observe past data show that increases in ice cream sales are associated with increases in drownings, and we conclude that eating ice cream causes drownings. 😱 😕 ⁉️

What is the confounder that is not in the data, but affects ice cream sales and the number of drownings?

Temperature: as temperature increases, ice cream sales increase and the number of drownings goes up because more people swim.

Causal Relationship

Making causal conclusions based on experiments is often more reasonable than making the same causal conclusions based on observational data.
Observational studies are generally only sufficient to show associations, not causality.

Sampling Methods

Simple Random Sample

Random Sample: Each member of a population is equally likely to be selected.
Simple Random Sample (SRS): Every possible sample of sample size \(n\) has the same chance to be chosen.
Example: If sample 100 students from all, say 10,000 Marquette students, I would randomly assign each student a number (from 1 to 10,000), then randomly select 100 numbers.

https://research-methodology.net/sampling-in-primary-data-collection/random-sampling/

Stratified Random Sample

Stratified Sampling: Subdivide the population into different subgroups (strata) that share the same characteristics, then draw a simple random sample from each subgroup.
Homogeneous within strata; Non-homogeneous between strata

Stratified Random Sample Example

Example: Divide Marquette students into groups by colleges, then SRS for each group.

Cluster Sampling

Cluster Sampling: Divide the population into clusters, then randomly select some of those clusters, and then choose all the members from those selected clusters.
Homogeneous between clusters; Non-homogeneous within clusters

Cluster Sampling Example

Example: Study 4720 student drinking habit by dividing the students into 9 groups, then randomly selecting 3 and interviewing all of the students in each of those clusters.

Data Type

Categorical vs. Numerical Variables

A categorical variable provides non-numerical information which can be placed in one (and only one) category from two or more categories.
- Gender (Male 👨, Female 👩, Trans 🏳️‍🌈)
- Class (Freshman, Sophomore, Junior, Senior, Graduate)
- Country (USA 🇺🇸, Canada 🇨🇦, UK 🇬🇧, Germany 🇩🇪, Japan 🇯🇵, Korea 🇰🇷)
A numerical variable is recorded in a numerical value representing counts or measurements.
- GPA
- The number of relationships you’ve had
- Height

Numerical Variables can be Discrete or Continuous

A discrete variable takes on values of a finite or countable number.
A continuous variable takes on values anywhere over a particular range without gaps or jumps.
- GPA is continuous because it can be any value between 0 and 4.
- The number of relationships you’ve had is discrete because you can count the number and it is finite.
- Height is continuous because it can be any number within a range.

Categorical Variables are Usually Recorded as Numbers

Gender (Male = 0, Female = 1, Trans = 2)
Class (Freshman = 1, Sophomore = 2, Junior = 3, Senior = 4, Graduate = 5)
Country (USA = 100, Canada = 101, UK = 200, Germany = 201, Japan = 300, Korea = 301)
United Airlines boarding groups
The numbers represent categories only; differences between them are meaningless.
- Canada - USA = 101 - 100 = 1?
- Graduate - Sophomore = 5 - 2 = 3 = Junior?
We need to learn the level of measurements to know whether or which arithmetic operations are meaningful.

Levels of Measurements: Nominal and Ordinal for Categorical Variables

Nominal: The data can NOT be ordered in a meaningful or natural way.
- Gender (Male = 0, Female = 1, Trans = 2) is nominal because Male, Female and Trans cannot be ordered.
- Country (USA = 100, Canada = 101, UK = 200, Germany = 201, Japan = 300, Korea = 301) is nominal.

Ordinal: The data can be arranged in some meaningful order, but differences between data values can NOT be determined or are meaningless.
- Class (Freshman = 1, Sophomore = 2, Junior = 3, Senior = 4, Graduate = 5) is ordinal because Sophomore is one class higher than Freshman.

Levels of Measurements: Interval and Ratio for Numerical Variables

Interval: The data have meaningful difference between any two values. But the data do NOT have a natural zero or starting point. The data can do \(\color{red} +\) and \(\color{red} -\), but can’t reasonably do \(\color{red} \times\) and \(\color{red} \div\).
- Temperature is interval because \(80^{\circ}\)F is 40 degrees higher than \(40^{\circ}\)F \((80-40=40)\), but \(0^{\circ}\) does not mean NO heat and \(80^{\circ}\)F is NOT twice as hot as \(40^{\circ}\)F.

Ratio: The data have both meaningful differences and ratios, and there is a natural zero starting point that indicates none of the quantity. The data can do \(\color{red} +\), \(\color{red} -\), \(\color{red} \times\) and \(\color{red} \div\).
- Distance is ratio because \(80\) miles is twice as far as \(40\) miles \((80/40 = 2)\), and \(0\) mile means no distance.

Converting Numerical to Categorical

You’ve already seen an example.

Grade	Percentage
A	[94, 100]
A-	[90, 94)
B+	[87, 90)
B	[83, 87)
B-	[80, 83)
C+	[77, 80)
C	[73, 77)
C-	[70, 73)
D+	[65, 70)
D	[60, 65)
F	[0, 60)

Identify data type of each variable in the Marquette men’s basketball player data

Number	Name	Class	Pos	Ht	Wt	Hometown	High_School	PPG	RPG	APG
1	Kam Jones	Sr	G	6’5”	205	Memphis, TN	Evangelical Christian School	19.2	4.5	5.9
2	Chase Ross	Jr	G	6’5”	210	Dallas, TX	Cushing Academy	10.5	3.8	2.1
4	Stevie Mitchell	Sr	G	6’3”	200	Reading, PA	Wilson HS	10.7	4.1	1.6
5	Tre Norman	So	G	6’4”	210	Boston, MA	Worcester Academy	1.9	1.5	0.5
7	Zaide Lowery	So	G	6’5”	200	Springfield, MO	La Lumiere School	4.1	3.0	0.2
8	Joshua Clark	Fr	F	7’1”	225	Virginia Beach, VA	Clements HS	NA	NA	NA
10	Damarius Owens	Fr	F	6’7”	200	Rochester, NY	Western Reserve Academy	2.6	1.2	0.5
12	Ben Gold	Jr	F	6’11”	235	Wellington, NZ	NBA Global Academy	7.4	4.3	0.9
13	Royce Parham	Fr	F	6’8”	230	Pittsburgh, PA	Western Reserve Academy	5.1	2.2	0.4
21	Al Amadou	So	F	6’9”	210	Philadelphia, PA	Chestnut Hill Academy	NA	NA	NA
22	Sean Jones	Jr (RS)	G	5’10”	185	Columbus, OH	Lincoln HS	NA	NA	NA
23	David Joplin	Sr	F	6’8”	225	Brookfield, WI	Brookfield Central HS	14.2	5.4	1.3
25	Jack Anderson	Sr	G	6’4”	200	Davie, FL	Western HS	0.4	0.4	0.0
35	Caedin Hamilton	R-Fr	F	6’9”	250	Santa Maria, CA	St. Joseph HS	1.5	1.2	0.6
40	Casey O’Malley	Jr	G	6’3”	200	Omaha, NE	Creighton Prep	0.6	0.2	0.0
41	Jonah Lucas	Jr	G	6’1”	180	West Lafayette, IN	Harrison HS	0.0	0.2	0.0
42	Luke Jacobson	Fr	F	6’7”	215	San Luis Obispo, CA	Mission Prep	NA	NA	NA
54	Jake Ciardo	Jr	G	6’2”	185	Germantown, WI	Germantown HS	0.4	0.8	0.0
55	Cameron Brown	Sr	G	6’1”	215	Plano, TX	John Paul II HS	0.4	0.2	0.0

03:00