Overview of Statistics and Data πŸ“–

MATH 4720/MSSC 5720 Introduction to Statistics

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

What is Statisitcs

Statistics as Numeric Records

  • In ordinary conversations, the word statistics is used as a term to indicate a set or collection of numeric records.

Statistics as Numeric Records

  • In ordinary conversations, the word statistics is used as a term to indicate a set or collection of numeric records.

https://slamgoods.com/products/jordan-collectors-issue

https://www.amazon.com/Funny-Statistics-Shirt-Definition/dp/B07JKMCDR2?customId=B07537PB8C&customizationToken=MC_Assembly_1%23B07537PB8C&th=1&psc=1

Statistics as a Discipline

https://en.wikipedia.org/wiki/Statistics
  • Statistics is a subject dealing with data, or Science of Data.
  • A science of data using statistical thinking, methods and models.

πŸ€” But wait, then what is DATA SCIENCE ❓

Difference between Statistics and Data Science

My ChatGPT says:

Statistics is foundational to Data Science, but Data Science also includes programming, data engineering, machine learning, and business communication.

Data Science Life Cycle

UC Santa Cruz Department of Statistics Courses

STAT 5 – Statistics

STAT 7 – Statistical Methods for the Biological, Environmental, and Health Sciences

STAT 17 – Statistical Methods for Business and Economics

STAT 80A – Gambling and Gaming

STAT 80B – The Art of Data Visualization

STAT 108 – Linear Regression

STAT 131 – Introduction to Probability Theory

STAT 132 – Classical and Bayesian Inference

STAT 202 – Linear Models in SAS

STAT 203 – Introduction to Probability Theory

STAT 204 – Introduction to Statistical Data Analysis

STAT 205 – Introduction to Classical Statistical Learning

STAT 205B – Intermediate Classical Inference

STAT 206 – Applied Bayesian Statistics

STAT 206B – Intermediate Bayesian Inference

STAT 207 – Intermediate Bayesian Statistical Modeling

STAT 208 – Linear Statistical Models

STAT 209 – Generalized Linear Models

STAT 221 – Statistical Machine Learning

STAT 222 – Bayesian Nonparametric Methods

STAT 223 – Time Series Analysis

STAT 224 – Bayesian Survival Analysis and Clinical Design

STAT 225 – Multivariate Statistical Methods

STAT 226 – Spatial Statistics

STAT 227 – Statistical Learning and High-Dimensional Data Analysis

STAT 229 – Advanced Bayesian Computation

STAT 243 – Stochastic Processes

STAT 244 – Bayesian Decision Theory

STAT 246 – Probability Theory with Markov Chains

STAT 266A – Data Visualization and Statistical Programming in R

STAT 266B – Advanced Statistical Programming in R

STAT 266C – Introduction to Data Wrangling

  • ️⬛ Methods and models
  • 🟩 Other data science related topics

Data Science May Now Be a Broader View of Statistics

Collection, organization, analysis, interpretation and presentation of data.

What We Learn In this Course

OpenIntro Statistics Contents

We Focus On Statistical Inference

  • We spend most of time on various statistical methods for analyzing data.

  • Learn useful information

    • about the population we are interested (e.g. All Marquette students)

    • from our sample data (e.g. Students in MATH 4720)

    • through statistical inferential methods, including estimation and testing (e.g. Confidence intervals)

Statistics is a Science of Data, so What is Data?

  • Data: A set of objects on which we observe or measure one or more characteristics.

  • Objects are individuals, observations, subjects or cases in statistical studies.

  • A characteristic or attribute is called a variable because it varies from one to another.

Number Name Class Pos Ht Wt Hometown High_School PPG RPG APG
1 Kam Jones Sr G 6’5” 205 Memphis, TN Evangelical Christian School 19.2 4.5 5.9
2 Chase Ross Jr G 6’5” 210 Dallas, TX Cushing Academy 10.5 3.8 2.1
4 Stevie Mitchell Sr G 6’3” 200 Reading, PA Wilson HS 10.7 4.1 1.6
5 Tre Norman So G 6’4” 210 Boston, MA Worcester Academy 1.9 1.5 0.5
7 Zaide Lowery So G 6’5” 200 Springfield, MO La Lumiere School 4.1 3.0 0.2
8 Joshua Clark Fr F 7’1” 225 Virginia Beach, VA Clements HS NA NA NA
10 Damarius Owens Fr F 6’7” 200 Rochester, NY Western Reserve Academy 2.6 1.2 0.5
12 Ben Gold Jr F 6’11” 235 Wellington, NZ NBA Global Academy 7.4 4.3 0.9
13 Royce Parham Fr F 6’8” 230 Pittsburgh, PA Western Reserve Academy 5.1 2.2 0.4
21 Al Amadou So F 6’9” 210 Philadelphia, PA Chestnut Hill Academy NA NA NA
22 Sean Jones Jr (RS) G 5’10” 185 Columbus, OH Lincoln HS NA NA NA
23 David Joplin Sr F 6’8” 225 Brookfield, WI Brookfield Central HS 14.2 5.4 1.3
25 Jack Anderson Sr G 6’4” 200 Davie, FL Western HS 0.4 0.4 0.0
35 Caedin Hamilton R-Fr F 6’9” 250 Santa Maria, CA St. Joseph HS 1.5 1.2 0.6
40 Casey O’Malley Jr G 6’3” 200 Omaha, NE Creighton Prep 0.6 0.2 0.0
41 Jonah Lucas Jr G 6’1” 180 West Lafayette, IN Harrison HS 0.0 0.2 0.0
42 Luke Jacobson Fr F 6’7” 215 San Luis Obispo, CA Mission Prep NA NA NA
54 Jake Ciardo Jr G 6’2” 185 Germantown, WI Germantown HS 0.4 0.8 0.0
55 Cameron Brown Sr G 6’1” 215 Plano, TX John Paul II HS 0.4 0.2 0.0

Data Matrix

  • Each row corresponds to a unique case or observational unit.

  • Each column represents a characteristic or variable.

  • This structure allows new cases to be added as rows or new variables as new columns.

Number Name Class Pos Ht Wt Hometown High_School PPG RPG APG
1 Kam Jones Sr G 6’5” 205 Memphis, TN Evangelical Christian School 19.2 4.5 5.9
2 Chase Ross Jr G 6’5” 210 Dallas, TX Cushing Academy 10.5 3.8 2.1
4 Stevie Mitchell Sr G 6’3” 200 Reading, PA Wilson HS 10.7 4.1 1.6
5 Tre Norman So G 6’4” 210 Boston, MA Worcester Academy 1.9 1.5 0.5
7 Zaide Lowery So G 6’5” 200 Springfield, MO La Lumiere School 4.1 3.0 0.2
8 Joshua Clark Fr F 7’1” 225 Virginia Beach, VA Clements HS NA NA NA
10 Damarius Owens Fr F 6’7” 200 Rochester, NY Western Reserve Academy 2.6 1.2 0.5
12 Ben Gold Jr F 6’11” 235 Wellington, NZ NBA Global Academy 7.4 4.3 0.9
13 Royce Parham Fr F 6’8” 230 Pittsburgh, PA Western Reserve Academy 5.1 2.2 0.4
21 Al Amadou So F 6’9” 210 Philadelphia, PA Chestnut Hill Academy NA NA NA
22 Sean Jones Jr (RS) G 5’10” 185 Columbus, OH Lincoln HS NA NA NA
23 David Joplin Sr F 6’8” 225 Brookfield, WI Brookfield Central HS 14.2 5.4 1.3
25 Jack Anderson Sr G 6’4” 200 Davie, FL Western HS 0.4 0.4 0.0
35 Caedin Hamilton R-Fr F 6’9” 250 Santa Maria, CA St. Joseph HS 1.5 1.2 0.6
40 Casey O’Malley Jr G 6’3” 200 Omaha, NE Creighton Prep 0.6 0.2 0.0
41 Jonah Lucas Jr G 6’1” 180 West Lafayette, IN Harrison HS 0.0 0.2 0.0
42 Luke Jacobson Fr F 6’7” 215 San Luis Obispo, CA Mission Prep NA NA NA
54 Jake Ciardo Jr G 6’2” 185 Germantown, WI Germantown HS 0.4 0.8 0.0
55 Cameron Brown Sr G 6’1” 215 Plano, TX John Paul II HS 0.4 0.2 0.0

Population and Sample

Target Population

  • The first step in conducting a study is to identify questions to be investigated.

  • A clear research question is helpful in identifying

    • what cases should be studied (row)
    • what variables are important (column)
  • Target Population: the collection of all objects which we are interested in studying from.

  • What is the average GPA of currently enrolled Marquette students?

All Marquette students that are currently enrolled.

Target Population

  • The first step in conducting a study is to identify questions to be investigated.

  • A clear research question is helpful in identifying

    • what cases should be studied (row)
    • what variables are important (column)
  • Target Population: the collection of all objects which we are interested in studying from.

  • Does a new drug reduce mortality in patients with severe heart disease?

All people with severe heart disease.

Sample Data

  • Sometimes, it’s possible to collect data of all cases we are interested.

  • Most of the time, it is too expensive to collect data for every case in a population.

  • What about the average GPA of all students in Illinois? the U.S.? the world? 😱 😱 😱

  • Sample: A subset of cases selected from a population.

  • Compute the average GPA of the sample data

  • Hope sample avg GPA \(\approx\) population avg GPA. πŸ™

Good Sample vs. Bad Sample

Is this 4720/5720 class a sample data of the target population Marquette students?

Is this 4720/5720 class a β€œgood” sample of the target population?

Good Sample vs. Bad Sample

Is this 4720/5720 class a β€œgood” sample of the target population?

  • The sample is convenient to be collected, but it may NOT be representative of the population.

  • Biased sample: The average GPA of the class may be far from that of all Marquette undergrads.

How and Why a Representative Sample?

  • We always seek to randomly select a sample from a population.

  • Lots of statistical methods are based on randomness assumption.

Data Collection

Two Types of Studies to Collect Sample Data

  • Observational Study: Observe and measure characteristics/variables, and do NOT attempt to modify or intervene with the subjects being studied.
    • Sample from 1️⃣ the heart disease and 2️⃣ heart disease-free populations. Then record the fat content of the diets for the two groups.
  • Experimental Study: Apply some treatment(s) and then proceed to observe its responses or effects on the individuals (experimental units).
    • Assign volunteers to one of several diets with different levels of dietary fat (treatments). Then compare the treatments with respect to the incidence of hear disease after a period of time.

Observational or Experimental?

  • Randomly select 40 males and 40 females to see the difference in blood pressure levels between male and female.

  • Test the effects of a new drug by randomly dividing patients into 3 groups (high dosage, low dosage, placebo).

Limitation of Observational Studies: Confounding

  • Confounder: A variable NOT included in a study but affects the variables in the study.
  • Observe past data show that increases in ice cream sales are associated with increases in drownings, and we conclude that eating ice cream causes drownings. 😱 πŸ˜• ⁉️

What is the confounder that is not in the data, but affects ice cream sales and the number of drownings?

Temperature: as temperature increases, ice cream sales increase and the number of drownings goes up because more people swim.

Causal Relationship

  • Making causal conclusions based on experiments is often more reasonable than making the same causal conclusions based on observational data.

  • Observational studies are generally only sufficient to show associations, not causality.

Sampling Methods

Simple Random Sample

  • Random Sample: Each member of a population is equally likely to be selected.

  • Simple Random Sample (SRS): Every possible sample of sample size \(n\) has the same chance to be chosen.

  • Example: If sample 100 students from all, say 10,000 Marquette students, I would randomly assign each student a number (from 1 to 10,000), then randomly select 100 numbers.

https://research-methodology.net/sampling-in-primary-data-collection/random-sampling/

Stratified Random Sample

  • Stratified Sampling: Subdivide the population into different subgroups (strata) that share the same characteristics, then draw a simple random sample from each subgroup.

  • Homogeneous within strata; Non-homogeneous between strata

Stratified Random Sample Example

  • Example: Divide Marquette students into groups by colleges, then SRS for each group.

Cluster Sampling

  • Cluster Sampling: Divide the population into clusters, then randomly select some of those clusters, and then choose all the members from those selected clusters.

  • Homogeneous between clusters; Non-homogeneous within clusters

Cluster Sampling Example

  • Example: Study 4720 student drinking habit by dividing the students into 9 groups, then randomly selecting 3 and interviewing all of the students in each of those clusters.

Data Type

Categorical vs. Numerical Variables

  • A categorical variable provides non-numerical information which can be placed in one (and only one) category from two or more categories.

    • Gender (Male πŸ‘¨, Female πŸ‘©, Trans πŸ³οΈβ€πŸŒˆ)
    • Class (Freshman, Sophomore, Junior, Senior, Graduate)
    • Country (USA πŸ‡ΊπŸ‡Έ, Canada πŸ‡¨πŸ‡¦, UK πŸ‡¬πŸ‡§, Germany πŸ‡©πŸ‡ͺ, Japan πŸ‡―πŸ‡΅, Korea πŸ‡°πŸ‡·)
  • A numerical variable is recorded in a numerical value representing counts or measurements.

    • GPA
    • The number of relationships you’ve had
    • Height

Numerical Variables can be Discrete or Continuous

  • A discrete variable takes on values of a finite or countable number.

  • A continuous variable takes on values anywhere over a particular range without gaps or jumps.

    • GPA is continuous because it can be any value between 0 and 4.
    • The number of relationships you’ve had is discrete because you can count the number and it is finite.
    • Height is continuous because it can be any number within a range.

Categorical Variables are Usually Recorded as Numbers

  • Gender (Male = 0, Female = 1, Trans = 2)

  • Class (Freshman = 1, Sophomore = 2, Junior = 3, Senior = 4, Graduate = 5)

  • Country (USA = 100, Canada = 101, UK = 200, Germany = 201, Japan = 300, Korea = 301)

  • United Airlines boarding groups

  • The numbers represent categories only; differences between them are meaningless.

    • Canada - USA = 101 - 100 = 1?
    • Graduate - Sophomore = 5 - 2 = 3 = Junior?
  • We need to learn the level of measurements to know whether or which arithmetic operations are meaningful.

Levels of Measurements: Nominal and Ordinal for Categorical Variables

  • Nominal: The data can NOT be ordered in a meaningful or natural way.
    • Gender (Male = 0, Female = 1, Trans = 2) is nominal because Male, Female and Trans cannot be ordered.
    • Country (USA = 100, Canada = 101, UK = 200, Germany = 201, Japan = 300, Korea = 301) is nominal.


  • Ordinal: The data can be arranged in some meaningful order, but differences between data values can NOT be determined or are meaningless.
    • Class (Freshman = 1, Sophomore = 2, Junior = 3, Senior = 4, Graduate = 5) is ordinal because Sophomore is one class higher than Freshman.

Levels of Measurements: Interval and Ratio for Numerical Variables

  • Interval: The data have meaningful difference between any two values. But the data do NOT have a natural zero or starting point. The data can do \(\color{red} +\) and \(\color{red} -\), but can’t reasonably do \(\color{red} \times\) and \(\color{red} \div\).
    • Temperature is interval because \(80^{\circ}\)F is 40 degrees higher than \(40^{\circ}\)F \((80-40=40)\), but \(0^{\circ}\) does not mean NO heat and \(80^{\circ}\)F is NOT twice as hot as \(40^{\circ}\)F.
  • Ratio: The data have both meaningful differences and ratios, and there is a natural zero starting point that indicates none of the quantity. The data can do \(\color{red} +\), \(\color{red} -\), \(\color{red} \times\) and \(\color{red} \div\).
    • Distance is ratio because \(80\) miles is twice as far as \(40\) miles \((80/40 = 2)\), and \(0\) mile means no distance.

Converting Numerical to Categorical

  • You’ve already seen an example.
Grade Percentage
A [94, 100]
A- [90, 94)
B+ [87, 90)
B [83, 87)
B- [80, 83)
C+ [77, 80)
C [73, 77)
C- [70, 73)
D+ [65, 70)
D [60, 65)
F [0, 60)

Identify data type of each variable in the Marquette men’s basketball player data

Number Name Class Pos Ht Wt Hometown High_School PPG RPG APG
1 Kam Jones Sr G 6’5” 205 Memphis, TN Evangelical Christian School 19.2 4.5 5.9
2 Chase Ross Jr G 6’5” 210 Dallas, TX Cushing Academy 10.5 3.8 2.1
4 Stevie Mitchell Sr G 6’3” 200 Reading, PA Wilson HS 10.7 4.1 1.6
5 Tre Norman So G 6’4” 210 Boston, MA Worcester Academy 1.9 1.5 0.5
7 Zaide Lowery So G 6’5” 200 Springfield, MO La Lumiere School 4.1 3.0 0.2
8 Joshua Clark Fr F 7’1” 225 Virginia Beach, VA Clements HS NA NA NA
10 Damarius Owens Fr F 6’7” 200 Rochester, NY Western Reserve Academy 2.6 1.2 0.5
12 Ben Gold Jr F 6’11” 235 Wellington, NZ NBA Global Academy 7.4 4.3 0.9
13 Royce Parham Fr F 6’8” 230 Pittsburgh, PA Western Reserve Academy 5.1 2.2 0.4
21 Al Amadou So F 6’9” 210 Philadelphia, PA Chestnut Hill Academy NA NA NA
22 Sean Jones Jr (RS) G 5’10” 185 Columbus, OH Lincoln HS NA NA NA
23 David Joplin Sr F 6’8” 225 Brookfield, WI Brookfield Central HS 14.2 5.4 1.3
25 Jack Anderson Sr G 6’4” 200 Davie, FL Western HS 0.4 0.4 0.0
35 Caedin Hamilton R-Fr F 6’9” 250 Santa Maria, CA St. Joseph HS 1.5 1.2 0.6
40 Casey O’Malley Jr G 6’3” 200 Omaha, NE Creighton Prep 0.6 0.2 0.0
41 Jonah Lucas Jr G 6’1” 180 West Lafayette, IN Harrison HS 0.0 0.2 0.0
42 Luke Jacobson Fr F 6’7” 215 San Luis Obispo, CA Mission Prep NA NA NA
54 Jake Ciardo Jr G 6’2” 185 Germantown, WI Germantown HS 0.4 0.8 0.0
55 Cameron Brown Sr G 6’1” 215 Plano, TX John Paul II HS 0.4 0.2 0.0
03:00