You RRR a Beginner: Data Types and Structures ๐Ÿ‘จโ€๐Ÿ’ป

MATH 4720/MSSC 5720 Introduction to Statistics

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

R is a Calculator - Arithmetic Operators

R is a Calculator - Examples

  • We have to do the operation in the parenthesis first

R Does Comparisons - Logical Operators

5 <= 5
[1] TRUE
5 <= 4
[1] FALSE
# Is 5 NOT equal to 5? FALSE
5 != 5
[1] FALSE

Build-in Functions

  • R has lots of built-in functions, especially for mathematics, probability and statistics.

sqrt(144)
[1] 12
exp(1)  ## Euler's number
[1] 2.718282
sin(pi/2)
[1] 1
abs(-7)
[1] 7

Creating Variables

  • A variable stores a value that can be changed according to our need.

  • Use <- operator to assign a value to the variable. (Highly recommended๐Ÿ‘)

x <- 5  ## we create an object, value 5, and call it x, which is a variable.
x  ## type the variable name to see the value stored in the object x
[1] 5
(x <- x + 6)  # We can reassign any value to the variable we created
[1] 11
x == 5  # We can perform any operations on variables
[1] FALSE
log(x) # Variables can also be used in any built-in functions
[1] 2.397895

Object Types

Types of Variables

[1] "double"
typeof(5L)
[1] "integer"
typeof("I_love_stat")
[1] "character"
typeof(1 > 3)
[1] "logical"
[1] FALSE

Variable Types in R and in Statistics

  • Type character and logical correspond to categorical variables.

  • Type logical is a special type of categorical variables that has only two categories (binary).

  • Type double and integer correspond to numerical variables. (an exception later)
    • Type double is for continuous variables
    • Type integer is for discrete variables.


  • Create a variable age that stores your age. Check what type it is.

  • Create a variable name that stores your name. Check its type.

  • Create a variable is_male that stores whether you are male (true/false). Check its type.

02:00

R Data Structures

  • Vector

  • Factor

  • Matrix

  • Data Frame

(Atomic) Vector

  • To create a vector, use c(), short for concatenate or combine.

  • All elements of a vector must be of the same type.

(dbl_vec <- c(1, 2.5, 4.5)) 
[1] 1.0 2.5 4.5
(chr_vec <- c("pretty", "girl"))
[1] "pretty" "girl"  
## check how many elements in a vector
length(dbl_vec) 
[1] 3
## check a compact description of 
## any R data structure
str(dbl_vec) 
 num [1:3] 1 2.5 4.5

Operations on Vectors

  • We can do any operations on vectors as we do on a scalar variable (vector of length 1).
# Create two vectors
v1 <- c(3, 8)
v2 <- c(4, 100) 

## All operations happen element-wisely
# Vector addition
v1 + v2
[1]   7 108
# Vector subtraction
v1 - v2
[1]  -1 -92
# Vector multiplication
v1 * v2
[1]  12 800
# Vector division
v1 / v2
[1] 0.75 0.08
sqrt(v2)
[1]  2 10

Recycling of Vectors

  • If we apply arithmetic operations to two vectors of unequal length, the elements of the shorter vector will be recycled to complete the operations.
v1 <- c(3, 8, 4, 5)
# The following 2 operations are the same
v1 * 2
[1]  6 16  8 10
v1 * c(2, 2, 2, 2)
[1]  6 16  8 10
v3 <- c(4, 11)
v1 + v3  ## v3 becomes c(4, 11, 4, 11) when doing the operation
[1]  7 19  8 16

Subsetting Vectors

  • To extract element(s) in a vector, use a pair of brackets [] with element indexing.

  • The indexing starts with 1.

v1
[1] 3 8 4 5
v2
[1]   4 100
## The 3rd element
v1[3]  
[1] 4
v1[c(1, 3)]
[1] 3 4
## extract all except a few elements
## put a negative sign before the vector of 
## indices
v1[-c(2, 3)] 
[1] 3 5

Factor

  • A vector of type factor can be ordered in a meaningful way. Create a factor by factor().
## Create a factor from a character vector using function factor()
(fac <- factor(c("med", "high", "low")))
[1] med  high low 
Levels: high low med
  • It is a type of integer, not character. ๐Ÿ˜ฒ ๐Ÿ™„
typeof(fac)  ## The type is integer.
[1] "integer"
str(fac)  ## The integers show the level each element in vector fac belongs to.
 Factor w/ 3 levels "high","low","med": 3 1 2
order_fac <- factor(c("med", "high", "low"), levels = c("low", "med", "high"))
str(order_fac)
 Factor w/ 3 levels "low","med","high": 2 3 1

Matrix

  • A matrix is a two-dimensional analog of a vector with attribute dim.
  • Use command matrix() to create a matrix.
## Create a 3 by 2 matrix called mat
(mat <- matrix(data = 1:6, nrow = 3, ncol = 2)) 
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
dim(mat); nrow(mat); ncol(mat)
[1] 3 2
[1] 3
[1] 2

Subsetting a Matrix

  • Use the same indexing approach as vectors on rows and columns.
  • Use comma , to separate row and column index.
  • mat[2, 2] extracts the element of the second row and second column.
mat
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
## all rows and 2nd column
## leave row index blank
## specify 2 in coln index
mat[, 2]
[1] 4 5 6
## 2nd row and all columns
mat[2, ] 
[1] 2 5
## The 1st and 3rd rows and the 1st column
mat[c(1, 3), 1] 
[1] 1 3

Data Frame: The Most Common Way of Storing Datasets

  • A data frame is of type list of equal-length vectors, having a 2-dimensional structure.

  • More general than matrix: Different columns can have different types.

  • Use data.frame() that takes named vectors as input โ€œelementโ€.

## data frame w/ an dbl column named age and char columns gen and col.
(df <- data.frame(age = c(19, 21, 40), gen = c("m", "f", "m"), col = c("r","b","g")))
  age gen col
1  19   m   r
2  21   f   b
3  40   m   g
str(df)  ## use $ to represent column elements 
'data.frame':   3 obs. of  3 variables:
 $ age: num  19 21 40
 $ gen: chr  "m" "f" "m"
 $ col: chr  "r" "b" "g"

What happen if we create a data frame without column names?

Data Structure Comparison

https://environmentalcomputing.net/getting-started-with-r/data-types-structure/

Properties of Data Frames

Data frame has properties of matrix.

df
  age gen col
1  19   m   r
2  21   f   b
3  40   m   g
colnames(df)  ## df as a matrix
[1] "age" "gen" "col"
ncol(df) ## df as a matrix
[1] 3
dim(df) ## df as a matrix
[1] 3 3

Subsetting a Data Frame

df
  age gen col
1  19   m   r
2  21   f   b
3  40   m   g
## Subset rows
df[c(1, 3), ]
  age gen col
1  19   m   r
3  40   m   g
## Subset columns
df[, c("age", "col")]
  age col
1  19   r
2  21   b
3  40   g
## select the row where age == 21
df[df$age == 21, ]
  age gen col
2  21   f   b


  • Create a vector object called x that has 5 elements 3, 6, 2, 9, 14.
  • Compute the average of elements of x.
  • Subset the mtcars data set by selecting variables mpg and disp.
  • Select the cars (rows) in mtcars that have 4 cylinders.
05:00