class: title-slide <br> <br> .right-panel[ # Changing Variables ## Dr. Mine Dogucu ] --- class: center middle .font150[Review] --- class: middle How many panes have we seen in RStudio and what is the purpose of each pane? --- class: middle Which of the following files is a markdown file? a. `example.R` b. `example.md` c. `example.Rmd` --- class: middle Which of the following is a valid order of actions when starting a project using git and GitHub? a. clone, commit, push, b. push, commit, clone c. commit, clone, push d. clone, push, commit --- class: middle Which R functions have we learned together? --- class: middle What is the formula for variance? --- class: middle You are given a data frame called `registrar`. There are two variables you are interested in `class_year` which represents whether someone is a first year, sophomore, junior, or senior and `gpa` which represents GPA. How would you find the average GPA for each class rank? --- class: center middle .font150[Changing Variables] --- class: middle ```r glimpse(lapd) ``` ``` ## Rows: 68,564 ## Columns: 3 ## $ Year <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013… ## $ `Employment Type` <chr> "Full Time", "Full Time", "Full Time", "Full Time", … ## $ `Base Pay` <dbl> 105764.79, 47662.40, 101287.85, 118086.71, 90321.86,… ``` --- class: middle `clean_names()` makes variable names in tidy style. ```r lapd <- clean_names(lapd) glimpse(lapd) ``` ``` ## Rows: 68,564 ## Columns: 3 ## $ year <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, … ## $ employment_type <chr> "Full Time", "Full Time", "Full Time", "Full Time", "F… ## $ base_pay <dbl> 105764.79, 47662.40, 101287.85, 118086.71, 90321.86, 6… ``` --- class: middle **Goal**: Create a new variable called `base_pay_k` that represents `base_pay` in thousand dollars. --- class: middle ```r lapd %>% mutate(base_pay_k = base_pay/1000) ``` ``` ## # A tibble: 68,564 × 4 ## year employment_type base_pay base_pay_k ## <dbl> <chr> <dbl> <dbl> ## 1 2013 Full Time 105765. 106. ## 2 2013 Full Time 47662. 47.7 ## 3 2013 Full Time 101288. 101. ## 4 2013 Full Time 118087. 118. ## 5 2013 Full Time 90322. 90.3 ## 6 2013 Full Time 62770. 62.8 ## 7 2013 Full Time 93718. 93.7 ## 8 2013 Full Time 0 0 ## 9 2013 Full Time 51246. 51.2 ## 10 2013 Full Time 74227. 74.2 ## # … with 68,554 more rows ``` --- class: middle ```r glimpse(lapd) ``` ``` ## Rows: 68,564 ## Columns: 3 ## $ year <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, … ## $ employment_type <chr> "Full Time", "Full Time", "Full Time", "Full Time", "F… ## $ base_pay <dbl> 105764.79, 47662.40, 101287.85, 118086.71, 90321.86, 6… ``` **Goal**: Create a new variable called `base_pay_level` which has `Less Than 0`, `No Income`, `Less than Median and Greater than 0` and `Greater than Median`. We will consider $62474 as the median (from previous lecture). --- ```r lapd %>% mutate(base_pay_level = case_when( base_pay < 0 ~ "Less than 0", base_pay == 0 ~ "No Income", base_pay < 62474 & base_pay > 0 ~ "Less than Median, Greater than 0", base_pay > 62474 ~ "Greater than Median")) ``` ``` ## # A tibble: 68,564 × 4 ## year employment_type base_pay base_pay_level ## <dbl> <chr> <dbl> <chr> ## 1 2013 Full Time 105765. Greater than Median ## 2 2013 Full Time 47662. Less than Median, Greater than 0 ## 3 2013 Full Time 101288. Greater than Median ## 4 2013 Full Time 118087. Greater than Median ## 5 2013 Full Time 90322. Greater than Median ## 6 2013 Full Time 62770. Greater than Median ## 7 2013 Full Time 93718. Greater than Median ## 8 2013 Full Time 0 No Income ## 9 2013 Full Time 51246. Less than Median, Greater than 0 ## 10 2013 Full Time 74227. Greater than Median ## # … with 68,554 more rows ``` --- class: middle ## (Some) Variable Types in R `character`: takes string values (e.g. a person's name, address) `integer`: integer (single precision) `double`: floating decimal (double precision) `numeric`: integer or double `factor`: categorical variables with different levels `logical`: TRUE (1), FALSE (0) --- class: middle ```r glimpse(lapd) ``` ``` ## Rows: 68,564 ## Columns: 3 ## $ year <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, … ## $ employment_type <chr> "Full Time", "Full Time", "Full Time", "Full Time", "F… ## $ base_pay <dbl> 105764.79, 47662.40, 101287.85, 118086.71, 90321.86, 6… ``` **Goal**: Change `year` and `employment_type` to appropriate variable types. --- class: middle ```r lapd %>% mutate(employment_type = as.factor(employment_type), year = as.integer(year)) ``` ``` ## # A tibble: 68,564 × 3 ## year employment_type base_pay ## <int> <fct> <dbl> ## 1 2013 Full Time 105765. ## 2 2013 Full Time 47662. ## 3 2013 Full Time 101288. ## 4 2013 Full Time 118087. ## 5 2013 Full Time 90322. ## 6 2013 Full Time 62770. ## 7 2013 Full Time 93718. ## 8 2013 Full Time 0 ## 9 2013 Full Time 51246. ## 10 2013 Full Time 74227. ## # … with 68,554 more rows ``` --- class: middle `as.factor()` - makes a vector factor `as.numeric()` - makes a vector numeric `as.integer()` - makes a vector integer `as.double()` - makes a vector double `as.character()` - makes a vector character --- In your lecture notes, you can do all the changes in this lecture in one long set of piped code. That's the beauty of piping! ```r lapd <- lapd %>% clean_names() %>% mutate(employment_type = as.factor(employment_type), year = as.integer(year), base_pay_level = case_when( base_pay < 0 ~ "Less than 0", base_pay == 0 ~ "No Income", base_pay < 62474 & base_pay > 0 ~ "Less than Median, Greater than 0", base_pay > 62474 ~ "Greater than Median")) ``` --- class: middle ## Word of caution The functions `clean_names()`, and `mutate()` all take a data frame as the first argument. Even though we do not see it, the data frame is piped through from the previous step of code at each step. When we use these functions without the `%>%` we have to include the data frame explicitly. .pull-left[ Data frame is used as the first argument ```r clean_names(lapd) ``` ``` ## # A tibble: 68,564 × 3 ## year employment_type base_pay ## <dbl> <chr> <dbl> ## 1 2013 Full Time 105765. ## 2 2013 Full Time 47662. ## 3 2013 Full Time 101288. ## 4 2013 Full Time 118087. ## 5 2013 Full Time 90322. ## 6 2013 Full Time 62770. ## 7 2013 Full Time 93718. ## 8 2013 Full Time 0 ## 9 2013 Full Time 51246. ## 10 2013 Full Time 74227. ## # … with 68,554 more rows ``` ] .pull-right[ Data frame is piped ```r lapd %>% clean_names() ``` ``` ## # A tibble: 68,564 × 3 ## year employment_type base_pay ## <dbl> <chr> <dbl> ## 1 2013 Full Time 105765. ## 2 2013 Full Time 47662. ## 3 2013 Full Time 101288. ## 4 2013 Full Time 118087. ## 5 2013 Full Time 90322. ## 6 2013 Full Time 62770. ## 7 2013 Full Time 93718. ## 8 2013 Full Time 0 ## 9 2013 Full Time 51246. ## 10 2013 Full Time 74227. ## # … with 68,554 more rows ``` ]