Changing Variables

<br>
<br>
.right-panel[

# Changing Variables
## Dr. Mine Dogucu
]

---

---

How many panes have we seen in RStudio and what is the purpose of each pane?

---

Which of the following files is a markdown file?

a. `example.R`  
b. `example.md`  
c. `example.Rmd`

---

Which of the following is a valid order of actions when starting a project using git and GitHub?

a. clone, commit, push,

b. push, commit, clone

c. commit, clone, push

d. clone, push, commit

---

Which R functions have we learned together?

---

What is the formula for variance?

---

You are given a data frame called `registrar`. There are two variables you are interested in `class_year` which represents whether someone is a first year, sophomore, junior, or senior and `gpa` which represents GPA.

How would you find the average GPA for each class rank?

---

---

```r
glimpse(lapd)
```

```
## Rows: 68,564
## Columns: 3
## $ Year              <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013…
## $ `Employment Type` <chr> "Full Time", "Full Time", "Full Time", "Full Time", …
## $ `Base Pay`        <dbl> 105764.79, 47662.40, 101287.85, 118086.71, 90321.86,…
```

---

`clean_names()` makes variable names in tidy style.

```r
lapd <- clean_names(lapd)
glimpse(lapd)
```

```
## Rows: 68,564
## Columns: 3
## $ year            <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ employment_type <chr> "Full Time", "Full Time", "Full Time", "Full Time", "F…
## $ base_pay        <dbl> 105764.79, 47662.40, 101287.85, 118086.71, 90321.86, 6…
```

---

**Goal**:

Create a new variable called `base_pay_k` that represents `base_pay` in thousand dollars.

---

```r
lapd %>% 
  mutate(base_pay_k = base_pay/1000)
```

```
## # A tibble: 68,564 × 4
##     year employment_type base_pay base_pay_k
##    <dbl> <chr>              <dbl>      <dbl>
##  1  2013 Full Time        105765.      106. 
##  2  2013 Full Time         47662.       47.7
##  3  2013 Full Time        101288.      101. 
##  4  2013 Full Time        118087.      118. 
##  5  2013 Full Time         90322.       90.3
##  6  2013 Full Time         62770.       62.8
##  7  2013 Full Time         93718.       93.7
##  8  2013 Full Time             0         0  
##  9  2013 Full Time         51246.       51.2
## 10  2013 Full Time         74227.       74.2
## # … with 68,554 more rows
```

---

```r
glimpse(lapd)
```

**Goal**:

Create a new variable called `base_pay_level` which has `Less Than 0`, `No Income`, `Less than Median and Greater than 0` and `Greater than Median`. We will consider $62474 as the median (from previous lecture).

---

```r
lapd %>% 
  mutate(base_pay_level = case_when(
    base_pay < 0 ~ "Less than 0", 
    base_pay == 0 ~ "No Income",
    base_pay < 62474 & base_pay > 0 ~ "Less than Median, Greater than 0",
    base_pay > 62474 ~ "Greater than Median")) 
```

```
## # A tibble: 68,564 × 4
##     year employment_type base_pay base_pay_level                  
##    <dbl> <chr>              <dbl> <chr>                           
##  1  2013 Full Time        105765. Greater than Median             
##  2  2013 Full Time         47662. Less than Median, Greater than 0
##  3  2013 Full Time        101288. Greater than Median             
##  4  2013 Full Time        118087. Greater than Median             
##  5  2013 Full Time         90322. Greater than Median             
##  6  2013 Full Time         62770. Greater than Median             
##  7  2013 Full Time         93718. Greater than Median             
##  8  2013 Full Time             0  No Income                       
##  9  2013 Full Time         51246. Less than Median, Greater than 0
## 10  2013 Full Time         74227. Greater than Median             
## # … with 68,554 more rows
```

---

## (Some) Variable Types in R

`character`: takes string values (e.g. a person's name, address)  
`integer`: integer (single precision)  
`double`: floating decimal (double precision)  
`numeric`: integer or double  
`factor`: categorical variables with different levels  
`logical`: TRUE (1), FALSE (0)

---

```r
glimpse(lapd)
```

**Goal**:

Change `year` and `employment_type` to appropriate variable types.

---

```r
lapd %>% 
  mutate(employment_type = as.factor(employment_type),
         year = as.integer(year)) 
```

```
## # A tibble: 68,564 × 3
##     year employment_type base_pay
##    <int> <fct>              <dbl>
##  1  2013 Full Time        105765.
##  2  2013 Full Time         47662.
##  3  2013 Full Time        101288.
##  4  2013 Full Time        118087.
##  5  2013 Full Time         90322.
##  6  2013 Full Time         62770.
##  7  2013 Full Time         93718.
##  8  2013 Full Time             0 
##  9  2013 Full Time         51246.
## 10  2013 Full Time         74227.
## # … with 68,554 more rows
```

---

`as.factor()` - makes a vector factor  
`as.numeric()` - makes a vector numeric  
`as.integer()` - makes a vector integer  
`as.double()` - makes a vector double  
`as.character()` - makes a vector character

---

In your lecture notes, you can do all the changes in this lecture in one long set of piped code. That's the beauty of piping!

```r
lapd <- 
  lapd %>% 
  clean_names() %>% 
    mutate(employment_type = as.factor(employment_type),
           year = as.integer(year),
           base_pay_level = case_when(
             base_pay < 0 ~ "Less than 0", 
             base_pay == 0 ~ "No Income",
             base_pay < 62474 & base_pay > 0 ~ "Less than Median, Greater than 0",
             base_pay > 62474 ~ "Greater than Median")) 
```

---

## Word of caution

The functions `clean_names()`, and `mutate()` all take a data frame as the first argument. Even though we do not see it, the data frame is piped through from the previous step of code at each step. 
When we use these functions without the `%>%` we have to include the data frame explicitly.

```r
clean_names(lapd)
```

```
## # A tibble: 68,564 × 3
##     year employment_type base_pay
##    <dbl> <chr>              <dbl>
##  1  2013 Full Time        105765.
##  2  2013 Full Time         47662.
##  3  2013 Full Time        101288.
##  4  2013 Full Time        118087.
##  5  2013 Full Time         90322.
##  6  2013 Full Time         62770.
##  7  2013 Full Time         93718.
##  8  2013 Full Time             0 
##  9  2013 Full Time         51246.
## 10  2013 Full Time         74227.
## # … with 68,554 more rows
```
]

```r
lapd %>% 
  clean_names()
```