Multiple Linear Regression

<br>
<br>
.right-panel[

# Multiple Linear Regression
## Created by Dr. Mine Dogucu and Presented by Dr. Jessica Jaynes
]

---

| y | Response    | Birth weight | Numeric |
|---|-------------|-----------------|---------|
| x | Explanatory | Smoke           | Categorical |

---

## Notation

`$y_i = \beta_0 +\beta_1x_i + \epsilon_i$`

`$\beta_0$` is y-intercept  
`$\beta_1$` is slope  
`$\epsilon_i$` is error/residual  
`$i = 1, 2, ...n$` identifier for each point

---

```r
model_s <- lm(bwt ~ smoke, data = babies)
tidy(model_s)
```

```
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   123.       0.649    190.   0       
## 2 smoke          -8.94     1.03      -8.65 1.55e-17
```

`$\hat {y}_i = b_0 + b_1 x_i$`

`$\hat {\text{bwt}_i} = b_0 + b_1 \text{ smoke}_i$`

`$\hat {\text{bwt}_i} = 123 + (-8.94\text{ smoke}_i)$`

---

#### Expected bwt for a baby with a non-smoker mother

`$\hat {\text{bwt}_i} = 123 + (-8.94\text{ smoke}_i)$`

`$\hat {\text{bwt}_i} = 123 + (-8.94\times 0)$`

`$\hat {\text{bwt}_i} = 123$`

`$E[bwt_i | smoke_i = 0] = b_0$`

]

`$\hat {\text{bwt}_i} = 123 + (-8.94\text{ smoke}_i)$`

`$\hat {\text{bwt}_i} = 123 + (-8.94\times 1)$`

`$\hat {\text{bwt}_i} = 114.06$`

`$E[bwt_i | smoke_i = 1] = b_0 + b_1$`

]

---

```r
confint(model_s)
```

```
##                 2.5 %     97.5 %
## (Intercept) 121.77391 124.320430
## smoke       -10.96413  -6.911199
```

Note that the confidence interval for the "slope" does not contain 0 and all the values in the interval are negative.

---

#### Data `babies` in `openintro` package

```
## Rows: 1,236
## Columns: 8
## $ case      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ bwt       <int> 120, 113, 128, 123, 108, 136, 138, 132, 120, 143, 140, 144, …
## $ gestation <int> 284, 282, 279, NA, 282, 286, 244, 245, 289, 299, 351, 282, 2…
## $ parity    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ age       <int> 27, 33, 28, 36, 23, 25, 33, 23, 25, 30, 27, 32, 23, 36, 30, …
## $ height    <int> 62, 64, 64, 69, 67, 62, 62, 65, 62, 66, 68, 64, 63, 61, 63, …
## $ weight    <int> 100, 135, 115, 190, 125, 93, 178, 140, 125, 136, 120, 124, 1…
## $ smoke     <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, …
```

---

| y | Response    | Birth weight | Numeric |
|---|-------------|-----------------|---------|
| `$x_1$` | Explanatory | Gestation           | Numeric |
| `$x_2$` | Explanatory | Smoke           | Categorical |

---

## Notation

`$y_i = \beta_0 +\beta_1x_{1i}  + \beta_2x_{2i} + \epsilon_i$`

`$\beta_0$` is intercept  
`$\beta_1$` is the slope for gestation   
`$\beta_2$` is the slope for smoke 
`$\epsilon_i$` is error/residual  
`$i = 1, 2, ...n$` identifier for each point

---

```r
model_gs <- lm(bwt ~ gestation + smoke, data = babies)
tidy(model_gs)
```

```
## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -0.932    8.15      -0.114 9.09e- 1
## 2 gestation      0.443    0.0290    15.3   3.16e-48
## 3 smoke         -8.09     0.953     -8.49  5.96e-17
```

Expected birth weight for a baby who had 280 days of gestation with a smoker mother

`$\hat {\text{bwt}_i} = b_0 + b_1 \text{ gestation}_i + b_2 \text{ smoke}_i$`

`$\hat {\text{bwt}_i} = -0.932 + (0.443 \times 280) + (-8.09 \times 1)$`

---

```r
babies %>% 
  modelr::add_predictions(model_gs) %>% 
  select(bwt, gestation, smoke, pred)
```

```
## # A tibble: 1,236 × 4
##      bwt gestation smoke  pred
##    <int>     <int> <int> <dbl>
##  1   120       284     0  125.
##  2   113       282     0  124.
##  3   128       279     1  115.
##  4   123        NA     0   NA 
##  5   108       282     1  116.
##  6   136       286     0  126.
##  7   138       244     0  107.
##  8   132       245     0  108.
##  9   120       289     0  127.
## 10   143       299     1  123.
## # … with 1,226 more rows
```

---

```r
babies %>% 
  modelr::add_predictions(model_gs) %>% 
  modelr::add_residuals(model_gs) %>% 
  select(bwt, gestation, smoke, pred, resid)
```

```
## # A tibble: 1,236 × 5
##      bwt gestation smoke  pred  resid
##    <int>     <int> <int> <dbl>  <dbl>
##  1   120       284     0  125.  -4.84
##  2   113       282     0  124. -11.0 
##  3   128       279     1  115.  13.5 
##  4   123        NA     0   NA   NA   
##  5   108       282     1  116.  -7.87
##  6   136       286     0  126.  10.3 
##  7   138       244     0  107.  30.9 
##  8   132       245     0  108.  24.4 
##  9   120       289     0  127.  -7.05
## 10   143       299     1  123.  19.6 
## # … with 1,226 more rows
```
---

---
# Collinearity between explanatory variables
- Two predictor variables are collinear when they are correlated an d  collinearity complicates model estimation.
- Predictors are also called explanatory or independent variables, and we want them to be independent of each other.
- We do not wnat to add predictors that are associated with each  other to the model because adding these additional variables does not enhance the model. 
- Instead, we prefer the  simplest best model, i.e. parsimonious model.
- While it’s impossible to avoid collinearity from arising in observational data, experiments are usually designed to prevent correlation among predictors.

---
Great example of simple linear regression in action!
https://www.tandfonline.com/doi/full/10.1080/26939169.2021.1946450