class: title-slide <br> <br> .right-panel[ # Multiple Linear Regression ## Created by Dr. Mine Dogucu and Presented by Dr. Jessica Jaynes ] --- class: middle <div align = "center"> | y | Response | Birth weight | Numeric | |---|-------------|-----------------|---------| | x | Explanatory | Smoke | Categorical | --- ## Notation `\(y_i = \beta_0 +\beta_1x_i + \epsilon_i\)` `\(\beta_0\)` is y-intercept `\(\beta_1\)` is slope `\(\epsilon_i\)` is error/residual `\(i = 1, 2, ...n\)` identifier for each point --- ```r model_s <- lm(bwt ~ smoke, data = babies) tidy(model_s) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 123. 0.649 190. 0 ## 2 smoke -8.94 1.03 -8.65 1.55e-17 ``` -- `\(\hat {y}_i = b_0 + b_1 x_i\)` `\(\hat {\text{bwt}_i} = b_0 + b_1 \text{ smoke}_i\)` `\(\hat {\text{bwt}_i} = 123 + (-8.94\text{ smoke}_i)\)` --- class: middle .pull-left[ #### Expected bwt for a baby with a non-smoker mother `\(\hat {\text{bwt}_i} = 123 + (-8.94\text{ smoke}_i)\)` `\(\hat {\text{bwt}_i} = 123 + (-8.94\times 0)\)` `\(\hat {\text{bwt}_i} = 123\)` `\(E[bwt_i | smoke_i = 0] = b_0\)` ] -- .pull-right[ #### Expected bwt for a baby with a smoker mother `\(\hat {\text{bwt}_i} = 123 + (-8.94\text{ smoke}_i)\)` `\(\hat {\text{bwt}_i} = 123 + (-8.94\times 1)\)` `\(\hat {\text{bwt}_i} = 114.06\)` `\(E[bwt_i | smoke_i = 1] = b_0 + b_1\)` ] --- ```r confint(model_s) ``` ``` ## 2.5 % 97.5 % ## (Intercept) 121.77391 124.320430 ## smoke -10.96413 -6.911199 ``` Note that the confidence interval for the "slope" does not contain 0 and all the values in the interval are negative. --- #### Data `babies` in `openintro` package ``` ## Rows: 1,236 ## Columns: 8 ## $ case <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1… ## $ bwt <int> 120, 113, 128, 123, 108, 136, 138, 132, 120, 143, 140, 144, … ## $ gestation <int> 284, 282, 279, NA, 282, 286, 244, 245, 289, 299, 351, 282, 2… ## $ parity <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ age <int> 27, 33, 28, 36, 23, 25, 33, 23, 25, 30, 27, 32, 23, 36, 30, … ## $ height <int> 62, 64, 64, 69, 67, 62, 62, 65, 62, 66, 68, 64, 63, 61, 63, … ## $ weight <int> 100, 135, 115, 190, 125, 93, 178, 140, 125, 136, 120, 124, 1… ## $ smoke <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, … ``` --- class: middle <div align = "center"> | y | Response | Birth weight | Numeric | |---|-------------|-----------------|---------| | `\(x_1\)` | Explanatory | Gestation | Numeric | | `\(x_2\)` | Explanatory | Smoke | Categorical | --- ## Notation `\(y_i = \beta_0 +\beta_1x_{1i} + \beta_2x_{2i} + \epsilon_i\)` `\(\beta_0\)` is intercept `\(\beta_1\)` is the slope for gestation `\(\beta_2\)` is the slope for smoke `\(\epsilon_i\)` is error/residual `\(i = 1, 2, ...n\)` identifier for each point --- ```r model_gs <- lm(bwt ~ gestation + smoke, data = babies) tidy(model_gs) ``` ``` ## # A tibble: 3 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -0.932 8.15 -0.114 9.09e- 1 ## 2 gestation 0.443 0.0290 15.3 3.16e-48 ## 3 smoke -8.09 0.953 -8.49 5.96e-17 ``` -- Expected birth weight for a baby who had 280 days of gestation with a smoker mother `\(\hat {\text{bwt}_i} = b_0 + b_1 \text{ gestation}_i + b_2 \text{ smoke}_i\)` `\(\hat {\text{bwt}_i} = -0.932 + (0.443 \times 280) + (-8.09 \times 1)\)` --- ```r babies %>% modelr::add_predictions(model_gs) %>% select(bwt, gestation, smoke, pred) ``` ``` ## # A tibble: 1,236 × 4 ## bwt gestation smoke pred ## <int> <int> <int> <dbl> ## 1 120 284 0 125. ## 2 113 282 0 124. ## 3 128 279 1 115. ## 4 123 NA 0 NA ## 5 108 282 1 116. ## 6 136 286 0 126. ## 7 138 244 0 107. ## 8 132 245 0 108. ## 9 120 289 0 127. ## 10 143 299 1 123. ## # … with 1,226 more rows ``` --- ```r babies %>% modelr::add_predictions(model_gs) %>% modelr::add_residuals(model_gs) %>% select(bwt, gestation, smoke, pred, resid) ``` ``` ## # A tibble: 1,236 × 5 ## bwt gestation smoke pred resid ## <int> <int> <int> <dbl> <dbl> ## 1 120 284 0 125. -4.84 ## 2 113 282 0 124. -11.0 ## 3 128 279 1 115. 13.5 ## 4 123 NA 0 NA NA ## 5 108 282 1 116. -7.87 ## 6 136 286 0 126. 10.3 ## 7 138 244 0 107. 30.9 ## 8 132 245 0 108. 24.4 ## 9 120 289 0 127. -7.05 ## 10 143 299 1 123. 19.6 ## # … with 1,226 more rows ``` --- --- # Collinearity between explanatory variables - Two predictor variables are collinear when they are correlated an d collinearity complicates model estimation. - Predictors are also called explanatory or independent variables, and we want them to be independent of each other. - We do not wnat to add predictors that are associated with each other to the model because adding these additional variables does not enhance the model. - Instead, we prefer the simplest best model, i.e. parsimonious model. - While it’s impossible to avoid collinearity from arising in observational data, experiments are usually designed to prevent correlation among predictors. --- Great example of simple linear regression in action! https://www.tandfonline.com/doi/full/10.1080/26939169.2021.1946450