class: title-slide <br> <br> .right-panel[ # Simple Linear Regression ## Created by Dr. Mine Dogucu and Presented by Dr. Jessica Jaynes ] --- class: middle ## Goals - Linear Regression with Numeric Predictors - Conditions for Linear Regression --- class: middle # Association Two variables measured on the same cases are _associated_ if knowing the value of one of the variables tells you something that you would not otherwise know about the value of the other variable. --- class: middle # Explanatory and Response Variables Is your purpose simply to explore the nature of the relationship, or do you hope to show that one of the variables can explain variation in the other? __Response variable__: measures an outcome of a study (independent variable) __Explanatory variable__: explains or causes changes in the response variable (dependent variable) --- class: middle # Scatterplot - Shows the relationship between two quantitative variables measured on the same individuals. - The values of one variable appear on the horizontal axis, and the values of the other variable appear on the vertical axis. - Each individual corresponds to one point on the graph. You can describe the overall pattern of a scatterplot by the direction, form, and strength of the relationship. An important kind of departure is an outlier, an individual value that falls outside the overall pattern of the relationship. --- class: middle # Measuring Linear Association A scatterplot displays the strength, direction, and form of the relationship between two quantitative variables. Linear relationships are important because a straight line is a simple pattern that is quite common. Our eyes are not always good judges of how strong a relationship is. Therefore, we use a numerical measure to supplement our scatterplot and help us interpret the strength of the linear relationship. The __correlation__ (r) measures the direction and strength of the linear relationship between two quantitative variables. --- class: middle # Properties of Correlation - r is always a number between –1 and 1. - r > 0 indicates a positive association. - r < 0 indicates a negative association. - Values of r near 0 indicate a very weak linear relationship. - The strength of the linear relationship increases as r moves away from 0 toward –1 or 1. - The extreme values r = –1 and r = 1 occur only in the case of a perfect linear relationship. - r has no units and does not change when we change the units of measurement of x, y, or both. - Correlation makes no distinction between explanatory and response variables. --- class: middle # Cautions with Correlation : Correlation requires that both variables be quantitative. Correlation does not describe curved relationships between variables, no matter how strong the relationship is. The correlation r is not resistant; it can be strongly affected by a few outlying observations. Correlation is not a complete summary of two-variable data. --- class: middle #### Data `babies` in `openintro` package ``` ## Rows: 1,236 ## Columns: 8 ## $ case <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1… ## $ bwt <int> 120, 113, 128, 123, 108, 136, 138, 132, 120, 143, 140, 144, … ## $ gestation <int> 284, 282, 279, NA, 282, 286, 244, 245, 289, 299, 351, 282, 2… ## $ parity <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ age <int> 27, 33, 28, 36, 23, 25, 33, 23, 25, 30, 27, 32, 23, 36, 30, … ## $ height <int> 62, 64, 64, 69, 67, 62, 62, 65, 62, 66, 68, 64, 63, 61, 63, … ## $ weight <int> 100, 135, 115, 190, 125, 93, 178, 140, 125, 136, 120, 124, 1… ## $ smoke <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, … ``` --- ## Baby Weights .pull-left[ ```r ggplot(babies, aes(x = gestation, y = bwt)) + geom_point() ``` ] .pull-right[ <img src="03a-simple-linreg_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> ] --- # Inference for Linear Regression When a scatterplot shows a linear relationship, we can use the least-squares line fitted to the data to predict y for a given value of x. If the data are a random sample from a larger population, we need statistical inference to answer questions like these: - Is there really a linear relationship between x and y in the population, or could the pattern we see in the scatterplot plausibly happen just by chance? - What is the slope (rate of change) that relates y to x in the population, including a margin of error for our estimate of the slope? --- ## Baby Weights .pull-left[ ```r ggplot(babies, aes(x = gestation, y = bwt)) + geom_point() + geom_smooth(method = "lm", se = FALSE) ``` `lm` stands for linear model `se` stands for standard error ] .pull-right[ <img src="03a-simple-linreg_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> ] --- class: middle <div align = "center"> | y | Response | Birth weight | Numeric | |---|-------------|-----------------|---------| | x | Explanatory | Gestation | Numeric | --- ## Linear Equations Review .pull-left[ Recall from your previous math classes `\(y = mx + b\)` where `\(m\)` is the slope and `\(b\)` is the y-intercept e.g. `\(y = 2x -1\)` ] -- .pull-right[ ![](03a-simple-linreg_files/figure-html/unnamed-chunk-8-1.png)<!-- --> Notice anything different between baby weights plot and this one? ] --- class: middle .pull-left[ **Math** class `\(y = b + mx\)` `\(b\)` is y-intercept `\(m\)` is slope ] .pull-left[ **Stats** class `\(y_i = \beta_0 +\beta_1x_i + \epsilon_i\)` `\(\beta_0\)` is y-intercept `\(\beta_1\)` is slope `\(\epsilon_i\)` is error/residual `\(i = 1, 2, ...n\)` identifier for each point ] --- class: middle ```r model_g <- lm(bwt ~ gestation, data = babies) ``` `lm` stands for linear model. We are fitting a linear regression model. Note that the variables are entered in y ~ x order. --- class: middle ```r summary(model_g) ``` ``` ## ## Call: ## lm(formula = bwt ~ gestation, data = babies) ## ## Residuals: ## Min 1Q Median 3Q Max ## -49.394 -11.125 0.071 10.106 57.353 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -10.06418 8.32220 -1.209 0.227 ## gestation 0.46426 0.02974 15.609 <0.0000000000000002 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 16.66 on 1221 degrees of freedom ## (13 observations deleted due to missingness) ## Multiple R-squared: 0.1663, Adjusted R-squared: 0.1657 ## F-statistic: 243.6 on 1 and 1221 DF, p-value: < 0.00000000000000022 ``` --- class: middle ```r broom::tidy(model_g) ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -10.1 8.32 -1.21 2.27e- 1 ## 2 gestation 0.464 0.0297 15.6 3.22e-50 ``` -- `\(\hat {y}_i = b_0 + b_1 x_i\)` `\(\hat {\text{bwt}_i} = b_0 + b_1 \text{ gestation}_i\)` `\(\hat {\text{bwt}_i} = -10.1 + 0.464\text{ gestation}_i\)` --- class: middle ## Expected bwt for a baby with 300 days of gestation `\(\hat {\text{bwt}_i} = -10.1 + 0.464\text{ gestation}_i\)` `\(\hat {\text{bwt}} = -10.1 + 0.464 \times 300\)` `\(\hat {\text{bwt}} =\)` 129.1 For a baby with 300 days of gestation the expected birth weight is 129.1 ounces. --- ## Interpretation of estimates .pull-left[ <img src="03a-simple-linreg_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> `\(b_1 = 0.464\)` which means for one unit(day) increase in gestation period the expected increase in birth weight is 0.464 ounces. ] -- .pull-right[ <img src="03a-simple-linreg_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> `\(b_0 = -10.1\)` which means for gestation period of 0 days the expected birth weight is -10.1 ounces!!!!!!!! (does NOT make sense) ] --- class: middle ## Extrapolation - There is no such thing as 0 days of gestation. -- - Birth weight cannot possibly be -10.1 ounces. -- - Extrapolation happens when we use a model outside the range of the x-values that are observed. After all, we cannot really know how the model behaves (e.g. may be non-linear) outside of the scope of what we have observed. --- ## Baby number 148 .pull-left[ ```r babies %>% filter(case == 148) %>% select(bwt, gestation) ``` ``` ## # A tibble: 1 × 2 ## bwt gestation ## <int> <int> ## 1 160 300 ``` ] .pull-right[ ![](03a-simple-linreg_files/figure-html/unnamed-chunk-15-1.png)<!-- --> ] --- class: middle ## Baby #148 .pull-left[ **Expected** `\(\hat y_{148} = b_0 +b_1x_{148}\)` `\(\hat y_{148} = -10.1 + 0.464\times300\)` `\(\hat y_{148}\)` = 129.1 ] .pull-left[ **Observed** `\(y_{148} =\)` 160 ] --- ## Residual for `i = 148` .pull-left[ <img src="03a-simple-linreg_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> ] .pull-right[ `\(y_{148} = 160\)` <hr> `\(\hat y_{148}\)` = 129.1 <hr> `\(e_{148} = y_{148} - \hat y_{148}\)` `\(e_{148} =\)` 30.9 ] --- ## Least Squares Regression The goal is to minimize `$$e_1^2 + e_2^2 + ... + e_n^2$$` -- which can be rewritten as `$$\sum_{i = 1}^n e_i^2$$` --- ## Conditions for Least Squares Regression - Linearity - Normality of Residuals - Constant Variance - Independence --- .pull-left[ .center[**Linear**] ![](03a-simple-linreg_files/figure-html/unnamed-chunk-17-1.png)<!-- --> ] .pull-right[ .center[**Non-linear**] ![](03a-simple-linreg_files/figure-html/unnamed-chunk-18-1.png)<!-- --> ] --- .pull-left[ .center[**Nearly normal**] ![](03a-simple-linreg_files/figure-html/unnamed-chunk-19-1.png)<!-- --> ] .pull-right[ .center[**Not normal**] ![](03a-simple-linreg_files/figure-html/unnamed-chunk-20-1.png)<!-- --> ] --- .pull-left[ .center[**Constant Variance**] ![](03a-simple-linreg_files/figure-html/unnamed-chunk-21-1.png)<!-- --> ] .pull-right[ .center[**Non-constant variance**] ![](03a-simple-linreg_files/figure-html/unnamed-chunk-22-1.png)<!-- --> ] --- class: middle ## Independence Harder to check because we need to know how the data were collected. -- In the description of the dataset it says _[a study]considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area._ -- It is possible that babies born in the same hospital may have similar birth weight. -- Correlated data examples: patients within hospitals, students within schools, people within neighborhoods, time-series data. --- ## Conditions for Least Squares Regression ```r plot(model_g, 1:2) ``` ![](03a-simple-linreg_files/figure-html/unnamed-chunk-23-1.png)<!-- -->![](03a-simple-linreg_files/figure-html/unnamed-chunk-23-2.png)<!-- --> --- class: middle ### Inference: Confidence Interval (theoretical) ```r confint(model_g) ``` ``` ## 2.5 % 97.5 % ## (Intercept) -26.3915884 6.2632199 ## gestation 0.4059083 0.5226169 ``` Note that the 95% confidence interval for the slope does not contain zero and all the values in the interval are positive indicating a significant positive relationship between gestation and birth weight. --- # Coefficient of Determination: Connection between correlation and regression The square of the correlation, `\(R^2\)`, is the fraction of the variation in values of y that is explained by the least-squares regression of y on x. `\(R^2\)` is called the coefficient of determination. --- # Correlation Does Not Imply Causation - Even when direct causation is present, it may not be the whole explanation for a correlation. - Always worry about lurking variables and try to anticipate lurking variables and measure them - Do an experiment in which we change x and keep lurking variables under control.