Simple regression model
Linear Regression model
Assumption in OLS:
1 - Errors have mean equal to zero.
2 - Errors are independent.
3 - Errors have equal or constant variance (Homoscedasticity or Homogeneity of variance).
4 - Errors follow a normal distribution.
# We want to study the relationship between academic performance api00
# and number of students enrolled in schools,
#scatter plot
plot(api00 ~ enroll, data = dat,
main = "Scatter plot of school performance vs. number of enrollment",
ylab = "api00 (performance score in 2000)",
xlab = "enroll", xlim = c(0,1000))
# We want to study the relationship between academic performance api00
# and number of students enrolled in schools,
m1 <- lm(api00 ~ enroll, data = dat)
summary(m1)
Call:
lm(formula = api00 ~ enroll, data = dat)
Residuals:
Min 1Q Median 3Q Max
-285.50 -112.55 -6.70 95.06 389.15
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 744.25141 15.93308 46.711 < 2e-16 ***
enroll -0.19987 0.02985 -6.695 7.34e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 135 on 398 degrees of freedom
Multiple R-squared: 0.1012, Adjusted R-squared: 0.09898
F-statistic: 44.83 on 1 and 398 DF, p-value: 7.339e-11
There are cases that the outcome for a given set of predictors can only gets positive values or values in the interval (0,1)
-If the outcome is dichotomous and can get only value 0 and 1 (Bernoulli distribution), the mean is between 0 and 1.
-If the outcome is from a count distribution where it is count number 0,1,2,⋯ like Poisson distribution, the mean is always positive (𝜇>0).
mean of Y : 0 < 𝜇 < 1
odds function: 0 < 𝜇/(1-𝜇) < ∞
logit function: -∞ < log(𝜇/(1-𝜇)) < ∞
likelihood: \(Y_i | \pi_i \sim \text{Bern}(\pi_i)\)
\(𝜇_i = E(Y_i|\pi_i) = \pi_i\)
link function: \(g(\pi_i)\)
\(g(\pi_i)= \beta_0 + \beta_1 X_{i1}\)
logit link function: \(Y_i|\beta_0,\beta_1 \stackrel{ind}{\sim} \text{Bern}(\pi_i) \;\; \text{ with } \;\; \log\left(\frac{\pi_i}{1 - \pi_i}\right) = \beta_0 + \beta_1 X_{i1}\)
\[\frac{\pi_i}{1-\pi_i} = e^{\beta_0 + \beta_1 X_{i1}} \;\;\;\; \text{ and } \;\;\;\; \pi_i = \frac{e^{\beta_0 + \beta_1 X_{i1}}}{1 + e^{\beta_0 + \beta_1 X_{i1}}}\]
\[\log(\text{odds}) = \log\left(\frac{\pi}{1-\pi}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p \; . \] When \((X_1,X_2,\ldots,X_p)\) are all 0, \(\beta_0\) is the typical log odds of the event of interest and \(e^{\beta_0}\) is the typical odds.
When controlling for the other predictors \((X_2,\ldots,X_p)\), let \(\text{odds}_x\) represent the typical odds of the event of interest when \(X_1 = x\) and \(\text{odds}_{x+1}\) the typical odds when \(X_1 = x + 1\). Then when \(X_1\) increases by 1, from \(x\) to \(x + 1\), \(\beta_1\) is the typical change in log odds and \(e^{\beta_1}\) is the typical multiplicative change in odds:
\[\beta_1 = \log(\text{odds}_{x+1}) - \log(\text{odds}_x) \;\;\; \text{ and } \;\;\; e^{\beta_1} = \frac{\text{odds}_{x+1}}{\text{odds}_x}\]
\[\log\left(\frac{\pi}{1-\pi}\right) = \beta_0 + \beta_1 X_1\]
\[\pi = \frac{e^{\beta_0 + \beta_1 X_1}}{1 + e^{\beta_0 + \beta_1 X_1}}\]
This data set has a binary response (outcome, dependent) variable called admit.
There are three predictor variables: gre, gpa and rank.
We will treat the variables gre and gpa as continuous.
The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest.
#reading the data from OARC stat website admission data:
mydata <- read.csv("https://stats.oarc.ucla.edu/stat/data/binary.csv")
#summary statistics of variables
summary(mydata)
admit gre gpa rank
Min. :0.0000 Min. :220.0 Min. :2.260 Min. :1.000
1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000
Median :0.0000 Median :580.0 Median :3.395 Median :2.000
Mean :0.3175 Mean :587.7 Mean :3.390 Mean :2.485
3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000
Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000
admit gre gpa rank
0.4660867 115.5165364 0.3805668 0.9444602
\[Y = \begin{cases} \text{ Yes admit} & \\ \text{ No admit} & \\ \end{cases} \;\]
\[\begin{split} X_1 & = \text{ GRE} \\ \end{split}\]
\[\text{odds} = \frac{\pi}{1-\pi}\]
if prob of admit = \(\frac{2}{3}\) then \(\text{odds of admit } = \frac{2/3}{1-2/3} = 2\)
if prob of admit = \(\frac{1}{3}\) then \(\text{odds of admit } = \frac{1/3}{1-1/3} = \frac{1}{2}\)
if prob of admit = \(\frac{1}{2}\) then \(\text{odds of admit } = \frac{1/2}{1-1/2} = 1\)
Call:
lm(formula = admit ~ gre, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-0.4755 -0.3415 -0.2522 0.5989 0.8966
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1198407 0.1190510 -1.007 0.314722
gre 0.0007442 0.0001988 3.744 0.000208 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4587 on 398 degrees of freedom
Multiple R-squared: 0.03402, Adjusted R-squared: 0.03159
F-statistic: 14.02 on 1 and 398 DF, p-value: 0.0002081
#to estimate coefficient of the model we use glm with
#the option family = binomial(link = "logit")
glm.mod <- glm(admit ~ gre, family = binomial(link = "logit"), data = mydata)
summary(glm.mod)
Call:
glm(formula = admit ~ gre, family = binomial(link = "logit"),
data = mydata)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.901344 0.606038 -4.787 1.69e-06 ***
gre 0.003582 0.000986 3.633 0.00028 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 499.98 on 399 degrees of freedom
Residual deviance: 486.06 on 398 degrees of freedom
AIC: 490.06
Number of Fisher Scoring iterations: 4
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.901344270 0.6060376103 -4.787400 1.689561e-06
gre 0.003582212 0.0009860051 3.633056 2.800845e-04
2.5 % 97.5 %
(Intercept) -4.119988259 -1.739756286
gre 0.001679963 0.005552748
(Intercept) gre
0.0549493 1.0035886
The coefficient of GRE (slope of the model) interpreted as: For one unit increase of GRE the log-odds of admit expected to increases by 0.003582 on average.
The intercept is the expected log odds of someone getting admitted to graduate school when their GRE score is zero.
When we exponentiate the intercept then we get the expected odds of someone with GRE equal to zero. Exp(−2.901344)=0.05494932 Transforming to probability, p=0.05494932/(1+0.05494932)=0.052, we get the expected probability of someone with GRE equal to 0.
The exponentiation of the slope is the odds ratio of admission for a unit increase of GRE score which is: exp(0.003582)=1.003588 we can say the odds of getting admitted on average increase by 0.3% for one unit improvement of GRE score.
What if someone improve their GRE by 50 unit?
The exponentiation function is increases multiplicative and not additive. Thus, for 50 more GRE score the odds of admission increases by a factor of 1.196115. In another word their odds of getting admitted increases by about 20%
glm.mod.2 <- glm(admit ~ gre + gpa, family = binomial(link = "logit"), data = mydata)
summary(glm.mod.2)
Call:
glm(formula = admit ~ gre + gpa, family = binomial(link = "logit"),
data = mydata)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.949378 1.075093 -4.604 4.15e-06 ***
gre 0.002691 0.001057 2.544 0.0109 *
gpa 0.754687 0.319586 2.361 0.0182 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 499.98 on 399 degrees of freedom
Residual deviance: 480.34 on 397 degrees of freedom
AIC: 486.34
Number of Fisher Scoring iterations: 4
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.949378063 1.075092882 -4.603675 4.151004e-06
gre 0.002690684 0.001057491 2.544403 1.094647e-02
gpa 0.754686856 0.319585595 2.361455 1.820340e-02
(Intercept) gre gpa
0.007087816 1.002694307 2.126945379
2.5 % 97.5 %
(Intercept) -7.1092205752 -2.886726963
gre 0.0006373958 0.004791712
gpa 0.1347509288 1.390131131
the odds of getting admitted on average increases by a factor of 2.1269 for one unit improvement of GPA given GRE score is constant.
glm.mod.3 <- glm(admit ~ gre * gpa, family = binomial(link = "logit"), data = mydata)
summary(glm.mod.3)
Call:
glm(formula = admit ~ gre * gpa, family = binomial(link = "logit"),
data = mydata)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -13.818700 5.874816 -2.352 0.0187 *
gre 0.017464 0.009606 1.818 0.0691 .
gpa 3.367918 1.722685 1.955 0.0506 .
gre:gpa -0.004327 0.002787 -1.552 0.1205
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 499.98 on 399 degrees of freedom
Residual deviance: 477.85 on 396 degrees of freedom
AIC: 485.85
Number of Fisher Scoring iterations: 4
Estimate Std. Error z value Pr(>|z|)
(Intercept) -13.818700374 5.874815854 -2.352193 0.01866309
gre 0.017464312 0.009606206 1.818024 0.06906048
gpa 3.367918022 1.722685231 1.955040 0.05057838
gre:gpa -0.004326605 0.002786899 -1.552480 0.12054741
(Intercept) gre gpa gre:gpa
9.968153e-07 1.017618e+00 2.901805e+01 9.956827e-01
2.5 % 97.5 %
(Intercept) -2.576285e+01 -2.677081148
gre -9.057462e-04 0.036855984
gpa 8.009849e-02 6.850511379
gre:gpa -9.932612e-03 0.001022485
[1] 3.335476
The interaction therm is an multiplication factor, for example for GPA, increases/decreases (here decreases) the odds of outcome by a factor of 0.9956827 for every unit of GRE.
For example if someone has GRE 500, for every 1 unit increase of their GPA then their odds of admission changes by a factor of 3.3354
Or we can first calculate 3.367918022+500*(−0.004326605) and then we can take the exponentiation.
Count data are observations that have only non-negative integer values.
Count can range from zero to some grater undetermined value.
Theoretically count can range from zero to infinity but in practice they are always limited to some lesser value.
A count of items or events occuring within a period of time.
Count of item or events occouring in a given geographical or spatial area.
Count of number of people having a particular disease, adjusted by the size of the population.
\[p(y|\mu ) = \frac{e^\mu\mu^y}{y!}\]
\[p(y=0|\mu=2 ) = \frac{e^2 2^0}{0!}=0.1353\] \[p(y=1|\mu=2 ) = \frac{e^2 2^1}{1!}=0.2707\]
\(p(y=2|\mu=2 ) = \frac{e^2 2^2}{2!}=0.2707\)
\(p(y=3|\mu=2 ) = 0.1804\)
\(p(y=4|\mu=2 ) = 0.0902\)
\(p(y=5|\mu=2 ) = 0.0361\)
\(p(y=6|\mu=2 ) = 0.0120\)
\(p(y=7|\mu=2 ) = 0.0034\)
\(p(y=8|\mu=2 ) = 0.0009\)
link function: \(g(\mu_i)=ln(\mu_i)\)
\(g(\mu_i)= \beta_0 + \beta_1 X_{i1}\)
\(ln(\mu_i)= ln(E(y_i|x_i ; \beta_0 , \beta_1)) = \beta_0 + \beta_1 X_{i}\)
\[f(y_i|x_i ; \beta_0 , \beta_1) = \frac{e^{\beta_0 + \beta_1 x_i}{(e^{\beta_0 + \beta_1 x_i}})^{y_i}}{y_i!}\]
docvis hospvis edlevel age outwork female married kids hhninc educ self
1 1 0 3 54 0 0 1 0 3.050 15.0 0
2 0 0 1 44 1 1 1 0 3.050 9.0 0
3 0 0 1 58 1 1 0 0 1.434 11.0 0
4 7 2 1 64 0 0 0 0 1.500 10.5 0
5 6 0 3 30 1 0 0 0 2.400 13.0 0
6 9 0 3 26 1 0 0 0 1.050 13.0 0
edlevel1 edlevel2 edlevel3 edlevel4
1 0 0 1 0
2 1 0 0 0
3 1 0 0 0
4 1 0 0 0
5 0 0 1 0
6 0 0 1 0
docvis hospvis edlevel age
Min. : 0.000 Min. : 0.0000 Min. :1.00 Min. :25
1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.:1.00 1st Qu.:35
Median : 1.000 Median : 0.0000 Median :1.00 Median :44
Mean : 3.163 Mean : 0.1213 Mean :1.38 Mean :44
3rd Qu.: 4.000 3rd Qu.: 0.0000 3rd Qu.:1.00 3rd Qu.:54
Max. :121.000 Max. :17.0000 Max. :4.00 Max. :64
outwork female married kids
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :1.0000 Median :0.0000
Mean :0.3665 Mean :0.4793 Mean :0.7891 Mean :0.4491
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
hhninc educ self edlevel1
Min. : 0.015 Min. : 7.00 Min. :0.00000 Min. :0.0000
1st Qu.: 2.000 1st Qu.:10.00 1st Qu.:0.00000 1st Qu.:1.0000
Median : 2.800 Median :10.50 Median :0.00000 Median :1.0000
Mean : 2.969 Mean :11.09 Mean :0.06118 Mean :0.8136
3rd Qu.: 3.590 3rd Qu.:11.50 3rd Qu.:0.00000 3rd Qu.:1.0000
Max. :25.000 Max. :18.00 Max. :1.00000 Max. :1.0000
edlevel2 edlevel3 edlevel4
Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
Median :0.0000 Median :0.0000 Median :0.00000
Mean :0.0524 Mean :0.0746 Mean :0.05937
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
Max. :1.0000 Max. :1.0000 Max. :1.00000
?rwm1984
#centering the age predictor
rwm1984$cage <- rwm1984$age - mean(rwm1984$age)
#running glm
pois_m <- glm(docvis ~ outwork + cage, family = poisson(link = "log"), data = rwm1984)
#extract the outcome by summary
summary(pois_m)
Call:
glm(formula = docvis ~ outwork + cage, family = poisson(link = "log"),
data = rwm1984)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.9380965 0.0127571 73.53 <2e-16 ***
outwork 0.4079314 0.0188447 21.65 <2e-16 ***
cage 0.0220842 0.0008377 26.36 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 25791 on 3873 degrees of freedom
Residual deviance: 24190 on 3871 degrees of freedom
AIC: 31279
Number of Fisher Scoring iterations: 6