GLM

Dr. Mine Dogucu & Dr. Mozhdeh Forghani

Packages

# install.packages(c("MASS", "lattice", "sandwich", "boot", "COUNT", "pscl",
#                      "logistf"), packagesdependencies=TRUE)
library(MASS)
library(lattice)
library(sandwich)
library(boot)
library(COUNT)
library(pscl)
library(logistf)
library(ggplot2)

Recall: Linear Regression Model

Simple regression model
Linear Regression model
Assumption in OLS:

1 - Errors have mean equal to zero.

2 - Errors are independent.

3 - Errors have equal or constant variance (Homoscedasticity or Homogeneity of variance).

4 - Errors follow a normal distribution.

Data

dat <- read.csv(
  "https://stats.oarc.ucla.edu/wp-content/uploads/2019/02/elemapi2v2.csv")
head(dat[,c("api00", "enroll")])

  api00 enroll
1   693    247
2   570    463
3   546    395
4   571    418
5   478    520
6   858    343

# We want to study the relationship between academic performance api00
# and number of students enrolled in schools,
#scatter plot
plot(api00 ~ enroll, data = dat,
     main = "Scatter plot of school performance vs. number of enrollment",
     ylab = "api00 (performance score in 2000)",
     xlab = "enroll", xlim = c(0,1000))

Example of linear regression model using lm()

# We want to study the relationship between academic performance api00
# and number of students enrolled in schools,

m1 <- lm(api00 ~ enroll, data = dat)
summary(m1)


Call:
lm(formula = api00 ~ enroll, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-285.50 -112.55   -6.70   95.06  389.15 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 744.25141   15.93308  46.711  < 2e-16 ***
enroll       -0.19987    0.02985  -6.695 7.34e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 135 on 398 degrees of freedom
Multiple R-squared:  0.1012,    Adjusted R-squared:  0.09898 
F-statistic: 44.83 on 1 and 398 DF,  p-value: 7.339e-11

opar <- par(mfrow = c(2, 2), mar = c(4.1, 4.1, 2.1, 1.1))
plot(m1)

Interpretation

Logistic Regression

There are cases that the outcome for a given set of predictors can only gets positive values or values in the interval (0,1)

-If the outcome is dichotomous and can get only value 0 and 1 (Bernoulli distribution), the mean is between 0 and 1.

-If the outcome is from a count distribution where it is count number 0,1,2,⋯ like Poisson distribution, the mean is always positive (𝜇>0).

Link Function

mean of Y : 0 < 𝜇 < 1

odds function: 0 < 𝜇/(1-𝜇) < ∞

logit function: -∞ < log(𝜇/(1-𝜇)) < ∞

Logistic regression model

likelihood: \(Y_i | \pi_i \sim \text{Bern}(\pi_i)\)

\(𝜇_i = E(Y_i|\pi_i) = \pi_i\)

link function: \(g(\pi_i)\)

\(g(\pi_i)= \beta_0 + \beta_1 X_{i1}\)

logit link function: \(Y_i|\beta_0,\beta_1 \stackrel{ind}{\sim} \text{Bern}(\pi_i) \;\; \text{ with } \;\; \log\left(\frac{\pi_i}{1 - \pi_i}\right) = \beta_0 + \beta_1 X_{i1}\)

\[\frac{\pi_i}{1-\pi_i} = e^{\beta_0 + \beta_1 X_{i1}} \;\;\;\; \text{ and } \;\;\;\; \pi_i = \frac{e^{\beta_0 + \beta_1 X_{i1}}}{1 + e^{\beta_0 + \beta_1 X_{i1}}}\]

\[\log(\text{odds}) = \log\left(\frac{\pi}{1-\pi}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p \; . \] When \((X_1,X_2,\ldots,X_p)\) are all 0, \(\beta_0\) is the typical log odds of the event of interest and \(e^{\beta_0}\) is the typical odds.

When controlling for the other predictors \((X_2,\ldots,X_p)\), let \(\text{odds}_x\) represent the typical odds of the event of interest when \(X_1 = x\) and \(\text{odds}_{x+1}\) the typical odds when \(X_1 = x + 1\). Then when \(X_1\) increases by 1, from \(x\) to \(x + 1\), \(\beta_1\) is the typical change in log odds and \(e^{\beta_1}\) is the typical multiplicative change in odds:

\[\beta_1 = \log(\text{odds}_{x+1}) - \log(\text{odds}_x) \;\;\; \text{ and } \;\;\; e^{\beta_1} = \frac{\text{odds}_{x+1}}{\text{odds}_x}\]

\[\log\left(\frac{\pi}{1-\pi}\right) = \beta_0 + \beta_1 X_1\]

\[\pi = \frac{e^{\beta_0 + \beta_1 X_1}}{1 + e^{\beta_0 + \beta_1 X_1}}\]

Data

This data set has a binary response (outcome, dependent) variable called admit.

There are three predictor variables: gre, gpa and rank.

We will treat the variables gre and gpa as continuous.

The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest.

Data

#reading the data from OARC stat website admission data:
mydata <- read.csv("https://stats.oarc.ucla.edu/stat/data/binary.csv")
#summary statistics of variables
summary(mydata)

     admit             gre             gpa             rank      
 Min.   :0.0000   Min.   :220.0   Min.   :2.260   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:520.0   1st Qu.:3.130   1st Qu.:2.000  
 Median :0.0000   Median :580.0   Median :3.395   Median :2.000  
 Mean   :0.3175   Mean   :587.7   Mean   :3.390   Mean   :2.485  
 3rd Qu.:1.0000   3rd Qu.:660.0   3rd Qu.:3.670   3rd Qu.:3.000  
 Max.   :1.0000   Max.   :800.0   Max.   :4.000   Max.   :4.000

#standard deviation of variables
sapply(mydata, sd)

      admit         gre         gpa        rank 
  0.4660867 115.5165364   0.3805668   0.9444602

\[Y = \begin{cases} \text{ Yes admit} & \\ \text{ No admit} & \\ \end{cases} \;\]

\[\begin{split} X_1 & = \text{ GRE} \\ \end{split}\]

Understanding Odds and probability

\[\text{odds} = \frac{\pi}{1-\pi}\]

if prob of admit = \(\frac{2}{3}\) then \(\text{odds of admit } = \frac{2/3}{1-2/3} = 2\)

if prob of admit = \(\frac{1}{3}\) then \(\text{odds of admit } = \frac{1/3}{1-1/3} = \frac{1}{2}\)

if prob of admit = \(\frac{1}{2}\) then \(\text{odds of admit } = \frac{1/2}{1-1/2} = 1\)

Example1: GLM using glm()

# If we run lm
lm.admit <- lm(admit ~ gre, data = mydata)
summary(lm.admit)


Call:
lm(formula = admit ~ gre, data = mydata)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.4755 -0.3415 -0.2522  0.5989  0.8966 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.1198407  0.1190510  -1.007 0.314722    
gre          0.0007442  0.0001988   3.744 0.000208 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4587 on 398 degrees of freedom
Multiple R-squared:  0.03402,   Adjusted R-squared:  0.03159 
F-statistic: 14.02 on 1 and 398 DF,  p-value: 0.0002081

opar <- par(mfrow = c(2, 2), 
            mar = c(4.1, 4.1, 2.1, 1.1))
plot(lm.admit)

#to estimate coefficient of the model we use glm with  
#the option family = binomial(link = "logit")
glm.mod <- glm(admit ~ gre, family = binomial(link = "logit"), data = mydata)

summary(glm.mod)


Call:
glm(formula = admit ~ gre, family = binomial(link = "logit"), 
    data = mydata)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.901344   0.606038  -4.787 1.69e-06 ***
gre          0.003582   0.000986   3.633  0.00028 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 486.06  on 398  degrees of freedom
AIC: 490.06

Number of Fisher Scoring iterations: 4

coef(summary(glm.mod))

                Estimate   Std. Error   z value     Pr(>|z|)
(Intercept) -2.901344270 0.6060376103 -4.787400 1.689561e-06
gre          0.003582212 0.0009860051  3.633056 2.800845e-04

confint(glm.mod)

                   2.5 %       97.5 %
(Intercept) -4.119988259 -1.739756286
gre          0.001679963  0.005552748

exp(coef(summary(glm.mod))[,1])

(Intercept)         gre 
  0.0549493   1.0035886

Interpretation

The coefficient of GRE (slope of the model) interpreted as: For one unit increase of GRE the log-odds of admit expected to increases by 0.003582 on average.

The intercept is the expected log odds of someone getting admitted to graduate school when their GRE score is zero.

When we exponentiate the intercept then we get the expected odds of someone with GRE equal to zero. Exp(−2.901344)=0.05494932 Transforming to probability, p=0.05494932/(1+0.05494932)=0.052, we get the expected probability of someone with GRE equal to 0.

The exponentiation of the slope is the odds ratio of admission for a unit increase of GRE score which is: exp(0.003582)=1.003588 we can say the odds of getting admitted on average increase by 0.3% for one unit improvement of GRE score.

What if someone improve their GRE by 50 unit?

The exponentiation function is increases multiplicative and not additive. Thus, for 50 more GRE score the odds of admission increases by a factor of 1.196115. In another word their odds of getting admitted increases by about 20%

glm.mod.2 <- glm(admit ~ gre + gpa, family = binomial(link = "logit"), data = mydata)

summary(glm.mod.2)


Call:
glm(formula = admit ~ gre + gpa, family = binomial(link = "logit"), 
    data = mydata)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.949378   1.075093  -4.604 4.15e-06 ***
gre          0.002691   0.001057   2.544   0.0109 *  
gpa          0.754687   0.319586   2.361   0.0182 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 480.34  on 397  degrees of freedom
AIC: 486.34

Number of Fisher Scoring iterations: 4

coef(summary(glm.mod.2))

                Estimate  Std. Error   z value     Pr(>|z|)
(Intercept) -4.949378063 1.075092882 -4.603675 4.151004e-06
gre          0.002690684 0.001057491  2.544403 1.094647e-02
gpa          0.754686856 0.319585595  2.361455 1.820340e-02

exp(coef(summary(glm.mod.2))[,1])

(Intercept)         gre         gpa 
0.007087816 1.002694307 2.126945379

confint(glm.mod.2)

                    2.5 %       97.5 %
(Intercept) -7.1092205752 -2.886726963
gre          0.0006373958  0.004791712
gpa          0.1347509288  1.390131131

the odds of getting admitted on average increases by a factor of 2.1269 for one unit improvement of GPA given GRE score is constant.

Prediction

glm.mod.3 <- glm(admit ~ gre * gpa, family = binomial(link = "logit"), data = mydata)

summary(glm.mod.3)


Call:
glm(formula = admit ~ gre * gpa, family = binomial(link = "logit"), 
    data = mydata)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)  
(Intercept) -13.818700   5.874816  -2.352   0.0187 *
gre           0.017464   0.009606   1.818   0.0691 .
gpa           3.367918   1.722685   1.955   0.0506 .
gre:gpa      -0.004327   0.002787  -1.552   0.1205  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 477.85  on 396  degrees of freedom
AIC: 485.85

Number of Fisher Scoring iterations: 4

coef(summary(glm.mod.3))

                 Estimate  Std. Error   z value   Pr(>|z|)
(Intercept) -13.818700374 5.874815854 -2.352193 0.01866309
gre           0.017464312 0.009606206  1.818024 0.06906048
gpa           3.367918022 1.722685231  1.955040 0.05057838
gre:gpa      -0.004326605 0.002786899 -1.552480 0.12054741

exp(coef(summary(glm.mod.3))[,1])

 (Intercept)          gre          gpa      gre:gpa 
9.968153e-07 1.017618e+00 2.901805e+01 9.956827e-01

confint(glm.mod.3)

                    2.5 %       97.5 %
(Intercept) -2.576285e+01 -2.677081148
gre         -9.057462e-04  0.036855984
gpa          8.009849e-02  6.850511379
gre:gpa     -9.932612e-03  0.001022485

exp(3.367918022 + 500 * -0.004326605)

[1] 3.335476

Prediction and Classification

The interaction therm is an multiplication factor, for example for GPA, increases/decreases (here decreases) the odds of outcome by a factor of 0.9956827 for every unit of GRE.

For example if someone has GRE 500, for every 1 unit increase of their GPA then their odds of admission changes by a factor of 3.3354

Or we can first calculate 3.367918022+500*(−0.004326605) and then we can take the exponentiation.

Regression of count data

Count data are observations that have only non-negative integer values.

Count can range from zero to some grater undetermined value.

Theoretically count can range from zero to infinity but in practice they are always limited to some lesser value.

Some types of count data

A count of items or events occuring within a period of time.

Count of item or events occouring in a given geographical or spatial area.

Count of number of people having a particular disease, adjusted by the size of the population.

Examples of a Poisson random variable

\[p(y|\mu ) = \frac{e^\mu\mu^y}{y!}\]

\[p(y=0|\mu=2 ) = \frac{e^2 2^0}{0!}=0.1353\] \[p(y=1|\mu=2 ) = \frac{e^2 2^1}{1!}=0.2707\]

Examples of a Poisson random variable

\(p(y=2|\mu=2 ) = \frac{e^2 2^2}{2!}=0.2707\)

\(p(y=3|\mu=2 ) = 0.1804\)

\(p(y=4|\mu=2 ) = 0.0902\)

\(p(y=5|\mu=2 ) = 0.0361\)

\(p(y=6|\mu=2 ) = 0.0120\)

\(p(y=7|\mu=2 ) = 0.0034\)

\(p(y=8|\mu=2 ) = 0.0009\)

Formalization of count model

link function: \(g(\mu_i)=ln(\mu_i)\)

\(g(\mu_i)= \beta_0 + \beta_1 X_{i1}\)

\(ln(\mu_i)= ln(E(y_i|x_i ; \beta_0 , \beta_1)) = \beta_0 + \beta_1 X_{i}\)

\[f(y_i|x_i ; \beta_0 , \beta_1) = \frac{e^{\beta_0 + \beta_1 x_i}{(e^{\beta_0 + \beta_1 x_i}})^{y_i}}{y_i!}\]

Example2: Count data - using glm()

# library(COUNT)
data("rwm1984")
head(rwm1984)

  docvis hospvis edlevel age outwork female married kids hhninc educ self
1      1       0       3  54       0      0       1    0  3.050 15.0    0
2      0       0       1  44       1      1       1    0  3.050  9.0    0
3      0       0       1  58       1      1       0    0  1.434 11.0    0
4      7       2       1  64       0      0       0    0  1.500 10.5    0
5      6       0       3  30       1      0       0    0  2.400 13.0    0
6      9       0       3  26       1      0       0    0  1.050 13.0    0
  edlevel1 edlevel2 edlevel3 edlevel4
1        0        0        1        0
2        1        0        0        0
3        1        0        0        0
4        1        0        0        0
5        0        0        1        0
6        0        0        1        0

summary(rwm1984)

     docvis           hospvis           edlevel          age    
 Min.   :  0.000   Min.   : 0.0000   Min.   :1.00   Min.   :25  
 1st Qu.:  0.000   1st Qu.: 0.0000   1st Qu.:1.00   1st Qu.:35  
 Median :  1.000   Median : 0.0000   Median :1.00   Median :44  
 Mean   :  3.163   Mean   : 0.1213   Mean   :1.38   Mean   :44  
 3rd Qu.:  4.000   3rd Qu.: 0.0000   3rd Qu.:1.00   3rd Qu.:54  
 Max.   :121.000   Max.   :17.0000   Max.   :4.00   Max.   :64  
    outwork           female          married            kids       
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000   Median :1.0000   Median :0.0000  
 Mean   :0.3665   Mean   :0.4793   Mean   :0.7891   Mean   :0.4491  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
     hhninc            educ            self            edlevel1     
 Min.   : 0.015   Min.   : 7.00   Min.   :0.00000   Min.   :0.0000  
 1st Qu.: 2.000   1st Qu.:10.00   1st Qu.:0.00000   1st Qu.:1.0000  
 Median : 2.800   Median :10.50   Median :0.00000   Median :1.0000  
 Mean   : 2.969   Mean   :11.09   Mean   :0.06118   Mean   :0.8136  
 3rd Qu.: 3.590   3rd Qu.:11.50   3rd Qu.:0.00000   3rd Qu.:1.0000  
 Max.   :25.000   Max.   :18.00   Max.   :1.00000   Max.   :1.0000  
    edlevel2         edlevel3         edlevel4      
 Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
 Median :0.0000   Median :0.0000   Median :0.00000  
 Mean   :0.0524   Mean   :0.0746   Mean   :0.05937  
 3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.00000

?rwm1984

#centering the age predictor
rwm1984$cage <- rwm1984$age - mean(rwm1984$age)

#running glm
pois_m <- glm(docvis ~ outwork + cage, family = poisson(link = "log"), data = rwm1984)
#extract the outcome by summary
summary(pois_m)


Call:
glm(formula = docvis ~ outwork + cage, family = poisson(link = "log"), 
    data = rwm1984)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) 0.9380965  0.0127571   73.53   <2e-16 ***
outwork     0.4079314  0.0188447   21.65   <2e-16 ***
cage        0.0220842  0.0008377   26.36   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 25791  on 3873  degrees of freedom
Residual deviance: 24190  on 3871  degrees of freedom
AIC: 31279

Number of Fisher Scoring iterations: 6