This lesson is still being designed and assembled (Pre-Alpha version)

Regression

Overview

Teaching: 60 min
Exercises: 30 min
Questions
Objectives

PDF version + exercises

PDF notes

Exercises

1. Regression Introduction

Regression analysis is a conceptually simple method for investigating functional relationships among variables.

Regression Analysis By Example by Samprit Chatterjee, Ali S. Hadi and Bertram Price available online from the MQ library.

EXAMPLE: A new Real Estate agent in Macquarie, ACT wants to come up with a sale price for a house based on some physical characteristics.

She has some past sales data on houses that sold in 2018. Note that I have de-identified the records by removing the addresses.

library(tidyverse)
library(ggpubr)
library(olsrr)

theme_set(theme_bw())

houses <- read_csv(file.path("..", "data", "Macquarie2018.csv"))
houses
# A tibble: 15 x 8
   Contract_Date  Price Block_Size Bedrooms Bathrooms Ensuites Garages
   <chr>          <dbl>      <dbl>    <dbl>     <dbl> <lgl>      <dbl>
 1 12/10/2018    605000        780       NA        NA NA            NA
 2 14/07/2018    645000        761        4         1 NA             2
 3 22/05/2018    670000        433        3         1 NA             0
 4 15/01/2018    682000       1173       NA        NA NA            NA
 5 27/11/2018    700000        698       NA        NA NA            NA
 6 24/09/2018    710000        971       NA        NA NA            NA
 7 24/01/2018    721000        453        3         2 NA             0
 8 2/06/2018     723000        850        3         1 NA             2
 9 5/12/2018     785000        871        3         1 NA             1
10 2/11/2018     800000        929        3         1 NA            NA
11 28/02/2018    848000       1039       NA        NA NA            NA
12 15/02/2018    860000        941        3         1 NA             1
13 21/09/2018    868000        808        3         1 NA             1
14 8/12/2018     877500       1179        4         2 NA             2
15 9/11/2018     900000        808       NA        NA NA            NA
# ... with 1 more variable: Carports <dbl>

Response (Dependent variable, Outcome variable): Price ($Y$)

Predictors (Explanatory variables, independent variables, covariates, regressors): Block Size ($X_1$), Bedrooms ($X_2$), Bathrooms ($X_3$), Ensuites ($X_4$), Garages ($X_5$), Carports ($X_6$).

Regression Model:

This is used as an approximation to the true relationship between $Y$ and $X_1$, $X_2$, $X_3$, $X_4$, $X_5$, $X_6$.

The function $f(X_1,X_2,X_3,X_4,X_5,X_6)$ describes relationship between $Y$ and $X_1$, $X_2$, $X_3$, $X_4$, $X_5$, $X_6$.

$\varepsilon$ is assumed to be random error representing the discrepancy between $Y$ and $f(X_1,X_2,X_3,X_4,X_5,X_6)$.

Linear regression model:

The $\beta_0$, $\beta_1$, $\beta_2$,$\cdots$, $\beta_6$ are called regression parameters or coefficients, are unknown constants that need to be estimated from the data.

1.2 STEPS IN REGRESSION ANALYSIS

Statement of the problem

Question/s to be addressed by the analysis must be carefully thought out.

EXAMPLE: Determine the sale price for some houses in Macquarie.

Selection of potentially relevant variables

Select a set of variables that are thought to explain or predict the response variable. Usually use those suggested by experts in the area of study.

EXAMPLE: The Real Estate Agent might think Block Size strongly influences house sale price.

Usually record data as follows

Each row corresponds to an observation. Each column corresponds to a variable. For each observation we have one value for the response variable and one value for each of the $p$ predictors.

We can classify the variables as quantitative or qualitative (categorical) and further divide categorical variables as being ordinal or nominal (no ordering in the categories).

EXAMPLE: In the Macquarie Real Estate example, except for Contract Date, all variables can be considered to be quantitative.

Model specification

Need to select the form of the function $f(X_1,X_2,\cdots,X_p)$. Could be specified initially by experts in the field of study based on their knowledge or judgements (objective and / or subjective). Hypothesized model can be rejected or not rejected by the analysis of the collected data.

Only the form of the function $f(X_1,X_2,\cdots,X_p)$ needs to be specified. The function may be classified as linear or nonlinear. Do the regression parameters enter the equation linearly or nonlinearly?

Various Classifications of Regression Analysis

Type of Regression Conditions
Univariate Only one quantitative response variable
Multivariate Two or more quantitative response variables
Simple Only one predictor variable
Multiple Two or more predictor variables
Linear All parameters enter the equation linearly, possibly after transformation of the data
Nonlinear The relationship between the response and some of the predictors is nonlinear or some of the parameters appear nonlinearly, but no transformation is possible to make the parameters appear linearly
Analysis of Variance All predictors are qualitative variables
Analysis of Covariance Some predictors are quantitative variables and others are qualitative variables
Logistic The response variable is qualitative

(Page 15, Regression Analysis by Example, Third Edition by Samprit Chatterjee, Ali S. Hadi and Bertram Price (2000).)

Choice of fitting model

The Least Squares method is most commonly used. Under certain conditions this method produces estimators with desirable properties. We will use least squares or weighted least squares most of the time.

Model fitting

Estimation of regression parameters or fitting the model to the collected data using the chosen method of fitting. It is usual to denote estimates of the regression parameters $\beta_0,~\beta_1,~\beta_2,\cdots,~\beta_p$ by $\hat{\beta}_0,~\hat{\beta}_1,~\hat{\beta}_2,\cdots,~\hat{\beta}_p$.

If fitting a multiple linear regression to the data the estimated regression becomes Note that $\hat{Y}$ (said Y-hat) is called the fitted value.

The i-th fitted value is

We can use ($\text{*}$) to predict the response variable for any values of the predictor variables not observed in our data, the obtained $\hat{Y}$ is called the predicted value.

Do not predict the response variable for a set of predictor variables outside the range of the data.

Model validation and criticism

The assumptions made about the data to fit a particular model need to be checked. This can only be done after the model has been fitted to the data.

If model assumptions do not hold the analysis produced (we will use the lm function in R) will not be valid. There is no point looking at p-values if the model assumptions do not hold.

You always need to check that the model assumptions hold when fitting a model to the data.

Using the chosen model(s) for the solution to the posed problem

EXAMPLE: Suppose that the new Macquarie Real Estate Agent has the feeling that house price is closely related to block size.

I have the data saved in the file Macquarie2018.csv that you can download from the following location ….

I use the readr package which is part of the Tidyverse to get the data into R. Make sure you change your Directory in R to the one where you have saved this data set. Google readr r for documentation for more details about this package.

ggplot(houses, aes(x = Block_Size, y = Price)) +
    geom_point()

plot of chunk plot_houses

The agent feels that the relationship looks linear. A simple linear regression is fitted to the data by a first year Statistics student. The simple linear regression model is

with $E(\varepsilon_i)=0$, $Var(\varepsilon_i)=\sigma^2$ and $Cov(\varepsilon_i,\varepsilon_j)=0$. We can also assume that $\varepsilon_i \sim N(0,\sigma^2)$ (i.e., the $\varepsilon$ is normally distributed with zero mean and constant variance $\sigma^2$).
This normality assumption is not needed for the least squares estimation but it is necessary for the standard confidence intervals and hypothesis tests.

We fit the model by estimating values for the parameters $\beta_0$, $\beta_1$ and $\sigma^2$.

houses_fit <- lm(Price ~ Block_Size, data = houses)

To get the estimated values of $\beta_0$ and $\beta_1$:

houses_fit

Call:
lm(formula = Price ~ Block_Size, data = houses)

Coefficients:
(Intercept)   Block_Size  
     616624          169  

To get more detailed output from the model fitted to the data:

summary(houses_fit)

Call:
lm(formula = Price ~ Block_Size, data = houses)

Residuals:
    Min      1Q  Median      3Q     Max 
-143435  -53988   21187   58718  146833 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 616624.3    98136.8   6.283 2.82e-05 ***
Block_Size     169.0      112.6   1.501    0.157    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 90700 on 13 degrees of freedom
Multiple R-squared:  0.1476,	Adjusted R-squared:  0.08207 
F-statistic: 2.252 on 1 and 13 DF,  p-value: 0.1574

More importantly, before looking at any p-values, look at the diagnostic plots.

par(mfrow = c(2, 2))
plot(houses_fit)

plot of chunk houses_plot_residuals

For you to be satisfied that the model is adequate and that you can proceed to make inference the Residual vs Fitted values plot should look like a random scatter about zero and the Normal Q-Q plot should look linear. Even if we fit a multiple linear regression we still look at these two plots to assess model adequacy.

Suppose you thought the model is adequate (it isn’t so).

The regression equation is

where $Y=$ Price and $x=$ Block Size.

You see that the Adjusted R-Squared is 0.08207 and that the F Statistic p-value is 0.1574, if the model is adequate and the normality assumption holds you can make the following statements:

The Block Size accounts for 8.2% of the variation in Price. The regression is not significant (p-value=0.157).

1.3 COVARIANCE AND CORRELATION COEFFICIENT

We are interested in studying the relationship between a response variable $Y$ and a single predictor $X$.The covariance between $Y$ and $X$ measures the direction of the linear relationship between $Y$ and $X$ but tells us nothing about the strength of the relationship since it changes if we change the unit of measurement. If $Cov(Y,X)>0$ then there is a positive relationship between $Y$ and $X$ but if $Cov(Y,X)<0$ the relationship is negative.

The correlation coefficient between $Y$ and $X$ is scale invariant so it measures both the direction and strength of the linear relationship between $Y$ and $X$.

EXAMPLE: Anscomb (1973) used four data sets to illustrate the importance of investigating the data using scatter plots and not relying totally on the correlation coefficient. The four data sets are given below. Explore the data graphically and obtain the correlation coefficient for each data set. ($r^2=\frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2\sum(y_i-\bar{y})^2)}$)

The file Anscombe.csv contains this data.

library(tidyverse)
library(readr)

anscomb <- read_csv(file.path("..", "data", "Anscomb.csv"))
anscomb
# A tibble: 11 x 8
      x1    y1    x2    y2    x3    y3    x4    y4
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1    10  8.04    10  9.14    10  7.46     8  6.58
 2     8  6.95     8  8.14     8  6.77     8  5.76
 3    13  7.58    13  8.74    13 12.7      8  7.71
 4     9  8.81     9  8.77     9  7.11     8  8.84
 5    11  8.33    11  9.26    11  7.81     8  8.47
 6    14  9.96    14  8.1     14  8.84     8  7.04
 7     6  7.24     6  6.13     6  6.08     8  5.25
 8     4  4.26     4  3.1      4  5.39    19 12.5 
 9    12 10.8     12  9.13    12  8.15     8  5.56
10     7  4.82     7  7.26     7  6.42     8  7.91
11     5  5.68     5  4.74     5  5.73     8  6.89

Now we want to see what each data set looks like.

p1 <- ggplot(anscomb, aes(x = x1, y = y1)) + geom_point()
p2 <- ggplot(anscomb, aes(x = x2, y = y2)) + geom_point()
p3 <- ggplot(anscomb, aes(x = x3, y = y3)) + geom_point()
p4 <- ggplot(anscomb, aes(x = x4, y = y4)) + geom_point()

ggarrange(p1, p2, p3, p4, ncol = 2, nrow = 2)

plot of chunk anscomb_plot

Base R has the cor function to produce correlations and the cov function to produce covariances. The default is Pearson’s correlation.

cor(anscomb$x1, anscomb$y1)
[1] 0.8164205
cor(anscomb$x2, anscomb$y2)
[1] 0.8162365
cor(anscomb$x3, anscomb$y3)
[1] 0.8162867
cor(anscomb$x4, anscomb$y4)
[1] 0.8165214

To two decimal places, all 4 sets of data have 0.82 correlation yet the plots don’t all look linear. Just because you have a high correlation is not a sufficient reason to fit a straight line to the data. Even though Data Set 2 has high positive correlation it is obvious from the plot of y2 vs x2 that a perfect nonlinear relationship describes the relationship between the two variables better. Looking at the plot of y3 versus x3 it is obvious if it wasn’t for the second last data point the relationship would be perfectly linear with a positive slope. Look at the last plot, if it wasn’t for the point on its own there would be no relationship between y4 and x4, such a point is called highly influential because if it was removed the curve fitted would be very different. Only the plot of y1 versus x1 could be considered to be approximately linear.

1.4 THE SIMPLE LINEAR REGRESSION MODEL

If the scatter plot of response variable ($Y$) against predictor variable ($X$) is approximately linear we use it to study the relationship between $Y$ and $X$.The response variable $Y$ is of primary importance.

Model: $y_i=\beta_0+\beta_1x_i+\varepsilon_i$ with $E(\varepsilon_i)=0$, $Var(\varepsilon_i)=\sigma^2$ and $Cov(\varepsilon_i,\varepsilon_j)=0$.

This model represents a linear relationship between the variables $x$ and $y$, with uncorrelated errors with constant variance. The errors are not mistakes but represent the variation in the response that is not accounted for by the explanatory variable. We can also assume that $\varepsilon_i \sim N(0,\sigma^2)$ but this assumption is not needed for the least squares estimation but it is necessary for the standard confidence intervals and hypothesis tests. We fit the model by estimating values for the parameters $\beta_0$ (intercept), $\beta_1$ (slope) and $\sigma^2$.

Now

which means “The expected value of the random variable $Y$, given that $X$ is fixed at the value $x$, is equal to $\beta_0+\beta_1 x$”.

1.4.1 Least squares estimation

The least squares estimates of $\beta_0$ and $\beta_1$ are those values $\hat{\beta}_0$ and $\hat{\beta}_1$ that minimise the ‘residual sum of squares’ function

The least squares estimates $\hat{\beta}_0$ and $\hat{\beta}_1$ are $\hat{\beta}_0=y-\beta_1 \bar{x}$, $\beta_1=S_{XY}/S_{XX}$ where $S_{XX}=\sum (x_i-\bar{x})^2$, $S_{XY}=\sum (x_i-\bar{x})(y_i-\bar{y})$, and $S_{YY}=\sum (y_i-\bar{y})^2$.

We define the fitted values by $\hat{y}_i=\hat{\beta}_0+\hat{\beta}_1 x_i$, and residuals by $e_i=\hat{y}_i-y_i$.

We estimate the variance $\sigma^2$ by

where $SSE=\sum e_i^2=\sum(y_i-\hat{y}_i)^2=(S_{YY}-\hat{\beta}_1S_{XY})$.

1.4.2 Properties of the estimators

$E(\hat{\beta}_1)=\beta_1$,

$Var(\hat{\beta}_1)=\frac{\sigma^2}{S_{XX}} (1)$,

$E(\hat{\beta}_0)=\beta_0$,

$Var(\hat{\beta}_0)=\sigma^2(\frac{1}{n}+\frac{\bar{x}^2}{S_{XX}}) (2)$,

$E(\hat{y}_i)=\beta_0+\beta_1 x_i$,

$Var(\hat{y}_i)=\sigma^2(\frac{1}{n}+\frac{(x_i-\bar{x})^2}{S_{XX}}).$

The variance consists of two parts:

The first reflects the fact that the fit is based on a line obtained from a sample of $n$ values.

The second depends on how far the x-value is from the average of the $x$’s, and reflects the fact that the regression line is most accurate near the average of the $x$’s, and least accurate at the extremes.

Replacing $\sigma^2$ in (2) and (1) by $\hat{\sigma}^2$ we get unbiased estimators of the variances of $\hat{\beta}_0$ and $\hat{\beta}_1$.

1.4.3 Tests of hypotheses

Hypothesis tests concerning the slope

Test $H_0:\beta_1=\beta_1^0$ against $H_1:\beta_1 \neq \beta_1^0$ using test statistic $t_1=\frac{\hat{\beta}_1-\beta_1^0}{s.e.(\hat{\beta}_1)}$ which has a $t_{n-2}$ distribution if H_0 is true.

Rejection region: $|t| \geq t_{n-2;\alpha/2}$ where $\alpha=0.05$ if testing at 5% significance level.

P-value: p-value>$\alpha$

EXAMPLE: Regressing Price on Block Size. Done earlier, saved as houses_fit.

summary(houses_fit)

Call:
lm(formula = Price ~ Block_Size, data = houses)

Residuals:
    Min      1Q  Median      3Q     Max 
-143435  -53988   21187   58718  146833 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 616624.3    98136.8   6.283 2.82e-05 ***
Block_Size     169.0      112.6   1.501    0.157    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 90700 on 13 degrees of freedom
Multiple R-squared:  0.1476,	Adjusted R-squared:  0.08207 
F-statistic: 2.252 on 1 and 13 DF,  p-value: 0.1574

If the model assumptions and normality assumption hold (which they don’t) we would conclude that since the p-value associated with the coefficient of Block Size is 0.157 so we do not reject the hypothesis $H_0:\beta_1=0$ in favour of $H_1:\beta_1 \neq 0$. In a report you would write something like “Insufficient evidence to support house prices being influenced by block sizes in Macquarie (p=0.157)”. Different journals have different requirements for reporting statistical results. You need to make sure you extract all the information for the style you have to use, for example APS style requires more (see https://apastyle.apa.org/ or do a Google search if the information isn’t there).

A Test Using Correlation Coefficient

Test $H_0:\rho=0$ versus $H_1:\rho \neq 0$ using test statistic $t=\frac{r}{\sqrt{(1-r^2)/(n-2)}}$ which has a $t_{n-2}$ distribution when $H_0$ is true.

This is also equivalent to testing $H_0:\beta_1 = 0$ against $H_1:\beta_1 \neq 0$ in a simple linear regression of $Y$ on $X$.

Hypothesis tests concerning the intercept

Test of $H_0:\beta_0=\beta_0^0$ against $H_1:\beta_0\neq \beta_0^0$ is based on $t_0=\frac{\hat{\beta}_0-\beta_0^0}{s.e.(\hat{\beta}_0)}$ which has a $t_{n-2}$ distribution if $H_0$ is true.

EXAMPLE:

Back to regressing Price on Block Size. The previous output had a p-value of 2.82e-05 (0.0000282) so if the model is adequate and normality assumption holds you would conclude that the intercept is significant (testing $H_0:\beta_0=0$ against $H_1:\beta_0\neq 0$).

Confidence Intervals and Prediction Intervals

Sometimes you don’t just want to quote a p-value but provide a confidence interval. This output is available from lm in R. For more details see https://stat.ethz.ch/R-manual/R-devel/library/stats/html/confint.html

confint(houses_fit)
                   2.5 %      97.5 %
(Intercept) 404612.65155 828635.9089
Block_Size     -74.30034    412.2767

This is giving you a default 95% confidence interval for the intercept and slope. You would write something like the “95% confidence interval is (-74.3,412.3) for Block Size”, since the confidence interval contains zero there is insufficient evidence that Block Size influences Price. Again, only report this information if the model assumptions hold (residual versus fitted values plot looks like a random scatter about zero and the normal probability plot looks linear).

To get prediction intervals you need to create a new data frame that sets the Block Size. The example below is only getting a default 95% prediction interval for a block of size 800 m2.

newdata <- data.frame(Block_Size = 800)
predict(houses_fit, newdata, interval = "predict")
       fit      lwr      upr
1 751814.8 549136.4 954493.3

For details on how confidence interval or prediction intervals are calculated from formula look at CHP which is available online from the library.

If the model assumptions hold then the predicted price for a block size of 800 m2 is $751,814.8 and the 95% prediction interval is ($549,136.4, $954,493.3).

1.4.6 Analysis of variance (ANOVA)

ANOVA is a technique where the total variability in a set of data is split up into components assigned to various sources of variability:

The total variability is measured by $SS_{yy}=\sum (y_i-y)^2$, also called SST or ‘total sums of squares’.

The SST can be split into SSR or ‘regression sums of squares’ and SSE or ‘residual sums of squares’ as follows:

                              SST         =       SSR      +       SSE

If SSR is large compared to SSE, the regression explains most of the variability in $Y$ and hence is worthwhile.

If SSR is small compared to SSE, the regression is not explaining much of the variability in $Y$.

The comparisons must take into account the degrees of freedom for each component, and so are made using mean squares (sums of squares divided by degrees of freedom), denoted by MS. The test for significance of the regression is carried out by comparing the ratio of Regression Mean Square (MSR) and RMS to the $F_{1,n-2}$ distribution. To carry out the test, we must assume that the errors $\varepsilon_i \sim N(0,\sigma^2)$. The calculations are displayed in an analysis of variance table:

Formally we have $H_0:$ the regression is not significant, against $H_1:$ the regression is significant. Reject $H_0$ at the $\alpha$ ??????level of significance if $F>F_{1, n-2; \alpha}$.

Assumptions must be checked before drawing statistical conclusions from the analyses. We will be looking at residual plots in more detail in the next section. Residual plots are one way of checking assumptions.

1.4 Revision of Matrix Algebra

When we extend the simple linear regression to the multiple linear regression the mathematics is simplified if we use matrix algebra.

DEFN: An $m\times n$ matrix has $m$ rows and $n$ columns: $A=\left(\begin{matrix}a_{11}&a_{12}&\cdots&a_{1n}\a_{21}&a_{22}&\cdots&a_{2n}\\vdots&\vdots&\ddots&\vdots\a_{m1}&a_{m2}&\cdots&a_{mn}\\end{matrix}\right)$.

If the entries are all real we can write $A\in R^{m\times n}$.

DEFN: The transpose of a matrix $A$,denoted by $A^T$ or $A’$, is obtained by interchanging the rows & columns of $A$: $A^T=\left(\begin{matrix}a_{11}&a_{21}&\cdots&a_{m1}\a_{12}&a_{22}&\cdots&a_{m2}\\vdots&\vdots&\ddots&\vdots\a_{1n}&a_{2n}&\cdots&a_{nm}\\end{matrix}\right)$.

DEFN: A square matrix has $m=n$; i.e., number of rows equals number of columns.

Special square matrices:

Diagonal matrix: all entries not on the main diagonal are zero. $D=\left(\begin{matrix}d_1&0&0&\cdots&0\0&d_2&0&\cdots&0\0&0&d_3&\ddots&\vdots\\vdots&\vdots&\ddots&\ddots&0\0&0&\cdots&0&d_n\\end{matrix}\right)$, also written $D=diag{d_1,d_2,\cdots,d_n}$.

Identity matrix: Is a diagonal matrix with all diagonal entries equal to 1. We write $I_n$ for the $n\times n$ identity matrix.

For simple $2 \times 2$ matrices I am going to illustrate matrix addition, subtraction and multiplication which just scales up for larger matrices. Note the entry by entry nature of the operations.

Matrix operations

Addition: $\left(\begin{matrix}a_{11}&a_{12}\a_{21}&a_{22}\\end{matrix}\right)+\left(\begin{matrix}b_{11}&b_{12}\b_{21}&b_{22}\\end{matrix}\right)=\left(\begin{matrix}a_{11}+b_{11}&a_{12}+b_{12}\a_{21}+b_{21}&a_{22}+b_{22}\\end{matrix}\right)$

Note that matrices must all have the same number of rows and columns for addition to be defined.

Subtraction: $\left(\begin{matrix}a_{11}&a_{12}\a_{21}&a_{22}\\end{matrix}\right)-\left(\begin{matrix}b_{11}&b_{12}\b_{21}&b_{22}\\end{matrix}\right)=\left(\begin{matrix}a_{11}-b_{11}&a_{12}-b_{12}\a_{21}-b_{21}&a_{22}-b_{22}\\end{matrix}\right)$

Note that matrices must all have the same number of rows and columns for addition to be defined.

Multiplication: $\left(\begin{matrix}a_{11}&a_{12}\a_{21}&a_{22}\\end{matrix}\right) \left(\begin{matrix}b_{11}&b_{12}\b_{21}&b_{22}\\end{matrix}\right) = \left(\begin{matrix} a_{11}b_{11}+a_{12}b_{21} & a_{11}b_{12}+a_{12}b_{22}
a_{21}b_{11}+a_{22}b_{21} & a_{21}b_{12}+a_{22}b_{22}\\end{matrix}\right)$

Matrix multiplication is only defined if the number of columns of the first matrix equals the number of rows of the second matrix; i.e., for $A_{m\times n}\times B_{p\times q}$ to exist we must have $n=p$. The resulting matrix will have dimension $m\times q$.

Note that $\left(\begin{matrix}a_{11}&a_{12}\a_{21}&a_{22}\\end{matrix}\right)\times\left(\begin{matrix}1&0\0&1\\end{matrix}\right)=\left(\begin{matrix}a_{11}&a_{12}\a_{21}&a_{22}\\end{matrix}\right)$ and in general $A\times I=A$ hence the term identity matrix.

Matrix inversion

For a scalar $k$, the inverse of $k$ is $k^{-1}$ defined as $k\times k^{-1}=1$.

For a square matrix $A$, the inverse of $A$ is $A^{-1}$ where $A\times A^{-1}=I$.

1.6 SIMPLE LINEAR REGRESSION IN MATRIX FORM

Model: $Y_i=\beta_0+\beta_1X_i+\varepsilon_i$, $i=1,\cdots,n$.

Each observation rewritten as:

$y_1=\beta_0+\beta_1x_1+\varepsilon_1$

$y_2=\beta_0+\beta_1x_2+\varepsilon_2$

$\vdots$

$y_n=\beta_0+\beta_1x_n+\varepsilon_n$

Compact form: $Y=X\beta+\varepsilon$ where $y=\left(\begin{matrix}y_1\y_2\\vdots\y_n\\end{matrix}\right)$, $X=\left(\begin{matrix}1&x_1\1&x_2\\vdots&\vdots\1&x_n\\end{matrix}\right)$, $\beta=\left(\begin{matrix}\beta_0\\beta_1\\end{matrix}\right)$, \& $\varepsilon=\left(\begin{matrix}\varepsilon_1\\varepsilon_2\\vdots\\varepsilon_n\\end{matrix}\right)$.

Least squares estimates $\hat{\beta}_0$ and $\hat{\beta}_1$:

Regression through the origin

Models with $\beta_0=0$ can arise in several ways. Most obvious is if $x= 0$ must imply that $y = 0$. Regression through the origin can be carried out using the same model $Y=X\beta+\varepsilon$ by defining the $X$ matrix without the first column of 1’s.

1.7 MULTIPLE LINEAR REGRESSION

1.7.1 EXAMPLE

Heatflux: As part of a test of solar thermal energy, the total heat flux from homes is measured. Researchers wish to examine whether total heat flux (Heatflux) can be predicted by insulation (Insulation), by the position of the focal points in the east (East), south (South), and north (North) directions, and by the time of day (Time). The data are from the book by D.C. Montgomery and E.A. Peck, published by John Wiley \& Sons in 1982, titled "Introduction to Linear Regression Analysis". You can find this data in the file heatflux.csv.

library(tidyverse)
library(ggpubr)

heat_flux <- read_csv(file.path("..", "Data", "heatflux.csv"))
Parsed with column specification:
cols(
  HeatFlux = col_double(),
  Insulation = col_double(),
  East = col_double(),
  South = col_double(),
  North = col_double(),
  Time = col_double()
)
heat_flux
# A tibble: 29 x 6
   HeatFlux Insulation  East South North  Time
      <dbl>      <dbl> <dbl> <dbl> <dbl> <dbl>
 1     272.       783.  33.5  40.6  16.7  13.2
 2     264        748.  36.5  36.2  16.5  14.1
 3     239.       684.  34.7  37.3  17.7  15.7
 4     231.       828.  33.1  32.5  17.5  10.5
 5     252.       860.  35.8  33.7  16.4  11  
 6     258.       875.  34.5  34.1  16.3  11.3
 7     264.       909.  34.6  34.8  16.1  12.0
 8     266.       906.  35.4  35.9  15.9  12.6
 9     229.       756   35.8  33.5  16.6  10.7
10     239.       769.  35.7  33.8  16.4  10.8
# ... with 19 more rows

Lets explore the data quickly graphically. The response variable is HeatFlux and we have five potential predictors Insulation, East, South, North, and Time.

pairs(heat_flux)

plot of chunk pairs_heatflux

Note that HeatFlux appears to be approximately linearly related to all predictors except time. We will fit a multiple linear regression of HeatFlux ($Y$) on the predictors Insulation ($X_1$), East ($X_2$), South ($X_3$), North ($X_4$), and Time ($X_5$).

Suggested Model: $Y=\beta_0+\beta_1X_1+\beta_2X_2+\beta_3X_3+\beta_4X_4+\beta_5 X_5 + \varepsilon$

We will now fit this model in R and look at various outputs which I will intersperse with comments.

heatflux_fit <- lm(HeatFlux ~ Insulation + East + South + North + Time, data = heat_flux)

par(mfrow = c(2, 2))
plot(heatflux_fit)

plot of chunk heatflux_full_model

Looking at the Residual vs Fitted values plot, ignore the Loess smoother line fitted, if it wasn’t for a couple of points you probably would think this looks like a random scatter about zero so model adequate and proceed to check the normality assumption. The Normal Q-Q plot looks approximately linear so might be prepared to say the normality assumption holds. We are now in a position to make inference. Recall that you cannot test any hypotheses or obtain confidence intervals if the model assumptions do not hold.

summary(heatflux_fit)

Call:
lm(formula = HeatFlux ~ Insulation + East + South + North + Time, 
    data = heat_flux)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.6848  -2.7688   0.6273   3.9166  17.3962 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 325.43612   96.12721   3.385  0.00255 ** 
Insulation    0.06753    0.02899   2.329  0.02900 *  
East          2.55198    1.24824   2.044  0.05252 .  
South         3.80019    1.46114   2.601  0.01598 *  
North       -22.94947    2.70360  -8.488 1.53e-08 ***
Time          2.41748    1.80829   1.337  0.19433    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.039 on 23 degrees of freedom
Multiple R-squared:  0.8988,	Adjusted R-squared:  0.8768 
F-statistic: 40.84 on 5 and 23 DF,  p-value: 1.077e-10

The p-value for the F statistic is very small (much smaller than 0.05) so the regression is significant which means that at least one of the predictors is needed in the model. The five predictors explain for 87.7% of the variability in the response HeatFlux.

1.7.2 Multiple linear regression: matrix form

Model: $Y=\beta_0+\beta_1X_1+\cdots+\beta_pX_p+\varepsilon$

Matrix form: $y=X\beta+\varepsilon$, where $y=\left(\begin{matrix}y_1\\vdots\y_n\\end{matrix}\right)$, $X=\left(\begin{matrix}1&x_{11}&x_{12}&\cdots&x_{1p}\\vdots&\vdots&\vdots&&\vdots\1&x_{n1}&x_{n2}&\cdots&x_{np}\\end{matrix}\right)$, $\beta=\left(\begin{matrix}\beta_0\\beta_1\\vdots\\beta_p\\end{matrix}\right)$ and $\varepsilon=\left(\begin{matrix}\varepsilon_1\\vdots\\varepsilon_n\\end{matrix}\right) \sim N(\mathbf{0},\sigma^2\mathbf{I})$.

Matrix $X$ is called the data matrix or model matrix or the design matrix. The data matrix ($X$) displays the predictor data, one column for each predictor variable, together with a column of 1s (if there is an intercept in the model), and it shows the experimental design if the levels of the predictor variables are planned. Each row in the matrix $X$ shows the information on all predictors for one case.

The least squares estimates of the regression coefficients $\beta$ are those values $\hat{\beta}$ that minimise the ‘residual sum of squares’ function $SSE(\beta)=(y-X\beta)^T(y-X\beta)$.

Normal equations: $(X^TX)\hat{\beta}=X^Ty$

Solution: $\hat{\beta}=(X^TX)^{-1}X^Ty$ (provided the inverse exists).

The $(p+1)\times(p+1)$ matrix $(X^TX)$ is a symmetric matrix whose elements consist of sums of squares and sums of cross products of the elements in the columns of $X$. The nature of $(X^TX)$ plays an important role in the properties of the estimators $\hat{\beta}$ and will often be a large factor in the success or failure of the least squares estimation procedure.

Fitted values: $\hat{y}=X\hat{\beta}$

Residuals: $e=y-\hat{y}$.

Residual sum of squares: $SSE=(y-X\hat{\beta})^T(y-X\hat{\beta})=y^Ty-{\hat{\beta}}^T(X^TX)\hat{\beta}$

Unbiased estimate of $\sigma^2$: ${\hat{\sigma}}^2=\frac{1}{n-p-1}(y-X\hat{\beta})^T(y-X\hat{\beta})=\frac{1}{n-p-1}\sum{(y-{\hat{y}}_i)^2}$

Properties of the least squares estimators

$E(\hat{\beta})=E(X^TX)^{-1}X^Ty=\beta$.

$Var(\hat{\beta})=\sigma^2(X^TX)^{-1}$.

If we assume that $\varepsilon$ has a normal distribution, then $ \hat{\beta}$ also has a normal distribution with mean $\beta$ and variance $\sigma^2(X^TX)^{-1}$.

Here the distribution of the residual mean square is given by $\frac{(n-p-1)^2\hat{\sigma}^2}{\sigma^2} \sim \chi_{n-p-1}^2$. The estimated variance of $\hat{\beta}$ is given by $\hat{\sigma}^2(X^TX)^{-1}$, and standard errors are obtained as $\hat{\sigma}$ multiplied by the square root of the diagonal elements of the $(X^TX)^{-1}$ matrix.

Predictions and fitted values

If we observe a vector of predictors $x$ (including a 1) for which we don’t know the response, the predicted response is $\hat{y}=x^T\hat{\beta}$ with standard error of prediction $\mathrm{s.e.pred}=s\sqrt{1+x^T(X^TX)^{-1}x}$. Similarly, the estimated mean response when the predictors take the value $x$ is $\hat{y}=x^T\hat{\beta}$ with standard error of fit $\mathrm{s.e.}(\mathrm{fit})=s\sqrt{x^T(X^{T}X)^{-1}x}$.

2.1 HYPOTHESIS TESTS

We wish to test whether the overall regression is significant; i.e., are all the predictors taken together useful in the prediction of $Y$?

Should a particular predictor be in the regression model?

Should a set of predictors be added to the regression model? (This situation arises when a group of predictors belong together.)

In general, we have a null model (reduced model), in which some or all of the predictors are left out of the model (i.e. some or all of $\beta_i$’s are hypothesised to be zero) and a full model which has all the predictors in the model (i.e. all $\beta_i$’s present). The general form of the test statistic is:

where $\hat{\sigma}^2$ is the usual estimate of $\sigma^2$ under the full model and $\hat{\sigma}_0^2$ is the estimate of $\sigma^2$ under the null model (reduced model). If the null model holds $\hat{\sigma}_0^2$ equals $\sigma^2$ and if the null model does not hold then it is greater than $\sigma^2$. Having more parameters in the model will always reduce the residual variation. Therefore $\sigma^2\le\hat{\sigma}_0^2$.If the null model holds we would expect $F$ to be approximately equal to 1 and if the full model holds we would expect $F$ to be greater than 1. If $F$ is large the full model has reduced the residual variation substantially, the additional predictors are useful in the prediction of the $y$ whereas if $F$ is close to 1 then the additional predictors in the full model have not reduced the residual variation by much so we would use the null model.

2.1.1 Test for overall significance of the regression

If we had $p$ predictors the full model would be

If we are interested in testing whether the regression is significant or not we are interested in testing whether some of the predictors ($X_i$’s) are useful or not. The null model would be that none of the predictors are useful:

We test the hypothesis $H_0:\beta_1=\beta_2=\cdots=\beta_p=0$ against $H_1:$ at least one $\beta_i\neq0$. In matrix notation

Here the regression mean square would be used to estimate $\sigma^2$ under the full model and the residual mean square would be used to estimate $\sigma^2$ under the null model. The test statistic is

which under the null hypothesis has a $F_{p,n-p-1}$ distribution. We reject $H_0$ in favour of $H_1$ if $F$ lies in the upper tail of this distribution.

Computations can be summarised in the ANOVA table:

where $SST$, $SSR$ and $SSE$ are defined as before. We have exactly the same decomposition as before that

SST = SSR + SSE

Now, $R^2=\frac{SSR}{SST}$ is the proportion of the variance explained by the regression.

We can show that

A small value of $R^2$ will result in a small value of $F$ and we will not reject $H_0$ in favour of $H_1$. A value of $R^2$ close to 1 will result in a large value of $F$ and we will reject $H_0$ in favour of $H_1$.

We already have seen an example of testing whether the regression is significant.

2.1.2 Selecting the “best” model

There is no unique criterion for choosing the “best” model. We want as simple a model as possible that adequately explains what is going on (“principle of parsimony”). The more parameters in the model, the closer the fitted values will be to the observed data and the higher $R^2$ will be but the standard errors of the estimates ${\hat{\beta}}_i$ will increase because we are estimating more parameters on the same amount of information. We trade off between

  1. Few $X$’s (small $p$): lower $R^2$ but more precise $\beta_i$’s and
  2. Many $X$’s (large $p$): higher $R^2$ but less precise $\beta_i$’s.

We try to find that set of predictors which give an acceptable model fit, or $R^2$. If a predictor does not add to the model’s explanation of the variation of $Y$ in a significant way, it is not added to the model, even though it would have reduced $R^2$.

Comparing two models

The reduced model is the model with the smallest number of parameters. We want to test $H_0:$ reduced model is appropriate against $H_1:$ full model is appropriate. To formally test this we need to fit the reduced model and record from the output the Residual (Error) Sum of Squares and its associated degrees of freedom, which we denote by $SSE_{RM}$ and $DF_{RM}$ respectively. The same information is required from fitting the full model, label the residual sum of squares and its associated degrees of freedom by $SSE_{FM}$ and $DF_{FM}$. The appropriate test statistic is

We reject $H_0$ in favour of $H_1$ at the $100\alpha\%$ significance level if $T>F_{DF_{RM}-DF_{FM},DF_{FM};\alpha}$. This is only valid if the model assumptions hold. We will look at regression diagnostics in detail the next section.

Partial F-tests

Assume that we have $n$ observations. The full model has all the $p$ predictors in it. The reduced model has one predictor removed, say the i’th predictor. We use the same test statistic as above and reject $H_0:\beta_i=0$ in favour of $H_1:\beta_i\neq0$ at the $\alpha100\%$ significance level if $T>F_{1,n-p-1;\alpha}$. You can think of the partial F tests as assessing variables as if they were the last being added to the model.

An equivalent way of testing $H_0:\beta_i=0$ versus $H_1:\beta_i\neq 0$ is using $t=\frac{\hat{\beta}i}{S.E.(\hat{\beta}}_i)}$ which has a $t{n-p-1}$ distribution if $H_0$ is true. The R function lm gives you the partial F-tests in this way. The disadvantage of partial F-tests is that they are not independent tests.

EXAMPLE: Heat Flux revisited
summary(heatflux_fit)

Call:
lm(formula = HeatFlux ~ Insulation + East + South + North + Time, 
    data = heat_flux)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.6848  -2.7688   0.6273   3.9166  17.3962 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 325.43612   96.12721   3.385  0.00255 ** 
Insulation    0.06753    0.02899   2.329  0.02900 *  
East          2.55198    1.24824   2.044  0.05252 .  
South         3.80019    1.46114   2.601  0.01598 *  
North       -22.94947    2.70360  -8.488 1.53e-08 ***
Time          2.41748    1.80829   1.337  0.19433    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.039 on 23 degrees of freedom
Multiple R-squared:  0.8988,	Adjusted R-squared:  0.8768 
F-statistic: 40.84 on 5 and 23 DF,  p-value: 1.077e-10

The Partial F-tests p-values are given in the Pr(>|t|) column. You only look at these if the model assumptions hold. Suppose they do, then you would conclude at the 5% significance level that the predictors Insulation and South are significant.

Sequential F tests

Assume that we have $Sn$ observations. Variables are added to the model in a particular order and at each step the most recent predictor being entered into the model is tested for significance.

R function lm can provide the output. All you need do is divide the sequential sum of squares by the residual mean sum of squares for the full model and compare with $F_{1,n-p-1}$. The advantage of the sequential $F$ tests are that they are independent tests. The disadvantage is that the tests may be highly dependent on the order in which the predictors enter the model.

fit5 <- lm(HeatFlux ~ Insulation + East + South + North + Time, data = heat_flux)
fit4 <- lm(HeatFlux ~ Insulation + East + South + North, data = heat_flux)
fit3 <- lm(HeatFlux ~ Insulation + East + South, data = heat_flux)
fit2 <- lm(HeatFlux ~ Insulation + East, data = heat_flux)
fit1 <- lm(HeatFlux ~ Insulation, data = heat_flux)
fit0 <- lm(HeatFlux ~ 1, data = heat_flux)

anova(fit0, fit1, fit2, fit3, fit4, fit5, test = "F")
Analysis of Variance Table

Model 1: HeatFlux ~ 1
Model 2: HeatFlux ~ Insulation
Model 3: HeatFlux ~ Insulation + East
Model 4: HeatFlux ~ Insulation + East + South
Model 5: HeatFlux ~ Insulation + East + South + North
Model 6: HeatFlux ~ Insulation + East + South + North + Time
  Res.Df     RSS Df Sum of Sq       F    Pr(>F)    
1     28 14681.9                                   
2     27  8898.1  1    5783.8 89.4964 2.155e-09 ***
3     26  8086.4  1     811.7 12.5602 0.0017316 ** 
4     25  6904.9  1    1181.5 18.2822 0.0002832 ***
5     24  1601.9  1    5303.0 82.0574 4.772e-09 ***
6     23  1486.4  1     115.5  1.7873 0.1943335    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

East, South and North are significant (at 1\% significance level or 0.5\% significance level) according to the Sequential F Test.

EXERCISE Try different orderings of the predictors. There are 5 ways you can choose the first predictor, 4 ways for the second predictor, 3 ways of choosing the third predictor, 2 ways of choosing the fourth predictor, and only one way to choose the fifth predictor giving $5 \times 4 \times 3 \times 2 \times 1 = 120$ combinations.

Using R Square ($R^2$)

When a linear regression model is fitted using function lm in R you can extract $R^2$.

summary(heatflux_fit)$r.squared
[1] 0.8987602

R Square gives the “proportion of variation in the response data that is explained by the model”. An acceptable value for $R^2$ depends on the context of the model fitting. R Square ($R^2$) is a dangerous criterion for model comparisons, any additional model terms will automatically increase it.

Note that Multiple R-squared given earlier is the same as R-squared in the lm function.

Using Adjusted R Square ($R_{ADJ}^2$)

When a linear regression model is fitted using lm in R you can always obtain $R_{ADJ}^2$ (Adjusted R-Squared). This is $R_{ADJ}^2=1-\frac{n-1}{n-p}(1-R^2)$ which does not necessarily increase if more terms are added to the model. The model with the largest $R_{ADJ}^2$ is usually chosen.

EXAMPLE For the Heat Flux data using all the predictors the Adjusted R-squared is 0.8768. See earlier output.

Using $\hat{\sigma}^2$, the residual mean square (MSE)

We choose the model with the smallest $\hat{\sigma}^2$ or if the next smallest $\hat{\sigma}^2$ is close to the smallest $\hat{\sigma}^2$ but the model has less terms in it we would choose it.

Mallows $C_p$ (A Criterion Based Method)

Predicted values obtained from a regression equation based on a subset of variables are generally biased. We use the mean square error of the predicted value to judge the performance of an equation. The measure standardized total mean squared error of prediction for the observed data by

where MSE($\hat{y}_i$) is the ith predicted value from an equation with $p$ terms (number of parameters in equation) and $\sigma^2$ is the variance of the random errors. This statistic places special emphasis on observed data, and good subsets will result in small values.

We can estimate the value of this statistic from the data by Mallows’ $C$, calculated from:

where $\hat{\sigma}^2$ is from the linear model with the full set of $q$ variables.

Mallows’ $C_p$ has these properties:

  1. it is easily calculated from usual regression summaries,
  2. it measures the difference in fitting errors between the full and the subset models,
  3. it has a random and a fixed component giving a trade off between better fit and more parameters, and
  4. it can be used to compare subset models — although it is not necessarily true that a smaller value means a better subset model, any model with $C_p \le p$ will be a good model.

We can use leaps to obtain the Mallow’s $C_p$. Note that when $p$ equals the number of predictors in the full model that Mallow’s $C_p$ always equals $p$.

library(leaps)

x <- heat_flux[, -1]
y <- heat_flux$HeatFlux

leaps(x, y)
$which
      1     2     3     4     5
1 FALSE FALSE FALSE  TRUE FALSE
1  TRUE FALSE FALSE FALSE FALSE
1 FALSE FALSE FALSE FALSE  TRUE
1 FALSE FALSE  TRUE FALSE FALSE
1 FALSE  TRUE FALSE FALSE FALSE
2 FALSE FALSE  TRUE  TRUE FALSE
2 FALSE FALSE FALSE  TRUE  TRUE
2  TRUE FALSE FALSE  TRUE FALSE
2 FALSE  TRUE FALSE  TRUE FALSE
2  TRUE  TRUE FALSE FALSE FALSE
2  TRUE FALSE  TRUE FALSE FALSE
2  TRUE FALSE FALSE FALSE  TRUE
2 FALSE FALSE  TRUE FALSE  TRUE
2 FALSE  TRUE FALSE FALSE  TRUE
2 FALSE  TRUE  TRUE FALSE FALSE
3 FALSE  TRUE  TRUE  TRUE FALSE
3  TRUE FALSE  TRUE  TRUE FALSE
3 FALSE FALSE  TRUE  TRUE  TRUE
3  TRUE FALSE FALSE  TRUE  TRUE
3 FALSE  TRUE FALSE  TRUE  TRUE
3  TRUE  TRUE FALSE  TRUE FALSE
3  TRUE  TRUE  TRUE FALSE FALSE
3  TRUE FALSE  TRUE FALSE  TRUE
3 FALSE  TRUE  TRUE FALSE  TRUE
3  TRUE  TRUE FALSE FALSE  TRUE
4  TRUE  TRUE  TRUE  TRUE FALSE
4  TRUE FALSE  TRUE  TRUE  TRUE
4 FALSE  TRUE  TRUE  TRUE  TRUE
4  TRUE  TRUE FALSE  TRUE  TRUE
4  TRUE  TRUE  TRUE FALSE  TRUE
5  TRUE  TRUE  TRUE  TRUE  TRUE

$label
[1] "(Intercept)" "1"           "2"           "3"           "4"          
[6] "5"          

$size
 [1] 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 6

$Cp
 [1]  38.492277 112.687090 174.174842 199.329010 199.803490   9.097518
 [7]  17.846129  37.462451  40.490491 102.126871 107.320539 114.607262
[13] 120.046927 174.732260 196.395794   7.596312   9.717698   9.950942
[19]  10.149946  19.842738  38.941581  85.844637 100.383642 101.484104
[25] 102.842597   5.787266   8.179844   9.424426  10.764344  76.054147
[31]   6.000000

According to Mallow’s $C_p$ none of the models are any good.

Multicollinearity

When predictors are related to each other, regression modelling can be very confusing. Estimated effects can change in magnitude and even sign. Two predictors are collinear if $c_1x_1+c_2x_2=c_0$ for constants $c_1,c_2,c_0$. The definition of collinearity extends to several predictors. More important, however, is approximate collinearity, when predictors are closely but not exactly related and the equation holds approximately. In some packages (e.g. Minitab) a message is given if you perform a regression where there is too much collinearity. If the collinearity is extreme, a variable will be dropped before the calculations are carried out. If the collinearity is exact, this causes no loss of information, and if the collinearity is approximate, only a small amount of information is lost.

Alternatively, we can ask for variance inflation factors (VIFs) associated with each predictor. These come from the formula for the variance of the parameter estimates $Var(\hat{\beta}_i)=\sigma^2\left(\frac{1}{1-R_i^2}\right)\left(\frac{1}{SX_iX_i}\right)$, with $R_i^2$ the square of the multiple correlation between $X_i$ and the other $X’s$, and $1/(1-R_i^2)$ being the amount by which the variance is increased, or inflated, due to collinearity. If none of the VIFs are greater than 10 then collinearity is not a problem.

EXAMPLE Revisiting Heat Flux example.

library(car)

heatflux_fit <- lm(HeatFlux ~ Insulation + East + South + North + Time, data = heat_flux)
vif(heatflux_fit)
Insulation       East      South      North       Time 
  2.319035   1.355448   3.175970   2.612066   5.370059 

All VIFs are less than 10 so there is no multicollinearity problem.

Multiple correlation coefficient

The multiple correlation coefficient $R_{Y|X_1,\cdots,X_p}$ is a measure of the association between $Y$ and $X_1,\ldots,X_p$ jointly. Its square is what we have called $R^2$. We can interpret $R_{Y|X_1,\cdots,X_p}$ as the correlation between $Y$ and the regression equation involving $X_1,\ldots,X_p$. It is defined as $R_{Y|X_1,\cdots,X_p}=\frac{\sum(y_i-\bar{y})(y_i-\hat{y})}{\sum(y_i-\bar{y})^2 \sum (y_i-\bar\hat{y})^2}$ where $\hat{y}i=\hat{\beta}_0+\hat{\beta}_1x{i1}+\hat{\beta}2x{i2}+\cdots \hat{\beta}px{ip}$, $i=1,\ldots,n$, is the fitted value or predicted value for $y_i$ and $\bar\hat{y}=\frac{1}{n}\sum \hat{y}_i$ is the mean of the fitted values. (It turns out that $\bar{\hat{y}}=\bar{y}$ always.)

Also, $R_{Y|X_1,\cdots,X_p}^2=\frac{SST-SSE}{SST}=\frac{SSR}{SST}$ and it is interpreted as the proportion of total variation (SST) that is explained by the regression involving $X_1,\ldots,X_p$ (SSR).

Variable selection

To combat collinearity or to find a parsimonious model we may wish to select only some of the predictor variables available. Assume that we have $n$ cases with values observed on $k$ predictor variables $X_1,X_2,\cdots,X_k$ and a response $Y$. Let $p$ = the number of predictors in a selected subset (including the intercept) and assume all necessary transformations have been carried out.

The full model can be written as $Y=X_1\beta_1+X_2\beta_2+\varepsilon$ (vector notation, with X_1 an n by p matrix) and the subset model $Y=X_1\beta_1+\varepsilon$ is obtained by putting $\beta_2=0$. The subset model is tested against the full model using a generalised F-test, as usual.

Selecting variables on substantive grounds

The most important method for selecting variables is based on a knowledge of the situation and the variables being studied.

EXAMPLE: The variables HT (height) and WT (weight) can be combined into BMI (body mass index = weight per height squared).

EXAMPLE: A number of test scores can be combined into an average.

EXAMPLE: Weisberg (1981) gives an example Highway data, from a Masters thesis modelling the automobile accident rate in terms of 13 independent variables. Here there are $2^{13} = 8192$ possible subset models, which can be reduced by considering two points:

  1. Three variables, FAI, PA and MA, should be kept together since they are indicator variables for different types of highways, with a fourth type indicated if each of these variables equals zero and
  2. The variable LEN, length of segment under study, is required by definition. This now gives a total of 512 possible subset models, still a large number but much fewer than originally.
Stepwise methods of variable selection

Often we have to use the data to find a reasonable subset of the predictors for use in a model. A stepwise procedure is a systematic way of examining only a few subsets of each size and choosing a path through all possible models.

Forward selection

We begin with a simple regression model with the best single predictor (largest $R^2$, $F$ or $t$).We then add the predictor that increases $R^2$ the most, or would have the largest $F$ or $t$ of any of the other variables. We continue thus, but stop when you reach a subset of predetermined size, or when no other variable has an $F$ greater than $F$ to enter, or if any variable would cause unacceptable collinearity.

Backward elimination

We begin with the full model. We then remove the variable with the smallest $F$ or $t$, the variable that would reduce $R^2$ the least. We continue thus, but stop when you reach a subset of predetermined size, or when all variables remaining in the model have $F$ greater than $F$ to remove.

Stepwise

This is a combination of the previous two methods. We start with one variable, as in forward selection, and at each step take one of four options: add a variable, remove a variable, exchange two variables or stop. If there are at least two variables in the model, remove a variable if it has $F$ less than $F$ to remove. If there are at least two variables in the model, remove a variable if this would result in a larger $R^2$ than obtained previously with that many variables. Exchange a variable in the model with one not in the model if this would increase $R^2$. Add a variable if it has the largest $F$ greater than $F$ to enter and the collinearity tolerance is OK.

Remarks

EXAMPLE: Squid data (This data set is from Classical and Modern Regression with Applications by R.H. Myers published in 1990.) An experiment was conducted to study the size of squid eaten by sharks and tuna.

The regressors are characteristics of the beak or mouth of the squid. They are X1 = Beak length in inches, X2 = Wing length in inches, X3 = Beak to notch length, X4 = Notch to wing length, and X5 = Width in inches. The response (Y) is the weight of the squid in pounds. These data are given below.

We first fit the model $Y=\beta_0+\beta_1X_1+\beta_2X_2+\beta_3X_3+\beta_4X_4+\beta_5X_5+\varepsilon$ to the above data.

Before doing this we should first obtain a matrix scatter plot of the data.

library(tidyverse)
library(ggpubr)

squid <- read_csv(file.path("..", "Data", "Squid.csv"))
squid
# A tibble: 22 x 6
       Y    X1    X2    X3    X4    X5
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  1.95  1.31  1.07  0.44 0.75  0.35 
 2  2.9   1.55  1.49  0.53 0.9   0.47 
 3  0.72  0.99  0.84  0.34 0.570 0.32 
 4  0.81  0.99  0.83  0.34 0.54  0.27 
 5  1.09  1.05  0.9   0.36 0.64  0.3  
 6  1.22  1.09  0.93  0.42 0.61  0.31 
 7  1.02  1.08  0.9   0.4  0.51  0.31 
 8  1.93  1.27  1.08  0.44 0.77  0.34 
 9  0.64  0.99  0.85  0.36 0.56  0.290
10  2.08  1.34  1.13  0.45 0.77  0.37 
# ... with 12 more rows
pairs(squid)

plot of chunk read_squid

The relationship between $Y$ and each predictor separately is approximately linear with positive slope. More worrying is the obvious correlations between most of the predictors, there may be multicollinearity.

Seeing what happens when we fit the multiple linear regression.

squid_fit <- lm(Y ~ X1 + X2 + X3 + X4 + X5, data = squid)

par(mfrow = c(2, 2))
plot(squid_fit)

plot of chunk squid_lm

The model is not adequate, the Residual versus Fitted values plot does not look like a random scatter about zero. We cannot proceed to make valid inference.

For the moment, just to illustrate what to do if the model was adequate and the normality assumption held, look at the model out put.

summary(squid_fit)

Call:
lm(formula = Y ~ X1 + X2 + X3 + X4 + X5, data = squid)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2610 -0.5373  0.1355  0.5120  0.8611 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -6.5122     0.9336  -6.976 3.13e-06 ***
X1            1.9994     2.5733   0.777  0.44851    
X2           -3.6751     2.7737  -1.325  0.20378    
X3            2.5245     6.3475   0.398  0.69610    
X4            5.1581     3.6603   1.409  0.17791    
X5           14.4012     4.8560   2.966  0.00911 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7035 on 16 degrees of freedom
Multiple R-squared:  0.9633,	Adjusted R-squared:  0.9519 
F-statistic: 84.07 on 5 and 16 DF,  p-value: 6.575e-11

Partial F tests suggest that the “best” model contains only one predictor variable and that is X5. (Since the model assumptions are violated the above inference is dangerous.)

We could also look at the “best” model suggested by sequential F tests. The order in which the parameters are listed can affect the conclusions reached significantly. A full-scale model-building process cannot be based on sequential F tests unless there is an appropriate selection of order of variables based on subject matter expertise.

fit5 <- lm(Y ~ X1 + X2 + X3 + X4 + X5, data = squid)
fit4 <- lm(Y ~ X1 + X2 + X3 + X4, data = squid)
fit3 <- lm(Y ~ X1 + X2 + X3, data = squid)
fit2 <- lm(Y ~ X1 + X2, data = squid)
fit1 <- lm(Y ~ X1, data = squid)
fit0 <- lm(Y ~ 1, data = squid)

anova(fit0, fit1, fit2, fit3, fit4, fit5, test = "F")
Analysis of Variance Table

Model 1: Y ~ 1
Model 2: Y ~ X1
Model 3: Y ~ X1 + X2
Model 4: Y ~ X1 + X2 + X3
Model 5: Y ~ X1 + X2 + X3 + X4
Model 6: Y ~ X1 + X2 + X3 + X4 + X5
  Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
1     21 215.925                                    
2     20  16.779  1   199.145 402.4397 9.131e-13 ***
3     19  16.653  1     0.127   0.2560  0.619804    
4     18  12.533  1     4.120   8.3249  0.010765 *  
5     17  12.270  1     0.263   0.5325  0.476114    
6     16   7.918  1     4.352   8.7951  0.009109 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model suggested by the sequential F tests is $Y=\beta_0+\beta_1X_1+\beta_3X_3+\beta_5X_5+\varepsilon$.

fit135 <- lm(Y ~ X1 + X3 + X5, data = squid)

summary(fit135)

Call:
lm(formula = Y ~ X1 + X3 + X5, data = squid)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6386 -0.5084  0.1060  0.5471  0.9666 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -6.83999    0.87718  -7.798 3.52e-07 ***
X1           3.26593    1.60373   2.036   0.0567 .  
X3           0.07094    5.48254   0.013   0.9898    
X5          13.35925    4.72866   2.825   0.0112 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7142 on 18 degrees of freedom
Multiple R-squared:  0.9575,	Adjusted R-squared:  0.9504 
F-statistic: 135.1 on 3 and 18 DF,  p-value: 1.572e-12

The appropriate test statistic for testing $H_0: Y=\beta_0+\beta_1X_1+\beta_3X_3+\beta_5X_5+\varepsilon$ against $H_1: Y=\beta_0+\beta_1X_1+\beta_2X_2+\beta_3X_3+\beta_4X_4+\beta_5X_5+\varepsilon$ is

We reject $H_0$ in favour of $H_1$ at the 100$\alpha\%$ significance level if $T>F_{DF_{RM}-DF_{FM},DF_{FM};\alpha}$.

anova(fit5)
Analysis of Variance Table

Response: Y
          Df  Sum Sq Mean Sq  F value    Pr(>F)    
X1         1 199.145 199.145 402.4397 9.131e-13 ***
X2         1   0.127   0.127   0.2560  0.619804    
X3         1   4.120   4.120   8.3249  0.010765 *  
X4         1   0.263   0.263   0.5325  0.476114    
X5         1   4.352   4.352   8.7951  0.009109 ** 
Residuals 16   7.918   0.495                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit135)
Analysis of Variance Table

Response: Y
          Df  Sum Sq Mean Sq  F value    Pr(>F)    
X1         1 199.145 199.145 390.3744 1.188e-13 ***
X3         1   3.525   3.525   6.9103   0.01704 *  
X5         1   4.072   4.072   7.9816   0.01121 *  
Residuals 18   9.183   0.510                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now, from the output above $SSE_{FM}=7.918$, $DF_{FM}=16$, $SSE_{REM}=9.183$, and $DF_{RM}=18$ so our test statistic

Need to compare this with $F_{2,18;0.05}$.

qf(0.95, df1 = 2, df2 = 18)
[1] 3.554557

Since $T=1.28 \ngeq 3.5546$ we cannot reject $H_0$ in favour of $H_1$ at the $5\%$ significance level. The reduced model here is plausible.

If the model assumptions were not being met the above inference is dangerous.

Forward stepwise regression in R

Illustrated on the squid data set.

model <- lm(Y ~ ., data = squid)
model$call$data <- squid # Should not be needed when running by hand

ols_step_forward_p(model)
Forward Selection Method    
---------------------------

Candidate Terms: 

1. X1 
2. X2 
3. X3 
4. X4 
5. X5 

We are selecting variables based on p value...

Variables Entered: 

- X5 
- X4 
- X2 

No more variables to be added.

Final Model Output 
------------------

                        Model Summary                          
--------------------------------------------------------------
R                       0.981       RMSE                0.678 
R-Squared               0.962       Coef. Var          16.169 
Adj. R-Squared          0.955       MSE                 0.460 
Pred R-Squared          0.930       MAE                 0.524 
--------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 

                               ANOVA                                 
--------------------------------------------------------------------
               Sum of                                               
              Squares        DF    Mean Square       F         Sig. 
--------------------------------------------------------------------
Regression    207.643         3         69.214     150.44    0.0000 
Residual        8.281        18          0.460                      
Total         215.925        21                                     
--------------------------------------------------------------------

                                  Parameter Estimates                                    
----------------------------------------------------------------------------------------
      model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
----------------------------------------------------------------------------------------
(Intercept)    -5.980         0.659                 -9.073    0.000    -7.365    -4.596 
         X5    17.109         3.082        0.798     5.551    0.000    10.634    23.584 
         X4     6.879         2.783        0.489     2.472    0.024     1.033    12.726 
         X2    -2.890         2.346       -0.292    -1.232    0.234    -7.820     2.040 
----------------------------------------------------------------------------------------

                           Selection Summary                            
-----------------------------------------------------------------------
        Variable                  Adj.                                     
Step    Entered     R-Square    R-Square     C(p)       AIC       RMSE     
-----------------------------------------------------------------------
   1    X5            0.9455      0.9428    5.7794    54.6672    0.7670    
   2    X4            0.9584      0.9540    2.1456    50.7185    0.6875    
   3    X2            0.9616      0.9553    2.7354    50.9387    0.6783    
-----------------------------------------------------------------------

Final model has $X_5$, $X_4$ and $X_2$.

Forward stepwise elimination in R
modelb <- lm(Y ~ ., data = squid)
modelb$call$data <- squid # Should not be needed when running by hand

ols_step_backward_p(modelb)
Backward Elimination Method 
---------------------------

Candidate Terms: 

1 . X1 
2 . X2 
3 . X3 
4 . X4 
5 . X5 

We are eliminating variables based on p value...

Variables Removed: 

- X3 
- X1 

No more variables satisfy the condition of p value = 0.3


Final Model Output 
------------------

                        Model Summary                          
--------------------------------------------------------------
R                       0.981       RMSE                0.678 
R-Squared               0.962       Coef. Var          16.169 
Adj. R-Squared          0.955       MSE                 0.460 
Pred R-Squared          0.930       MAE                 0.524 
--------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 

                               ANOVA                                 
--------------------------------------------------------------------
               Sum of                                               
              Squares        DF    Mean Square       F         Sig. 
--------------------------------------------------------------------
Regression    207.643         3         69.214     150.44    0.0000 
Residual        8.281        18          0.460                      
Total         215.925        21                                     
--------------------------------------------------------------------

                                  Parameter Estimates                                    
----------------------------------------------------------------------------------------
      model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
----------------------------------------------------------------------------------------
(Intercept)    -5.980         0.659                 -9.073    0.000    -7.365    -4.596 
         X2    -2.890         2.346       -0.292    -1.232    0.234    -7.820     2.040 
         X4     6.879         2.783        0.489     2.472    0.024     1.033    12.726 
         X5    17.109         3.082        0.798     5.551    0.000    10.634    23.584 
----------------------------------------------------------------------------------------

                          Elimination Summary                           
-----------------------------------------------------------------------
        Variable                  Adj.                                     
Step    Removed     R-Square    R-Square     C(p)       AIC       RMSE     
-----------------------------------------------------------------------
   1    X3             0.963      0.9543    4.1582    52.1665    0.6858    
   2    X1            0.9616      0.9553    2.7354    50.9387    0.6783    
-----------------------------------------------------------------------

Final model has $X_2$, $X_4$ and $X_5$.

Stepwise regression in R
models <- lm(Y ~ ., data = squid)
models$call$data <- squid # Should not be needed when running by hand

ols_step_both_p(models)
Stepwise Selection Method   
---------------------------

Candidate Terms: 

1. X1 
2. X2 
3. X3 
4. X4 
5. X5 

We are selecting variables based on p value...

Variables Entered/Removed: 

- X5 added 
- X4 added 

No more variables to be added/removed.


Final Model Output 
------------------

                        Model Summary                          
--------------------------------------------------------------
R                       0.979       RMSE                0.687 
R-Squared               0.958       Coef. Var          16.387 
Adj. R-Squared          0.954       MSE                 0.473 
Pred R-Squared          0.945       MAE                 0.529 
--------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 

                               ANOVA                                 
--------------------------------------------------------------------
               Sum of                                               
              Squares        DF    Mean Square       F         Sig. 
--------------------------------------------------------------------
Regression    206.945         2        103.473    218.947    0.0000 
Residual        8.979        19          0.473                      
Total         215.925        21                                     
--------------------------------------------------------------------

                                   Parameter Estimates                                    
-----------------------------------------------------------------------------------------
      model      Beta    Std. Error    Std. Beta       t        Sig      lower     upper 
-----------------------------------------------------------------------------------------
(Intercept)    -6.335         0.601                 -10.543    0.000    -7.593    -5.077 
         X5    15.016         2.606        0.700      5.763    0.000     9.562    20.470 
         X4     4.154         1.710        0.295      2.429    0.025     0.574     7.734 
-----------------------------------------------------------------------------------------

                            Stepwise Selection Summary                              
-----------------------------------------------------------------------------------
                     Added/                   Adj.                                     
Step    Variable    Removed     R-Square    R-Square     C(p)       AIC       RMSE     
-----------------------------------------------------------------------------------
   1       X5       addition       0.946       0.943    5.7790    54.6672    0.7670    
   2       X4       addition       0.958       0.954    2.1460    50.7185    0.6875    
-----------------------------------------------------------------------------------

Final model has $X_5$ and $X_4$.

Best Subsets Regression in R
modelbs <- lm(Y ~ ., data = squid)
modelbs$call$data <- squid # Should not be needed when running by hand

ols_step_best_subset(modelbs)
   Best Subsets Regression   
-----------------------------
Model Index    Predictors
-----------------------------
     1         X5             
     2         X4 X5          
     3         X2 X4 X5       
     4         X1 X2 X4 X5    
     5         X1 X2 X3 X4 X5 
-----------------------------

                                                 Subsets Regression Summary                                                  
-----------------------------------------------------------------------------------------------------------------------------
                       Adj.        Pred                                                                                       
Model    R-Square    R-Square    R-Square     C(p)       AIC        SBIC        SBC       MSEP      FPE       HSP       APC  
-----------------------------------------------------------------------------------------------------------------------------
  1        0.9455      0.9428      0.9343    5.7794    54.6672     -8.0766    57.9403    0.6475    0.6418    0.0310    0.0654 
  2        0.9584      0.9540      0.9448    2.1456    50.7185    -10.5305    55.0827    0.5490    0.5370    0.0263    0.0547 
  3        0.9616      0.9553      0.9301    2.7354    50.9387     -9.1759    56.3940    0.5659    0.5437    0.0271    0.0554 
  4        0.9630      0.9543      0.9262    4.1582    52.1665     -6.9128    58.7128    0.6147    0.5772    0.0294    0.0588 
  5        0.9633      0.9519        0.91    6.0000    53.9501     -4.2645    61.5874    0.6898    0.6298    0.0330    0.0642 
-----------------------------------------------------------------------------------------------------------------------------
AIC: Akaike Information Criteria 
 SBIC: Sawa's Bayesian Information Criteria 
 SBC: Schwarz Bayesian Criteria 
 MSEP: Estimated error of prediction, assuming multivariate normality 
 FPE: Final Prediction Error 
 HSP: Hocking's Sp 
 APC: Amemiya Prediction Criteria 

Using Mallow’s $C_p$ would go for Model 3. Full model always has Mallow’s $C_p = p$.

Using Adjusted $R^2$ would go for Model 3.

Other packages will give you the estimate of $\hat{\sigma}^2$ or $\hat{\sigma}$ for each model and you would go for the model with the smallest or close to smallest $\hat{\sigma}^2$ with the least number of parameters in it.

2.2 PARTIAL CORRELATION

Partial correlation is the correlation between two variables while controlling for the effects of one or more other variables.The partial correlation between $Y$ and $X$, after controlling for $Z_1,\cdots,Z_p$ is denoted by $r_{YX|Z_1,cdots,Z_p}$. Its square is defined as follows $r_{YX|Z_1,\cdots,Z_p}^2=\frac{SSE_{Z_1,\cdots,Z_p}-SSE_{X,Z_1,\cdots,Z_p}}{SSE_{Z_1,\cdots,Z_p}}$. This is the reduction in sum of squares due to adding $X$ into the model, given $Z_1,\cdots,Z_p$ already in the model divided by the residual sum of squares for the model only having $Z_1,\cdots,Z_p$. The quantity $r_{YX|Z_1,\cdots,Z_p}$ is called the $p^{th}$ order partial correlation coefficient. The R package ppcor can be used to calculate partial correlations. See https://cran.r-project.org/web/packages/ppcor/ppcor.pdf. If we compare the controlled correlation ($r_{YX|Z_1,\cdots,Z_p}$) with the original correlation ($r_{YX}$) and if there is no difference, the inference is that the control variables have no effect.

Test of the partial correlation coefficient

The partial correlation coefficient is an estimate $r_{YX|Z_1,\cdots,Z_p}^2$ of the population quantity $\rho_{YX|Z_1,\ldots,Z_p}^2$. We test $H_0:\rho_{YX|Z_1,\ldots,Z_p}^2=0$ versus $H_1:\rho_{YX|Z_1,\ldots,Z_p}^2\neq0$ via the partial F-test with test statistic $T=\frac{\left(SSE_{RM}-SSE_{FM}\right)/(DF_{RM}-DF_{FM})}{SSE_{FM}/DF_{FM}}$ where the reduced model regresses $Y$ on $Z_1,\cdots,Z_p$ and the full model regresses $Y$ on $X,Z_1,\cdots,Z_p$.

EXAMPLE: MCG (For details see http://www.statsci.org/data/oz/afl.html)

Want to investigate the effect on football game attendance of various covariates.

After taking the effect of Members into account, what is the effect of Top50 on MCG attendance? Here $Y=$MCG, $X=$Top50 and $Z=$Members.

library(tidyverse)

mcg1 <- read_csv(file.path("..", "Data", "MCG1.csv"))
mcg1
# A tibble: 41 x 8
     MCG Other  Temp Members Top50 Date       Home  Away 
   <dbl> <dbl> <dbl>   <dbl> <dbl> <chr>      <chr> <chr>
 1  8.65 72.9     24    12.6     5 27/03/1993 NM    Bris 
 2 49.9  60.8     21    26.0     7 3/04/1993  Ess   Carl 
 3 24.4  59.8     24    16.9     5 17/04/1993 NM    Melb 
 4 46.6   9.27    22    27.0     8 1/05/1993  Ess   Gee  
 5 29.3  74.8     17    22.9     7 8/05/1993  Rich  StK  
 6 34.4  61.9     16    51.6     8 22/05/1993 Ess   Adel 
 7 17.8  92.1     19    13.0     4 29/05/1993 Rich  Syd  
 8 37.1  23.4     13    29.9     9 12/06/1993 Carl  Gee  
 9 44.1  61.2     14    21.6     6 19/06/1993 Melb  Ess  
10 19.1  54.9     13    21.5     7 26/06/1993 Ess   Rich 
# ... with 31 more rows

Fitting the model without the controlling variable.

mcg_fit1 <- lm(MCG ~ Members, data = mcg1)

summary(mcg_fit1)

Call:
lm(formula = MCG ~ Members, data = mcg1)

Residuals:
    Min      1Q  Median      3Q     Max 
-37.602  -8.161  -0.114   8.453  31.575 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.6216     8.0103   1.201 0.236935    
Members       1.2073     0.2873   4.202 0.000149 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.57 on 39 degrees of freedom
Multiple R-squared:  0.3117,	Adjusted R-squared:  0.294 
F-statistic: 17.66 on 1 and 39 DF,  p-value: 0.0001488
anova(mcg_fit1)
Analysis of Variance Table

Response: MCG
          Df Sum Sq Mean Sq F value    Pr(>F)    
Members    1 4279.8  4279.8   17.66 0.0001488 ***
Residuals 39 9451.5   242.3                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fitting the model with the controlling variable.

mcg_fit2 <- lm(MCG ~ Members + Top50, data = mcg1)

summary(mcg_fit2)

Call:
lm(formula = MCG ~ Members + Top50, data = mcg1)

Residuals:
    Min      1Q  Median      3Q     Max 
-38.124  -9.259   0.387   8.682  30.940 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   1.5101    11.1320   0.136  0.89281   
Members       1.0728     0.3143   3.413  0.00154 **
Top50         1.9474     1.8584   1.048  0.30130   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.55 on 38 degrees of freedom
Multiple R-squared:  0.331,	Adjusted R-squared:  0.2958 
F-statistic: 9.401 on 2 and 38 DF,  p-value: 0.0004818
anova(mcg_fit2)
Analysis of Variance Table

Response: MCG
          Df Sum Sq Mean Sq F value    Pr(>F)    
Members    1 4279.8  4279.8 17.7044 0.0001518 ***
Top50      1  265.5   265.5  1.0981 0.3012961    
Residuals 38 9186.0   241.7                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Calculating the partial correlation:

cor(mcg1$MCG, mcg1$Top50)
[1] 0.3548781

Now correlation between MCG and Top50 is 0.355. After controlling for Members, the correlation between MCG and Top50 has shrunk to 0.17.

3.1 CONFOUNDING

An extraneous variable is an independent variable that is not of direct interest to the study, but does have an influence on the response.

EXAMPLE: We observe the following relationship between characteristics $y$ and $x$ in a sample of male members of a species:

males <- read_csv(file.path("..", "Data", "Males.csv"))

plot(males)

plot of chunk read_males

and the following for the females:

females <- read_csv(file.path("..", "Data", "Females.csv"))

plot(females)

plot of chunk read_females

Plots look very similar but the ranges on the x-axes and y- axes are different.

Simple linear regression for the genders separately give:

Males

males_fit <- lm(y ~ x, data = males)

summary(males_fit)

Call:
lm(formula = y ~ x, data = males)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.33527 -0.36645  0.03395  0.36717  1.03367 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.5815     0.3820   6.758 1.57e-08 ***
x            -2.7407     0.2499 -10.967 8.60e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5254 on 49 degrees of freedom
Multiple R-squared:  0.7105,	Adjusted R-squared:  0.7046 
F-statistic: 120.3 on 1 and 49 DF,  p-value: 8.599e-15
anova(males_fit)
Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value    Pr(>F)    
x          1 33.199  33.199  120.28 8.599e-15 ***
Residuals 49 13.525   0.276                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(males)
abline(a = 2.5815, b = -2.7407)

plot of chunk males_fit

Females

females_fit <- lm(y1 ~ ., data = females)

summary(females_fit)

Call:
lm(formula = y1 ~ ., data = females)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.90342 -0.61669  0.05136  0.53750  1.61836 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  15.3054     1.3308  11.501 1.59e-15 ***
x1           -3.0930     0.3789  -8.163 1.08e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7966 on 49 degrees of freedom
Multiple R-squared:  0.5763,	Adjusted R-squared:  0.5676 
F-statistic: 66.64 on 1 and 49 DF,  p-value: 1.077e-10
anova(females_fit)
Analysis of Variance Table

Response: y1
          Df Sum Sq Mean Sq F value    Pr(>F)    
x1         1 42.286  42.286  66.637 1.077e-10 ***
Residuals 49 31.094   0.635                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(females)
abline(a = 15.3054, b = -3.0930)

plot of chunk females_fit

Intercepts ($\hat{\beta}_0$’s) are quite different for male and female regression equations but slopes ($\hat{\beta}_1’s$) are similar.

Reasonable to conclude that there is a negative linear relationship between Y and X, with a slope of -3. (Test $H_0:\beta_1=-3$ against $H_1:\beta_1\neq-3$ for Males and Females separately to see this.)

EXERCISE: Do this for yourself.

Now we regress $Y$ on $X$ with the males and females together:

mf <- read_csv(file.path("..", "Data", "MF.csv"))
mf
# A tibble: 102 x 3
       x       y   Sex
   <dbl>   <dbl> <dbl>
 1  1     0.0573     1
 2  1.02  0.151      1
 3  1.04 -0.235      1
 4  1.06 -0.265      1
 5  1.08 -0.0827     1
 6  1.1  -0.227      1
 7  1.12 -0.544      1
 8  1.14 -0.208      1
 9  1.16 -0.530      1
10  1.18 -0.803      1
# ... with 92 more rows
mf_fit <- lm(y ~ x, data = mf)
summary(mf_fit)

Call:
lm(formula = y ~ x, data = mf)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1277 -1.4482 -0.1652  1.4083  4.0033 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -4.8558     0.4675  -10.39   <2e-16 ***
x             2.5324     0.1726   14.67   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.817 on 100 degrees of freedom
Multiple R-squared:  0.6828,	Adjusted R-squared:  0.6797 
F-statistic: 215.3 on 1 and 100 DF,  p-value: < 2.2e-16

Overall regression has a slope of $\hat{\beta}_1=2.5324$! This is what has happened:

plot(mf$x, mf$y)
abline(a = -4.8558, b = 2.5324)

plot of chunk mf_plot

Plot of $y$ vs $x$, with genders indicated by different symbols:

ggplot(mf, aes(x = x, y = y, colour = Sex)) + 
    geom_point() +
    geom_abline(intercept = -4.8558, slope = 2.5324)

plot of chunk mf_ggplot

Gender is a confounder in the regression. A confounder is a variable in a study that may not be of direct interest, but has an association with both response and predictor(s). Confounders must be taken into account or controlled for. If not, incorrect results such as that shown above may be obtained.

Kleinbaum et al [1] state: In general, confounding exists if meaningfully different interpretations of the relationship of interest result when an extraneous variable is ignored or included in the data analysis.

[1] Kleinbaum, Kupper, Muller and Nizam (1998) Applied Regression Analysis and Other Multivariable Methods, Third Edition, Duxbury Press.

Controlling for a confounder

mf_fit_all <- lm(y ~ ., data = mf)

summary(mf_fit_all)

Call:
lm(formula = y ~ ., data = mf)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8365 -0.4415 -0.0187  0.4624  1.6783 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -8.9972     0.2393  -37.59   <2e-16 ***
x            -2.9168     0.2265  -12.88   <2e-16 ***
Sex          11.8430     0.4722   25.08   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6734 on 99 degrees of freedom
Multiple R-squared:  0.9569,	Adjusted R-squared:  0.956 
F-statistic:  1098 on 2 and 99 DF,  p-value: < 2.2e-16
anova(mf_fit_all)
Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value    Pr(>F)    
x          1 710.81  710.81 1567.49 < 2.2e-16 ***
Sex        1 285.24  285.24  629.03 < 2.2e-16 ***
Residuals 99  44.89    0.45                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

FIXME To put a plot with the fitted model here

How do we identify when confounding is occurring?

The coefficient of a predictor is very different when the confounder is added to the model.

EXAMPLE: When the model is $Y_i=\beta_0+\beta_1X_i+\varepsilon$ for above example we get $\hat{\beta}_1=2.5324$. When gender is added to the model:

we get $\hat{\beta}_1=-2.9168$. This is ample proof of confounding. No formal statistical test for the equality of the $\beta_1$’s obtained under the two models exists; we need to use judgement, in conjunction with graphical evidence such as above. Confounding happens when both response and predictor are affected by the same variable. Evidence of the cause of confounding is provided by plots of both $X$ and $Y$ against the confounder.

par(mfrow = c(1, 2))
boxplot(x ~ Sex, data = mf)
boxplot(y ~ Sex, data = mf)

plot of chunk mf_plot_by_sex

Continuous confounders

EXAMPLE: Mass and Physical Measurements for Male Subjects Michael Larner measured the weight and various physical measurements for 22 male subjects aged 16 - 30. Subjects were randomly chosen volunteers, all in reasonable good health. Subjects were requested to slightly tense each muscle being measured to ensure measurement consistency. All measurements except mass are in cm.

Larner, M. (1996). Mass and its Relationship to Physical Measurements. MS305 Data Project, Department of Mathematics, University of Queensland.

Say the purpose of the study was to explain men’s weight ($Y$) as a function of their height ($X_1$):

mass <- read_csv(file.path("..", "Data", "mass.csv"))
names(mass) <- c("Mass", "Fore", "Bicep", "Chest", "Neck", "Shoulder", "Waist", "Height",
    "Calf", "Thigh", "Head")
mass
# A tibble: 22 x 11
    Mass  Fore Bicep Chest  Neck Shoulder Waist Height  Calf Thigh  Head
   <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
 1  77    28.5  33.5   100  38.5      114  85     178   37.5  53    58  
 2  85.5  29.5  36.5   107  39        119  90.5   187   40    52    59  
 3  63    25    31      94  36.5      102  80.5   175   33    49    57  
 4  80.5  28.5  34     104  39        114  91.5   183   38    50    60  
 5  79.5  28.5  36.5   107  39        114  92     174   40    53    59  
 6  94    30.5  38     112  39        121 101     180   39.5  57.5  59  
 7  66    26.5  29      93  35        105  76     178.  38.5  50    58.5
 8  69    27    31      95  37        108  84     182.  36    49    60  
 9  65    26.5  29      93  35        112  74     178.  34    47    55.5
10  58    26.5  31      96  35        103  76     168.  35    46    58  
# ... with 12 more rows
mass_fit1 <- lm(Mass ~ Height, data = mass)

summary(mass_fit1)

Call:
lm(formula = Mass ~ Height, data = mass)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.2156  -7.7434   0.1785   4.0739  18.7436 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) -81.3253    61.4763  -1.323    0.201  
Height        0.8699     0.3443   2.527    0.020 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.781 on 20 degrees of freedom
Multiple R-squared:  0.242,	Adjusted R-squared:  0.2041 
F-statistic: 6.385 on 1 and 20 DF,  p-value: 0.02004
anova(mass_fit1)
Analysis of Variance Table

Response: Mass
          Df  Sum Sq Mean Sq F value  Pr(>F)  
Height     1  610.86  610.86  6.3854 0.02004 *
Residuals 20 1913.29   95.66                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

An extraneous variable in the study is $X_2=$ Waist. Let’s add it into the regression:

mass_fit2 <- lm(Mass ~ Height + Waist, data = mass)

summary(mass_fit2)

Call:
lm(formula = Mass ~ Height + Waist, data = mass)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.9319 -3.2881  0.6235  3.5401  5.2012 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -89.3517    26.0808  -3.426  0.00283 ** 
Height        0.3439     0.1559   2.206  0.03990 *  
Waist         1.1909     0.1240   9.604    1e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.147 on 19 degrees of freedom
Multiple R-squared:  0.8705,	Adjusted R-squared:  0.8569 
F-statistic: 63.88 on 2 and 19 DF,  p-value: 3.678e-09
anova(mass_fit2)
Analysis of Variance Table

Response: Mass
          Df  Sum Sq Mean Sq F value    Pr(>F)    
Height     1  610.86  610.86  35.514 9.788e-06 ***
Waist      1 1586.49 1586.49  92.237 1.005e-08 ***
Residuals 19  326.80   17.20                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We have

The estimate of $\beta_1$ differs substantially depending on whether Waist is included in the model or not. Confounding is occurring because of important association between the extraneous variable (Waist) and the response variable Mass.

It is often necessary to include in a model variables not of direct interest to the research question (i.e. extraneous variables), simply because omitting them from the analysis would lead to incorrect conclusions concerning the relationship between the variables of interest.

How do we know what extraneous variables, or potential confounders, to measure in a study?

We have to rely on previous knowledge, such as that gained from previous studies, to identify which variables, besides those of direct interest, we should be measuring.

3.2 Interaction

Interaction occurs when the relationship between $Y$ and $X_1$ is different at different levels of a third variable $X_2$.

EXAMPLE: Simulated data set. There is a continuous predictor $x_1$, and a categorical predictor $x_2$ which assumes the values 0 and 1. The variable $y$ increases linearly with $x_1$ when $x_2=0$, and decreases linearly with $x_1$ when $x_2=1$.

confounding <- read_csv(file.path("..", "Data", "confounding.csv"))
names(confounding) <- c("y", "x1", "x2")

ggplot(confounding, aes(x = x1, y = y, colour = x2)) +
    ggtitle("Figure 1: Interaction") +
    geom_point()

plot of chunk read_confounding

An example of no interaction is:

noconfounding <- read_csv(file.path("..", "Data", "noconfounding.csv"))
names(noconfounding) <- c("y", "x1", "x2")

ggplot(noconfounding, aes(x = x1, y = y, colour = x2)) +
    ggtitle("Figure 2: No interaction") +
    geom_point()

plot of chunk read_noconfounding

Here $y$ increases linearly with $x_1$, at the same rate (slope), irrespective of the value of $x_2$.

A regression of $Y$ on $X_1$ and $X_2$ for the data in Figure 1 gives:

conf_fit <- lm(y ~ x1 + x2, data = confounding)

summary(conf_fit)

Call:
lm(formula = y ~ x1 + x2, data = confounding)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.8407  -4.2836   0.0165   4.0220   9.5455 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  12.4042     1.1830  10.485  < 2e-16 ***
x1           -0.6230     0.1833  -3.399 0.000982 ***
x2           -6.2591     1.0580  -5.916 4.96e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.29 on 97 degrees of freedom
Multiple R-squared:  0.3243,	Adjusted R-squared:  0.3104 
F-statistic: 23.28 on 2 and 97 DF,  p-value: 5.539e-09
anova(conf_fit)
Analysis of Variance Table

Response: y
          Df  Sum Sq Mean Sq F value    Pr(>F)    
x1         1  323.32  323.32  11.554 0.0009825 ***
x2         1  979.39  979.39  34.999 4.956e-08 ***
Residuals 97 2714.43   27.98                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We are fitting a common slope, and allowing different intercepts for $x_2=0$ and $x_2=1$. We can write the model as

ggplot(confounding, aes(x = x1, y = y, colour = x2)) +
    geom_point() +
    geom_abline(intercept = 12.4, slope = -0.623) +
    geom_abline(intercept = 6.14, slope = -0.623)

plot of chunk conf_fit_plot

In order to capture the true relationship, we include an interaction term, which is $x_1\times x_2:$

conf_fit_int <- lm(y ~ x1 * x2, data = confounding)

summary(conf_fit_int)

Call:
lm(formula = y ~ x1 * x2, data = confounding)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.1629 -1.3032  0.0645  1.4364  4.5780 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.2929     0.6598   6.507 3.48e-09 ***
x1            0.9992     0.1143   8.743 7.42e-14 ***
x2            9.9634     0.9331  10.678  < 2e-16 ***
x1:x2        -3.2445     0.1616 -20.074  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.332 on 96 degrees of freedom
Multiple R-squared:   0.87,	Adjusted R-squared:  0.8659 
F-statistic: 214.2 on 3 and 96 DF,  p-value: < 2.2e-16
anova(conf_fit_int)
Analysis of Variance Table

Response: y
          Df  Sum Sq Mean Sq F value    Pr(>F)    
x1         1  323.32  323.32  59.435 1.165e-11 ***
x2         1  979.39  979.39 180.039 < 2.2e-16 ***
x1:x2      1 2192.20 2192.20 402.985 < 2.2e-16 ***
Residuals 96  522.23    5.44                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We now have the following model:

The fitted lines are shown with the data below:

ggplot(confounding, aes(x = x1, y = y, colour = x2)) +
    geom_point() +
    geom_abline(intercept =  4.29, slope =  0.999) +
    geom_abline(intercept = 14.25, slope = -2.241)

plot of chunk conf_fit_int_plot

Fitting a model with an interaction term to the data in Figure 2 gives:

noconf_fit_int <- lm(y ~ x1 * x2, data = noconfounding)

summary(noconf_fit_int)

Call:
lm(formula = y ~ x1 * x2, data = noconfounding)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.8783  -3.0603  -0.4352   3.0126  11.3395 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.1814     1.3786   3.758 0.000294 ***
x1            2.9798     0.2388  12.479  < 2e-16 ***
x2            2.0335     1.9496   1.043 0.299551    
x1:x2        -0.2760     0.3377  -0.817 0.415830    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.873 on 96 degrees of freedom
Multiple R-squared:  0.7476,	Adjusted R-squared:  0.7397 
F-statistic: 94.79 on 3 and 96 DF,  p-value: < 2.2e-16
anova(noconf_fit_int)
Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq  F value Pr(>F)    
x1         1 6727.3  6727.3 283.2649 <2e-16 ***
x2         1   10.7    10.7   0.4498 0.5041    
x1:x2      1   15.9    15.9   0.6678 0.4158    
Residuals 96 2279.9    23.7                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction term $x_1x_2$ is not significant. Fitting a model without the interaction term gives:

noconf_fit <- lm(y ~ x1 + x2, data = noconfounding)

summary(noconf_fit)

Call:
lm(formula = y ~ x1 + x2, data = noconfounding)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.1129  -3.0653  -0.6238   3.2418  11.5741 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.8713     1.0880   5.397 4.81e-07 ***
x1            2.8418     0.1686  16.859  < 2e-16 ***
x2            0.6537     0.9730   0.672    0.503    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.865 on 97 degrees of freedom
Multiple R-squared:  0.7459,	Adjusted R-squared:  0.7406 
F-statistic: 142.3 on 2 and 97 DF,  p-value: < 2.2e-16
anova(noconf_fit)
Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq  F value Pr(>F)    
x1         1 6727.3  6727.3 284.2383 <2e-16 ***
x2         1   10.7    10.7   0.4513 0.5033    
Residuals 97 2295.8    23.7                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Some comments in fitting a model with interaction

In general, an $m$’th order interaction involves $m+1$ predictors.

EXAMPLE: If we have the second-order $X_1X_2X_3$ interaction, we would also include the first-order interactions $X_1X_2$, $X_1X_3$ and $X_2X_3$ as well as the main effects $X_1$, $X_2$ and $X_3$.

Some other diagnostics

par(mfrow = c(2, 2))
plot(noconf_fit)

plot of chunk noconf_fit_plot

The Residuals vs Leverage ($h_{ii}$) plot can be used to see if you have any points with high leverage. To actually identify the points use hatvalues(fit) where fit is whatever name you gave to the model fitted to the data.

Leverage is a measure of how far away the independent variable values of an observation are from those of the other observations
https://en.wikipedia.org/wiki/Leverage_(statistics)

One rule of thumb for identifying such points is when

We have $n=100$ and $p=2$ so high leverage points will have $h_{ii}>2\frac{(2+1)}{100}=0.06.$ There are no high leverage points under the model fitted.

Influential points are captured by Cook’s distance ($D_i$). An influential observation is an observation if it was removed the regression parameter estimates could be quite different. Such points are said to have high influence.

According to Cook and Weisberg we flag a case as influential if $D_i\geq F_{p+1,n-p-1;0.50}$. For this example all Cook’s distances are close to zero so appears that there are no influential points.

The following can be used to obtain Cook’s Distances.

library(car)

cooks.distance(mass_fit2)
           1            2            3            4            5 
1.491363e-02 2.714110e-02 2.203842e-02 8.236742e-03 1.047932e-03 
           6            7            8            9           10 
1.119823e-02 4.391635e-02 3.386806e-02 1.075475e-01 7.001717e-03 
          11           12           13           14           15 
9.366549e-02 4.016974e-03 1.165020e-02 1.624058e-01 1.044987e-01 
          16           17           18           19           20 
1.840585e-05 1.209214e-01 1.045333e-01 6.708925e-02 5.233727e-02 
          21           22 
2.204762e-02 1.522339e-02 

To get $D_i\geq F_{p+1,n-p-1;0.50}$ use qf(.5, df1 = p+1, df2 = n-p-1) in R where you substitute the number values of df1 and df2.

4. Categorical Predictors

To fit models with categorical predictors with more than two values (levels) we need to use the lm command in R.

Example: BMD Data set

We have a sample of n=122 post-menopausal women, who took part in a clinical trial of hormone replacement therapy (HRT). The following variables were measured at the start of the trial:

BMD: Bone mineral density at the spine (g/cm²)

BMI: Body Mass Index (kg/m²)

AGE: Years

CALCIUM: Daily calcium intake (mg)

WTKG: Weight (kg)

HTCM: Height (cm)

MENOPYRS: Number of years since menopause

SMKCODE: Smoking status (1=none, 2= 1-10 cigarettes/day, 3= >10 cigarettes/day)

PARITY: Number of children

ALCOHOL: Alcohol intake (1= none, 2= ≤1 drink/day, 3= 2-3 drinks/day, 4= ≥4 drinks/day)

TRTPRV: Previous HRT (0=no, 1=yes)

AGEMENOP: Age at menopause

BMI_CAT: BMI category (1=underweight, 2=normal, 3=overweight, 4=obese, 5=very obese)

OSTEOFAM: Family history of osteoporosis (0=no, 1=yes)

The study was undertaken to assess the effect of HRT on bone mineral density; at the start of the trial, it is of interest to explain the relationship of BMD with the covariates.

The predictor TRTPRV takes on the value 0 if the subject had not previously taken hormone replacement therapy (HRT), and 1 if she had.

The regression of BMD against BMI and TRTPRV gives:

bmd <- read_csv(file.path("..", "data", "BMD.csv"))
names(bmd) <- c("BMD", "BMI", "AGE", "CALCIUM", "WTKG", "HTCM", "MENOPYRS", "SMKCODE", "PARITY",
    "ALCOHOL", "TRTPRV", "AGEMENOP", "BMI_CAT", "OSTEOFAM")
bmd_fit <- lm(BMD ~ BMI + TRTPRV, data = bmd)

summary(bmd_fit)

Call:
lm(formula = BMD ~ BMI + TRTPRV, data = bmd)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.32088 -0.08494  0.00583  0.09239  0.38783 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.945730   0.076010  12.442   <2e-16 ***
BMI         0.005948   0.002867   2.075   0.0402 *  
TRTPRV      0.032468   0.028704   1.131   0.2603    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1446 on 119 degrees of freedom
Multiple R-squared:  0.04411,	Adjusted R-squared:  0.02805 
F-statistic: 2.746 on 2 and 119 DF,  p-value: 0.06826

TRTPRV is not significant (p=0.260), but let’s consider what the regression is telling us.

If TRTPRV = 0:

If TRTPRV = 1:

In other words, having TRTPRV = 1 results in an increase in the predicted value of BMD of 0.03247 g/cm². Note that the values of 0 and 1 assigned to TRTPRV are arbitrary, and any other coding would produce the same fitted values.

TRTPRV is a categorical predictor, or factor, with two levels. What happens when there are more then two levels? Consider the predictor SMKCODE:

If we regress BMD on SMKCODE we get:

bmd_smoke_fit <- lm(BMD ~ SMKCODE, data = bmd)

summary(bmd_smoke_fit)

Call:
lm(formula = BMD ~ SMKCODE, data = bmd)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.32688 -0.09233  0.00366  0.10306  0.41163 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.18137    0.02970  39.782  < 2e-16 ***
SMKCODE     -0.05710    0.02117  -2.697  0.00801 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.143 on 120 degrees of freedom
Multiple R-squared:  0.05714,	Adjusted R-squared:  0.04929 
F-statistic: 7.273 on 1 and 120 DF,  p-value: 0.008007

SMKCODE=1: $BMD_i= 1.18-0.0571\times 1 =1.1229$

SMKCODE=2: $BMD_i=1.18-0.0571\times 2 = 1.0658$

SMKCODE=3: $BMD_i = 1.18-0.0571 \times 3=1.0087$

This means that there is a predicted decrease in BMD of 0.0571 g/cm² when SMKCODE changes from 1 to 2, and from 2 to 3. This doesn’t make much sense: why should there be an equal difference in mean BMD between non-smokers and moderate smokers, as between moderate and heavy smokers?

If we had used different a coding for SMKCODE, we would have obtained a totally different answer. This makes no sense at all. What we need is a model which recognises that the levels of a categorical variable are not to be taken literally as numerical values, but rather as indicators of different states that the variable can be in. For this we need the concept of an indicator variable.

Indicator variables

Going back to the BMD smoking variable, we define three indicator variables $S_1$, $S_2$ and $S_3$:

The values of $S_1$, $S_2$ and $S_3$ will then be as follows:

We can then include $S_1$, $S_2$ and $S_3$ in the model. The problem with this approach is collinearity. We will always have for i-th subject: $S_{i1}+S_{i2}+S_{i3}=1$,
which means there is a perfect collinearity between the indicator variables. Another way to think of this is that we are giving the model redundant information: for example, for the i-th subject, if we have $S_{i1}=0$ and $S_{i2}=0$ then we know that we must have $S_{i3}=1$. We only need to provide two bits of information about SMKCODE in order to convey all of the information. We leave out one of the indicator variables, and the level of the categorical variable corresponding to the indicator variable which we leave out, is called the referent category.

bmd_smoke_fact_fit <- lm(BMD ~ factor(SMKCODE), data = bmd)

summary(bmd_smoke_fact_fit)

Call:
lm(formula = BMD ~ factor(SMKCODE), data = bmd)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.33018 -0.08980  0.00036  0.09505  0.40832 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       1.12758    0.01412  79.835  < 2e-16 ***
factor(SMKCODE)2 -0.12711    0.04706  -2.701  0.00792 ** 
factor(SMKCODE)3 -0.08718    0.04507  -1.934  0.05544 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1419 on 119 degrees of freedom
Multiple R-squared:  0.07856,	Adjusted R-squared:  0.06307 
F-statistic: 5.073 on 2 and 119 DF,  p-value: 0.007689

The referrent category is $S_1$, non-smoker.

Another way of looking at the problem

We have performed an analysis to determine whether Smoking (a categorical variable or factor) is a significant predictor of BMD. An appropriate visual display is the box plot:

boxplot(BMD ~ SMKCODE, data = bmd)

plot of chunk bmd_smoke_boxplot

in which we can see the lower BMDs of smokers. This suggests another method of analysis: one-way analysis of variance. The model is

where $Y_{ij} =$ BMD of i-th subject in smoking group j

$\mu =$ Overall mean BMD

$\alpha_j=$ Effect of smoking group j

$n_j=$ Number of subjects in smoking group j

anova(bmd_smoke_fact_fit)
Analysis of Variance Table

Response: BMD
                 Df  Sum Sq  Mean Sq F value   Pr(>F)   
factor(SMKCODE)   2 0.20441 0.102204  5.0727 0.007689 **
Residuals       119 2.39759 0.020148                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.1 ANALYSIS OF COVARIANCE

A regression based on a single, categorical predictor is equivalent to a one-way ANOVA. A regression based on more categorical variables as predictors (say m of them) would have been equivalent to an m-way ANOVA. Once we include categorical predictors in the regression framework, by using indicator variables, there is nothing stopping us from also including one or more covariates (or continuous predictors) in the model. Say we have one covariate ($X$) and one categorical predictor ($S$), which has $k$ levels. Assuming that the last ($k$-th) level of $S$ is the referent category, a possible model is

$\varepsilon_i \sim N(0,\sigma^2)$ independently $i=1,\ldots,n$

What (1) is assuming is that the slope of the relationship between $y$ and $x$ is $\beta_1$, irrespective of the value of $S$. This may be depicted, for $k=3$, as:

x <- c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y <- 13.4 + 0.623 * x
example1 <- data.frame(x, y)

ggplot(data = example1, aes(x = x, y = y)) +
    geom_point() +
    geom_abline(intercept = 12.4, slope = 0.623, col = "blue") +
    geom_abline(intercept = 13.4, slope = 0.623) +
    geom_abline(intercept = 14.4, slope = 0.623, col = "red")

plot of chunk ancova_example1

We say here that there is no interaction between X and S.

Another possible scenario is:

x <- c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y <- 13.4 + 1.623 * x
example2 <- data.frame(x, y)

ggplot(data = example2, aes(x = x, y = y)) +
    geom_point() +
    geom_abline(intercept = 12.4, slope = 0.623, col = "blue") +
    geom_abline(intercept = 13.4, slope = 1.623) + 
    geom_abline(intercept = 14.4, slope = 0.23, col = "red")

plot of chunk ancova_example2

The slope of the relationship between $y$ and $x$ depends on the value of $S$.

This is formulated as follows:

Rewrite as

We have differential slopes for different levels of $S$. For the referent category $k$, the slope is $\beta_1$.

No interaction hypothesis:

No effect of S hypothesis assuming no interaction:

BMD example

Body mass index (BMI) is an important predictor of bone mineral density (BMD). It is a simple matter to include BMI in the model. The following is the model with no interaction, i.e. the same slope between BMD and BMI, irrespective of the subject’s smoking status:

where

$BMI_i$ = body mass index of i-th subject and other definitions are as for model (1). The results are

bmd_bmi_smoke_fit <- lm(BMD ~ BMI + factor(SMKCODE), data = bmd)

summary(bmd_bmi_smoke_fit)

Call:
lm(formula = BMD ~ BMI + factor(SMKCODE), data = bmd)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.31668 -0.08025  0.00546  0.09665  0.39288 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.986226   0.074111  13.307   <2e-16 ***
BMI               0.005428   0.002795   1.942   0.0545 .  
factor(SMKCODE)2 -0.126571   0.046518  -2.721   0.0075 ** 
factor(SMKCODE)3 -0.078738   0.044763  -1.759   0.0812 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1403 on 118 degrees of freedom
Multiple R-squared:  0.1071,	Adjusted R-squared:  0.0844 
F-statistic: 4.718 on 3 and 118 DF,  p-value: 0.003806
anova(bmd_bmi_smoke_fit)
Analysis of Variance Table

Response: BMD
                 Df  Sum Sq  Mean Sq F value   Pr(>F)   
BMI               1 0.08804 0.088044  4.4717 0.036566 * 
factor(SMKCODE)   2 0.19062 0.095311  4.8408 0.009538 **
Residuals       118 2.32333 0.019689                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The fitted model is

Interpretation of parameters:

The predicted effect on BMD of an increase of 1 kg/m² of BMI, is a increase of 0.005 g/cm².

The predicted effect on BMD of a person being a moderate smoker, compared with a non-smoker, is a decrease of 0.127 g/cm².

The predicted effect on BMD of a person being a heavy smoker, compared with a non-smoker, is a decrease of 0.079 g/cm².

The model with interaction is

bmd_bmi_smoke_fit_int <- lm(BMD ~ BMI * factor(SMKCODE), data = bmd)

summary(bmd_bmi_smoke_fit_int)

Call:
lm(formula = BMD ~ BMI * factor(SMKCODE), data = bmd)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.31624 -0.08058  0.00551  0.09204  0.39303 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)           0.981566   0.079424  12.359   <2e-16 ***
BMI                   0.005607   0.003002   1.868   0.0643 .  
factor(SMKCODE)2     -0.251757   0.265309  -0.949   0.3446    
factor(SMKCODE)3      0.259077   0.343809   0.754   0.4526    
BMI:factor(SMKCODE)2  0.004827   0.010065   0.480   0.6324    
BMI:factor(SMKCODE)3 -0.013786   0.013880  -0.993   0.3227    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1407 on 116 degrees of freedom
Multiple R-squared:  0.1169,	Adjusted R-squared:  0.0788 
F-statistic:  3.07 on 5 and 116 DF,  p-value: 0.01222
anova(bmd_bmi_smoke_fit_int)
Analysis of Variance Table

Response: BMD
                     Df  Sum Sq  Mean Sq F value   Pr(>F)   
BMI                   1 0.08804 0.088044  4.4445 0.037166 * 
factor(SMKCODE)       2 0.19062 0.095311  4.8114 0.009831 **
BMI:factor(SMKCODE)   2 0.02542 0.012710  0.6416 0.528309   
Residuals           116 2.29791 0.019810                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model fitted:

Alternatively:

SMKCODE=1: $B\hat{M}D_i=0.98157+0.005607\cdot BMI_i$

SMKCODE=2: $B\hat{M}D_i=0.98157+0.005607\cdot BMI_i-0.2518+0.00483\cdot BMI_i$

$=0.72977+0.010437\cdot BMI_i$

SMKCODE=3: $B\hat{M}D_i=0.98157+0.005607\cdot BMI_i+0.2591-0.01379\cdot BMI_i$

$=1.2407-0.008183\cdot BMI_i$

Interpretation of parameters:

Interaction makes interpretation more complicated.

The predicted effect on BMD of an increase of 1 kg/m² of BMI, is:

The predicted effect on BMD of a person being a moderate smoker compared with a non-smoker: $-0.25180+0.00483\cdot BMI_i$

The predicted effect on BMD of a person being a heavy smoker compared with a non-smoker: $0.2591-0.01379\cdot BMI_i$

Test for interaction

Always test for interaction before main effects. If interaction is present then we need to retain the main effects in the model. We want to test

We can do this by looking at the ANOVA of the model we fitted with interaction.

anova(bmd_bmi_smoke_fit_int)
Analysis of Variance Table

Response: BMD
                     Df  Sum Sq  Mean Sq F value   Pr(>F)   
BMI                   1 0.08804 0.088044  4.4445 0.037166 * 
factor(SMKCODE)       2 0.19062 0.095311  4.8114 0.009831 **
BMI:factor(SMKCODE)   2 0.02542 0.012710  0.6416 0.528309   
Residuals           116 2.29791 0.019810                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value associated with the interaction term is $0.0095$ which is significant. Only can say if this model assumptions hold:

par(mfrow = c(2, 2))
plot(bmd_bmi_smoke_fit_int)

plot of chunk bmd_bmi_smoke_fit_int_plot

Might be prepared to say the model assumptions hold. Residuals versus Fitted values plot does look like a random scatter about zero and the Normal Q-Q plot looks linear.

We conclude that there is no evidence to confirm a BMI-smoking interaction effect on bone mineral density.

Test for categorical predictor (no interaction present)

In the presence of interaction, we need the main effects to be present in the model and so do not need to check for their significance.

In the absence of interaction, we test for the overall significance of the categorical predictor, again via a multiple partial F-test. (R gives you this when you obtain the ANOVA of the model fitted without an interaction term whit was fit above.)

The full model is

We test:

We can do this by looking at the ANOVA of the model we fitted with no interaction earlier.

anova(bmd_bmi_smoke_fit)
Analysis of Variance Table

Response: BMD
                 Df  Sum Sq  Mean Sq F value   Pr(>F)   
BMI               1 0.08804 0.088044  4.4717 0.036566 * 
factor(SMKCODE)   2 0.19062 0.095311  4.8408 0.009538 **
Residuals       118 2.32333 0.019689                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value associated with the model is $0.0095$ which is significant. We strongly reject $H_0$ in favour of $H_1$. We conclude that Smoking is a significant predictor of BMD, after correcting (or controlling) for Body Mass Index. We only can say if this model assumptions hold:

par(mfrow = c(2, 2))
plot(bmd_bmi_smoke_fit)

plot of chunk bmd_bmi_smoke_fit_plot

Tests for individual coefficients

summary(bmd_bmi_smoke_fit)

Call:
lm(formula = BMD ~ BMI + factor(SMKCODE), data = bmd)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.31668 -0.08025  0.00546  0.09665  0.39288 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.986226   0.074111  13.307   <2e-16 ***
BMI               0.005428   0.002795   1.942   0.0545 .  
factor(SMKCODE)2 -0.126571   0.046518  -2.721   0.0075 ** 
factor(SMKCODE)3 -0.078738   0.044763  -1.759   0.0812 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1403 on 118 degrees of freedom
Multiple R-squared:  0.1071,	Adjusted R-squared:  0.0844 
F-statistic: 4.718 on 3 and 118 DF,  p-value: 0.003806

Test for the significance of $\gamma_j$, compared with the referent category.

For $H_0:\gamma_2=0$ we have $\gamma_2$ significant (p=0.0075) and for $H_0:\gamma_3=0$, we have $\gamma_3$ not significant (p=0.081).

We conclude that moderate smokers have mean BMD levels significantly lower than non-smokers, but that heavy smokers’ mean BMD levels are not significantly different from non-smokers. [The numbers of heavy smokers in the sample was small, making the detection of a significant difference unlikely, even if there was a significant difference in the population.]

Caution: Here we have performed two hypothesis tests, both at a 5% level of significance. Is the overall significance of our conclusion still 5%? If not, is it greater than or less than 5%?

Referent category revisited

A sensible choice for referent category will be a category which:

In the BMD example, we have the following frequencies for Smkcode:

There are few moderate and heavy smokers relative to non-smokers, and a comparison with reference to non-smokers makes sense.

Transformation of variables will be covered in the last Exercise for this lesson.

Key Points