This post is a summary of one chapter of course notes on Computational Statistics taught at ETH Zurich by (Bühlmann and Mächler 2014). For all those who are interested in a mathematically rigorous presentation of some topics of computational statistics are invited to have a look at the course notes. The link to those course notes is included in the References section of this post.
I can also recommend a completely informal introduction to the topic of least squares by Rafael Irizarry. In his video he shows how to teach least squares to a fifth grader.
We start with a practical example of a multiple linear regression analysis in R. For all those who are interested in a more formal introduction are advice to skip to the next section.
For our first analysis, we are using the Chatterjee–Price Attitude data-set. This data-set is part of the R-system and is described in the help file available via
?attitude
The data objects are survey results of clerical employees of a company. The data-set consists of 30 observations on 7 variables. The data is organsied such that variables are stored in columns and observations are stored in rows.
Let us assume that we focus on one particular variable, namely in the variable called rating
. We are interested in finding relationships that are found between rating
and any of the other variables.
Before getting into any model fitting, it is always a good idea to get an overview of the data by a pairs plot. The pairs plot is composed of a matrix of scatter-plots for all combinations of variables in the data-set.
require(stats); require(graphics)
pairs(attitude, main = "attitude data")
Simple summary statistics are obtained by
summary(attitude)
## rating complaints privileges learning
## Min. :40.0 Min. :37.0 Min. :30.0 Min. :34.0
## 1st Qu.:58.8 1st Qu.:58.5 1st Qu.:45.0 1st Qu.:47.0
## Median :65.5 Median :65.0 Median :51.5 Median :56.5
## Mean :64.6 Mean :66.6 Mean :53.1 Mean :56.4
## 3rd Qu.:71.8 3rd Qu.:77.0 3rd Qu.:62.5 3rd Qu.:66.8
## Max. :85.0 Max. :90.0 Max. :83.0 Max. :75.0
## raises critical advance
## Min. :43.0 Min. :49.0 Min. :25.0
## 1st Qu.:58.2 1st Qu.:69.2 1st Qu.:35.0
## Median :63.5 Median :77.5 Median :41.0
## Mean :64.6 Mean :74.8 Mean :42.9
## 3rd Qu.:71.0 3rd Qu.:80.0 3rd Qu.:47.8
## Max. :88.0 Max. :92.0 Max. :72.0
In a first step, we are interested in how the 6 variables complaints, privileges, learning, raises, critical and advance are related to the observed overall rating.
fm1 <- lm(rating ~ ., data = attitude)
The call to function lm()
fits the multiple linear regression model using variable rating
as response and all other variables as predictors. Since we are assigning the result to a variable, noting is returned. But the resulting linear model object has a summary method which shows us the most important results.
summary(fm1)
##
## Call:
## lm(formula = rating ~ ., data = attitude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.942 -4.356 0.316 5.543 11.599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.7871 11.5893 0.93 0.3616
## complaints 0.6132 0.1610 3.81 0.0009 ***
## privileges -0.0731 0.1357 -0.54 0.5956
## learning 0.3203 0.1685 1.90 0.0699 .
## raises 0.0817 0.2215 0.37 0.7155
## critical 0.0384 0.1470 0.26 0.7963
## advance -0.2171 0.1782 -1.22 0.2356
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.07 on 23 degrees of freedom
## Multiple R-squared: 0.733, Adjusted R-squared: 0.663
## F-statistic: 10.5 on 6 and 23 DF, p-value: 1.24e-05
The above summary of the linear model object contains the following details:
Besides just looking at the numbers given by the modeling output it is always a good idea to also look at some diagnostics plots. The residuals \(r_i\) corresponding to the difference of the observed values \(y_i\) minus the fitted values \(\hat{y}_i\) are useful approximations of the unobservable error terms. The plot below shows four diagrams which are helpful in checking the appropriatness of the chosen linear model.
opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
plot(fm1, las = 1, )
The top left diagram shows a plot of the residuals versus the fitted values. It is also known as the Tukey-Anscombe
plot. Ideally there should be no relationship between the two quantities and the red line should ideally be horizontal. For our example data-set we can see that there is no obvious pattern between residuals and fitted values. The top right diagram is a so-called QQ-plot
which draws empirically observed quantiles of the residuals versus theoretically expected quantiles, assuming a normal distribution for the error term. In an ideal situation these are all on a line with slope \(= 1\). The diagrams in the bottom row are the scale-location
plot testing for equal variance of the residuals and the residuals vs leverage
plot.
When looking at the summary result of the first model fit fm1
, we can see that based on the t-Test statistic and its accompanying p-value, the variable complaints
stands out from the other variables. Hence we can refine our model and check a smaller model which uses only complaints
as a predictor for the response.
summary(fm2 <- lm(rating ~ complaints, data = attitude))
##
## Call:
## lm(formula = rating ~ complaints, data = attitude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.880 -5.990 0.178 6.298 9.629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.3763 6.6200 2.17 0.039 *
## complaints 0.7546 0.0975 7.74 2e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.99 on 28 degrees of freedom
## Multiple R-squared: 0.681, Adjusted R-squared: 0.67
## F-statistic: 59.9 on 1 and 28 DF, p-value: 1.99e-08
Multiple linear regression is defined by the following quote from Bühlmann and Mächler (2014).
Given a single response variable: up to some random errors it is a linear function of several predictors (or covariables). The linear function involves unknown parameters. The goal is to estimate these parameters, to study their relevance and to estimate the error variance.
What the above quote means is that for a set of objects or a set of individuals, we have observed or measured some quantities which are termed response variable
or y-values
or dependent variable
. As an example in a clinical study, we have measured blood pressure or cholesterol levels for a given set of patients. The response variables can only be observed or measured up to some random and non-systematic errors.
In addition to the response variable, we have for the same set of objects or individuals, some more characteristics that describe or classify the given objects or individuals. For our patients in the above example that might be height and weight or some diatary quantities. These quantities are termed predictors
or covariables
or independent variables
and are assumed to be known exactly without any errors.
The goal of multiple linear regression is to relate the predictors to a response variable using the mathematical tool of linear functions. In what follows is a more mathematical description of what was described so far.
\[y_i = \mathbf{x}_i^T \mathbf{\beta} + \epsilon_i \text{ with } (i = 1, ..., n)\]
Usually it is assumed that \(\epsilon_1, ..., \epsilon_n\) are idependent and identically distributed (iid) with \(\mathbb{E}[\epsilon_i] = 0\) and \(Var(\epsilon_i) = \sigma^2\)
\[\mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}\]
where \(\mathbf{X}\) is a matrix with \(n\) rows \(\mathbf{x}_i^T\) and \(p\) columns \(\mathbf{x}^{(j)}\). It is assumed that \(n>p\) and that matrix \(\mathbf{X}\) has full rank \(p\), i.e. all \(p\) vectors \(\mathbf{x}^{(1)}, ..., \mathbf{x}^{(p)}\) are linearly independent.
Given the linear model \(\mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}\), we are looking for good estimates \(\hat{\mathbf{\beta}}\) for the unknown parameter vector \(\mathbf{\beta}\). One possibility of finding such estimates \(\hat{\mathbf{\beta}}\) is through the method of least squares which aims at finding \(\hat{\mathbf{\beta}}\) such that the errors are as small as possible.
More formally, the least squares estimator is defined as
\[\hat{\mathbf{\beta}} = \text{argmin}_{\beta}\lVert\mathbf{y} - \mathbf{X}\mathbf{\beta}\rVert^2\]
where \(\lVert.\rVert\) is defined as the Euclidean norm on the \(n\) dimensional space \(\mathbf{R}^n\). The \(\text{argmin}_\beta\) operator finds that representative of \(\beta\) that minimizes the expression \(\lVert\mathbf{y} - \mathbf{X}\mathbf{\beta}\rVert^2\). Assuming \(\mathbf{X}\) has full column rank \(p\), the minimum of the above norm can be computed explicitly. Taking the derivative of the norm with respect to \(\mathbf{\beta}\) (\(p\) dimensional gradient vector) inserting \(\hat{\mathbf{\beta}}\) for \(\mathbf{\beta}\) and setting the derivative to zero yields
\[ (-2)\mathbf{X}^T(\mathbf{y} - \mathbf{X}\hat{\mathbf{\beta}}) = \mathbf{0}\]
This can be transformed into the following normal equation
\[\mathbf{X}^T\mathbf{X}\hat{\mathbf{\beta}} = \mathbf{X}^T\mathbf{y}\]
Although, the normal equation can explicitly be solved for the unknown \(\hat{\mathbf{\beta}}\) resulting in
\[\hat{\mathbf{\beta}} = \left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{y} \text{ ,}\]
this formula is only useful for theoretical purposes. For numerical computations, it is much more stable to use the QR decomposition instead of inverting \(\left(\mathbf{X}^T\mathbf{X}\right)\).
Using the residuals \(r_i = y_i - \mathbf{x}_i^T\hat{\mathbf{\beta}}\) as estimates for the errors \(\epsilon_i\) a plausible estimator for \(\sigma^2\) is
\[\hat{\sigma}^2 = \frac{1}{n-p}\sum_{i=1}^n r_i^2\]
The unusual factor \(\frac{1}{n-p}\) leads to \(\mathbf{E}\left[\hat{\sigma}^2\right] = \sigma^2\) which means the estimator is unbiased.
Bühlmann, Peter, and Martin Mächler. 2014. “Computational Statistics.” http://stat.ethz.ch/education/semesters/ss2014/CompStat.