We use the data from excercise 3, problem 1:
s_data_file <- "https://charlotte-ngs.github.io/gelasmss2021/data/ex03p01_data.csv"
tbl_geno_data <- readr::read_csv(file = s_data_file)
Parsed with column specification:
cols(
Animal = col_double(),
`SNP G` = col_double(),
`SNP H` = col_double(),
Observation = col_double()
)
tbl_geno_data
For the data set in tbl_geno_data
, marker effects for loci G and H can be estimated.
In our given dataset in tbl_geno_data
, the first problem of \(p>>N\) does not exist. The first step is to create a data set where this problem exists.
First, we reduce the number of observations to 20
# old style base-R version
tbl_first_problem <- tbl_geno_data[1:n_nr_obs,]
tbl_first_problem
Alternatively, we can use features of the R-Package dplyr
# more modern style
library(dplyr)
tbl_first_problem <- tbl_geno_data %>% filter(Animal <= n_nr_obs)
tbl_first_problem
Secondly, the number of parameters (for our example, the number of loci) must be increased.
n_nr_loci_add <- 10
vec_genotypes <- c(-1,0,1)
for (idx in 1:n_nr_loci_add){
vec_cur_geno <- sample(vec_genotypes, size = n_nr_obs, replace = TRUE)
tbl_first_problem <- bind_cols(tbl_first_problem, tibble::tibble(cur_geno = vec_cur_geno))
names(tbl_first_problem)[ncol(tbl_first_problem)] <- paste0('SNP_', idx)
}
tbl_first_problem
The dataset in tbl_first_problem
has the problem of \(p>>N\), because p is 12 and N is 5.
Trying to use least squares to estimate marker effects,
lm_fit_lsq <- lm(formula = Observation ~ `SNP G` + `SNP H` + SNP_1 + SNP_2 + SNP_3 + SNP_4 + SNP_5 + SNP_6 + SNP_7 + SNP_8 + SNP_9 + SNP_10, data = tbl_first_problem)
summary(lm_fit_lsq)
Call:
lm(formula = Observation ~ `SNP G` + `SNP H` + SNP_1 + SNP_2 +
SNP_3 + SNP_4 + SNP_5 + SNP_6 + SNP_7 + SNP_8 + SNP_9 + SNP_10,
data = tbl_first_problem)
Residuals:
ALL 5 residuals are 0: no residual degrees of freedom!
Coefficients: (8 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 483.5 NA NA NA
`SNP G` 52.5 NA NA NA
`SNP H` 20.0 NA NA NA
SNP_1 -1.5 NA NA NA
SNP_2 NA NA NA NA
SNP_3 NA NA NA NA
SNP_4 -24.5 NA NA NA
SNP_5 NA NA NA NA
SNP_6 NA NA NA NA
SNP_7 NA NA NA NA
SNP_8 NA NA NA NA
SNP_9 NA NA NA NA
SNP_10 NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 4 and 0 DF, p-value: NA
This result is a clear sign of too many parameter.
We do not know which of the SNPs are important. To show the effect of the second problem, we use a different data set
tbl_second_problem <- tbl_geno_data
n_nr_obs <- nrow(tbl_second_problem)
n_nr_loci_add <- 2
vec_genotypes <- c(-1,0,1)
for (idx in 1:n_nr_loci_add){
vec_cur_geno <- sample(vec_genotypes, size = n_nr_obs, replace = TRUE)
tbl_second_problem <- bind_cols(tbl_second_problem, tibble::tibble(cur_geno = vec_cur_geno))
names(tbl_second_problem)[ncol(tbl_second_problem)] <- paste0('SNP_', idx)
}
tbl_second_problem
lm_fit_sp <- lm(Observation ~ `SNP G` + `SNP H` + SNP_1 + SNP_2, data = tbl_second_problem)
summary(lm_fit_sp)
Call:
lm(formula = Observation ~ `SNP G` + `SNP H` + SNP_1 + SNP_2,
data = tbl_second_problem)
Residuals:
Min 1Q Median 3Q Max
-15.436 -8.237 -3.458 4.734 24.229
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 497.3748 3.2096 154.962 < 2e-16 ***
`SNP G` 23.5993 4.1922 5.629 4.8e-05 ***
`SNP H` 9.7414 4.6932 2.076 0.0555 .
SNP_1 2.7128 3.7597 0.722 0.4817
SNP_2 0.8252 4.2340 0.195 0.8481
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.37 on 15 degrees of freedom
Multiple R-squared: 0.7026, Adjusted R-squared: 0.6232
F-statistic: 8.858 on 4 and 15 DF, p-value: 0.0007042
Because, we do not have the first problem of \(p>>N\), we can use model selection. In R this can be done using
MASS::stepAIC(lm_fit_sp)
Start: AIC=107.98
Observation ~ `SNP G` + `SNP H` + SNP_1 + SNP_2
Df Sum of Sq RSS AIC
- SNP_2 1 6.8 2689.8 106.03
- SNP_1 1 93.1 2776.2 106.66
<none> 2683.1 107.98
- `SNP H` 1 770.6 3453.7 111.03
- `SNP G` 1 5668.4 8351.4 128.69
Step: AIC=106.03
Observation ~ `SNP G` + `SNP H` + SNP_1
Df Sum of Sq RSS AIC
- SNP_1 1 97.5 2787.4 104.74
<none> 2689.8 106.03
- `SNP H` 1 775.6 3465.5 109.10
- `SNP G` 1 6073.7 8763.5 127.65
Step: AIC=104.74
Observation ~ `SNP G` + `SNP H`
Df Sum of Sq RSS AIC
<none> 2787.4 104.74
- `SNP H` 1 679.6 3466.9 107.11
- `SNP G` 1 5981.1 8768.5 125.66
Call:
lm(formula = Observation ~ `SNP G` + `SNP H`, data = tbl_second_problem)
Coefficients:
(Intercept) `SNP G` `SNP H`
497.146 23.318 8.403
Do the model selection procedure according to the forward selection process as described and also do the backward elminiation and compare the results.
lm_fs_m0 <- lm(Observation ~ 1, data = tbl_second_problem)
summary(lm_fs_m0)
Call:
lm(formula = Observation ~ 1, data = tbl_second_problem)
Residuals:
Min 1Q Median 3Q Max
-45.65 -16.90 2.85 17.60 36.35
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 502.650 4.872 103.2 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 21.79 on 19 degrees of freedom
Sum of the squared residuals
(vec_residuals <- residuals(lm_fs_m0))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
7.35 25.35 2.35 36.35 27.35 -13.65 -16.65 -17.65 -24.65 -23.65 17.35 18.35 -29.65 -45.65 -5.65 13.35 21.35 -0.65
19 20
5.35 3.35
crossprod(vec_residuals)
[,1]
[1,] 9020.55