We use the data from excercise 3, problem 1:
s_data_file <- "https://charlotte-ngs.github.io/gelasmss2021/data/ex03p01_data.csv"
tbl_geno_data <- readr::read_csv(file = s_data_file)Parsed with column specification:
cols(
  Animal = col_double(),
  `SNP G` = col_double(),
  `SNP H` = col_double(),
  Observation = col_double()
)tbl_geno_dataFor the data set in tbl_geno_data, marker effects for loci G and H can be estimated.
In our given dataset in tbl_geno_data, the first problem of \(p>>N\) does not exist. The first step is to create a data set where this problem exists.
First, we reduce the number of observations to 20
# old style base-R version
tbl_first_problem <- tbl_geno_data[1:n_nr_obs,]
tbl_first_problemAlternatively, we can use features of the R-Package dplyr
# more modern style
library(dplyr)
tbl_first_problem <- tbl_geno_data %>% filter(Animal <= n_nr_obs)
tbl_first_problemSecondly, the number of parameters (for our example, the number of loci) must be increased.
n_nr_loci_add <- 10
vec_genotypes <- c(-1,0,1)
for (idx in 1:n_nr_loci_add){
  vec_cur_geno <- sample(vec_genotypes, size = n_nr_obs, replace = TRUE)
  tbl_first_problem <- bind_cols(tbl_first_problem, tibble::tibble(cur_geno = vec_cur_geno))
  names(tbl_first_problem)[ncol(tbl_first_problem)] <- paste0('SNP_', idx)
}
tbl_first_problemThe dataset in tbl_first_problem has the problem of \(p>>N\), because p is 12 and N is 5.
Trying to use least squares to estimate marker effects,
lm_fit_lsq <- lm(formula = Observation ~ `SNP G` + `SNP H` + SNP_1 + SNP_2 + SNP_3 + SNP_4 + SNP_5 + SNP_6 + SNP_7 + SNP_8 + SNP_9 + SNP_10, data = tbl_first_problem)
summary(lm_fit_lsq)
Call:
lm(formula = Observation ~ `SNP G` + `SNP H` + SNP_1 + SNP_2 + 
    SNP_3 + SNP_4 + SNP_5 + SNP_6 + SNP_7 + SNP_8 + SNP_9 + SNP_10, 
    data = tbl_first_problem)
Residuals:
ALL 5 residuals are 0: no residual degrees of freedom!
Coefficients: (8 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    483.5         NA      NA       NA
`SNP G`         52.5         NA      NA       NA
`SNP H`         20.0         NA      NA       NA
SNP_1           -1.5         NA      NA       NA
SNP_2             NA         NA      NA       NA
SNP_3             NA         NA      NA       NA
SNP_4          -24.5         NA      NA       NA
SNP_5             NA         NA      NA       NA
SNP_6             NA         NA      NA       NA
SNP_7             NA         NA      NA       NA
SNP_8             NA         NA      NA       NA
SNP_9             NA         NA      NA       NA
SNP_10            NA         NA      NA       NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 4 and 0 DF,  p-value: NAThis result is a clear sign of too many parameter.
We do not know which of the SNPs are important. To show the effect of the second problem, we use a different data set
tbl_second_problem <- tbl_geno_data
n_nr_obs <- nrow(tbl_second_problem)
n_nr_loci_add <- 2
vec_genotypes <- c(-1,0,1)
for (idx in 1:n_nr_loci_add){
  vec_cur_geno <- sample(vec_genotypes, size = n_nr_obs, replace = TRUE)
  tbl_second_problem <- bind_cols(tbl_second_problem, tibble::tibble(cur_geno = vec_cur_geno))
  names(tbl_second_problem)[ncol(tbl_second_problem)] <- paste0('SNP_', idx)
}
tbl_second_problemlm_fit_sp <- lm(Observation ~ `SNP G` + `SNP H` + SNP_1 + SNP_2, data = tbl_second_problem)
summary(lm_fit_sp)
Call:
lm(formula = Observation ~ `SNP G` + `SNP H` + SNP_1 + SNP_2, 
    data = tbl_second_problem)
Residuals:
    Min      1Q  Median      3Q     Max 
-15.436  -8.237  -3.458   4.734  24.229 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 497.3748     3.2096 154.962  < 2e-16 ***
`SNP G`      23.5993     4.1922   5.629  4.8e-05 ***
`SNP H`       9.7414     4.6932   2.076   0.0555 .  
SNP_1         2.7128     3.7597   0.722   0.4817    
SNP_2         0.8252     4.2340   0.195   0.8481    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.37 on 15 degrees of freedom
Multiple R-squared:  0.7026,    Adjusted R-squared:  0.6232 
F-statistic: 8.858 on 4 and 15 DF,  p-value: 0.0007042Because, we do not have the first problem of \(p>>N\), we can use model selection. In R this can be done using
MASS::stepAIC(lm_fit_sp)Start:  AIC=107.98
Observation ~ `SNP G` + `SNP H` + SNP_1 + SNP_2
          Df Sum of Sq    RSS    AIC
- SNP_2    1       6.8 2689.8 106.03
- SNP_1    1      93.1 2776.2 106.66
<none>                 2683.1 107.98
- `SNP H`  1     770.6 3453.7 111.03
- `SNP G`  1    5668.4 8351.4 128.69
Step:  AIC=106.03
Observation ~ `SNP G` + `SNP H` + SNP_1
          Df Sum of Sq    RSS    AIC
- SNP_1    1      97.5 2787.4 104.74
<none>                 2689.8 106.03
- `SNP H`  1     775.6 3465.5 109.10
- `SNP G`  1    6073.7 8763.5 127.65
Step:  AIC=104.74
Observation ~ `SNP G` + `SNP H`
          Df Sum of Sq    RSS    AIC
<none>                 2787.4 104.74
- `SNP H`  1     679.6 3466.9 107.11
- `SNP G`  1    5981.1 8768.5 125.66
Call:
lm(formula = Observation ~ `SNP G` + `SNP H`, data = tbl_second_problem)
Coefficients:
(Intercept)      `SNP G`      `SNP H`  
    497.146       23.318        8.403  Do the model selection procedure according to the forward selection process as described and also do the backward elminiation and compare the results.
lm_fs_m0 <- lm(Observation ~ 1, data = tbl_second_problem)
summary(lm_fs_m0)
Call:
lm(formula = Observation ~ 1, data = tbl_second_problem)
Residuals:
   Min     1Q Median     3Q    Max 
-45.65 -16.90   2.85  17.60  36.35 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  502.650      4.872   103.2   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 21.79 on 19 degrees of freedomSum of the squared residuals
(vec_residuals <- residuals(lm_fs_m0))     1      2      3      4      5      6      7      8      9     10     11     12     13     14     15     16     17     18 
  7.35  25.35   2.35  36.35  27.35 -13.65 -16.65 -17.65 -24.65 -23.65  17.35  18.35 -29.65 -45.65  -5.65  13.35  21.35  -0.65 
    19     20 
  5.35   3.35 crossprod(vec_residuals)        [,1]
[1,] 9020.55