Data

We use the data from excercise 3, problem 1:

s_data_file <- "https://charlotte-ngs.github.io/gelasmss2021/data/ex03p01_data.csv"
tbl_geno_data <- readr::read_csv(file = s_data_file)
Parsed with column specification:
cols(
  Animal = col_double(),
  `SNP G` = col_double(),
  `SNP H` = col_double(),
  Observation = col_double()
)
tbl_geno_data

For the data set in tbl_geno_data, marker effects for loci G and H can be estimated.

Two Problems

  1. The number of parameter (marker effects) is larger than the number of observations (\(p>>N\))
  2. Not all SNP are relevant and we have to find those which are relevant.

First Problem

In our given dataset in tbl_geno_data, the first problem of \(p>>N\) does not exist. The first step is to create a data set where this problem exists.

First, we reduce the number of observations to 20

# old style base-R version
tbl_first_problem <- tbl_geno_data[1:n_nr_obs,]
tbl_first_problem

Alternatively, we can use features of the R-Package dplyr

# more modern style
library(dplyr)
tbl_first_problem <- tbl_geno_data %>% filter(Animal <= n_nr_obs)
tbl_first_problem

Secondly, the number of parameters (for our example, the number of loci) must be increased.

n_nr_loci_add <- 10
vec_genotypes <- c(-1,0,1)
for (idx in 1:n_nr_loci_add){
  vec_cur_geno <- sample(vec_genotypes, size = n_nr_obs, replace = TRUE)
  tbl_first_problem <- bind_cols(tbl_first_problem, tibble::tibble(cur_geno = vec_cur_geno))
  names(tbl_first_problem)[ncol(tbl_first_problem)] <- paste0('SNP_', idx)
}
tbl_first_problem

The dataset in tbl_first_problem has the problem of \(p>>N\), because p is 12 and N is 5.

Trying to use least squares to estimate marker effects,

lm_fit_lsq <- lm(formula = Observation ~ `SNP G` + `SNP H` + SNP_1 + SNP_2 + SNP_3 + SNP_4 + SNP_5 + SNP_6 + SNP_7 + SNP_8 + SNP_9 + SNP_10, data = tbl_first_problem)
summary(lm_fit_lsq)

Call:
lm(formula = Observation ~ `SNP G` + `SNP H` + SNP_1 + SNP_2 + 
    SNP_3 + SNP_4 + SNP_5 + SNP_6 + SNP_7 + SNP_8 + SNP_9 + SNP_10, 
    data = tbl_first_problem)

Residuals:
ALL 5 residuals are 0: no residual degrees of freedom!

Coefficients: (8 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    483.5         NA      NA       NA
`SNP G`         52.5         NA      NA       NA
`SNP H`         20.0         NA      NA       NA
SNP_1           -1.5         NA      NA       NA
SNP_2             NA         NA      NA       NA
SNP_3             NA         NA      NA       NA
SNP_4          -24.5         NA      NA       NA
SNP_5             NA         NA      NA       NA
SNP_6             NA         NA      NA       NA
SNP_7             NA         NA      NA       NA
SNP_8             NA         NA      NA       NA
SNP_9             NA         NA      NA       NA
SNP_10            NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 4 and 0 DF,  p-value: NA

This result is a clear sign of too many parameter.

Second Problem

We do not know which of the SNPs are important. To show the effect of the second problem, we use a different data set

tbl_second_problem <- tbl_geno_data
n_nr_obs <- nrow(tbl_second_problem)
n_nr_loci_add <- 2
vec_genotypes <- c(-1,0,1)
for (idx in 1:n_nr_loci_add){
  vec_cur_geno <- sample(vec_genotypes, size = n_nr_obs, replace = TRUE)
  tbl_second_problem <- bind_cols(tbl_second_problem, tibble::tibble(cur_geno = vec_cur_geno))
  names(tbl_second_problem)[ncol(tbl_second_problem)] <- paste0('SNP_', idx)
}
tbl_second_problem
lm_fit_sp <- lm(Observation ~ `SNP G` + `SNP H` + SNP_1 + SNP_2, data = tbl_second_problem)
summary(lm_fit_sp)

Call:
lm(formula = Observation ~ `SNP G` + `SNP H` + SNP_1 + SNP_2, 
    data = tbl_second_problem)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.436  -8.237  -3.458   4.734  24.229 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 497.3748     3.2096 154.962  < 2e-16 ***
`SNP G`      23.5993     4.1922   5.629  4.8e-05 ***
`SNP H`       9.7414     4.6932   2.076   0.0555 .  
SNP_1         2.7128     3.7597   0.722   0.4817    
SNP_2         0.8252     4.2340   0.195   0.8481    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.37 on 15 degrees of freedom
Multiple R-squared:  0.7026,    Adjusted R-squared:  0.6232 
F-statistic: 8.858 on 4 and 15 DF,  p-value: 0.0007042

Because, we do not have the first problem of \(p>>N\), we can use model selection. In R this can be done using

MASS::stepAIC(lm_fit_sp)
Start:  AIC=107.98
Observation ~ `SNP G` + `SNP H` + SNP_1 + SNP_2

          Df Sum of Sq    RSS    AIC
- SNP_2    1       6.8 2689.8 106.03
- SNP_1    1      93.1 2776.2 106.66
<none>                 2683.1 107.98
- `SNP H`  1     770.6 3453.7 111.03
- `SNP G`  1    5668.4 8351.4 128.69

Step:  AIC=106.03
Observation ~ `SNP G` + `SNP H` + SNP_1

          Df Sum of Sq    RSS    AIC
- SNP_1    1      97.5 2787.4 104.74
<none>                 2689.8 106.03
- `SNP H`  1     775.6 3465.5 109.10
- `SNP G`  1    6073.7 8763.5 127.65

Step:  AIC=104.74
Observation ~ `SNP G` + `SNP H`

          Df Sum of Sq    RSS    AIC
<none>                 2787.4 104.74
- `SNP H`  1     679.6 3466.9 107.11
- `SNP G`  1    5981.1 8768.5 125.66

Call:
lm(formula = Observation ~ `SNP G` + `SNP H`, data = tbl_second_problem)

Coefficients:
(Intercept)      `SNP G`      `SNP H`  
    497.146       23.318        8.403  

Exercise

Do the model selection procedure according to the forward selection process as described and also do the backward elminiation and compare the results.

Forward Selection

  • Step 1: start with empty model
lm_fs_m0 <- lm(Observation ~ 1, data = tbl_second_problem)
summary(lm_fs_m0)

Call:
lm(formula = Observation ~ 1, data = tbl_second_problem)

Residuals:
   Min     1Q Median     3Q    Max 
-45.65 -16.90   2.85  17.60  36.35 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  502.650      4.872   103.2   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.79 on 19 degrees of freedom

Sum of the squared residuals

(vec_residuals <- residuals(lm_fs_m0))
     1      2      3      4      5      6      7      8      9     10     11     12     13     14     15     16     17     18 
  7.35  25.35   2.35  36.35  27.35 -13.65 -16.65 -17.65 -24.65 -23.65  17.35  18.35 -29.65 -45.65  -5.65  13.35  21.35  -0.65 
    19     20 
  5.35   3.35 
crossprod(vec_residuals)
        [,1]
[1,] 9020.55
LS0tCnRpdGxlOiAiVHdvIFByb2JsZW1zIG9mIEZMRU0gd2l0aCBHZW5vbWljIERhdGEiCmRhdGU6ICIyMDIxLTAzLTE1IgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgoKIyMgRGF0YQpXZSB1c2UgdGhlIGRhdGEgZnJvbSBleGNlcmNpc2UgMywgcHJvYmxlbSAxOgoKYGBge3J9CnNfZGF0YV9maWxlIDwtICJodHRwczovL2NoYXJsb3R0ZS1uZ3MuZ2l0aHViLmlvL2dlbGFzbXNzMjAyMS9kYXRhL2V4MDNwMDFfZGF0YS5jc3YiCnRibF9nZW5vX2RhdGEgPC0gcmVhZHI6OnJlYWRfY3N2KGZpbGUgPSBzX2RhdGFfZmlsZSkKdGJsX2dlbm9fZGF0YQpgYGAKCkZvciB0aGUgZGF0YSBzZXQgaW4gYHRibF9nZW5vX2RhdGFgLCBtYXJrZXIgZWZmZWN0cyBmb3IgbG9jaSBHIGFuZCBIIGNhbiBiZSBlc3RpbWF0ZWQuCgoKIyMgVHdvIFByb2JsZW1zCgoxLiBUaGUgbnVtYmVyIG9mIHBhcmFtZXRlciAobWFya2VyIGVmZmVjdHMpIGlzIGxhcmdlciB0aGFuIHRoZSBudW1iZXIgb2Ygb2JzZXJ2YXRpb25zICgkcD4+TiQpCjIuIE5vdCBhbGwgU05QIGFyZSByZWxldmFudCBhbmQgd2UgaGF2ZSB0byBmaW5kIHRob3NlIHdoaWNoIGFyZSByZWxldmFudC4KCgojIyBGaXJzdCBQcm9ibGVtCkluIG91ciBnaXZlbiBkYXRhc2V0IGluIGB0YmxfZ2Vub19kYXRhYCwgdGhlIGZpcnN0IHByb2JsZW0gb2YgJHA+Pk4kIGRvZXMgbm90IGV4aXN0LiBUaGUgZmlyc3Qgc3RlcCBpcyB0byBjcmVhdGUgYSBkYXRhIHNldCB3aGVyZSB0aGlzIHByb2JsZW0gZXhpc3RzLgoKYGBge3IsIGVjaG89RkFMU0V9Cm5fbnJfb2JzIDwtIDUKYGBgCgpGaXJzdCwgd2UgcmVkdWNlIHRoZSBudW1iZXIgb2Ygb2JzZXJ2YXRpb25zIHRvIGByIG5fbnJfb2JzYAoKYGBge3J9CiMgb2xkIHN0eWxlIGJhc2UtUiB2ZXJzaW9uCnRibF9maXJzdF9wcm9ibGVtIDwtIHRibF9nZW5vX2RhdGFbMTpuX25yX29icyxdCnRibF9maXJzdF9wcm9ibGVtCmBgYAoKQWx0ZXJuYXRpdmVseSwgd2UgY2FuIHVzZSBmZWF0dXJlcyBvZiB0aGUgUi1QYWNrYWdlIGBkcGx5cmAKCmBgYHtyfQojIG1vcmUgbW9kZXJuIHN0eWxlCmxpYnJhcnkoZHBseXIpCnRibF9maXJzdF9wcm9ibGVtIDwtIHRibF9nZW5vX2RhdGEgJT4lIGZpbHRlcihBbmltYWwgPD0gbl9ucl9vYnMpCnRibF9maXJzdF9wcm9ibGVtCmBgYAoKU2Vjb25kbHksIHRoZSBudW1iZXIgb2YgcGFyYW1ldGVycyAoZm9yIG91ciBleGFtcGxlLCB0aGUgbnVtYmVyIG9mIGxvY2kpIG11c3QgYmUgaW5jcmVhc2VkLgoKYGBge3J9Cm5fbnJfbG9jaV9hZGQgPC0gMTAKdmVjX2dlbm90eXBlcyA8LSBjKC0xLDAsMSkKZm9yIChpZHggaW4gMTpuX25yX2xvY2lfYWRkKXsKICB2ZWNfY3VyX2dlbm8gPC0gc2FtcGxlKHZlY19nZW5vdHlwZXMsIHNpemUgPSBuX25yX29icywgcmVwbGFjZSA9IFRSVUUpCiAgdGJsX2ZpcnN0X3Byb2JsZW0gPC0gYmluZF9jb2xzKHRibF9maXJzdF9wcm9ibGVtLCB0aWJibGU6OnRpYmJsZShjdXJfZ2VubyA9IHZlY19jdXJfZ2VubykpCiAgbmFtZXModGJsX2ZpcnN0X3Byb2JsZW0pW25jb2wodGJsX2ZpcnN0X3Byb2JsZW0pXSA8LSBwYXN0ZTAoJ1NOUF8nLCBpZHgpCn0KdGJsX2ZpcnN0X3Byb2JsZW0KYGBgCgoKVGhlIGRhdGFzZXQgaW4gYHRibF9maXJzdF9wcm9ibGVtYCBoYXMgdGhlIHByb2JsZW0gb2YgJHA+Pk4kLCBiZWNhdXNlIHAgaXMgMTIgYW5kIE4gaXMgNS4KClRyeWluZyB0byB1c2UgbGVhc3Qgc3F1YXJlcyB0byBlc3RpbWF0ZSBtYXJrZXIgZWZmZWN0cywgCgpgYGB7cn0KbG1fZml0X2xzcSA8LSBsbShmb3JtdWxhID0gT2JzZXJ2YXRpb24gfiBgU05QIEdgICsgYFNOUCBIYCArIFNOUF8xICsgU05QXzIgKyBTTlBfMyArIFNOUF80ICsgU05QXzUgKyBTTlBfNiArIFNOUF83ICsgU05QXzggKyBTTlBfOSArIFNOUF8xMCwgZGF0YSA9IHRibF9maXJzdF9wcm9ibGVtKQpzdW1tYXJ5KGxtX2ZpdF9sc3EpCmBgYAoKVGhpcyByZXN1bHQgaXMgYSBjbGVhciBzaWduIG9mIHRvbyBtYW55IHBhcmFtZXRlci4KCiMjIFNlY29uZCBQcm9ibGVtCldlIGRvIG5vdCBrbm93IHdoaWNoIG9mIHRoZSBTTlBzIGFyZSBpbXBvcnRhbnQuIFRvIHNob3cgdGhlIGVmZmVjdCBvZiB0aGUgc2Vjb25kIHByb2JsZW0sIHdlIHVzZSBhIGRpZmZlcmVudCBkYXRhIHNldAoKYGBge3J9CnRibF9zZWNvbmRfcHJvYmxlbSA8LSB0YmxfZ2Vub19kYXRhCm5fbnJfb2JzIDwtIG5yb3codGJsX3NlY29uZF9wcm9ibGVtKQpuX25yX2xvY2lfYWRkIDwtIDIKdmVjX2dlbm90eXBlcyA8LSBjKC0xLDAsMSkKZm9yIChpZHggaW4gMTpuX25yX2xvY2lfYWRkKXsKICB2ZWNfY3VyX2dlbm8gPC0gc2FtcGxlKHZlY19nZW5vdHlwZXMsIHNpemUgPSBuX25yX29icywgcmVwbGFjZSA9IFRSVUUpCiAgdGJsX3NlY29uZF9wcm9ibGVtIDwtIGJpbmRfY29scyh0Ymxfc2Vjb25kX3Byb2JsZW0sIHRpYmJsZTo6dGliYmxlKGN1cl9nZW5vID0gdmVjX2N1cl9nZW5vKSkKICBuYW1lcyh0Ymxfc2Vjb25kX3Byb2JsZW0pW25jb2wodGJsX3NlY29uZF9wcm9ibGVtKV0gPC0gcGFzdGUwKCdTTlBfJywgaWR4KQp9CnRibF9zZWNvbmRfcHJvYmxlbQpgYGAKCgpgYGB7cn0KbG1fZml0X3NwIDwtIGxtKE9ic2VydmF0aW9uIH4gYFNOUCBHYCArIGBTTlAgSGAgKyBTTlBfMSArIFNOUF8yLCBkYXRhID0gdGJsX3NlY29uZF9wcm9ibGVtKQpzdW1tYXJ5KGxtX2ZpdF9zcCkKYGBgCgpCZWNhdXNlLCB3ZSBkbyBub3QgaGF2ZSB0aGUgZmlyc3QgcHJvYmxlbSBvZiAkcD4+TiQsIHdlIGNhbiB1c2UgbW9kZWwgc2VsZWN0aW9uLiBJbiBSIHRoaXMgY2FuIGJlIGRvbmUgdXNpbmcgCgpgYGB7cn0KTUFTUzo6c3RlcEFJQyhsbV9maXRfc3ApCmBgYAoKCiMjIEV4ZXJjaXNlCkRvIHRoZSBtb2RlbCBzZWxlY3Rpb24gcHJvY2VkdXJlIGFjY29yZGluZyB0byB0aGUgZm9yd2FyZCBzZWxlY3Rpb24gcHJvY2VzcyBhcyBkZXNjcmliZWQgYW5kIGFsc28gZG8gdGhlIGJhY2t3YXJkIGVsbWluaWF0aW9uIGFuZCBjb21wYXJlIHRoZSByZXN1bHRzLgoKIyMjIEZvcndhcmQgU2VsZWN0aW9uCgoqIFN0ZXAgMTogc3RhcnQgd2l0aCBlbXB0eSBtb2RlbAoKYGBge3J9CmxtX2ZzX20wIDwtIGxtKE9ic2VydmF0aW9uIH4gMSwgZGF0YSA9IHRibF9zZWNvbmRfcHJvYmxlbSkKc3VtbWFyeShsbV9mc19tMCkKYGBgCgpTdW0gb2YgdGhlIHNxdWFyZWQgcmVzaWR1YWxzCgpgYGB7cn0KKHZlY19yZXNpZHVhbHMgPC0gcmVzaWR1YWxzKGxtX2ZzX20wKSkKY3Jvc3Nwcm9kKHZlY19yZXNpZHVhbHMpCmBgYAoKCgoKCgo=