This notebooks tries to show the difference between the two datatypes ‘strings’ and ‘factors’ in R.
The following dataset is used. This data set is based on the data used in exercise 5.
### # read tibble from file
s_wwg_path <- 'https://charlotte-ngs.github.io/lbgfs2020/misc/weaningweightbeef.csv'
tbl_beef_data <- readr::read_csv(file = s_wwg_path)Parsed with column specification:
cols(
  Animal = col_double(),
  Sire = col_double(),
  Dam = col_double(),
  Herd = col_double(),
  `Weaning Weight` = col_double()
)tbl_beef_datal_sex <- list(abbrev = c('F', 'M'), name = c('female', 'male'))We add a column ‘sex’ to the data where F stands for a female animal and M stands for a male animal, respectively.
tbl_beef_data$Sex <- sample(l_sex$abbrev, size = nrow(tbl_beef_data), replace = TRUE)
tbl_beef_dataBased on the above output, we see that the column ‘Sex’ is a character, hence a ‘string’. If this data is read again
s_ex_beef_data <- 'extended_beef_data.csv'
readr::write_csv2(tbl_beef_data, path = s_ex_beef_data)
df_ex_beef_str <- read.csv2(file = s_ex_beef_data)
df_ex_beef_strChanging the option of ‘stringsAsFactors’
df_ex_beef_fct <- read.csv2(file = s_ex_beef_data, stringsAsFactors = TRUE)
df_ex_beef_fctUsing the above data in a linear model.
lm_str <- lm(Weaning.Weight ~ Sex + Herd, data = df_ex_beef_str)
summary(lm_str)
Call:
lm(formula = Weaning.Weight ~ Sex + Herd, data = df_ex_beef_str)
Residuals:
     Min       1Q   Median       3Q      Max 
-0.46301 -0.14199 -0.01330  0.09358  0.64699 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.63972    0.35089   7.523 4.36e-06 ***
SexM        -0.12864    0.20312  -0.633    0.538    
Herd        -0.05807    0.19667  -0.295    0.772    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3368 on 13 degrees of freedom
Multiple R-squared:  0.03003,   Adjusted R-squared:  -0.1192 
F-statistic: 0.2012 on 2 and 13 DF,  p-value: 0.8202lm_fct <- lm(Weaning.Weight ~ Sex + Herd, data = df_ex_beef_fct)
summary(lm_fct)
Call:
lm(formula = Weaning.Weight ~ Sex + Herd, data = df_ex_beef_fct)
Residuals:
     Min       1Q   Median       3Q      Max 
-0.46301 -0.14199 -0.01330  0.09358  0.64699 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.63972    0.35089   7.523 4.36e-06 ***
SexM        -0.12864    0.20312  -0.633    0.538    
Herd        -0.05807    0.19667  -0.295    0.772    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3368 on 13 degrees of freedom
Multiple R-squared:  0.03003,   Adjusted R-squared:  -0.1192 
F-statistic: 0.2012 on 2 and 13 DF,  p-value: 0.8202Add an additional Herd
The same model
lm_fct_hd <- lm(Weaning.Weight ~ Sex + Herd, data = df_ex_beef_fct_hd)
summary(lm_fct_hd)
Call:
lm(formula = Weaning.Weight ~ Sex + Herd, data = df_ex_beef_fct_hd)
Residuals:
     Min       1Q   Median       3Q      Max 
-0.45414 -0.13178 -0.01757  0.07147  0.65586 
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.550655   0.325482   7.837  2.8e-06 ***
SexM        -0.101690   0.216483  -0.470    0.646    
Herd        -0.004828   0.153718  -0.031    0.975    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3379 on 13 degrees of freedom
Multiple R-squared:  0.0236,    Adjusted R-squared:  -0.1266 
F-statistic: 0.1571 on 2 and 13 DF,  p-value: 0.8562Changing the datatype of the Herd into a factor
df_ex_beef_fct_hd$Herd <- as.factor(df_ex_beef_fct_hd$Herd)
df_ex_beef_fct_hdNAlm_fct_hd <- lm(Weaning.Weight ~ Sex + Herd, data = df_ex_beef_fct_hd)
summary(lm_fct_hd)For all columns which are of a numeric type (integer or double), the lm() function in R fits a regression. For all columns which are of type factor, the lm() function uses the variable as a fixed effect.