This notebooks tries to show the difference between the two datatypes ‘strings’ and ‘factors’ in R.
The following dataset is used. This data set is based on the data used in exercise 5.
### # read tibble from file
s_wwg_path <- 'https://charlotte-ngs.github.io/lbgfs2020/misc/weaningweightbeef.csv'
tbl_beef_data <- readr::read_csv(file = s_wwg_path)
Parsed with column specification:
cols(
Animal = col_double(),
Sire = col_double(),
Dam = col_double(),
Herd = col_double(),
`Weaning Weight` = col_double()
)
tbl_beef_data
l_sex <- list(abbrev = c('F', 'M'), name = c('female', 'male'))
We add a column ‘sex’ to the data where F stands for a female animal and M stands for a male animal, respectively.
tbl_beef_data$Sex <- sample(l_sex$abbrev, size = nrow(tbl_beef_data), replace = TRUE)
tbl_beef_data
Based on the above output, we see that the column ‘Sex’ is a character, hence a ‘string’. If this data is read again
s_ex_beef_data <- 'extended_beef_data.csv'
readr::write_csv2(tbl_beef_data, path = s_ex_beef_data)
df_ex_beef_str <- read.csv2(file = s_ex_beef_data)
df_ex_beef_str
Changing the option of ‘stringsAsFactors’
df_ex_beef_fct <- read.csv2(file = s_ex_beef_data, stringsAsFactors = TRUE)
df_ex_beef_fct
Using the above data in a linear model.
lm_str <- lm(Weaning.Weight ~ Sex + Herd, data = df_ex_beef_str)
summary(lm_str)
Call:
lm(formula = Weaning.Weight ~ Sex + Herd, data = df_ex_beef_str)
Residuals:
Min 1Q Median 3Q Max
-0.46301 -0.14199 -0.01330 0.09358 0.64699
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.63972 0.35089 7.523 4.36e-06 ***
SexM -0.12864 0.20312 -0.633 0.538
Herd -0.05807 0.19667 -0.295 0.772
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3368 on 13 degrees of freedom
Multiple R-squared: 0.03003, Adjusted R-squared: -0.1192
F-statistic: 0.2012 on 2 and 13 DF, p-value: 0.8202
lm_fct <- lm(Weaning.Weight ~ Sex + Herd, data = df_ex_beef_fct)
summary(lm_fct)
Call:
lm(formula = Weaning.Weight ~ Sex + Herd, data = df_ex_beef_fct)
Residuals:
Min 1Q Median 3Q Max
-0.46301 -0.14199 -0.01330 0.09358 0.64699
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.63972 0.35089 7.523 4.36e-06 ***
SexM -0.12864 0.20312 -0.633 0.538
Herd -0.05807 0.19667 -0.295 0.772
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3368 on 13 degrees of freedom
Multiple R-squared: 0.03003, Adjusted R-squared: -0.1192
F-statistic: 0.2012 on 2 and 13 DF, p-value: 0.8202
Add an additional Herd
The same model
lm_fct_hd <- lm(Weaning.Weight ~ Sex + Herd, data = df_ex_beef_fct_hd)
summary(lm_fct_hd)
Call:
lm(formula = Weaning.Weight ~ Sex + Herd, data = df_ex_beef_fct_hd)
Residuals:
Min 1Q Median 3Q Max
-0.45414 -0.13178 -0.01757 0.07147 0.65586
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.550655 0.325482 7.837 2.8e-06 ***
SexM -0.101690 0.216483 -0.470 0.646
Herd -0.004828 0.153718 -0.031 0.975
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3379 on 13 degrees of freedom
Multiple R-squared: 0.0236, Adjusted R-squared: -0.1266
F-statistic: 0.1571 on 2 and 13 DF, p-value: 0.8562
Changing the datatype of the Herd into a factor
df_ex_beef_fct_hd$Herd <- as.factor(df_ex_beef_fct_hd$Herd)
df_ex_beef_fct_hd
NA
lm_fct_hd <- lm(Weaning.Weight ~ Sex + Herd, data = df_ex_beef_fct_hd)
summary(lm_fct_hd)
For all columns which are of a numeric type (integer or double), the lm()
function in R fits a regression. For all columns which are of type factor
, the lm()
function uses the variable as a fixed effect.