Literate Programming Based Improvements To Chapter 1 of GWH2013a

Disclaimer

This document summarizes the content of chapter 1 of (Gondro, van der Werf, and Hayes 2013). The main objective of this summary is to transform the summarized text and the programming code between the text into a document based on the paradigm of Literate Programming.

Chapter 1 - R for Genome Wide Association Studies

Abstract

The authors state that R has become the de facto language of choice for statisticians and that it is the most widely used environment for analysing high-throughput genomic data. While these statements are certainly not wrong, one has to keep in mind that most of statistical analyses are probably still done in MS Excel. But despite that fact I would never recommend to anyone to use MS Excel for any statistical analysis. Or more generally I would not make any recommendation of a product only based on the fact that a large number of people use that product. From my point of view, R is widely used because it has a very active and supportive community and that problems are solved much quicker than in any other commericially available products.

Section 1 - Introduction

The introduction starts with the same statements as were already mentioned at the begining of the abstract. Do not get me wrong here, I am not critisizing the authors here for saying anything wrong, but I find those arguments not so convincing. Part of the problem is also that no examples of R-code snippets are shown to support their argument.

Let me add two points in favor of R that I would use to convince people to use R. For all those readers who are not interested in these arguments or for all those who are already convinced of R, can safely skip the remainder of this subsection and can jump to the summary of section 2 of chapter 1 of (Gondro, van der Werf, and Hayes 2013).

My favorite two arguments in favor of R are

Fast Prototyping - which means that ideas are quickly converted into R-code and
Literate Programming - R provides a lot of infrastructure to implement the paradigm of literate programming.

For all those who want to know more about the above two points, please read the post entitled Why R.

After this short digression, let us get back to (Gondro, van der Werf, and Hayes 2013)

Section 2 - Reading Data into R

It is true that it takes time to read genomic data of a certain size into R, but this is true for any system. Using the same dataset as in (Gondro, van der Werf, and Hayes 2013), the following timings result

sSnpDataFn <- "../data/square.txt"
(ptReadTable  <- system.time(myData <- read.table(file = sSnpDataFn, sep = "\t")))

##    user  system elapsed 
##  56.876   0.982  58.204

dim(myData)

## [1]  1000 50002

What about MS Excel?

As a very small side note, when looking at the dimension of the data.frame myData, we can see that it has 50002 columns. According to Excel specifications and limits the number of columns that can be read into a single worksheet in Excel is still limited to 16,384 which makes it clearly impossible to read this dataset into a single worksheet of MS Excel. I clearly prefer waiting for 58.204 seconds until the data is read compared to not being able to read the data at all.

Back on track to where we were before that side note, we still want to read that SNP dataset into R and we would like to do it a little faster then what we just saw with a bare-bones read.table command. One way of doing this is by telling the read.table() function what datatype to expect in which column. Datatypes are specified using the argument colClasses which is set to a vector of strings where each vector element contains the datatype of the corresponding datacolumn.

(ptReadTableColClasses <- system.time(myData <- read.table(file = sSnpDataFn, sep = "\t", 
                                                           colClasses = c("character", "factor", 
                                                                          rep("numeric",50001)))))

##    user  system elapsed 
##  52.369   0.563  53.074

Uups why is this not working?

When we compare the above statement to the original one in (Gondro, van der Werf, and Hayes 2013), we can see that there is a difference. The vector assigned to the colClasses argument in the book is missing the first element. The statement in the book when copy-pasted to the R-prompt gives the following error:

myData <- read.table(file = sSnpDataFn, sep = "\t", colClasses = c("factor", rep("numeric",50001)))
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  scan() expected 'a real', got 'A'

The reason for the error can be found in the help file to ?read.table() which says for the argument colClasses that row names must also be specified in the vector of column classes. Therefore when adding “character” as the first element of the colClasses vector, the error disappears.

Bummer - Literate Programming Rocks!

This is really unbelievable, the authors of the first chapter in (Gondro, van der Werf, and Hayes 2013) could not have given me a bigger gift with this error. I really feel blessed to be able to present such a fantastic show-case in favor of literate programming. Would the authors have used a literate programming approach, such as we are doing here, they could have avoided for themselves the embarrasing and painful experience of showing an incorrect statements already in the second code snippet that they are presenting. For the reader of the book who really wants to use the examples, this is very frustrating when you hit an error already in the second statement that could be found.

In the text after the code snippet producing the error, there is yet another problem. The authors mention that performance could still be marginally improved by specifying the number of rows in the table beforehand using in out case ncols=50000. But this does not work. First, we do not have 50000 rows but only 1000 rows. Second, the function read.table() does not have an argument called ncol. Such an argument would make no sense at all, because the number of columns is already determined by the colClasses vector. Probably, what the authors wanted to say is to specify the number of rows using the nrows argument such as shown below.

Adding more options concerning the comment character and the number of rows, we can hope to still get a little faster.

(ptReadTableNRows <- system.time(myData <- read.table(file = sSnpDataFn, sep = "\t", 
                                                           colClasses = c("character", "factor", 
                                                                          rep("numeric",50001)),
                                                           comment.char = "",
                                                           nrows = 1000)))

##    user  system elapsed 
##  50.698   0.433  51.304

Comparing read.table results

So far we have seen three different versions of how to read a dataset into R using the read.table() function. Now let us compare the timings that we have measured so far.

Read.table version	Total time (in s)	Improvement over minimal (in %)
minimal	58.204
colClasses	53.074	9
nRows	51.304	12

Comparing the timings in the above table, we observe much smaller differences than reported in (Gondro, van der Werf, and Hayes 2013). It is difficult to find a reason for that because such timings are very difficult to compare across machines and also between different versions of R. Useful information on telling the reader how results were produced is included in the section Session Info. The content of this section is just the output of the R function sessionInfo().

Alternatives to read.table

Using the function read.table() we were able to read all data, phenotypes and genotypes from the same file. In principle we did not even have to know the dimension of the dataset or what datatypes were contained in which columns, read.table() would figure it out all by itself.

Now let us assume that genotypes are stored in a separate file and that they are all encoded by numerical values. Furthermore, we know the dimensions of the genoptypes matrix corresponding to the number of SNPs times the number of samples. Then it is possible to read the genotypes using the function scan(). Because scan() is a low-level C-function, it should be faster than read.table() and combining scan() together with matrix() is probably the fastest way to read data of tabular form into R.

(ptMatrixScan <- system.time(myGeno <- matrix(scan(file = "../data/genotypes.txt", 
                                                   what = integer(), 
                                                   sep = "\t"), 
                                              nrow = 50000, 
                                              ncol = 1000, 
                                              byrow = T)))

##    user  system elapsed 
##  18.749   0.498  19.317

This reads all genotypes from a file and stores it in a matrix of dimension 50000 \(\times\) 1000. Reading the genotypes with scan() takes 19.317 seconds. Adding this result to the above table allows us to compare all methods that transform the input SNP genotypes into a tabular form. The comparison is not completely fair, because read.table() also read the phenotypes which is not done by matrix(scan()).

Input version	Total time (in s)	Improvement over minimal (in %)
minimal	58.204
colClasses	53.074	9
nRows	51.304	12
matrix-scan	19.317	67

When looking at the above table it is clear that only matrix(scan()) leads to a significant improvement of the time it takes to import our example dataset. But specifying all the options like number of rows or column classes has an other important advantage. It allows us to check the consistency between what we think should be in the dataset and what R really finds in data file specified as input.

In case, we do not need the data in a tabular form, then we can use functions like readLines() or readChar(). The former reads the content into a vector of characters. Each line of the input file is stored as one vector component. For readChar() a character vector of length corresponding to the number of items read is returned. These functions readLines() and readChar() are faster than scan() but they are only useful, if we can use the data as vector of characters. First reading the data using readLines() or readChar() and then parsing the data using some high-level R-functions is probably not faster than using scan in the first place. Because the result that we get from read.table() and matrix(scan()) on the one side and from readLines() and readChar() on the other side is very different, a direct comparison of the timing results does not make a whole lot of sense.

The function readLines() can be very useful if we have a really big dataset that does not fit into memory, then we can use the following construct which reads the content line by line and stores the current line in curLine.

conInput <- file(description = "../data/genotypes.txt", open = "r")
nrLine <- 0
sumFirstElement <- 0
while (length(curLine  <- readLines(con = conInput, n = 1)) > 0) {
  vecCurElement <- unlist(strsplit(curLine, split = "\t", perl = TRUE))
  sumFirstElement <- sumFirstElement + as.numeric(vecCurElement[1])
  nrLine <- nrLine + 1
}
close(conInput)
cat("Number of lines read: ", nrLine, "\n")

## Number of lines read:  50000

cat("Sum over first elements: ", sumFirstElement)

## Sum over first elements:  50057

The above code snippet is not doing anything terribly clever, but it just illustrates how one can use readLines() for reading a file line-by-line. Once a line is read from the file, it is stored in the one element character vector curLine which is then parsed into its elements in splitting the line at the separator. After the split, the above code just sums over the first element of each line and it counts the number of lines read.

With the advent of big data, the above construct can become more and more important, because nowadays, we often do not have the complete dataset to be analysed in one single file. Most likely we just have a stream of data that is sent to us from some server and we just have to process that stream. For example, suppose we want to analyse certain properties of SNPs from a large SNP-database or we want to link certain characteristics of SNPs to properties that are found in sequence databases such as Genebank, but we do not have neither the time to wait for the SNP-database and the complete Genebank database to be downloaded nor do we have the money to store all this data. So what we try to do is to send requests to these database servers to send us a stream of interesting data and we are processing this data on the fly and storing just the results.

Conclusions of Section 2

As we have shown with our timings in the section above, reading data into R does require some time. But still, for reading a dataset in the order of \(10^7\) data points requires less than one minute to load. This seams to be quite acceptable to me.

Whenever, we have to read the excact same dataset more than once, the fastest is always to load the dataset the first time from the file, then save the R object in binary format and then load it again using the load() command. For the above dataframe myData this is done as follows.

sDataBinFn <- "../data/mydata.bin"
save(myData, file = sDataBinFn)
(ptBinLoad <- system.time(load(file = sDataBinFn)))

##    user  system elapsed 
##   1.128   0.058   1.190

Comparing the above timing to the minimal version using the read.table() function shows that loading a data object from a binary file is 49 times faster than the minimal version. This makes clear that reading the exact same dataset from a file more than once, is a waste of time.

There is a clear trade-off between the time it takes to read a certain dataset from a file and the amount of information we have to know beforehand about this dataset. Whenever, we need the data in a tabulare form like a data.frame or a matrix it takes longer to read and convert that data compared to when we just read the data into a vector of strings.

The function readLines() allows us to read a file line-by-line and therebye offers the possibility of processing streams of data commonly observed in big data projects.

Section 3 - Loops and Vectorization

It is true, R is neither FORTRAN nor C. That means, programming practices that are common in FORTRAN and C do normally not translate very well into R. Translation here means pure syntactic conversion of code from FORTRAN or C into R statements. The fact that the R kernel is written in C and that many compiled libraries available in R are written in FORTRAN does not change the fact that programming in R requires different concepts compared to programming in FORTRAN and C. The fundamental difference means that problems require a different way of thinking when they are to be addressed in R. One important thing to remember is that the basic data object in R is a vector. Hence many R-functions are optimized to work well with vectors as input arguments.

In (Gondro, van der Werf, and Hayes 2013) the authors mention the term reallocation of memory. Afterwards an example is given with a loop that grows a matrix dynamically inside the loop-body. It is true that this is very inefficient, especially in case where the final size of the matrix is known beforehand. What is meant here by memory reallocation is whenever a data structure such as a matrix is grown dynamically, the system has to allocate more memory for that growing data structure. Besides the resource requirements due to the repeated memory allocation, the allocated memory also gets fragmented accross the whole memory space. Fragmentation happens because the system cannot guarantee that a newly allocated junk of memory for a growing data object can be placed directly adjacent to an already existing memory block that has been previously reserved for that same object. The bottom line is, while growing data object dynamically is very convenient for the programmer, it is very inefficient when it comes to use of resources. Hence whenever possible, growing data objects dynamically should be avoided. What is described here about the inefficencies of growing data objects dynamically in R is true for any other programming languages.

To be continued ...

Session Info

sessionInfo()

## R version 3.1.1 (2014-07-10)
## Platform: x86_64-apple-darwin13.1.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
## [1] digest_0.6.4     evaluate_0.5.5   formatR_1.0      htmltools_0.2.4 
## [5] knitr_1.6        rmarkdown_0.3.10 stringr_0.6.2    tools_3.1.1     
## [9] yaml_2.1.13

References

Gondro, Cedric, Julius van der Werf, and Ben Hayes. 2013. “Genome-Wide Association Studies and Genomic Prediction.” http://www.springer.com/life+sciences/systems+biology+and+bioinformatics/book/978-1-62703-446-3.