Simlate Genomic Data

This notebook is used to simulate a small genomic dataset that is analysed in the course GELASMSS2019. The material shown here is based on the JuliaCourse held in december 2018 in Munich.

In a first step, packages Distributions and Random must be loaded. After that, we are setting a seed.

In [17]:
using Distributions, Random
Random.seed!(2345);

Initialize The Sampler

In a first step we take the code from day2/dataSimulation and modify it to a much smaller example.

In [18]:
using XSim
chrLength = 1.0
numChr    = 1
numLoci   = 100
mutRate   = 0.0
locusInt  = chrLength/numLoci
mapPos   = collect(0:locusInt:(chrLength-0.0001))
geneFreq = fill(0.5,numLoci)
XSim.build_genome(numChr,chrLength,numLoci,geneFreq,mapPos,mutRate)
In [19]:
### #checking the result
XSim.G
Out[19]:
XSim.GenomeInfo(XSim.ChromosomeInfo[], 0, 0.0, 0.0, Int64[], Float64[])

Random Mating In Finite Populations

We start by generating a small founder population.

In [20]:
### # specify the number of sires and dams
popSizeFounderSire=2
popSizeFounderDam=3
### # sample the founder population
sires = sampleFounders(popSizeFounderSire)
dams = sampleFounders(popSizeFounderDam)
Sampling 2 animals into base population.
Sampling 3 animals into base population.
Out[20]:
XSim.Cohort(XSim.Animal[Animal(XSim.Chromosome[Chromosome([1, 0, 1, 0, 1, 0, 0, 1, 0, 1  …  0, 1, 1, 1, 0, 1, 0, 1, 0, 0], [5], [0.0])], XSim.Chromosome[Chromosome([1, 0, 1, 0, 1, 1, 1, 0, 0, 0  …  0, 1, 1, 1, 1, 1, 1, 0, 0, 1], [6], [0.0])], Float64[], 3, 0, 0, -9999.0, -9999.0, -9999.0), Animal(XSim.Chromosome[Chromosome([0, 0, 1, 0, 0, 0, 0, 0, 1, 1  …  1, 1, 1, 0, 1, 1, 0, 0, 1, 0], [7], [0.0])], XSim.Chromosome[Chromosome([1, 0, 1, 1, 0, 1, 1, 1, 0, 0  …  0, 0, 1, 1, 0, 1, 0, 1, 1, 1], [8], [0.0])], Float64[], 4, 0, 0, -9999.0, -9999.0, -9999.0), Animal(XSim.Chromosome[Chromosome([0, 0, 0, 0, 0, 1, 1, 1, 0, 0  …  0, 1, 1, 1, 0, 0, 1, 0, 1, 0], [9], [0.0])], XSim.Chromosome[Chromosome([1, 1, 0, 0, 1, 0, 1, 1, 0, 1  …  1, 0, 0, 0, 0, 1, 0, 1, 1, 0], [10], [0.0])], Float64[], 5, 0, 0, -9999.0, -9999.0, -9999.0)], Array{Int64}(0,0))

The function sampleFounders() uses the information from the initialized genome stored in XSim.G. The returned results are assigned to sires and dams which are cohorts.

Check

We do an intermediate check of the sampled genotypes.

In [21]:
animalFounders = concatCohorts(sires,dams)
MFounders = getOurGenotypes(animalFounders)
Out[21]:
5×100 Array{Int64,2}:
 2  1  0  2  2  1  1  2  1  1  1  0  1  …  1  2  1  1  1  0  2  1  1  2  0  1
 1  1  1  1  0  0  1  1  0  0  1  2  0     1  1  0  0  1  1  0  1  1  2  0  2
 2  0  2  0  2  1  1  1  0  1  2  1  1     2  1  0  2  2  2  1  2  1  1  0  1
 1  0  2  1  0  1  1  1  1  1  1  1  2     0  1  1  1  2  1  1  2  0  1  2  1
 1  1  0  0  1  1  2  2  0  1  0  2  0     2  0  1  1  1  1  0  1  1  1  2  0

The function concatCohorts() is used to combine two cohorts. The function getOurGenotypes() extracts the genotypes of all animals in the specified cohort.

Random Mating

The founder cohorts are used to generate a first generation via randomly mating the sires and dams from the founder cohort.

In [22]:
ngen,popSize = 1,5
sires1,dams1,gen1 = sampleRan(popSize, ngen, sires, dams);
Generation     2: sampling     2 males and     2 females

We use sires1 and dams1 to produce a second generation

In [23]:
ngen, popSize = 1,11
sires2,dams2,gen2 = sampleRan(popSize, ngen, sires1, dams1);
Generation     2: sampling     6 males and     6 females

All sampled animals are combined into a single cohort called animals.

In [24]:
animals=concatCohorts(animalFounders,sires1,dams1,sires2,dams2);

We are extracting the genotypes of all animals into an array

In [25]:
M = getOurGenotypes(animals);

Randomly Sample The QTL Positions

We fix the number of QTL and initialize an index vector with the same length as the number of loci to all FALSE. Then later a random sample in this vector will be set to TRUE. These positions then represent the QTL

In [26]:
nQTL   = 5;
selQTL = fill(false,numLoci);

Using the sample() function to determine the QTL positions

In [27]:
selQTL[sample(1:numLoci, nQTL, replace=false)] .= true;

All positions that are not QTL are set to be markers

In [28]:
selMrk =.!selQTL;

Genotypes of markers and QTL are separated into two different matrices

In [29]:
Q = M[:,selQTL]
Out[29]:
21×5 Array{Int64,2}:
 0  0  0  2  1
 1  2  2  2  0
 2  1  1  1  2
 2  2  1  1  1
 1  1  1  0  1
 1  1  1  2  1
 1  0  1  2  1
 2  2  1  1  0
 1  1  0  1  1
 2  1  0  2  0
 1  1  0  1  0
 2  0  0  2  2
 2  1  1  1  0
 0  0  0  1  0
 2  2  1  1  0
 1  0  0  1  1
 1  1  1  2  1
 2  2  2  2  0
 2  1  0  1  0
 1  2  1  1  0
 1  1  2  2  1
In [30]:
X = M[:,selMrk]
Out[30]:
21×95 Array{Int64,2}:
 2  1  0  2  2  1  1  2  1  1  1  0  1  …  1  1  2  1  1  0  2  1  1  2  0  1
 1  1  1  1  0  0  1  1  0  0  1  2  0     0  1  1  0  1  1  0  1  1  2  0  2
 2  0  2  0  2  1  1  1  0  1  2  1  1     0  2  1  0  2  2  1  2  1  1  0  1
 1  0  2  1  0  1  1  1  1  1  1  1  2     0  0  1  1  2  1  1  2  0  1  2  1
 1  1  0  0  1  1  2  2  0  1  0  2  0     1  2  0  1  1  1  0  1  1  1  2  0
 1  1  1  0  1  0  1  2  0  1  1  1  1  …  0  2  1  0  1  1  0  2  1  2  0  1
 2  0  1  1  2  1  0  2  0  2  1  0  2     1  1  2  0  1  1  2  1  2  1  0  2
 1  0  1  1  0  1  1  1  0  0  1  2  0     1  1  0  1  1  1  0  1  0  2  1  1
 1  0  0  1  1  1  2  2  1  0  1  1  0     1  2  1  2  0  0  1  1  1  2  1  1
 1  0  1  0  1  1  1  2  0  1  1  1  1     1  0  1  0  1  1  1  0  1  2  0  2
 2  0  2  1  1  0  0  1  0  1  1  1  1  …  2  1  1  1  0  0  1  1  1  2  1  1
 1  0  1  0  1  1  1  2  0  1  1  1  1     0  2  1  1  1  1  1  1  1  2  0  1
 2  0  2  1  1  0  0  1  0  1  2  1  1     2  1  1  1  0  0  1  1  1  2  1  1
 2  0  0  2  2  1  1  2  1  1  1  0  1     2  1  1  1  0  0  1  1  1  2  1  1
 2  0  2  1  1  0  0  1  0  1  2  1  1     1  2  1  1  0  0  0  2  1  2  1  1
 1  0  1  0  1  1  1  2  1  1  2  0  1  …  1  2  0  1  1  1  0  2  0  2  1  0
 1  1  1  1  0  0  1  1  0  0  1  2  0     0  1  0  0  2  2  0  1  0  2  0  1
 1  0  1  0  1  1  1  2  0  1  1  1  1     0  1  1  0  1  1  0  1  1  2  0  2
 1  0  0  1  1  2  1  2  0  1  0  1  1     2  1  1  1  0  0  1  1  1  2  0  2
 0  1  0  0  0  1  2  2  0  0  0  2  0     1  2  1  1  0  0  0  2  1  2  1  1
 2  0  1  2  1  1  0  1  0  1  1  1  1  …  0  1  1  0  2  2  1  1  1  1  0  2

Simulation of breeding values and phenotypic values

We start by setting a fixed number of significant QTL. In our case, this corrsponds to the number of columns of the matrix Q. These QTL get an associated effect which is then used to generate the breeding values called a. Then we add an intercept and a random error term.

In [36]:
nSigQTL = size(Q,2)
nObs = size(Q,1)
alphaSd = 1
alpha = rand(Normal(0,alphaSd),nSigQTL)
a = Q*alpha
# scaling breeding values to have variance 25.0
v = var(a)
genVar = 25.0
a *= sqrt(genVar/v)
va = var(a)
# formatted printing
println("genetic variance = $va")
genetic variance = 24.999999999999993

Computation of mean and sd require the package Statistics.

In [37]:
using Statistics
In [39]:
resVar = 75.0
resStd = sqrt(resVar)
e = rand(Normal(0,resStd),nObs)
intercept = 100
y = intercept .+ a + e
yMean = Statistics.mean(y)
yVar = Statistics.var(y)
println("phenotypic mean     = $yMean")
println("phenotypic variance = $yVar")
phenotypic mean     = 104.25030250893921
phenotypic variance = 58.26290252707984

Generated phenotypic values are assigned to the animals cohort. Starting with a single element.

In [51]:
animals.animalCohort[1].phenVal = y[1]
Out[51]:
100.42996466522034
In [52]:
animals.animalCohort[1].phenVal
Out[52]:
100.42996466522034

In a loop over the vector y of phenotypic observations, we assign them to the cohort animals.

In [58]:
for i in 1:nObs
    println("Assigning obs: ", i, " to ", y[i])
    animals.animalCohort[i].phenVal = y[i]
end
Assigning obs: 1 to 100.42996466522034
Assigning obs: 2 to 103.39573424417216
Assigning obs: 3 to 114.45820264928625
Assigning obs: 4 to 100.06804153951788
Assigning obs: 5 to 104.14371728574699
Assigning obs: 6 to 117.5240684780068
Assigning obs: 7 to 97.74377793862861
Assigning obs: 8 to 111.92639828882956
Assigning obs: 9 to 103.48594116967874
Assigning obs: 10 to 97.91400260635113
Assigning obs: 11 to 104.65102258283915
Assigning obs: 12 to 115.71445320910016
Assigning obs: 13 to 86.89981574814107
Assigning obs: 14 to 101.09720251088896
Assigning obs: 15 to 102.79456473016224
Assigning obs: 16 to 112.1820780186409
Assigning obs: 17 to 109.29515540354501
Assigning obs: 18 to 105.27092406050166
Assigning obs: 19 to 91.74425015266642
Assigning obs: 20 to 101.13193185566477
Assigning obs: 21 to 107.38510555013445

Checking back whether assignment worked with

In [62]:
P = getOurPhenVals(animals, resVar)
Out[62]:
21-element Array{Float64,1}:
 100.42996466522034
 103.39573424417216
 114.45820264928625
 100.06804153951788
 104.14371728574699
 117.5240684780068 
  97.74377793862861
 111.92639828882956
 103.48594116967874
  97.91400260635113
 104.65102258283915
 115.71445320910016
  86.89981574814107
 101.09720251088896
 102.79456473016224
 112.1820780186409 
 109.29515540354501
 105.27092406050166
  91.74425015266642
 101.13193185566477
 107.38510555013445

Writing The Data

Now that we have generated the data, we must write them to files. The data consist of

  • marker and QTL genotypes
  • phenotypic observations
  • pedigree information

Before the data is written, we first delete any old files from previous runs. Otherwise new data gets appended to the old files.

In [66]:
outFile = "data_ex04"
# delete old files first
run(`\rm -f $outFile.ped`)
run(`\rm -f $outFile.phe`)
run(`\rm -f $outFile.brc`)
run(`\rm -f $outFile.gen`)
# write new output    
outputPedigree(animals, outFile)

Convert This Notebook

In [4]:
;ipython nbconvert SimulateDataEx04.ipynb
[TerminalIPythonApp] WARNING | Subcommand `ipython nbconvert` is deprecated and will be removed in future versions.
[TerminalIPythonApp] WARNING | You likely want to use `jupyter nbconvert` in the future
[NbConvertApp] Converting notebook SimulateDataEx04.ipynb to html
[NbConvertApp] Writing 308799 bytes to SimulateDataEx04.html
In [ ]: