原文網(wǎng)址:https://kbroman.org/qtl2/assets/vignettes/user_guide.html#performing_a_permutation_test
# 一下內(nèi)容是我粘貼原網(wǎng)址的內(nèi)容
R/qtl2 user guide
R/qtl2?(aka qtl2) is a reimplementation of the QTL analysis software?R/qtl, to better handle high-dimensional data and complex cross designs.
The software was previously split into multiple packages (qtl2geno, qtl2scan, qtl2plot, and qtl2db), but as of version 0.7, these have been combined into a single package?qtl2. The multiple packages proved awkward and confusing. The?qtl2convert?package (for converting data among the R/qtl2,?DOQTL, and?R/qtl?formats) will remain a separate package.
R/qtl2?contains?a lot?of functions. To identify functions of interest, see the?organized list of R/qtl2 functions, with brief descriptions.
Installation
R/qtl2 is not yet available on?CRAN, but it can be installed from a mini-CRAN at?rqtl.org. Make sure you have the latest version of?R (3.4.3).
install.packages("qtl2", repos="https://rqtl.org/qtl2cran")
Alternatively, you can install R/qtl2 from its source on?GitHub. (But note that compiling the C++ code can be rather slow.)
On?Windows, you’ll need?Rtools.
On?Mac OS X, you’ll need the?command-line developer tools.
You then need to install the?devtools?package, plus a set of package dependencies:?yaml,?jsonlite,?data.table,?RcppEigen,?RSQLite, and?qtl. (Additional, secondary dependencies will also be installed.)
install.packages(c("devtools", "yaml", "jsonlite", "data.table", "RcppEigen", "RSQLite", "qtl"))
Finally, install R/qtl2 using?devtools::install_github().
library(devtools)
install_github("rqtl/qtl2")
Data file format
The input data file formats for?R/qtl?cannot handle complex crosses, and so for R/qtl2, we have defined a new format for the data files. We’ll describe it here briefly; for details, see the separate?vignette on the input file format.
QTL mapping data consists of a set of tables of data: marker genotypes, phenotypes, marker maps, etc. In the new format, these different tables are in separate comma-delimited (CSV) files. In each file, the first column is a set of IDs for the rows, and the first row is a set of IDs for the columns. For example, the phenotype data file will have individual IDs in the first column and phenotype names in the first row.
A few important changes in the tabular data:
We will use not just the genetic marker map, but also a physical map (if available).
Previously, phenotypes and covariates were combined. In the new format, we separate numeric phenotypes from the often non-numeric covariates.
We define a table of “phenotype covariates.” These are?metadata?describing the phenotypes. For example, in the case of a phenotype measured over time, one column in the phenotype covariate data could be the time of measurement. For gene expression data, we would have columns representing chromosome and physical position of genes, as well as gene IDs.
In addition to the set of CSV files with the primary data, we need a separate “control” file with various control parameters (or metadata), including the names of all of the other data files and the genotype codes used in the genotype data file. The control file is in a specific format using either?YAML?or?JSON; these are human-readable text files for representing relatively complex data.
A big advantage of this control file scheme is that it greatly simplifies the function for reading in the data. That function,?read_cross2(), has a?single?argument: the name (with path) of the control file. So you can read in data like this:
library(qtl2)grav2 <- read_cross2("~/my_data/grav2.yaml")
The large number of files is a bit cumbersome, so we’ve made it possible to use a?zip file?containing all of the data files, and to read that zip file directly. There’s even a function for creating the zip file:
zip_datafiles("~/my_data/grav2.yaml")
This?zip_datafiles()?function will read the control file to identify all of the relevant data files and then zip them up into a file with the same name and location, but with the extension?.zip?rather than?.yaml?or?.json.
To read the data back in, we use the same?read_cross2()?function, providing the name (and path) of the zip file rather than the control file.
grav2 <- read_cross2("~/my_data/grav2.zip")
This can even be done with remote files.
grav2 <- read_cross2("https://kbroman.org/qtl2/assets/sampledata/grav2/grav2.zip")
Of course, the other advantage of the zip file is that it is?compressed?and so smaller than the combined set of CSV files.
The control file may be confusing for some users. To assist in its construction, there’s a function?write_control_file()?that takes the large set of control parameters as input and then writes the YAML or JSON control file in the appropriate format.
Sample data sets
The R/qtl2 web site includes?sample data files?in the new format. Zipped versions of these datasets are included with the?qtl2package and can be loaded into R using the?read_cross2()?function.
In the?qtl2 package source, the sample zip files are located in?qtl2/inst/extdata. In the installed version of the package, they are in?qtl2/extdata, within whatever directory your R packages were installed. The R function?system.file()?can be used to construct the path to these files.
For example, one of the sample data sets concerns a gravitropism phenotype in a set of Arabidopsis recombinant inbred lines (RIL), from?Moore et al. (2013) Genetics 195:1077-1086. The data are in?qtl2/extdata/grav2.zip, which can be loaded as follows:
library(qtl2)grav2 <- read_cross2( system.file("extdata","grav2.zip", package="qtl2") )
Additional sample data sets, including data on Diversity Outbred (DO) mice, are available at?https://github.com/rqtl/qtl2data.
Calculating genotype probabilities
The first basic task in QTL analysis is to calculate conditional genotype probabilities, given the observed marker data, at each putative QTL position. This is accomplished with the?calc_genoprob()?function. Unlike the corresponding function in?R/qtl,?calc.genoprob(), the result is not inserted back into the input cross object, but is returned as a list of three-dimensional arrays (one per chromosome). Each 3d array of probabilities is arranged as individuals × genotypes × positions.
If we wish to perform QTL calculations at positions between markers (so called “pseudomarkers”), we first need to insert such positions into the genetic map with the function?insert_pseudomarkers(). Unlike?R/qtl, the map is kept separate from the genotype probabilities.
We’ll use the?iron dataset?from?Grant et al. (2006) Hepatology 44:174-185?(an intercross) as an example. We first load the data:
library(qtl2)iron <- read_cross2( system.file("extdata","iron.zip", package="qtl2") )
We then use?insert_pseudomarkers()?to insert pseudomarkers into the genetic map, which we grab from the?iron?object as?iron$gmap:
map <- insert_pseudomarkers(iron$gmap, step=1)
And next we use?calc_genoprob()?to calculate the QTL genotype probabilities.
pr <- calc_genoprob(iron, map, error_prob=0.002)
To speed up the calculations with large datasets on a multi-core machine, you can use the argument?cores. With?cores=0, the number of available cores will be detected via?parallel::detectCores(). Otherwise, specify the number of cores as a positive integer.
pr <- calc_genoprob(iron, map, error_prob=0.002, cores=4)
The genome scan functions (see below) use genotype probabilities as well as a matrix of phenotypes. If you wished to perform a genome scan via an additive allele model, you would first convert the genotype probabilities to allele probabilities, using the function?genoprob_to_alleleprob().
apr <- genoprob_to_alleleprob(pr)
Calculating a kinship matrix
If you wish to perform a genome scan by a linear mixed model, accounting for the relationships among individuals (in other words, including a random polygenic effect), you’ll need to calculate a kinship matrix for the individuals. This is accomplished with the?calc_kinship()?function. It takes the genotype probabilities as input.
kinship <- calc_kinship(pr)
By default, the genotype probabilities are converted to allele probabilities, and the kinship matrix is calculated as the proportion of shared alleles. To use genotype probabilities instead, use?use_allele_probs=FALSE. Further, by default we calculate the kinship matrix using both the autosomes and the X chromosome. To omit the X chromosome, use?omit_x=TRUE.
In calculating the kinship matrix, one may wish to eliminate the effect of varying marker density across the genome, and only use the probabilities along the grid of pseudomarkers (defined by the?step?argument to?insert_pseudomarkers()). To do so, we need to first use?calc_grid()?to determine the grid of pseudomarkers, and then?probs_to_grid()?to omit probabilities for positions that are not on the grid.
grid <- calc_grid(iron$gmap, step=1)pr_grid <- probs_to_grid(pr, grid)kinship_grid <- calc_kinship(pr_grid)
If, for your linear mixed model genome scan, you wish to use the “l(fā)eave one chromosome out” (LOCO) method (scan each chromosome using a kinship matrix that is calculated using data from all other chromosomes), use?type="loco"?in the call to?calc_kinship().
kinship_loco <- calc_kinship(pr,"loco")
On a multi-core machine, you can get some speed-up via the?cores?argument, as with?calc_genoprob().
kinship_loco <- calc_kinship(pr,"loco", cores=4)
Special covariates for the X chromosome
In a QTL scan of the X chromosome, special covariates (such as sex) may need to be included under the null hypothesis of no QTL, to avoid spurious evidence of linkage. (See?Broman et al. (2006) Genetics 174:2151-2158.)
The particular X chromosome covariates depend on the cross, and can be obtained with the function?get_x_covar().
Xcovar <- get_x_covar(iron)
Performing a genome scan
To perform a genome scan by Haley-Knott regression (Haley and Knott 1992), use the function?scan1().
scan1()?takes as input the genotype probabilities, a matrix of phenotypes, optional additive and interactive covariates, and the special X chromosome covariates. Another option is to provide a vector of weights.
out <- scan1(pr, iron$pheno, Xcovar=Xcovar)
On a multi-core machine, you can get some speed-up via the?cores?argument, as with?calc_genoprob()?and?calc_kinship().
out <- scan1(pr, iron$pheno, Xcovar=Xcovar, cores=4)
The output of?scan1()?is a matrix of LOD scores, positions × phenotypes.
The function?plot_scan1()?can be used to plot the LOD curves. You can just write?plot(), as there’s an S3 method?plot.scan1()and the output of?scan1()?has class?"scan1". Use the argument?lodcolumn?to indicate which column to plot. Unlike the?plot.scanone()?function in?R/qtl, the?plot_scan1()?function will only plot one set of LOD curves at a time, and you need to provide the marker/pseudomarker map (created by?insert_pseudomarkers()).
par(mar=c(5.1,4.1,1.1,1.1))ymx <- maxlod(out)# overall maximum LOD scoreplot(out, map, lodcolumn=1, col="slateblue", ylim=c(0, ymx*1.02))plot(out, map, lodcolumn=2, col="violetred", add=TRUE)legend("topleft", lwd=2, col=c("slateblue","violetred"), colnames(out), bg="gray90")
Finding LOD peaks
The function?find_peaks()?can be used to identify a set of LOD peaks that exceed some threshold. It can also provide LOD support or Bayes credible intervals, by using the arguments?drop?(the amount to drop in the LOD support intervals) or?prob?(the nominal coverage for the Bayes credible intervals).
You need to provide both the?scan1()?output as well as the marker/pseudomarker map.
find_peaks(out, map, threshold=4, drop=1.5)
##? lodindex lodcolumn chr? pos? ? lod ci_lo ci_hi
## 1? ? ? ? 1? ? liver? 2 56.8? 4.9576? 48.1? 73.2
## 2? ? ? ? 1? ? liver? 7 50.1? 4.0508? 13.1? 53.6
## 3? ? ? ? 1? ? liver? 16 28.6? 7.6816? 6.6? 40.4
## 4? ? ? ? 2? ? spleen? 8 13.6? 4.3029? 0.0? 32.7
## 5? ? ? ? 2? ? spleen? 9 56.6 12.0632? 53.6? 61.2
The?find_peaks()?function can also pick out multiple peaks on a chromosome: each peak must exceed the chosen threshold, and the argument?peakdrop?indicates the amount that the LOD curve must drop below the lowest of two adjacent peaks. Use this feature with caution.
find_peaks(out, map, threshold=4, peakdrop=1.8, drop=1.5)
##? lodindex lodcolumn chr? pos? ? lod ci_lo ci_hi
## 1? ? ? ? 1? ? liver? 2 56.8? 4.9576? 48.1? 73.2
## 2? ? ? ? 1? ? liver? 7 25.1? 4.0400? 13.1? 28.4
## 3? ? ? ? 1? ? liver? 7 50.1? 4.0508? 31.7? 53.6
## 4? ? ? ? 1? ? liver? 16 28.6? 7.6816? 6.6? 40.4
## 5? ? ? ? 2? ? spleen? 8 13.6? 4.3029? 0.0? 32.7
## 6? ? ? ? 2? ? spleen? 9 56.6 12.0632? 53.6? 61.2
The functions?lod_int()?and?bayes_int()?can be used to derive the LOD support or Bayes credible intervals for QTL, for a specific chromosome and LOD score column. For example, to obtain the Bayes interval for the locus on chromosome 9 for the second phenotype (“spleen”):
bayes_int(out, map, lodcolumn=2, chr=9, prob=0.95)
##? ci_lo? pos ci_hi
## 1? 53.6 56.6? 61.2
Both?lod_int()?and?bayes_int()?take a?peakdrop?argument, if you wish to try to identify multiple peaks on a chromosome. Again, use this feature with caution.
lod_int(out, map, lodcolumn=1, chr=7, peakdrop=1.8, drop=1.5)
##? ci_lo? pos ci_hi
## 1? 13.1 25.1? 28.4
## 2? 31.7 50.1? 53.6
Each row is a different peak; the columns are the lower interval endpoint, the estimated QTL position, and the upper interval endpoint.
Performing a genome scan with a linear mixed model
To perform a genome scan using a linear mixed model, accounting for relationships among individuals using a random polygenic effect, you also use the function?scan1(); you just need to provide the argument?kinship, a kinship matrix (or, for the LOCO method, a list of kinship matrices).
out_pg <- scan1(pr, iron$pheno, kinship, Xcovar=Xcovar)
Again, on a multi-core machine, you can get some speed-up using the?cores?argument.
out_pg <- scan1(pr, iron$pheno, kinship, Xcovar=Xcovar, cores=4)
For the LOCO (leave one chromosome out) method, provide the list of kinship matrices as obtained from?calc_kinship()?with?method="loco".
out_pg_loco <- scan1(pr, iron$pheno, kinship_loco, Xcovar=Xcovar)
To plot the results, we again use?plot_scan1(), or just type?plot().
Here is a plot of the LOD scores, by Haley-Knott regression and the linear mixed model using either the standard kinship matrix or the LOCO method.
color <- c("slateblue","violetred","green3")par(mar=c(4.1,4.1,1.6,1.1))ymx <- max(maxlod(out), maxlod(out_pg), maxlod(out_pg_loco))for(iin1:2) {? ? plot(out, map, lodcolumn=i, col=color[1], main=colnames(iron$pheno)[i],? ? ? ? ? ? ? ylim=c(0, ymx*1.02))? ? plot(out_pg, map, lodcolumn=i, col=color[2], add=TRUE)? ? plot(out_pg_loco, map, lodcolumn=i, col=color[3], add=TRUE, lty=2)? ? legend("topleft", lwd=2, col=color, c("H-K","LMM","LOCO"), bg="gray90", lty=c(1,1,2))}
For the liver phenotype (top panel), the three methods give quite different results. The linear mixed model with an overall kinship matrix gives much lower LOD scores than the other two methods. On chromosomes with some evidence of a QTL, the LOCO method gives higher LOD scores than Haley-Knott, except on chromosome 16 where it gives lower LOD scores.
For the spleen phenotype (bottom panel), the linear mixed model with an overall kinship matrix again gives much lower LOD scores than the other two methods. However, in this case Haley-Knott regression and the LOCO method give quite similar results.
Performing a genome scan with binary traits
The genome scans above were performed assuming that the residual variation followed a normal distribution. This will often provide reasonable results even if the residuals are not normal, but an important special case is that of a binary trait, with values 0 and 1, which is best treated differently. The?scan1()?function can perform a genome scan with binary traits by logistic regression, using the argument?model="binary". (The default value for the?model?argument is?"normal".) At present, we?can notaccount for relationships among individuals in this analysis.
Let’s first turn our two phenotypes into binary traits by thresholding at the median. One would generally?not?do this in practice; this is just for illustration.
bin_pheno <- apply(iron$pheno,2,function(a) as.numeric(a > median(a)))rownames(bin_pheno) <- rownames(iron$pheno)
We now perform the genome scan as before, including?model="binary"?to indicate that the phenotypes are binary traits with values 0 and 1.
out_bin <- scan1(pr, bin_pheno, Xcovar=Xcovar, model="binary")
Here is a plot of the two sets of LOD curves.
par(mar=c(5.1,4.1,1.1,1.1))ymx <- maxlod(out_bin)plot(out_bin, map, lodcolumn=1, col="slateblue", ylim=c(0, ymx*1.02))plot(out_bin, map, lodcolumn=2, col="violetred", add=TRUE)legend("topleft", lwd=2, col=c("slateblue","violetred"), colnames(out_bin), bg="gray90")
We can use?find_peaks?as before.
find_peaks(out_bin, map, threshold=3.5, drop=1.5)
##? lodindex lodcolumn chr? pos? ? lod ci_lo ci_hi
## 1? ? ? ? 1? ? liver? 2 56.8 3.6303? 48.1? 73.2
## 2? ? ? ? 1? ? liver? 8 38.0 3.5077? 17.3? 69.9
## 3? ? ? ? 1? ? liver? 15 49.2 3.7789? 16.4? 49.2
## 4? ? ? ? 1? ? liver? 16 27.6 7.5069? 6.6? 40.4
## 5? ? ? ? 2? ? spleen? 9 57.6 9.2528? 53.6? 61.2
Performing a permutation test
To perform a permutation test to establish the statistical significance of the results of a genome scan, use the function?scan1perm(). (In?R/qtl, a single function,?scanone(), was used for both performing a genome scan and for getting permutation-based significance thresholds, but in?R/qtl2, we’ve decided to make two separate functions).
The?scan1perm()?function takes the same arguments as?scan1(), plus additional arguments to control the permutations:
n_perm?is the number of permutation replicates.
perm_Xsp?controls whether to perform autosome/X chromosome specific permutations (with?perm_Xsp=TRUE) or not (the default is to not).
perm_strata?is a vector that defines the strata for a stratified permutation test.
chr_lengths?is a vector of chromosome lengths, used in the case that?perm_Xsp=TRUE.
As with?scan1(), you may provide a kinship matrix (or vector of kinship matrices, for the “l(fā)eave one chromosome out” (LOCO) approach), in order to fit linear mixed models to account for relationships among individuals (in other words, including a random polygenic effect). If?kinship?is unspecified, the function performs ordinary Haley-Knott regression.
To perform a permutation test with the?iron?data, we do the following:
operm <- scan1perm(pr, iron$pheno, Xcovar=Xcovar, n_perm=1000)
Note the need to specify special covariates for the X chromosome (via?Xcovar), to be included under the null hypothesis of no QTL. And note that when these are provided, the default is to perform a stratified permutation test, using strata defined by the rows in?Xcovar. In general, when the X chromosome is considered, one will wish to stratify at least by sex.
Also note that, as with?scan1(), you can speed up the calculations on a multi-core machine by specifying the argument?cores. With?cores=0, the number of available cores will be detected via?parallel::detectCores(). Otherwise, specify the number of cores as a positive integer. For large datasets, be mindful of the amount of memory that will be needed; you may need to use fewer than the maximum number of cores, to avoid going beyond the available memory.
operm <- scan1perm(pr, iron$pheno, Xcovar=Xcovar, n_perm=1000, cores=0)
To get estimated significance thresholds, use the function?summary().
summary(operm)
## LOD thresholds (1000 permutations)
##? ? ? liver spleen
## 0.05? 3.46? 3.46
The default is to return the 5% significance thresholds. Thresholds for other (or for multiple) significance levels can be obtained via the?alpha?argument.
summary(operm, alpha=c(0.2,0.05))
## LOD thresholds (1000 permutations)
##? ? ? liver spleen
## 0.2? 2.63? 2.64
## 0.05? 3.46? 3.46
To obtain autosome/X chromosome-specific significance thresholds, specify?perm_Xsp=TRUE. In this case, you need to provide chromosome lengths, which may be obtained with the function?chr_lengths().
operm2 <- scan1perm(pr, iron$pheno, Xcovar=Xcovar, n_perm=1000,? ? ? ? ? ? ? ? ? ? perm_Xsp=TRUE, chr_lengths=chr_lengths(map))
Separate permutations are performed for the autosomes and X chromosome, and considerably more permutation replicates are needed for the X chromosome. The computations take about twice as much time. See?Broman et al. (2006) Genetics 174:2151-2158.
The significance thresholds are again derived via?summary():
summary(operm2, alpha=c(0.2,0.05))
## Autosome LOD thresholds (1000 permutations)
##? ? ? liver spleen
## 0.2? 2.65? 2.54
## 0.05? 3.42? 3.22
##
## X chromosome LOD thresholds (28243 permutations)
##? ? ? liver spleen
## 0.2? ? 3.1? 4.02
## 0.05? 3.9? 5.18
Permutations for a genome scan with a linear mixed model-based are performed by specifying the?kinship?argument. We can use the “l(fā)eave one chromosome out” (LOCO) method by providing?kinship_loco, the list of kinship matrices calculated above with?calc_kinship().
operm3 <- scan1perm(pr, iron$pheno, kinship_loco, Xcovar=Xcovar, n_perm=1000,? ? ? ? ? ? ? ? ? ? perm_Xsp=TRUE, chr_lengths=chr_lengths(map))
Here are the estimated significance thresholds:
summary(operm3, alpha=c(0.2,0.05))
## Autosome LOD thresholds (1000 permutations)
##? ? ? liver spleen
## 0.2? 2.64? 2.62
## 0.05? 3.29? 3.29
##
## X chromosome LOD thresholds (28243 permutations)
##? ? ? liver spleen
## 0.2? 3.14? 4.37
## 0.05? 3.82? 5.50
As with?scan1, we can use?scan1perm?with binary traits, using the argument?model="binary". Again, this can’t be used with a kinship matrix, but all of the other arguments can be applied.
operm_bin <- scan1perm(pr, bin_pheno, Xcovar=Xcovar, model="binary",? ? ? ? ? ? ? ? ? ? ? n_perm=1000, perm_Xsp=TRUE, chr_lengths=chr_lengths(map))
Here are the estimated 5% and 20% significance thresholds.
summary(operm_bin, alpha=c(0.2,0.05))
## Autosome LOD thresholds (1000 permutations)
##? ? ? liver spleen
## 0.2? 2.60? 2.63
## 0.05? 3.33? 3.41
##
## X chromosome LOD thresholds (28243 permutations)
##? ? ? liver spleen
## 0.2? 3.16? 3.06
## 0.05? 3.86? 3.77
Estimated QTL effects
The?scan1()?function returns only LOD scores. To obtain estimated QTL effects, use the function?scan1coef(). This function takes a single phenotype and the genotype probabilities for a single chromosome and returns a matrix with the estimated coefficients at each putative QTL location along the chromosome.
For example, to get the estimated effects on chromosome 2 for the liver phenotype, we’d do the following:
c2eff <- scan1coef(pr[,"2"], iron$pheno[,"liver"])
The result is a matrix, 39 positions × 4 genotypes. To plot the effects, use the function?plot_coef(). There is again an S3 method function?plot.scan1coef(), so one can just type?plot(), but in either case you need to provide the map for the relevant chromosome. Use the argument?columns?to indicate which coefficient columns to plot.
par(mar=c(4.1,4.1,1.1,2.6), las=1)col <- c("slateblue","violetred","green3")plot(c2eff, map["2"], columns=1:3, col=col)last_coef <- unclass(c2eff)[nrow(c2eff),]# pull out last coefficientsfor(iinseq(along=last_coef))? ? axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])
The default is to provide phenotype averages for each genotype group. If instead you want additive and dominance effects, you can provide a square matrix of?contrasts, as follows:
c2effB <- scan1coef(pr[,"2"], iron$pheno[,"liver"],? ? ? ? ? ? ? ? ? ? contrasts=cbind(mu=c(1,1,1), a=c(-1,0,1), d=c(0,1,0)))
The result will then contain the estimates of?mu,?a, and?d. Here’s a plot of the additive and dominance effects, which are in the second and third columns.
par(mar=c(4.1,4.1,1.1,2.6), las=1)plot(c2effB, map["2"], columns=2:3, col=col)last_coef <- unclass(c2effB)[nrow(c2effB),2:3]# last two coefficientsfor(iinseq(along=last_coef))? ? axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])
If you provide a kinship matrix to?scan1coef(), it fits a linear mixed model (LMM) to account for a residual polygenic effect. Here let’s use the kinship matrix from the LOCO method.
c2eff_pg <- scan1coef(pr[,"2"], iron$pheno[,"liver"], kinship_loco[["2"]])
Here’s a plot of the estimates.
par(mar=c(4.1,4.1,1.1,2.6), las=1)col <- c("slateblue","violetred","green3")plot(c2eff_pg, map["2"], columns=1:3, col=col, ylab="Phenotype average")last_coef <- unclass(c2eff_pg)[nrow(c2eff_pg),]for(iinseq(along=last_coef))? ? axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])
You can also get estimated additive and dominance effects, using a matrix of contrasts.
c2effB_pg <- scan1coef(pr[,"2"], iron$pheno[,"liver"], kinship_loco[["2"]],? ? ? ? ? ? ? ? ? ? ? contrasts=cbind(mu=c(1,1,1), a=c(-1,0,1), d=c(-0.5,1, -0.5)))
Here’s a plot of the results.
par(mar=c(4.1,4.1,1.1,2.6), las=1)plot(c2effB_pg, map["2"], columns=2:3, col=col)last_coef <- unclass(c2effB_pg)[nrow(c2effB_pg),2:3]for(iinseq(along=last_coef))? ? axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])
Another option for estimating the QTL effects is to treat them as random effects and calculate Best Linear Unbiased Predictors (BLUPs). This is particularly valuable for multi-parent populations such as the Collaborative Cross and Diversity Outbred mice, where the large number of possible genotypes at a QTL lead to considerable variability in the effect estimates. To calculate BLUPs, use?scan1blup(); it takes the same arguments as?scan1coef(), including the option of a kinship matrix to account for a residual polygenic effect.
c2blup <- scan1blup(pr[,"2"], iron$pheno[,"liver"], kinship_loco[["2"]])
Here is a plot of the BLUPs (as dashed curves) alongside the standard estimates. Note that the BLUPs are centered at 0, while the coefficient estimates are centered at the phenotype average.
par(mar=c(4.1,4.1,1.1,2.6), las=1)col <- c("slateblue","violetred","green3")ylim <- range(c(c2blup, c2eff))+c(-1,1)plot(c2eff, map["2"], columns=1:3, col=col, ylab="Phenotype average", ylim=ylim,? ? xlab="Chr 2 position")plot(c2blup, map["2"], columns=1:3, col=col, add=TRUE, lty=2)last_coef <- unclass(c2eff)[nrow(c2eff),]for(iinseq(along=last_coef))? ? axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])
The?scan1coef?function can also provide estimated QTL effects for binary traits, with?model="binary". (However,?scan1blup?has not yet been implemented for binary traits.)
c2eff_bin <- scan1coef(pr[,"2"], bin_pheno[,"liver"], model="binary")
Here’s a plot of the effects. They’re a bit tricky to interpret, as their basically log odds ratios.
par(mar=c(4.1,4.1,1.1,2.6), las=1)col <- c("slateblue","violetred","green3")plot(c2eff_bin, map["2"], columns=1:3, col=col)last_coef <- unclass(c2eff_bin)[nrow(c2eff_bin),]# pull out last coefficientsfor(iinseq(along=last_coef))? ? axis(side=4, at=last_coef[i], names(last_coef)[i], tick=FALSE, col.axis=col[i])
Finally, to plot the raw phenotypes against the genotypes at a single putative QTL position, you can use the function?plot_pxg(). This takes a vector of genotypes as produced by the?maxmarg()?function, which picks the most likely genotype from a set of genotype probabilities, provided it is greater than some specified value (the argument?minprob). Note that the “marg” in “maxmarg” stands for “marginal”, as this function is selecting the genotype at each position that has maximum marginal probability.
For example, we could get inferred genotypes at the chr 2 QTL for the liver phenotype (at 28.6 cM) as follows:
g <- maxmarg(pr, map, chr=2, pos=28.6, return_char=TRUE)
We use?return_char=TRUE?to have?maxmarg()?return a vector of character strings with the genotype labels.
We then plot the liver phenotype against these genotypes as follows:
par(mar=c(4.1,4.1,0.6,0.6))plot_pxg(g, iron$pheno[,"liver"], ylab="Liver phenotype")
We can use?swap_axes=TRUE?to have the phenotype on the x-axis. And we can use?SEmult=2?to include mean ± 2 SE intervals.
par(mar=c(4.1,4.1,0.6,0.6))plot_pxg(g, iron$pheno[,"liver"], SEmult=2, swap_axes=TRUE, xlab="Liver phenotype")
SNP association
For multi-parent crosses, it can be useful to collapse the genotype or allele probabilities according to the founder genotypes of the various SNPs in the region of a QTL.
QTL analysis in Diversity Outbred mice
To illustrate this sort of SNP association analysis, we’ll consider some Diversity Outbred mouse data. The Diversity Outcross (DO) mice are an advanced intercross population derived from the same eight founder strains as the Collaborative Cross (CC). See?Svenson et al. (2012),?Churchill et al. (2012), and?Gatti et al. (2014).
We’ll consider a subset of the data from?Recla et al. (2014), available as part of the?qtl2data github repository. (The full data are in?DO_Recla; the directory?DOex?contains a reduced set, with just three chromosomes, one phenotype (OF_immobile_pct, percent immobile in the open field test), and a reduced set of markers.
You can download the data from a single zip file, as follows:
file <-"https://raw.githubusercontent.com/rqtl/qtl2data/master/DOex/DOex.zip"DOex <- read_cross2(file)
Let’s quickly whip through a basic analysis.
We first calculate genotype probabilities and convert them to allele probabilities. We’ll just use marker locations and not insert any pseudomarkers.
pr <- calc_genoprob(DOex, error_prob=0.002)apr <- genoprob_to_alleleprob(pr)
We calculate kinship matrices (using the “LOCO” method, though with the caveat that here we are only considering genotypes on three chromosomes).
k <- calc_kinship(apr,"loco")
We create a numeric covariate for sex; be sure to include the individual IDs as names.
sex <- (DOex$covar$Sex =="male")*1names(sex) <- rownames(DOex$covar)
Note that you can create the vector and assign the names in one step using the basic R function?setNames().
sex <- setNames( (DOex$covar$Sex =="male")*1, rownames(DOex$covar) )
We perform a genome scan with a linear mixed model (adjusting for a residual polygenic effect), with sex as an additive covariate.
out <- scan1(apr, DOex$pheno, k, sex)
Here’s a plot of the results.
par(mar=c(4.1,4.1,0.6,0.6))plot(out, DOex$gmap)
There’s a strong peak on chromosome 2. Let’s look at the QTL effects. We estimate them with?scan1coef(). We need to subset the allele probabilities and the list of kinship matrices.
coef_c2 <- scan1coef(apr[,"2"], DOex$pheno, k[["2"]], sex)
For the DO, with 8 QTL alleles, we can use the function?plot_coefCC, which plots the 8 allele effects in the “official” Collaborative Cross (CC) colors. (Well, actually?slightly?modified colors, because I think the official colors are kind of ugly.) The strong locus seems to be mostly due to the NZO allele. Note that qtl2 package includes the vector of colors for the CC as the object?CCcolors; there’s also a?CCorigcolors?object with the?official?colors.
par(mar=c(4.1,4.1,0.6,0.6))plot_coefCC(coef_c2, DOex$gmap["2"], bgcolor="gray95", legend="bottomleft")
If you provide?plot_coefCC()?with the genome scan output, it will display the LOD curve below the coefficient estimates.
par(mar=c(4.1,4.1,0.6,0.6))plot_coefCC(coef_c2, DOex$gmap["2"], scan1_output=out, bgcolor="gray95", legend="bottomleft")
Connecting to SNP and gene databases
To perform SNP association analysis in the region of a QTL, we’ll need to grab data on all of the SNPs in the region, including their genotypes in the eight founder strains for the CC and DO populations. As a related task, we’ll also want to identify the genes in the region.
We want R/qtl2 to permit different ways of storing this information. For the CC and DO populations, we’ve prepared SQLite database files for the variants in the CC founders and for the MGI mouse gene annotations. But others might wish to use a different kind of database, or may wish to query an online database.
We provide a template for how to use R/qtl2 to connect to SNP and gene databases, with the functions?create_variant_query_func()?and?create_gene_query_func(). Each returns a?function?for querying variant and gene databases, respectively. The query functions that are returned take just three arguments (chr,?start, end?end), and themselves return a data frame of variants on the one hand and genes on the other.
If you are analyzing the CC lines or a DO population, you can download the SQLite databases that we have prepared and made available on?Figshare:
cc_variants.sqlite?doi:10.6084/m9.figshare.5280229.v3, variants in the Collaborative Cross founders (3 GB)
mouse_genes.sqlite?doi:10.6084/m9.figshare.5280238.v6?full set of mouse gene annotations (530 MB)
mouse_genes_mgi.sqlite?doi:10.6084/m9.figshare.5286019.v7?just the MGI mouse gene annotations (12 MB)
For example, to download?cc_variants.sqlite?and?mouse_genes_mgi.sqlite, we’d go to the figshare sites above and determine the URLs for downloading the files, and then do the following:
download.file("https://ndownloader.figshare.com/files/18533342","cc_variants.sqlite")download.file("https://ndownloader.figshare.com/files/17609252","mouse_genes_mgi.sqlite")
Store them locally on your computer, wherever it’s convenient. For example, I’ve placed them in?~/Data/CCdb/.
To create a function for querying the CC variants, call?create_variant_query_func()?with the path to the?cc_variants.sqlite?file:
query_variants <- create_variant_query_func("~/Data/CCdb/cc_variants.sqlite")
To grab the variants in the interval 97-98 Mbp on chromosome 2, you’d then do the following:
variants_2_97.5 <- query_variants(2,97,98)
Similarly, to create a function for querying the MGI mouse gene annotations, you call?create_gene_query_func()?with the path to the?mouse_genes_mgi.sqlite?file:
query_genes <- create_gene_query_func("~/Data/CCdb/mouse_genes_mgi.sqlite")
To grab the genes overlapping the interval 97-98 Mbp on chromosome 2, you’d then do the following:
genes_2_97.5 <- query_genes(2,97,98)
The way we have set this up is a bit complicated, but it allows greatest flexibility on the part of the user. And for our own work, we like to have a local SQLite database, for rapid queries of SNPs and genes.
SNP associations
Okay, now finally we get to the SNP associations. We have a large peak on chromosome 2, and we want to look at individual SNPs in the region of the locus.
Well, actually, we first need to find the location of the inferred QTL. The peak LOD score on chromosome 2 occurs at 52.4 cM. But to find nearby SNPs, we really want to know the Mbp position. The calculations were only performed at the marker positions, and so can use?max(), giving both the?scan1()?output and the physical map, and then pull out the position from the results.
peak_Mbp <- max(out, DOex$pmap)$pos
The marker is at 96.8 Mbp. We’ll focus on a 2 Mbp interval centered at 96.8 Mbp.
We can pull out the variants in the 2 Mbp interval centered at 96.8 on chr 2 using the query function we defined above:
variants <- query_variants(2, peak_Mbp -1, peak_Mbp +1)
There are 27355 variants in the interval, including 27108 SNPs, 127 indels, and 120 structural variants. We’re treating all of them as biallelic markers (all but the major allele in the eight founder strains binned into a single allele). In the following, we’re going to just say “SNP” rather than “variant”, even though the variants include indels and structural variants.
After identifying the variants in the interval of interest, we use our genotype probabilities and the founder SNP genotypes to infer the SNP genotypes for the DO mice. That is, at each SNP, we want to collapse the eight founder allele probabilities to two SNP allele probabilities, using the SNP genotypes of the founders.
We do this assuming that the genotype probabilities were calculated sufficiently densely that they can be assumed to be constant in intervals. With this assumption, we then:
Find the interval for each SNP.
Reduce the SNPs to a “distinct” set: if two SNPs have the same strain distribution pattern (SDP; the pattern of alleles in the eight founders) and are in the same interval, by our assumption their allele probabilities will be the same.
Take the average of the allele probabilities at the two endpoints of each interval.
Collapse the 8 allele probabilities to two according to each observed SDP in the interval.
We further create a look-up table relating the full set of SNPs to the reduced set (one of each observed SDP in each interval).
All of these steps are combined into a single function?scan1snps(), which takes the genotype probabilities, a physical map of those locations, the phenotype, the kinship matrix, covariates, the query function for grabbing the SNPs in a given interval, and the chromosome, start, and end positions for the interval. (If?insert_pseudomarkers()?was used to insert pseudomarkers, you will need to use?interp_map()?to get interpolated Mbp positions for the pseudomarkers.)
out_snps <- scan1snps(pr, DOex$pmap, DOex$pheno, k[["2"]], sex, query_func=query_variants,? ? ? ? ? ? ? ? ? ? ? chr=2, start=peak_Mbp-1, end=peak_Mbp+1, keep_all_snps=TRUE)
The output is a list with two components:?lod?is a matrix of LOD scores (with a single column, since we’re using just one phenotype), and?snpinfo?is a data frame with SNP information. With the argument?keep_all_snps=TRUE, the?snpinfo?data frame contains information about?all?of the variants in the region with an?index?column indicating the equivalence classes.
The function?plot_snpasso()?can be used to plot the results, with points at each of the SNPs. The default is to plot?all?SNPs. In this case, there are 27355 variants in the region, but only 167 distinct ones.
par(mar=c(4.1,4.1,0.6,0.6))plot_snpasso(out_snps$lod, out_snps$snpinfo)
We can actually just type?plot()?rather than?plot_snpasso(), because with the?snpinfo?table in place of a genetic map,?plot()detects that a SNP association plot should be created.
par(mar=c(4.1,4.1,0.6,0.6))plot(out_snps$lod, out_snps$snpinfo)
We can use our?query_genes()?function to identify the genes in the region, and?plot_genes()?to plot their locations. But also?plot_snpasso()?can take the gene locations with the argument?genes?and then display them below the SNP association results. Here, we are also highlighting the top SNPs in the SNP association plot using the?drop_hilit?argument. SNPs with LOD score within?drop_hilit?of the maximum are shown in pink.
genes <- query_genes(2, peak_Mbp -1, peak_Mbp +1)par(mar=c(4.1,4.1,0.6,0.6))plot(out_snps$lod, out_snps$snpinfo, drop_hilit=1.5, genes=genes)
To get a table of the SNPs with the largest LOD scores, use the function?top_snps(). This will show all SNPs with LOD score within some amount (the default is 1.5) of the maximum SNP LOD score. We’re going to display just a subset of the columns.
top <- top_snps(out_snps$lod, out_snps$snpinfo)print(top[,c(1,8:15,20)], row.names=FALSE)
##? ? ? ? ? ? ? ? ? snp_id A_J C57BL_6J 129S1_SvImJ NOD_ShiLtJ NZO_HlLtJ CAST_EiJ PWK_PhJ WSB_EiJ? ? lod
##? ? ? ? ? ? ? rs33297153? 1? ? ? ? 1? ? ? ? ? 2? ? ? ? ? 1? ? ? ? 2? ? ? ? 3? ? ? 1? ? ? 1 8.2151
##? ? ? ? ? ? rs226751615? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? rs230653109? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? rs250407660? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? rs248216267? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? rs237564299? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? rs255011323? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? rs212448521? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? ? rs27379206? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? ? rs27379194? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? rs244484646? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? rs215855380? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? rs236942088? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? rs212414861? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? rs229578122? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? rs254318131? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? rs217679969? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? rs238404461? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? rs262749925? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? rs231282689? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? rs260286709? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? ? rs27396282? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? rs263841533? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? rs231205920? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? rs242885221? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 7.9496
##? ? ? ? ? ? rs586746690? 1? ? ? ? 1? ? ? ? ? 2? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 8.6351
##? ? ? ? ? ? ? rs49002164? 1? ? ? ? 1? ? ? ? ? 2? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 8.6351
##? SV_2_96945406_96945414? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 3? ? ? 1? ? ? 1 8.2809
##? ? ? ? ? ? rs244595995? 1? ? ? ? 1? ? ? ? ? 2? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 8.6351
##? SV_2_96993116_96993118? 1? ? ? ? 1? ? ? ? ? 1? ? ? ? ? 1? ? ? ? 2? ? ? ? 2? ? ? 1? ? ? 1 8.2809
The top SNPs all have NZO and CAST with a common allele, different from the other 6 founders. The next-best SNPs have NZO, CAST, and 129 with a common allele, and then there’s a group of SNPs where NZO has a unique allele.
The?scan1snps()?function can also be used to perform a genome-wide SNP association scan, by providing a variant query function but leaving?chr,?start, and?end?unspecified. In this case it’s maybe best to use?keep_all_snps=FALSE?(the default) and only save the index SNPs.
out_gwas <- scan1snps(pr, DOex$pmap, DOex$pheno, k, sex, query_func=query_variants, cores=0)
We can make a Manhattan plot of the results as follows. We use?altcol?to define a color for alternate chromosomes and?gap=0?to have no gap between chromosomes.
par(mar=c(4.1,4.1,0.6,0.6))plot(out_gwas$lod, out_gwas$snpinfo, altcol="green4", gap=0)
Note that while there are LOD scores on the X chromosome that are as large as those on chromosome 2, we’re allowing a genotype × sex interaction on the X chromosome and so the test has 3 degrees of freedom (versus 2 degrees of freedom on the autosomes) and so the LOD scores will naturally be larger. If we’d used the allele probabilities (apr?above, calculated from?genoprob_to_alleleprob()) rather than the genotype probabilities, we would be performing a test with 1 degree of freedom on both the X chromosome and the autosomes.