If you use this method, please cite Li & Ralph 2019:
Local PCA Shows How the Effect of Population Structure Differs Along the Genome, Han Li and Peter Ralph, Genetics January 1, 2019 vol. 211 no. 1 289-304.
@article {li2019local,
author = {Li, Han and Ralph, Peter},
title = {Local PCA Shows How the Effect of Population Structure Differs Along the Genome},
volume = {211}, number = {1}, pages = {289--304}, year = {2019},
doi = {10.1534/genetics.118.301747}, publisher = {Genetics}, journal = {Genetics}
}
Note: a prototype python implementation by Joseph Guhlin is available at github:jghulin/lostruct-py.
To install the package, make sure you have devtools
(by doing install.packages("devtools")
),
and then running
install.packages("data.table")
devtools::install_github("petrelharp/local_pca/lostruct")
library(lostruct)
Note: the library is called lostruct
.
The example scripts in the directories above mostly work without the R package. To start using the code on your own data, have a look at these files:
-
A quick example : in four lines of code, reads in chromosome 22 from a TPED, and does local PCA.
-
Setting up data : after documenting where the data are from, does local PCA on a small subset of the whole dataset, to establish how the functions work.
-
Drop in your own data and make a report with the scripts provided in the
templated/
directory, modified from the Medicago analysis.
- To use the functions to read in windows from a VCF or BCF file, you will need bcftools.
- To compile the example report, you probably want templater.
The three steps are:
- Compute the local PCA coordinates - done with
eigen_windows()
. - Compute the distance matrix between windows - done with
pc_dist()
on the output ofeigen_windows()
. - Visualize - whatever you want; MDS is implemented in R with
cmdscale()
.
Data formats:
The function eigen_windows()
basically wants
your data in a numeric matrix,
with one row per variant and one column per sample
(so that x[i,j]
is the number of alleles that sample j
has at site i
).
If your data are already in this form, then you can use it directly.
We also have two methods to get data in from standard formats,
tped
and vcf
. Neither are extensively tested:
double-check what you are getting out of them.
-
TPED: the
read_tped()
function will read in a tped file and output a numeric matrix like the above. For instance:snps <- read_tped("mydata.tped")
-
VCF, in memory: the
read_vcf()
function does the same. For instance:snps <- read_vcf("mydata.vcf")
-
BCF, not in memory:
eigen_windows()
instead of a matrix can take a function that when called returns the submatrix corresponding to the appropriate window. (see documentation) Since we only need one window in memory at a time, this reduces the memory footprint. We usebcftools
to extract the windows, so you need bcftools, and your vcf file must be converted to bcf (or bgzipped) and indexed. To do this, for instance, run:bcftools convert -O b mydata.vcf > mydata.bcf bcftools index mydata.bcf
Once you have this, the function
vcf_windower()
will create the window extractor function. For instance,snps <- vcf_windower("my_data.bcf",size=1e3,type='snp') snps(10)
will return the 10th window of 1,000 SNPs in the file
my_data.vcf
.
In any case, the next step is:
pcs <- eigen_windows(snps,k=2)
pcdist <- pc_dist(pcs,npc=2)
which gives you pcs
, a matrix whose rows give the first two eigenvalues and eigenvectors for each window,
and pcdist
, the pairwise distance matrix between those windows.
Also included in this repository is code we used to analyze the datasets in the paper (before the R package was written). The general order to see the code in each directory is
- recode : turn bases into numbers
- PCA : find local PCs
- distance : compute distance matrix between windows from local PCs
- MDS : visualize the result
There are standalone examples for each of the three datasets studied:
POPRES (Homo sapiens, SNP chip data from a few worldwide populations)
Chromosome 1 is the example given. See also popres_example.R for an example of some steps using the package.
- POPRES_SNPdata_recode12.R : recodes the TPED as numeric
- POPRES_cov.R : computes covariance matrix for the entire chromsome 1
- POPRES_PCA_win100.R : computes local PCs
- POPRES_jackknife_var.R : estimates SE of local PCs
- POPRES_distance.R : computes distance matrix from local PCs
- POPRES_MDS.R : finds and plots MDS visualization of distance matrix
DPGP (Drosophila melanogaster population genome project)
Chromosome 3L is the example given .
- DPGP_recode_and_cov.R : recodes data as numeric, removes individuals with more than 8% missing data, sites with more than 20% missing data, and computes whole-chromosome covariance matrix
- DPGP_PCA_plot.R : plots PCs for entire 3L
- DPGP_PCA_win103.R : computes local PCs along 3L in windows of 1000 SNPs
- DPGP_var_between_win.R : computes variance of PCs between windows
- DPGP_jackknife_var.R : does jackknife estimate of SE for PCs on windows of 1000 SNPs
- DPGP_distance.R : computes distance matrix from local PCs
- DPGP_MDS_1d.R : computes and plots MDS plots from distance matrix
- DPGP_get_extreme_points.R : identifies extreme points (with interaction)
- DPGP_combine_extremes_and_get_cov.R : combines each of three sets of extreme windows and computes covariances for each
Medicago (Medicago truncatula hapmap)
For Medicago, it calculates the pairwise distance for all 8 chromosome together and then apply MDS and use subset of the whole MDS result for each chromosome.
- Medicago_VCF_recode.py : recodes VCF file as numeric
- Medicago_recode_and_cov.R : computes covariance matrix for (entire) chromosome 1
- Medicago_PCA_win104.R : computes local PCs for chromosome 1
- Medicago_distance_all_chr.R : computes a distance matrix from PC information
- Medicago_MDS.R : computes and plots MDS plots from the distance matrix
This method works through the genome doing something (PCA on the covariance matrix)
one window at a time. Because of this, it can be frustratingly slow to first load
the entire dataset into memory. There are several methods implemented here to avoid this;
for instance, vcf_windower()
which is used to compute PCs for the medicago data.
The interface is via a function that takes an integer, n
,
and returns a data frame of the genomic data in the n
th window.