This pipeline requires:
python 3.6.8 or higher plink 1.90
This repository attempts to impute the ethnicity provided a set of binary PLINK files. It utilizes the 1000 Genomes Phase 3 Data and its Associated Data.
Nonetheless, the 1000 Genome Data was Quality Controlled (QC) prior to be used in this repository, removing multiallelic variants. The data can be request to Aditya Ambassi - otherwise, please proceed to download the 1000 Genomes variants data and perform a QC routine to remove undesired and multiallelic variants.
The pipeline is composed by three main steps:
- Merge: common Single Nucleotides Polymorphisms (SNP) are parsed in chromosome 2, and the data provided is merged with the 1000 Genomes variants data. In the process, multiallelic and flipped variants are removed.
- Principal Component Analysis: PCA is computed to find directions of maximum variability.
- Ethnicity Computation: The mean Principal Components (PC) of each 1000 Genome ethnicity is computed. Subsequently, each subjects PCs' euclidean distance to each 1000 Genome Ethnicity ethnicity mean is computed - the closest mean, is considered the ethnicity of the subject.
- Plot: PCs of subjects and 1000 Genome subjects are ploted and colored by ethnicity.
To run the pipeline, please follow the next steps:
- Copy the input data to the Data folder.
- Copy the 1000 Genome binary PLINK files to the Resources folder.
- Fill the
settings.json
file according the Settings section. - Run
python impute_ancestry.py
Please fill up the following items in settings.json
:
- Resources
- CHR2_1000Genome: Relative or full path to the binary PLINK files of 1000 Genomes (without extension, only the prefix).
- Data:
- prefix: Prefix/name of the input data.
This function is designed to parse the 1000 Genome superpopulations based on the Associated Data. The function does not need to be ran, given that superpopulations have been provided in the pipeline and are located in the Resources folder.
The pipeline does not currently support flipping variants. Multiallelic and flipped variants are removed.