Ancestry Imputation

Requirements

This pipeline requires:

python 3.6.8 or higher plink 1.90

Introduction

This repository attempts to impute the ethnicity provided a set of binary PLINK files. It utilizes the 1000 Genomes Phase 3 Data and its Associated Data.

Nonetheless, the 1000 Genome Data was Quality Controlled (QC) prior to be used in this repository, removing multiallelic variants. The data can be request to Aditya Ambassi - otherwise, please proceed to download the 1000 Genomes variants data and perform a QC routine to remove undesired and multiallelic variants.

Pipeline

The pipeline is composed by three main steps:

Merge: common Single Nucleotides Polymorphisms (SNP) are parsed in chromosome 2, and the data provided is merged with the 1000 Genomes variants data. In the process, multiallelic and flipped variants are removed.
Principal Component Analysis: PCA is computed to find directions of maximum variability.
Ethnicity Computation: The mean Principal Components (PC) of each 1000 Genome ethnicity is computed. Subsequently, each subjects PCs' euclidean distance to each 1000 Genome Ethnicity ethnicity mean is computed - the closest mean, is considered the ethnicity of the subject.
Plot: PCs of subjects and 1000 Genome subjects are ploted and colored by ethnicity.

Utilization

To run the pipeline, please follow the next steps:

Copy the input data to the Data folder.
Copy the 1000 Genome binary PLINK files to the Resources folder.
Fill the settings.json file according the Settings section.
Run python impute_ancestry.py

Settings

Please fill up the following items in settings.json:

Resources
- CHR2_1000Genome: Relative or full path to the binary PLINK files of 1000 Genomes (without extension, only the prefix).
Data:
- prefix: Prefix/name of the input data.

Utils

`parse_superpopulations.R`

This function is designed to parse the 1000 Genome superpopulations based on the Associated Data. The function does not need to be ran, given that superpopulations have been provided in the pipeline and are located in the Resources folder.

Warnings and Future Work

The pipeline does not currently support flipping variants. Multiallelic and flipped variants are removed.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Data		Data
Outputs		Outputs
Resources		Resources
sh_scripts		sh_scripts
utils		utils
.Rhistory		.Rhistory
LICENSE		LICENSE
README.md		README.md
ancestry_PCA.png		ancestry_PCA.png
impute_ancestry.py		impute_ancestry.py
parse_SNPS.R		parse_SNPS.R
settings.json		settings.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ancestry Imputation

Requirements

Introduction

Pipeline

Utilization

Settings

Utils

`parse_superpopulations.R`

Warnings and Future Work

About

Uh oh!

Releases

Packages

Languages

License

vipese/ancestry_imputation

Folders and files

Latest commit

History

Repository files navigation

Ancestry Imputation

Requirements

Introduction

Pipeline

Utilization

Settings

Utils

parse_superpopulations.R

Warnings and Future Work

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`parse_superpopulations.R`

Packages