Skip to content

Latest commit

 

History

History
64 lines (50 loc) · 3.73 KB

README.md

File metadata and controls

64 lines (50 loc) · 3.73 KB

python-version DOI

Spacing pipeline

The scripts provided in this repository are used to compute and characterize the spacing relationships of transcription factors.

Here is the overview of the method:

Dependencies

Quick Usage

identify_motif.py can find motifs given a peak file, a FASTA file for peak sequences, and a motif file. The recommended parameters are as below to filter for motifs passing a false positive rate <0.1% (--cutoff) and a location <50 bp from peak centers (-d 50):

python identify_motif.py ../ENCODE_processed_files/CTCF_idr.fa CTCF --motif_path ../motifs/ --cutoff -d 50

To identify motifs and simultaneously separate peaks into those falling at repetitive and nonrepetitive DNA regions, please download the repeats annotations first and run identify_motif.py script by specifying --repeat:

wget https://homer.ucsd.edu/zeyang/hg38_repeats.tar.gz
tar -zxvf hg38_repeats.tar.gz
python identify_motif.py ../ENCODE_processed_files/CTCF_idr.fa CTCF --motif_path ../motifs/ --cutoff -d 50 --repeat hg38_repeats/hg38_repeats_merged.nodup.all.txt

characterize_spacing.py can take in two processed files from identify_motif.py for a pair of transcription factors and output results of spacing relationships. The basic usage is as below:

python characterize_spacing.py ../ENCODE_processed_files/ GATA1 TAL1 --motif_path ../motifs/

Citation

If you use our findings or scripts, please cite our paper: https://doi.org/10.7554/eLife.70878.

Data

motifs/ folder stores the PWM files in the JASPAR format used in the paper.

ENCODE_processed_files/ folder includes the processed data of this paper based on ENCODE ChIP-seq data:

  • _idr.tsv -- ChIP-seq peaks in HOMER peak file format after running IDR
  • _idr.fa -- sequences of ChIP-seq peaks in _idr.tsv
  • _idr_cutoff.tsv -- ChIP-seq peaks that have been identified to have valid motifs
  • _idr_cutoff_inmask.tsv -- Peaks in _idr_cutoff.tsv that fall into repetitive regions
  • _idr_cutoff_masked.tsv -- Peaks in _idr_cutoff.tsv that fall into nonrepetitive regions

Contact

If you enconter a problem when using the scripts, you can

  1. post an issue on Issue section
  2. or email Zeyang Shen by [email protected]

License

This project is licensed under GNU GPL v3

Contributors

The scripts were developed primarily by Zeyang Shen and Rick Zhenzhi Li. Supervision for the project was provided by Christopher K. Glass.