-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Database creation for Killifish #53
Comments
For the motif to TF table you could take the fly/mouse/human motif to TF tables we provide and match them to kilifish homologs For your gene regions you can get 10kb upstream and downstream of each TSS (so 20kb regions) and/or another set of 500bp upstream and 100 bp downstream of the TSS You can use # Get Ensembl gene annotation name for Killifish.
$ pycistopic tss gene_annotation_list -f killifish
nfurzeri_gene_ensembl Turquoise killifish genes (Nfu_20140520)
$ Get TSS for each gene from Ensembl.
$ pycistopic tss get_tss -n nfurzeri_gene_ensembl -o nfurzeri_gene_ensembl.tss.bed
- Get TSS annotation from Ensembl BioMart with the following settings:
- biomart_name: "nfurzeri_gene_ensembl"
- biomart_host: "http://www.ensembl.org"
- transcript_type: ['protein_coding']
- use_cache: True
- Writing TSS annotation BED file to "nfurzeri_gene_ensembl.tss.bed".
# Sort BED file with TSS.
$ sort -k 1,1 -k2,2n -k3,3n nfurzeri_gene_ensembl.tss.bed > nfurzeri_gene_ensembl.tss.sorted.bed
# Get chromosome name to size mapping.
# For most standard species the following would work, but not for Killifish apparently.
# Get killifish NCBI accession ID.
$ pycistopic tss get_ncbi_acc -s killifish
accession assembly_name
GCF_027789165.1 UI_Nfuz_MZM_1.0
GCF_001465895.1 Nfu_20140520
$ pycistopic tss get_ncbi_chrom_sizes_and_alias_mapping --ncbi GCF_027789165.1 --chrom-sizes-alias nfurzeri.chrom_sizes_and_alias.tsv
Traceback (most recent call last):
File "/home/luna.kuleuven.be/u0079808/software/anaconda3/envs/pycistopic_3.11/bin/pycistopic", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/luna.kuleuven.be/u0079808/PycharmProjects/pycisTopic/src/pycisTopic/cli/pycistopic.py", line 26, in main
args.func(args)
File "/home/luna.kuleuven.be/u0079808/PycharmProjects/pycisTopic/src/pycisTopic/cli/subcommand/tss.py", line 491, in run_tss_get_ncbi_chrom_sizes_and_alias_mapping
get_chrom_sizes_and_alias_mapping_from_ncbi(
File "/home/luna.kuleuven.be/u0079808/PycharmProjects/pycisTopic/src/pycisTopic/cli/subcommand/tss.py", line 398, in get_chrom_sizes_and_alias_mapping_from_ncbi
ga.get_chrom_sizes_and_alias_mapping_from_ncbi(
File "/home/luna.kuleuven.be/u0079808/PycharmProjects/pycisTopic/src/pycisTopic/gene_annotation.py", line 475, in get_chrom_sizes_and_alias_mapping_from_ncbi
chrom_sizes_and_alias_df_pl = pl.from_dicts(reports).select(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/luna.kuleuven.be/u0079808/software/anaconda3/envs/pycistopic_3.11/lib/python3.11/site-packages/polars/dataframe/frame.py", line 8873, in select
return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/luna.kuleuven.be/u0079808/software/anaconda3/envs/pycistopic_3.11/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 2034, in collect
return wrap_df(ldf.collect(callback))
^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ColumnNotFoundError: ucsc_style_name
# For Killifish you can use get_chromosome_sizes: https://gist.github.com/andrewyatz/a3687b573364f65904e2
./get_chromosome_sizes.py 'nothobranchius_furzeri' > nothobranchius_furzeri.chrom_sizes.tsv
# Create BED file with regions 20kb up and downstream of TSS.
$ bedtools slop -l 20000 -r 20000 -s -i nfurzeri_gene_ensembl.tss.sorted.bed -g nothobranchius_furzeri.chrom_sizes.tsv > nfurzeri_gene_ensembl.tss.sorted.20kb_up_and_down.bed
# Create BED file with regions 500bp upstream and 100bp downstream of TSS.
$ bedtools slop -l 500 -r 100 -s -i nfurzeri_gene_ensembl.tss.sorted.bed -g nothobranchius_furzeri.chrom_sizes.tsv > nfurzeri_gene_ensembl.tss.sorted.500bp_up_and_100bp_down.bed |
# Set Ensembl gene ID as gene column.
❯ awk -F '\t' -v 'OFS=\t' '{print $1, $2, $3, $8}' nfurzeri_gene_ensembl.tss.sorted.20kb_up_and_down.bed > nfurzeri_gene_ensembl.tss.sorted.with_ensembl_gene_id.20kb_up_and_down.bed
# Set Ensembl gene ID as gene column.
❯ awk -F '\t' -v 'OFS=\t' '{print $1, $2, $3, $8}' nfurzeri_gene_ensembl.tss.sorted.500bp_up_and_100bp_down.bed > nfurzeri_gene_ensembl.tss.sorted.with_ensembl_gene_id.500bp_up_and_100bp_down.bed
❯ head nfurzeri_gene_ensembl.tss.sorted.20kb_up_and_down.bed nfurzeri_gene_ensembl.tss.sorted.with_ensembl_gene_id.20kb_up_and_down.bed
==> nfurzeri_gene_ensembl.tss.sorted.20kb_up_and_down.bed <==
LN600519.1 357430 397431 . - protein_coding ENSNFUG00015017460
LN600519.1 552219 592220 wdr43 . - protein_coding ENSNFUG00015017465
LN600519.1 557140 597141 TRMT61B . + protein_coding ENSNFUG00015017478
LN600519.1 562524 602525 spdya . - protein_coding ENSNFUG00015017485
LN600519.1 563086 603087 spdya . - protein_coding ENSNFUG00015017485
LN600519.1 563258 603259 spdya . - protein_coding ENSNFUG00015017485
LN600519.1 563784 603785 . + protein_coding ENSNFUG00015017559
LN600519.1 611523 651524 . - protein_coding ENSNFUG00015017563
LN600519.1 694463 734464 . + protein_coding ENSNFUG00015017570
LN600519.1 789733 829734 LRFN2 . - protein_coding ENSNFUG00015017567
==> nfurzeri_gene_ensembl.tss.sorted.with_ensembl_gene_id.20kb_up_and_down.bed <==
LN600519.1 357430 397431 ENSNFUG00015017460
LN600519.1 552219 592220 ENSNFUG00015017465
LN600519.1 557140 597141 ENSNFUG00015017478
LN600519.1 562524 602525 ENSNFUG00015017485
LN600519.1 563086 603087 ENSNFUG00015017485
LN600519.1 563258 603259 ENSNFUG00015017485
LN600519.1 563784 603785 ENSNFUG00015017559
LN600519.1 611523 651524 ENSNFUG00015017563
LN600519.1 694463 734464 ENSNFUG00015017570
LN600519.1 789733 829734 ENSNFUG00015017567
Files from previous post attached: |
Hi, thank you! But where is this BED files needed? To form the create_cisTarget_databases I need fasta file and motif file right? The tutorial is quite confusing. Can you please just tell what to do after this? |
You need the BED file to create a FASTA file with the sequences for your gene regions. # Add "#number" to gene IDs" (only needed due to internal reasons by create_cisTarget_databases).
awk -F '\t' -v OFS='\t' '{ $4 = $4 "#" NR }' nfurzeri_gene_ensembl.tss.sorted.with_ensembl_gene_id.20kb_up_and_down.bed > nfurzeri_gene_ensembl.tss.sorted.with_ensembl_gene_id.20kb_up_and_down.for_create_cisTarget_databases.bed
# Create FASTA file for gene regions with BEDTools:
bedtools getfasta \
-nameOnly \
-fi killifish_genome.fa \
-fo kilifish_gene_regions_20kb_up_and_down.fa \
-bed nfurzeri_gene_ensembl.tss.sorted.with_ensembl_gene_id.20kb_up_and_down.for_create_cisTarget_databases.bed
# FASTA file with sequences per region IDs / gene IDs.
fasta_filename='./kilifish_gene_regions_20kb_up_and_down.fa '
# Directory with motifs in Cluster-Buster format.
motifs_dir='./v10nr_clust_public/singletons/'
# File with motif IDs (base name of motif file in ${motifs_dir}).
motifs_list_filename='./v10nr_clust_public/motif_list.txt'
# cisTarget motif database output prefix.
db_prefix='./nfurzeri_gene_ensembl_20kb_up_and_down'
nbr_threads=22
"${create_cistarget_databases_dir}/create_cistarget_motif_databases.py" \
-f "${fasta_filename}" \
-M "${motifs_dir}" \
-m "${motifs_list_filename}" \
-g '#[0-9]+$' \
-o "${db_prefix}" \
-t "${nbr_threads}" Cluster-Buster motif files from our public collection can be found here: # Download public motif collection.
wget https://resources.aertslab.org/cistarget/motif_collections/v10nr_clust_public/v10nr_clust_public.zip
# Extract public motif collection.
unzip v10nr_clust_public.zip
# Create motif list file.
ls -1 /tmp/v10nr_clust_public/singletons/ > v10nr_clust_public/motif_list.txt |
Hello,
I am working on TBI in killifish using bioinformatics approach.
I want to investigate killifish clusters and it’s subclusters as it is not very well studied. I want to discover the cell types and clusters and to do so I was planning to work on Gene Regulatory analysis using SCENIC as it has an option to make custom database. Since you and your other colleagues have made the software, I was hoping if you could help me in creating the database for killifish as well. For Killifish we have Ensemble genome (https://www.ensembl.org/Nothobranchius_furzeri/Info/Index ) and Transcription factor list (from the TFDB database) but we do not have any Motif information. Can you let me know if this information is sufficient to proceed? I had a look already into the codes on GitHub, but I still wanted to ask and confirm and seems a bit tough myself to proceed so if you could help me or let me know how it’s done, would be a great help in my research and to the killifish community as well!
The text was updated successfully, but these errors were encountered: