Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database creation for Killifish #53

Open
juhi-ku opened this issue Sep 11, 2024 · 4 comments
Open

Database creation for Killifish #53

juhi-ku opened this issue Sep 11, 2024 · 4 comments

Comments

@juhi-ku
Copy link

juhi-ku commented Sep 11, 2024

Hello,

I am working on TBI in killifish using bioinformatics approach.
I want to investigate killifish clusters and it’s subclusters as it is not very well studied. I want to discover the cell types and clusters and to do so I was planning to work on Gene Regulatory analysis using SCENIC as it has an option to make custom database. Since you and your other colleagues have made the software, I was hoping if you could help me in creating the database for killifish as well. For Killifish we have Ensemble genome (https://www.ensembl.org/Nothobranchius_furzeri/Info/Index ) and Transcription factor list (from the TFDB database) but we do not have any Motif information. Can you let me know if this information is sufficient to proceed? I had a look already into the codes on GitHub, but I still wanted to ask and confirm and seems a bit tough myself to proceed so if you could help me or let me know how it’s done, would be a great help in my research and to the killifish community as well!

@ghuls
Copy link
Member

ghuls commented Sep 18, 2024

For the motif to TF table you could take the fly/mouse/human motif to TF tables we provide and match them to kilifish homologs
(change the gene_name column with the kilifish homologs).

For your gene regions you can get 10kb upstream and downstream of each TSS (so 20kb regions) and/or another set of 500bp upstream and 100 bp downstream of the TSS
for another database (depending on the species you might need to tweak those values a bit e.g 5kb up/down might work better for yours species).

You can use pycistopic tss to get some of the needed files relatively easily:

# Get Ensembl gene annotation name for Killifish.
$ pycistopic tss gene_annotation_list -f killifish
nfurzeri_gene_ensembl	Turquoise killifish genes (Nfu_20140520)

$ Get TSS for each gene from Ensembl.
$ pycistopic tss get_tss -n nfurzeri_gene_ensembl -o nfurzeri_gene_ensembl.tss.bed
- Get TSS annotation from Ensembl BioMart with the following settings:
  - biomart_name: "nfurzeri_gene_ensembl"
  - biomart_host: "http://www.ensembl.org"
  - transcript_type: ['protein_coding']
  - use_cache: True
- Writing TSS annotation BED file to "nfurzeri_gene_ensembl.tss.bed".

# Sort BED file with TSS.
$ sort -k 1,1 -k2,2n -k3,3n nfurzeri_gene_ensembl.tss.bed > nfurzeri_gene_ensembl.tss.sorted.bed

# Get chromosome name to size mapping.

# For most standard species the following would work, but not for Killifish apparently.

# Get killifish NCBI accession ID.
$ pycistopic tss get_ncbi_acc -s killifish
accession	assembly_name
GCF_027789165.1	UI_Nfuz_MZM_1.0
GCF_001465895.1	Nfu_20140520

$ pycistopic tss get_ncbi_chrom_sizes_and_alias_mapping --ncbi GCF_027789165.1 --chrom-sizes-alias nfurzeri.chrom_sizes_and_alias.tsv
Traceback (most recent call last):
  File "/home/luna.kuleuven.be/u0079808/software/anaconda3/envs/pycistopic_3.11/bin/pycistopic", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/luna.kuleuven.be/u0079808/PycharmProjects/pycisTopic/src/pycisTopic/cli/pycistopic.py", line 26, in main
    args.func(args)
  File "/home/luna.kuleuven.be/u0079808/PycharmProjects/pycisTopic/src/pycisTopic/cli/subcommand/tss.py", line 491, in run_tss_get_ncbi_chrom_sizes_and_alias_mapping
    get_chrom_sizes_and_alias_mapping_from_ncbi(
  File "/home/luna.kuleuven.be/u0079808/PycharmProjects/pycisTopic/src/pycisTopic/cli/subcommand/tss.py", line 398, in get_chrom_sizes_and_alias_mapping_from_ncbi
    ga.get_chrom_sizes_and_alias_mapping_from_ncbi(
  File "/home/luna.kuleuven.be/u0079808/PycharmProjects/pycisTopic/src/pycisTopic/gene_annotation.py", line 475, in get_chrom_sizes_and_alias_mapping_from_ncbi
    chrom_sizes_and_alias_df_pl = pl.from_dicts(reports).select(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/luna.kuleuven.be/u0079808/software/anaconda3/envs/pycistopic_3.11/lib/python3.11/site-packages/polars/dataframe/frame.py", line 8873, in select
    return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/luna.kuleuven.be/u0079808/software/anaconda3/envs/pycistopic_3.11/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 2034, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ColumnNotFoundError: ucsc_style_name

# For Killifish you can use get_chromosome_sizes: https://gist.github.com/andrewyatz/a3687b573364f65904e2
./get_chromosome_sizes.py 'nothobranchius_furzeri' > nothobranchius_furzeri.chrom_sizes.tsv


# Create BED file with regions 20kb up and downstream of TSS.
$ bedtools slop -l 20000 -r 20000 -s -i nfurzeri_gene_ensembl.tss.sorted.bed -g nothobranchius_furzeri.chrom_sizes.tsv > nfurzeri_gene_ensembl.tss.sorted.20kb_up_and_down.bed

# Create BED file with regions 500bp upstream and 100bp downstream of TSS.
$ bedtools slop -l 500 -r 100 -s -i nfurzeri_gene_ensembl.tss.sorted.bed -g nothobranchius_furzeri.chrom_sizes.tsv > nfurzeri_gene_ensembl.tss.sorted.500bp_up_and_100bp_down.bed

@ghuls
Copy link
Member

ghuls commented Sep 18, 2024

# Set Ensembl gene ID as gene column. 
❯ awk -F '\t' -v 'OFS=\t' '{print $1, $2, $3, $8}' nfurzeri_gene_ensembl.tss.sorted.20kb_up_and_down.bed > nfurzeri_gene_ensembl.tss.sorted.with_ensembl_gene_id.20kb_up_and_down.bed


# Set Ensembl gene ID as gene column. 
❯ awk -F '\t' -v 'OFS=\t' '{print $1, $2, $3, $8}' nfurzeri_gene_ensembl.tss.sorted.500bp_up_and_100bp_down.bed > nfurzeri_gene_ensembl.tss.sorted.with_ensembl_gene_id.500bp_up_and_100bp_down.bed

❯ head nfurzeri_gene_ensembl.tss.sorted.20kb_up_and_down.bed nfurzeri_gene_ensembl.tss.sorted.with_ensembl_gene_id.20kb_up_and_down.bed
==> nfurzeri_gene_ensembl.tss.sorted.20kb_up_and_down.bed <==
LN600519.1	357430	397431		.	-	protein_coding	ENSNFUG00015017460
LN600519.1	552219	592220	wdr43	.	-	protein_coding	ENSNFUG00015017465
LN600519.1	557140	597141	TRMT61B	.	+	protein_coding	ENSNFUG00015017478
LN600519.1	562524	602525	spdya	.	-	protein_coding	ENSNFUG00015017485
LN600519.1	563086	603087	spdya	.	-	protein_coding	ENSNFUG00015017485
LN600519.1	563258	603259	spdya	.	-	protein_coding	ENSNFUG00015017485
LN600519.1	563784	603785		.	+	protein_coding	ENSNFUG00015017559
LN600519.1	611523	651524		.	-	protein_coding	ENSNFUG00015017563
LN600519.1	694463	734464		.	+	protein_coding	ENSNFUG00015017570
LN600519.1	789733	829734	LRFN2	.	-	protein_coding	ENSNFUG00015017567

==> nfurzeri_gene_ensembl.tss.sorted.with_ensembl_gene_id.20kb_up_and_down.bed <==
LN600519.1	357430	397431	ENSNFUG00015017460
LN600519.1	552219	592220	ENSNFUG00015017465
LN600519.1	557140	597141	ENSNFUG00015017478
LN600519.1	562524	602525	ENSNFUG00015017485
LN600519.1	563086	603087	ENSNFUG00015017485
LN600519.1	563258	603259	ENSNFUG00015017485
LN600519.1	563784	603785	ENSNFUG00015017559
LN600519.1	611523	651524	ENSNFUG00015017563
LN600519.1	694463	734464	ENSNFUG00015017570
LN600519.1	789733	829734	ENSNFUG00015017567

Files from previous post attached:
nfurzeri_gene_ensembl.tss.sorted.zip

@juhi-ku
Copy link
Author

juhi-ku commented Sep 24, 2024

Hi, thank you! But where is this BED files needed? To form the create_cisTarget_databases I need fasta file and motif file right? The tutorial is quite confusing. Can you please just tell what to do after this?

@ghuls
Copy link
Member

ghuls commented Sep 30, 2024

You need the BED file to create a FASTA file with the sequences for your gene regions.

# Add "#number" to gene IDs" (only needed due to internal reasons by create_cisTarget_databases).
awk -F '\t' -v OFS='\t' '{ $4 = $4 "#" NR }' nfurzeri_gene_ensembl.tss.sorted.with_ensembl_gene_id.20kb_up_and_down.bed > nfurzeri_gene_ensembl.tss.sorted.with_ensembl_gene_id.20kb_up_and_down.for_create_cisTarget_databases.bed

# Create FASTA file for gene regions with BEDTools:
bedtools getfasta \
    -nameOnly \
    -fi killifish_genome.fa \
    -fo kilifish_gene_regions_20kb_up_and_down.fa \
    -bed nfurzeri_gene_ensembl.tss.sorted.with_ensembl_gene_id.20kb_up_and_down.for_create_cisTarget_databases.bed



# FASTA file with sequences per region IDs / gene IDs.
fasta_filename='./kilifish_gene_regions_20kb_up_and_down.fa '
# Directory with motifs in Cluster-Buster format.
motifs_dir='./v10nr_clust_public/singletons/'
# File with motif IDs (base name of motif file in ${motifs_dir}).
motifs_list_filename='./v10nr_clust_public/motif_list.txt'
# cisTarget motif database output prefix.
db_prefix='./nfurzeri_gene_ensembl_20kb_up_and_down'

nbr_threads=22


"${create_cistarget_databases_dir}/create_cistarget_motif_databases.py" \
    -f "${fasta_filename}" \
    -M "${motifs_dir}" \
    -m "${motifs_list_filename}" \
    -g '#[0-9]+$' \
    -o "${db_prefix}" \
    -t "${nbr_threads}"

Cluster-Buster motif files from our public collection can be found here:
https://resources.aertslab.org/cistarget/motif_collections/v10nr_clust_public/v10nr_clust_public.zip

# Download public motif collection.
wget https://resources.aertslab.org/cistarget/motif_collections/v10nr_clust_public/v10nr_clust_public.zip

# Extract public motif collection.
unzip v10nr_clust_public.zip

# Create motif list file.
ls -1 /tmp/v10nr_clust_public/singletons/ > v10nr_clust_public/motif_list.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants