Skip to content

PanPhlAn download pangenome 3_0

Léonard Dubois edited this page Oct 9, 2020 · 6 revisions

Pangenomes are build for species for which at least 2 reference genomes are available. These files are available on this DropBox. They can also be easily downloaded using the panphlan_download_pangenome.py script.

Example:

panphlan_download_pangenome.py -i Eubacterium_rectale

Input

  • -i INPUT_NAME the name of a species

Output

The tar.bz2 archive is downloaded if available and uncompressed at the location given by the --output argument. If none is provided, the pangenome folder will be created in the local directory

Output content

The retrieved folder contains the pangenome contigs in a multi-FASTA .fna file, the bowtie2 indexes, an annotation .tsv file mapping gene families (UniRef) to GO, KO, KEGG, Pfam, eggNOG... and a pangenome .tsv file containing all information needed to map the genes to the sequences.
The organization of this last file is UniRef90 cluster IDs, gene ID, genome ID, contig ID, start position, stop position

Duplicated sam header error

Some pangenome of the database might have a problem of duplicated sequences leading to an error raised during the mapping step : [E::sam_hrecs_update_hashes] Duplicate entry “XXX” in sam header".

In these cases, better check the duplication comparing wc -l [species_name]_pangenome_contigs.fna and sort species_name_pangenome_contigs.fna | uniq | wc -l that should give roughly half of the previous number.
Then just cut the fna file in half (using for example head -[half of the lines in the fna file] [species_name]_pangenome_contigs.fna > new_contigs.fna ) and then regenerate the indexes using bowtie2-build.

Help -h

usage: panphlan_download_pangenome.py [-h] [-i INPUT_NAME] [-o OUTPUT] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_NAME, --input_name INPUT_NAME
  -o OUTPUT, --output OUTPUT
  -v, --verbose         Show progress information
Clone this wiki locally