Skip to content

Preparation and Analysis

eddy-elisee edited this page Sep 12, 2024 · 2 revisions

Get identity percentage between targets and reference(s)

When modelling is carried out by MODELLER (ASMC default), the percentage of identity between the target sequences and the reference structure(s) is calculated. This information is used to determine the reference(s) that will be used to model and align the target. Active sites are extracted on the basis of the alignment between the targets and their respective reference.

To extract and cluster the active sites from a MSA with multiple reference sequences, user must first generate the file identity_targets_refs.tsv using the following subcommand:

usage: asmc identity [-h] (-s  | -m ) (-r  | -R )

options:
  -h, --help       show this help message and exit
  -s , --seqs      multi fasta file
  -m , --models    file containing all PDB paths
  -r , --ref-str   file containing the reference structure paths
  -R , --ref-seq   file containing the reference sequences id

The identity percentage can be calculated using either target sequences, if the user has run the ASMC using a set of sequences, or target 3D structures, if the user has run the ASMC using pre-built 3D models (MODELLER, AlphaFold...).

From a set of sequence targets

User must provide a set of homologous protein sequences and a reference sequence file called by the --ref-seq option.

asmc identity -s sequences.fasta --ref-seq ref_seq.txt

From a set of structure targets

User must provide a set of homologous protein structures and a reference structure file called by the --ref-str option.

asmc identity -r models.txt --ref-str ref_str.txt

Extract amino acid at a queried position

The subcommand extract extracts the lines of groups_x_min_y.tsv that contain a specific amino acid or residue type at a queried position.

usage: asmc extract [-h] -f  -p  -a  [-g]

options:
  -h, --help        show this help message and exit
  -f , --file       tsv file from run_asmc.py
  -p , --position   position where to find the specified amino acid type, e.g: 5
  -a , --aa-type    amino acid type to search, must be either 1-letter amino acid, 'aromatic', 'acidic', 'basic', 'polar' or 'hydrophobic'
  -g , --group      group id, if not used, search in all groups

The position numbering corresponds to the position in the active site sequences within groups_x_min_y.tsv; e.g, if the user is looking for a tyrosine (Y) at position 5, the command line is as follows:

asmc extract -f groups_x_min_y.tsv -p 5 -a Y

Outputs are displayed in the stdout.

Compare active sites from different methods

The subcommand compare returns the comparison of active sites present within groups_x_min_y.tsv.

usage: asmc compare [-h] -f1  -f2  -id

options:
  -h, --help  show this help message and exit
  -f1         Group file 1
  -f2         Group file 2
  -id         identity_targets_refs.tsv

User must provide a TSV file for each clustering method (MSA, structure, pairwise) and the identity_targets_refs.tsv called by the -id option.

asmc compare -f1 groups_x_min_y.tsv -f2 groups_a_min_b.tsv -id identity_targets_refs.tsv

The output file is named active_site_checking.tsv.

Retrieve unique active sites and obtain some statistics

The subcommand unique returns the unique active sites per group and some statistics.

usage: asmc unique [-h] -f

Returns the unique active sites per group and some statistics

options:
  -h, --help    show this help message and exit
  -f , --file   tsv group file with all active sites from asmc run

User must provide the file groups_x_min_y.tsv to the following subcommand:

asmc unique -f groups_x_min_y.tsv

The output files are unique_sequences.tsv and groups_stats.tsv.

Visualisation with Pymol

The subcommand pymol returns the path of the script to load into the Pymol console. It runs some Pymol commands to show the superposed active site residues. To visualise the active sites:

  • Open Pymol and set a directory containing all the ASMC outputs as working directory
  • Use the command run <path>/ASMC/asmc/zoom_active_site.py to load the functions
  • Use the command target ID, where ID is the ID of a built model, to load the target model and his reference structure
  • Use the command active_site to zoom on the two active sites

The last command displays the list of corresponding positions in the Pymol console, e.g:

Ref - Target
189 SER - 94 VAL
190 THR - 95 SER
191 GLY - 96 SER
192 ILE - 97 ILE
193 CYS - 98 CYS
197 SER - 102 ALA
200 LEU - Gap
202 PHE - 103 ALA
235 THR - 131 ASP
268 PRO - 163 PRO
271 GLN - 166 GLN
272 TYR - 167 TYR
275 TYR - Gap
278 GLU - 170 SER

Format TSV output to XLSX

The subcommand to_xlsx transforms a TSV file into a XLSX file. In the new file, each position in the sequence is in its own column. The colors refers to the Weblogo 3 "Chemistry (AA)" color scheme, resumed hereafter:

Chemical properties 1-letter code aminoacids Colors
Polar G,S,T,Y,C ${\color{green}green}$
Neutral Q,N ${\color{purple}purple}$
Basic K,R,H ${\color{blue}blue}$
Acidic D,E ${\color{red}red}$
Hydrophobic A,V,L,I,P,W,F,M ${\color{black}black}$
usage: asmc to_xlsx [-h] [-o] -f

options:
  -h, --help     show this help message and exit
  -f, --file     Group tsv file
  -o, --outdir   output name (default: <input_name>.xlsx)

User must provide a TSV file and run the following command:

asmc to_xlsx -f groups_x_min_y.tsv 

The output file is now an XLSX file, which can be opened with a spreadsheet program.