CASP Data Download and Preprocessing

Overview of notebooks

Main notebooks:

DownloadCaspData: download raw data from the CASP website
DsspSecondaryStructure: compute DSSP features
MultipleSequenceAlignment: compute MSA features
CadScore: compute ground-truth CAD scores
LddtScore: compute ground-truth LDDT scores
TmScore: compute ground-truth GDT-TS and TM scores
Preprocessing: compute graph features and repack all input features and ground-truth scores for training
Training: an example training script (actual one is in src/graphqa/train.py)
QaMetrics: compute QA metrics by comparing predicted and ground-truth scores

Other notebooks and files:

GraphConnectivityAndSeparationEncoding: examples of graph connectivity by distance and separation
PositionalEncoding: simple implementation of positional encoding
ProteinMetrics: difference between all-models and per-model Pearson correlation of local scores
RankingMetrics: recall@x and normalized cumulative discount
Zscore: example z-score calculation

DownloadCaspData is the main notebook for downloading the raw protein data from the CASP website. This notebook should be run first if one wishes to reproduce the experiments in the paper.

All other notebooks represent an easy-to-follow overview of the entire pre- and post-processing pipeline, but might not reflect exactly the steps used in the experiments. The actual pre- and post-processing code, rewritten for efficiency and ease-of-use, can be found in src/graphqa/data.

External software

We use the following external tools for pre-processing:

Jackhmmer (v3.3) for multiple-sequence alignments
Dockerized DSSP (697deab) for computing DSSP
Voronota (v1.21.2744) for computing CAD scores
Dockerized OpenStructure (v2.1.0) for computing LDDT scores
TMscore (v2019/11/25) for computing TM scores

Folder structure

For each protein dataset, the following folder structure is used:

CASP13
├── sequences.fasta                    Primary sequences
├── alignments.pkl                     Multiple-sequence alignment features
├── QA_groups.csv                      QA group names and ids
├── decoy_name_mapping.pkl             Mapping from decoy filenames to (target, decoy) pairs, 
│                                      e.g. 'T0953s1TS368_3' -> ('T0953s1', 'BAKER-ROSETTASERVER_TS3')
├── decoys                             Decoy files (raw structures, dssp outputs, ground-truth scores)
│   ├── T0949.cad.npz
│   ├── T0949.lddt.npz
│   ├── T0949.tmscore.npz
│   ├── T0949
│   |   ├── 3D-JIGSAW_SL1_TS1.dssp
│   |   ├── 3D-JIGSAW_SL1_TS1.pdb
│   |   ├── ...
|   |   ├── Zhou-SPOT-3D_TS5.dssp
|   |   └── Zhou-SPOT-3D_TS5.pdb
│   ├── ...
│   ├── T1022s2.cad.npz
│   ├── T1022s2.lddt.npz
│   ├── T1022s2.tmscore.npz
│   └── T1022s2
│       ├── 3D-JIGSAW_SL1_TS1.dssp
│       ├── 3D-JIGSAW_SL1_TS1.pdb
│       ├── ...
|       ├── Zhou-SPOT-3D_TS5.dssp
|       └── Zhou-SPOT-3D_TS5.pdb
├── native                             Native structures
│   ├── T0949.pdb
│   ├── ...
│   └── T1022s2.pdb
├── processed                          Decoys processed for training (precomputed graphs)
│   ├── T0949.pth
│   ├── ...
│   └── T1022s2.pth
├── QA_official                        Official local QA scores from CASP
│   ├── T0949
│   │   ├── T0949QA014_1.lga
│   │   ├── T0949QA014_2.lga
│   │   ├── ...
│   |   ├── T1022s2QA471_1.lga
│   |   └── T1022s2QA471_2.lga
│   ├── ...
│   └── T1022s2
│       ├── T1022s2QA014_1.lga
│       ├── T1022s2QA014_2.lga
│       ├── ...
│       ├── T1022s2QA471_1.lga
│       └── T1022s2QA471_2.lga
└── QA_predictions                     QA predictions made by other groups (CASP QA format)
    ├── T0949
    │   ├── T0949QA014_1
    │   ├── T0949QA014_2
    │   ├── ...
    |   ├── T1022s2QA471_1
    |   └── T1022s2QA471_2
    ├── ...
    └── T1022s2
        ├── T1022s2QA014_1
        ├── T1022s2QA014_2
        ├── ...
        ├── T1022s2QA471_1
        └── T1022s2QA471_2

Notes

All notebooks are saved as .ipynb and .py (percent script) and kept in sync through Jupytext.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!