Main notebooks:
- DownloadCaspData: download raw data from the CASP website
- DsspSecondaryStructure: compute DSSP features
- MultipleSequenceAlignment: compute MSA features
- CadScore: compute ground-truth CAD scores
- LddtScore: compute ground-truth LDDT scores
- TmScore: compute ground-truth GDT-TS and TM scores
- Preprocessing: compute graph features and repack all input features and ground-truth scores for training
- Training:
an example training script (actual one is in
src/graphqa/train.py
) - QaMetrics: compute QA metrics by comparing predicted and ground-truth scores
Other notebooks and files:
- GraphConnectivityAndSeparationEncoding: examples of graph connectivity by distance and separation
- PositionalEncoding: simple implementation of positional encoding
- ProteinMetrics: difference between all-models and per-model Pearson correlation of local scores
- RankingMetrics: recall@x and normalized cumulative discount
- Zscore: example z-score calculation
DownloadCaspData is the main notebook for downloading the raw protein data from the CASP website. This notebook should be run first if one wishes to reproduce the experiments in the paper.
All other notebooks represent an easy-to-follow overview of the entire pre- and post-processing pipeline,
but might not reflect exactly the steps used in the experiments.
The actual pre- and post-processing code, rewritten for efficiency and ease-of-use, can be found in
src/graphqa/data
.
We use the following external tools for pre-processing:
- Jackhmmer (v3.3) for multiple-sequence alignments
- Dockerized DSSP (697deab) for computing DSSP
- Voronota (v1.21.2744) for computing CAD scores
- Dockerized OpenStructure (v2.1.0) for computing LDDT scores
- TMscore (v2019/11/25) for computing TM scores
For each protein dataset, the following folder structure is used:
CASP13
├── sequences.fasta Primary sequences
├── alignments.pkl Multiple-sequence alignment features
├── QA_groups.csv QA group names and ids
├── decoy_name_mapping.pkl Mapping from decoy filenames to (target, decoy) pairs,
│ e.g. 'T0953s1TS368_3' -> ('T0953s1', 'BAKER-ROSETTASERVER_TS3')
├── decoys Decoy files (raw structures, dssp outputs, ground-truth scores)
│ ├── T0949.cad.npz
│ ├── T0949.lddt.npz
│ ├── T0949.tmscore.npz
│ ├── T0949
│ | ├── 3D-JIGSAW_SL1_TS1.dssp
│ | ├── 3D-JIGSAW_SL1_TS1.pdb
│ | ├── ...
| | ├── Zhou-SPOT-3D_TS5.dssp
| | └── Zhou-SPOT-3D_TS5.pdb
│ ├── ...
│ ├── T1022s2.cad.npz
│ ├── T1022s2.lddt.npz
│ ├── T1022s2.tmscore.npz
│ └── T1022s2
│ ├── 3D-JIGSAW_SL1_TS1.dssp
│ ├── 3D-JIGSAW_SL1_TS1.pdb
│ ├── ...
| ├── Zhou-SPOT-3D_TS5.dssp
| └── Zhou-SPOT-3D_TS5.pdb
├── native Native structures
│ ├── T0949.pdb
│ ├── ...
│ └── T1022s2.pdb
├── processed Decoys processed for training (precomputed graphs)
│ ├── T0949.pth
│ ├── ...
│ └── T1022s2.pth
├── QA_official Official local QA scores from CASP
│ ├── T0949
│ │ ├── T0949QA014_1.lga
│ │ ├── T0949QA014_2.lga
│ │ ├── ...
│ | ├── T1022s2QA471_1.lga
│ | └── T1022s2QA471_2.lga
│ ├── ...
│ └── T1022s2
│ ├── T1022s2QA014_1.lga
│ ├── T1022s2QA014_2.lga
│ ├── ...
│ ├── T1022s2QA471_1.lga
│ └── T1022s2QA471_2.lga
└── QA_predictions QA predictions made by other groups (CASP QA format)
├── T0949
│ ├── T0949QA014_1
│ ├── T0949QA014_2
│ ├── ...
| ├── T1022s2QA471_1
| └── T1022s2QA471_2
├── ...
└── T1022s2
├── T1022s2QA014_1
├── T1022s2QA014_2
├── ...
├── T1022s2QA471_1
└── T1022s2QA471_2
All notebooks are saved as .ipynb
and .py
(percent script) and kept in sync through Jupytext.