...
The Glycosylation prediction tool is implemented as an argo workflow and an API. the argo workflow can be found at https://github.com/CyrusBiotechnology/glyco-predictor
Table of Contents | ||||
---|---|---|---|---|
|
Engine API usage
The best way to run the Glycosylation prediction tool is via the Cyrus Engine API. Given a protein sequence in a fasta file, the API is run like this:
...
The API takes approximately an hour to run and generates a single report in the form of a CSV file listing report of N-linked glycosylation predictions.
Tool Implementation Details
Architecture and Features
DNGP architecture parameters were optimized via grid search with three-fold cross-validation on an independent training set. Final DNN parameters selected by Pakhrin et al. are summarized in Table 1. Optimal features for classification were a combination of structural predictions (ASA, RSA, SS, disorder, torsion) via NetSurfP, gapped dipeptide rates, and PSSM data from PSI-BLAST.
...
In order to best replicate structural features encoded in DNGP models, internal development of NetSurfP-3.0 was necessary since pre-trained models are not readily available. With publicly available training data, an internal NetSurfP-3.0 model was implemented for workflow use; however, it is important to note that the original DNGP models were trained on data generated from NetSurfP-2.0. The main difference between versions lies in underlying architecture for optimized speed (NetSurfP-3.0 utilizes an ESM model to improve runtime performance), other performance metrics were reported to have no significant differences.
Benchmarks
DNGP was originally evaluated on two different datasets: data adapted from N-GlyDE and data from N-GlycositeAtlas. N-GlyDE had 2050 experimentally verified N-linked glycosylation sites from 832 human glycoproteins; 1030 glycosylation sites not verified in these proteins were considered negative sites. N-GlycositeAtlas is a database for N-linked glycosylation made up of 7204 glycoproteins, which consisted of 9450 positive and 9450 negative sites post filtering and sampling. Negative sites for the N-GlycositeAtlas dataset were selected based on localization of glycoproteins in the nucleus and mitochondria – assuming that these proteins do not undergo N-linked glycosylation (Note: team has acknowledged this caveat and decided to proceed with caution).
...
Cyrus glyco-predictor used both models and datasets as benchmarks along with experimental results from internal programs to evaluate feasibility of use. Overall, the N-GlycositeAtlas model (window-size 41) had greater performance in predictions compared to the N-GlyDE set. Performance comparisons were primarily focused on overall recall and precision.
Performance
Table 2 summarizes the accuracy, precision, recall and specificity of the Cyrus glyco-predictor tool for different benchmarks. The N-GlyDE dataset consisted of 167 positive glycosylation sites and 280 negative sites, and were not included in any of the model training sets. Glycoproteomic analysis of ACE2 constructs S19 and v2.4 provided mass spectrometry results for evaluating glycosylation of 10 and 8 NX-[S/T] sites respectively (S19 contains two designed glycosites at positions 660 and 718). Both datasets were used as additional validation sets, along with internal data for GALA glycosylation.
...
Table 2. Performance of Cyrus glyco-predictor tool on different benchmarks.
Future Improvements
Primary areas for potential improvement involve modification to input feature training parameters. As mentioned, structure-based features predicted by NetSurfP-3.0 could be benchmarked against Rosetta calculated features. Doing so would enable greater control over desired parameters and introduce more fine-grained features to the training data. Currently, NetSurfP-3.0 is used to predict secondary structure (q3), torsion angles, accessible surface area, and disorder. RMSD, B-factor, or alternative metrics for flexibility could possibly be introduced as additional features. Additionally, window-size for the flanking residues surrounding the sequon asparagine could be re-evaluated and screened to determine if there is room for further optimization.
Expansion of the model domain from human to mammalian N-linked glycosylation may help with generalizing glycosylation design evaluation for a variety of expression systems and for assessing viability for animal models downstream.
Argo Workflow Usage
Note |
---|
You should probably use the cyrus Engine API described above |
Argo Submission
Code Block |
---|
argo submit workflow.yaml -p fasta="gcspath/example.fasta" |
1. Preparation
Upload fasta to the default argo bucket on GCS in a relevant project.
2. Feature Extraction
Input = fasta
Output = CSV of DNGP features
...
Gapped Dipeptide Rate calculations
Data formatting
3. Prediction
Input = DNGP feature CSV
Output = report of N-linked glycosylation predictions
Links
Pakhrin SC, Aoki-Kinoshita KF, Caragea D, KC DB. DeepNGlyPred: A Deep Neural Network-Based Approach for Human N-Linked Glycosylation Site Prediction. Molecules. 2021; 26(23):7314. https://doi.org/10.3390/molecules26237314
DeepNGlyPred Github Repository