In order to improve the success of and confidence in designed glycosylation sites, Cyrus has implemented glyco-predictor
, an in-house version of DeepNGlyPred (DNGP)-- a deep neural-network (DNN) learning tool for sequence-based human N-linked glycosylation prediction. The following report outlines key architecture parameters and features, benchmark datasets, overall performance evaluation, and an overview of the workflow and how to use the tool.
The Glycosylation prediction tool is implemented as an argo workflow and an API. the argo workflow can be found at https://github.com/CyrusBiotechnology/glyco-predictor
Table of Contents | ||||
---|---|---|---|---|
|
Engine API usage
...
Quickstart
Run predictions on given FASTA sequence
Code Block |
---|
cyrus engine submit glyco-predictor input.fasta |
The API takes approximately an hour to run and generates a single report in the form of a CSV file listing report of N-linked glycosylation predictions.
Tool Implementation Details
Architecture and Features
DNGP architecture parameters were optimized via grid search with three-fold cross-validation on an independent training set. Final DNN parameters selected by Pakhrin et al. are summarized in Table 1. Optimal features for classification were a combination of structural predictions (ASA, RSA, SS, disorder, torsion) via NetSurfP, gapped dipeptide rates, and PSSM data from PSI-BLAST.
...
Parameter Name
...
Parameter Used
...
Number of layers
...
4
...
Number of neuron in three layers
...
150
...
Number of neuro in output layer
...
2
...
Activation Function
...
sigmoid
...
Activation Function at output layer
...
softmax
...
Optimizer
...
Adam
...
Learning rate
...
0.001
...
Objective / loss function
...
binary_crossentropy
...
Model checkpoint
...
val_accuracy
...
Reduce learning rate on plateau
...
Factor = 0.001
...
Early stopping
...
patience = 5
...
Dropout
...
0.3
...
Batch_size
...
256
...
Epochs
...
400
Table 1. Selected DNN architectural parameters from Pakhrin et al.
Inputs
--fasta-file
(str)Input FASTA file of sequence of interest
Outputs
dngp-report.csv
CSV file containing positions of glycosylation motifs within input sequence and prediction of glycosylation propensity.
Notes
Feature selection
The best features in predicting glycosylation were structure-based, determined by NetSurfP-3.0. Optimal window size for NetSurfP-3.0 predictions was dependent on training sets (N-GlycositeAtlas = 41, N-GlyDE = 25). Window size is defined as the number of flanking residues surrounding (and including) central asparagine of the glycosylation motif (N-X-[S/T]). This suggests potential for refinement via alternative structural prediction tools (i.e. Rosetta) and further optimization of window size with additional data.
In order to best replicate structural features encoded in DNGP models, internal development of NetSurfP-3.0 was necessary since pre-trained models are not readily available. With publicly available training data, an internal NetSurfP-3.0 model was implemented for workflow use; however, it is important to note that the original DNGP models were trained on data generated from NetSurfP-2.0. The main difference between versions lies in underlying architecture for optimized speed (NetSurfP-3.0 utilizes an ESM model to improve runtime performance), other performance metrics were reported to have no significant differences.
...
DNGP was originally evaluated on two different datasets: data adapted from N-GlyDE and data from N-GlycositeAtlas. N-GlyDE had 2050 experimentally verified N-linked glycosylation sites from 832 human glycoproteins; 1030 glycosylation sites not verified in these proteins were considered negative sites. N-GlycositeAtlas is a database for N-linked glycosylation made up of 7204 glycoproteins, which consisted of 9450 positive and 9450 negative sites post filtering and sampling. Negative sites for the N-GlycositeAtlas dataset were selected based on localization of glycoproteins in the nucleus and mitochondria – assuming that these proteins do not undergo N-linked glycosylation (Note: team has acknowledged this caveat and decided to proceed with caution).
The two datasets used for training resulted in two different models of DNGP, essentially differentiated by training set (N-GlyDE vs N-GlycositeAtlas) and window-size (25 vs 41). Both models were used to evaluate internal benchmark sets and replication.
...
Performance
...
Performance
Table 2 1 summarizes the accuracy, precision, recall and specificity of the Cyrus glyco-predictor tool for different benchmarks. The N-GlyDE dataset consisted of 167 positive glycosylation sites and 280 negative sites, and were not included in any of the model training sets. Glycoproteomic analysis of ACE2 constructs S19 and v2.4 provided Benchmarks 1 and 2 include internal results from glycoproteomic analysis provided by mass spectrometry results of two proteins for evaluating glycosylation of 10 and 8 NX-[S/T] sites respectively (S19 contains two designed glycosites at positions 660 and 718). Both datasets were used as additional validation sets, along with internal data for GALA glycosylation.
On average, glyco-predictor achieved 7274.9% 8% accuracy, 8175.1% 6% precision, 7174.9% 5% recall, and 8380.2% 0% specificity. While there may be potential in enhancing this model, these benchmarks reflect acceptable performance for predicting human N-linked glycosylation.
Benchmark | Accuracy (%) | Precision (%) | Recall (%) | Specificity (%) |
N-GlyDE dataset | 79.4 | 67.0 | 88.6 | 73.9 |
Benchmark 1 | 70 | 60 | 75 | 66 |
Benchmark 2 | 75 | 100 | 60 | 100 |
GALA
67.5
64
93
Average |
74. |
8 |
75. |
6 |
74. |
5 |
80. |
0 |
Table 21. Performance of Cyrus glyco-predictor tool on different benchmarks.
Future Improvements
Primary areas for potential improvement involve modification to input feature training parameters. As mentioned, structure-based features predicted by NetSurfP-3.0 could be benchmarked against Rosetta calculated features. Doing so would enable greater control over desired parameters and introduce more fine-grained features to the training data. Currently, NetSurfP-3.0 is used to predict secondary structure (q3), torsion angles, accessible surface area, and disorder. RMSD, B-factor, or alternative metrics for flexibility could possibly be introduced as additional features. Additionally, window-size for the flanking residues surrounding the sequon asparagine could be re-evaluated and screened to determine if there is room for further optimization.
Expansion of the model domain from human to mammalian N-linked glycosylation may help with generalizing glycosylation design evaluation for a variety of expression systems and for assessing viability for animal models downstream.
Argo Workflow Usage
Note |
---|
You should probably use the cyrus Engine API described above |
Argo Submission
Code Block |
---|
argo submit workflow.yaml -p fasta="gcspath/example.fasta" |
1. Preparation
Upload fasta to the default argo bucket on GCS in a relevant project.
2. Feature Extraction
Input = fasta
Output = CSV of DNGP features
Steps:
PSSM generation with BLAST
Secondary structure prediction with NetSurfP3
Gapped Dipeptide Rate calculations
Data formatting
3. Prediction
Input = DNGP feature CSV
Output = report of N-linked glycosylation predictions
Links
...
References
DeepNGlyPred: A Deep Neural Network-Based Approach for Human N-Linked Glycosylation Site Prediction. Molecules. 2021; 26(23):7314. https://doi.org/10.3390/molecules26237314