In order to improve the success of and confidence in designed glycosylation sites, Cyrus has implemented glyco-predictor
, an in-house version of DeepNGlyPred (DNGP)-- a deep neural-network (DNN) learning tool for sequence-based human N-linked glycosylation prediction. The following report outlines key architecture parameters and features, benchmark datasets, overall performance evaluation, and an overview of the workflow and how to use the tool.
Architecture and Features
DNGP architecture parameters were optimized via grid search with three-fold cross-validation on an independent training set. Final DNN parameters selected by Pakhrin et al. are summarized in Table 1. Optimal features for classification were a combination of structural predictions (ASA, RSA, SS, disorder, torsion) via NetSurfP, gapped dipeptide rates, and PSSM data from PSI-BLAST.
Parameter Name | Parameter Used |
Number of layers | 4 |
Number of neuron in three layers | 150 |
Number of neuro in output layer | 2 |
Activation Function | sigmoid |
Activation Function at output layer | softmax |
Optimizer | Adam |
Learning rate | 0.001 |
Objective / loss function | binary_crossentropy |
Model checkpoint | val_accuracy |
Reduce learning rate on plateau | Factor = 0.001 |
Early stopping | patience = 5 |
Dropout | 0.3 |
Batch_size | 256 |
Epochs | 400 |
Table 1. Selected DNN architectural parameters from Pakhrin et al.
The best features in predicting glycosylation were structure-based, determined by NetSurfP-3.0. Optimal window size for NetSurfP-3.0 predictions was dependent on training sets (N-GlycositeAtlas = 41, N-GlyDE = 25). Window size is defined as the number of flanking residues surrounding (and including) central asparagine of the glycosylation motif (N-X-[S/T]). This suggests potential for refinement via alternative structural prediction tools (i.e. Rosetta) and further optimization of window size with additional data.
In order to best replicate structural features encoded in DNGP models, internal development of NetSurfP-3.0 was necessary since pre-trained models are not readily available. With publicly available training data, an internal NetSurfP-3.0 model was implemented for workflow use; however, it is important to note that the original DNGP models were trained on data generated from NetSurfP-2.0. The main difference between versions lies in underlying architecture for optimized speed (NetSurfP-3.0 utilizes an ESM model to improve runtime performance), other performance metrics were reported to have no significant differences.
Benchmarks
DNGP was originally evaluated on two different datasets: data adapted from N-GlyDE and data from N-GlycositeAtlas. N-GlyDE had 2050 experimentally verified N-linked glycosylation sites from 832 human glycoproteins; 1030 glycosylation sites not verified in these proteins were considered negative sites. N-GlycositeAtlas is a database for N-linked glycosylation made up of 7204 glycoproteins, which consisted of 9450 positive and 9450 negative sites post filtering and sampling. Negative sites for the N-GlycositeAtlas dataset were selected based on localization of glycoproteins in the nucleus and mitochondria – assuming that these proteins do not undergo N-linked glycosylation (Note: team has acknowledged this caveat and decided to proceed with caution).
The two datasets used for training resulted in two different models of DNGP, essentially differentiated by training set (N-GlyDE vs N-GlycositeAtlas) and window-size (25 vs 41). Both models were used to evaluate internal benchmark sets and replication.
Cyrus glyco-predictor used both models and datasets as benchmarks along with experimental results from internal programs to evaluate feasibility of use. Overall, the N-GlycositeAtlas model (window-size 41) had greater performance in predictions compared to the N-GlyDE set. Performance comparisons were primarily focused on overall recall and precision.
Performance
Table 2 summarizes the accuracy, precision, recall and specificity of the Cyrus glyco-predictor tool for different benchmarks. The N-GlyDE dataset consisted of 167 positive glycosylation sites and 280 negative sites, and were not included in any of the model training sets. Glycoproteomic analysis of ACE2 constructs S19 and v2.4 provided mass spectrometry results for evaluating glycosylation of 10 and 8 NX-[S/T] sites respectively (S19 contains two designed glycosites at positions 660 and 718). Both datasets were used as additional validation sets, along with internal data for GALA glycosylation.
On average, glyco-predictor achieved 72.9% accuracy, 81.1% precision, 71.9% recall, and 83.2% specificity. While there may be potential in enhancing this model, these benchmarks reflect acceptable performance for predicting human N-linked glycosylation.
Benchmark | Accuracy (%) | Precision (%) | Recall (%) | Specificity (%) |
N-GlyDE dataset | 79.4 | 67.0 | 88.6 | 73.9 |
ACE2 (S19) | 70 | 60 | 75 | 66 |
ACE2 (v2.4) | 75 | 100 | 60 | 100 |
GALA | 67.5 | 97.9 | 64 | 93 |
Average | 72.9 | 81.1 | 71.9 | 83.2 |
Table 2. Performance of Cyrus glyco-predictor tool on different benchmarks.
Future Improvements
Primary areas for potential improvement involve modification to input feature training parameters. As mentioned, structure-based features predicted by NetSurfP-3.0 could be benchmarked against Rosetta calculated features. Doing so would enable greater control over desired parameters and introduce more fine-grained features to the training data. Currently, NetSurfP-3.0 is used to predict secondary structure (q3), torsion angles, accessible surface area, and disorder. RMSD, B-factor, or alternative metrics for flexibility could possibly be introduced as additional features. Additionally, window-size for the flanking residues surrounding the sequon asparagine could be re-evaluated and screened to determine if there is room for further optimization.
Expansion of the model domain from human to mammalian N-linked glycosylation may help with generalizing glycosylation design evaluation for a variety of expression systems and for assessing viability for animal models downstream.
Workflow and Quickstart
Argo Submission
argo submit workflow.yaml \ -p fasta="/gcspath/example.fasta" \ -p bucket-name=cyrus-playground
1. Preparation
Upload fasta to a bucket on GCS in a relevant project.
2. Feature Extraction
Input = fasta
Output = CSV of DNGP features
Steps:
PSSM generation with BLAST
Secondary structure prediction with NetSurfP3
Gapped Dipeptide Rate calculations
Data formatting
3. Prediction
Input = DNGP feature CSV
Output = report of N-linked glycosylation predictions
Links
Pakhrin SC, Aoki-Kinoshita KF, Caragea D, KC DB. DeepNGlyPred: A Deep Neural Network-Based Approach for Human N-Linked Glycosylation Site Prediction. Molecules. 2021; 26(23):7314. https://doi.org/10.3390/molecules26237314
DeepNGlyPred Github Repository