In order to improve the success of and confidence in designed glycosylation sites, Cyrus has implemented glyco-predictor
, an in-house version of DeepNGlyPred (DNGP)-- a deep neural-network (DNN) learning tool for sequence-based human N-linked glycosylation prediction.
Quickstart
Run predictions on given FASTA sequence
cyrus engine submit glyco-predictor input.fasta
The API takes approximately an hour to run and generates a single report in the form of a CSV file listing report of N-linked glycosylation predictions.
Inputs
--fasta-file
(str)Input FASTA file of sequence of interest
Outputs
dngp-report.csv
CSV file containing positions of glycosylation motifs within input sequence and prediction of glycosylation propensity.
Notes
Feature selection
The best features in predicting glycosylation were structure-based, determined by NetSurfP-3.0. Optimal window size for NetSurfP-3.0 predictions was dependent on training sets (N-GlycositeAtlas = 41, N-GlyDE = 25). Window size is defined as the number of flanking residues surrounding (and including) central asparagine of the glycosylation motif (N-X-[S/T]).
In order to best replicate structural features encoded in DNGP models, internal development of NetSurfP-3.0 was necessary since pre-trained models are not readily available. With publicly available training data, an internal NetSurfP-3.0 model was implemented for workflow use; however, it is important to note that the original DNGP models were trained on data generated from NetSurfP-2.0. The main difference between versions lies in underlying architecture for optimized speed (NetSurfP-3.0 utilizes an ESM model to improve runtime performance), other performance metrics were reported to have no significant differences.
Performance
Table 1 summarizes the accuracy, precision, recall and specificity of the Cyrus glyco-predictor tool for different benchmarks. The N-GlyDE dataset consisted of 167 positive glycosylation sites and 280 negative sites, and were not included in any of the model training sets. Benchmarks 1 and 2 include internal results from glycoproteomic analysis provided by mass spectrometry results of two proteins for evaluating glycosylation of 10 and 8 NX-[S/T] sites respectively.
On average, glyco-predictor achieved 74.8% accuracy, 75.6% precision, 74.5% recall, and 80.0% specificity. While there may be potential in enhancing this model, these benchmarks reflect acceptable performance for predicting human N-linked glycosylation.
Benchmark | Accuracy (%) | Precision (%) | Recall (%) | Specificity (%) |
N-GlyDE dataset | 79.4 | 67.0 | 88.6 | 73.9 |
Benchmark 1 | 70 | 60 | 75 | 66 |
Benchmark 2 | 75 | 100 | 60 | 100 |
Average | 74.8 | 75.6 | 74.5 | 80.0 |
Table 1. Performance of Cyrus glyco-predictor tool on different benchmarks.
References
DeepNGlyPred: A Deep Neural Network-Based Approach for Human N-Linked Glycosylation Site Prediction
DeepNGlyPred Github Repository