Glycosylation Prediction

In order to improve the success of and confidence in designed glycosylation sites, Cyrus has implemented glyco-predictor, an in-house version of DeepNGlyPred (DNGP)-- a deep neural-network (DNN) learning tool for sequence-based human N-linked glycosylation prediction.

 

Quickstart

Run predictions on given FASTA sequence

cyrus engine submit glyco-predictor input.fasta

The API takes approximately an hour to run and generates a single report in the form of a CSV file listing report of N-linked glycosylation predictions.

Inputs

  • --fasta-file (str)

    • Input FASTA file of sequence of interest

Outputs

  • dngp-report.csv

    • CSV file containing positions of glycosylation motifs within input sequence and prediction of glycosylation propensity.

Notes

Feature selection

The best features in predicting glycosylation were structure-based, determined by NetSurfP-3.0. Optimal window size for NetSurfP-3.0 predictions was dependent on training sets (N-GlycositeAtlas = 41, N-GlyDE = 25). Window size is defined as the number of flanking residues surrounding (and including) central asparagine of the glycosylation motif (N-X-[S/T]).

In order to best replicate structural features encoded in DNGP models, internal development of NetSurfP-3.0 was necessary since pre-trained models are not readily available. With publicly available training data, an internal NetSurfP-3.0 model was implemented for workflow use; however, it is important to note that the original DNGP models were trained on data generated from NetSurfP-2.0. The main difference between versions lies in underlying architecture for optimized speed (NetSurfP-3.0 utilizes an ESM model to improve runtime performance), other performance metrics were reported to have no significant differences.

Performance

Table 1 summarizes the accuracy, precision, recall and specificity of the Cyrus glyco-predictor tool for different benchmarks. The N-GlyDE dataset consisted of 167 positive glycosylation sites and 280 negative sites, and were not included in any of the model training sets. Benchmarks 1 and 2 include internal results from glycoproteomic analysis provided by mass spectrometry results of two proteins for evaluating glycosylation of 10 and 8 NX-[S/T] sites respectively.

On average, glyco-predictor achieved 74.8% accuracy, 75.6% precision, 74.5% recall, and 80.0% specificity. While there may be potential in enhancing this model, these benchmarks reflect acceptable performance for predicting human N-linked glycosylation.

Benchmark

Accuracy (%)

Precision (%)

Recall (%)

Specificity (%)

N-GlyDE dataset

79.4

67.0

88.6

73.9

Benchmark 1

70

60

75

66

Benchmark 2

75

100

60

100

Average

74.8

75.6

74.5

80.0

Table 1. Performance of Cyrus glyco-predictor tool on different benchmarks.

References

DeepNGlyPred: A Deep Neural Network-Based Approach for Human N-Linked Glycosylation Site Prediction

DeepNGlyPred Github Repository 

NetSurfP-3.0 Github Repository