Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

In order to improve the success of and confidence in designed glycosylation sites, Cyrus has implemented glyco-predictor, an in-house version of DeepNGlyPred (DNGP)-- a deep neural-network (DNN) learning tool for sequence-based human N-linked glycosylation prediction. The following report outlines key architecture parameters and features, benchmark datasets, overall performance evaluation, and an overview of the workflow and how to use the tool. 

The Glycosylation prediction tool is implemented as an argo workflow and an API. the argo workflow can be found at https://github.com/CyrusBiotechnology/glyco-predictor

Table of Contents
minLevel1
maxLevel7

Engine API usage

...

Quickstart

Run predictions on given FASTA sequence

Code Block
cyrus engine submit glyco-predictor input.fasta

The API takes approximately an hour to run and generates a single report in the form of a CSV file listing report of N-linked glycosylation predictions.

Tool Implementation Details

Architecture and Features

DNGP architecture parameters were optimized via grid search with three-fold cross-validation on an independent training set. Final DNN parameters selected by Pakhrin et al. are summarized in Table 1. Optimal features for classification were a combination of structural predictions (ASA, RSA, SS, disorder, torsion) via NetSurfP, gapped dipeptide rates, and PSSM data from PSI-BLAST. 

...

Parameter Name

...

Parameter Used

...

Number of layers

...

4

...

Number of neuron in three layers

...

150

...

Number of neuro in output layer

...

2

...

Activation Function

...

sigmoid

...

Activation Function at output layer

...

softmax

...

Optimizer

...

Adam

...

Learning rate

...

0.001

...

Objective / loss function

...

binary_crossentropy

...

Model checkpoint

...

val_accuracy

...

Reduce learning rate on plateau

...

Factor = 0.001

...

Early stopping

...

patience = 5

...

Dropout

...

0.3

...

Batch_size

...

256

...

Epochs

...

400

Table 1. Selected DNN architectural parameters from Pakhrin et al.

Inputs

  • --fasta-file (str)

    • Input FASTA file of sequence of interest

Outputs

  • dngp-report.csv

    • CSV file containing positions of glycosylation motifs within input sequence and prediction of glycosylation propensity.

Notes

Feature selection

The best features in predicting glycosylation were structure-based, determined by NetSurfP-3.0. Optimal window size for NetSurfP-3.0 predictions was dependent on training sets (N-GlycositeAtlas = 41, N-GlyDE = 25). Window size is defined as the number of flanking residues surrounding (and including) central asparagine of the glycosylation motif (N-X-[S/T]). This suggests potential for refinement via alternative structural prediction tools (i.e. Rosetta) and further optimization of window size with additional data. 

In order to best replicate structural features encoded in DNGP models, internal development of NetSurfP-3.0 was necessary since pre-trained models are not readily available. With publicly available training data, an internal NetSurfP-3.0 model was implemented for workflow use; however, it is important to note that the original DNGP models were trained on data generated from NetSurfP-2.0. The main difference between versions lies in underlying architecture for optimized speed (NetSurfP-3.0 utilizes an ESM model to improve runtime performance), other performance metrics were reported to have no significant differences.

...

DNGP was originally evaluated on two different datasets: data adapted from N-GlyDE and data from N-GlycositeAtlas. N-GlyDE had 2050 experimentally verified N-linked glycosylation sites from 832 human glycoproteins; 1030 glycosylation sites not verified in these proteins were considered negative sites. N-GlycositeAtlas is a database for N-linked glycosylation made up of 7204 glycoproteins, which consisted of 9450 positive and 9450 negative sites post filtering and sampling. Negative sites for the N-GlycositeAtlas dataset were selected based on localization of glycoproteins in the nucleus and mitochondria – assuming that these proteins do not undergo N-linked glycosylation (Note: team has acknowledged this caveat and decided to proceed with caution). 

The two datasets used for training resulted in two different models of DNGP, essentially differentiated by training set (N-GlyDE vs N-GlycositeAtlas) and window-size (25 vs 41). Both models were used to evaluate internal benchmark sets and replication. 

...

Performance

...

Performance

Table 2 1 summarizes the accuracy, precision, recall and specificity of the Cyrus glyco-predictor tool for different benchmarks. The N-GlyDE dataset consisted of 167 positive glycosylation sites and 280 negative sites, and were not included in any of the model training sets. Glycoproteomic analysis of ACE2 constructs S19 and v2.4 provided Benchmarks 1 and 2 include internal results from glycoproteomic analysis provided by mass spectrometry results of two proteins for evaluating glycosylation of 10 and 8 NX-[S/T] sites respectively (S19 contains two designed glycosites at positions 660 and 718). Both datasets were used as additional validation sets, along with internal data for GALA glycosylation.  

On average, glyco-predictor achieved 7274.9% 8% accuracy, 8175.1% 6% precision, 7174.9% 5% recall, and 8380.2% 0% specificity. While there may be potential in enhancing this model, these benchmarks reflect acceptable performance for predicting human N-linked glycosylation.

Benchmark

Accuracy (%)

Precision (%)

Recall (%)

Specificity (%)

N-GlyDE dataset

79.4

67.0

88.6

73.9

ACE2 (S19)

Benchmark 1

70

60

75

66

ACE2 (v2.4) 

Benchmark 2

75

100

60

100

GALA

67.5

97.9

64

93

Average

72

74.

9

8

81

75.

1

6

71

74.

9

5

83

80.

2

0

Table 21. Performance of Cyrus glyco-predictor tool on different benchmarks.

Future Improvements

Primary areas for potential improvement involve modification to input feature training parameters. As mentioned, structure-based features predicted by NetSurfP-3.0 could be benchmarked against Rosetta calculated features. Doing so would enable greater control over desired parameters and introduce more fine-grained features to the training data. Currently, NetSurfP-3.0 is used to predict secondary structure (q3), torsion angles, accessible surface area, and disorder. RMSD, B-factor, or alternative metrics for flexibility could possibly be introduced as additional features. Additionally, window-size for the flanking residues surrounding the sequon asparagine could be re-evaluated and screened to determine if there is room for further optimization.

Expansion of the model domain from human to mammalian N-linked glycosylation may help with generalizing glycosylation design evaluation for a variety of expression systems and for assessing viability for animal models downstream. 

Argo Workflow Usage

Note

You should probably use the cyrus Engine API described above

Argo Submission

Code Block
argo submit workflow.yaml -p fasta="gcspath/example.fasta"

1. Preparation

Upload fasta to the default argo bucket on GCS in a relevant project.

2. Feature Extraction

Input = fasta

Output = CSV of DNGP features

Steps:

PSSM generation with BLAST

Secondary structure prediction with NetSurfP3

Gapped Dipeptide Rate calculations

Data formatting

3. Prediction

Input = DNGP feature CSV

Output = report of N-linked glycosylation predictions

...

References

DeepNGlyPred: A Deep Neural Network-Based Approach for Human N-Linked Glycosylation Site Prediction. Molecules. 2021; 26(23):7314. https://doi.org/10.3390/molecules26237314

DeepNGlyPred Github Repository 

NetSurfP-3.0 Github RepositoryCyrus Glycosylation Predictor