In order to improve the success of and confidence in designed glycosylation sites, Cyrus has implemented glyco-predictor, an in-house version of DeepNGlyPred (DNGP)-- a deep neural-network (DNN) learning tool for sequence-based human N-linked glycosylation prediction. The following report outlines key architecture parameters and features, benchmark datasets, overall performance evaluation, and an overview of the workflow and how to use the tool.

The Glycosylation prediction tool is implemented as an argo workflow and an API. the argo workflow can be found at https://github.com/CyrusBiotechnology/glyco-predictor

Engine API usage

The best way to run the Glycosylation prediction tool is via the Cyrus Engine API. Given a protein sequence in a fasta file, the API is run like this:

cyrus engine submit glyco-predictor input.fasta

The API takes approximately an hour to run and generates a single report in the form of a CSV file listing report of N-linked glycosylation predictions.

Tool Implementation Details

Architecture and Features

DNGP architecture parameters were optimized via grid search with three-fold cross-validation on an independent training set. Final DNN parameters selected by Pakhrin et al. are summarized in Table 1. Optimal features for classification were a combination of structural predictions (ASA, RSA, SS, disorder, torsion) via NetSurfP, gapped dipeptide rates, and PSSM data from PSI-BLAST.

Parameter Name	Parameter Used
Number of layers	4
Number of neuron in three layers	150
Number of neuro in output layer	2
Activation Function	sigmoid
Activation Function at output layer	softmax
Optimizer	Adam
Learning rate	0.001
Objective / loss function	binary_crossentropy
Model checkpoint	val_accuracy
Reduce learning rate on plateau	Factor = 0.001
Early stopping	patience = 5
Dropout	0.3
Batch_size	256
Epochs	400

Table 1. Selected DNN architectural parameters from Pakhrin et al.

The best features in predicting glycosylation were structure-based, determined by NetSurfP-3.0. Optimal window size for NetSurfP-3.0 predictions was dependent on training sets (N-GlycositeAtlas = 41, N-GlyDE = 25). Window size is defined as the number of flanking residues surrounding (and including) central asparagine of the glycosylation motif (N-X-[S/T]). This suggests potential for refinement via alternative structural prediction tools (i.e. Rosetta) and further optimization of window size with additional data.

In order to best replicate structural features encoded in DNGP models, internal development of NetSurfP-3.0 was necessary since pre-trained models are not readily available. With publicly available training data, an internal NetSurfP-3.0 model was implemented for workflow use; however, it is important to note that the original DNGP models were trained on data generated from NetSurfP-2.0. The main difference between versions lies in underlying architecture for optimized speed (NetSurfP-3.0 utilizes an ESM model to improve runtime performance), other performance metrics were reported to have no significant differences.

Benchmarks

DNGP was originally evaluated on two different datasets: data adapted from N-GlyDE and data from N-GlycositeAtlas. N-GlyDE had 2050 experimentally verified N-linked glycosylation sites from 832 human glycoproteins; 1030 glycosylation sites not verified in these proteins were considered negative sites. N-GlycositeAtlas is a database for N-linked glycosylation made up of 7204 glycoproteins, which consisted of 9450 positive and 9450 negative sites post filtering and sampling. Negative sites for the N-GlycositeAtlas dataset were selected based on localization of glycoproteins in the nucleus and mitochondria – assuming that these proteins do not undergo N-linked glycosylation (Note: team has acknowledged this caveat and decided to proceed with caution).

The two datasets used for training resulted in two different models of DNGP, essentially differentiated by training set (N-GlyDE vs N-GlycositeAtlas) and window-size (25 vs 41). Both models were used to evaluate internal benchmark sets and replication.

Cyrus glyco-predictor used both models and datasets as benchmarks along with experimental results from internal programs to evaluate feasibility of use. Overall, the N-GlycositeAtlas model (window-size 41) had greater performance in predictions compared to the N-GlyDE set. Performance comparisons were primarily focused on overall recall and precision.

Performance

Table 2 summarizes the accuracy, precision, recall and specificity of the Cyrus glyco-predictor tool for different benchmarks. The N-GlyDE dataset consisted of 167 positive glycosylation sites and 280 negative sites, and were not included in any of the model training sets. Glycoproteomic analysis of ACE2 constructs S19 and v2.4 provided mass spectrometry results for evaluating glycosylation of 10 and 8 NX-[S/T] sites respectively (S19 contains two designed glycosites at positions 660 and 718). Both datasets were used as additional validation sets, along with internal data for GALA glycosylation.

On average, glyco-predictor achieved 72.9% accuracy, 81.1% precision, 71.9% recall, and 83.2% specificity. While there may be potential in enhancing this model, these benchmarks reflect acceptable performance for predicting human N-linked glycosylation.

Benchmark	Accuracy (%)	Precision (%)	Recall (%)	Specificity (%)
N-GlyDE dataset	79.4	67.0	88.6	73.9
ACE2 (S19)	70	60	75	66
ACE2 (v2.4)	75	100	60	100
GALA	67.5	97.9	64	93
Average	72.9	81.1	71.9	83.2

Table 2. Performance of Cyrus glyco-predictor tool on different benchmarks.

Future Improvements

Primary areas for potential improvement involve modification to input feature training parameters. As mentioned, structure-based features predicted by NetSurfP-3.0 could be benchmarked against Rosetta calculated features. Doing so would enable greater control over desired parameters and introduce more fine-grained features to the training data. Currently, NetSurfP-3.0 is used to predict secondary structure (q3), torsion angles, accessible surface area, and disorder. RMSD, B-factor, or alternative metrics for flexibility could possibly be introduced as additional features. Additionally, window-size for the flanking residues surrounding the sequon asparagine could be re-evaluated and screened to determine if there is room for further optimization.

Expansion of the model domain from human to mammalian N-linked glycosylation may help with generalizing glycosylation design evaluation for a variety of expression systems and for assessing viability for animal models downstream.

Argo Workflow Usage

You should probably use the cyrus Engine API described above

Argo Submission

argo submit workflow.yaml -p fasta="gcspath/example.fasta"

1. Preparation

Upload fasta to the default argo bucket on GCS in a relevant project.

2. Feature Extraction

Input = fasta

Output = CSV of DNGP features

Steps:

PSSM generation with BLAST

Secondary structure prediction with NetSurfP3

Gapped Dipeptide Rate calculations

Data formatting

3. Prediction

Input = DNGP feature CSV

Output = report of N-linked glycosylation predictions

Links

Pakhrin SC, Aoki-Kinoshita KF, Caragea D, KC DB. DeepNGlyPred: A Deep Neural Network-Based Approach for Human N-Linked Glycosylation Site Prediction. Molecules. 2021; 26(23):7314. https://doi.org/10.3390/molecules26237314

DeepNGlyPred Github Repository

NetSurfP-3.0 Github Repository

Cyrus Glycosylation Predictor

Glycosylation Prediction API