Note
Please note that NetMHCPan and NetMHCIIPan modes are only available for users with a valid DTU Health Tech license agreement.

The epitope scan API runs the rosetta Rosetta MHC II epitope prediction algorithm, as well as the NetMHCPan and NetMHCIIPan prediction algorithms , (provided that the user has a license to use those tools) on an input protein structure or sequence. This This API uses a machine learning model to predict epitopes based entirely on the sequence of the protein. Structure input is provided as a convenience but the API will produce identical results regardless of whether a sequence or structure is used as input.

Currently, API outputs and some inputs are specific to the algorithm chosen for prediction.

Further background information and advice on interpreting the results of the epitope scan tool can be found in the Cyrus Bench Documentation

Table of Contents

Supported Alleles

The epitope scan predicts the immunogenicity of the protein with respect to the following alleles:

Note
The Rosetta and NetMHCIIPan models use slightly different names for some alleles

Rosetta epitope Scan Allele List (MHC Class II):

View file

name	rosetta_alleles.txt

NetMHCIIPan Allele List(MHC Class II):

View file

name	mhcii.txt

NetMHCPan Allele List (MHC Class I):

View file

name	mhci.txt

Default Allele Lists

If you do not specify an allele list when using either Rosetta Epitope Scan or NetMHCIIPan, A default set of alleles will be used. The default allele lists are below. There is no default allele list for NetMHCPan.

Rosetta Default Allele List:

View file

name	rosetta_default_epitopes.txt

NetMHCIIPan Default Allele List:

View file

name	netmhciipan_default_epitopes.txt

Inputs

You must specify one of a PDB file, a sequence, or a FASTA file. If a native sequence is supplied epitope-scan will be run on it as well, and the raw results as well as a delta file will be added to the outputs.

Input PDB file – a PDB file
- CLI argument: --pdb-file input.pdb
- Python submit() argument: pdb_path="input.pdb"
- Do not include nonprotein residues.
- Do not include multimodel (NMR-sourced) PDBs.
Sequence – a protein sequence
- CLI Arguments: --sequence NLYIQWLKDGGPSSGRPPPS
- Python submit() argument: sequence=”NLYIQWLKDGGPSSGRPPPS”
Native sequence – a native protein sequence
- CLI Arguments: --native-sequence NLYIQWLKDGGPSSGRPPPS
- Python submit() argument: native_sequence=”NLYIQWLKDGGPSSGRPPPS”
FASTA – a fasta file or multiple fasta files
- CLI argument: --fasta-file sequence.fasta
  - --fasta-file *.fasta
  - --fasta-file fastas.zip
- Python argument: fasta_file="input.fasta"

Note
When using fasta file input, you. must strictly follow the guidelines described here: https://services.healthtech.dtu.dk/examples/example.fasta.html * all sequences must have a header with a name for the sequence * there must be no space between the > character starting the header * the file can have no blank lines

Quickstart

Command Line Examples

Run Rosetta epitope scan on an input sequence:

Code Block
cyrus engine submit epitope-scan NLYIQWLKDGGPSSGRPPPS --allele-list-file alleles.txt

Run epitope scan on an input pdb using NetMHCPan:

Code Block
cyrus engine submit epitope-scan input.pdb --mode=netmhcpan --allele-list-file alleles.txt

Run NetMHCIIPan on an input pdb using only the alleles in the specified file. Epitopes Alleles should be listed 1 per line:

Code Block
cyrus engine submit epitope-scan input.pdb --mode=netmhciipan --allele-list-file alleles.txt

(Requires up-to-date cyrus engine. brew update && brew upgrade cyrusbiotechnology/tap/engine)

Example human allele file using Rosetta allele names:

Expand

title	alleles.txt

HLA-DRB10101
HLA-DRB10301
HLA-DRB10401
HLA-DRB10701
HLA-DRB10802
HLA-DRB10901
HLA-DRB11101
HLA-DRB11302
HLA-DRB11501
HLA-DRB30101
HLA-DRB40101
HLA-DRB50101
HLA-DQA10501-DQB10301
HLA-DQA10301-DQB10302

Run NetMHCIIPan on an input FASTA with default allele list:

Code Block
cyrus engine submit epitope-scan --fasta-file input.fasta --mode=netmhciipan

Python Examples

Note
When using the python library, the alleles you are interested in must be specified explicitly from the list in the introduction of this document. This behavior differs from the command line client, which defaults to searching for all alleles

...

Code Block

language	py

from engine.epitope_scan.client import EpitopeScanClient

client = EpitopeScanClient()
job_id = client.submit(pdb_path="input.pdb",mhc_list=["H-2-IAb", "HLA-DRB10101"])

Inputs

You must specify one of a PDB file, a sequence, or a FASTA file.

--allele-list-file
- File containing a list of MHC Class I or II allele names, 1 per line
- See Supported Alleles for correct naming conventions/syntax for each API mode
- See Default Allele Lists for mode specific default alleles
--pbd-file (str)
- Input PDB file – a PDB file
- CLI argument: --pdb-file input.pdb
- Python submit() argument: pdb_path="input.pdb"
- Do not include nonprotein residues.
- Do not include multimodel (NMR-sourced) PDBs.
--sequence (str)
- Sequence – a protein sequence
- CLI Arguments: --sequence NLYIQWLKDGGPSSGRPPPS
- Python submit() argument: sequence=”NLYIQWLKDGGPSSGRPPPS”
--fasta-file (stringArray)
- A FASTA file with one or more sequences, or multiple fasta files
- CLI argument: --fasta-file sequence.fasta
  - --fasta-file *.fasta
  - --fasta-file fastas.zip
- Python argument: fasta_file="input.fasta"

Note

When using fasta file input, you. must strictly follow the guidelines described here: https://services.healthtech.dtu.dk/examples/example.fasta.html

all sequences must have a header with a name for the sequence
there must be no space between the > character starting the header
the file can have no blank lines

Options

--mode
- Prediction model to use, must be one of rosetta, netmhcpan or netmhciipan
- default = rosetta
--native-sequence
- A native protein sequence
- If supplied, epitope-scan will be run on it as well, and the raw results as well as a delta file will be added to the outputs.
- CLI Arguments: --native-sequence NLYIQWLKDGGPSSGRPPPS
- Python submit() argument: native_sequence=”NLYIQWLKDGGPSSGRPPPS”
--weak-binder-threshold
- The percentile cutoff for distinguishing weak and strong binders in NetMHCPan and NetMHCIIPan
- default = 5

Outputs

Rosetta Mode Epitope Scan Outputs

The API with --mode=rosetta returns a CSV file (rosetta_epitope_scan.csv) with the following fields:

begin_seqpos – The - start of the sequence window involved in the prediction
epitope_seq – The - sequence of the epitope involved in the prediction
allele – - The MHC allele binding affinity is being predicted for
IC50_nM – - The predicted IC50 in nanomolarity
rank_percentage – Primary - Primary epitope score metric. The epitope is in the top n% of binders measured against random background. Lower number is more likely epitope for that allele.
score – - The raw score of the prediction model, lower is better. Used to calculate the primary normalized score metric, rank_percentage
genome_sequence – - Is the epitope in the human reference genome
known – - Does the sequence exist in the IEDB as a known T-cell activating epitope

NetMHCIIPan Mode Epitope Scan Outputs

The API with --mode=netmhciipan returns a TSV file (NetMHCIIPan_results.tsv) with the following fields:

Pos - starting position in sequence of peptide window
Peptide - 15mer peptide
ID - sequence ID
Weighted_NB - binding score weighted by allele population weights (for predicted binding alleles)
For every allele (<allele>) provided in --allele-list-file or in default allele sets:
- <allele>-Core - predicted core binding register
- <allele>-Score - Eluted ligand prediction score
- <allele>-Rank - percentile rank of eluted ligand prediction score
- <allele>-Score_BA - predicted binding affinity in log-scale
- <allele>-nM - predicted binding affinity in nanomolar IC50)
- <allele>-Rank_BA - percentile rank of predicted affinity compared to a set of 100,000 random natural peptides

Outputs with `--native-sequence`

delta_results.csv
- If --native-sequence is provided the API will return the results for both

...

- the input sequence and native sequence provided, as well as a file named delta_results.csv.

...

- Contains the scores of the design input minus the scores of the native input.

Notes

NetMHCIIPan Eluted Ligand (EL) vs. Binding Affinity (BA)

NetMHCIIPan has two modes: EL and BA. By default, NetMHCII only runs in EL mode, but the Cyrus API has activated the flag to output results from both modes.
EL (eluted ligand) data is the result of a peptide being naturally processed and eluted from the MHC complex (so binding IC50 not possible; but does tell you that it bound)
BA (binding affinity) data is the measured IC50
EL contains a lot of data for self proteins
BA contains a lot of data for non-self proteins (bacteria, viruses)
BA predicts if a peptide will bind whereas EL tells us whether a peptide is likely to be naturally processed (and indirectly that it bound)
EL can gain information from varying lengths of peptides - which is good for MHCII
“Pan” means the network is universal (so no need to train individual networks)
NetMHCIIPan 4.1 : “The output of the model is a prediction score for the likelihood of a peptide to be naturally presented by [an] MHC II receptor of choice. The output also includes %rank score, which normalizes prediction score by comparing to prediction of a set of random peptides. Optionally, the model also outputs BA prediction and %rank scores.
Consider:
- EL by nature will “miss” epitopes (like MAPPs it binds but may not be detected)
- BA will overpredict (not all binders will be immunogenic)
Key Points
1. BA models are trained on binding affinity data and reports predicting binding
2. EL models are trained on eluted ligands (bound, presented to mhc, eluted) and reports whether a peptide is likely to be naturally processed
3. EL will under-predict; BA will over-predict
4. EL is best used for "key epitopes" (but may still miss some)
5. BA is best used for optimal coverage of possible epitopes

Interpreting epitope predictions

To replicate CAD-style "best practices", API users could:

for each epitope, calculate 'n_hits' = number of alleles where rank_percentage < 10
sort by n_hits

Running NetMHCIIpan locally

NetMHCIIpan is installed on basil. You can run it here with no arguments to get help:

Code Block
/home/indigo/exe/NetMHCIIpan-4.1/netMHCIIpan-4.1/netMHCIIpan

For each run, you can run against multiple query sequences using a multi-fasta, but only one allele at a time:

Code Block
/home/indigo/exe/NetMHCIIpan-4.1/netMHCIIpan-4.1/netMHCIIpan -f test.fasta -a DRB1_0701

…where test.fasta looks like this:

Code Block
cat test.fasta  ✔  10033  12:33:23 >PEPTIDE PEPTIDEPEPTIDEPEPTIDEPEPTIDE >PRTEIN PRTEINPRTEIPRTEINNPRTEINPRTEIN

Here is a list of alleles that matches the older Cyrus Human14 allele set

Code Block
DRB1_0101 DRB1_0301 DRB1_0401 DRB1_0701 DRB1_0802 DRB1_0901 DRB1_1101 DRB1_1302 DRB1_1501 DRB3_0101 DRB4_0101 DRB5_0101 HLA-DQA10501-DQB10301 HLA-DQA10301-DQB10302

So let’s say the above list was saved as a file called “human14.netmhc2alleles.list”. You could run for multiple sequences across multiple alleles like this:

...