/

Tolerance Identification

Tolerance Identification

Owned by Ben Baker

Last updated: Nov 09, 2023 by Aaron Aguhob

The Tolerance Identification API finds the top N closest (by Blosum62) N-mers in the human genome against a given protein of sequence.

1 Quickstart
2 Inputs
3 Options
4 Outputs
5 Notes
- 5.1 Input proteome

Quickstart

Get the top 10 closest 9-mers:

cyrus engine submit tolerance-identification NLYIQWLKDGGPSSGRPPPS --top-n 10

Get the top 5 closet 9 and 15-mers:

cyrus engine submit tolerance-identification NLYIQWLKDGGPSSGRPPPS --top-n 5 --nmer-sizes 9,15

Inputs

--sequence (str)
- Input protein of sequence to compare against

Options

--top-n (int)
- Collect the top N matches
- default = 20
--nmer-sizes
- Nmer size(s) to run this on (Comma separated string ex: 9,10,11,12)
- default = 9

Outputs

out.csv
- CSV file containing the following columns
  - nmer_size - size of this nmer
  - resnum - residue number (1 indexed) of the nmer position in the query sequence
  - query_seq - query sequence
  - matchrank - Rank (0=best, N = worst ) out of the top-N closest (by blosum62) nmers to the query
  - matchscore - blosum62 score of the result to the query sequence
  - matchseq - the found human genome sequence
  - matchscore/max_score - matchscore divided by the score of a 100% (normalized Blosum62)
out.json
- JSON format of the out.csv

Notes

Running this protocol takes between 4 and 5 GB of memory per CPU

Input proteome

The input proteome file was taken from https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/pep/Homo_sapiens.GRCh38.pep.abinitio.fa.gz