Topiary

Predict which peptides from protein sequences will be presented by MHC molecules, making them potential T-cell epitopes. Used in cancer immunotherapy research to find mutant peptides (neoantigens) that the immune system could target.

Installation

pip install topiary

Topiary 4.9.0 and later require mhctools>=3.7.0.

For variant annotation and gene lookups, download Ensembl reference data:

pyensembl install --release 110 --species human

Predicting MHC binding

The simplest use case: score a handful of peptides against one or more HLA alleles.

from topiary import TopiaryPredictor
from mhctools import NetMHCpan

predictor = TopiaryPredictor(
    models=[NetMHCpan],
    alleles=["HLA-A*02:01"],
)

df = predictor.predict_from_named_peptides({
    "peptide_1": "YLQLVFGIEV",
    "peptide_2": "LLFNILGGWV",
    "peptide_3": "GILGFVFTL",
})

The result is a DataFrame with one row per peptide-allele-kind combination. Each row has score (normalized, higher = better), value (raw prediction, e.g. IC50 nM), and percentile_rank (lower = better).

To scan full-length proteins with a sliding window instead:

df = predictor.predict_from_named_sequences({
    "BRAF_V600E": "MAALSGGGGG...LATEKSRWSG",
})

Multiple models can run together. The DataFrame will have rows for each model's prediction kinds:

from mhctools import NetMHCpan, MHCflurry

predictor = TopiaryPredictor(
    models=[NetMHCpan, MHCflurry],
    alleles=["HLA-A*02:01", "HLA-B*07:02"],
)

Filtering and ranking

Topiary has an expression language for filtering and ranking predictions. It works identically as Python objects and as CLI strings.

Filters

Keep only peptides that pass a threshold:

from topiary import TopiaryPredictor, Affinity, Presentation
from mhctools import NetMHCpan

predictor = TopiaryPredictor(
    models=[NetMHCpan],
    alleles=["HLA-A*02:01"],
    filter_by=Affinity <= 500,                       # IC50 nM cutoff
)

# Or with presentation score:
predictor = TopiaryPredictor(
    models=[NetMHCpan],
    alleles=["HLA-A*02:01"],
    filter_by=(Affinity <= 500) | (Presentation.rank <= 2.0),  # keep if either passes
)

Prediction kinds and fields

Accessor	Aliases	What it measures
`Affinity`	`ba`, `aff`, `ic50`	Binding affinity (IC50 nM)
`Presentation`	`el`	Eluted ligand / presentation score
`Stability`		pMHC complex stability
`Processing`		Antigen processing / cleavage

Each kind has three fields:

.value — raw prediction (e.g. IC50 in nM). Default when no field is specified, so Affinity <= 500 means Affinity.value <= 500.
.rank — percentile rank (lower = better).
.score — normalized score (higher = better).

Ranking with transforms

Sort surviving peptides with sort_by. Transforms normalize heterogeneous scores to a common scale:

from topiary import TopiaryPredictor, Affinity, Presentation, mean
from mhctools import NetMHCpan, MHCflurry

predictor = TopiaryPredictor(
    models=[NetMHCpan, MHCflurry],
    alleles=["HLA-A*02:01"],
    filter_by=Affinity <= 500,
    sort_by=mean(
        Affinity["netmhcpan"].logistic(350, 150),
        Affinity["mhcflurry"].logistic(350, 150),
    ),
)

When multiple models produce the same kind, qualify with brackets: Affinity["netmhcpan"].

Available transforms:

Transform	What it does
`.descending_cdf(mean, std)`	Lower input -> higher output (for IC50, rank)
`.ascending_cdf(mean, std)`	Higher input -> higher output (for scores)
`.logistic(midpoint, width)`	Sigmoid normalization
`.clip(lo, hi)`	Clamp to range
`.hinge()`	`max(0, x)`
`.log()` / `.log2()` / `.log10()` / `.log1p()`	Logarithms
`.sqrt()` / `.exp()`	Square root, exponential
`abs(...)`	Absolute value

Aggregations: mean(), geomean(), minimum(), maximum(), median()

String expressions

Every filter and ranking expression has a string form for use at the CLI or in configuration:

Python	String
`Affinity <= 500`	`affinity <= 500` / `ba <= 500`
`Affinity.rank <= 2`	`affinity.rank <= 2`
`Presentation.score >= 0.5`	`el.score >= 0.5`
`(Affinity <= 500) \| (Presentation.rank <= 2)`	`affinity <= 500 \| el.rank <= 2`
`Affinity["netmhcpan"] <= 500`	`netmhcpan_ba <= 500`
`0.5 * Affinity.score + 0.5 * Presentation.score`	`0.5 * affinity.score + 0.5 * presentation.score`
`Affinity.logistic(350, 150)`	`affinity.logistic(350, 150)`
`mean(Affinity.score, Presentation.score)`	`mean(affinity.score, presentation.score)`
`Column("gene_tpm") >= 5`	`gene_tpm >= 5`
`Column("gene_tpm").log()`	`gene_tpm.log()`

Column references

Any DataFrame column can be used directly in expressions — expression data, peptide properties, custom annotations. Unknown identifiers are automatically treated as column references:

from topiary import Affinity, Column

# As a filter — bare name or Column() both work
filter_by = Column("gene_tpm") >= 5.0

# As a sorting signal
sort_by = 0.5 * Affinity.logistic(350, 150) - 0.1 * Column("cysteine_count")

# Transforms work on columns too
sort_by = Column("gene_tpm").log1p()

In CLI strings, bare names work directly:

--filter-by "ba <= 500 & gene_tpm >= 5"
--sort-by "affinity.logistic(350, 150) + 0.1 * gene_tpm.log1p()"

The explicit column(name) syntax also works: gene_tpm >= 5.

Misspelled column names get a helpful error at evaluation time: Column 'hydrophobicty' not found. Did you mean: ['hydrophobicity']?

Scope prefixes

The wt. prefix reads predictions for the wildtype peptide at the same position (variant workflows only). Use it for differential binding:

from topiary import Affinity, wt

sort_by = Affinity.score - wt.Affinity.score  # mutant advantage

shuffled. and self. prefixes work the same way for shuffled-decoy and self-proteome contexts.

Peptide-level expressions

Len() reads the peptide length. Count("C") counts amino acid occurrences:

from topiary import Len, Count, wt

sort_by = Count("C") - wt.Count("C")   # gained/lost cysteines vs wildtype

Expression data

Load gene-level, transcript-level, or variant-level quantification to annotate predictions. Expression values become DataFrame columns that you can reference in ranking and filtering.

How it works

Each --*-expression flag takes a spec string: [name:]file[:id_col[:val_col]].

The file is auto-detected (Salmon .sf, Kallisto abundance.tsv, RSEM .genes.results / .isoforms.results, StringTie .gtf, Cufflinks .fpkm_tracking) or treated as generic TSV/CSV.
The loaded columns are joined onto the prediction DataFrame by gene_id, transcript_id, or variant respectively.
Value columns are prefixed with the name (default: gene, transcript, or variant) and lowercased. For example, Salmon's TPM column loaded with --gene-expression becomes gene_tpm.

Python API

import pandas as pd
from topiary import TopiaryPredictor, Affinity, Column
from topiary.rna import load_expression_from_spec
from mhctools import NetMHCpan
from varcode import load_vcf

# Load expression data
gene_name, gene_id_col, gene_df = load_expression_from_spec(
    "salmon_quant.sf", default_name="gene"
)
# gene_df has columns: ["Name", "TPM"]
# gene_name = "gene", gene_id_col = "Name"

expression_data = {
    "gene": [(gene_name, gene_id_col, gene_df)],
    "transcript": [],
    "variant": [],
}

predictor = TopiaryPredictor(
    models=[NetMHCpan],
    alleles=["HLA-A*02:01"],
    filter_by=(Affinity <= 500) & (Column("gene_tpm") >= 5.0),
)

variants = load_vcf("somatic.vcf")
df = predictor.predict_from_variants(
    variants,
    expression_data=expression_data,
)
# df now has a "gene_tpm" column from the Salmon file

Column naming

The name prefix and the original column name are combined as {prefix}_{column}, both lowercased:

Flag	File	Original column	Result column
`--gene-expression quant.sf`	Salmon	`TPM`	`gene_tpm`
`--gene-expression quant.sf`	Salmon	`NumReads`	`gene_numreads`
`--transcript-expression abundance.tsv`	Kallisto	`tpm`	`transcript_tpm`
`--variant-expression isovar.tsv`	generic	`num_alt_reads`	`variant_num_alt_reads`
`--gene-expression mygene:quant.sf`	Salmon	`TPM`	`mygene_tpm`

The join keys are implicit: gene-level data joins on gene_id, transcript-level on transcript_id, variant-level on variant. These columns are present in variant-pipeline output.

Transcript selection

When transcript-level expression data is provided, Topiary uses it to select the highest-expressed transcript per variant (instead of the default priority-based selection). This matches the behavior of the legacy --rna-transcript-fpkm-tracking-file flag.

Auto-detected formats

Extension / pattern	Format	Default ID column	Default value column
`.sf`	Salmon	`Name`	`TPM`
`abundance*.tsv` (with `target_id` + `tpm` header)	Kallisto	`target_id`	`tpm`
`.genes.results`	RSEM (gene)	`gene_id`	`TPM`
`.isoforms.results`	RSEM (transcript)	`transcript_id`	`TPM`
`.gtf`	StringTie	`reference_id`	`TPM` (or `FPKM`)
`.fpkm_tracking`	Cufflinks	`tracking_id`	`FPKM`
anything else	generic	first column	all numeric columns

Override defaults with the full spec: name:file:id_col:val_col.

CLI usage

Score peptides

topiary \
  --peptide-csv peptides.csv \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --output-csv results.csv

The CSV needs a peptide column (and optionally name). Each peptide is scored as-is.

Scan proteins

topiary \
  --fasta proteins.fasta \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01,HLA-B*07:02 \
  --ic50-cutoff 500 \
  --output-csv results.csv

Filter and rank

# Filter: keep if affinity OR presentation passes
topiary \
  --fasta proteins.fasta \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --filter-by "ba <= 500 | el.rank <= 2" \
  --output-csv results.csv

# Rank: composite score from two models
topiary \
  --fasta antigens.fasta \
  --mhc-predictor netmhcpan mhcflurry \
  --mhc-alleles HLA-A*02:01 \
  --filter-by "ba <= 500" \
  --sort-by "mean(affinity['netmhcpan'].logistic(350, 150), affinity['mhcflurry'].logistic(350, 150))" \
  --output-csv results.csv

Neoantigen discovery from somatic variants

topiary \
  --vcf somatic.vcf \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01,HLA-B*07:02 \
  --filter-by "ba <= 500 | el.rank <= 2" \
  --sort-by "0.6 * affinity.descending_cdf(500, 200) + 0.4 * presentation.score" \
  --only-novel-epitopes \
  --output-csv epitopes.csv

Add expression data

# Gene-level expression from Salmon (auto-detected)
topiary \
  --vcf somatic.vcf \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --gene-expression salmon_quant.sf \
  --filter-by "ba <= 500 & gene_tpm >= 5" \
  --only-novel-epitopes \
  --output-csv results.csv

# Transcript-level from Kallisto
topiary \
  --vcf somatic.vcf \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --transcript-expression kallisto/abundance.tsv \
  --filter-by "ba <= 500" \
  --sort-by "affinity.descending_cdf(500, 200) + 0.1 * transcript_tpm.log1p()" \
  --output-csv results.csv

# Variant-level read support from isovar
topiary \
  --vcf somatic.vcf \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --variant-expression isovar_output.tsv \
  --filter-by "ba <= 500 & variant_num_alt_reads >= 3" \
  --output-csv results.csv

# Combine all three levels
topiary \
  --vcf somatic.vcf \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --gene-expression salmon_quant.sf \
  --transcript-expression kallisto/abundance.tsv \
  --variant-expression isovar_output.tsv \
  --filter-by "ba <= 500 & gene_tpm >= 5 & variant_num_alt_reads >= 3" \
  --output-csv results.csv

# Override auto-detection with explicit spec
topiary \
  --vcf somatic.vcf \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --gene-expression mygene:custom_quant.tsv:ensembl_id:tpm_value \
  --output-csv results.csv

Expression flags are only valid with the variant pipeline (--vcf, --maf, --variant). They are rejected with direct sequence inputs (--fasta, --peptide-csv, etc.) because those outputs lack the gene_id / transcript_id / variant columns needed for joining.

Gene and transcript lookups

Pull protein sequences from Ensembl:

topiary --gene-names BRAF TP53 EGFR --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01
topiary --gene-ids ENSG00000157764 --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01
topiary --transcript-ids ENST00000288602 --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01
topiary --ensembl-proteome --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01
topiary --cta --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01  # cancer-testis antigens

Restrict to protein regions

topiary \
  --fasta proteins.fasta \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --regions spike:319-541 nucleocapsid:0-50 \
  --output-csv results.csv

Format: NAME:START-END (0-indexed, half-open).

Exclusion filtering

Remove peptides found in reference proteomes — for tumor-specific or pathogen-specific peptide selection:

--exclude-ensembl                    # human Ensembl proteome
--exclude-non-cta                    # non-CTA proteins (requires pirlygenes)
--exclude-tissues heart_muscle lung  # genes expressed in these tissues
--exclude-fasta reference.fasta      # custom reference sequences
--exclude-mode substring             # "substring" (default) or "exact"

MHC prediction models

Specify one or more predictors with --mhc-predictor and alleles with --mhc-alleles:

--mhc-predictor netmhcpan mhcflurry --mhc-alleles HLA-A*02:01,HLA-B*07:02

All predictors come from mhctools.

With mhctools 3.7.0+, upstream predictor parsing supports multiple predictors in one CLI invocation, so commands like --mhc-predictor netmhcpan42 bigmhc-el are supported directly. Topiary keeps its higher-level --filter-by / --sort-by DSL on top of that lower-level predictor interface. Topiary's ranking/filtering DSL is also compatible with the simplified mhctools 3.7.0+ kind constants API. NetChop and Pepsickle behavior follows the upstream changes as well: improved NetChop error handling and Pepsickle's epitope-focused model selection.

CLI name	Predicts
`netmhcpan`	affinity + presentation (auto-detects installed version)
`netmhcpan4` / `netmhcpan41` / `netmhcpan42`	specific NetMHCpan 4.x versions
`netmhcpan4-ba` / `netmhcpan4-el`	single-mode (binding affinity or eluted ligand)
`mhcflurry`	affinity + presentation + processing
`mixmhcpred`	presentation
`netmhciipan` / `netmhciipan4` / `netmhciipan43`	MHC class II
`bigmhc` / `bigmhc-el` / `bigmhc-im`	presentation / immunogenicity
`netmhcstabpan`	pMHC stability
`pepsickle` / `netchop`	proteasomal cleavage
`netmhcpan-iedb` / `netmhccons-iedb` / `smm-iedb`	IEDB web API (no local install)
`random`	random predictions (for testing)

Peptide lengths: --mhc-epitope-lengths 8,9,10,11 (defaults come from the predictor).

Output

--output-csv results.csv
--output-html results.html
--output-csv-sep "\t"
--subset-output-columns peptide allele affinity
--rename-output-column value ic50

All predictions: source_sequence_name, peptide, peptide_offset, peptide_length, allele, kind, score, value, affinity, percentile_rank, prediction_method_name

Variant predictions add: variant, gene, gene_id, transcript_id, transcript_name, effect, effect_type, contains_mutant_residues, mutation_start_in_peptide, mutation_end_in_peptide

Expression data adds: columns named {prefix}_{column} as described in Column naming, e.g. gene_tpm, transcript_tpm, variant_num_alt_reads.

Peptide properties

Compute amino acid properties and use them in ranking:

from topiary import TopiaryPredictor, Affinity, Column
from topiary.properties import add_peptide_properties

df = predictor.predict_from_named_sequences(seqs)
df = add_peptide_properties(df, groups=["manufacturability"])

# Properties become ranking signals
score = Affinity.logistic(350, 150) - 0.1 * Column("cysteine_count")
# CLI: --sort-by "affinity.logistic(350, 150) - 0.1 * cysteine_count"

Named groups:

"core" — charge, hydrophobicity, aromaticity, molecular_weight
"manufacturability" — core + cysteine_count, instability_index, max_7mer_hydrophobicity, difficult_nterm/cterm, asp_pro_bonds
"immunogenicity" — core + tcr_charge, tcr_aromaticity, tcr_hydrophobicity (TCR-facing positions for MHC-I)

Protein sources (Python API)

from topiary.sources import (
    ensembl_proteome,
    sequences_from_gene_names,
    cta_sequences,
    non_cta_sequences,
    tissue_expressed_sequences,
    available_tissues,
)

# All return dict[name -> amino_acid_sequence]
seqs = sequences_from_gene_names(["BRAF", "TP53", "EGFR"])
cta = cta_sequences()                          # cancer-testis antigens
heart = tissue_expressed_sequences(["heart_muscle"])
print(available_tissues())                      # list tissue names

Name		Name	Last commit message	Last commit date
Latest commit History 274 Commits
.github/workflows		.github/workflows
docs		docs
tests		tests
topiary		topiary
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
RELEASING.md		RELEASING.md
deploy.sh		deploy.sh
develop.sh		develop.sh
lint.sh		lint.sh
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test.sh		test.sh

Folders and files

Latest commit

History

Repository files navigation

Topiary

Installation

Predicting MHC binding

Filtering and ranking

Filters

Prediction kinds and fields

Ranking with transforms

String expressions

Column references

Scope prefixes

Peptide-level expressions

Expression data

How it works

Python API

Column naming

Transcript selection

Auto-detected formats

CLI usage

Score peptides

Scan proteins

Filter and rank

Neoantigen discovery from somatic variants

Add expression data

Gene and transcript lookups

Restrict to protein regions

Exclusion filtering

MHC prediction models

Output

Peptide properties

Protein sources (Python API)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages