About the Project

Rethinking HLA compatibility with protein language models

CAPA replaces the coarse categorical HLA match/mismatch score with continuous, biologically meaningful distances derived from ESM-2 — a 650 M-parameter protein language model.

The Problem

Haematopoietic stem cell transplantation (HSCT) outcome depends critically on HLA compatibility between donor and recipient. The standard approach encodes this as a binary match/mismatch count — treating all mismatches as equally dangerous regardless of the underlying protein sequence difference.

A mismatch between two structurally similar alleles and two completely divergent ones are treated identically. This discards most of the immunologically relevant information.

Our Approach

CAPA encodes every HLA allele as a 1 280-dimensional vector using ESM-2, a large protein language model pre-trained on 250 million protein sequences. Alleles with similar binding-groove conformations cluster together in embedding space; immunologically distant alleles are far apart.

A cross-attention network then models the interaction between donor and recipient allele sets, learning which locus pairs are most predictive. The resulting representation is fed into a DeepHit survival head that jointly models GvHD, relapse, and transplant-related mortality as competing risks.

Data

Primary validation uses the UCI Bone Marrow Transplant Children Dataset — 187 paediatric patients with allogeneic HSCT, donor/recipient HLA typing at antigen and allele resolution, and time-to-event outcomes for GvHD, relapse, and cause of death. HLA protein sequences are sourced from the IPD-IMGT/HLA database.

Technical details

Stack & references

ModelESM-2 (facebook/esm2_t33_650M_UR50D) via HuggingFace

FrameworkPyTorch 2.x

SurvivalDeepHit — joint distribution over event types and times

DataUCI Bone Marrow Transplant Dataset — 187 paediatric patients

HLA ref.IPD-IMGT/HLA database (full protein sequences)

ExplainabilitySHAP (KernelExplainer) + cross-attention weights

FrontendNext.js 14, Tailwind CSS, shadcn/ui

LicenseMIT

Transparency

Limitations

We are committed to honest reporting of what CAPA can and cannot do.

Research prototype — not for clinical use

Training cohort of 187 patients is small — deep model is proof-of-concept only.
Paediatric HSCT only (UCI BMT). Adult cohort generalisability is unknown.
HLA sequences limited to IPD-IMGT/HLA alleles; novel or incomplete typings may fail.
Not validated for clinical decision-making. Research use only.

Contributors

People

Huanxuan Li (Shawn)

Lead developer · Architecture, training pipeline, web interface

Interested in contributing?

CAPA is open source. Pull requests, issues, and feedback are welcome.

GitHub Try the tool