Rethinking HLA compatibility with protein language models
CAPA replaces the coarse categorical HLA match/mismatch score with continuous, biologically meaningful distances derived from ESM-2 — a 650 M-parameter protein language model.
The Problem
Haematopoietic stem cell transplantation (HSCT) outcome depends critically on HLA compatibility between donor and recipient. The standard approach encodes this as a binary match/mismatch count — treating all mismatches as equally dangerous regardless of the underlying protein sequence difference.
A mismatch between two structurally similar alleles and two completely divergent ones are treated identically. This discards most of the immunologically relevant information.
Our Approach
CAPA encodes every HLA allele as a 1 280-dimensional vector using ESM-2, a large protein language model pre-trained on 250 million protein sequences. Alleles with similar binding-groove conformations cluster together in embedding space; immunologically distant alleles are far apart.
A cross-attention network then models the interaction between donor and recipient allele sets, learning which locus pairs are most predictive. The resulting representation is fed into a DeepHit survival head that jointly models GvHD, relapse, and transplant-related mortality as competing risks.
Data
Primary validation uses the UCI Bone Marrow Transplant Children Dataset — 187 paediatric patients with allogeneic HSCT, donor/recipient HLA typing at antigen and allele resolution, and time-to-event outcomes for GvHD, relapse, and cause of death. HLA protein sequences are sourced from the IPD-IMGT/HLA database.
Stack & references
Limitations
We are committed to honest reporting of what CAPA can and cannot do.
Research prototype — not for clinical use
- Training cohort of 187 patients is small — deep model is proof-of-concept only.
- Paediatric HSCT only (UCI BMT). Adult cohort generalisability is unknown.
- HLA sequences limited to IPD-IMGT/HLA alleles; novel or incomplete typings may fail.
- Not validated for clinical decision-making. Research use only.
People
Huanxuan Li (Shawn)
Lead developer · Architecture, training pipeline, web interface
Interested in contributing?
CAPA is open source. Pull requests, issues, and feedback are welcome.