C
CAPA
Open Source · MIT · Proof-of-Concept

Predicting Alloimmunity with Protein Language Models

CAPA replaces coarse HLA match/mismatch scores with continuous ESM-2 embeddings and predicts GvHD, relapse, and transplant-related mortality as competing risks using cross-attention and DeepHit.

scroll
How it works

From allele strings to risk curves

Three stages transform raw HLA typing into calibrated, interpretable competing-risk predictions.

01
allele → sequence

HLA Input

Donor and recipient HLA alleles at five loci (A, B, C, DRB1, DQB1) are looked up in the IPD-IMGT/HLA database to retrieve their full protein sequences.

02
sequence → 1 280-dim

ESM-2 Embedding

Each amino-acid sequence is encoded by frozen ESM-2 (650 M parameters) into a 1 280-dim vector. Structural similarity is preserved — immunologically similar alleles cluster together.

03
interaction → CIF curves

Risk Prediction

A cross-attention network models donor–recipient allele interactions. DeepHit jointly outputs cumulative incidence curves for GvHD, relapse, and TRM as competing events.

Input

HLA-A*02:01
HLA-B*07:02
HLA-DRB1*15:01

ESM-2 · 650M

1 280-dim
embedding
per allele

Cross-Attention

Donor × Recipient
interaction
128-dim

DeepHit output

GvHD CIF
Relapse CIF
TRM CIF
Key results

Outperforming traditional baselines

Evaluated on the UCI Bone Marrow Transplant dataset (n = 187) using time-dependent C-index and Brier score.

0.84

C-index, relapse

Fine–Gray · 95% CI 0.69–1.00

0.75

C-index, relapse

Cox-PH · 95% CI 0.53–1.00

187

Patients (UCI BMT)

Paediatric HSCT cohort

3

Competing risks

GvHD · Relapse · TRM

Time-dependent C-index — held-out test set

UCI BMT · n = 29 test patients
ModelGvHDRelapseTRM
Cox-PH (cause-specific)
0.75
0.65
Fine–Gray
best
0.84
0.66
DeepHit (tabular HLA)
0.67
0.41

GvHD not evaluable — only 2 events in the test set. Fine–Gray is the best-performing baseline.

About the project

A new lens on HLA compatibility

Haematopoietic stem cell transplantation outcome depends critically on HLA compatibility. The standard approach encodes this as a binary match/mismatch count, discarding most immunological information.

CAPA was built to change that. By encoding every allele with ESM-2 — a protein language model trained on 250 M sequences — and learning donor–recipient interaction through cross-attention, we get embeddings that reflect structural and functional similarity rather than mere categorical identity.

This is an open-source proof-of-concept, validated on 187 paediatric HSCT patients. We acknowledge the small cohort limitation and encourage replication on larger datasets.

ESM-2 Embeddings

1 280-dim per allele, frozen 650M model

Cross-Attention

Interpretable donor × recipient interaction

DeepHit Survival

Joint competing-risks CIF output

Open Source

MIT licensed, fully reproducible

Try it on your own data

Enter donor and recipient HLA strings and receive competing-risk curves, attention heatmaps, and SHAP feature attribution in seconds.