AttOmics: Attention-based architecture for diagnosis and prognosis from Omics data
The increasing availability of high-throughput omics data allows for considering a new medicine centered on individual patients.
Precision medicine relies on exploiting these high-throughput data with machine-learning models, especially the ones based on deep-learning approaches, to improve diagnosis.
Due to the high-dimensional small-sample nature of omics data, current deep-learning models end up with many parameters and have to be fitted with a limited training set.
Furthermore, interactions between molecular entities inside an omics profile are not patient-specific but are the same for all patients.
In this article, we propose AttOmics, a new deep-learning architecture based on the self-attention mechanism.
First, we decompose each omics profile into a set of groups, where each group contains related features.
Then, by applying the self-attention mechanism to the set of groups, we can capture the different interactions specific to a patient.
The results of different experiments carried out in this paper show that our model can accurately predict the phenotype of a patient with fewer parameters than deep neural networks.
Visualizing the attention maps can provide new insights into the essential groups for a particular phenotype.
Instalation
- Install miniconda
- Clone this repository:
git clone https://forge.ibisc.univ-evry.fr/abeaude/AttOmics.git
- Navigate to the AttOmics folder:
cd AttOmics
- Create a conda environment:
conda env create -f environment.yml
- Activate the newly created environment:
conda activate attomics
Data Format
Omics file contains the expression matrix of the different patient. Each row represents a patient and each columns represents the different features.
Here is an example of a pytorch dataset that can be used with AttOmics:
from torch.utils.data import Dataset
class OmicsDataset(Dataset):
def __init__(self, omics, label, event=None):
self.omics = omics
self.label = label
self.event = event
def __len__(self):
return self.label.shape[0]
def __getitem__(self, index):
if torch.is_tensor(index):
index = index.tolist()
sample = {"x": self.omics[index], "label": self.label[index]}
if self.event is not None:
sample.update({"event": self.event[index]})
sample = {k: torch.as_tensor(v) for k,v in sample.items()}
return sample
import numpy as np
from torch.utils.data import Dataloader
from requests import get
import zipfile
url = "https://cirrus.universite-paris-saclay.fr/s/MwrixnSRYykEPNW/download"
with get(url, stream=True) as r:
r.raise_for_status()
with open("/tmp/data.zip", "wb") as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)
with zipfile.ZipFile("/tmp/data.zip", "r") as zfile:
zfile.extractall("./")
X = np.load("AttOmics_DATA/rnaseq_train.npy")
Y = np.load("AttOmics_DATA/label_train.npy")
# create dataset
dataset = OmicsDataset(omics=X, label=Y)
# create dataloader
train_loader = DataLoader(dataset, batch_size=256, shuffle=True, drop_last=True)
You can repeat this for the different split.
Supported grouping strategies
You can add support to new grouping strategies. The function must have the following signature:
-
Input arguments
- in_features: int
- proj_size: int
- n_group: int
- train_data: DataFrame = None
- **kwargs
-
Output
- idx_in:
List[Tensor]
. Each element i of the list reprents the feature in group i - group_name:
List[str]
. Name of the different group - grp_proj_dim:
List[List[int]]
. The dimension used to encode each group in the gFCN module.
- idx_in:
If you add support for a new grouping strategy, please update GeneGroupCreation
dictionnary to register your method.
GeneGroupCreation.update({"my_new_method": new_method_fun})
Create your model
model = AttOmics(
n_group= 10,
n_layers= 1,
num_heads= 1,
attention_norm= "layer_norm",
grouping_method= "random",
head_norm= "layer_norm",
sa_residual_connection= True,
head_residual_connection= False,
head_dropout= 0.0,
head_batch_norm= False,
reuse_grp= True,
constant_group_size= False,
head_input_dim= 500,
head_hidden_ratio=[0.5],
input_dim=X.shape[1], # a dict of dimension
num_classes=n_class,
label_type="cancer_type",
class_weights=class_weights,
train_data=X,
optimizer_init=optimizer,
scheduler_init=lr_scheduler)
Training a model
We use pytorch_lightning to train our models. To train you need, first, to setup a Trainer
.
from pytorch_lightning import Trainer
trainer = Trainer(gpus=[0],
logger=MLFlowLogger(experiment_name="AttOmics",save_dir= "./logs")
)
Fit the model on the training set:
trainer.fit(model, , train_dataloader=train_loader, val_dataloaders=val_loader)
Now you can get evaluate your model on the test set:
trainer.test(model, test_dataloader=test_loader)
Authors
AttOmics was developed by:
- Aurélien Beaude
- Milad R. Vahid
- Franck Augé
- Farida Zehraoui
- Blaise Hanczar
License
AttOmics is licensed under the GNU GPL, version 3 or (at your option) any later version. AttOmics is Copyright (2023-) by the authors.