Gene Expression Generation with Diffusion Models
(Repo under construction...)
Pipeline overview for generating the L1000 landmark genes and reconstructing the full transcriptome.
PCA visualization of the generation process by our diffusion model on GTEx data (colors highlight the different tissue types).
Requirements
Be careful! Before installing these librairies, make sure that you created an environment dedicated to this project and that the version of your NVIDIA driver matches this version of pytorch. Otherwise, you can adapt the correct version of pytorch.
To install the required python librairies:
pip install -r requirements.txt
RNA-seq datasets
The first dataset is the Genotype-Tissue Expression project (GTEx Analysis V8 release):
- About GTEx
- GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz to retrieve data
The second dataset is the Cancer Genome Atlas (TCGA):
- About TCGA
- R package to retrieve data
Preprocessing of the data in the data
folder.
Deep generative models
The baseline generative models are the following:
- Variational Autoencoder (VAE) (Welling, M. and Kingma, D. P., 2014)
- Wasserstein Gan with Gradient Penalty (WGAN-GP) (Arjovsky et al., 2017)
The diffusion models investigated are the following:
- Denoising Diffusion Probabilistic Model (DDPM) (Ho et al., 2020)
- Denoising Diffusion Implicit Model (DDIM) (Song et al., 2021)
Scripts of the different models can be found at the folder ./src/generation
.
Metrics
To assess our generated expression data quality, we evaluated the data in a supervised and unsupervised manner.
Scripts of these metrics can be found in the metrics
folder.
Supervised performance indicators
- Reverse validation: the performance (accuracy) of a classifier trained only on generated data
Unsupervised performance indicators
- Correlation score (Vinas et al., 2022)
- Precision and Recall (Kynkäänniemi et al., 2019)
- Frechet Distance (FD) (Heusel et al., 2018)
- Adversarial accuracy (AA) (Yale et al., 2020)
Results
Test classification accuracy using only landmark genes (orange) or the full transcriptome (blue). The baseline accuracy obtained with true data is displayed on the rightmost panel. We observe that DDIM and DDPM are better than the VAE and WGAN-GP in reduced dimensions (L1000 genes).
Reproducibility (how to use the code)
To train one of the generative models (VAE, WGAN-GP and DDIM), reconstruct from landmark genes (L1000) to the full transcriptome (with regression or a MLP), please refer to the commands in the following bash scripts:
bash run_main.sh
bash run_main_vae.sh
bash run_main_gan.sh