R

rna-diffusion

This repo holds the scripts for the data generation pipeline with diffusion models. Generation is performed on landmark genes and reconstruction on target genes.

9b02626b add tcga preprocessing · by Alice LACAN

Gene Expression Generation with Diffusion Models

PyPI pyversions

(Repo under construction...)

Pipeline overview for generating the L1000 landmark genes and reconstructing the full transcriptome.


PCA visualization of the generation process by our diffusion model on GTEx data (colors highlight the different tissue types).

Requirements

Be careful! Before installing these librairies, make sure that you created an environment dedicated to this project and that the version of your NVIDIA driver matches this version of pytorch. Otherwise, you can adapt the correct version of pytorch.

To install the required python librairies:

pip install -r requirements.txt

RNA-seq datasets

The first dataset is the Genotype-Tissue Expression project (GTEx Analysis V8 release):

The second dataset is the Cancer Genome Atlas (TCGA):

Preprocessing of the data in the data folder.

Deep generative models

The baseline generative models are the following:

The diffusion models investigated are the following:

Scripts of the different models can be found at the folder ./src/generation.

Metrics

To assess our generated expression data quality, we evaluated the data in a supervised and unsupervised manner. Scripts of these metrics can be found in the metrics folder.

Supervised performance indicators

  • Reverse validation: the performance (accuracy) of a classifier trained only on generated data

Unsupervised performance indicators

Results

Test classification accuracy using only landmark genes (orange) or the full transcriptome (blue). The baseline accuracy obtained with true data is displayed on the rightmost panel. We observe that DDIM and DDPM are better than the VAE and WGAN-GP in reduced dimensions (L1000 genes).

Reproducibility (how to use the code)

To train one of the generative models (VAE, WGAN-GP and DDIM), reconstruct from landmark genes (L1000) to the full transcriptome (with regression or a MLP), please refer to the commands in the following bash scripts:

bash run_main.sh
bash run_main_vae.sh
bash run_main_gan.sh