G

GANs-for-transcriptomics

c47412e6 update main with model argument · by Alice LACAN

GAN-based Data Augmentation for Transcriptomics

Publication in Bioinformatics: https://doi.org/10.1093/bioinformatics/btad239

Citation

If you use our work, please use the following citation:

Alice Lacan, Michèle Sebag, Blaise Hanczar, GAN-based data augmentation for transcriptomics: survey and comparative assessment, Bioinformatics, Volume 39, Issue Supplement_1, June 2023, Pages i111–i120, https://doi.org/10.1093/bioinformatics/btad239

Requirements

Install the required python librairies:

pip install -r requirements.txt

Data

Data comes from The Cancer Genome Atlas (TCGA):

To preprocess the TCGA data, go to the data folder.

Reproducibility: How to run the code ?

1. TCGA Data

The gene expression data and clinical data is retrieved from the Cancer Genome Analysis dataset, using the RTCGA package. To retrieve the raw data (release data of 2015-11-01), run the following script:

rtcga_rnaseq.R

CAREFUL: All downloaded cohort files should be placed in the tcga_files folder. To build the train and test datasets, run the following script:

python main_preprocessing.py

[ALTERNATIVE]: Unzip the ./data/tcga_files/train_test_csv.zip containing both the train and test processed dataframes.

2. WGAN-GP

Run the following script to train and test the WGAN-GP generator on TCGA data:

python main.py -model 'wgan' -path_df_results './results/results_wgan.csv' -gpu_device 'cuda:0'

3. AttGAN

Run the following script to train and test the AttGAN generator on TCGA data:

python main.py -model 'attgan' -path_df_results './results/results_wgan.csv' -gpu_device 'cuda:0'

Models

We performed a comparative survey between:

Scripts of the different GANs can be found in the models folder.

Metrics

To assess our generated expression data quality, we evaluated the data in a supervised and unsupervised manner. Scripts of these metrics can be found in the metrics folder.

Supervised performance indicators

  • Reverse validation: the performance (accuracy) of a classifier trained only on generated data
  • Data Augmentation: the performance (accuracy) of a classifier trained on n true samples and k generated samples

Unsupervised performance indicators

Results

Our experiments show a significant gain in classification accuracy from training a Multilayer Perceptron (MLP) on a reduced amount of true data (varying from 50 to 3000 samples) to training a MLP with additional augmented data. All the baseline results correspond to the best classifier accuracy obtained after a grid search for binary cancer yes/no classification and tissue classification for a given number of true training samples. The results with data augmentation are obtained after training a MLP with fixed architecture and hyper-parameters, i.e. without any model tuning. Results presented in our paper can be found in the results folder.

Here are the test accuracy results when training a MLP with a given number of true samples and 8000 additional samples generated by the corresponding generative model (either our best GAN, WGAN-GP or AttGAN). We observe a significant gain in low sample regions, with binary accuracy jumping from ~94% to ~98% and the tissue type accuracy jumping from ~70% to ~93% with only 50 training examples.

Here is a UMAP representation of real (left) and fake (right) data generated by our best AttGAN. Colors highlight different tissue types.