GAN-based Data Augmentation for Transcriptomics
Publication in Bioinformatics: https://doi.org/10.1093/bioinformatics/btad239
Citation
If you use our work, please use the following citation:
Alice Lacan, Michèle Sebag, Blaise Hanczar, GAN-based data augmentation for transcriptomics: survey and comparative assessment, Bioinformatics, Volume 39, Issue Supplement_1, June 2023, Pages i111–i120, https://doi.org/10.1093/bioinformatics/btad239
Requirements
Install the required python librairies:
pip install -r requirements.txt
Data
Data comes from The Cancer Genome Atlas (TCGA):
- About TCGA
- R package to retrieve data
To preprocess the TCGA data, go to the data
folder.
Reproducibility: How to run the code ?
1. TCGA Data
The gene expression data and clinical data is retrieved from the Cancer Genome Analysis dataset, using the RTCGA
package.
To retrieve the raw data (release data of 2015-11-01), run the following script:
rtcga_rnaseq.R
CAREFUL: All downloaded cohort files should be placed in the tcga_files
folder.
To build the train and test datasets, run the following script:
python main_preprocessing.py
[ALTERNATIVE]: Unzip the ./data/tcga_files/train_test_csv.zip
containing both the train and test processed dataframes.
2. WGAN-GP
Run the following script to train and test the WGAN-GP generator on TCGA data:
python main.py -model 'wgan' -path_df_results './results/results_wgan.csv' -gpu_device 'cuda:0'
3. AttGAN
Run the following script to train and test the AttGAN generator on TCGA data:
python main.py -model 'attgan' -path_df_results './results/results_wgan.csv' -gpu_device 'cuda:0'
Models
We performed a comparative survey between:
- Generative Adversarial Networks (GANs) (Ian J. Goodfellow et al., 2014)
- Wassertsein GANs with Gradient Penalty (WGAN-GP) (I. Gulrajani et al., 2017)
- Attention + WGAN-GP (AttGAN), one of our contributions
Scripts of the different GANs can be found in the models
folder.
Metrics
To assess our generated expression data quality, we evaluated the data in a supervised and unsupervised manner.
Scripts of these metrics can be found in the metrics
folder.
Supervised performance indicators
- Reverse validation: the performance (accuracy) of a classifier trained only on generated data
- Data Augmentation: the performance (accuracy) of a classifier trained on n true samples and k generated samples
Unsupervised performance indicators
- Correlation score (Vinas et al., 2022)
- Precision and Recall (Kynkäänniemi et al., 2019)
- Frechet Distance (FD) (Heusel et al., 2018)
- Adversarial accuracy (AA) (Yale et al., 2020)
Results
Our experiments show a significant gain in classification accuracy from training a Multilayer Perceptron (MLP) on a reduced amount of true data (varying from 50 to 3000 samples) to training a MLP with additional augmented data.
All the baseline results correspond to the best classifier accuracy obtained after a grid search for binary cancer yes/no classification and tissue classification for a given number of true training samples. The results with data augmentation are obtained after training a MLP with fixed architecture and hyper-parameters, i.e. without any model tuning. Results presented in our paper can be found in the results
folder.
Here are the test accuracy results when training a MLP with a given number of true samples and 8000 additional samples generated by the corresponding generative model (either our best GAN, WGAN-GP or AttGAN). We observe a significant gain in low sample regions, with binary accuracy jumping from ~94% to ~98% and the tissue type accuracy jumping from ~70% to ~93% with only 50 training examples.
Here is a UMAP representation of real (left) and fake (right) data generated by our best AttGAN. Colors highlight different tissue types.