M

MMnc

a2e8b265 Add license · by Constance Creux

MMnc: Multi-modal representation for non-coding RNA class prediction and annotation

Datasets

Three datasets are available in the folder data:

  • dataset1 is based on (Fiannaca et al., 2017), but some sequences have been removed from the test set due to data leakage (in the original dataset, they were also present in the training set).
  • dataset2is based on (Lima et al., 2023).
  • dataset3is a novel dataset in which four classes of ncRNAs are represented (lncRNAs, miRNAs, snoRNAs and snRNAs). Data is collected from Xena and Ensembl.

Each dataset is split into a training and a test set.

Files ending with _labels.csv present the label associated with each ncRNA. _sequence.fasta files contain sequences in the fasta format. The suffix _structure.dot indicates files containing secondary structures in dot-bracket format, predicted using MXfold2.. Finally, only for dataset3, expression data is available in dictionary format, where keys are ncRNA identifiers and values are a list of their expression in the 19,131 conditions in the dataset (normalized).

Source code

The folder src contains model files: Sequence.py, Structure.py, and Expression.py are modality encoders. MMnc.py is the global model, which calls the necessary modality encoders. MMDataset.py is used to create multi-modal datasets.

The folder utilscontain several utilitary functions, related to data preparation, model training and prediction, and visualization.

The file main.py can be run to train and evaluate the MMnc model, as detailed below.

Usage

Code has been tested on a Linux machine with GPU acceleration, using python 3.10. Main dependencies are torch, torch_geometric, transformers. The conda environment can be reproduced: conda env create -f environment.yml, then conda activate mmnc.

To train and evaluate the model, the following command can be used like in this example: python main.py --dataset dataset3 --modalities sequence structure expression

Multiple arguments can be passed to the model, their description can be obtained with python main.py -h