Louis BECQUEY

Updated Docker and documentation

results_*
results/
build_BiORSEO_docker_image_ubuntu18.sh
deploy_BiORSEO_docker_image_linux.sh
INSTALL.md
Readme.md
benchmark_results/
*.gz
*.pickle
log_of_the_run.sh
\ No newline at end of file
Changelog
=========================
### Biorseo 2.1 (Nov 2021)
This is an official, tested, release of Biorseo 2:
- replacing Nupack's dynamic programming scheme supporting simple pseudoknots by ViennaRNA's window-based scheme, which does not support pseudoknots or long-distance contacts but allows to test much longer sequences,
- supporting RINs with no issues,
- supporting custom modules in JSON format (to be detected in sequences by regular expressions), thanks to Nathalie Bernard
- not running Jar3d or BayesPairing for you anymore. This simplifies a lot the code management (replacing a pipeline by the C++ tool only). Jar3d is getting older, does not support very complex modules, and is biaised because it takes as input loops (not the whole sequence). Therefore, you have to give biorseo the answer as input ! BayesPairing 2.0 is evolving itself into a module-placement tool in secondary structures taking eneries into account (and now comparative information), it is a non sense to include it *within* Biorseo. Approaches should be compared and benchmarked instead. But, you can still use the ouputs of this tools as input for biorseo if you like.
- introducing the MFE criterion (thanks to Nathalie Bernard),
- introducing the Biokop-mode,
- with a much simpler and lighter installation process.
Biorseo 2.1 is availbale as a docker container and as a git branch called "biorseo2".
It is the last version supported by Louis Becquey.
### Biorseo 2.0
This was an unofficial, unsupported and unpublished version after the internship of Lénaic Durand at IBISC.
This version
- replaces Nupack 3.2 with ViennaRNA to compute the pairing probabilities, thanks to Lénaic,
- introduces early support for BayesPairing 2.0, which was still unofficial too at the time,
- supports CaRNAval RINs,
- but has issues with the constraints to assert RIN basepairs are respected.
Results from this version are published in [Louis Becquey's thesis](https://tel.archives-ouvertes.fr/tel-03440181).
### Biorseo 1.2 (2019) and Biorseo 1.5 (2020)
These brought some improvements, fixing numerical issues, and other technical improvements.
Biorseo 1.2 is still available as a docker, and the 1.5 is available as a Git branch called 'biorseo1'.
### Biorseo 1.0 (2018)
This was the first version published for the paper [*Becquey et al. 2020*.](https://doi.org/10.1093/bioinformatics/btz962)
\ No newline at end of file
......@@ -8,7 +8,7 @@
FROM ubuntu:focal
# compiled biorseo
COPY . /biorseo/
COPY ./bin /workdir/
# Install runtime dependencies
RUN apt-get update -yq && \
......@@ -16,4 +16,5 @@ RUN apt-get update -yq && \
apt-get install -y libboost-program-options-dev libboost-filesystem-dev && \
rm -rf /var/lib/apt/lists/*
WORKDIR /biorseo
\ No newline at end of file
WORKDIR /workdir
ENTRYPOINT ["/workdir/biorseo"]
......
Option 1 : Installation using docker image (Windows, Mac, Linux)
==================================
* Clone this git repository : `git clone https://github.com/persalteas/biorseo.git` , or download the .zip archive from a BiORSEO release and extract it.
* Move into the repository ( `cd biorseo` )
### Install Docker:
* See the officiel instructions depending on your OS here : https://docs.docker.com/install/
......@@ -13,60 +11,40 @@ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubun
sudo apt update && sudo apt install docker-ce docker-ce-cli containerd.io
```
### Create a separated folder
You will need a folder that is mounted in the Docker container (shared between your filesystem and the container's one). For example, create a `data` folder. It should be the place where you place your FASTA input, your modules, and where Biorseo should place the output.
```
mkdir data
```
### Download and install the RNA motifs data files:
* Move your JSON-formatted or CSV-formatted files containing motifs in the folder.
* If you use Rna3Dmotifs, you need to get RNA-MoIP's .DESC dataset: download it from [GitHub](https://github.com/McGill-CSB/RNAMoIP/blob/master/CATALOGUE.tgz). Put all the .desc from the `Non_Redundant_DESC` folder into `./data/modules/DESC`. Otherwise, you also can run Rna3Dmotifs' `catalog` program to get your own DESC modules collection from updated 3D data (download [Rna3Dmotifs](https://rna3dmotif.lri.fr/Rna3Dmotif.tgz)). You also need to move the final DESC files into `./data/modules/DESC`.
* Move your JSON-formatted or CSV-formatted files containing motifs in a `./data/JSON` or `./data/CSV` folder.
* If you use Rna3Dmotifs (easy, but outdated), you need to get RNA-MoIP's .DESC dataset: download it from [GitHub](https://github.com/McGill-CSB/RNAMoIP/blob/master/CATALOGUE.tgz). Put all the .desc from the `Non_Redundant_DESC` folder into `./data/DESC`. Otherwise, you also can run Rna3Dmotifs' `catalog` program to get your own DESC modules collection from updated 3D data (download [Rna3Dmotifs](https://rna3dmotif.lri.fr/Rna3Dmotif.tgz)). You also need to move the final DESC files into `./data/DESC`.
* If you use CaRNAval (which is supposed to be a long-distance contact module dataset, not a SSE module dataset), use the script [scripts/Install_CaRNAval_RINs.py](scripts/Install_CaRNAval_RINs.py) :
`python3 Install_CaRNAval_RINs.py`. This will create a `/../data/modules/RIN/` folder (because the script is supposed to be run from the repo's `scripts` subfolder)
### Download the docker image from Docker Hub
`docker pull persalteas/biorseo:latest`
```
docker pull persalteas/biorseo:latest
```
### Run the docker image
Use the following command to run the docker image:
```
$ docker run
-v `pwd`/data/modules:/modules
-v `pwd`/data/fasta:/biorseo/data/fasta
-v `pwd`/results:/biorseo/results
persalteas/biorseo
yourexamplejobcommandhere
$ docker run -v `pwd`/data:/workdir/data persalteas/biorseo [optionshere]
```
You can replace \`pwd\` by the full path of the biorseo/ root folder. Here we launch the biorseo image with 4 volumes : A first to give BiORSEO access to the module files, a second to give it access to your input file(s), a third for your trained BayesPairing, and a last for it to output the result files of your job. Considering you place your input file 'MyFastaFile.fa' into the `data/fasta` folder, an example job command can be ` ./biorseo.py -i /biorseo/data/fasta/myFastaFile.fa --rna3dmotifs --func B`, so the full run command would be
You can replace \`pwd\`/data by the full path to your data folder. Assuming you place your input file 'MyFastaFile.fa' into the `data/fasta` folder, an example job command can be :
```
$ docker run -v `pwd`/data/modules:/modules -v `pwd`/data/fasta:/biorseo/data/fasta -v `pwd`/results:/biorseo/results persalteas/biorseo ./bin/biorseo -s /biorseo/data/fasta/applications.fa --descfolder /biorseo/data/modules/DESC --func B -v
$ docker run -v `pwd`/data:/workdir/data persalteas/biorseo -s data/fasta/MyFastaFile.fa --descfolder data/DESC --func B -v -o data/MyOutput.biorseo
```
Note that the paths to the input and output files are paths *inside the Docker container*, and those paths are mounted to folders of the host machine with -v options.
Note that the paths to the input and output files are paths *inside the Docker container*, and those paths are mounted to the data folder of the host machine with the -v option.
Option 2 : Compile and Install from source (without docker, Linux only)
==================================
### CLONING
* Clone this git repository : `git clone https://forge.ibisc.univ-evry.fr/lbecquey/biorseo.git` (from the IBISC forge) or `git clone https://github.com/persalteas/biorseo.git` (from my personal GitHub, only while i am the current developer !) and `cd biorseo`.
* Create folders for the modules you will use: `mkdir -p data/modules/`. If you plan to use several module sources, add subdirectories :
```bash
mkdir -p data/modules/BGSU
mkdir -p data/modules/RIN
mkdir -p data/modules/DESC
mkdir -p data/modules/JSON
```
### THE RNA 3D MOTIF ATLAS DATA
Get the latest version of the HL and IL module models from the [BGSU website](http://rna.bgsu.edu/data/jar3d/models/) and extract the Zip files. Put the HL and IL folders from inside the Zip files into `./data/modules/BGSU`. Note that only the latest Zip is required.
### CARNAVAL DATA
You first need to have the `unzip` command installed on your machine and the `networkx` package installed for Python 3. Then just run the script `Install_CaRNAval_RINs.py`, this will create files into `./data/modules/RIN/Subfiles` :
```bash
cd scripts
python3 Install_CaRNAval_RINs.py
```
If you do not have the unzip command, download and extract manually the [CaRNAval dataset](http://carnaval.lri.fr/carnaval_dataset.zip) and place the files `RIN.py` and `CaRNAval_1_as_dictionnary.nxpickled` in the folder `data/modules/RIN/`, and run the python script.
### RNA3DMOTIFS DATA (DEPRECATED)
If you use Rna3Dmotifs, you need to get RNA-MoIP's .DESC dataset: download it from [GitHub](https://github.com/McGill-CSB/RNAMoIP/blob/master/CATALOGUE.tgz). Put all the .desc from the `Non_Redundant_DESC` folder into `./data/modules/DESC`. Otherwise, you also can run Rna3Dmotifs' `catalog` program to get your own DESC modules collection from updated 3D data (download [Rna3Dmotifs](https://rna3dmotif.lri.fr/Rna3Dmotif.tgz)). You also need to move the final DESC files into `./data/modules/DESC`.
* Clone this git repository : `git clone https://forge.ibisc.univ-evry.fr/lbecquey/biorseo.git` (from the IBISC forge) or `git clone https://github.com/persalteas/biorseo.git` (from my personal GitHub, only while i (Louis Becquey) am the current developer !) and `cd biorseo`.
### DEPENDENCIES
- Make sure you have Python 3.7+ and a C++ compiler (tested with GCC and clang) installed on your distribution. Use a recent one, we use the 2017 C++ standard. The compilation will not work with Ubuntu 16's GCC 5.4 for example.
......
This diff is collapsed. Click to expand it.
Install supported module sources
==================================
Create folders for the modules you will use: `mkdir -p data/modules/`. If you plan to use several module sources, add subdirectories :
```bash
mkdir -p data/modules/BGSU
mkdir -p data/modules/RIN
mkdir -p data/modules/DESC
mkdir -p data/modules/JSON
mkdir -p data/modules/CSV
```
## CUSTOM JSON- OR CSV-FORMATTED MODULES
Just add you JSON-formatted modules to `data/modules/JSON/mydatabase.json`, according to the following format :
```
{
"1": {
"sequence": "ACUAGCG&GGCUA&GU",
"struct2d": "((((((.&.))))&))"
},
...
}
```
You can use `'&'` to indicate sequence discontinuity, which leads to several components in the module.
You can also use CSV-formatted insertion sites (for example, obtained with Jar3d or BayesPairing) to `data/modules/CSV`, following one of these formats:
### The "BayesPairing" format:
Here k-loops can have any number of components k, you have to precise the start and end coordinates of each. The file should include the header.
```
Motif,Score,Start1,End1,Start2,End2...
motif1name,-19,29,38
motif2name,-28,71,80,90,96
...
```
Entries may not accumulate useless commas if they have a low number of components (don't `motif1name,-19,29,38,,`)
### The Jar3d format
Here the modules may only be 1-loops or 2-loops (HL or IL). There is a fixed number of columns per line, and undefined values are indicated with a dash `'-'`.
```
Motif,Rotation,Score,Start1,End1,Start2,End2
IL_43115.1,True,66,30,32,55,57
HL_35894.1,False,63,42,47,-,-
...
```
## CARNAVAL DATA (*Reinhartz et al, 2018*)
You first need to have the `unzip` command installed on your machine and the `networkx` package installed for Python 3. Then just run the script `Install_CaRNAval_RINs.py`.
If you have cloned the Git repository, just run :
```bash
cd scripts
python3 Install_CaRNAval_RINs.py
```
This will create files into `./data/modules/RIN/Subfiles`.
If not, or if you do not have the unzip command, download and extract manually the [CaRNAval dataset](http://carnaval.lri.fr/carnaval_dataset.zip) and place the files `RIN.py` and `CaRNAval_1_as_dictionnary.nxpickled` in the folder `data/modules/RIN/`, and run the python script.
*Note : CaRNAval is supposed to be a long-distance contact module dataset, not a SSE module dataset. It was supported for testing mostly, but you will not get the best performance from using it, it's not supposed to be loops.*
## THE RNA 3D MOTIF ATLAS DATA (*Petrov et al, 2013*, previously supported)
Source : see http://rna.bgsu.edu/rna3dhub/motifs/.
Get the latest version of the HL and IL module models from the [BGSU website](http://rna.bgsu.edu/data/jar3d/models/) and extract the Zip files. Put the HL and IL folders from inside the Zip files into `./data/modules/BGSU`. Note that only the latest Zip is required.
*Note : In Biorseo V1.0, you could use this modules directly because Biorseo was running Jar3d or BayesPairing for you. This is not the case anymore. You need to run these tools separately and get their results as a CSV file, see above how to format the CSV file.*
## RNA3DMOTIFS DATA (from the work of *Djelloul & Denise, 2008*, considered outdated)
If you use Rna3Dmotifs, you need to get RNA-MoIP's .DESC dataset: download it from [GitHub](https://github.com/McGill-CSB/RNAMoIP/blob/master/CATALOGUE.tgz). Put all the .desc from the `Non_Redundant_DESC` folder into `./data/modules/DESC`. Otherwise, you also can run Rna3Dmotifs' `catalog` program to get your own DESC modules collection from updated 3D data (download [Rna3Dmotifs](https://rna3dmotif.lri.fr/Rna3Dmotif.tgz)). You also need to move the final DESC files into `./data/modules/DESC`.
\ No newline at end of file
......@@ -22,7 +22,7 @@ while step < len(seq)+50:
fasta.close()
# run biorseo on it, with default options
cmd = ["./bin/biorseo", "-d", "./data/modules/DESC", "-s", "./ZDFS33.fa", "-v"]
cmd = ["./bin/biorseo", "-d", "./data/modules/DESC", "-s", "data/fasta/ZDFS33.fa", "-v"]
old_time = time.time()
output = subprocess.check_output(cmd, stderr=subprocess.DEVNULL).decode("utf-8").split("\n")[-5:]
run_time = time.time() - old_time
......
#!/bin/bash
######################################################## RNA modules ##############################################################
cd ../
# Rna3Dmotifs data
mkdir -p data/modules/DESC
wget https://github.com/McGill-CSB/RNAMoIP/raw/master/CATALOGUE.tgz
tar -xvzf CATALOGUE.tgz
mv No_Redondance_DESC/*.desc data/modules/DESC/
rm -r No_Redondance_VIEW3D No_Redondance_DESC CATALOGUE.tgz
# The RNA 3D Motif Atlas
mkdir -p data/modules/BGSU
wget http://rna.bgsu.edu/data/jar3d/models/HL/HL_3.2_models.zip
unzip HL_3.2_models.zip
mv HL data/modules/BGSU
rm HL_3.2_models.zip
wget http://rna.bgsu.edu/data/jar3d/models/IL/IL_3.2_models.zip
unzip IL_3.2_models.zip
mv IL data/modules/BGSU
rm IL_3.2_models.zip
# Install BayesPairing
sudo -H pip3 install --upgrade pip
sudo -H pip3 install networkx numpy regex wrapt biopython
git clone http://jwgitlab.cs.mcgill.ca/sarrazin/rnabayespairing.git BayesPairing
cd BayesPairing
sudo -H pip3 install .
# Train Bayes Pairing (it has been installed on the image and the source has been deleted, we train the models now, and will remount it as volume at run time)
cd bayespairing/src
python3 parse_sequences.py -d rna3dmotif -seq ACACGGGGUAAGAGCUGAACGCAUCUAAGCUCGAAACCCACUUGGAAAAGAGACACCGCCGAGGUCCCGCGUACAAGACGCGGUCGAUAGACUCGGGGUGUGCGCGUCGAGGUAACGAGACGUUAAGCCCACGAGCACUAACAGACCAAAGCCAUCAU -ss ".................................................................((...............)xxxx(...................................................)xxx).............."
python3 parse_sequences.py -d 3dmotifatlas -seq ACACGGGGUAAGAGCUGAACGCAUCUAAGCUCGAAACCCACUUGGAAAAGAGACACCGCCGAGGUCCCGCGUACAAGACGCGGUCGAUAGACUCGGGGUGUGCGCGUCGAGGUAACGAGACGUUAAGCCCACGAGCACUAACAGACCAAAGCCAUCAU -ss ".................................................................((...............)xxxx(...................................................)xxx).............."
cd ../../..
######################################################## Run it ##############################################################
# docker run -v `pwd`/data/modules:/modules -v `pwd`/BayesPairing/bayespairing:/byp -v `pwd`/results:/biorseo/results biorseo ./biorseo.py -i /biorseo/data/fasta/applications.fa --rna3dmotifs --patternmatch --func B
\ No newline at end of file