Louis BECQUEY

Updated Docker and documentation

1 -results_*
2 -results/
3 -build_BiORSEO_docker_image_ubuntu18.sh
4 -deploy_BiORSEO_docker_image_linux.sh
5 -INSTALL.md
6 -Readme.md
7 -benchmark_results/
8 -*.gz
9 -*.pickle
10 -log_of_the_run.sh
...\ No newline at end of file ...\ No newline at end of file
1 +Changelog
2 +=========================
3 +
4 +### Biorseo 2.1 (Nov 2021)
5 +This is an official, tested, release of Biorseo 2:
6 +- replacing Nupack's dynamic programming scheme supporting simple pseudoknots by ViennaRNA's window-based scheme, which does not support pseudoknots or long-distance contacts but allows to test much longer sequences,
7 +- supporting RINs with no issues,
8 +- supporting custom modules in JSON format (to be detected in sequences by regular expressions), thanks to Nathalie Bernard
9 +- not running Jar3d or BayesPairing for you anymore. This simplifies a lot the code management (replacing a pipeline by the C++ tool only). Jar3d is getting older, does not support very complex modules, and is biaised because it takes as input loops (not the whole sequence). Therefore, you have to give biorseo the answer as input ! BayesPairing 2.0 is evolving itself into a module-placement tool in secondary structures taking eneries into account (and now comparative information), it is a non sense to include it *within* Biorseo. Approaches should be compared and benchmarked instead. But, you can still use the ouputs of this tools as input for biorseo if you like.
10 +- introducing the MFE criterion (thanks to Nathalie Bernard),
11 +- introducing the Biokop-mode,
12 +- with a much simpler and lighter installation process.
13 +
14 +Biorseo 2.1 is availbale as a docker container and as a git branch called "biorseo2".
15 +It is the last version supported by Louis Becquey.
16 +
17 +### Biorseo 2.0
18 +This was an unofficial, unsupported and unpublished version after the internship of Lénaic Durand at IBISC.
19 +This version
20 +- replaces Nupack 3.2 with ViennaRNA to compute the pairing probabilities, thanks to Lénaic,
21 +- introduces early support for BayesPairing 2.0, which was still unofficial too at the time,
22 +- supports CaRNAval RINs,
23 +- but has issues with the constraints to assert RIN basepairs are respected.
24 +
25 +Results from this version are published in [Louis Becquey's thesis](https://tel.archives-ouvertes.fr/tel-03440181).
26 +
27 +### Biorseo 1.2 (2019) and Biorseo 1.5 (2020)
28 +These brought some improvements, fixing numerical issues, and other technical improvements.
29 +Biorseo 1.2 is still available as a docker, and the 1.5 is available as a Git branch called 'biorseo1'.
30 +
31 +### Biorseo 1.0 (2018)
32 +This was the first version published for the paper [*Becquey et al. 2020*.](https://doi.org/10.1093/bioinformatics/btz962)
...\ No newline at end of file ...\ No newline at end of file
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
8 FROM ubuntu:focal 8 FROM ubuntu:focal
9 9
10 # compiled biorseo 10 # compiled biorseo
11 -COPY . /biorseo/ 11 +COPY ./bin /workdir/
12 12
13 # Install runtime dependencies 13 # Install runtime dependencies
14 RUN apt-get update -yq && \ 14 RUN apt-get update -yq && \
...@@ -16,4 +16,5 @@ RUN apt-get update -yq && \ ...@@ -16,4 +16,5 @@ RUN apt-get update -yq && \
16 apt-get install -y libboost-program-options-dev libboost-filesystem-dev && \ 16 apt-get install -y libboost-program-options-dev libboost-filesystem-dev && \
17 rm -rf /var/lib/apt/lists/* 17 rm -rf /var/lib/apt/lists/*
18 18
19 -WORKDIR /biorseo
...\ No newline at end of file ...\ No newline at end of file
19 +WORKDIR /workdir
20 +ENTRYPOINT ["/workdir/biorseo"]
......
1 Option 1 : Installation using docker image (Windows, Mac, Linux) 1 Option 1 : Installation using docker image (Windows, Mac, Linux)
2 ================================== 2 ==================================
3 -* Clone this git repository : `git clone https://github.com/persalteas/biorseo.git` , or download the .zip archive from a BiORSEO release and extract it.
4 -* Move into the repository ( `cd biorseo` )
5 3
6 ### Install Docker: 4 ### Install Docker:
7 * See the officiel instructions depending on your OS here : https://docs.docker.com/install/ 5 * See the officiel instructions depending on your OS here : https://docs.docker.com/install/
...@@ -13,60 +11,40 @@ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubun ...@@ -13,60 +11,40 @@ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubun
13 sudo apt update && sudo apt install docker-ce docker-ce-cli containerd.io 11 sudo apt update && sudo apt install docker-ce docker-ce-cli containerd.io
14 ``` 12 ```
15 13
14 +### Create a separated folder
15 +You will need a folder that is mounted in the Docker container (shared between your filesystem and the container's one). For example, create a `data` folder. It should be the place where you place your FASTA input, your modules, and where Biorseo should place the output.
16 +```
17 +mkdir data
18 +```
19 +
16 ### Download and install the RNA motifs data files: 20 ### Download and install the RNA motifs data files:
17 -* Move your JSON-formatted or CSV-formatted files containing motifs in the folder. 21 +* Move your JSON-formatted or CSV-formatted files containing motifs in a `./data/JSON` or `./data/CSV` folder.
18 -* If you use Rna3Dmotifs, you need to get RNA-MoIP's .DESC dataset: download it from [GitHub](https://github.com/McGill-CSB/RNAMoIP/blob/master/CATALOGUE.tgz). Put all the .desc from the `Non_Redundant_DESC` folder into `./data/modules/DESC`. Otherwise, you also can run Rna3Dmotifs' `catalog` program to get your own DESC modules collection from updated 3D data (download [Rna3Dmotifs](https://rna3dmotif.lri.fr/Rna3Dmotif.tgz)). You also need to move the final DESC files into `./data/modules/DESC`. 22 +* If you use Rna3Dmotifs (easy, but outdated), you need to get RNA-MoIP's .DESC dataset: download it from [GitHub](https://github.com/McGill-CSB/RNAMoIP/blob/master/CATALOGUE.tgz). Put all the .desc from the `Non_Redundant_DESC` folder into `./data/DESC`. Otherwise, you also can run Rna3Dmotifs' `catalog` program to get your own DESC modules collection from updated 3D data (download [Rna3Dmotifs](https://rna3dmotif.lri.fr/Rna3Dmotif.tgz)). You also need to move the final DESC files into `./data/DESC`.
23 +* If you use CaRNAval (which is supposed to be a long-distance contact module dataset, not a SSE module dataset), use the script [scripts/Install_CaRNAval_RINs.py](scripts/Install_CaRNAval_RINs.py) :
24 +`python3 Install_CaRNAval_RINs.py`. This will create a `/../data/modules/RIN/` folder (because the script is supposed to be run from the repo's `scripts` subfolder)
19 25
20 ### Download the docker image from Docker Hub 26 ### Download the docker image from Docker Hub
21 -`docker pull persalteas/biorseo:latest` 27 +```
28 +docker pull persalteas/biorseo:latest
29 +```
22 30
23 ### Run the docker image 31 ### Run the docker image
24 Use the following command to run the docker image: 32 Use the following command to run the docker image:
25 ``` 33 ```
26 -$ docker run 34 +$ docker run -v `pwd`/data:/workdir/data persalteas/biorseo [optionshere]
27 --v `pwd`/data/modules:/modules
28 --v `pwd`/data/fasta:/biorseo/data/fasta
29 --v `pwd`/results:/biorseo/results
30 -persalteas/biorseo
31 -yourexamplejobcommandhere
32 ``` 35 ```
33 -You can replace \`pwd\` by the full path of the biorseo/ root folder. Here we launch the biorseo image with 4 volumes : A first to give BiORSEO access to the module files, a second to give it access to your input file(s), a third for your trained BayesPairing, and a last for it to output the result files of your job. Considering you place your input file 'MyFastaFile.fa' into the `data/fasta` folder, an example job command can be ` ./biorseo.py -i /biorseo/data/fasta/myFastaFile.fa --rna3dmotifs --func B`, so the full run command would be 36 +You can replace \`pwd\`/data by the full path to your data folder. Assuming you place your input file 'MyFastaFile.fa' into the `data/fasta` folder, an example job command can be :
34 ``` 37 ```
35 -$ docker run -v `pwd`/data/modules:/modules -v `pwd`/data/fasta:/biorseo/data/fasta -v `pwd`/results:/biorseo/results persalteas/biorseo ./bin/biorseo -s /biorseo/data/fasta/applications.fa --descfolder /biorseo/data/modules/DESC --func B -v 38 +$ docker run -v `pwd`/data:/workdir/data persalteas/biorseo -s data/fasta/MyFastaFile.fa --descfolder data/DESC --func B -v -o data/MyOutput.biorseo
36 ``` 39 ```
37 40
38 -Note that the paths to the input and output files are paths *inside the Docker container*, and those paths are mounted to folders of the host machine with -v options. 41 +Note that the paths to the input and output files are paths *inside the Docker container*, and those paths are mounted to the data folder of the host machine with the -v option.
39 42
40 Option 2 : Compile and Install from source (without docker, Linux only) 43 Option 2 : Compile and Install from source (without docker, Linux only)
41 ================================== 44 ==================================
42 45
43 ### CLONING 46 ### CLONING
44 -* Clone this git repository : `git clone https://forge.ibisc.univ-evry.fr/lbecquey/biorseo.git` (from the IBISC forge) or `git clone https://github.com/persalteas/biorseo.git` (from my personal GitHub, only while i am the current developer !) and `cd biorseo`. 47 +* Clone this git repository : `git clone https://forge.ibisc.univ-evry.fr/lbecquey/biorseo.git` (from the IBISC forge) or `git clone https://github.com/persalteas/biorseo.git` (from my personal GitHub, only while i (Louis Becquey) am the current developer !) and `cd biorseo`.
45 -
46 -* Create folders for the modules you will use: `mkdir -p data/modules/`. If you plan to use several module sources, add subdirectories :
47 -```bash
48 -mkdir -p data/modules/BGSU
49 -mkdir -p data/modules/RIN
50 -mkdir -p data/modules/DESC
51 -mkdir -p data/modules/JSON
52 -```
53 -
54 -### THE RNA 3D MOTIF ATLAS DATA
55 -
56 -Get the latest version of the HL and IL module models from the [BGSU website](http://rna.bgsu.edu/data/jar3d/models/) and extract the Zip files. Put the HL and IL folders from inside the Zip files into `./data/modules/BGSU`. Note that only the latest Zip is required.
57 -
58 -### CARNAVAL DATA
59 -
60 -You first need to have the `unzip` command installed on your machine and the `networkx` package installed for Python 3. Then just run the script `Install_CaRNAval_RINs.py`, this will create files into `./data/modules/RIN/Subfiles` :
61 -```bash
62 -cd scripts
63 -python3 Install_CaRNAval_RINs.py
64 -```
65 -If you do not have the unzip command, download and extract manually the [CaRNAval dataset](http://carnaval.lri.fr/carnaval_dataset.zip) and place the files `RIN.py` and `CaRNAval_1_as_dictionnary.nxpickled` in the folder `data/modules/RIN/`, and run the python script.
66 -
67 -### RNA3DMOTIFS DATA (DEPRECATED)
68 -
69 -If you use Rna3Dmotifs, you need to get RNA-MoIP's .DESC dataset: download it from [GitHub](https://github.com/McGill-CSB/RNAMoIP/blob/master/CATALOGUE.tgz). Put all the .desc from the `Non_Redundant_DESC` folder into `./data/modules/DESC`. Otherwise, you also can run Rna3Dmotifs' `catalog` program to get your own DESC modules collection from updated 3D data (download [Rna3Dmotifs](https://rna3dmotif.lri.fr/Rna3Dmotif.tgz)). You also need to move the final DESC files into `./data/modules/DESC`.
70 48
71 ### DEPENDENCIES 49 ### DEPENDENCIES
72 - Make sure you have Python 3.7+ and a C++ compiler (tested with GCC and clang) installed on your distribution. Use a recent one, we use the 2017 C++ standard. The compilation will not work with Ubuntu 16's GCC 5.4 for example. 50 - Make sure you have Python 3.7+ and a C++ compiler (tested with GCC and clang) installed on your distribution. Use a recent one, we use the 2017 C++ standard. The compilation will not work with Ubuntu 16's GCC 5.4 for example.
......
This diff is collapsed. Click to expand it.
1 +Install supported module sources
2 +==================================
3 +Create folders for the modules you will use: `mkdir -p data/modules/`. If you plan to use several module sources, add subdirectories :
4 +```bash
5 +mkdir -p data/modules/BGSU
6 +mkdir -p data/modules/RIN
7 +mkdir -p data/modules/DESC
8 +mkdir -p data/modules/JSON
9 +mkdir -p data/modules/CSV
10 +```
11 +
12 +## CUSTOM JSON- OR CSV-FORMATTED MODULES
13 +Just add you JSON-formatted modules to `data/modules/JSON/mydatabase.json`, according to the following format :
14 +```
15 +{
16 + "1": {
17 + "sequence": "ACUAGCG&GGCUA&GU",
18 + "struct2d": "((((((.&.))))&))"
19 + },
20 + ...
21 +}
22 +```
23 +You can use `'&'` to indicate sequence discontinuity, which leads to several components in the module.
24 +
25 +You can also use CSV-formatted insertion sites (for example, obtained with Jar3d or BayesPairing) to `data/modules/CSV`, following one of these formats:
26 +
27 +### The "BayesPairing" format:
28 +Here k-loops can have any number of components k, you have to precise the start and end coordinates of each. The file should include the header.
29 +```
30 +Motif,Score,Start1,End1,Start2,End2...
31 +motif1name,-19,29,38
32 +motif2name,-28,71,80,90,96
33 +...
34 +```
35 +Entries may not accumulate useless commas if they have a low number of components (don't `motif1name,-19,29,38,,`)
36 +
37 +### The Jar3d format
38 +Here the modules may only be 1-loops or 2-loops (HL or IL). There is a fixed number of columns per line, and undefined values are indicated with a dash `'-'`.
39 +```
40 +Motif,Rotation,Score,Start1,End1,Start2,End2
41 +IL_43115.1,True,66,30,32,55,57
42 +HL_35894.1,False,63,42,47,-,-
43 +...
44 +```
45 +
46 +## CARNAVAL DATA (*Reinhartz et al, 2018*)
47 +
48 +You first need to have the `unzip` command installed on your machine and the `networkx` package installed for Python 3. Then just run the script `Install_CaRNAval_RINs.py`.
49 +
50 +If you have cloned the Git repository, just run :
51 +```bash
52 +cd scripts
53 +python3 Install_CaRNAval_RINs.py
54 +```
55 +This will create files into `./data/modules/RIN/Subfiles`.
56 +
57 +If not, or if you do not have the unzip command, download and extract manually the [CaRNAval dataset](http://carnaval.lri.fr/carnaval_dataset.zip) and place the files `RIN.py` and `CaRNAval_1_as_dictionnary.nxpickled` in the folder `data/modules/RIN/`, and run the python script.
58 +
59 +*Note : CaRNAval is supposed to be a long-distance contact module dataset, not a SSE module dataset. It was supported for testing mostly, but you will not get the best performance from using it, it's not supposed to be loops.*
60 +
61 +## THE RNA 3D MOTIF ATLAS DATA (*Petrov et al, 2013*, previously supported)
62 +Source : see http://rna.bgsu.edu/rna3dhub/motifs/.
63 +
64 +Get the latest version of the HL and IL module models from the [BGSU website](http://rna.bgsu.edu/data/jar3d/models/) and extract the Zip files. Put the HL and IL folders from inside the Zip files into `./data/modules/BGSU`. Note that only the latest Zip is required.
65 +
66 +*Note : In Biorseo V1.0, you could use this modules directly because Biorseo was running Jar3d or BayesPairing for you. This is not the case anymore. You need to run these tools separately and get their results as a CSV file, see above how to format the CSV file.*
67 +
68 +## RNA3DMOTIFS DATA (from the work of *Djelloul & Denise, 2008*, considered outdated)
69 +
70 +If you use Rna3Dmotifs, you need to get RNA-MoIP's .DESC dataset: download it from [GitHub](https://github.com/McGill-CSB/RNAMoIP/blob/master/CATALOGUE.tgz). Put all the .desc from the `Non_Redundant_DESC` folder into `./data/modules/DESC`. Otherwise, you also can run Rna3Dmotifs' `catalog` program to get your own DESC modules collection from updated 3D data (download [Rna3Dmotifs](https://rna3dmotif.lri.fr/Rna3Dmotif.tgz)). You also need to move the final DESC files into `./data/modules/DESC`.
...\ No newline at end of file ...\ No newline at end of file
...@@ -22,7 +22,7 @@ while step < len(seq)+50: ...@@ -22,7 +22,7 @@ while step < len(seq)+50:
22 fasta.close() 22 fasta.close()
23 23
24 # run biorseo on it, with default options 24 # run biorseo on it, with default options
25 - cmd = ["./bin/biorseo", "-d", "./data/modules/DESC", "-s", "./ZDFS33.fa", "-v"] 25 + cmd = ["./bin/biorseo", "-d", "./data/modules/DESC", "-s", "data/fasta/ZDFS33.fa", "-v"]
26 old_time = time.time() 26 old_time = time.time()
27 output = subprocess.check_output(cmd, stderr=subprocess.DEVNULL).decode("utf-8").split("\n")[-5:] 27 output = subprocess.check_output(cmd, stderr=subprocess.DEVNULL).decode("utf-8").split("\n")[-5:]
28 run_time = time.time() - old_time 28 run_time = time.time() - old_time
......
1 -
2 -#!/bin/bash
3 -######################################################## RNA modules ##############################################################
4 -
5 -cd ../
6 -
7 -# Rna3Dmotifs data
8 -mkdir -p data/modules/DESC
9 -wget https://github.com/McGill-CSB/RNAMoIP/raw/master/CATALOGUE.tgz
10 -tar -xvzf CATALOGUE.tgz
11 -mv No_Redondance_DESC/*.desc data/modules/DESC/
12 -rm -r No_Redondance_VIEW3D No_Redondance_DESC CATALOGUE.tgz
13 -
14 -# The RNA 3D Motif Atlas
15 -mkdir -p data/modules/BGSU
16 -wget http://rna.bgsu.edu/data/jar3d/models/HL/HL_3.2_models.zip
17 -unzip HL_3.2_models.zip
18 -mv HL data/modules/BGSU
19 -rm HL_3.2_models.zip
20 -wget http://rna.bgsu.edu/data/jar3d/models/IL/IL_3.2_models.zip
21 -unzip IL_3.2_models.zip
22 -mv IL data/modules/BGSU
23 -rm IL_3.2_models.zip
24 -
25 -# Install BayesPairing
26 -sudo -H pip3 install --upgrade pip
27 -sudo -H pip3 install networkx numpy regex wrapt biopython
28 -git clone http://jwgitlab.cs.mcgill.ca/sarrazin/rnabayespairing.git BayesPairing
29 -cd BayesPairing
30 -sudo -H pip3 install .
31 -
32 -# Train Bayes Pairing (it has been installed on the image and the source has been deleted, we train the models now, and will remount it as volume at run time)
33 -cd bayespairing/src
34 -python3 parse_sequences.py -d rna3dmotif -seq ACACGGGGUAAGAGCUGAACGCAUCUAAGCUCGAAACCCACUUGGAAAAGAGACACCGCCGAGGUCCCGCGUACAAGACGCGGUCGAUAGACUCGGGGUGUGCGCGUCGAGGUAACGAGACGUUAAGCCCACGAGCACUAACAGACCAAAGCCAUCAU -ss ".................................................................((...............)xxxx(...................................................)xxx).............."
35 -python3 parse_sequences.py -d 3dmotifatlas -seq ACACGGGGUAAGAGCUGAACGCAUCUAAGCUCGAAACCCACUUGGAAAAGAGACACCGCCGAGGUCCCGCGUACAAGACGCGGUCGAUAGACUCGGGGUGUGCGCGUCGAGGUAACGAGACGUUAAGCCCACGAGCACUAACAGACCAAAGCCAUCAU -ss ".................................................................((...............)xxxx(...................................................)xxx).............."
36 -cd ../../..
37 -
38 -######################################################## Run it ##############################################################
39 -
40 -# docker run -v `pwd`/data/modules:/modules -v `pwd`/BayesPairing/bayespairing:/byp -v `pwd`/results:/biorseo/results biorseo ./biorseo.py -i /biorseo/data/fasta/applications.fa --rna3dmotifs --patternmatch --func B
...\ No newline at end of file ...\ No newline at end of file