Updated Docker and documentation

Louis BECQUEY
Commit eda1ab32bdb323184a282215e53fa9cc261776df eda1ab32 1 parent b01c7f77
Showing 8 changed files with 124 additions and 93 deletions
.dockerignore
CHANGELOG.md
Dockerfile
INSTALL.md
Readme.md
SOURCES.md
scripts/benchmark_on_seq_length.py
scripts/deploy_BiORSEO_docker_image_linux.sh
--- a/.dockerignore deleted 100644 → 0
View file @b01c7f7
+++ b/.dockerignore deleted 100644 → 0
View file @b01c7f7
- results_*
- results/
- build_BiORSEO_docker_image_ubuntu18.sh
- deploy_BiORSEO_docker_image_linux.sh
- INSTALL.md
- Readme.md
- benchmark_results/
- *.gz
- *.pickle
- log_of_the_run.sh
\ No newline at end of file
--- a/CHANGELOG.md 0 → 100644
View file @eda1ab3
+++ b/CHANGELOG.md 0 → 100644
View file @eda1ab3
+ Changelog
+ =========================
+ 
+ ### Biorseo 2.1 (Nov 2021)
+ This is an official, tested, release of Biorseo 2:
+ - replacing Nupack's dynamic programming scheme supporting simple pseudoknots by ViennaRNA's window-based scheme, which does not support pseudoknots or long-distance contacts but allows to test much longer sequences,
+ - supporting RINs with no issues, 
+ - supporting custom modules in JSON format (to be detected in sequences by regular expressions), thanks to Nathalie Bernard
+ - not running Jar3d or BayesPairing for you anymore. This simplifies a lot the code management (replacing a pipeline by the C++ tool only). Jar3d is getting older, does not support very complex modules, and is biaised because it takes as input loops (not the whole sequence). Therefore, you have to give biorseo the answer as input ! BayesPairing 2.0 is evolving itself into a module-placement tool in secondary structures taking eneries into account (and now comparative information), it is a non sense to include it *within* Biorseo. Approaches should be compared and benchmarked instead. But, you can still use the ouputs of this tools as input for biorseo if you like.
+ - introducing the MFE criterion (thanks to Nathalie Bernard),
+ - introducing the Biokop-mode,
+ - with a much simpler and lighter installation process.
+ 
+ Biorseo 2.1 is availbale as a docker container and as a git branch called "biorseo2".
+ It is the last version supported by Louis Becquey.
+ 
+ ### Biorseo 2.0
+ This was an unofficial, unsupported and unpublished version after the internship of Lénaic Durand at IBISC.
+ This version 
+ - replaces Nupack 3.2 with ViennaRNA to compute the pairing probabilities, thanks to Lénaic,
+ - introduces early support for BayesPairing 2.0, which was still unofficial too at the time,
+ - supports CaRNAval RINs,
+ - but has issues with the constraints to assert RIN basepairs are respected.
+ 
+ Results from this version are published in [Louis Becquey's thesis](https://tel.archives-ouvertes.fr/tel-03440181).
+ 
+ ### Biorseo 1.2 (2019) and Biorseo 1.5 (2020)
+ These brought some improvements, fixing numerical issues, and other technical improvements.
+ Biorseo 1.2 is still available as a docker, and the 1.5 is available as a Git branch called 'biorseo1'.
+ 
+ ### Biorseo 1.0 (2018)
+ This was the first version published for the paper [*Becquey et al. 2020*.](https://doi.org/10.1093/bioinformatics/btz962)
\ No newline at end of file
--- a/Dockerfile
View file @eda1ab3
+++ b/Dockerfile
View file @eda1ab3
@@ -8,7 +8,7 @@
 FROM ubuntu:focal
 
 # compiled biorseo
- COPY . /biorseo/
+ COPY ./bin /workdir/
 
 # Install runtime dependencies
 RUN apt-get update -yq && \
@@ -16,4 +16,5 @@ RUN apt-get update -yq && \
     apt-get install -y libboost-program-options-dev libboost-filesystem-dev && \
     rm -rf /var/lib/apt/lists/*
 
- WORKDIR /biorseo
\ No newline at end of file
+ WORKDIR /workdir
+ ENTRYPOINT ["/workdir/biorseo"]
--- a/INSTALL.md
View file @eda1ab3
+++ b/INSTALL.md
View file @eda1ab3
 Option 1 : Installation using docker image (Windows, Mac, Linux)
 ==================================
- * Clone this git repository : `git clone https://github.com/persalteas/biorseo.git` , or download the .zip archive from a BiORSEO release and extract it.
- * Move into the repository ( `cd biorseo` )
 
 ### Install Docker:
 * See the officiel instructions depending on your OS here : https://docs.docker.com/install/
@@ -13,60 +11,40 @@ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubun
 sudo apt update && sudo apt install docker-ce docker-ce-cli containerd.io
 ```
 
+ ### Create a separated folder
+ You will need a folder that is mounted in the Docker container (shared between your filesystem and the container's one). For example, create a `data` folder. It should be the place where you place your FASTA input, your modules, and where Biorseo should place the output.
+ ```
+ mkdir data 
+ ```
+ 
 ### Download and install the RNA motifs data files:
- * Move your JSON-formatted or CSV-formatted files containing motifs in the folder.
- * If you use Rna3Dmotifs, you need to get RNA-MoIP's .DESC dataset: download it from [GitHub](https://github.com/McGill-CSB/RNAMoIP/blob/master/CATALOGUE.tgz). Put all the .desc from the `Non_Redundant_DESC` folder into `./data/modules/DESC`. Otherwise, you also can run Rna3Dmotifs' `catalog` program to get your own DESC modules collection from updated 3D data (download [Rna3Dmotifs](https://rna3dmotif.lri.fr/Rna3Dmotif.tgz)). You also need to move the final DESC files into `./data/modules/DESC`.
+ * Move your JSON-formatted or CSV-formatted files containing motifs in a `./data/JSON` or `./data/CSV` folder.
+ * If you use Rna3Dmotifs (easy, but outdated), you need to get RNA-MoIP's .DESC dataset: download it from [GitHub](https://github.com/McGill-CSB/RNAMoIP/blob/master/CATALOGUE.tgz). Put all the .desc from the `Non_Redundant_DESC` folder into `./data/DESC`. Otherwise, you also can run Rna3Dmotifs' `catalog` program to get your own DESC modules collection from updated 3D data (download [Rna3Dmotifs](https://rna3dmotif.lri.fr/Rna3Dmotif.tgz)). You also need to move the final DESC files into `./data/DESC`.
+ * If you use CaRNAval (which is supposed to be a long-distance contact module dataset, not a SSE module dataset), use the script [scripts/Install_CaRNAval_RINs.py](scripts/Install_CaRNAval_RINs.py) : 
+ `python3 Install_CaRNAval_RINs.py`. This will create a `/../data/modules/RIN/` folder (because the script is supposed to be run from the repo's `scripts` subfolder)
 
 ### Download the docker image from Docker Hub
- `docker pull persalteas/biorseo:latest`
+ ```
+ docker pull persalteas/biorseo:latest
+ ```
 
 ### Run the docker image
 Use the following command to run the docker image:
 ```
- $ docker run 
- -v `pwd`/data/modules:/modules 
- -v `pwd`/data/fasta:/biorseo/data/fasta
- -v `pwd`/results:/biorseo/results 
- persalteas/biorseo 
- yourexamplejobcommandhere
+ $ docker run -v `pwd`/data:/workdir/data persalteas/biorseo [optionshere]
 ```
- You can replace \`pwd\` by the full path of the biorseo/ root folder. Here we launch the biorseo image with 4 volumes : A first to give BiORSEO access to the module files, a second to give it access to your input file(s), a third for your trained BayesPairing, and a last for it to output the result files of your job. Considering you place your input file 'MyFastaFile.fa' into the `data/fasta` folder, an example job command can be ` ./biorseo.py -i /biorseo/data/fasta/myFastaFile.fa  --rna3dmotifs --func B`, so the full run command would be 
+ You can replace \`pwd\`/data by the full path to your data folder. Assuming you place your input file 'MyFastaFile.fa' into the `data/fasta` folder, an example job command can be :
 ```
- $ docker run -v `pwd`/data/modules:/modules -v `pwd`/data/fasta:/biorseo/data/fasta -v `pwd`/results:/biorseo/results persalteas/biorseo ./bin/biorseo -s /biorseo/data/fasta/applications.fa --descfolder /biorseo/data/modules/DESC --func B -v
+ $ docker run -v `pwd`/data:/workdir/data persalteas/biorseo -s data/fasta/MyFastaFile.fa --descfolder data/DESC --func B -v -o data/MyOutput.biorseo
 ```
 
- Note that the paths to the input and output files are paths *inside the Docker container*, and those paths are mounted to folders of the host machine with -v options.
+ Note that the paths to the input and output files are paths *inside the Docker container*, and those paths are mounted to the data folder of the host machine with the -v option.
 
 Option 2 : Compile and Install from source (without docker, Linux only)
 ==================================
 
 ### CLONING
- * Clone this git repository : `git clone https://forge.ibisc.univ-evry.fr/lbecquey/biorseo.git` (from the IBISC forge) or `git clone https://github.com/persalteas/biorseo.git` (from my personal GitHub, only while i am the current developer !) and `cd biorseo`.
- 
- * Create folders for the modules you will use: `mkdir -p data/modules/`. If you plan to use several module sources, add subdirectories :
- ```bash
- mkdir -p data/modules/BGSU
- mkdir -p data/modules/RIN
- mkdir -p data/modules/DESC
- mkdir -p data/modules/JSON
- ```
- 
- ### THE RNA 3D MOTIF ATLAS DATA
- 
- Get the latest version of the HL and IL module models from the [BGSU website](http://rna.bgsu.edu/data/jar3d/models/) and extract the Zip files. Put the HL and IL folders from inside the Zip files into `./data/modules/BGSU`. Note that only the latest Zip is required.
- 
- ### CARNAVAL DATA
- 
- You first need to have the `unzip` command installed on your machine and the `networkx` package installed for Python 3. Then just run the script `Install_CaRNAval_RINs.py`, this will create files into `./data/modules/RIN/Subfiles` :
- ```bash
- cd scripts
- python3 Install_CaRNAval_RINs.py
- ```
- If you do not have the unzip command, download and extract manually the [CaRNAval dataset](http://carnaval.lri.fr/carnaval_dataset.zip) and place the files `RIN.py` and `CaRNAval_1_as_dictionnary.nxpickled` in the folder `data/modules/RIN/`, and run the python script.
- 
- ### RNA3DMOTIFS DATA (DEPRECATED)
- 
- If you use Rna3Dmotifs, you need to get RNA-MoIP's .DESC dataset: download it from [GitHub](https://github.com/McGill-CSB/RNAMoIP/blob/master/CATALOGUE.tgz). Put all the .desc from the `Non_Redundant_DESC` folder into `./data/modules/DESC`. Otherwise, you also can run Rna3Dmotifs' `catalog` program to get your own DESC modules collection from updated 3D data (download [Rna3Dmotifs](https://rna3dmotif.lri.fr/Rna3Dmotif.tgz)). You also need to move the final DESC files into `./data/modules/DESC`.
+ * Clone this git repository : `git clone https://forge.ibisc.univ-evry.fr/lbecquey/biorseo.git` (from the IBISC forge) or `git clone https://github.com/persalteas/biorseo.git` (from my personal GitHub, only while i (Louis Becquey) am the current developer !) and `cd biorseo`.
 
 ### DEPENDENCIES
 - Make sure you have Python 3.7+ and a C++ compiler (tested with GCC and clang) installed on your distribution. Use a recent one, we use the 2017 C++ standard. The compilation will not work with Ubuntu 16's GCC 5.4 for example.
--- a/Readme.md
View file @eda1ab3
+++ b/Readme.md
View file @eda1ab3
--- a/SOURCES.md 0 → 100644
View file @eda1ab3
+++ b/SOURCES.md 0 → 100644
View file @eda1ab3
+ Install supported module sources
+ ==================================
+ Create folders for the modules you will use: `mkdir -p data/modules/`. If you plan to use several module sources, add subdirectories :
+ ```bash
+ mkdir -p data/modules/BGSU
+ mkdir -p data/modules/RIN
+ mkdir -p data/modules/DESC
+ mkdir -p data/modules/JSON
+ mkdir -p data/modules/CSV
+ ```
+ 
+ ## CUSTOM JSON- OR CSV-FORMATTED MODULES
+ Just add you JSON-formatted modules to `data/modules/JSON/mydatabase.json`, according to the following format : 
+ ```
+ {
+     "1": {
+         "sequence": "ACUAGCG&GGCUA&GU",
+         "struct2d": "((((((.&.))))&))"
+     },
+     ...
+ }
+ ```
+ You can use `'&'` to indicate sequence discontinuity, which leads to several components in the module.
+ 
+ You can also use CSV-formatted insertion sites (for example, obtained with Jar3d or BayesPairing) to `data/modules/CSV`, following one of these formats:
+ 
+ ### The "BayesPairing" format:
+ Here k-loops can have any number of components k, you have to precise the start and end coordinates of each. The file should include the header.
+ ```
+ Motif,Score,Start1,End1,Start2,End2...
+ motif1name,-19,29,38
+ motif2name,-28,71,80,90,96
+ ...
+ ```
+ Entries may not accumulate useless commas if they have a low number of components (don't `motif1name,-19,29,38,,`)
+ 
+ ### The Jar3d format
+ Here the modules may only be 1-loops or 2-loops (HL or IL). There is a fixed number of columns per line, and undefined values are indicated with a dash `'-'`.
+ ```
+ Motif,Rotation,Score,Start1,End1,Start2,End2
+ IL_43115.1,True,66,30,32,55,57
+ HL_35894.1,False,63,42,47,-,-
+ ...
+ ```
+ 
+ ## CARNAVAL DATA (*Reinhartz et al, 2018*)
+ 
+ You first need to have the `unzip` command installed on your machine and the `networkx` package installed for Python 3. Then just run the script `Install_CaRNAval_RINs.py`.
+ 
+ If you have cloned the Git repository, just run :
+ ```bash
+ cd scripts
+ python3 Install_CaRNAval_RINs.py
+ ```
+ This will create files into `./data/modules/RIN/Subfiles`.
+ 
+ If not, or if you do not have the unzip command, download and extract manually the [CaRNAval dataset](http://carnaval.lri.fr/carnaval_dataset.zip) and place the files `RIN.py` and `CaRNAval_1_as_dictionnary.nxpickled` in the folder `data/modules/RIN/`, and run the python script.
+ 
+ *Note : CaRNAval is supposed to be a long-distance contact module dataset, not a SSE module dataset. It was supported for testing mostly, but you will not get the best performance from using it, it's not supposed to be loops.*
+ 
+ ## THE RNA 3D MOTIF ATLAS DATA (*Petrov et al, 2013*, previously supported)
+ Source : see http://rna.bgsu.edu/rna3dhub/motifs/.
+ 
+ Get the latest version of the HL and IL module models from the [BGSU website](http://rna.bgsu.edu/data/jar3d/models/) and extract the Zip files. Put the HL and IL folders from inside the Zip files into `./data/modules/BGSU`. Note that only the latest Zip is required.
+ 
+ *Note : In Biorseo V1.0, you could use this modules directly because Biorseo was running Jar3d or BayesPairing for you. This is not the case anymore. You need to run these tools separately and get their results as a CSV file, see above how to format the CSV file.*
+ 
+ ## RNA3DMOTIFS DATA (from the work of *Djelloul & Denise, 2008*, considered outdated)
+ 
+ If you use Rna3Dmotifs, you need to get RNA-MoIP's .DESC dataset: download it from [GitHub](https://github.com/McGill-CSB/RNAMoIP/blob/master/CATALOGUE.tgz). Put all the .desc from the `Non_Redundant_DESC` folder into `./data/modules/DESC`. Otherwise, you also can run Rna3Dmotifs' `catalog` program to get your own DESC modules collection from updated 3D data (download [Rna3Dmotifs](https://rna3dmotif.lri.fr/Rna3Dmotif.tgz)). You also need to move the final DESC files into `./data/modules/DESC`.
\ No newline at end of file
--- a/scripts/benchmark_on_seq_length.py 100644 → 100755
View file @eda1ab3
+++ b/scripts/benchmark_on_seq_length.py 100644 → 100755
View file @eda1ab3
@@ -22,7 +22,7 @@ while step < len(seq)+50:
 	fasta.close()
 
 	# run biorseo on it, with default options
- 	cmd = ["./bin/biorseo", "-d", "./data/modules/DESC", "-s", "./ZDFS33.fa", "-v"]
+ 	cmd = ["./bin/biorseo", "-d", "./data/modules/DESC", "-s", "data/fasta/ZDFS33.fa", "-v"]
 	old_time = time.time()
 	output = subprocess.check_output(cmd, stderr=subprocess.DEVNULL).decode("utf-8").split("\n")[-5:]
 	run_time = time.time() - old_time
--- a/scripts/deploy_BiORSEO_docker_image_linux.sh deleted 100755 → 0
View file @b01c7f7
+++ b/scripts/deploy_BiORSEO_docker_image_linux.sh deleted 100755 → 0
View file @b01c7f7
- 
- #!/bin/bash
- ######################################################## RNA modules ##############################################################
- 
- cd ../
- 
- # Rna3Dmotifs data
- mkdir -p data/modules/DESC
- wget https://github.com/McGill-CSB/RNAMoIP/raw/master/CATALOGUE.tgz
- tar -xvzf CATALOGUE.tgz 
- mv No_Redondance_DESC/*.desc data/modules/DESC/
- rm -r No_Redondance_VIEW3D No_Redondance_DESC CATALOGUE.tgz
- 
- # The RNA 3D Motif Atlas
- mkdir -p data/modules/BGSU
- wget http://rna.bgsu.edu/data/jar3d/models/HL/HL_3.2_models.zip
- unzip HL_3.2_models.zip
- mv HL data/modules/BGSU
- rm HL_3.2_models.zip
- wget http://rna.bgsu.edu/data/jar3d/models/IL/IL_3.2_models.zip
- unzip IL_3.2_models.zip
- mv IL data/modules/BGSU
- rm IL_3.2_models.zip
- 
- # Install BayesPairing
- sudo -H pip3 install --upgrade pip
- sudo -H pip3 install networkx numpy regex wrapt biopython
- git clone http://jwgitlab.cs.mcgill.ca/sarrazin/rnabayespairing.git BayesPairing
- cd BayesPairing
- sudo -H pip3 install .
- 
- # Train Bayes Pairing (it has been installed on the image and the source has been deleted, we train the models now, and will remount it as volume at run time)
- cd bayespairing/src
- python3 parse_sequences.py -d rna3dmotif -seq ACACGGGGUAAGAGCUGAACGCAUCUAAGCUCGAAACCCACUUGGAAAAGAGACACCGCCGAGGUCCCGCGUACAAGACGCGGUCGAUAGACUCGGGGUGUGCGCGUCGAGGUAACGAGACGUUAAGCCCACGAGCACUAACAGACCAAAGCCAUCAU -ss ".................................................................((...............)xxxx(...................................................)xxx).............."
- python3 parse_sequences.py -d 3dmotifatlas -seq ACACGGGGUAAGAGCUGAACGCAUCUAAGCUCGAAACCCACUUGGAAAAGAGACACCGCCGAGGUCCCGCGUACAAGACGCGGUCGAUAGACUCGGGGUGUGCGCGUCGAGGUAACGAGACGUUAAGCCCACGAGCACUAACAGACCAAAGCCAUCAU -ss ".................................................................((...............)xxxx(...................................................)xxx).............."
- cd ../../..
- 
- ######################################################## Run it ##############################################################
- 
- # docker run -v `pwd`/data/modules:/modules -v `pwd`/BayesPairing/bayespairing:/byp -v `pwd`/results:/biorseo/results biorseo ./biorseo.py -i /biorseo/data/fasta/applications.fa --rna3dmotifs --patternmatch --func B
\ No newline at end of file