Louis BECQUEY

Updated Docker and documentation

results_*
results/
build_BiORSEO_docker_image_ubuntu18.sh
deploy_BiORSEO_docker_image_linux.sh
INSTALL.md
Readme.md
benchmark_results/
*.gz
*.pickle
log_of_the_run.sh
\ No newline at end of file
Changelog
=========================
### Biorseo 2.1 (Nov 2021)
This is an official, tested, release of Biorseo 2:
- replacing Nupack's dynamic programming scheme supporting simple pseudoknots by ViennaRNA's window-based scheme, which does not support pseudoknots or long-distance contacts but allows to test much longer sequences,
- supporting RINs with no issues,
- supporting custom modules in JSON format (to be detected in sequences by regular expressions), thanks to Nathalie Bernard
- not running Jar3d or BayesPairing for you anymore. This simplifies a lot the code management (replacing a pipeline by the C++ tool only). Jar3d is getting older, does not support very complex modules, and is biaised because it takes as input loops (not the whole sequence). Therefore, you have to give biorseo the answer as input ! BayesPairing 2.0 is evolving itself into a module-placement tool in secondary structures taking eneries into account (and now comparative information), it is a non sense to include it *within* Biorseo. Approaches should be compared and benchmarked instead. But, you can still use the ouputs of this tools as input for biorseo if you like.
- introducing the MFE criterion (thanks to Nathalie Bernard),
- introducing the Biokop-mode,
- with a much simpler and lighter installation process.
Biorseo 2.1 is availbale as a docker container and as a git branch called "biorseo2".
It is the last version supported by Louis Becquey.
### Biorseo 2.0
This was an unofficial, unsupported and unpublished version after the internship of Lénaic Durand at IBISC.
This version
- replaces Nupack 3.2 with ViennaRNA to compute the pairing probabilities, thanks to Lénaic,
- introduces early support for BayesPairing 2.0, which was still unofficial too at the time,
- supports CaRNAval RINs,
- but has issues with the constraints to assert RIN basepairs are respected.
Results from this version are published in [Louis Becquey's thesis](https://tel.archives-ouvertes.fr/tel-03440181).
### Biorseo 1.2 (2019) and Biorseo 1.5 (2020)
These brought some improvements, fixing numerical issues, and other technical improvements.
Biorseo 1.2 is still available as a docker, and the 1.5 is available as a Git branch called 'biorseo1'.
### Biorseo 1.0 (2018)
This was the first version published for the paper [*Becquey et al. 2020*.](https://doi.org/10.1093/bioinformatics/btz962)
\ No newline at end of file
......@@ -8,7 +8,7 @@
FROM ubuntu:focal
# compiled biorseo
COPY . /biorseo/
COPY ./bin /workdir/
# Install runtime dependencies
RUN apt-get update -yq && \
......@@ -16,4 +16,5 @@ RUN apt-get update -yq && \
apt-get install -y libboost-program-options-dev libboost-filesystem-dev && \
rm -rf /var/lib/apt/lists/*
WORKDIR /biorseo
\ No newline at end of file
WORKDIR /workdir
ENTRYPOINT ["/workdir/biorseo"]
......
Option 1 : Installation using docker image (Windows, Mac, Linux)
==================================
* Clone this git repository : `git clone https://github.com/persalteas/biorseo.git` , or download the .zip archive from a BiORSEO release and extract it.
* Move into the repository ( `cd biorseo` )
### Install Docker:
* See the officiel instructions depending on your OS here : https://docs.docker.com/install/
......@@ -13,60 +11,40 @@ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubun
sudo apt update && sudo apt install docker-ce docker-ce-cli containerd.io
```
### Create a separated folder
You will need a folder that is mounted in the Docker container (shared between your filesystem and the container's one). For example, create a `data` folder. It should be the place where you place your FASTA input, your modules, and where Biorseo should place the output.
```
mkdir data
```
### Download and install the RNA motifs data files:
* Move your JSON-formatted or CSV-formatted files containing motifs in the folder.
* If you use Rna3Dmotifs, you need to get RNA-MoIP's .DESC dataset: download it from [GitHub](https://github.com/McGill-CSB/RNAMoIP/blob/master/CATALOGUE.tgz). Put all the .desc from the `Non_Redundant_DESC` folder into `./data/modules/DESC`. Otherwise, you also can run Rna3Dmotifs' `catalog` program to get your own DESC modules collection from updated 3D data (download [Rna3Dmotifs](https://rna3dmotif.lri.fr/Rna3Dmotif.tgz)). You also need to move the final DESC files into `./data/modules/DESC`.
* Move your JSON-formatted or CSV-formatted files containing motifs in a `./data/JSON` or `./data/CSV` folder.
* If you use Rna3Dmotifs (easy, but outdated), you need to get RNA-MoIP's .DESC dataset: download it from [GitHub](https://github.com/McGill-CSB/RNAMoIP/blob/master/CATALOGUE.tgz). Put all the .desc from the `Non_Redundant_DESC` folder into `./data/DESC`. Otherwise, you also can run Rna3Dmotifs' `catalog` program to get your own DESC modules collection from updated 3D data (download [Rna3Dmotifs](https://rna3dmotif.lri.fr/Rna3Dmotif.tgz)). You also need to move the final DESC files into `./data/DESC`.
* If you use CaRNAval (which is supposed to be a long-distance contact module dataset, not a SSE module dataset), use the script [scripts/Install_CaRNAval_RINs.py](scripts/Install_CaRNAval_RINs.py) :
`python3 Install_CaRNAval_RINs.py`. This will create a `/../data/modules/RIN/` folder (because the script is supposed to be run from the repo's `scripts` subfolder)
### Download the docker image from Docker Hub
`docker pull persalteas/biorseo:latest`
```
docker pull persalteas/biorseo:latest
```
### Run the docker image
Use the following command to run the docker image:
```
$ docker run
-v `pwd`/data/modules:/modules
-v `pwd`/data/fasta:/biorseo/data/fasta
-v `pwd`/results:/biorseo/results
persalteas/biorseo
yourexamplejobcommandhere
$ docker run -v `pwd`/data:/workdir/data persalteas/biorseo [optionshere]
```
You can replace \`pwd\` by the full path of the biorseo/ root folder. Here we launch the biorseo image with 4 volumes : A first to give BiORSEO access to the module files, a second to give it access to your input file(s), a third for your trained BayesPairing, and a last for it to output the result files of your job. Considering you place your input file 'MyFastaFile.fa' into the `data/fasta` folder, an example job command can be ` ./biorseo.py -i /biorseo/data/fasta/myFastaFile.fa --rna3dmotifs --func B`, so the full run command would be
You can replace \`pwd\`/data by the full path to your data folder. Assuming you place your input file 'MyFastaFile.fa' into the `data/fasta` folder, an example job command can be :
```
$ docker run -v `pwd`/data/modules:/modules -v `pwd`/data/fasta:/biorseo/data/fasta -v `pwd`/results:/biorseo/results persalteas/biorseo ./bin/biorseo -s /biorseo/data/fasta/applications.fa --descfolder /biorseo/data/modules/DESC --func B -v
$ docker run -v `pwd`/data:/workdir/data persalteas/biorseo -s data/fasta/MyFastaFile.fa --descfolder data/DESC --func B -v -o data/MyOutput.biorseo
```
Note that the paths to the input and output files are paths *inside the Docker container*, and those paths are mounted to folders of the host machine with -v options.
Note that the paths to the input and output files are paths *inside the Docker container*, and those paths are mounted to the data folder of the host machine with the -v option.
Option 2 : Compile and Install from source (without docker, Linux only)
==================================
### CLONING
* Clone this git repository : `git clone https://forge.ibisc.univ-evry.fr/lbecquey/biorseo.git` (from the IBISC forge) or `git clone https://github.com/persalteas/biorseo.git` (from my personal GitHub, only while i am the current developer !) and `cd biorseo`.
* Create folders for the modules you will use: `mkdir -p data/modules/`. If you plan to use several module sources, add subdirectories :
```bash
mkdir -p data/modules/BGSU
mkdir -p data/modules/RIN
mkdir -p data/modules/DESC
mkdir -p data/modules/JSON
```
### THE RNA 3D MOTIF ATLAS DATA
Get the latest version of the HL and IL module models from the [BGSU website](http://rna.bgsu.edu/data/jar3d/models/) and extract the Zip files. Put the HL and IL folders from inside the Zip files into `./data/modules/BGSU`. Note that only the latest Zip is required.
### CARNAVAL DATA
You first need to have the `unzip` command installed on your machine and the `networkx` package installed for Python 3. Then just run the script `Install_CaRNAval_RINs.py`, this will create files into `./data/modules/RIN/Subfiles` :
```bash
cd scripts
python3 Install_CaRNAval_RINs.py
```
If you do not have the unzip command, download and extract manually the [CaRNAval dataset](http://carnaval.lri.fr/carnaval_dataset.zip) and place the files `RIN.py` and `CaRNAval_1_as_dictionnary.nxpickled` in the folder `data/modules/RIN/`, and run the python script.
### RNA3DMOTIFS DATA (DEPRECATED)
If you use Rna3Dmotifs, you need to get RNA-MoIP's .DESC dataset: download it from [GitHub](https://github.com/McGill-CSB/RNAMoIP/blob/master/CATALOGUE.tgz). Put all the .desc from the `Non_Redundant_DESC` folder into `./data/modules/DESC`. Otherwise, you also can run Rna3Dmotifs' `catalog` program to get your own DESC modules collection from updated 3D data (download [Rna3Dmotifs](https://rna3dmotif.lri.fr/Rna3Dmotif.tgz)). You also need to move the final DESC files into `./data/modules/DESC`.
* Clone this git repository : `git clone https://forge.ibisc.univ-evry.fr/lbecquey/biorseo.git` (from the IBISC forge) or `git clone https://github.com/persalteas/biorseo.git` (from my personal GitHub, only while i (Louis Becquey) am the current developer !) and `cd biorseo`.
### DEPENDENCIES
- Make sure you have Python 3.7+ and a C++ compiler (tested with GCC and clang) installed on your distribution. Use a recent one, we use the 2017 C++ standard. The compilation will not work with Ubuntu 16's GCC 5.4 for example.
......
......@@ -4,134 +4,348 @@ Biorseo (Bi-Objective RNA Structure Efficient Optimizer)
This tool predicts the secondary structure of a RNA sequence with pieces of 3D information (non-canonical contacts) at some places,
by identifying zones that can fold like known modules from data like the RNA 3D Motif Atlas or Rna3Dmotifs.
Contact : louis.becquey@univ-evry.fr
Contact : louis.becquey@univ-evry.fr (deprecated), fariza.tahi@univ-evry.fr (head of team).
1/ How it works
===================================
INPUT:
- An RNA sequence (with 16 GB of RAM you can go up to ~230 bases)
- An RNA sequence (you can go up to ~230 bases, also depends on your hardware, patience, and module average size)
THEN
- **Pattern-matching step** : Find all possible occurrences of known RNAmodules in the query sequence, by finding subsequences of the query that score well with the probabilistic models of the modules (like JAR3D, or BayesPairing)
- **Constraints definition step** : Define constraints on the secondary structure imposed by modules if they would be included (in this case, the loop closing base-pairs are mandatory)
- **Solve a bi-objective IP problem** : Find a secondary structure that satisfies as much as possible both the expected accuracy of the structure and a criterion taking into account module inclusions, by solving a bi-objective integer linear programming problem, using the previous constraints defined in the previous step.
- **Pattern-matching step** : Find all possible occurrences of known RNA modules in the query sequence, by finding subsequences of the query that match the modules. Our approach is the regular-expression based approach from RNA-MoIP (*Reinharz et al, 2012*), we deal the same way with short components and wildcards. This step can also be delegated to external tools (like Jar3d or BayesPairing) which will use probabilistic models to score the modules insertion sites.
- **Constraints definition step** : Define constraints on the secondary structure imposed by modules if they would be included (example, the loop closing base-pairs for modules that are secondary-structure elements).
- **Solve a bi-objective IP problem** : Find a secondary structure that satisfies as much as possible both an energy-based criterion (e.g. minimize energy, or maximize expected accracy), and a criterion taking into account module inclusions, by solving a bi-objective integer linear programming problem, using the previous constraints defined in the previous step.
OUTPUT:
- A set of secondary structures from the Pareto front,
- The list of known modules inserted inplace in the corresponding structures
- A set of positions of the nucleotides in contact with the protein represented by asterisks (only if the motifs_28-05-2021.json library is used!)
- The list of known modules inserted inplace in the corresponding structures,
- The scores of the secondary structures on the used objective functions.
2/ The different models
==================================
MODULE SOURCES
### MODULE SOURCES
Biorseo can be used with two modules datasets (yet):
* Rna3Dmotifs (from the work of *Djelloul & Denise, 2008*)
* The RNA 3D Motif Atlas of BGSU's RNA lab (*Petrov et al, 2013*, see http://rna.bgsu.edu/rna3dhub/motifs/)
* CaRNAval 1.0 (*Reinhartz et al, 2018*)
* /data/modules/ISAURE/motifs_28-05-2021.json a library of motifs from RNA linked to a protein from Isaure Chauvot de Beauchêne of LORIA laboratory
(contact:isaure.chauvot-de-beauchene@loria.fr)
See supported module sources in file [SOURCES.md](SOURCES.md).
PATTERN MATCHING STEP
- Use **simple pattern matching**. Rna3Dmotifs modules are available with sequence information. We use regular expressions to find those known loops in your query. This is the approach of RNA-MoIP (*Reinharz et al, 2012*), we deal the same way with short components and wildcards.
### OBJECTIVE FUNCTIONS TO OPTIMIZE
- Use **JAR3D**. The RNA 3D Motif Atlas modules can be scored against a given loop sequence by an hybrid SCFG/MRF method (*Zirbel et al, 2015*). This first requires to identify potential loops, which is achieved by a run of RNAsubopt first.
First, the energy-based objective functions :
* **MEA** : Try to maximize the expected accuracy of the secondary structure. This should lead to the MEA secondary structure estimator. Note that in practice, Biorseo maximizes the sum of observed basepairs' probabilities, therefore the scores returned are not interpretable as expected accuracies.
* **MFE** : Try to minimize the free energy of the secondary structure. The free energy model is very simple and considers sequence-dependent contributions for each stack of basepairs. This should lead to the MFE secondary structure estimator. Note that in practice, Biorseo maximizes the opposite of the energy, so the given score can be seen as the opposite of energy (which should be negative).
- Use **Bayesian networks with BayesPairing**. To accurately model probability dependancies between nucleotides, one can use BayesPairing to build bayesian networks of the modules (the RNA 3D Motif Atlas and Rna3Dmotifs are both supported). Then, sequences are sampled with the Bayesian network of a module, and we use regular expressions to find them in your query.
OBJECTIVE FUNCTIONS FOR THE MODULE INSERTION CRITERIA
These approaches are the IPknot (*Sato et al. 2011*) and Biokop (*Legendre et al. 2018*) approaches.
Then, the objective functions for the insertion of modules :
* **Function A** : weights a module by its squared number of nucleotides (like RNA-MoIP).
* **Function B** : weights a module by its number of components (strands) and penalizes it by the log^(_2) of its nucleotide size.
* **Function C** : weights a module by its insertion site score (JAR3D or BayesPairing score).
* **Function D** : weights a module by its number of components (strands) and insertion site score (JAR3D or BayesPairing score), and penalizes it by the log^(_2) of its nucleotide size.
* **Function E** : weights a module by its nucleotides in contact with a protein, number of occurences and number of nucleotides in the module.
* **Function F** : weights a module by its nucleotides in contact with a protein, number of occurences and number of nucleotides along the entire length of the RNA.
3/ Installation
==================================
Check the file [INSTALL.md](INSTALL.md) for installation instructions.
4/ Recommended uses
3/ Installation
==================================
- If **you know you have no pseudoknot**:
* Benchmarks show Biorseo does not perform better than simpler tools like RNAsubopt alone. Please use RNAsubopt (ViennaRNA package) or Fold (RNAstructure package).
You can install Biorseo using Docker, or compile it from source. Check the file [INSTALL.md](INSTALL.md) for installation instructions.
- If you **might expect a pseudoknot, or don't know**:
* The most promising method is the use of direct pattern matching with Rna3Dmotifs and function A. But this method is sometimes subject to combinatorial explosion issues. If you have a long RNA or a large number of loops, don't use it. Example:
`./biorseo.py -i PDB_00304.fa -O resultsFolder/ --rna3dmotifs --patternmatch --func A --MEA`
* The use of the RNA 3D Motif Atlas placed by JAR3D and scored with function A is not subject to combinatorial issues, but performs a bit worse. It also returns less solutions. Example:
`./biorseo.py -i PDB_00304.fa -O resultsFolder/ --3dmotifatlas --jar3d --func A --MEA
5/ List of Options
4/ How to run
==================================
The command to run is different depending on your installation method, see [INSTALL.md](INSTALL.md). But both methods allow you to define options to get the desired behavior.
```
Usage: You must provide:
1) a FASTA input file with -i,
2) a module type with --rna3dmotifs, --carnaval, --3dmotifatlas or --contacts
3) one module placement method in { --patternmatch, --jar3d, --bayespairing }
4) one scoring function with --func A, B, C, D, E ou F
5) one estimator betwenn --MEA or --MFE
If you are not using the Docker image:
6) --modules-path, --biorseo-dir and (--jar3d-exec or --bypdir)
1) a FASTA input file with -s,
2) a module type with --rna3dmotifs, --carnaval, --json or --pre-placed,
3) one module-based scoring function with --func A, B, C, or D
4) one energy-based scoring function with --mfe or --mea,
5) how to display results: in console (-v), or in a result file (-o).
Options:
-h [ --help ] Print this help message
--version Print the program version
-i [ --seq=… ] FASTA file with the query RNA sequence
--rna3dmotifs Use DESC modules from Djelloul & Denise, 2008
--carnaval Use RIN modules from Reinharz & al, 2018
--3dmotifatlas Use the HL and IL loops from BGSU's 3D Motif Atlas (updated)
--contacts Use the library of motifs, created from RNA sequences linked to proteins provided by I. Chauvot de Beauchene of LORIA laboratory
-p [ --patternmatch ] Use regular expressions to place modules in the sequence (requires --rna3dmotifs or --carnaval)
-j [ --jar3d ] Use JAR3D to place modules in the sequence (requires --3dmotifatlas)
-b [ --bayespairing ] Use BayesPairing2 to place modules in the sequence (requires --rna3dmotifs or --3dmotifatlas)
-o [ --output=… ] File to summarize the results
-O [ --outputf=… ] Folder where to output result and temp files
-f [ --func=… ] (A, B, C, D, E or F default is B) Objective function to score module insertions:
(A) insert big modules (B) insert light, high-order modules
(C) insert modules which score well with the sequence
(D) insert light, high-order modules which score well with the sequence.
C and D cannot be used with --patternmatch.
(E) and (F) insert modules with a lot of nucleotides and a lot of nucleotides in contact with a proteine, and a huge number of occurences.
(E) maximize the number of contact nucleotide inside the module, while (F) maximize the number of contact nucleotide along the entire length of the RNA.
--MEA Use Maximum Expected Accuracy for the second objective
--MFE Use Minimum Free Energy based on the formula of (*Legendre et al., 2018*) for the second objective
-c [ --first-objective=… ] (default 1) Objective to solve in the mono-objective portions of the algorithm.
(1) is the module objective given by --func, (2) is the expected accuracy of the structure.
-l [ --limit=… ] (default 500) Number of solutions in the Pareto set from which
we give up the computation, before eliminating secondary structure doublons.
-t [ --theta=… ] (default 0.001) Pairing-probability threshold to consider or not the possibility of pairing
-n [ --disable-pseudoknots ] Add constraints to explicitly forbid the formation of pseudoknots
-v [ --verbose ] Print what is happening to stdout
--modules-path=… Path to the modules data.
The folder should contain modules in the DESC format as output by Djelloul & Denise's
'catalog' program for use with --rna3dmotifs, or the IL/ and HL/ folders
from a release of the RNA 3D Motif Atlas for use with --3dmotifatlas, or the
data/modules/RIN/Subfiles/ folder for use with --carnaval.
Consider placing these files on a fast I/O device (NVMe SSD, ...)
--jar3d-exec=… Path to the jar3d executable.
Default is /jar3d_2014-12-11.jar, you should use this option if you run
BiORSEO from outside the docker image.
--bypdir=… Path to the BayesParing src folder.
Default is /byp/src, you should use this option if you run
BiORSEO from outside the docker image.
--biorseo-dir=… Path to the BiORSEO root directory.
Default is /biorseo, you should use this option if you run
BiORSEO from outside the docker image. Use the FULL path.
Examples:
biorseo.py -i myRNA.fa -O myResultsFolder/ --rna3dmotifs --patternmatch --func B --MEA
biorseo.py -i myRNA.fa -O myResultsFolder/ --3dmotifatlas --jar3d --func B -l 800 --MEA
biorseo.py -i myRNA.fa -v --3dmotifatlas --bayespairing --func D --MEA
The allowed module/placement-method/function combinations are:
--patternmatch --bayespairing --jar3d
--rna3dmotifs A. B. A. B. C. D.
--3dmotifatlas A. B. C. D. A. B. C. D.
--carnaval A. B.
--contacts E. F.
-h [ --help ] Print the help message
--version Print the program version
-s [ --seq ] arg Fasta file containing the RNA sequence
-d [ --descfolder ] arg A folder containing modules in .desc
format, as produced by Djelloul & Denise's
catalog program (deprecated)
-r [ --rinfolder ] arg A folder containing CaRNAval's RINs in .txt
format, as produced by script
transform_caRNAval_pickle.py
-j [ --jsonfolder ] arg A folder containing a custom motif library
in .json format
-x [ --pre-placed ] arg A CSV file providing motif insertion sites
obtained with another tool.
-f [ --function ] arg (=B) (A, B, C, or D) Objective function to score
module insertions:
(A) insert big modules
(B) light, high-order modules
(C) well-scored modules
(D) light, high-order, well-scored
modules
C and D require position scores
provided by --pre-placed.
-E [ --mfe ] Minimize stacking energies
(leads to MFE extimator)
-A [ --mea ] (default) Maximize expected accuracy
(leads to MEA estimator)
-c [ --first-objective ] arg (=2) (1 or 2) Objective to solve in the
mono-objective portions of the algorithm.
(1) is the module objective,
given by --function
(2) is energy-based objective,
either MFE or MEA
-o [ --output ] arg A file to summarize the computation results
-t [ --theta ] arg (=0.001) Pairing probability threshold to consider
or not the possibility of pairing
-n [ --disable-pseudoknots ] Add constraints forbidding the formation of
pseudoknots
-l [ --limit ] arg (=500) Intermediate number of solutions in the
Pareto set above which we give up the
calculation.
```
## Run in Biokop-mode :
If you run Biorseo with both options `--mfe` and `--mea`, the biobjective optimisation problem will be defined without modules, comparing the two energy-based criterions together. This should be equivalent to run the Biokop tool (*Legendre et al. 2018*) with only one Pareto-set (option `biokop -n 1`).
5/ Example output and interpretation
================================
Let's consider an example input FASTA file (data/fasta/example.fa):
```
>PDB_00376
GGGGUAUCGCCAAGCGGUAAGGCACCGGAUUCUGAUUCCGGAGGUCGAGGUUCGAAUCCUCGUACCCCAGCCA
```
By running `docker run -v `pwd`/data:/workdir/data persalteas/biorseo -s data/fasta/example.fa --descfolder data/modules/DESC --func B -v -o data/MyOutput.biorseo`, i get many information about what is happening:
### BASEPAIR PROBABILITIES
```
Summary of basepair probabilities:
=== -log10(p(i,j)) for each pair (i,j) of nucleotides: ===
GGGGUAUCGCCAAGCGGUAAGGCACCGGAUUCUGAUUCCGGAGGUCGAGGUUCGAAUCCUCGUACCCCAGCCA
4 3 5 3 3 4 530 43 G
45 34 5 34 44 5303 33 G
54 43 4 44 443 3033 33 G
4 3 5 3 4 443 0334 3 G
3 5 1343 4 4 4 0 43 U
0 3 4 0 A
55 50 3243 3 5 3 5 4 5 5 0 5 U
3 03 34 55 3 5 44 4 5 1 C
0 0 1 55 53 23 5 5 5443 4 G
0 40 51 5 52 5 5 4 C
04 1 5 25 5 5 5 C
3 45 5 4 5 A
3 5 54 5 4 A
3 2 55 4 5 4 5 5 5 G
42 3 5 54 3 5 5 C
4 43 5 5 5 3 45 5 5 4 G
2 35 5 4 535 45 5 5 4 G
3 3 5 5545 4 5 4 U
4 5 A
5 3 4 5 A
44 5 4 10 44 4 5 54 1 G
5 5 4 03 445 4 5 4 54 1 G
5 10 4 4 1 C
5 3 4 5 A
40 33 54 5 C
4 03 3 4 4 5 C
4 40 4 5 2 4 5 G
4 305 43 5 2 4 5 G
5 20 3 5 2 A
32 533 5 2 U
1 345 5 2 U
53 5 2 C
3 5 5 4 U
3 5 33 G
5 33 A
5 43 U
5 433 U
5 4 34 C
55 3 4 C
3 4 5 52 5 G
34 45 1 523 5 G
4 1 4 A
53155 4 2134 4 G
31455 2 134 4 G
13 5 1 4 U
1 0 C
51 4450 5 54 G
1 4 0 A
4 3045 3 G
30455 3 G
31 3 U
1 5 U
3 C
55 3 G
3 A
A
U
C
C
U
C
G
U
A
C
C
C
C
A
G
C
C
A
green basepairs are kept as decision variables.
```
This triangular matrix gives you the probability that two nucleotides are canonically paired. For example, a 5 means there is a 10^-5 probability that the two corresponding bases are paired. For this reason, Biorseo does not consider all possible basepairs, but only the most probable ones, highlighted in green. You can set the threshold by using the `--theta` parameter in the command line options, default is 0.001.
*Note : these probabilities are computed using ViennaRNA's [vrna_pfl_fold](https://www.tbi.univie.ac.at/RNA/ViennaRNA/doc/html/group__part__func__window.html) routine, which considers windows of size 100 and only allows basepairs within these windows of size 100. #TODO: allow user to customize window size.*
### PREPARING ILP DECISION VARIABLES AND CONSTRAINTS
Then the program defines the variables :
```
Defining problem decision variables...
> Legal basepairs : 0-67 1-66 2-65 3-64 4-18 4-63 5-17 5-62 6-16 6-19 6-61 7-15 7-61 8-14 8-17 8-22 8-44 9-13 9-16 9-21 9-43 10-15 10-20 10-42 13-22 14-21 16-22 20-44 20-45 20-71 21-44 21-70 22-42 22-43 22-69 24-40 25-39 26-38 26-58 27-37 27-57 28-35 28-36 28-56 29-34 29-55 30-34 30-54 31-53 39-67 40-60 40-66 41-59 42-58 42-64 42-65 43-57 43-62 43-64 44-54 44-63 45-53 45-61 46-52 46-60 47-51 47-59 48-58 49-57 50-55 51-55
> The possible stacks of two base pairs (i,j) and (i+1,j-1) : 0-67 1-66 2-65 3-64 4-18 4-63 5-17 5-62 6-16 7-15 8-14 8-17 8-22 8-44 9-16 9-21 9-43 13-22 20-45 20-71 21-44 21-70 24-40 25-39 26-38 26-58 27-37 27-57 28-35 28-56 29-55 30-54 39-67 40-60 41-59 42-58 42-65 43-64 44-54 45-53 45-61 46-52 46-60 47-59 48-58
```
This summarizes the possible basepairs and stacks, depending on the previous probability matrix.
```
> Looking for insertion sites...
> Ignoring motif "1Z7F.A.2", hairpin (terminal) loops must be at least of size 3 !
> Ignoring motif "3L26.C.2", hairpin (terminal) loops must be at least of size 3 !
...
```
Depending on you module data source, you may then get messages about your modules, informing you if they could be placed in the sequence or not, sometimes because of errors. The results are then summarized :
```
> 262 candidate motifs on 4695 (53 ignored motifs),
50 insertion sites kept after applying probability threshold of 0.001
> Allowed candidate insertion sites:
> 3CUL.D.6 3CUL.D.6 ( 48-58 ) 1 components: ........... basepairs:
> 3IZF.A.2 3IZF.A.2 ( 24-26 38-40 ) 2 components: ... ... basepairs:
...
```
*Note : for modules that are loops (DESC or BGSU's ones), the loop closing basepairs ARE considered even if they do not display here.*
This means, we had 4695 modules in the dataset, 53 of them were invalid, and we could potentially place 262 of them in the input sequence (not at the same time !). The optimisation program will now decide which of them we keep.
The program then defines constraints for the integer-linear-program.
```
> 71 + 45 + 93 (yuv + xuv + Cpxi) decision variables are used.
> ensuring there are at most 1 pairing by nucleotide...
...
> ensuring that the stacks are correct...
...
> forbidding lonely basepairs...
...
> forbidding basepairs inside included motif's components...
...
> forbidding component overlap...
...
> ensuring that motives cannot be partially included...
...
> forcing basepairs imposed by a module insertion...
...
A total of 492 constraints are used.
```
### SOLVING
The program starts by solving the two objectives independantly :
```
Solving...
> Solution status: objective values (3.349, 15.0427)
> Solution status: objective values (0, 19.1801)
Best solution according to objective 1 :(((((((.((...)).))...((.((.((.......)).))..))((.((.......)).)).)))))..... + 1VOY.B.63 + 1XJR.A.3 + 1YI2.0.158 + 2LA5.A.1 + 2RD2.B.3 3.3489961 15.0426987
Best solution according to objective 2 :((((((((((...)))....(((.((((([[[....)))))..)))((((...]]].)))))))))))..... 0.0000000 19.1801369
```
Then, it will limit iteratively the criterion space (approach similar to the epsilon-constraint method) using constraints, and find new solutions to add to the Pareto front, or sometimes discard some dominated ones. This can last several hundreds or thousand iterations.
```
...
Solving objective function 2, on top of 1.027879251: Obj1 being in [1.027889251, 3.34900609441]...
> Solution status: objective values (1.48611998869, 18.2887755166), not dominated.
> adding structure to Pareto set : ((((.(((((...)))....(((.((((([[[....)))))..)))((((...]]].)))))).))))..... + 1VOY.B.61 + 1XJR.A.3 1.4861200 18.2887755
...
```
### RESULTS
Finally, you get the Pareto set:
```
---------------------------------------------------------------
Whole Pareto Set:
GGGGUAUCGCCAAGCGGUAAGGCACCGGAUUCUGAUUCCGGAGGUCGAGGUUCGAAUCCUCGUACCCCAGCCA
((((((((((...)))....(((.((((([[[....)))))..)))((((...]]].)))))))))))..... + 1XJR.A.3 0.7124144 19.1801369
--- ---- 1XJR.A.3 ( 22-24 40-43 )
GGGGUAUCGCCAAGCGGUAAGGCACCGGAUUCUGAUUCCGGAGGUCGAGGUUCGAAUCCUCGUACCCCAGCCA
((((((((((...)))....(((.(((((.......)))))..)))((((.......)))))))))))..... + 1XJR.A.3 + 2RD2.B.3 1.0278793 19.1770336
--- ---- 1XJR.A.3 ( 22-24 40-43 )
--------- 2RD2.B.3 ( 28-36 )
GGGGUAUCGCCAAGCGGUAAGGCACCGGAUUCUGAUUCCGGAGGUCGAGGUUCGAAUCCUCGUACCCCAGCCA
((((.(((((...)))....(((.((((([[[....)))))..)))((((...]]].)))))).))))..... + 1VOY.B.61 + 1XJR.A.3 1.4861200 18.2887755
--- --- 1VOY.B.61 ( 3-5 62-64 )
--- ---- 1XJR.A.3 ( 22-24 40-43 )
GGGGUAUCGCCAAGCGGUAAGGCACCGGAUUCUGAUUCCGGAGGUCGAGGUUCGAAUCCUCGUACCCCAGCCA
((((.(((((...)))....(((.(((((.......)))))..)))((((.......)))))).))))..... + 1VOY.B.61 + 1XJR.A.3 + 2RD2.B.3 1.8015849 18.2856722
--- --- 1VOY.B.61 ( 3-5 62-64 )
--- ---- 1XJR.A.3 ( 22-24 40-43 )
--------- 2RD2.B.3 ( 28-36 )
GGGGUAUCGCCAAGCGGUAAGGCACCGGAUUCUGAUUCCGGAGGUCGAGGUUCGAAUCCUCGUACCCCAGCCA
((((.(((((...)))....(((.(((((.......)))))..)))(((.........))))).))))..... + 1VOY.B.61 + 1XJR.A.3 + 2RD2.B.3 + 3CUL.D.6 2.0906497 17.3138502
--- --- 1VOY.B.61 ( 3-5 62-64 )
--- ---- 1XJR.A.3 ( 22-24 40-43 )
--------- 2RD2.B.3 ( 28-36 )
----------- 3CUL.D.6 ( 48-58 )
GGGGUAUCGCCAAGCGGUAAGGCACCGGAUUCUGAUUCCGGAGGUCGAGGUUCGAAUCCUCGUACCCCAGCCA
((((.(((((...)))....(((.((.(([[[....)).))..)))((((...]]].)))))).))))..... + 1FFK.0.86 + 1VOY.B.61 + 1XJR.A.3 2.2598256 17.2909202
--- --- 1FFK.0.86 ( 25-27 37-39 )
--- --- 1VOY.B.61 ( 3-5 62-64 )
--- ---- 1XJR.A.3 ( 22-24 40-43 )
and several more...
```
There can be up to tens of solutions in here, depending mostly on your sequence length and average module size. For each solution, you get a secondary structure, and the corresponding insertion sites of some modules that match the sequence (they may have several components, which means several sequence portions are concerned by the same module). You get the insertion coordinates for each module. Also note the two criterion scores of the solution next to the secondary structure and the list of its modules.
The solutions are summarized in the result file, without the insertion sites locations :
```
$ cat data/MyOutput.biorseo
PDB_00376
GGGGUAUCGCCAAGCGGUAAGGCACCGGAUUCUGAUUCCGGAGGUCGAGGUUCGAAUCCUCGUACCCCAGCCA
((((((((((...)))....(((.((((([[[....)))))..)))((((...]]].)))))))))))..... + 1XJR.A.3 0.7124144 19.1801369
((((((((((...)))....(((.(((((.......)))))..)))((((.......)))))))))))..... + 1XJR.A.3 + 2RD2.B.3 1.0278793 19.1770336
((((.(((((...)))....(((.((((([[[....)))))..)))((((...]]].)))))).))))..... + 1VOY.B.61 + 1XJR.A.3 1.4861200 18.2887755
((((.(((((...)))....(((.(((((.......)))))..)))((((.......)))))).))))..... + 1VOY.B.61 + 1XJR.A.3 + 2RD2.B.3 1.8015849 18.2856722
((((.(((((...)))....(((.(((((.......)))))..)))(((.........))))).))))..... + 1VOY.B.61 + 1XJR.A.3 + 2RD2.B.3 + 3CUL.D.6 2.0906497 17.3138502
((((.(((((...)))....(((.((.(([[[....)).))..)))((((...]]].)))))).))))..... + 1FFK.0.86 + 1VOY.B.61 + 1XJR.A.3 2.2598256 17.2909202
((((.(((((...)))....(((.((.((.......)).))..)))((((.......)))))).))))..... + 1VOY.B.61 + 1XJR.A.3 + 2LA5.A.1 + 2RD2.B.3 2.5752905 17.2878169
((((.(((((...)))....(((.((.((.......)).))..)))(((.........))))).))))..... + 1VOY.B.61 + 1XJR.A.3 + 2LA5.A.1 + 2RD2.B.3 + 3CUL.D.6 2.8643553 16.3159949
(((.((((((...))))....((.((.(([[[....)).))..))((.((...]]].)).)))).)))..... + 1FFK.0.76 + 1XJR.A.3 + 2JLT.B.1 + 2LA5.A.1 3.0335312 15.1786201
(((.((((((...))))....((.((.((.......)).))..))((.((.......)).)))).)))..... + 1XJR.A.3 + 2JLT.B.1 + 2LA5.A.1 + 2RD2.B.3 + 430D.A.1 3.3489961 15.1755168
(((.((((((...))))....((.((.(([[[....)).))..))((.((...]]].)).)))).)))..... + 1N36.A.99 + 1XJR.A.3 + 1YI2.0.158 + 2LA5.A.1 3.0335312 15.1786201
((((.(((((...)))....(((.((.((.......)).))..)))(((.........))))).))))..... + 1SER.T.3 + 1VOY.B.61 + 1XJR.A.3 + 2LA5.A.1 + 2RD2.B.3 2.8643553 16.3159949
((((.(((((...)))....(((.((.((.......)).))..)))((((.......)))))).))))..... + 1FFK.0.86 + 1VOY.B.61 + 1XJR.A.3 + 2RD2.B.3 2.5752905 17.2878169
((((.(((((...)))....(((.((.(([[[....)).))..)))((((...]]].)))))).))))..... + 1VOY.B.61 + 1XJR.A.3 + 2LA5.A.1 2.2598256 17.2909202
((((.(((((...)))....(((.(((((.......)))))..)))(((.........))))).))))..... + 1SER.T.3 + 1VOY.B.61 + 1XJR.A.3 + 2RD2.B.3 2.0906497 17.3138502
```
You still get the secondary structure, the name of the modules, and the two scores.
*Note : on linux/mac, if you ran Biorseo using Docker, the produced file MyResult.biorseo is owned by root. You may need to change permissions or ownership to access it with a regular user account.*
......
Install supported module sources
==================================
Create folders for the modules you will use: `mkdir -p data/modules/`. If you plan to use several module sources, add subdirectories :
```bash
mkdir -p data/modules/BGSU
mkdir -p data/modules/RIN
mkdir -p data/modules/DESC
mkdir -p data/modules/JSON
mkdir -p data/modules/CSV
```
## CUSTOM JSON- OR CSV-FORMATTED MODULES
Just add you JSON-formatted modules to `data/modules/JSON/mydatabase.json`, according to the following format :
```
{
"1": {
"sequence": "ACUAGCG&GGCUA&GU",
"struct2d": "((((((.&.))))&))"
},
...
}
```
You can use `'&'` to indicate sequence discontinuity, which leads to several components in the module.
You can also use CSV-formatted insertion sites (for example, obtained with Jar3d or BayesPairing) to `data/modules/CSV`, following one of these formats:
### The "BayesPairing" format:
Here k-loops can have any number of components k, you have to precise the start and end coordinates of each. The file should include the header.
```
Motif,Score,Start1,End1,Start2,End2...
motif1name,-19,29,38
motif2name,-28,71,80,90,96
...
```
Entries may not accumulate useless commas if they have a low number of components (don't `motif1name,-19,29,38,,`)
### The Jar3d format
Here the modules may only be 1-loops or 2-loops (HL or IL). There is a fixed number of columns per line, and undefined values are indicated with a dash `'-'`.
```
Motif,Rotation,Score,Start1,End1,Start2,End2
IL_43115.1,True,66,30,32,55,57
HL_35894.1,False,63,42,47,-,-
...
```
## CARNAVAL DATA (*Reinhartz et al, 2018*)
You first need to have the `unzip` command installed on your machine and the `networkx` package installed for Python 3. Then just run the script `Install_CaRNAval_RINs.py`.
If you have cloned the Git repository, just run :
```bash
cd scripts
python3 Install_CaRNAval_RINs.py
```
This will create files into `./data/modules/RIN/Subfiles`.
If not, or if you do not have the unzip command, download and extract manually the [CaRNAval dataset](http://carnaval.lri.fr/carnaval_dataset.zip) and place the files `RIN.py` and `CaRNAval_1_as_dictionnary.nxpickled` in the folder `data/modules/RIN/`, and run the python script.
*Note : CaRNAval is supposed to be a long-distance contact module dataset, not a SSE module dataset. It was supported for testing mostly, but you will not get the best performance from using it, it's not supposed to be loops.*
## THE RNA 3D MOTIF ATLAS DATA (*Petrov et al, 2013*, previously supported)
Source : see http://rna.bgsu.edu/rna3dhub/motifs/.
Get the latest version of the HL and IL module models from the [BGSU website](http://rna.bgsu.edu/data/jar3d/models/) and extract the Zip files. Put the HL and IL folders from inside the Zip files into `./data/modules/BGSU`. Note that only the latest Zip is required.
*Note : In Biorseo V1.0, you could use this modules directly because Biorseo was running Jar3d or BayesPairing for you. This is not the case anymore. You need to run these tools separately and get their results as a CSV file, see above how to format the CSV file.*
## RNA3DMOTIFS DATA (from the work of *Djelloul & Denise, 2008*, considered outdated)
If you use Rna3Dmotifs, you need to get RNA-MoIP's .DESC dataset: download it from [GitHub](https://github.com/McGill-CSB/RNAMoIP/blob/master/CATALOGUE.tgz). Put all the .desc from the `Non_Redundant_DESC` folder into `./data/modules/DESC`. Otherwise, you also can run Rna3Dmotifs' `catalog` program to get your own DESC modules collection from updated 3D data (download [Rna3Dmotifs](https://rna3dmotif.lri.fr/Rna3Dmotif.tgz)). You also need to move the final DESC files into `./data/modules/DESC`.
\ No newline at end of file
......@@ -22,7 +22,7 @@ while step < len(seq)+50:
fasta.close()
# run biorseo on it, with default options
cmd = ["./bin/biorseo", "-d", "./data/modules/DESC", "-s", "./ZDFS33.fa", "-v"]
cmd = ["./bin/biorseo", "-d", "./data/modules/DESC", "-s", "data/fasta/ZDFS33.fa", "-v"]
old_time = time.time()
output = subprocess.check_output(cmd, stderr=subprocess.DEVNULL).decode("utf-8").split("\n")[-5:]
run_time = time.time() - old_time
......
#!/bin/bash
######################################################## RNA modules ##############################################################
cd ../
# Rna3Dmotifs data
mkdir -p data/modules/DESC
wget https://github.com/McGill-CSB/RNAMoIP/raw/master/CATALOGUE.tgz
tar -xvzf CATALOGUE.tgz
mv No_Redondance_DESC/*.desc data/modules/DESC/
rm -r No_Redondance_VIEW3D No_Redondance_DESC CATALOGUE.tgz
# The RNA 3D Motif Atlas
mkdir -p data/modules/BGSU
wget http://rna.bgsu.edu/data/jar3d/models/HL/HL_3.2_models.zip
unzip HL_3.2_models.zip
mv HL data/modules/BGSU
rm HL_3.2_models.zip
wget http://rna.bgsu.edu/data/jar3d/models/IL/IL_3.2_models.zip
unzip IL_3.2_models.zip
mv IL data/modules/BGSU
rm IL_3.2_models.zip
# Install BayesPairing
sudo -H pip3 install --upgrade pip
sudo -H pip3 install networkx numpy regex wrapt biopython
git clone http://jwgitlab.cs.mcgill.ca/sarrazin/rnabayespairing.git BayesPairing
cd BayesPairing
sudo -H pip3 install .
# Train Bayes Pairing (it has been installed on the image and the source has been deleted, we train the models now, and will remount it as volume at run time)
cd bayespairing/src
python3 parse_sequences.py -d rna3dmotif -seq ACACGGGGUAAGAGCUGAACGCAUCUAAGCUCGAAACCCACUUGGAAAAGAGACACCGCCGAGGUCCCGCGUACAAGACGCGGUCGAUAGACUCGGGGUGUGCGCGUCGAGGUAACGAGACGUUAAGCCCACGAGCACUAACAGACCAAAGCCAUCAU -ss ".................................................................((...............)xxxx(...................................................)xxx).............."
python3 parse_sequences.py -d 3dmotifatlas -seq ACACGGGGUAAGAGCUGAACGCAUCUAAGCUCGAAACCCACUUGGAAAAGAGACACCGCCGAGGUCCCGCGUACAAGACGCGGUCGAUAGACUCGGGGUGUGCGCGUCGAGGUAACGAGACGUUAAGCCCACGAGCACUAACAGACCAAAGCCAUCAU -ss ".................................................................((...............)xxxx(...................................................)xxx).............."
cd ../../..
######################################################## Run it ##############################################################
# docker run -v `pwd`/data/modules:/modules -v `pwd`/BayesPairing/bayespairing:/byp -v `pwd`/results:/biorseo/results biorseo ./biorseo.py -i /biorseo/data/fasta/applications.fa --rna3dmotifs --patternmatch --func B
\ No newline at end of file