B

biorseo

A C++ software interface to RNA module databases, and a biobjective algorithm to include known modules in sequences to assist the energy criteria in predicting secondary structures.

Biorseo (Bi-Objective RNA Structure Efficient Optimizer)

This tool predicts the secondary structure of a RNA sequence with pieces of 3D information (non-canonical contacts) at some places, by identifying zones that can fold like known modules from data like the RNA 3D Motif Atlas or Rna3Dmotifs.

Contact : louis.becquey@univ-evry.fr (deprecated), fariza.tahi@univ-evry.fr (head of team).

1/ How it works

INPUT:

  • An RNA sequence (you can go up to ~230 bases, also depends on your hardware, patience, and module average size)

THEN

  • Pattern-matching step : Find all possible occurrences of known RNA modules in the query sequence, by finding subsequences of the query that match the modules. Our approach is the regular-expression based approach from RNA-MoIP (Reinharz et al, 2012), we deal the same way with short components and wildcards. This step can also be delegated to external tools (like Jar3d or BayesPairing) which will use probabilistic models to score the modules insertion sites.
  • Constraints definition step : Define constraints on the secondary structure imposed by modules if they would be included (example, the loop closing base-pairs for modules that are secondary-structure elements).
  • Solve a bi-objective IP problem : Find a secondary structure that satisfies as much as possible both an energy-based criterion (e.g. minimize energy, or maximize expected accracy), and a criterion taking into account module inclusions, by solving a bi-objective integer linear programming problem, using the previous constraints defined in the previous step.

OUTPUT:

  • A set of secondary structures from the Pareto front,
  • The list of known modules inserted inplace in the corresponding structures,
  • The scores of the secondary structures on the used objective functions.

2/ The different models

MODULE SOURCES

See supported module sources in file SOURCES.md.

OBJECTIVE FUNCTIONS TO OPTIMIZE

First, the energy-based objective functions :

  • MEA : Try to maximize the expected accuracy of the secondary structure. This should lead to the MEA secondary structure estimator. Note that in practice, Biorseo maximizes the sum of observed basepairs' probabilities, therefore the scores returned are not interpretable as expected accuracies.
  • MFE : Try to minimize the free energy of the secondary structure. The free energy model is very simple and considers sequence-dependent contributions for each stack of basepairs. This should lead to the MFE secondary structure estimator. Note that in practice, Biorseo maximizes the opposite of the energy, so the given score can be seen as the opposite of energy (which should be negative).

These approaches are the IPknot (Sato et al. 2011) and Biokop (Legendre et al. 2018) approaches.

Then, the objective functions for the insertion of modules :

  • Function A : weights a module by its squared number of nucleotides (like RNA-MoIP).
  • Function B : weights a module by its number of components (strands) and penalizes it by the log_2 of its nucleotide size.
  • Function C : weights a module by its insertion site score (JAR3D or BayesPairing score).
  • Function D : weights a module by its number of components (strands) and insertion site score (JAR3D or BayesPairing score), and penalizes it by the log_2 of its nucleotide size.

3/ Installation

You can install Biorseo using Docker, or compile it from source. Check the file INSTALL.md for installation instructions.

4/ How to run

The command to run is different depending on your installation method, see INSTALL.md. But both methods allow you to define options to get the desired behavior.

Usage:  You must provide:
        1) a FASTA input file with -s,
        2) a module type with --rna3dmotifs, --carnaval, --json or --pre-placed,
        3) one module-based scoring function with --func A, B, C, or D
        4) one energy-based scoring function with --mfe or --mea,
        5) how to display results: in console (-v), or in a result file (-o).

Options:
  -h [ --help ]                     Print the help message
  --version                         Print the program version
  -s [ --seq ] arg                  Fasta file containing the RNA sequence
  -d [ --descfolder ] arg           A folder containing modules in .desc 
                                    format, as produced by Djelloul & Denise's 
                                    catalog program (deprecated)
  -r [ --rinfolder ] arg            A folder containing CaRNAval's RINs in .txt
                                    format, as produced by script 
                                    transform_caRNAval_pickle.py
  -j [ --jsonfolder ] arg           A folder containing a custom motif library 
                                    in .json format
  -x [ --pre-placed ] arg           A CSV file providing motif insertion sites 
                                    obtained with another tool.
  -f [ --function ] arg (=B)        (A, B, C, or D) Objective function to score
                                    module insertions:
                                      (A) insert big modules
                                      (B) light, high-order modules
                                      (C) well-scored modules
                                      (D) light, high-order, well-scored
                                        modules
                                      C and D require position scores
                                      provided by --pre-placed.

  -E [ --mfe ]                      Minimize stacking energies
                                      (leads to MFE extimator)
  -A [ --mea ]                      (default) Maximize expected accuracy
                                      (leads to MEA estimator)
  -c [ --first-objective ] arg (=2) (1 or 2) Objective to solve in the 
                                    mono-objective portions of the algorithm.
                                      (1) is the module objective,
                                      given by --function
                                      (2) is energy-based objective,
                                      either MFE or MEA
  -o [ --output ] arg               A file to summarize the computation results
  -t [ --theta ] arg (=0.001)       Pairing probability threshold to consider 
                                    or not the possibility of pairing
  -n [ --disable-pseudoknots ]      Add constraints forbidding the formation of
                                    pseudoknots
  -l [ --limit ] arg (=500)         Intermediate number of solutions in the 
                                    Pareto set above which we give up the 
                                    calculation.

Run in Biokop-mode :

If you run Biorseo with both options --mfe and --mea, the biobjective optimisation problem will be defined without modules, comparing the two energy-based criterions together. This should be equivalent to run the Biokop tool (Legendre et al. 2018) with only one Pareto-set (option biokop -n 1).

5/ Example output and interpretation

Let's consider an example input FASTA file (data/fasta/example.fa):

>PDB_00376
GGGGUAUCGCCAAGCGGUAAGGCACCGGAUUCUGAUUCCGGAGGUCGAGGUUCGAAUCCUCGUACCCCAGCCA

By running docker run -vpwd/data:/workdir/data persalteas/biorseo -s data/fasta/example.fa --descfolder data/modules/DESC --func B -v -o data/MyOutput.biorseo, i get many information about what is happening:

BASEPAIR PROBABILITIES

Summary of basepair probabilities:

        === -log10(p(i,j)) for each pair (i,j) of nucleotides: ===

        GGGGUAUCGCCAAGCGGUAAGGCACCGGAUUCUGAUUCCGGAGGUCGAGGUUCGAAUCCUCGUACCCCAGCCA
               4  3   5       3  3            4                          530  43 G
              45 34           5 34           44                         5303  33 G
              54 43           4 44          443                         3033  33 G
               4 3    5       3 4          443                          0334  3  G
                3    5    1343 4          4                          4 0    43   U
                         0                 3        4                 0          A
                   55  50 3243 3    5     3      5 4   5      5      0 5         U
                     3 03   34    55     3      5 44  4      5       1           C
                      0  0    1 55     53           23    5 5    5443 4          G
                     0 40   51     5              52     5   5       4           C
                       04   1     5               25    5    5       5           C
                         3           45                    5    4  5             A
                         3           5                    54    5  4             A
                         3    2 55                        4 5     4 5 5       5  G
                            42    3            5        54   3       5       5   C
                              4 43            5      5    5 3    45      5 5   4 G
                              2 35           5       4    535   45      5 5   4  G
                             3 3                   5  5545     4       5     4   U
                                                    4                 5          A
                                        5           3              4  5          A
                                44     5      4     10     44     4 5     54   1 G
                                 5    5      4      03    445    4 5  4  54   1  G
                                                5 10     4           4       1   C
                                           5        3     4           5          A
                                               40 33    54           5           C
                                         4     03 3   4 4    5                   C
                                        4    40      4      5     2        4   5 G
                                       4    305     43     5     2        4   5  G
                                        5  20       3     5     2                A
                                         32      533     5     2                 U
                                          1      345    5     2                  U
                                               53 5          2                   C
                                               3 5    5      4                   U
                                              3      5     33                    G
                                                    5     33                     A
                                                   5    43                       U
                                                  5    433                       U
                                                   5  4 34                       C
                                                  55  3 4                        C
                                                     3            4 5     52   5 G
                                                    34           45 1    523  5  G
                                                                4  1  4          A
                                                                53155 4 2134   4 G
                                                                31455 2 134   4  G
                                                              13     5 1     4   U
                                                             1       0           C
                                                           51    4450 5   54     G
                                                           1    4  0             A
                                                            4    3045    3       G
                                                                30455   3        G
                                                              31       3         U
                                                               1       5         U
                                                                     3           C
                                                                 55 3            G
                                                                   3             A
                                                                                 A
                                                                                 U
                                                                                 C
                                                                                 C
                                                                                 U
                                                                                 C
                                                                                 G
                                                                                 U
                                                                                 A
                                                                                 C
                                                                                 C
                                                                                 C
                                                                                 C
                                                                                 A
                                                                                 G
                                                                                 C
                                                                                 C
                                                                                 A

        green basepairs are kept as decision variables.

This triangular matrix gives you the probability that two nucleotides are canonically paired. For example, a 5 means there is a 10-5 probability that the two corresponding bases are paired. For this reason, Biorseo does not consider all possible basepairs, but only the most probable ones, highlighted in green. You can set the threshold by using the --theta parameter in the command line options, default is 0.001. Note : these probabilities are computed using ViennaRNA's vrna_pfl_fold routine, which considers windows of size 100 and only allows basepairs within these windows of size 100. #TODO: allow user to customize window size.

PREPARING ILP DECISION VARIABLES AND CONSTRAINTS

Then the program defines the variables :

Defining problem decision variables...
 > Legal basepairs : 0-67 1-66 2-65 3-64 4-18 4-63 5-17 5-62 6-16 6-19 6-61 7-15 7-61 8-14 8-17 8-22 8-44 9-13 9-16 9-21 9-43 10-15 10-20 10-42 13-22 14-21 16-22 20-44 20-45 20-71 21-44 21-70 22-42 22-43 22-69 24-40 25-39 26-38 26-58 27-37 27-57 28-35 28-36 28-56 29-34 29-55 30-34 30-54 31-53 39-67 40-60 40-66 41-59 42-58 42-64 42-65 43-57 43-62 43-64 44-54 44-63 45-53 45-61 46-52 46-60 47-51 47-59 48-58 49-57 50-55 51-55 
 > The possible stacks of two base pairs (i,j) and (i+1,j-1) : 0-67 1-66 2-65 3-64 4-18 4-63 5-17 5-62 6-16 7-15 8-14 8-17 8-22 8-44 9-16 9-21 9-43 13-22 20-45 20-71 21-44 21-70 24-40 25-39 26-38 26-58 27-37 27-57 28-35 28-56 29-55 30-54 39-67 40-60 41-59 42-58 42-65 43-64 44-54 45-53 45-61 46-52 46-60 47-59 48-58 

This summarizes the possible basepairs and stacks, depending on the previous probability matrix.

 > Looking for insertion sites...
 > Ignoring motif "1Z7F.A.2", hairpin (terminal) loops must be at least of size 3 !
 > Ignoring motif "3L26.C.2", hairpin (terminal) loops must be at least of size 3 !
 ...

Depending on you module data source, you may then get messages about your modules, informing you if they could be placed in the sequence or not, sometimes because of errors. The results are then summarized :

 > 262 candidate motifs on 4695 (53 ignored motifs), 
          50 insertion sites kept after applying probability threshold of 0.001
        > Allowed candidate insertion sites:
                > 3CUL.D.6      3CUL.D.6 ( 48-58 )         1 components: ........... basepairs:
                > 3IZF.A.2      3IZF.A.2 ( 24-26 38-40 )   2 components: ... ... basepairs:
                ...

Note : for modules that are loops (DESC or BGSU's ones), the loop closing basepairs ARE considered even if they do not display here.

This means, we had 4695 modules in the dataset, 53 of them were invalid, and we could potentially place 262 of them in the input sequence (not at the same time !). The optimisation program will now decide which of them we keep.

The program then defines constraints for the integer-linear-program.

        > 71 + 45 + 93 (yuv + xuv + Cpxi) decision variables are used.
        > ensuring there are at most 1 pairing by nucleotide...
            ...
        > ensuring that the stacks are correct...
            ...
        > forbidding lonely basepairs...
            ...
        > forbidding basepairs inside included motif's components...
            ...
        > forbidding component overlap...
            ...
        > ensuring that motives cannot be partially included...
            ...
        > forcing basepairs imposed by a module insertion...
            ...
    A total of 492 constraints are used.

SOLVING

The program starts by solving the two objectives independantly :

Solving...
        > Solution status: objective values (3.349, 15.0427)
        > Solution status: objective values (0, 19.1801)
Best solution according to objective 1 :(((((((.((...)).))...((.((.((.......)).))..))((.((.......)).)).)))))..... + 1VOY.B.63 + 1XJR.A.3 + 1YI2.0.158 + 2LA5.A.1 + 2RD2.B.3     3.3489961  15.0426987
Best solution according to objective 2 :((((((((((...)))....(((.((((([[[....)))))..)))((((...]]].))))))))))).....       0.0000000       19.1801369

Then, it will limit iteratively the criterion space (approach similar to the epsilon-constraint method) using constraints, and find new solutions to add to the Pareto front, or sometimes discard some dominated ones. This can last several hundreds or thousand iterations.

...

Solving objective function 2, on top of 1.027879251: Obj1  being in [1.027889251, 3.34900609441]...
        > Solution status: objective values (1.48611998869, 18.2887755166), not dominated.
        > adding structure to Pareto set :      ((((.(((((...)))....(((.((((([[[....)))))..)))((((...]]].)))))).))))..... + 1VOY.B.61 + 1XJR.A.3        1.4861200       18.2887755

...

RESULTS

Finally, you get the Pareto set:

---------------------------------------------------------------
Whole Pareto Set:

        GGGGUAUCGCCAAGCGGUAAGGCACCGGAUUCUGAUUCCGGAGGUCGAGGUUCGAAUCCUCGUACCCCAGCCA
        ((((((((((...)))....(((.((((([[[....)))))..)))((((...]]].)))))))))))..... + 1XJR.A.3    0.7124144       19.1801369
                              ---               ----                                    1XJR.A.3 ( 22-24 40-43 )


        GGGGUAUCGCCAAGCGGUAAGGCACCGGAUUCUGAUUCCGGAGGUCGAGGUUCGAAUCCUCGUACCCCAGCCA
        ((((((((((...)))....(((.(((((.......)))))..)))((((.......)))))))))))..... + 1XJR.A.3 + 2RD2.B.3 1.0278793       19.1770336
                              ---               ----                                    1XJR.A.3 ( 22-24 40-43 )
                                    ---------                                           2RD2.B.3 ( 28-36 )


        GGGGUAUCGCCAAGCGGUAAGGCACCGGAUUCUGAUUCCGGAGGUCGAGGUUCGAAUCCUCGUACCCCAGCCA
        ((((.(((((...)))....(((.((((([[[....)))))..)))((((...]]].)))))).))))..... + 1VOY.B.61 + 1XJR.A.3        1.4861200       18.2887755
           ---                                                        ---               1VOY.B.61 ( 3-5 62-64 )
                              ---               ----                                    1XJR.A.3 ( 22-24 40-43 )


        GGGGUAUCGCCAAGCGGUAAGGCACCGGAUUCUGAUUCCGGAGGUCGAGGUUCGAAUCCUCGUACCCCAGCCA
        ((((.(((((...)))....(((.(((((.......)))))..)))((((.......)))))).))))..... + 1VOY.B.61 + 1XJR.A.3 + 2RD2.B.3     1.8015849       18.2856722
           ---                                                        ---               1VOY.B.61 ( 3-5 62-64 )
                              ---               ----                                    1XJR.A.3 ( 22-24 40-43 )
                                    ---------                                           2RD2.B.3 ( 28-36 )


        GGGGUAUCGCCAAGCGGUAAGGCACCGGAUUCUGAUUCCGGAGGUCGAGGUUCGAAUCCUCGUACCCCAGCCA
        ((((.(((((...)))....(((.(((((.......)))))..)))(((.........))))).))))..... + 1VOY.B.61 + 1XJR.A.3 + 2RD2.B.3 + 3CUL.D.6  2.0906497       17.3138502
           ---                                                        ---               1VOY.B.61 ( 3-5 62-64 )
                              ---               ----                                    1XJR.A.3 ( 22-24 40-43 )
                                    ---------                                           2RD2.B.3 ( 28-36 )
                                                        -----------                     3CUL.D.6 ( 48-58 )


        GGGGUAUCGCCAAGCGGUAAGGCACCGGAUUCUGAUUCCGGAGGUCGAGGUUCGAAUCCUCGUACCCCAGCCA
        ((((.(((((...)))....(((.((.(([[[....)).))..)))((((...]]].)))))).))))..... + 1FFK.0.86 + 1VOY.B.61 + 1XJR.A.3    2.2598256       17.2909202
                                 ---         ---                                        1FFK.0.86 ( 25-27 37-39 )
           ---                                                        ---               1VOY.B.61 ( 3-5 62-64 )
                              ---               ----                                    1XJR.A.3 ( 22-24 40-43 )


and several more...

There can be up to tens of solutions in here, depending mostly on your sequence length and average module size. For each solution, you get a secondary structure, and the corresponding insertion sites of some modules that match the sequence (they may have several components, which means several sequence portions are concerned by the same module). You get the insertion coordinates for each module. Also note the two criterion scores of the solution next to the secondary structure and the list of its modules.

The solutions are summarized in the result file, without the insertion sites locations :

$ cat data/MyOutput.biorseo

PDB_00376
GGGGUAUCGCCAAGCGGUAAGGCACCGGAUUCUGAUUCCGGAGGUCGAGGUUCGAAUCCUCGUACCCCAGCCA
((((((((((...)))....(((.((((([[[....)))))..)))((((...]]].)))))))))))..... + 1XJR.A.3    0.7124144       19.1801369
((((((((((...)))....(((.(((((.......)))))..)))((((.......)))))))))))..... + 1XJR.A.3 + 2RD2.B.3 1.0278793       19.1770336
((((.(((((...)))....(((.((((([[[....)))))..)))((((...]]].)))))).))))..... + 1VOY.B.61 + 1XJR.A.3        1.4861200       18.2887755
((((.(((((...)))....(((.(((((.......)))))..)))((((.......)))))).))))..... + 1VOY.B.61 + 1XJR.A.3 + 2RD2.B.3     1.8015849       18.2856722
((((.(((((...)))....(((.(((((.......)))))..)))(((.........))))).))))..... + 1VOY.B.61 + 1XJR.A.3 + 2RD2.B.3 + 3CUL.D.6  2.0906497       17.3138502
((((.(((((...)))....(((.((.(([[[....)).))..)))((((...]]].)))))).))))..... + 1FFK.0.86 + 1VOY.B.61 + 1XJR.A.3    2.2598256       17.2909202
((((.(((((...)))....(((.((.((.......)).))..)))((((.......)))))).))))..... + 1VOY.B.61 + 1XJR.A.3 + 2LA5.A.1 + 2RD2.B.3  2.5752905       17.2878169
((((.(((((...)))....(((.((.((.......)).))..)))(((.........))))).))))..... + 1VOY.B.61 + 1XJR.A.3 + 2LA5.A.1 + 2RD2.B.3 + 3CUL.D.6       2.8643553       16.3159949
(((.((((((...))))....((.((.(([[[....)).))..))((.((...]]].)).)))).)))..... + 1FFK.0.76 + 1XJR.A.3 + 2JLT.B.1 + 2LA5.A.1  3.0335312       15.1786201
(((.((((((...))))....((.((.((.......)).))..))((.((.......)).)))).)))..... + 1XJR.A.3 + 2JLT.B.1 + 2LA5.A.1 + 2RD2.B.3 + 430D.A.1        3.3489961       15.1755168
(((.((((((...))))....((.((.(([[[....)).))..))((.((...]]].)).)))).)))..... + 1N36.A.99 + 1XJR.A.3 + 1YI2.0.158 + 2LA5.A.1        3.0335312       15.1786201
((((.(((((...)))....(((.((.((.......)).))..)))(((.........))))).))))..... + 1SER.T.3 + 1VOY.B.61 + 1XJR.A.3 + 2LA5.A.1 + 2RD2.B.3       2.8643553       16.3159949
((((.(((((...)))....(((.((.((.......)).))..)))((((.......)))))).))))..... + 1FFK.0.86 + 1VOY.B.61 + 1XJR.A.3 + 2RD2.B.3 2.5752905       17.2878169
((((.(((((...)))....(((.((.(([[[....)).))..)))((((...]]].)))))).))))..... + 1VOY.B.61 + 1XJR.A.3 + 2LA5.A.1     2.2598256       17.2909202
((((.(((((...)))....(((.(((((.......)))))..)))(((.........))))).))))..... + 1SER.T.3 + 1VOY.B.61 + 1XJR.A.3 + 2RD2.B.3  2.0906497       17.3138502

You still get the secondary structure, the name of the modules, and the two scores. Note : on linux/mac, if you ran Biorseo using Docker, the produced file MyResult.biorseo is owned by root. You may need to change permissions or ownership to access it with a regular user account.