Dockerization

Louis BECQUEY
Commit 533345abd5cd41c6623248c23e57b9427d43cf0b 533345ab 1 parent ce2cba25
Showing 5 changed files with 145 additions and 30 deletions
.dockerignore
Dockerfile
README.md
build_docker_image.sh
statistics.py
--- a/.dockerignore 0 → 100644
View file @533345a
+++ b/.dockerignore 0 → 100644
View file @533345a
+nohup.out
+log_of_the_run.sh
+results/
+logs/
+data/
+esl*
+.vscode/
+__pycache__/
+.git/
+errors.txt
+known_issues.txt
+known_issues_reasons.txt
+kill_rnanet.sh
+Dockerfile
+LICENSE
+README.md
+automate.sh
+build_docker_image.sh
\ No newline at end of file
--- a/Dockerfile 0 → 100644
View file @533345a
+++ b/Dockerfile 0 → 100644
View file @533345a
+FROM alpine:latest
+COPY . /RNANet
+WORKDIR /
+RUN apk update && apk add --no-cache \
+        curl \
+        freetype-dev \
+        gcc g++ \
+        linux-headers \
+        lapack-dev \
+        make \
+        musl-dev \
+        openblas-dev \
+        python3 python3-dev py3-pip py3-six py3-wheel \
+        py3-matplotlib py3-requests py3-scipy py3-setproctitle py3-sqlalchemy py3-tqdm \
+        sqlite \
+    \
+    && python3 -m pip install biopython==1.76 pandas psutil pymysql && \
+    \
+    wget -q -O /etc/apk/keys/sgerrand.rsa.pub https://alpine-pkgs.sgerrand.com/sgerrand.rsa.pub && \
+    wget https://github.com/sgerrand/alpine-pkg-glibc/releases/download/2.32-r0/glibc-2.32-r0.apk && \
+    apk add glibc-2.32-r0.apk && \
+    rm glibc-2.32-r0.apk && \
+    \
+    mkdir /3D && mkdir /sequences && \
+    \
+    mv /RNANet/x3dna-dssr /usr/local/bin/x3dna-dssr && chmod +x /usr/local/bin/x3dna-dssr && \
+    \
+    curl -SL http://eddylab.org/infernal/infernal-1.1.3.tar.gz | tar xz  && cd infernal-1.1.3 && \
+    ./configure && make -j 16 && make install && cd easel && make install && cd / && \
+    \
+    curl -SL https://github.com/epruesse/SINA/releases/download/v1.7.1/sina-1.7.1-linux.tar.gz | tar xz && mv sina-1.7.1-linux /sina && \
+    ln -s /sina/bin/sina /usr/local/bin/sina && \
+    \
+    rm -rf /infernal-1.1.3 && \
+    \
+    apk del openblas-dev gcc g++ gfortran binutils \
+        curl \
+        linux-headers \
+        make \
+        musl-dev \
+        py3-pip py3-wheel \
+        freetype-dev zlib-dev
+VOLUME ["/3D", "/sequences", "/runDir"]
+WORKDIR /runDir
+ENTRYPOINT ["/RNANet/RNAnet.py", "--3d-folder", "/3D", "--seq-folder", "/sequences" ]
\ No newline at end of file
--- a/README.md
View file @533345a
+++ b/README.md
View file @533345a
@@ -11,8 +11,8 @@ Contents:
 * [Output files](#output-files)
 * [How to run](#how-to-run)
     * [Required computational resources](#required-computational-resources)
-    * [Dependencies](#dependencies)
+    * [Using Docker](#using-docker)
-    * [Command line](#command-line)
+    * [Using classical command line installation](#using-classical-command-line-installation)
     * [Post-computation task: estimate quality](#post-computation-task:-estimate-quality)
 * [How to further filter the dataset](#how-to-further-filter-the-dataset)
     * [Filter on 3D structure resolution](#filter-on-3D-structure-resolution)
@@ -63,7 +63,7 @@ Other folders are created and not deleted, which you might want to conserve to a
 * `path-to-3D-folder-you-passed-in-option/RNAcifs/` contains mmCIF structures directly downloaded from the PDB, which contain RNA chains,
 * `path-to-3D-folder-you-passed-in-option/annotations/` contains the raw JSON annotation files of the previous mmCIF structures. You may find additional information into them which is not properly supported by RNANet yet.
-# How to run
+# How to run (on Linux x86-64 only)
 ## Required computational resources
 - CPU: no requirements. The program is optimized for multi-core CPUs, you might want to use Intel Xeons, AMD Ryzens, etc.
@@ -77,17 +77,18 @@ Measured the 23rd of June 2020 on a 16-core AMD Ryzen 7 3700X CPU @3.60GHz, plus
 Update runs are much quicker, around 3 hours. It depends mostly on what RNA families are concerned by the update.
-## Dependencies
+## Using Docker
-You need to install:
+
-- DSSR, you need to register to the X3DNA forum [here](http://forum.x3dna.org/site-announcements/download-instructions/) and then download the DSSR binary [on that page](http://forum.x3dna.org/downloads/3dna-download/). 
+* Step 1 : Download the [Docker container](#soon). Open a terminal and move to the appropriate directory.
-- Infernal, to download at [Eddylab](http://eddylab.org/infernal/), several options are available depending on your preferences. Make sure to have the `cmalign`, `esl-alimanip` and `esl-reformat` binaries in your $PATH variable, so that RNANet.py can find them.You don't need the whole X3DNA suite of tools, just DSSR is fine. Make sure to have the `x3dna-dssr` binary in your $PATH variable so that RNANet.py finds it.
+* Step 2 : Extract the archive to a Docker image named *rnanet* in your local installation
-- SINA, follow [these instructions](https://sina.readthedocs.io/en/latest/install.html) for example. Make sure to have the `sina` binary in your $PATH.
+```
-- Python >= 3.8, (Unfortunately, python3.6 is no longer supported, because of changes in the multiprocessing and Threading packages. Untested with Python 3.7.\*)
+$ docker image import rnanet_v1.2_docker.tar rnanet
-- The following Python packages: `python3.8 -m pip install numpy matplotlib pandas biopython psutil pymysql requests sqlalchemy sqlite3 tqdm`
+```
+* Step 3 : Run the container, giving it 3 folders to mount as volumes: a first to store the 3D data, a second to store the sequence data and alignments, and a third to output the results, data and logs:
+```
+$ docker run -v path/to/3D/data/folder:/3D -v path/to/sequence/data/folder:/sequences -v path/to/experiment/results/folder:/runDir rnanet [ - other options ]
+```
-## Command line
-Run `./RNANet.py --3d-folder path/to/3D/data/folder --seq-folder path/to/sequence/data/folder [ - other options ]`. 
-It requires solid hardware to run. It takes around around 12 to 15 hours the first time, and 1 to 3h then, tested on a server with 32 cores and 48GB of RAM.
 The detailed list of options is below:
 ```
@@ -121,18 +122,43 @@ The detailed list of options is below:
 --archive                       Create a tar.gz archive of the datapoints text files, and update the link to the latest archive
 --no-logs                       Do not save per-chain logs of the numbering modifications
 ```
+You may not use the --3d-folder and --seq-folder options, they are set by default to the paths you provide with the -v options when running Docker.
+
+## Using classical command line installation
+
+You need to install the dependencies:
+- DSSR, you need to register to the X3DNA forum [here](http://forum.x3dna.org/site-announcements/download-instructions/) and then download the DSSR binary [on that page](http://forum.x3dna.org/downloads/3dna-download/).  Make sure to have the `x3dna-dssr` binary in your $PATH variable so that RNANet.py finds it.
+- Infernal, to download at [Eddylab](http://eddylab.org/infernal/), several options are available depending on your preferences. Make sure to have the `cmalign`, `esl-alimanip`, `esl-alipid` and `esl-reformat` binaries in your $PATH variable, so that RNANet.py can find them.
+- SINA, follow [these instructions](https://sina.readthedocs.io/en/latest/install.html) for example. Make sure to have the `sina` binary in your $PATH.
+- Sqlite 3, available under the name *sqlite* in every distro's package manager,
+- Python >= 3.8, (Unfortunately, python3.6 is no longer supported, because of changes in the multiprocessing and Threading packages. Untested with Python 3.7.\*)
+- The following Python packages: `python3.8 -m pip install biopython==1.76 matplotlib pandas psutil pymysql requests scipy setproctitle sqlalchemy tqdm`. Note that Biopython versions 1.77 or later do not work (yet) since they removed the alphabet system.
+
+Then, run it from the command line, preferably using nohup if your shell will be interrupted:
+```
+ ./RNANet.py --3d-folder path/to/3D/data/folder --seq-folder path/to/sequence/data/folder [ - other options ]
+```
+See the list of possible options juste above in the [Using Docker](#using-docker) section. Expect hours (maybe days) of computation.
 Typical usage:
 ```
-nohup bash -c 'time ~/Projects/RNANet/RNAnet.py --3d-folder ~/Data/RNA/3D/ --seq-folder ~/Data/RNA/sequences -s' &
+nohup bash -c 'time ~/Projects/RNANet/RNAnet.py --3d-folder ~/Data/RNA/3D/ --seq-folder ~/Data/RNA/sequences --no-logs -s' &
 ```
 ## Post-computation task: estimate quality
-The file statistics.py is supposed to give a summary on the produced dataset. See the results/ folder. It can be run automatically after RNANet if you pass the `-s` option.
+If your did not ask for automatic run of statistics over the produced dataset with the `-s` option, you can run them later using the file statistics.py. 
+```
+python3.8 statistics.py --3d-folder path/to/3D/data/folder --seq-folder path/to/sequence/data/folder -r 20.0
+```
+/!\ Beware, if not precised with option `-r`, no resolution threshold is applied and all the data in RNANet.db is used.
+
+If you have run RNANet twice, once with option `--no-homology`, and once without, you unlock new statistics over unmapped chains. You will also be allowed to use option `--wadley` to reproduce Wadley & al. (2007) results automatically.
 # How to further filter the dataset
 You may want to build your own sub-dataset by querying the results/RNANet.db file. Here are quick examples using Python3 and its sqlite3 package.
+*Note: you cannot install the sqlite3 package through pip. Install it using your OS' package manager, search for 'sqlite'.*
+
 ## Filter on 3D structure resolution
 We need to import sqlite3 and pandas packages first.
@@ -157,13 +183,16 @@ with sqlite3.connect("results/RNANet.db) as connection:
 Step 2 : Then, we define a template string, containing the SQL request we use to get all information of one RNA chain, with brackets { } at the place we will insert every chain_id. 
 You can remove fields you are not interested in.
 ```
-req = """SELECT index_chain, old_nt_resnum, position, nt_name, nt_code, nt_align_code, is_A, is_C, is_G, is_U, is_other, freq_A, freq_C, freq_G, freq_U, freq_other, dbn, paired, nb_interact, pair_type_LW, pair_type_DSSR, alpha, beta, gamma, delta, epsilon, zeta, epsilon_zeta, chi, bb_type, glyco_bond, form, ssZp, Dp, eta, theta, eta_prime, theta_prime, eta_base, theta_base,
+req = """SELECT index_chain, old_nt_resnum, nt_position, nt_name, nt_code, nt_align_code, 
-v0, v1, v2, v3, v4, amlitude, phase_angle, puckering 
+                is_A, is_C, is_G, is_U, is_other, freq_A, freq_C, freq_G, freq_U, freq_other, dbn,
-FROM 
+                paired, nb_interact, pair_type_LW, pair_type_DSSR, alpha, beta, gamma, delta, epsilon, zeta, epsilon_zeta,
-(SELECT chain_id, rfam_acc from chain WHERE chain_id = {})
+                chi, bb_type, glyco_bond, form, ssZp, Dp, eta, theta, eta_prime, theta_prime, eta_base, theta_base,
-NATURAL JOIN re_mapping
+                v0, v1, v2, v3, v4, amplitude, phase_angle, puckering 
-NATURAL JOIN nucleotide
+                FROM 
-NATURAL JOIN align_column;"""
+                (SELECT chain_id, rfam_acc from chain WHERE chain_id = {})
+                NATURAL JOIN re_mapping
+                NATURAL JOIN nucleotide
+                NATURAL JOIN align_column;"""
 ```
 Step 3 : Finally, we iterate over this list of chains and save their information in CSV files:
@@ -199,12 +228,13 @@ If you want just one example of each RNA 3D chain, use in Step 1:
 ```
 with sqlite3.connect("results/RNANet.db) as connection:
-    chain_list = pd.read_sql("""SELECT UNIQUE chain_id, structure_id, chain_name
+    chain_list = pd.read_sql("""SELECT DISTINCT chain_id, structure_id, chain_name
                                 FROM chain JOIN structure
                                 ON chain.structure_id = structure.pdb_id
                                 ORDER BY structure_id ASC;""",
                             con=connection)
 ```
+Then proceed to steps 2 and 3.
 # More about the database structure
 To help you design your own requests, here follows a description of the database tables and fields.
@@ -231,13 +261,12 @@ To help you design your own requests, here follows a description of the database
 * `chain_id`: A unique identifier
 * `structure_id`: The `pdb_id` where the chain comes from
 * `chain_name`: The chain label, extracted from the 3D file
+* `eq_class`: The BGSU equivalence class label containing this chain
+* `rfam_acc`: The family which the chain is mapped to (if not mapped, value is *unmappd*)
 * `pdb_start`: Position in the chain where the mapping to Rfam begins (absolute position, not residue number)
 * `pdb_end`: Position in the chain where the mapping to Rfam ends (absolute position, not residue number)
-* `pdb_start`: Position in the chain where the mapping to Rfam begins (absolute position, not residue number)
-* `pdb_start`: Position in the chain where the mapping to Rfam begins (absolute position, not residue number)
 * `reversed`: Wether the mapping numbering order differs from the residue numbering order in the mmCIF file (eg 4c9d, chains C and D)
-* `issue`: Wether an issue occurred with this structure while downloading, extracting, annotating or parsing the annotation. Chains with issues are removed from the dataset (Only one known to date: 1gsg, chain T, which is too short)
+* `issue`: Wether an issue occurred with this structure while downloading, extracting, annotating or parsing the annotation. See the file known_issues_reasons.txt for more information about why your chain is marked as an issue.
-* `rfam_acc`: The family which the chain is mapped to
 * `inferred`: Wether the mapping has been inferred using the redundancy list (value is 1) or just known from Rfam-PDB mappings (value is 0)
 * `chain_freq_A`, `chain_freq_C`, `chain_freq_G`, `chain_freq_U`, `chain_freq_other`: Nucleotide frequencies in the chain
 * `pair_count_cWW`, `pair_count_cWH`, ... `pair_count_tSS`: Counts of the non-canonical base-pair types in the chain (intra-chain counts only)
--- a/build_docker_image.sh 0 → 100755
View file @533345a
+++ b/build_docker_image.sh 0 → 100755
View file @533345a
+#!/bin/bash
+
+# echo "WARNING: The purpose of this file is to document how the docker image was built.";
+# echo "You cannot execute it directly, because of licensing reasons. Please get your own";
+# echo "DSSR 2.0 executable at http://innovation.columbia.edu/technologies/CU20391";
+# echo "and place it in this folder.";
+# exit 0;
+
+THISDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+
+####################################################### Dependencies ##############################################################
+
+# The $THISDIR folder is supposed to contain the x3dna-dssr executable
+cp `which x3dna-dssr` $THISDIR
+
+######################################################## Build Docker image ######################################################
+# Execute the Dockerfile and build the image
+docker build -t persalteas/rnanet .
+
+############################################################## Cleaning ##########################################################
+rm x3dna-dssr
+
+# to run, use something like:
+# docker run -v /home/persalteas/Data/RNA/3D/:/3D -v /home/persalteas/Data/RNA/sequences/:/sequences -v /home/persalteas/labo/:/runDir persalteas/rnanet [ additional options here ]
+# Without additional options, this runs a standard pass with known issues support, log output, and no statistics. The default resolution threshold is 4.0 Angstroms.
\ No newline at end of file
--- a/statistics.py
View file @533345a
+++ b/statistics.py
View file @533345a
@@ -329,9 +329,7 @@ def parallel_stats_pairs(f):
         with sqlite3.connect(runDir + "/results/RNANet.db") as conn:
             # Get comma separated lists of basepairs per nucleotide
             interactions = pd.DataFrame(
-                            sql_ask_database(conn, 
+                            sql_ask_database(conn, f"SELECT nt_code as nt1, index_chain, paired, pair_type_LW FROM nucleotide WHERE chain_id='{cid}';"), 
-                                            f"SELECT nt_code as nt1, index_chain, paired, pair_type_LW FROM (SELECT chain_id FROM chain WHERE chain_id='{cid}') NATURAL JOIN nucleotide;",
-                                            warn_every=0), 
                             columns = ["nt1", "index_chain", "paired", "pair_type_LW"]
                            )
         # expand the comma-separated lists in real lists