Dockerization

Louis BECQUEY
Commit 533345abd5cd41c6623248c23e57b9427d43cf0b 533345ab 1 parent ce2cba25
Showing 5 changed files with 145 additions and 30 deletions
.dockerignore
Dockerfile
README.md
build_docker_image.sh
statistics.py
--- a/.dockerignore 0 → 100644
View file @533345a
+++ b/.dockerignore 0 → 100644
View file @533345a
+ nohup.out
+ log_of_the_run.sh
+ results/
+ logs/
+ data/
+ esl*
+ .vscode/
+ __pycache__/
+ .git/
+ errors.txt
+ known_issues.txt
+ known_issues_reasons.txt
+ kill_rnanet.sh
+ Dockerfile
+ LICENSE
+ README.md
+ automate.sh
+ build_docker_image.sh
\ No newline at end of file
--- a/Dockerfile 0 → 100644
View file @533345a
+++ b/Dockerfile 0 → 100644
View file @533345a
+ FROM alpine:latest
+ COPY . /RNANet
+ WORKDIR /
+ RUN apk update && apk add --no-cache \
+         curl \
+         freetype-dev \
+         gcc g++ \
+         linux-headers \
+         lapack-dev \
+         make \
+         musl-dev \
+         openblas-dev \
+         python3 python3-dev py3-pip py3-six py3-wheel \
+         py3-matplotlib py3-requests py3-scipy py3-setproctitle py3-sqlalchemy py3-tqdm \
+         sqlite \
+     \
+     && python3 -m pip install biopython==1.76 pandas psutil pymysql && \
+     \
+     wget -q -O /etc/apk/keys/sgerrand.rsa.pub https://alpine-pkgs.sgerrand.com/sgerrand.rsa.pub && \
+     wget https://github.com/sgerrand/alpine-pkg-glibc/releases/download/2.32-r0/glibc-2.32-r0.apk && \
+     apk add glibc-2.32-r0.apk && \
+     rm glibc-2.32-r0.apk && \
+     \
+     mkdir /3D && mkdir /sequences && \
+     \
+     mv /RNANet/x3dna-dssr /usr/local/bin/x3dna-dssr && chmod +x /usr/local/bin/x3dna-dssr && \
+     \
+     curl -SL http://eddylab.org/infernal/infernal-1.1.3.tar.gz | tar xz  && cd infernal-1.1.3 && \
+     ./configure && make -j 16 && make install && cd easel && make install && cd / && \
+     \
+     curl -SL https://github.com/epruesse/SINA/releases/download/v1.7.1/sina-1.7.1-linux.tar.gz | tar xz && mv sina-1.7.1-linux /sina && \
+     ln -s /sina/bin/sina /usr/local/bin/sina && \
+     \
+     rm -rf /infernal-1.1.3 && \
+     \
+     apk del openblas-dev gcc g++ gfortran binutils \
+         curl \
+         linux-headers \
+         make \
+         musl-dev \
+         py3-pip py3-wheel \
+         freetype-dev zlib-dev
+ VOLUME ["/3D", "/sequences", "/runDir"]
+ WORKDIR /runDir
+ ENTRYPOINT ["/RNANet/RNAnet.py", "--3d-folder", "/3D", "--seq-folder", "/sequences" ]
\ No newline at end of file
--- a/README.md
View file @533345a
+++ b/README.md
View file @533345a
@@ -11,8 +11,8 @@ Contents:
 * [Output files](#output-files)
 * [How to run](#how-to-run)
     * [Required computational resources](#required-computational-resources)
-     * [Dependencies](#dependencies)
-     * [Command line](#command-line)
+     * [Using Docker](#using-docker)
+     * [Using classical command line installation](#using-classical-command-line-installation)
     * [Post-computation task: estimate quality](#post-computation-task:-estimate-quality)
 * [How to further filter the dataset](#how-to-further-filter-the-dataset)
     * [Filter on 3D structure resolution](#filter-on-3D-structure-resolution)
@@ -63,7 +63,7 @@ Other folders are created and not deleted, which you might want to conserve to a
 * `path-to-3D-folder-you-passed-in-option/RNAcifs/` contains mmCIF structures directly downloaded from the PDB, which contain RNA chains,
 * `path-to-3D-folder-you-passed-in-option/annotations/` contains the raw JSON annotation files of the previous mmCIF structures. You may find additional information into them which is not properly supported by RNANet yet.
 
- # How to run
+ # How to run (on Linux x86-64 only)
 
 ## Required computational resources
 - CPU: no requirements. The program is optimized for multi-core CPUs, you might want to use Intel Xeons, AMD Ryzens, etc.
@@ -77,17 +77,18 @@ Measured the 23rd of June 2020 on a 16-core AMD Ryzen 7 3700X CPU @3.60GHz, plus
 
 Update runs are much quicker, around 3 hours. It depends mostly on what RNA families are concerned by the update.
 
- ## Dependencies
- You need to install:
- - DSSR, you need to register to the X3DNA forum [here](http://forum.x3dna.org/site-announcements/download-instructions/) and then download the DSSR binary [on that page](http://forum.x3dna.org/downloads/3dna-download/). 
- - Infernal, to download at [Eddylab](http://eddylab.org/infernal/), several options are available depending on your preferences. Make sure to have the `cmalign`, `esl-alimanip` and `esl-reformat` binaries in your $PATH variable, so that RNANet.py can find them.You don't need the whole X3DNA suite of tools, just DSSR is fine. Make sure to have the `x3dna-dssr` binary in your $PATH variable so that RNANet.py finds it.
- - SINA, follow [these instructions](https://sina.readthedocs.io/en/latest/install.html) for example. Make sure to have the `sina` binary in your $PATH.
- - Python >= 3.8, (Unfortunately, python3.6 is no longer supported, because of changes in the multiprocessing and Threading packages. Untested with Python 3.7.\*)
- - The following Python packages: `python3.8 -m pip install numpy matplotlib pandas biopython psutil pymysql requests sqlalchemy sqlite3 tqdm`
+ ## Using Docker
+ 
+ * Step 1 : Download the [Docker container](#soon). Open a terminal and move to the appropriate directory.
+ * Step 2 : Extract the archive to a Docker image named *rnanet* in your local installation
+ ```
+ $ docker image import rnanet_v1.2_docker.tar rnanet
+ ```
+ * Step 3 : Run the container, giving it 3 folders to mount as volumes: a first to store the 3D data, a second to store the sequence data and alignments, and a third to output the results, data and logs:
+ ```
+ $ docker run -v path/to/3D/data/folder:/3D -v path/to/sequence/data/folder:/sequences -v path/to/experiment/results/folder:/runDir rnanet [ - other options ]
+ ```
 
- ## Command line
- Run `./RNANet.py --3d-folder path/to/3D/data/folder --seq-folder path/to/sequence/data/folder [ - other options ]`. 
- It requires solid hardware to run. It takes around around 12 to 15 hours the first time, and 1 to 3h then, tested on a server with 32 cores and 48GB of RAM.
 The detailed list of options is below:
 
 ```
@@ -121,18 +122,43 @@ The detailed list of options is below:
 --archive                       Create a tar.gz archive of the datapoints text files, and update the link to the latest archive
 --no-logs                       Do not save per-chain logs of the numbering modifications
 ```
+ You may not use the --3d-folder and --seq-folder options, they are set by default to the paths you provide with the -v options when running Docker.
+ 
+ ## Using classical command line installation
+ 
+ You need to install the dependencies:
+ - DSSR, you need to register to the X3DNA forum [here](http://forum.x3dna.org/site-announcements/download-instructions/) and then download the DSSR binary [on that page](http://forum.x3dna.org/downloads/3dna-download/).  Make sure to have the `x3dna-dssr` binary in your $PATH variable so that RNANet.py finds it.
+ - Infernal, to download at [Eddylab](http://eddylab.org/infernal/), several options are available depending on your preferences. Make sure to have the `cmalign`, `esl-alimanip`, `esl-alipid` and `esl-reformat` binaries in your $PATH variable, so that RNANet.py can find them.
+ - SINA, follow [these instructions](https://sina.readthedocs.io/en/latest/install.html) for example. Make sure to have the `sina` binary in your $PATH.
+ - Sqlite 3, available under the name *sqlite* in every distro's package manager,
+ - Python >= 3.8, (Unfortunately, python3.6 is no longer supported, because of changes in the multiprocessing and Threading packages. Untested with Python 3.7.\*)
+ - The following Python packages: `python3.8 -m pip install biopython==1.76 matplotlib pandas psutil pymysql requests scipy setproctitle sqlalchemy tqdm`. Note that Biopython versions 1.77 or later do not work (yet) since they removed the alphabet system.
+ 
+ Then, run it from the command line, preferably using nohup if your shell will be interrupted:
+ ```
+  ./RNANet.py --3d-folder path/to/3D/data/folder --seq-folder path/to/sequence/data/folder [ - other options ]
+ ```
+ See the list of possible options juste above in the [Using Docker](#using-docker) section. Expect hours (maybe days) of computation.
 
 Typical usage:
 ```
- nohup bash -c 'time ~/Projects/RNANet/RNAnet.py --3d-folder ~/Data/RNA/3D/ --seq-folder ~/Data/RNA/sequences -s' &
+ nohup bash -c 'time ~/Projects/RNANet/RNAnet.py --3d-folder ~/Data/RNA/3D/ --seq-folder ~/Data/RNA/sequences --no-logs -s' &
 ```
 
 ## Post-computation task: estimate quality
- The file statistics.py is supposed to give a summary on the produced dataset. See the results/ folder. It can be run automatically after RNANet if you pass the `-s` option.
+ If your did not ask for automatic run of statistics over the produced dataset with the `-s` option, you can run them later using the file statistics.py. 
+ ```
+ python3.8 statistics.py --3d-folder path/to/3D/data/folder --seq-folder path/to/sequence/data/folder -r 20.0
+ ```
+ /!\ Beware, if not precised with option `-r`, no resolution threshold is applied and all the data in RNANet.db is used.
+ 
+ If you have run RNANet twice, once with option `--no-homology`, and once without, you unlock new statistics over unmapped chains. You will also be allowed to use option `--wadley` to reproduce Wadley & al. (2007) results automatically.
 
 # How to further filter the dataset
 You may want to build your own sub-dataset by querying the results/RNANet.db file. Here are quick examples using Python3 and its sqlite3 package.
 
+ *Note: you cannot install the sqlite3 package through pip. Install it using your OS' package manager, search for 'sqlite'.*
+ 
 ## Filter on 3D structure resolution
 
 We need to import sqlite3 and pandas packages first.
@@ -157,13 +183,16 @@ with sqlite3.connect("results/RNANet.db) as connection:
 Step 2 : Then, we define a template string, containing the SQL request we use to get all information of one RNA chain, with brackets { } at the place we will insert every chain_id. 
 You can remove fields you are not interested in.
 ```
- req = """SELECT index_chain, old_nt_resnum, position, nt_name, nt_code, nt_align_code, is_A, is_C, is_G, is_U, is_other, freq_A, freq_C, freq_G, freq_U, freq_other, dbn, paired, nb_interact, pair_type_LW, pair_type_DSSR, alpha, beta, gamma, delta, epsilon, zeta, epsilon_zeta, chi, bb_type, glyco_bond, form, ssZp, Dp, eta, theta, eta_prime, theta_prime, eta_base, theta_base,
- v0, v1, v2, v3, v4, amlitude, phase_angle, puckering 
- FROM 
- (SELECT chain_id, rfam_acc from chain WHERE chain_id = {})
- NATURAL JOIN re_mapping
- NATURAL JOIN nucleotide
- NATURAL JOIN align_column;"""
+ req = """SELECT index_chain, old_nt_resnum, nt_position, nt_name, nt_code, nt_align_code, 
+                 is_A, is_C, is_G, is_U, is_other, freq_A, freq_C, freq_G, freq_U, freq_other, dbn,
+                 paired, nb_interact, pair_type_LW, pair_type_DSSR, alpha, beta, gamma, delta, epsilon, zeta, epsilon_zeta,
+                 chi, bb_type, glyco_bond, form, ssZp, Dp, eta, theta, eta_prime, theta_prime, eta_base, theta_base,
+                 v0, v1, v2, v3, v4, amplitude, phase_angle, puckering 
+                 FROM 
+                 (SELECT chain_id, rfam_acc from chain WHERE chain_id = {})
+                 NATURAL JOIN re_mapping
+                 NATURAL JOIN nucleotide
+                 NATURAL JOIN align_column;"""
 ```
 
 Step 3 : Finally, we iterate over this list of chains and save their information in CSV files:
@@ -199,12 +228,13 @@ If you want just one example of each RNA 3D chain, use in Step 1:
 
 ```
 with sqlite3.connect("results/RNANet.db) as connection:
-     chain_list = pd.read_sql("""SELECT UNIQUE chain_id, structure_id, chain_name
+     chain_list = pd.read_sql("""SELECT DISTINCT chain_id, structure_id, chain_name
                                 FROM chain JOIN structure
                                 ON chain.structure_id = structure.pdb_id
                                 ORDER BY structure_id ASC;""",
                             con=connection)
 ```
+ Then proceed to steps 2 and 3.
 
 # More about the database structure
 To help you design your own requests, here follows a description of the database tables and fields.
@@ -231,13 +261,12 @@ To help you design your own requests, here follows a description of the database
 * `chain_id`: A unique identifier
 * `structure_id`: The `pdb_id` where the chain comes from
 * `chain_name`: The chain label, extracted from the 3D file
+ * `eq_class`: The BGSU equivalence class label containing this chain
+ * `rfam_acc`: The family which the chain is mapped to (if not mapped, value is *unmappd*)
 * `pdb_start`: Position in the chain where the mapping to Rfam begins (absolute position, not residue number)
 * `pdb_end`: Position in the chain where the mapping to Rfam ends (absolute position, not residue number)
- * `pdb_start`: Position in the chain where the mapping to Rfam begins (absolute position, not residue number)
- * `pdb_start`: Position in the chain where the mapping to Rfam begins (absolute position, not residue number)
 * `reversed`: Wether the mapping numbering order differs from the residue numbering order in the mmCIF file (eg 4c9d, chains C and D)
- * `issue`: Wether an issue occurred with this structure while downloading, extracting, annotating or parsing the annotation. Chains with issues are removed from the dataset (Only one known to date: 1gsg, chain T, which is too short)
- * `rfam_acc`: The family which the chain is mapped to
+ * `issue`: Wether an issue occurred with this structure while downloading, extracting, annotating or parsing the annotation. See the file known_issues_reasons.txt for more information about why your chain is marked as an issue.
 * `inferred`: Wether the mapping has been inferred using the redundancy list (value is 1) or just known from Rfam-PDB mappings (value is 0)
 * `chain_freq_A`, `chain_freq_C`, `chain_freq_G`, `chain_freq_U`, `chain_freq_other`: Nucleotide frequencies in the chain
 * `pair_count_cWW`, `pair_count_cWH`, ... `pair_count_tSS`: Counts of the non-canonical base-pair types in the chain (intra-chain counts only)
--- a/build_docker_image.sh 0 → 100755
View file @533345a
+++ b/build_docker_image.sh 0 → 100755
View file @533345a
+ #!/bin/bash
+ 
+ # echo "WARNING: The purpose of this file is to document how the docker image was built.";
+ # echo "You cannot execute it directly, because of licensing reasons. Please get your own";
+ # echo "DSSR 2.0 executable at http://innovation.columbia.edu/technologies/CU20391";
+ # echo "and place it in this folder.";
+ # exit 0;
+ 
+ THISDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+ 
+ ####################################################### Dependencies ##############################################################
+ 
+ # The $THISDIR folder is supposed to contain the x3dna-dssr executable
+ cp `which x3dna-dssr` $THISDIR
+ 
+ ######################################################## Build Docker image ######################################################
+ # Execute the Dockerfile and build the image
+ docker build -t persalteas/rnanet .
+ 
+ ############################################################## Cleaning ##########################################################
+ rm x3dna-dssr
+ 
+ # to run, use something like:
+ # docker run -v /home/persalteas/Data/RNA/3D/:/3D -v /home/persalteas/Data/RNA/sequences/:/sequences -v /home/persalteas/labo/:/runDir persalteas/rnanet [ additional options here ]
+ # Without additional options, this runs a standard pass with known issues support, log output, and no statistics. The default resolution threshold is 4.0 Angstroms.
\ No newline at end of file
--- a/statistics.py
View file @533345a
+++ b/statistics.py
View file @533345a
@@ -329,9 +329,7 @@ def parallel_stats_pairs(f):
         with sqlite3.connect(runDir + "/results/RNANet.db") as conn:
             # Get comma separated lists of basepairs per nucleotide
             interactions = pd.DataFrame(
-                             sql_ask_database(conn, 
-                                             f"SELECT nt_code as nt1, index_chain, paired, pair_type_LW FROM (SELECT chain_id FROM chain WHERE chain_id='{cid}') NATURAL JOIN nucleotide;",
-                                             warn_every=0), 
+                             sql_ask_database(conn, f"SELECT nt_code as nt1, index_chain, paired, pair_type_LW FROM nucleotide WHERE chain_id='{cid}';"), 
                             columns = ["nt1", "index_chain", "paired", "pair_type_LW"]
                            )
         # expand the comma-separated lists in real lists