Louis BECQUEY

beta 1.5 pre-commit for docker building

......@@ -18,9 +18,7 @@ Dockerfile
LICENSE
CHANGELOG
*.md
scripts/automate.sh
scripts/kill_rnanet.sh
scripts/build_docker_image.sh
scripts/*.sh
scripts/*.tar
scripts/measure.py
scripts/recompute_some_chains.py
......
# execution outputs:
nohup.out
log_of_the_run.sh
latest_run.log
# results
results/*
......
......@@ -25,13 +25,13 @@ RUN apk update && apk add --no-cache \
\
mv /RNANet/scripts/x3dna-dssr /usr/local/bin/x3dna-dssr && chmod +x /usr/local/bin/x3dna-dssr && \
\
curl -SL http://eddylab.org/infernal/infernal-1.1.3.tar.gz | tar xz && cd infernal-1.1.3 && \
curl -SL http://eddylab.org/infernal/infernal-1.1.4.tar.gz | tar xz && cd infernal-1.1.4 && \
./configure && make -j 16 && make install && cd easel && make install && cd / && \
\
curl -SL https://github.com/epruesse/SINA/releases/download/v1.7.1/sina-1.7.1-linux.tar.gz | tar xz && mv sina-1.7.1-linux /sina && \
ln -s /sina/bin/sina /usr/local/bin/sina && \
\
rm -rf /infernal-1.1.3 && \
rm -rf /infernal-1.1.4 && \
\
apk del openblas-dev gcc g++ gfortran binutils \
curl \
......
......@@ -10,16 +10,16 @@
# Required computational resources
- CPU: no requirements. The program is optimized for multi-core CPUs, you might want to use Intel Xeons, AMD Ryzens, etc.
- GPU: not required
- RAM: 16 GB with a large swap partition is okay. 32 GB is recommended (usage peaks at ~27 GB)
- RAM: 16 GB with a large swap partition is okay. 32 GB is recommended (usage peaks at ~27 GB, but this number depends on your number of CPU cores)
- Storage: to date, it takes 60 GB for the 3D data (36 GB if you don't use the --extract option), 11 GB for the sequence data, and 7GB for the outputs (5.6 GB database, 1 GB archive of CSV files). You need to add a few more for the dependencies. Pick a 100GB partition and you are good to go. The computation speed is way better if you use a fast storage device (e.g. SSD instead of hard drive, or even better, a NVMe SSD) because of constant I/O with the SQlite database.
- Network : We query the Rfam public MySQL server on port 4497. Make sure your network enables communication (there should not be any issue on private networks, but maybe you company/university closes ports by default). You will get an error message if the port is not open. Around 30 GB of data is downloaded.
# Method 1 : Installation using Docker
* Step 1 : Download the [Docker container](https://entrepot.ibisc.univ-evry.fr/d/1aff90a9ef214a19b848/files/?p=/rnanet_v1.3_docker.tar&dl=1). Open a terminal and move to the appropriate directory.
* Step 1 : Download the [Docker container](https://entrepot.ibisc.univ-evry.fr/d/1aff90a9ef214a19b848/files/?p=/rnanet_v1.5b_docker.tar&dl=1). Open a terminal and move to the appropriate directory.
* Step 2 : Extract the archive to a Docker image named *rnanet* in your local installation
```
$ docker load -i rnanet_v1.3_docker.tar
$ docker load -i rnanet_v1.5b_docker.tar
```
* Step 3 : Run the container, giving it 3 folders to mount as volumes: a first to store the 3D data, a second to store the sequence data and alignments, and a third to output the results, data and logs:
```
......@@ -36,7 +36,7 @@ nohup bash -c 'time docker run --rm -v /path/to/3D/data/folder:/3D -v /path/to/s
You need to install the dependencies:
- DSSR, you need to register to the X3DNA forum [here](http://forum.x3dna.org/site-announcements/download-instructions/) and then download the DSSR binary [on that page](http://forum.x3dna.org/downloads/3dna-download/). Make sure to have the `x3dna-dssr` binary in your $PATH variable so that RNANet.py finds it.
- Infernal, to download at [Eddylab](http://eddylab.org/infernal/), several options are available depending on your preferences. Make sure to have the `cmalign`, `esl-alimanip`, `esl-alipid` and `esl-reformat` binaries in your $PATH variable, so that RNANet.py can find them.
- Infernal, to download at [Eddylab](http://eddylab.org/infernal/), several options are available depending on your preferences. Make sure to have the `cmalign`, `cmfetch`, `cmbuild`, `esl-alimanip`, `esl-alipid` and `esl-reformat` binaries in your $PATH variable, so that RNANet.py can find them.
- SINA, follow [these instructions](https://sina.readthedocs.io/en/latest/install.html) for example. Make sure to have the `sina` binary in your $PATH.
- Sqlite 3, available under the name *sqlite* in every distro's package manager,
- Python >= 3.8, (Unfortunately, python3.6 is no longer supported, because of changes in the multiprocessing and Threading packages. Untested with Python 3.7.\*)
......@@ -112,13 +112,14 @@ The most useful options in that list are
* Computation of sequence identity matrices
* Statistics over the sequence lengths, nucleotide frequencies, and basepair types by RNA family
* Overall database content statistics
* Detailed analysis of the eta-theta pseudotorsion angles (use `--stats-opts "--wadley"` after `-s`) or 3D distance matrices and their averages per family (use `--stats-opts "--distance-matrices"`)
* Detailed analysis of the eta-theta pseudotorsion angles (use `--stats-opts="--wadley"` after `-s`) or 3D distance matrices and their averages per family (use `--stats-opts="--distance-matrices"`)
* ` --redundant`, to yield all the available data and not only the BGSU NR-List respresentatives
# Computation time
To give you an estimation, our last full run took exactly 12h, excluding the time to download the MMCIF files containing RNA (around 25GB to download) and the time to compute statistics.
Measured the 23rd of June 2020 on a 16-core AMD Ryzen 7 3700X CPU @3.60GHz, plus 32 Go RAM, and a 7200rpm Hard drive. Total CPU time spent: 135 hours (user+kernel modes), corresponding to 12h (actual time spent with the 16-core CPU).
Another recent full run, including the MMCIF downloads and computation of heavy statistics (`--wadley --distance-matrices`) last 13h (real time) on a 60-core Xeon E7-4850v4@2.10GHz and 120 Go of RAM. The user+kernel time was about 300h.
Update runs are much quicker, around 3 hours. It depends mostly on what RNA families are concerned by the update.
......@@ -135,9 +136,11 @@ By default, this computes:
* Statistics over the sequence lengths, nucleotide frequencies, and basepair types by RNA family
* Overall database content statistics
If you have run RNANet once with option `--extract`, additionally, you can compute more by passing the options:
* With option `--distance-matrices` to compute pairwise residue distances within the chain for every chain, and compute average and standard deviations by RNA families. This is supposed to capture the average shape of an RNA family. The distance matrices are the size of the family's covariance model (match states). Unresolved nucleotides or deletions to the covariance model are NaNs.
If you have run RNANet once with options `--no-homology` and `--extract`, you unlock new statistics over unmapped chains.
* You will be allowed to use option `--wadley` to reproduce Wadley & al. (2007) results automatically. These are clustering results of the pseudotorsions angles of the backbone.
* (experimental) You will be allowed to use option `--distance-matrices` to compute pairwise residue distances within the chain for every chain, and compute average and standard deviations by RNA families. This is supposed to capture the average shape of an RNA family.
# Output files
......
......@@ -969,6 +969,7 @@ class Pipeline:
self.REUSE_ALL = False
self.REDUNDANT = False
self.ALIGNOPTS = None
self.STATSOPTS = None
self.USESINA = False
self.SELECT_ONLY = None
self.ARCHIVE = False
......@@ -1102,6 +1103,8 @@ class Pipeline:
self.REUSE_ALL = True
elif opt == "cmalign-opts":
self.ALIGNOPTS = arg
elif opt == "stats-opts":
self.STATSOPTS = " ".split(arg)
elif opt == "--all":
self.REUSE_ALL = True
self.USE_KNOWN_ISSUES = False
......@@ -1545,9 +1548,12 @@ class Pipeline:
# Run statistics files
subprocess.run([python_executable, fileDir+"/scripts/regression.py", runDir + "/results/RNANet.db"])
subprocess.run([python_executable, fileDir+"/statistics.py", "--3d-folder", path_to_3D_data,
if self.STATSOPTS is None:
subprocess.run([python_executable, fileDir+"/statistics.py", "--3d-folder", path_to_3D_data,
"--seq-folder", path_to_seq_data, "-r", str(self.CRYSTAL_RES)])
else:
subprocess.run([python_executable, fileDir+"/statistics.py", "--3d-folder", path_to_3D_data,
"--seq-folder", path_to_seq_data, "-r", str(self.CRYSTAL_RES)] + self.STATSOPTS)
# Save additional informations
with sqlite3.connect(runDir+"/results/RNANet.db") as conn:
conn.execute('pragma journal_mode=wal')
......
6ydp_1_AA_1176-2737
6ydw_1_AA_1176-2737
2z9q_1_A_1-72
1ml5_1_b_5-121
1ml5_1_a_1-2914
......@@ -9,6 +11,9 @@
1qza_1_B_1-73
1ls2_1_B_1-73
1gsg_1_T_1-72
7d1a_1_A_805-902
7d0g_1_A_805-913
7d0f_1_A_817-913
3jcr_1_H_1-115
1vy7_1_AY_1-73
1vy7_1_CY_1-73
......@@ -18,15 +23,21 @@
4v48_1_A9_3-118
4v47_1_A9_3-118
2ob7_1_A_10-319
1x1l_1_A_1-132
1zc8_1_Z_1-93
2ob7_1_D_1-132
4v42_1_BB_5-121
1x1l_1_A_1-130
1zc8_1_Z_1-91
2ob7_1_D_1-130
4v42_1_BA_1-2914
4v42_1_BB_5-121
1r2x_1_C_1-58
1r2w_1_C_1-58
1eg0_1_L_1-56
5zzm_1_N_1-2904
3dg2_1_A_1-1542
3dg0_1_A_1-1542
4v48_1_BA_1-1543
4v47_1_BA_1-1542
3dg4_1_A_1-1542
3dg5_1_A_1-1542
5zzm_1_N_1-2903
2rdo_1_B_1-2904
3dg2_1_B_1-2904
3dg0_1_B_1-2904
......@@ -34,21 +45,17 @@
4v47_1_A0_1-2904
3dg4_1_B_1-2904
3dg5_1_B_1-2904
3dg2_1_A_1-1542
3dg0_1_A_1-1542
4v48_1_BA_1-1543
4v47_1_BA_1-1542
3dg4_1_A_1-1542
3dg5_1_A_1-1542
1eg0_1_O_1-73
1zc8_1_A_1-59
1mvr_1_D_1-61
4adx_1_9_1-123
1zn1_1_B_1-59
1jgq_1_A_2-1520
4v42_1_AA_2-1520
1jgo_1_A_2-1520
1jgp_1_A_2-1520
1mvr_1_D_1-59
4c9d_1_D_29-1
4c9d_1_C_29-1
4adx_1_9_1-121
1zn1_1_B_1-59
1emi_1_B_1-108
3iy9_1_A_498-1027
3ep2_1_B_1-50
......@@ -61,7 +68,7 @@
3cw1_1_V_1-138
3cw1_1_v_1-138
2iy3_1_B_9-105
3jcr_1_N_1-107
3jcr_1_N_1-106
2vaz_1_A_64-177
2ftc_1_R_81-1466
3jcr_1_M_1-141
......@@ -70,9 +77,10 @@
3iy8_1_A_1-540
4v5z_1_BY_2-113
4v5z_1_BZ_1-70
4v5z_1_B1_2-125
4adx_1_0_1-2925
1mvr_1_B_3-96
4v5z_1_B1_2-123
1mvr_1_B_1-96
4adx_1_0_1-2923
3eq4_1_Y_1-69
6uz7_1_8_2140-2827
7a5p_1_2_259-449
6uz7_1_8_2140-2825
4v5z_1_AA_1-1563
......
6ydp_1_AA_1176-2737
Could not find nucleotides of chain AA in annotation 6ydp.json. Either there is a problem with 6ydp mmCIF download, or the bases are not resolved in the structure. Delete it and retry.
6ydw_1_AA_1176-2737
Could not find nucleotides of chain AA in annotation 6ydw.json. Either there is a problem with 6ydw mmCIF download, or the bases are not resolved in the structure. Delete it and retry.
2z9q_1_A_1-72
DSSR warning 2z9q.json: no nucleotides found. Ignoring 2z9q_1_A_1-72.
......@@ -31,6 +37,15 @@ DSSR warning 1ls2.json: no nucleotides found. Ignoring 1ls2_1_B_1-73.
1gsg_1_T_1-72
DSSR warning 1gsg.json: no nucleotides found. Ignoring 1gsg_1_T_1-72.
7d1a_1_A_805-902
Could not find nucleotides of chain A in annotation 7d1a.json. Either there is a problem with 7d1a mmCIF download, or the bases are not resolved in the structure. Delete it and retry.
7d0g_1_A_805-913
Could not find nucleotides of chain A in annotation 7d0g.json. Either there is a problem with 7d0g mmCIF download, or the bases are not resolved in the structure. Delete it and retry.
7d0f_1_A_817-913
Could not find nucleotides of chain A in annotation 7d0f.json. Either there is a problem with 7d0f mmCIF download, or the bases are not resolved in the structure. Delete it and retry.
3jcr_1_H_1-115
DSSR warning 3jcr.json: no nucleotides found. Ignoring 3jcr_1_H_1-115.
......@@ -58,21 +73,21 @@ DSSR warning 4v47.json: no nucleotides found. Ignoring 4v47_1_A9_3-118.
2ob7_1_A_10-319
DSSR warning 2ob7.json: no nucleotides found. Ignoring 2ob7_1_A_10-319.
1x1l_1_A_1-132
DSSR warning 1x1l.json: no nucleotides found. Ignoring 1x1l_1_A_1-132.
1zc8_1_Z_1-93
DSSR warning 1zc8.json: no nucleotides found. Ignoring 1zc8_1_Z_1-93.
1x1l_1_A_1-130
DSSR warning 1x1l.json: no nucleotides found. Ignoring 1x1l_1_A_1-130.
2ob7_1_D_1-132
DSSR warning 2ob7.json: no nucleotides found. Ignoring 2ob7_1_D_1-132.
1zc8_1_Z_1-91
DSSR warning 1zc8.json: no nucleotides found. Ignoring 1zc8_1_Z_1-91.
4v42_1_BB_5-121
Could not find nucleotides of chain BB in annotation 4v42.json. Either there is a problem with 4v42 mmCIF download, or the bases are not resolved in the structure. Delete it and retry.
2ob7_1_D_1-130
DSSR warning 2ob7.json: no nucleotides found. Ignoring 2ob7_1_D_1-130.
4v42_1_BA_1-2914
Could not find nucleotides of chain BA in annotation 4v42.json. Either there is a problem with 4v42 mmCIF download, or the bases are not resolved in the structure. Delete it and retry.
4v42_1_BB_5-121
Could not find nucleotides of chain BB in annotation 4v42.json. Either there is a problem with 4v42 mmCIF download, or the bases are not resolved in the structure. Delete it and retry.
1r2x_1_C_1-58
DSSR warning 1r2x.json: no nucleotides found. Ignoring 1r2x_1_C_1-58.
......@@ -82,8 +97,26 @@ DSSR warning 1r2w.json: no nucleotides found. Ignoring 1r2w_1_C_1-58.
1eg0_1_L_1-56
DSSR warning 1eg0.json: no nucleotides found. Ignoring 1eg0_1_L_1-56.
5zzm_1_N_1-2904
DSSR warning 5zzm.json: no nucleotides found. Ignoring 5zzm_1_N_1-2904.
3dg2_1_A_1-1542
DSSR warning 3dg2.json: no nucleotides found. Ignoring 3dg2_1_A_1-1542.
3dg0_1_A_1-1542
DSSR warning 3dg0.json: no nucleotides found. Ignoring 3dg0_1_A_1-1542.
4v48_1_BA_1-1543
DSSR warning 4v48.json: no nucleotides found. Ignoring 4v48_1_BA_1-1543.
4v47_1_BA_1-1542
DSSR warning 4v47.json: no nucleotides found. Ignoring 4v47_1_BA_1-1542.
3dg4_1_A_1-1542
DSSR warning 3dg4.json: no nucleotides found. Ignoring 3dg4_1_A_1-1542.
3dg5_1_A_1-1542
DSSR warning 3dg5.json: no nucleotides found. Ignoring 3dg5_1_A_1-1542.
5zzm_1_N_1-2903
DSSR warning 5zzm.json: no nucleotides found. Ignoring 5zzm_1_N_1-2903.
2rdo_1_B_1-2904
DSSR warning 2rdo.json: no nucleotides found. Ignoring 2rdo_1_B_1-2904.
......@@ -106,39 +139,12 @@ DSSR warning 3dg4.json: no nucleotides found. Ignoring 3dg4_1_B_1-2904.
3dg5_1_B_1-2904
DSSR warning 3dg5.json: no nucleotides found. Ignoring 3dg5_1_B_1-2904.
3dg2_1_A_1-1542
DSSR warning 3dg2.json: no nucleotides found. Ignoring 3dg2_1_A_1-1542.
3dg0_1_A_1-1542
DSSR warning 3dg0.json: no nucleotides found. Ignoring 3dg0_1_A_1-1542.
4v48_1_BA_1-1543
DSSR warning 4v48.json: no nucleotides found. Ignoring 4v48_1_BA_1-1543.
4v47_1_BA_1-1542
DSSR warning 4v47.json: no nucleotides found. Ignoring 4v47_1_BA_1-1542.
3dg4_1_A_1-1542
DSSR warning 3dg4.json: no nucleotides found. Ignoring 3dg4_1_A_1-1542.
3dg5_1_A_1-1542
DSSR warning 3dg5.json: no nucleotides found. Ignoring 3dg5_1_A_1-1542.
1eg0_1_O_1-73
DSSR warning 1eg0.json: no nucleotides found. Ignoring 1eg0_1_O_1-73.
1zc8_1_A_1-59
DSSR warning 1zc8.json: no nucleotides found. Ignoring 1zc8_1_A_1-59.
1mvr_1_D_1-61
DSSR warning 1mvr.json: no nucleotides found. Ignoring 1mvr_1_D_1-61.
4adx_1_9_1-123
DSSR warning 4adx.json: no nucleotides found. Ignoring 4adx_1_9_1-123.
1zn1_1_B_1-59
DSSR warning 1zn1.json: no nucleotides found. Ignoring 1zn1_1_B_1-59.
1jgq_1_A_2-1520
Could not find nucleotides of chain A in annotation 1jgq.json. Either there is a problem with 1jgq mmCIF download, or the bases are not resolved in the structure. Delete it and retry.
......@@ -151,6 +157,21 @@ Could not find nucleotides of chain A in annotation 1jgo.json. Either there is a
1jgp_1_A_2-1520
Could not find nucleotides of chain A in annotation 1jgp.json. Either there is a problem with 1jgp mmCIF download, or the bases are not resolved in the structure. Delete it and retry.
1mvr_1_D_1-59
DSSR warning 1mvr.json: no nucleotides found. Ignoring 1mvr_1_D_1-59.
4c9d_1_D_29-1
Mapping is reversed, this case is not supported (yet).
4c9d_1_C_29-1
Mapping is reversed, this case is not supported (yet).
4adx_1_9_1-121
DSSR warning 4adx.json: no nucleotides found. Ignoring 4adx_1_9_1-121.
1zn1_1_B_1-59
DSSR warning 1zn1.json: no nucleotides found. Ignoring 1zn1_1_B_1-59.
1emi_1_B_1-108
DSSR warning 1emi.json: no nucleotides found. Ignoring 1emi_1_B_1-108.
......@@ -187,8 +208,8 @@ DSSR warning 3cw1.json: no nucleotides found. Ignoring 3cw1_1_v_1-138.
2iy3_1_B_9-105
DSSR warning 2iy3.json: no nucleotides found. Ignoring 2iy3_1_B_9-105.
3jcr_1_N_1-107
DSSR warning 3jcr.json: no nucleotides found. Ignoring 3jcr_1_N_1-107.
3jcr_1_N_1-106
DSSR warning 3jcr.json: no nucleotides found. Ignoring 3jcr_1_N_1-106.
2vaz_1_A_64-177
DSSR warning 2vaz.json: no nucleotides found. Ignoring 2vaz_1_A_64-177.
......@@ -214,19 +235,22 @@ DSSR warning 4v5z.json: no nucleotides found. Ignoring 4v5z_1_BY_2-113.
4v5z_1_BZ_1-70
DSSR warning 4v5z.json: no nucleotides found. Ignoring 4v5z_1_BZ_1-70.
4v5z_1_B1_2-125
DSSR warning 4v5z.json: no nucleotides found. Ignoring 4v5z_1_B1_2-125.
4v5z_1_B1_2-123
DSSR warning 4v5z.json: no nucleotides found. Ignoring 4v5z_1_B1_2-123.
4adx_1_0_1-2925
DSSR warning 4adx.json: no nucleotides found. Ignoring 4adx_1_0_1-2925.
1mvr_1_B_1-96
DSSR warning 1mvr.json: no nucleotides found. Ignoring 1mvr_1_B_1-96.
1mvr_1_B_3-96
DSSR warning 1mvr.json: no nucleotides found. Ignoring 1mvr_1_B_3-96.
4adx_1_0_1-2923
DSSR warning 4adx.json: no nucleotides found. Ignoring 4adx_1_0_1-2923.
3eq4_1_Y_1-69
DSSR warning 3eq4.json: no nucleotides found. Ignoring 3eq4_1_Y_1-69.
6uz7_1_8_2140-2827
7a5p_1_2_259-449
Could not find nucleotides of chain 2 in annotation 7a5p.json. Either there is a problem with 7a5p mmCIF download, or the bases are not resolved in the structure. Delete it and retry.
6uz7_1_8_2140-2825
Could not find nucleotides of chain 8 in annotation 6uz7.json. Either there is a problem with 6uz7 mmCIF download, or the bases are not resolved in the structure. Delete it and retry.
4v5z_1_AA_1-1563
......
......@@ -4,7 +4,7 @@ cd /home/lbecquey/Projects/RNANet
rm -rf latest_run.log errors.txt
# Run RNANet
bash -c 'time python3.8 ./RNAnet.py --3d-folder /home/lbecquey/Data/RNA/3D/ --seq-folder /home/lbecquey/Data/RNA/sequences/ -r 20.0 --extract -s --archive' > latest_run.log 2>&1
bash -c 'time python3.8 ./RNAnet.py --3d-folder /home/lbecquey/Data/RNA/3D/ --seq-folder /home/lbecquey/Data/RNA/sequences/ --sina -r 20.0 --extract -s --archive' > latest_run.log 2>&1
echo 'Compressing RNANet.db.gz...' >> latest_run.log
touch results/RNANet.db # update last modification date
gzip -k /home/lbecquey/Projects/RNANet/results/RNANet.db # compress it
......