v pre-1.6b

Louis BECQUEY · Louis BECQUEY
Commit 33fafe91580342a69439f4e9d849004f0c32198c 33fafe91 1 parent 397da68d
Showing 12 changed files with 146 additions and 31 deletions
CHANGELOG
LICENSE
RNAnet.py
doc/INSTALL.md
geometric_stats.py
known_issues.txt
known_issues_reasons.txt
scripts/automate.sh
scripts/automate_from_scratch.sh
scripts/recompute_family.py
scripts/recompute_some_chains.py
statistics.py
--- a/CHANGELOG
View file @33fafe9
+++ b/CHANGELOG
View file @33fafe9
 ############################################################################################
+v 1.6 beta, August 2021
+
+Aglaé Tabot joins the development team. Khodor Hannoush leaves.
+
+FEATURE CHANGES
+    - Distinct options --cmalign-opts and --cmalign-rrna-opts allow to adapt the parameters for LSU and SSU families.
+      The LSU and SSU are now aligned with Infernal options '--cpu 10 --mxsize 8192 --mxtau 0.1', which is slow, 
+      requires up to 100 GB of RAM, and yields a suboptimal alignment (tau=0.1 is quite bad), but is homogenous with the other families.
+    - The LSU and SSU therefore have defined cm_coords fields, and therefore distance matrices can be computed.
+    - We now provide for download the renumbered (standardised) 3D MMCIF files, the nucleotides being numbered by their "index_chain" in the database.
+    - We now provide for download the sequences of the 3D chains aligned by Rfam family (without Rfam sequences, which have been removed).
+    - statistics.py now computes histograms and a density estimation with Gaussian mixture models for a large set of geometric parameters, 
+      measured on the unmapped data at a given resolution threshold. The parameters include:
+        * All atom bonded distances and torsion angles
+        * Distances, flat angles and torsion angles in the Pyle/VFold model
+        * Distances, flat angles and torsion anfles in the HiRE-RNA model
+        * Sequence-dependant geometric parameters of the basepairs for all non-canonical basepairs in the HiRE-RNA model.
+      The data is saved as JSON files of parameters, and numerous figures are produced to illustrate the distributions.
+      The number of gaussians to use in the GMMs are hard-coded in geometric_stats.py after our first estimation. If you do not want to trust this estimation,
+      you can ignore it with option --rescan-nmodes. An exploration of the number of Gaussians from 1 to 8 will be performed, and the best GMM will be kept. 
+
+BUG CORRECTIONS
+    - New code file geometric_stats.py
+    - New automation script that starts from scratch
+    - Many small fixes
+    - Performance tweaks
+
+TECHNICAL CHANGES
+    - Switched to DSSR Pro.
+    - Switched to esl-alimerge instead of cmalign --merge to merge alignments.
+
+############################################################################################
 v 1.5 beta, April 2021
 FEATURE CHANGES
--- a/LICENSE
View file @33fafe9
+++ b/LICENSE
View file @33fafe9
 MIT License
-Copyright (c) 2019 Louis Becquey
+Copyright (c) 2019-2021 IBISC, Université Paris Saclay
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
--- a/RNAnet.py
View file @33fafe9
+++ b/RNAnet.py
View file @33fafe9
--- a/doc/INSTALL.md
View file @33fafe9
+++ b/doc/INSTALL.md
View file @33fafe9
@@ -35,7 +35,7 @@ nohup bash -c 'time docker run --rm -v /path/to/3D/data/folder:/3D -v /path/to/s
 # Method 2 : Classical command line installation (Linux only)
 You need to install the dependencies:
-- DSSR 1.9.9 or newer, you need to register to the X3DNA forum [here](http://forum.x3dna.org/site-announcements/download-instructions/) and then download the DSSR binary [on that page](http://forum.x3dna.org/downloads/3dna-download/).  Make sure to have the `x3dna-dssr` binary in your $PATH variable so that RNANet.py finds it.
+- DSSR 1.9.9 or newer, you need to register to ask for a DSSR (academic) license [on that page](http://innovation.columbia.edu/technologies/CU20391). Make sure to have the `x3dna-dssr` binary in your $PATH variable so that RNANet.py finds it.
 - Infernal 1.1.4 or newer, to download at [Eddylab](http://eddylab.org/infernal/), several options are available depending on your preferences. Make sure to have the `cmalign`, `cmfetch`, `cmbuild`, `esl-alimanip`, `esl-alimerge`, `esl-alipid` and `esl-reformat` binaries in your $PATH variable, so that RNANet.py can find them.
 - SINA (if you plan to use it), follow [these instructions](https://sina.readthedocs.io/en/latest/install.html) for example. Make sure to have the `sina` binary in your $PATH.
 - Sqlite 3, available under the name *sqlite* in every distro's package manager,
--- a/geometric_stats.py
View file @33fafe9
+++ b/geometric_stats.py
View file @33fafe9
--- a/known_issues.txt
View file @33fafe9
+++ b/known_issues.txt
View file @33fafe9
--- a/known_issues_reasons.txt
View file @33fafe9
+++ b/known_issues_reasons.txt
View file @33fafe9
--- a/scripts/automate.sh
View file @33fafe9
+++ b/scripts/automate.sh
View file @33fafe9
@@ -5,7 +5,7 @@ rm -rf latest_run.log errors.txt
 # Run RNANet
 bash -c 'time python3.8 ./RNAnet.py --3d-folder /home/lbecquey/Data/RNA/3D/ --seq-folder /home/lbecquey/Data/RNA/sequences/ -r 20.0 --no-homology --redundant --extract' > latest_run.log 2>&1
-bash -c 'time python3.8 ./RNAnet.py --3d-folder /home/lbecquey/Data/RNA/3D/ --seq-folder /home/lbecquey/Data/RNA/sequences/ -r 20.0 --redundant --sina --extract -s --stats-opts="--wadley --distance-matrices" --archive' > latest_run.log 2>&1
+bash -c 'time python3.8 ./RNAnet.py --3d-folder /home/lbecquey/Data/RNA/3D/ --seq-folder /home/lbecquey/Data/RNA/sequences/ -r 20.0 --redundant --extract -s --stats-opts="-r 20.0 --wadley --hire-rna --distance-matrices" --archive' >> latest_run.log 2>&1
 echo 'Compressing RNANet.db.gz...' >> latest_run.log
 touch results/RNANet.db                                         # update last modification date
 gzip -k /home/lbecquey/Projects/RNANet/results/RNANet.db        # compress it
--- a/scripts/automate_from_scratch.sh 0 → 100755
View file @33fafe9
+++ b/scripts/automate_from_scratch.sh 0 → 100755
View file @33fafe9
+# This is a script supposed to be run periodically as a cron job
+# This one uses argument --from-scratch, so all is recomputed ! /!\ 
+# run it one or twice a year, otherwise, the faster update runs should be enough.
+
+cd /home/lbecquey/Projects/RNANet
+rm -rf latest_run.log errors.txt
+
+# Run RNANet
+bash -c 'time python3.8 ./RNAnet.py --3d-folder /home/lbecquey/Data/RNA/3D/ --seq-folder /home/lbecquey/Data/RNA/sequences/ --from-scratch --ignore-issues -r 20.0 --no-homology --redundant --extract' > latest_run.log 2>&1
+bash -c 'time python3.8 ./RNAnet.py --3d-folder /home/lbecquey/Data/RNA/3D/ --seq-folder /home/lbecquey/Data/RNA/sequences/ --from-scratch --ignore-issues -r 20.0 --redundant --extract -s --stats-opts="-r 20.0 --wadley --hire-rna --distance-matrices" --archive' >> latest_run.log 2>&1
+echo 'Compressing RNANet.db.gz...' >> latest_run.log
+touch results/RNANet.db                                         # update last modification date
+gzip -k /home/lbecquey/Projects/RNANet/results/RNANet.db        # compress it
+rm -f results/RNANet.db-wal results/RNANet.db-shm               # SQLite temporary files
+
+# Save the latest results
+export DATE=`date +%Y%m%d`
+echo "Creating new release in ./archive/ folder ($DATE)..." >> latest_run.log
+cp /home/lbecquey/Projects/RNANet/results/summary.csv /home/lbecquey/Projects/RNANet/archive/summary_latest.csv
+cp /home/lbecquey/Projects/RNANet/results/summary.csv "/home/lbecquey/Projects/RNANet/archive/summary_$DATE.csv"
+cp /home/lbecquey/Projects/RNANet/results/families.csv /home/lbecquey/Projects/RNANet/archive/families_latest.csv
+cp /home/lbecquey/Projects/RNANet/results/families.csv "/home/lbecquey/Projects/RNANet/archive/families_$DATE.csv"
+cp /home/lbecquey/Projects/RNANet/results/frequencies.csv /home/lbecquey/Projects/RNANet/archive/frequencies_latest.csv
+cp /home/lbecquey/Projects/RNANet/results/pair_types.csv /home/lbecquey/Projects/RNANet/archive/pair_types_latest.csv
+mv /home/lbecquey/Projects/RNANet/results/RNANet.db.gz /home/lbecquey/Projects/RNANet/archive/
+
+# Init Seafile synchronization between RNANet library and ./archive/ folder (just the first time !)
+# seaf-cli sync -l 8e082c6e-b9ed-4b2f-9279-de2177134c57 -s https://entrepot.ibisc.univ-evry.fr -u l****.b*****y@univ-evry.fr -p ****************** -d archive/
+
+# Sync in Seafile
+seaf-cli start >> latest_run.log 2>&1
+echo 'Waiting 10m for SeaFile synchronization...' >> latest_run.log
+sleep 15m
+echo `seaf-cli status` >> latest_run.log
+seaf-cli stop >> latest_run.log 2>&1
+echo 'We are '`date`', update completed.' >> latest_run.log
+
--- a/scripts/recompute_family.py
View file @33fafe9
+++ b/scripts/recompute_family.py
View file @33fafe9
@@ -36,6 +36,6 @@ for fam in families:
 # Now re run RNANet normally.
 command = ["python3.8", "./RNAnet.py", "--3d-folder", path_to_3D_data, "--seq-folder", path_to_seq_data, "-r", "20.0",
-            "--redundant", "--sina", "--extract", "-s", "--stats-opts=\"--wadley --distance-matrices\""]
+            "--redundant", "--extract", "-s", "--stats-opts=\"-r 20.0 --wadley --hire-rna --distance-matrices\""]
 print(' '.join(command))
 subprocess.run(command)
\ No newline at end of file
--- a/scripts/recompute_some_chains.py
View file @33fafe9
+++ b/scripts/recompute_some_chains.py
View file @33fafe9
@@ -3,8 +3,9 @@ import subprocess, os, sys
 # Put a list of problematic chains here, they will be properly deleted and recomputed
 problems = [
-    "1k73_1_A",
+    "7nhm_1_A_1-2923"
-    "1k73_1_B"
+    "4wfa_1_X_1-2923"
+    "4wce_1_X_1-2923"
 ]
 # provide the path to your data folders, the RNANet.db file, and the RNANet.py file as arguments to this script
@@ -22,6 +23,7 @@ for p in problems:
     # Remove the datapoints files and 3D files
     subprocess.run(["rm", '-f', path_to_3D_data + f"/rna_mapped_to_Rfam/{p}.cif"])
+    subprocess.run(["rm", '-f', path_to_3D_data + f"/rna_only/{p}.cif"])
     files = [ f for f in os.listdir(path_to_3D_data + "/datapoints") if p in f ]
     for f in files:
         subprocess.run(["rm", '-f', path_to_3D_data + f"/datapoints/{f}"])
@@ -38,14 +40,14 @@ for p in problems:
             print(' '.join(command))
             subprocess.run(command)
-        command = ["python3.8", path_to_RNANet, "--3d-folder", path_to_3D_data, "--seq-folder", path_to_seq_data, "-r", "20.0", "--extract", "--only", p]
+        command = ["python3.8", path_to_RNANet, "--3d-folder", path_to_3D_data, "--seq-folder", path_to_seq_data, "--redundant", "-r", "20.0", "--extract", "--only", p]
     else:
         # Delete the chain from the database, and the associated nucleotides and re_mappings, using foreign keys
         command = ["sqlite3", path_to_db, f"PRAGMA foreign_keys=ON; delete from chain where structure_id=\"{structure}\" and chain_name=\"{chain}\" and rfam_acc is null;"]
         print(' '.join(command))
         subprocess.run(command)
-        command = ["python3.8", path_to_RNANet, "--3d-folder", path_to_3D_data, "--seq-folder", path_to_seq_data, "-r", "20.0", "--no-homology", "--extract", "--only", p]
+        command = ["python3.8", path_to_RNANet, "--3d-folder", path_to_3D_data, "--seq-folder", path_to_seq_data, "--redundant", "-r", "20.0", "--no-homology", "--extract", "--only", p]
     # Re-run RNANet
     os.chdir(os.path.dirname(os.path.realpath(path_to_db)) + '/../')
--- a/statistics.py
View file @33fafe9
+++ b/statistics.py
View file @33fafe9
@@ -7,7 +7,7 @@
 # Run this file if you want the base counts, pair-type counts, identity percents, etc
 # in the database.
-import getopt, glob, json, os, sqlite3, shlex, subprocess, sys, warnings
+import getopt, json, os, sqlite3, shlex, subprocess, sys, warnings
 import numpy as np
 import pandas as pd
 import scipy.stats as st
@@ -27,7 +27,6 @@ from tqdm import tqdm
 from collections import Counter
 from setproctitle import setproctitle
 from RNAnet import Job, read_cpu_number, sql_ask_database, sql_execute, warn, notify, init_with_tqdm, trace_unhandled_exceptions
-from geometric_stats import *
 np.set_printoptions(threshold=sys.maxsize, linewidth=np.inf, precision=8)
 path_to_3D_data = "tobedefinedbyoptions"
@@ -38,6 +37,8 @@ res_thr = 20.0 # default: all structures
 LSU_set = ("RF00002", "RF02540", "RF02541", "RF02543", "RF02546")   # From Rfam CLAN 00112
 SSU_set = ("RF00177", "RF02542",  "RF02545", "RF01959", "RF01960")  # From Rfam CLAN 00111
+from geometric_stats import *   # after definition of the variables
+
 @trace_unhandled_exceptions
 def reproduce_wadley_results(carbon=4, show=False, sd_range=(1,4), res=2.0):
     """
@@ -934,7 +935,7 @@ def par_distance_matrix(filelist, f, label, cm_coords, consider_all_atoms, s):
         coordinates = nt_3d_centers(filename, consider_all_atoms)
         if not len(coordinates):
             # there is not nucleotides in the file, or no C1' atoms for example.
-            warn("No C1' atoms in " + filename)
+            warn("No C1' atoms in " + filename.split('/')[-1] + ", ignoring")
             return None, None, None
     except FileNotFoundError:
         return None, None, None
@@ -951,8 +952,8 @@ def par_distance_matrix(filelist, f, label, cm_coords, consider_all_atoms, s):
             try:
                 coordinates_with_gaps.append(coordinates[i - nb_gap])
             except IndexError as e:
-                warn(f"{filename} : {s.seq} at position {i}, we get {e}.", error=True)
+                warn(f"{filename.split('/')[-1]} : {s.seq} at position {i}, we get {e}.", error=True)
-                exit(0)
+                return None, None, None
     # Build the pairwise distances
     d = np.zeros((len(s.seq), len(s.seq)), dtype=np.float32)
@@ -1099,7 +1100,7 @@ def get_avg_std_distance_matrix(f, consider_all_atoms, multithread=False):
     std = np.divide(std, counts, where=counts>0, out=np.full_like(std, np.NaN))
     mask = np.invert(np.isnan(std))
     value = std[mask] - np.power(avg[mask], 2)
-    if ((value[value<0] < -1e-2).any()):
+    if ((value[value < 0] < -1e-2).any()):
         warn("Erasing very negative variance value !")
     value[value<0] = 0.0 # floating point problems !
     std[mask] = np.sqrt(value)
@@ -1127,8 +1128,48 @@ def get_avg_std_distance_matrix(f, consider_all_atoms, multithread=False):
     if not multithread:
         idxQueue.put(thr_idx) # replace the thread index in the queue
         setproctitle(f"RNANet statistics.py Worker {thr_idx+1} finished")
+    else:
+        # basically, for the rRNAs
+        # we delete the unique csv files for each chain, they wheight hundreds of gigabytes together
+        warn(f"Removing {f} ({label}) individual distance matrices, they weight too much. keeping the averages and standard deviations.")
+        for csv in glob.glob(runDir + '/results/distance_matrices/' + f + '_'+ label + "/*-" + f + ".csv"):
+            try:
+                os.remove(csv)
+            except FileNotFoundError:
+                pass
     return 0
+@trace_unhandled_exceptions
+def measure_from_structure(f):
+    """
+    Do geometric measures required on a given filename
+    """
+
+    name = f.split('.')[0]
+
+    global idxQueue
+    thr_idx = idxQueue.get()
+    setproctitle(f"RNANet statistics.py Worker {thr_idx+1} measure_from_structure({f})")
+
+    # Open the structure 
+    with warnings.catch_warnings():
+        # Ignore the PDB problems. This mostly warns that some chain is discontinuous.
+        warnings.simplefilter('ignore', Bio.PDB.PDBExceptions.PDBConstructionWarning)
+        warnings.simplefilter('ignore', Bio.PDB.PDBExceptions.BiopythonWarning)
+        parser = MMCIFParser()
+        s = parser.get_structure(f, os.path.abspath(path_to_3D_data+ "rna_only/" + f))
+    
+    #pyle_measures(name, s, thr_idx)
+    measures_aa(name, s, thr_idx)
+    if DO_HIRE_RNA_MEASURES:
+        measures_hrna(name, s, thr_idx)
+        measures_hrna_basepairs(name, s, path_to_3D_data, thr_idx)
+    if DO_WADLEY_ANALYSIS:
+        measures_pyle(name, s, thr_idx)
+    
+    idxQueue.put(thr_idx) # replace the thread index in the queue
+    setproctitle(f"RNANet statistics.py Worker {thr_idx+1} finished")
+
 def family_order(f):
     # sort the RNA families so that the plots are readable
@@ -1154,7 +1195,11 @@ def nt_3d_centers(cif_file, consider_all_atoms):
     try:
         structure = MMCIFParser().get_structure(cif_file, cif_file)
     except Exception as e:
-        warn(f"{cif_file} : {e}", error=True)
+        warn(f"{cif_file.split('/')[-1]} : {e}", error=True)
+        with open(runDir + "/errors.txt", "a") as f:
+            f.write(f"Exception in nt_3d_centers({cif_file.split('/')[-1]})\n")
+            f.write(str(e))
+            f.write("\n\n")
         return result
     for model in structure:
         for chain in model:
@@ -1205,6 +1250,7 @@ def log_to_pbar(pbar):
         pbar.update(1)
     return update
+@trace_unhandled_exceptions
 def process_jobs(joblist):
     """
     Starts a Pool to run the Job() objects in joblist.
@@ -1302,7 +1348,7 @@ if __name__ == "__main__":
             os.makedirs(runDir + "/results/figures/GMM/HiRE-RNA/angles/", exist_ok=True)
             os.makedirs(runDir + "/results/figures/GMM/HiRE-RNA/torsions/", exist_ok=True)
             os.makedirs(runDir + "/results/figures/GMM/HiRE-RNA/basepairs/", exist_ok=True)
-        elif opt == "rescan-nmodes":
+        elif opt == "--rescan-nmodes":
             RESCAN_GMM_COMP_NUM = True
     # Load mappings. famlist will contain only families with structures at this resolution threshold.
@@ -1388,7 +1434,6 @@ if __name__ == "__main__":
             if path.isfile(path_to_3D_data + "datapoints/" + f.split('.')[0]):
                 joblist.append(Job(function=measure_from_structure, args=(f,), how_many_in_parallel=nworkers))   # All-atom distances
-    
     process_jobs(joblist)
     # Now process the memory-heavy tasks family by family
@@ -1412,24 +1457,23 @@ if __name__ == "__main__":
         general_stats()
         os.makedirs(runDir+"/results/figures/GMM/", exist_ok=True)
         os.makedirs(runDir+"/results/geometry/json/", exist_ok=True)
-        concat_dataframes(runDir + '/results/geometry/all-atoms/distances/', 'dist_atoms.csv')
+        concat_dataframes(runDir + '/results/geometry/all-atoms/distances/', 'dist_atoms.csv', nworkers)
         if DO_HIRE_RNA_MEASURES:
-            concat_dataframes(runDir + '/results/geometry/HiRE-RNA/distances/', 'distances_HiRERNA.csv')
+            concat_dataframes(runDir + '/results/geometry/HiRE-RNA/distances/', 'distances_HiRERNA.csv', nworkers)
-            concat_dataframes(runDir + '/results/geometry/HiRE-RNA/angles/', 'angles_HiRERNA.csv')
+            concat_dataframes(runDir + '/results/geometry/HiRE-RNA/angles/', 'angles_HiRERNA.csv', nworkers)
-            concat_dataframes(runDir + '/results/geometry/HiRE-RNA/torsions/', 'torsions_HiRERNA.csv')
+            concat_dataframes(runDir + '/results/geometry/HiRE-RNA/torsions/', 'torsions_HiRERNA.csv', nworkers)
-            concat_dataframes(runDir + '/results/geometry/HiRE-RNA/basepairs/', 'basepairs_HiRERNA.csv')
+            concat_dataframes(runDir + '/results/geometry/HiRE-RNA/basepairs/', 'basepairs_HiRERNA.csv', nworkers)
         if DO_WADLEY_ANALYSIS:
-            concat_dataframes(runDir + '/results/geometry/Pyle/distances/', 'distances_pyle.csv')
+            concat_dataframes(runDir + '/results/geometry/Pyle/distances/', 'distances_pyle.csv', nworkers)
-            concat_dataframes(runDir + '/results/geometry/Pyle/angles/', 'flat_angles_pyle.csv')
+            concat_dataframes(runDir + '/results/geometry/Pyle/angles/', 'flat_angles_pyle.csv', nworkers)
         joblist = []
-        joblist.append(Job(function=gmm_aa_dists, args=(RESCAN_GMM_COMP_NUM)))
+        joblist.append(Job(function=gmm_aa_dists, args=(RESCAN_GMM_COMP_NUM,)))
-        joblist.append(Job(function=gmm_aa_torsions, args=(RESCAN_GMM_COMP_NUM)))
+        joblist.append(Job(function=gmm_aa_torsions, args=(RESCAN_GMM_COMP_NUM, res_thr)))
         if DO_HIRE_RNA_MEASURES:
-            joblist.append(Job(function=gmm_hrna, args=(RESCAN_GMM_COMP_NUM)))
+            joblist.append(Job(function=gmm_hrna, args=(RESCAN_GMM_COMP_NUM,)))
-            joblist.append(Job(function=gmm_hrna_basepairs, args=(RESCAN_GMM_COMP_NUM)))
+            joblist.append(Job(function=gmm_hrna_basepairs, args=(RESCAN_GMM_COMP_NUM,)))
         if DO_WADLEY_ANALYSIS:
-            joblist.append(Job(function=gmm_pyle, args=(RESCAN_GMM_COMP_NUM)))
+            joblist.append(Job(function=gmm_pyle, args=(RESCAN_GMM_COMP_NUM, res_thr)))
-        if len(joblist):
+        process_jobs(joblist)
-            process_jobs(joblist)
         merge_jsons()