Louis BECQUEY

v pre-1.6b

############################################################################################
v 1.6 beta, August 2021
Aglaé Tabot joins the development team. Khodor Hannoush leaves.
FEATURE CHANGES
- Distinct options --cmalign-opts and --cmalign-rrna-opts allow to adapt the parameters for LSU and SSU families.
The LSU and SSU are now aligned with Infernal options '--cpu 10 --mxsize 8192 --mxtau 0.1', which is slow,
requires up to 100 GB of RAM, and yields a suboptimal alignment (tau=0.1 is quite bad), but is homogenous with the other families.
- The LSU and SSU therefore have defined cm_coords fields, and therefore distance matrices can be computed.
- We now provide for download the renumbered (standardised) 3D MMCIF files, the nucleotides being numbered by their "index_chain" in the database.
- We now provide for download the sequences of the 3D chains aligned by Rfam family (without Rfam sequences, which have been removed).
- statistics.py now computes histograms and a density estimation with Gaussian mixture models for a large set of geometric parameters,
measured on the unmapped data at a given resolution threshold. The parameters include:
* All atom bonded distances and torsion angles
* Distances, flat angles and torsion angles in the Pyle/VFold model
* Distances, flat angles and torsion anfles in the HiRE-RNA model
* Sequence-dependant geometric parameters of the basepairs for all non-canonical basepairs in the HiRE-RNA model.
The data is saved as JSON files of parameters, and numerous figures are produced to illustrate the distributions.
The number of gaussians to use in the GMMs are hard-coded in geometric_stats.py after our first estimation. If you do not want to trust this estimation,
you can ignore it with option --rescan-nmodes. An exploration of the number of Gaussians from 1 to 8 will be performed, and the best GMM will be kept.
BUG CORRECTIONS
- New code file geometric_stats.py
- New automation script that starts from scratch
- Many small fixes
- Performance tweaks
TECHNICAL CHANGES
- Switched to DSSR Pro.
- Switched to esl-alimerge instead of cmalign --merge to merge alignments.
############################################################################################
v 1.5 beta, April 2021
FEATURE CHANGES
......
MIT License
Copyright (c) 2019 Louis Becquey
Copyright (c) 2019-2021 IBISC, Université Paris Saclay
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
......
This diff is collapsed. Click to expand it.
......@@ -35,7 +35,7 @@ nohup bash -c 'time docker run --rm -v /path/to/3D/data/folder:/3D -v /path/to/s
# Method 2 : Classical command line installation (Linux only)
You need to install the dependencies:
- DSSR 1.9.9 or newer, you need to register to the X3DNA forum [here](http://forum.x3dna.org/site-announcements/download-instructions/) and then download the DSSR binary [on that page](http://forum.x3dna.org/downloads/3dna-download/). Make sure to have the `x3dna-dssr` binary in your $PATH variable so that RNANet.py finds it.
- DSSR 1.9.9 or newer, you need to register to ask for a DSSR (academic) license [on that page](http://innovation.columbia.edu/technologies/CU20391). Make sure to have the `x3dna-dssr` binary in your $PATH variable so that RNANet.py finds it.
- Infernal 1.1.4 or newer, to download at [Eddylab](http://eddylab.org/infernal/), several options are available depending on your preferences. Make sure to have the `cmalign`, `cmfetch`, `cmbuild`, `esl-alimanip`, `esl-alimerge`, `esl-alipid` and `esl-reformat` binaries in your $PATH variable, so that RNANet.py can find them.
- SINA (if you plan to use it), follow [these instructions](https://sina.readthedocs.io/en/latest/install.html) for example. Make sure to have the `sina` binary in your $PATH.
- Sqlite 3, available under the name *sqlite* in every distro's package manager,
......
This diff could not be displayed because it is too large.
This diff is collapsed. Click to expand it.
This diff could not be displayed because it is too large.
......@@ -5,7 +5,7 @@ rm -rf latest_run.log errors.txt
# Run RNANet
bash -c 'time python3.8 ./RNAnet.py --3d-folder /home/lbecquey/Data/RNA/3D/ --seq-folder /home/lbecquey/Data/RNA/sequences/ -r 20.0 --no-homology --redundant --extract' > latest_run.log 2>&1
bash -c 'time python3.8 ./RNAnet.py --3d-folder /home/lbecquey/Data/RNA/3D/ --seq-folder /home/lbecquey/Data/RNA/sequences/ -r 20.0 --redundant --sina --extract -s --stats-opts="--wadley --distance-matrices" --archive' > latest_run.log 2>&1
bash -c 'time python3.8 ./RNAnet.py --3d-folder /home/lbecquey/Data/RNA/3D/ --seq-folder /home/lbecquey/Data/RNA/sequences/ -r 20.0 --redundant --extract -s --stats-opts="-r 20.0 --wadley --hire-rna --distance-matrices" --archive' >> latest_run.log 2>&1
echo 'Compressing RNANet.db.gz...' >> latest_run.log
touch results/RNANet.db # update last modification date
gzip -k /home/lbecquey/Projects/RNANet/results/RNANet.db # compress it
......
# This is a script supposed to be run periodically as a cron job
# This one uses argument --from-scratch, so all is recomputed ! /!\
# run it one or twice a year, otherwise, the faster update runs should be enough.
cd /home/lbecquey/Projects/RNANet
rm -rf latest_run.log errors.txt
# Run RNANet
bash -c 'time python3.8 ./RNAnet.py --3d-folder /home/lbecquey/Data/RNA/3D/ --seq-folder /home/lbecquey/Data/RNA/sequences/ --from-scratch --ignore-issues -r 20.0 --no-homology --redundant --extract' > latest_run.log 2>&1
bash -c 'time python3.8 ./RNAnet.py --3d-folder /home/lbecquey/Data/RNA/3D/ --seq-folder /home/lbecquey/Data/RNA/sequences/ --from-scratch --ignore-issues -r 20.0 --redundant --extract -s --stats-opts="-r 20.0 --wadley --hire-rna --distance-matrices" --archive' >> latest_run.log 2>&1
echo 'Compressing RNANet.db.gz...' >> latest_run.log
touch results/RNANet.db # update last modification date
gzip -k /home/lbecquey/Projects/RNANet/results/RNANet.db # compress it
rm -f results/RNANet.db-wal results/RNANet.db-shm # SQLite temporary files
# Save the latest results
export DATE=`date +%Y%m%d`
echo "Creating new release in ./archive/ folder ($DATE)..." >> latest_run.log
cp /home/lbecquey/Projects/RNANet/results/summary.csv /home/lbecquey/Projects/RNANet/archive/summary_latest.csv
cp /home/lbecquey/Projects/RNANet/results/summary.csv "/home/lbecquey/Projects/RNANet/archive/summary_$DATE.csv"
cp /home/lbecquey/Projects/RNANet/results/families.csv /home/lbecquey/Projects/RNANet/archive/families_latest.csv
cp /home/lbecquey/Projects/RNANet/results/families.csv "/home/lbecquey/Projects/RNANet/archive/families_$DATE.csv"
cp /home/lbecquey/Projects/RNANet/results/frequencies.csv /home/lbecquey/Projects/RNANet/archive/frequencies_latest.csv
cp /home/lbecquey/Projects/RNANet/results/pair_types.csv /home/lbecquey/Projects/RNANet/archive/pair_types_latest.csv
mv /home/lbecquey/Projects/RNANet/results/RNANet.db.gz /home/lbecquey/Projects/RNANet/archive/
# Init Seafile synchronization between RNANet library and ./archive/ folder (just the first time !)
# seaf-cli sync -l 8e082c6e-b9ed-4b2f-9279-de2177134c57 -s https://entrepot.ibisc.univ-evry.fr -u l****.b*****y@univ-evry.fr -p ****************** -d archive/
# Sync in Seafile
seaf-cli start >> latest_run.log 2>&1
echo 'Waiting 10m for SeaFile synchronization...' >> latest_run.log
sleep 15m
echo `seaf-cli status` >> latest_run.log
seaf-cli stop >> latest_run.log 2>&1
echo 'We are '`date`', update completed.' >> latest_run.log
......@@ -36,6 +36,6 @@ for fam in families:
# Now re run RNANet normally.
command = ["python3.8", "./RNAnet.py", "--3d-folder", path_to_3D_data, "--seq-folder", path_to_seq_data, "-r", "20.0",
"--redundant", "--sina", "--extract", "-s", "--stats-opts=\"--wadley --distance-matrices\""]
"--redundant", "--extract", "-s", "--stats-opts=\"-r 20.0 --wadley --hire-rna --distance-matrices\""]
print(' '.join(command))
subprocess.run(command)
\ No newline at end of file
......
......@@ -3,8 +3,9 @@ import subprocess, os, sys
# Put a list of problematic chains here, they will be properly deleted and recomputed
problems = [
"1k73_1_A",
"1k73_1_B"
"7nhm_1_A_1-2923"
"4wfa_1_X_1-2923"
"4wce_1_X_1-2923"
]
# provide the path to your data folders, the RNANet.db file, and the RNANet.py file as arguments to this script
......@@ -22,6 +23,7 @@ for p in problems:
# Remove the datapoints files and 3D files
subprocess.run(["rm", '-f', path_to_3D_data + f"/rna_mapped_to_Rfam/{p}.cif"])
subprocess.run(["rm", '-f', path_to_3D_data + f"/rna_only/{p}.cif"])
files = [ f for f in os.listdir(path_to_3D_data + "/datapoints") if p in f ]
for f in files:
subprocess.run(["rm", '-f', path_to_3D_data + f"/datapoints/{f}"])
......@@ -38,14 +40,14 @@ for p in problems:
print(' '.join(command))
subprocess.run(command)
command = ["python3.8", path_to_RNANet, "--3d-folder", path_to_3D_data, "--seq-folder", path_to_seq_data, "-r", "20.0", "--extract", "--only", p]
command = ["python3.8", path_to_RNANet, "--3d-folder", path_to_3D_data, "--seq-folder", path_to_seq_data, "--redundant", "-r", "20.0", "--extract", "--only", p]
else:
# Delete the chain from the database, and the associated nucleotides and re_mappings, using foreign keys
command = ["sqlite3", path_to_db, f"PRAGMA foreign_keys=ON; delete from chain where structure_id=\"{structure}\" and chain_name=\"{chain}\" and rfam_acc is null;"]
print(' '.join(command))
subprocess.run(command)
command = ["python3.8", path_to_RNANet, "--3d-folder", path_to_3D_data, "--seq-folder", path_to_seq_data, "-r", "20.0", "--no-homology", "--extract", "--only", p]
command = ["python3.8", path_to_RNANet, "--3d-folder", path_to_3D_data, "--seq-folder", path_to_seq_data, "--redundant", "-r", "20.0", "--no-homology", "--extract", "--only", p]
# Re-run RNANet
os.chdir(os.path.dirname(os.path.realpath(path_to_db)) + '/../')
......
......@@ -7,7 +7,7 @@
# Run this file if you want the base counts, pair-type counts, identity percents, etc
# in the database.
import getopt, glob, json, os, sqlite3, shlex, subprocess, sys, warnings
import getopt, json, os, sqlite3, shlex, subprocess, sys, warnings
import numpy as np
import pandas as pd
import scipy.stats as st
......@@ -27,7 +27,6 @@ from tqdm import tqdm
from collections import Counter
from setproctitle import setproctitle
from RNAnet import Job, read_cpu_number, sql_ask_database, sql_execute, warn, notify, init_with_tqdm, trace_unhandled_exceptions
from geometric_stats import *
np.set_printoptions(threshold=sys.maxsize, linewidth=np.inf, precision=8)
path_to_3D_data = "tobedefinedbyoptions"
......@@ -38,6 +37,8 @@ res_thr = 20.0 # default: all structures
LSU_set = ("RF00002", "RF02540", "RF02541", "RF02543", "RF02546") # From Rfam CLAN 00112
SSU_set = ("RF00177", "RF02542", "RF02545", "RF01959", "RF01960") # From Rfam CLAN 00111
from geometric_stats import * # after definition of the variables
@trace_unhandled_exceptions
def reproduce_wadley_results(carbon=4, show=False, sd_range=(1,4), res=2.0):
"""
......@@ -934,7 +935,7 @@ def par_distance_matrix(filelist, f, label, cm_coords, consider_all_atoms, s):
coordinates = nt_3d_centers(filename, consider_all_atoms)
if not len(coordinates):
# there is not nucleotides in the file, or no C1' atoms for example.
warn("No C1' atoms in " + filename)
warn("No C1' atoms in " + filename.split('/')[-1] + ", ignoring")
return None, None, None
except FileNotFoundError:
return None, None, None
......@@ -951,8 +952,8 @@ def par_distance_matrix(filelist, f, label, cm_coords, consider_all_atoms, s):
try:
coordinates_with_gaps.append(coordinates[i - nb_gap])
except IndexError as e:
warn(f"{filename} : {s.seq} at position {i}, we get {e}.", error=True)
exit(0)
warn(f"{filename.split('/')[-1]} : {s.seq} at position {i}, we get {e}.", error=True)
return None, None, None
# Build the pairwise distances
d = np.zeros((len(s.seq), len(s.seq)), dtype=np.float32)
......@@ -1099,7 +1100,7 @@ def get_avg_std_distance_matrix(f, consider_all_atoms, multithread=False):
std = np.divide(std, counts, where=counts>0, out=np.full_like(std, np.NaN))
mask = np.invert(np.isnan(std))
value = std[mask] - np.power(avg[mask], 2)
if ((value[value<0] < -1e-2).any()):
if ((value[value < 0] < -1e-2).any()):
warn("Erasing very negative variance value !")
value[value<0] = 0.0 # floating point problems !
std[mask] = np.sqrt(value)
......@@ -1127,8 +1128,48 @@ def get_avg_std_distance_matrix(f, consider_all_atoms, multithread=False):
if not multithread:
idxQueue.put(thr_idx) # replace the thread index in the queue
setproctitle(f"RNANet statistics.py Worker {thr_idx+1} finished")
else:
# basically, for the rRNAs
# we delete the unique csv files for each chain, they wheight hundreds of gigabytes together
warn(f"Removing {f} ({label}) individual distance matrices, they weight too much. keeping the averages and standard deviations.")
for csv in glob.glob(runDir + '/results/distance_matrices/' + f + '_'+ label + "/*-" + f + ".csv"):
try:
os.remove(csv)
except FileNotFoundError:
pass
return 0
@trace_unhandled_exceptions
def measure_from_structure(f):
"""
Do geometric measures required on a given filename
"""
name = f.split('.')[0]
global idxQueue
thr_idx = idxQueue.get()
setproctitle(f"RNANet statistics.py Worker {thr_idx+1} measure_from_structure({f})")
# Open the structure
with warnings.catch_warnings():
# Ignore the PDB problems. This mostly warns that some chain is discontinuous.
warnings.simplefilter('ignore', Bio.PDB.PDBExceptions.PDBConstructionWarning)
warnings.simplefilter('ignore', Bio.PDB.PDBExceptions.BiopythonWarning)
parser = MMCIFParser()
s = parser.get_structure(f, os.path.abspath(path_to_3D_data+ "rna_only/" + f))
#pyle_measures(name, s, thr_idx)
measures_aa(name, s, thr_idx)
if DO_HIRE_RNA_MEASURES:
measures_hrna(name, s, thr_idx)
measures_hrna_basepairs(name, s, path_to_3D_data, thr_idx)
if DO_WADLEY_ANALYSIS:
measures_pyle(name, s, thr_idx)
idxQueue.put(thr_idx) # replace the thread index in the queue
setproctitle(f"RNANet statistics.py Worker {thr_idx+1} finished")
def family_order(f):
# sort the RNA families so that the plots are readable
......@@ -1154,7 +1195,11 @@ def nt_3d_centers(cif_file, consider_all_atoms):
try:
structure = MMCIFParser().get_structure(cif_file, cif_file)
except Exception as e:
warn(f"{cif_file} : {e}", error=True)
warn(f"{cif_file.split('/')[-1]} : {e}", error=True)
with open(runDir + "/errors.txt", "a") as f:
f.write(f"Exception in nt_3d_centers({cif_file.split('/')[-1]})\n")
f.write(str(e))
f.write("\n\n")
return result
for model in structure:
for chain in model:
......@@ -1205,6 +1250,7 @@ def log_to_pbar(pbar):
pbar.update(1)
return update
@trace_unhandled_exceptions
def process_jobs(joblist):
"""
Starts a Pool to run the Job() objects in joblist.
......@@ -1302,7 +1348,7 @@ if __name__ == "__main__":
os.makedirs(runDir + "/results/figures/GMM/HiRE-RNA/angles/", exist_ok=True)
os.makedirs(runDir + "/results/figures/GMM/HiRE-RNA/torsions/", exist_ok=True)
os.makedirs(runDir + "/results/figures/GMM/HiRE-RNA/basepairs/", exist_ok=True)
elif opt == "rescan-nmodes":
elif opt == "--rescan-nmodes":
RESCAN_GMM_COMP_NUM = True
# Load mappings. famlist will contain only families with structures at this resolution threshold.
......@@ -1388,7 +1434,6 @@ if __name__ == "__main__":
if path.isfile(path_to_3D_data + "datapoints/" + f.split('.')[0]):
joblist.append(Job(function=measure_from_structure, args=(f,), how_many_in_parallel=nworkers)) # All-atom distances
process_jobs(joblist)
# Now process the memory-heavy tasks family by family
......@@ -1412,24 +1457,23 @@ if __name__ == "__main__":
general_stats()
os.makedirs(runDir+"/results/figures/GMM/", exist_ok=True)
os.makedirs(runDir+"/results/geometry/json/", exist_ok=True)
concat_dataframes(runDir + '/results/geometry/all-atoms/distances/', 'dist_atoms.csv')
concat_dataframes(runDir + '/results/geometry/all-atoms/distances/', 'dist_atoms.csv', nworkers)
if DO_HIRE_RNA_MEASURES:
concat_dataframes(runDir + '/results/geometry/HiRE-RNA/distances/', 'distances_HiRERNA.csv')
concat_dataframes(runDir + '/results/geometry/HiRE-RNA/angles/', 'angles_HiRERNA.csv')
concat_dataframes(runDir + '/results/geometry/HiRE-RNA/torsions/', 'torsions_HiRERNA.csv')
concat_dataframes(runDir + '/results/geometry/HiRE-RNA/basepairs/', 'basepairs_HiRERNA.csv')
concat_dataframes(runDir + '/results/geometry/HiRE-RNA/distances/', 'distances_HiRERNA.csv', nworkers)
concat_dataframes(runDir + '/results/geometry/HiRE-RNA/angles/', 'angles_HiRERNA.csv', nworkers)
concat_dataframes(runDir + '/results/geometry/HiRE-RNA/torsions/', 'torsions_HiRERNA.csv', nworkers)
concat_dataframes(runDir + '/results/geometry/HiRE-RNA/basepairs/', 'basepairs_HiRERNA.csv', nworkers)
if DO_WADLEY_ANALYSIS:
concat_dataframes(runDir + '/results/geometry/Pyle/distances/', 'distances_pyle.csv')
concat_dataframes(runDir + '/results/geometry/Pyle/angles/', 'flat_angles_pyle.csv')
concat_dataframes(runDir + '/results/geometry/Pyle/distances/', 'distances_pyle.csv', nworkers)
concat_dataframes(runDir + '/results/geometry/Pyle/angles/', 'flat_angles_pyle.csv', nworkers)
joblist = []
joblist.append(Job(function=gmm_aa_dists, args=(RESCAN_GMM_COMP_NUM)))
joblist.append(Job(function=gmm_aa_torsions, args=(RESCAN_GMM_COMP_NUM)))
joblist.append(Job(function=gmm_aa_dists, args=(RESCAN_GMM_COMP_NUM,)))
joblist.append(Job(function=gmm_aa_torsions, args=(RESCAN_GMM_COMP_NUM, res_thr)))
if DO_HIRE_RNA_MEASURES:
joblist.append(Job(function=gmm_hrna, args=(RESCAN_GMM_COMP_NUM)))
joblist.append(Job(function=gmm_hrna_basepairs, args=(RESCAN_GMM_COMP_NUM)))
joblist.append(Job(function=gmm_hrna, args=(RESCAN_GMM_COMP_NUM,)))
joblist.append(Job(function=gmm_hrna_basepairs, args=(RESCAN_GMM_COMP_NUM,)))
if DO_WADLEY_ANALYSIS:
joblist.append(Job(function=gmm_pyle, args=(RESCAN_GMM_COMP_NUM)))
if len(joblist):
process_jobs(joblist)
joblist.append(Job(function=gmm_pyle, args=(RESCAN_GMM_COMP_NUM, res_thr)))
process_jobs(joblist)
merge_jsons()
......