January 2021 update

Louis BECQUEY
Commit d0371fa5125978b09d3ec35bc244b190c1d8ac7a d0371fa5 1 parent 3765dbe7
Showing 5 changed files with 45 additions and 4 deletions
CHANGELOG
README.md
RNAnet.py
scripts/automate.sh
statistics.py
--- a/CHANGELOG 0 → 100644
View file @d0371fa
+++ b/CHANGELOG 0 → 100644
View file @d0371fa
+ v 1.1 beta, January 2021
+ 
+ The first uses of RNAnet by people from outside the development team happened between this December.
+ A few feedback allowed to identify issues and useful information to add.
+ 
+ FEATURE CHANGES
+     - Sequence alignments of the 3D structures mapped to a family are now provided. 
+     - Full alignements with Rfam sequences are not provided, but you can ask us for the files.
+     - Two new fields in table 'family': ali_length and ali_filtered_length. 
+     They are the MSA lengths of the alignment with and without the Rfam sequences. 
+ 
+ TECHNICAL CHANGES
+     - SQLite connexions are now all in WAL mode by default (previously, only the writers used WAL mode)
+ 
+ BUG CORRECTIONS
+     - When an alignment file is updated in a newer run of RNANet, all the re_mappings are now re-computed 
+     for this family. Previously, the remappings were computed only for the newly added sequences,
+     while the alignment actually changed even for chains added in past runs.
+     - Changed the ownership and permissions of files produced by the Docker container. 
+     They were previously owned by root and the user could not get access to them.
+     - Modified nucleotides were not always correctly transformed to N in the alignments (and nucleotide.nt_align_code fields).
+     Now, the alignments and nt_align_code only contain "ACGUN-" chars. 
+     Now, 'N' means 'other', while '-' means 'nothing'.
+ 
+ COMING SOON
+     - Automated annotation of detected Recurrent Interaction Networks (RINs), see http://carnaval.lri.fr/ .
+     - Possibly, automated detection of HLs and ILs from the 3D Motif Atlas (BGSU). Maybe. Their own website already does the job.
+     - A field estimating the quality of the sequence alignment in table family.
+     - Possibly, more metrics about the alignments coming from Infernal.
\ No newline at end of file
--- a/README.md
View file @d0371fa
+++ b/README.md
View file @d0371fa
@@ -249,7 +249,9 @@ To help you design your own requests, here follows a description of the database
 * `nb_homologs`: The number of hits known to be homologous downloaded from Rfam to compute nucleotide frequencies
 * `nb_3d_chains`: The number of 3D RNA chains mapped to the family (from Rfam-PDB mappings, or inferred using the redundancy list)
 * `nb_total_homol`: Sum of the two previous fields, the number of sequences in the multiple sequence alignment, used to compute nucleotide frequencies
- * `max_len`: The longest RNA sequence among the homologs (in bases)
+ * `max_len`: The longest RNA sequence among the homologs (in bases, unaligned)
+ * `ali_len`: The aligned sequences length (in bases, aligned)
+ * `ali_filtered_len`: The aligned sequences length when we filter the alignment to keep only the RNANet chains (which have a 3D structure) and remove the gap-only columns.
 * `comput_time`: Time required to compute the family's multiple sequence alignment in seconds,
 * `comput_peak_mem`: RAM (or swap) required to compute the family's multiple sequence alignment in megabytes,
 * `idty_percent`: Average identity percentage over pairs of the 3D chains' sequences from the family
--- a/RNAnet.py
View file @d0371fa
+++ b/RNAnet.py
View file @d0371fa
--- a/scripts/automate.sh
View file @d0371fa
+++ b/scripts/automate.sh
View file @d0371fa
@@ -4,7 +4,7 @@ cd /home/lbecquey/Projects/RNANet
 rm -rf latest_run.log errors.txt
 
 # Run RNANet
- bash -c 'time ./RNAnet.py --3d-folder /home/lbecquey/Data/RNA/3D/ --seq-folder /home/lbecquey/Data/RNA/sequences/ -r 20.0 --extract -s --archive' > latest_run.log 2>&1
+ bash -c 'time python3.8 /RNAnet.py --3d-folder /home/lbecquey/Data/RNA/3D/ --seq-folder /home/lbecquey/Data/RNA/sequences/ -r 20.0 --extract -s --archive' > latest_run.log 2>&1
 echo 'Compressing RNANet.db.gz...' >> latest_run.log
 touch results/RNANet.db                                         # update last modification date
 gzip -k /home/lbecquey/Projects/RNANet/results/RNANet.db        # compress it
--- a/statistics.py
View file @d0371fa
+++ b/statistics.py
View file @d0371fa
@@ -417,7 +417,10 @@ def parallel_stats_pairs(f):
 def to_id_matrix(f):
     """
     Extracts sequences of 3D chains from the family alignments to a distinct STK file,
-     then runs esl-alipid on it to get an identity matrix
+     then runs esl-alipid on it to get an identity matrix.
+ 
+     Side-effect : also produces the 3D_only family alignment as a separate file. 
+     So, we use this function to update 'ali_filtered_length' in the family table.
     """
     if path.isfile("data/"+f+".npy"):
         return 0
@@ -442,7 +445,14 @@ def to_id_matrix(f):
     subprocess.run(["esl-reformat", "--informat", "stockholm", "--mingap",              #
                     "-o", path_to_seq_data+f"/realigned/{f}_3d_only.stk",               # This run just deletes columns of gaps
                     "stockholm",  path_to_seq_data+f"/realigned/{f}_3d_only_tmp.stk"])  #
-     subprocess.run(["rm", "-f", f + "_3d_only_tmp.stk"])
+     subprocess.run(["rm", "-f", f + "_3d_only_tmp.stk", f + "_3d_only.stk"])
+     subprocess.run(["esl-reformat", "-o", path_to_seq_data+f"/realigned/{f}_3d_only.afa", "afa", path_to_seq_data+f"/realigned/{f}_3d_only.stk"])
+ 
+     # Out-of-scope task : update the database with the length of the filtered alignment:
+     align = AlignIO.read(path_to_seq_data+f"/realigned/{f}_3d_only.afa", "fasta")
+     with sqlite3.connect(runDir + "/results/RNANet.db") as conn:
+         sql_execute(conn, """UPDATE family SET ali_filtered_len = ? WHERE rfam_acc = ?;""", many=True, data=(align.get_alignment_length(), f))
+     del align
 
     # Prepare the job
     process = subprocess.Popen(shlex.split(f"esl-alipid --rna --noheader --informat stockholm {path_to_seq_data}realigned/{f}_3d_only.stk"),