Louis BECQUEY

January 2021 update

1 +v 1.1 beta, January 2021
2 +
3 +The first uses of RNAnet by people from outside the development team happened between this December.
4 +A few feedback allowed to identify issues and useful information to add.
5 +
6 +FEATURE CHANGES
7 + - Sequence alignments of the 3D structures mapped to a family are now provided.
8 + - Full alignements with Rfam sequences are not provided, but you can ask us for the files.
9 + - Two new fields in table 'family': ali_length and ali_filtered_length.
10 + They are the MSA lengths of the alignment with and without the Rfam sequences.
11 +
12 +TECHNICAL CHANGES
13 + - SQLite connexions are now all in WAL mode by default (previously, only the writers used WAL mode)
14 +
15 +BUG CORRECTIONS
16 + - When an alignment file is updated in a newer run of RNANet, all the re_mappings are now re-computed
17 + for this family. Previously, the remappings were computed only for the newly added sequences,
18 + while the alignment actually changed even for chains added in past runs.
19 + - Changed the ownership and permissions of files produced by the Docker container.
20 + They were previously owned by root and the user could not get access to them.
21 + - Modified nucleotides were not always correctly transformed to N in the alignments (and nucleotide.nt_align_code fields).
22 + Now, the alignments and nt_align_code only contain "ACGUN-" chars.
23 + Now, 'N' means 'other', while '-' means 'nothing'.
24 +
25 +COMING SOON
26 + - Automated annotation of detected Recurrent Interaction Networks (RINs), see http://carnaval.lri.fr/ .
27 + - Possibly, automated detection of HLs and ILs from the 3D Motif Atlas (BGSU). Maybe. Their own website already does the job.
28 + - A field estimating the quality of the sequence alignment in table family.
29 + - Possibly, more metrics about the alignments coming from Infernal.
...\ No newline at end of file ...\ No newline at end of file
...@@ -249,7 +249,9 @@ To help you design your own requests, here follows a description of the database ...@@ -249,7 +249,9 @@ To help you design your own requests, here follows a description of the database
249 * `nb_homologs`: The number of hits known to be homologous downloaded from Rfam to compute nucleotide frequencies 249 * `nb_homologs`: The number of hits known to be homologous downloaded from Rfam to compute nucleotide frequencies
250 * `nb_3d_chains`: The number of 3D RNA chains mapped to the family (from Rfam-PDB mappings, or inferred using the redundancy list) 250 * `nb_3d_chains`: The number of 3D RNA chains mapped to the family (from Rfam-PDB mappings, or inferred using the redundancy list)
251 * `nb_total_homol`: Sum of the two previous fields, the number of sequences in the multiple sequence alignment, used to compute nucleotide frequencies 251 * `nb_total_homol`: Sum of the two previous fields, the number of sequences in the multiple sequence alignment, used to compute nucleotide frequencies
252 -* `max_len`: The longest RNA sequence among the homologs (in bases) 252 +* `max_len`: The longest RNA sequence among the homologs (in bases, unaligned)
253 +* `ali_len`: The aligned sequences length (in bases, aligned)
254 +* `ali_filtered_len`: The aligned sequences length when we filter the alignment to keep only the RNANet chains (which have a 3D structure) and remove the gap-only columns.
253 * `comput_time`: Time required to compute the family's multiple sequence alignment in seconds, 255 * `comput_time`: Time required to compute the family's multiple sequence alignment in seconds,
254 * `comput_peak_mem`: RAM (or swap) required to compute the family's multiple sequence alignment in megabytes, 256 * `comput_peak_mem`: RAM (or swap) required to compute the family's multiple sequence alignment in megabytes,
255 * `idty_percent`: Average identity percentage over pairs of the 3D chains' sequences from the family 257 * `idty_percent`: Average identity percentage over pairs of the 3D chains' sequences from the family
......
This diff is collapsed. Click to expand it.
...@@ -4,7 +4,7 @@ cd /home/lbecquey/Projects/RNANet ...@@ -4,7 +4,7 @@ cd /home/lbecquey/Projects/RNANet
4 rm -rf latest_run.log errors.txt 4 rm -rf latest_run.log errors.txt
5 5
6 # Run RNANet 6 # Run RNANet
7 -bash -c 'time ./RNAnet.py --3d-folder /home/lbecquey/Data/RNA/3D/ --seq-folder /home/lbecquey/Data/RNA/sequences/ -r 20.0 --extract -s --archive' > latest_run.log 2>&1 7 +bash -c 'time python3.8 /RNAnet.py --3d-folder /home/lbecquey/Data/RNA/3D/ --seq-folder /home/lbecquey/Data/RNA/sequences/ -r 20.0 --extract -s --archive' > latest_run.log 2>&1
8 echo 'Compressing RNANet.db.gz...' >> latest_run.log 8 echo 'Compressing RNANet.db.gz...' >> latest_run.log
9 touch results/RNANet.db # update last modification date 9 touch results/RNANet.db # update last modification date
10 gzip -k /home/lbecquey/Projects/RNANet/results/RNANet.db # compress it 10 gzip -k /home/lbecquey/Projects/RNANet/results/RNANet.db # compress it
......
...@@ -417,7 +417,10 @@ def parallel_stats_pairs(f): ...@@ -417,7 +417,10 @@ def parallel_stats_pairs(f):
417 def to_id_matrix(f): 417 def to_id_matrix(f):
418 """ 418 """
419 Extracts sequences of 3D chains from the family alignments to a distinct STK file, 419 Extracts sequences of 3D chains from the family alignments to a distinct STK file,
420 - then runs esl-alipid on it to get an identity matrix 420 + then runs esl-alipid on it to get an identity matrix.
421 +
422 + Side-effect : also produces the 3D_only family alignment as a separate file.
423 + So, we use this function to update 'ali_filtered_length' in the family table.
421 """ 424 """
422 if path.isfile("data/"+f+".npy"): 425 if path.isfile("data/"+f+".npy"):
423 return 0 426 return 0
...@@ -442,7 +445,14 @@ def to_id_matrix(f): ...@@ -442,7 +445,14 @@ def to_id_matrix(f):
442 subprocess.run(["esl-reformat", "--informat", "stockholm", "--mingap", # 445 subprocess.run(["esl-reformat", "--informat", "stockholm", "--mingap", #
443 "-o", path_to_seq_data+f"/realigned/{f}_3d_only.stk", # This run just deletes columns of gaps 446 "-o", path_to_seq_data+f"/realigned/{f}_3d_only.stk", # This run just deletes columns of gaps
444 "stockholm", path_to_seq_data+f"/realigned/{f}_3d_only_tmp.stk"]) # 447 "stockholm", path_to_seq_data+f"/realigned/{f}_3d_only_tmp.stk"]) #
445 - subprocess.run(["rm", "-f", f + "_3d_only_tmp.stk"]) 448 + subprocess.run(["rm", "-f", f + "_3d_only_tmp.stk", f + "_3d_only.stk"])
449 + subprocess.run(["esl-reformat", "-o", path_to_seq_data+f"/realigned/{f}_3d_only.afa", "afa", path_to_seq_data+f"/realigned/{f}_3d_only.stk"])
450 +
451 + # Out-of-scope task : update the database with the length of the filtered alignment:
452 + align = AlignIO.read(path_to_seq_data+f"/realigned/{f}_3d_only.afa", "fasta")
453 + with sqlite3.connect(runDir + "/results/RNANet.db") as conn:
454 + sql_execute(conn, """UPDATE family SET ali_filtered_len = ? WHERE rfam_acc = ?;""", many=True, data=(align.get_alignment_length(), f))
455 + del align
446 456
447 # Prepare the job 457 # Prepare the job
448 process = subprocess.Popen(shlex.split(f"esl-alipid --rna --noheader --informat stockholm {path_to_seq_data}realigned/{f}_3d_only.stk"), 458 process = subprocess.Popen(shlex.split(f"esl-alipid --rna --noheader --informat stockholm {path_to_seq_data}realigned/{f}_3d_only.stk"),
......