Showing
5 changed files
with
45 additions
and
4 deletions
CHANGELOG
0 → 100644
1 | +v 1.1 beta, January 2021 | ||
2 | + | ||
3 | +The first uses of RNAnet by people from outside the development team happened between this December. | ||
4 | +A few feedback allowed to identify issues and useful information to add. | ||
5 | + | ||
6 | +FEATURE CHANGES | ||
7 | + - Sequence alignments of the 3D structures mapped to a family are now provided. | ||
8 | + - Full alignements with Rfam sequences are not provided, but you can ask us for the files. | ||
9 | + - Two new fields in table 'family': ali_length and ali_filtered_length. | ||
10 | + They are the MSA lengths of the alignment with and without the Rfam sequences. | ||
11 | + | ||
12 | +TECHNICAL CHANGES | ||
13 | + - SQLite connexions are now all in WAL mode by default (previously, only the writers used WAL mode) | ||
14 | + | ||
15 | +BUG CORRECTIONS | ||
16 | + - When an alignment file is updated in a newer run of RNANet, all the re_mappings are now re-computed | ||
17 | + for this family. Previously, the remappings were computed only for the newly added sequences, | ||
18 | + while the alignment actually changed even for chains added in past runs. | ||
19 | + - Changed the ownership and permissions of files produced by the Docker container. | ||
20 | + They were previously owned by root and the user could not get access to them. | ||
21 | + - Modified nucleotides were not always correctly transformed to N in the alignments (and nucleotide.nt_align_code fields). | ||
22 | + Now, the alignments and nt_align_code only contain "ACGUN-" chars. | ||
23 | + Now, 'N' means 'other', while '-' means 'nothing'. | ||
24 | + | ||
25 | +COMING SOON | ||
26 | + - Automated annotation of detected Recurrent Interaction Networks (RINs), see http://carnaval.lri.fr/ . | ||
27 | + - Possibly, automated detection of HLs and ILs from the 3D Motif Atlas (BGSU). Maybe. Their own website already does the job. | ||
28 | + - A field estimating the quality of the sequence alignment in table family. | ||
29 | + - Possibly, more metrics about the alignments coming from Infernal. | ||
... | \ No newline at end of file | ... | \ No newline at end of file |
... | @@ -249,7 +249,9 @@ To help you design your own requests, here follows a description of the database | ... | @@ -249,7 +249,9 @@ To help you design your own requests, here follows a description of the database |
249 | * `nb_homologs`: The number of hits known to be homologous downloaded from Rfam to compute nucleotide frequencies | 249 | * `nb_homologs`: The number of hits known to be homologous downloaded from Rfam to compute nucleotide frequencies |
250 | * `nb_3d_chains`: The number of 3D RNA chains mapped to the family (from Rfam-PDB mappings, or inferred using the redundancy list) | 250 | * `nb_3d_chains`: The number of 3D RNA chains mapped to the family (from Rfam-PDB mappings, or inferred using the redundancy list) |
251 | * `nb_total_homol`: Sum of the two previous fields, the number of sequences in the multiple sequence alignment, used to compute nucleotide frequencies | 251 | * `nb_total_homol`: Sum of the two previous fields, the number of sequences in the multiple sequence alignment, used to compute nucleotide frequencies |
252 | -* `max_len`: The longest RNA sequence among the homologs (in bases) | 252 | +* `max_len`: The longest RNA sequence among the homologs (in bases, unaligned) |
253 | +* `ali_len`: The aligned sequences length (in bases, aligned) | ||
254 | +* `ali_filtered_len`: The aligned sequences length when we filter the alignment to keep only the RNANet chains (which have a 3D structure) and remove the gap-only columns. | ||
253 | * `comput_time`: Time required to compute the family's multiple sequence alignment in seconds, | 255 | * `comput_time`: Time required to compute the family's multiple sequence alignment in seconds, |
254 | * `comput_peak_mem`: RAM (or swap) required to compute the family's multiple sequence alignment in megabytes, | 256 | * `comput_peak_mem`: RAM (or swap) required to compute the family's multiple sequence alignment in megabytes, |
255 | * `idty_percent`: Average identity percentage over pairs of the 3D chains' sequences from the family | 257 | * `idty_percent`: Average identity percentage over pairs of the 3D chains' sequences from the family | ... | ... |
This diff is collapsed. Click to expand it.
... | @@ -4,7 +4,7 @@ cd /home/lbecquey/Projects/RNANet | ... | @@ -4,7 +4,7 @@ cd /home/lbecquey/Projects/RNANet |
4 | rm -rf latest_run.log errors.txt | 4 | rm -rf latest_run.log errors.txt |
5 | 5 | ||
6 | # Run RNANet | 6 | # Run RNANet |
7 | -bash -c 'time ./RNAnet.py --3d-folder /home/lbecquey/Data/RNA/3D/ --seq-folder /home/lbecquey/Data/RNA/sequences/ -r 20.0 --extract -s --archive' > latest_run.log 2>&1 | 7 | +bash -c 'time python3.8 /RNAnet.py --3d-folder /home/lbecquey/Data/RNA/3D/ --seq-folder /home/lbecquey/Data/RNA/sequences/ -r 20.0 --extract -s --archive' > latest_run.log 2>&1 |
8 | echo 'Compressing RNANet.db.gz...' >> latest_run.log | 8 | echo 'Compressing RNANet.db.gz...' >> latest_run.log |
9 | touch results/RNANet.db # update last modification date | 9 | touch results/RNANet.db # update last modification date |
10 | gzip -k /home/lbecquey/Projects/RNANet/results/RNANet.db # compress it | 10 | gzip -k /home/lbecquey/Projects/RNANet/results/RNANet.db # compress it | ... | ... |
... | @@ -417,7 +417,10 @@ def parallel_stats_pairs(f): | ... | @@ -417,7 +417,10 @@ def parallel_stats_pairs(f): |
417 | def to_id_matrix(f): | 417 | def to_id_matrix(f): |
418 | """ | 418 | """ |
419 | Extracts sequences of 3D chains from the family alignments to a distinct STK file, | 419 | Extracts sequences of 3D chains from the family alignments to a distinct STK file, |
420 | - then runs esl-alipid on it to get an identity matrix | 420 | + then runs esl-alipid on it to get an identity matrix. |
421 | + | ||
422 | + Side-effect : also produces the 3D_only family alignment as a separate file. | ||
423 | + So, we use this function to update 'ali_filtered_length' in the family table. | ||
421 | """ | 424 | """ |
422 | if path.isfile("data/"+f+".npy"): | 425 | if path.isfile("data/"+f+".npy"): |
423 | return 0 | 426 | return 0 |
... | @@ -442,7 +445,14 @@ def to_id_matrix(f): | ... | @@ -442,7 +445,14 @@ def to_id_matrix(f): |
442 | subprocess.run(["esl-reformat", "--informat", "stockholm", "--mingap", # | 445 | subprocess.run(["esl-reformat", "--informat", "stockholm", "--mingap", # |
443 | "-o", path_to_seq_data+f"/realigned/{f}_3d_only.stk", # This run just deletes columns of gaps | 446 | "-o", path_to_seq_data+f"/realigned/{f}_3d_only.stk", # This run just deletes columns of gaps |
444 | "stockholm", path_to_seq_data+f"/realigned/{f}_3d_only_tmp.stk"]) # | 447 | "stockholm", path_to_seq_data+f"/realigned/{f}_3d_only_tmp.stk"]) # |
445 | - subprocess.run(["rm", "-f", f + "_3d_only_tmp.stk"]) | 448 | + subprocess.run(["rm", "-f", f + "_3d_only_tmp.stk", f + "_3d_only.stk"]) |
449 | + subprocess.run(["esl-reformat", "-o", path_to_seq_data+f"/realigned/{f}_3d_only.afa", "afa", path_to_seq_data+f"/realigned/{f}_3d_only.stk"]) | ||
450 | + | ||
451 | + # Out-of-scope task : update the database with the length of the filtered alignment: | ||
452 | + align = AlignIO.read(path_to_seq_data+f"/realigned/{f}_3d_only.afa", "fasta") | ||
453 | + with sqlite3.connect(runDir + "/results/RNANet.db") as conn: | ||
454 | + sql_execute(conn, """UPDATE family SET ali_filtered_len = ? WHERE rfam_acc = ?;""", many=True, data=(align.get_alignment_length(), f)) | ||
455 | + del align | ||
446 | 456 | ||
447 | # Prepare the job | 457 | # Prepare the job |
448 | process = subprocess.Popen(shlex.split(f"esl-alipid --rna --noheader --informat stockholm {path_to_seq_data}realigned/{f}_3d_only.stk"), | 458 | process = subprocess.Popen(shlex.split(f"esl-alipid --rna --noheader --informat stockholm {path_to_seq_data}realigned/{f}_3d_only.stk"), | ... | ... |
-
Please register or login to post a comment