Louis BECQUEY

January 2021 update

v 1.1 beta, January 2021
The first uses of RNAnet by people from outside the development team happened between this December.
A few feedback allowed to identify issues and useful information to add.
FEATURE CHANGES
- Sequence alignments of the 3D structures mapped to a family are now provided.
- Full alignements with Rfam sequences are not provided, but you can ask us for the files.
- Two new fields in table 'family': ali_length and ali_filtered_length.
They are the MSA lengths of the alignment with and without the Rfam sequences.
TECHNICAL CHANGES
- SQLite connexions are now all in WAL mode by default (previously, only the writers used WAL mode)
BUG CORRECTIONS
- When an alignment file is updated in a newer run of RNANet, all the re_mappings are now re-computed
for this family. Previously, the remappings were computed only for the newly added sequences,
while the alignment actually changed even for chains added in past runs.
- Changed the ownership and permissions of files produced by the Docker container.
They were previously owned by root and the user could not get access to them.
- Modified nucleotides were not always correctly transformed to N in the alignments (and nucleotide.nt_align_code fields).
Now, the alignments and nt_align_code only contain "ACGUN-" chars.
Now, 'N' means 'other', while '-' means 'nothing'.
COMING SOON
- Automated annotation of detected Recurrent Interaction Networks (RINs), see http://carnaval.lri.fr/ .
- Possibly, automated detection of HLs and ILs from the 3D Motif Atlas (BGSU). Maybe. Their own website already does the job.
- A field estimating the quality of the sequence alignment in table family.
- Possibly, more metrics about the alignments coming from Infernal.
\ No newline at end of file
......@@ -249,7 +249,9 @@ To help you design your own requests, here follows a description of the database
* `nb_homologs`: The number of hits known to be homologous downloaded from Rfam to compute nucleotide frequencies
* `nb_3d_chains`: The number of 3D RNA chains mapped to the family (from Rfam-PDB mappings, or inferred using the redundancy list)
* `nb_total_homol`: Sum of the two previous fields, the number of sequences in the multiple sequence alignment, used to compute nucleotide frequencies
* `max_len`: The longest RNA sequence among the homologs (in bases)
* `max_len`: The longest RNA sequence among the homologs (in bases, unaligned)
* `ali_len`: The aligned sequences length (in bases, aligned)
* `ali_filtered_len`: The aligned sequences length when we filter the alignment to keep only the RNANet chains (which have a 3D structure) and remove the gap-only columns.
* `comput_time`: Time required to compute the family's multiple sequence alignment in seconds,
* `comput_peak_mem`: RAM (or swap) required to compute the family's multiple sequence alignment in megabytes,
* `idty_percent`: Average identity percentage over pairs of the 3D chains' sequences from the family
......
This diff is collapsed. Click to expand it.
......@@ -4,7 +4,7 @@ cd /home/lbecquey/Projects/RNANet
rm -rf latest_run.log errors.txt
# Run RNANet
bash -c 'time ./RNAnet.py --3d-folder /home/lbecquey/Data/RNA/3D/ --seq-folder /home/lbecquey/Data/RNA/sequences/ -r 20.0 --extract -s --archive' > latest_run.log 2>&1
bash -c 'time python3.8 /RNAnet.py --3d-folder /home/lbecquey/Data/RNA/3D/ --seq-folder /home/lbecquey/Data/RNA/sequences/ -r 20.0 --extract -s --archive' > latest_run.log 2>&1
echo 'Compressing RNANet.db.gz...' >> latest_run.log
touch results/RNANet.db # update last modification date
gzip -k /home/lbecquey/Projects/RNANet/results/RNANet.db # compress it
......
......@@ -417,7 +417,10 @@ def parallel_stats_pairs(f):
def to_id_matrix(f):
"""
Extracts sequences of 3D chains from the family alignments to a distinct STK file,
then runs esl-alipid on it to get an identity matrix
then runs esl-alipid on it to get an identity matrix.
Side-effect : also produces the 3D_only family alignment as a separate file.
So, we use this function to update 'ali_filtered_length' in the family table.
"""
if path.isfile("data/"+f+".npy"):
return 0
......@@ -442,7 +445,14 @@ def to_id_matrix(f):
subprocess.run(["esl-reformat", "--informat", "stockholm", "--mingap", #
"-o", path_to_seq_data+f"/realigned/{f}_3d_only.stk", # This run just deletes columns of gaps
"stockholm", path_to_seq_data+f"/realigned/{f}_3d_only_tmp.stk"]) #
subprocess.run(["rm", "-f", f + "_3d_only_tmp.stk"])
subprocess.run(["rm", "-f", f + "_3d_only_tmp.stk", f + "_3d_only.stk"])
subprocess.run(["esl-reformat", "-o", path_to_seq_data+f"/realigned/{f}_3d_only.afa", "afa", path_to_seq_data+f"/realigned/{f}_3d_only.stk"])
# Out-of-scope task : update the database with the length of the filtered alignment:
align = AlignIO.read(path_to_seq_data+f"/realigned/{f}_3d_only.afa", "fasta")
with sqlite3.connect(runDir + "/results/RNANet.db") as conn:
sql_execute(conn, """UPDATE family SET ali_filtered_len = ? WHERE rfam_acc = ?;""", many=True, data=(align.get_alignment_length(), f))
del align
# Prepare the job
process = subprocess.Popen(shlex.split(f"esl-alipid --rna --noheader --informat stockholm {path_to_seq_data}realigned/{f}_3d_only.stk"),
......