Aglaé Tabot joins the development team. Khodor Hannoush leaves.
- Distinct options --cmalign-opts and --cmalign-rrna-opts allow to adapt the parameters for LSU and SSU families. The LSU and SSU are now aligned with Infernal options '--cpu 10 --mxsize 8192 --mxtau 0.1', which is slow, requires up to 100 GB of RAM, and yields a suboptimal alignment (tau=0.1 is quite bad), but is homogenous with the other families.
- The LSU and SSU therefore have defined cm_coords fields, and therefore distance matrices can be computed.
- Distances matrices are computed on all availables molecules of the family by default, but you can use statistics.py --non-redundant to only select the equivalence class representatives at a given resolution into account (new option). For storage reasons, rRNAs are always run in this mode (but this might change in the future : space required is 'only' ~300 GB).
- We now provide for download the renumbered (standardized) 3D MMCIF files, the nucleotides being numbered by their "index_chain" in the database.
- We now provide for download the sequences of the 3D chains aligned by Rfam family (without Rfam sequences, which have been removed).
- statistics.py now computes histograms and a density estimation with Gaussian mixture models for a large set of geometric parameters, measured on the unmapped data at a given resolution threshold.
The parameters include:
- All atom bonded distances and torsion angles
- Distances, flat angles and torsion angles in the Pyle/VFold model
- Distances, flat angles and torsion anfles in the HiRE-RNA model
- Sequence-dependant geometric parameters of the basepairs for all non-canonical basepairs in the HiRE-RNA model.
The data is saved as JSON files of parameters, and numerous figures are produced to illustrate the distributions. The number of gaussians to use in the GMMs are hard-coded in geometric_stats.py after our first estimation. If you do not want to trust this estimation, you can ignore it with option --rescan-nmodes. An exploration of the number of Gaussians from 1 to 8 will be performed, and the best GMM will be kept.
- Many small fixes, leading to the support of many previously "known issues"
- Performance tweaks
- New code file geometric_stats.py
- New automation script that starts from scratch
- Switched to DSSR Pro.
- Switched to esl-alimerge instead of cmalign --merge to merge alignments.
- Tested successfully with Python 3.9.6 + BioPython 1.79. However, the production server still runs with Python 3.8.1 + BioPython 1.78.
- New option
--stats-opts="..."allows to pass options to the automatic run of statistics.py (when
- Removed support for 3'->5' Rfam hits, they are now completely ignored. They concern the opposite strand, which is unresolved in 3D.
- Removed PyDCA, which is outdated and introduces dependencies conflicts. Code may be adapted later.
- A new column in align_column table,
index_small_ali, gives the index of the nucléotide in the "3d only" alignment.
- 3D distance matrices are now computed only for match positions (match to the covariance model)
- Corrected a bug which skipped angle conversions from degrees (DSSR) to radians if nucleotides where renumbered.
Khodor Hannoush joins the development of RNANet.
- SINA is now used only if you pass the option
--sina, Infernal is used by default even for rRNAs.
- A new option
--cmalign-opts="..."allows to customize your cmalign runs, e.g. with
--cyk. The default is no option.
- RNANet makes use of PyDCA to compute DCA-related features on the alignments (descriptions to come in the Database.md)
- statistics.py now fully supports the computation of 3D distance matrices, with average and standard deviation by RNA family
- Now RNANet considers only the equivalence class representative structure by default. To consider all members of an equivalence class (like before), use the
- cmalign is not run with
--cykanymore by default, and now requires huge amounts of RAM if launched with the default options.
- Moving to a 60-core/128GB server for our internal runs.
- New option
The first uses of RNAnet by people from outside the development team happened last December. A few feedback allowed to identify issues and useful information to add.
- Sequence alignments of the 3D structures mapped to a family are now provided.
- Full alignments with Rfam sequences are not provided, but you can ask us for the files.
- Two new fields in table 'family': ali_length and ali_filtered_length. They are the MSA lengths of the alignment with and without the Rfam sequences.
- Gap replacement by consensus (--fill-gaps) has been removed. Now, the gap percentage and consensus are saved in the align_column table and the datapoints in CSV format, in separate columns. Consensus is one of ACGUN-, the gap being chosen if >75% of the sequences are gaps at this position. Otherwise, A/C/G/U is chosen if >50% of the non-gap positions are A/C/G/U. Otherwise, N is the consensus.
- SQLite connections are now all in WAL mode by default (previously, only the writers used WAL mode, but this is useless)
- Moved to Python3.9 for internal testing.
- Latest version of BioPython is now supported (1.78)
- When an alignment file is updated in a newer run of RNANet, all the re_mappings are now re-computed for this family. Previously, the remappings were computed only for the newly added sequences, while the alignment actually changed even for chains added in past runs.
- Changed the ownership and permissions of files produced by the Docker container. They were previously owned by root and the user could not get access to them.
- Modified nucleotides were not always correctly transformed to N in the alignments (and nucleotide.nt_align_code fields). Now, the alignments and nt_align_code (and consensus) only contain "ACGUN-" chars. Now, 'N' means 'other' or 'any', while '-' means 'nothing' or 'unknown'.