Louis BECQUEY

Doc update

......@@ -18,15 +18,13 @@ Contents:
Additional relevant references:
The "ProteinNet" philosophy which inspired this work:
* AlQuraishi, M. (2019b). **ProteinNet: A standardized data set for machine learning of protein structure.** *BMC Bioinformatics*, 20(1), 311
If you use our annotations by DSSR, you might want to cite:
* Lu, X.-J.et al.(2015). **DSSR: An integrated software tool for dissecting the spatial structure of RNA.** *Nucleic Acids Research*, 43(21), e142–e142.
If you use our multiple sequence alignments and homology data, you might want to cite:
* Pruesse, E. et al.(2012). **Sina: accurate high-throughput multiple sequence alignment of ribosomal RNA genes.** *Bioinformatics*, 28(14), 1823–1829
* Nawrocki, E. P. and Eddy, S. R. (2013). **Infernal 1.1: 100-fold faster RNA homology searches.** *Bioinformatics*, 29(22), 2933–2935.
* Pruesse, E. et al.(2012). **Sina: accurate high-throughput multiple sequence alignment of ribosomal RNA genes.** *Bioinformatics*, 28(14), 1823–1829
# What is RNANet ?
......@@ -39,7 +37,8 @@ Most interestingly, nucleotides have been renumered in a standardized way, and t
## Methodology
We use the Rfam mappings between 3D structures and known Rfam families, using the sequences that are known to belong to an Rfam family (hits provided in RF0XXXX.fasta files from Rfam).
Future versions might compute a real MSA-based clusering directly with Rfamseq ncRNA sequences, like ProteinNet does with protein sequences, but this requires a tool similar to jackHMMER in the Infernal software suite, which is not available yet.
Future versions might compute a real MSA-based clusering directly with Rfamseq ncRNA sequences, like ProteinNet does with protein sequences, but this requires a tool similar to jackHMMER in the Infernal software suite, which is not available yet.
If interested by such approaches, the user may check tools like RNAlien.
This script prepares the dataset from available public data in PDB, RNA 3D Hub, Rfam and SILVA.
......@@ -48,15 +47,16 @@ This script prepares the dataset from available public data in PDB, RNA 3D Hub,
The script follows these steps:
To gather structures:
* Gets a list of 3D structures containing RNA from BGSU's non-redundant list (but keeps the redundant structures /!\\),
* Gets a list of 3D structures containing RNA from BGSU's non-redundant list (redundancy can be kept or eliminated, see command line option `--redundant`),
* Asks Rfam for mappings of these structures onto Rfam families (~50% of structures have a direct mapping, some more are inferred using the redundancy list)
* Downloads the corresponding 3D structures (mmCIFs)
* If desired, extracts the right chain portions that map onto an Rfam family to a separate mmCIF file
* Standardizes the residue numbering from 1 to N, including missing residues (gaps)
* If desired, extracts the renumbered chain portions that map onto an Rfam family to a separate mmCIF file
To compute homology information:
* Extract the sequence for every 3D chain
* Extracts the sequence of every 3D chain
* Downloads Rfamseq ncRNA sequence hits for the concerned Rfam families (or ARB databases of SSU or LSU sequences from SILVA for rRNAs)
* Realigns Rfamseq hits and sequences from the 3D structures together to obtain a multiple sequence alignment for each Rfam family (using `cmalign --cyk`, except for ribosomal LSU and SSU, where SINA is used)
* Realigns Rfamseq hits and sequences from the 3D structures together to obtain a multiple sequence alignment for each Rfam family (using `cmalign`, but SINA can be used for ribosomal LSU and SSU)
* Computes nucleotide frequencies at every position for each alignment
* Map each nucleotide of a 3D chain to its position in the corresponding family sequence alignment
......@@ -65,6 +65,15 @@ To compute 3D annotations:
Finally, export this data from the SQLite database into flat CSV files.
Statistical analysis of the structures:
* Computes statistics about the amount of data from various resolutions and experimental methods (by RNA family)
* Computes basic statistics about the frequency of (modified) nucleotides by chain and by family,
* Computes basic statistics about the frequencies of non-canonical interactions,
* Computes density estimations (using Gaussian mixtures) for various geometrical parameters like distances and torsion angles for different representations : all-atom, the Pyle/VFold model, and the HiRE-RNA model,
* Computes pairwise residue distance matrices for each chain, and average + std-dev by RNA family
* Computes sequence identity matrices for each RNA family (based on the alignments)
* Saves covariance models (Infernal .cm files) for each RNA family
## Data provided
We provide couple of resources to exploit this dataset. You can download them on [EvryRNA](https://evryrna.ibisc.univ-evry.fr/evryrna/rnanet/rnanet_home).
......
......@@ -7,12 +7,22 @@
* [Post-computation tasks](#post-computation-tasks-estimate-quality)
* [Output files](#output-files)
# Required computational resources
- CPU: no requirements. The program is optimized for multi-core CPUs, you might want to use Intel Xeons, AMD Ryzens, etc.
- GPU: not required
- RAM: 16 GB with a large swap partition is okay. 32 GB is recommended (usage peaks at ~27 GB, but this number depends on your number of CPU cores)
- Storage: to date, it takes 60 GB for the 3D data (36 GB if you don't use the --extract option), 11 GB for the sequence data, and 7GB for the outputs (5.6 GB database, 1 GB archive of CSV files). You need to add a few more for the dependencies. Pick a 100GB partition and you are good to go. The computation speed is way better if you use a fast storage device (e.g. SSD instead of hard drive, or even better, a NVMe SSD) because of constant I/O with the SQlite database.
- Network : We query the Rfam public MySQL server on port 4497. Make sure your network enables communication (there should not be any issue on private networks, but maybe you company/university closes ports by default). You will get an error message if the port is not open. Around 30 GB of data is downloaded.
# Required hardware resources
- **CPU**: The program is optimized for highly multi-core CPUs. The more you have, the faster the computation. Ensure you have enough RAM to follow.
- **GPU**: not required.
- **RAM**: This depends on the usage.
- In regular mode, the first computation of alignments requires a huge 100GB. If you do not have them, you might:
- either want to use SINA (--sina) instead of Infernal to align the rRNAs. However, all information related to covariance models will not be available for them (distance matrices, 3D-only alignments...)
- or customize options --cmalign-opts and --cmalign-rrna-opts with cmalign arguments --cpu (number of cores to use) and --mxsize (max memory to allocate per core), so that it fits your machine. In very hard cases, also increase the parameter --maxtau from 0.05 to 0.1, but this reduces the quality of the alignments.
- In regular "update" mode, when the alignments already exists, less RAM is required, 64GB should be fine. If not, use the same options than the first time for your update runs.
- In 'no homology' mode, just for annotation of the structures without mapping to families, each core can peak to ~3GB (but not all at the same time if you are lucky). Use option --maxcores to reduce the number of cores if you do not have enough RAM. 32GB is fine in most cases.
- **Storage**: to date, it takes 60 GB for the 3D data (36 GB if you don't use the --extract option), 11 GB for the sequence data, and 7GB for the outputs (5.6 GB database, 1 GB archive of CSV files). You need to add a few more for the dependencies. If you compute geometry statistics and parameter distributions, you need to count a 80GB more (permanent) and 100GB more (that will be deleted at the end of the run). So, pick a 500GB partition and you are good to go. The computation speed is much higher if you use a fast storage device (e.g. SSD instead of hard drive, or even better, a NVMe M.2) because of constant I/O with the SQlite database.
- **Network** : We query the Rfam public MySQL server on port 4497. Make sure your network enables communication (there should not be any issue on private networks, but your university may close ports by default). You will get an error message if the port is not open. Around 30 GB of data is downloaded.
The IBISC-EvryRNA server example :
* Intel Xeon E7-4850 v4 (60 cores, 2.10GHz)
* 112 GB of RAM
* 250 GB of hard-disk storage
# Method 1 : Installation using Docker
......@@ -57,50 +67,52 @@ nohup bash -c 'time ~/Projects/RNANet/RNAnet.py --3d-folder ~/Data/RNA/3D/ --seq
The detailed list of options is below:
```
-h [ --help ] Print this help message
--version Print the program version
-h [ --help ] Print this help message
--version Print the program version
Select what to do:
--------------------------------------------------------------------------------------------------------------
-f [ --full-inference ] Infer new mappings even if Rfam already provides some. Yields more copies of
chains mapped to different families.
-s Run statistics computations after completion
--stats-opts=… Pass additional command line options to the statistics.py script, e.g. "--wadley --distance-matrices"
--extract Extract the portions of 3D RNA chains to individual mmCIF files.
--keep-hetatm=False (True | False) Keep ions, waters and ligands in produced mmCIF files.
Does not affect the descriptors.
--no-homology Do not try to compute PSSMs and do not align sequences.
Allows to yield more 3D data (consider chains without a Rfam mapping).
-f [ --full-inference ] Infer new mappings even if Rfam already provides some. Yields more copies of
chains mapped to different families.
-s Run statistics computations after completion
--stats-opts=… Pass additional command line options to the statistics.py script, e.g. "--wadley --distance-matrices"
--extract Extract the portions of 3D RNA chains to individual mmCIF files.
--keep-hetatm=False (True | False) Keep ions, waters and ligands in produced mmCIF files.
Does not affect the descriptors.
--no-homology Do not try to compute PSSMs and do not align sequences.
Allows to yield more 3D data (consider chains without a Rfam mapping).
Select how to do it:
--------------------------------------------------------------------------------------------------------------
--3d-folder=… Path to a folder to store the 3D data files. Subfolders will contain:
RNAcifs/ Full structures containing RNA, in mmCIF format
rna_mapped_to_Rfam/ Extracted 'pure' portions of RNA chains mapped to families
rna_only/ Extracted 'pure' RNA chains, not truncated
datapoints/ Final results in CSV file format.
--seq-folder=… Path to a folder to store the sequence and alignment files. Subfolders will be:
rfam_sequences/fasta/ Compressed hits to Rfam families
realigned/ Sequences, covariance models, and alignments by family
--sina Align large subunit LSU and small subunit SSU ribosomal RNA using SINA instead of Infernal,
the other RNA families will be aligned using infernal.
--maxcores=… Limit the number of cores to use in parallel portions to reduce the simultaneous
need of RAM. Should be a number between 1 and your number of CPUs. Note that portions
of the pipeline already limit themselves to 50% or 70% of that number by default.
--cmalign-opts=… A string of additional options to pass to cmalign aligner, e.g. "--nonbanded --mxsize 2048"
--archive Create tar.gz archives of the datapoints text files and the alignments,
and update the link to the latest archive.
--no-logs Do not save per-chain logs of the numbering modifications.
--3d-folder=… Path to a folder to store the 3D data files. Subfolders will contain:
RNAcifs/ Full structures containing RNA, in mmCIF format
rna_mapped_to_Rfam/ Extracted 'pure' portions of RNA chains mapped to families
rna_only/ Extracted 'pure' RNA chains, not truncated
datapoints/ Final results in CSV file format.
--seq-folder=… Path to a folder to store the sequence and alignment files. Subfolders will be:
rfam_sequences/fasta/ Compressed hits to Rfam families
realigned/ Sequences, covariance models, and alignments by family
--sina Align large subunit LSU and small subunit SSU ribosomal RNA using SINA instead of Infernal,
the other RNA families will be aligned using infernal.
--maxcores=… Limit the number of cores to use in parallel portions to reduce the simultaneous
need of RAM. Should be a number between 1 and your number of CPUs. Note that portions
of the pipeline already limit themselves to 50% or 70% of that number by default.
--cmalign-opts=… A string of additional options to pass to cmalign aligner, e.g. "--nonbanded --mxsize 2048"
--cmalign-rrna-opts=… Like cmalign-opts, but applied for rRNA (large families, memory-heavy jobs).
--archive Create tar.gz archives of the datapoints text files and the alignments,
and update the link to the latest archive.
--no-logs Do not save per-chain logs of the numbering modifications.
Select which data we are interested in:
--------------------------------------------------------------------------------------------------------------
-r 4.0 [ --resolution=4.0 ] Maximum 3D structure resolution to consider a RNA chain.
--all Process chains even if they already are in the database.
--redundant Process all members of the equivalence classes not only the representative.
--only Ask to process a specific chains only (e.g. 4v49, 4v49_1_AA, or 4v49_1_AA_5-1523).
--ignore-issues Do not ignore already known issues and attempt to compute them.
--update-homologous Re-download Rfam and SILVA databases, realign all families, and recompute all CSV files.
--from-scratch Delete database, local 3D and sequence files, and known issues, and recompute.
-r 4.0 [ --resolution=4.0 ] Maximum 3D structure resolution to consider a RNA chain.
--all Process chains even if they already are in the database.
--redundant Process all members of the equivalence classes not only the representative.
--only Ask to process a specific chains only (could be 4v49, 4v49_1_AA, or 4v49_1_AA_5-1523).
--ignore-issues Do not ignore already known issues and attempt to compute them.
--update-homologous Re-download Rfam and SILVA databases, realign all families, and recompute all CSV files.
--from-scratch Delete database, local 3D and sequence files, and known issues, and recompute.
```
Options --3d-folder and --seq-folder are mandatory for command-line installations, but should not be used for installations with Docker. In the Docker container, they are set by default to the paths you provide with the -v options.
......