RNANet
Contents:
- What is RNANet ?
- Install and run RNANet
- How to further filter the dataset
- Database tables documentation
- FAQ
- Troubleshooting
- Contact
Cite us
- Louis Becquey, Eric Angel, and Fariza Tahi, (2020) RNANet: an automatically built dual-source dataset integrating homologous sequences and RNA structures, Bioinformatics, 2020, btaa944, DOI, Read the OpenAccess paper here
Additional relevant references:
The "ProteinNet" philosophy which inspired this work:
- AlQuraishi, M. (2019b). ProteinNet: A standardized data set for machine learning of protein structure. BMC Bioinformatics, 20(1), 311
If you use our annotations by DSSR, you might want to cite:
- Lu, X.-J.et al.(2015). DSSR: An integrated software tool for dissecting the spatial structure of RNA. Nucleic Acids Research, 43(21), e142–e142.
If you use our multiple sequence alignments and homology data, you might want to cite:
- Pruesse, E. et al.(2012). Sina: accurate high-throughput multiple sequence alignment of ribosomal RNA genes. Bioinformatics, 28(14), 1823–1829
- Nawrocki, E. P. and Eddy, S. R. (2013). Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics, 29(22), 2933–2935.
What is RNANet ?
RNANet is a multiscale dataset of non-coding RNA structures, including sequences, secondary structures, non-canonical interactions, 3D geometrical descriptors, and sequence homology.
It is available in machine-learning ready formats like CSV files (one per RNA chain) or as a SQL database.
Most interestingly, nucleotides have been renumered in a standardized way, and the 3D chains have been re-aligned with homologous sequences and covariance models from the Rfam database.
Methodology
We use the Rfam mappings between 3D structures and known Rfam families, using the sequences that are known to belong to an Rfam family (hits provided in RF0XXXX.fasta files from Rfam). Future versions might compute a real MSA-based clusering directly with Rfamseq ncRNA sequences, like ProteinNet does with protein sequences, but this requires a tool similar to jackHMMER in the Infernal software suite, which is not available yet.
This script prepares the dataset from available public data in PDB, RNA 3D Hub, Rfam and SILVA.
Pipeline
The script follows these steps:
To gather structures:
- Gets a list of 3D structures containing RNA from BGSU's non-redundant list (but keeps the redundant structures /!\),
- Asks Rfam for mappings of these structures onto Rfam families (~50% of structures have a direct mapping, some more are inferred using the redundancy list)
- Downloads the corresponding 3D structures (mmCIFs)
- If desired, extracts the right chain portions that map onto an Rfam family to a separate mmCIF file
To compute homology information:
- Extract the sequence for every 3D chain
- Downloads Rfamseq ncRNA sequence hits for the concerned Rfam families (or ARB databases of SSU or LSU sequences from SILVA for rRNAs)
- Realigns Rfamseq hits and sequences from the 3D structures together to obtain a multiple sequence alignment for each Rfam family (using
cmalign --cyk
, except for ribosomal LSU and SSU, where SINA is used) - Computes nucleotide frequencies at every position for each alignment
- Map each nucleotide of a 3D chain to its position in the corresponding family sequence alignment
To compute 3D annotations:
- Run DSSR on every RNA structure to get a variety of descriptors per position, describing secondary and tertiary structure. Basepair types annotations include intra-chain and inter-chain interactions.
Finally, export this data from the SQLite database into flat CSV files.
Data provided
We provide couple of resources to exploit this dataset. You can download them on EvryRNA.
- A series of tables in the SQLite3 database, see the database documentation and examples of useful queries,
- One CSV file per RNA chain, summarizing all the relevant information about it,
- Filtered alignment files in FASTA format containing only the sequences with a 3D structure available in RNANet, but which have been aligned using all the homologous sequences of this family from Rfam or SILVA,
- Additional statistics files about nucleotide frequencies, modified bases, basepair types within each chain or by RNA family.
For now, we do not provide as public downloads the set of cleaned 3D structures nor the full alignments with Rfam sequences. If you need them, recompute them or ask us.
Updates
RNANet is updated monthly to take into account new structures proposed in the BGSU Non-redundant lists. The monthly runs realign previous alignments with the new sequences using esl-alimerge
from Infernal.
It is updated yearly from scratch to take into account new Rfam sequences or updates in the covariance models, and updates in the PDB 3D files.
For now, the SILVA releases used are fixed (LSU and SSU releases 138.1) and not automatically updated. SILVA authors if you reach this : please provide a "latest" download link to ease automatic retrieval of the latest version.
See what's new in the latest version of RNANet in the CHANGELOG.
How to further filter the dataset
You may want to build your own sub-dataset by querying the results/RNANet.db file. Here are quick examples using Python3 and its sqlite3 package.
Note: you cannot install the sqlite3 package through pip. Install it using your OS' package manager, search for 'sqlite'.
Filter on 3D structure resolution
We need to import sqlite3 and pandas packages first.
import sqlite3
import pandas as pd
Step 1 : We first get a list of chains that are below our favorite resolution threshold (here 4.0 Angströms):
with sqlite3.connect("results/RNANet.db) as connection:
chain_list = pd.read_sql("""SELECT chain_id, structure_id, chain_name
FROM chain JOIN structure
ON chain.structure_id = structure.pdb_id
WHERE resolution < 4.0
ORDER BY structure_id ASC;""",
con=connection)
Step 2 : Then, we define a template string, containing the SQL request we use to get all information of one RNA chain, with brackets { } at the place we will insert every chain_id. You can remove fields you are not interested in.
req = """SELECT index_chain, old_nt_resnum, nt_position, nt_name, nt_code, nt_align_code,
is_A, is_C, is_G, is_U, is_other, freq_A, freq_C, freq_G, freq_U, freq_other, dbn,
paired, nb_interact, pair_type_LW, pair_type_DSSR, alpha, beta, gamma, delta, epsilon, zeta, epsilon_zeta,
chi, bb_type, glyco_bond, form, ssZp, Dp, eta, theta, eta_prime, theta_prime, eta_base, theta_base,
v0, v1, v2, v3, v4, amplitude, phase_angle, puckering
FROM
(SELECT chain_id, rfam_acc from chain WHERE chain_id = {})
NATURAL JOIN re_mapping
NATURAL JOIN nucleotide
NATURAL JOIN align_column;"""
Step 3 : Finally, we iterate over this list of chains and save their information in CSV files:
with sqlite3.connect("results/RNANet.db) as connection:
for chain in chain_list.iterrows():
df = pd.read_sql(req.format(chain.chain_id), connection)
filename = chain.structure_id + '-' + chain.chain_name + '.csv'
df.to_csv(filename, float_format="%.2f", index=False)
Filter on 3D structure publication date
You might want to get only the dataset you would have had in a past year, to compare yourself with the competitors of a RNA-Puzzles problem for example. We will simply modify the Step 1 above:
with sqlite3.connect("results/RNANet.db) as connection:
chain_list = pd.read_sql("""SELECT chain_id, structure_id, chain_name
FROM chain JOIN structure
ON chain.structure_id = structure.pdb_id
WHERE date < "2018-06-01"
ORDER BY structure_id ASC;""",
con=connection)
Then proceed to steps 2 and 3.
Filter to avoid chain redundancy when several mappings are available
Some chains can be mapped to two (or more) RNA families, and exist several times in the database. If you want just one example of each RNA 3D chain, use in Step 1:
with sqlite3.connect("results/RNANet.db) as connection:
chain_list = pd.read_sql("""SELECT DISTINCT chain_id, structure_id, chain_name
FROM chain JOIN structure
ON chain.structure_id = structure.pdb_id
ORDER BY structure_id ASC;""",
con=connection)
Then proceed to steps 2 and 3.
Troubleshooting
Check if your problem is listed in the known issues.
Warning and Errors
If you ran RNANet and got an error or a warning that you do not fully understand, check the Error documentation.
Not enough memory
If you run out of memory (job killed), you may want to reduce the number of jobs run in parallel. Use the --maxcores
option with a small number to ask RNANet to limit the concurrency and the simultaneous need for a lot of RAM. The computation time will increase accordingly. If the blocking part is a cmalign
alignment, use --cmalign-opts="--cyk --nonbanded --notrunc --small"
to reduce alignment requirements.
Not enough memory/too slow (developer trick)
If --maxcores
is not enough, and that you identified the step which fails, you can try to edit the Python code. Look for the "coeff_ncores" argument of some functions calls. This is the coefficient applied to --maxcores
for different steps of the pipeline. You can change it following your needs to reduce or increase concurrency (to use less memory, or compute faster, respectively).
Contact
RNANet is still in beta, this means we are truly open (and enjoying) all the feedback we can get from interested users.
Please send all your questions, feature requests, bug reports or angry reacts to louis.becquey(a)univ-evry.fr