Database.md 7.19 KB

Raw Blame History Permalink



More about the database structure

To help you design your own SQL requests, we provide a description of the database tables and fields.


Table family, for Rfam families and their properties


rfam_acc: The family codename, from Rfam's numbering (Rfam accession number)

description: What RNAs fit in this family

nb_homologs: The number of hits known to be homologous downloaded from Rfam to compute nucleotide frequencies

nb_3d_chains: The number of 3D RNA chains mapped to the family (from Rfam-PDB mappings, or inferred using the redundancy list)

nb_total_homol: Sum of the two previous fields, the number of sequences in the multiple sequence alignment, used to compute nucleotide frequencies

max_len: The longest RNA sequence among the homologs (in bases, unaligned)

ali_len: The aligned sequences length (in bases, aligned)

ali_filtered_len: The aligned sequences length when we filter the alignment to keep only the RNANet chains (which have a 3D structure) and some gap-only columns.

comput_time: Time required to compute the family's multiple sequence alignment in seconds,

comput_peak_mem: RAM (or swap) required to compute the family's multiple sequence alignment in megabytes,

idty_percent: Average identity percentage over pairs of the 3D chains' sequences from the family


Table structure, for 3D structures of the PDB


pdb_id: The 4-char PDB identifier

pdb_model: The model used in the PDB file

date: The first submission date of the 3D structure to a public database

exp_method: A string to know wether the structure as been obtained by X-ray crystallography ('X-RAY DIFFRACTION'), electron microscopy ('ELECTRON MICROSCOPY'), or NMR (not seen yet)

resolution: Resolution of the structure, in Angströms


Table chain, for the datapoints: one chain mapped to one Rfam family


chain_id: A unique identifier

structure_id: The pdb_id where the chain comes from

chain_name: The chain label, extracted from the 3D file

eq_class: The BGSU equivalence class label containing this chain

rfam_acc: The family which the chain is mapped to (if not mapped, value is unmappd)

pdb_start: Position in the chain where the mapping to Rfam begins (absolute position, not residue number)

pdb_end: Position in the chain where the mapping to Rfam ends (absolute position, not residue number)

reversed: Wether the mapping numbering order differs from the residue numbering order in the mmCIF file (eg 4c9d, chains C and D)

issue: Wether an issue occurred with this structure while downloading, extracting, annotating or parsing the annotation. See the file known_issues_reasons.txt for more information about why your chain is marked as an issue.

inferred: Wether the mapping has been inferred using the redundancy list (value is 1) or just known from Rfam-PDB mappings (value is 0)

chain_freq_A, chain_freq_C, chain_freq_G, chain_freq_U, chain_freq_other: Nucleotide frequencies in the chain

pair_count_cWW, pair_count_cWH, ... pair_count_tSS: Counts of the non-canonical base-pair types in the chain (intra-chain counts only)


Table nucleotide, for individual nucleotide descriptors


nt_id: A unique identifier

chain_id: The chain the nucleotide belongs to

index_chain: its absolute position within the portion of chain mapped to Rfam, from 1 to X. This is completely uncorrelated to any gene start or 3D chain residue numbers.

nt_position: relative position within the portion of chain mapped to RFam, from 0 to 1

old_nt_resnum: The residue number in the 3D mmCIF file (it's a string actually, some contain a letter like '37A')

nt_name: The residue type. This includes modified nucleotide names (e.g. 5MC for 5-methylcytosine)

nt_code: One-letter name. Lowercase "acgu" letters are used for modified "ACGU" bases.

nt_align_code: One-letter name used for sequence alignment. Contains "ACGUN-" only first, and then, gaps may be replaced by the most common letter at this position (default)

is_A, is_C, is_G, is_U, is_other: One-hot encoding of the nucleotide base

dbn: character used at this position if we look at the dot-bracket encoding of the secondary structure. Includes inter-chain (RNA complexes) contacts.

paired: empty, or comma separated list of index_chain values referring to nucleotides the base is interacting with. Up to 3 values. Inter-chain interactions are marked paired to '0'.

nb_interact: number of interactions with other nucleotides. Up to 3 values. Includes inter-chain interactions.

pair_type_LW: The Leontis-Westhof nomenclature codes of the interactions. The first letter concerns cis/trans orientation, the second this base's side interacting, and the third the other base's side.

pair_type_DSSR: Same but using the DSSR nomenclature (Hoogsteen edge approximately corresponds to Major-groove and Sugar edge to minor-groove)

alpha, beta, gamma, delta, epsilon, zeta: The 6 torsion angles of the RNA backabone for this nucleotide

epsilon_zeta: Difference between epsilon and zeta angles

bb_type: conformation of the backbone (BI, BII or ..)

chi: torsion angle between the sugar and base (O-C1'-N-C4)

glyco_bond: syn or anti configuration of the sugar-base bond

v0, v1, v2, v3, v4: 5 torsion angles of the ribose cycle

form: if the nucleotide is involved in a stem, the stem type (A, B or Z)

ssZp: Z-coordinate of the 3’ phosphorus atom with reference to the5’ base plane

Dp: Perpendicular distance of the 3’ P atom to the glycosidic bond

eta, theta: Pseudotorsions of the backbone, using phosphorus and carbon 4'

eta_prime, theta_prime: Pseudotorsions of the backbone, using phosphorus and carbon 1'

eta_base, theta_base: Pseudotorsions of the backbone, using phosphorus and the base center

phase_angle: Conformation of the ribose cycle

amplitude: Amplitude of the sugar puckering

puckering: Conformation of the ribose cycle (10 classes depending on the phase_angle value)


Table align_column, for positions in multiple sequence alignments


column_id: A unique identifier

rfam_acc: The family's MSA the column belongs to

index_ali: Position of the column in the alignment (starts at 1)

freq_A, freq_C, freq_G, freq_U, freq_other: Nucleotide frequencies in the alignment at this position

gap_percent: The frequencies of gaps at this position in the alignment (between 0.0 and 1.0)

consensus: A consensus character (ACGUN or '-') summarizing the column, if we can. If >75% of the sequences are gaps at this position, the gap is picked as consensus. Otherwise, A/C/G/U is chosen if >50% of the non-gap positions are A/C/G/U. Otherwise, N is the consensus.


There always is an entry, for each family (rfam_acc), with index_ali = 0; gap_percent = 1.0; and nucleotide frequencies set to 0.0. This entry is used when the nucleotide frequencies cannot be determined because of local alignment issues.


Table re_mapping, to map a nucleotide to an alignment column


remapping_id: A unique identifier

chain_id: The chain which is mapped to an alignment

index_chain: The absolute position of the nucleotide in the chain (from 1 to X)

index_ali The position of that nucleotide in its family alignment