@@ -30,14 +30,12 @@ To help you design your own SQL requests, we provide a description of the databa
*`rfam_acc`: The family which the chain is mapped to (if not mapped, value is *unmappd*)
*`pdb_start`: Position in the chain where the mapping to Rfam begins (absolute position, not residue number)
*`pdb_end`: Position in the chain where the mapping to Rfam ends (absolute position, not residue number)
*`reversed`: Wether the mapping numbering order differs from the residue numbering order in the mmCIF file (eg 4c9d, chains C and D)
*`issue`: Wether an issue occurred with this structure while downloading, extracting, annotating or parsing the annotation. See the file known_issues_reasons.txt for more information about why your chain is marked as an issue.
*`inferred`: Wether the mapping has been inferred using the redundancy list (value is 1) or just known from Rfam-PDB mappings (value is 0)
*`chain_freq_A`, `chain_freq_C`, `chain_freq_G`, `chain_freq_U`, `chain_freq_other`: Nucleotide frequencies in the chain
*`pair_count_cWW`, `pair_count_cWH`, ... `pair_count_tSS`: Counts of the non-canonical base-pair types in the chain (intra-chain counts only)
## Table `nucleotide`, for individual nucleotide descriptors
*`nt_id`: A unique identifier
*`chain_id`: The chain the nucleotide belongs to
*`index_chain`: its absolute position within the portion of chain mapped to Rfam, from 1 to X. This is completely uncorrelated to any gene start or 3D chain residue numbers.
*`nt_position`: relative position within the portion of chain mapped to RFam, from 0 to 1
...
...
@@ -51,7 +49,7 @@ To help you design your own SQL requests, we provide a description of the databa
*`nb_interact`: number of interactions with other nucleotides. Up to 3 values. Includes inter-chain interactions.
*`pair_type_LW`: The Leontis-Westhof nomenclature codes of the interactions. The first letter concerns cis/trans orientation, the second this base's side interacting, and the third the other base's side.
*`pair_type_DSSR`: Same but using the DSSR nomenclature (Hoogsteen edge approximately corresponds to Major-groove and Sugar edge to minor-groove)
*`alpha`, `beta`, `gamma`, `delta`, `epsilon`, `zeta`: The 6 torsion angles of the RNA backabone for this nucleotide
*`alpha`, `beta`, `gamma`, `delta`, `epsilon`, `zeta`: The 6 torsion angles of the RNA backbone for this nucleotide, between 0 and 2pi
*`epsilon_zeta`: Difference between epsilon and zeta angles
*`bb_type`: conformation of the backbone (BI, BII or ..)
*`chi`: torsion angle between the sugar and base (O-C1'-N-C4)
...
...
@@ -69,7 +67,8 @@ To help you design your own SQL requests, we provide a description of the databa
## Table `align_column`, for positions in multiple sequence alignments
*`rfam_acc`: The family's MSA the column belongs to
*`index_ali`: Position of the column in the alignment (starts at 1)
*`index_ali`: Position of the column in the wide alignment with Rfam sequences (starts at 1)
*`index_small_ali`: Position of the column in the small alignment with only 3D chains (starts at 1)
*`cm_coord`: Position of the column in the Rfam covariance model of the family (starts at 1). The value is NULL in portions that are insertions compared to the model.
*`freq_A`, `freq_C`, `freq_G`, `freq_U`, `freq_other`: Nucleotide frequencies in the alignment at this position
*`gap_percent`: The frequencies of gaps at this position in the alignment (between 0.0 and 1.0)
...
...
@@ -79,7 +78,6 @@ To help you design your own SQL requests, we provide a description of the databa
There always is an entry, for each family (rfam_acc), with index_ali = 0; gap_percent = 1.0; and nucleotide frequencies set to 0.0. This entry is used when the nucleotide frequencies cannot be determined because of local alignment issues.
## Table `re_mapping`, to map a nucleotide to an alignment column
*`remapping_id`: A unique identifier
*`chain_id`: The chain which is mapped to an alignment
*`index_chain`: The absolute position of the nucleotide in the chain (from 1 to X)
*`index_ali` The position of that nucleotide in its family alignment
-r 4.0 [ --resolution=4.0 ] Maximum 3D structure resolution to consider a RNA chain.
--all Build chains even if they already are in the database.
--only Ask to process a specific chain label only
--ignore-issues Do not ignore already known issues and attempt to compute them
--update-homologous Re-download Rfam and SILVA databases, realign all families, and recompute all CSV files
--from-scratch Delete database, local 3D and sequence files, and known issues, and recompute.
-r 4.0 [ --resolution=4.0 ] Maximum 3D structure resolution to consider a RNA chain.
--all Process chains even if they already are in the database.
--redundant Process all members of the equivalence classes not only the representative.
--only Ask to process a specific chains only (e.g. 4v49, 4v49_1_AA, or 4v49_1_AA_5-1523).
--ignore-issues Do not ignore already known issues and attempt to compute them.
--update-homologous Re-download Rfam and SILVA databases, realign all families, and recompute all CSV files.
--from-scratch Delete database, local 3D and sequence files, and known issues, and recompute.
```
Options --3d-folder and --seq-folder are mandatory for command-line installations, but should not be used for installations with Docker. In the Docker container, they are set by default to the paths you provide with the -v options.
The most useful options in that list are
*` --extract`, to actually produce some re-numbered 3D mmCIF files of the RNA chains individually,
*` --no-homology`, to ignore the family mapping and sequence alignment parts and only focus on 3D data download and annotation. This would yield more data since many RNAs are not mapped to any Rfam family.
*` --no-homology`, to ignore the family mapping and sequence alignment parts and only focus on 3D data download and annotation. This would yield more data since many RNAs are not mapped to any Rfam family,
*` -s`, to run the "statistics" which are a few useful post-computation tasks such as:
* Computation of sequence identity matrices
* Statistics over the sequence lengths, nucleotide frequencies, and basepair types by RNA family
* Overall database content statistics
* Detailed analysis of the eta-theta pseudotorsion angles (use `--stats-opts "--wadley"` after `-s`) or 3D distance matrices and their averages per family (use `--stats-opts "--distance-matrices"`)
*` --redundant`, to yield all the available data and not only the BGSU NR-List respresentatives