Louis BECQUEY

solved issue angles still in degrees

...@@ -30,14 +30,12 @@ To help you design your own SQL requests, we provide a description of the databa ...@@ -30,14 +30,12 @@ To help you design your own SQL requests, we provide a description of the databa
30 * `rfam_acc`: The family which the chain is mapped to (if not mapped, value is *unmappd*) 30 * `rfam_acc`: The family which the chain is mapped to (if not mapped, value is *unmappd*)
31 * `pdb_start`: Position in the chain where the mapping to Rfam begins (absolute position, not residue number) 31 * `pdb_start`: Position in the chain where the mapping to Rfam begins (absolute position, not residue number)
32 * `pdb_end`: Position in the chain where the mapping to Rfam ends (absolute position, not residue number) 32 * `pdb_end`: Position in the chain where the mapping to Rfam ends (absolute position, not residue number)
33 -* `reversed`: Wether the mapping numbering order differs from the residue numbering order in the mmCIF file (eg 4c9d, chains C and D)
34 * `issue`: Wether an issue occurred with this structure while downloading, extracting, annotating or parsing the annotation. See the file known_issues_reasons.txt for more information about why your chain is marked as an issue. 33 * `issue`: Wether an issue occurred with this structure while downloading, extracting, annotating or parsing the annotation. See the file known_issues_reasons.txt for more information about why your chain is marked as an issue.
35 * `inferred`: Wether the mapping has been inferred using the redundancy list (value is 1) or just known from Rfam-PDB mappings (value is 0) 34 * `inferred`: Wether the mapping has been inferred using the redundancy list (value is 1) or just known from Rfam-PDB mappings (value is 0)
36 * `chain_freq_A`, `chain_freq_C`, `chain_freq_G`, `chain_freq_U`, `chain_freq_other`: Nucleotide frequencies in the chain 35 * `chain_freq_A`, `chain_freq_C`, `chain_freq_G`, `chain_freq_U`, `chain_freq_other`: Nucleotide frequencies in the chain
37 * `pair_count_cWW`, `pair_count_cWH`, ... `pair_count_tSS`: Counts of the non-canonical base-pair types in the chain (intra-chain counts only) 36 * `pair_count_cWW`, `pair_count_cWH`, ... `pair_count_tSS`: Counts of the non-canonical base-pair types in the chain (intra-chain counts only)
38 37
39 ## Table `nucleotide`, for individual nucleotide descriptors 38 ## Table `nucleotide`, for individual nucleotide descriptors
40 -* `nt_id`: A unique identifier
41 * `chain_id`: The chain the nucleotide belongs to 39 * `chain_id`: The chain the nucleotide belongs to
42 * `index_chain`: its absolute position within the portion of chain mapped to Rfam, from 1 to X. This is completely uncorrelated to any gene start or 3D chain residue numbers. 40 * `index_chain`: its absolute position within the portion of chain mapped to Rfam, from 1 to X. This is completely uncorrelated to any gene start or 3D chain residue numbers.
43 * `nt_position`: relative position within the portion of chain mapped to RFam, from 0 to 1 41 * `nt_position`: relative position within the portion of chain mapped to RFam, from 0 to 1
...@@ -51,7 +49,7 @@ To help you design your own SQL requests, we provide a description of the databa ...@@ -51,7 +49,7 @@ To help you design your own SQL requests, we provide a description of the databa
51 * `nb_interact`: number of interactions with other nucleotides. Up to 3 values. Includes inter-chain interactions. 49 * `nb_interact`: number of interactions with other nucleotides. Up to 3 values. Includes inter-chain interactions.
52 * `pair_type_LW`: The Leontis-Westhof nomenclature codes of the interactions. The first letter concerns cis/trans orientation, the second this base's side interacting, and the third the other base's side. 50 * `pair_type_LW`: The Leontis-Westhof nomenclature codes of the interactions. The first letter concerns cis/trans orientation, the second this base's side interacting, and the third the other base's side.
53 * `pair_type_DSSR`: Same but using the DSSR nomenclature (Hoogsteen edge approximately corresponds to Major-groove and Sugar edge to minor-groove) 51 * `pair_type_DSSR`: Same but using the DSSR nomenclature (Hoogsteen edge approximately corresponds to Major-groove and Sugar edge to minor-groove)
54 -* `alpha`, `beta`, `gamma`, `delta`, `epsilon`, `zeta`: The 6 torsion angles of the RNA backabone for this nucleotide 52 +* `alpha`, `beta`, `gamma`, `delta`, `epsilon`, `zeta`: The 6 torsion angles of the RNA backbone for this nucleotide, between 0 and 2pi
55 * `epsilon_zeta`: Difference between epsilon and zeta angles 53 * `epsilon_zeta`: Difference between epsilon and zeta angles
56 * `bb_type`: conformation of the backbone (BI, BII or ..) 54 * `bb_type`: conformation of the backbone (BI, BII or ..)
57 * `chi`: torsion angle between the sugar and base (O-C1'-N-C4) 55 * `chi`: torsion angle between the sugar and base (O-C1'-N-C4)
...@@ -69,7 +67,8 @@ To help you design your own SQL requests, we provide a description of the databa ...@@ -69,7 +67,8 @@ To help you design your own SQL requests, we provide a description of the databa
69 67
70 ## Table `align_column`, for positions in multiple sequence alignments 68 ## Table `align_column`, for positions in multiple sequence alignments
71 * `rfam_acc`: The family's MSA the column belongs to 69 * `rfam_acc`: The family's MSA the column belongs to
72 -* `index_ali`: Position of the column in the alignment (starts at 1) 70 +* `index_ali`: Position of the column in the wide alignment with Rfam sequences (starts at 1)
71 +* `index_small_ali`: Position of the column in the small alignment with only 3D chains (starts at 1)
73 * `cm_coord`: Position of the column in the Rfam covariance model of the family (starts at 1). The value is NULL in portions that are insertions compared to the model. 72 * `cm_coord`: Position of the column in the Rfam covariance model of the family (starts at 1). The value is NULL in portions that are insertions compared to the model.
74 * `freq_A`, `freq_C`, `freq_G`, `freq_U`, `freq_other`: Nucleotide frequencies in the alignment at this position 73 * `freq_A`, `freq_C`, `freq_G`, `freq_U`, `freq_other`: Nucleotide frequencies in the alignment at this position
75 * `gap_percent`: The frequencies of gaps at this position in the alignment (between 0.0 and 1.0) 74 * `gap_percent`: The frequencies of gaps at this position in the alignment (between 0.0 and 1.0)
...@@ -79,7 +78,6 @@ To help you design your own SQL requests, we provide a description of the databa ...@@ -79,7 +78,6 @@ To help you design your own SQL requests, we provide a description of the databa
79 There always is an entry, for each family (rfam_acc), with index_ali = 0; gap_percent = 1.0; and nucleotide frequencies set to 0.0. This entry is used when the nucleotide frequencies cannot be determined because of local alignment issues. 78 There always is an entry, for each family (rfam_acc), with index_ali = 0; gap_percent = 1.0; and nucleotide frequencies set to 0.0. This entry is used when the nucleotide frequencies cannot be determined because of local alignment issues.
80 79
81 ## Table `re_mapping`, to map a nucleotide to an alignment column 80 ## Table `re_mapping`, to map a nucleotide to an alignment column
82 -* `remapping_id`: A unique identifier
83 * `chain_id`: The chain which is mapped to an alignment 81 * `chain_id`: The chain which is mapped to an alignment
84 * `index_chain`: The absolute position of the nucleotide in the chain (from 1 to X) 82 * `index_chain`: The absolute position of the nucleotide in the chain (from 1 to X)
85 * `index_ali` The position of that nucleotide in its family alignment 83 * `index_ali` The position of that nucleotide in its family alignment
......
...@@ -31,5 +31,5 @@ We first remove the nucleotides whose number is outside the family mapping (if a ...@@ -31,5 +31,5 @@ We first remove the nucleotides whose number is outside the family mapping (if a
31 31
32 * **What are the versions of the dependencies you use ?** 32 * **What are the versions of the dependencies you use ?**
33 33
34 -`cmalign` is v1.1.3, `sina` is v1.6.0, `x3dna-dssr` is v1.9.9, Biopython is v1.78. 34 +`cmalign` is v1.1.4, `sina` is v1.6.0, `x3dna-dssr` is v1.9.9, Biopython is v1.78.
35 35
...\ No newline at end of file ...\ No newline at end of file
......
...@@ -57,55 +57,63 @@ nohup bash -c 'time ~/Projects/RNANet/RNAnet.py --3d-folder ~/Data/RNA/3D/ --seq ...@@ -57,55 +57,63 @@ nohup bash -c 'time ~/Projects/RNANet/RNAnet.py --3d-folder ~/Data/RNA/3D/ --seq
57 The detailed list of options is below: 57 The detailed list of options is below:
58 58
59 ``` 59 ```
60 --h [ --help ] Print this help message 60 +-h [ --help ] Print this help message
61 ---version Print the program version 61 +--version Print the program version
62 62
63 Select what to do: 63 Select what to do:
64 -------------------------------------------------------------------------------------------------------------- 64 --------------------------------------------------------------------------------------------------------------
65 --f [ --full-inference ] Infer new mappings even if Rfam already provides some. Yields more copies of 65 +-f [ --full-inference ] Infer new mappings even if Rfam already provides some. Yields more copies of
66 - chains mapped to different families. 66 + chains mapped to different families.
67 --s Run statistics computations after completion 67 +-s Run statistics computations after completion
68 ---extract Extract the portions of 3D RNA chains to individual mmCIF files. 68 +--stats-opts=… Pass additional command line options to the statistics.py script, e.g. "--wadley --distance-matrices"
69 ---keep-hetatm=False (True | False) Keep ions, waters and ligands in produced mmCIF files. 69 +--extract Extract the portions of 3D RNA chains to individual mmCIF files.
70 - Does not affect the descriptors. 70 +--keep-hetatm=False (True | False) Keep ions, waters and ligands in produced mmCIF files.
71 ---no-homology Do not try to compute PSSMs and do not align sequences. 71 + Does not affect the descriptors.
72 - Allows to yield more 3D data (consider chains without a Rfam mapping). 72 +--no-homology Do not try to compute PSSMs and do not align sequences.
73 + Allows to yield more 3D data (consider chains without a Rfam mapping).
73 74
74 Select how to do it: 75 Select how to do it:
75 -------------------------------------------------------------------------------------------------------------- 76 --------------------------------------------------------------------------------------------------------------
76 ---3d-folder=… Path to a folder to store the 3D data files. Subfolders will contain: 77 +--3d-folder=… Path to a folder to store the 3D data files. Subfolders will contain:
77 - RNAcifs/ Full structures containing RNA, in mmCIF format 78 + RNAcifs/ Full structures containing RNA, in mmCIF format
78 - rna_mapped_to_Rfam/ Extracted 'pure' RNA chains 79 + rna_mapped_to_Rfam/ Extracted 'pure' portions of RNA chains mapped to families
79 - datapoints/ Final results in CSV file format. 80 + rna_only/ Extracted 'pure' RNA chains, not truncated
80 ---seq-folder=… Path to a folder to store the sequence and alignment files. Subfolders will be: 81 + datapoints/ Final results in CSV file format.
81 - rfam_sequences/fasta/ Compressed hits to Rfam families 82 +--seq-folder=… Path to a folder to store the sequence and alignment files. Subfolders will be:
82 - realigned/ Sequences, covariance models, and alignments by family 83 + rfam_sequences/fasta/ Compressed hits to Rfam families
83 ---maxcores=… Limit the number of cores to use in parallel portions to reduce the simultaneous 84 + realigned/ Sequences, covariance models, and alignments by family
84 - need of RAM. Should be a number between 1 and your number of CPUs. Note that portions 85 +--sina Align large subunit LSU and small subunit SSU ribosomal RNA using SINA instead of Infernal,
85 - of the pipeline already limit themselves to 50% or 70% of that number by default. 86 + the other RNA families will be aligned using infernal.
86 ---archive Create tar.gz archives of the datapoints text files and the alignments, 87 +--maxcores=… Limit the number of cores to use in parallel portions to reduce the simultaneous
87 - and update the link to the latest archive. 88 + need of RAM. Should be a number between 1 and your number of CPUs. Note that portions
88 ---no-logs Do not save per-chain logs of the numbering modifications 89 + of the pipeline already limit themselves to 50% or 70% of that number by default.
90 +--cmalign-opts=… A string of additional options to pass to cmalign aligner, e.g. "--nonbanded --mxsize 2048"
91 +--archive Create tar.gz archives of the datapoints text files and the alignments,
92 + and update the link to the latest archive.
93 +--no-logs Do not save per-chain logs of the numbering modifications.
89 94
90 Select which data we are interested in: 95 Select which data we are interested in:
91 -------------------------------------------------------------------------------------------------------------- 96 --------------------------------------------------------------------------------------------------------------
92 --r 4.0 [ --resolution=4.0 ] Maximum 3D structure resolution to consider a RNA chain. 97 +-r 4.0 [ --resolution=4.0 ] Maximum 3D structure resolution to consider a RNA chain.
93 ---all Build chains even if they already are in the database. 98 +--all Process chains even if they already are in the database.
94 ---only Ask to process a specific chain label only 99 +--redundant Process all members of the equivalence classes not only the representative.
95 ---ignore-issues Do not ignore already known issues and attempt to compute them 100 +--only Ask to process a specific chains only (e.g. 4v49, 4v49_1_AA, or 4v49_1_AA_5-1523).
96 ---update-homologous Re-download Rfam and SILVA databases, realign all families, and recompute all CSV files 101 +--ignore-issues Do not ignore already known issues and attempt to compute them.
97 ---from-scratch Delete database, local 3D and sequence files, and known issues, and recompute. 102 +--update-homologous Re-download Rfam and SILVA databases, realign all families, and recompute all CSV files.
103 +--from-scratch Delete database, local 3D and sequence files, and known issues, and recompute.
98 104
99 ``` 105 ```
100 Options --3d-folder and --seq-folder are mandatory for command-line installations, but should not be used for installations with Docker. In the Docker container, they are set by default to the paths you provide with the -v options. 106 Options --3d-folder and --seq-folder are mandatory for command-line installations, but should not be used for installations with Docker. In the Docker container, they are set by default to the paths you provide with the -v options.
101 107
102 The most useful options in that list are 108 The most useful options in that list are
103 * ` --extract`, to actually produce some re-numbered 3D mmCIF files of the RNA chains individually, 109 * ` --extract`, to actually produce some re-numbered 3D mmCIF files of the RNA chains individually,
104 -* ` --no-homology`, to ignore the family mapping and sequence alignment parts and only focus on 3D data download and annotation. This would yield more data since many RNAs are not mapped to any Rfam family. 110 +* ` --no-homology`, to ignore the family mapping and sequence alignment parts and only focus on 3D data download and annotation. This would yield more data since many RNAs are not mapped to any Rfam family,
105 * ` -s`, to run the "statistics" which are a few useful post-computation tasks such as: 111 * ` -s`, to run the "statistics" which are a few useful post-computation tasks such as:
106 * Computation of sequence identity matrices 112 * Computation of sequence identity matrices
107 * Statistics over the sequence lengths, nucleotide frequencies, and basepair types by RNA family 113 * Statistics over the sequence lengths, nucleotide frequencies, and basepair types by RNA family
108 * Overall database content statistics 114 * Overall database content statistics
115 + * Detailed analysis of the eta-theta pseudotorsion angles (use `--stats-opts "--wadley"` after `-s`) or 3D distance matrices and their averages per family (use `--stats-opts "--distance-matrices"`)
116 +* ` --redundant`, to yield all the available data and not only the BGSU NR-List respresentatives
109 117
110 # Computation time 118 # Computation time
111 119
......
This diff is collapsed. Click to expand it.
...@@ -1190,7 +1190,7 @@ if __name__ == "__main__": ...@@ -1190,7 +1190,7 @@ if __name__ == "__main__":
1190 1190
1191 if opt == "-h" or opt == "--help": 1191 if opt == "-h" or opt == "--help":
1192 print( "RNANet statistics, a script to build a multiscale RNA dataset from public data\n" 1192 print( "RNANet statistics, a script to build a multiscale RNA dataset from public data\n"
1193 - "Developped by Louis Becquey (louis.becquey@univ-evry.fr), 2020/2021") 1193 + "Developped by Louis Becquey an Khodor Hannoush, 2020/2021")
1194 print() 1194 print()
1195 print("Options:") 1195 print("Options:")
1196 print("-h [ --help ]\t\t\tPrint this help message") 1196 print("-h [ --help ]\t\t\tPrint this help message")
...@@ -1206,7 +1206,7 @@ if __name__ == "__main__": ...@@ -1206,7 +1206,7 @@ if __name__ == "__main__":
1206 1206
1207 sys.exit() 1207 sys.exit()
1208 elif opt == '--version': 1208 elif opt == '--version':
1209 - print("RNANet statistics 1.4 beta") 1209 + print("RNANet statistics 1.5 beta")
1210 sys.exit() 1210 sys.exit()
1211 elif opt == "-r" or opt == "--resolution": 1211 elif opt == "-r" or opt == "--resolution":
1212 assert float(arg) > 0.0 and float(arg) <= 20.0 1212 assert float(arg) > 0.0 and float(arg) <= 20.0
......