Louis BECQUEY

update Readme

Showing 1 changed file with 36 additions and 30 deletions
...@@ -34,7 +34,7 @@ Finally, export this data from the SQLite database into flat CSV files. ...@@ -34,7 +34,7 @@ Finally, export this data from the SQLite database into flat CSV files.
34 34
35 * `results/RNANet.db` is a SQLite database file containing several tables with all the information, which you can query yourself with your custom requests, 35 * `results/RNANet.db` is a SQLite database file containing several tables with all the information, which you can query yourself with your custom requests,
36 * `3D-folder-you-passed-in-option/datapoints/*` are flat text CSV files, one for one RNA chain mapped to one RNA family, gathering the per-position nucleotide descriptors, 36 * `3D-folder-you-passed-in-option/datapoints/*` are flat text CSV files, one for one RNA chain mapped to one RNA family, gathering the per-position nucleotide descriptors,
37 -* `results/RNANET_datapoints_latest.tar.gz` is a compressed archive of the above CSV files 37 +* `results/RNANET_datapoints_latest.tar.gz` is a compressed archive of the above CSV files (only if you passed the --archive option)
38 * `path-to-3D-folder-you-passed-in-option/rna_mapped_to_Rfam` If you used the --extract option, this folder contains one mmCIF file per RNA chain mapped to one RNA family, without other chains, proteins (nor ions and ligands by default) 38 * `path-to-3D-folder-you-passed-in-option/rna_mapped_to_Rfam` If you used the --extract option, this folder contains one mmCIF file per RNA chain mapped to one RNA family, without other chains, proteins (nor ions and ligands by default)
39 * `results/summary_latest.csv` summarizes information about the RNA chains 39 * `results/summary_latest.csv` summarizes information about the RNA chains
40 * `results/families_latest.csv` summarizes information about the RNA families 40 * `results/families_latest.csv` summarizes information about the RNA families
...@@ -54,7 +54,7 @@ Other folders are created and not deleted, which you might want to conserve to a ...@@ -54,7 +54,7 @@ Other folders are created and not deleted, which you might want to conserve to a
54 - CPU: no requirements. The program is optimized for multi-core CPUs, you might want to use Intel Xeons, AMD Ryzens, etc. 54 - CPU: no requirements. The program is optimized for multi-core CPUs, you might want to use Intel Xeons, AMD Ryzens, etc.
55 - GPU: not required 55 - GPU: not required
56 - RAM: 16 GB with a large swap partition is okay. 32 GB is recommended (usage peaks at ~27 GB) 56 - RAM: 16 GB with a large swap partition is okay. 32 GB is recommended (usage peaks at ~27 GB)
57 -- Storage: to date, it takes 60 GB for the 3D data (36 GB if you don't use the --extract option), 11 GB for the sequence data, and 7GB for the outputs (5.6 GB database, 1 GB archive of CSV files). You need to add a few more for the dependencies. Go for a 100GB partition and you are good to go. The computation speed is really decreased if you use a fast storage device (e.g. SSD instead of hard drive, or even better, a NVMe SSD) because of permanent I/O with the SQlite database. 57 +- Storage: to date, it takes 60 GB for the 3D data (36 GB if you don't use the --extract option), 11 GB for the sequence data, and 7GB for the outputs (5.6 GB database, 1 GB archive of CSV files). You need to add a few more for the dependencies. Pick a 100GB partition and you are good to go. The computation speed is way better if you use a fast storage device (e.g. SSD instead of hard drive, or even better, a NVMe SSD) because of constant I/O with the SQlite database.
58 - Network : We query the Rfam public MySQL server on port 4497. Make sure your network enables communication (there should not be any issue on private networks, but maybe you company/university closes ports by default). You will get an error message if the port is not open. Around 30 GB of data is downloaded. 58 - Network : We query the Rfam public MySQL server on port 4497. Make sure your network enables communication (there should not be any issue on private networks, but maybe you company/university closes ports by default). You will get an error message if the port is not open. Around 30 GB of data is downloaded.
59 59
60 To give you an estimation, our last full run took exactly 12h, excluding the time to download the MMCIF files containing RNA (around 25GB to download) and the time to compute statistics. 60 To give you an estimation, our last full run took exactly 12h, excluding the time to download the MMCIF files containing RNA (around 25GB to download) and the time to compute statistics.
...@@ -65,7 +65,7 @@ Update runs are much quicker, around 3 hours. It depends mostly on what RNA fami ...@@ -65,7 +65,7 @@ Update runs are much quicker, around 3 hours. It depends mostly on what RNA fami
65 ## Dependencies 65 ## Dependencies
66 You need to install: 66 You need to install:
67 - DSSR, you need to register to the X3DNA forum [here](http://forum.x3dna.org/site-announcements/download-instructions/) and then download the DSSR binary [on that page](http://forum.x3dna.org/downloads/3dna-download/). 67 - DSSR, you need to register to the X3DNA forum [here](http://forum.x3dna.org/site-announcements/download-instructions/) and then download the DSSR binary [on that page](http://forum.x3dna.org/downloads/3dna-download/).
68 -- Infernal, to download at [Eddylab](http://eddylab.org/infernal/), several options are available depending on your preferences. Make sure to have the `cmalign` and `esl-reformat` binaries in your $PATH variable, so that RNANet.py can find them.You don't need the whole X3DNA suite of tools, just DSSR is fine. Make sure to have the `x3dna-dssr` binary in your $PATH variable so that RNANet.py finds it. 68 +- Infernal, to download at [Eddylab](http://eddylab.org/infernal/), several options are available depending on your preferences. Make sure to have the `cmalign`, `esl-alimanip` and `esl-reformat` binaries in your $PATH variable, so that RNANet.py can find them.You don't need the whole X3DNA suite of tools, just DSSR is fine. Make sure to have the `x3dna-dssr` binary in your $PATH variable so that RNANet.py finds it.
69 - SINA, follow [these instructions](https://sina.readthedocs.io/en/latest/install.html) for example. Make sure to have the `sina` binary in your $PATH. 69 - SINA, follow [these instructions](https://sina.readthedocs.io/en/latest/install.html) for example. Make sure to have the `sina` binary in your $PATH.
70 - Python >= 3.8, (Unfortunately, python3.6 is no longer supported, because of changes in the multiprocessing and Threading packages. Untested with Python 3.7.\*) 70 - Python >= 3.8, (Unfortunately, python3.6 is no longer supported, because of changes in the multiprocessing and Threading packages. Untested with Python 3.7.\*)
71 - The following Python packages: `python3.8 -m pip install numpy matplotlib pandas biopython psutil pymysql requests sqlalchemy sqlite3 tqdm` 71 - The following Python packages: `python3.8 -m pip install numpy matplotlib pandas biopython psutil pymysql requests sqlalchemy sqlite3 tqdm`
...@@ -76,36 +76,42 @@ It requires solid hardware to run. It takes around 15 hours the first time, and ...@@ -76,36 +76,42 @@ It requires solid hardware to run. It takes around 15 hours the first time, and
76 The detailed list of options is below: 76 The detailed list of options is below:
77 77
78 ``` 78 ```
79 --h [ --help ] Print this help message 79 +-h [ --help ] Print this help message
80 ---version Print the program version 80 +--version Print the program version
81 - 81 +
82 --r 4.0 [ --resolution=4.0 ] Maximum 3D structure resolution to consider a RNA chain. 82 +-r 4.0 [ --resolution=4.0 ] Maximum 3D structure resolution to consider a RNA chain.
83 --s Run statistics computations after completion 83 +-s Run statistics computations after completion
84 ---extract Extract the portions of 3D RNA chains to individual mmCIF files. 84 +--extract Extract the portions of 3D RNA chains to individual mmCIF files.
85 ---keep-hetatm=False (True | False) Keep ions, waters and ligands in produced mmCIF files. 85 +--keep-hetatm=False (True | False) Keep ions, waters and ligands in produced mmCIF files.
86 - Does not affect the descriptors. 86 + Does not affect the descriptors.
87 ---fill-gaps=True (True | False) Replace gaps in nt_align_code field due to unresolved residues 87 +--fill-gaps=True (True | False) Replace gaps in nt_align_code field due to unresolved residues
88 - by the most common nucleotide at this position in the alignment. 88 + by the most common nucleotide at this position in the alignment.
89 ---3d-folder=… Path to a folder to store the 3D data files. Subfolders will contain: 89 +--3d-folder=… Path to a folder to store the 3D data files. Subfolders will contain:
90 - RNAcifs/ Full structures containing RNA, in mmCIF format 90 + RNAcifs/ Full structures containing RNA, in mmCIF format
91 - rna_mapped_to_Rfam/ or rnaonly/ Extracted 'pure' RNA chains 91 + rna_mapped_to_Rfam/ Extracted 'pure' RNA chains
92 - datapoints/ Final results in CSV file format. 92 + datapoints/ Final results in CSV file format.
93 ---seq-folder=… Path to a folder to store the sequence and alignment files. 93 +--seq-folder=… Path to a folder to store the sequence and alignment files.
94 - rfam_sequences/fasta/ Compressed hits to Rfam families 94 + rfam_sequences/fasta/ Compressed hits to Rfam families
95 - realigned/ Sequences, covariance models, and alignments by family 95 + realigned/ Sequences, covariance models, and alignments by family
96 ---no-homology Do not try to compute PSSMs and do not align sequences. 96 +--no-homology Do not try to compute PSSMs and do not align sequences.
97 - Allows to yield more 3D data (consider chains without a Rfam mapping). 97 + Allows to yield more 3D data (consider chains without a Rfam mapping).
98 - 98 +
99 ---ignore-issues Do not ignore already known issues and attempt to compute them 99 +--all Build chains even if they already are in the database.
100 ---update-homologous Re-download Rfam sequences and SILVA arb databases, and realign all families 100 +--only Ask to process a specific chain label only
101 ---from-scratch Delete database, local 3D and sequence files, and known issues, and recompute. 101 +--ignore-issues Do not ignore already known issues and attempt to compute them
102 +--update-homologous Re-download Rfam and SILVA databases, realign all families, and recompute all CSV files
103 +--from-scratch Delete database, local 3D and sequence files, and known issues, and recompute.
104 +--archive Create a tar.gz archive of the datapoints text files, and update the link to the latest archive
105 +```
106 +
107 +Typical usage:
108 +```
109 +nohup bash -c 'time ~/Projects/RNANet/RNAnet.py --3d-folder ~/Data/RNA/3D/ --seq-folder ~/Data/RNA/sequences' -s --archive &
102 ``` 110 ```
103 111
104 ## Post-computation task: estimate quality 112 ## Post-computation task: estimate quality
105 The file statistics.py is supposed to give a summary on the produced dataset. See the results/ folder. It can be run automatically after RNANet if you pass the `-s` option. 113 The file statistics.py is supposed to give a summary on the produced dataset. See the results/ folder. It can be run automatically after RNANet if you pass the `-s` option.
106 114
107 -
108 -
109 # How to further filter the dataset 115 # How to further filter the dataset
110 You may want to build your own sub-dataset by querying the results/RNANet.db file. Here are quick examples using Python3 and its sqlite3 package. 116 You may want to build your own sub-dataset by querying the results/RNANet.db file. Here are quick examples using Python3 and its sqlite3 package.
111 117
...@@ -133,7 +139,7 @@ with sqlite3.connect("results/RNANet.db) as connection: ...@@ -133,7 +139,7 @@ with sqlite3.connect("results/RNANet.db) as connection:
133 Step 2 : Then, we define a template string, containing the SQL request we use to get all information of one RNA chain, with brackets { } at the place we will insert every chain_id. 139 Step 2 : Then, we define a template string, containing the SQL request we use to get all information of one RNA chain, with brackets { } at the place we will insert every chain_id.
134 You can remove fields you are not interested in. 140 You can remove fields you are not interested in.
135 ``` 141 ```
136 -req = """SELECT index_chain, nt_resnum, position, nt_name, nt_code, nt_align_code, is_A, is_C, is_G, is_U, is_other, freq_A, freq_C, freq_G, freq_U, freq_other, dbn, paired, nb_interact, pair_type_LW, pair_type_DSSR, alpha, beta, gamma, delta, epsilon, zeta, epsilon_zeta, chi, bb_type, glyco_bond, form, ssZp, Dp, eta, theta, eta_prime, theta_prime, eta_base, theta_base, 142 +req = """SELECT index_chain, old_nt_resnum, position, nt_name, nt_code, nt_align_code, is_A, is_C, is_G, is_U, is_other, freq_A, freq_C, freq_G, freq_U, freq_other, dbn, paired, nb_interact, pair_type_LW, pair_type_DSSR, alpha, beta, gamma, delta, epsilon, zeta, epsilon_zeta, chi, bb_type, glyco_bond, form, ssZp, Dp, eta, theta, eta_prime, theta_prime, eta_base, theta_base,
137 v0, v1, v2, v3, v4, amlitude, phase_angle, puckering 143 v0, v1, v2, v3, v4, amlitude, phase_angle, puckering
138 FROM 144 FROM
139 (SELECT chain_id, rfam_acc from chain WHERE chain_id = {}) 145 (SELECT chain_id, rfam_acc from chain WHERE chain_id = {})
...@@ -223,7 +229,7 @@ To help you design your own requests, here follows a description of the database ...@@ -223,7 +229,7 @@ To help you design your own requests, here follows a description of the database
223 * `chain_id`: The chain the nucleotide belongs to 229 * `chain_id`: The chain the nucleotide belongs to
224 * `index_chain`: its absolute position within the portion of chain mapped to Rfam, from 1 to X. This is completely uncorrelated to any gene start or 3D chain residue numbers. 230 * `index_chain`: its absolute position within the portion of chain mapped to Rfam, from 1 to X. This is completely uncorrelated to any gene start or 3D chain residue numbers.
225 * `nt_position`: relative position within the portion of chain mapped to RFam, from 0 to 1 231 * `nt_position`: relative position within the portion of chain mapped to RFam, from 0 to 1
226 -* `nt_resnum`: The residue number in the 3D mmCIF file 232 +* `old_nt_resnum`: The residue number in the 3D mmCIF file (it's a string actually, some contain a letter like '37A')
227 * `nt_name`: The residue type. This includes modified nucleotide names (e.g. 5MC for 5-methylcytosine) 233 * `nt_name`: The residue type. This includes modified nucleotide names (e.g. 5MC for 5-methylcytosine)
228 * `nt_code`: One-letter name. Lowercase "acgu" letters are used for modified "ACGU" bases. 234 * `nt_code`: One-letter name. Lowercase "acgu" letters are used for modified "ACGU" bases.
229 * `nt_align_code`: One-letter name used for sequence alignment. Contains "ACGUN-" only first, and then, gaps may be replaced by the most common letter at this position (default) 235 * `nt_align_code`: One-letter name used for sequence alignment. Contains "ACGUN-" only first, and then, gaps may be replaced by the most common letter at this position (default)
......