Showing
4 changed files
with
128 additions
and
50 deletions
... | @@ -49,6 +49,19 @@ Other folders are created and not deleted, which you might want to conserve to a | ... | @@ -49,6 +49,19 @@ Other folders are created and not deleted, which you might want to conserve to a |
49 | * `path-to-3D-folder-you-passed-in-option/annotations/` contains the raw JSON annotation files of the previous mmCIF structures. You may find additional information into them which is not properly supported by RNANet yet. | 49 | * `path-to-3D-folder-you-passed-in-option/annotations/` contains the raw JSON annotation files of the previous mmCIF structures. You may find additional information into them which is not properly supported by RNANet yet. |
50 | 50 | ||
51 | # How to run | 51 | # How to run |
52 | + | ||
53 | +## Required computational resources | ||
54 | +- CPU: no requirements. The program is optimized for multi-core CPUs, you might want to use Intel Xeons, AMD Ryzens, etc. | ||
55 | +- GPU: not required | ||
56 | +- RAM: 16 GB with a large swap partition is okay. 32 GB is recommended (usage peaks at ~27 GB) | ||
57 | +- Storage: to date, it takes 60 GB for the 3D data (36 GB if you don't use the --extract option), 11 GB for the sequence data, and 7GB for the outputs (5.6 GB database, 1 GB archive of CSV files). You need to add a few more for the dependencies. Go for a 100GB partition and you are good to go. The computation speed is really decreased if you use a fast storage device (e.g. SSD instead of hard drive, or even better, a NVMe SSD) because of permanent I/O with the SQlite database. | ||
58 | +- Network : We query the Rfam public MySQL server on port 4497. Make sure your network enables communication (there should not be any issue on private networks, but maybe you company/university closes ports by default). You will get an error message if the port is not open. Around 30 GB of data is downloaded. | ||
59 | + | ||
60 | +To give you an estimation, our last full run took exactly 12h, excluding the time to download the MMCIF files containing RNA (around 25GB to download) and the time to compute statistics. | ||
61 | +Measured the 23rd of June 2020 on a 16-core AMD Ryzen 7 3700X CPU @3.60GHz, plus 32 Go RAM, and a 7200rpm Hard drive. Total CPU time spent: 135 hours (user+kernel modes), corresponding to 12h (actual time spent with the 16-core CPU). | ||
62 | + | ||
63 | +Update runs are much quicker, around 3 hours. It depends mostly on what RNA families are concerned by the update. | ||
64 | + | ||
52 | ## Dependencies | 65 | ## Dependencies |
53 | You need to install: | 66 | You need to install: |
54 | - DSSR, you need to register to the X3DNA forum [here](http://forum.x3dna.org/site-announcements/download-instructions/) and then download the DSSR binary [on that page](http://forum.x3dna.org/downloads/3dna-download/). | 67 | - DSSR, you need to register to the X3DNA forum [here](http://forum.x3dna.org/site-announcements/download-instructions/) and then download the DSSR binary [on that page](http://forum.x3dna.org/downloads/3dna-download/). |
... | @@ -91,6 +104,8 @@ The detailed list of options is below: | ... | @@ -91,6 +104,8 @@ The detailed list of options is below: |
91 | ## Post-computation task: estimate quality | 104 | ## Post-computation task: estimate quality |
92 | The file statistics.py is supposed to give a summary on the produced dataset. See the results/ folder. It can be run automatically after RNANet if you pass the `-s` option. | 105 | The file statistics.py is supposed to give a summary on the produced dataset. See the results/ folder. It can be run automatically after RNANet if you pass the `-s` option. |
93 | 106 | ||
107 | + | ||
108 | + | ||
94 | # How to further filter the dataset | 109 | # How to further filter the dataset |
95 | You may want to build your own sub-dataset by querying the results/RNANet.db file. Here are quick examples using Python3 and its sqlite3 package. | 110 | You may want to build your own sub-dataset by querying the results/RNANet.db file. Here are quick examples using Python3 and its sqlite3 package. |
96 | 111 | ||
... | @@ -108,6 +123,7 @@ Step 1 : We first get a list of chains that are below our favorite resolution th | ... | @@ -108,6 +123,7 @@ Step 1 : We first get a list of chains that are below our favorite resolution th |
108 | with sqlite3.connect("results/RNANet.db) as connection: | 123 | with sqlite3.connect("results/RNANet.db) as connection: |
109 | chain_list = pd.read_sql("""SELECT chain_id, structure_id, chain_name | 124 | chain_list = pd.read_sql("""SELECT chain_id, structure_id, chain_name |
110 | FROM chain JOIN structure | 125 | FROM chain JOIN structure |
126 | + ON chain.structure_id = structure.pdb_id | ||
111 | WHERE resolution < 4.0 | 127 | WHERE resolution < 4.0 |
112 | ORDER BY structure_id ASC;""", | 128 | ORDER BY structure_id ASC;""", |
113 | con=connection) | 129 | con=connection) |
... | @@ -146,6 +162,7 @@ We will simply modify the Step 1 above: | ... | @@ -146,6 +162,7 @@ We will simply modify the Step 1 above: |
146 | with sqlite3.connect("results/RNANet.db) as connection: | 162 | with sqlite3.connect("results/RNANet.db) as connection: |
147 | chain_list = pd.read_sql("""SELECT chain_id, structure_id, chain_name | 163 | chain_list = pd.read_sql("""SELECT chain_id, structure_id, chain_name |
148 | FROM chain JOIN structure | 164 | FROM chain JOIN structure |
165 | + ON chain.structure_id = structure.pdb_id | ||
149 | WHERE date < "2018-06-01" | 166 | WHERE date < "2018-06-01" |
150 | ORDER BY structure_id ASC;""", | 167 | ORDER BY structure_id ASC;""", |
151 | con=connection) | 168 | con=connection) |
... | @@ -160,6 +177,7 @@ If you want just one example of each RNA 3D chain, use in Step 1: | ... | @@ -160,6 +177,7 @@ If you want just one example of each RNA 3D chain, use in Step 1: |
160 | with sqlite3.connect("results/RNANet.db) as connection: | 177 | with sqlite3.connect("results/RNANet.db) as connection: |
161 | chain_list = pd.read_sql("""SELECT UNIQUE chain_id, structure_id, chain_name | 178 | chain_list = pd.read_sql("""SELECT UNIQUE chain_id, structure_id, chain_name |
162 | FROM chain JOIN structure | 179 | FROM chain JOIN structure |
180 | + ON chain.structure_id = structure.pdb_id | ||
163 | ORDER BY structure_id ASC;""", | 181 | ORDER BY structure_id ASC;""", |
164 | con=connection) | 182 | con=connection) |
165 | ``` | 183 | ``` | ... | ... |
... | @@ -6,14 +6,15 @@ from Bio import AlignIO, SeqIO | ... | @@ -6,14 +6,15 @@ from Bio import AlignIO, SeqIO |
6 | from Bio.PDB import MMCIFParser | 6 | from Bio.PDB import MMCIFParser |
7 | from Bio.PDB.mmcifio import MMCIFIO | 7 | from Bio.PDB.mmcifio import MMCIFIO |
8 | from Bio.PDB.MMCIF2Dict import MMCIF2Dict | 8 | from Bio.PDB.MMCIF2Dict import MMCIF2Dict |
9 | -from Bio.PDB.PDBExceptions import PDBConstructionWarning | 9 | +from Bio.PDB.PDBExceptions import PDBConstructionWarning, BiopythonWarning |
10 | +from Bio.PDB.Dice import ChainSelector | ||
10 | from Bio._py3k import urlretrieve as _urlretrieve | 11 | from Bio._py3k import urlretrieve as _urlretrieve |
11 | from Bio._py3k import urlcleanup as _urlcleanup | 12 | from Bio._py3k import urlcleanup as _urlcleanup |
12 | from Bio.Alphabet import generic_rna | 13 | from Bio.Alphabet import generic_rna |
13 | from Bio.Seq import Seq | 14 | from Bio.Seq import Seq |
14 | from Bio.SeqRecord import SeqRecord | 15 | from Bio.SeqRecord import SeqRecord |
15 | from Bio.Align import MultipleSeqAlignment, AlignInfo | 16 | from Bio.Align import MultipleSeqAlignment, AlignInfo |
16 | -from collections import OrderedDict | 17 | +from collections import OrderedDict, defaultdict |
17 | from functools import partial, wraps | 18 | from functools import partial, wraps |
18 | from os import path, makedirs | 19 | from os import path, makedirs |
19 | from multiprocessing import Pool, Manager, set_start_method | 20 | from multiprocessing import Pool, Manager, set_start_method |
... | @@ -40,6 +41,7 @@ errsymb = '\U0000274C' | ... | @@ -40,6 +41,7 @@ errsymb = '\U0000274C' |
40 | 41 | ||
41 | LSU_set = {"RF00002", "RF02540", "RF02541", "RF02543", "RF02546"} # From Rfam CLAN 00112 | 42 | LSU_set = {"RF00002", "RF02540", "RF02541", "RF02543", "RF02546"} # From Rfam CLAN 00112 |
42 | SSU_set = {"RF00177", "RF02542", "RF02545", "RF01959", "RF01960"} # From Rfam CLAN 00111 | 43 | SSU_set = {"RF00177", "RF02542", "RF02545", "RF01959", "RF01960"} # From Rfam CLAN 00111 |
44 | +no_nts_set = set() | ||
43 | 45 | ||
44 | class NtPortionSelector(object): | 46 | class NtPortionSelector(object): |
45 | """Class passed to MMCIFIO to select some chain portions in an MMCIF file. | 47 | """Class passed to MMCIFIO to select some chain portions in an MMCIF file. |
... | @@ -170,29 +172,40 @@ class Chain: | ... | @@ -170,29 +172,40 @@ class Chain: |
170 | with warnings.catch_warnings(): | 172 | with warnings.catch_warnings(): |
171 | # Ignore the PDB problems. This mostly warns that some chain is discontinuous. | 173 | # Ignore the PDB problems. This mostly warns that some chain is discontinuous. |
172 | warnings.simplefilter('ignore', PDBConstructionWarning) | 174 | warnings.simplefilter('ignore', PDBConstructionWarning) |
175 | + warnings.simplefilter('ignore', BiopythonWarning) | ||
173 | 176 | ||
174 | # Load the whole mmCIF into a Biopython structure object: | 177 | # Load the whole mmCIF into a Biopython structure object: |
175 | mmcif_parser = MMCIFParser() | 178 | mmcif_parser = MMCIFParser() |
176 | s = mmcif_parser.get_structure(self.pdb_id, path_to_3D_data + "RNAcifs/"+self.pdb_id+".cif") | 179 | s = mmcif_parser.get_structure(self.pdb_id, path_to_3D_data + "RNAcifs/"+self.pdb_id+".cif") |
180 | + | ||
177 | 181 | ||
178 | # Extract the desired chain | 182 | # Extract the desired chain |
179 | c = s[model_idx][self.pdb_chain_id] | 183 | c = s[model_idx][self.pdb_chain_id] |
180 | 184 | ||
181 | if (self.pdb_end - self.pdb_start): | 185 | if (self.pdb_end - self.pdb_start): |
182 | - # Pay attention to residue numbering | 186 | + # # Pay attention to residue numbering |
183 | - first_number = c.child_list[0].get_id()[1] # the chain's first residue is numbered 'first_number' | 187 | + # first_number = c.child_list[0].get_id()[1] # the chain's first residue is numbered 'first_number' |
188 | + # if self.pdb_start < self.pdb_end: | ||
189 | + # start = self.pdb_start + first_number - 1 # shift our start_position by 'first_number' | ||
190 | + # end = self.pdb_end + first_number - 1 # same for the end position | ||
191 | + # else: | ||
192 | + # self.reversed = True # the 3D chain is numbered backwards compared to the Rfam family | ||
193 | + # end = self.pdb_start + first_number - 1 | ||
194 | + # start = self.pdb_end + first_number - 1 | ||
195 | + | ||
184 | if self.pdb_start < self.pdb_end: | 196 | if self.pdb_start < self.pdb_end: |
185 | - start = self.pdb_start + first_number - 1 # shift our start_position by 'first_number' | 197 | + start = self.pdb_start # shift our start_position by 'first_number' |
186 | - end = self.pdb_end + first_number - 1 # same for the end position | 198 | + end = self.pdb_end # same for the end position |
187 | else: | 199 | else: |
188 | - self.reversed = True # the 3D chain is numbered backwards compared to the Rfam family | 200 | + self.reversed = True # the 3D chain is numbered backwards compared to the Rfam family |
189 | - end = self.pdb_start + first_number - 1 | 201 | + end = self.pdb_start |
190 | - start = self.pdb_end + first_number - 1 | 202 | + start = self.pdb_end |
191 | else: | 203 | else: |
192 | start = c.child_list[0].get_id()[1] # the chain's first residue is numbered 'first_number' | 204 | start = c.child_list[0].get_id()[1] # the chain's first residue is numbered 'first_number' |
193 | end = c.child_list[-1].get_id()[1] # the chain's last residue number | 205 | end = c.child_list[-1].get_id()[1] # the chain's last residue number |
194 | 206 | ||
195 | # Define a selection | 207 | # Define a selection |
208 | + # sel = ChainSelector(self.pdb_chain_id, start, end, model_id = model_idx) | ||
196 | sel = NtPortionSelector(model_idx, self.pdb_chain_id, start, end, khetatm) | 209 | sel = NtPortionSelector(model_idx, self.pdb_chain_id, start, end, khetatm) |
197 | 210 | ||
198 | # Save that selection on the mmCIF object s to file | 211 | # Save that selection on the mmCIF object s to file |
... | @@ -205,7 +218,9 @@ class Chain: | ... | @@ -205,7 +218,9 @@ class Chain: |
205 | def extract_3D_data(self): | 218 | def extract_3D_data(self): |
206 | """ Maps DSSR annotations to the chain. """ | 219 | """ Maps DSSR annotations to the chain. """ |
207 | 220 | ||
221 | + ############################################ | ||
208 | # Load the mmCIF annotations from file | 222 | # Load the mmCIF annotations from file |
223 | + ############################################ | ||
209 | try: | 224 | try: |
210 | with open(path_to_3D_data + "annotations/" + self.pdb_id + ".json", 'r') as json_file: | 225 | with open(path_to_3D_data + "annotations/" + self.pdb_id + ".json", 'r') as json_file: |
211 | json_object = json.load(json_file) | 226 | json_object = json.load(json_file) |
... | @@ -219,39 +234,53 @@ class Chain: | ... | @@ -219,39 +234,53 @@ class Chain: |
219 | # Print eventual warnings given by DSSR, and abort if there are some | 234 | # Print eventual warnings given by DSSR, and abort if there are some |
220 | if "warning" in json_object.keys(): | 235 | if "warning" in json_object.keys(): |
221 | warn(f"found DSSR warning in annotation {self.pdb_id}.json: {json_object['warning']}. Ignoring {self.chain_label}.") | 236 | warn(f"found DSSR warning in annotation {self.pdb_id}.json: {json_object['warning']}. Ignoring {self.chain_label}.") |
237 | + if "no nucleotides" in json_object['warning']: | ||
238 | + no_nts_set.add(self.pdb_id) | ||
222 | self.delete_me = True | 239 | self.delete_me = True |
223 | self.error_messages = f"DSSR warning {self.pdb_id}.json: {json_object['warning']}. Ignoring {self.chain_label}." | 240 | self.error_messages = f"DSSR warning {self.pdb_id}.json: {json_object['warning']}. Ignoring {self.chain_label}." |
224 | return 1 | 241 | return 1 |
225 | 242 | ||
243 | + ############################################ | ||
244 | + # Create the data-frame | ||
245 | + ############################################ | ||
226 | try: | 246 | try: |
227 | - # Prepare a data structure (Pandas DataFrame) for the nucleotides | 247 | + # Create the Pandas DataFrame for the nucleotides of the right chain |
228 | nts = json_object["nts"] # sub-json-object | 248 | nts = json_object["nts"] # sub-json-object |
229 | df = pd.DataFrame(nts) # conversion to dataframe | 249 | df = pd.DataFrame(nts) # conversion to dataframe |
230 | df = df[ df.chain_name == self.pdb_chain_id ] # keeping only this chain's nucleotides | 250 | df = df[ df.chain_name == self.pdb_chain_id ] # keeping only this chain's nucleotides |
251 | + | ||
252 | + # Assert nucleotides of the chain are found | ||
231 | if df.empty: | 253 | if df.empty: |
232 | warn(f"Could not find nucleotides of chain {self.pdb_chain_id} in annotation {self.pdb_id}.json. Ignoring chain {self.chain_label}.", error=True) | 254 | warn(f"Could not find nucleotides of chain {self.pdb_chain_id} in annotation {self.pdb_id}.json. Ignoring chain {self.chain_label}.", error=True) |
255 | + no_nts_set.add(self.pdb_id) | ||
233 | self.delete_me = True | 256 | self.delete_me = True |
234 | self.error_messages = f"Could not find nucleotides of chain {self.pdb_chain_id} in annotation {self.pdb_id}.json. We expect a problem with {self.pdb_id} mmCIF download. Delete it and retry." | 257 | self.error_messages = f"Could not find nucleotides of chain {self.pdb_chain_id} in annotation {self.pdb_id}.json. We expect a problem with {self.pdb_id} mmCIF download. Delete it and retry." |
235 | return 1 | 258 | return 1 |
236 | 259 | ||
237 | - # remove low pertinence or undocumented descriptors | 260 | + # Remove low pertinence or undocumented descriptors, convert angles values |
238 | cols_we_keep = ["index_chain", "nt_resnum", "nt_name", "nt_code", "nt_id", "dbn", "alpha", "beta", "gamma", "delta", "epsilon", "zeta", | 261 | cols_we_keep = ["index_chain", "nt_resnum", "nt_name", "nt_code", "nt_id", "dbn", "alpha", "beta", "gamma", "delta", "epsilon", "zeta", |
239 | "epsilon_zeta", "bb_type", "chi", "glyco_bond", "form", "ssZp", "Dp", "eta", "theta", "eta_prime", "theta_prime", "eta_base", "theta_base", | 262 | "epsilon_zeta", "bb_type", "chi", "glyco_bond", "form", "ssZp", "Dp", "eta", "theta", "eta_prime", "theta_prime", "eta_base", "theta_base", |
240 | "v0", "v1", "v2", "v3", "v4", "amplitude", "phase_angle", "puckering" ] | 263 | "v0", "v1", "v2", "v3", "v4", "amplitude", "phase_angle", "puckering" ] |
241 | df = df[cols_we_keep] | 264 | df = df[cols_we_keep] |
242 | - | 265 | + df.loc[:,['alpha', 'beta','gamma','delta','epsilon','zeta','epsilon_zeta','chi','v0', 'v1', 'v2', 'v3', 'v4', # Conversion to radians |
243 | - # Convert angles to radians | ||
244 | - df.loc[:,['alpha', 'beta','gamma','delta','epsilon','zeta','epsilon_zeta','chi','v0', 'v1', 'v2', 'v3', 'v4', | ||
245 | 'eta','theta','eta_prime','theta_prime','eta_base','theta_base', 'phase_angle']] *= np.pi/180.0 | 266 | 'eta','theta','eta_prime','theta_prime','eta_base','theta_base', 'phase_angle']] *= np.pi/180.0 |
246 | - # mapping [-pi, pi] into [0, 2pi] | 267 | + df.loc[:,['alpha', 'beta','gamma','delta','epsilon','zeta','epsilon_zeta','chi','v0', 'v1', 'v2', 'v3', 'v4', # mapping [-pi, pi] into [0, 2pi] |
247 | - df.loc[:,['alpha', 'beta','gamma','delta','epsilon','zeta','epsilon_zeta','chi','v0', 'v1', 'v2', 'v3', 'v4', | ||
248 | 'eta','theta','eta_prime','theta_prime','eta_base','theta_base', 'phase_angle']] %= (2.0*np.pi) | 268 | 'eta','theta','eta_prime','theta_prime','eta_base','theta_base', 'phase_angle']] %= (2.0*np.pi) |
269 | + | ||
249 | except KeyError as e: | 270 | except KeyError as e: |
250 | warn(f"Error while parsing DSSR {self.pdb_id}.json output:{e}", error=True) | 271 | warn(f"Error while parsing DSSR {self.pdb_id}.json output:{e}", error=True) |
251 | self.delete_me = True | 272 | self.delete_me = True |
252 | self.error_messages = f"Error while parsing DSSR's json output:\n{e}" | 273 | self.error_messages = f"Error while parsing DSSR's json output:\n{e}" |
253 | return 1 | 274 | return 1 |
254 | 275 | ||
276 | + # Remove nucleotides of the chain that are outside the Rfam mapping, if any | ||
277 | + if self.pdb_start and self.pdb_end: | ||
278 | + df = df.drop(df[(df.nt_resnum < self.pdb_start) | (df.nt_resnum > self.pdb_end)].index) | ||
279 | + | ||
280 | + ############################################# | ||
281 | + # Solve some common issues and drop ligands | ||
282 | + ############################################# | ||
283 | + | ||
255 | # Shift numbering when duplicate residue numbers are found. | 284 | # Shift numbering when duplicate residue numbers are found. |
256 | # Example: 4v9q-DV contains 17 and 17A which are both read 17 by DSSR. | 285 | # Example: 4v9q-DV contains 17 and 17A which are both read 17 by DSSR. |
257 | while True in df.duplicated(['nt_resnum']).values: | 286 | while True in df.duplicated(['nt_resnum']).values: |
... | @@ -266,11 +295,12 @@ class Chain: | ... | @@ -266,11 +295,12 @@ class Chain: |
266 | or (len(df.index_chain) >= 2 and df.iloc[[-1]].nt_resnum.iloc[0] > 50 + df.iloc[[-2]].nt_resnum.iloc[0])): | 295 | or (len(df.index_chain) >= 2 and df.iloc[[-1]].nt_resnum.iloc[0] > 50 + df.iloc[[-2]].nt_resnum.iloc[0])): |
267 | df = df.head(-1) | 296 | df = df.head(-1) |
268 | 297 | ||
269 | - # Assert some nucleotides exist | 298 | + # Assert some nucleotides still exist |
270 | try: | 299 | try: |
271 | l = df.iloc[-1,1] - df.iloc[0,1] + 1 # length of chain from nt_resnum point of view | 300 | l = df.iloc[-1,1] - df.iloc[0,1] + 1 # length of chain from nt_resnum point of view |
272 | except IndexError: | 301 | except IndexError: |
273 | warn(f"Could not find real nucleotides of chain {self.pdb_chain_id} in annotation {self.pdb_id}.json. Ignoring chain {self.chain_label}.", error=True) | 302 | warn(f"Could not find real nucleotides of chain {self.pdb_chain_id} in annotation {self.pdb_id}.json. Ignoring chain {self.chain_label}.", error=True) |
303 | + no_nts_set.add(self.pdb_id) | ||
274 | self.delete_me = True | 304 | self.delete_me = True |
275 | self.error_messages = f"Could not find nucleotides of chain {self.pdb_chain_id} in annotation {self.pdb_id}.json. We expect a problem with {self.pdb_id} mmCIF download. Delete it and retry." | 305 | self.error_messages = f"Could not find nucleotides of chain {self.pdb_chain_id} in annotation {self.pdb_id}.json. We expect a problem with {self.pdb_id} mmCIF download. Delete it and retry." |
276 | return 1 | 306 | return 1 |
... | @@ -281,14 +311,15 @@ class Chain: | ... | @@ -281,14 +311,15 @@ class Chain: |
281 | df.iloc[:, 0] -= st | 311 | df.iloc[:, 0] -= st |
282 | df = df.drop(df[df.index_chain < 0].index) # drop eventual ones with index_chain < the first residue (usually, ligands) | 312 | df = df.drop(df[df.index_chain < 0].index) # drop eventual ones with index_chain < the first residue (usually, ligands) |
283 | 313 | ||
284 | - # Add a sequence column just for the alignments | 314 | + # Re-Assert some nucleotides still exist |
285 | - df['nt_align_code'] = [ str(x).upper() | 315 | + try: |
286 | - .replace('NAN', '-') # Unresolved nucleotides are gaps | 316 | + l = df.iloc[-1,1] - df.iloc[0,1] + 1 # length of chain from nt_resnum point of view |
287 | - .replace('?', '-') # Unidentified residues, let's delete them | 317 | + except IndexError: |
288 | - .replace('T', 'U') # 5MU are modified to t, which gives T | 318 | + warn(f"Could not find real nucleotides of chain {self.pdb_chain_id} in annotation {self.pdb_id}.json. Ignoring chain {self.chain_label}.", error=True) |
289 | - .replace('P', 'U') # Pseudo-uridines, but it is not really right to change them to U, see DSSR paper, Fig 2 | 319 | + no_nts_set.add(self.pdb_id) |
290 | - for x in df['nt_code'] ] | 320 | + self.delete_me = True |
291 | - | 321 | + self.error_messages = f"Could not find nucleotides of chain {self.pdb_chain_id} in annotation {self.pdb_id}.json. We expect a problem with {self.pdb_id} mmCIF download. Delete it and retry." |
322 | + return 1 | ||
292 | 323 | ||
293 | # Add eventual missing rows because of unsolved residues in the chain. | 324 | # Add eventual missing rows because of unsolved residues in the chain. |
294 | # Sometimes, the 3D structure is REALLY shorter than the family it's mapped to, | 325 | # Sometimes, the 3D structure is REALLY shorter than the family it's mapped to, |
... | @@ -304,7 +335,7 @@ class Chain: | ... | @@ -304,7 +335,7 @@ class Chain: |
304 | # portion solved in 3D 1 |--------------|79 85|------------| 156 | 335 | # portion solved in 3D 1 |--------------|79 85|------------| 156 |
305 | # Rfam mapping 3 |------------------------------------------ ... -------| 3353 (yes larger, 'cause it could be inferred) | 336 | # Rfam mapping 3 |------------------------------------------ ... -------| 3353 (yes larger, 'cause it could be inferred) |
306 | # nt resnum 3 |--------------------------------| 156 | 337 | # nt resnum 3 |--------------------------------| 156 |
307 | - # index_chain 1 |-------------|77 83|------------| 149 | 338 | + # index_chain 1 |-------------|77 83|------------| 149 (before correction) |
308 | # expected data point 1 |--------------------------------| 154 | 339 | # expected data point 1 |--------------------------------| 154 |
309 | # | 340 | # |
310 | 341 | ||
... | @@ -314,15 +345,26 @@ class Chain: | ... | @@ -314,15 +345,26 @@ class Chain: |
314 | for i in sorted(diff): | 345 | for i in sorted(diff): |
315 | # Add a row at position i | 346 | # Add a row at position i |
316 | df = pd.concat([ df.iloc[:i], | 347 | df = pd.concat([ df.iloc[:i], |
317 | - pd.DataFrame({"index_chain": i+1, "nt_resnum": i+resnum_start, | 348 | + pd.DataFrame({"index_chain": i+1, "nt_resnum": i+resnum_start, "nt_code":'-', "nt_name":'-'}, index=[i]), |
318 | - "nt_code":'-', "nt_name":'-', 'nt_align_code':'-'}, index=[i]), | 349 | + df.iloc[i:] ]) |
319 | - df.iloc[i:] | ||
320 | - ]) | ||
321 | # Increase the index_chain of all following lines | 350 | # Increase the index_chain of all following lines |
322 | df.iloc[i+1:, 0] += 1 | 351 | df.iloc[i+1:, 0] += 1 |
323 | df = df.reset_index(drop=True) | 352 | df = df.reset_index(drop=True) |
324 | self.full_length = len(df.index_chain) | 353 | self.full_length = len(df.index_chain) |
325 | 354 | ||
355 | + # Add a sequence column just for the alignments | ||
356 | + df['nt_align_code'] = [ str(x).upper() | ||
357 | + .replace('NAN', '-') # Unresolved nucleotides are gaps | ||
358 | + .replace('?', '-') # Unidentified residues, let's delete them | ||
359 | + .replace('T', 'U') # 5MU are modified to t, which gives T | ||
360 | + .replace('P', 'U') # Pseudo-uridines, but it is not really right to change them to U, see DSSR paper, Fig 2 | ||
361 | + for x in df['nt_code'] ] | ||
362 | + | ||
363 | + | ||
364 | + ####################################### | ||
365 | + # Compute new features | ||
366 | + ####################################### | ||
367 | + | ||
326 | # One-hot encoding sequence | 368 | # One-hot encoding sequence |
327 | df["is_A"] = [ 1 if x=="A" else 0 for x in df["nt_code"] ] | 369 | df["is_A"] = [ 1 if x=="A" else 0 for x in df["nt_code"] ] |
328 | df["is_C"] = [ 1 if x=="C" else 0 for x in df["nt_code"] ] | 370 | df["is_C"] = [ 1 if x=="C" else 0 for x in df["nt_code"] ] |
... | @@ -415,7 +457,13 @@ class Chain: | ... | @@ -415,7 +457,13 @@ class Chain: |
415 | newpairs.append(v) | 457 | newpairs.append(v) |
416 | df['paired'] = newpairs | 458 | df['paired'] = newpairs |
417 | 459 | ||
418 | - # Saving to database | 460 | + self.seq = "".join(df.nt_code) |
461 | + self.seq_to_align = "".join(df.nt_align_code) | ||
462 | + self.length = len([ x for x in self.seq_to_align if x != "-" ]) | ||
463 | + | ||
464 | + #################################### | ||
465 | + # Save everything to database | ||
466 | + #################################### | ||
419 | with sqlite3.connect(runDir+"/results/RNANet.db", timeout=10.0) as conn: | 467 | with sqlite3.connect(runDir+"/results/RNANet.db", timeout=10.0) as conn: |
420 | # Register the chain in table chain | 468 | # Register the chain in table chain |
421 | if self.pdb_start is not None: | 469 | if self.pdb_start is not None: |
... | @@ -444,11 +492,7 @@ class Chain: | ... | @@ -444,11 +492,7 @@ class Chain: |
444 | ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?);""", | 492 | ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?);""", |
445 | many=True, data=list(df.to_records(index=False)), warn_every=10) | 493 | many=True, data=list(df.to_records(index=False)), warn_every=10) |
446 | 494 | ||
447 | - # Now load data from the database | 495 | + |
448 | - self.seq = "".join(df.nt_code) | ||
449 | - self.seq_to_align = "".join(df.nt_align_code) | ||
450 | - self.length = len([ x for x in self.seq_to_align if x != "-" ]) | ||
451 | - | ||
452 | # Remove too short chains | 496 | # Remove too short chains |
453 | if self.length < 5: | 497 | if self.length < 5: |
454 | warn(f"{self.chain_label} sequence is too short, let's ignore it.\t", error=True) | 498 | warn(f"{self.chain_label} sequence is too short, let's ignore it.\t", error=True) |
... | @@ -811,7 +855,7 @@ class Pipeline: | ... | @@ -811,7 +855,7 @@ class Pipeline: |
811 | global path_to_seq_data | 855 | global path_to_seq_data |
812 | 856 | ||
813 | try: | 857 | try: |
814 | - opts, args = getopt.getopt( sys.argv[1:], "r:hs", | 858 | + opts, _ = getopt.getopt( sys.argv[1:], "r:hs", |
815 | [ "help", "resolution=", "keep-hetatm=", "from-scratch", | 859 | [ "help", "resolution=", "keep-hetatm=", "from-scratch", |
816 | "fill-gaps=", "3d-folder=", "seq-folder=", | 860 | "fill-gaps=", "3d-folder=", "seq-folder=", |
817 | "no-homology", "ignore-issues", "extract", | 861 | "no-homology", "ignore-issues", "extract", |
... | @@ -822,7 +866,7 @@ class Pipeline: | ... | @@ -822,7 +866,7 @@ class Pipeline: |
822 | 866 | ||
823 | for opt, arg in opts: | 867 | for opt, arg in opts: |
824 | 868 | ||
825 | - if opt in ["--from-scratch", "--update-mmcifs", "--update-homolgous"] and "tobedefinedbyoptions" in [path_to_3D_data, path_to_seq_data]: | 869 | + if opt in ["--from-scratch", "--update-mmcifs", "--update-homologous"] and "tobedefinedbyoptions" in [path_to_3D_data, path_to_seq_data]: |
826 | print("Please provide --3d-folder and --seq-folder first, so that we know what to delete and update.") | 870 | print("Please provide --3d-folder and --seq-folder first, so that we know what to delete and update.") |
827 | exit() | 871 | exit() |
828 | 872 | ||
... | @@ -890,7 +934,7 @@ class Pipeline: | ... | @@ -890,7 +934,7 @@ class Pipeline: |
890 | warn("Deleting previous database and recomputing from scratch.") | 934 | warn("Deleting previous database and recomputing from scratch.") |
891 | subprocess.run(["rm", "-rf", | 935 | subprocess.run(["rm", "-rf", |
892 | path_to_3D_data + "annotations", | 936 | path_to_3D_data + "annotations", |
893 | - path_to_3D_data + "RNAcifs", | 937 | + # path_to_3D_data + "RNAcifs", # DEBUG : keep the cifs ! |
894 | path_to_3D_data + "rna_mapped_to_Rfam", | 938 | path_to_3D_data + "rna_mapped_to_Rfam", |
895 | path_to_3D_data + "rnaonly", | 939 | path_to_3D_data + "rnaonly", |
896 | path_to_seq_data + "realigned", | 940 | path_to_seq_data + "realigned", |
... | @@ -899,9 +943,6 @@ class Pipeline: | ... | @@ -899,9 +943,6 @@ class Pipeline: |
899 | runDir + "/known_issues_reasons.txt", | 943 | runDir + "/known_issues_reasons.txt", |
900 | runDir + "/results/RNANet.db"]) | 944 | runDir + "/results/RNANet.db"]) |
901 | elif opt == "--update-homologous": | 945 | elif opt == "--update-homologous": |
902 | - if "tobedefinedbyoptions" == path_to_seq_data: | ||
903 | - warn("Please provide --seq-folder before --update-homologous in the list of options.", error=True) | ||
904 | - exit(1) | ||
905 | warn("Deleting previous sequence files and recomputing alignments.") | 946 | warn("Deleting previous sequence files and recomputing alignments.") |
906 | subprocess.run(["rm", "-rf", | 947 | subprocess.run(["rm", "-rf", |
907 | path_to_seq_data + "realigned", | 948 | path_to_seq_data + "realigned", |
... | @@ -942,8 +983,9 @@ class Pipeline: | ... | @@ -942,8 +983,9 @@ class Pipeline: |
942 | print("> Building list of structures...", flush=True) | 983 | print("> Building list of structures...", flush=True) |
943 | p = Pool(initializer=init_worker, initargs=(tqdm.get_lock(),), processes=ncores) | 984 | p = Pool(initializer=init_worker, initargs=(tqdm.get_lock(),), processes=ncores) |
944 | try: | 985 | try: |
986 | + | ||
945 | pbar = tqdm(full_structures_list, maxinterval=1.0, miniters=1, bar_format="{percentage:3.0f}%|{bar}|") | 987 | pbar = tqdm(full_structures_list, maxinterval=1.0, miniters=1, bar_format="{percentage:3.0f}%|{bar}|") |
946 | - for i, newchains in enumerate(p.imap_unordered(partial(work_infer_mappings, not self.REUSE_ALL, allmappings), full_structures_list)): | 988 | + for _, newchains in enumerate(p.imap_unordered(partial(work_infer_mappings, not self.REUSE_ALL, allmappings), full_structures_list)): |
947 | self.update += newchains | 989 | self.update += newchains |
948 | pbar.update(1) # Everytime the iteration finishes, update the global progress bar | 990 | pbar.update(1) # Everytime the iteration finishes, update the global progress bar |
949 | 991 | ||
... | @@ -1053,7 +1095,8 @@ class Pipeline: | ... | @@ -1053,7 +1095,8 @@ class Pipeline: |
1053 | warn(f"Adding {c[1].chain_label} to known issues.") | 1095 | warn(f"Adding {c[1].chain_label} to known issues.") |
1054 | ki.write(c[1].chain_label + '\n') | 1096 | ki.write(c[1].chain_label + '\n') |
1055 | kir.write(c[1].chain_label + '\n' + c[1].error_messages + '\n\n') | 1097 | kir.write(c[1].chain_label + '\n' + c[1].error_messages + '\n\n') |
1056 | - sql_execute(conn, f"UPDATE chain SET issue = 1 WHERE chain_id = ?;", data=(c[1].db_chain_id,)) | 1098 | + with sqlite3.connect(runDir+"/results/RNANet.db") as conn: |
1099 | + sql_execute(conn, f"UPDATE chain SET issue = 1 WHERE chain_id = ?;", data=(c[1].db_chain_id,)) | ||
1057 | ki.close() | 1100 | ki.close() |
1058 | kir.close() | 1101 | kir.close() |
1059 | 1102 | ||
... | @@ -1194,7 +1237,7 @@ class Pipeline: | ... | @@ -1194,7 +1237,7 @@ class Pipeline: |
1194 | p = Pool(initializer=init_worker, initargs=(tqdm.get_lock(),), processes=3) | 1237 | p = Pool(initializer=init_worker, initargs=(tqdm.get_lock(),), processes=3) |
1195 | try: | 1238 | try: |
1196 | pbar = tqdm(total=len(self.loaded_chains), desc="Saving chains to CSV", position=0, leave=True) | 1239 | pbar = tqdm(total=len(self.loaded_chains), desc="Saving chains to CSV", position=0, leave=True) |
1197 | - for i, _ in enumerate(p.imap_unordered(work_save, self.loaded_chains)): | 1240 | + for _, _2 in enumerate(p.imap_unordered(work_save, self.loaded_chains)): |
1198 | pbar.update(1) | 1241 | pbar.update(1) |
1199 | pbar.close() | 1242 | pbar.close() |
1200 | p.close() | 1243 | p.close() |
... | @@ -2158,6 +2201,8 @@ if __name__ == "__main__": | ... | @@ -2158,6 +2201,8 @@ if __name__ == "__main__": |
2158 | pp.dl_and_annotate(retry=True, coeff_ncores=0.3) # | 2201 | pp.dl_and_annotate(retry=True, coeff_ncores=0.3) # |
2159 | pp.build_chains(retry=True, coeff_ncores=1.0) # Use half the cores to reduce required amount of memory | 2202 | pp.build_chains(retry=True, coeff_ncores=1.0) # Use half the cores to reduce required amount of memory |
2160 | print(f"> Loaded {len(pp.loaded_chains)} RNA chains ({len(pp.update) - len(pp.loaded_chains)} errors).") | 2203 | print(f"> Loaded {len(pp.loaded_chains)} RNA chains ({len(pp.update) - len(pp.loaded_chains)} errors).") |
2204 | + if len(no_nts_set): | ||
2205 | + print(f"Among errors, {len(no_nts_set)} structures seem to contain RNA chains without defined nucleotides:", no_nts_set, flush=True) | ||
2161 | pp.checkpoint_save_chains() | 2206 | pp.checkpoint_save_chains() |
2162 | 2207 | ||
2163 | if not pp.HOMOLOGY: | 2208 | if not pp.HOMOLOGY: |
... | @@ -2165,7 +2210,7 @@ if __name__ == "__main__": | ... | @@ -2165,7 +2210,7 @@ if __name__ == "__main__": |
2165 | for c in pp.loaded_chains: | 2210 | for c in pp.loaded_chains: |
2166 | work_save(c, homology=False) | 2211 | work_save(c, homology=False) |
2167 | print("Completed.") | 2212 | print("Completed.") |
2168 | - exit() | 2213 | + exit(0) |
2169 | 2214 | ||
2170 | # At this point, structure, chain and nucleotide tables of the database are up to date. | 2215 | # At this point, structure, chain and nucleotide tables of the database are up to date. |
2171 | # (Modulo some statistics computed by statistics.py) | 2216 | # (Modulo some statistics computed by statistics.py) | ... | ... |
... | @@ -168,6 +168,8 @@ def stats_len(): | ... | @@ -168,6 +168,8 @@ def stats_len(): |
168 | lengths = [] | 168 | lengths = [] |
169 | conn = sqlite3.connect("results/RNANet.db") | 169 | conn = sqlite3.connect("results/RNANet.db") |
170 | for i,f in enumerate(fam_list): | 170 | for i,f in enumerate(fam_list): |
171 | + | ||
172 | + # Define a color for that family in the plot | ||
171 | if f in LSU_set: | 173 | if f in LSU_set: |
172 | cols.append("red") # LSU | 174 | cols.append("red") # LSU |
173 | elif f in SSU_set: | 175 | elif f in SSU_set: |
... | @@ -178,11 +180,15 @@ def stats_len(): | ... | @@ -178,11 +180,15 @@ def stats_len(): |
178 | cols.append("orange") | 180 | cols.append("orange") |
179 | else: | 181 | else: |
180 | cols.append("grey") | 182 | cols.append("grey") |
183 | + | ||
184 | + # Get the lengths of chains | ||
181 | l = [ x[0] for x in sql_ask_database(conn, f"SELECT COUNT(index_chain) FROM (SELECT chain_id FROM chain WHERE rfam_acc='{f}') NATURAL JOIN nucleotide GROUP BY chain_id;") ] | 185 | l = [ x[0] for x in sql_ask_database(conn, f"SELECT COUNT(index_chain) FROM (SELECT chain_id FROM chain WHERE rfam_acc='{f}') NATURAL JOIN nucleotide GROUP BY chain_id;") ] |
182 | lengths.append(l) | 186 | lengths.append(l) |
187 | + | ||
183 | notify(f"[{i+1}/{len(fam_list)}] Computed {f} chains lengths") | 188 | notify(f"[{i+1}/{len(fam_list)}] Computed {f} chains lengths") |
184 | conn.close() | 189 | conn.close() |
185 | 190 | ||
191 | + # Plot the figure | ||
186 | fig = plt.figure(figsize=(10,3)) | 192 | fig = plt.figure(figsize=(10,3)) |
187 | ax = fig.gca() | 193 | ax = fig.gca() |
188 | ax.hist(lengths, bins=100, stacked=True, log=True, color=cols, label=fam_list) | 194 | ax.hist(lengths, bins=100, stacked=True, log=True, color=cols, label=fam_list) |
... | @@ -191,6 +197,8 @@ def stats_len(): | ... | @@ -191,6 +197,8 @@ def stats_len(): |
191 | ax.set_xlim(left=-150) | 197 | ax.set_xlim(left=-150) |
192 | ax.tick_params(axis='both', which='both', labelsize=8) | 198 | ax.tick_params(axis='both', which='both', labelsize=8) |
193 | fig.tight_layout() | 199 | fig.tight_layout() |
200 | + | ||
201 | + # Draw the legend | ||
194 | fig.subplots_adjust(right=0.78) | 202 | fig.subplots_adjust(right=0.78) |
195 | filtered_handles = [mpatches.Patch(color='red'), mpatches.Patch(color='white'), mpatches.Patch(color='white'), mpatches.Patch(color='white'), | 203 | filtered_handles = [mpatches.Patch(color='red'), mpatches.Patch(color='white'), mpatches.Patch(color='white'), mpatches.Patch(color='white'), |
196 | mpatches.Patch(color='blue'), mpatches.Patch(color='white'), mpatches.Patch(color='white'), | 204 | mpatches.Patch(color='blue'), mpatches.Patch(color='white'), mpatches.Patch(color='white'), |
... | @@ -204,6 +212,8 @@ def stats_len(): | ... | @@ -204,6 +212,8 @@ def stats_len(): |
204 | 'Other'] | 212 | 'Other'] |
205 | ax.legend(filtered_handles, filtered_labels, loc='right', | 213 | ax.legend(filtered_handles, filtered_labels, loc='right', |
206 | ncol=1, fontsize='small', bbox_to_anchor=(1.3, 0.5)) | 214 | ncol=1, fontsize='small', bbox_to_anchor=(1.3, 0.5)) |
215 | + | ||
216 | + # Save the figure | ||
207 | fig.savefig("results/figures/lengths.png") | 217 | fig.savefig("results/figures/lengths.png") |
208 | notify("Computed sequence length statistics and saved the figure.") | 218 | notify("Computed sequence length statistics and saved the figure.") |
209 | 219 | ||
... | @@ -224,10 +234,12 @@ def stats_freq(): | ... | @@ -224,10 +234,12 @@ def stats_freq(): |
224 | 234 | ||
225 | Outputs results/frequencies.csv | 235 | Outputs results/frequencies.csv |
226 | REQUIRES tables chain, nucleotide up to date.""" | 236 | REQUIRES tables chain, nucleotide up to date.""" |
237 | + # Initialize a Counter object for each family | ||
227 | freqs = {} | 238 | freqs = {} |
228 | for f in fam_list: | 239 | for f in fam_list: |
229 | freqs[f] = Counter() | 240 | freqs[f] = Counter() |
230 | 241 | ||
242 | + # List all nt_names happening within a RNA family and store the counts in the Counter | ||
231 | conn = sqlite3.connect("results/RNANet.db") | 243 | conn = sqlite3.connect("results/RNANet.db") |
232 | for i,f in enumerate(fam_list): | 244 | for i,f in enumerate(fam_list): |
233 | counts = dict(sql_ask_database(conn, f"SELECT nt_name, COUNT(nt_name) FROM (SELECT chain_id from chain WHERE rfam_acc='{f}') NATURAL JOIN nucleotide GROUP BY nt_name;")) | 245 | counts = dict(sql_ask_database(conn, f"SELECT nt_name, COUNT(nt_name) FROM (SELECT chain_id from chain WHERE rfam_acc='{f}') NATURAL JOIN nucleotide GROUP BY nt_name;")) |
... | @@ -235,6 +247,7 @@ def stats_freq(): | ... | @@ -235,6 +247,7 @@ def stats_freq(): |
235 | notify(f"[{i+1}/{len(fam_list)}] Computed {f} nucleotide frequencies.") | 247 | notify(f"[{i+1}/{len(fam_list)}] Computed {f} nucleotide frequencies.") |
236 | conn.close() | 248 | conn.close() |
237 | 249 | ||
250 | + # Create a pandas DataFrame, and save it to CSV. | ||
238 | df = pd.DataFrame() | 251 | df = pd.DataFrame() |
239 | for f in fam_list: | 252 | for f in fam_list: |
240 | tot = sum(freqs[f].values()) | 253 | tot = sum(freqs[f].values()) |
... | @@ -347,8 +360,8 @@ def stats_pairs(): | ... | @@ -347,8 +360,8 @@ def stats_pairs(): |
347 | fam_pbar = tqdm(total=len(fam_list), desc="Pair-types in families", position=0, leave=True) | 360 | fam_pbar = tqdm(total=len(fam_list), desc="Pair-types in families", position=0, leave=True) |
348 | results = [] | 361 | results = [] |
349 | allpairs = [] | 362 | allpairs = [] |
350 | - for i, _ in enumerate(p.imap_unordered(parallel_stats_pairs, fam_list)): | 363 | + for _, newp_famdf in enumerate(p.imap_unordered(parallel_stats_pairs, fam_list)): |
351 | - newpairs, fam_df = _ | 364 | + newpairs, fam_df = newp_famdf |
352 | fam_pbar.update(1) | 365 | fam_pbar.update(1) |
353 | results.append(fam_df) | 366 | results.append(fam_df) |
354 | allpairs.append(newpairs) | 367 | allpairs.append(newpairs) |
... | @@ -432,13 +445,14 @@ def seq_idty(): | ... | @@ -432,13 +445,14 @@ def seq_idty(): |
432 | Creates temporary results files in data/*.npy | 445 | Creates temporary results files in data/*.npy |
433 | REQUIRES tables chain, family un to date.""" | 446 | REQUIRES tables chain, family un to date.""" |
434 | 447 | ||
448 | + # List the families for which we will compute sequence identity matrices | ||
435 | conn = sqlite3.connect("results/RNANet.db") | 449 | conn = sqlite3.connect("results/RNANet.db") |
436 | famlist = [ x[0] for x in sql_ask_database(conn, "SELECT rfam_acc from (SELECT rfam_acc, COUNT(chain_id) as n_chains FROM family NATURAL JOIN chain GROUP BY rfam_acc) WHERE n_chains > 1 ORDER BY rfam_acc ASC;") ] | 450 | famlist = [ x[0] for x in sql_ask_database(conn, "SELECT rfam_acc from (SELECT rfam_acc, COUNT(chain_id) as n_chains FROM family NATURAL JOIN chain GROUP BY rfam_acc) WHERE n_chains > 1 ORDER BY rfam_acc ASC;") ] |
437 | ignored = [ x[0] for x in sql_ask_database(conn, "SELECT rfam_acc from (SELECT rfam_acc, COUNT(chain_id) as n_chains FROM family NATURAL JOIN chain GROUP BY rfam_acc) WHERE n_chains < 2 ORDER BY rfam_acc ASC;") ] | 451 | ignored = [ x[0] for x in sql_ask_database(conn, "SELECT rfam_acc from (SELECT rfam_acc, COUNT(chain_id) as n_chains FROM family NATURAL JOIN chain GROUP BY rfam_acc) WHERE n_chains < 2 ORDER BY rfam_acc ASC;") ] |
438 | if len(ignored): | 452 | if len(ignored): |
439 | print("Idty matrices: Ignoring families with only one chain:", " ".join(ignored)+'\n') | 453 | print("Idty matrices: Ignoring families with only one chain:", " ".join(ignored)+'\n') |
440 | 454 | ||
441 | - # compute distance matrices | 455 | + # compute distance matrices (or ignore if data/RF0****.npy exists) |
442 | p = Pool(processes=8) | 456 | p = Pool(processes=8) |
443 | p.map(to_dist_matrix, famlist) | 457 | p.map(to_dist_matrix, famlist) |
444 | p.close() | 458 | p.close() |
... | @@ -502,7 +516,7 @@ def per_chain_stats(): | ... | @@ -502,7 +516,7 @@ def per_chain_stats(): |
502 | 516 | ||
503 | # Set the values | 517 | # Set the values |
504 | sql_execute(conn, "UPDATE chain SET chain_freq_A = ?, chain_freq_C = ?, chain_freq_G = ?, chain_freq_U = ?, chain_freq_other = ? WHERE chain_id= ?;", | 518 | sql_execute(conn, "UPDATE chain SET chain_freq_A = ?, chain_freq_C = ?, chain_freq_G = ?, chain_freq_U = ?, chain_freq_other = ? WHERE chain_id= ?;", |
505 | - many=True, data=list(df.to_records(index=False)), warn_every=10) | 519 | + many=True, data=list(df.to_records(index=False)), warn_every=10) |
506 | notify("Updated the database with per-chain base frequencies") | 520 | notify("Updated the database with per-chain base frequencies") |
507 | 521 | ||
508 | if __name__ == "__main__": | 522 | if __name__ == "__main__": | ... | ... |
-
Please register or login to post a comment