Louis BECQUEY

Better cut dataframes (cut at Rfam mapping)

...@@ -7,6 +7,7 @@ results/ ...@@ -7,6 +7,7 @@ results/
7 7
8 # temporary results files 8 # temporary results files
9 data/ 9 data/
10 +esl*
10 11
11 # environment stuff 12 # environment stuff
12 .vscode/ 13 .vscode/
......
...@@ -49,6 +49,19 @@ Other folders are created and not deleted, which you might want to conserve to a ...@@ -49,6 +49,19 @@ Other folders are created and not deleted, which you might want to conserve to a
49 * `path-to-3D-folder-you-passed-in-option/annotations/` contains the raw JSON annotation files of the previous mmCIF structures. You may find additional information into them which is not properly supported by RNANet yet. 49 * `path-to-3D-folder-you-passed-in-option/annotations/` contains the raw JSON annotation files of the previous mmCIF structures. You may find additional information into them which is not properly supported by RNANet yet.
50 50
51 # How to run 51 # How to run
52 +
53 +## Required computational resources
54 +- CPU: no requirements. The program is optimized for multi-core CPUs, you might want to use Intel Xeons, AMD Ryzens, etc.
55 +- GPU: not required
56 +- RAM: 16 GB with a large swap partition is okay. 32 GB is recommended (usage peaks at ~27 GB)
57 +- Storage: to date, it takes 60 GB for the 3D data (36 GB if you don't use the --extract option), 11 GB for the sequence data, and 7GB for the outputs (5.6 GB database, 1 GB archive of CSV files). You need to add a few more for the dependencies. Go for a 100GB partition and you are good to go. The computation speed is really decreased if you use a fast storage device (e.g. SSD instead of hard drive, or even better, a NVMe SSD) because of permanent I/O with the SQlite database.
58 +- Network : We query the Rfam public MySQL server on port 4497. Make sure your network enables communication (there should not be any issue on private networks, but maybe you company/university closes ports by default). You will get an error message if the port is not open. Around 30 GB of data is downloaded.
59 +
60 +To give you an estimation, our last full run took exactly 12h, excluding the time to download the MMCIF files containing RNA (around 25GB to download) and the time to compute statistics.
61 +Measured the 23rd of June 2020 on a 16-core AMD Ryzen 7 3700X CPU @3.60GHz, plus 32 Go RAM, and a 7200rpm Hard drive. Total CPU time spent: 135 hours (user+kernel modes), corresponding to 12h (actual time spent with the 16-core CPU).
62 +
63 +Update runs are much quicker, around 3 hours. It depends mostly on what RNA families are concerned by the update.
64 +
52 ## Dependencies 65 ## Dependencies
53 You need to install: 66 You need to install:
54 - DSSR, you need to register to the X3DNA forum [here](http://forum.x3dna.org/site-announcements/download-instructions/) and then download the DSSR binary [on that page](http://forum.x3dna.org/downloads/3dna-download/). 67 - DSSR, you need to register to the X3DNA forum [here](http://forum.x3dna.org/site-announcements/download-instructions/) and then download the DSSR binary [on that page](http://forum.x3dna.org/downloads/3dna-download/).
...@@ -91,6 +104,8 @@ The detailed list of options is below: ...@@ -91,6 +104,8 @@ The detailed list of options is below:
91 ## Post-computation task: estimate quality 104 ## Post-computation task: estimate quality
92 The file statistics.py is supposed to give a summary on the produced dataset. See the results/ folder. It can be run automatically after RNANet if you pass the `-s` option. 105 The file statistics.py is supposed to give a summary on the produced dataset. See the results/ folder. It can be run automatically after RNANet if you pass the `-s` option.
93 106
107 +
108 +
94 # How to further filter the dataset 109 # How to further filter the dataset
95 You may want to build your own sub-dataset by querying the results/RNANet.db file. Here are quick examples using Python3 and its sqlite3 package. 110 You may want to build your own sub-dataset by querying the results/RNANet.db file. Here are quick examples using Python3 and its sqlite3 package.
96 111
...@@ -108,6 +123,7 @@ Step 1 : We first get a list of chains that are below our favorite resolution th ...@@ -108,6 +123,7 @@ Step 1 : We first get a list of chains that are below our favorite resolution th
108 with sqlite3.connect("results/RNANet.db) as connection: 123 with sqlite3.connect("results/RNANet.db) as connection:
109 chain_list = pd.read_sql("""SELECT chain_id, structure_id, chain_name 124 chain_list = pd.read_sql("""SELECT chain_id, structure_id, chain_name
110 FROM chain JOIN structure 125 FROM chain JOIN structure
126 + ON chain.structure_id = structure.pdb_id
111 WHERE resolution < 4.0 127 WHERE resolution < 4.0
112 ORDER BY structure_id ASC;""", 128 ORDER BY structure_id ASC;""",
113 con=connection) 129 con=connection)
...@@ -146,6 +162,7 @@ We will simply modify the Step 1 above: ...@@ -146,6 +162,7 @@ We will simply modify the Step 1 above:
146 with sqlite3.connect("results/RNANet.db) as connection: 162 with sqlite3.connect("results/RNANet.db) as connection:
147 chain_list = pd.read_sql("""SELECT chain_id, structure_id, chain_name 163 chain_list = pd.read_sql("""SELECT chain_id, structure_id, chain_name
148 FROM chain JOIN structure 164 FROM chain JOIN structure
165 + ON chain.structure_id = structure.pdb_id
149 WHERE date < "2018-06-01" 166 WHERE date < "2018-06-01"
150 ORDER BY structure_id ASC;""", 167 ORDER BY structure_id ASC;""",
151 con=connection) 168 con=connection)
...@@ -160,6 +177,7 @@ If you want just one example of each RNA 3D chain, use in Step 1: ...@@ -160,6 +177,7 @@ If you want just one example of each RNA 3D chain, use in Step 1:
160 with sqlite3.connect("results/RNANet.db) as connection: 177 with sqlite3.connect("results/RNANet.db) as connection:
161 chain_list = pd.read_sql("""SELECT UNIQUE chain_id, structure_id, chain_name 178 chain_list = pd.read_sql("""SELECT UNIQUE chain_id, structure_id, chain_name
162 FROM chain JOIN structure 179 FROM chain JOIN structure
180 + ON chain.structure_id = structure.pdb_id
163 ORDER BY structure_id ASC;""", 181 ORDER BY structure_id ASC;""",
164 con=connection) 182 con=connection)
165 ``` 183 ```
......
...@@ -6,14 +6,15 @@ from Bio import AlignIO, SeqIO ...@@ -6,14 +6,15 @@ from Bio import AlignIO, SeqIO
6 from Bio.PDB import MMCIFParser 6 from Bio.PDB import MMCIFParser
7 from Bio.PDB.mmcifio import MMCIFIO 7 from Bio.PDB.mmcifio import MMCIFIO
8 from Bio.PDB.MMCIF2Dict import MMCIF2Dict 8 from Bio.PDB.MMCIF2Dict import MMCIF2Dict
9 -from Bio.PDB.PDBExceptions import PDBConstructionWarning 9 +from Bio.PDB.PDBExceptions import PDBConstructionWarning, BiopythonWarning
10 +from Bio.PDB.Dice import ChainSelector
10 from Bio._py3k import urlretrieve as _urlretrieve 11 from Bio._py3k import urlretrieve as _urlretrieve
11 from Bio._py3k import urlcleanup as _urlcleanup 12 from Bio._py3k import urlcleanup as _urlcleanup
12 from Bio.Alphabet import generic_rna 13 from Bio.Alphabet import generic_rna
13 from Bio.Seq import Seq 14 from Bio.Seq import Seq
14 from Bio.SeqRecord import SeqRecord 15 from Bio.SeqRecord import SeqRecord
15 from Bio.Align import MultipleSeqAlignment, AlignInfo 16 from Bio.Align import MultipleSeqAlignment, AlignInfo
16 -from collections import OrderedDict 17 +from collections import OrderedDict, defaultdict
17 from functools import partial, wraps 18 from functools import partial, wraps
18 from os import path, makedirs 19 from os import path, makedirs
19 from multiprocessing import Pool, Manager, set_start_method 20 from multiprocessing import Pool, Manager, set_start_method
...@@ -40,6 +41,7 @@ errsymb = '\U0000274C' ...@@ -40,6 +41,7 @@ errsymb = '\U0000274C'
40 41
41 LSU_set = {"RF00002", "RF02540", "RF02541", "RF02543", "RF02546"} # From Rfam CLAN 00112 42 LSU_set = {"RF00002", "RF02540", "RF02541", "RF02543", "RF02546"} # From Rfam CLAN 00112
42 SSU_set = {"RF00177", "RF02542", "RF02545", "RF01959", "RF01960"} # From Rfam CLAN 00111 43 SSU_set = {"RF00177", "RF02542", "RF02545", "RF01959", "RF01960"} # From Rfam CLAN 00111
44 +no_nts_set = set()
43 45
44 class NtPortionSelector(object): 46 class NtPortionSelector(object):
45 """Class passed to MMCIFIO to select some chain portions in an MMCIF file. 47 """Class passed to MMCIFIO to select some chain portions in an MMCIF file.
...@@ -170,29 +172,40 @@ class Chain: ...@@ -170,29 +172,40 @@ class Chain:
170 with warnings.catch_warnings(): 172 with warnings.catch_warnings():
171 # Ignore the PDB problems. This mostly warns that some chain is discontinuous. 173 # Ignore the PDB problems. This mostly warns that some chain is discontinuous.
172 warnings.simplefilter('ignore', PDBConstructionWarning) 174 warnings.simplefilter('ignore', PDBConstructionWarning)
175 + warnings.simplefilter('ignore', BiopythonWarning)
173 176
174 # Load the whole mmCIF into a Biopython structure object: 177 # Load the whole mmCIF into a Biopython structure object:
175 mmcif_parser = MMCIFParser() 178 mmcif_parser = MMCIFParser()
176 s = mmcif_parser.get_structure(self.pdb_id, path_to_3D_data + "RNAcifs/"+self.pdb_id+".cif") 179 s = mmcif_parser.get_structure(self.pdb_id, path_to_3D_data + "RNAcifs/"+self.pdb_id+".cif")
180 +
177 181
178 # Extract the desired chain 182 # Extract the desired chain
179 c = s[model_idx][self.pdb_chain_id] 183 c = s[model_idx][self.pdb_chain_id]
180 184
181 if (self.pdb_end - self.pdb_start): 185 if (self.pdb_end - self.pdb_start):
182 - # Pay attention to residue numbering 186 + # # Pay attention to residue numbering
183 - first_number = c.child_list[0].get_id()[1] # the chain's first residue is numbered 'first_number' 187 + # first_number = c.child_list[0].get_id()[1] # the chain's first residue is numbered 'first_number'
188 + # if self.pdb_start < self.pdb_end:
189 + # start = self.pdb_start + first_number - 1 # shift our start_position by 'first_number'
190 + # end = self.pdb_end + first_number - 1 # same for the end position
191 + # else:
192 + # self.reversed = True # the 3D chain is numbered backwards compared to the Rfam family
193 + # end = self.pdb_start + first_number - 1
194 + # start = self.pdb_end + first_number - 1
195 +
184 if self.pdb_start < self.pdb_end: 196 if self.pdb_start < self.pdb_end:
185 - start = self.pdb_start + first_number - 1 # shift our start_position by 'first_number' 197 + start = self.pdb_start # shift our start_position by 'first_number'
186 - end = self.pdb_end + first_number - 1 # same for the end position 198 + end = self.pdb_end # same for the end position
187 else: 199 else:
188 - self.reversed = True # the 3D chain is numbered backwards compared to the Rfam family 200 + self.reversed = True # the 3D chain is numbered backwards compared to the Rfam family
189 - end = self.pdb_start + first_number - 1 201 + end = self.pdb_start
190 - start = self.pdb_end + first_number - 1 202 + start = self.pdb_end
191 else: 203 else:
192 start = c.child_list[0].get_id()[1] # the chain's first residue is numbered 'first_number' 204 start = c.child_list[0].get_id()[1] # the chain's first residue is numbered 'first_number'
193 end = c.child_list[-1].get_id()[1] # the chain's last residue number 205 end = c.child_list[-1].get_id()[1] # the chain's last residue number
194 206
195 # Define a selection 207 # Define a selection
208 + # sel = ChainSelector(self.pdb_chain_id, start, end, model_id = model_idx)
196 sel = NtPortionSelector(model_idx, self.pdb_chain_id, start, end, khetatm) 209 sel = NtPortionSelector(model_idx, self.pdb_chain_id, start, end, khetatm)
197 210
198 # Save that selection on the mmCIF object s to file 211 # Save that selection on the mmCIF object s to file
...@@ -205,7 +218,9 @@ class Chain: ...@@ -205,7 +218,9 @@ class Chain:
205 def extract_3D_data(self): 218 def extract_3D_data(self):
206 """ Maps DSSR annotations to the chain. """ 219 """ Maps DSSR annotations to the chain. """
207 220
221 + ############################################
208 # Load the mmCIF annotations from file 222 # Load the mmCIF annotations from file
223 + ############################################
209 try: 224 try:
210 with open(path_to_3D_data + "annotations/" + self.pdb_id + ".json", 'r') as json_file: 225 with open(path_to_3D_data + "annotations/" + self.pdb_id + ".json", 'r') as json_file:
211 json_object = json.load(json_file) 226 json_object = json.load(json_file)
...@@ -219,39 +234,53 @@ class Chain: ...@@ -219,39 +234,53 @@ class Chain:
219 # Print eventual warnings given by DSSR, and abort if there are some 234 # Print eventual warnings given by DSSR, and abort if there are some
220 if "warning" in json_object.keys(): 235 if "warning" in json_object.keys():
221 warn(f"found DSSR warning in annotation {self.pdb_id}.json: {json_object['warning']}. Ignoring {self.chain_label}.") 236 warn(f"found DSSR warning in annotation {self.pdb_id}.json: {json_object['warning']}. Ignoring {self.chain_label}.")
237 + if "no nucleotides" in json_object['warning']:
238 + no_nts_set.add(self.pdb_id)
222 self.delete_me = True 239 self.delete_me = True
223 self.error_messages = f"DSSR warning {self.pdb_id}.json: {json_object['warning']}. Ignoring {self.chain_label}." 240 self.error_messages = f"DSSR warning {self.pdb_id}.json: {json_object['warning']}. Ignoring {self.chain_label}."
224 return 1 241 return 1
225 242
243 + ############################################
244 + # Create the data-frame
245 + ############################################
226 try: 246 try:
227 - # Prepare a data structure (Pandas DataFrame) for the nucleotides 247 + # Create the Pandas DataFrame for the nucleotides of the right chain
228 nts = json_object["nts"] # sub-json-object 248 nts = json_object["nts"] # sub-json-object
229 df = pd.DataFrame(nts) # conversion to dataframe 249 df = pd.DataFrame(nts) # conversion to dataframe
230 df = df[ df.chain_name == self.pdb_chain_id ] # keeping only this chain's nucleotides 250 df = df[ df.chain_name == self.pdb_chain_id ] # keeping only this chain's nucleotides
251 +
252 + # Assert nucleotides of the chain are found
231 if df.empty: 253 if df.empty:
232 warn(f"Could not find nucleotides of chain {self.pdb_chain_id} in annotation {self.pdb_id}.json. Ignoring chain {self.chain_label}.", error=True) 254 warn(f"Could not find nucleotides of chain {self.pdb_chain_id} in annotation {self.pdb_id}.json. Ignoring chain {self.chain_label}.", error=True)
255 + no_nts_set.add(self.pdb_id)
233 self.delete_me = True 256 self.delete_me = True
234 self.error_messages = f"Could not find nucleotides of chain {self.pdb_chain_id} in annotation {self.pdb_id}.json. We expect a problem with {self.pdb_id} mmCIF download. Delete it and retry." 257 self.error_messages = f"Could not find nucleotides of chain {self.pdb_chain_id} in annotation {self.pdb_id}.json. We expect a problem with {self.pdb_id} mmCIF download. Delete it and retry."
235 return 1 258 return 1
236 259
237 - # remove low pertinence or undocumented descriptors 260 + # Remove low pertinence or undocumented descriptors, convert angles values
238 cols_we_keep = ["index_chain", "nt_resnum", "nt_name", "nt_code", "nt_id", "dbn", "alpha", "beta", "gamma", "delta", "epsilon", "zeta", 261 cols_we_keep = ["index_chain", "nt_resnum", "nt_name", "nt_code", "nt_id", "dbn", "alpha", "beta", "gamma", "delta", "epsilon", "zeta",
239 "epsilon_zeta", "bb_type", "chi", "glyco_bond", "form", "ssZp", "Dp", "eta", "theta", "eta_prime", "theta_prime", "eta_base", "theta_base", 262 "epsilon_zeta", "bb_type", "chi", "glyco_bond", "form", "ssZp", "Dp", "eta", "theta", "eta_prime", "theta_prime", "eta_base", "theta_base",
240 "v0", "v1", "v2", "v3", "v4", "amplitude", "phase_angle", "puckering" ] 263 "v0", "v1", "v2", "v3", "v4", "amplitude", "phase_angle", "puckering" ]
241 df = df[cols_we_keep] 264 df = df[cols_we_keep]
242 - 265 + df.loc[:,['alpha', 'beta','gamma','delta','epsilon','zeta','epsilon_zeta','chi','v0', 'v1', 'v2', 'v3', 'v4', # Conversion to radians
243 - # Convert angles to radians
244 - df.loc[:,['alpha', 'beta','gamma','delta','epsilon','zeta','epsilon_zeta','chi','v0', 'v1', 'v2', 'v3', 'v4',
245 'eta','theta','eta_prime','theta_prime','eta_base','theta_base', 'phase_angle']] *= np.pi/180.0 266 'eta','theta','eta_prime','theta_prime','eta_base','theta_base', 'phase_angle']] *= np.pi/180.0
246 - # mapping [-pi, pi] into [0, 2pi] 267 + df.loc[:,['alpha', 'beta','gamma','delta','epsilon','zeta','epsilon_zeta','chi','v0', 'v1', 'v2', 'v3', 'v4', # mapping [-pi, pi] into [0, 2pi]
247 - df.loc[:,['alpha', 'beta','gamma','delta','epsilon','zeta','epsilon_zeta','chi','v0', 'v1', 'v2', 'v3', 'v4',
248 'eta','theta','eta_prime','theta_prime','eta_base','theta_base', 'phase_angle']] %= (2.0*np.pi) 268 'eta','theta','eta_prime','theta_prime','eta_base','theta_base', 'phase_angle']] %= (2.0*np.pi)
269 +
249 except KeyError as e: 270 except KeyError as e:
250 warn(f"Error while parsing DSSR {self.pdb_id}.json output:{e}", error=True) 271 warn(f"Error while parsing DSSR {self.pdb_id}.json output:{e}", error=True)
251 self.delete_me = True 272 self.delete_me = True
252 self.error_messages = f"Error while parsing DSSR's json output:\n{e}" 273 self.error_messages = f"Error while parsing DSSR's json output:\n{e}"
253 return 1 274 return 1
254 275
276 + # Remove nucleotides of the chain that are outside the Rfam mapping, if any
277 + if self.pdb_start and self.pdb_end:
278 + df = df.drop(df[(df.nt_resnum < self.pdb_start) | (df.nt_resnum > self.pdb_end)].index)
279 +
280 + #############################################
281 + # Solve some common issues and drop ligands
282 + #############################################
283 +
255 # Shift numbering when duplicate residue numbers are found. 284 # Shift numbering when duplicate residue numbers are found.
256 # Example: 4v9q-DV contains 17 and 17A which are both read 17 by DSSR. 285 # Example: 4v9q-DV contains 17 and 17A which are both read 17 by DSSR.
257 while True in df.duplicated(['nt_resnum']).values: 286 while True in df.duplicated(['nt_resnum']).values:
...@@ -266,11 +295,12 @@ class Chain: ...@@ -266,11 +295,12 @@ class Chain:
266 or (len(df.index_chain) >= 2 and df.iloc[[-1]].nt_resnum.iloc[0] > 50 + df.iloc[[-2]].nt_resnum.iloc[0])): 295 or (len(df.index_chain) >= 2 and df.iloc[[-1]].nt_resnum.iloc[0] > 50 + df.iloc[[-2]].nt_resnum.iloc[0])):
267 df = df.head(-1) 296 df = df.head(-1)
268 297
269 - # Assert some nucleotides exist 298 + # Assert some nucleotides still exist
270 try: 299 try:
271 l = df.iloc[-1,1] - df.iloc[0,1] + 1 # length of chain from nt_resnum point of view 300 l = df.iloc[-1,1] - df.iloc[0,1] + 1 # length of chain from nt_resnum point of view
272 except IndexError: 301 except IndexError:
273 warn(f"Could not find real nucleotides of chain {self.pdb_chain_id} in annotation {self.pdb_id}.json. Ignoring chain {self.chain_label}.", error=True) 302 warn(f"Could not find real nucleotides of chain {self.pdb_chain_id} in annotation {self.pdb_id}.json. Ignoring chain {self.chain_label}.", error=True)
303 + no_nts_set.add(self.pdb_id)
274 self.delete_me = True 304 self.delete_me = True
275 self.error_messages = f"Could not find nucleotides of chain {self.pdb_chain_id} in annotation {self.pdb_id}.json. We expect a problem with {self.pdb_id} mmCIF download. Delete it and retry." 305 self.error_messages = f"Could not find nucleotides of chain {self.pdb_chain_id} in annotation {self.pdb_id}.json. We expect a problem with {self.pdb_id} mmCIF download. Delete it and retry."
276 return 1 306 return 1
...@@ -281,14 +311,15 @@ class Chain: ...@@ -281,14 +311,15 @@ class Chain:
281 df.iloc[:, 0] -= st 311 df.iloc[:, 0] -= st
282 df = df.drop(df[df.index_chain < 0].index) # drop eventual ones with index_chain < the first residue (usually, ligands) 312 df = df.drop(df[df.index_chain < 0].index) # drop eventual ones with index_chain < the first residue (usually, ligands)
283 313
284 - # Add a sequence column just for the alignments 314 + # Re-Assert some nucleotides still exist
285 - df['nt_align_code'] = [ str(x).upper() 315 + try:
286 - .replace('NAN', '-') # Unresolved nucleotides are gaps 316 + l = df.iloc[-1,1] - df.iloc[0,1] + 1 # length of chain from nt_resnum point of view
287 - .replace('?', '-') # Unidentified residues, let's delete them 317 + except IndexError:
288 - .replace('T', 'U') # 5MU are modified to t, which gives T 318 + warn(f"Could not find real nucleotides of chain {self.pdb_chain_id} in annotation {self.pdb_id}.json. Ignoring chain {self.chain_label}.", error=True)
289 - .replace('P', 'U') # Pseudo-uridines, but it is not really right to change them to U, see DSSR paper, Fig 2 319 + no_nts_set.add(self.pdb_id)
290 - for x in df['nt_code'] ] 320 + self.delete_me = True
291 - 321 + self.error_messages = f"Could not find nucleotides of chain {self.pdb_chain_id} in annotation {self.pdb_id}.json. We expect a problem with {self.pdb_id} mmCIF download. Delete it and retry."
322 + return 1
292 323
293 # Add eventual missing rows because of unsolved residues in the chain. 324 # Add eventual missing rows because of unsolved residues in the chain.
294 # Sometimes, the 3D structure is REALLY shorter than the family it's mapped to, 325 # Sometimes, the 3D structure is REALLY shorter than the family it's mapped to,
...@@ -304,7 +335,7 @@ class Chain: ...@@ -304,7 +335,7 @@ class Chain:
304 # portion solved in 3D 1 |--------------|79 85|------------| 156 335 # portion solved in 3D 1 |--------------|79 85|------------| 156
305 # Rfam mapping 3 |------------------------------------------ ... -------| 3353 (yes larger, 'cause it could be inferred) 336 # Rfam mapping 3 |------------------------------------------ ... -------| 3353 (yes larger, 'cause it could be inferred)
306 # nt resnum 3 |--------------------------------| 156 337 # nt resnum 3 |--------------------------------| 156
307 - # index_chain 1 |-------------|77 83|------------| 149 338 + # index_chain 1 |-------------|77 83|------------| 149 (before correction)
308 # expected data point 1 |--------------------------------| 154 339 # expected data point 1 |--------------------------------| 154
309 # 340 #
310 341
...@@ -314,15 +345,26 @@ class Chain: ...@@ -314,15 +345,26 @@ class Chain:
314 for i in sorted(diff): 345 for i in sorted(diff):
315 # Add a row at position i 346 # Add a row at position i
316 df = pd.concat([ df.iloc[:i], 347 df = pd.concat([ df.iloc[:i],
317 - pd.DataFrame({"index_chain": i+1, "nt_resnum": i+resnum_start, 348 + pd.DataFrame({"index_chain": i+1, "nt_resnum": i+resnum_start, "nt_code":'-', "nt_name":'-'}, index=[i]),
318 - "nt_code":'-', "nt_name":'-', 'nt_align_code':'-'}, index=[i]), 349 + df.iloc[i:] ])
319 - df.iloc[i:]
320 - ])
321 # Increase the index_chain of all following lines 350 # Increase the index_chain of all following lines
322 df.iloc[i+1:, 0] += 1 351 df.iloc[i+1:, 0] += 1
323 df = df.reset_index(drop=True) 352 df = df.reset_index(drop=True)
324 self.full_length = len(df.index_chain) 353 self.full_length = len(df.index_chain)
325 354
355 + # Add a sequence column just for the alignments
356 + df['nt_align_code'] = [ str(x).upper()
357 + .replace('NAN', '-') # Unresolved nucleotides are gaps
358 + .replace('?', '-') # Unidentified residues, let's delete them
359 + .replace('T', 'U') # 5MU are modified to t, which gives T
360 + .replace('P', 'U') # Pseudo-uridines, but it is not really right to change them to U, see DSSR paper, Fig 2
361 + for x in df['nt_code'] ]
362 +
363 +
364 + #######################################
365 + # Compute new features
366 + #######################################
367 +
326 # One-hot encoding sequence 368 # One-hot encoding sequence
327 df["is_A"] = [ 1 if x=="A" else 0 for x in df["nt_code"] ] 369 df["is_A"] = [ 1 if x=="A" else 0 for x in df["nt_code"] ]
328 df["is_C"] = [ 1 if x=="C" else 0 for x in df["nt_code"] ] 370 df["is_C"] = [ 1 if x=="C" else 0 for x in df["nt_code"] ]
...@@ -415,7 +457,13 @@ class Chain: ...@@ -415,7 +457,13 @@ class Chain:
415 newpairs.append(v) 457 newpairs.append(v)
416 df['paired'] = newpairs 458 df['paired'] = newpairs
417 459
418 - # Saving to database 460 + self.seq = "".join(df.nt_code)
461 + self.seq_to_align = "".join(df.nt_align_code)
462 + self.length = len([ x for x in self.seq_to_align if x != "-" ])
463 +
464 + ####################################
465 + # Save everything to database
466 + ####################################
419 with sqlite3.connect(runDir+"/results/RNANet.db", timeout=10.0) as conn: 467 with sqlite3.connect(runDir+"/results/RNANet.db", timeout=10.0) as conn:
420 # Register the chain in table chain 468 # Register the chain in table chain
421 if self.pdb_start is not None: 469 if self.pdb_start is not None:
...@@ -444,11 +492,7 @@ class Chain: ...@@ -444,11 +492,7 @@ class Chain:
444 ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?);""", 492 ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?);""",
445 many=True, data=list(df.to_records(index=False)), warn_every=10) 493 many=True, data=list(df.to_records(index=False)), warn_every=10)
446 494
447 - # Now load data from the database 495 +
448 - self.seq = "".join(df.nt_code)
449 - self.seq_to_align = "".join(df.nt_align_code)
450 - self.length = len([ x for x in self.seq_to_align if x != "-" ])
451 -
452 # Remove too short chains 496 # Remove too short chains
453 if self.length < 5: 497 if self.length < 5:
454 warn(f"{self.chain_label} sequence is too short, let's ignore it.\t", error=True) 498 warn(f"{self.chain_label} sequence is too short, let's ignore it.\t", error=True)
...@@ -811,7 +855,7 @@ class Pipeline: ...@@ -811,7 +855,7 @@ class Pipeline:
811 global path_to_seq_data 855 global path_to_seq_data
812 856
813 try: 857 try:
814 - opts, args = getopt.getopt( sys.argv[1:], "r:hs", 858 + opts, _ = getopt.getopt( sys.argv[1:], "r:hs",
815 [ "help", "resolution=", "keep-hetatm=", "from-scratch", 859 [ "help", "resolution=", "keep-hetatm=", "from-scratch",
816 "fill-gaps=", "3d-folder=", "seq-folder=", 860 "fill-gaps=", "3d-folder=", "seq-folder=",
817 "no-homology", "ignore-issues", "extract", 861 "no-homology", "ignore-issues", "extract",
...@@ -822,7 +866,7 @@ class Pipeline: ...@@ -822,7 +866,7 @@ class Pipeline:
822 866
823 for opt, arg in opts: 867 for opt, arg in opts:
824 868
825 - if opt in ["--from-scratch", "--update-mmcifs", "--update-homolgous"] and "tobedefinedbyoptions" in [path_to_3D_data, path_to_seq_data]: 869 + if opt in ["--from-scratch", "--update-mmcifs", "--update-homologous"] and "tobedefinedbyoptions" in [path_to_3D_data, path_to_seq_data]:
826 print("Please provide --3d-folder and --seq-folder first, so that we know what to delete and update.") 870 print("Please provide --3d-folder and --seq-folder first, so that we know what to delete and update.")
827 exit() 871 exit()
828 872
...@@ -890,7 +934,7 @@ class Pipeline: ...@@ -890,7 +934,7 @@ class Pipeline:
890 warn("Deleting previous database and recomputing from scratch.") 934 warn("Deleting previous database and recomputing from scratch.")
891 subprocess.run(["rm", "-rf", 935 subprocess.run(["rm", "-rf",
892 path_to_3D_data + "annotations", 936 path_to_3D_data + "annotations",
893 - path_to_3D_data + "RNAcifs", 937 + # path_to_3D_data + "RNAcifs", # DEBUG : keep the cifs !
894 path_to_3D_data + "rna_mapped_to_Rfam", 938 path_to_3D_data + "rna_mapped_to_Rfam",
895 path_to_3D_data + "rnaonly", 939 path_to_3D_data + "rnaonly",
896 path_to_seq_data + "realigned", 940 path_to_seq_data + "realigned",
...@@ -899,9 +943,6 @@ class Pipeline: ...@@ -899,9 +943,6 @@ class Pipeline:
899 runDir + "/known_issues_reasons.txt", 943 runDir + "/known_issues_reasons.txt",
900 runDir + "/results/RNANet.db"]) 944 runDir + "/results/RNANet.db"])
901 elif opt == "--update-homologous": 945 elif opt == "--update-homologous":
902 - if "tobedefinedbyoptions" == path_to_seq_data:
903 - warn("Please provide --seq-folder before --update-homologous in the list of options.", error=True)
904 - exit(1)
905 warn("Deleting previous sequence files and recomputing alignments.") 946 warn("Deleting previous sequence files and recomputing alignments.")
906 subprocess.run(["rm", "-rf", 947 subprocess.run(["rm", "-rf",
907 path_to_seq_data + "realigned", 948 path_to_seq_data + "realigned",
...@@ -942,8 +983,9 @@ class Pipeline: ...@@ -942,8 +983,9 @@ class Pipeline:
942 print("> Building list of structures...", flush=True) 983 print("> Building list of structures...", flush=True)
943 p = Pool(initializer=init_worker, initargs=(tqdm.get_lock(),), processes=ncores) 984 p = Pool(initializer=init_worker, initargs=(tqdm.get_lock(),), processes=ncores)
944 try: 985 try:
986 +
945 pbar = tqdm(full_structures_list, maxinterval=1.0, miniters=1, bar_format="{percentage:3.0f}%|{bar}|") 987 pbar = tqdm(full_structures_list, maxinterval=1.0, miniters=1, bar_format="{percentage:3.0f}%|{bar}|")
946 - for i, newchains in enumerate(p.imap_unordered(partial(work_infer_mappings, not self.REUSE_ALL, allmappings), full_structures_list)): 988 + for _, newchains in enumerate(p.imap_unordered(partial(work_infer_mappings, not self.REUSE_ALL, allmappings), full_structures_list)):
947 self.update += newchains 989 self.update += newchains
948 pbar.update(1) # Everytime the iteration finishes, update the global progress bar 990 pbar.update(1) # Everytime the iteration finishes, update the global progress bar
949 991
...@@ -1053,7 +1095,8 @@ class Pipeline: ...@@ -1053,7 +1095,8 @@ class Pipeline:
1053 warn(f"Adding {c[1].chain_label} to known issues.") 1095 warn(f"Adding {c[1].chain_label} to known issues.")
1054 ki.write(c[1].chain_label + '\n') 1096 ki.write(c[1].chain_label + '\n')
1055 kir.write(c[1].chain_label + '\n' + c[1].error_messages + '\n\n') 1097 kir.write(c[1].chain_label + '\n' + c[1].error_messages + '\n\n')
1056 - sql_execute(conn, f"UPDATE chain SET issue = 1 WHERE chain_id = ?;", data=(c[1].db_chain_id,)) 1098 + with sqlite3.connect(runDir+"/results/RNANet.db") as conn:
1099 + sql_execute(conn, f"UPDATE chain SET issue = 1 WHERE chain_id = ?;", data=(c[1].db_chain_id,))
1057 ki.close() 1100 ki.close()
1058 kir.close() 1101 kir.close()
1059 1102
...@@ -1194,7 +1237,7 @@ class Pipeline: ...@@ -1194,7 +1237,7 @@ class Pipeline:
1194 p = Pool(initializer=init_worker, initargs=(tqdm.get_lock(),), processes=3) 1237 p = Pool(initializer=init_worker, initargs=(tqdm.get_lock(),), processes=3)
1195 try: 1238 try:
1196 pbar = tqdm(total=len(self.loaded_chains), desc="Saving chains to CSV", position=0, leave=True) 1239 pbar = tqdm(total=len(self.loaded_chains), desc="Saving chains to CSV", position=0, leave=True)
1197 - for i, _ in enumerate(p.imap_unordered(work_save, self.loaded_chains)): 1240 + for _, _2 in enumerate(p.imap_unordered(work_save, self.loaded_chains)):
1198 pbar.update(1) 1241 pbar.update(1)
1199 pbar.close() 1242 pbar.close()
1200 p.close() 1243 p.close()
...@@ -2158,6 +2201,8 @@ if __name__ == "__main__": ...@@ -2158,6 +2201,8 @@ if __name__ == "__main__":
2158 pp.dl_and_annotate(retry=True, coeff_ncores=0.3) # 2201 pp.dl_and_annotate(retry=True, coeff_ncores=0.3) #
2159 pp.build_chains(retry=True, coeff_ncores=1.0) # Use half the cores to reduce required amount of memory 2202 pp.build_chains(retry=True, coeff_ncores=1.0) # Use half the cores to reduce required amount of memory
2160 print(f"> Loaded {len(pp.loaded_chains)} RNA chains ({len(pp.update) - len(pp.loaded_chains)} errors).") 2203 print(f"> Loaded {len(pp.loaded_chains)} RNA chains ({len(pp.update) - len(pp.loaded_chains)} errors).")
2204 + if len(no_nts_set):
2205 + print(f"Among errors, {len(no_nts_set)} structures seem to contain RNA chains without defined nucleotides:", no_nts_set, flush=True)
2161 pp.checkpoint_save_chains() 2206 pp.checkpoint_save_chains()
2162 2207
2163 if not pp.HOMOLOGY: 2208 if not pp.HOMOLOGY:
...@@ -2165,7 +2210,7 @@ if __name__ == "__main__": ...@@ -2165,7 +2210,7 @@ if __name__ == "__main__":
2165 for c in pp.loaded_chains: 2210 for c in pp.loaded_chains:
2166 work_save(c, homology=False) 2211 work_save(c, homology=False)
2167 print("Completed.") 2212 print("Completed.")
2168 - exit() 2213 + exit(0)
2169 2214
2170 # At this point, structure, chain and nucleotide tables of the database are up to date. 2215 # At this point, structure, chain and nucleotide tables of the database are up to date.
2171 # (Modulo some statistics computed by statistics.py) 2216 # (Modulo some statistics computed by statistics.py)
......
...@@ -168,6 +168,8 @@ def stats_len(): ...@@ -168,6 +168,8 @@ def stats_len():
168 lengths = [] 168 lengths = []
169 conn = sqlite3.connect("results/RNANet.db") 169 conn = sqlite3.connect("results/RNANet.db")
170 for i,f in enumerate(fam_list): 170 for i,f in enumerate(fam_list):
171 +
172 + # Define a color for that family in the plot
171 if f in LSU_set: 173 if f in LSU_set:
172 cols.append("red") # LSU 174 cols.append("red") # LSU
173 elif f in SSU_set: 175 elif f in SSU_set:
...@@ -178,11 +180,15 @@ def stats_len(): ...@@ -178,11 +180,15 @@ def stats_len():
178 cols.append("orange") 180 cols.append("orange")
179 else: 181 else:
180 cols.append("grey") 182 cols.append("grey")
183 +
184 + # Get the lengths of chains
181 l = [ x[0] for x in sql_ask_database(conn, f"SELECT COUNT(index_chain) FROM (SELECT chain_id FROM chain WHERE rfam_acc='{f}') NATURAL JOIN nucleotide GROUP BY chain_id;") ] 185 l = [ x[0] for x in sql_ask_database(conn, f"SELECT COUNT(index_chain) FROM (SELECT chain_id FROM chain WHERE rfam_acc='{f}') NATURAL JOIN nucleotide GROUP BY chain_id;") ]
182 lengths.append(l) 186 lengths.append(l)
187 +
183 notify(f"[{i+1}/{len(fam_list)}] Computed {f} chains lengths") 188 notify(f"[{i+1}/{len(fam_list)}] Computed {f} chains lengths")
184 conn.close() 189 conn.close()
185 190
191 + # Plot the figure
186 fig = plt.figure(figsize=(10,3)) 192 fig = plt.figure(figsize=(10,3))
187 ax = fig.gca() 193 ax = fig.gca()
188 ax.hist(lengths, bins=100, stacked=True, log=True, color=cols, label=fam_list) 194 ax.hist(lengths, bins=100, stacked=True, log=True, color=cols, label=fam_list)
...@@ -191,6 +197,8 @@ def stats_len(): ...@@ -191,6 +197,8 @@ def stats_len():
191 ax.set_xlim(left=-150) 197 ax.set_xlim(left=-150)
192 ax.tick_params(axis='both', which='both', labelsize=8) 198 ax.tick_params(axis='both', which='both', labelsize=8)
193 fig.tight_layout() 199 fig.tight_layout()
200 +
201 + # Draw the legend
194 fig.subplots_adjust(right=0.78) 202 fig.subplots_adjust(right=0.78)
195 filtered_handles = [mpatches.Patch(color='red'), mpatches.Patch(color='white'), mpatches.Patch(color='white'), mpatches.Patch(color='white'), 203 filtered_handles = [mpatches.Patch(color='red'), mpatches.Patch(color='white'), mpatches.Patch(color='white'), mpatches.Patch(color='white'),
196 mpatches.Patch(color='blue'), mpatches.Patch(color='white'), mpatches.Patch(color='white'), 204 mpatches.Patch(color='blue'), mpatches.Patch(color='white'), mpatches.Patch(color='white'),
...@@ -204,6 +212,8 @@ def stats_len(): ...@@ -204,6 +212,8 @@ def stats_len():
204 'Other'] 212 'Other']
205 ax.legend(filtered_handles, filtered_labels, loc='right', 213 ax.legend(filtered_handles, filtered_labels, loc='right',
206 ncol=1, fontsize='small', bbox_to_anchor=(1.3, 0.5)) 214 ncol=1, fontsize='small', bbox_to_anchor=(1.3, 0.5))
215 +
216 + # Save the figure
207 fig.savefig("results/figures/lengths.png") 217 fig.savefig("results/figures/lengths.png")
208 notify("Computed sequence length statistics and saved the figure.") 218 notify("Computed sequence length statistics and saved the figure.")
209 219
...@@ -224,10 +234,12 @@ def stats_freq(): ...@@ -224,10 +234,12 @@ def stats_freq():
224 234
225 Outputs results/frequencies.csv 235 Outputs results/frequencies.csv
226 REQUIRES tables chain, nucleotide up to date.""" 236 REQUIRES tables chain, nucleotide up to date."""
237 + # Initialize a Counter object for each family
227 freqs = {} 238 freqs = {}
228 for f in fam_list: 239 for f in fam_list:
229 freqs[f] = Counter() 240 freqs[f] = Counter()
230 241
242 + # List all nt_names happening within a RNA family and store the counts in the Counter
231 conn = sqlite3.connect("results/RNANet.db") 243 conn = sqlite3.connect("results/RNANet.db")
232 for i,f in enumerate(fam_list): 244 for i,f in enumerate(fam_list):
233 counts = dict(sql_ask_database(conn, f"SELECT nt_name, COUNT(nt_name) FROM (SELECT chain_id from chain WHERE rfam_acc='{f}') NATURAL JOIN nucleotide GROUP BY nt_name;")) 245 counts = dict(sql_ask_database(conn, f"SELECT nt_name, COUNT(nt_name) FROM (SELECT chain_id from chain WHERE rfam_acc='{f}') NATURAL JOIN nucleotide GROUP BY nt_name;"))
...@@ -235,6 +247,7 @@ def stats_freq(): ...@@ -235,6 +247,7 @@ def stats_freq():
235 notify(f"[{i+1}/{len(fam_list)}] Computed {f} nucleotide frequencies.") 247 notify(f"[{i+1}/{len(fam_list)}] Computed {f} nucleotide frequencies.")
236 conn.close() 248 conn.close()
237 249
250 + # Create a pandas DataFrame, and save it to CSV.
238 df = pd.DataFrame() 251 df = pd.DataFrame()
239 for f in fam_list: 252 for f in fam_list:
240 tot = sum(freqs[f].values()) 253 tot = sum(freqs[f].values())
...@@ -347,8 +360,8 @@ def stats_pairs(): ...@@ -347,8 +360,8 @@ def stats_pairs():
347 fam_pbar = tqdm(total=len(fam_list), desc="Pair-types in families", position=0, leave=True) 360 fam_pbar = tqdm(total=len(fam_list), desc="Pair-types in families", position=0, leave=True)
348 results = [] 361 results = []
349 allpairs = [] 362 allpairs = []
350 - for i, _ in enumerate(p.imap_unordered(parallel_stats_pairs, fam_list)): 363 + for _, newp_famdf in enumerate(p.imap_unordered(parallel_stats_pairs, fam_list)):
351 - newpairs, fam_df = _ 364 + newpairs, fam_df = newp_famdf
352 fam_pbar.update(1) 365 fam_pbar.update(1)
353 results.append(fam_df) 366 results.append(fam_df)
354 allpairs.append(newpairs) 367 allpairs.append(newpairs)
...@@ -432,13 +445,14 @@ def seq_idty(): ...@@ -432,13 +445,14 @@ def seq_idty():
432 Creates temporary results files in data/*.npy 445 Creates temporary results files in data/*.npy
433 REQUIRES tables chain, family un to date.""" 446 REQUIRES tables chain, family un to date."""
434 447
448 + # List the families for which we will compute sequence identity matrices
435 conn = sqlite3.connect("results/RNANet.db") 449 conn = sqlite3.connect("results/RNANet.db")
436 famlist = [ x[0] for x in sql_ask_database(conn, "SELECT rfam_acc from (SELECT rfam_acc, COUNT(chain_id) as n_chains FROM family NATURAL JOIN chain GROUP BY rfam_acc) WHERE n_chains > 1 ORDER BY rfam_acc ASC;") ] 450 famlist = [ x[0] for x in sql_ask_database(conn, "SELECT rfam_acc from (SELECT rfam_acc, COUNT(chain_id) as n_chains FROM family NATURAL JOIN chain GROUP BY rfam_acc) WHERE n_chains > 1 ORDER BY rfam_acc ASC;") ]
437 ignored = [ x[0] for x in sql_ask_database(conn, "SELECT rfam_acc from (SELECT rfam_acc, COUNT(chain_id) as n_chains FROM family NATURAL JOIN chain GROUP BY rfam_acc) WHERE n_chains < 2 ORDER BY rfam_acc ASC;") ] 451 ignored = [ x[0] for x in sql_ask_database(conn, "SELECT rfam_acc from (SELECT rfam_acc, COUNT(chain_id) as n_chains FROM family NATURAL JOIN chain GROUP BY rfam_acc) WHERE n_chains < 2 ORDER BY rfam_acc ASC;") ]
438 if len(ignored): 452 if len(ignored):
439 print("Idty matrices: Ignoring families with only one chain:", " ".join(ignored)+'\n') 453 print("Idty matrices: Ignoring families with only one chain:", " ".join(ignored)+'\n')
440 454
441 - # compute distance matrices 455 + # compute distance matrices (or ignore if data/RF0****.npy exists)
442 p = Pool(processes=8) 456 p = Pool(processes=8)
443 p.map(to_dist_matrix, famlist) 457 p.map(to_dist_matrix, famlist)
444 p.close() 458 p.close()
...@@ -502,7 +516,7 @@ def per_chain_stats(): ...@@ -502,7 +516,7 @@ def per_chain_stats():
502 516
503 # Set the values 517 # Set the values
504 sql_execute(conn, "UPDATE chain SET chain_freq_A = ?, chain_freq_C = ?, chain_freq_G = ?, chain_freq_U = ?, chain_freq_other = ? WHERE chain_id= ?;", 518 sql_execute(conn, "UPDATE chain SET chain_freq_A = ?, chain_freq_C = ?, chain_freq_G = ?, chain_freq_U = ?, chain_freq_other = ? WHERE chain_id= ?;",
505 - many=True, data=list(df.to_records(index=False)), warn_every=10) 519 + many=True, data=list(df.to_records(index=False)), warn_every=10)
506 notify("Updated the database with per-chain base frequencies") 520 notify("Updated the database with per-chain base frequencies")
507 521
508 if __name__ == "__main__": 522 if __name__ == "__main__":
......