Louis BECQUEY

Better cut dataframes (cut at Rfam mapping)

...@@ -7,6 +7,7 @@ results/ ...@@ -7,6 +7,7 @@ results/
7 7
8 # temporary results files 8 # temporary results files
9 data/ 9 data/
10 +esl*
10 11
11 # environment stuff 12 # environment stuff
12 .vscode/ 13 .vscode/
......
...@@ -49,6 +49,19 @@ Other folders are created and not deleted, which you might want to conserve to a ...@@ -49,6 +49,19 @@ Other folders are created and not deleted, which you might want to conserve to a
49 * `path-to-3D-folder-you-passed-in-option/annotations/` contains the raw JSON annotation files of the previous mmCIF structures. You may find additional information into them which is not properly supported by RNANet yet. 49 * `path-to-3D-folder-you-passed-in-option/annotations/` contains the raw JSON annotation files of the previous mmCIF structures. You may find additional information into them which is not properly supported by RNANet yet.
50 50
51 # How to run 51 # How to run
52 +
53 +## Required computational resources
54 +- CPU: no requirements. The program is optimized for multi-core CPUs, you might want to use Intel Xeons, AMD Ryzens, etc.
55 +- GPU: not required
56 +- RAM: 16 GB with a large swap partition is okay. 32 GB is recommended (usage peaks at ~27 GB)
57 +- Storage: to date, it takes 60 GB for the 3D data (36 GB if you don't use the --extract option), 11 GB for the sequence data, and 7GB for the outputs (5.6 GB database, 1 GB archive of CSV files). You need to add a few more for the dependencies. Go for a 100GB partition and you are good to go. The computation speed is really decreased if you use a fast storage device (e.g. SSD instead of hard drive, or even better, a NVMe SSD) because of permanent I/O with the SQlite database.
58 +- Network : We query the Rfam public MySQL server on port 4497. Make sure your network enables communication (there should not be any issue on private networks, but maybe you company/university closes ports by default). You will get an error message if the port is not open. Around 30 GB of data is downloaded.
59 +
60 +To give you an estimation, our last full run took exactly 12h, excluding the time to download the MMCIF files containing RNA (around 25GB to download) and the time to compute statistics.
61 +Measured the 23rd of June 2020 on a 16-core AMD Ryzen 7 3700X CPU @3.60GHz, plus 32 Go RAM, and a 7200rpm Hard drive. Total CPU time spent: 135 hours (user+kernel modes), corresponding to 12h (actual time spent with the 16-core CPU).
62 +
63 +Update runs are much quicker, around 3 hours. It depends mostly on what RNA families are concerned by the update.
64 +
52 ## Dependencies 65 ## Dependencies
53 You need to install: 66 You need to install:
54 - DSSR, you need to register to the X3DNA forum [here](http://forum.x3dna.org/site-announcements/download-instructions/) and then download the DSSR binary [on that page](http://forum.x3dna.org/downloads/3dna-download/). 67 - DSSR, you need to register to the X3DNA forum [here](http://forum.x3dna.org/site-announcements/download-instructions/) and then download the DSSR binary [on that page](http://forum.x3dna.org/downloads/3dna-download/).
...@@ -91,6 +104,8 @@ The detailed list of options is below: ...@@ -91,6 +104,8 @@ The detailed list of options is below:
91 ## Post-computation task: estimate quality 104 ## Post-computation task: estimate quality
92 The file statistics.py is supposed to give a summary on the produced dataset. See the results/ folder. It can be run automatically after RNANet if you pass the `-s` option. 105 The file statistics.py is supposed to give a summary on the produced dataset. See the results/ folder. It can be run automatically after RNANet if you pass the `-s` option.
93 106
107 +
108 +
94 # How to further filter the dataset 109 # How to further filter the dataset
95 You may want to build your own sub-dataset by querying the results/RNANet.db file. Here are quick examples using Python3 and its sqlite3 package. 110 You may want to build your own sub-dataset by querying the results/RNANet.db file. Here are quick examples using Python3 and its sqlite3 package.
96 111
...@@ -108,6 +123,7 @@ Step 1 : We first get a list of chains that are below our favorite resolution th ...@@ -108,6 +123,7 @@ Step 1 : We first get a list of chains that are below our favorite resolution th
108 with sqlite3.connect("results/RNANet.db) as connection: 123 with sqlite3.connect("results/RNANet.db) as connection:
109 chain_list = pd.read_sql("""SELECT chain_id, structure_id, chain_name 124 chain_list = pd.read_sql("""SELECT chain_id, structure_id, chain_name
110 FROM chain JOIN structure 125 FROM chain JOIN structure
126 + ON chain.structure_id = structure.pdb_id
111 WHERE resolution < 4.0 127 WHERE resolution < 4.0
112 ORDER BY structure_id ASC;""", 128 ORDER BY structure_id ASC;""",
113 con=connection) 129 con=connection)
...@@ -146,6 +162,7 @@ We will simply modify the Step 1 above: ...@@ -146,6 +162,7 @@ We will simply modify the Step 1 above:
146 with sqlite3.connect("results/RNANet.db) as connection: 162 with sqlite3.connect("results/RNANet.db) as connection:
147 chain_list = pd.read_sql("""SELECT chain_id, structure_id, chain_name 163 chain_list = pd.read_sql("""SELECT chain_id, structure_id, chain_name
148 FROM chain JOIN structure 164 FROM chain JOIN structure
165 + ON chain.structure_id = structure.pdb_id
149 WHERE date < "2018-06-01" 166 WHERE date < "2018-06-01"
150 ORDER BY structure_id ASC;""", 167 ORDER BY structure_id ASC;""",
151 con=connection) 168 con=connection)
...@@ -160,6 +177,7 @@ If you want just one example of each RNA 3D chain, use in Step 1: ...@@ -160,6 +177,7 @@ If you want just one example of each RNA 3D chain, use in Step 1:
160 with sqlite3.connect("results/RNANet.db) as connection: 177 with sqlite3.connect("results/RNANet.db) as connection:
161 chain_list = pd.read_sql("""SELECT UNIQUE chain_id, structure_id, chain_name 178 chain_list = pd.read_sql("""SELECT UNIQUE chain_id, structure_id, chain_name
162 FROM chain JOIN structure 179 FROM chain JOIN structure
180 + ON chain.structure_id = structure.pdb_id
163 ORDER BY structure_id ASC;""", 181 ORDER BY structure_id ASC;""",
164 con=connection) 182 con=connection)
165 ``` 183 ```
......
This diff is collapsed. Click to expand it.
...@@ -168,6 +168,8 @@ def stats_len(): ...@@ -168,6 +168,8 @@ def stats_len():
168 lengths = [] 168 lengths = []
169 conn = sqlite3.connect("results/RNANet.db") 169 conn = sqlite3.connect("results/RNANet.db")
170 for i,f in enumerate(fam_list): 170 for i,f in enumerate(fam_list):
171 +
172 + # Define a color for that family in the plot
171 if f in LSU_set: 173 if f in LSU_set:
172 cols.append("red") # LSU 174 cols.append("red") # LSU
173 elif f in SSU_set: 175 elif f in SSU_set:
...@@ -178,11 +180,15 @@ def stats_len(): ...@@ -178,11 +180,15 @@ def stats_len():
178 cols.append("orange") 180 cols.append("orange")
179 else: 181 else:
180 cols.append("grey") 182 cols.append("grey")
183 +
184 + # Get the lengths of chains
181 l = [ x[0] for x in sql_ask_database(conn, f"SELECT COUNT(index_chain) FROM (SELECT chain_id FROM chain WHERE rfam_acc='{f}') NATURAL JOIN nucleotide GROUP BY chain_id;") ] 185 l = [ x[0] for x in sql_ask_database(conn, f"SELECT COUNT(index_chain) FROM (SELECT chain_id FROM chain WHERE rfam_acc='{f}') NATURAL JOIN nucleotide GROUP BY chain_id;") ]
182 lengths.append(l) 186 lengths.append(l)
187 +
183 notify(f"[{i+1}/{len(fam_list)}] Computed {f} chains lengths") 188 notify(f"[{i+1}/{len(fam_list)}] Computed {f} chains lengths")
184 conn.close() 189 conn.close()
185 190
191 + # Plot the figure
186 fig = plt.figure(figsize=(10,3)) 192 fig = plt.figure(figsize=(10,3))
187 ax = fig.gca() 193 ax = fig.gca()
188 ax.hist(lengths, bins=100, stacked=True, log=True, color=cols, label=fam_list) 194 ax.hist(lengths, bins=100, stacked=True, log=True, color=cols, label=fam_list)
...@@ -191,6 +197,8 @@ def stats_len(): ...@@ -191,6 +197,8 @@ def stats_len():
191 ax.set_xlim(left=-150) 197 ax.set_xlim(left=-150)
192 ax.tick_params(axis='both', which='both', labelsize=8) 198 ax.tick_params(axis='both', which='both', labelsize=8)
193 fig.tight_layout() 199 fig.tight_layout()
200 +
201 + # Draw the legend
194 fig.subplots_adjust(right=0.78) 202 fig.subplots_adjust(right=0.78)
195 filtered_handles = [mpatches.Patch(color='red'), mpatches.Patch(color='white'), mpatches.Patch(color='white'), mpatches.Patch(color='white'), 203 filtered_handles = [mpatches.Patch(color='red'), mpatches.Patch(color='white'), mpatches.Patch(color='white'), mpatches.Patch(color='white'),
196 mpatches.Patch(color='blue'), mpatches.Patch(color='white'), mpatches.Patch(color='white'), 204 mpatches.Patch(color='blue'), mpatches.Patch(color='white'), mpatches.Patch(color='white'),
...@@ -204,6 +212,8 @@ def stats_len(): ...@@ -204,6 +212,8 @@ def stats_len():
204 'Other'] 212 'Other']
205 ax.legend(filtered_handles, filtered_labels, loc='right', 213 ax.legend(filtered_handles, filtered_labels, loc='right',
206 ncol=1, fontsize='small', bbox_to_anchor=(1.3, 0.5)) 214 ncol=1, fontsize='small', bbox_to_anchor=(1.3, 0.5))
215 +
216 + # Save the figure
207 fig.savefig("results/figures/lengths.png") 217 fig.savefig("results/figures/lengths.png")
208 notify("Computed sequence length statistics and saved the figure.") 218 notify("Computed sequence length statistics and saved the figure.")
209 219
...@@ -224,10 +234,12 @@ def stats_freq(): ...@@ -224,10 +234,12 @@ def stats_freq():
224 234
225 Outputs results/frequencies.csv 235 Outputs results/frequencies.csv
226 REQUIRES tables chain, nucleotide up to date.""" 236 REQUIRES tables chain, nucleotide up to date."""
237 + # Initialize a Counter object for each family
227 freqs = {} 238 freqs = {}
228 for f in fam_list: 239 for f in fam_list:
229 freqs[f] = Counter() 240 freqs[f] = Counter()
230 241
242 + # List all nt_names happening within a RNA family and store the counts in the Counter
231 conn = sqlite3.connect("results/RNANet.db") 243 conn = sqlite3.connect("results/RNANet.db")
232 for i,f in enumerate(fam_list): 244 for i,f in enumerate(fam_list):
233 counts = dict(sql_ask_database(conn, f"SELECT nt_name, COUNT(nt_name) FROM (SELECT chain_id from chain WHERE rfam_acc='{f}') NATURAL JOIN nucleotide GROUP BY nt_name;")) 245 counts = dict(sql_ask_database(conn, f"SELECT nt_name, COUNT(nt_name) FROM (SELECT chain_id from chain WHERE rfam_acc='{f}') NATURAL JOIN nucleotide GROUP BY nt_name;"))
...@@ -235,6 +247,7 @@ def stats_freq(): ...@@ -235,6 +247,7 @@ def stats_freq():
235 notify(f"[{i+1}/{len(fam_list)}] Computed {f} nucleotide frequencies.") 247 notify(f"[{i+1}/{len(fam_list)}] Computed {f} nucleotide frequencies.")
236 conn.close() 248 conn.close()
237 249
250 + # Create a pandas DataFrame, and save it to CSV.
238 df = pd.DataFrame() 251 df = pd.DataFrame()
239 for f in fam_list: 252 for f in fam_list:
240 tot = sum(freqs[f].values()) 253 tot = sum(freqs[f].values())
...@@ -347,8 +360,8 @@ def stats_pairs(): ...@@ -347,8 +360,8 @@ def stats_pairs():
347 fam_pbar = tqdm(total=len(fam_list), desc="Pair-types in families", position=0, leave=True) 360 fam_pbar = tqdm(total=len(fam_list), desc="Pair-types in families", position=0, leave=True)
348 results = [] 361 results = []
349 allpairs = [] 362 allpairs = []
350 - for i, _ in enumerate(p.imap_unordered(parallel_stats_pairs, fam_list)): 363 + for _, newp_famdf in enumerate(p.imap_unordered(parallel_stats_pairs, fam_list)):
351 - newpairs, fam_df = _ 364 + newpairs, fam_df = newp_famdf
352 fam_pbar.update(1) 365 fam_pbar.update(1)
353 results.append(fam_df) 366 results.append(fam_df)
354 allpairs.append(newpairs) 367 allpairs.append(newpairs)
...@@ -432,13 +445,14 @@ def seq_idty(): ...@@ -432,13 +445,14 @@ def seq_idty():
432 Creates temporary results files in data/*.npy 445 Creates temporary results files in data/*.npy
433 REQUIRES tables chain, family un to date.""" 446 REQUIRES tables chain, family un to date."""
434 447
448 + # List the families for which we will compute sequence identity matrices
435 conn = sqlite3.connect("results/RNANet.db") 449 conn = sqlite3.connect("results/RNANet.db")
436 famlist = [ x[0] for x in sql_ask_database(conn, "SELECT rfam_acc from (SELECT rfam_acc, COUNT(chain_id) as n_chains FROM family NATURAL JOIN chain GROUP BY rfam_acc) WHERE n_chains > 1 ORDER BY rfam_acc ASC;") ] 450 famlist = [ x[0] for x in sql_ask_database(conn, "SELECT rfam_acc from (SELECT rfam_acc, COUNT(chain_id) as n_chains FROM family NATURAL JOIN chain GROUP BY rfam_acc) WHERE n_chains > 1 ORDER BY rfam_acc ASC;") ]
437 ignored = [ x[0] for x in sql_ask_database(conn, "SELECT rfam_acc from (SELECT rfam_acc, COUNT(chain_id) as n_chains FROM family NATURAL JOIN chain GROUP BY rfam_acc) WHERE n_chains < 2 ORDER BY rfam_acc ASC;") ] 451 ignored = [ x[0] for x in sql_ask_database(conn, "SELECT rfam_acc from (SELECT rfam_acc, COUNT(chain_id) as n_chains FROM family NATURAL JOIN chain GROUP BY rfam_acc) WHERE n_chains < 2 ORDER BY rfam_acc ASC;") ]
438 if len(ignored): 452 if len(ignored):
439 print("Idty matrices: Ignoring families with only one chain:", " ".join(ignored)+'\n') 453 print("Idty matrices: Ignoring families with only one chain:", " ".join(ignored)+'\n')
440 454
441 - # compute distance matrices 455 + # compute distance matrices (or ignore if data/RF0****.npy exists)
442 p = Pool(processes=8) 456 p = Pool(processes=8)
443 p.map(to_dist_matrix, famlist) 457 p.map(to_dist_matrix, famlist)
444 p.close() 458 p.close()
......