Better cut dataframes (cut at Rfam mapping)

Louis BECQUEY
Commit c21f55fc9d38da93a83baa819981dbb141cc8783 c21f55fc 1 parent 98c2907a
Showing 4 changed files with 36 additions and 3 deletions
.gitignore
README.md
RNAnet.py
statistics.py
--- a/.gitignore
View file @c21f55f
+++ b/.gitignore
View file @c21f55f
@@ -7,6 +7,7 @@ results/
 # temporary results files
 data/
+esl*
 # environment stuff
 .vscode/
--- a/README.md
View file @c21f55f
+++ b/README.md
View file @c21f55f
@@ -49,6 +49,19 @@ Other folders are created and not deleted, which you might want to conserve to a
 * `path-to-3D-folder-you-passed-in-option/annotations/` contains the raw JSON annotation files of the previous mmCIF structures. You may find additional information into them which is not properly supported by RNANet yet.
 # How to run
+
+## Required computational resources
+- CPU: no requirements. The program is optimized for multi-core CPUs, you might want to use Intel Xeons, AMD Ryzens, etc.
+- GPU: not required
+- RAM: 16 GB with a large swap partition is okay. 32 GB is recommended (usage peaks at ~27 GB)
+- Storage: to date, it takes 60 GB for the 3D data (36 GB if you don't use the --extract option), 11 GB for the sequence data, and 7GB for the outputs (5.6 GB database, 1 GB archive of CSV files). You need to add a few more for the dependencies. Go for a 100GB partition and you are good to go. The computation speed is really decreased if you use a fast storage device (e.g. SSD instead of hard drive, or even better, a NVMe SSD) because of permanent I/O with the SQlite database.
+- Network : We query the Rfam public MySQL server on port 4497. Make sure your network enables communication (there should not be any issue on private networks, but maybe you company/university closes ports by default). You will get an error message if the port is not open. Around 30 GB of data is downloaded.
+
+To give you an estimation, our last full run took exactly 12h, excluding the time to download the MMCIF files containing RNA (around 25GB to download) and the time to compute statistics.
+Measured the 23rd of June 2020 on a 16-core AMD Ryzen 7 3700X CPU @3.60GHz, plus 32 Go RAM, and a 7200rpm Hard drive. Total CPU time spent: 135 hours (user+kernel modes), corresponding to 12h (actual time spent with the 16-core CPU). 
+
+Update runs are much quicker, around 3 hours. It depends mostly on what RNA families are concerned by the update.
+
 ## Dependencies
 You need to install:
 - DSSR, you need to register to the X3DNA forum [here](http://forum.x3dna.org/site-announcements/download-instructions/) and then download the DSSR binary [on that page](http://forum.x3dna.org/downloads/3dna-download/). 
@@ -91,6 +104,8 @@ The detailed list of options is below:
 ## Post-computation task: estimate quality
 The file statistics.py is supposed to give a summary on the produced dataset. See the results/ folder. It can be run automatically after RNANet if you pass the `-s` option.
+
+
 # How to further filter the dataset
 You may want to build your own sub-dataset by querying the results/RNANet.db file. Here are quick examples using Python3 and its sqlite3 package.
@@ -108,6 +123,7 @@ Step 1 : We first get a list of chains that are below our favorite resolution th
 with sqlite3.connect("results/RNANet.db) as connection:
     chain_list = pd.read_sql("""SELECT chain_id, structure_id, chain_name
                                 FROM chain JOIN structure 
+                                ON chain.structure_id = structure.pdb_id
                                 WHERE resolution < 4.0 
                                 ORDER BY structure_id ASC;""",
                             con=connection)
@@ -146,6 +162,7 @@ We will simply modify the Step 1 above:
 with sqlite3.connect("results/RNANet.db) as connection:
     chain_list = pd.read_sql("""SELECT chain_id, structure_id, chain_name
                                 FROM chain JOIN structure 
+                                ON chain.structure_id = structure.pdb_id
                                 WHERE date < "2018-06-01" 
                                 ORDER BY structure_id ASC;""",
                             con=connection)
@@ -160,6 +177,7 @@ If you want just one example of each RNA 3D chain, use in Step 1:
 with sqlite3.connect("results/RNANet.db) as connection:
     chain_list = pd.read_sql("""SELECT UNIQUE chain_id, structure_id, chain_name
                                 FROM chain JOIN structure
+                                ON chain.structure_id = structure.pdb_id
                                 ORDER BY structure_id ASC;""",
                             con=connection)
 ```
--- a/RNAnet.py
View file @c21f55f
+++ b/RNAnet.py
View file @c21f55f
--- a/statistics.py
View file @c21f55f
+++ b/statistics.py
View file @c21f55f
@@ -168,6 +168,8 @@ def stats_len():
     lengths = []
     conn = sqlite3.connect("results/RNANet.db")
     for i,f in enumerate(fam_list):
+
+        # Define a color for that family in the plot
         if f in LSU_set:
             cols.append("red") # LSU
         elif f in SSU_set:
@@ -178,11 +180,15 @@ def stats_len():
             cols.append("orange")
         else:
             cols.append("grey")
+
+        # Get the lengths of chains
         l = [ x[0] for x in sql_ask_database(conn, f"SELECT COUNT(index_chain) FROM (SELECT chain_id FROM chain WHERE rfam_acc='{f}') NATURAL JOIN nucleotide GROUP BY chain_id;") ]
         lengths.append(l)
+
         notify(f"[{i+1}/{len(fam_list)}] Computed {f} chains lengths")
     conn.close()
+    # Plot the figure
     fig = plt.figure(figsize=(10,3))
     ax = fig.gca()
     ax.hist(lengths, bins=100, stacked=True, log=True, color=cols, label=fam_list)
@@ -191,6 +197,8 @@ def stats_len():
     ax.set_xlim(left=-150)
     ax.tick_params(axis='both', which='both', labelsize=8)
     fig.tight_layout()
+
+    # Draw the legend
     fig.subplots_adjust(right=0.78)
     filtered_handles = [mpatches.Patch(color='red'), mpatches.Patch(color='white'), mpatches.Patch(color='white'), mpatches.Patch(color='white'),
                         mpatches.Patch(color='blue'), mpatches.Patch(color='white'), mpatches.Patch(color='white'),
@@ -204,6 +212,8 @@ def stats_len():
                        'Other']
     ax.legend(filtered_handles, filtered_labels, loc='right', 
                 ncol=1, fontsize='small', bbox_to_anchor=(1.3, 0.5))
+
+    # Save the figure
     fig.savefig("results/figures/lengths.png")
     notify("Computed sequence length statistics and saved the figure.")
@@ -224,10 +234,12 @@ def stats_freq():
     Outputs results/frequencies.csv
     REQUIRES tables chain, nucleotide up to date."""
+    # Initialize a Counter object for each family
     freqs = {}
     for f in fam_list:
         freqs[f] = Counter()
+    # List all nt_names happening within a RNA family and store the counts in the Counter
     conn = sqlite3.connect("results/RNANet.db")
     for i,f in enumerate(fam_list):
         counts = dict(sql_ask_database(conn, f"SELECT nt_name, COUNT(nt_name) FROM (SELECT chain_id from chain WHERE rfam_acc='{f}') NATURAL JOIN nucleotide GROUP BY nt_name;"))
@@ -235,6 +247,7 @@ def stats_freq():
         notify(f"[{i+1}/{len(fam_list)}] Computed {f} nucleotide frequencies.")
     conn.close()
+    # Create a pandas DataFrame, and save it to CSV.
     df = pd.DataFrame()
     for f in fam_list:
         tot = sum(freqs[f].values())
@@ -347,8 +360,8 @@ def stats_pairs():
             fam_pbar = tqdm(total=len(fam_list), desc="Pair-types in families", position=0, leave=True) 
             results = []
             allpairs = []
-            for i, _ in enumerate(p.imap_unordered(parallel_stats_pairs, fam_list)):
+            for _, newp_famdf in enumerate(p.imap_unordered(parallel_stats_pairs, fam_list)):
-                newpairs, fam_df = _
+                newpairs, fam_df = newp_famdf
                 fam_pbar.update(1)
                 results.append(fam_df)
                 allpairs.append(newpairs)
@@ -432,13 +445,14 @@ def seq_idty():
     Creates temporary results files in data/*.npy
     REQUIRES tables chain, family un to date."""
+    # List the families for which we will compute sequence identity matrices
     conn = sqlite3.connect("results/RNANet.db")
     famlist = [ x[0] for x in sql_ask_database(conn, "SELECT rfam_acc from (SELECT rfam_acc, COUNT(chain_id) as n_chains FROM family NATURAL JOIN chain GROUP BY rfam_acc) WHERE n_chains > 1 ORDER BY rfam_acc ASC;") ]
     ignored = [ x[0] for x in sql_ask_database(conn, "SELECT rfam_acc from (SELECT rfam_acc, COUNT(chain_id) as n_chains FROM family NATURAL JOIN chain GROUP BY rfam_acc) WHERE n_chains < 2 ORDER BY rfam_acc ASC;") ]
     if len(ignored):
         print("Idty matrices: Ignoring families with only one chain:", " ".join(ignored)+'\n')
-    # compute distance matrices
+    # compute distance matrices (or ignore if data/RF0****.npy exists)
     p = Pool(processes=8)
     p.map(to_dist_matrix, famlist)
     p.close()