Pipeline class

Louis BECQUEY
Commit 56fd6817c33048c79887909c056c80db7137af04 56fd6817 1 parent cd84ecd7
Showing 17 changed files with 471 additions and 355 deletions
README.md
RNAnet.py
automate.sh
known_issues.txt
known_issues_reasons.txt
regression.py
results/distances.png
results/figures/distances.png
results/figures/lengths.png
results/figures/pairings.png
results/frequencies.csv
results/lengths.png
results/mappings_list.csv
results/pairings.csv
results/pairings.png
results/realign_jobs_performance.png
statistics.py
--- a/README.md
View file @56fd681
+++ b/README.md
View file @56fd681
@@ -2,25 +2,19 @@
 Building a dataset following the ProteinNet philosophy, but for RNA.
 
 We use the Rfam mappings between 3D structures and known Rfam families, using the sequences that are known to belong to an Rfam family (hits provided in RF0XXXX.fasta files from Rfam).
- 
 Future versions might compute a real MSA-based clusering directly with Rfamseq ncRNA sequences, like ProteinNet does with protein sequences, but this requires a tool similar to jackHMMER in the Infernal software suite, which is not available yet.
 
 This script prepares the dataset from available public data in PDB and Rfam.
- It requires solid hardware to run. (Tested on a server with 32 cores and 48GB of RAM.)
 
- # Dependencies
- You need to install Infernal, DSSR, and SINA before running this.
- I moved to python3.8.1. Unfortunately, python3.6 is no longer supported, because of changes in the multiprocessing and Threading packages. Untested with Python 3.7.*.
 
- Packages numpy, pandas, matplotlib, requests, psutil, biopython, sqlalchemy and tqdm are required.
- `python3.8 -m pip install numpy matplotlib pandas biopython psutil pymysql requests sqlalchemy tqdm`
+ **Please cite**: *Coming soon, expect it summer 2020*
 
 # What it does
 The script follows these steps:
 * Gets a list of 3D structures containing RNA from BGSU's non-redundant list (but keeps the redundant structures /!\\),
- * Asks Rfam for mappings of these structures onto Rfam families (~ a half of structures have a mapping)
+ * Asks Rfam for mappings of these structures onto Rfam families (~ a half of structures have a direct mapping, some more are inferred using the redundancy list)
 * Downloads the corresponding 3D structures (mmCIFs)
- * Extracts the right chain portions that map onto an Rfam family
+ * If desired, extracts the right chain portions that map onto an Rfam family
 
 Now, compute the features:
 
@@ -32,13 +26,224 @@ Now, compute the features:
 
 Then, compute the labels:
 
- * Run DSSR on every chain to get a variety of descriptors per position, describing secondary and tertiary structure
- * This also permits to identify missing residues and compute a mask for every chain.
+ * Run DSSR on every RNA structure to get a variety of descriptors per position, describing secondary and tertiary structure. Basepair types annotations include intra-chain and inter-chain interactions.
+ 
+ Finally, export this data from the SQLite database into flat CSV files.
+ 
+ # Output files
+ 
+ * `results/RNANet.db` is a SQLite database file containing several tables with all the information, which you can query yourself with your custom requests,
+ * `3D-folder-you-passed-in-option/datapoints/*` are flat text CSV files, one for one RNA chain mapped to one RNA family, gathering the per-position nucleotide descriptors,
+ * `results/RNANET_datapoints_latest.tar.gz` is a compressed archive of the above CSV files
+ * `path-to-3D-folder-you-passed-in-option/rna_mapped_to_Rfam` If you used the --extract option, this folder contains one mmCIF file per RNA chain mapped to one RNA family, without other chains, proteins (nor ions and ligands by default)
+ * `results/summary_latest.csv` summarizes information about the RNA chains
+ * `results/families_latest.csv` summarizes information about the RNA families
+ 
+ If you launch successive executions of RNANet, the previous tar.gz archive and the two summary CSV files are stored in the `results/archive/` folder.
+ 
+ Other folders are created and not deleted, which you might want to conserve to avoid re-computations in later runs:
+ 
+ * `path-to-sequence-folder-you-passed-in-option/rfam_sequences/fasta/` contains compressed FASTA files of the homologous sequences used, by Rfam family.
+ * `path-to-sequence-folder-you-passed-in-option/realigned/` contains families covariance models (\*.cm), unaligned list of sequences (\*.fa), and multiple sequence alignments in both formats Stockholm and Aligned-FASTA (\*.stk and \*.afa). Also contains SINA homolgous sequences databases LSU.arb and SSU.arb, and their index files (\*.sidx).
+ * `path-to-3D-folder-you-passed-in-option/RNAcifs/` contains mmCIF structures directly downloaded from the PDB, which contain RNA chains,
+ * `path-to-3D-folder-you-passed-in-option/annotations/` contains the raw JSON annotation files of the previous mmCIF structures. You may find additional information into them which is not properly supported by RNANet yet.
+ 
+ # How to run
+ ## Dependencies
+ You need to install:
+ - DSSR, you need to register to the X3DNA forum [here](http://forum.x3dna.org/site-announcements/download-instructions/) and then download the DSSR binary [on that page](http://forum.x3dna.org/downloads/3dna-download/). 
+ - Infernal, to download at [Eddylab](http://eddylab.org/infernal/), several options are available depending on your preferences. Make sure to have the `cmalign` and `esl-reformat` binaries in your $PATH variable, so that RNANet.py can find them.You don't need the whole X3DNA suite of tools, just DSSR is fine. Make sure to have the `x3dna-dssr` binary in your $PATH variable so that RNANet.py finds it.
+ - SINA, follow [these instructions](https://sina.readthedocs.io/en/latest/install.html) for example. Make sure to have the `sina` binary in your $PATH.
+ - Python >= 3.8, (Unfortunately, python3.6 is no longer supported, because of changes in the multiprocessing and Threading packages. Untested with Python 3.7.\*)
+ - The following Python packages: `python3.8 -m pip install numpy matplotlib pandas biopython psutil pymysql requests sqlalchemy sqlite3 tqdm`
+ 
+ ## Command line
+ Run `./RNANet.py --3d-folder path/to/3D/data/folder --seq-folder path/to/sequence/data/folder [ - other options ]`. 
+ It requires solid hardware to run. It takes around 15 hours the first time, and 9h then, tested on a server with 32 cores and 48GB of RAM.
+ The detailed list of options is below:
+ 
+ ```
+ -h [ --help ]			Print this help message
+ --version			    Print the program version
+ 
+ -r 4.0 [ --resolution=4.0 ]	Maximum 3D structure resolution to consider a RNA chain.
+ -s				        Run statistics computations after completion
+ --extract			    Extract the portions of 3D RNA chains to individual mmCIF files.
+ --keep-hetatm=False		(True | False) Keep ions, waters and ligands in produced mmCIF files. 
+ 				        Does not affect the descriptors.
+ --fill-gaps=True		(True | False) Replace gaps in nt_align_code field due to unresolved residues
+ 				        by the most common nucleotide at this position in the alignment.
+ --3d-folder=…			Path to a folder to store the 3D data files. Subfolders will contain:
+ 					        RNAcifs/		Full structures containing RNA, in mmCIF format
+ 					        rna_mapped_to_Rfam/ or rnaonly/	Extracted 'pure' RNA chains
+ 					        datapoints/		Final results in CSV file format.
+ --seq-folder=…			Path to a folder to store the sequence and alignment files.
+ 					        rfam_sequences/fasta/	Compressed hits to Rfam families
+ 					        realigned/		Sequences, covariance models, and alignments by family
+ --no-homology			Do not try to compute PSSMs and do not align sequences.
+ 				        Allows to yield more 3D data (consider chains without a Rfam mapping).
+ 
+ --ignore-issues			Do not ignore already known issues and attempt to compute them
+ --update-homologous		Re-download Rfam sequences and SILVA arb databases, and realign all families
+ --from-scratch			Delete database, local 3D and sequence files, and known issues, and recompute.
+ ```
+ 
+ ## Post-computation task: estimate quality
+ The file statistics.py is supposed to give a summary on the produced dataset. See the results/ folder. It can be run automatically after RNANet if you pass the `-s` option.
+ 
+ # How to further filter the dataset
+ You may want to build your own sub-dataset by querying the results/RNANet.db file. Here are quick examples using Python3 and its sqlite3 package.
+ 
+ ## Filter on 3D structure resolution
+ 
+ We need to import sqlite3 and pandas packages first.
+ 
+ ```
+ import sqlite3
+ import pandas as pd
+ ```
+ 
+ Step 1 : We first get a list of chains that are below our favorite resolution threshold (here 4.0 Angströms):
+ ```
+ with sqlite3.connect("results/RNANet.db) as connection:
+     chain_list = pd.read_sql("""SELECT chain_id, structure_id, chain_name
+                                 FROM chain JOIN structure 
+                                 WHERE resolution < 4.0 
+                                 ORDER BY structure_id ASC;""",
+                             con=connection)
+ 
+ ```
+ 
+ Step 2 : Then, we define a template string, containing the SQL request we use to get all information of one RNA chain, with brackets { } at the place we will insert every chain_id. 
+ You can remove fields you are not interested in.
+ ```
+ req = """SELECT index_chain, nt_resnum, position, nt_name, nt_code, nt_align_code, is_A, is_C, is_G, is_U, is_other, freq_A, freq_C, freq_G, freq_U, freq_other, dbn, paired, nb_interact, pair_type_LW, pair_type_DSSR, alpha, beta, gamma, delta, epsilon, zeta, epsilon_zeta, chi, bb_type, glyco_bond, form, ssZp, Dp, eta, theta, eta_prime, theta_prime, eta_base, theta_base,
+ v0, v1, v2, v3, v4, amlitude, phase_angle, puckering 
+ FROM 
+ (SELECT chain_id, rfam_acc from chain WHERE chain_id = {})
+ NATURAL JOIN re_mapping
+ NATURAL JOIN nucleotide
+ NATURAL JOIN align_column;"""
+ ```
+ 
+ Step 3 : Finally, we iterate over this list of chains and save their information in CSV files:
+ 
+ ```
+ with sqlite3.connect("results/RNANet.db) as connection:
+     for chain in chain_list.iterrows():
+         df = pd.read_sql(req.format(chain.chain_id), connection)
+         filename = chain.structure_id + '-' + chain.chain_name + '.csv'
+         df.to_csv(filename, float_format="%.2f", index=False)
+ 
+ ```
+ 
+ ## Filter on 3D structure publication date
+ 
+ You might want to get only the dataset you would have had in a past year, to compare yourself with the competitors of a RNA-Puzzles problem for example.
+ We will simply modify the Step 1 above:
+ 
+ ```
+ with sqlite3.connect("results/RNANet.db) as connection:
+     chain_list = pd.read_sql("""SELECT chain_id, structure_id, chain_name
+                                 FROM chain JOIN structure 
+                                 WHERE date < "2018-06-01" 
+                                 ORDER BY structure_id ASC;""",
+                             con=connection)
+ ```
+ Then proceed to steps 2 and 3.
+ 
+ ## Filter to avoid chain redundancy when several mappings are available
+ Some chains can be mapped to two (or more) RNA families, and exist several times in the database.
+ If you want just one example of each RNA 3D chain, use in Step 1:
+ 
+ ```
+ with sqlite3.connect("results/RNANet.db) as connection:
+     chain_list = pd.read_sql("""SELECT UNIQUE chain_id, structure_id, chain_name
+                                 FROM chain JOIN structure
+                                 ORDER BY structure_id ASC;""",
+                             con=connection)
+ ```
+ 
+ # More about the database structure
+ To help you design your own requests, here follows a description of the database tables and fields.
+ 
+ ## Table `family`, for Rfam families and their properties
+ * `rfam_acc`: The family codename, from Rfam's numbering (Rfam accession number)
+ * `description`: What RNAs fit in this family
+ * `nb_homologs`: The number of hits known to be homologous downloaded from Rfam to compute nucleotide frequencies
+ * `nb_3d_chains`: The number of 3D RNA chains mapped to the family (from Rfam-PDB mappings, or inferred using the redundancy list)
+ * `nb_total_homol`: Sum of the two previous fields, the number of sequences in the multiple sequence alignment, used to compute nucleotide frequencies
+ * `max_len`: The longest RNA sequence among the homologs (in bases)
+ * `comput_time`: Time required to compute the family's multiple sequence alignment in seconds,
+ * `comput_peak_mem`: RAM (or swap) required to compute the family's multiple sequence alignment in megabytes,
+ * `idty_percent`: Average identity percentage over pairs of the 3D chains' sequences from the family
+ 
+ ## Table `structure`, for 3D structures of the PDB
+ * `pdb_id`: The 4-char PDB identifier
+ * `pdb_model`: The model used in the PDB file
+ * `date`: The first submission date of the 3D structure to a public database
+ * `exp_method`: A string to know wether the structure as been obtained by X-ray crystallography ('X-RAY DIFFRACTION'), electron microscopy ('ELECTRON MICROSCOPY'), or NMR (not seen yet)
+ * `resolution`: Resolution of the structure, in Angstöms
+ 
+ ## Table `chain`, for the datapoints: one chain mapped to one Rfam family
+ * `chain_id`: A unique identifier
+ * `structure_id`: The `pdb_id` where the chain comes from
+ * `chain_name`: The chain label, extracted from the 3D file
+ * `pdb_start`: Position in the chain where the mapping to Rfam begins (absolute position, not residue number)
+ * `pdb_end`: Position in the chain where the mapping to Rfam ends (absolute position, not residue number)
+ * `pdb_start`: Position in the chain where the mapping to Rfam begins (absolute position, not residue number)
+ * `pdb_start`: Position in the chain where the mapping to Rfam begins (absolute position, not residue number)
+ * `reversed`: Wether the mapping numbering order differs from the residue numbering order in the mmCIF file (eg 4c9d, chains C and D)
+ * `issue`: Wether an issue occurred with this structure while downloading, extracting, annotating or parsing the annotation. Chains with issues are removed from the dataset (Only one known to date: 1gsg, chain T, which is too short)
+ * `rfam_acc`: The family which the chain is mapped to
+ * `inferred`: Wether the mapping has been inferred using the redundancy list (value is 1) or just known from Rfam-PDB mappings (value is 0)
+ * `chain_freq_A`, `chain_freq_C`, `chain_freq_G`, `chain_freq_U`, `chain_freq_other`: Nucleotide frequencies in the chain
+ * `pair_count_cWW`, `pair_count_cWH`, ... `pair_count_tSS`: Counts of the non-canonical base-pair types in the chain (intra-chain counts only)
+ 
+ ## Table `nucleotide`, for individual nucleotide descriptors
+ * `nt_id`: A unique identifier
+ * `chain_id`: The chain the nucleotide belongs to
+ * `index_chain`: its absolute position within the portion of chain mapped to Rfam, from 1 to X. This is completely uncorrelated to any gene start or 3D chain residue numbers.
+ * `nt_position`: relative position within the portion of chain mapped to RFam, from 0 to 1
+ * `nt_resnum`: The residue number in the 3D mmCIF file
+ * `nt_name`: The residue type. This includes modified nucleotide names (e.g. 5MC for 5-methylcytosine)
+ * `nt_code`: One-letter name. Lowercase "acgu" letters are used for modified "ACGU" bases.
+ * `nt_align_code`: One-letter name used for sequence alignment. Contains "ACGUN-" only first, and then, gaps may be replaced by the most common letter at this position (default)
+ * `is_A`, `is_C`, `is_G`, `is_U`, `is_other`: One-hot encoding of the nucleotide base
+ * `dbn`: character used at this position if we look at the dot-bracket encoding of the secondary structure. Includes inter-chain (RNA complexes) contacts.
+ * `paired`: empty, or comma separated list of `index_chain` values referring to nucleotides the base is interacting with. Up to 3 values. Inter-chain interactions are marked paired to '0'.
+ * `nb_interact`: number of interactions with other nucleotides. Up to 3 values. Includes inter-chain interactions.
+ * `pair_type_LW`: The Leontis-Westhof nomenclature codes of the interactions. The first letter concerns cis/trans orientation, the second this base's side interacting, and the third the other base's side.
+ * `pair_type_DSSR`: Same but using the DSSR nomenclature (Hoogsteen edge approximately corresponds to Major-groove and Sugar edge to minor-groove)
+ * `alpha`, `beta`, `gamma`, `delta`, `epsilon`, `zeta`: The 6 torsion angles of the RNA backabone for this nucleotide
+ * `epsilon_zeta`: Difference between epsilon and zeta angles
+ * `bb_type`: conformation of the backbone (BI, BII or ..)
+ * `chi`: torsion angle between the sugar and base (O-C1'-N-C4)
+ * `glyco_bond`: syn or anti configuration of the sugar-base bond
+ * `v0`, `v1`, `v2`, `v3`, `v4`: 5 torsion angles of the ribose cycle
+ * `form`: if the nucleotide is involved in a stem, the stem type (A, B or Z)
+ * `ssZp`: Z-coordinate of the 3’ phosphorus atom with reference to the5’ base plane
+ * `Dp`: Perpendicular distance of the 3’ P atom to the glycosidic bond
+ * `eta`, `theta`: Pseudotorsions of the backbone, using phosphorus and carbon 4'
+ * `eta_prime`, `theta_prime`: Pseudotorsions of the backbone, using phosphorus and carbon 1'
+ * `eta_base`, `theta_base`: Pseudotorsions of the backbone, using phosphorus and the base center
+ * `phase_angle`: Conformation of the ribose cycle
+ * `amplitude`: Amplitude of the sugar puckering
+ * `puckering`: Conformation of the ribose cycle (10 classes depending on the phase_angle value)
+ 
+ ## Table `align_column`, for positions in multiple sequence alignments
+ * `column_id`: A unique identifier
+ * `rfam_acc`: The family's MSA the column belongs to
+ * `index_ali`: Position of the column in the alignment (starts at 1)
+ * `freq_A`, `freq_C`, `freq_G`, `freq_U`, `freq_other`: Nucleotide frequencies in the alignment at this position
 
- Finally, store this data into files.
+ There always is an entry, for each family (rfam_acc), with index_ali = zero and nucleotide frequencies set to freq_other = 1.0. This entry is used when the nucleotide frequencies cannot be determined because of local alignment issues.
 
- # Dataset quality
- The file statistics.py is supposed to give a summary on the produced dataset. See the results/ folder.
+ ## Table `re_mapping`, to map a nucleotide to an alignment column
+ * `remapping_id`: A unique identifier
+ * `chain_id`: The chain which is mapped to an alignment
+ * `index_chain`: The absolute position of the nucleotide in the chain (from 1 to X)
+ * `index_ali` The position of that nucleotide in its family alignment
 
 # Contact
 louis.becquey@univ-evry.fr
--- a/RNAnet.py
View file @56fd681
+++ b/RNAnet.py
View file @56fd681
--- a/automate.sh 0 → 100644
View file @56fd681
+++ b/automate.sh 0 → 100644
View file @56fd681
+ # This is a script supposed to be run periodically as a cron job
+ 
+ # Run RNANet
+ cd /home/lbecquey/Projects/RNANet;
+ rm -f stdout.txt stderr.txt errors.txt;
+ time './RNAnet.py --3d-folder /home/lbequey/Data/RNA/3D/ --seq-folder /home/lbecquey/Data/RNA/sequences/' > stdout.txt 2> stderr.txt;
+ 
+ # Sync in Seafile
+ seaf-cli start;
+ 
+ seaf-cli stop;
+ 
--- a/known_issues.txt 0 → 100644
View file @56fd681
+++ b/known_issues.txt 0 → 100644
View file @56fd681
+ 1gsg_1_T_1-72
--- a/known_issues_reasons.txt 0 → 100644
View file @56fd681
+++ b/known_issues_reasons.txt 0 → 100644
View file @56fd681
+ 1gsg_1_T_1-72
+ DSSR warning for 1gsg_1_T_1-72: no nucleotides found
+ 
--- a/regression.py
View file @56fd681
+++ b/regression.py
View file @56fd681
 #!/usr/bin/python3.8
 # This file is supposed to propose regression models on the computation time and mem usage of the re-alignment jobs.
- # Light jobs are monitored by the Monitor class in RNAnet.py, and the measures are saved in jobstats.csv.
+ # Jobs are monitored by the Monitor class in RNAnet.py, and the measures are saved in jobstats.csv.
 # This was done to guess the amount of memory required to re-align the large ribosomal subunit families RF02541 and RF02543.
- # INFO: Our home hardware was a 32-core VM with 50GB RAM + 8GB Swap.
+ # INFO: Our home hardware was a 32-core VM with 50GB RAM
+ 
+ # The conclusion of this was to move to SINA for ribosomal subunits. 
+ # However, this was before we use cmalign with --small, which is after all required for RF00005, RF00382 and RF01852 
+ # (we do not understand why the two last very small families require that much memory). 
+ # Feedback would be appreciated on wether it is better to 
+ #   - Use a specialised database (SILVA) : better alignments (we guess?), but two kind of jobs
+ #   - Use cmalign --small everywhere (homogeneity)
+ # Moreover, --small requires --nonbanded --cyk, which means the output alignement is the optimally scored one. 
+ # To date, we trust Infernal as the best tool to realign RNA. Is it ?
+ 
+ # Contact: louis.becquey@univ-evry.fr (PhD student), fariza.tahi@univ-evry.fr (PI)
+ 
+ # Running this file is not required to compute the dataset.
 
 import matplotlib.pyplot as plt
 import pandas as pd
 import numpy as np
- import scipy, os
- from sklearn.linear_model import LinearRegression
+ import scipy, os, sqlite3
+ # from sklearn.linear_model import LinearRegression
 from mpl_toolkits.mplot3d import Axes3D
+ pd.set_option('display.max_rows', None)
+ 
+ LSU_set = ["RF00002", "RF02540", "RF02541", "RF02543", "RF02546"]   # From Rfam CLAN 00112
+ SSU_set = ["RF00177", "RF02542",  "RF02545", "RF01959", "RF01960"]  # From Rfam CLAN 00111
+ 
+ with sqlite3.connect("results/RNANet.db") as conn:
+     df = pd.read_sql("SELECT rfam_acc, max_len, nb_total_homol, comput_time, comput_peak_mem FROM family;", conn)
 
- jobstats = pd.read_csv("data/jobstats.csv", sep=",")
- families = pd.read_csv("data/statistics.csv", sep=",")
- 
- computed_families = []
- comptimes = []
- maxmem = []
- nchains = []
- maxlengths = []
- 
- for index, fam in jobstats.iterrows():
-     if fam["max_mem"] != -1 and fam["comp_time"] != -1:
-         rfam_acc = fam["label"].split(' ')[1]
-         computed_families.append(rfam_acc)
-         comptimes.append(fam["comp_time"])
-         maxmem.append(fam["max_mem"])
-         nchains.append(
-             families.loc[families["rfam_acc"] == rfam_acc, "total_seqs"].values[0])
-         maxlengths.append(
-             families.loc[families["rfam_acc"] == rfam_acc, "maxlength"].values[0])
- 
- comptimes = [x/3600 for x in comptimes]  # compte en heures
- maxlengths = [x/1000 for x in maxlengths]  # compte en kB
- maxmem = [x/1024/1024 for x in maxmem]  # compte en MB
- 
- summary = pd.DataFrame({"family": computed_families, "n_chains": nchains,
-                         "max_length(kB)": maxlengths, "comp_time(h)": comptimes, "max_mem(MB)": maxmem})
- summary.sort_values("max_length(kB)", inplace=True)
- summary.to_csv("results/summary.csv")
+ to_remove = [ f for f in df.rfam_acc if f in LSU_set+SSU_set ]
+ df = df.set_index('rfam_acc').drop(to_remove)
+ print(df)
 
 # ========================================================
 # Plot the data
@@ -47,39 +42,39 @@ summary.to_csv("results/summary.csv")
 fig = plt.figure(figsize=(12,8), dpi=100)
 
 plt.subplot(231)
- plt.scatter(summary.n_chains, summary["max_mem(MB)"])
+ plt.scatter(df.nb_total_homol, df.comput_peak_mem)
 plt.xlabel("Number of sequences")
 plt.ylabel("Peak memory (MB)")
 
 plt.subplot(232)
- plt.scatter(summary["max_length(kB)"], summary["max_mem(MB)"])
- plt.xlabel("Maximum length of sequences (kB)")
+ plt.scatter(df.max_len, df.comput_peak_mem)
+ plt.xlabel("Maximum length of sequences ")
 plt.ylabel("Peak memory (MB)")
 
 ax = fig.add_subplot(233, projection='3d')
- ax.scatter(summary.n_chains, summary["max_length(kB)"], summary["max_mem(MB)"])
+ ax.scatter(df.nb_total_homol, df.max_len, df.comput_peak_mem)
 ax.set_xlabel("Number of sequences")
- ax.set_ylabel("Maximum length of sequences (kB)")
+ ax.set_ylabel("Maximum length of sequences ")
 ax.set_zlabel("Peak memory (MB)")
 
 plt.subplot(234)
- plt.scatter(summary.n_chains, summary["comp_time(h)"])
+ plt.scatter(df.nb_total_homol, df.comput_time)
 plt.xlabel("Number of sequences")
- plt.ylabel("Computation time (h)")
+ plt.ylabel("Computation time (s)")
 
 plt.subplot(235)
- plt.scatter(summary["max_length(kB)"], summary["comp_time(h)"])
- plt.xlabel("Maximum length of sequences (kB)")
- plt.ylabel("Computation time (h)")
+ plt.scatter(df.max_len, df.comput_time)
+ plt.xlabel("Maximum length of sequences ")
+ plt.ylabel("Computation time (s)")
 
 ax = fig.add_subplot(236, projection='3d')
- ax.scatter(summary.n_chains, summary["max_length(kB)"], summary["comp_time(h)"])
+ ax.scatter(df.nb_total_homol, df.max_len, df.comput_time)
 ax.set_xlabel("Number of sequences")
- ax.set_ylabel("Maximum length of sequences (kB)")
- ax.set_zlabel("Computation time (h)")
+ ax.set_ylabel("Maximum length of sequences ")
+ ax.set_zlabel("Computation time (s)")
 
 plt.subplots_adjust(wspace=0.4)
- plt.savefig("results/realign_jobs_performance.png")
+ plt.savefig("results/cmalign_jobs_performance.png")
 
 # # ========================================================
 # # Linear Regression of max_mem as function of max_length
@@ -87,20 +82,20 @@ plt.savefig("results/realign_jobs_performance.png")
 
 # # With scikit-learn
 # model = LinearRegression(normalize=True, n_jobs=-1)
- # model.fit(summary["max_length(kB)"].values.reshape(-1, 1), summary["max_mem(MB)"])
+ # model.fit(df.max_len.values.reshape(-1, 1), df.comput_peak_mem)
 # b0 = model.intercept_
 # b1 = model.coef_[0]
 # print(f"peak_mem = {b0:.0f} + {b1:.0f} * max_length")
 
 # # with scipy
 # coeffs = scipy.optimize.curve_fit(  lambda t, B0, B1: B0+np.exp(B1*t), 
- #                                     summary["max_length(kB)"].values, 
- #                                     summary["max_mem(MB)"].values
+ #                                     df.max_len.values, 
+ #                                     df.comput_peak_mem.values
 #                                  )[0]
 # print(f"peak_mem = {coeffs[0]:.0f} + e^({coeffs[1]:.0f} * max_length)")
 # coeffs_log = scipy.optimize.curve_fit(  lambda t, B0, B1: B0+B1*np.log(t),
- #                                         summary["max_length(kB)"].values, 
- #                                         summary["max_mem(MB)"].values,
+ #                                         df.max_len.values, 
+ #                                         df.comput_peak_mem.values,
 #                                         p0=(400, 12000)
 #                                      )[0]
 # print(f"peak_mem = {coeffs_log[0]:.0f} + {coeffs_log[1]:.0f} * log(max_length)")
@@ -108,8 +103,8 @@ plt.savefig("results/realign_jobs_performance.png")
 # # Re-plot
 # x = np.linspace(0, 10, 1000)
 # plt.figure()
- # plt.scatter(summary["max_length(kB)"], summary["max_mem(MB)"])
- # plt.xlabel("Maximum length of sequences (kB)")
+ # plt.scatter(df.max_len, df.comput_peak_mem)
+ # plt.xlabel("Maximum length of sequences ")
 # plt.ylabel("Peak memory (MB)")
 # plt.plot(x, b0 + b1*x, "-r", label="linear fit")
 # plt.plot(x, coeffs[0] + np.exp(coeffs[1]*x), "-g", label="expo fit")
@@ -123,7 +118,7 @@ plt.savefig("results/realign_jobs_performance.png")
 
 # # With scikit-learn
 # model = LinearRegression(normalize=True, n_jobs=-1)
- # model.fit(summary.n_chains.values.reshape(-1, 1), summary["comp_time(h)"])
+ # model.fit(df.nb_total_homol.values.reshape(-1, 1), df.comput_time)
 # b0 = model.intercept_
 # b1 = model.coef_[0]
 # print(f"comp_time = {b0:.3f} + {b1:.3f} * n_chains")
@@ -131,9 +126,9 @@ plt.savefig("results/realign_jobs_performance.png")
 # # Re-plot
 # x = np.linspace(0, 500000, 1000)
 # plt.figure()
- # plt.scatter(summary.n_chains, summary["comp_time(h)"])
+ # plt.scatter(df.nb_total_homol, df.comput_time)
 # plt.xlabel("Number of sequences")
- # plt.ylabel("Computation time (h)")
+ # plt.ylabel("Computation time (s)")
 # plt.plot(x, b0 + b1*x, "-r", label="linear fit")
 # plt.legend()
 # plt.savefig("results/regression/comp_time_linear_model.png")
--- a/results/distances.png deleted 100644 → 0
View file @cd84ecd
+++ b/results/distances.png deleted 100644 → 0
View file @cd84ecd
--- a/results/figures/distances.png deleted 100644 → 0
View file @cd84ecd
+++ b/results/figures/distances.png deleted 100644 → 0
View file @cd84ecd
--- a/results/figures/lengths.png deleted 100644 → 0
View file @cd84ecd
+++ b/results/figures/lengths.png deleted 100644 → 0
View file @cd84ecd
--- a/results/figures/pairings.png deleted 100644 → 0
View file @cd84ecd
+++ b/results/figures/pairings.png deleted 100644 → 0
View file @cd84ecd
--- a/results/frequencies.csv deleted 100644 → 0
View file @cd84ecd
+++ b/results/frequencies.csv deleted 100644 → 0
View file @cd84ecd
- ,G,C,A,U,-,A2M,OMU,OMG,OMC,7MG,PSU,5MU,4SU,MIA,H2U,U8U,T6A,DJF,6MZ,CM0,5MC,2MG,1MA,YYG,M2G,2MA,QUO,G7M,4OC,YG,AET,2MU,12A,70U,6IA,1MG,GTP,574,I,RSP,RIA,3AU,AG9,ANZ,1RN,N79,365,UBD,9QV,CCC,IU,MA6,UR3,A3P,A23,23G,N,GDP,CBV,4AC,M7A,E3C,B8Q,B8N,C4J,M1Y,JMH,3TD,B9B,E7G,B9H,P7G,I4U,B8H,P4U,B8W,P5P,Y5P,B8T,B8K,E6G,BGH,MHG
- RF00001,33.99%,29.98%,20.01%,16.01%,0.01%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00002,26.80%,23.51%,27.36%,21.86%,0.43%,0.01%,0.02%,<.01%,<.01%,<.01%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00004,18.12%,16.77%,23.33%,25.90%,15.82%,0,0,0,0,0,0.06%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00005,31.37%,27.32%,19.93%,17.61%,1.23%,0,<.01%,0.03%,0.07%,0.18%,0.73%,0.41%,0.33%,0.15%,0.20%,0.02%,0.02%,<.01%,0.02%,0.02%,0.14%,0.02%,0.02%,<.01%,0.02%,<.01%,0.02%,0.02%,0.01%,0.01%,<.01%,<.01%,<.01%,<.01%,<.01%,0.02%,<.01%,<.01%,<.01%,<.01%,<.01%,<.01%,<.01%,<.01%,<.01%,<.01%,<.01%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00008,31.25%,26.35%,24.16%,18.24%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00009,31.11%,26.48%,20.69%,21.71%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00010,35.64%,29.65%,17.52%,11.12%,6.07%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00011,21.41%,15.95%,17.10%,11.65%,33.89%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00013,25.23%,24.32%,21.62%,19.82%,9.01%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00015,18.15%,14.11%,19.30%,23.34%,25.10%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00017,32.85%,24.43%,19.37%,14.49%,8.73%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.13%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00020,16.76%,19.36%,20.57%,30.63%,12.69%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00023,31.22%,22.68%,11.46%,16.10%,16.59%,0,0,0,0,0,0.98%,0.98%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00026,18.40%,16.77%,25.32%,26.02%,13.45%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.02%,0.02%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00028,27.66%,20.61%,28.66%,22.05%,1.02%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00029,32.73%,21.82%,26.91%,18.55%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00032,17.00%,40.32%,22.92%,19.76%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00037,23.33%,20.00%,23.33%,31.67%,1.67%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00050,28.57%,16.07%,27.68%,23.21%,2.68%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.89%,0,0,0,0,0,0,0,0,0,0,0,0,0.89%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00059,31.16%,23.60%,22.54%,20.11%,2.17%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.21%,0,0,0,0,0,0,0,0,0,0,0,0,0.21%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00061,26.51%,23.06%,15.52%,14.87%,20.04%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00080,28.00%,17.41%,31.06%,22.00%,1.53%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00100,26.23%,24.59%,9.84%,21.31%,18.03%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00162,31.84%,23.64%,29.81%,14.68%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.03%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00164,34.88%,23.26%,25.58%,16.28%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00167,23.89%,22.76%,26.40%,26.79%,0.06%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.03%,0.06%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00168,39.10%,26.42%,19.10%,15.37%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00169,36.50%,31.99%,22.19%,9.32%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00174,35.38%,25.15%,19.59%,12.28%,7.60%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00177,33.24%,24.72%,22.68%,16.89%,2.29%,0,0,<.01%,0,0.01%,0.02%,<.01%,<.01%,0,<.01%,0,0,0,0,0,0.05%,0.02%,0,0,0.01%,0,0,<.01%,0.01%,0,0,0,0,0,<.01%,<.01%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.03%,0.01%,<.01%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00233,28.21%,29.49%,21.79%,20.51%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00234,29.82%,20.65%,23.80%,24.45%,0.65%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.49%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.14%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00250,17.65%,29.41%,35.29%,17.65%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00379,28.42%,24.27%,24.32%,19.37%,3.57%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.05%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00380,24.26%,21.94%,27.64%,24.47%,1.69%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00382,36.40%,26.00%,20.97%,16.63%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00390,13.04%,17.39%,30.43%,39.13%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00442,30.34%,21.35%,28.09%,19.10%,1.12%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00458,18.53%,16.06%,28.60%,30.36%,6.44%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00488,18.10%,13.22%,18.91%,26.63%,22.54%,0,0,0,0,0,0.06%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.53%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00504,30.89%,21.55%,30.64%,14.69%,0.54%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.08%,0,0,0,0,0,0,0,0.60%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF00505,28.33%,28.33%,11.67%,26.67%,5.00%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF01051,28.75%,25.58%,26.41%,13.07%,6.19%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF01357,32.00%,24.00%,20.00%,16.00%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8.00%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF01510,21.88%,24.22%,28.12%,25.78%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF01689,25.81%,22.04%,31.72%,20.43%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF01725,37.91%,29.67%,25.27%,7.14%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF01734,31.37%,31.37%,21.57%,15.69%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF01739,32.79%,27.87%,24.59%,14.75%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF01750,32.13%,23.91%,23.19%,15.22%,5.56%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF01763,38.27%,29.64%,18.76%,6.94%,2.25%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4.13%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF01786,27.03%,17.57%,27.03%,27.03%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.35%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF01807,27.17%,23.91%,26.63%,22.28%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF01826,19.23%,15.38%,32.69%,23.08%,7.69%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.92%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF01831,33.06%,19.76%,25.41%,21.77%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF01846,24.93%,22.72%,15.88%,19.09%,17.37%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF01852,28.27%,21.04%,27.74%,22.33%,0.55%,0,0,0,0,0,0.02%,0.02%,<.01%,0,0.02%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.01%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF01854,33.22%,28.24%,20.27%,18.27%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF01857,37.74%,29.53%,18.80%,13.93%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF01960,24.97%,19.86%,24.33%,24.49%,6.22%,0.02%,<.01%,0.01%,0.01%,<.01%,0.02%,<.01%,0,0,0,0,0,0,<.01%,0,<.01%,<.01%,0,0,0,0,0,0,<.01%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,<.01%,<.01%,0,0,0,0.01%,0,0,<.01%,<.01%,<.01%,<.01%,<.01%,<.01%,<.01%,<.01%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF01998,32.10%,21.36%,27.78%,17.72%,1.05%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF02001,26.78%,17.17%,32.96%,21.51%,1.58%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF02012,29.11%,22.15%,23.42%,24.89%,0.42%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF02253,20.69%,24.14%,27.59%,27.59%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF02348,21.52%,16.46%,36.71%,25.32%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF02519,23.53%,14.71%,29.41%,29.41%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.94%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF02540,29.03%,23.70%,24.55%,17.33%,5.28%,0,0.02%,0.02%,0,0,0.02%,0,0,0,0,0,0,0,0,0,0,0,0.02%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.02%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF02541,33.08%,24.10%,23.09%,16.25%,3.38%,0,<.01%,<.01%,<.01%,<.01%,0.04%,0.01%,0,0,<.01%,0,0,0,<.01%,0,0.01%,<.01%,0,0,0,<.01%,0,<.01%,<.01%,0,0,<.01%,0,0,0,<.01%,<.01%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,<.01%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF02543,25.03%,18.64%,20.96%,18.86%,16.44%,0.01%,<.01%,0.02%,<.01%,<.01%,<.01%,<.01%,0,0,<.01%,0,0,0,<.01%,0,<.01%,<.01%,<.01%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,<.01%,0,0,0,0,0,0,0,<.01%,0,<.01%,0,0,0,<.01%,0,<.01%,<.01%,<.01%,<.01%,<.01%,<.01%,<.01%,<.01%,<.01%,<.01%,<.01%,<.01%,<.01%,<.01%,<.01%
- RF02545,9.88%,4.94%,35.83%,38.95%,10.39%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF02546,2.40%,1.07%,16.80%,11.73%,68.00%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF02553,32.50%,22.50%,20.00%,23.75%,1.25%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF02680,28.71%,29.70%,19.80%,18.81%,1.98%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.99%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF02683,31.40%,24.42%,29.07%,13.95%,1.16%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
- RF02796,33.33%,36.84%,17.54%,12.28%,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
--- a/results/lengths.png deleted 100644 → 0
View file @cd84ecd
+++ b/results/lengths.png deleted 100644 → 0
View file @cd84ecd
--- a/results/mappings_list.csv deleted 100644 → 0
View file @cd84ecd
+++ b/results/mappings_list.csv deleted 100644 → 0
View file @cd84ecd
--- a/results/pairings.csv deleted 100644 → 0
View file @cd84ecd
+++ b/results/pairings.csv deleted 100644 → 0
View file @cd84ecd
- ,cWW,tSH,tWH,tHS,other,tWW,tSS,tHW,cSH,cSW,cSS,tSW,cWH,cWS,tWS,tHH,cHW,cHH,cHS
- RF00001,61.87%,4.31%,3.21%,1.98%,3.33%,0.42%,0.97%,2.64%,5.30%,5.61%,0.11%,4.14%,0.61%,3.04%,0.93%,0.53%,0.89%,<.01%,0.10%
- RF00002,62.36%,5.36%,2.71%,6.11%,1.72%,2.25%,1.23%,2.54%,1.87%,4.10%,0.63%,1.50%,1.14%,0.68%,0.57%,3.20%,1.38%,0.59%,0.05%
- RF00004,85.28%,3.30%,5.23%,0.96%,0.69%,0.14%,0 %,0 %,0.28%,0.28%,0 %,0.69%,0.55%,0 %,0 %,0 %,0.28%,0.28%,2.06%
- RF00005,70.47%,0.91%,6.92%,0.09%,1.74%,3.56%,0.08%,3.29%,0.53%,0.52%,0.22%,1.75%,1.24%,2.00%,2.31%,1.71%,0.65%,0.48%,1.53%
- RF00008,64.74%,4.62%,8.09%,2.89%,1.16%,0 %,0 %,0 %,1.16%,5.20%,0 %,1.16%,0.58%,4.05%,4.62%,1.73%,0 %,0 %,0 %
- RF00009,81.68%,0.58%,2.53%,0.58%,0.97%,0 %,0.39%,1.36%,1.17%,2.73%,0.97%,2.34%,0.58%,0.78%,0.78%,0 %,1.36%,0.39%,0.78%
- RF00010,69.24%,2.58%,4.60%,0.37%,3.31%,0.55%,1.29%,0.92%,2.03%,2.76%,2.39%,2.76%,0.18%,1.84%,1.66%,0.55%,2.21%,0 %,0.74%
- RF00011,64.71%,4.50%,4.50%,1.04%,3.46%,2.08%,2.42%,2.77%,3.11%,1.04%,1.38%,2.08%,2.08%,1.04%,1.04%,1.04%,1.73%,0 %,0 %
- RF00013,89.66%,3.45%,0 %,0 %,3.45%,3.45%,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %
- RF00015,86.76%,4.18%,0.70%,3.48%,0.70%,0 %,0 %,0 %,0.70%,0.35%,0 %,1.74%,0.35%,0 %,0 %,0.35%,0 %,0.70%,0 %
- RF00017,75.15%,2.90%,3.05%,0.76%,3.35%,2.74%,0.46%,1.68%,1.07%,0.30%,2.13%,2.59%,1.68%,0.30%,0 %,0 %,0.91%,0.91%,0 %
- RF00020,88.26%,0.73%,2.39%,0.37%,0.55%,0.73%,0 %,0 %,0.73%,1.10%,1.28%,1.10%,0.37%,1.28%,0 %,0 %,0.73%,0 %,0.37%
- RF00023,73.83%,1.87%,12.15%,0.93%,1.87%,0.93%,0 %,0.93%,0 %,1.87%,0 %,0 %,0 %,1.87%,3.74%,0 %,0 %,0 %,0 %
- RF00026,81.41%,3.66%,6.15%,1.17%,0.44%,1.17%,0 %,0 %,0.29%,0.44%,0.15%,1.02%,0.29%,0.29%,0.44%,0.15%,0.15%,0.29%,2.49%
- RF00028,65.73%,2.86%,2.64%,3.83%,2.16%,1.62%,2.91%,2.05%,3.12%,1.29%,1.94%,0.38%,1.67%,0.54%,1.45%,0.22%,4.58%,0.86%,0.16%
- RF00029,80.70%,6.14%,0 %,0 %,0 %,3.51%,0 %,3.51%,0 %,0.88%,0 %,0 %,0.88%,0.88%,0 %,0 %,0.88%,0 %,2.63%
- RF00032,100.00%,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %
- RF00037,100.00%,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %
- RF00050,68.39%,3.87%,7.74%,3.87%,2.26%,0.32%,5.48%,0 %,0 %,0 %,5.81%,0 %,0 %,0.32%,0 %,0 %,1.94%,0 %,0 %
- RF00059,60.28%,1.50%,4.97%,3.70%,2.54%,1.85%,5.31%,0 %,0 %,0 %,7.16%,4.97%,4.50%,0.35%,0.12%,1.85%,0.23%,0.69%,0 %
- RF00061,77.86%,3.05%,2.29%,2.29%,0 %,2.29%,0 %,1.53%,2.29%,0 %,0 %,0.76%,0.76%,2.29%,0 %,1.53%,2.29%,0 %,0.76%
- RF00080,84.19%,6.45%,0 %,0 %,2.26%,0 %,1.94%,0 %,4.19%,0 %,0 %,0.65%,0 %,0 %,0 %,0 %,0 %,0 %,0.32%
- RF00100,65.22%,0 %,4.35%,0 %,5.07%,0.72%,0 %,8.70%,0 %,0 %,0 %,2.90%,13.04%,0 %,0 %,0 %,0 %,0 %,0 %
- RF00162,73.74%,6.90%,0.07%,2.15%,0.96%,0 %,0.59%,0 %,2.52%,2.82%,4.15%,2.37%,0.07%,0.45%,3.04%,0 %,0 %,0.15%,0 %
- RF00164,76.19%,4.76%,0 %,0 %,0 %,0 %,0 %,0 %,4.76%,4.76%,9.52%,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %
- RF00167,67.80%,0 %,7.77%,0.23%,2.51%,0 %,0 %,2.63%,2.22%,3.10%,2.63%,2.98%,0 %,5.14%,2.63%,0.29%,0.06%,0 %,0 %
- RF00168,76.92%,4.74%,1.95%,2.41%,0.45%,1.20%,1.20%,2.41%,3.23%,1.20%,0.68%,1.43%,0.98%,0 %,0 %,1.20%,0 %,0 %,0 %
- RF00169,70.92%,9.56%,3.19%,0.80%,4.78%,0 %,0.40%,9.16%,0 %,0 %,0 %,0 %,0.80%,0 %,0.40%,0 %,0 %,0 %,0 %
- RF00174,71.01%,2.90%,5.07%,4.35%,2.90%,0.72%,1.45%,2.17%,0 %,2.17%,2.90%,1.45%,0.72%,2.17%,0 %,0 %,0 %,0 %,0 %
- RF00177,63.05%,3.95%,4.48%,2.84%,3.20%,2.13%,2.18%,2.57%,2.50%,2.24%,2.00%,1.72%,2.02%,1.58%,1.44%,0.78%,0.70%,0.34%,0.29%
- RF00233,72.06%,1.47%,7.35%,2.94%,0 %,2.94%,0 %,0 %,4.41%,0 %,2.94%,1.47%,2.94%,0 %,0 %,0 %,1.47%,0 %,0 %
- RF00234,73.03%,1.96%,0.68%,0.64%,1.28%,1.96%,2.42%,5.29%,2.92%,0.59%,0.41%,7.07%,1.32%,0 %,0.23%,0 %,0.18%,0 %,0 %
- RF00250,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %
- RF00379,71.10%,6.46%,1.46%,7.07%,1.10%,0.12%,3.29%,0.24%,2.93%,1.46%,1.95%,1.59%,0.61%,0 %,0 %,0.12%,0.49%,0 %,0 %
- RF00380,64.46%,5.37%,1.24%,2.07%,6.20%,3.31%,2.89%,4.96%,2.48%,1.24%,2.07%,0 %,0 %,1.24%,1.24%,0 %,1.24%,0 %,0 %
- RF00382,50.00%,0 %,0 %,0 %,20.59%,0 %,0 %,0 %,0 %,0 %,0 %,2.94%,20.59%,0 %,0 %,0 %,0 %,5.88%,0 %
- RF00390,55.17%,0 %,0 %,0 %,6.90%,0 %,0 %,0 %,13.79%,6.90%,0 %,0 %,17.24%,0 %,0 %,0 %,0 %,0 %,0 %
- RF00442,56.52%,6.52%,6.52%,2.17%,8.70%,2.17%,2.17%,2.17%,0 %,4.35%,2.17%,0 %,4.35%,0 %,0 %,2.17%,0 %,0 %,0 %
- RF00458,70.22%,3.37%,5.06%,0 %,5.34%,1.97%,0 %,1.40%,1.97%,1.97%,0.28%,0.28%,2.81%,1.97%,0.84%,0.84%,0.56%,0.84%,0.28%
- RF00488,91.95%,0.20%,0 %,0.20%,0.80%,1.41%,0.10%,0.50%,0.91%,1.21%,0.10%,0.30%,0.70%,0.70%,0 %,0 %,0.30%,0.50%,0.10%
- RF00504,72.66%,3.88%,2.59%,7.77%,3.02%,0 %,2.45%,0.29%,2.59%,0 %,1.58%,0 %,0 %,0 %,0.14%,0.14%,2.88%,0 %,0 %
- RF00505,100.00%,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %
- RF01051,64.48%,5.37%,0 %,2.84%,4.93%,0 %,2.84%,4.18%,4.33%,2.09%,1.49%,1.94%,0.60%,3.43%,0.60%,0.60%,0 %,0.15%,0.15%
- RF01357,80.00%,10.00%,0 %,0 %,0 %,0 %,0 %,0 %,10.00%,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %
- RF01510,85.62%,0 %,0 %,0 %,1.09%,0 %,0 %,0 %,3.27%,0.22%,0 %,0 %,0 %,6.32%,3.49%,0 %,0 %,0 %,0 %
- RF01689,75.95%,3.80%,5.06%,0 %,1.27%,5.06%,0 %,0.63%,1.27%,0 %,1.27%,3.16%,0 %,0 %,2.53%,0 %,0 %,0 %,0 %
- RF01725,71.25%,7.50%,0 %,0 %,1.25%,0 %,5.00%,0 %,5.00%,0 %,5.00%,2.50%,0 %,0 %,2.50%,0 %,0 %,0 %,0 %
- RF01734,75.76%,8.08%,0 %,0 %,0 %,5.05%,3.03%,5.05%,0 %,0 %,1.01%,2.02%,0 %,0 %,0 %,0 %,0 %,0 %,0 %
- RF01739,61.06%,3.54%,4.42%,3.54%,7.96%,3.54%,0 %,0 %,3.54%,1.77%,0 %,0 %,3.54%,0 %,0 %,3.54%,3.54%,0 %,0 %
- RF01750,79.22%,4.55%,0 %,3.90%,1.30%,0 %,0 %,1.30%,0 %,0 %,3.90%,0 %,1.30%,0 %,0 %,0 %,4.55%,0 %,0 %
- RF01763,42.70%,0.28%,5.23%,0 %,12.67%,3.58%,0 %,0 %,2.20%,0 %,3.03%,2.75%,20.94%,6.61%,0 %,0 %,0 %,0 %,0 %
- RF01786,76.39%,2.78%,5.56%,2.78%,1.39%,0 %,0 %,2.78%,5.56%,0 %,0 %,0 %,0 %,0 %,2.78%,0 %,0 %,0 %,0 %
- RF01807,74.12%,3.53%,2.35%,0 %,2.35%,4.71%,2.35%,1.18%,0 %,1.18%,0 %,1.18%,2.35%,1.18%,0 %,1.18%,0 %,0 %,2.35%
- RF01826,50.00%,0 %,8.33%,4.17%,4.17%,4.17%,4.17%,0 %,0 %,0 %,4.17%,0 %,20.83%,0 %,0 %,0 %,0 %,0 %,0 %
- RF01831,78.61%,1.19%,2.97%,1.98%,1.19%,0 %,3.56%,3.96%,1.78%,2.38%,0 %,0 %,0 %,0 %,2.38%,0 %,0 %,0 %,0 %
- RF01846,86.57%,3.14%,0.43%,1.71%,1.00%,0.57%,0.29%,1.43%,0.29%,1.14%,0 %,1.00%,0.43%,0.57%,0.29%,0.29%,0.86%,0 %,0 %
- RF01852,71.41%,0.42%,1.47%,0.10%,4.63%,1.18%,0.06%,4.89%,4.63%,2.20%,0.03%,0.45%,6.65%,0.22%,0.64%,0 %,0.77%,0.06%,0.19%
- RF01854,68.87%,5.96%,4.64%,3.97%,3.97%,1.99%,2.65%,2.65%,0 %,0 %,1.99%,0 %,1.32%,0 %,0.66%,0 %,1.32%,0 %,0 %
- RF01857,71.35%,4.21%,2.81%,0 %,3.93%,2.25%,2.53%,5.34%,0 %,0.56%,1.97%,1.69%,0.56%,1.12%,1.69%,0 %,0 %,0 %,0 %
- RF01960,66.53%,3.35%,3.47%,2.51%,3.10%,2.23%,1.24%,2.17%,1.66%,2.49%,1.75%,1.64%,2.30%,1.38%,1.71%,0.42%,1.34%,0.49%,0.22%
- RF01998,56.65%,4.92%,4.37%,6.74%,3.10%,0.91%,7.10%,4.01%,2.73%,1.09%,0 %,0.36%,3.64%,0.36%,0 %,3.46%,0.55%,0 %,0 %
- RF02001,74.15%,5.56%,0.28%,5.07%,0.83%,0.07%,4.86%,3.47%,0.14%,0 %,0.07%,0.90%,0.63%,0.35%,0.49%,0 %,2.78%,0 %,0.35%
- RF02012,76.03%,5.48%,0 %,4.11%,1.37%,0.68%,0 %,0 %,2.74%,0 %,0 %,0 %,1.37%,2.05%,0 %,0 %,4.11%,1.37%,0.68%
- RF02253,100.00%,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %
- RF02348,80.00%,5.00%,0 %,3.33%,0 %,0 %,0 %,1.67%,1.67%,3.33%,0 %,0 %,0 %,0 %,0 %,0 %,5.00%,0 %,0 %
- RF02519,66.67%,0 %,0 %,0 %,16.67%,0 %,8.33%,0 %,8.33%,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %
- RF02540,60.17%,5.14%,3.83%,3.92%,2.79%,2.53%,3.11%,1.90%,2.22%,1.96%,2.38%,2.25%,1.45%,1.79%,1.50%,1.94%,0.55%,0.28%,0.31%
- RF02541,62.00%,4.13%,3.68%,3.79%,2.68%,2.55%,2.84%,2.12%,2.25%,1.87%,2.18%,1.89%,1.71%,1.78%,1.53%,1.61%,0.65%,0.35%,0.38%
- RF02543,66.82%,3.48%,2.88%,3.00%,2.51%,2.52%,1.61%,2.09%,1.74%,2.13%,1.88%,1.84%,1.95%,1.51%,1.25%,1.41%,0.74%,0.36%,0.26%
- RF02545,65.43%,0.82%,4.12%,2.88%,1.23%,3.70%,1.65%,1.65%,2.47%,2.47%,1.23%,1.23%,0.82%,2.47%,3.70%,2.47%,0.82%,0.82%,0 %
- RF02546,82.61%,0 %,8.70%,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,0 %,4.35%,0 %,0 %,0 %,0 %,4.35%
- RF02553,73.68%,2.63%,7.89%,0 %,0 %,2.63%,0 %,0 %,2.63%,0 %,0 %,5.26%,0 %,2.63%,0 %,2.63%,0 %,0 %,0 %
- RF02680,88.89%,0 %,2.78%,0 %,2.78%,0 %,0 %,0 %,0 %,0 %,0 %,0 %,5.56%,0 %,0 %,0 %,0 %,0 %,0 %
- RF02683,80.56%,2.78%,0 %,5.56%,2.78%,0 %,0 %,0 %,0 %,0 %,0 %,0 %,2.78%,2.78%,0 %,2.78%,0 %,0 %,0 %
- RF02796,78.69%,4.92%,0 %,4.92%,4.92%,0 %,0 %,0 %,4.92%,0 %,0 %,1.64%,0 %,0 %,0 %,0 %,0 %,0 %,0 %
- TOTAL,63.42%,3.93%,3.83%,3.23%,2.83%,2.35%,2.28%,2.28%,2.26%,2.13%,1.96%,1.88%,1.82%,1.68%,1.46%,1.25%,0.73%,0.35%,0.33%
--- a/results/pairings.png deleted 100644 → 0
View file @cd84ecd
+++ b/results/pairings.png deleted 100644 → 0
View file @cd84ecd
--- a/results/realign_jobs_performance.png deleted 100644 → 0
View file @cd84ecd
+++ b/results/realign_jobs_performance.png deleted 100644 → 0
View file @cd84ecd
--- a/statistics.py
View file @56fd681
+++ b/statistics.py
View file @56fd681
 #!/usr/bin/python3.8
- import os, pickle, sys
+ 
+ # This file computes additional statistics over the produced dataset.
+ # Run this file if you want the base counts, pair-type counts, identity percents, etc
+ # in the database.
+ # This should be run from the folder where the file is (to access the database with path "results/RNANet.db")
+ 
+ import os, pickle, sqlite3, sys
 import numpy as np
 import pandas as pd
 import threading as th
@@ -12,31 +18,24 @@ from scipy.spatial.distance import squareform
 from mpl_toolkits.mplot3d import axes3d
 from Bio.Phylo.TreeConstruction import DistanceCalculator
 from Bio import AlignIO, SeqIO
- from tqdm import tqdm
 from functools import partial
 from multiprocessing import Pool
 from os import path
+ from tqdm import tqdm
 from collections import Counter
- from RNAnet import read_cpu_number, sql_ask_database
- 
- 
- path_to_3D_data = "/nhome/siniac/lbecquey/Data/RNA/3D/"
- path_to_seq_data = "/nhome/siniac/lbecquey/Data/RNA/sequences/"
+ from RNAnet import read_cpu_number, sql_ask_database, sql_execute, warn, notify, init_worker
 
+ # This sets the paths
+ path_to_3D_data = "/home/lbecquey/Data/RNA/3D/"
+ path_to_seq_data = "/home/lbecquey/Data/RNA/sequences/"
 if len(sys.argv) > 1:
     path_to_3D_data = path.abspath(sys.argv[1])
     path_to_seq_data = path.abspath(sys.argv[2])
 
- class DataPoint():
-     def __init__(self, path_to_textfile):
-         self.df = pd.read_csv(path_to_textfile, sep=',', header=0, engine="c", index_col=0)
-         self.family = path_to_textfile.split('.')[-1]
-         self.chain_label = path_to_textfile.split('.')[-2].split('/')[-1]
- 
- def load_rna_frome_file(path_to_textfile):
-     return DataPoint(path_to_textfile)
+ LSU_set = ("RF00002", "RF02540", "RF02541", "RF02543", "RF02546")   # From Rfam CLAN 00112
+ SSU_set = ("RF00177", "RF02542",  "RF02545", "RF01959", "RF01960")  # From Rfam CLAN 00111
 
- def reproduce_wadley_results(points, show=False, carbon=4, sd_range=(1,4)):
+ def reproduce_wadley_results(show=False, carbon=4, sd_range=(1,4)):
     """
     Plot the joint distribution of pseudotorsion angles, in a Ramachandran-style graph.
     See Wadley & Pyle (2007)
@@ -53,7 +52,6 @@ def reproduce_wadley_results(points, show=False, carbon=4, sd_range=(1,4)):
                      and values above avg + sd_range[1] * stdev to avg + sd_range[1] * stdev.
                      This removes noise and cuts too high peaks, to clearly see the clusters.
     """
-     worker_nbr = 1 + (carbon==1)
 
     if carbon == 4:
         angle = "eta"
@@ -66,17 +64,16 @@ def reproduce_wadley_results(points, show=False, carbon=4, sd_range=(1,4)):
     else:
         exit("You overestimate my capabilities !")
 
+     
     if not path.isfile(f"data/wadley_kernel_{angle}.npz"):
-         c2_endo_etas = []
-         c3_endo_etas = []
-         c2_endo_thetas = []
-         c3_endo_thetas = []
-         for p in tqdm(points, desc="Loading eta/thetas", position=worker_nbr, leave=False):
-             df = p.df.loc[(p.df[angle].isna()==False) & (p.df["th"+angle].isna()==False), ["form","puckering", angle,"th"+angle]]
-             c2_endo_etas   += list(df.loc[ (df.puckering=="C2'-endo"), angle ].values)
-             c3_endo_etas   += list(df.loc[ (df.form=='.') & (df.puckering=="C3'-endo"), angle ].values)
-             c2_endo_thetas += list(df.loc[ (df.puckering=="C2'-endo"), "th"+angle ].values)
-             c3_endo_thetas += list(df.loc[ (df.form=='.') & (df.puckering=="C3'-endo"), "th"+angle ].values)
+         conn = sqlite3.connect("results/RNANet.db")
+         df = pd.read_sql(f"""SELECT {angle}, th{angle} FROM nucleotide WHERE puckering="C2'-endo" AND {angle} IS NOT NULL AND th{angle} IS NOT NULL;""", conn)
+         c2_endo_etas = df[angle].values.tolist()
+         c2_endo_thetas = df["th"+angle].values.tolist()
+         df = pd.read_sql(f"""SELECT {angle}, th{angle} FROM nucleotide WHERE form = '.' AND puckering="C3'-endo" AND {angle} IS NOT NULL AND th{angle} IS NOT NULL;""", conn)
+         c3_endo_etas = df[angle].values.tolist()
+         c3_endo_thetas = df["th"+angle].values.tolist()
+         conn.close()
 
         xx, yy = np.mgrid[0:2*np.pi:100j, 0:2*np.pi:100j]
         positions = np.vstack([xx.ravel(), yy.ravel()])
@@ -103,12 +100,13 @@ def reproduce_wadley_results(points, show=False, carbon=4, sd_range=(1,4)):
         f_c2 = f["kernel_c2"]
         xx, yy = np.mgrid[0:2*np.pi:100j, 0:2*np.pi:100j]
 
-     # print(f"[{worker_nbr}]\tKernel computed (or loaded from file).")
+     notify(f"Kernel computed for {angle}/th{angle} (or loaded from file).")
 
     # exact counts:
     hist_c2, xedges, yedges = np.histogram2d(c2_endo_etas, c2_endo_thetas, bins=int(2*np.pi/0.1), range=[[0, 2*np.pi], [0, 2*np.pi]])
     hist_c3, xedges, yedges = np.histogram2d(c3_endo_etas, c3_endo_thetas, bins=int(2*np.pi/0.1), range=[[0, 2*np.pi], [0, 2*np.pi]])
-     color_values = cm.jet(hist_c3.ravel()/hist_c3.max())
+     cmap = cm.get_cmap("jet")
+     color_values = cmap(hist_c3.ravel()/hist_c3.max())
 
     for x, y, hist, f, l in zip( (c3_endo_etas, c2_endo_etas), 
                                  (c3_endo_thetas, c2_endo_thetas), 
@@ -137,7 +135,7 @@ def reproduce_wadley_results(points, show=False, carbon=4, sd_range=(1,4)):
         # Smoothed joint distribution
         fig = plt.figure()
         ax = fig.add_subplot(111, projection='3d')
-         ax.plot_surface(xx, yy, f_cut, cmap=cm.coolwarm, linewidth=0, antialiased=True)
+         ax.plot_surface(xx, yy, f_cut, cmap=cm.get_cmap("coolwarm"), linewidth=0, antialiased=True)
         ax.set_xlabel(xlabel)
         ax.set_ylabel(ylabel)
         fig.savefig(f"results/figures/wadley_plots/wadley_distrib_{angle}_{l}.png")
@@ -148,7 +146,7 @@ def reproduce_wadley_results(points, show=False, carbon=4, sd_range=(1,4)):
         fig = plt.figure(figsize=(5,5))
         ax = fig.gca()
         ax.scatter(x, y, s=1, alpha=0.1)
-         ax.contourf(xx, yy, f_cut, alpha=0.5, cmap=cm.coolwarm, levels=levels, extend="max")
+         ax.contourf(xx, yy, f_cut, alpha=0.5, cmap=cm.get_cmap("coolwarm"), levels=levels, extend="max")
 
         ax.set_xlabel(xlabel)
         ax.set_ylabel(ylabel)
@@ -157,31 +155,34 @@ def reproduce_wadley_results(points, show=False, carbon=4, sd_range=(1,4)):
             fig.show()
     # print(f"[{worker_nbr}]\tComputed joint distribution of angles (C{carbon}) and saved the figures.")
 
- def stats_len(mappings_list, points):
+ def stats_len():
+     """Plots statistics on chain lengths in RNA families.
+     
+     REQUIRES tables chain, nucleotide up to date.
+     """
+ 
     cols = []
     lengths = []
-     for f in tqdm(sorted(mappings_list.keys()), desc="Chain length by family", position=3, leave=False):
-         if f in ["RF02540","RF02541","RF02543"]:
+     conn = sqlite3.connect("results/RNANet.db")
+     for i,f in enumerate(fam_list):
+         if f in LSU_set:
             cols.append("red") # LSU
-         elif f in ["RF00177","RF01960","RF01959","RF02542"]:
+         elif f in SSU_set:
             cols.append("blue") # SSU
         elif f in ["RF00001"]:
             cols.append("green")
-         elif f in ["RF00002"]:
-             cols.append("purple")
         elif f in ["RF00005"]:
             cols.append("orange")
         else:
             cols.append("grey")
-         l = []
-         for r in points:
-             if r.family != f: continue
-             l.append(len(r.df['nt_code']))
+         l = [ x[0] for x in sql_ask_database(conn, f"SELECT COUNT(nt_id) FROM (SELECT chain_id FROM chain WHERE rfam_acc='{f}') NATURAL JOIN nucleotide GROUP BY chain_id;") ]
         lengths.append(l)
+         notify(f"[{i+1}/{len(fam_list)}] Computed {f} chains lengths")
+     conn.close()
 
     fig = plt.figure(figsize=(10,3))
     ax = fig.gca()
-     ax.hist(lengths, bins=100, stacked=True, log=True, color=cols, label=sorted(mappings_list.keys()))
+     ax.hist(lengths, bins=100, stacked=True, log=True, color=cols, label=fam_list)
     ax.set_xlabel("Sequence length (nucleotides)", fontsize=8)
     ax.set_ylabel("Number of 3D chains", fontsize=8)
     ax.set_xlim(left=-150)
@@ -190,15 +191,18 @@ def stats_len(mappings_list, points):
     fig.subplots_adjust(right=0.78)
     filtered_handles = [mpatches.Patch(color='red'), mpatches.Patch(color='white'), mpatches.Patch(color='white'), mpatches.Patch(color='white'),
                         mpatches.Patch(color='blue'), mpatches.Patch(color='white'), mpatches.Patch(color='white'),
-                         mpatches.Patch(color='green'), mpatches.Patch(color='purple'),
-                         mpatches.Patch(color='orange'), mpatches.Patch(color='grey')]
-     filtered_labels = ['Large Ribosomal Subunits', '(RF02540,', 'RF02541', 'RF02543)',
-                         'Small Ribosomal Subunits','(RF01960,', 'RF00177)',
-                        '5S rRNA (RF00001)', '5.8S rRNA (RF00002)', 'tRNA (RF00005)', 'Other']
+                         mpatches.Patch(color='green'), mpatches.Patch(color='white'),
+                         mpatches.Patch(color='orange'), mpatches.Patch(color='white'),
+                         mpatches.Patch(color='grey')]
+     filtered_labels = ['Large Ribosomal Subunits', '(RF00002, RF02540,', 'RF02541, RF02543,', 'RF02546)',
+                         'Small Ribosomal Subunits','(RF01960, RF00177,', 'RF02545)',
+                        '5S rRNA', '(RF00001)', 
+                        'tRNA', '(RF00005)', 
+                        'Other']
     ax.legend(filtered_handles, filtered_labels, loc='right', 
-                 ncol=1, fontsize='small', bbox_to_anchor=(1.3, 0.55))
+                 ncol=1, fontsize='small', bbox_to_anchor=(1.3, 0.5))
     fig.savefig("results/figures/lengths.png")
-     # print("[3]\tComputed sequence length statistics and saved the figure.")
+     notify("Computed sequence length statistics and saved the figure.")
 
 def format_percentage(tot, x):
         if not tot:
@@ -212,63 +216,100 @@ def format_percentage(tot, x):
             x = "<.01"
         return x + '%'
 
- def stats_freq(mappings_list, points):
+ def stats_freq():
+     """Computes base frequencies in all RNA families.
+ 
+     Outputs results/frequencies.csv
+     REQUIRES tables chain, nucleotide up to date."""
     freqs = {}
-     for f in mappings_list.keys():
+     for f in fam_list:
         freqs[f] = Counter()
 
-     for r in tqdm(points, desc="Nucleotide frequencies", position=4, leave=False):
-         freqs[r.family].update(dict(r.df['nt_name'].value_counts()))
+     conn = sqlite3.connect("results/RNANet.db")
+     for i,f in enumerate(fam_list):
+         counts = dict(sql_ask_database(conn, f"SELECT nt_name, COUNT(nt_name) FROM (SELECT chain_id from chain WHERE rfam_acc='{f}') NATURAL JOIN nucleotide GROUP BY nt_name;"))
+         freqs[f].update(counts)
+         notify(f"[{i+1}/{len(fam_list)}] Computed {f} nucleotide frequencies.")
+     conn.close()
     
     df = pd.DataFrame()
-     for f in sorted(mappings_list.keys()):
+     for f in fam_list:
         tot = sum(freqs[f].values())
         df = pd.concat([ df, pd.DataFrame([[ format_percentage(tot, x) for x in freqs[f].values() ]], columns=list(freqs[f]), index=[f]) ])
     df = df.fillna(0)
-     df.to_csv("results/frequencies.csv")
+     df.to_csv("results/frequencies.csv")    
+     notify("Saved nucleotide frequencies to CSV file.")
+ 
+ def parallel_stats_pairs(f):
+     """Counts occurrences of intra-chain base-pair types in one RNA family
+ 
+     REQUIRES tables chain, nucleotide up-to-date.""" 
+ 
+     with sqlite3.connect("results/RNANet.db") as conn:
+         # Get comma separated lists of basepairs per nucleotide
+         interactions = pd.read_sql(f"SELECT paired, pair_type_LW FROM (SELECT chain_id FROM chain WHERE rfam_acc='{f}') NATURAL JOIN nucleotide WHERE nb_interact>0;", conn)
+ 
+     # expand the comma-separated lists in real lists
+     expanded_list = pd.concat([ pd.DataFrame({ 'paired':row['paired'].split(','), 'pair_type_LW':row['pair_type_LW'].split(',') }) 
+                                 for _, row in interactions.iterrows() ]).reset_index(drop=True)
+     # keep only intra-chain interactions
+     expanded_list = expanded_list[ expanded_list.paired != '0' ].pair_type_LW
+ 
+     # Count each pair type
+     vcnts = expanded_list.value_counts()
+ 
+     # Add these new counts to the family's counter
+     cnt = Counter()
+     cnt.update(dict(vcnts))
+ 
+     # Create an output DataFrame
+     return pd.DataFrame([[ x for x in cnt.values() ]], columns=list(cnt), index=[f])
 
-     # print("[4]\tComputed nucleotide statistics and saved CSV file.")
+ def stats_pairs():
+     """Counts occurrences of intra-chain base-pair types in RNA families
 
- def stats_pairs(mappings_list, points):
+     Creates a temporary results file in data/pair_counts.csv, and a results file in results/pairings.csv.
+     REQUIRES tables chain, nucleotide up-to-date.""" 
 
     def line_format(family_data):
         return family_data.apply(partial(format_percentage, sum(family_data)))
 
-     # Create a Counter() object by family
-     freqs = {}
-     for f in mappings_list.keys():
-         freqs[f] = Counter()
- 
-     # Iterate over data points
     if not path.isfile("data/pair_counts.csv"):
-         for r in tqdm(points, desc="Leontis-Westhof basepair stats", position=5, leave=False):
-             # Skip if linear piece of RNA
-             if r.df.pair_type_LW.isna().all():
-                 continue 
- 
-             # Count each pair type within the molecule
-             vcnts = pd.concat(
-                                 [   pd.Series(row['pair_type_LW'].split(',')) 
-                                     for _, row in r.df.dropna(subset=["pair_type_LW"]).iterrows() ]
-                             ).reset_index(drop=True).value_counts()
- 
-             # Add these new counts to the family's counter
-             freqs[r.family].update(dict(vcnts))
-         
-         # Create the output dataframe
-         df = pd.DataFrame()
-         for f in sorted(mappings_list.keys()):
-             df = pd.concat([ df, pd.DataFrame([[ x for x in freqs[f].values() ]], columns=list(freqs[f]), index=[f]) ])
-         df = df.fillna(0)
+         p = Pool(initializer=init_worker, initargs=(tqdm.get_lock(),), processes=read_cpu_number(), maxtasksperchild=5)
+         try:
+             fam_pbar = tqdm(total=len(fam_list), desc="Pair-types in families", position=0, leave=True) 
+             results = []
+             for i, fam_df in enumerate(p.imap_unordered(parallel_stats_pairs, fam_list)):
+                 fam_pbar.update(1)
+                 results.append(fam_df)
+             fam_pbar.close()
+             p.close()
+             p.join()
+         except KeyboardInterrupt:
+             warn("KeyboardInterrupt, terminating workers.", error=True)
+             fam_pbar.close()
+             p.terminate()
+             p.join()
+             exit(1)
+ 
+         df = pd.concat(results).fillna(0)
         df.to_csv("data/pair_counts.csv")
     else:
         df = pd.read_csv("data/pair_counts.csv", index_col=0)
 
- 
+     print(df)
     # Remove not very well defined pair types (not in the 12 LW types)
     col_list = [ x for x in df.columns if '.' in x ]
     df['other'] = df[col_list].sum(axis=1)
     df.drop(col_list, axis=1, inplace=True)
+     print(df)
+ 
+     # drop duplicate types
+     # The twelve Leontis-Westhof types are
+     # cWW cWH cWS cHH cHS cSS (do not count cHW cSW and cSH, they are the same as their opposites)
+     # tWW tWH tWS tHH tHS tSS (do not count tHW tSW and tSH, they are the same as their opposites)
+     df.drop([ "cHW", "tHW", "cSW", "tSW", "cHS", "tHS"], axis=1)
+     df.loc[ ["cWW", "tWW", "cHH", "tHH", "cSS", "tSS", "other"] ] /= 2.0
 
     # Compute total row
     total_series = df.sum(numeric_only=True).rename("TOTAL")
@@ -291,10 +332,11 @@ def stats_pairs(mappings_list, points):
     plt.subplots_adjust(bottom=0.2, right=0.99)
     plt.savefig("results/figures/pairings.png")
 
-     # print("[5]\tComputed nucleotide statistics and saved CSV and PNG file.")
+     notify("Computed nucleotide statistics and saved CSV and PNG file.")
 
 def to_dist_matrix(f):
     if path.isfile("data/"+f+".npy"):
+         notify(f"Computed {f} distance matrix", "loaded from file")
         return 0
 
     dm = DistanceCalculator('identity')
@@ -305,23 +347,24 @@ def to_dist_matrix(f):
     l = len(idty)
     np.save("data/"+f+".npy", np.array([ idty[i] + [0]*(l-1-i) if i<l-1 else idty[i]  for i in range(l) ]))
     del idty
+     notify(f"Computed {f} distance matrix")
     return 0
 
- def seq_idty(mappings_list):
-     famlist = sorted([ x for x in mappings_list.keys() if len(mappings_list[x]) > 1 ])
-     ignored = []
-     for x in mappings_list.keys():
-         if len(mappings_list[x]) == 1:
-             ignored.append(x)
+ def seq_idty():
+     """Computes identity matrices for each of the RNA families.
+     
+     Creates temporary results files in data/*.npy
+     REQUIRES tables chain, family un to date."""
+ 
+     conn = sqlite3.connect("results/RNANet.db")
+     famlist = [ x[0] for x in sql_ask_database(conn, "SELECT rfam_acc from (SELECT rfam_acc, COUNT(chain_id) as n_chains FROM family NATURAL JOIN chain GROUP BY rfam_acc) WHERE n_chains > 1 ORDER BY rfam_acc ASC;") ]
+     ignored = [ x[0] for x in sql_ask_database(conn, "SELECT rfam_acc from (SELECT rfam_acc, COUNT(chain_id) as n_chains FROM family NATURAL JOIN chain GROUP BY rfam_acc) WHERE n_chains < 2 ORDER BY rfam_acc ASC;") ]
     if len(ignored):
         print("Idty matrices: Ignoring families with only one chain:", " ".join(ignored)+'\n')
 
     # compute distance matrices
     p = Pool(processes=8)
-     pbar = tqdm(total=len(famlist), desc="Families idty matrices", position=0, leave=False)
-     for i, _ in enumerate(p.imap_unordered(to_dist_matrix, famlist)):
-         pbar.update(1)
-     pbar.close()
+     p.map(to_dist_matrix, famlist)
     p.close()
     p.join()
 
@@ -333,9 +376,20 @@ def seq_idty(mappings_list):
         else:
             fam_arrays.append([])
 
+     # Update database with identity percentages
+     conn = sqlite3.connect("results/RNANet.db")
+     for f, D in zip(famlist, fam_arrays):
+         if not len(D): continue
+         a = 1.0 - np.average(D + D.T) # Get symmetric matrix instead of lower triangle + convert from distance matrix to identity matrix
+         conn.execute(f"UPDATE family SET idty_percent = {float(a)} WHERE rfam_acc = '{f}';")
+     conn.commit()
+     conn.close()
+ 
+     # Plots plots plots
     fig, axs = plt.subplots(5,13, figsize=(15,9))
     axs = axs.ravel()
     [axi.set_axis_off() for axi in axs]
+     im = "" # Just to declare the variable, it will be set in the loop
     for f, D, ax in zip(famlist, fam_arrays, axs):
         if not len(D): continue
         if D.shape[0] > 2:  # Cluster only if there is more than 2 sequences to organize
@@ -356,55 +410,54 @@ def seq_idty(mappings_list):
     fig.subplots_adjust(wspace=0.1, hspace=0.3)
     fig.colorbar(im, ax=axs[-1], shrink=0.8)
     fig.savefig(f"results/figures/distances.png")
-     # print("[6]\tComputed identity matrices and saved the figure.")
+     notify("Computed all identity matrices and saved the figure.")
+ 
+ def per_chain_stats():
+     """Computes per-chain frequencies and base-pair type counts.
+ 
+     REQUIRES tables chain, nucleotide up to date. """
+ 
+     with sqlite3.connect("results/RNANet.db") as conn:
+         # Compute per-chain nucleotide frequencies
+         df = pd.read_sql("SELECT SUM(is_A) as A, SUM(is_C) AS C, SUM(is_G) AS G, SUM(is_U) AS U, SUM(is_other) AS O, chain_id FROM nucleotide GROUP BY chain_id;", conn)
+         df["total"] = pd.Series(df.A + df.C + df.G + df.U + df.O, dtype=np.float64)
+         df[['A','C','G','U','O']] = df[['A','C','G','U','O']].div(df.total, axis=0)
+         df = df.drop("total", axis=1)
+ 
+         # Set the values
+         sql_execute(conn, "UPDATE chain SET chain_freq_A = ?, chain_freq_C = ?, chain_freq_G = ?, chain_freq_U = ?, chain_freq_other = ? WHERE chain_id= ?;",
+                        many=True, data=list(df.to_records(index=False)), warn_every=10)
+     notify("Updated the database with per-chain base frequencies")
 
 if __name__ == "__main__":
 
-     #################################################################
-     #               LOAD ALL FILES
-     #################################################################
     os.makedirs("results/figures/wadley_plots/", exist_ok=True)
 
     print("Loading mappings list...")
-     mappings_list = pd.read_csv("results/mappings_list.csv", sep=',', index_col=0).to_dict(orient='list')
-     for k in mappings_list.keys():
-         mappings_list[k] = [ x for x in mappings_list[k] if str(x) != 'nan' ]
- 
-     print("Loading datapoints from file...")
-     if path.isfile("data/rnapoints.dat"):
-         with open("data/rnapoints.dat", 'rb') as f:
-             rna_points = pickle.load(f)
-     else:
-         rna_points = []
-         filelist = [path_to_3D_data+"/datapoints/"+f for f in os.listdir(path_to_3D_data+"/datapoints") ]
-         p = Pool(initializer=tqdm.set_lock, initargs=(tqdm.get_lock(),), processes=read_cpu_number())
-         pbar = tqdm(total=len(filelist), desc="RNA files", position=0, leave=False)
-         for i, rna in enumerate(p.imap_unordered(load_rna_frome_file, filelist)):
-             rna_points.append(rna)
-             pbar.update(1)
-         pbar.close()
-         p.close()
-         p.join()
-         with open("data/rnapoints.dat", "wb") as f:
-             pickle.dump(rna_points, f)
-     npoints = len(rna_points)
-     print(npoints, "RNA files loaded.")
- 
-     #################################################################
-     #               Define threads for the tasks
-     #################################################################
+     conn = sqlite3.connect("results/RNANet.db")
+     fam_list = [ x[0] for x in sql_ask_database(conn, "SELECT rfam_acc from family ORDER BY rfam_acc ASC;") ]
+     mappings_list = {}
+     for k in fam_list:
+         mappings_list[k] = [ x[0] for x in sql_ask_database(conn, f"SELECT chain_id from chain WHERE rfam_acc='{k}';") ]
+     conn.close()
+     
+     stats_pairs()
+ 
+     # Define threads for the tasks
     threads = [
-         # th.Thread(target=reproduce_wadley_results, args=[rna_points], kwargs={'carbon': 1}),
-         # th.Thread(target=reproduce_wadley_results, args=[rna_points], kwargs={'carbon': 4}),
-         th.Thread(target=partial(stats_len, mappings_list), args=[rna_points]),
-         # th.Thread(target=partial(stats_freq, mappings_list), args=[rna_points]),
-         # th.Thread(target=partial(stats_pairs, mappings_list), args=[rna_points]),
-         # th.Thread(target=seq_idty, args=[mappings_list])
+         # th.Thread(target=reproduce_wadley_results, kwargs={'carbon': 1}),
+         # th.Thread(target=reproduce_wadley_results, kwargs={'carbon': 4}),
+         # th.Thread(target=stats_len),
+         # th.Thread(target=stats_freq),
+         # th.Thread(target=seq_idty),
+         th.Thread(target=per_chain_stats)
     ]
- 
+     
+     # Start the threads
     for t in threads:
         t.start()
 
+     # Wait for the threads to complete
     for t in threads:
         t.join()