Louis BECQUEY

added cm coordinates and consensus 2D to align_colum

...@@ -68,12 +68,13 @@ To help you design your own SQL requests, we provide a description of the databa ...@@ -68,12 +68,13 @@ To help you design your own SQL requests, we provide a description of the databa
68 * `puckering`: Conformation of the ribose cycle (10 classes depending on the phase_angle value) 68 * `puckering`: Conformation of the ribose cycle (10 classes depending on the phase_angle value)
69 69
70 ## Table `align_column`, for positions in multiple sequence alignments 70 ## Table `align_column`, for positions in multiple sequence alignments
71 -* `column_id`: A unique identifier
72 * `rfam_acc`: The family's MSA the column belongs to 71 * `rfam_acc`: The family's MSA the column belongs to
73 * `index_ali`: Position of the column in the alignment (starts at 1) 72 * `index_ali`: Position of the column in the alignment (starts at 1)
73 +* `cm_coord`: Position of the column in the Rfam covariance model of the family (starts at 1). The value is NULL in portions that are insertions compared to the model.
74 * `freq_A`, `freq_C`, `freq_G`, `freq_U`, `freq_other`: Nucleotide frequencies in the alignment at this position 74 * `freq_A`, `freq_C`, `freq_G`, `freq_U`, `freq_other`: Nucleotide frequencies in the alignment at this position
75 * `gap_percent`: The frequencies of gaps at this position in the alignment (between 0.0 and 1.0) 75 * `gap_percent`: The frequencies of gaps at this position in the alignment (between 0.0 and 1.0)
76 * `consensus`: A consensus character (ACGUN or '-') summarizing the column, if we can. If >75% of the sequences are gaps at this position, the gap is picked as consensus. Otherwise, A/C/G/U is chosen if >50% of the non-gap positions are A/C/G/U. Otherwise, N is the consensus. 76 * `consensus`: A consensus character (ACGUN or '-') summarizing the column, if we can. If >75% of the sequences are gaps at this position, the gap is picked as consensus. Otherwise, A/C/G/U is chosen if >50% of the non-gap positions are A/C/G/U. Otherwise, N is the consensus.
77 +* `cons_sec_struct`: A consensus secondary structure for the RNAs of the family, obtained from the Infernal alignement. The structure is well-nested (no pseudoknots) and the possible symbols are '.' (unpaired) or '(' and ')' (paired). The field is NULL in portions that are insertions compared to the Rfam model of the family, meaning that their is no consensus on the structure.
77 78
78 There always is an entry, for each family (rfam_acc), with index_ali = 0; gap_percent = 1.0; and nucleotide frequencies set to 0.0. This entry is used when the nucleotide frequencies cannot be determined because of local alignment issues. 79 There always is an entry, for each family (rfam_acc), with index_ali = 0; gap_percent = 1.0; and nucleotide frequencies set to 0.0. This entry is used when the nucleotide frequencies cannot be determined because of local alignment issues.
79 80
......
...@@ -17,12 +17,12 @@ ...@@ -17,12 +17,12 @@
17 * [DONE] Get filtered versions of the sequence alignments containing the 3D chains, publicly available for download 17 * [DONE] Get filtered versions of the sequence alignments containing the 3D chains, publicly available for download
18 * [DONE] Get a consensus residue for each alignement column 18 * [DONE] Get a consensus residue for each alignement column
19 * [DONE] Get an option to limit the number of cores 19 * [DONE] Get an option to limit the number of cores
20 +* [DONE] Move to SILVA LSU release 138.1
20 * [UPCOMING] Automated annotation of detected Recurrent Interaction Networks (RINs), see http://carnaval.lri.fr/ . 21 * [UPCOMING] Automated annotation of detected Recurrent Interaction Networks (RINs), see http://carnaval.lri.fr/ .
21 * [UPCOMING] Possibly, automated detection of HLs and ILs from the 3D Motif Atlas (BGSU). Maybe. Their own website already does the job. 22 * [UPCOMING] Possibly, automated detection of HLs and ILs from the 3D Motif Atlas (BGSU). Maybe. Their own website already does the job.
22 * [UPCOMING] Weight sequences in alignment to give more importance to rarer sequences 23 * [UPCOMING] Weight sequences in alignment to give more importance to rarer sequences
23 * [UPCOMING] Give both gap_percent and insertion_gap_percent 24 * [UPCOMING] Give both gap_percent and insertion_gap_percent
24 * A field estimating the quality of the sequence alignment in table family. 25 * A field estimating the quality of the sequence alignment in table family.
25 * Possibly, more metrics about the alignments coming from Infernal. 26 * Possibly, more metrics about the alignments coming from Infernal.
26 -* Move to SILVA LSU release 138
27 * Run cmscan ourselves from the NDB instead of using Rfam-PDB mappings ? (Iff this actually makes a real difference, untested yet) 27 * Run cmscan ourselves from the NDB instead of using Rfam-PDB mappings ? (Iff this actually makes a real difference, untested yet)
28 * Use and save Infernal alignment bounds and truncation information 28 * Use and save Infernal alignment bounds and truncation information
......
...@@ -862,10 +862,10 @@ class Downloader: ...@@ -862,10 +862,10 @@ class Downloader:
862 try: 862 try:
863 print(f"Downloading {unit} from SILVA...", end='', flush=True) 863 print(f"Downloading {unit} from SILVA...", end='', flush=True)
864 if unit == "LSU": 864 if unit == "LSU":
865 - subprocess.run(["wget", "-nv", "http://www.arb-silva.de/fileadmin/arb_web_db/release_132/ARB_files/SILVA_132_LSURef_07_12_17_opt.arb.gz", 865 + subprocess.run(["wget", "-nv", "https://www.arb-silva.de/fileadmin/arb_web_db/release_138_1/ARB_files/SILVA_138.1_LSURef_opt.arb.gz",
866 "-O", path_to_seq_data + "realigned/LSU.arb.gz"]) 866 "-O", path_to_seq_data + "realigned/LSU.arb.gz"])
867 else: 867 else:
868 - subprocess.run(["wget", "-nv", "http://www.arb-silva.de/fileadmin/silva_databases/release_138/ARB_files/SILVA_138_SSURef_05_01_20_opt.arb.gz", 868 + subprocess.run(["wget", "-nv", "https://www.arb-silva.de/fileadmin/arb_web_db/release_138_1/ARB_files/SILVA_138.1_SSURef_opt.arb.gz",
869 "-O", path_to_seq_data + "realigned/SSU.arb.gz"]) 869 "-O", path_to_seq_data + "realigned/SSU.arb.gz"])
870 except: 870 except:
871 warn(f"Error downloading the {unit} database from SILVA", error=True) 871 warn(f"Error downloading the {unit} database from SILVA", error=True)
...@@ -1379,11 +1379,11 @@ class Pipeline: ...@@ -1379,11 +1379,11 @@ class Pipeline:
1379 for r in results: 1379 for r in results:
1380 align = AlignIO.read(path_to_seq_data + "realigned/" + r[0] + "++.afa", "fasta") 1380 align = AlignIO.read(path_to_seq_data + "realigned/" + r[0] + "++.afa", "fasta")
1381 nb_3d_chains = len([1 for r in align if '[' in r.id]) 1381 nb_3d_chains = len([1 for r in align if '[' in r.id])
1382 - if r[0] in SSU_set: # SSU v138 is used 1382 + if r[0] in SSU_set: # SSU v138.1 is used
1383 - nb_homologs = 2225272 # source: https://www.arb-silva.de/documentation/release-138/ 1383 + nb_homologs = 2224740 # source: https://www.arb-silva.de/documentation/release-1381/
1384 nb_total_homol = nb_homologs + nb_3d_chains 1384 nb_total_homol = nb_homologs + nb_3d_chains
1385 - elif r[0] in LSU_set: # LSU v132 is used 1385 + elif r[0] in LSU_set: # LSU v138.1 is used
1386 - nb_homologs = 198843 # source: https://www.arb-silva.de/documentation/release-132/ 1386 + nb_homologs = 227331 # source: https://www.arb-silva.de/documentation/release-1381/
1387 nb_total_homol = nb_homologs + nb_3d_chains 1387 nb_total_homol = nb_homologs + nb_3d_chains
1388 else: 1388 else:
1389 nb_total_homol = len(align) 1389 nb_total_homol = len(align)
...@@ -1433,6 +1433,32 @@ class Pipeline: ...@@ -1433,6 +1433,32 @@ class Pipeline:
1433 p.join() 1433 p.join()
1434 exit(1) 1434 exit(1)
1435 1435
1436 + def extractCMs(self):
1437 + """
1438 + Extracts Rfam CMs of the families for which we have 3D structures and compresses
1439 + them for later use with cmscan.
1440 + """
1441 +
1442 + # Retrieve the list of families and write them to a "keys file"
1443 + allfams = []
1444 + with sqlite3.connect(runDir+"/results/RNANet.db", timeout=10.0) as conn:
1445 + conn.execute('pragma journal_mode=wal')
1446 + allfams = sql_ask_database(conn, "SELECT DISTINCT rfam_acc FROM chain ORDER BY rfam_acc ASC;")
1447 + allfams = [ r[0]+"\n" for r in allfams ]
1448 + if not len(allfams):
1449 + return
1450 + with open(runDir + "/results/available_families.txt", "w") as f:
1451 + f.writelines(allfams)
1452 +
1453 + # Extract the families from Rfam.cm
1454 + os.makedirs(runDir + "/results/cm/", exist_ok=True)
1455 + subprocess.run(["cmfetch", "-f", "-o", runDir + "/results/cm/RNANet.cm", path_to_seq_data + "Rfam.cm", runDir + "/results/available_families.txt"])
1456 + os.remove(runDir + "/results/available_families.txt")
1457 +
1458 + # Compress the cm database for use with cmscan
1459 + subprocess.run(["rm", "-f", runDir + "/results/cm/RNANet.cm.i1p", runDir + "/results/cm/RNANet.cm.i1i", runDir + "/results/cm/RNANet.cm.i1m", runDir + "/results/cm/RNANet.cm.i1f"])
1460 + subprocess.run(["cmpress", runDir + "/results/cm/RNANet.cm"])
1461 +
1436 def output_results(self): 1462 def output_results(self):
1437 """Produces CSV files, archive them, and additional metadata files 1463 """Produces CSV files, archive them, and additional metadata files
1438 1464
...@@ -1665,8 +1691,8 @@ def sql_define_tables(conn): ...@@ -1665,8 +1691,8 @@ def sql_define_tables(conn):
1665 pair_count_tSS SMALLINT, 1691 pair_count_tSS SMALLINT,
1666 pair_count_other SMALLINT, 1692 pair_count_other SMALLINT,
1667 UNIQUE (structure_id, chain_name, rfam_acc), 1693 UNIQUE (structure_id, chain_name, rfam_acc),
1668 - FOREIGN KEY(structure_id) REFERENCES structure(pdb_id), 1694 + FOREIGN KEY(structure_id) REFERENCES structure(pdb_id) ON DELETE CASCADE,
1669 - FOREIGN KEY(rfam_acc) REFERENCES family(rfam_acc) 1695 + FOREIGN KEY(rfam_acc) REFERENCES family(rfam_acc) ON DELETE CASCADE
1670 ); 1696 );
1671 CREATE TABLE IF NOT EXISTS nucleotide ( 1697 CREATE TABLE IF NOT EXISTS nucleotide (
1672 chain_id INT, 1698 chain_id INT,
...@@ -1721,6 +1747,7 @@ def sql_define_tables(conn): ...@@ -1721,6 +1747,7 @@ def sql_define_tables(conn):
1721 CREATE TABLE IF NOT EXISTS align_column ( 1747 CREATE TABLE IF NOT EXISTS align_column (
1722 rfam_acc CHAR(7) NOT NULL, 1748 rfam_acc CHAR(7) NOT NULL,
1723 index_ali INT NOT NULL, 1749 index_ali INT NOT NULL,
1750 + cm_coord INT,
1724 freq_A REAL, 1751 freq_A REAL,
1725 freq_C REAL, 1752 freq_C REAL,
1726 freq_G REAL, 1753 freq_G REAL,
...@@ -1728,12 +1755,17 @@ def sql_define_tables(conn): ...@@ -1728,12 +1755,17 @@ def sql_define_tables(conn):
1728 freq_other REAL, 1755 freq_other REAL,
1729 gap_percent REAL, 1756 gap_percent REAL,
1730 consensus CHAR(1), 1757 consensus CHAR(1),
1758 + cons_sec_struct CHAR(1),
1731 PRIMARY KEY (rfam_acc, index_ali), 1759 PRIMARY KEY (rfam_acc, index_ali),
1732 - FOREIGN KEY(rfam_acc) REFERENCES family(rfam_acc) 1760 + FOREIGN KEY(rfam_acc) REFERENCES family(rfam_acc) ON DELETE CASCADE
1733 ); 1761 );
1734 """) 1762 """)
1735 conn.commit() 1763 conn.commit()
1736 1764
1765 + # Prepare the WAL files while we're in single thread, otherwise it sometimes fails
1766 + # at the first access in WAL mode
1767 + conn.execute("pragma journal_mode=wal")
1768 +
1737 @trace_unhandled_exceptions 1769 @trace_unhandled_exceptions
1738 def sql_ask_database(conn, sql, warn_every=10): 1770 def sql_ask_database(conn, sql, warn_every=10):
1739 """ 1771 """
...@@ -1966,12 +1998,12 @@ def work_infer_mappings(update_only, allmappings, fullinference, codelist) -> li ...@@ -1966,12 +1998,12 @@ def work_infer_mappings(update_only, allmappings, fullinference, codelist) -> li
1966 sel_5_to_3 = (inferred_mappings['pdb_start'] < inferred_mappings['pdb_end']) 1998 sel_5_to_3 = (inferred_mappings['pdb_start'] < inferred_mappings['pdb_end'])
1967 thisfam_5_3 = (inferred_mappings['rfam_acc'] == rfam) & sel_5_to_3 1999 thisfam_5_3 = (inferred_mappings['rfam_acc'] == rfam) & sel_5_to_3
1968 thisfam_3_5 = (inferred_mappings['rfam_acc'] == rfam) & (sel_5_to_3 == False) 2000 thisfam_3_5 = (inferred_mappings['rfam_acc'] == rfam) & (sel_5_to_3 == False)
1969 - print() 2001 + # print()
1970 - warn(f"Found mappings to {rfam} in both directions on the same interval, keeping only the 5'->3' one.") 2002 + # warn(f"Found mappings to {rfam} in both directions on the same interval, keeping only the 5'->3' one.")
1971 else: 2003 else:
1972 warn(f"There are mappings for {rfam} in both directions:", error=True) 2004 warn(f"There are mappings for {rfam} in both directions:", error=True)
1973 print(inferred_mappings) 2005 print(inferred_mappings)
1974 - exit(1) 2006 + # exit(1)
1975 2007
1976 # Compute consensus for chains in 5' -> 3' sense 2008 # Compute consensus for chains in 5' -> 3' sense
1977 if len(inferred_mappings[thisfam_5_3]): 2009 if len(inferred_mappings[thisfam_5_3]):
...@@ -2297,8 +2329,10 @@ def work_realign(rfam_acc): ...@@ -2297,8 +2329,10 @@ def work_realign(rfam_acc):
2297 2329
2298 # Align the new sequences 2330 # Align the new sequences
2299 with open(new_ali_path, 'w') as o: 2331 with open(new_ali_path, 'w') as o:
2300 - p1 = subprocess.run(["cmalign", path_to_seq_data + f"realigned/{rfam_acc}.cm", 2332 + p1 = subprocess.run(["cmalign", "--nonbanded", "--ifile", path_to_seq_data + f"realigned/{rfam_acc}.ins",
2301 - path_to_seq_data + f"realigned/{rfam_acc}_new.fa"], 2333 + "--sfile", path_to_seq_data + f"realigned/{rfam_acc}.tsv",
2334 + path_to_seq_data + f"realigned/{rfam_acc}.cm",
2335 + path_to_seq_data + f"realigned/{rfam_acc}_new.fa"],
2302 stdout=o, stderr=subprocess.PIPE) 2336 stdout=o, stderr=subprocess.PIPE)
2303 notify("Aligned new sequences together") 2337 notify("Aligned new sequences together")
2304 2338
...@@ -2337,8 +2371,8 @@ def work_realign(rfam_acc): ...@@ -2337,8 +2371,8 @@ def work_realign(rfam_acc):
2337 # Alignment does not exist yet. We need to compute it from scratch. 2371 # Alignment does not exist yet. We need to compute it from scratch.
2338 print(f"\t> Aligning {rfam_acc} sequences together (cmalign) ...", end='', flush=True) 2372 print(f"\t> Aligning {rfam_acc} sequences together (cmalign) ...", end='', flush=True)
2339 2373
2340 - p = subprocess.run(["cmalign", "--small", "--cyk", "--noprob", "--nonbanded", "--notrunc", 2374 + p = subprocess.run(["cmalign", "--nonbanded", '-o', path_to_seq_data + f"realigned/{rfam_acc}++.stk",
2341 - '-o', path_to_seq_data + f"realigned/{rfam_acc}++.stk", 2375 + "--ifile", path_to_seq_data + f"realigned/{rfam_acc}.ins", "--sfile", path_to_seq_data + f"realigned/{rfam_acc}.tsv",
2342 path_to_seq_data + f"realigned/{rfam_acc}.cm", 2376 path_to_seq_data + f"realigned/{rfam_acc}.cm",
2343 path_to_seq_data + f"realigned/{rfam_acc}++.fa"], 2377 path_to_seq_data + f"realigned/{rfam_acc}++.fa"],
2344 stdout=subprocess.DEVNULL, stderr=subprocess.PIPE) 2378 stdout=subprocess.DEVNULL, stderr=subprocess.PIPE)
...@@ -2532,6 +2566,30 @@ def work_pssm_remap(f): ...@@ -2532,6 +2566,30 @@ def work_pssm_remap(f):
2532 2566
2533 setproctitle(f"RNAnet.py work_pssm_remap({f}) saving") 2567 setproctitle(f"RNAnet.py work_pssm_remap({f}) saving")
2534 2568
2569 + # Get back the information of match/insertion states from the STK file
2570 + alignstk = AlignIO.read(path_to_seq_data + "realigned/" + f + "++.stk", "stockholm")
2571 + consensus_2d = alignstk.column_annotations["secondary_structure"]
2572 + del alignstk
2573 + cm_coord = 1
2574 + cm_coords = []
2575 + cm_2d = []
2576 + for x in consensus_2d:
2577 + if x == ".":
2578 + cm_coords.append(None)
2579 + cm_2d.append(None)
2580 + else:
2581 + cm_coords.append(cm_coord)
2582 + if x in "[(<{":
2583 + cm_2d.append("(")
2584 + elif x in "])>}":
2585 + cm_2d.append(")")
2586 + elif x in ",_-:":
2587 + cm_2d.append(".")
2588 + else:
2589 + warn("Unsupported WUSS secondary structure symbol : "+x)
2590 + cm_2d.append(".")
2591 + cm_coord += 1
2592 +
2535 # Save the re_mappings 2593 # Save the re_mappings
2536 conn = sqlite3.connect(runDir + '/results/RNANet.db', timeout=20.0) 2594 conn = sqlite3.connect(runDir + '/results/RNANet.db', timeout=20.0)
2537 conn.execute('pragma journal_mode=wal') # Allow multiple other readers to ask things while we execute this writing query 2595 conn.execute('pragma journal_mode=wal') # Allow multiple other readers to ask things while we execute this writing query
...@@ -2550,20 +2608,20 @@ def work_pssm_remap(f): ...@@ -2550,20 +2608,20 @@ def work_pssm_remap(f):
2550 conn.commit() 2608 conn.commit()
2551 2609
2552 # Save the useful columns in the database 2610 # Save the useful columns in the database
2553 - data = [(f, j) + tuple(pssm_info[:,j-1]) + (consensus[j-1],) for j in sorted(columns_to_save)] 2611 + data = [(f, j, cm_coords[j-1]) + tuple(pssm_info[:,j-1]) + (consensus[j-1], cm_2d[j-1]) for j in sorted(columns_to_save)]
2554 - sql_execute(conn, """INSERT INTO align_column (rfam_acc, index_ali, freq_A, freq_C, freq_G, freq_U, freq_other, gap_percent, consensus) 2612 + sql_execute(conn, """INSERT INTO align_column (rfam_acc, index_ali, cm_coord, freq_A, freq_C, freq_G, freq_U, freq_other, gap_percent, consensus, cons_sec_struct)
2555 - VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?) ON CONFLICT(rfam_acc, index_ali) DO 2613 + VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) ON CONFLICT(rfam_acc, index_ali) DO
2556 - UPDATE SET freq_A=excluded.freq_A, freq_C=excluded.freq_C, freq_G=excluded.freq_G, freq_U=excluded.freq_U, 2614 + UPDATE SET cm_coord=excluded.cm_coord, freq_A=excluded.freq_A, freq_C=excluded.freq_C, freq_G=excluded.freq_G, freq_U=excluded.freq_U,
2557 - freq_other=excluded.freq_other, gap_percent=excluded.gap_percent, consensus=excluded.consensus;""", many=True, data=data) 2615 + freq_other=excluded.freq_other, gap_percent=excluded.gap_percent, consensus=excluded.consensus, cons_sec_struct=excluded.cons_sec_struct;""", many=True, data=data)
2558 # Add an unknown values column, with index_ali 0 (for nucleotides unsolved in 3D giving a gap '-' but found facing letter in the alignment) 2616 # Add an unknown values column, with index_ali 0 (for nucleotides unsolved in 3D giving a gap '-' but found facing letter in the alignment)
2559 - sql_execute(conn, f"""INSERT OR IGNORE INTO align_column (rfam_acc, index_ali, freq_A, freq_C, freq_G, freq_U, freq_other, gap_percent, consensus) 2617 + sql_execute(conn, f"""INSERT OR IGNORE INTO align_column (rfam_acc, index_ali, cm_coord, freq_A, freq_C, freq_G, freq_U, freq_other, gap_percent, consensus, cons_sec_struct)
2560 - VALUES (?, 0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, '-');""", data=(f,)) 2618 + VALUES (?, 0, NULL, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, '-', "NULL");""", data=(f,))
2561 # Save the number of "used columns" to table family ( = the length of the alignment if it was composed only of the RNANet chains) 2619 # Save the number of "used columns" to table family ( = the length of the alignment if it was composed only of the RNANet chains)
2562 sql_execute(conn, f"UPDATE family SET ali_filtered_len = ? WHERE rfam_acc = ?;", data=(len(columns_to_save), f)) 2620 sql_execute(conn, f"UPDATE family SET ali_filtered_len = ? WHERE rfam_acc = ?;", data=(len(columns_to_save), f))
2563 conn.close() 2621 conn.close()
2564 2622
2565 ########################################################################################## 2623 ##########################################################################################
2566 - # Saving the filtered alignement with only the saved positinos 2624 + # Saving the filtered alignement with only the saved positions
2567 ########################################################################################## 2625 ##########################################################################################
2568 2626
2569 setproctitle(f"RNAnet.py work_pssm_remap({f}) filtering alignment") 2627 setproctitle(f"RNAnet.py work_pssm_remap({f}) filtering alignment")
...@@ -2600,8 +2658,8 @@ def work_save(c, homology=True): ...@@ -2600,8 +2658,8 @@ def work_save(c, homology=True):
2600 conn.execute('pragma journal_mode=wal') 2658 conn.execute('pragma journal_mode=wal')
2601 if homology: 2659 if homology:
2602 df = pd.read_sql_query(f""" 2660 df = pd.read_sql_query(f"""
2603 - SELECT index_chain, old_nt_resnum, nt_position, nt_name, nt_code, nt_align_code, 2661 + SELECT index_chain, old_nt_resnum, nt_position, nt_name, nt_code, nt_align_code, cm_coord,
2604 - is_A, is_C, is_G, is_U, is_other, freq_A, freq_C, freq_G, freq_U, freq_other, gap_percent, consensus, dbn, 2662 + is_A, is_C, is_G, is_U, is_other, freq_A, freq_C, freq_G, freq_U, freq_other, gap_percent, consensus, cons_sec_struct, dbn,
2605 paired, nb_interact, pair_type_LW, pair_type_DSSR, alpha, beta, gamma, delta, epsilon, zeta, epsilon_zeta, 2663 paired, nb_interact, pair_type_LW, pair_type_DSSR, alpha, beta, gamma, delta, epsilon, zeta, epsilon_zeta,
2606 chi, bb_type, glyco_bond, form, ssZp, Dp, eta, theta, eta_prime, theta_prime, eta_base, theta_base, 2664 chi, bb_type, glyco_bond, form, ssZp, Dp, eta, theta, eta_prime, theta_prime, eta_base, theta_base,
2607 v0, v1, v2, v3, v4, amplitude, phase_angle, puckering FROM 2665 v0, v1, v2, v3, v4, amplitude, phase_angle, puckering FROM
...@@ -2682,6 +2740,7 @@ if __name__ == "__main__": ...@@ -2682,6 +2740,7 @@ if __name__ == "__main__":
2682 work_save(c, homology=False) 2740 work_save(c, homology=False)
2683 print("Completed.") 2741 print("Completed.")
2684 exit(0) 2742 exit(0)
2743 +
2685 2744
2686 # At this point, structure, chain and nucleotide tables of the database are up to date. 2745 # At this point, structure, chain and nucleotide tables of the database are up to date.
2687 # (Modulo some statistics computed by statistics.py) 2746 # (Modulo some statistics computed by statistics.py)
...@@ -2716,6 +2775,7 @@ if __name__ == "__main__": ...@@ -2716,6 +2775,7 @@ if __name__ == "__main__":
2716 idxQueue = thr_idx_mgr.Queue() 2775 idxQueue = thr_idx_mgr.Queue()
2717 2776
2718 pp.remap() 2777 pp.remap()
2778 + pp.extractCMs()
2719 2779
2720 # At this point, the align_column and re_mapping tables are up-to-date. 2780 # At this point, the align_column and re_mapping tables are up-to-date.
2721 2781
......
1 +2z9q_1_A_1-72
2 +1ml5_1_b_5-121
3 +1ml5_1_a_1-2914
4 +3ep2_1_Y_1-72
5 +3eq3_1_Y_1-72
6 +4v48_1_A6_1-73
7 +1ml5_1_A_2-1520
8 +1qzb_1_B_1-73
9 +1qza_1_B_1-73
10 +1ls2_1_B_1-73
11 +1gsg_1_T_1-72
12 +3jcr_1_H_1-115
13 +1vy7_1_AY_1-73
14 +1vy7_1_CY_1-73
15 +4w2h_1_CY_1-73
16 +5zzm_1_M_3-118
17 +2rdo_1_A_3-118
18 +4v48_1_A9_3-118
19 +4v47_1_A9_3-118
20 +2ob7_1_A_10-319
21 +1x1l_1_A_1-132
22 +1zc8_1_Z_1-93
23 +2ob7_1_D_1-132
24 +4v42_1_BB_5-121
25 +4v42_1_BA_1-2914
26 +1r2x_1_C_1-58
27 +1r2w_1_C_1-58
28 +1eg0_1_L_1-56
29 +5zzm_1_N_1-2904
30 +2rdo_1_B_1-2904
31 +3dg2_1_B_1-2904
32 +3dg0_1_B_1-2904
33 +4v48_1_A0_1-2904
34 +4v47_1_A0_1-2904
35 +3dg4_1_B_1-2904
36 +3dg5_1_B_1-2904
37 +3dg2_1_A_1-1542
38 +3dg0_1_A_1-1542
39 +4v48_1_BA_1-1543
40 +4v47_1_BA_1-1542
41 +3dg4_1_A_1-1542
42 +3dg5_1_A_1-1542
43 +1eg0_1_O_1-73
44 +1zc8_1_A_1-59
45 +1mvr_1_D_1-61
46 +4adx_1_9_1-123
47 +1zn1_1_B_1-59
48 +1jgq_1_A_2-1520
49 +4v42_1_AA_2-1520
50 +1jgo_1_A_2-1520
51 +1jgp_1_A_2-1520
52 +1emi_1_B_1-108
53 +3iy9_1_A_498-1027
54 +3ep2_1_B_1-50
55 +3eq3_1_B_1-50
56 +3eq4_1_B_1-50
57 +3pgw_1_R_1-164
58 +3pgw_1_N_1-164
59 +3cw1_1_x_1-138
60 +3cw1_1_w_1-138
61 +3cw1_1_V_1-138
62 +3cw1_1_v_1-138
63 +2iy3_1_B_9-105
64 +3jcr_1_N_1-107
65 +2vaz_1_A_64-177
66 +2ftc_1_R_81-1466
67 +3jcr_1_M_1-141
68 +4v5z_1_B0_1-2902
69 +5g2x_1_A_595-692
70 +3iy8_1_A_1-540
71 +4v5z_1_BY_2-113
72 +4v5z_1_BZ_1-70
73 +4v5z_1_B1_2-125
74 +4adx_1_0_1-2925
75 +1mvr_1_B_3-96
76 +3eq4_1_Y_1-69
77 +6uz7_1_8_2140-2827
78 +4v5z_1_AA_1-1563
1 +2z9q_1_A_1-72
2 +DSSR warning 2z9q.json: no nucleotides found. Ignoring 2z9q_1_A_1-72.
3 +
4 +1ml5_1_b_5-121
5 +Could not find nucleotides of chain b in annotation 1ml5.json. Either there is a problem with 1ml5 mmCIF download, or the bases are not resolved in the structure. Delete it and retry.
6 +
7 +1ml5_1_a_1-2914
8 +Could not find nucleotides of chain a in annotation 1ml5.json. Either there is a problem with 1ml5 mmCIF download, or the bases are not resolved in the structure. Delete it and retry.
9 +
10 +3ep2_1_Y_1-72
11 +DSSR warning 3ep2.json: no nucleotides found. Ignoring 3ep2_1_Y_1-72.
12 +
13 +3eq3_1_Y_1-72
14 +DSSR warning 3eq3.json: no nucleotides found. Ignoring 3eq3_1_Y_1-72.
15 +
16 +4v48_1_A6_1-73
17 +DSSR warning 4v48.json: no nucleotides found. Ignoring 4v48_1_A6_1-73.
18 +
19 +1ml5_1_A_2-1520
20 +Could not find nucleotides of chain A in annotation 1ml5.json. Either there is a problem with 1ml5 mmCIF download, or the bases are not resolved in the structure. Delete it and retry.
21 +
22 +1qzb_1_B_1-73
23 +DSSR warning 1qzb.json: no nucleotides found. Ignoring 1qzb_1_B_1-73.
24 +
25 +1qza_1_B_1-73
26 +DSSR warning 1qza.json: no nucleotides found. Ignoring 1qza_1_B_1-73.
27 +
28 +1ls2_1_B_1-73
29 +DSSR warning 1ls2.json: no nucleotides found. Ignoring 1ls2_1_B_1-73.
30 +
31 +1gsg_1_T_1-72
32 +DSSR warning 1gsg.json: no nucleotides found. Ignoring 1gsg_1_T_1-72.
33 +
34 +3jcr_1_H_1-115
35 +DSSR warning 3jcr.json: no nucleotides found. Ignoring 3jcr_1_H_1-115.
36 +
37 +1vy7_1_AY_1-73
38 +Sequence is too short. (< 5 resolved nts)
39 +
40 +1vy7_1_CY_1-73
41 +Sequence is too short. (< 5 resolved nts)
42 +
43 +4w2h_1_CY_1-73
44 +Sequence is too short. (< 5 resolved nts)
45 +
46 +5zzm_1_M_3-118
47 +DSSR warning 5zzm.json: no nucleotides found. Ignoring 5zzm_1_M_3-118.
48 +
49 +2rdo_1_A_3-118
50 +DSSR warning 2rdo.json: no nucleotides found. Ignoring 2rdo_1_A_3-118.
51 +
52 +4v48_1_A9_3-118
53 +DSSR warning 4v48.json: no nucleotides found. Ignoring 4v48_1_A9_3-118.
54 +
55 +4v47_1_A9_3-118
56 +DSSR warning 4v47.json: no nucleotides found. Ignoring 4v47_1_A9_3-118.
57 +
58 +2ob7_1_A_10-319
59 +DSSR warning 2ob7.json: no nucleotides found. Ignoring 2ob7_1_A_10-319.
60 +
61 +1x1l_1_A_1-132
62 +DSSR warning 1x1l.json: no nucleotides found. Ignoring 1x1l_1_A_1-132.
63 +
64 +1zc8_1_Z_1-93
65 +DSSR warning 1zc8.json: no nucleotides found. Ignoring 1zc8_1_Z_1-93.
66 +
67 +2ob7_1_D_1-132
68 +DSSR warning 2ob7.json: no nucleotides found. Ignoring 2ob7_1_D_1-132.
69 +
70 +4v42_1_BB_5-121
71 +Could not find nucleotides of chain BB in annotation 4v42.json. Either there is a problem with 4v42 mmCIF download, or the bases are not resolved in the structure. Delete it and retry.
72 +
73 +4v42_1_BA_1-2914
74 +Could not find nucleotides of chain BA in annotation 4v42.json. Either there is a problem with 4v42 mmCIF download, or the bases are not resolved in the structure. Delete it and retry.
75 +
76 +1r2x_1_C_1-58
77 +DSSR warning 1r2x.json: no nucleotides found. Ignoring 1r2x_1_C_1-58.
78 +
79 +1r2w_1_C_1-58
80 +DSSR warning 1r2w.json: no nucleotides found. Ignoring 1r2w_1_C_1-58.
81 +
82 +1eg0_1_L_1-56
83 +DSSR warning 1eg0.json: no nucleotides found. Ignoring 1eg0_1_L_1-56.
84 +
85 +5zzm_1_N_1-2904
86 +DSSR warning 5zzm.json: no nucleotides found. Ignoring 5zzm_1_N_1-2904.
87 +
88 +2rdo_1_B_1-2904
89 +DSSR warning 2rdo.json: no nucleotides found. Ignoring 2rdo_1_B_1-2904.
90 +
91 +3dg2_1_B_1-2904
92 +DSSR warning 3dg2.json: no nucleotides found. Ignoring 3dg2_1_B_1-2904.
93 +
94 +3dg0_1_B_1-2904
95 +DSSR warning 3dg0.json: no nucleotides found. Ignoring 3dg0_1_B_1-2904.
96 +
97 +4v48_1_A0_1-2904
98 +DSSR warning 4v48.json: no nucleotides found. Ignoring 4v48_1_A0_1-2904.
99 +
100 +4v47_1_A0_1-2904
101 +DSSR warning 4v47.json: no nucleotides found. Ignoring 4v47_1_A0_1-2904.
102 +
103 +3dg4_1_B_1-2904
104 +DSSR warning 3dg4.json: no nucleotides found. Ignoring 3dg4_1_B_1-2904.
105 +
106 +3dg5_1_B_1-2904
107 +DSSR warning 3dg5.json: no nucleotides found. Ignoring 3dg5_1_B_1-2904.
108 +
109 +3dg2_1_A_1-1542
110 +DSSR warning 3dg2.json: no nucleotides found. Ignoring 3dg2_1_A_1-1542.
111 +
112 +3dg0_1_A_1-1542
113 +DSSR warning 3dg0.json: no nucleotides found. Ignoring 3dg0_1_A_1-1542.
114 +
115 +4v48_1_BA_1-1543
116 +DSSR warning 4v48.json: no nucleotides found. Ignoring 4v48_1_BA_1-1543.
117 +
118 +4v47_1_BA_1-1542
119 +DSSR warning 4v47.json: no nucleotides found. Ignoring 4v47_1_BA_1-1542.
120 +
121 +3dg4_1_A_1-1542
122 +DSSR warning 3dg4.json: no nucleotides found. Ignoring 3dg4_1_A_1-1542.
123 +
124 +3dg5_1_A_1-1542
125 +DSSR warning 3dg5.json: no nucleotides found. Ignoring 3dg5_1_A_1-1542.
126 +
127 +1eg0_1_O_1-73
128 +DSSR warning 1eg0.json: no nucleotides found. Ignoring 1eg0_1_O_1-73.
129 +
130 +1zc8_1_A_1-59
131 +DSSR warning 1zc8.json: no nucleotides found. Ignoring 1zc8_1_A_1-59.
132 +
133 +1mvr_1_D_1-61
134 +DSSR warning 1mvr.json: no nucleotides found. Ignoring 1mvr_1_D_1-61.
135 +
136 +4adx_1_9_1-123
137 +DSSR warning 4adx.json: no nucleotides found. Ignoring 4adx_1_9_1-123.
138 +
139 +1zn1_1_B_1-59
140 +DSSR warning 1zn1.json: no nucleotides found. Ignoring 1zn1_1_B_1-59.
141 +
142 +1jgq_1_A_2-1520
143 +Could not find nucleotides of chain A in annotation 1jgq.json. Either there is a problem with 1jgq mmCIF download, or the bases are not resolved in the structure. Delete it and retry.
144 +
145 +4v42_1_AA_2-1520
146 +Could not find nucleotides of chain AA in annotation 4v42.json. Either there is a problem with 4v42 mmCIF download, or the bases are not resolved in the structure. Delete it and retry.
147 +
148 +1jgo_1_A_2-1520
149 +Could not find nucleotides of chain A in annotation 1jgo.json. Either there is a problem with 1jgo mmCIF download, or the bases are not resolved in the structure. Delete it and retry.
150 +
151 +1jgp_1_A_2-1520
152 +Could not find nucleotides of chain A in annotation 1jgp.json. Either there is a problem with 1jgp mmCIF download, or the bases are not resolved in the structure. Delete it and retry.
153 +
154 +1emi_1_B_1-108
155 +DSSR warning 1emi.json: no nucleotides found. Ignoring 1emi_1_B_1-108.
156 +
157 +3iy9_1_A_498-1027
158 +DSSR warning 3iy9.json: no nucleotides found. Ignoring 3iy9_1_A_498-1027.
159 +
160 +3ep2_1_B_1-50
161 +DSSR warning 3ep2.json: no nucleotides found. Ignoring 3ep2_1_B_1-50.
162 +
163 +3eq3_1_B_1-50
164 +DSSR warning 3eq3.json: no nucleotides found. Ignoring 3eq3_1_B_1-50.
165 +
166 +3eq4_1_B_1-50
167 +DSSR warning 3eq4.json: no nucleotides found. Ignoring 3eq4_1_B_1-50.
168 +
169 +3pgw_1_R_1-164
170 +DSSR warning 3pgw.json: no nucleotides found. Ignoring 3pgw_1_R_1-164.
171 +
172 +3pgw_1_N_1-164
173 +DSSR warning 3pgw.json: no nucleotides found. Ignoring 3pgw_1_N_1-164.
174 +
175 +3cw1_1_x_1-138
176 +DSSR warning 3cw1.json: no nucleotides found. Ignoring 3cw1_1_x_1-138.
177 +
178 +3cw1_1_w_1-138
179 +DSSR warning 3cw1.json: no nucleotides found. Ignoring 3cw1_1_w_1-138.
180 +
181 +3cw1_1_V_1-138
182 +DSSR warning 3cw1.json: no nucleotides found. Ignoring 3cw1_1_V_1-138.
183 +
184 +3cw1_1_v_1-138
185 +DSSR warning 3cw1.json: no nucleotides found. Ignoring 3cw1_1_v_1-138.
186 +
187 +2iy3_1_B_9-105
188 +DSSR warning 2iy3.json: no nucleotides found. Ignoring 2iy3_1_B_9-105.
189 +
190 +3jcr_1_N_1-107
191 +DSSR warning 3jcr.json: no nucleotides found. Ignoring 3jcr_1_N_1-107.
192 +
193 +2vaz_1_A_64-177
194 +DSSR warning 2vaz.json: no nucleotides found. Ignoring 2vaz_1_A_64-177.
195 +
196 +2ftc_1_R_81-1466
197 +DSSR warning 2ftc.json: no nucleotides found. Ignoring 2ftc_1_R_81-1466.
198 +
199 +3jcr_1_M_1-141
200 +DSSR warning 3jcr.json: no nucleotides found. Ignoring 3jcr_1_M_1-141.
201 +
202 +4v5z_1_B0_1-2902
203 +DSSR warning 4v5z.json: no nucleotides found. Ignoring 4v5z_1_B0_1-2902.
204 +
205 +5g2x_1_A_595-692
206 +Sequence is too short. (< 5 resolved nts)
207 +
208 +3iy8_1_A_1-540
209 +DSSR warning 3iy8.json: no nucleotides found. Ignoring 3iy8_1_A_1-540.
210 +
211 +4v5z_1_BY_2-113
212 +DSSR warning 4v5z.json: no nucleotides found. Ignoring 4v5z_1_BY_2-113.
213 +
214 +4v5z_1_BZ_1-70
215 +DSSR warning 4v5z.json: no nucleotides found. Ignoring 4v5z_1_BZ_1-70.
216 +
217 +4v5z_1_B1_2-125
218 +DSSR warning 4v5z.json: no nucleotides found. Ignoring 4v5z_1_B1_2-125.
219 +
220 +4adx_1_0_1-2925
221 +DSSR warning 4adx.json: no nucleotides found. Ignoring 4adx_1_0_1-2925.
222 +
223 +1mvr_1_B_3-96
224 +DSSR warning 1mvr.json: no nucleotides found. Ignoring 1mvr_1_B_3-96.
225 +
226 +3eq4_1_Y_1-69
227 +DSSR warning 3eq4.json: no nucleotides found. Ignoring 3eq4_1_Y_1-69.
228 +
229 +6uz7_1_8_2140-2827
230 +Could not find nucleotides of chain 8 in annotation 6uz7.json. Either there is a problem with 6uz7 mmCIF download, or the bases are not resolved in the structure. Delete it and retry.
231 +
232 +4v5z_1_AA_1-1563
233 +DSSR warning 4v5z.json: no nucleotides found. Ignoring 4v5z_1_AA_1-1563.
234 +
...@@ -958,6 +958,8 @@ def par_distance_matrix(filelist, f, label, consider_all_atoms, s): ...@@ -958,6 +958,8 @@ def par_distance_matrix(filelist, f, label, consider_all_atoms, s):
958 958
959 @trace_unhandled_exceptions 959 @trace_unhandled_exceptions
960 def get_avg_std_distance_matrix(f, consider_all_atoms, multithread=False): 960 def get_avg_std_distance_matrix(f, consider_all_atoms, multithread=False):
961 + np.seterr(divide='ignore') # ignore division by zero issues
962 +
961 if consider_all_atoms: 963 if consider_all_atoms:
962 label = "base" 964 label = "base"
963 else: 965 else:
...@@ -1024,44 +1026,38 @@ def get_avg_std_distance_matrix(f, consider_all_atoms, multithread=False): ...@@ -1024,44 +1026,38 @@ def get_avg_std_distance_matrix(f, consider_all_atoms, multithread=False):
1024 p.join() 1026 p.join()
1025 exit(1) 1027 exit(1)
1026 1028
1027 - with warnings.catch_warnings(record=True) as w: 1029 + # Calculation of the average matrix
1028 - warnings.simplefilter("error") 1030 + avg = np.divide(avg, counts, where=counts>0, out=np.full_like(avg, np.NaN)) # Ultrafancy way to take avg/counts or NaN if counts is 0
1029 - 1031 +
1030 - # Calculation of the average matrix 1032 + np.savetxt(runDir + '/results/distance_matrices/' + f + '_'+ label + '/' + f + '_average.csv' , avg, delimiter=",", fmt="%.3f")
1031 - try: 1033 +
1032 - avg = avg/counts 1034 + fig, ax = plt.subplots()
1033 - assert (counts >0).all() 1035 + im = ax.imshow(avg)
1034 - except RuntimeWarning as e: 1036 + cbar = ax.figure.colorbar(im, ax=ax)
1035 - # there will be division by 0 if count is 0 at some position = gap only column 1037 + cbar.ax.set_ylabel("Angströms", rotation=-90, va="bottom")
1036 - pass 1038 + ax.set_title(f"Average distance between {f} residues (Angströms)")
1037 - np.savetxt(runDir + '/results/distance_matrices/' + f + '_'+ label + '/' + f + '_average.csv' , avg, delimiter=",", fmt="%.3f") 1039 + fig.tight_layout()
1038 - 1040 + fig.savefig(runDir + '/results/distance_matrices/' + f + '_'+ label + '/' + f + '_average.png', dpi=300)
1039 - fig, ax = plt.subplots() 1041 + plt.close()
1040 - im = ax.imshow(avg)
1041 - cbar = ax.figure.colorbar(im, ax=ax)
1042 - cbar.ax.set_ylabel("Angströms", rotation=-90, va="bottom")
1043 - ax.set_title(f"Average distance between {f} residues (Angströms)")
1044 - fig.tight_layout()
1045 - fig.savefig(runDir + '/results/distance_matrices/' + f + '_'+ label + '/' + f + '_average.png', dpi=300)
1046 - plt.close()
1047 1042
1048 - # Calculation of the standard deviation matrix by the Huygens theorem 1043 + # Calculation of the standard deviation matrix by the Huygens theorem
1049 - try: 1044 + std = np.divide(std, counts, where=counts>0, out=np.full_like(std, np.NaN))
1050 - std = np.sqrt(std/counts - np.power(avg/counts, 2)) 1045 + mask = np.invert(np.isnan(std))
1051 - assert (counts >0).all() 1046 + value = std[mask] - np.power(avg[mask], 2)
1052 - except RuntimeWarning as e: 1047 + if ((value[value<0] < -1e-2).any()):
1053 - # there will be division by 0 if count is 0 at some position = gap only column 1048 + warn("Erasing very negative variance value !")
1054 - pass 1049 + value[value<0] = 0.0 # floating point problems !
1055 - np.savetxt(runDir + '/results/distance_matrices/' + f + '_'+ label + '/' + f + '_stdev.csv' , std, delimiter=",", fmt="%.3f") 1050 + std[mask] = np.sqrt(value)
1056 - 1051 + np.savetxt(runDir + '/results/distance_matrices/' + f + '_'+ label + '/' + f + '_stdev.csv' , std, delimiter=",", fmt="%.3f")
1057 - fig, ax = plt.subplots() 1052 +
1058 - im = ax.imshow(std) 1053 + fig, ax = plt.subplots()
1059 - cbar = ax.figure.colorbar(im, ax=ax) 1054 + im = ax.imshow(std)
1060 - cbar.ax.set_ylabel("Angströms", rotation=-90, va="bottom") 1055 + cbar = ax.figure.colorbar(im, ax=ax)
1061 - ax.set_title(f"Standard deviation of distances between {f} residues (Angströms)") 1056 + cbar.ax.set_ylabel("Angströms", rotation=-90, va="bottom")
1062 - fig.tight_layout() 1057 + ax.set_title(f"Standard deviation of distances between {f} residues (Angströms)")
1063 - fig.savefig(runDir + '/results/distance_matrices/' + f + '_'+ label + '/' + f + '_std.png', dpi=300) 1058 + fig.tight_layout()
1064 - plt.close() 1059 + fig.savefig(runDir + '/results/distance_matrices/' + f + '_'+ label + '/' + f + '_std.png', dpi=300)
1060 + plt.close()
1065 1061
1066 # Save log 1062 # Save log
1067 with open(runDir + '/results/distance_matrices/' + f + '_'+ label + '/' + f + '.log', 'a') as logfile: 1063 with open(runDir + '/results/distance_matrices/' + f + '_'+ label + '/' + f + '.log', 'a') as logfile:
......