Louis BECQUEY

Splitted Aglaé's code in a separate stats file

This diff is collapsed. Click to expand it.
......@@ -7,6 +7,15 @@ In `cmalign` alignments, - means a nucleotide is missing compared to the covaria
In the final filtered alignment that we provide for download, the same rule applies, but on top of that, some '.' are replaced by '-' when a gap in the 3D structure (a missing, unresolved nucleotide) is mapped to an insertion gap.
* **What are the cmalign options for ?**
From Infernal's user guide, we can quote that Infernal uses an HMM banding technique to accelerate alignment by default. It also takes care of 3' or 5' truncated sequences to be aligned correctly (and we have some).
First, one can choose an algorithm, between `--optacc` (maximizing posterior probabilities, the default) and `--cyk` (maximizing likelihood).
Then, the use of bands allows faster and more memory efficient computation, at the price of the guarantee of determining the optimal alignment. Bands can be disabled using the `--nonbanded` option. A best idea would be to control the threshold of probability mass to be considered negligible during HMM band calculation with the `--tau` parameter. Higher values of Tau yield greater speedups and lower memory usage, but a greater chance to miss the optimal alignment. In practice, the algorithm explores several Tau values (increasing it by a factor 2.0 from the original `--tau` value) until the DP matrix size falls below the threshold given by `--mxsize` (default 1028 Mb) or the value of `--maxtau` is reached (in this case, the program fails). One can disable this exploration with option `--fixedtau`. The default value of `--tau` is 1e-7, the default `--maxtau` is 0.05. Basically, you may decide on a value of `--mxsize` by dividing your available RAM by the number of cores used with cmalign. If necessary, you may use less cores than you have, using option `--cpu`.
Finally, if using `--cyk --nonbanded --notrunc --noprob`, one can use the `--small` option to align using the divide-and-conquer CYK algorithm from Eddy 2002, requiring a very few memory but a lot of time. The major drawback of this is that it requires `--notrunc` and `--noprob`, so we give up on the correct alignment of truncated sequences, and the computation of posterior probabilities.
* **Why are there some gap-only columns in the alignment ?**
These columns are not completely gap-only, they contain at least one dash-gap '-'. This means an actual, physical nucleotide which should exist in the 3D structure should be located there. The previous and following nucleotides are **not** contiguous in space in 3D.
......@@ -31,5 +40,5 @@ We first remove the nucleotides whose number is outside the family mapping (if a
* **What are the versions of the dependencies you use ?**
`cmalign` is v1.1.4, `sina` is v1.6.0, `x3dna-dssr` is v1.9.9, Biopython is v1.78.
`cmalign` is v1.1.4, `sina` is v1.6.0, `x3dna-dssr` is v2.3.2-2021jun29, Biopython is v1.78.
\ No newline at end of file
......
......@@ -6,23 +6,16 @@
* Some chains are not correctly renamed A in the produced separate files (e.g. 1d4r-B)
## Alignment issues
* [SOLVED] Filtered alignments are shorter than the number of alignment columns saved to the SQL table `align_column`
* Chain names appear in triple in the FASTA header (e.g. 1d4r[1]-B 1d4r[1]-B 1d4r[1]-B)
## Technical running issues
* [SOLVED] Files produced by Docker containers are owned by root and require root permissions to be read
* [SOLVED] SQLite WAL files are not deleted properly
# Known feature requests
* [DONE] Get filtered versions of the sequence alignments containing the 3D chains, publicly available for download
* [DONE] Get a consensus residue for each alignement column
* [DONE] Get an option to limit the number of cores
* [DONE] Move to SILVA LSU release 138.1
* [UPCOMING] Automated annotation of detected Recurrent Interaction Networks (RINs), see http://carnaval.lri.fr/ .
* [UPCOMING] Possibly, automated detection of HLs and ILs from the 3D Motif Atlas (BGSU). Maybe. Their own website already does the job.
* [UPCOMING] Weight sequences in alignment to give more importance to rarer sequences
* [UPCOMING] Give both gap_percent and insertion_gap_percent
* Automated annotation of detected Recurrent Interaction Networks (RINs), see http://carnaval.lri.fr/ .
* Possibly, automated detection of HLs and ILs from the 3D Motif Atlas (BGSU). Maybe. Their own website already does the job.
* Weight sequences in alignment to give more importance to rarer sequences
* Give both gap_percent and insertion_gap_percent
* A field estimating the quality of the sequence alignment in table family.
* Possibly, more metrics about the alignments coming from Infernal.
* Run cmscan ourselves from the NDB instead of using Rfam-PDB mappings ? (Iff this actually makes a real difference, untested yet)
* Use and save Infernal alignment bounds and truncation information
* Save if a chain is a representative in BGSU list
* Annotate unstructured regions (on a nucleotide basis)
......
This diff could not be displayed because it is too large.
6ydp_1_AA_1176-2737
6ydw_1_AA_1176-2737
2z9q_1_A_1-72
1ml5_1_b_5-121
1ml5_1_a_1-2914
3ep2_1_Y_1-72
3eq3_1_Y_1-72
4v48_1_A6_1-73
1ml5_1_A_2-1520
1ml5_1_b_5-121
1ml5_1_a_1-2914
1qzb_1_B_1-73
1qza_1_B_1-73
1ls2_1_B_1-73
1ml5_1_A_2-1520
1gsg_1_T_1-72
7d1a_1_A_805-902
7d0g_1_A_805-913
......@@ -22,15 +22,12 @@
2rdo_1_A_3-118
4v48_1_A9_3-118
4v47_1_A9_3-118
4v42_1_BA_1-2914
4v42_1_BB_5-121
2ob7_1_A_10-319
1x1l_1_A_1-130
1zc8_1_Z_1-91
2ob7_1_D_1-130
4v42_1_BA_1-2914
4v42_1_BB_5-121
1r2x_1_C_1-58
1r2w_1_C_1-58
1eg0_1_L_1-56
3dg2_1_A_1-1542
3dg0_1_A_1-1542
4v48_1_BA_1-1543
......@@ -46,11 +43,14 @@
3dg4_1_B_1-2904
3dg5_1_B_1-2904
1eg0_1_O_1-73
1zc8_1_A_1-59
1r2x_1_C_1-58
1r2w_1_C_1-58
1eg0_1_L_1-56
1jgq_1_A_2-1520
4v42_1_AA_2-1520
1jgo_1_A_2-1520
1jgp_1_A_2-1520
1zc8_1_A_1-59
1mvr_1_D_1-59
4c9d_1_D_29-1
4c9d_1_C_29-1
......@@ -61,12 +61,6 @@
3ep2_1_B_1-50
3eq3_1_B_1-50
3eq4_1_B_1-50
3pgw_1_R_1-164
3pgw_1_N_1-164
3cw1_1_x_1-138
3cw1_1_w_1-138
3cw1_1_V_1-138
3cw1_1_v_1-138
2iy3_1_B_9-105
3jcr_1_N_1-106
2vaz_1_A_64-177
......@@ -78,6 +72,12 @@
4v5z_1_BY_2-113
4v5z_1_BZ_1-70
4v5z_1_B1_2-123
3pgw_1_R_1-164
3pgw_1_N_1-164
3cw1_1_x_1-138
3cw1_1_w_1-138
3cw1_1_V_1-138
3cw1_1_v_1-138
1mvr_1_B_1-96
4adx_1_0_1-2923
3eq4_1_Y_1-69
......@@ -295,7 +295,12 @@
6ucq_1_2Y
4w2e_1_X
6ucq_1_2X
7n1p_1_DT
7n2u_1_DT
6yss_1_W
7n30_1_DT
7n31_1_DT
7n2c_1_DT
5afi_1_Y
5uq8_1_Z
5wdt_1_Y
......@@ -333,6 +338,22 @@
4v4j_1_X
4v4i_1_X
4v42_1_BB
4jrc_1_B
4jrc_1_A
6lkq_1_S
5h5u_1_H
7d6z_1_F
5lze_1_Y
5lze_1_V
5lze_1_X
3jcj_1_G
6o7k_1_G
3dg2_1_A
3dg0_1_A
4v48_1_BA
4v47_1_BA
3dg4_1_A
3dg5_1_A
6d30_1_C
6j7z_1_C
3er9_1_D
......@@ -437,25 +458,22 @@
6doc_1_B
6doe_1_B
6n6g_1_D
6lkq_1_S
5h5u_1_H
7d6z_1_F
5lze_1_Y
5lze_1_V
5lze_1_X
3jcj_1_G
6o7k_1_G
3dg2_1_A
3dg0_1_A
4v48_1_BA
4v47_1_BA
3dg4_1_A
3dg5_1_A
4b3r_1_W
4b3t_1_W
4b3s_1_W
7b5k_1_X
5o2r_1_X
5kcs_1_1X
7n1p_1_PT
7n2u_1_PT
7n30_1_PT
7n31_1_PT
7n2c_1_PT
6yl5_1_I
6yl5_1_E
6yl5_1_A
6yl5_1_K
6yl5_1_G
6zvk_1_E2
6zvk_1_H2
7a01_1_E2
......@@ -526,6 +544,7 @@
6w6l_1_V
6olf_1_V
3erc_1_G
4qjd_1_D
6of1_1_1W
6cae_1_1Y
6o97_1_1W
......@@ -557,7 +576,9 @@
4v48_1_A6
2z9q_1_A
4hot_1_X
5ns4_1_C
6d2z_1_C
7eh0_1_I
4tu0_1_F
4tu0_1_G
6r9o_1_B
......@@ -578,20 +599,23 @@
6sv4_1_NC
6i7o_1_NB
1ml5_1_A
7nsq_1_V
6swa_1_Q
6swa_1_R
3j6x_1_IR
3j6y_1_IR
6ole_1_T
6om0_1_T
6oli_1_T
6om7_1_T
6olf_1_T
6w6l_1_T
6tnu_1_M
5mc6_1_M
7nrc_1_SM
6tb3_1_N
7b7d_1_SM
7b7d_1_SN
6tnu_1_N
7nrc_1_SN
7nrd_1_SN
6zot_1_C
2uxb_1_X
......@@ -602,6 +626,9 @@
1eg0_1_M
3eq4_1_D
5o1y_1_B
4kzy_1_I
4kzz_1_I
4kzx_1_I
3jcr_1_H
6dzi_1_H
5zeu_1_A
......@@ -705,7 +732,6 @@
6ip6_1_ZZ
6uu3_1_333
6uu1_1_333
1pn8_1_D
3er8_1_H
3er8_1_G
3er8_1_F
......@@ -744,9 +770,8 @@
4wtl_1_T
4wtl_1_P
1xnq_1_W
1x18_1_C
1x18_1_B
1x18_1_D
7n2v_1_DT
4peh_1_Z
1vq6_1_4
4am3_1_D
4am3_1_H
......@@ -758,12 +783,45 @@
4wtj_1_T
4wtj_1_P
4xbf_1_D
5w1h_1_B
6n6d_1_D
6n6k_1_C
6n6k_1_D
3rtj_1_D
6ty9_1_M
6tz1_1_N
6q1h_1_D
6q1h_1_H
6p7p_1_F
6p7p_1_E
6p7p_1_D
6vm6_1_J
6vm6_1_G
6wan_1_K
6wan_1_H
6wan_1_G
6wan_1_L
6wan_1_I
6ywo_1_F
6wan_1_J
4oau_1_A
6ywo_1_E
6ywo_1_K
6vm6_1_I
6vm6_1_H
6ywo_1_I
2a1r_1_C
6m6v_1_F
6m6v_1_E
2a1r_1_D
3gpq_1_E
3gpq_1_F
6o79_1_C
6vm6_1_K
6m6v_1_G
6hyu_1_D
1laj_1_R
6ybv_1_K
6sce_1_B
6xl1_1_C
6scf_1_I
......@@ -809,11 +867,12 @@
1y1y_1_P
5zuu_1_I
5zuu_1_G
7am2_1_R1
4peh_1_W
4peh_1_V
4peh_1_X
4peh_1_Y
4peh_1_Z
7d8c_1_C
6mkn_1_W
7kl3_1_B
4cxg_1_C
......@@ -826,14 +885,7 @@
4eya_1_F
4eya_1_Q
4eya_1_R
1qzc_1_B
1t1o_1_B
1mvr_1_C
1t1m_1_B
1t1o_1_C
1t1m_1_A
1t1o_1_A
2r1g_1_B
4ht9_1_E
6z1p_1_AB
6z1p_1_AA
......@@ -844,11 +896,9 @@
5uk4_1_W
5uk4_1_U
5f6c_1_E
7nwh_1_HH
4rcj_1_B
1xnr_1_W
2agn_1_A
2agn_1_C
2agn_1_B
6e0o_1_C
6o75_1_D
6o75_1_C
......@@ -866,8 +916,7 @@
1ibm_1_Z
4dr5_1_V
4d61_1_J
1trj_1_B
1trj_1_C
7nwg_1_Q3
5tbw_1_SR
6hhq_1_SR
6zvi_1_H
......@@ -883,6 +932,8 @@
5k8h_1_A
5z4a_1_B
3jbu_1_V
4ts2_1_Y
4ts0_1_Y
1h2c_1_R
1h2d_1_S
1h2d_1_R
......@@ -909,6 +960,7 @@
6ppn_1_I
5flx_1_Z
6eri_1_AX
7k5l_1_R
7d80_1_Y
1zc8_1_A
1zc8_1_C
......@@ -916,6 +968,7 @@
1zc8_1_G
1zc8_1_I
1zc8_1_H
6bfb_1_Y
1zc8_1_J
7du2_1_R
4v8z_1_CX
......@@ -951,6 +1004,8 @@
4x9e_1_H
6z1p_1_BB
6z1p_1_BA
3p22_1_C
3p22_1_G
2uxd_1_X
6ywe_1_BB
3ol9_1_D
......@@ -973,8 +1028,6 @@
3ol7_1_H
3ol8_1_L
3ol8_1_P
1qzc_1_C
1qzc_1_A
6yrq_1_E
6yrq_1_H
6yrq_1_G
......@@ -1054,6 +1107,7 @@
3iy9_1_A
4wtk_1_T
4wtk_1_P
6wlj_3_A
1vqn_1_4
4oav_1_C
4oav_1_A
......@@ -1070,18 +1124,13 @@
3eq3_1_B
3eq4_1_B
4i67_1_B
3pgw_1_R
3pgw_1_N
3cw1_1_X
3cw1_1_W
3cw1_1_V
7b0y_1_A
4jf2_1_A
6k32_1_T
6k32_1_P
5mmj_1_A
5x8r_1_A
2agn_1_E
2agn_1_D
3fu2_1_B
3fu2_1_A
4v5z_1_BD
6yw5_1_AA
6ywe_1_AA
......@@ -1117,6 +1166,17 @@
3p6y_1_Q
3p6y_1_W
5dto_1_B
6yml_1_A
6ymm_1_A
6ymi_1_M
6ymi_1_F
6ymi_1_A
6ylb_1_F
6ymi_1_C
6ymj_1_C
6ylb_1_C
6ymj_1_I
6ymj_1_O
4cxh_1_X
1uvj_1_F
1uvj_1_D
......@@ -1153,6 +1213,12 @@
4v4f_1_B4
4v4f_1_A6
4v4f_1_B2
7m4y_1_V
7m4x_1_V
6v3a_1_V
6v39_1_V
6ck5_1_A
6ck5_1_B
5it9_1_I
7jqc_1_I
5zsb_1_C
......@@ -1162,6 +1228,8 @@
1cwp_1_D
3jcr_1_N
6gfw_1_R
3j6x_1_IR
3j6y_1_IR
2vaz_1_A
6zm6_1_X
6zm5_1_X
......@@ -1177,11 +1245,11 @@
5uh6_1_I
6l74_1_I
5uh9_1_I
4v5z_1_BS
2ftc_1_R
7a5j_1_X
6sag_1_R
4udv_1_R
2r1g_1_E
5zsc_1_D
5zsc_1_C
6woy_1_I
......@@ -1209,7 +1277,7 @@
3m85_1_X
3m85_1_Z
3m85_1_Y
1e8s_1_C
5u34_1_B
5wnp_1_B
5wnv_1_B
5yts_1_B
......@@ -1232,8 +1300,11 @@
6ij2_1_E
3u2e_1_D
3u2e_1_C
7eh1_1_I
5uef_1_C
5uef_1_D
7eh2_1_R
7eh2_1_I
4x4u_1_H
4afy_1_D
6oy5_1_I
......@@ -1244,13 +1315,15 @@
6s0m_1_C
6ymw_1_C
7a5g_1_J
1m5k_1_B
1m5o_1_E
1m5v_1_B
6gx6_1_B
4k4s_1_D
4k4s_1_H
4k4t_1_H
4k4t_1_D
1zn1_1_C
1zn0_1_C
1xpu_1_G
1xpu_1_L
1xpr_1_L
......@@ -1274,7 +1347,9 @@
6gc5_1_F
6gc5_1_H
6gc5_1_G
4rne_1_C
1n1h_1_B
7n2v_1_PT
4ohz_1_B
6t83_1_6B
4gv6_1_C
......@@ -1290,6 +1365,9 @@
4v5z_1_BC
5y88_1_X
4v5z_1_BB
5y85_1_D
5y85_1_B
5y87_1_D
3j0o_1_H
3j0l_1_H
3j0p_1_H
......@@ -1351,11 +1429,11 @@
4e6b_1_A
4e6b_1_B
6a6l_1_D
4v5z_1_BS
4v8t_1_1
1uvi_1_D
1uvi_1_F
1uvi_1_E
3gs5_1_A
4m7d_1_P
4k4u_1_D
4k4u_1_H
......@@ -1376,8 +1454,8 @@
6ip5_1_2M
6ip6_1_2M
6qcs_1_M
7b5k_1_Z
486d_1_G
2r1g_1_C
486d_1_F
4v5z_1_B0
4nia_1_O
......@@ -1391,11 +1469,11 @@
4oq9_1_F
4oq9_1_L
6r9q_1_B
7m4u_1_A
6v3a_1_SN1
6v3b_1_SN1
6v39_1_SN1
6v3e_1_SN1
1pn7_1_C
1mj1_1_Q
1mj1_1_R
4dr6_1_V
......@@ -1437,14 +1515,25 @@
6ow3_1_I
6ovy_1_I
6oy6_1_I
4bbl_1_Y
4bbl_1_Z
4qvd_1_H
5gxi_1_B
3iy8_1_A
6tnu_1_M
5mc6_1_M
7n06_1_G
7n06_1_H
7n06_1_I
7n06_1_J
7n06_1_K
7n06_1_L
7n33_1_G
7n33_1_H
7n33_1_I
7n33_1_J
7n33_1_K
7n33_1_L
5mc6_1_N
2qwy_1_C
2qwy_1_A
2qwy_1_B
4eya_1_O
4eya_1_P
4eya_1_C
......@@ -1453,8 +1542,6 @@
6htq_1_W
6htq_1_U
6uu6_1_333
6v3a_1_V
6v39_1_V
5a0v_1_F
3avt_1_T
6d1v_1_C
......@@ -1497,6 +1584,7 @@
6o78_1_E
6xa1_1_BV
6ha8_1_X
3bnp_1_B
1m8w_1_E
1m8w_1_F
5udi_1_B
......@@ -1520,16 +1608,29 @@
6een_1_H
4wti_1_T
4wti_1_P
6dlr_1_A
6dlt_1_A
6dls_1_A
6dlq_1_A
6dnr_1_A
5l3p_1_Y
4hor_1_X
3rzo_1_R
5wlh_1_B
2f4v_1_Z
5ml7_1_B
1qln_1_R
3pgw_1_R
3pgw_1_N
3cw1_1_X
3cw1_1_W
3cw1_1_V
7b0y_1_A
6ogy_1_M
6ogy_1_N
6uej_1_B
7kga_1_A
6ywy_1_BB
1x18_1_A
5ytx_1_B
4g0a_1_H
6r9p_1_B
......@@ -1572,12 +1673,8 @@
5mre_1_AA
5mrf_1_AA
7jhy_1_Z
2r1g_1_A
2r1g_1_D
2r1g_1_F
3eq4_1_Y
4wkr_1_C
2r1g_1_X
4v99_1_EC
4v99_1_AC
4v99_1_BH
......@@ -1641,44 +1738,21 @@
6rcl_1_C
5jju_1_C
4ejt_1_G
1et4_1_A
1et4_1_C
1et4_1_B
1et4_1_D
1et4_1_E
1ddy_1_C
1ddy_1_A
1ddy_1_E
6lkq_1_W
6r47_1_A
3qsu_1_P
3qsu_1_R
2xs7_1_B
1n38_1_B
4qvc_1_G
6q1h_1_D
6q1h_1_H
6p7p_1_F
6p7p_1_E
6p7p_1_D
6vm6_1_J
6vm6_1_G
6wan_1_K
6wan_1_H
6wan_1_G
6wan_1_L
6wan_1_I
6ywo_1_F
6wan_1_J
4oau_1_A
6ywo_1_E
6ywo_1_K
6vm6_1_I
6vm6_1_H
6ywo_1_I
2a1r_1_C
6m6v_1_F
6m6v_1_E
2a1r_1_D
3gpq_1_E
3gpq_1_F
6o79_1_C
6vm6_1_K
6m6v_1_G
6hyu_1_D
1laj_1_R
6ybv_1_K
6mpf_1_W
6spc_1_A
6spe_1_A
......@@ -1687,14 +1761,12 @@
6fti_1_V
6ftj_1_V
6ftg_1_V
3npn_1_A
4g0a_1_G
4g0a_1_F
4g0a_1_E
2b2d_1_S
5hkc_1_C
4kzy_1_I
4kzz_1_I
4kzx_1_I
1rmv_1_B
4qu7_1_X
4qu7_1_V
......@@ -1710,25 +1782,3 @@
6pmi_1_3
6pmj_1_3
5hjz_1_C
7nrc_1_SM
7nrc_1_SN
7am2_1_R1
7k5l_1_R
7b5k_1_X
7d8c_1_C
7m4y_1_V
7m4x_1_V
7b5k_1_Z
7m4u_1_A
7n06_1_G
7n06_1_H
7n06_1_I
7n06_1_J
7n06_1_K
7n06_1_L
7n33_1_G
7n33_1_H
7n33_1_I
7n33_1_J
7n33_1_K
7n33_1_L
......
This diff is collapsed. Click to expand it.
This diff is collapsed. Click to expand it.