Louis BECQUEY

Revision 1 for Bioinformatics completed

...@@ -12,4 +12,5 @@ esl* ...@@ -12,4 +12,5 @@ esl*
12 12
13 # environment stuff 13 # environment stuff
14 .vscode/ 14 .vscode/
15 -*.pyc
...\ No newline at end of file ...\ No newline at end of file
15 +*.pyc
16 +__pycache__/
...\ No newline at end of file ...\ No newline at end of file
......
...@@ -94,6 +94,8 @@ The detailed list of options is below: ...@@ -94,6 +94,8 @@ The detailed list of options is below:
94 -h [ --help ] Print this help message 94 -h [ --help ] Print this help message
95 --version Print the program version 95 --version Print the program version
96 96
97 +-f [ --full-inference ] Infer new 3D->family mappings even if Rfam already provides some. Yields more copies of chains
98 + mapped to different families.
97 -r 4.0 [ --resolution=4.0 ] Maximum 3D structure resolution to consider a RNA chain. 99 -r 4.0 [ --resolution=4.0 ] Maximum 3D structure resolution to consider a RNA chain.
98 -s Run statistics computations after completion 100 -s Run statistics computations after completion
99 --extract Extract the portions of 3D RNA chains to individual mmCIF files. 101 --extract Extract the portions of 3D RNA chains to individual mmCIF files.
...@@ -105,7 +107,7 @@ The detailed list of options is below: ...@@ -105,7 +107,7 @@ The detailed list of options is below:
105 RNAcifs/ Full structures containing RNA, in mmCIF format 107 RNAcifs/ Full structures containing RNA, in mmCIF format
106 rna_mapped_to_Rfam/ Extracted 'pure' RNA chains 108 rna_mapped_to_Rfam/ Extracted 'pure' RNA chains
107 datapoints/ Final results in CSV file format. 109 datapoints/ Final results in CSV file format.
108 ---seq-folder=… Path to a folder to store the sequence and alignment files. 110 +--seq-folder=… Path to a folder to store the sequence and alignment files. Subfolders will be:
109 rfam_sequences/fasta/ Compressed hits to Rfam families 111 rfam_sequences/fasta/ Compressed hits to Rfam families
110 realigned/ Sequences, covariance models, and alignments by family 112 realigned/ Sequences, covariance models, and alignments by family
111 --no-homology Do not try to compute PSSMs and do not align sequences. 113 --no-homology Do not try to compute PSSMs and do not align sequences.
...@@ -117,11 +119,12 @@ The detailed list of options is below: ...@@ -117,11 +119,12 @@ The detailed list of options is below:
117 --update-homologous Re-download Rfam and SILVA databases, realign all families, and recompute all CSV files 119 --update-homologous Re-download Rfam and SILVA databases, realign all families, and recompute all CSV files
118 --from-scratch Delete database, local 3D and sequence files, and known issues, and recompute. 120 --from-scratch Delete database, local 3D and sequence files, and known issues, and recompute.
119 --archive Create a tar.gz archive of the datapoints text files, and update the link to the latest archive 121 --archive Create a tar.gz archive of the datapoints text files, and update the link to the latest archive
122 +--no-logs Do not save per-chain logs of the numbering modifications
120 ``` 123 ```
121 124
122 Typical usage: 125 Typical usage:
123 ``` 126 ```
124 -nohup bash -c 'time ~/Projects/RNANet/RNAnet.py --3d-folder ~/Data/RNA/3D/ --seq-folder ~/Data/RNA/sequences -s --archive' & 127 +nohup bash -c 'time ~/Projects/RNANet/RNAnet.py --3d-folder ~/Data/RNA/3D/ --seq-folder ~/Data/RNA/sequences -s' &
125 ``` 128 ```
126 129
127 ## Post-computation task: estimate quality 130 ## Post-computation task: estimate quality
......
This diff could not be displayed because it is too large.
This diff is collapsed. Click to expand it.
This diff could not be displayed because it is too large.
...@@ -11,7 +11,7 @@ ...@@ -11,7 +11,7 @@
11 # - Use a specialised database (SILVA) : better alignments (we guess?), but two kind of jobs 11 # - Use a specialised database (SILVA) : better alignments (we guess?), but two kind of jobs
12 # - Use cmalign --small everywhere (homogeneity) 12 # - Use cmalign --small everywhere (homogeneity)
13 # Moreover, --small requires --nonbanded --cyk, which means the output alignement is the optimally scored one. 13 # Moreover, --small requires --nonbanded --cyk, which means the output alignement is the optimally scored one.
14 -# To date, we trust Infernal as the best tool to realign RNA. Is it ? 14 +# To date, we trust Infernal as the best tool to realign ncRNA. Is it ?
15 15
16 # Contact: louis.becquey@univ-evry.fr (PhD student), fariza.tahi@univ-evry.fr (PI) 16 # Contact: louis.becquey@univ-evry.fr (PhD student), fariza.tahi@univ-evry.fr (PI)
17 17
...@@ -28,7 +28,7 @@ pd.set_option('display.max_rows', None) ...@@ -28,7 +28,7 @@ pd.set_option('display.max_rows', None)
28 LSU_set = ["RF00002", "RF02540", "RF02541", "RF02543", "RF02546"] # From Rfam CLAN 00112 28 LSU_set = ["RF00002", "RF02540", "RF02541", "RF02543", "RF02546"] # From Rfam CLAN 00112
29 SSU_set = ["RF00177", "RF02542", "RF02545", "RF01959", "RF01960"] # From Rfam CLAN 00111 29 SSU_set = ["RF00177", "RF02542", "RF02545", "RF01959", "RF01960"] # From Rfam CLAN 00111
30 30
31 -with sqlite3.connect("results/RNANet.db") as conn: 31 +with sqlite3.connect(os.getcwd()+"/results/RNANet.db") as conn:
32 df = pd.read_sql("SELECT rfam_acc, max_len, nb_total_homol, comput_time, comput_peak_mem FROM family;", conn) 32 df = pd.read_sql("SELECT rfam_acc, max_len, nb_total_homol, comput_time, comput_peak_mem FROM family;", conn)
33 33
34 to_remove = [ f for f in df.rfam_acc if f in LSU_set+SSU_set ] 34 to_remove = [ f for f in df.rfam_acc if f in LSU_set+SSU_set ]
...@@ -74,7 +74,7 @@ ax.set_ylabel("Maximum length of sequences ") ...@@ -74,7 +74,7 @@ ax.set_ylabel("Maximum length of sequences ")
74 ax.set_zlabel("Computation time (s)") 74 ax.set_zlabel("Computation time (s)")
75 75
76 plt.subplots_adjust(wspace=0.4) 76 plt.subplots_adjust(wspace=0.4)
77 -plt.savefig("results/cmalign_jobs_performance.png") 77 +plt.savefig(os.getcwd()+"/results/cmalign_jobs_performance.png")
78 78
79 # # ======================================================== 79 # # ========================================================
80 # # Linear Regression of max_mem as function of max_length 80 # # Linear Regression of max_mem as function of max_length
......
This diff is collapsed. Click to expand it.