Louis BECQUEY

Merge branch 'master' of https://github.com/persalteas/RNANet

...@@ -27,7 +27,7 @@ Contents: ...@@ -27,7 +27,7 @@ Contents:
27 # What it does 27 # What it does
28 The script follows these steps: 28 The script follows these steps:
29 * Gets a list of 3D structures containing RNA from BGSU's non-redundant list (but keeps the redundant structures /!\\), 29 * Gets a list of 3D structures containing RNA from BGSU's non-redundant list (but keeps the redundant structures /!\\),
30 -* Asks Rfam for mappings of these structures onto Rfam families (~ a half of structures have a direct mapping, some more are inferred using the redundancy list) 30 +* Asks Rfam for mappings of these structures onto Rfam families (~50% of structures have a direct mapping, some more are inferred using the redundancy list)
31 * Downloads the corresponding 3D structures (mmCIFs) 31 * Downloads the corresponding 3D structures (mmCIFs)
32 * If desired, extracts the right chain portions that map onto an Rfam family 32 * If desired, extracts the right chain portions that map onto an Rfam family
33 33
...@@ -35,7 +35,7 @@ Now, compute the features: ...@@ -35,7 +35,7 @@ Now, compute the features:
35 35
36 * Extract the sequence for every 3D chain 36 * Extract the sequence for every 3D chain
37 * Downloads Rfamseq ncRNA sequence hits for the concerned Rfam families 37 * Downloads Rfamseq ncRNA sequence hits for the concerned Rfam families
38 -* Realigns Rfamseq hits and sequences from the 3D structures together to obtain a multiple sequence alignment for each Rfam family (using cmalign, except for ribosomal LSU and SSU, where SINA is used) 38 +* Realigns Rfamseq hits and sequences from the 3D structures together to obtain a multiple sequence alignment for each Rfam family (using `cmalign --cyk`, except for ribosomal LSU and SSU, where SINA is used)
39 * Computes nucleotide frequencies at every position for each alignment 39 * Computes nucleotide frequencies at every position for each alignment
40 * For each aligned 3D chain, get the nucleotide frequencies in the corresponding RNA family for each residue 40 * For each aligned 3D chain, get the nucleotide frequencies in the corresponding RNA family for each residue
41 41
...@@ -49,12 +49,10 @@ Finally, export this data from the SQLite database into flat CSV files. ...@@ -49,12 +49,10 @@ Finally, export this data from the SQLite database into flat CSV files.
49 49
50 * `results/RNANet.db` is a SQLite database file containing several tables with all the information, which you can query yourself with your custom requests, 50 * `results/RNANet.db` is a SQLite database file containing several tables with all the information, which you can query yourself with your custom requests,
51 * `3D-folder-you-passed-in-option/datapoints/*` are flat text CSV files, one for one RNA chain mapped to one RNA family, gathering the per-position nucleotide descriptors, 51 * `3D-folder-you-passed-in-option/datapoints/*` are flat text CSV files, one for one RNA chain mapped to one RNA family, gathering the per-position nucleotide descriptors,
52 -* `results/RNANET_datapoints_latest.tar.gz` is a compressed archive of the above CSV files (only if you passed the --archive option) 52 +* `archive/RNANET_datapoints_{DATE}.tar.gz` is a compressed archive of the above CSV files (only if you passed the --archive option)
53 -* `path-to-3D-folder-you-passed-in-option/rna_mapped_to_Rfam` If you used the --extract option, this folder contains one mmCIF file per RNA chain mapped to one RNA family, without other chains, proteins (nor ions and ligands by default) 53 +* `path-to-3D-folder-you-passed-in-option/rna_mapped_to_Rfam` If you used the `--extract` option, this folder contains one mmCIF file per RNA chain mapped to one RNA family, without other chains, proteins (nor ions and ligands by default). If you used both `--extract` and `--no-homology`, this folder is called `rnaonly`.
54 -* `results/summary_latest.csv` summarizes information about the RNA chains 54 +* `results/summary.csv` summarizes information about the RNA chains
55 -* `results/families_latest.csv` summarizes information about the RNA families 55 +* `results/families.csv` summarizes information about the RNA families
56 -
57 -If you launch successive executions of RNANet, the previous tar.gz archive and the two summary CSV files are stored in the `results/archive/` folder.
58 56
59 Other folders are created and not deleted, which you might want to conserve to avoid re-computations in later runs: 57 Other folders are created and not deleted, which you might want to conserve to avoid re-computations in later runs:
60 58
...@@ -63,7 +61,8 @@ Other folders are created and not deleted, which you might want to conserve to a ...@@ -63,7 +61,8 @@ Other folders are created and not deleted, which you might want to conserve to a
63 * `path-to-3D-folder-you-passed-in-option/RNAcifs/` contains mmCIF structures directly downloaded from the PDB, which contain RNA chains, 61 * `path-to-3D-folder-you-passed-in-option/RNAcifs/` contains mmCIF structures directly downloaded from the PDB, which contain RNA chains,
64 * `path-to-3D-folder-you-passed-in-option/annotations/` contains the raw JSON annotation files of the previous mmCIF structures. You may find additional information into them which is not properly supported by RNANet yet. 62 * `path-to-3D-folder-you-passed-in-option/annotations/` contains the raw JSON annotation files of the previous mmCIF structures. You may find additional information into them which is not properly supported by RNANet yet.
65 63
66 -# How to run (on Linux x86-64 only) 64 +# How to run
65 +RNANet is availbale on Linux (x86-64) only. It could theoretically work on Mac using command line installation (*untested*).
67 66
68 ## Required computational resources 67 ## Required computational resources
69 - CPU: no requirements. The program is optimized for multi-core CPUs, you might want to use Intel Xeons, AMD Ryzens, etc. 68 - CPU: no requirements. The program is optimized for multi-core CPUs, you might want to use Intel Xeons, AMD Ryzens, etc.
......