Louis BECQUEY
Committed by GitHub

Updated dependencies in Readme

Showing 1 changed file with 13 additions and 10 deletions
1 # RNANet 1 # RNANet
2 Building a dataset following the ProteinNet philosophy, but for RNA. 2 Building a dataset following the ProteinNet philosophy, but for RNA.
3 3
4 -In the early versions, we only use the Rfam mappings between 3D structures and known Rfam families, using the sequences that are known to belong to an Rfam family (hits provided in RF0XXXX.fasta files from Rfam). 4 +We use the Rfam mappings between 3D structures and known Rfam families, using the sequences that are known to belong to an Rfam family (hits provided in RF0XXXX.fasta files from Rfam).
5 5
6 -Future versions might compute a real MSA-based clusering directly with Rfamseq ncRNA sequences, like ProteinNet does with protein sequences. 6 +Future versions might compute a real MSA-based clusering directly with Rfamseq ncRNA sequences, like ProteinNet does with protein sequences, but this requires a tool similar to jackHMMER in the Infernal software suite, which is not available yet.
7 7
8 This script prepares the dataset from available public data in PDB and Rfam. 8 This script prepares the dataset from available public data in PDB and Rfam.
9 -It requires solid hardware to run. (Tested on a server with 24 cores and 80 GB of RAM, which is just enough.) 9 +It requires solid hardware to run. (Tested on a server with 24 cores and 48GB of RAM.)
10 10
11 # Dependencies 11 # Dependencies
12 -You need to install Infernal and X3DNA + DSSR before running this. 12 +You need to install Infernal, DSSR, and SINA before running this.
13 I moved to python3.8.1. Unfortunately, python3.6 is no longer supported, because of changes in the multiprocessing and Threading packages. Untested with Python 3.7.*. 13 I moved to python3.8.1. Unfortunately, python3.6 is no longer supported, because of changes in the multiprocessing and Threading packages. Untested with Python 3.7.*.
14 14
15 -Packages numpy, pandas, gzip, requests, psutil, biopython, and sqlalchemy are required. 15 +Packages numpy, pandas, matplotlib, requests, psutil, biopython, and sqlalchemy are required.
16 -`python3.8 -m pip install numpy pandas pymysql requests psutil biopython sqlalchemy` 16 +`python3.8 -m pip install numpy pandas matplotlib pymysql requests psutil biopython sqlalchemy tqdm`
17 17
18 -Before use, please set the two variables `path_to_3D_data` and `path_to_seq_data` (between lines 20 and 30 of RNAnet.py) to two folders where you want to store RNA 3D structures and RNA sequences. A few gigabytes will be produced. 18 +Before use, please set the two variables `path_to_3D_data` and `path_to_seq_data` (around line 30 of RNAnet.py) to two folders where you want to store RNA 3D structures and RNA sequences. A few gigabytes will be produced.
19 19
20 # What it does 20 # What it does
21 The script follows these steps: 21 The script follows these steps:
...@@ -28,16 +28,19 @@ Now, compute the features: ...@@ -28,16 +28,19 @@ Now, compute the features:
28 28
29 * Extract the sequence for every 3D chain 29 * Extract the sequence for every 3D chain
30 * Downloads Rfamseq ncRNA sequence hits for the concerned Rfam families 30 * Downloads Rfamseq ncRNA sequence hits for the concerned Rfam families
31 -* Realigns Rfamseq hits and sequences from the 3D structures together to obtain a multiple sequence alignment for each Rfam family (using cmalign) 31 +* Realigns Rfamseq hits and sequences from the 3D structures together to obtain a multiple sequence alignment for each Rfam family (using cmalign, except for ribosomal LSU and SSU, where SINA is used)
32 * Computes nucleotide frequencies at every position for each alignment 32 * Computes nucleotide frequencies at every position for each alignment
33 * For each aligned 3D chain, get the nucleotide frequencies in the corresponding RNA family for each residue 33 * For each aligned 3D chain, get the nucleotide frequencies in the corresponding RNA family for each residue
34 34
35 Then, compute the labels: 35 Then, compute the labels:
36 36
37 -* Run DSSR `analyze -t` on every chain to get eta' and theta' pseudotorsions 37 +* Run DSSR on every chain to get eta' and theta' pseudotorsions
38 * This also permits to identify missing residues and compute a mask for every chain. 38 * This also permits to identify missing residues and compute a mask for every chain.
39 39
40 -Finally, store this data into tensorflow-2.0-ready files. 40 +Finally, store this data into files.
41 +
42 +# Dataset quality
43 +The file statistics.py is supposed to give a summary on the produced dataset. See the results/ folder.
41 44
42 # Contact 45 # Contact
43 louis.becquey@univ-evry.fr 46 louis.becquey@univ-evry.fr
......