Victoria BOURGEAIS

clean scripts and additional information in README

files/
.ipynb_checkpoints/
scripts/__pycache__/
......@@ -10,56 +10,44 @@ GraphGONet is a self-explaining neural network integrating the Gene Ontology int
## Get started
The code is implemented in Python using the [PyTorch](https://pytorch.org/) framework v1.7.1 (see [requirements.txt](https://forge.ibisc.univ-evry.fr/vbourgeais/GraphGONet/blob/master/requirements.txt) for more details)
The code is implemented in Python (3.6.7) using the [PyTorch](https://pytorch.org/) framework v1.7.1 (see [requirements.txt](https://forge.ibisc.univ-evry.fr/vbourgeais/GraphGONet/blob/master/requirements.txt) for more details about the additional packages used).
### Dataset
## Dataset
The full microarray dataset can be downloaded on ArrayExpress database under the id [E-MTAB-3732](https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-3732/). Here, you can find the pre-processed training and test sets:
[training set](https://entrepot.ibisc.univ-evry.fr/f/5b57ab5a69de4f6ab26b/?dl=1)
[test set](https://entrepot.ibisc.univ-evry.fr/f/057f1ffa0e6c4aab9bee/?dl=1)
<!-- Additional files for NN architecture: [filesforNNarch](https://entrepot.ibisc.univ-evry.fr/f/6f1c513798df41999b5d/?dl=1) -->
The full microarray dataset can be downloaded on ArrayExpress database under the id [E-MTAB-3732](https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-3732/).
TCGA dataset can be downloaded from [GDC portal](https://portal.gdc.cancer.gov/).
<!--
Here, you can find the pre-processed training and test sets:
[training set](https://entrepot.ibisc.univ-evry.fr/f/5b57ab5a69de4f6ab26b/?dl=1)
[test set](https://entrepot.ibisc.univ-evry.fr/f/057f1ffa0e6c4aab9bee/?dl=1)
Additional files for NN architecture: [filesforNNarch](https://entrepot.ibisc.univ-evry.fr/f/6f1c513798df41999b5d/?dl=1)
-->
Here, you can find the pre-processed training, validation and test sets with additional files for NN architecture to test the network: https://entrepot.ibisc.univ-evry.fr/d/d4764174275347f09862/
### Usage
## Usage
Example on TCGA dataset:
<!--
There exists 3 functions (flag *processing*): one is dedicated to the training of the model (*train*), another one to the evaluation of the model on the test set (*evaluate*), and the last one to the prediction of the outcomes of the samples from the test set (*predict*).
-->
#### 1) Train
### Train
On the microarray dataset:
<!-- On the microarray dataset:
```bash
python GraphGONet.py --n_inputs=36834 --n_nodes=10663 --n_nodes_annotated=8249 --n_classes=1 --mask="top" --selection_ratio=0.01 --n_epochs=50 --es --patience=5 --class_weight
python3 GraphGONet.py --save --n_inputs=36834 --n_nodes=10663 --n_nodes_annotated=8249 --n_classes=1 --mask="top" --selection_ratio=0.001 --n_epochs=50 --es --patience=5 --class_weight
```
-->
On TCGA dataset:
```bash
python GraphGONet.py --n_inputs=18427 --n_nodes=10636 --n_nodes_annotated=8288 --n_classes=12 --mask="top" --selection_ratio=0.01 --n_epochs=50 --es --patience=5 --class_weight
python3 GraphGONet.py --save --n_inputs=18427 --n_nodes=10636 --n_nodes_annotated=8288 --n_classes=12 --mask="top" --selection_ratio=0.001 --n_epochs=50 --es --patience=5 --class_weight
```
<!--
#### 2) Evaluate
### 2) Evaluate
```bash
python DeepGONet.py --type_training="LGO" --alpha=1e-2 --EPOCHS=600 --is_training=False --restore=True --processing="evaluate"
```
#### 3) Predict
### 3) Predict
```bash
......@@ -70,33 +58,35 @@ python DeepGONet.py --type_training="LGO" --alpha=1e-2 --EPOCHS=600 --is_trainin
The outcomes are saved into a numpy array.
-->
#### Help
### Comparison with random selection
All the details about the command line flags can be provided by the following command:
```bash
python GraphGONet.py --save --n_inputs=18427 --n_nodes=10636 --n_nodes_annotated=8288 --n_classes=12 --mask="random" --selection_ratio=0.001 --n_epochs=50 --es --patience=5 --class_weight
```
### Comparison with no selection
```bash
python GraphGONet.py --help
python GraphGONet.py --save --n_inputs=18427 --n_nodes=10636 --n_nodes_annotated=8288 --n_classes=12 --n_epochs=50 --es --patience=5 --class_weight
```
For most of the flags, the default values can be employed. *log_dir* and *save_dir* can be modified to your own repositories. Only the flags in the command lines displayed have to be adjusted to achieve the desired objective.
### Comparison with random selection
### Train the model with a small number of training samples
On the microarray dataset:
```bash
python GraphGONet.py --n_inputs=36834 --n_nodes=10663 --n_nodes_annotated=8249 --n_classes=1 --mask="random" --selection_ratio=0.01 --n_epochs=50 --es --patience=5 --class_weight
python GraphGONet.py --save --n_samples=50 --n_inputs=18427 --n_nodes=10636 --n_nodes_annotated=8288 --n_classes=12 --mask="top" --selection_ratio=0.001 --n_epochs=50 --es --patience=5 --class_weight
```
### Comparison with no selection
### Help
All the details about the command line flags can be provided by the following command:
On the microarray dataset:
```bash
python GraphGONet.py --n_inputs=36834 --n_nodes=10663 --n_nodes_annotated=8249 --n_classes=1 --n_epochs=50 --es --patience=5 --class_weight
python GraphGONet.py --help
```
<!--
For most of the flags, the default values can be employed. *dir_data*, *dir_files*, and *dir_log* can be set to your own repositories. Only the flags in the command lines displayed have to be adjusted to reproduce the results from the paper. If you have enough GPU memory, you can choose to switch to the entire GO graph (argument *type_graph* set to "entire"). The graph can be reconstructed by following the notebooks: Build_GONet_graph_part{1,2,3}.ipynb located in the notebooks directory. Then, you should change the value of the arguments *n_nodes* and *n_nodes_annotated* in the command line.
### Interpretation tool
Please see the notebook entitled *Interpretation_tool.ipynb* to perform the biological interpretation of the results.
-->
\ No newline at end of file
Please see the notebook entitled *Interpretation_tool.ipynb* (located in the notebooks directory) to perform the biological interpretation of the results.
\ No newline at end of file
......
Python==3.6.7
captum==0.3.1
goatools==1.0.15
jupyterlab==3.0.16
matplotlib==3.3.4
numpy==1.19.5
networkx==2.5
obonet==0.2.6
pandas==1.1.5
rpy2==3.4.3
scikit-learn==0.24.2
seaborn==0.11.1
sklearn-pandas==2.2.0
......
......@@ -5,11 +5,10 @@ import torch.nn as nn
import torch_geometric
import networkx as nx
from torchvision import transforms
from base_model import Net, DAGConv
from base_model import Net
import torch.nn.functional as F
from captum.attr import LayerGradientXActivation
import matplotlib.pyplot as plt
import seaborn
import numpy as np
......@@ -64,9 +63,9 @@ def train(args):
print("Processing the GO layers...")
start = time.time()
connection_matrix = pd.read_csv(os.path.abspath(os.path.join(args.dir_files,"matrix_truncated.csv")),index_col=0)
graph = nx.read_gpickle(os.path.join(args.dir_files,"gobp-truncated")) #read the GO graph wich will be converted into the hidden layers of the network
graph = from_networkx(graph, dim_inital_node_embedding=args.dim_init_emb,label=args.n_classes)
connection_matrix = pd.read_csv(os.path.abspath(os.path.join(args.dir_files,"matrix_connection_{}.csv".format(args.type_graph))),index_col=0)
graph = nx.read_gpickle(os.path.join(args.dir_files,"gobp-{}-converted".format(args.type_graph))) #read the GO graph wich will be converted into the hidden layers of the network
graph = from_networkx(graph, dim_inital_node_embedding=args.dim_init,label=args.n_classes)
n_samples = trainset.X.shape[0]
data_list = [graph.clone() for i in np.arange(n_samples)] #the same network architecture is used accross the patients from the same dataset
......@@ -90,13 +89,13 @@ def train(args):
# Launch the model
print("Launching the learning")
device = torch.device(args.device)
model = Net(n_genes=args.n_inputs,n_nodes=args.n_nodes,n_nodes_annot=args.n_nodes_annotated,n_nodes_emb=args.dim_init_emb,n_classes=args.n_classes,n_prop1=args.n_prop1,adj_mat_fc1=connection_matrix.values,conv=args.conv,aggr=args.aggr,dropout_ratio=args.dropout_ratio,mask=args.mask,ratio=args.selection_ratio,readout=args.readout).to(device)
model = Net(n_genes=args.n_inputs,n_nodes=args.n_nodes,n_nodes_annot=args.n_nodes_annotated,n_nodes_emb=args.dim_init,n_classes=args.n_classes,
n_prop1=args.n_prop1,adj_mat_fc1=connection_matrix.values,mask=args.mask,ratio=args.selection_ratio).to(device)
print(model)
print("(model mem allocation) - Memory available : {:.2e}".format(torch.cuda.memory_reserved(0)-torch.cuda.memory_allocated(0)))
if args.optimizer=="adam":#specify the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
# optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate,weight_decay=weight_decay)
elif args.optimizer=="rmsprop":
optimizer = torch.optim.RMSprop(model.parameters(), lr=args.lr)
elif args.optimizer=="momentum":
......@@ -293,20 +292,20 @@ def main():
# -- Configuration of the environnement --
parser.add_argument('--dir_log', type=str, default="log", help="dir_log")
parser.add_argument('--dir_files', type=str, default='files', help='repository for all the files needed for the training and the evaluation')
parser.add_argument('--dir_data', type=str, default='data/E-MTAB-3732', help='repository of the dataset')
parser.add_argument('--dir_data', type=str, default='data', help='repository of the dataset')
parser.add_argument('--file_extension', type=int, default=None, help="option to save different models with the same setting")
parser.add_argument('--save', action='store_true', help="Do you need to save the model?")
parser.add_argument('--restore', action='store_true', help="Do you want to restore a previous model?")
parser.add_argument('--processing', type=str, default="train_and_evaluate", help="What to do with the model? {train,train_and_evaluate,evaluate,predict}")
# -- Architecture of the neural network --
parser.add_argument('--type_graph', type=str, default="truncated", help='type of GO graph considered (truncated,entire)')
parser.add_argument('--n_samples', type=int, default=None, help="number of samples to use")
parser.add_argument('--n_inputs', type=int, default=36834, help="number of features")
parser.add_argument('--n_nodes', type=int, default=10663, help="number of nodes of GO graph")
parser.add_argument('--n_nodes_annotated', type=int, default=8249, help="number of nodes annotated with the genes")
parser.add_argument('--dim_init_emb', type=int, default=1, help="initial dimension of the nodes embedding")
parser.add_argument('--n_prop1', type=int, default=1, help="number of neurons in the GO layers")
parser.add_argument('--n_layers', type=int, default=1, help="number of layers")
parser.add_argument('--dim_init', type=int, default=1, help="initial dimension")
parser.add_argument('--n_prop1', type=int, default=1, help="dimension after propagation")
parser.add_argument('--n_classes', type=int, default=1, help="number of classes")
# -- Learning and Hyperparameters --
......@@ -331,9 +330,9 @@ def main():
os.mkdir(args.dir_log)
if args.mask:
args.dir_save=os.path.join(args.dir_log,'GraphGONet_POOL={}_SELECTRATIO={}'.format(args.mask,args.selection_ratio))
args.dir_save=os.path.join(args.dir_log,'GraphGONet_MASK={}_SELECTRATIO={}'.format(args.mask,args.selection_ratio))
else:
args.dir_save=os.path.join(args.dir_log,'GraphGONet_POOL={}'.format(args.mask))
args.dir_save=os.path.join(args.dir_log,'GraphGONet_MASK={}'.format(args.mask))
if args.n_samples:
args.dir_save+="_N_SAMPLES={}".format(args.n_samples)
......
This diff is collapsed. Click to expand it.
......@@ -20,7 +20,7 @@ def torch_clear_gpu_mem():
gc.collect()
torch.cuda.empty_cache()
#Dataset
#DatasetLoader based on the tutorial "Creating Your Own Datasets" - pytorch-geometric
class GeneExpressionDataset(Dataset):
"""Face Landmarks dataset."""
......@@ -52,7 +52,6 @@ class GeneExpressionDataset(Dataset):
sss = StratifiedShuffleSplit(n_splits=1,train_size=n_samples,test_size=self.X.shape[0]-n_samples,random_state=42) #keeping the proportion of the original classes
for train_index, test_index in sss.split(self.X , self.y):
self.X, self.y = self.X[train_index,:], self.y[train_index]
#self.X = self.X[np.random.randint(low=0, high=self.X.shape[0], size=n_samples),:]
if class_weights:
self.class_weight = torch.tensor(class_weight.compute_class_weight('balanced',
......@@ -78,7 +77,8 @@ class ToTensor(object):
def __call__(self, data):
return torch.from_numpy(data)
#inpired from the function with the same name from pytorch-geometric
def from_networkx(G,label,dim_inital_node_embedding=1,random=False):
r"""Converts a :obj:`networkx.Graph` or :obj:`networkx.DiGraph` to a
:class:`torch_geometric.data.Data` instance.
......