clean scripts and additional information in README

Victoria BOURGEAIS
Commit 2abe471bc11ff08386058768476acdc7d4efedd7 2abe471b 1 parent c983c9bd
Showing 6 changed files with 49 additions and 56 deletions
.gitignore
README.md
requirements.txt
GraphGONet.py → scripts/GraphGONet.py
base_model.py → scripts/base_model.py
utils.py → scripts/utils.py
--- a/.gitignore 0 → 100644
View file @2abe471
+++ b/.gitignore 0 → 100644
View file @2abe471
+ files/
+ .ipynb_checkpoints/
+ scripts/__pycache__/
--- a/README.md
View file @2abe471
+++ b/README.md
View file @2abe471
@@ -10,56 +10,44 @@ GraphGONet is a self-explaining neural network integrating the Gene Ontology int
 
 ## Get started
 
- The code is implemented in Python using the [PyTorch](https://pytorch.org/) framework v1.7.1 (see [requirements.txt](https://forge.ibisc.univ-evry.fr/vbourgeais/GraphGONet/blob/master/requirements.txt) for more details)
+ The code is implemented in Python (3.6.7) using the [PyTorch](https://pytorch.org/) framework v1.7.1 (see [requirements.txt](https://forge.ibisc.univ-evry.fr/vbourgeais/GraphGONet/blob/master/requirements.txt) for more details about the additional packages used).
 
- ### Dataset
+ ## Dataset
 
- The full microarray dataset can be downloaded on ArrayExpress database under the id [E-MTAB-3732](https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-3732/). Here, you can find the pre-processed training and test sets:
- 
- [training set](https://entrepot.ibisc.univ-evry.fr/f/5b57ab5a69de4f6ab26b/?dl=1)
- 
- [test set](https://entrepot.ibisc.univ-evry.fr/f/057f1ffa0e6c4aab9bee/?dl=1) 
- 
- <!-- Additional files for NN architecture: [filesforNNarch](https://entrepot.ibisc.univ-evry.fr/f/6f1c513798df41999b5d/?dl=1) -->
+ The full microarray dataset can be downloaded on ArrayExpress database under the id [E-MTAB-3732](https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-3732/). 
 
 TCGA dataset can be downloaded from [GDC portal](https://portal.gdc.cancer.gov/). 
- <!--
- Here, you can find the pre-processed training and test sets:
- 
- [training set](https://entrepot.ibisc.univ-evry.fr/f/5b57ab5a69de4f6ab26b/?dl=1)
- 
- [test set](https://entrepot.ibisc.univ-evry.fr/f/057f1ffa0e6c4aab9bee/?dl=1) 
 
- Additional files for NN architecture: [filesforNNarch](https://entrepot.ibisc.univ-evry.fr/f/6f1c513798df41999b5d/?dl=1)
- --> 
+ Here, you can find the pre-processed training, validation and test sets with additional files for NN architecture to test the network: https://entrepot.ibisc.univ-evry.fr/d/d4764174275347f09862/
 
- ### Usage
+ ## Usage
 
+ Example on TCGA dataset:
 <!--
 There exists 3 functions (flag *processing*): one is dedicated to the training of the model (*train*), another one to the evaluation of the model on the test set (*evaluate*), and the last one to the prediction of the outcomes of the samples from the test set (*predict*).
 -->
 
- #### 1) Train
+ ### Train
 
- On the microarray dataset:
+ <!-- On the microarray dataset:
 ```bash
- python GraphGONet.py --n_inputs=36834 --n_nodes=10663 --n_nodes_annotated=8249 --n_classes=1 --mask="top" --selection_ratio=0.01 --n_epochs=50 --es --patience=5 --class_weight 
+ python3 GraphGONet.py --save --n_inputs=36834 --n_nodes=10663 --n_nodes_annotated=8249 --n_classes=1 --mask="top" --selection_ratio=0.001 --n_epochs=50 --es --patience=5 --class_weight 
 ```
+ -->
 
- On TCGA dataset:
 ```bash
- python GraphGONet.py --n_inputs=18427 --n_nodes=10636 --n_nodes_annotated=8288 --n_classes=12 --mask="top" --selection_ratio=0.01 --n_epochs=50 --es --patience=5 --class_weight 
+ python3 GraphGONet.py --save --n_inputs=18427 --n_nodes=10636 --n_nodes_annotated=8288 --n_classes=12 --mask="top" --selection_ratio=0.001 --n_epochs=50 --es --patience=5 --class_weight 
 ```
 
 <!--
- #### 2) Evaluate
+ ### 2) Evaluate
 
 
 ```bash
 python DeepGONet.py --type_training="LGO" --alpha=1e-2 --EPOCHS=600 --is_training=False --restore=True --processing="evaluate"
 ```
 
- #### 3) Predict
+ ### 3) Predict
 
 
 ```bash
@@ -70,33 +58,35 @@ python DeepGONet.py --type_training="LGO" --alpha=1e-2 --EPOCHS=600 --is_trainin
 The outcomes are saved into a numpy array.
 -->
 
- #### Help
+ ### Comparison with random selection
 
- All the details about the command line flags can be provided by the following command:
+ ```bash
+ python GraphGONet.py --save --n_inputs=18427 --n_nodes=10636 --n_nodes_annotated=8288 --n_classes=12 --mask="random" --selection_ratio=0.001 --n_epochs=50 --es --patience=5 --class_weight 
+ ```
 
+ ### Comparison with no selection
 
 ```bash
- python GraphGONet.py --help
+ python GraphGONet.py --save --n_inputs=18427 --n_nodes=10636 --n_nodes_annotated=8288 --n_classes=12 --n_epochs=50 --es --patience=5 --class_weight 
 ```
 
- For most of the flags, the default values can be employed. *log_dir* and *save_dir* can be modified to your own repositories. Only the flags in the command lines displayed have to be adjusted to achieve the desired objective.
- 
- ### Comparison with random selection
+ ### Train the model with a small number of training samples
 
- On the microarray dataset:
 ```bash
- python GraphGONet.py --n_inputs=36834 --n_nodes=10663 --n_nodes_annotated=8249 --n_classes=1 --mask="random" --selection_ratio=0.01 --n_epochs=50 --es --patience=5 --class_weight 
+ python GraphGONet.py --save --n_samples=50 --n_inputs=18427 --n_nodes=10636 --n_nodes_annotated=8288 --n_classes=12 --mask="top" --selection_ratio=0.001 --n_epochs=50 --es --patience=5 --class_weight 
 ```
 
- ### Comparison with no selection
+ ### Help
+ 
+ All the details about the command line flags can be provided by the following command:
 
- On the microarray dataset:
 ```bash
- python GraphGONet.py --n_inputs=36834 --n_nodes=10663 --n_nodes_annotated=8249 --n_classes=1 --n_epochs=50 --es --patience=5 --class_weight 
+ python GraphGONet.py --help
 ```
 
- <!--
+ For most of the flags, the default values can be employed. *dir_data*, *dir_files*, and *dir_log* can be set to your own repositories. Only the flags in the command lines displayed have to be adjusted to reproduce the results from the paper. If you have enough GPU memory, you can choose to switch to the entire GO graph (argument *type_graph* set to "entire"). The graph can be reconstructed by following the notebooks: Build_GONet_graph_part{1,2,3}.ipynb located in the notebooks directory. Then, you should change the value of the arguments *n_nodes* and *n_nodes_annotated* in the command line. 
+ 
+ 
 ###  Interpretation tool
 
- Please see the notebook entitled *Interpretation_tool.ipynb* to perform the biological interpretation of the results.
- -->
\ No newline at end of file
+ Please see the notebook entitled *Interpretation_tool.ipynb* (located in the notebooks directory) to perform the biological interpretation of the results.
\ No newline at end of file
--- a/requirements.txt
View file @2abe471
+++ b/requirements.txt
View file @2abe471
- Python==3.6.7
 captum==0.3.1
+ goatools==1.0.15
 jupyterlab==3.0.16
 matplotlib==3.3.4
 numpy==1.19.5
 networkx==2.5
 obonet==0.2.6
 pandas==1.1.5
+ rpy2==3.4.3
 scikit-learn==0.24.2
 seaborn==0.11.1
 sklearn-pandas==2.2.0
--- a/GraphGONet.py → scripts/GraphGONet.py
View file @2abe471
+++ b/GraphGONet.py → scripts/GraphGONet.py
View file @2abe471
@@ -5,11 +5,10 @@ import torch.nn as nn
 import torch_geometric
 import networkx as nx
 from torchvision import transforms
- from base_model import Net, DAGConv
+ from base_model import Net
 import torch.nn.functional as F
 from captum.attr import LayerGradientXActivation
 
- 
 import matplotlib.pyplot as plt
 import seaborn
 import numpy as np
@@ -64,9 +63,9 @@ def train(args):
 	print("Processing the GO layers...")
 	start = time.time()
 
- 	connection_matrix = pd.read_csv(os.path.abspath(os.path.join(args.dir_files,"matrix_truncated.csv")),index_col=0)
- 	graph = nx.read_gpickle(os.path.join(args.dir_files,"gobp-truncated")) #read the GO graph wich will be converted into the hidden layers of the network
- 	graph = from_networkx(graph, dim_inital_node_embedding=args.dim_init_emb,label=args.n_classes)
+ 	connection_matrix = pd.read_csv(os.path.abspath(os.path.join(args.dir_files,"matrix_connection_{}.csv".format(args.type_graph))),index_col=0)
+ 	graph = nx.read_gpickle(os.path.join(args.dir_files,"gobp-{}-converted".format(args.type_graph))) #read the GO graph wich will be converted into the hidden layers of the network
+ 	graph = from_networkx(graph, dim_inital_node_embedding=args.dim_init,label=args.n_classes)
 
 	n_samples = trainset.X.shape[0]
 	data_list = [graph.clone() for i in np.arange(n_samples)] #the same network architecture is used accross the patients from the same dataset
@@ -90,13 +89,13 @@ def train(args):
 	# Launch the model
 	print("Launching the learning")
 	device = torch.device(args.device)
- 	model = Net(n_genes=args.n_inputs,n_nodes=args.n_nodes,n_nodes_annot=args.n_nodes_annotated,n_nodes_emb=args.dim_init_emb,n_classes=args.n_classes,n_prop1=args.n_prop1,adj_mat_fc1=connection_matrix.values,conv=args.conv,aggr=args.aggr,dropout_ratio=args.dropout_ratio,mask=args.mask,ratio=args.selection_ratio,readout=args.readout).to(device)
+ 	model = Net(n_genes=args.n_inputs,n_nodes=args.n_nodes,n_nodes_annot=args.n_nodes_annotated,n_nodes_emb=args.dim_init,n_classes=args.n_classes,
+                n_prop1=args.n_prop1,adj_mat_fc1=connection_matrix.values,mask=args.mask,ratio=args.selection_ratio).to(device)
 	print(model)
 	print("(model mem allocation) - Memory available : {:.2e}".format(torch.cuda.memory_reserved(0)-torch.cuda.memory_allocated(0)))
 
 	if args.optimizer=="adam":#specify the optimizer
 		optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
- 		# optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate,weight_decay=weight_decay)
 	elif args.optimizer=="rmsprop":
 		optimizer = torch.optim.RMSprop(model.parameters(), lr=args.lr)
 	elif args.optimizer=="momentum":
@@ -293,20 +292,20 @@ def main():
 	# -- Configuration of the environnement --
 	parser.add_argument('--dir_log', type=str, default="log", help="dir_log")
 	parser.add_argument('--dir_files', type=str, default='files', help='repository for all the files needed for the training and the evaluation')
- 	parser.add_argument('--dir_data', type=str, default='data/E-MTAB-3732', help='repository of the dataset')
+ 	parser.add_argument('--dir_data', type=str, default='data', help='repository of the dataset')
 	parser.add_argument('--file_extension', type=int, default=None, help="option to save different models with the same setting")    
 	parser.add_argument('--save', action='store_true', help="Do you need to save the model?")
 	parser.add_argument('--restore', action='store_true', help="Do you want to restore a previous model?")
 	parser.add_argument('--processing', type=str, default="train_and_evaluate", help="What to do with the model? {train,train_and_evaluate,evaluate,predict}")
 
 	# -- Architecture of the neural network --
+ 	parser.add_argument('--type_graph', type=str, default="truncated", help='type of GO graph considered (truncated,entire)')
 	parser.add_argument('--n_samples', type=int, default=None, help="number of samples to use")
 	parser.add_argument('--n_inputs', type=int, default=36834, help="number of features")
 	parser.add_argument('--n_nodes', type=int, default=10663, help="number of nodes of GO graph")
 	parser.add_argument('--n_nodes_annotated', type=int, default=8249, help="number of nodes annotated with the genes")
- 	parser.add_argument('--dim_init_emb', type=int, default=1, help="initial dimension of the nodes embedding")
- 	parser.add_argument('--n_prop1', type=int, default=1, help="number of neurons in the GO layers")
- 	parser.add_argument('--n_layers', type=int, default=1, help="number of layers")
+ 	parser.add_argument('--dim_init', type=int, default=1, help="initial dimension")
+ 	parser.add_argument('--n_prop1', type=int, default=1, help="dimension after propagation")
 	parser.add_argument('--n_classes', type=int, default=1, help="number of classes")
 
 	# -- Learning and Hyperparameters --
@@ -331,9 +330,9 @@ def main():
 		os.mkdir(args.dir_log)
     
 	if args.mask:
- 		args.dir_save=os.path.join(args.dir_log,'GraphGONet_POOL={}_SELECTRATIO={}'.format(args.mask,args.selection_ratio))
+ 		args.dir_save=os.path.join(args.dir_log,'GraphGONet_MASK={}_SELECTRATIO={}'.format(args.mask,args.selection_ratio))
 	else:
- 		args.dir_save=os.path.join(args.dir_log,'GraphGONet_POOL={}'.format(args.mask))
+ 		args.dir_save=os.path.join(args.dir_log,'GraphGONet_MASK={}'.format(args.mask))
 
 	if args.n_samples:
 		args.dir_save+="_N_SAMPLES={}".format(args.n_samples)
--- a/base_model.py → scripts/base_model.py
View file @2abe471
+++ b/base_model.py → scripts/base_model.py
View file @2abe471
--- a/utils.py → scripts/utils.py
View file @2abe471
+++ b/utils.py → scripts/utils.py
View file @2abe471
@@ -20,7 +20,7 @@ def torch_clear_gpu_mem():
     gc.collect()
     torch.cuda.empty_cache()
 
- #Dataset
+ #DatasetLoader based on the tutorial "Creating Your Own Datasets" - pytorch-geometric 
 class GeneExpressionDataset(Dataset):
     """Face Landmarks dataset."""
 
@@ -52,7 +52,6 @@ class GeneExpressionDataset(Dataset):
             sss = StratifiedShuffleSplit(n_splits=1,train_size=n_samples,test_size=self.X.shape[0]-n_samples,random_state=42) #keeping the proportion of the original classes
             for train_index, test_index in sss.split(self.X , self.y):
                 self.X, self.y = self.X[train_index,:], self.y[train_index]
-             #self.X = self.X[np.random.randint(low=0, high=self.X.shape[0], size=n_samples),:]
         
         if class_weights:
             self.class_weight = torch.tensor(class_weight.compute_class_weight('balanced',
@@ -78,7 +77,8 @@ class ToTensor(object):
 
     def __call__(self, data):
         return torch.from_numpy(data)
- 
+     
+ #inpired from the function with the same name from pytorch-geometric
 def from_networkx(G,label,dim_inital_node_embedding=1,random=False):
     r"""Converts a :obj:`networkx.Graph` or :obj:`networkx.DiGraph` to a
     :class:`torch_geometric.data.Data` instance.