Modify function comments of graph kernels.

6 years ago · 344a6f8d4b
--- a/datasets/AIDS/AIDS_A.txt
+++ b/datasets/AIDS/AIDS_A.txt
--- a/datasets/AIDS/AIDS_edge_labels.txt
+++ b/datasets/AIDS/AIDS_edge_labels.txt
--- a/datasets/AIDS/AIDS_graph_indicator.txt
+++ b/datasets/AIDS/AIDS_graph_indicator.txt
--- a/datasets/AIDS/AIDS_graph_labels.txt
+++ b/datasets/AIDS/AIDS_graph_labels.txt
--- a/datasets/AIDS/AIDS_label_readme.txt
+++ b/datasets/AIDS/AIDS_label_readme.txt
@@ -0,0 +1,65 @@
 Node labels:		[symbol]

 Node attributes:	[chem, charge, x, y]

 Edge labels:		[valence]

 Node labels were converted to integer values using this map:

 Component 0:
 	0	C  
 	1	O  
 	2	N  
 	3	Cl 
 	4	F  
 	5	S  
 	6	Se 
 	7	P  
 	8	Na 
 	9	I  
 	10	Co 
 	11	Br 
 	12	Li 
 	13	Si 
 	14	Mg 
 	15	Cu 
 	16	As 
 	17	B  
 	18	Pt 
 	19	Ru 
 	20	K  
 	21	Pd 
 	22	Au 
 	23	Te 
 	24	W  
 	25	Rh 
 	26	Zn 
 	27	Bi 
 	28	Pb 
 	29	Ge 
 	30	Sb 
 	31	Sn 
 	32	Ga 
 	33	Hg 
 	34	Ho 
 	35	Tl 
 	36	Ni 
 	37	Tb 



 Edge labels were converted to integer values using this map:

 Component 0:
 	0	1
 	1	2
 	2	3



 Class labels were converted to integer values using this map:

 	0	a
 	1	i


--- a/datasets/AIDS/AIDS_node_attributes.txt
+++ b/datasets/AIDS/AIDS_node_attributes.txt
--- a/datasets/AIDS/AIDS_node_labels.txt
+++ b/datasets/AIDS/AIDS_node_labels.txt
--- a/datasets/DD/1-s2.0-S0022283603006284-main.pdf
+++ b/datasets/DD/1-s2.0-S0022283603006284-main.pdf
--- a/datasets/DD/DD.zip
+++ b/datasets/DD/DD.zip
--- a/datasets/DD/DD_A.txt
+++ b/datasets/DD/DD_A.txt
--- a/datasets/DD/DD_graph_indicator.txt
+++ b/datasets/DD/DD_graph_indicator.txt
--- a/datasets/DD/DD_graph_labels.txt
+++ b/datasets/DD/DD_graph_labels.txt
--- a/datasets/DD/DD_node_labels.txt
+++ b/datasets/DD/DD_node_labels.txt
--- a/datasets/DD/README.txt
+++ b/datasets/DD/README.txt
@@ -0,0 +1,75 @@
 README for dataset DD


 === Usage ===

 This folder contains the following comma separated text files 
 (replace DS by the name of the dataset):

 n = total number of nodes
 m = total number of edges
 N = number of graphs

 (1) 	DS_A.txt (m lines) 
 	sparse (block diagonal) adjacency matrix for all graphs,
 	each line corresponds to (row, col) resp. (node_id, node_id)

 (2) 	DS_graph_indicator.txt (n lines)
 	column vector of graph identifiers for all nodes of all graphs,
 	the value in the i-th line is the graph_id of the node with node_id i

 (3) 	DS_graph_labels.txt (N lines) 
 	class labels for all graphs in the dataset,
 	the value in the i-th line is the class label of the graph with graph_id i

 (4) 	DS_node_labels.txt (n lines)
 	column vector of node labels,
 	the value in the i-th line corresponds to the node with node_id i

 There are OPTIONAL files if the respective information is available:

 (5) 	DS_edge_labels.txt (m lines; same size as DS_A_sparse.txt)
 	labels for the edges in DS_A_sparse.txt 

 (6) 	DS_edge_attributes.txt (m lines; same size as DS_A.txt)
 	attributes for the edges in DS_A.txt 

 (7) 	DS_node_attributes.txt (n lines) 
 	matrix of node attributes,
 	the comma seperated values in the i-th line is the attribute vector of the node with node_id i

 (8) 	DS_graph_attributes.txt (N lines) 
 	regression values for all graphs in the dataset,
 	the value in the i-th line is the attribute of the graph with graph_id i


 === Description ===

 D&D is a dataset of 1178 protein structures (Dobson and Doig, 2003). Each protein is 
 represented by a graph, in which the nodes are amino acids and two nodes are connected 
 by an edge if they are less than 6 Angstroms apart. The prediction task is to classify 
 the protein structures into enzymes and non-enzymes.


 === Previous Use of the Dataset ===

 Neumann, M., Garnett R., Bauckhage Ch., Kersting K.: Propagation Kernels: Efficient Graph 
 Kernels from Propagated Information. Under review at MLJ.

 Neumann, M., Patricia, N., Garnett, R., Kersting, K.: Efficient Graph Kernels by 
 Randomization. In: P.A. Flach, T.D. Bie, N. Cristianini (eds.) ECML/PKDD, Notes in 
 Computer Science, vol. 7523, pp. 378-393. Springer (2012).

 Shervashidze, N., Schweitzer, P., van Leeuwen, E., Mehlhorn, K., Borgwardt, K.:
 Weisfeiler-Lehman Graph Kernels. Journal of Machine Learning Research 12, 2539-2561 (2011)


 === References ===

 P. D. Dobson and A. J. Doig. Distinguishing enzyme structures from non-enzymes without 
 alignments. J. Mol. Biol., 330(4):771–783, Jul 2003.





--- a/datasets/NCI1/NCI1.zip
+++ b/datasets/NCI1/NCI1.zip
--- a/datasets/NCI1/NCI1_A.txt
+++ b/datasets/NCI1/NCI1_A.txt
--- a/datasets/NCI1/NCI1_graph_indicator.txt
+++ b/datasets/NCI1/NCI1_graph_indicator.txt
--- a/datasets/NCI1/NCI1_graph_labels.txt
+++ b/datasets/NCI1/NCI1_graph_labels.txt
--- a/datasets/NCI1/NCI1_node_labels.txt
+++ b/datasets/NCI1/NCI1_node_labels.txt
--- a/datasets/NCI1/README.txt
+++ b/datasets/NCI1/README.txt
@@ -0,0 +1,70 @@
 README for dataset NCI1


 === Usage ===

 This folder contains the following comma separated text files 
 (replace DS by the name of the dataset):

 n = total number of nodes
 m = total number of edges
 N = number of graphs

 (1) 	DS_A.txt (m lines) 
 	sparse (block diagonal) adjacency matrix for all graphs,
 	each line corresponds to (row, col) resp. (node_id, node_id)

 (2) 	DS_graph_indicator.txt (n lines)
 	column vector of graph identifiers for all nodes of all graphs,
 	the value in the i-th line is the graph_id of the node with node_id i

 (3) 	DS_graph_labels.txt (N lines) 
 	class labels for all graphs in the dataset,
 	the value in the i-th line is the class label of the graph with graph_id i

 (4) 	DS_node_labels.txt (n lines)
 	column vector of node labels,
 	the value in the i-th line corresponds to the node with node_id i

 There are OPTIONAL files if the respective information is available:

 (5) 	DS_edge_labels.txt (m lines; same size as DS_A_sparse.txt)
 	labels for the edges in DS_A_sparse.txt 

 (6) 	DS_edge_attributes.txt (m lines; same size as DS_A.txt)
 	attributes for the edges in DS_A.txt 

 (7) 	DS_node_attributes.txt (n lines) 
 	matrix of node attributes,
 	the comma seperated values in the i-th line is the attribute vector of the node with node_id i

 (8) 	DS_graph_attributes.txt (N lines) 
 	regression values for all graphs in the dataset,
 	the value in the i-th line is the attribute of the graph with graph_id i


 === Description ===

 NCI1 and NCI109 represent two balanced subsets of datasets of chemical compounds screened 
 for activity against non-small cell lung cancer and ovarian cancer cell lines respectively
 (Wale and Karypis (2006) and http://pubchem.ncbi.nlm.nih.gov). 


 === Previous Use of the Dataset ===

 Neumann, M., Garnett R., Bauckhage Ch., Kersting K.: Propagation Kernels: Efficient Graph 
 Kernels from Propagated Information. Under review at MLJ.

 Neumann, M., Patricia, N., Garnett, R., Kersting, K.: Efficient Graph Kernels by 
 Randomization. In: P.A. Flach, T.D. Bie, N. Cristianini (eds.) ECML/PKDD, Notes in 
 Computer Science, vol. 7523, pp. 378-393. Springer (2012).

 Shervashidze, N., Schweitzer, P., van Leeuwen, E., Mehlhorn, K., Borgwardt, K.:
 Weisfeiler-Lehman Graph Kernels. Journal of Machine Learning Research 12, 2539-2561 (2011)


 === References ===

 N. Wale and G. Karypis. Comparison of descriptor spaces for chemical compound retrieval and 
 classification. In Proc. of ICDM, pages 678–689, Hong Kong, 2006.

--- a/datasets/NCI109/NCI109.zip
+++ b/datasets/NCI109/NCI109.zip
--- a/datasets/NCI109/NCI109_A.txt
+++ b/datasets/NCI109/NCI109_A.txt
--- a/datasets/NCI109/NCI109_graph_indicator.txt
+++ b/datasets/NCI109/NCI109_graph_indicator.txt
--- a/datasets/NCI109/NCI109_graph_labels.txt
+++ b/datasets/NCI109/NCI109_graph_labels.txt
--- a/datasets/NCI109/NCI109_node_labels.txt
+++ b/datasets/NCI109/NCI109_node_labels.txt
--- a/datasets/NCI109/README.txt
+++ b/datasets/NCI109/README.txt
@@ -0,0 +1,70 @@
 README for dataset NCI109


 === Usage ===

 This folder contains the following comma separated text files 
 (replace DS by the name of the dataset):

 n = total number of nodes
 m = total number of edges
 N = number of graphs

 (1) 	DS_A.txt (m lines) 
 	sparse (block diagonal) adjacency matrix for all graphs,
 	each line corresponds to (row, col) resp. (node_id, node_id)

 (2) 	DS_graph_indicator.txt (n lines)
 	column vector of graph identifiers for all nodes of all graphs,
 	the value in the i-th line is the graph_id of the node with node_id i

 (3) 	DS_graph_labels.txt (N lines) 
 	class labels for all graphs in the dataset,
 	the value in the i-th line is the class label of the graph with graph_id i

 (4) 	DS_node_labels.txt (n lines)
 	column vector of node labels,
 	the value in the i-th line corresponds to the node with node_id i

 There are OPTIONAL files if the respective information is available:

 (5) 	DS_edge_labels.txt (m lines; same size as DS_A_sparse.txt)
 	labels for the edges in DS_A_sparse.txt 

 (6) 	DS_edge_attributes.txt (m lines; same size as DS_A.txt)
 	attributes for the edges in DS_A.txt 

 (7) 	DS_node_attributes.txt (n lines) 
 	matrix of node attributes,
 	the comma seperated values in the i-th line is the attribute vector of the node with node_id i

 (8) 	DS_graph_attributes.txt (N lines) 
 	regression values for all graphs in the dataset,
 	the value in the i-th line is the attribute of the graph with graph_id i


 === Description ===

 NCI1 and NCI109 represent two balanced subsets of datasets of chemical compounds screened 
 for activity against non-small cell lung cancer and ovarian cancer cell lines respectively
 (Wale and Karypis (2006) and http://pubchem.ncbi.nlm.nih.gov). 


 === Previous Use of the Dataset ===

 Neumann, M., Garnett R., Bauckhage Ch., Kersting K.: Propagation Kernels: Efficient Graph 
 Kernels from Propagated Information. Under review at MLJ.

 Neumann, M., Patricia, N., Garnett, R., Kersting, K.: Efficient Graph Kernels by 
 Randomization. In: P.A. Flach, T.D. Bie, N. Cristianini (eds.) ECML/PKDD, Notes in 
 Computer Science, vol. 7523, pp. 378-393. Springer (2012).

 Shervashidze, N., Schweitzer, P., van Leeuwen, E., Mehlhorn, K., Borgwardt, K.:
 Weisfeiler-Lehman Graph Kernels. Journal of Machine Learning Research 12, 2539-2561 (2011)


 === References ===

 N. Wale and G. Karypis. Comparison of descriptor spaces for chemical compound retrieval and 
 classification. In Proc. of ICDM, pages 678–689, Hong Kong, 2006.

--- a/notebooks/run_commonwalkkernel.py
+++ b/notebooks/run_commonwalkkernel.py
@@ -12,21 +12,21 @@ import multiprocessing
 from pygraph.kernels.commonWalkKernel import commonwalkkernel

 dslist = [
 #    {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds',
 #        'task': 'regression'},  # node symb
 #    {'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression',
 #        'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'},  
 #    # contains single node graph, node symb
 #    {'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb
 #    {'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled
 #    {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb
 #    {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'},
 #    # node nsymb
 #    {'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'},
 #    # node symb/nsymb
    {'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression',
        'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'},  
    # contains single node graph, node symb
    {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds',
        'task': 'regression'},  # node symb
    {'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb
    {'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled
    {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb
    {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'},
    # node nsymb
    {'name': 'AIDS', 'dataset': '../datasets/AIDS/AIDS_A.txt'}, # node symb/nsymb, edge symb 
    {'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'},
    # node symb/nsymb  
 #    {'name': 'NCI1', 'dataset': '../datasets/NCI1/NCI1_A.txt'}, # node symb
 #    {'name': 'NCI109', 'dataset': '../datasets/NCI109/NCI109_A.txt'}, # node symb
    {'name': 'AIDS', 'dataset': '../datasets/AIDS/AIDS_A.txt'}, # node symb/nsymb, edge symb 
 #    {'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb
 #
 #    {'name': 'Mutagenicity', 'dataset': '../datasets/Mutagenicity/Mutagenicity_A.txt'},
--- a/notebooks/run_marginalizedkernel.py
+++ b/notebooks/run_marginalizedkernel.py
@@ -12,22 +12,22 @@ import multiprocessing
 from pygraph.kernels.marginalizedKernel import marginalizedkernel

 dslist = [
 #    {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds',
 #        'task': 'regression'},  # node symb
 #    {'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression',
 #        'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'},  
 #    # contains single node graph, node symb
 #    {'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb
 #    {'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled
 #    {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb
 #    {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'},
 #    # node nsymb
 #    {'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'},
 #    # node symb/nsymb
    {'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression',
        'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'},  
    # contains single node graph, node symb
    {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds',
        'task': 'regression'},  # node symb
    {'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb
    {'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled
    {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb
    {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'},
    # node nsymb
    {'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'},
    # node symb/nsymb
 #    {'name': 'AIDS', 'dataset': '../datasets/AIDS/AIDS_A.txt'}, # node symb/nsymb, edge symb
 #    {'name': 'NCI1', 'dataset': '../datasets/NCI1/NCI1_A.txt'}, # node symb
 #    {'name': 'NCI109', 'dataset': '../datasets/NCI109/NCI109_A.txt'}, # node symb
    {'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb
 #    {'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb
 #
 #    {'name': 'Mutagenicity', 'dataset': '../datasets/Mutagenicity/Mutagenicity_A.txt'},
 #    # node/edge symb
--- a/notebooks/run_randomwalkkernel.py
+++ b/notebooks/run_randomwalkkernel.py
@@ -17,22 +17,23 @@ import numpy as np


 dslist = [
 #    {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds',
 #        'task': 'regression'},  # node symb
 #    {'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression',
 #        'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'},  
 #    # contains single node graph, node symb
 #    {'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb
 #    {'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled
 #    {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb
 #    {'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'},
 #    # node symb/nsymb
 #    {'name': 'NCI1', 'dataset': '../datasets/NCI1/NCI1_A.txt'}, # node symb
 #    {'name': 'NCI109', 'dataset': '../datasets/NCI109/NCI109_A.txt'}, # node symb
    {'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression',
        'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'},  
    # contains single node graph, node symb
    {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds',
        'task': 'regression'},  # node symb
    {'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb
    {'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled
    {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb
    {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'},
    # node nsymb
    {'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'},
    # node symb/nsymb
    {'name': 'AIDS', 'dataset': '../datasets/AIDS/AIDS_A.txt'}, # node symb/nsymb, edge symb
 #    {'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb
 #    {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'},
 #    # node nsymb
    {'name': 'NCI1', 'dataset': '../datasets/NCI1/NCI1_A.txt'}, # node symb
    {'name': 'NCI109', 'dataset': '../datasets/NCI109/NCI109_A.txt'}, # node symb   
    {'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb

 #
 #    {'name': 'Mutagenicity', 'dataset': '../datasets/Mutagenicity/Mutagenicity_A.txt'},
 #    # node/edge symb
--- a/notebooks/run_spkernel.py
+++ b/notebooks/run_spkernel.py
@@ -8,14 +8,14 @@ from pygraph.utils.kernels import deltakernel, gaussiankernel, kernelproduct

 # datasets
 dslist = [
 #    {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds',
 #        'task': 'regression'},  # node symb
 #    {'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression',
 #        'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'},  
 #    # contains single node graph, node symb
 #    {'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb
 #    {'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled
 #    {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb
    {'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression',
        'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'},  
    # contains single node graph, node symb
    {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds',
        'task': 'regression'},  # node symb
    {'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb
    {'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled
    {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb
    {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'},
    # node nsymb
    {'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'},
--- a/notebooks/run_structuralspkernel.py
+++ b/notebooks/run_structuralspkernel.py
@@ -14,22 +14,22 @@ from pygraph.kernels.structuralspKernel import structuralspkernel
 from pygraph.utils.kernels import deltakernel, gaussiankernel, kernelproduct

 dslist = [
 #    {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds',
 #        'task': 'regression'},  # node symb
 #    {'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression',
 #        'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'},  
 #    # contains single node graph, node symb
 #    {'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb
 #    {'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled
 #    {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb    
 #    {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'},
 #    # node nsymb
 #    {'name': 'AIDS', 'dataset': '../datasets/AIDS/AIDS_A.txt'}, # node symb/nsymb, edge symb
 #    {'name': 'NCI1', 'dataset': '../datasets/NCI1/NCI1_A.txt'}, # node symb
 #    {'name': 'NCI109', 'dataset': '../datasets/NCI109/NCI109_A.txt'}, # node symb
    {'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression',
        'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'},  
    # contains single node graph, node symb
    {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds',
        'task': 'regression'},  # node symb
    {'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb
    {'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled
    {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb    
    {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'},
    # node nsymb
    {'name': 'AIDS', 'dataset': '../datasets/AIDS/AIDS_A.txt'}, # node symb/nsymb, edge symb
    {'name': 'NCI1', 'dataset': '../datasets/NCI1/NCI1_A.txt'}, # node symb
    {'name': 'NCI109', 'dataset': '../datasets/NCI109/NCI109_A.txt'}, # node symb
 #    {'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'},
 #    # node symb/nsymb
    {'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb
 #    {'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb
 #
 #    {'name': 'Mutagenicity', 'dataset': '../datasets/Mutagenicity/Mutagenicity_A.txt'},
 #    # node/edge symb
--- a/notebooks/run_treeletkernel.py
+++ b/notebooks/run_treeletkernel.py
@@ -14,22 +14,22 @@ from pygraph.kernels.treeletKernel import treeletkernel
 from pygraph.utils.kernels import gaussiankernel, polynomialkernel

 dslist = [
 #    {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds',
 #        'task': 'regression'},  # node symb
    {'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression',
        'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'},  
    # contains single node graph, node symb
    {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds',
        'task': 'regression'},  # node symb
    {'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb
    {'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled
    {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb
    {'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'},
    # node symb/nsymb
    {'name': 'NCI1', 'dataset': '../datasets/NCI1/NCI1_A.txt'}, # node symb
    {'name': 'NCI109', 'dataset': '../datasets/NCI109/NCI109_A.txt'}, # node symb
    {'name': 'AIDS', 'dataset': '../datasets/AIDS/AIDS_A.txt'}, # node symb/nsymb, edge symb 
    {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'},
    # node nsymb
    {'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb   
    {'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'},
    # node symb/nsymb
 #    {'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb
 #    {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'},
 #    # node nsymb
 #
 #    {'name': 'Mutagenicity', 'dataset': '../datasets/Mutagenicity/Mutagenicity_A.txt'},
 #    # node/edge symb
--- a/notebooks/run_untilhpathkernel.py
+++ b/notebooks/run_untilhpathkernel.py
@@ -12,21 +12,21 @@ import multiprocessing
 from pygraph.kernels.untilHPathKernel import untilhpathkernel

 dslist = [
 #    {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds',
 #        'task': 'regression'},  # node symb
 #    {'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression',
 #        'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'},  
 #    # contains single node graph, node symb
 #    {'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb
 #    {'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled
 #    {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb
 #    {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'},
 #    # node nsymb
 #    {'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'},
 #    # node symb/nsymb
 #    {'name': 'NCI1', 'dataset': '../datasets/NCI1/NCI1_A.txt'}, # node symb
 #    {'name': 'NCI109', 'dataset': '../datasets/NCI109/NCI109_A.txt'}, # node symb
 #    {'name': 'AIDS', 'dataset': '../datasets/AIDS/AIDS_A.txt'}, # node symb/nsymb, edge symb
    {'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression',
        'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'},  
    # contains single node graph, node symb
    {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds',
        'task': 'regression'},  # node symb
    {'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb
    {'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled
    {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb
    {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'},
    # node nsymb
    {'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'},
    # node symb/nsymb
    {'name': 'AIDS', 'dataset': '../datasets/AIDS/AIDS_A.txt'}, # node symb/nsymb, edge symb
    {'name': 'NCI1', 'dataset': '../datasets/NCI1/NCI1_A.txt'}, # node symb
    {'name': 'NCI109', 'dataset': '../datasets/NCI109/NCI109_A.txt'}, # node symb
    {'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb    
 #
 #    {'name': 'Mutagenicity', 'dataset': '../datasets/Mutagenicity/Mutagenicity_A.txt'},
--- a/notebooks/run_weisfeilerlehmankernel.py
+++ b/notebooks/run_weisfeilerlehmankernel.py
@@ -14,22 +14,22 @@ from pygraph.kernels.weisfeilerLehmanKernel import weisfeilerlehmankernel


 dslist = [
 #    {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds',
 #        'task': 'regression'},  # node symb
 #    {'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression',
 #        'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'},  
 #    # contains single node graph, node symb
 #    {'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb
 #    {'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled
 #    {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb
    {'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression',
        'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'},  
    # contains single node graph, node symb
    {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds',
        'task': 'regression'},  # node symb
    {'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb
    {'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled
    {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb
 #    {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'},
 #    # node nsymb
 #    {'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'},
 #    # node symb/nsymb
 #    {'name': 'NCI1', 'dataset': '../datasets/NCI1/NCI1_A.txt'}, # node symb
 #    {'name': 'NCI109', 'dataset': '../datasets/NCI109/NCI109_A.txt'}, # node symb
 #    {'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb
    {'name': 'AIDS', 'dataset': '../datasets/AIDS/AIDS_A.txt'}, # node symb/nsymb, edge symb
    {'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'},
    # node symb/nsymb
    {'name': 'NCI1', 'dataset': '../datasets/NCI1/NCI1_A.txt'}, # node symb
    {'name': 'NCI109', 'dataset': '../datasets/NCI109/NCI109_A.txt'}, # node symb
    {'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb   
 #
 #    {'name': 'Mutagenicity', 'dataset': '../datasets/Mutagenicity/Mutagenicity_A.txt'},
 #    # node/edge symb
--- a/preimage/gk_iam.py
+++ b/preimage/gk_iam.py
@@ -277,7 +277,8 @@ def gk_iam_nearest(Gn, alpha, idx_gi, Kmatrix, k, r_max):
 #    return dhat, ghat_list


 def gk_iam_nearest_multi(Gn_init, Gn_median, alpha, idx_gi, Kmatrix, k, r_max, gkernel):
 def gk_iam_nearest_multi(Gn_init, Gn_median, alpha, idx_gi, Kmatrix, k, r_max, 
                         gkernel, c_ei=1, c_er=1, c_es=1, epsilon=0.001):
    """This function constructs graph pre-image by the iterative pre-image 
    framework in reference [1], algorithm 1, where the step of generating new 
    graphs randomly is replaced by the IAM algorithm in reference [2].
@@ -312,37 +313,44 @@ def gk_iam_nearest_multi(Gn_init, Gn_median, alpha, idx_gi, Kmatrix, k, r_max, g
        return 0, g0hat_list
    dhat = dis_gs[0] # the nearest distance
    ghat_list = [g.copy() for g in g0hat_list]
    for g in ghat_list:
        draw_Letter_graph(g)
 #    for g in ghat_list:
 #        draw_Letter_graph(g)
 #        nx.draw_networkx(g)
 #        plt.show()
        print(g.nodes(data=True))
        print(g.edges(data=True))
 #        print(g.nodes(data=True))
 #        print(g.edges(data=True))
    Gk = [Gn_init[ig].copy() for ig in sort_idx[0:k]] # the k nearest neighbors
    for gi in Gk:
 #        nx.draw_networkx(gi)
 #        plt.show()
        draw_Letter_graph(g)
        print(gi.nodes(data=True))
        print(gi.edges(data=True))
 #    for gi in Gk:
 ##        nx.draw_networkx(gi)
 ##        plt.show()
 #        draw_Letter_graph(g)
 #        print(gi.nodes(data=True))
 #        print(gi.edges(data=True))
    Gs_nearest = Gk.copy()
 #    gihat_list = []
    
 #    i = 1
    r = 1
    while r < r_max:
        print('r =', r)
 #        found = False
    r = 0
    itr = 0
 #    cur_sod = dhat
 #    old_sod = cur_sod * 2
    sod_list = [dhat]
    found = False
    nb_updated = 0
    while r < r_max:# and not found: # @todo: if not found?# and np.abs(old_sod - cur_sod) > epsilon:
        print('\nr =', r)
        print('itr for gk =', itr, '\n')
        found = False
 #        Gs_nearest = Gk + gihat_list
 #        g_tmp = iam(Gs_nearest)
        g_tmp_list = test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations(
                Gn_median, Gs_nearest, c_ei=1, c_er=1, c_es=1)
        for g in g_tmp_list:
        g_tmp_list, _ = test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations(
                Gn_median, Gs_nearest, c_ei=c_ei, c_er=c_er, c_es=c_es)
 #        for g in g_tmp_list:
 #            nx.draw_networkx(g)
 #            plt.show()
            draw_Letter_graph(g)
            print(g.nodes(data=True))
            print(g.edges(data=True))
 #            draw_Letter_graph(g)
 #            print(g.nodes(data=True))
 #            print(g.edges(data=True))
        
        # compute distance between phi and the new generated graphs.
        knew = compute_kernel(g_tmp_list + Gn_median, gkernel, False)
@@ -358,6 +366,7 @@ def gk_iam_nearest_multi(Gn_init, Gn_median, alpha, idx_gi, Kmatrix, k, r_max, g
 #              k_g1_list[1] + alpha[1] * alpha[1] * k_list[1])
            
        # find the new k nearest graphs.
        dnew_best = min(dnew_list)
        dis_gs = dnew_list + dis_gs # add the new nearest distances.
        Gs_nearest = [g.copy() for g in g_tmp_list] + Gs_nearest # add the corresponding graphs.
        sort_idx = np.argsort(dis_gs)
@@ -367,21 +376,34 @@ def gk_iam_nearest_multi(Gn_init, Gn_median, alpha, idx_gi, Kmatrix, k, r_max, g
            print(dis_gs[-1])
            Gs_nearest = [Gs_nearest[idx] for idx in sort_idx[0:k]]
            nb_best = len(np.argwhere(dis_gs == dis_gs[0]).flatten().tolist())
            if len([i for i in sort_idx[0:nb_best] if i < len(dnew_list)]) > 0:
                print('I have smaller or equal distance!')
            if dnew_best < dhat and np.abs(dnew_best - dhat) > epsilon:
                print('I have smaller distance!')
                print(str(dhat) + '->' + str(dis_gs[0]))
                dhat = dis_gs[0]
                idx_best_list = np.argwhere(dnew_list == dhat).flatten().tolist()
                ghat_list = [g_tmp_list[idx].copy() for idx in idx_best_list]
                for g in ghat_list:
 #                    nx.draw_networkx(g)
 #                    plt.show()
                    draw_Letter_graph(g)
                    print(g.nodes(data=True))
                    print(g.edges(data=True))
            r = 0
        else:
 #                for g in ghat_list:
 ##                    nx.draw_networkx(g)
 ##                    plt.show()
 #                    draw_Letter_graph(g)
 #                    print(g.nodes(data=True))
 #                    print(g.edges(data=True))
                r = 0
                found = True
                nb_updated += 1
            elif np.abs(dnew_best - dhat) < epsilon:
                print('I have almost equal distance!')
                print(str(dhat) + '->' + str(dnew_best))
        if not found:
            r += 1
            
 #        old_sod = cur_sod
 #        cur_sod = dnew_best
        sod_list.append(dhat)
        itr += 1
        
    print('\nthe graph is updated', nb_updated, 'times.')
    print('sods in kernel space:', sod_list, '\n')
    
    return dhat, ghat_list

--- a/preimage/iam.py
+++ b/preimage/iam.py
@@ -9,6 +9,7 @@ Iterative alternate minimizations using GED.
 import numpy as np
 import random
 import networkx as nx
 from tqdm import tqdm

 import sys
 #from Cython_GedLib_2 import librariesImport, script
@@ -181,13 +182,27 @@ def GED(g1, g2, lib='gedlib'):
    return dis, pi_forward, pi_backward


 def median_distance(Gn, Gn_median, measure='ged', verbose=False):
    dis_list = []
    pi_forward_list = []
    for idx, G in tqdm(enumerate(Gn), desc='computing median distances', 
                       file=sys.stdout) if verbose else enumerate(Gn):
        dis_sum = 0
        pi_forward_list.append([])
        for G_p in Gn_median:
            dis_tmp, pi_tmp_forward, pi_tmp_backward = GED(G, G_p)
            pi_forward_list[idx].append(pi_tmp_forward)
            dis_sum += dis_tmp
        dis_list.append(dis_sum)
    return dis_list, pi_forward_list


 # --------------------------- These are tests --------------------------------#
    
 def test_iam_with_more_graphs_as_init(Gn, G_candidate, c_ei=3, c_er=3, c_es=1, 
                                      node_label='atom', edge_label='bond_type'):
    """See my name, then you know what I do.
    """
    from tqdm import tqdm
 #    Gn = Gn[0:10]
    Gn = [nx.convert_node_labels_to_integers(g) for g in Gn]
    
@@ -321,7 +336,7 @@ def test_iam_with_more_graphs_as_init(Gn, G_candidate, c_ei=3, c_er=3, c_es=1,

 def test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations(
        Gn_median, Gn_candidate, c_ei=3, c_er=3, c_es=1, node_label='atom', 
        edge_label='bond_type', connected=True):
        edge_label='bond_type', connected=False):
    """See my name, then you know what I do.
    """
    from tqdm import tqdm
@@ -330,8 +345,11 @@ def test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations(
    node_ir = np.inf # corresponding to the node remove and insertion.
    label_r = 'thanksdanny' # the label for node remove. # @todo: make this label unrepeatable.
    ds_attrs = get_dataset_attributes(Gn_median + Gn_candidate, 
                                      attr_names=['edge_labeled', 'node_attr_dim'], 
                                      attr_names=['edge_labeled', 'node_attr_dim', 'edge_attr_dim'], 
                                      edge_label=edge_label)
    
    ite_max = 50
    epsilon = 0.001

    
    def generate_graph(G, pi_p_forward, label_set):
@@ -460,13 +478,15 @@ def test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations(
                                g_tmp.remove_edge(nd1, nd2)
                        # do not change anything when equal.                        
        
        # find the best graph generated in this iteration and update pi_p.
 #        # find the best graph generated in this iteration and update pi_p.
        # @todo: should we update all graphs generated or just the best ones?
        dis_list, pi_forward_list = median_distance(G_new_list, Gn_median)
        # @todo: should we remove the identical and connectivity check? 
        # Don't know which is faster.
        G_new_list, idx_list = remove_duplicates(G_new_list)
        pi_forward_list = [pi_forward_list[idx] for idx in idx_list]
        if ds_attrs['node_attr_dim'] == 0 and ds_attrs['edge_attr_dim'] == 0:
            G_new_list, idx_list = remove_duplicates(G_new_list)
            pi_forward_list = [pi_forward_list[idx] for idx in idx_list]
            dis_list = [dis_list[idx] for idx in idx_list]
 #        if connected == True:
 #            G_new_list, idx_list = remove_disconnected(G_new_list)
 #            pi_forward_list = [pi_forward_list[idx] for idx in idx_list]
@@ -482,25 +502,10 @@ def test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations(
 #            print(g.nodes(data=True))
 #            print(g.edges(data=True))
        
        return G_new_list, pi_forward_list
    
        return G_new_list, pi_forward_list, dis_list
    
    def median_distance(Gn, Gn_median, measure='ged', verbose=False):
        dis_list = []
        pi_forward_list = []
        for idx, G in tqdm(enumerate(Gn), desc='computing median distances', 
                           file=sys.stdout) if verbose else enumerate(Gn):
            dis_sum = 0
            pi_forward_list.append([])
            for G_p in Gn_median:
                dis_tmp, pi_tmp_forward, pi_tmp_backward = GED(G, G_p)
                pi_forward_list[idx].append(pi_tmp_forward)
                dis_sum += dis_tmp
            dis_list.append(dis_sum)
        return dis_list, pi_forward_list
    
    
    def best_median_graphs(Gn_candidate, dis_all, pi_all_forward):
    def best_median_graphs(Gn_candidate, pi_all_forward, dis_all):
        idx_min_list = np.argwhere(dis_all == np.min(dis_all)).flatten().tolist()
        dis_min = dis_all[idx_min_list[0]]
        pi_forward_min_list = [pi_all_forward[idx] for idx in idx_min_list]
@@ -508,25 +513,45 @@ def test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations(
        return G_min_list, pi_forward_min_list, dis_min
    
    
    def iteration_proc(G, pi_p_forward):
    def iteration_proc(G, pi_p_forward, cur_sod):
        G_list = [G]
        pi_forward_list = [pi_p_forward]
        old_sod = cur_sod * 2
        sod_list = [cur_sod]
        # iterations.
        for itr in range(0, 5): # @todo: the convergence condition?
 #            print('itr is', itr)
        itr = 0
        while itr < ite_max and np.abs(old_sod - cur_sod) > epsilon:
 #        for itr in range(0, 5): # the convergence condition?
            print('itr is', itr)
            G_new_list = []
            pi_forward_new_list = []
            dis_new_list = []
            for idx, G in enumerate(G_list):
                label_set = get_node_labels(Gn_median + [G], node_label)                        
                G_tmp_list, pi_forward_tmp_list = generate_graph(
                G_tmp_list, pi_forward_tmp_list, dis_tmp_list = generate_graph(
                    G, pi_forward_list[idx], label_set)
                G_new_list += G_tmp_list
                pi_forward_new_list += pi_forward_tmp_list
                dis_new_list += dis_tmp_list
            G_list = G_new_list[:]
            pi_forward_list = pi_forward_new_list[:]
            dis_list = dis_new_list[:]
            
            old_sod = cur_sod
            cur_sod = np.min(dis_list)
            sod_list.append(cur_sod)
            
            itr += 1
        
        G_list, idx_list = remove_duplicates(G_list)
        pi_forward_list = [pi_forward_list[idx] for idx in idx_list]
        # @todo: do we return all graphs or the best ones?
        # get the best ones of the generated graphs.
        G_list, pi_forward_list, dis_min = best_median_graphs(
            G_list, pi_forward_list, dis_list)
        
        if ds_attrs['node_attr_dim'] == 0 and ds_attrs['edge_attr_dim'] == 0:
            G_list, idx_list = remove_duplicates(G_list)
            pi_forward_list = [pi_forward_list[idx] for idx in idx_list]
 #            dis_list = [dis_list[idx] for idx in idx_list]
            
 #        import matplotlib.pyplot as plt
 #        for g in G_list:             
@@ -535,7 +560,9 @@ def test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations(
 #            print(g.nodes(data=True))
 #            print(g.edges(data=True))
            
        return G_list, pi_forward_list # do we return all graphs or the best ones?
        print('\nsods:', sod_list, '\n')
            
        return G_list, pi_forward_list, dis_min
    
    
    def remove_duplicates(Gn):
@@ -570,28 +597,37 @@ def test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations(
    # phase 1: initilize.
    # compute set-median.
    dis_min = np.inf
    dis_all, pi_all_forward = median_distance(Gn_candidate, Gn_median)
    dis_list, pi_forward_all = median_distance(Gn_candidate, Gn_median)
    # find all smallest distances.
    idx_min_list = np.argwhere(dis_all == np.min(dis_all)).flatten().tolist()
    dis_min = dis_all[idx_min_list[0]]
    idx_min_list = np.argwhere(dis_list == np.min(dis_list)).flatten().tolist()
    dis_min = dis_list[idx_min_list[0]]
    
    # phase 2: iteration.
    G_list = []
    for idx_min in idx_min_list[::-1]:
    dis_list = []
    pi_forward_list = []
    for idx_min in idx_min_list:
 #        print('idx_min is', idx_min)
        G = Gn_candidate[idx_min].copy()
        # list of edit operations.        
        pi_p_forward = pi_all_forward[idx_min]
        pi_p_forward = pi_forward_all[idx_min]
 #        pi_p_backward = pi_all_backward[idx_min]        
        Gi_list, pi_i_forward_list = iteration_proc(G, pi_p_forward)            
        Gi_list, pi_i_forward_list, dis_i_min = iteration_proc(G, pi_p_forward, dis_min)            
        G_list += Gi_list
        dis_list.append(dis_i_min)
        pi_forward_list += pi_i_forward_list
        
    G_list, _ = remove_duplicates(G_list)
    if ds_attrs['node_attr_dim'] == 0 and ds_attrs['edge_attr_dim'] == 0:
        G_list, idx_list = remove_duplicates(G_list)
        dis_list = [dis_list[idx] for idx in idx_list]
        pi_forward_list = [pi_forward_list[idx] for idx in idx_list]
    if connected == True:
        G_list_con, _ = remove_disconnected(G_list)
    # if there is no connected graphs at all, then remain the disconnected ones.
    if len(G_list_con) > 0: # @todo: ??????????????????????????
        G_list = G_list_con
        G_list_con, idx_list = remove_disconnected(G_list)
        # if there is no connected graphs at all, then remain the disconnected ones.
        if len(G_list_con) > 0: # @todo: ??????????????????????????
            G_list = G_list_con
            dis_list = [dis_list[idx] for idx in idx_list]
            pi_forward_list = [pi_forward_list[idx] for idx in idx_list]

 #    import matplotlib.pyplot as plt 
 #    for g in G_list:
@@ -601,15 +637,15 @@ def test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations(
 #        print(g.edges(data=True))
    
    # get the best median graphs
    dis_all, pi_all_forward = median_distance(G_list, Gn_median)
 #    dis_list, pi_forward_list = median_distance(G_list, Gn_median)
    G_min_list, pi_forward_min_list, dis_min = best_median_graphs(
            G_list, dis_all, pi_all_forward)
            G_list, pi_forward_list, dis_list)
 #    for g in G_min_list:
 #        nx.draw_networkx(g)
 #        plt.show()
 #        print(g.nodes(data=True))
 #        print(g.edges(data=True))
    return G_min_list
    return G_min_list, dis_min


 if __name__ == '__main__':
--- a/preimage/median.py
+++ b/preimage/median.py
@@ -0,0 +1,218 @@
 import sys
 sys.path.insert(0, "../")
 #import pathlib
 import numpy as np
 import networkx as nx
 import time

 #import librariesImport
 #import script
 #sys.path.insert(0, "/home/bgauzere/dev/optim-graphes/")
 #import pygraph
 from pygraph.utils.graphfiles import loadDataset

 def replace_graph_in_env(script, graph, old_id, label='median'):
    """
    Replace a graph in script

    If old_id is -1, add a new graph to the environnemt

    """
    if(old_id > -1):
        script.PyClearGraph(old_id)
    new_id = script.PyAddGraph(label)
    for i in graph.nodes():
        script.PyAddNode(new_id,str(i),graph.node[i]) # !! strings are required bt gedlib
    for e in graph.edges:
        script.PyAddEdge(new_id, str(e[0]),str(e[1]), {})
    script.PyInitEnv()
    script.PySetMethod("IPFP", "")
    script.PyInitMethod()

    return new_id
    
 #Dessin median courrant
 def draw_Letter_graph(graph, savepath=''):
    import numpy as np
    import networkx as nx
    import matplotlib.pyplot as plt
    plt.figure()
    pos = {}
    for n in graph.nodes:
        pos[n] = np.array([float(graph.node[n]['attributes'][0]),
           float(graph.node[n]['attributes'][1])])
    nx.draw_networkx(graph, pos)
    if savepath != '':
        plt.savefig(savepath + str(time.time()) + '.eps', format='eps', dpi=300)
    plt.show()
    plt.clf()
    
 #compute new mappings
 def update_mappings(script,median_id,listID):
    med_distances = {}
    med_mappings = {}
    sod = 0
    for i in range(0,len(listID)):
        script.PyRunMethod(median_id,listID[i])
        med_distances[i] = script.PyGetUpperBound(median_id,listID[i])
        med_mappings[i] = script.PyGetForwardMap(median_id,listID[i])
        sod += med_distances[i]
    return med_distances, med_mappings, sod

 def calcul_Sij(all_mappings, all_graphs,i,j):
    s_ij = 0
    for k in range(0,len(all_mappings)):
        cur_graph =  all_graphs[k]
        cur_mapping = all_mappings[k]
        size_graph = cur_graph.order()
        if ((cur_mapping[i] < size_graph) and 
            (cur_mapping[j] < size_graph) and 
            (cur_graph.has_edge(cur_mapping[i], cur_mapping[j]) == True)):
                s_ij += 1
        
    return s_ij

 # def update_median_nodes_L1(median,listIdSet,median_id,dataset, mappings):
 #     from scipy.stats.mstats import gmean

 #     for i in median.nodes():
 #         for k in listIdSet:
 #             vectors = [] #np.zeros((len(listIdSet),2))
 #             if(k != median_id):
 #                 phi_i = mappings[k][i]
 #                 if(phi_i < dataset[k].order()):
 #                     vectors.append([float(dataset[k].node[phi_i]['x']),float(dataset[k].node[phi_i]['y'])])

 #         new_labels = gmean(vectors)
 #         median.node[i]['x'] = str(new_labels[0])
 #         median.node[i]['y'] = str(new_labels[1])
 #     return median

 def update_median_nodes(median,dataset,mappings):
    #update node attributes
    for i in median.nodes():
        nb_sub=0
        mean_label = {'x' : 0, 'y' : 0}
        for k in range(0,len(mappings)):
            phi_i = mappings[k][i]
            if ( phi_i < dataset[k].order() ):
                nb_sub += 1
                mean_label['x'] += 0.75*float(dataset[k].node[phi_i]['x'])
                mean_label['y'] += 0.75*float(dataset[k].node[phi_i]['y'])
        median.node[i]['x'] = str((1/0.75)*(mean_label['x']/nb_sub))
        median.node[i]['y'] = str((1/0.75)*(mean_label['y']/nb_sub))
    return median

 def update_median_edges(dataset, mappings, median, cei=0.425,cer=0.425):
 #for letter high, ceir = 1.7, alpha = 0.75
    size_dataset = len(dataset)
    ratio_cei_cer = cer/(cei + cer)
    threshold = size_dataset*ratio_cei_cer
    order_graph_median = median.order()
    for i in range(0,order_graph_median):
        for j in range(i+1,order_graph_median):
            s_ij = calcul_Sij(mappings,dataset,i,j)
            if(s_ij > threshold):
                median.add_edge(i,j)
            else:
                if(median.has_edge(i,j)):
                    median.remove_edge(i,j)
    return median



 def compute_median(script, listID, dataset,verbose=False):
    """Compute a graph median of a dataset according to an environment

    Parameters

    script : An gedlib initialized environnement 
    listID (list): a list of ID in script: encodes the dataset 
    dataset (list): corresponding graphs in networkX format. We assume that graph
    listID[i] corresponds to dataset[i]

    Returns:
    A networkX graph, which is the median, with corresponding sod
    """
    print(len(listID))
    median_set_index, median_set_sod = compute_median_set(script, listID)
    print(median_set_index)
    print(median_set_sod)
    sods = []
    #Ajout median dans environnement
    set_median = dataset[median_set_index].copy()
    median = dataset[median_set_index].copy()
    cur_med_id = replace_graph_in_env(script,median,-1)
    med_distances, med_mappings, cur_sod = update_mappings(script,cur_med_id,listID)
    sods.append(cur_sod)
    if(verbose):
        print(cur_sod)
    ite_max = 50
    old_sod = cur_sod * 2
    ite = 0
    epsilon = 0.001

    best_median 
    while((ite < ite_max) and (np.abs(old_sod - cur_sod) > epsilon )):
        median = update_median_nodes(median,dataset, med_mappings)
        median = update_median_edges(dataset,med_mappings,median)

        cur_med_id = replace_graph_in_env(script,median,cur_med_id)
        med_distances, med_mappings, cur_sod = update_mappings(script,cur_med_id,listID)
        
        
        sods.append(cur_sod)
        if(verbose):
            print(cur_sod)
        ite += 1
    return median, cur_sod, sods, set_median
    
    draw_Letter_graph(median)


 def compute_median_set(script,listID):
    'Returns the id in listID corresponding to median set'
    #Calcul median set
    N=len(listID)
    map_id_to_index = {}
    map_index_to_id = {}
    for i in range(0,len(listID)):
        map_id_to_index[listID[i]] = i
        map_index_to_id[i] = listID[i]
        
    distances = np.zeros((N,N))
    for i in listID:
        for j in listID:
            script.PyRunMethod(i,j)
            distances[map_id_to_index[i],map_id_to_index[j]] = script.PyGetUpperBound(i,j)

    median_set_index = np.argmin(np.sum(distances,0))
    sod = np.min(np.sum(distances,0))
    
    return median_set_index, sod

 #if __name__ == "__main__":
 #    #Chargement du dataset
 #    script.PyLoadGXLGraph('/home/bgauzere/dev/gedlib/data/datasets/Letter/HIGH/', '/home/bgauzere/dev/gedlib/data/collections/Letter_Z.xml')
 #    script.PySetEditCost("LETTER")
 #    script.PyInitEnv()
 #    script.PySetMethod("IPFP", "")
 #    script.PyInitMethod()
 #
 #    dataset,my_y = pygraph.utils.graphfiles.loadDataset("/home/bgauzere/dev/gedlib/data/datasets/Letter/HIGH/Letter_Z.cxl")
 #
 #    listID = script.PyGetAllGraphIds()
 #    median, sod = compute_median(script,listID,dataset,verbose=True)
 #    
 #    print(sod)
 #    draw_Letter_graph(median)


 if __name__ == '__main__':
    # test draw_Letter_graph
    ds = {'name': 'Letter-high', 'dataset': '../datasets/Letter-high/Letter-high_A.txt',
          'extra_params': {}} # node nsymb
    Gn, y_all = loadDataset(ds['dataset'], extra_params=ds['extra_params'])
    print(y_all)
    for g in Gn:
        draw_Letter_graph(g)
--- a/preimage/run_gk_iam.py
+++ b/preimage/run_gk_iam.py
@@ -0,0 +1,423 @@
 #!/usr/bin/env python3
 # -*- coding: utf-8 -*-
 """
 Created on Thu Jul  4 12:20:16 2019

@author: ljia
 """
 import numpy as np
 import networkx as nx
 import matplotlib.pyplot as plt
 import time
 from tqdm import tqdm

 import sys
 sys.path.insert(0, "../")
 from pygraph.utils.graphfiles import loadDataset
 from median import draw_Letter_graph


 # --------------------------- These are tests --------------------------------#
    
 def test_who_is_the_closest_in_kernel_space(Gn):
    idx_gi = [0, 6]
    g1 = Gn[idx_gi[0]]
    g2 = Gn[idx_gi[1]]
    # create the "median" graph.
    gnew = g2.copy()
    gnew.remove_node(0)
    nx.draw_networkx(gnew)
    plt.show()
    print(gnew.nodes(data=True))
    Gn = [gnew] + Gn
    
    # compute gram matrix
    Kmatrix = compute_kernel(Gn, 'untilhpathkernel', True)
    # the distance matrix
    dmatrix = gram2distances(Kmatrix)
    print(np.sort(dmatrix[idx_gi[0] + 1]))
    print(np.argsort(dmatrix[idx_gi[0] + 1]))
    print(np.sort(dmatrix[idx_gi[1] + 1]))
    print(np.argsort(dmatrix[idx_gi[1] + 1]))
    # for all g in Gn, compute (d(g1, g) + d(g2, g)) / 2
    dis_median = [(dmatrix[i, idx_gi[0] + 1] + dmatrix[i, idx_gi[1] + 1]) / 2 for i in range(len(Gn))]
    print(np.sort(dis_median))
    print(np.argsort(dis_median))
    return


 def test_who_is_the_closest_in_GED_space(Gn):
    from iam import GED
    idx_gi = [0, 6]
    g1 = Gn[idx_gi[0]]
    g2 = Gn[idx_gi[1]]
    # create the "median" graph.
    gnew = g2.copy()
    gnew.remove_node(0)
    nx.draw_networkx(gnew)
    plt.show()
    print(gnew.nodes(data=True))
    Gn = [gnew] + Gn
    
    # compute GEDs
    ged_matrix = np.zeros((len(Gn), len(Gn)))
    for i1 in tqdm(range(len(Gn)), desc='computing GEDs', file=sys.stdout):
        for i2 in range(len(Gn)):
            dis, _, _ = GED(Gn[i1], Gn[i2], lib='gedlib')
            ged_matrix[i1, i2] = dis
    print(np.sort(ged_matrix[idx_gi[0] + 1]))
    print(np.argsort(ged_matrix[idx_gi[0] + 1]))
    print(np.sort(ged_matrix[idx_gi[1] + 1]))
    print(np.argsort(ged_matrix[idx_gi[1] + 1]))
    # for all g in Gn, compute (GED(g1, g) + GED(g2, g)) / 2
    dis_median = [(ged_matrix[i, idx_gi[0] + 1] + ged_matrix[i, idx_gi[1] + 1]) / 2 for i in range(len(Gn))]
    print(np.sort(dis_median))
    print(np.argsort(dis_median))
    return


 def test_will_IAM_give_the_median_graph_we_wanted(Gn):
    idx_gi = [0, 6]
    g1 = Gn[idx_gi[0]].copy()
    g2 = Gn[idx_gi[1]].copy()
 #    del Gn[idx_gi[0]]
 #    del Gn[idx_gi[1] - 1]
    g_median = test_iam_with_more_graphs_as_init([g1, g2], [g1, g2], c_ei=1, c_er=1, c_es=1)
 #    g_median = test_iam_with_more_graphs_as_init(Gn, Gn, c_ei=1, c_er=1, c_es=1)
    nx.draw_networkx(g_median)
    plt.show()
    print(g_median.nodes(data=True))
    print(g_median.edges(data=True))
    
    
 def test_new_IAM_allGraph_deleteNodes(Gn):
    idx_gi = [0, 6]
 #    g1 = Gn[idx_gi[0]].copy()
 #    g2 = Gn[idx_gi[1]].copy()

 #    g1 = nx.Graph(name='haha')
 #    g1.add_nodes_from([(0, {'atom': 'C'}), (1, {'atom': 'O'}), (2, {'atom': 'C'})])
 #    g1.add_edges_from([(0, 1, {'bond_type': '1'}), (1, 2, {'bond_type': '1'})])
 #    g2 = nx.Graph(name='hahaha')
 #    g2.add_nodes_from([(0, {'atom': 'C'}), (1, {'atom': 'O'}), (2, {'atom': 'C'}),
 #                       (3, {'atom': 'O'}), (4, {'atom': 'C'})])
 #    g2.add_edges_from([(0, 1, {'bond_type': '1'}), (1, 2, {'bond_type': '1'}),
 #                       (2, 3, {'bond_type': '1'}), (3, 4, {'bond_type': '1'})])
    
    g1 = nx.Graph(name='haha')
    g1.add_nodes_from([(0, {'atom': 'C'}), (1, {'atom': 'C'}), (2, {'atom': 'C'}),
                       (3, {'atom': 'S'}), (4, {'atom': 'S'})])
    g1.add_edges_from([(0, 1, {'bond_type': '1'}), (1, 2, {'bond_type': '1'}),
                       (2, 3, {'bond_type': '1'}), (2, 4, {'bond_type': '1'})])
    g2 = nx.Graph(name='hahaha')
    g2.add_nodes_from([(0, {'atom': 'C'}), (1, {'atom': 'C'}), (2, {'atom': 'C'}),
                       (3, {'atom': 'O'}), (4, {'atom': 'O'})])
    g2.add_edges_from([(0, 1, {'bond_type': '1'}), (1, 2, {'bond_type': '1'}),
                       (2, 3, {'bond_type': '1'}), (2, 4, {'bond_type': '1'})])

 #    g2 = g1.copy()
 #    g2.add_nodes_from([(3, {'atom': 'O'})])
 #    g2.add_nodes_from([(4, {'atom': 'C'})])
 #    g2.add_edges_from([(1, 3, {'bond_type': '1'})])
 #    g2.add_edges_from([(3, 4, {'bond_type': '1'})])

 #    del Gn[idx_gi[0]]
 #    del Gn[idx_gi[1] - 1]
    
    nx.draw_networkx(g1)
    plt.show()
    print(g1.nodes(data=True))
    print(g1.edges(data=True))
    nx.draw_networkx(g2)
    plt.show()
    print(g2.nodes(data=True))
    print(g2.edges(data=True))
    
    g_median = test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations([g1, g2], [g1, g2], c_ei=1, c_er=1, c_es=1)
 #    g_median = test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations(Gn, Gn, c_ei=1, c_er=1, c_es=1)
    nx.draw_networkx(g_median)
    plt.show()
    print(g_median.nodes(data=True))
    print(g_median.edges(data=True))
    
    
 def test_the_simple_two(Gn, gkernel):
    from gk_iam import gk_iam_nearest_multi, compute_kernel
    lmbda = 0.03 # termination probalility
    r_max = 10 # recursions
    l = 500
    alpha_range = np.linspace(0.5, 0.5, 1)
    k = 2 # k nearest neighbors
    
    # randomly select two molecules
    np.random.seed(1)
    idx_gi = [0, 6] # np.random.randint(0, len(Gn), 2)
    g1 = Gn[idx_gi[0]]
    g2 = Gn[idx_gi[1]]
    Gn_mix = [g.copy() for g in Gn]
    Gn_mix.append(g1.copy())
    Gn_mix.append(g2.copy())
    
 #    g_tmp = iam([g1, g2])
 #    nx.draw_networkx(g_tmp)
 #    plt.show()
    
    # compute 
 #    k_list = [] # kernel between each graph and itself.
 #    k_g1_list = [] # kernel between each graph and g1
 #    k_g2_list = [] # kernel between each graph and g2
 #    for ig, g in tqdm(enumerate(Gn), desc='computing self kernels', file=sys.stdout): 
 #        ktemp = compute_kernel([g, g1, g2], 'marginalizedkernel', False)
 #        k_list.append(ktemp[0][0, 0])
 #        k_g1_list.append(ktemp[0][0, 1])
 #        k_g2_list.append(ktemp[0][0, 2])
        
    km = compute_kernel(Gn_mix, gkernel, True)
 #    k_list = np.diag(km) # kernel between each graph and itself.
 #    k_g1_list = km[idx_gi[0]] # kernel between each graph and g1
 #    k_g2_list = km[idx_gi[1]] # kernel between each graph and g2    

    g_best = []
    dis_best = []
    # for each alpha
    for alpha in alpha_range:
        print('alpha =', alpha)
        dhat, ghat_list = gk_iam_nearest_multi(Gn, [g1, g2], [alpha, 1 - alpha], 
                                               range(len(Gn), len(Gn) + 2), km,
                                               k, r_max,gkernel)
        dis_best.append(dhat)
        g_best.append(ghat_list)
        
    for idx, item in enumerate(alpha_range):
        print('when alpha is', item, 'the shortest distance is', dis_best[idx])
        print('the corresponding pre-images are')
        for g in g_best[idx]:
            nx.draw_networkx(g)
            plt.show()
            print(g.nodes(data=True))
            print(g.edges(data=True))
            
    
 def test_remove_bests(Gn, gkernel):
    from gk_iam import gk_iam_nearest_multi, compute_kernel
    lmbda = 0.03 # termination probalility
    r_max = 10 # recursions
    l = 500
    alpha_range = np.linspace(0.5, 0.5, 1)
    k = 20 # k nearest neighbors
    
    # randomly select two molecules
    np.random.seed(1)
    idx_gi = [0, 6] # np.random.randint(0, len(Gn), 2)
    g1 = Gn[idx_gi[0]]
    g2 = Gn[idx_gi[1]]
    # remove the best 2 graphs.
    del Gn[idx_gi[0]]
    del Gn[idx_gi[1] - 1]
 #    del Gn[8]
    
    Gn_mix = [g.copy() for g in Gn]
    Gn_mix.append(g1.copy())
    Gn_mix.append(g2.copy())

    
    # compute
    km = compute_kernel(Gn_mix, gkernel, True)
    g_best = []
    dis_best = []
    # for each alpha
    for alpha in alpha_range:
        print('alpha =', alpha)
        dhat, ghat_list = gk_iam_nearest_multi(Gn, [g1, g2], [alpha, 1 - alpha], 
                                               range(len(Gn), len(Gn) + 2), km, 
                                               k, r_max, gkernel)
        dis_best.append(dhat)
        g_best.append(ghat_list)
        
    for idx, item in enumerate(alpha_range):
        print('when alpha is', item, 'the shortest distance is', dis_best[idx])
        print('the corresponding pre-images are')
        for g in g_best[idx]:
            draw_Letter_graph(g)
 #            nx.draw_networkx(g)
 #            plt.show()
            print(g.nodes(data=True))
            print(g.edges(data=True))
            
            
 def test_gkiam_letter_h():
    from gk_iam import gk_iam_nearest_multi, compute_kernel
    from iam import median_distance
    ds = {'name': 'Letter-high', 'dataset': '../datasets/Letter-high/Letter-high_A.txt',
          'extra_params': {}} # node nsymb
 #    ds = {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt',
 #          'extra_params': {}} # node nsymb
    Gn, y_all = loadDataset(ds['dataset'], extra_params=ds['extra_params'])
    gkernel = 'structuralspkernel'
    
    lmbda = 0.03 # termination probalility
    r_max = 3 # recursions
 #    alpha_range = np.linspace(0.5, 0.5, 1)
    k = 10 # k nearest neighbors
    
    # classify graphs according to letters.
    idx_dict = get_same_item_indices(y_all)
    time_list = []
    sod_list = []
    sod_min_list = []
    for letter in idx_dict:
        print('\n-------------------------------------------------------\n')
        Gn_let = [Gn[i].copy() for i in idx_dict[letter]]
        Gn_mix = Gn_let + [g.copy() for g in Gn_let]
        
        alpha_range = np.linspace(1 / len(Gn_let), 1 / len(Gn_let), 1)
        
        # compute
        time0 = time.time()
        km = compute_kernel(Gn_mix, gkernel, True)
        g_best = []
        dis_best = []
        # for each alpha
        for alpha in alpha_range:
            print('alpha =', alpha)
            dhat, ghat_list = gk_iam_nearest_multi(Gn_let, Gn_let, [alpha] * len(Gn_let), 
                                                   range(len(Gn_let), len(Gn_mix)), km, 
                                                   k, r_max, gkernel, c_ei=1.7, 
                                                   c_er=1.7, c_es=1.7)
            dis_best.append(dhat)
            g_best.append(ghat_list)
        time_list.append(time.time() - time0)
            
        # show best graphs and save them to file.
        for idx, item in enumerate(alpha_range):
            print('when alpha is', item, 'the shortest distance is', dis_best[idx])
            print('the corresponding pre-images are')
            for g in g_best[idx]:
                draw_Letter_graph(g, savepath='results/gk_iam/')
 #            nx.draw_networkx(g)
 #            plt.show()
                print(g.nodes(data=True))
                print(g.edges(data=True))
                
        # compute the corresponding sod in graph space. (alpha range not considered.)
        sod_tmp, _ = median_distance(g_best[0], Gn_let)
        sod_list.append(sod_tmp)
        sod_min_list.append(np.min(sod_tmp))
        
                
    print('\nsods in graph space: ', sod_list)
    print('\nsmallest sod in graph space for each letter: ', sod_min_list)               
    print('\ntimes:', time_list)
                
                
 def get_same_item_indices(ls):
    """Get the indices of the same items in a list. Return a dict keyed by items.
    """
    idx_dict = {}
    for idx, item in enumerate(ls):
        if item in idx_dict:
            idx_dict[item].append(idx)
        else:
            idx_dict[item] = [idx]
    return idx_dict


 #def compute_letter_median_by_average(Gn):
 #    return g_median
    

 def test_iam_letter_h():
    from iam import test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations
    from gk_iam import dis_gstar, compute_kernel
    ds = {'name': 'Letter-high', 'dataset': '../datasets/Letter-high/Letter-high_A.txt',
          'extra_params': {}} # node nsymb
 #    ds = {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt',
 #          'extra_params': {}} # node nsymb
    Gn, y_all = loadDataset(ds['dataset'], extra_params=ds['extra_params'])
    
    lmbda = 0.03 # termination probalility
 #    alpha_range = np.linspace(0.5, 0.5, 1)
    
    # classify graphs according to letters.
    idx_dict = get_same_item_indices(y_all)
    time_list = []
    sod_list = []
    sod_min_list = []
    for letter in idx_dict:        
        Gn_let = [Gn[i].copy() for i in idx_dict[letter]]
        
        alpha_range = np.linspace(1 / len(Gn_let), 1 / len(Gn_let), 1)
        
        # compute
        g_best = []
        dis_best = []
        time0 = time.time()
        # for each alpha
        for alpha in alpha_range:
            print('alpha =', alpha)
            ghat_list, dhat = test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations(
                Gn_let, Gn_let, c_ei=1.7, c_er=1.7, c_es=1.7)
            dis_best.append(dhat)
            g_best.append(ghat_list)
        time_list.append(time.time() - time0)
            
        # show best graphs and save them to file.
        for idx, item in enumerate(alpha_range):
            print('when alpha is', item, 'the shortest distance is', dis_best[idx])
            print('the corresponding pre-images are')
            for g in g_best[idx]:
                draw_Letter_graph(g, savepath='results/iam/')
 #            nx.draw_networkx(g)
 #            plt.show()
                print(g.nodes(data=True))
                print(g.edges(data=True))
                
        # compute the corresponding sod in kernel space. (alpha range not considered.)
        gkernel = 'structuralspkernel'        
        sod_tmp = []
        Gn_mix = g_best[0] + Gn_let
        km = compute_kernel(Gn_mix, gkernel, True)
        for ig, g in tqdm(enumerate(g_best[0]), desc='computing kernel sod', file=sys.stdout):
            dtemp = dis_gstar(ig, range(len(g_best[0]), len(Gn_mix)), 
                              [alpha_range[0]] * len(Gn_let), km, withterm3=False)
            sod_tmp.append(dtemp)
        sod_list.append(sod_tmp)
        sod_min_list.append(np.min(sod_tmp))
        
                
    print('\nsods in kernel space: ', sod_list)
    print('\nsmallest sod in kernel space for each letter: ', sod_min_list)
    print('\ntimes:', time_list)
        

 if __name__ == '__main__':
 #    ds = {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt',
 #          'extra_params': {}}  # node/edge symb
    ds = {'name': 'Letter-high', 'dataset': '../datasets/Letter-high/Letter-high_A.txt',
          'extra_params': {}} # node nsymb
 #    ds = {'name': 'Acyclic', 'dataset': '../datasets/monoterpenoides/trainset_9.ds',
 #          'extra_params': {}}
 #    ds = {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds',
 #        'extra_params': {}} # node symb
    Gn, y_all = loadDataset(ds['dataset'], extra_params=ds['extra_params'])
 #    Gn = Gn[0:20]
    
 #    import networkx.algorithms.isomorphism as iso
 #    G1 = nx.MultiDiGraph()
 #    G2 = nx.MultiDiGraph()
 #    G1.add_nodes_from([1,2,3], fill='red')
 #    G2.add_nodes_from([10,20,30,40], fill='red')
 #    nx.add_path(G1, [1,2,3,4], weight=3, linewidth=2.5)
 #    nx.add_path(G2, [10,20,30,40], weight=3)
 #    nm = iso.categorical_node_match('fill', 'red')
 #    print(nx.is_isomorphic(G1, G2, node_match=nm))
 #    
 #    test_new_IAM_allGraph_deleteNodes(Gn)
 #    test_will_IAM_give_the_median_graph_we_wanted(Gn)
 #    test_who_is_the_closest_in_GED_space(Gn)
 #    test_who_is_the_closest_in_kernel_space(Gn)
    
 #    test_the_simple_two(Gn, 'untilhpathkernel')
 #    test_remove_bests(Gn, 'untilhpathkernel')
    test_gkiam_letter_h()
 #    test_iam_letter_h()
--- a/pygraph/kernels/commonWalkKernel.py
+++ b/pygraph/kernels/commonWalkKernel.py
@@ -23,7 +23,7 @@ from pygraph.utils.parallel import parallel_gm
 def commonwalkkernel(*args,
                     node_label='atom',
                     edge_label='bond_type',
                     n=None,
 #                     n=None,
                     weight=1,
                     compute_method=None,
                     n_jobs=None,
@@ -35,26 +35,28 @@ def commonwalkkernel(*args,
        List of graphs between which the kernels are calculated.
    /
    G1, G2 : NetworkX graphs
        2 graphs between which the kernel is calculated.
        Two graphs between which the kernel is calculated.
    node_label : string
        node attribute used as label. The default node label is atom.
        Node attribute used as symbolic label. The default node label is 'atom'.
    edge_label : string
        edge attribute used as label. The default edge label is bond_type.
    n : integer
        Longest length of walks. Only useful when applying the 'brute' method.
        Edge attribute used as symbolic label. The default edge label is 'bond_type'.
 #    n : integer
 #        Longest length of walks. Only useful when applying the 'brute' method.
    weight: integer
        Weight coefficient of different lengths of walks, which represents beta
        in 'exp' method and gamma in 'geo'.
    compute_method : string
        Method used to compute walk kernel. The Following choices are 
        available:
        'exp' : exponential serial method applied on the direct product graph, 
        as shown in reference [1]. The time complexity is O(n^6) for graphs 
        with n vertices.
        'geo' : geometric serial method applied on the direct product graph, as
        shown in reference [1]. The time complexity is O(n^6) for graphs with n
        vertices.
        'brute' : brute force, simply search for all walks and compare them.
        'exp': method based on exponential serials applied on the direct 
        product graph, as shown in reference [1]. The time complexity is O(n^6) 
        for graphs with n vertices.
        'geo': method based on geometric serials applied on the direct product 
        graph, as shown in reference [1]. The time complexity is O(n^6) for 
        graphs with n vertices.
 #        'brute': brute force, simply search for all walks and compare them.
    n_jobs : int
        Number of jobs for parallelization. 

    Return
    ------
--- a/pygraph/kernels/marginalizedKernel.py
+++ b/pygraph/kernels/marginalizedKernel.py
@@ -44,17 +44,20 @@ def marginalizedkernel(*args,
        List of graphs between which the kernels are calculated.
    /
    G1, G2 : NetworkX graphs
        2 graphs between which the kernel is calculated.
        Two graphs between which the kernel is calculated.
    node_label : string
        node attribute used as label. The default node label is atom.
        Node attribute used as symbolic label. The default node label is 'atom'.
    edge_label : string
        edge attribute used as label. The default edge label is bond_type.
        Edge attribute used as symbolic label. The default edge label is 'bond_type'.
    p_quit : integer
        the termination probability in the random walks generating step
        The termination probability in the random walks generating step.
    n_iteration : integer
        time of iterations to calculate R_inf
        Time of iterations to calculate R_inf.
    remove_totters : boolean
        whether to remove totters. The default value is True.
        Whether to remove totterings by method introduced in [2]. The default 
        value is False.
    n_jobs : int
        Number of jobs for parallelization.   

    Return
    ------
--- a/pygraph/kernels/randomWalkKernel.py
+++ b/pygraph/kernels/randomWalkKernel.py
@@ -41,15 +41,62 @@ def randomwalkkernel(*args,
        List of graphs between which the kernels are calculated.
    /
    G1, G2 : NetworkX graphs
        2 graphs between which the kernel is calculated.
    node_label : string
        node attribute used as label. The default node label is atom.
        Two graphs between which the kernel is calculated.
    compute_method : string
        Method used to compute kernel. The Following choices are 
        available:
        'sylvester' - Sylvester equation method.
        'conjugate' - conjugate gradient method.
        'fp' - fixed-point iterations.
        'spectral' - spectral decomposition.
    weight : float
        A constant weight set for random walks of length h.
    p : None
        Initial probability distribution on the unlabeled direct product graph 
        of two graphs. It is set to be uniform over all vertices in the direct 
        product graph.
    q : None
        Stopping probability distribution on the unlabeled direct product graph 
        of two graphs. It is set to be uniform over all vertices in the direct 
        product graph.
    edge_weight: float
        Edge attribute name corresponding to the edge weight.
        
    node_kernels: dict
        A dictionary of kernel functions for nodes, including 3 items: 'symb' 
        for symbolic node labels, 'nsymb' for non-symbolic node labels, 'mix' 
        for both labels. The first 2 functions take two node labels as 
        parameters, and the 'mix' function takes 4 parameters, a symbolic and a
        non-symbolic label for each the two nodes. Each label is in form of 2-D
        dimension array (n_samples, n_features). Each function returns a number
        as the kernel value. Ignored when nodes are unlabeled. This argument
        is designated to conjugate gradient method and fixed-point iterations.
    edge_kernels: dict
        A dictionary of kernel functions for edges, including 3 items: 'symb' 
        for symbolic edge labels, 'nsymb' for non-symbolic edge labels, 'mix' 
        for both labels. The first 2 functions take two edge labels as 
        parameters, and the 'mix' function takes 4 parameters, a symbolic and a
        non-symbolic label for each the two edges. Each label is in form of 2-D
        dimension array (n_samples, n_features). Each function returns a number
        as the kernel value. Ignored when edges are unlabeled. This argument
        is designated to conjugate gradient method and fixed-point iterations.
    node_label: string
        Node attribute used as label. The default node label is atom. This 
        argument is designated to conjugate gradient method and fixed-point 
        iterations.
    edge_label : string
        edge attribute used as label. The default edge label is bond_type.
    h : integer
        Longest length of walks.
    method : string
        Method used to compute the random walk kernel. Available methods are 'sylvester', 'conjugate', 'fp', 'spectral' and 'kron'.
        Edge attribute used as label. The default edge label is bond_type. This 
        argument is designated to conjugate gradient method and fixed-point 
        iterations.
        
    sub_kernel: string
        Method used to compute walk kernel. The Following choices are 
        available:
        'exp' : method based on exponential serials.
        'geo' : method based on geometric serials.
        
    n_jobs: int
        Number of jobs for parallelization. 

    Return
    ------
@@ -168,7 +215,7 @@ def _sylvester_equation(Gn, lmda, p, q, eweight, n_jobs, verbose=True):

    if q == None:
        # don't normalize adjacency matrices if q is a uniform vector. Note
        # A_wave_list accually contains the transposes of the adjacency matrices.
        # A_wave_list actually contains the transposes of the adjacency matrices.
        A_wave_list = [
            nx.adjacency_matrix(G, eweight).todense().transpose() for G in 
            (tqdm(Gn, desc='compute adjacency matrices', file=sys.stdout) if
@@ -259,7 +306,7 @@ def _conjugate_gradient(Gn, lmda, p, q, ds_attrs, node_kernels, edge_kernels,
 #        # this is faster from unlabeled graphs. @todo: why?
 #        if q == None:
 #            # don't normalize adjacency matrices if q is a uniform vector. Note
 #            # A_wave_list accually contains the transposes of the adjacency matrices.
 #            # A_wave_list actually contains the transposes of the adjacency matrices.
 #            A_wave_list = [
 #                nx.adjacency_matrix(G, eweight).todense().transpose() for G in 
 #                    tqdm(Gn, desc='compute adjacency matrices', file=sys.stdout)
@@ -376,7 +423,7 @@ def _fixed_point(Gn, lmda, p, q, ds_attrs, node_kernels, edge_kernels,
 #        # this is faster from unlabeled graphs. @todo: why?
 #        if q == None:
 #            # don't normalize adjacency matrices if q is a uniform vector. Note
 #            # A_wave_list accually contains the transposes of the adjacency matrices.
 #            # A_wave_list actually contains the transposes of the adjacency matrices.
 #            A_wave_list = [
 #                nx.adjacency_matrix(G, eweight).todense().transpose() for G in 
 #                    tqdm(Gn, desc='compute adjacency matrices', file=sys.stdout)
@@ -481,7 +528,7 @@ def _spectral_decomposition(Gn, weight, p, q, sub_kernel, eweight, n_jobs, verbo
        for G in (tqdm(Gn, desc='spectral decompose', file=sys.stdout) if 
                  verbose else Gn):
            # don't normalize adjacency matrices if q is a uniform vector. Note
            # A accually is the transpose of the adjacency matrix.
            # A actually is the transpose of the adjacency matrix.
            A = nx.adjacency_matrix(G, eweight).todense().transpose()
            ew, ev = np.linalg.eig(A)
            D_list.append(ew)
--- a/pygraph/kernels/spKernel.py
+++ b/pygraph/kernels/spKernel.py
@@ -33,12 +33,12 @@ def spkernel(*args,
        List of graphs between which the kernels are calculated.
    /
    G1, G2 : NetworkX graphs
        2 graphs between which the kernel is calculated.
        Two graphs between which the kernel is calculated.
    node_label : string
        node attribute used as label. The default node label is atom.
        Node attribute used as label. The default node label is atom.
    edge_weight : string
        Edge attribute name corresponding to the edge weight.
    node_kernels: dict
    node_kernels : dict
        A dictionary of kernel functions for nodes, including 3 items: 'symb' 
        for symbolic node labels, 'nsymb' for non-symbolic node labels, 'mix' 
        for both labels. The first 2 functions take two node labels as 
@@ -46,6 +46,8 @@ def spkernel(*args,
        non-symbolic label for each the two nodes. Each label is in form of 2-D
        dimension array (n_samples, n_features). Each function returns an 
        number as the kernel value. Ignored when nodes are unlabeled.
    n_jobs : int
        Number of jobs for parallelization.

    Return
    ------
--- a/pygraph/kernels/structuralspKernel.py
+++ b/pygraph/kernels/structuralspKernel.py
@@ -42,14 +42,15 @@ def structuralspkernel(*args,
        List of graphs between which the kernels are calculated.
    /
    G1, G2 : NetworkX graphs
        2 graphs between which the kernel is calculated.
        Two graphs between which the kernel is calculated.
    node_label : string
        node attribute used as label. The default node label is atom.
        Node attribute used as label. The default node label is atom.
    edge_weight : string
        Edge attribute name corresponding to the edge weight.
        Edge attribute name corresponding to the edge weight. Applied for the 
        computation of the shortest paths.
    edge_label : string
        edge attribute used as label. The default edge label is bond_type.
    node_kernels: dict
        Edge attribute used as label. The default edge label is bond_type.
    node_kernels : dict
        A dictionary of kernel functions for nodes, including 3 items: 'symb' 
        for symbolic node labels, 'nsymb' for non-symbolic node labels, 'mix' 
        for both labels. The first 2 functions take two node labels as 
@@ -57,7 +58,7 @@ def structuralspkernel(*args,
        non-symbolic label for each the two nodes. Each label is in form of 2-D
        dimension array (n_samples, n_features). Each function returns a number
        as the kernel value. Ignored when nodes are unlabeled.
    edge_kernels: dict
    edge_kernels : dict
        A dictionary of kernel functions for edges, including 3 items: 'symb' 
        for symbolic edge labels, 'nsymb' for non-symbolic edge labels, 'mix' 
        for both labels. The first 2 functions take two edge labels as 
@@ -65,6 +66,13 @@ def structuralspkernel(*args,
        non-symbolic label for each the two edges. Each label is in form of 2-D
        dimension array (n_samples, n_features). Each function returns a number
        as the kernel value. Ignored when edges are unlabeled.
    compute_method : string
        Computation method to store the shortest paths and compute the graph
        kernel. The Following choices are available:
        'trie': store paths as tries.
        'naive': store paths to lists.
    n_jobs : int
        Number of jobs for parallelization.

    Return
    ------
--- a/pygraph/kernels/treeletKernel.py
+++ b/pygraph/kernels/treeletKernel.py
@@ -40,11 +40,19 @@ def treeletkernel(*args,
        The sub-kernel between 2 real number vectors. Each vector counts the
        numbers of isomorphic treelets in a graph.
    node_label : string
        Node attribute used as label. The default node label is atom.        
        Node attribute used as label. The default node label is atom.   
    edge_label : string
        Edge attribute used as label. The default edge label is bond_type.
    labeled : boolean
        Whether the graphs are labeled. The default is True.
    parallel : string/None
        Which paralleliztion method is applied to compute the kernel. The 
        Following choices are available:
        'imap_unordered': use Python's multiprocessing.Pool.imap_unordered
        method.
        None: no parallelization is applied.
    n_jobs : int
        Number of jobs for parallelization. The default is to use all 
        computational cores. This argument is only valid when one of the 
        parallelization method is applied.

    Return
    ------
--- a/pygraph/kernels/untilHPathKernel.py
+++ b/pygraph/kernels/untilHPathKernel.py
@@ -26,7 +26,7 @@ def untilhpathkernel(*args,
                     node_label='atom',
                     edge_label='bond_type',
                     depth=10,
                     k_func='tanimoto',
                     k_func='MinMax',
                     compute_method='trie',
                     n_jobs=None,
                     verbose=True):
@@ -38,7 +38,7 @@ def untilhpathkernel(*args,
        List of graphs between which the kernels are calculated.
    /
    G1, G2 : NetworkX graphs
        2 graphs between which the kernel is calculated.
        Two graphs between which the kernel is calculated.
    node_label : string
        Node attribute used as label. The default node label is atom.
    edge_label : string
@@ -47,9 +47,17 @@ def untilhpathkernel(*args,
        Depth of search. Longest length of paths.
    k_func : function
        A kernel function applied using different notions of fingerprint 
        similarity.
    compute_method: string
        Computation method, 'trie' or 'naive'.
        similarity, defining the type of feature map and normalization method 
        applied for the graph kernel. The Following choices are available:
        'MinMax': use the MiniMax kernel and counting feature map.
        'tanimoto': use the Tanimoto kernel and binary feature map.
    compute_method : string
        Computation method to store paths and compute the graph kernel. The 
        Following choices are available:
        'trie': store paths as tries.
        'naive': store paths to lists.
    n_jobs : int
        Number of jobs for parallelization.

    Return
    ------
--- a/pygraph/kernels/weisfeilerLehmanKernel.py
+++ b/pygraph/kernels/weisfeilerLehmanKernel.py
@@ -38,15 +38,28 @@ def weisfeilerlehmankernel(*args,
        List of graphs between which the kernels are calculated.
    /
    G1, G2 : NetworkX graphs
        2 graphs between which the kernel is calculated.        
        Two graphs between which the kernel is calculated.        
    node_label : string
        node attribute used as label. The default node label is atom.        
        Node attribute used as label. The default node label is atom.        
    edge_label : string
        edge attribute used as label. The default edge label is bond_type.        
        Edge attribute used as label. The default edge label is bond_type.        
    height : int
        subtree height
        Subtree height.
    base_kernel : string
        base kernel used in each iteration of WL kernel. The default base kernel is subtree kernel. For user-defined kernel, base_kernel is the name of the base kernel function used in each iteration of WL kernel. This function returns a Numpy matrix, each element of which is the user-defined Weisfeiler-Lehman kernel between 2 praphs.
        Base kernel used in each iteration of WL kernel. Only default 'subtree' 
        kernel can be applied for now.
 #        The default base 
 #        kernel is subtree kernel. For user-defined kernel, base_kernel is the 
 #        name of the base kernel function used in each iteration of WL kernel. 
 #        This function returns a Numpy matrix, each element of which is the 
 #        user-defined Weisfeiler-Lehman kernel between 2 praphs.
    parallel : None
        Which paralleliztion method is applied to compute the kernel. No 
        parallelization can be applied for now.
    n_jobs : int
        Number of jobs for parallelization. The default is to use all 
        computational cores. This argument is only valid when one of the 
        parallelization method is applied and can be ignored for now.

    Return
    ------