@@ -0,0 +1,65 @@ | |||
Node labels: [symbol] | |||
Node attributes: [chem, charge, x, y] | |||
Edge labels: [valence] | |||
Node labels were converted to integer values using this map: | |||
Component 0: | |||
0 C | |||
1 O | |||
2 N | |||
3 Cl | |||
4 F | |||
5 S | |||
6 Se | |||
7 P | |||
8 Na | |||
9 I | |||
10 Co | |||
11 Br | |||
12 Li | |||
13 Si | |||
14 Mg | |||
15 Cu | |||
16 As | |||
17 B | |||
18 Pt | |||
19 Ru | |||
20 K | |||
21 Pd | |||
22 Au | |||
23 Te | |||
24 W | |||
25 Rh | |||
26 Zn | |||
27 Bi | |||
28 Pb | |||
29 Ge | |||
30 Sb | |||
31 Sn | |||
32 Ga | |||
33 Hg | |||
34 Ho | |||
35 Tl | |||
36 Ni | |||
37 Tb | |||
Edge labels were converted to integer values using this map: | |||
Component 0: | |||
0 1 | |||
1 2 | |||
2 3 | |||
Class labels were converted to integer values using this map: | |||
0 a | |||
1 i | |||
@@ -0,0 +1,75 @@ | |||
README for dataset DD | |||
=== Usage === | |||
This folder contains the following comma separated text files | |||
(replace DS by the name of the dataset): | |||
n = total number of nodes | |||
m = total number of edges | |||
N = number of graphs | |||
(1) DS_A.txt (m lines) | |||
sparse (block diagonal) adjacency matrix for all graphs, | |||
each line corresponds to (row, col) resp. (node_id, node_id) | |||
(2) DS_graph_indicator.txt (n lines) | |||
column vector of graph identifiers for all nodes of all graphs, | |||
the value in the i-th line is the graph_id of the node with node_id i | |||
(3) DS_graph_labels.txt (N lines) | |||
class labels for all graphs in the dataset, | |||
the value in the i-th line is the class label of the graph with graph_id i | |||
(4) DS_node_labels.txt (n lines) | |||
column vector of node labels, | |||
the value in the i-th line corresponds to the node with node_id i | |||
There are OPTIONAL files if the respective information is available: | |||
(5) DS_edge_labels.txt (m lines; same size as DS_A_sparse.txt) | |||
labels for the edges in DS_A_sparse.txt | |||
(6) DS_edge_attributes.txt (m lines; same size as DS_A.txt) | |||
attributes for the edges in DS_A.txt | |||
(7) DS_node_attributes.txt (n lines) | |||
matrix of node attributes, | |||
the comma seperated values in the i-th line is the attribute vector of the node with node_id i | |||
(8) DS_graph_attributes.txt (N lines) | |||
regression values for all graphs in the dataset, | |||
the value in the i-th line is the attribute of the graph with graph_id i | |||
=== Description === | |||
D&D is a dataset of 1178 protein structures (Dobson and Doig, 2003). Each protein is | |||
represented by a graph, in which the nodes are amino acids and two nodes are connected | |||
by an edge if they are less than 6 Angstroms apart. The prediction task is to classify | |||
the protein structures into enzymes and non-enzymes. | |||
=== Previous Use of the Dataset === | |||
Neumann, M., Garnett R., Bauckhage Ch., Kersting K.: Propagation Kernels: Efficient Graph | |||
Kernels from Propagated Information. Under review at MLJ. | |||
Neumann, M., Patricia, N., Garnett, R., Kersting, K.: Efficient Graph Kernels by | |||
Randomization. In: P.A. Flach, T.D. Bie, N. Cristianini (eds.) ECML/PKDD, Notes in | |||
Computer Science, vol. 7523, pp. 378-393. Springer (2012). | |||
Shervashidze, N., Schweitzer, P., van Leeuwen, E., Mehlhorn, K., Borgwardt, K.: | |||
Weisfeiler-Lehman Graph Kernels. Journal of Machine Learning Research 12, 2539-2561 (2011) | |||
=== References === | |||
P. D. Dobson and A. J. Doig. Distinguishing enzyme structures from non-enzymes without | |||
alignments. J. Mol. Biol., 330(4):771–783, Jul 2003. | |||
@@ -0,0 +1,70 @@ | |||
README for dataset NCI1 | |||
=== Usage === | |||
This folder contains the following comma separated text files | |||
(replace DS by the name of the dataset): | |||
n = total number of nodes | |||
m = total number of edges | |||
N = number of graphs | |||
(1) DS_A.txt (m lines) | |||
sparse (block diagonal) adjacency matrix for all graphs, | |||
each line corresponds to (row, col) resp. (node_id, node_id) | |||
(2) DS_graph_indicator.txt (n lines) | |||
column vector of graph identifiers for all nodes of all graphs, | |||
the value in the i-th line is the graph_id of the node with node_id i | |||
(3) DS_graph_labels.txt (N lines) | |||
class labels for all graphs in the dataset, | |||
the value in the i-th line is the class label of the graph with graph_id i | |||
(4) DS_node_labels.txt (n lines) | |||
column vector of node labels, | |||
the value in the i-th line corresponds to the node with node_id i | |||
There are OPTIONAL files if the respective information is available: | |||
(5) DS_edge_labels.txt (m lines; same size as DS_A_sparse.txt) | |||
labels for the edges in DS_A_sparse.txt | |||
(6) DS_edge_attributes.txt (m lines; same size as DS_A.txt) | |||
attributes for the edges in DS_A.txt | |||
(7) DS_node_attributes.txt (n lines) | |||
matrix of node attributes, | |||
the comma seperated values in the i-th line is the attribute vector of the node with node_id i | |||
(8) DS_graph_attributes.txt (N lines) | |||
regression values for all graphs in the dataset, | |||
the value in the i-th line is the attribute of the graph with graph_id i | |||
=== Description === | |||
NCI1 and NCI109 represent two balanced subsets of datasets of chemical compounds screened | |||
for activity against non-small cell lung cancer and ovarian cancer cell lines respectively | |||
(Wale and Karypis (2006) and http://pubchem.ncbi.nlm.nih.gov). | |||
=== Previous Use of the Dataset === | |||
Neumann, M., Garnett R., Bauckhage Ch., Kersting K.: Propagation Kernels: Efficient Graph | |||
Kernels from Propagated Information. Under review at MLJ. | |||
Neumann, M., Patricia, N., Garnett, R., Kersting, K.: Efficient Graph Kernels by | |||
Randomization. In: P.A. Flach, T.D. Bie, N. Cristianini (eds.) ECML/PKDD, Notes in | |||
Computer Science, vol. 7523, pp. 378-393. Springer (2012). | |||
Shervashidze, N., Schweitzer, P., van Leeuwen, E., Mehlhorn, K., Borgwardt, K.: | |||
Weisfeiler-Lehman Graph Kernels. Journal of Machine Learning Research 12, 2539-2561 (2011) | |||
=== References === | |||
N. Wale and G. Karypis. Comparison of descriptor spaces for chemical compound retrieval and | |||
classification. In Proc. of ICDM, pages 678–689, Hong Kong, 2006. | |||
@@ -0,0 +1,70 @@ | |||
README for dataset NCI109 | |||
=== Usage === | |||
This folder contains the following comma separated text files | |||
(replace DS by the name of the dataset): | |||
n = total number of nodes | |||
m = total number of edges | |||
N = number of graphs | |||
(1) DS_A.txt (m lines) | |||
sparse (block diagonal) adjacency matrix for all graphs, | |||
each line corresponds to (row, col) resp. (node_id, node_id) | |||
(2) DS_graph_indicator.txt (n lines) | |||
column vector of graph identifiers for all nodes of all graphs, | |||
the value in the i-th line is the graph_id of the node with node_id i | |||
(3) DS_graph_labels.txt (N lines) | |||
class labels for all graphs in the dataset, | |||
the value in the i-th line is the class label of the graph with graph_id i | |||
(4) DS_node_labels.txt (n lines) | |||
column vector of node labels, | |||
the value in the i-th line corresponds to the node with node_id i | |||
There are OPTIONAL files if the respective information is available: | |||
(5) DS_edge_labels.txt (m lines; same size as DS_A_sparse.txt) | |||
labels for the edges in DS_A_sparse.txt | |||
(6) DS_edge_attributes.txt (m lines; same size as DS_A.txt) | |||
attributes for the edges in DS_A.txt | |||
(7) DS_node_attributes.txt (n lines) | |||
matrix of node attributes, | |||
the comma seperated values in the i-th line is the attribute vector of the node with node_id i | |||
(8) DS_graph_attributes.txt (N lines) | |||
regression values for all graphs in the dataset, | |||
the value in the i-th line is the attribute of the graph with graph_id i | |||
=== Description === | |||
NCI1 and NCI109 represent two balanced subsets of datasets of chemical compounds screened | |||
for activity against non-small cell lung cancer and ovarian cancer cell lines respectively | |||
(Wale and Karypis (2006) and http://pubchem.ncbi.nlm.nih.gov). | |||
=== Previous Use of the Dataset === | |||
Neumann, M., Garnett R., Bauckhage Ch., Kersting K.: Propagation Kernels: Efficient Graph | |||
Kernels from Propagated Information. Under review at MLJ. | |||
Neumann, M., Patricia, N., Garnett, R., Kersting, K.: Efficient Graph Kernels by | |||
Randomization. In: P.A. Flach, T.D. Bie, N. Cristianini (eds.) ECML/PKDD, Notes in | |||
Computer Science, vol. 7523, pp. 378-393. Springer (2012). | |||
Shervashidze, N., Schweitzer, P., van Leeuwen, E., Mehlhorn, K., Borgwardt, K.: | |||
Weisfeiler-Lehman Graph Kernels. Journal of Machine Learning Research 12, 2539-2561 (2011) | |||
=== References === | |||
N. Wale and G. Karypis. Comparison of descriptor spaces for chemical compound retrieval and | |||
classification. In Proc. of ICDM, pages 678–689, Hong Kong, 2006. | |||
@@ -12,21 +12,21 @@ import multiprocessing | |||
from pygraph.kernels.commonWalkKernel import commonwalkkernel | |||
dslist = [ | |||
# {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds', | |||
# 'task': 'regression'}, # node symb | |||
# {'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression', | |||
# 'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'}, | |||
# # contains single node graph, node symb | |||
# {'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb | |||
# {'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled | |||
# {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb | |||
# {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'}, | |||
# # node nsymb | |||
# {'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'}, | |||
# # node symb/nsymb | |||
{'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression', | |||
'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'}, | |||
# contains single node graph, node symb | |||
{'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds', | |||
'task': 'regression'}, # node symb | |||
{'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb | |||
{'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled | |||
{'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb | |||
{'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'}, | |||
# node nsymb | |||
{'name': 'AIDS', 'dataset': '../datasets/AIDS/AIDS_A.txt'}, # node symb/nsymb, edge symb | |||
{'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'}, | |||
# node symb/nsymb | |||
# {'name': 'NCI1', 'dataset': '../datasets/NCI1/NCI1_A.txt'}, # node symb | |||
# {'name': 'NCI109', 'dataset': '../datasets/NCI109/NCI109_A.txt'}, # node symb | |||
{'name': 'AIDS', 'dataset': '../datasets/AIDS/AIDS_A.txt'}, # node symb/nsymb, edge symb | |||
# {'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb | |||
# | |||
# {'name': 'Mutagenicity', 'dataset': '../datasets/Mutagenicity/Mutagenicity_A.txt'}, | |||
@@ -12,22 +12,22 @@ import multiprocessing | |||
from pygraph.kernels.marginalizedKernel import marginalizedkernel | |||
dslist = [ | |||
# {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds', | |||
# 'task': 'regression'}, # node symb | |||
# {'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression', | |||
# 'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'}, | |||
# # contains single node graph, node symb | |||
# {'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb | |||
# {'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled | |||
# {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb | |||
# {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'}, | |||
# # node nsymb | |||
# {'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'}, | |||
# # node symb/nsymb | |||
{'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression', | |||
'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'}, | |||
# contains single node graph, node symb | |||
{'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds', | |||
'task': 'regression'}, # node symb | |||
{'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb | |||
{'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled | |||
{'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb | |||
{'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'}, | |||
# node nsymb | |||
{'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'}, | |||
# node symb/nsymb | |||
# {'name': 'AIDS', 'dataset': '../datasets/AIDS/AIDS_A.txt'}, # node symb/nsymb, edge symb | |||
# {'name': 'NCI1', 'dataset': '../datasets/NCI1/NCI1_A.txt'}, # node symb | |||
# {'name': 'NCI109', 'dataset': '../datasets/NCI109/NCI109_A.txt'}, # node symb | |||
{'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb | |||
# {'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb | |||
# | |||
# {'name': 'Mutagenicity', 'dataset': '../datasets/Mutagenicity/Mutagenicity_A.txt'}, | |||
# # node/edge symb | |||
@@ -17,22 +17,23 @@ import numpy as np | |||
dslist = [ | |||
# {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds', | |||
# 'task': 'regression'}, # node symb | |||
# {'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression', | |||
# 'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'}, | |||
# # contains single node graph, node symb | |||
# {'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb | |||
# {'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled | |||
# {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb | |||
# {'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'}, | |||
# # node symb/nsymb | |||
# {'name': 'NCI1', 'dataset': '../datasets/NCI1/NCI1_A.txt'}, # node symb | |||
# {'name': 'NCI109', 'dataset': '../datasets/NCI109/NCI109_A.txt'}, # node symb | |||
{'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression', | |||
'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'}, | |||
# contains single node graph, node symb | |||
{'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds', | |||
'task': 'regression'}, # node symb | |||
{'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb | |||
{'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled | |||
{'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb | |||
{'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'}, | |||
# node nsymb | |||
{'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'}, | |||
# node symb/nsymb | |||
{'name': 'AIDS', 'dataset': '../datasets/AIDS/AIDS_A.txt'}, # node symb/nsymb, edge symb | |||
# {'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb | |||
# {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'}, | |||
# # node nsymb | |||
{'name': 'NCI1', 'dataset': '../datasets/NCI1/NCI1_A.txt'}, # node symb | |||
{'name': 'NCI109', 'dataset': '../datasets/NCI109/NCI109_A.txt'}, # node symb | |||
{'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb | |||
# | |||
# {'name': 'Mutagenicity', 'dataset': '../datasets/Mutagenicity/Mutagenicity_A.txt'}, | |||
# # node/edge symb | |||
@@ -8,14 +8,14 @@ from pygraph.utils.kernels import deltakernel, gaussiankernel, kernelproduct | |||
# datasets | |||
dslist = [ | |||
# {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds', | |||
# 'task': 'regression'}, # node symb | |||
# {'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression', | |||
# 'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'}, | |||
# # contains single node graph, node symb | |||
# {'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb | |||
# {'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled | |||
# {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb | |||
{'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression', | |||
'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'}, | |||
# contains single node graph, node symb | |||
{'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds', | |||
'task': 'regression'}, # node symb | |||
{'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb | |||
{'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled | |||
{'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb | |||
{'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'}, | |||
# node nsymb | |||
{'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'}, | |||
@@ -14,22 +14,22 @@ from pygraph.kernels.structuralspKernel import structuralspkernel | |||
from pygraph.utils.kernels import deltakernel, gaussiankernel, kernelproduct | |||
dslist = [ | |||
# {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds', | |||
# 'task': 'regression'}, # node symb | |||
# {'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression', | |||
# 'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'}, | |||
# # contains single node graph, node symb | |||
# {'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb | |||
# {'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled | |||
# {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb | |||
# {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'}, | |||
# # node nsymb | |||
# {'name': 'AIDS', 'dataset': '../datasets/AIDS/AIDS_A.txt'}, # node symb/nsymb, edge symb | |||
# {'name': 'NCI1', 'dataset': '../datasets/NCI1/NCI1_A.txt'}, # node symb | |||
# {'name': 'NCI109', 'dataset': '../datasets/NCI109/NCI109_A.txt'}, # node symb | |||
{'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression', | |||
'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'}, | |||
# contains single node graph, node symb | |||
{'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds', | |||
'task': 'regression'}, # node symb | |||
{'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb | |||
{'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled | |||
{'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb | |||
{'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'}, | |||
# node nsymb | |||
{'name': 'AIDS', 'dataset': '../datasets/AIDS/AIDS_A.txt'}, # node symb/nsymb, edge symb | |||
{'name': 'NCI1', 'dataset': '../datasets/NCI1/NCI1_A.txt'}, # node symb | |||
{'name': 'NCI109', 'dataset': '../datasets/NCI109/NCI109_A.txt'}, # node symb | |||
# {'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'}, | |||
# # node symb/nsymb | |||
{'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb | |||
# {'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb | |||
# | |||
# {'name': 'Mutagenicity', 'dataset': '../datasets/Mutagenicity/Mutagenicity_A.txt'}, | |||
# # node/edge symb | |||
@@ -14,22 +14,22 @@ from pygraph.kernels.treeletKernel import treeletkernel | |||
from pygraph.utils.kernels import gaussiankernel, polynomialkernel | |||
dslist = [ | |||
# {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds', | |||
# 'task': 'regression'}, # node symb | |||
{'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression', | |||
'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'}, | |||
# contains single node graph, node symb | |||
{'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds', | |||
'task': 'regression'}, # node symb | |||
{'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb | |||
{'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled | |||
{'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb | |||
{'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'}, | |||
# node symb/nsymb | |||
{'name': 'NCI1', 'dataset': '../datasets/NCI1/NCI1_A.txt'}, # node symb | |||
{'name': 'NCI109', 'dataset': '../datasets/NCI109/NCI109_A.txt'}, # node symb | |||
{'name': 'AIDS', 'dataset': '../datasets/AIDS/AIDS_A.txt'}, # node symb/nsymb, edge symb | |||
{'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'}, | |||
# node nsymb | |||
{'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb | |||
{'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'}, | |||
# node symb/nsymb | |||
# {'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb | |||
# {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'}, | |||
# # node nsymb | |||
# | |||
# {'name': 'Mutagenicity', 'dataset': '../datasets/Mutagenicity/Mutagenicity_A.txt'}, | |||
# # node/edge symb | |||
@@ -12,21 +12,21 @@ import multiprocessing | |||
from pygraph.kernels.untilHPathKernel import untilhpathkernel | |||
dslist = [ | |||
# {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds', | |||
# 'task': 'regression'}, # node symb | |||
# {'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression', | |||
# 'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'}, | |||
# # contains single node graph, node symb | |||
# {'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb | |||
# {'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled | |||
# {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb | |||
# {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'}, | |||
# # node nsymb | |||
# {'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'}, | |||
# # node symb/nsymb | |||
# {'name': 'NCI1', 'dataset': '../datasets/NCI1/NCI1_A.txt'}, # node symb | |||
# {'name': 'NCI109', 'dataset': '../datasets/NCI109/NCI109_A.txt'}, # node symb | |||
# {'name': 'AIDS', 'dataset': '../datasets/AIDS/AIDS_A.txt'}, # node symb/nsymb, edge symb | |||
{'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression', | |||
'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'}, | |||
# contains single node graph, node symb | |||
{'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds', | |||
'task': 'regression'}, # node symb | |||
{'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb | |||
{'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled | |||
{'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb | |||
{'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'}, | |||
# node nsymb | |||
{'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'}, | |||
# node symb/nsymb | |||
{'name': 'AIDS', 'dataset': '../datasets/AIDS/AIDS_A.txt'}, # node symb/nsymb, edge symb | |||
{'name': 'NCI1', 'dataset': '../datasets/NCI1/NCI1_A.txt'}, # node symb | |||
{'name': 'NCI109', 'dataset': '../datasets/NCI109/NCI109_A.txt'}, # node symb | |||
{'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb | |||
# | |||
# {'name': 'Mutagenicity', 'dataset': '../datasets/Mutagenicity/Mutagenicity_A.txt'}, | |||
@@ -14,22 +14,22 @@ from pygraph.kernels.weisfeilerLehmanKernel import weisfeilerlehmankernel | |||
dslist = [ | |||
# {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds', | |||
# 'task': 'regression'}, # node symb | |||
# {'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression', | |||
# 'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'}, | |||
# # contains single node graph, node symb | |||
# {'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb | |||
# {'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled | |||
# {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb | |||
{'name': 'Alkane', 'dataset': '../datasets/Alkane/dataset.ds', 'task': 'regression', | |||
'dataset_y': '../datasets/Alkane/dataset_boiling_point_names.txt'}, | |||
# contains single node graph, node symb | |||
{'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds', | |||
'task': 'regression'}, # node symb | |||
{'name': 'MAO', 'dataset': '../datasets/MAO/dataset.ds'}, # node/edge symb | |||
{'name': 'PAH', 'dataset': '../datasets/PAH/dataset.ds'}, # unlabeled | |||
{'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt'}, # node/edge symb | |||
# {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt'}, | |||
# # node nsymb | |||
# {'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'}, | |||
# # node symb/nsymb | |||
# {'name': 'NCI1', 'dataset': '../datasets/NCI1/NCI1_A.txt'}, # node symb | |||
# {'name': 'NCI109', 'dataset': '../datasets/NCI109/NCI109_A.txt'}, # node symb | |||
# {'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb | |||
{'name': 'AIDS', 'dataset': '../datasets/AIDS/AIDS_A.txt'}, # node symb/nsymb, edge symb | |||
{'name': 'ENZYMES', 'dataset': '../datasets/ENZYMES_txt/ENZYMES_A_sparse.txt'}, | |||
# node symb/nsymb | |||
{'name': 'NCI1', 'dataset': '../datasets/NCI1/NCI1_A.txt'}, # node symb | |||
{'name': 'NCI109', 'dataset': '../datasets/NCI109/NCI109_A.txt'}, # node symb | |||
{'name': 'D&D', 'dataset': '../datasets/DD/DD_A.txt'}, # node symb | |||
# | |||
# {'name': 'Mutagenicity', 'dataset': '../datasets/Mutagenicity/Mutagenicity_A.txt'}, | |||
# # node/edge symb | |||
@@ -277,7 +277,8 @@ def gk_iam_nearest(Gn, alpha, idx_gi, Kmatrix, k, r_max): | |||
# return dhat, ghat_list | |||
def gk_iam_nearest_multi(Gn_init, Gn_median, alpha, idx_gi, Kmatrix, k, r_max, gkernel): | |||
def gk_iam_nearest_multi(Gn_init, Gn_median, alpha, idx_gi, Kmatrix, k, r_max, | |||
gkernel, c_ei=1, c_er=1, c_es=1, epsilon=0.001): | |||
"""This function constructs graph pre-image by the iterative pre-image | |||
framework in reference [1], algorithm 1, where the step of generating new | |||
graphs randomly is replaced by the IAM algorithm in reference [2]. | |||
@@ -312,37 +313,44 @@ def gk_iam_nearest_multi(Gn_init, Gn_median, alpha, idx_gi, Kmatrix, k, r_max, g | |||
return 0, g0hat_list | |||
dhat = dis_gs[0] # the nearest distance | |||
ghat_list = [g.copy() for g in g0hat_list] | |||
for g in ghat_list: | |||
draw_Letter_graph(g) | |||
# for g in ghat_list: | |||
# draw_Letter_graph(g) | |||
# nx.draw_networkx(g) | |||
# plt.show() | |||
print(g.nodes(data=True)) | |||
print(g.edges(data=True)) | |||
# print(g.nodes(data=True)) | |||
# print(g.edges(data=True)) | |||
Gk = [Gn_init[ig].copy() for ig in sort_idx[0:k]] # the k nearest neighbors | |||
for gi in Gk: | |||
# nx.draw_networkx(gi) | |||
# plt.show() | |||
draw_Letter_graph(g) | |||
print(gi.nodes(data=True)) | |||
print(gi.edges(data=True)) | |||
# for gi in Gk: | |||
## nx.draw_networkx(gi) | |||
## plt.show() | |||
# draw_Letter_graph(g) | |||
# print(gi.nodes(data=True)) | |||
# print(gi.edges(data=True)) | |||
Gs_nearest = Gk.copy() | |||
# gihat_list = [] | |||
# i = 1 | |||
r = 1 | |||
while r < r_max: | |||
print('r =', r) | |||
# found = False | |||
r = 0 | |||
itr = 0 | |||
# cur_sod = dhat | |||
# old_sod = cur_sod * 2 | |||
sod_list = [dhat] | |||
found = False | |||
nb_updated = 0 | |||
while r < r_max:# and not found: # @todo: if not found?# and np.abs(old_sod - cur_sod) > epsilon: | |||
print('\nr =', r) | |||
print('itr for gk =', itr, '\n') | |||
found = False | |||
# Gs_nearest = Gk + gihat_list | |||
# g_tmp = iam(Gs_nearest) | |||
g_tmp_list = test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations( | |||
Gn_median, Gs_nearest, c_ei=1, c_er=1, c_es=1) | |||
for g in g_tmp_list: | |||
g_tmp_list, _ = test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations( | |||
Gn_median, Gs_nearest, c_ei=c_ei, c_er=c_er, c_es=c_es) | |||
# for g in g_tmp_list: | |||
# nx.draw_networkx(g) | |||
# plt.show() | |||
draw_Letter_graph(g) | |||
print(g.nodes(data=True)) | |||
print(g.edges(data=True)) | |||
# draw_Letter_graph(g) | |||
# print(g.nodes(data=True)) | |||
# print(g.edges(data=True)) | |||
# compute distance between phi and the new generated graphs. | |||
knew = compute_kernel(g_tmp_list + Gn_median, gkernel, False) | |||
@@ -358,6 +366,7 @@ def gk_iam_nearest_multi(Gn_init, Gn_median, alpha, idx_gi, Kmatrix, k, r_max, g | |||
# k_g1_list[1] + alpha[1] * alpha[1] * k_list[1]) | |||
# find the new k nearest graphs. | |||
dnew_best = min(dnew_list) | |||
dis_gs = dnew_list + dis_gs # add the new nearest distances. | |||
Gs_nearest = [g.copy() for g in g_tmp_list] + Gs_nearest # add the corresponding graphs. | |||
sort_idx = np.argsort(dis_gs) | |||
@@ -367,21 +376,34 @@ def gk_iam_nearest_multi(Gn_init, Gn_median, alpha, idx_gi, Kmatrix, k, r_max, g | |||
print(dis_gs[-1]) | |||
Gs_nearest = [Gs_nearest[idx] for idx in sort_idx[0:k]] | |||
nb_best = len(np.argwhere(dis_gs == dis_gs[0]).flatten().tolist()) | |||
if len([i for i in sort_idx[0:nb_best] if i < len(dnew_list)]) > 0: | |||
print('I have smaller or equal distance!') | |||
if dnew_best < dhat and np.abs(dnew_best - dhat) > epsilon: | |||
print('I have smaller distance!') | |||
print(str(dhat) + '->' + str(dis_gs[0])) | |||
dhat = dis_gs[0] | |||
idx_best_list = np.argwhere(dnew_list == dhat).flatten().tolist() | |||
ghat_list = [g_tmp_list[idx].copy() for idx in idx_best_list] | |||
for g in ghat_list: | |||
# nx.draw_networkx(g) | |||
# plt.show() | |||
draw_Letter_graph(g) | |||
print(g.nodes(data=True)) | |||
print(g.edges(data=True)) | |||
r = 0 | |||
else: | |||
# for g in ghat_list: | |||
## nx.draw_networkx(g) | |||
## plt.show() | |||
# draw_Letter_graph(g) | |||
# print(g.nodes(data=True)) | |||
# print(g.edges(data=True)) | |||
r = 0 | |||
found = True | |||
nb_updated += 1 | |||
elif np.abs(dnew_best - dhat) < epsilon: | |||
print('I have almost equal distance!') | |||
print(str(dhat) + '->' + str(dnew_best)) | |||
if not found: | |||
r += 1 | |||
# old_sod = cur_sod | |||
# cur_sod = dnew_best | |||
sod_list.append(dhat) | |||
itr += 1 | |||
print('\nthe graph is updated', nb_updated, 'times.') | |||
print('sods in kernel space:', sod_list, '\n') | |||
return dhat, ghat_list | |||
@@ -9,6 +9,7 @@ Iterative alternate minimizations using GED. | |||
import numpy as np | |||
import random | |||
import networkx as nx | |||
from tqdm import tqdm | |||
import sys | |||
#from Cython_GedLib_2 import librariesImport, script | |||
@@ -181,13 +182,27 @@ def GED(g1, g2, lib='gedlib'): | |||
return dis, pi_forward, pi_backward | |||
def median_distance(Gn, Gn_median, measure='ged', verbose=False): | |||
dis_list = [] | |||
pi_forward_list = [] | |||
for idx, G in tqdm(enumerate(Gn), desc='computing median distances', | |||
file=sys.stdout) if verbose else enumerate(Gn): | |||
dis_sum = 0 | |||
pi_forward_list.append([]) | |||
for G_p in Gn_median: | |||
dis_tmp, pi_tmp_forward, pi_tmp_backward = GED(G, G_p) | |||
pi_forward_list[idx].append(pi_tmp_forward) | |||
dis_sum += dis_tmp | |||
dis_list.append(dis_sum) | |||
return dis_list, pi_forward_list | |||
# --------------------------- These are tests --------------------------------# | |||
def test_iam_with_more_graphs_as_init(Gn, G_candidate, c_ei=3, c_er=3, c_es=1, | |||
node_label='atom', edge_label='bond_type'): | |||
"""See my name, then you know what I do. | |||
""" | |||
from tqdm import tqdm | |||
# Gn = Gn[0:10] | |||
Gn = [nx.convert_node_labels_to_integers(g) for g in Gn] | |||
@@ -321,7 +336,7 @@ def test_iam_with_more_graphs_as_init(Gn, G_candidate, c_ei=3, c_er=3, c_es=1, | |||
def test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations( | |||
Gn_median, Gn_candidate, c_ei=3, c_er=3, c_es=1, node_label='atom', | |||
edge_label='bond_type', connected=True): | |||
edge_label='bond_type', connected=False): | |||
"""See my name, then you know what I do. | |||
""" | |||
from tqdm import tqdm | |||
@@ -330,8 +345,11 @@ def test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations( | |||
node_ir = np.inf # corresponding to the node remove and insertion. | |||
label_r = 'thanksdanny' # the label for node remove. # @todo: make this label unrepeatable. | |||
ds_attrs = get_dataset_attributes(Gn_median + Gn_candidate, | |||
attr_names=['edge_labeled', 'node_attr_dim'], | |||
attr_names=['edge_labeled', 'node_attr_dim', 'edge_attr_dim'], | |||
edge_label=edge_label) | |||
ite_max = 50 | |||
epsilon = 0.001 | |||
def generate_graph(G, pi_p_forward, label_set): | |||
@@ -460,13 +478,15 @@ def test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations( | |||
g_tmp.remove_edge(nd1, nd2) | |||
# do not change anything when equal. | |||
# find the best graph generated in this iteration and update pi_p. | |||
# # find the best graph generated in this iteration and update pi_p. | |||
# @todo: should we update all graphs generated or just the best ones? | |||
dis_list, pi_forward_list = median_distance(G_new_list, Gn_median) | |||
# @todo: should we remove the identical and connectivity check? | |||
# Don't know which is faster. | |||
G_new_list, idx_list = remove_duplicates(G_new_list) | |||
pi_forward_list = [pi_forward_list[idx] for idx in idx_list] | |||
if ds_attrs['node_attr_dim'] == 0 and ds_attrs['edge_attr_dim'] == 0: | |||
G_new_list, idx_list = remove_duplicates(G_new_list) | |||
pi_forward_list = [pi_forward_list[idx] for idx in idx_list] | |||
dis_list = [dis_list[idx] for idx in idx_list] | |||
# if connected == True: | |||
# G_new_list, idx_list = remove_disconnected(G_new_list) | |||
# pi_forward_list = [pi_forward_list[idx] for idx in idx_list] | |||
@@ -482,25 +502,10 @@ def test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations( | |||
# print(g.nodes(data=True)) | |||
# print(g.edges(data=True)) | |||
return G_new_list, pi_forward_list | |||
return G_new_list, pi_forward_list, dis_list | |||
def median_distance(Gn, Gn_median, measure='ged', verbose=False): | |||
dis_list = [] | |||
pi_forward_list = [] | |||
for idx, G in tqdm(enumerate(Gn), desc='computing median distances', | |||
file=sys.stdout) if verbose else enumerate(Gn): | |||
dis_sum = 0 | |||
pi_forward_list.append([]) | |||
for G_p in Gn_median: | |||
dis_tmp, pi_tmp_forward, pi_tmp_backward = GED(G, G_p) | |||
pi_forward_list[idx].append(pi_tmp_forward) | |||
dis_sum += dis_tmp | |||
dis_list.append(dis_sum) | |||
return dis_list, pi_forward_list | |||
def best_median_graphs(Gn_candidate, dis_all, pi_all_forward): | |||
def best_median_graphs(Gn_candidate, pi_all_forward, dis_all): | |||
idx_min_list = np.argwhere(dis_all == np.min(dis_all)).flatten().tolist() | |||
dis_min = dis_all[idx_min_list[0]] | |||
pi_forward_min_list = [pi_all_forward[idx] for idx in idx_min_list] | |||
@@ -508,25 +513,45 @@ def test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations( | |||
return G_min_list, pi_forward_min_list, dis_min | |||
def iteration_proc(G, pi_p_forward): | |||
def iteration_proc(G, pi_p_forward, cur_sod): | |||
G_list = [G] | |||
pi_forward_list = [pi_p_forward] | |||
old_sod = cur_sod * 2 | |||
sod_list = [cur_sod] | |||
# iterations. | |||
for itr in range(0, 5): # @todo: the convergence condition? | |||
# print('itr is', itr) | |||
itr = 0 | |||
while itr < ite_max and np.abs(old_sod - cur_sod) > epsilon: | |||
# for itr in range(0, 5): # the convergence condition? | |||
print('itr is', itr) | |||
G_new_list = [] | |||
pi_forward_new_list = [] | |||
dis_new_list = [] | |||
for idx, G in enumerate(G_list): | |||
label_set = get_node_labels(Gn_median + [G], node_label) | |||
G_tmp_list, pi_forward_tmp_list = generate_graph( | |||
G_tmp_list, pi_forward_tmp_list, dis_tmp_list = generate_graph( | |||
G, pi_forward_list[idx], label_set) | |||
G_new_list += G_tmp_list | |||
pi_forward_new_list += pi_forward_tmp_list | |||
dis_new_list += dis_tmp_list | |||
G_list = G_new_list[:] | |||
pi_forward_list = pi_forward_new_list[:] | |||
dis_list = dis_new_list[:] | |||
old_sod = cur_sod | |||
cur_sod = np.min(dis_list) | |||
sod_list.append(cur_sod) | |||
itr += 1 | |||
G_list, idx_list = remove_duplicates(G_list) | |||
pi_forward_list = [pi_forward_list[idx] for idx in idx_list] | |||
# @todo: do we return all graphs or the best ones? | |||
# get the best ones of the generated graphs. | |||
G_list, pi_forward_list, dis_min = best_median_graphs( | |||
G_list, pi_forward_list, dis_list) | |||
if ds_attrs['node_attr_dim'] == 0 and ds_attrs['edge_attr_dim'] == 0: | |||
G_list, idx_list = remove_duplicates(G_list) | |||
pi_forward_list = [pi_forward_list[idx] for idx in idx_list] | |||
# dis_list = [dis_list[idx] for idx in idx_list] | |||
# import matplotlib.pyplot as plt | |||
# for g in G_list: | |||
@@ -535,7 +560,9 @@ def test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations( | |||
# print(g.nodes(data=True)) | |||
# print(g.edges(data=True)) | |||
return G_list, pi_forward_list # do we return all graphs or the best ones? | |||
print('\nsods:', sod_list, '\n') | |||
return G_list, pi_forward_list, dis_min | |||
def remove_duplicates(Gn): | |||
@@ -570,28 +597,37 @@ def test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations( | |||
# phase 1: initilize. | |||
# compute set-median. | |||
dis_min = np.inf | |||
dis_all, pi_all_forward = median_distance(Gn_candidate, Gn_median) | |||
dis_list, pi_forward_all = median_distance(Gn_candidate, Gn_median) | |||
# find all smallest distances. | |||
idx_min_list = np.argwhere(dis_all == np.min(dis_all)).flatten().tolist() | |||
dis_min = dis_all[idx_min_list[0]] | |||
idx_min_list = np.argwhere(dis_list == np.min(dis_list)).flatten().tolist() | |||
dis_min = dis_list[idx_min_list[0]] | |||
# phase 2: iteration. | |||
G_list = [] | |||
for idx_min in idx_min_list[::-1]: | |||
dis_list = [] | |||
pi_forward_list = [] | |||
for idx_min in idx_min_list: | |||
# print('idx_min is', idx_min) | |||
G = Gn_candidate[idx_min].copy() | |||
# list of edit operations. | |||
pi_p_forward = pi_all_forward[idx_min] | |||
pi_p_forward = pi_forward_all[idx_min] | |||
# pi_p_backward = pi_all_backward[idx_min] | |||
Gi_list, pi_i_forward_list = iteration_proc(G, pi_p_forward) | |||
Gi_list, pi_i_forward_list, dis_i_min = iteration_proc(G, pi_p_forward, dis_min) | |||
G_list += Gi_list | |||
dis_list.append(dis_i_min) | |||
pi_forward_list += pi_i_forward_list | |||
G_list, _ = remove_duplicates(G_list) | |||
if ds_attrs['node_attr_dim'] == 0 and ds_attrs['edge_attr_dim'] == 0: | |||
G_list, idx_list = remove_duplicates(G_list) | |||
dis_list = [dis_list[idx] for idx in idx_list] | |||
pi_forward_list = [pi_forward_list[idx] for idx in idx_list] | |||
if connected == True: | |||
G_list_con, _ = remove_disconnected(G_list) | |||
# if there is no connected graphs at all, then remain the disconnected ones. | |||
if len(G_list_con) > 0: # @todo: ?????????????????????????? | |||
G_list = G_list_con | |||
G_list_con, idx_list = remove_disconnected(G_list) | |||
# if there is no connected graphs at all, then remain the disconnected ones. | |||
if len(G_list_con) > 0: # @todo: ?????????????????????????? | |||
G_list = G_list_con | |||
dis_list = [dis_list[idx] for idx in idx_list] | |||
pi_forward_list = [pi_forward_list[idx] for idx in idx_list] | |||
# import matplotlib.pyplot as plt | |||
# for g in G_list: | |||
@@ -601,15 +637,15 @@ def test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations( | |||
# print(g.edges(data=True)) | |||
# get the best median graphs | |||
dis_all, pi_all_forward = median_distance(G_list, Gn_median) | |||
# dis_list, pi_forward_list = median_distance(G_list, Gn_median) | |||
G_min_list, pi_forward_min_list, dis_min = best_median_graphs( | |||
G_list, dis_all, pi_all_forward) | |||
G_list, pi_forward_list, dis_list) | |||
# for g in G_min_list: | |||
# nx.draw_networkx(g) | |||
# plt.show() | |||
# print(g.nodes(data=True)) | |||
# print(g.edges(data=True)) | |||
return G_min_list | |||
return G_min_list, dis_min | |||
if __name__ == '__main__': | |||
@@ -0,0 +1,218 @@ | |||
import sys | |||
sys.path.insert(0, "../") | |||
#import pathlib | |||
import numpy as np | |||
import networkx as nx | |||
import time | |||
#import librariesImport | |||
#import script | |||
#sys.path.insert(0, "/home/bgauzere/dev/optim-graphes/") | |||
#import pygraph | |||
from pygraph.utils.graphfiles import loadDataset | |||
def replace_graph_in_env(script, graph, old_id, label='median'): | |||
""" | |||
Replace a graph in script | |||
If old_id is -1, add a new graph to the environnemt | |||
""" | |||
if(old_id > -1): | |||
script.PyClearGraph(old_id) | |||
new_id = script.PyAddGraph(label) | |||
for i in graph.nodes(): | |||
script.PyAddNode(new_id,str(i),graph.node[i]) # !! strings are required bt gedlib | |||
for e in graph.edges: | |||
script.PyAddEdge(new_id, str(e[0]),str(e[1]), {}) | |||
script.PyInitEnv() | |||
script.PySetMethod("IPFP", "") | |||
script.PyInitMethod() | |||
return new_id | |||
#Dessin median courrant | |||
def draw_Letter_graph(graph, savepath=''): | |||
import numpy as np | |||
import networkx as nx | |||
import matplotlib.pyplot as plt | |||
plt.figure() | |||
pos = {} | |||
for n in graph.nodes: | |||
pos[n] = np.array([float(graph.node[n]['attributes'][0]), | |||
float(graph.node[n]['attributes'][1])]) | |||
nx.draw_networkx(graph, pos) | |||
if savepath != '': | |||
plt.savefig(savepath + str(time.time()) + '.eps', format='eps', dpi=300) | |||
plt.show() | |||
plt.clf() | |||
#compute new mappings | |||
def update_mappings(script,median_id,listID): | |||
med_distances = {} | |||
med_mappings = {} | |||
sod = 0 | |||
for i in range(0,len(listID)): | |||
script.PyRunMethod(median_id,listID[i]) | |||
med_distances[i] = script.PyGetUpperBound(median_id,listID[i]) | |||
med_mappings[i] = script.PyGetForwardMap(median_id,listID[i]) | |||
sod += med_distances[i] | |||
return med_distances, med_mappings, sod | |||
def calcul_Sij(all_mappings, all_graphs,i,j): | |||
s_ij = 0 | |||
for k in range(0,len(all_mappings)): | |||
cur_graph = all_graphs[k] | |||
cur_mapping = all_mappings[k] | |||
size_graph = cur_graph.order() | |||
if ((cur_mapping[i] < size_graph) and | |||
(cur_mapping[j] < size_graph) and | |||
(cur_graph.has_edge(cur_mapping[i], cur_mapping[j]) == True)): | |||
s_ij += 1 | |||
return s_ij | |||
# def update_median_nodes_L1(median,listIdSet,median_id,dataset, mappings): | |||
# from scipy.stats.mstats import gmean | |||
# for i in median.nodes(): | |||
# for k in listIdSet: | |||
# vectors = [] #np.zeros((len(listIdSet),2)) | |||
# if(k != median_id): | |||
# phi_i = mappings[k][i] | |||
# if(phi_i < dataset[k].order()): | |||
# vectors.append([float(dataset[k].node[phi_i]['x']),float(dataset[k].node[phi_i]['y'])]) | |||
# new_labels = gmean(vectors) | |||
# median.node[i]['x'] = str(new_labels[0]) | |||
# median.node[i]['y'] = str(new_labels[1]) | |||
# return median | |||
def update_median_nodes(median,dataset,mappings): | |||
#update node attributes | |||
for i in median.nodes(): | |||
nb_sub=0 | |||
mean_label = {'x' : 0, 'y' : 0} | |||
for k in range(0,len(mappings)): | |||
phi_i = mappings[k][i] | |||
if ( phi_i < dataset[k].order() ): | |||
nb_sub += 1 | |||
mean_label['x'] += 0.75*float(dataset[k].node[phi_i]['x']) | |||
mean_label['y'] += 0.75*float(dataset[k].node[phi_i]['y']) | |||
median.node[i]['x'] = str((1/0.75)*(mean_label['x']/nb_sub)) | |||
median.node[i]['y'] = str((1/0.75)*(mean_label['y']/nb_sub)) | |||
return median | |||
def update_median_edges(dataset, mappings, median, cei=0.425,cer=0.425): | |||
#for letter high, ceir = 1.7, alpha = 0.75 | |||
size_dataset = len(dataset) | |||
ratio_cei_cer = cer/(cei + cer) | |||
threshold = size_dataset*ratio_cei_cer | |||
order_graph_median = median.order() | |||
for i in range(0,order_graph_median): | |||
for j in range(i+1,order_graph_median): | |||
s_ij = calcul_Sij(mappings,dataset,i,j) | |||
if(s_ij > threshold): | |||
median.add_edge(i,j) | |||
else: | |||
if(median.has_edge(i,j)): | |||
median.remove_edge(i,j) | |||
return median | |||
def compute_median(script, listID, dataset,verbose=False): | |||
"""Compute a graph median of a dataset according to an environment | |||
Parameters | |||
script : An gedlib initialized environnement | |||
listID (list): a list of ID in script: encodes the dataset | |||
dataset (list): corresponding graphs in networkX format. We assume that graph | |||
listID[i] corresponds to dataset[i] | |||
Returns: | |||
A networkX graph, which is the median, with corresponding sod | |||
""" | |||
print(len(listID)) | |||
median_set_index, median_set_sod = compute_median_set(script, listID) | |||
print(median_set_index) | |||
print(median_set_sod) | |||
sods = [] | |||
#Ajout median dans environnement | |||
set_median = dataset[median_set_index].copy() | |||
median = dataset[median_set_index].copy() | |||
cur_med_id = replace_graph_in_env(script,median,-1) | |||
med_distances, med_mappings, cur_sod = update_mappings(script,cur_med_id,listID) | |||
sods.append(cur_sod) | |||
if(verbose): | |||
print(cur_sod) | |||
ite_max = 50 | |||
old_sod = cur_sod * 2 | |||
ite = 0 | |||
epsilon = 0.001 | |||
best_median | |||
while((ite < ite_max) and (np.abs(old_sod - cur_sod) > epsilon )): | |||
median = update_median_nodes(median,dataset, med_mappings) | |||
median = update_median_edges(dataset,med_mappings,median) | |||
cur_med_id = replace_graph_in_env(script,median,cur_med_id) | |||
med_distances, med_mappings, cur_sod = update_mappings(script,cur_med_id,listID) | |||
sods.append(cur_sod) | |||
if(verbose): | |||
print(cur_sod) | |||
ite += 1 | |||
return median, cur_sod, sods, set_median | |||
draw_Letter_graph(median) | |||
def compute_median_set(script,listID): | |||
'Returns the id in listID corresponding to median set' | |||
#Calcul median set | |||
N=len(listID) | |||
map_id_to_index = {} | |||
map_index_to_id = {} | |||
for i in range(0,len(listID)): | |||
map_id_to_index[listID[i]] = i | |||
map_index_to_id[i] = listID[i] | |||
distances = np.zeros((N,N)) | |||
for i in listID: | |||
for j in listID: | |||
script.PyRunMethod(i,j) | |||
distances[map_id_to_index[i],map_id_to_index[j]] = script.PyGetUpperBound(i,j) | |||
median_set_index = np.argmin(np.sum(distances,0)) | |||
sod = np.min(np.sum(distances,0)) | |||
return median_set_index, sod | |||
#if __name__ == "__main__": | |||
# #Chargement du dataset | |||
# script.PyLoadGXLGraph('/home/bgauzere/dev/gedlib/data/datasets/Letter/HIGH/', '/home/bgauzere/dev/gedlib/data/collections/Letter_Z.xml') | |||
# script.PySetEditCost("LETTER") | |||
# script.PyInitEnv() | |||
# script.PySetMethod("IPFP", "") | |||
# script.PyInitMethod() | |||
# | |||
# dataset,my_y = pygraph.utils.graphfiles.loadDataset("/home/bgauzere/dev/gedlib/data/datasets/Letter/HIGH/Letter_Z.cxl") | |||
# | |||
# listID = script.PyGetAllGraphIds() | |||
# median, sod = compute_median(script,listID,dataset,verbose=True) | |||
# | |||
# print(sod) | |||
# draw_Letter_graph(median) | |||
if __name__ == '__main__': | |||
# test draw_Letter_graph | |||
ds = {'name': 'Letter-high', 'dataset': '../datasets/Letter-high/Letter-high_A.txt', | |||
'extra_params': {}} # node nsymb | |||
Gn, y_all = loadDataset(ds['dataset'], extra_params=ds['extra_params']) | |||
print(y_all) | |||
for g in Gn: | |||
draw_Letter_graph(g) |
@@ -0,0 +1,423 @@ | |||
#!/usr/bin/env python3 | |||
# -*- coding: utf-8 -*- | |||
""" | |||
Created on Thu Jul 4 12:20:16 2019 | |||
@author: ljia | |||
""" | |||
import numpy as np | |||
import networkx as nx | |||
import matplotlib.pyplot as plt | |||
import time | |||
from tqdm import tqdm | |||
import sys | |||
sys.path.insert(0, "../") | |||
from pygraph.utils.graphfiles import loadDataset | |||
from median import draw_Letter_graph | |||
# --------------------------- These are tests --------------------------------# | |||
def test_who_is_the_closest_in_kernel_space(Gn): | |||
idx_gi = [0, 6] | |||
g1 = Gn[idx_gi[0]] | |||
g2 = Gn[idx_gi[1]] | |||
# create the "median" graph. | |||
gnew = g2.copy() | |||
gnew.remove_node(0) | |||
nx.draw_networkx(gnew) | |||
plt.show() | |||
print(gnew.nodes(data=True)) | |||
Gn = [gnew] + Gn | |||
# compute gram matrix | |||
Kmatrix = compute_kernel(Gn, 'untilhpathkernel', True) | |||
# the distance matrix | |||
dmatrix = gram2distances(Kmatrix) | |||
print(np.sort(dmatrix[idx_gi[0] + 1])) | |||
print(np.argsort(dmatrix[idx_gi[0] + 1])) | |||
print(np.sort(dmatrix[idx_gi[1] + 1])) | |||
print(np.argsort(dmatrix[idx_gi[1] + 1])) | |||
# for all g in Gn, compute (d(g1, g) + d(g2, g)) / 2 | |||
dis_median = [(dmatrix[i, idx_gi[0] + 1] + dmatrix[i, idx_gi[1] + 1]) / 2 for i in range(len(Gn))] | |||
print(np.sort(dis_median)) | |||
print(np.argsort(dis_median)) | |||
return | |||
def test_who_is_the_closest_in_GED_space(Gn): | |||
from iam import GED | |||
idx_gi = [0, 6] | |||
g1 = Gn[idx_gi[0]] | |||
g2 = Gn[idx_gi[1]] | |||
# create the "median" graph. | |||
gnew = g2.copy() | |||
gnew.remove_node(0) | |||
nx.draw_networkx(gnew) | |||
plt.show() | |||
print(gnew.nodes(data=True)) | |||
Gn = [gnew] + Gn | |||
# compute GEDs | |||
ged_matrix = np.zeros((len(Gn), len(Gn))) | |||
for i1 in tqdm(range(len(Gn)), desc='computing GEDs', file=sys.stdout): | |||
for i2 in range(len(Gn)): | |||
dis, _, _ = GED(Gn[i1], Gn[i2], lib='gedlib') | |||
ged_matrix[i1, i2] = dis | |||
print(np.sort(ged_matrix[idx_gi[0] + 1])) | |||
print(np.argsort(ged_matrix[idx_gi[0] + 1])) | |||
print(np.sort(ged_matrix[idx_gi[1] + 1])) | |||
print(np.argsort(ged_matrix[idx_gi[1] + 1])) | |||
# for all g in Gn, compute (GED(g1, g) + GED(g2, g)) / 2 | |||
dis_median = [(ged_matrix[i, idx_gi[0] + 1] + ged_matrix[i, idx_gi[1] + 1]) / 2 for i in range(len(Gn))] | |||
print(np.sort(dis_median)) | |||
print(np.argsort(dis_median)) | |||
return | |||
def test_will_IAM_give_the_median_graph_we_wanted(Gn): | |||
idx_gi = [0, 6] | |||
g1 = Gn[idx_gi[0]].copy() | |||
g2 = Gn[idx_gi[1]].copy() | |||
# del Gn[idx_gi[0]] | |||
# del Gn[idx_gi[1] - 1] | |||
g_median = test_iam_with_more_graphs_as_init([g1, g2], [g1, g2], c_ei=1, c_er=1, c_es=1) | |||
# g_median = test_iam_with_more_graphs_as_init(Gn, Gn, c_ei=1, c_er=1, c_es=1) | |||
nx.draw_networkx(g_median) | |||
plt.show() | |||
print(g_median.nodes(data=True)) | |||
print(g_median.edges(data=True)) | |||
def test_new_IAM_allGraph_deleteNodes(Gn): | |||
idx_gi = [0, 6] | |||
# g1 = Gn[idx_gi[0]].copy() | |||
# g2 = Gn[idx_gi[1]].copy() | |||
# g1 = nx.Graph(name='haha') | |||
# g1.add_nodes_from([(0, {'atom': 'C'}), (1, {'atom': 'O'}), (2, {'atom': 'C'})]) | |||
# g1.add_edges_from([(0, 1, {'bond_type': '1'}), (1, 2, {'bond_type': '1'})]) | |||
# g2 = nx.Graph(name='hahaha') | |||
# g2.add_nodes_from([(0, {'atom': 'C'}), (1, {'atom': 'O'}), (2, {'atom': 'C'}), | |||
# (3, {'atom': 'O'}), (4, {'atom': 'C'})]) | |||
# g2.add_edges_from([(0, 1, {'bond_type': '1'}), (1, 2, {'bond_type': '1'}), | |||
# (2, 3, {'bond_type': '1'}), (3, 4, {'bond_type': '1'})]) | |||
g1 = nx.Graph(name='haha') | |||
g1.add_nodes_from([(0, {'atom': 'C'}), (1, {'atom': 'C'}), (2, {'atom': 'C'}), | |||
(3, {'atom': 'S'}), (4, {'atom': 'S'})]) | |||
g1.add_edges_from([(0, 1, {'bond_type': '1'}), (1, 2, {'bond_type': '1'}), | |||
(2, 3, {'bond_type': '1'}), (2, 4, {'bond_type': '1'})]) | |||
g2 = nx.Graph(name='hahaha') | |||
g2.add_nodes_from([(0, {'atom': 'C'}), (1, {'atom': 'C'}), (2, {'atom': 'C'}), | |||
(3, {'atom': 'O'}), (4, {'atom': 'O'})]) | |||
g2.add_edges_from([(0, 1, {'bond_type': '1'}), (1, 2, {'bond_type': '1'}), | |||
(2, 3, {'bond_type': '1'}), (2, 4, {'bond_type': '1'})]) | |||
# g2 = g1.copy() | |||
# g2.add_nodes_from([(3, {'atom': 'O'})]) | |||
# g2.add_nodes_from([(4, {'atom': 'C'})]) | |||
# g2.add_edges_from([(1, 3, {'bond_type': '1'})]) | |||
# g2.add_edges_from([(3, 4, {'bond_type': '1'})]) | |||
# del Gn[idx_gi[0]] | |||
# del Gn[idx_gi[1] - 1] | |||
nx.draw_networkx(g1) | |||
plt.show() | |||
print(g1.nodes(data=True)) | |||
print(g1.edges(data=True)) | |||
nx.draw_networkx(g2) | |||
plt.show() | |||
print(g2.nodes(data=True)) | |||
print(g2.edges(data=True)) | |||
g_median = test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations([g1, g2], [g1, g2], c_ei=1, c_er=1, c_es=1) | |||
# g_median = test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations(Gn, Gn, c_ei=1, c_er=1, c_es=1) | |||
nx.draw_networkx(g_median) | |||
plt.show() | |||
print(g_median.nodes(data=True)) | |||
print(g_median.edges(data=True)) | |||
def test_the_simple_two(Gn, gkernel): | |||
from gk_iam import gk_iam_nearest_multi, compute_kernel | |||
lmbda = 0.03 # termination probalility | |||
r_max = 10 # recursions | |||
l = 500 | |||
alpha_range = np.linspace(0.5, 0.5, 1) | |||
k = 2 # k nearest neighbors | |||
# randomly select two molecules | |||
np.random.seed(1) | |||
idx_gi = [0, 6] # np.random.randint(0, len(Gn), 2) | |||
g1 = Gn[idx_gi[0]] | |||
g2 = Gn[idx_gi[1]] | |||
Gn_mix = [g.copy() for g in Gn] | |||
Gn_mix.append(g1.copy()) | |||
Gn_mix.append(g2.copy()) | |||
# g_tmp = iam([g1, g2]) | |||
# nx.draw_networkx(g_tmp) | |||
# plt.show() | |||
# compute | |||
# k_list = [] # kernel between each graph and itself. | |||
# k_g1_list = [] # kernel between each graph and g1 | |||
# k_g2_list = [] # kernel between each graph and g2 | |||
# for ig, g in tqdm(enumerate(Gn), desc='computing self kernels', file=sys.stdout): | |||
# ktemp = compute_kernel([g, g1, g2], 'marginalizedkernel', False) | |||
# k_list.append(ktemp[0][0, 0]) | |||
# k_g1_list.append(ktemp[0][0, 1]) | |||
# k_g2_list.append(ktemp[0][0, 2]) | |||
km = compute_kernel(Gn_mix, gkernel, True) | |||
# k_list = np.diag(km) # kernel between each graph and itself. | |||
# k_g1_list = km[idx_gi[0]] # kernel between each graph and g1 | |||
# k_g2_list = km[idx_gi[1]] # kernel between each graph and g2 | |||
g_best = [] | |||
dis_best = [] | |||
# for each alpha | |||
for alpha in alpha_range: | |||
print('alpha =', alpha) | |||
dhat, ghat_list = gk_iam_nearest_multi(Gn, [g1, g2], [alpha, 1 - alpha], | |||
range(len(Gn), len(Gn) + 2), km, | |||
k, r_max,gkernel) | |||
dis_best.append(dhat) | |||
g_best.append(ghat_list) | |||
for idx, item in enumerate(alpha_range): | |||
print('when alpha is', item, 'the shortest distance is', dis_best[idx]) | |||
print('the corresponding pre-images are') | |||
for g in g_best[idx]: | |||
nx.draw_networkx(g) | |||
plt.show() | |||
print(g.nodes(data=True)) | |||
print(g.edges(data=True)) | |||
def test_remove_bests(Gn, gkernel): | |||
from gk_iam import gk_iam_nearest_multi, compute_kernel | |||
lmbda = 0.03 # termination probalility | |||
r_max = 10 # recursions | |||
l = 500 | |||
alpha_range = np.linspace(0.5, 0.5, 1) | |||
k = 20 # k nearest neighbors | |||
# randomly select two molecules | |||
np.random.seed(1) | |||
idx_gi = [0, 6] # np.random.randint(0, len(Gn), 2) | |||
g1 = Gn[idx_gi[0]] | |||
g2 = Gn[idx_gi[1]] | |||
# remove the best 2 graphs. | |||
del Gn[idx_gi[0]] | |||
del Gn[idx_gi[1] - 1] | |||
# del Gn[8] | |||
Gn_mix = [g.copy() for g in Gn] | |||
Gn_mix.append(g1.copy()) | |||
Gn_mix.append(g2.copy()) | |||
# compute | |||
km = compute_kernel(Gn_mix, gkernel, True) | |||
g_best = [] | |||
dis_best = [] | |||
# for each alpha | |||
for alpha in alpha_range: | |||
print('alpha =', alpha) | |||
dhat, ghat_list = gk_iam_nearest_multi(Gn, [g1, g2], [alpha, 1 - alpha], | |||
range(len(Gn), len(Gn) + 2), km, | |||
k, r_max, gkernel) | |||
dis_best.append(dhat) | |||
g_best.append(ghat_list) | |||
for idx, item in enumerate(alpha_range): | |||
print('when alpha is', item, 'the shortest distance is', dis_best[idx]) | |||
print('the corresponding pre-images are') | |||
for g in g_best[idx]: | |||
draw_Letter_graph(g) | |||
# nx.draw_networkx(g) | |||
# plt.show() | |||
print(g.nodes(data=True)) | |||
print(g.edges(data=True)) | |||
def test_gkiam_letter_h(): | |||
from gk_iam import gk_iam_nearest_multi, compute_kernel | |||
from iam import median_distance | |||
ds = {'name': 'Letter-high', 'dataset': '../datasets/Letter-high/Letter-high_A.txt', | |||
'extra_params': {}} # node nsymb | |||
# ds = {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt', | |||
# 'extra_params': {}} # node nsymb | |||
Gn, y_all = loadDataset(ds['dataset'], extra_params=ds['extra_params']) | |||
gkernel = 'structuralspkernel' | |||
lmbda = 0.03 # termination probalility | |||
r_max = 3 # recursions | |||
# alpha_range = np.linspace(0.5, 0.5, 1) | |||
k = 10 # k nearest neighbors | |||
# classify graphs according to letters. | |||
idx_dict = get_same_item_indices(y_all) | |||
time_list = [] | |||
sod_list = [] | |||
sod_min_list = [] | |||
for letter in idx_dict: | |||
print('\n-------------------------------------------------------\n') | |||
Gn_let = [Gn[i].copy() for i in idx_dict[letter]] | |||
Gn_mix = Gn_let + [g.copy() for g in Gn_let] | |||
alpha_range = np.linspace(1 / len(Gn_let), 1 / len(Gn_let), 1) | |||
# compute | |||
time0 = time.time() | |||
km = compute_kernel(Gn_mix, gkernel, True) | |||
g_best = [] | |||
dis_best = [] | |||
# for each alpha | |||
for alpha in alpha_range: | |||
print('alpha =', alpha) | |||
dhat, ghat_list = gk_iam_nearest_multi(Gn_let, Gn_let, [alpha] * len(Gn_let), | |||
range(len(Gn_let), len(Gn_mix)), km, | |||
k, r_max, gkernel, c_ei=1.7, | |||
c_er=1.7, c_es=1.7) | |||
dis_best.append(dhat) | |||
g_best.append(ghat_list) | |||
time_list.append(time.time() - time0) | |||
# show best graphs and save them to file. | |||
for idx, item in enumerate(alpha_range): | |||
print('when alpha is', item, 'the shortest distance is', dis_best[idx]) | |||
print('the corresponding pre-images are') | |||
for g in g_best[idx]: | |||
draw_Letter_graph(g, savepath='results/gk_iam/') | |||
# nx.draw_networkx(g) | |||
# plt.show() | |||
print(g.nodes(data=True)) | |||
print(g.edges(data=True)) | |||
# compute the corresponding sod in graph space. (alpha range not considered.) | |||
sod_tmp, _ = median_distance(g_best[0], Gn_let) | |||
sod_list.append(sod_tmp) | |||
sod_min_list.append(np.min(sod_tmp)) | |||
print('\nsods in graph space: ', sod_list) | |||
print('\nsmallest sod in graph space for each letter: ', sod_min_list) | |||
print('\ntimes:', time_list) | |||
def get_same_item_indices(ls): | |||
"""Get the indices of the same items in a list. Return a dict keyed by items. | |||
""" | |||
idx_dict = {} | |||
for idx, item in enumerate(ls): | |||
if item in idx_dict: | |||
idx_dict[item].append(idx) | |||
else: | |||
idx_dict[item] = [idx] | |||
return idx_dict | |||
#def compute_letter_median_by_average(Gn): | |||
# return g_median | |||
def test_iam_letter_h(): | |||
from iam import test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations | |||
from gk_iam import dis_gstar, compute_kernel | |||
ds = {'name': 'Letter-high', 'dataset': '../datasets/Letter-high/Letter-high_A.txt', | |||
'extra_params': {}} # node nsymb | |||
# ds = {'name': 'Letter-med', 'dataset': '../datasets/Letter-med/Letter-med_A.txt', | |||
# 'extra_params': {}} # node nsymb | |||
Gn, y_all = loadDataset(ds['dataset'], extra_params=ds['extra_params']) | |||
lmbda = 0.03 # termination probalility | |||
# alpha_range = np.linspace(0.5, 0.5, 1) | |||
# classify graphs according to letters. | |||
idx_dict = get_same_item_indices(y_all) | |||
time_list = [] | |||
sod_list = [] | |||
sod_min_list = [] | |||
for letter in idx_dict: | |||
Gn_let = [Gn[i].copy() for i in idx_dict[letter]] | |||
alpha_range = np.linspace(1 / len(Gn_let), 1 / len(Gn_let), 1) | |||
# compute | |||
g_best = [] | |||
dis_best = [] | |||
time0 = time.time() | |||
# for each alpha | |||
for alpha in alpha_range: | |||
print('alpha =', alpha) | |||
ghat_list, dhat = test_iam_moreGraphsAsInit_tryAllPossibleBestGraphs_deleteNodesInIterations( | |||
Gn_let, Gn_let, c_ei=1.7, c_er=1.7, c_es=1.7) | |||
dis_best.append(dhat) | |||
g_best.append(ghat_list) | |||
time_list.append(time.time() - time0) | |||
# show best graphs and save them to file. | |||
for idx, item in enumerate(alpha_range): | |||
print('when alpha is', item, 'the shortest distance is', dis_best[idx]) | |||
print('the corresponding pre-images are') | |||
for g in g_best[idx]: | |||
draw_Letter_graph(g, savepath='results/iam/') | |||
# nx.draw_networkx(g) | |||
# plt.show() | |||
print(g.nodes(data=True)) | |||
print(g.edges(data=True)) | |||
# compute the corresponding sod in kernel space. (alpha range not considered.) | |||
gkernel = 'structuralspkernel' | |||
sod_tmp = [] | |||
Gn_mix = g_best[0] + Gn_let | |||
km = compute_kernel(Gn_mix, gkernel, True) | |||
for ig, g in tqdm(enumerate(g_best[0]), desc='computing kernel sod', file=sys.stdout): | |||
dtemp = dis_gstar(ig, range(len(g_best[0]), len(Gn_mix)), | |||
[alpha_range[0]] * len(Gn_let), km, withterm3=False) | |||
sod_tmp.append(dtemp) | |||
sod_list.append(sod_tmp) | |||
sod_min_list.append(np.min(sod_tmp)) | |||
print('\nsods in kernel space: ', sod_list) | |||
print('\nsmallest sod in kernel space for each letter: ', sod_min_list) | |||
print('\ntimes:', time_list) | |||
if __name__ == '__main__': | |||
# ds = {'name': 'MUTAG', 'dataset': '../datasets/MUTAG/MUTAG_A.txt', | |||
# 'extra_params': {}} # node/edge symb | |||
ds = {'name': 'Letter-high', 'dataset': '../datasets/Letter-high/Letter-high_A.txt', | |||
'extra_params': {}} # node nsymb | |||
# ds = {'name': 'Acyclic', 'dataset': '../datasets/monoterpenoides/trainset_9.ds', | |||
# 'extra_params': {}} | |||
# ds = {'name': 'Acyclic', 'dataset': '../datasets/acyclic/dataset_bps.ds', | |||
# 'extra_params': {}} # node symb | |||
Gn, y_all = loadDataset(ds['dataset'], extra_params=ds['extra_params']) | |||
# Gn = Gn[0:20] | |||
# import networkx.algorithms.isomorphism as iso | |||
# G1 = nx.MultiDiGraph() | |||
# G2 = nx.MultiDiGraph() | |||
# G1.add_nodes_from([1,2,3], fill='red') | |||
# G2.add_nodes_from([10,20,30,40], fill='red') | |||
# nx.add_path(G1, [1,2,3,4], weight=3, linewidth=2.5) | |||
# nx.add_path(G2, [10,20,30,40], weight=3) | |||
# nm = iso.categorical_node_match('fill', 'red') | |||
# print(nx.is_isomorphic(G1, G2, node_match=nm)) | |||
# | |||
# test_new_IAM_allGraph_deleteNodes(Gn) | |||
# test_will_IAM_give_the_median_graph_we_wanted(Gn) | |||
# test_who_is_the_closest_in_GED_space(Gn) | |||
# test_who_is_the_closest_in_kernel_space(Gn) | |||
# test_the_simple_two(Gn, 'untilhpathkernel') | |||
# test_remove_bests(Gn, 'untilhpathkernel') | |||
test_gkiam_letter_h() | |||
# test_iam_letter_h() |
@@ -23,7 +23,7 @@ from pygraph.utils.parallel import parallel_gm | |||
def commonwalkkernel(*args, | |||
node_label='atom', | |||
edge_label='bond_type', | |||
n=None, | |||
# n=None, | |||
weight=1, | |||
compute_method=None, | |||
n_jobs=None, | |||
@@ -35,26 +35,28 @@ def commonwalkkernel(*args, | |||
List of graphs between which the kernels are calculated. | |||
/ | |||
G1, G2 : NetworkX graphs | |||
2 graphs between which the kernel is calculated. | |||
Two graphs between which the kernel is calculated. | |||
node_label : string | |||
node attribute used as label. The default node label is atom. | |||
Node attribute used as symbolic label. The default node label is 'atom'. | |||
edge_label : string | |||
edge attribute used as label. The default edge label is bond_type. | |||
n : integer | |||
Longest length of walks. Only useful when applying the 'brute' method. | |||
Edge attribute used as symbolic label. The default edge label is 'bond_type'. | |||
# n : integer | |||
# Longest length of walks. Only useful when applying the 'brute' method. | |||
weight: integer | |||
Weight coefficient of different lengths of walks, which represents beta | |||
in 'exp' method and gamma in 'geo'. | |||
compute_method : string | |||
Method used to compute walk kernel. The Following choices are | |||
available: | |||
'exp' : exponential serial method applied on the direct product graph, | |||
as shown in reference [1]. The time complexity is O(n^6) for graphs | |||
with n vertices. | |||
'geo' : geometric serial method applied on the direct product graph, as | |||
shown in reference [1]. The time complexity is O(n^6) for graphs with n | |||
vertices. | |||
'brute' : brute force, simply search for all walks and compare them. | |||
'exp': method based on exponential serials applied on the direct | |||
product graph, as shown in reference [1]. The time complexity is O(n^6) | |||
for graphs with n vertices. | |||
'geo': method based on geometric serials applied on the direct product | |||
graph, as shown in reference [1]. The time complexity is O(n^6) for | |||
graphs with n vertices. | |||
# 'brute': brute force, simply search for all walks and compare them. | |||
n_jobs : int | |||
Number of jobs for parallelization. | |||
Return | |||
------ | |||
@@ -44,17 +44,20 @@ def marginalizedkernel(*args, | |||
List of graphs between which the kernels are calculated. | |||
/ | |||
G1, G2 : NetworkX graphs | |||
2 graphs between which the kernel is calculated. | |||
Two graphs between which the kernel is calculated. | |||
node_label : string | |||
node attribute used as label. The default node label is atom. | |||
Node attribute used as symbolic label. The default node label is 'atom'. | |||
edge_label : string | |||
edge attribute used as label. The default edge label is bond_type. | |||
Edge attribute used as symbolic label. The default edge label is 'bond_type'. | |||
p_quit : integer | |||
the termination probability in the random walks generating step | |||
The termination probability in the random walks generating step. | |||
n_iteration : integer | |||
time of iterations to calculate R_inf | |||
Time of iterations to calculate R_inf. | |||
remove_totters : boolean | |||
whether to remove totters. The default value is True. | |||
Whether to remove totterings by method introduced in [2]. The default | |||
value is False. | |||
n_jobs : int | |||
Number of jobs for parallelization. | |||
Return | |||
------ | |||
@@ -41,15 +41,62 @@ def randomwalkkernel(*args, | |||
List of graphs between which the kernels are calculated. | |||
/ | |||
G1, G2 : NetworkX graphs | |||
2 graphs between which the kernel is calculated. | |||
node_label : string | |||
node attribute used as label. The default node label is atom. | |||
Two graphs between which the kernel is calculated. | |||
compute_method : string | |||
Method used to compute kernel. The Following choices are | |||
available: | |||
'sylvester' - Sylvester equation method. | |||
'conjugate' - conjugate gradient method. | |||
'fp' - fixed-point iterations. | |||
'spectral' - spectral decomposition. | |||
weight : float | |||
A constant weight set for random walks of length h. | |||
p : None | |||
Initial probability distribution on the unlabeled direct product graph | |||
of two graphs. It is set to be uniform over all vertices in the direct | |||
product graph. | |||
q : None | |||
Stopping probability distribution on the unlabeled direct product graph | |||
of two graphs. It is set to be uniform over all vertices in the direct | |||
product graph. | |||
edge_weight: float | |||
Edge attribute name corresponding to the edge weight. | |||
node_kernels: dict | |||
A dictionary of kernel functions for nodes, including 3 items: 'symb' | |||
for symbolic node labels, 'nsymb' for non-symbolic node labels, 'mix' | |||
for both labels. The first 2 functions take two node labels as | |||
parameters, and the 'mix' function takes 4 parameters, a symbolic and a | |||
non-symbolic label for each the two nodes. Each label is in form of 2-D | |||
dimension array (n_samples, n_features). Each function returns a number | |||
as the kernel value. Ignored when nodes are unlabeled. This argument | |||
is designated to conjugate gradient method and fixed-point iterations. | |||
edge_kernels: dict | |||
A dictionary of kernel functions for edges, including 3 items: 'symb' | |||
for symbolic edge labels, 'nsymb' for non-symbolic edge labels, 'mix' | |||
for both labels. The first 2 functions take two edge labels as | |||
parameters, and the 'mix' function takes 4 parameters, a symbolic and a | |||
non-symbolic label for each the two edges. Each label is in form of 2-D | |||
dimension array (n_samples, n_features). Each function returns a number | |||
as the kernel value. Ignored when edges are unlabeled. This argument | |||
is designated to conjugate gradient method and fixed-point iterations. | |||
node_label: string | |||
Node attribute used as label. The default node label is atom. This | |||
argument is designated to conjugate gradient method and fixed-point | |||
iterations. | |||
edge_label : string | |||
edge attribute used as label. The default edge label is bond_type. | |||
h : integer | |||
Longest length of walks. | |||
method : string | |||
Method used to compute the random walk kernel. Available methods are 'sylvester', 'conjugate', 'fp', 'spectral' and 'kron'. | |||
Edge attribute used as label. The default edge label is bond_type. This | |||
argument is designated to conjugate gradient method and fixed-point | |||
iterations. | |||
sub_kernel: string | |||
Method used to compute walk kernel. The Following choices are | |||
available: | |||
'exp' : method based on exponential serials. | |||
'geo' : method based on geometric serials. | |||
n_jobs: int | |||
Number of jobs for parallelization. | |||
Return | |||
------ | |||
@@ -168,7 +215,7 @@ def _sylvester_equation(Gn, lmda, p, q, eweight, n_jobs, verbose=True): | |||
if q == None: | |||
# don't normalize adjacency matrices if q is a uniform vector. Note | |||
# A_wave_list accually contains the transposes of the adjacency matrices. | |||
# A_wave_list actually contains the transposes of the adjacency matrices. | |||
A_wave_list = [ | |||
nx.adjacency_matrix(G, eweight).todense().transpose() for G in | |||
(tqdm(Gn, desc='compute adjacency matrices', file=sys.stdout) if | |||
@@ -259,7 +306,7 @@ def _conjugate_gradient(Gn, lmda, p, q, ds_attrs, node_kernels, edge_kernels, | |||
# # this is faster from unlabeled graphs. @todo: why? | |||
# if q == None: | |||
# # don't normalize adjacency matrices if q is a uniform vector. Note | |||
# # A_wave_list accually contains the transposes of the adjacency matrices. | |||
# # A_wave_list actually contains the transposes of the adjacency matrices. | |||
# A_wave_list = [ | |||
# nx.adjacency_matrix(G, eweight).todense().transpose() for G in | |||
# tqdm(Gn, desc='compute adjacency matrices', file=sys.stdout) | |||
@@ -376,7 +423,7 @@ def _fixed_point(Gn, lmda, p, q, ds_attrs, node_kernels, edge_kernels, | |||
# # this is faster from unlabeled graphs. @todo: why? | |||
# if q == None: | |||
# # don't normalize adjacency matrices if q is a uniform vector. Note | |||
# # A_wave_list accually contains the transposes of the adjacency matrices. | |||
# # A_wave_list actually contains the transposes of the adjacency matrices. | |||
# A_wave_list = [ | |||
# nx.adjacency_matrix(G, eweight).todense().transpose() for G in | |||
# tqdm(Gn, desc='compute adjacency matrices', file=sys.stdout) | |||
@@ -481,7 +528,7 @@ def _spectral_decomposition(Gn, weight, p, q, sub_kernel, eweight, n_jobs, verbo | |||
for G in (tqdm(Gn, desc='spectral decompose', file=sys.stdout) if | |||
verbose else Gn): | |||
# don't normalize adjacency matrices if q is a uniform vector. Note | |||
# A accually is the transpose of the adjacency matrix. | |||
# A actually is the transpose of the adjacency matrix. | |||
A = nx.adjacency_matrix(G, eweight).todense().transpose() | |||
ew, ev = np.linalg.eig(A) | |||
D_list.append(ew) | |||
@@ -33,12 +33,12 @@ def spkernel(*args, | |||
List of graphs between which the kernels are calculated. | |||
/ | |||
G1, G2 : NetworkX graphs | |||
2 graphs between which the kernel is calculated. | |||
Two graphs between which the kernel is calculated. | |||
node_label : string | |||
node attribute used as label. The default node label is atom. | |||
Node attribute used as label. The default node label is atom. | |||
edge_weight : string | |||
Edge attribute name corresponding to the edge weight. | |||
node_kernels: dict | |||
node_kernels : dict | |||
A dictionary of kernel functions for nodes, including 3 items: 'symb' | |||
for symbolic node labels, 'nsymb' for non-symbolic node labels, 'mix' | |||
for both labels. The first 2 functions take two node labels as | |||
@@ -46,6 +46,8 @@ def spkernel(*args, | |||
non-symbolic label for each the two nodes. Each label is in form of 2-D | |||
dimension array (n_samples, n_features). Each function returns an | |||
number as the kernel value. Ignored when nodes are unlabeled. | |||
n_jobs : int | |||
Number of jobs for parallelization. | |||
Return | |||
------ | |||
@@ -42,14 +42,15 @@ def structuralspkernel(*args, | |||
List of graphs between which the kernels are calculated. | |||
/ | |||
G1, G2 : NetworkX graphs | |||
2 graphs between which the kernel is calculated. | |||
Two graphs between which the kernel is calculated. | |||
node_label : string | |||
node attribute used as label. The default node label is atom. | |||
Node attribute used as label. The default node label is atom. | |||
edge_weight : string | |||
Edge attribute name corresponding to the edge weight. | |||
Edge attribute name corresponding to the edge weight. Applied for the | |||
computation of the shortest paths. | |||
edge_label : string | |||
edge attribute used as label. The default edge label is bond_type. | |||
node_kernels: dict | |||
Edge attribute used as label. The default edge label is bond_type. | |||
node_kernels : dict | |||
A dictionary of kernel functions for nodes, including 3 items: 'symb' | |||
for symbolic node labels, 'nsymb' for non-symbolic node labels, 'mix' | |||
for both labels. The first 2 functions take two node labels as | |||
@@ -57,7 +58,7 @@ def structuralspkernel(*args, | |||
non-symbolic label for each the two nodes. Each label is in form of 2-D | |||
dimension array (n_samples, n_features). Each function returns a number | |||
as the kernel value. Ignored when nodes are unlabeled. | |||
edge_kernels: dict | |||
edge_kernels : dict | |||
A dictionary of kernel functions for edges, including 3 items: 'symb' | |||
for symbolic edge labels, 'nsymb' for non-symbolic edge labels, 'mix' | |||
for both labels. The first 2 functions take two edge labels as | |||
@@ -65,6 +66,13 @@ def structuralspkernel(*args, | |||
non-symbolic label for each the two edges. Each label is in form of 2-D | |||
dimension array (n_samples, n_features). Each function returns a number | |||
as the kernel value. Ignored when edges are unlabeled. | |||
compute_method : string | |||
Computation method to store the shortest paths and compute the graph | |||
kernel. The Following choices are available: | |||
'trie': store paths as tries. | |||
'naive': store paths to lists. | |||
n_jobs : int | |||
Number of jobs for parallelization. | |||
Return | |||
------ | |||
@@ -40,11 +40,19 @@ def treeletkernel(*args, | |||
The sub-kernel between 2 real number vectors. Each vector counts the | |||
numbers of isomorphic treelets in a graph. | |||
node_label : string | |||
Node attribute used as label. The default node label is atom. | |||
Node attribute used as label. The default node label is atom. | |||
edge_label : string | |||
Edge attribute used as label. The default edge label is bond_type. | |||
labeled : boolean | |||
Whether the graphs are labeled. The default is True. | |||
parallel : string/None | |||
Which paralleliztion method is applied to compute the kernel. The | |||
Following choices are available: | |||
'imap_unordered': use Python's multiprocessing.Pool.imap_unordered | |||
method. | |||
None: no parallelization is applied. | |||
n_jobs : int | |||
Number of jobs for parallelization. The default is to use all | |||
computational cores. This argument is only valid when one of the | |||
parallelization method is applied. | |||
Return | |||
------ | |||
@@ -26,7 +26,7 @@ def untilhpathkernel(*args, | |||
node_label='atom', | |||
edge_label='bond_type', | |||
depth=10, | |||
k_func='tanimoto', | |||
k_func='MinMax', | |||
compute_method='trie', | |||
n_jobs=None, | |||
verbose=True): | |||
@@ -38,7 +38,7 @@ def untilhpathkernel(*args, | |||
List of graphs between which the kernels are calculated. | |||
/ | |||
G1, G2 : NetworkX graphs | |||
2 graphs between which the kernel is calculated. | |||
Two graphs between which the kernel is calculated. | |||
node_label : string | |||
Node attribute used as label. The default node label is atom. | |||
edge_label : string | |||
@@ -47,9 +47,17 @@ def untilhpathkernel(*args, | |||
Depth of search. Longest length of paths. | |||
k_func : function | |||
A kernel function applied using different notions of fingerprint | |||
similarity. | |||
compute_method: string | |||
Computation method, 'trie' or 'naive'. | |||
similarity, defining the type of feature map and normalization method | |||
applied for the graph kernel. The Following choices are available: | |||
'MinMax': use the MiniMax kernel and counting feature map. | |||
'tanimoto': use the Tanimoto kernel and binary feature map. | |||
compute_method : string | |||
Computation method to store paths and compute the graph kernel. The | |||
Following choices are available: | |||
'trie': store paths as tries. | |||
'naive': store paths to lists. | |||
n_jobs : int | |||
Number of jobs for parallelization. | |||
Return | |||
------ | |||
@@ -38,15 +38,28 @@ def weisfeilerlehmankernel(*args, | |||
List of graphs between which the kernels are calculated. | |||
/ | |||
G1, G2 : NetworkX graphs | |||
2 graphs between which the kernel is calculated. | |||
Two graphs between which the kernel is calculated. | |||
node_label : string | |||
node attribute used as label. The default node label is atom. | |||
Node attribute used as label. The default node label is atom. | |||
edge_label : string | |||
edge attribute used as label. The default edge label is bond_type. | |||
Edge attribute used as label. The default edge label is bond_type. | |||
height : int | |||
subtree height | |||
Subtree height. | |||
base_kernel : string | |||
base kernel used in each iteration of WL kernel. The default base kernel is subtree kernel. For user-defined kernel, base_kernel is the name of the base kernel function used in each iteration of WL kernel. This function returns a Numpy matrix, each element of which is the user-defined Weisfeiler-Lehman kernel between 2 praphs. | |||
Base kernel used in each iteration of WL kernel. Only default 'subtree' | |||
kernel can be applied for now. | |||
# The default base | |||
# kernel is subtree kernel. For user-defined kernel, base_kernel is the | |||
# name of the base kernel function used in each iteration of WL kernel. | |||
# This function returns a Numpy matrix, each element of which is the | |||
# user-defined Weisfeiler-Lehman kernel between 2 praphs. | |||
parallel : None | |||
Which paralleliztion method is applied to compute the kernel. No | |||
parallelization can be applied for now. | |||
n_jobs : int | |||
Number of jobs for parallelization. The default is to use all | |||
computational cores. This argument is only valid when one of the | |||
parallelization method is applied and can be ignored for now. | |||
Return | |||
------ | |||