modifié : README.md nouveau fichier : notebooks/run_cyclicpatternkernel.ipynb modifié : notebooks/run_marginalizedkernel_acyclic.ipynb modifié : notebooks/run_pathkernel_acyclic.ipynb modifié : notebooks/run_spkernel_acyclic.ipynb modifié : notebooks/run_treeletkernel_acyclic.ipynb nouveau fichier : notebooks/run_treepatternkernel.ipynb modifié : notebooks/run_weisfeilerLehmankernel_acyclic.ipynb nouveau fichier : pygraph/kernels/cyclicPatternKernel.py modifié : pygraph/kernels/deltaKernel.py modifié : pygraph/kernels/pathKernel.py modifié : pygraph/kernels/results.md modifié : pygraph/kernels/spKernel.py nouveau fichier : pygraph/kernels/treePatternKernel.py modifié : pygraph/kernels/treeletKernel.py modifié : pygraph/kernels/untildPathKernel.py modifié : pygraph/kernels/weisfeilerLehmanKernel.py modifié : pygraph/utils/graphfiles.py modifié : pygraph/utils/utils.py nouveau fichier : run_cyclic.pyv0.1
@@ -11,26 +11,30 @@ A python package for graph kernels. | |||
* tabulate - 0.8.2 | |||
## Results with minimal test RMSE for each kernel on dataset Asyclic | |||
All kernels are tested on dataset Asyclic, which consists of 185 molecules (graphs). | |||
All kernels expect for Cyclic pattern kernel are tested on dataset Asyclic, which consists of 185 molecules (graphs). (Cyclic pattern kernel is tested on dataset MAO and PAH.) | |||
The criteria used for prediction are SVM for classification and kernel Ridge regression for regression. | |||
For predition we randomly divide the data in train and test subset, where 90% of entire dataset is for training and rest for testing. 10 splits are performed. For each split, we first train on the train data, then evaluate the performance on the test set. We choose the optimal parameters for the test set and finally provide the corresponding performance. The final results correspond to the average of the performances on the test sets. | |||
| Kernels | RMSE(℃) | STD(℃) | Parameter | k_time | | |||
|---------------|:-------:|:------:|-------------:|-------:| | |||
| Shortest path | 35.19 | 4.50 | - | 14.58" | | |||
| Marginalized | 18.02 | 6.29 | p_quit = 0.1 | 4'19" | | |||
| Path | 14.00 | 6.93 | - | 36.21" | | |||
| WL subtree | 7.55 | 2.33 | height = 1 | 0.84" | | |||
| Treelet | 8.31 | 3.38 | - | 0.50" | | |||
| Path up to d | 7.43 | 2.69 | depth = 2 | 0.59" | | |||
| Kernels | RMSE(℃) | STD(℃) | Parameter | k_time | | |||
|------------------|:-------:|:------:|------------------:|-------:| | |||
| Shortest path | 35.19 | 4.50 | - | 14.58" | | |||
| Marginalized | 18.02 | 6.29 | p_quit = 0.1 | 4'19" | | |||
| Path | 18.41 | 10.78 | - | 29.43" | | |||
| WL subtree | 7.55 | 2.33 | height = 1 | 0.84" | | |||
| WL shortest path | 35.16 | 4.50 | height = 2 | 40.24" | | |||
| WL edge | 33.41 | 4.73 | height = 5 | 5.66" | | |||
| Treelet | 8.31 | 3.38 | - | 0.50" | | |||
| Path up to d | 7.43 | 2.69 | depth = 2 | 0.59" | | |||
| Tree pattern | 7.27 | 2.21 | lamda = 1, h = 2 | 37.24" | | |||
| Cyclic pattern | 0.9 | 0.11 | cycle bound = 100 | 0.31" | | |||
* RMSE stands for arithmetic mean of the root mean squared errors on all splits. | |||
* STD stands for standard deviation of the root mean squared errors on all splits. | |||
* Paremeter is the one with which the kenrel achieves the best results. | |||
* k_time is the time spent on building the kernel matrix. | |||
* The targets of training data are normalized before calculating *path kernel* and *treelet kernel*. | |||
* The targets of training data are normalized before calculating *treelet kernel*. | |||
* See detail results in [results.md](pygraph/kernels/results.md). | |||
## References | |||
@@ -44,6 +48,12 @@ For predition we randomly divide the data in train and test subset, where 90% of | |||
[5] Gaüzère B, Brun L, Villemin D. Two new graphs kernels in chemoinformatics. Pattern Recognition Letters. 2012 Nov 1;33(15):2038-47. | |||
[6] Liva Ralaivola, Sanjay J Swamidass, Hiroto Saigo, and Pierre Baldi. Graph kernels for chemical informatics. Neural networks, 18(8):1093–1110, 2005. | |||
[7] Pierre Mahé and Jean-Philippe Vert. Graph kernels based on tree patterns for molecules. Machine learning, 75(1):3–35, 2009. | |||
[8] Tamás Horváth, Thomas Gärtner, and Stefan Wrobel. Cyclic pattern kernels for predictive graph mining. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 158–167. ACM, 2004. | |||
## Updates | |||
### 2018.01.24 | |||
* ADD *path kernel up to depth d* and its result on dataset Asyclic. | |||
@@ -364,6 +364,155 @@ | |||
}, | |||
{ | |||
"cell_type": "code", | |||
"execution_count": 2, | |||
"metadata": {}, | |||
"outputs": [ | |||
{ | |||
"name": "stdout", | |||
"output_type": "stream", | |||
"text": [ | |||
"\n", | |||
" --- This is a regression problem ---\n", | |||
"\n", | |||
"\n", | |||
" Loading dataset from file...\n", | |||
"\n", | |||
" Calculating kernel matrix, this could take a while...\n", | |||
"\n", | |||
" --- marginalized kernel matrix of size 185 built in 1133.0229969024658 seconds ---\n", | |||
"[[ 0.0287062 0.0124634 0.00444444 ..., 0.00606061 0.00606061\n", | |||
" 0.00606061]\n", | |||
" [ 0.0124634 0.01108958 0.00333333 ..., 0.00454545 0.00454545\n", | |||
" 0.00454545]\n", | |||
" [ 0.00444444 0.00333333 0.0287062 ..., 0.00819912 0.00819912\n", | |||
" 0.00975875]\n", | |||
" ..., \n", | |||
" [ 0.00606061 0.00454545 0.00819912 ..., 0.02846735 0.02836907\n", | |||
" 0.02896354]\n", | |||
" [ 0.00606061 0.00454545 0.00819912 ..., 0.02836907 0.02831424\n", | |||
" 0.0288712 ]\n", | |||
" [ 0.00606061 0.00454545 0.00975875 ..., 0.02896354 0.0288712\n", | |||
" 0.02987915]]\n", | |||
"\n", | |||
" Saving kernel matrix to file...\n", | |||
"\n", | |||
" Mean performance on train set: 12.186285\n", | |||
"With standard deviation: 7.038988\n", | |||
"\n", | |||
" Mean performance on test set: 18.024312\n", | |||
"With standard deviation: 6.292466\n", | |||
"\n", | |||
"\n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 18.0243 6.29247 12.1863 7.03899 1133.02\n" | |||
] | |||
} | |||
], | |||
"source": [ | |||
"%load_ext line_profiler\n", | |||
"\n", | |||
"import numpy as np\n", | |||
"import sys\n", | |||
"sys.path.insert(0, \"../\")\n", | |||
"from pygraph.utils.utils import kernel_train_test\n", | |||
"from pygraph.kernels.marginalizedKernel import marginalizedkernel, _marginalizedkernel_do\n", | |||
"\n", | |||
"datafile = '../../../../datasets/acyclic/Acyclic/dataset_bps.ds'\n", | |||
"kernel_file_path = 'kernelmatrices_weisfeilerlehman_subtree_acyclic/'\n", | |||
"\n", | |||
"kernel_para = dict(node_label = 'atom', edge_label = 'bond_type', itr = 20, p_quit = 0.1)\n", | |||
"\n", | |||
"# kernel_train_test(datafile, kernel_file_path, marginalizedkernel, kernel_para, \\\n", | |||
"# hyper_name = 'p_quit', hyper_range = np.linspace(0.1, 0.9, 9), normalize = False)\n", | |||
"\n", | |||
"%lprun -f _marginalizedkernel_do \\\n", | |||
" kernel_train_test(datafile, kernel_file_path, marginalizedkernel, kernel_para, \\\n", | |||
" normalize = False)" | |||
] | |||
}, | |||
{ | |||
"cell_type": "code", | |||
"execution_count": null, | |||
"metadata": {}, | |||
"outputs": [], | |||
"source": [ | |||
"Timer unit: 1e-06 s\n", | |||
"\n", | |||
"Total time: 828.879 s\n", | |||
"File: ../pygraph/kernels/marginalizedKernel.py\n", | |||
"Function: _marginalizedkernel_do at line 67\n", | |||
"\n", | |||
"Line # Hits Time Per Hit % Time Line Contents\n", | |||
"==============================================================\n", | |||
" 67 def _marginalizedkernel_do(G1, G2, node_label, edge_label, p_quit, itr):\n", | |||
" 68 \"\"\"Calculate marginalized graph kernel between 2 graphs.\n", | |||
" 69 \n", | |||
" 70 Parameters\n", | |||
" 71 ----------\n", | |||
" 72 G1, G2 : NetworkX graphs\n", | |||
" 73 2 graphs between which the kernel is calculated.\n", | |||
" 74 node_label : string\n", | |||
" 75 node attribute used as label.\n", | |||
" 76 edge_label : string\n", | |||
" 77 edge attribute used as label.\n", | |||
" 78 p_quit : integer\n", | |||
" 79 the termination probability in the random walks generating step.\n", | |||
" 80 itr : integer\n", | |||
" 81 time of iterations to calculate R_inf.\n", | |||
" 82 \n", | |||
" 83 Return\n", | |||
" 84 ------\n", | |||
" 85 kernel : float\n", | |||
" 86 Marginalized Kernel between 2 graphs.\n", | |||
" 87 \"\"\"\n", | |||
" 88 # init parameters\n", | |||
" 89 17205 12886.0 0.7 0.0 kernel = 0\n", | |||
" 90 17205 52542.0 3.1 0.0 num_nodes_G1 = nx.number_of_nodes(G1)\n", | |||
" 91 17205 28240.0 1.6 0.0 num_nodes_G2 = nx.number_of_nodes(G2)\n", | |||
" 92 17205 15595.0 0.9 0.0 p_init_G1 = 1 / num_nodes_G1 # the initial probability distribution in the random walks generating step (uniform distribution over |G|)\n", | |||
" 93 17205 11587.0 0.7 0.0 p_init_G2 = 1 / num_nodes_G2\n", | |||
" 94 \n", | |||
" 95 17205 11663.0 0.7 0.0 q = p_quit * p_quit\n", | |||
" 96 17205 10728.0 0.6 0.0 r1 = q\n", | |||
" 97 \n", | |||
" 98 # initial R_inf\n", | |||
" 99 17205 38412.0 2.2 0.0 R_inf = np.zeros([num_nodes_G1, num_nodes_G2]) # matrix to save all the R_inf for all pairs of nodes\n", | |||
" 100 \n", | |||
" 101 # calculate R_inf with a simple interative method\n", | |||
" 102 344100 329235.0 1.0 0.0 for i in range(1, itr):\n", | |||
" 103 326895 900354.0 2.8 0.1 R_inf_new = np.zeros([num_nodes_G1, num_nodes_G2])\n", | |||
" 104 326895 2287346.0 7.0 0.3 R_inf_new.fill(r1)\n", | |||
" 105 \n", | |||
" 106 # calculate R_inf for each pair of nodes\n", | |||
" 107 2653464 3667117.0 1.4 0.4 for node1 in G1.nodes(data = True):\n", | |||
" 108 2326569 7522840.0 3.2 0.9 neighbor_n1 = G1[node1[0]]\n", | |||
" 109 2326569 3492118.0 1.5 0.4 p_trans_n1 = (1 - p_quit) / len(neighbor_n1) # the transition probability distribution in the random walks generating step (uniform distribution over the vertices adjacent to the current vertex)\n", | |||
" 110 24024379 27775021.0 1.2 3.4 for node2 in G2.nodes(data = True):\n", | |||
" 111 21697810 69471941.0 3.2 8.4 neighbor_n2 = G2[node2[0]]\n", | |||
" 112 21697810 32446626.0 1.5 3.9 p_trans_n2 = (1 - p_quit) / len(neighbor_n2) \n", | |||
" 113 \n", | |||
" 114 59095092 52545370.0 0.9 6.3 for neighbor1 in neighbor_n1:\n", | |||
" 115 104193150 92513935.0 0.9 11.2 for neighbor2 in neighbor_n2:\n", | |||
" 116 \n", | |||
" 117 t = p_trans_n1 * p_trans_n2 * \\\n", | |||
" 118 66795868 285324518.0 4.3 34.4 deltakernel(G1.node[neighbor1][node_label] == G2.node[neighbor2][node_label]) * \\\n", | |||
" 119 66795868 137934393.0 2.1 16.6 deltakernel(neighbor_n1[neighbor1][edge_label] == neighbor_n2[neighbor2][edge_label])\n", | |||
" 120 66795868 106834143.0 1.6 12.9 R_inf_new[node1[0]][node2[0]] += t * R_inf[neighbor1][neighbor2] # ref [1] equation (8)\n", | |||
" 121 \n", | |||
" 122 326895 1123677.0 3.4 0.1 R_inf[:] = R_inf_new\n", | |||
" 123 \n", | |||
" 124 # add elements of R_inf up and calculate kernel\n", | |||
" 125 139656 330283.0 2.4 0.0 for node1 in G1.nodes(data = True):\n", | |||
" 126 1264441 1435263.0 1.1 0.2 for node2 in G2.nodes(data = True): \n", | |||
" 127 1141990 1377134.0 1.2 0.2 s = p_init_G1 * p_init_G2 * deltakernel(node1[1][node_label] == node2[1][node_label])\n", | |||
" 128 1141990 1375456.0 1.2 0.2 kernel += s * R_inf[node1[0]][node2[0]] # ref [1] equation (6)\n", | |||
" 129 \n", | |||
" 130 17205 10801.0 0.6 0.0 return kernel" | |||
] | |||
}, | |||
{ | |||
"cell_type": "code", | |||
"execution_count": 3, | |||
"metadata": { | |||
"scrolled": false | |||
@@ -2,7 +2,7 @@ | |||
"cells": [ | |||
{ | |||
"cell_type": "code", | |||
"execution_count": 6, | |||
"execution_count": 2, | |||
"metadata": {}, | |||
"outputs": [ | |||
{ | |||
@@ -15,13 +15,11 @@ | |||
" --- This is a regression problem ---\n", | |||
"\n", | |||
"\n", | |||
"\n", | |||
"\n", | |||
" Loading dataset from file...\n", | |||
"\n", | |||
" Calculating kernel matrix, this could take a while...\n", | |||
"\n", | |||
" --- mean average path kernel matrix of size 185 built in 132.2242877483368 seconds ---\n", | |||
" --- mean average path kernel matrix of size 185 built in 29.430902242660522 seconds ---\n", | |||
"[[ 0.55555556 0.22222222 0. ..., 0. 0. 0. ]\n", | |||
" [ 0.22222222 0.27777778 0. ..., 0. 0. 0. ]\n", | |||
" [ 0. 0. 0.55555556 ..., 0.03030303 0.03030303\n", | |||
@@ -36,16 +34,16 @@ | |||
"\n", | |||
" Saving kernel matrix to file...\n", | |||
"\n", | |||
" Mean performance on train set: 3.761907\n", | |||
"With standard deviation: 0.702594\n", | |||
" Mean performance on train set: 3.619948\n", | |||
"With standard deviation: 0.512351\n", | |||
"\n", | |||
" Mean performance on test set: 14.001515\n", | |||
"With standard deviation: 6.936023\n", | |||
" Mean performance on test set: 18.418852\n", | |||
"With standard deviation: 10.781119\n", | |||
"\n", | |||
"\n", | |||
" RMSE_test std_test RMSE_train std_train k_time\n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 14.0015 6.93602 3.76191 0.702594 132.224\n" | |||
" 18.4189 10.7811 3.61995 0.512351 29.4309\n" | |||
] | |||
} | |||
], | |||
@@ -62,10 +60,10 @@ | |||
"\n", | |||
"kernel_para = dict(node_label = 'atom', edge_label = 'bond_type')\n", | |||
"\n", | |||
"kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = True)\n", | |||
"kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = False)\n", | |||
"\n", | |||
"# %lprun -f _pathkernel_do \\\n", | |||
"# kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = True)" | |||
"# kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = False)" | |||
] | |||
}, | |||
{ | |||
@@ -84,7 +82,7 @@ | |||
"# without y normalization\n", | |||
" RMSE_test std_test RMSE_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 18.4189 10.7811 3.61995 0.512351 37.0017" | |||
" 18.4189 10.7811 3.61995 0.512351 29.4309" | |||
] | |||
}, | |||
{ | |||
@@ -2,44 +2,42 @@ | |||
"cells": [ | |||
{ | |||
"cell_type": "code", | |||
"execution_count": 2, | |||
"execution_count": 1, | |||
"metadata": {}, | |||
"outputs": [ | |||
{ | |||
"name": "stdout", | |||
"output_type": "stream", | |||
"text": [ | |||
"The line_profiler extension is already loaded. To reload it, use:\n", | |||
" %reload_ext line_profiler\n", | |||
"\n", | |||
" --- This is a regression problem ---\n", | |||
"\n", | |||
"\n", | |||
"\n", | |||
" Loading dataset from file...\n", | |||
"\n", | |||
" Calculating kernel matrix, this could take a while...\n", | |||
"--- shortest path kernel matrix of size 185 built in 14.576777696609497 seconds ---\n", | |||
"[[ 3. 1. 3. ..., 1. 1. 1.]\n", | |||
" [ 1. 6. 1. ..., 0. 0. 3.]\n", | |||
" [ 3. 1. 3. ..., 1. 1. 1.]\n", | |||
" ..., \n", | |||
" [ 1. 0. 1. ..., 55. 21. 7.]\n", | |||
" [ 1. 0. 1. ..., 21. 55. 7.]\n", | |||
" [ 1. 3. 1. ..., 7. 7. 55.]]\n", | |||
"\n", | |||
" Saving kernel matrix to file...\n", | |||
"\n", | |||
"--- shortest path kernel matrix of size 185 built in 13.3865065574646 seconds ---\n", | |||
"[[ 3. 1. 3. ... 1. 1. 1.]\n", | |||
" [ 1. 6. 1. ... 0. 0. 3.]\n", | |||
" [ 3. 1. 3. ... 1. 1. 1.]\n", | |||
" ...\n", | |||
" [ 1. 0. 1. ... 55. 21. 7.]\n", | |||
" [ 1. 0. 1. ... 21. 55. 7.]\n", | |||
" [ 1. 3. 1. ... 7. 7. 55.]]\n", | |||
"\n", | |||
" Starting calculate accuracy/rmse...\n", | |||
"calculate performance: 94%|█████████▎| 936/1000 [00:01<00:00, 757.54it/s]\n", | |||
" Mean performance on train set: 28.360361\n", | |||
"With standard deviation: 1.357183\n", | |||
"\n", | |||
" Mean performance on test set: 35.191954\n", | |||
"With standard deviation: 4.495767\n", | |||
"calculate performance: 100%|██████████| 1000/1000 [00:01<00:00, 771.22it/s]\n", | |||
"\n", | |||
"\n", | |||
" RMSE_test std_test RMSE_train std_train k_time\n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 35.192 4.49577 28.3604 1.35718 14.5768\n" | |||
" 35.192 4.49577 28.3604 1.35718 13.3865\n" | |||
] | |||
} | |||
], | |||
@@ -2,15 +2,13 @@ | |||
"cells": [ | |||
{ | |||
"cell_type": "code", | |||
"execution_count": 2, | |||
"execution_count": 1, | |||
"metadata": {}, | |||
"outputs": [ | |||
{ | |||
"name": "stdout", | |||
"output_type": "stream", | |||
"text": [ | |||
"The line_profiler extension is already loaded. To reload it, use:\n", | |||
" %reload_ext line_profiler\n", | |||
"\n", | |||
" --- This is a regression problem ---\n", | |||
"\n", | |||
@@ -19,68 +17,34 @@ | |||
"\n", | |||
" Calculating kernel matrix, this could take a while...\n", | |||
"\n", | |||
" --- treelet kernel matrix of size 185 built in 0.48417091369628906 seconds ---\n", | |||
"[[ 4.00000000e+00 2.60653066e+00 1.00000000e+00 ..., 1.26641655e-14\n", | |||
" 1.26641655e-14 1.26641655e-14]\n", | |||
" [ 2.60653066e+00 6.00000000e+00 1.00000000e+00 ..., 1.26641655e-14\n", | |||
" 1.26641655e-14 1.26641655e-14]\n", | |||
" [ 1.00000000e+00 1.00000000e+00 4.00000000e+00 ..., 3.00000000e+00\n", | |||
" 3.00000000e+00 3.00000000e+00]\n", | |||
" ..., \n", | |||
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 1.80000000e+01\n", | |||
" 1.30548713e+01 8.19020657e+00]\n", | |||
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 1.30548713e+01\n", | |||
" 2.20000000e+01 9.71901120e+00]\n", | |||
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 8.19020657e+00\n", | |||
" 9.71901120e+00 1.60000000e+01]]\n", | |||
"\n", | |||
" Saving kernel matrix to file...\n", | |||
"\n", | |||
" --- treelet kernel matrix of size 185 built in 0.47543811798095703 seconds ---\n", | |||
"[[4.00000000e+00 2.60653066e+00 1.00000000e+00 ... 1.26641655e-14\n", | |||
" 1.26641655e-14 1.26641655e-14]\n", | |||
" [2.60653066e+00 6.00000000e+00 1.00000000e+00 ... 1.26641655e-14\n", | |||
" 1.26641655e-14 1.26641655e-14]\n", | |||
" [1.00000000e+00 1.00000000e+00 4.00000000e+00 ... 3.00000000e+00\n", | |||
" 3.00000000e+00 3.00000000e+00]\n", | |||
" ...\n", | |||
" [1.26641655e-14 1.26641655e-14 3.00000000e+00 ... 1.80000000e+01\n", | |||
" 1.30548713e+01 8.19020657e+00]\n", | |||
" [1.26641655e-14 1.26641655e-14 3.00000000e+00 ... 1.30548713e+01\n", | |||
" 2.20000000e+01 9.71901120e+00]\n", | |||
" [1.26641655e-14 1.26641655e-14 3.00000000e+00 ... 8.19020657e+00\n", | |||
" 9.71901120e+00 1.60000000e+01]]\n", | |||
"\n", | |||
" Starting calculate accuracy/rmse...\n", | |||
"calculate performance: 98%|█████████▊| 983/1000 [00:01<00:00, 796.45it/s]\n", | |||
" Mean performance on train set: 2.688029\n", | |||
"With standard deviation: 1.541623\n", | |||
"\n", | |||
" Mean performance on test set: 10.099738\n", | |||
"With standard deviation: 5.035844\n", | |||
"calculate performance: 100%|██████████| 1000/1000 [00:01<00:00, 745.11it/s]\n", | |||
"\n", | |||
"\n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 10.0997 5.03584 2.68803 1.54162 0.484171\n", | |||
"\n", | |||
" --- This is a regression problem ---\n", | |||
"\n", | |||
"\n", | |||
" Loading dataset from file...\n", | |||
"\n", | |||
" Calculating kernel matrix, this could take a while...\n", | |||
"\n", | |||
" --- treelet kernel matrix of size 185 built in 0.5003015995025635 seconds ---\n", | |||
"[[ 4.00000000e+00 2.60653066e+00 1.00000000e+00 ..., 1.26641655e-14\n", | |||
" 1.26641655e-14 1.26641655e-14]\n", | |||
" [ 2.60653066e+00 6.00000000e+00 1.00000000e+00 ..., 1.26641655e-14\n", | |||
" 1.26641655e-14 1.26641655e-14]\n", | |||
" [ 1.00000000e+00 1.00000000e+00 4.00000000e+00 ..., 3.00000000e+00\n", | |||
" 3.00000000e+00 3.00000000e+00]\n", | |||
" ..., \n", | |||
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 1.80000000e+01\n", | |||
" 1.30548713e+01 8.19020657e+00]\n", | |||
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 1.30548713e+01\n", | |||
" 2.20000000e+01 9.71901120e+00]\n", | |||
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 8.19020657e+00\n", | |||
" 9.71901120e+00 1.60000000e+01]]\n", | |||
"\n", | |||
" Saving kernel matrix to file...\n", | |||
"\n", | |||
" Mean performance on train set: 2.908869\n", | |||
"With standard deviation: 1.267900\n", | |||
"\n", | |||
" Mean performance on test set: 8.307902\n", | |||
"With standard deviation: 3.378376\n", | |||
"\n", | |||
"\n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 8.3079 3.37838 2.90887 1.2679 0.500302\n" | |||
" 10.0997 5.03584 2.68803 1.54162 0.475438\n" | |||
] | |||
} | |||
], | |||
@@ -99,8 +63,6 @@ | |||
"\n", | |||
"kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = False)\n", | |||
"\n", | |||
"kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = True)\n", | |||
"\n", | |||
"# %lprun -f treeletkernel \\\n", | |||
"# kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = False)" | |||
] | |||
@@ -121,14 +83,58 @@ | |||
"# without y normalization\n", | |||
" RMSE_test std_test RMSE_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 10.0997 5.03584 2.68803 1.54162 0.484171" | |||
" 10.0997 5.03584 2.68803 1.54162 0.484171\n", | |||
"\n", | |||
" \n", | |||
"\n", | |||
"# G0 -> WL subtree h = 0\n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 13.9223 2.88611 13.373 0.653301 0.186731\n", | |||
"\n", | |||
"# G0 U G1 U G6 U G8 U G13 -> WL subtree h = 1\n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 8.97706 2.90771 6.7343 1.17505 0.223171\n", | |||
" \n", | |||
"# all patterns \\ { G3 U G4 U G5 U G10 } -> WL subtree h = 2 \n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 7.31274 1.96289 3.73909 0.406267 0.294902\n", | |||
"\n", | |||
"# all patterns \\ { G4 U G5 } -> WL subtree h = 3\n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 8.39977 2.78309 3.8606 1.58686 0.348912\n", | |||
"\n", | |||
"# all patterns \\ { G5 } \n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 9.47647 4.22113 3.18029 1.5669 0.423638\n", | |||
" \n", | |||
" \n", | |||
" \n", | |||
"# G0, -> WL subtree h = 0\n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 13.9223 2.88611 13.373 0.653301 0.186731 \n", | |||
" \n", | |||
"# G0 U G1 U G2 U G6 U G8 U G13 -> WL subtree h = 1\n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 8.62431 2.54327 5.63422 0.255002 0.290797\n", | |||
" \n", | |||
"# all patterns \\ { G5 U G10 } -> WL subtree h = 2\n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 10.1294 3.50275 3.69664 1.55116 0.418498" | |||
] | |||
}, | |||
{ | |||
"cell_type": "code", | |||
"execution_count": 3, | |||
"metadata": { | |||
"scrolled": false | |||
"scrolled": true | |||
}, | |||
"outputs": [ | |||
{ | |||
@@ -364,6 +364,155 @@ | |||
}, | |||
{ | |||
"cell_type": "code", | |||
"execution_count": 2, | |||
"metadata": {}, | |||
"outputs": [ | |||
{ | |||
"name": "stdout", | |||
"output_type": "stream", | |||
"text": [ | |||
"\n", | |||
" --- This is a regression problem ---\n", | |||
"\n", | |||
"\n", | |||
" Loading dataset from file...\n", | |||
"\n", | |||
" Calculating kernel matrix, this could take a while...\n", | |||
"\n", | |||
" --- marginalized kernel matrix of size 185 built in 1133.0229969024658 seconds ---\n", | |||
"[[ 0.0287062 0.0124634 0.00444444 ..., 0.00606061 0.00606061\n", | |||
" 0.00606061]\n", | |||
" [ 0.0124634 0.01108958 0.00333333 ..., 0.00454545 0.00454545\n", | |||
" 0.00454545]\n", | |||
" [ 0.00444444 0.00333333 0.0287062 ..., 0.00819912 0.00819912\n", | |||
" 0.00975875]\n", | |||
" ..., \n", | |||
" [ 0.00606061 0.00454545 0.00819912 ..., 0.02846735 0.02836907\n", | |||
" 0.02896354]\n", | |||
" [ 0.00606061 0.00454545 0.00819912 ..., 0.02836907 0.02831424\n", | |||
" 0.0288712 ]\n", | |||
" [ 0.00606061 0.00454545 0.00975875 ..., 0.02896354 0.0288712\n", | |||
" 0.02987915]]\n", | |||
"\n", | |||
" Saving kernel matrix to file...\n", | |||
"\n", | |||
" Mean performance on train set: 12.186285\n", | |||
"With standard deviation: 7.038988\n", | |||
"\n", | |||
" Mean performance on test set: 18.024312\n", | |||
"With standard deviation: 6.292466\n", | |||
"\n", | |||
"\n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 18.0243 6.29247 12.1863 7.03899 1133.02\n" | |||
] | |||
} | |||
], | |||
"source": [ | |||
"%load_ext line_profiler\n", | |||
"\n", | |||
"import numpy as np\n", | |||
"import sys\n", | |||
"sys.path.insert(0, \"../\")\n", | |||
"from pygraph.utils.utils import kernel_train_test\n", | |||
"from pygraph.kernels.marginalizedKernel import marginalizedkernel, _marginalizedkernel_do\n", | |||
"\n", | |||
"datafile = '../../../../datasets/acyclic/Acyclic/dataset_bps.ds'\n", | |||
"kernel_file_path = 'kernelmatrices_weisfeilerlehman_subtree_acyclic/'\n", | |||
"\n", | |||
"kernel_para = dict(node_label = 'atom', edge_label = 'bond_type', itr = 20, p_quit = 0.1)\n", | |||
"\n", | |||
"# kernel_train_test(datafile, kernel_file_path, marginalizedkernel, kernel_para, \\\n", | |||
"# hyper_name = 'p_quit', hyper_range = np.linspace(0.1, 0.9, 9), normalize = False)\n", | |||
"\n", | |||
"%lprun -f _marginalizedkernel_do \\\n", | |||
" kernel_train_test(datafile, kernel_file_path, marginalizedkernel, kernel_para, \\\n", | |||
" normalize = False)" | |||
] | |||
}, | |||
{ | |||
"cell_type": "code", | |||
"execution_count": null, | |||
"metadata": {}, | |||
"outputs": [], | |||
"source": [ | |||
"Timer unit: 1e-06 s\n", | |||
"\n", | |||
"Total time: 828.879 s\n", | |||
"File: ../pygraph/kernels/marginalizedKernel.py\n", | |||
"Function: _marginalizedkernel_do at line 67\n", | |||
"\n", | |||
"Line # Hits Time Per Hit % Time Line Contents\n", | |||
"==============================================================\n", | |||
" 67 def _marginalizedkernel_do(G1, G2, node_label, edge_label, p_quit, itr):\n", | |||
" 68 \"\"\"Calculate marginalized graph kernel between 2 graphs.\n", | |||
" 69 \n", | |||
" 70 Parameters\n", | |||
" 71 ----------\n", | |||
" 72 G1, G2 : NetworkX graphs\n", | |||
" 73 2 graphs between which the kernel is calculated.\n", | |||
" 74 node_label : string\n", | |||
" 75 node attribute used as label.\n", | |||
" 76 edge_label : string\n", | |||
" 77 edge attribute used as label.\n", | |||
" 78 p_quit : integer\n", | |||
" 79 the termination probability in the random walks generating step.\n", | |||
" 80 itr : integer\n", | |||
" 81 time of iterations to calculate R_inf.\n", | |||
" 82 \n", | |||
" 83 Return\n", | |||
" 84 ------\n", | |||
" 85 kernel : float\n", | |||
" 86 Marginalized Kernel between 2 graphs.\n", | |||
" 87 \"\"\"\n", | |||
" 88 # init parameters\n", | |||
" 89 17205 12886.0 0.7 0.0 kernel = 0\n", | |||
" 90 17205 52542.0 3.1 0.0 num_nodes_G1 = nx.number_of_nodes(G1)\n", | |||
" 91 17205 28240.0 1.6 0.0 num_nodes_G2 = nx.number_of_nodes(G2)\n", | |||
" 92 17205 15595.0 0.9 0.0 p_init_G1 = 1 / num_nodes_G1 # the initial probability distribution in the random walks generating step (uniform distribution over |G|)\n", | |||
" 93 17205 11587.0 0.7 0.0 p_init_G2 = 1 / num_nodes_G2\n", | |||
" 94 \n", | |||
" 95 17205 11663.0 0.7 0.0 q = p_quit * p_quit\n", | |||
" 96 17205 10728.0 0.6 0.0 r1 = q\n", | |||
" 97 \n", | |||
" 98 # initial R_inf\n", | |||
" 99 17205 38412.0 2.2 0.0 R_inf = np.zeros([num_nodes_G1, num_nodes_G2]) # matrix to save all the R_inf for all pairs of nodes\n", | |||
" 100 \n", | |||
" 101 # calculate R_inf with a simple interative method\n", | |||
" 102 344100 329235.0 1.0 0.0 for i in range(1, itr):\n", | |||
" 103 326895 900354.0 2.8 0.1 R_inf_new = np.zeros([num_nodes_G1, num_nodes_G2])\n", | |||
" 104 326895 2287346.0 7.0 0.3 R_inf_new.fill(r1)\n", | |||
" 105 \n", | |||
" 106 # calculate R_inf for each pair of nodes\n", | |||
" 107 2653464 3667117.0 1.4 0.4 for node1 in G1.nodes(data = True):\n", | |||
" 108 2326569 7522840.0 3.2 0.9 neighbor_n1 = G1[node1[0]]\n", | |||
" 109 2326569 3492118.0 1.5 0.4 p_trans_n1 = (1 - p_quit) / len(neighbor_n1) # the transition probability distribution in the random walks generating step (uniform distribution over the vertices adjacent to the current vertex)\n", | |||
" 110 24024379 27775021.0 1.2 3.4 for node2 in G2.nodes(data = True):\n", | |||
" 111 21697810 69471941.0 3.2 8.4 neighbor_n2 = G2[node2[0]]\n", | |||
" 112 21697810 32446626.0 1.5 3.9 p_trans_n2 = (1 - p_quit) / len(neighbor_n2) \n", | |||
" 113 \n", | |||
" 114 59095092 52545370.0 0.9 6.3 for neighbor1 in neighbor_n1:\n", | |||
" 115 104193150 92513935.0 0.9 11.2 for neighbor2 in neighbor_n2:\n", | |||
" 116 \n", | |||
" 117 t = p_trans_n1 * p_trans_n2 * \\\n", | |||
" 118 66795868 285324518.0 4.3 34.4 deltakernel(G1.node[neighbor1][node_label] == G2.node[neighbor2][node_label]) * \\\n", | |||
" 119 66795868 137934393.0 2.1 16.6 deltakernel(neighbor_n1[neighbor1][edge_label] == neighbor_n2[neighbor2][edge_label])\n", | |||
" 120 66795868 106834143.0 1.6 12.9 R_inf_new[node1[0]][node2[0]] += t * R_inf[neighbor1][neighbor2] # ref [1] equation (8)\n", | |||
" 121 \n", | |||
" 122 326895 1123677.0 3.4 0.1 R_inf[:] = R_inf_new\n", | |||
" 123 \n", | |||
" 124 # add elements of R_inf up and calculate kernel\n", | |||
" 125 139656 330283.0 2.4 0.0 for node1 in G1.nodes(data = True):\n", | |||
" 126 1264441 1435263.0 1.1 0.2 for node2 in G2.nodes(data = True): \n", | |||
" 127 1141990 1377134.0 1.2 0.2 s = p_init_G1 * p_init_G2 * deltakernel(node1[1][node_label] == node2[1][node_label])\n", | |||
" 128 1141990 1375456.0 1.2 0.2 kernel += s * R_inf[node1[0]][node2[0]] # ref [1] equation (6)\n", | |||
" 129 \n", | |||
" 130 17205 10801.0 0.6 0.0 return kernel" | |||
] | |||
}, | |||
{ | |||
"cell_type": "code", | |||
"execution_count": 3, | |||
"metadata": { | |||
"scrolled": false | |||
@@ -2,23 +2,24 @@ | |||
"cells": [ | |||
{ | |||
"cell_type": "code", | |||
"execution_count": 1, | |||
"execution_count": 2, | |||
"metadata": {}, | |||
"outputs": [ | |||
{ | |||
"name": "stdout", | |||
"output_type": "stream", | |||
"text": [ | |||
"The line_profiler extension is already loaded. To reload it, use:\n", | |||
" %reload_ext line_profiler\n", | |||
"\n", | |||
" --- This is a regression problem ---\n", | |||
"\n", | |||
"\n", | |||
"\n", | |||
" Loading dataset from file...\n", | |||
"\n", | |||
" Calculating kernel matrix, this could take a while...\n", | |||
"\n", | |||
" --- mean average path kernel matrix of size 185 built in 45.52756929397583 seconds ---\n", | |||
" --- mean average path kernel matrix of size 185 built in 29.430902242660522 seconds ---\n", | |||
"[[ 0.55555556 0.22222222 0. ..., 0. 0. 0. ]\n", | |||
" [ 0.22222222 0.27777778 0. ..., 0. 0. 0. ]\n", | |||
" [ 0. 0. 0.55555556 ..., 0.03030303 0.03030303\n", | |||
@@ -33,16 +34,16 @@ | |||
"\n", | |||
" Saving kernel matrix to file...\n", | |||
"\n", | |||
" Mean performance on train set: 3.761907\n", | |||
"With standard deviation: 0.702594\n", | |||
" Mean performance on train set: 3.619948\n", | |||
"With standard deviation: 0.512351\n", | |||
"\n", | |||
" Mean performance on test set: 14.001515\n", | |||
"With standard deviation: 6.936023\n", | |||
" Mean performance on test set: 18.418852\n", | |||
"With standard deviation: 10.781119\n", | |||
"\n", | |||
"\n", | |||
" RMSE_test std_test RMSE_train std_train k_time\n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 14.0015 6.93602 3.76191 0.702594 45.5276\n" | |||
" 18.4189 10.7811 3.61995 0.512351 29.4309\n" | |||
] | |||
} | |||
], | |||
@@ -59,10 +60,10 @@ | |||
"\n", | |||
"kernel_para = dict(node_label = 'atom', edge_label = 'bond_type')\n", | |||
"\n", | |||
"kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = True)\n", | |||
"kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = False)\n", | |||
"\n", | |||
"# %lprun -f _pathkernel_do \\\n", | |||
"# kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = True)" | |||
"# kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = False)" | |||
] | |||
}, | |||
{ | |||
@@ -81,7 +82,7 @@ | |||
"# without y normalization\n", | |||
" RMSE_test std_test RMSE_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 18.4189 10.7811 3.61995 0.512351 37.0017" | |||
" 18.4189 10.7811 3.61995 0.512351 29.4309" | |||
] | |||
}, | |||
{ | |||
@@ -2,44 +2,42 @@ | |||
"cells": [ | |||
{ | |||
"cell_type": "code", | |||
"execution_count": 2, | |||
"execution_count": 1, | |||
"metadata": {}, | |||
"outputs": [ | |||
{ | |||
"name": "stdout", | |||
"output_type": "stream", | |||
"text": [ | |||
"The line_profiler extension is already loaded. To reload it, use:\n", | |||
" %reload_ext line_profiler\n", | |||
"\n", | |||
" --- This is a regression problem ---\n", | |||
"\n", | |||
"\n", | |||
"\n", | |||
" Loading dataset from file...\n", | |||
"\n", | |||
" Calculating kernel matrix, this could take a while...\n", | |||
"--- shortest path kernel matrix of size 185 built in 14.576777696609497 seconds ---\n", | |||
"[[ 3. 1. 3. ..., 1. 1. 1.]\n", | |||
" [ 1. 6. 1. ..., 0. 0. 3.]\n", | |||
" [ 3. 1. 3. ..., 1. 1. 1.]\n", | |||
" ..., \n", | |||
" [ 1. 0. 1. ..., 55. 21. 7.]\n", | |||
" [ 1. 0. 1. ..., 21. 55. 7.]\n", | |||
" [ 1. 3. 1. ..., 7. 7. 55.]]\n", | |||
"\n", | |||
" Saving kernel matrix to file...\n", | |||
"\n", | |||
"--- shortest path kernel matrix of size 185 built in 13.3865065574646 seconds ---\n", | |||
"[[ 3. 1. 3. ... 1. 1. 1.]\n", | |||
" [ 1. 6. 1. ... 0. 0. 3.]\n", | |||
" [ 3. 1. 3. ... 1. 1. 1.]\n", | |||
" ...\n", | |||
" [ 1. 0. 1. ... 55. 21. 7.]\n", | |||
" [ 1. 0. 1. ... 21. 55. 7.]\n", | |||
" [ 1. 3. 1. ... 7. 7. 55.]]\n", | |||
"\n", | |||
" Starting calculate accuracy/rmse...\n", | |||
"calculate performance: 94%|█████████▎| 936/1000 [00:01<00:00, 757.54it/s]\n", | |||
" Mean performance on train set: 28.360361\n", | |||
"With standard deviation: 1.357183\n", | |||
"\n", | |||
" Mean performance on test set: 35.191954\n", | |||
"With standard deviation: 4.495767\n", | |||
"calculate performance: 100%|██████████| 1000/1000 [00:01<00:00, 771.22it/s]\n", | |||
"\n", | |||
"\n", | |||
" RMSE_test std_test RMSE_train std_train k_time\n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 35.192 4.49577 28.3604 1.35718 14.5768\n" | |||
" 35.192 4.49577 28.3604 1.35718 13.3865\n" | |||
] | |||
} | |||
], | |||
@@ -2,15 +2,13 @@ | |||
"cells": [ | |||
{ | |||
"cell_type": "code", | |||
"execution_count": 2, | |||
"execution_count": 1, | |||
"metadata": {}, | |||
"outputs": [ | |||
{ | |||
"name": "stdout", | |||
"output_type": "stream", | |||
"text": [ | |||
"The line_profiler extension is already loaded. To reload it, use:\n", | |||
" %reload_ext line_profiler\n", | |||
"\n", | |||
" --- This is a regression problem ---\n", | |||
"\n", | |||
@@ -19,68 +17,34 @@ | |||
"\n", | |||
" Calculating kernel matrix, this could take a while...\n", | |||
"\n", | |||
" --- treelet kernel matrix of size 185 built in 0.48417091369628906 seconds ---\n", | |||
"[[ 4.00000000e+00 2.60653066e+00 1.00000000e+00 ..., 1.26641655e-14\n", | |||
" 1.26641655e-14 1.26641655e-14]\n", | |||
" [ 2.60653066e+00 6.00000000e+00 1.00000000e+00 ..., 1.26641655e-14\n", | |||
" 1.26641655e-14 1.26641655e-14]\n", | |||
" [ 1.00000000e+00 1.00000000e+00 4.00000000e+00 ..., 3.00000000e+00\n", | |||
" 3.00000000e+00 3.00000000e+00]\n", | |||
" ..., \n", | |||
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 1.80000000e+01\n", | |||
" 1.30548713e+01 8.19020657e+00]\n", | |||
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 1.30548713e+01\n", | |||
" 2.20000000e+01 9.71901120e+00]\n", | |||
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 8.19020657e+00\n", | |||
" 9.71901120e+00 1.60000000e+01]]\n", | |||
"\n", | |||
" Saving kernel matrix to file...\n", | |||
"\n", | |||
" --- treelet kernel matrix of size 185 built in 0.47543811798095703 seconds ---\n", | |||
"[[4.00000000e+00 2.60653066e+00 1.00000000e+00 ... 1.26641655e-14\n", | |||
" 1.26641655e-14 1.26641655e-14]\n", | |||
" [2.60653066e+00 6.00000000e+00 1.00000000e+00 ... 1.26641655e-14\n", | |||
" 1.26641655e-14 1.26641655e-14]\n", | |||
" [1.00000000e+00 1.00000000e+00 4.00000000e+00 ... 3.00000000e+00\n", | |||
" 3.00000000e+00 3.00000000e+00]\n", | |||
" ...\n", | |||
" [1.26641655e-14 1.26641655e-14 3.00000000e+00 ... 1.80000000e+01\n", | |||
" 1.30548713e+01 8.19020657e+00]\n", | |||
" [1.26641655e-14 1.26641655e-14 3.00000000e+00 ... 1.30548713e+01\n", | |||
" 2.20000000e+01 9.71901120e+00]\n", | |||
" [1.26641655e-14 1.26641655e-14 3.00000000e+00 ... 8.19020657e+00\n", | |||
" 9.71901120e+00 1.60000000e+01]]\n", | |||
"\n", | |||
" Starting calculate accuracy/rmse...\n", | |||
"calculate performance: 98%|█████████▊| 983/1000 [00:01<00:00, 796.45it/s]\n", | |||
" Mean performance on train set: 2.688029\n", | |||
"With standard deviation: 1.541623\n", | |||
"\n", | |||
" Mean performance on test set: 10.099738\n", | |||
"With standard deviation: 5.035844\n", | |||
"calculate performance: 100%|██████████| 1000/1000 [00:01<00:00, 745.11it/s]\n", | |||
"\n", | |||
"\n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 10.0997 5.03584 2.68803 1.54162 0.484171\n", | |||
"\n", | |||
" --- This is a regression problem ---\n", | |||
"\n", | |||
"\n", | |||
" Loading dataset from file...\n", | |||
"\n", | |||
" Calculating kernel matrix, this could take a while...\n", | |||
"\n", | |||
" --- treelet kernel matrix of size 185 built in 0.5003015995025635 seconds ---\n", | |||
"[[ 4.00000000e+00 2.60653066e+00 1.00000000e+00 ..., 1.26641655e-14\n", | |||
" 1.26641655e-14 1.26641655e-14]\n", | |||
" [ 2.60653066e+00 6.00000000e+00 1.00000000e+00 ..., 1.26641655e-14\n", | |||
" 1.26641655e-14 1.26641655e-14]\n", | |||
" [ 1.00000000e+00 1.00000000e+00 4.00000000e+00 ..., 3.00000000e+00\n", | |||
" 3.00000000e+00 3.00000000e+00]\n", | |||
" ..., \n", | |||
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 1.80000000e+01\n", | |||
" 1.30548713e+01 8.19020657e+00]\n", | |||
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 1.30548713e+01\n", | |||
" 2.20000000e+01 9.71901120e+00]\n", | |||
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 8.19020657e+00\n", | |||
" 9.71901120e+00 1.60000000e+01]]\n", | |||
"\n", | |||
" Saving kernel matrix to file...\n", | |||
"\n", | |||
" Mean performance on train set: 2.908869\n", | |||
"With standard deviation: 1.267900\n", | |||
"\n", | |||
" Mean performance on test set: 8.307902\n", | |||
"With standard deviation: 3.378376\n", | |||
"\n", | |||
"\n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 8.3079 3.37838 2.90887 1.2679 0.500302\n" | |||
" 10.0997 5.03584 2.68803 1.54162 0.475438\n" | |||
] | |||
} | |||
], | |||
@@ -99,8 +63,6 @@ | |||
"\n", | |||
"kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = False)\n", | |||
"\n", | |||
"kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = True)\n", | |||
"\n", | |||
"# %lprun -f treeletkernel \\\n", | |||
"# kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = False)" | |||
] | |||
@@ -121,14 +83,58 @@ | |||
"# without y normalization\n", | |||
" RMSE_test std_test RMSE_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 10.0997 5.03584 2.68803 1.54162 0.484171" | |||
" 10.0997 5.03584 2.68803 1.54162 0.484171\n", | |||
"\n", | |||
" \n", | |||
"\n", | |||
"# G0 -> WL subtree h = 0\n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 13.9223 2.88611 13.373 0.653301 0.186731\n", | |||
"\n", | |||
"# G0 U G1 U G6 U G8 U G13 -> WL subtree h = 1\n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 8.97706 2.90771 6.7343 1.17505 0.223171\n", | |||
" \n", | |||
"# all patterns \\ { G3 U G4 U G5 U G10 } -> WL subtree h = 2 \n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 7.31274 1.96289 3.73909 0.406267 0.294902\n", | |||
"\n", | |||
"# all patterns \\ { G4 U G5 } -> WL subtree h = 3\n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 8.39977 2.78309 3.8606 1.58686 0.348912\n", | |||
"\n", | |||
"# all patterns \\ { G5 } \n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 9.47647 4.22113 3.18029 1.5669 0.423638\n", | |||
" \n", | |||
" \n", | |||
" \n", | |||
"# G0, -> WL subtree h = 0\n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 13.9223 2.88611 13.373 0.653301 0.186731 \n", | |||
" \n", | |||
"# G0 U G1 U G2 U G6 U G8 U G13 -> WL subtree h = 1\n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 8.62431 2.54327 5.63422 0.255002 0.290797\n", | |||
" \n", | |||
"# all patterns \\ { G5 U G10 } -> WL subtree h = 2\n", | |||
" rmse_test std_test rmse_train std_train k_time\n", | |||
"----------- ---------- ------------ ----------- --------\n", | |||
" 10.1294 3.50275 3.69664 1.55116 0.418498" | |||
] | |||
}, | |||
{ | |||
"cell_type": "code", | |||
"execution_count": 3, | |||
"metadata": { | |||
"scrolled": false | |||
"scrolled": true | |||
}, | |||
"outputs": [ | |||
{ | |||
@@ -0,0 +1,147 @@ | |||
""" | |||
@author: linlin <jajupmochi@gmail.com> | |||
@references: | |||
[1] Tamás Horváth, Thomas Gärtner, and Stefan Wrobel. Cyclic pattern kernels for predictive graph mining. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 158–167. ACM, 2004. | |||
[2] Hopcroft, J.; Tarjan, R. (1973). “Efficient algorithms for graph manipulation”. Communications of the ACM 16: 372–378. doi:10.1145/362248.362272. | |||
[3] Finding all the elementary circuits of a directed graph. D. B. Johnson, SIAM Journal on Computing 4, no. 1, 77-84, 1975. http://dx.doi.org/10.1137/0204007 | |||
""" | |||
import sys | |||
import pathlib | |||
sys.path.insert(0, "../") | |||
import time | |||
import networkx as nx | |||
import numpy as np | |||
from tqdm import tqdm | |||
def cyclicpatternkernel(*args, node_label = 'atom', edge_label = 'bond_type', labeled = True, cycle_bound = None): | |||
"""Calculate cyclic pattern graph kernels between graphs. | |||
Parameters | |||
---------- | |||
Gn : List of NetworkX graph | |||
List of graphs between which the kernels are calculated. | |||
/ | |||
G1, G2 : NetworkX graphs | |||
2 graphs between which the kernel is calculated. | |||
node_label : string | |||
node attribute used as label. The default node label is atom. | |||
edge_label : string | |||
edge attribute used as label. The default edge label is bond_type. | |||
labeled : boolean | |||
Whether the graphs are labeled. The default is True. | |||
depth : integer | |||
Depth of search. Longest length of paths. | |||
Return | |||
------ | |||
Kmatrix : Numpy matrix | |||
Kernel matrix, each element of which is the path kernel up to d between 2 praphs. | |||
""" | |||
Gn = args[0] if len(args) == 1 else [args[0], args[1]] # arrange all graphs in a list | |||
Kmatrix = np.zeros((len(Gn), len(Gn))) | |||
start_time = time.time() | |||
# get all cyclic and tree patterns of all graphs before calculating kernels to save time, but this may consume a lot of memory for large dataset. | |||
all_patterns = [ get_patterns(Gn[i], node_label = node_label, edge_label = edge_label, labeled = labeled, cycle_bound = cycle_bound) | |||
for i in tqdm(range(0, len(Gn)), desc = 'retrieve patterns', file=sys.stdout) ] | |||
for i in tqdm(range(0, len(Gn)), desc = 'calculate kernels', file=sys.stdout): | |||
for j in range(i, len(Gn)): | |||
Kmatrix[i][j] = _cyclicpatternkernel_do(all_patterns[i], all_patterns[j]) | |||
Kmatrix[j][i] = Kmatrix[i][j] | |||
run_time = time.time() - start_time | |||
print("\n --- kernel matrix of cyclic pattern kernel of size %d built in %s seconds ---" % (len(Gn), run_time)) | |||
return Kmatrix, run_time | |||
def _cyclicpatternkernel_do(patterns1, patterns2): | |||
"""Calculate path graph kernels up to depth d between 2 graphs. | |||
Parameters | |||
---------- | |||
paths1, paths2 : list | |||
List of paths in 2 graphs, where for unlabeled graphs, each path is represented by a list of nodes; while for labeled graphs, each path is represented by a string consists of labels of nodes and edges on that path. | |||
k_func : function | |||
A kernel function used using different notions of fingerprint similarity. | |||
node_label : string | |||
node attribute used as label. The default node label is atom. | |||
edge_label : string | |||
edge attribute used as label. The default edge label is bond_type. | |||
labeled : boolean | |||
Whether the graphs are labeled. The default is True. | |||
Return | |||
------ | |||
kernel : float | |||
Treelet Kernel between 2 graphs. | |||
""" | |||
return len(set(patterns1) & set(patterns2)) | |||
def get_patterns(G, node_label = 'atom', edge_label = 'bond_type', labeled = True, cycle_bound = None): | |||
"""Find all cyclic and tree patterns in a graph. | |||
Parameters | |||
---------- | |||
G : NetworkX graphs | |||
The graph in which paths are searched. | |||
length : integer | |||
The maximum length of paths. | |||
node_label : string | |||
node attribute used as label. The default node label is atom. | |||
edge_label : string | |||
edge attribute used as label. The default edge label is bond_type. | |||
labeled : boolean | |||
Whether the graphs are labeled. The default is True. | |||
Return | |||
------ | |||
path : list | |||
List of paths retrieved, where for unlabeled graphs, each path is represented by a list of nodes; while for labeled graphs, each path is represented by a string consists of labels of nodes and edges on that path. | |||
""" | |||
number_simplecycles = 0 | |||
bridges = nx.Graph() | |||
patterns = [] | |||
bicomponents = nx.biconnected_component_subgraphs(G) # all biconnected components of G. this function use algorithm in reference [2], which (i guess) is slightly different from the one used in paper [1] | |||
for subgraph in bicomponents: | |||
if nx.number_of_edges(subgraph) > 1: | |||
simple_cycles = list(nx.simple_cycles(G.to_directed())) # all simple cycles in biconnected components. this function use algorithm in reference [3], which has time complexity O((n+e)(N+1)) for n nodes, e edges and N simple cycles. Which might be slower than the algorithm applied in paper [1] | |||
if cycle_bound != None and len(simple_cycles) > cycle_bound - number_simplecycles: # in paper [1], when applying another algorithm (subroutine RT), this becomes len(simple_cycles) == cycle_bound - number_simplecycles + 1, check again. | |||
return [] | |||
else: | |||
# calculate canonical representation for each simple cycle | |||
all_canonkeys = [] | |||
for cycle in simple_cycles: | |||
canonlist = [ G.node[node][node_label] + G[node][cycle[cycle.index(node) + 1]][edge_label] for node in cycle[:-1] ] | |||
canonkey = ''.join(canonlist) | |||
canonkey = canonkey if canonkey < canonkey[::-1] else canonkey[::-1] | |||
for i in range(1, len(cycle[:-1])): | |||
canonlist = [ G.node[node][node_label] + G[node][cycle[cycle.index(node) + 1]][edge_label] for node in cycle[i:-1] + cycle[:i] ] | |||
canonkey_t = ''.join(canonlist) | |||
canonkey_t = canonkey_t if canonkey_t < canonkey_t[::-1] else canonkey_t[::-1] | |||
canonkey = canonkey if canonkey < canonkey_t else canonkey_t | |||
all_canonkeys.append(canonkey) | |||
patterns = list(set(patterns) | set(all_canonkeys)) | |||
number_simplecycles += len(simple_cycles) | |||
else: | |||
bridges.add_edges_from(subgraph.edges(data=True)) | |||
# calculate canonical representation for each connected component in bridge set | |||
components = list(nx.connected_component_subgraphs(bridges)) # all connected components in the bridge | |||
tree_patterns = [] | |||
for tree in components: | |||
break | |||
# patterns += pi(bridges) | |||
return patterns |
@@ -1,18 +1,18 @@ | |||
def deltakernel(condition): | |||
"""Return 1 if condition holds, 0 otherwise. | |||
Parameters | |||
---------- | |||
condition : Boolean | |||
A condition, according to which the kernel is set to 1 or 0. | |||
Return | |||
------ | |||
kernel : integer | |||
Delta kernel. | |||
References | |||
---------- | |||
[1] H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between labeled graphs. In Proceedings of the 20th International Conference on Machine Learning, Washington, DC, United States, 2003. | |||
""" | |||
return (1 if condition else 0) | |||
return condition #(1 if condition else 0) |
@@ -1,3 +1,8 @@ | |||
""" | |||
@author: linlin | |||
@references: Suard F, Rakotomamonjy A, Bensrhair A. Kernel on Bag of Paths For Measuring Similarity of Shapes. InESANN 2007 Apr 25 (pp. 355-360). | |||
""" | |||
import sys | |||
import pathlib | |||
sys.path.insert(0, "../") | |||
@@ -27,10 +32,6 @@ def pathkernel(*args, node_label = 'atom', edge_label = 'bond_type'): | |||
------ | |||
Kmatrix/kernel : Numpy matrix/float | |||
Kernel matrix, each element of which is the path kernel between 2 praphs. / Path kernel between 2 graphs. | |||
References | |||
---------- | |||
[1] Suard F, Rakotomamonjy A, Bensrhair A. Kernel on Bag of Paths For Measuring Similarity of Shapes. InESANN 2007 Apr 25 (pp. 355-360). | |||
""" | |||
some_graph = args[0][0] if len(args) == 1 else args[0] # only edge attributes of type int or float can be used as edge weight to calculate the shortest paths. | |||
some_weight = list(nx.get_edge_attributes(some_graph, edge_label).values())[0] | |||
@@ -42,9 +43,11 @@ def pathkernel(*args, node_label = 'atom', edge_label = 'bond_type'): | |||
start_time = time.time() | |||
splist = [ get_shortest_paths(Gn[i], weight) for i in range(0, len(Gn)) ] | |||
for i in range(0, len(Gn)): | |||
for j in range(i, len(Gn)): | |||
Kmatrix[i][j] = _pathkernel_do(Gn[i], Gn[j], node_label, edge_label, weight = weight) | |||
Kmatrix[i][j] = _pathkernel_do(Gn[i], Gn[j], splist[i], splist[j], node_label, edge_label) | |||
Kmatrix[j][i] = Kmatrix[i][j] | |||
run_time = time.time() - start_time | |||
@@ -55,7 +58,10 @@ def pathkernel(*args, node_label = 'atom', edge_label = 'bond_type'): | |||
else: # for only 2 graphs | |||
start_time = time.time() | |||
kernel = _pathkernel_do(args[0], args[1], node_label, edge_label, weight = weight) | |||
splist = get_shortest_paths(args[0], weight) | |||
splist = get_shortest_paths(args[1], weight) | |||
kernel = _pathkernel_do(args[0], args[1], sp1, sp2, node_label, edge_label) | |||
run_time = time.time() - start_time | |||
print("\n --- mean average path kernel built in %s seconds ---" % (run_time)) | |||
@@ -63,19 +69,19 @@ def pathkernel(*args, node_label = 'atom', edge_label = 'bond_type'): | |||
return kernel, run_time | |||
def _pathkernel_do(G1, G2, node_label = 'atom', edge_label = 'bond_type', weight = None): | |||
def _pathkernel_do(G1, G2, sp1, sp2, node_label = 'atom', edge_label = 'bond_type'): | |||
"""Calculate mean average path kernel between 2 graphs. | |||
Parameters | |||
---------- | |||
G1, G2 : NetworkX graphs | |||
2 graphs between which the kernel is calculated. | |||
sp1, sp2 : list of list | |||
List of shortest paths of 2 graphs, where each path is represented by a list of nodes. | |||
node_label : string | |||
node attribute used as label. The default node label is atom. | |||
edge_label : string | |||
edge attribute used as label. The default edge label is bond_type. | |||
weight : string/None | |||
edge attribute used as weight to calculate the shortest path. The default edge label is None. | |||
Return | |||
------ | |||
@@ -83,30 +89,62 @@ def _pathkernel_do(G1, G2, node_label = 'atom', edge_label = 'bond_type', weight | |||
Path Kernel between 2 graphs. | |||
""" | |||
# calculate shortest paths for both graphs | |||
sp1 = [] | |||
num_nodes = G1.number_of_nodes() | |||
for node1 in range(num_nodes): | |||
for node2 in range(node1 + 1, num_nodes): | |||
sp1.append(nx.shortest_path(G1, node1, node2, weight = weight)) | |||
sp2 = [] | |||
num_nodes = G2.number_of_nodes() | |||
for node1 in range(num_nodes): | |||
for node2 in range(node1 + 1, num_nodes): | |||
sp2.append(nx.shortest_path(G2, node1, node2, weight = weight)) | |||
# calculate kernel | |||
kernel = 0 | |||
for path1 in sp1: | |||
for path2 in sp2: | |||
if len(path1) == len(path2): | |||
kernel_path = deltakernel(G1.node[path1[0]][node_label] == G2.node[path2[0]][node_label]) | |||
kernel_path = (G1.node[path1[0]][node_label] == G2.node[path2[0]][node_label]) | |||
if kernel_path: | |||
for i in range(1, len(path1)): | |||
# kernel = 1 if all corresponding nodes and edges in the 2 paths have same labels, otherwise 0 | |||
kernel_path *= deltakernel(G1[path1[i - 1]][path1[i]][edge_label] == G2[path2[i - 1]][path2[i]][edge_label]) * deltakernel(G1.node[path1[i]][node_label] == G2.node[path2[i]][node_label]) | |||
kernel_path *= (G1[path1[i - 1]][path1[i]][edge_label] == G2[path2[i - 1]][path2[i]][edge_label]) \ | |||
* (G1.node[path1[i]][node_label] == G2.node[path2[i]][node_label]) | |||
if kernel_path == 0: | |||
break | |||
kernel += kernel_path # add up kernels of all paths | |||
# kernel = 0 | |||
# for path1 in sp1: | |||
# for path2 in sp2: | |||
# if len(path1) == len(path2): | |||
# if (G1.node[path1[0]][node_label] == G2.node[path2[0]][node_label]): | |||
# for i in range(1, len(path1)): | |||
# # kernel = 1 if all corresponding nodes and edges in the 2 paths have same labels, otherwise 0 | |||
# # kernel_path *= (G1[path1[i - 1]][path1[i]][edge_label] == G2[path2[i - 1]][path2[i]][edge_label]) \ | |||
# # * (G1.node[path1[i]][node_label] == G2.node[path2[i]][node_label]) | |||
# # if kernel_path == 0: | |||
# # break | |||
# # kernel += kernel_path # add up kernels of all paths | |||
# if (G1[path1[i - 1]][path1[i]][edge_label] != G2[path2[i - 1]][path2[i]][edge_label]) or \ | |||
# (G1.node[path1[i]][node_label] != G2.node[path2[i]][node_label]): | |||
# break | |||
# else: | |||
# kernel += 1 | |||
kernel = kernel / (len(sp1) * len(sp2)) # calculate mean average | |||
return kernel | |||
def get_shortest_paths(G, weight): | |||
"""Get all shortest paths of a graph. | |||
Parameters | |||
---------- | |||
G : NetworkX graphs | |||
The graphs whose paths are calculated. | |||
weight : string/None | |||
edge attribute used as weight to calculate the shortest path. | |||
Return | |||
------ | |||
sp : list of list | |||
List of shortest paths of the graph, where each path is represented by a list of nodes. | |||
""" | |||
sp = [] | |||
num_nodes = G.number_of_nodes() | |||
for node1 in range(num_nodes): | |||
for node2 in range(node1 + 1, num_nodes): | |||
sp.append(nx.shortest_path(G, node1, node2, weight = weight)) | |||
return sp |
@@ -1,20 +1,26 @@ | |||
# Results with minimal test RMSE for each kernel on dataset Asyclic | |||
All kernels are tested on dataset Asyclic, which consists of 185 molecules (graphs). | |||
All kernels expect for Cyclic pattern kernel are tested on dataset Asyclic, which consists of 185 molecules (graphs). (Cyclic pattern kernel is tested on dataset MAO and PAH.) | |||
The criteria used for prediction are SVM for classification and kernel Ridge regression for regression. | |||
For predition we randomly divide the data in train and test subset, where 90% of entire dataset is for training and rest for testing. 10 splits are performed. For each split, we first train on the train data, then evaluate the performance on the test set. We choose the optimal parameters for the test set and finally provide the corresponding performance. The final results correspond to the average of the performances on the test sets. | |||
All the results were run under Python 3.5.2, in a machine of 64 bit with one Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz, Memory of 32GB, and Ubuntu 16.04.3 LTS OS. | |||
## Summary | |||
| Kernels | RMSE(℃) | STD(℃) | Parameter | k_time | | |||
|---------------|:-------:|:------:|-------------:|-------:| | |||
| Shortest path | 35.19 | 4.50 | - | 14.58" | | |||
| Marginalized | 18.02 | 6.29 | p_quit = 0.1 | 4'19" | | |||
| Path | 14.00 | 6.94 | - | 37.58" | | |||
| WL subtree | 7.55 | 2.33 | height = 1 | 0.84" | | |||
| Treelet | 8.31 | 3.38 | - | 0.50" | | |||
| Path up to d | 7.43 | 2.69 | depth = 2 | 0.52" | | |||
| Kernels | RMSE(℃) | STD(℃) | Parameter | k_time | | |||
|------------------|:-------:|:------:|------------------:|-------:| | |||
| Shortest path | 35.19 | 4.50 | - | 14.58" | | |||
| Marginalized | 18.02 | 6.29 | p_quit = 0.1 | 4'19" | | |||
| Path | 14.00 | 6.94 | - | 37.58" | | |||
| WL subtree | 7.55 | 2.33 | height = 1 | 0.84" | | |||
| WL shortest path | 35.16 | 4.50 | height = 2 | 40.24" | | |||
| WL edge | 33.41 | 4.73 | height = 5 | 5.66" | | |||
| Treelet | 8.31 | 3.38 | - | 0.50" | | |||
| Path up to d | 7.43 | 2.69 | depth = 2 | 0.52" | | |||
| Tree pattern | 7.27 | 2.21 | lamda = 1, h = 2 | 37.24" | | |||
| Cyclic pattern | 0.9 | 0.11 | cycle bound = 100 | 0.31" | | |||
* RMSE stands for arithmetic mean of the root mean squared errors on all splits. | |||
* STD stands for standard deviation of the root mean squared errors on all splits. | |||
@@ -76,6 +82,42 @@ The table below shows the results of the WL subtree under different subtree heig | |||
10 17.1864 4.05672 0.691516 0.564621 5.00918 | |||
``` | |||
### Weisfeiler-Lehman shortest path kernel | |||
The table below shows the results of the WL subtree under different subtree heights. | |||
``` | |||
height rmse_test std_test rmse_train std_train k_time | |||
-------- ----------- ---------- ------------ ----------- -------- | |||
0 35.192 4.49577 28.3604 1.35718 13.5041 | |||
1 35.1808 4.50045 27.9335 1.44836 26.8292 | |||
2 35.1632 4.50205 28.1113 1.50891 40.2356 | |||
3 35.1946 4.49801 28.3903 1.36571 54.6704 | |||
4 35.1753 4.50111 27.9746 1.46222 67.1522 | |||
5 35.1997 4.5071 28.0184 1.45564 80.0881 | |||
6 35.1645 4.49849 28.3731 1.60057 92.1925 | |||
7 35.1771 4.5009 27.9604 1.45742 105.812 | |||
8 35.1968 4.50526 28.1991 1.5149 119.022 | |||
9 35.1956 4.50197 28.2665 1.30769 131.228 | |||
10 35.1676 4.49723 28.4163 1.61596 144.964 | |||
``` | |||
### Weisfeiler-Lehman edge kernel | |||
The table below shows the results of the WL subtree under different subtree heights. | |||
``` | |||
height rmse_test std_test rmse_train std_train k_time | |||
-------- ----------- ---------- ------------ ----------- --------- | |||
0 33.4077 4.73272 29.9975 0.90234 0.853002 | |||
1 33.4235 4.72131 30.1603 1.09423 1.71751 | |||
2 33.433 4.72441 29.9286 0.787941 2.66032 | |||
3 33.4073 4.73243 30.0114 0.909674 3.47763 | |||
4 33.4256 4.72166 30.1842 1.1089 4.54367 | |||
5 33.4067 4.72641 30.0411 1.01845 5.66178 | |||
6 33.419 4.73075 29.9056 0.782179 6.14803 | |||
7 33.4248 4.72155 30.1759 1.10382 7.60354 | |||
8 33.4122 4.71554 30.1365 1.07485 7.97222 | |||
9 33.4071 4.73193 30.0329 0.921065 9.07084 | |||
10 33.4165 4.73169 29.9242 0.790843 10.0254 | |||
``` | |||
### Treelet kernel | |||
**The targets of training data are normalized before calculating the kernel.** | |||
``` | |||
@@ -87,7 +129,7 @@ The table below shows the results of the WL subtree under different subtree heig | |||
### Path kernel up to depth *d* | |||
The table below shows the results of the path kernel up to different depth *d*. | |||
The first table is the results using Tanimoto kernel, where **The targets of training data are normalized before calculating the kernel.**. | |||
The first table is the results using *Tanimoto kernel*, where **The targets of training data are normalized before calculating the kernel.**. | |||
``` | |||
depth rmse_test std_test rmse_train std_train k_time | |||
------- ----------- ---------- ------------ ----------- --------- | |||
@@ -104,7 +146,7 @@ The first table is the results using Tanimoto kernel, where **The targets of tra | |||
10 19.8708 5.09217 10.7787 2.10002 2.41006 | |||
``` | |||
The second table is the results using MinMax kernel. | |||
The second table is the results using *MinMax kernel*. | |||
``` | |||
depth rmse_test std_test rmse_train std_train k_time | |||
------- ----------- ---------- ------------ ----------- -------- | |||
@@ -120,3 +162,62 @@ depth rmse_test std_test rmse_train std_train k_time | |||
9 13.1789 5.27707 1.36002 1.84834 1.96545 | |||
10 13.2538 5.26425 1.36208 1.85426 2.24943 | |||
``` | |||
### Tree pattern kernel | |||
Until N kernel when h = 2: | |||
``` | |||
lmda rmse_test std_test rmse_train std_train k_time | |||
----------- ----------- ---------- ------------ ----------- -------- | |||
1e-10 7.46524 1.71862 5.99486 0.356634 38.1447 | |||
1e-09 7.37326 1.77195 5.96155 0.374395 37.4921 | |||
1e-08 7.35105 1.78349 5.96481 0.378047 37.9971 | |||
1e-07 7.35213 1.77903 5.96728 0.382251 38.3182 | |||
1e-06 7.3524 1.77992 5.9696 0.3863 39.6428 | |||
1e-05 7.34958 1.78141 5.97114 0.39017 37.3711 | |||
0.0001 7.3513 1.78136 5.94251 0.331843 37.3967 | |||
0.001 7.35822 1.78119 5.9326 0.32534 36.7357 | |||
0.01 7.37552 1.79037 5.94089 0.34763 36.8864 | |||
0.1 7.32951 1.91346 6.42634 1.29405 36.8382 | |||
1 7.27134 2.20774 6.62425 1.2242 37.2425 | |||
10 7.49787 2.36815 6.81697 1.50182 37.8286 | |||
100 7.42887 2.64789 6.68766 1.34809 36.3701 | |||
1000 7.24914 2.65554 6.81906 1.41008 36.1695 | |||
10000 7.08183 2.6248 6.93431 1.38441 37.5723 | |||
100000 8.021 3.43694 8.69813 0.909839 37.8158 | |||
1e+06 8.49625 3.6332 9.59333 0.96626 38.4688 | |||
1e+07 10.9067 3.17593 11.5642 2.07792 36.9926 | |||
1e+08 61.1524 10.4355 65.3527 13.9538 37.1321 | |||
1e+09 99.943 13.6994 98.8848 5.27014 36.7443 | |||
1e+10 100.083 13.8503 97.9168 3.22768 37.096 | |||
``` | |||
### Cyclic pattern kernel | |||
**This kernel is not tested on dataset Acyclic** | |||
Results on dataset MAO: | |||
``` | |||
cycle_bound accur_test std_test accur_train std_train k_time | |||
------------- ------------ ---------- ------------- ----------- -------- | |||
0 0.642857 0.146385 0.54918 0.0167983 0.187052 | |||
50 0.871429 0.1 0.698361 0.116889 0.300629 | |||
100 0.9 0.111575 0.732787 0.0826366 0.309837 | |||
150 0.9 0.111575 0.732787 0.0826366 0.31808 | |||
200 0.9 0.111575 0.732787 0.0826366 0.317575 | |||
``` | |||
Results on dataset PAH: | |||
``` | |||
cycle_bound accur_test std_test accur_train std_train k_time | |||
------------- ------------ ---------- ------------- ----------- -------- | |||
0 0.61 0.113578 0.629762 0.0135212 0.521801 | |||
10 0.61 0.113578 0.629762 0.0135212 0.52589 | |||
20 0.61 0.113578 0.629762 0.0135212 0.548528 | |||
30 0.64 0.111355 0.633333 0.0157935 0.535311 | |||
40 0.64 0.111355 0.633333 0.0157935 0.61764 | |||
50 0.67 0.09 0.658333 0.0345238 0.733868 | |||
60 0.68 0.107703 0.671429 0.0365769 0.871147 | |||
70 0.67 0.100499 0.666667 0.0380208 1.12625 | |||
80 0.78 0.107703 0.709524 0.0588534 1.19828 | |||
90 0.78 0.107703 0.709524 0.0588534 1.21182 | |||
``` |
@@ -1,3 +1,8 @@ | |||
""" | |||
@author: linlin | |||
@references: Borgwardt KM, Kriegel HP. Shortest-path kernels on graphs. InData Mining, Fifth IEEE International Conference on 2005 Nov 27 (pp. 8-pp). IEEE. | |||
""" | |||
import sys | |||
import pathlib | |||
sys.path.insert(0, "../") | |||
@@ -12,7 +17,7 @@ from pygraph.utils.utils import getSPGraph | |||
def spkernel(*args, edge_weight = 'bond_type'): | |||
"""Calculate shortest-path kernels between graphs. | |||
Parameters | |||
---------- | |||
Gn : List of NetworkX graph | |||
@@ -22,51 +27,33 @@ def spkernel(*args, edge_weight = 'bond_type'): | |||
2 graphs between which the kernel is calculated. | |||
edge_weight : string | |||
edge attribute corresponding to the edge weight. The default edge weight is bond_type. | |||
Return | |||
------ | |||
Kmatrix/kernel : Numpy matrix/float | |||
Kernel matrix, each element of which is the sp kernel between 2 praphs. / SP kernel between 2 graphs. | |||
References | |||
---------- | |||
[1] Borgwardt KM, Kriegel HP. Shortest-path kernels on graphs. InData Mining, Fifth IEEE International Conference on 2005 Nov 27 (pp. 8-pp). IEEE. | |||
""" | |||
if len(args) == 1: # for a list of graphs | |||
Gn = args[0] | |||
Kmatrix = np.zeros((len(Gn), len(Gn))) | |||
Sn = [] # get shortest path graphs of Gn | |||
for i in range(0, len(Gn)): | |||
Sn.append(getSPGraph(Gn[i], edge_weight = edge_weight)) | |||
Gn = args[0] if len(args) == 1 else [args[0], args[1]] # arrange all graphs in a list | |||
Kmatrix = np.zeros((len(Gn), len(Gn))) | |||
start_time = time.time() | |||
Gn = [ getSPGraph(G, edge_weight = edge_weight) for G in args[0] ] # get shortest path graphs of Gn | |||
for i in range(0, len(Gn)): | |||
for j in range(i, len(Gn)): | |||
# kernel_t = [ e1[2]['cost'] != 0 and e1[2]['cost'] == e2[2]['cost'] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])) \ | |||
# for e1 in Sn[i].edges(data = True) for e2 in Sn[j].edges(data = True) ] | |||
# Kmatrix[i][j] = np.sum(kernel_t) | |||
# Kmatrix[j][i] = Kmatrix[i][j] | |||
start_time = time.time() | |||
for i in range(0, len(Gn)): | |||
for j in range(i, len(Gn)): | |||
for e1 in Sn[i].edges(data = True): | |||
for e2 in Sn[j].edges(data = True): | |||
if e1[2]['cost'] != 0 and e1[2]['cost'] == e2[2]['cost'] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])): | |||
Kmatrix[i][j] += 1 | |||
Kmatrix[j][i] += (0 if i == j else 1) | |||
for e1 in Gn[i].edges(data = True): | |||
for e2 in Gn[j].edges(data = True): | |||
if e1[2]['cost'] != 0 and e1[2]['cost'] == e2[2]['cost'] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])): | |||
Kmatrix[i][j] += 1 | |||
Kmatrix[j][i] = Kmatrix[i][j] | |||
run_time = time.time() - start_time | |||
print("--- shortest path kernel matrix of size %d built in %s seconds ---" % (len(Gn), run_time)) | |||
return Kmatrix, run_time | |||
else: # for only 2 graphs | |||
G1 = getSPGraph(args[0], edge_weight = edge_weight) | |||
G2 = getSPGraph(args[1], edge_weight = edge_weight) | |||
kernel = 0 | |||
start_time = time.time() | |||
for e1 in G1.edges(data = True): | |||
for e2 in G2.edges(data = True): | |||
if e1[2]['cost'] != 0 and e1[2]['cost'] == e2[2]['cost'] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])): | |||
kernel += 1 | |||
run_time = time.time() - start_time | |||
print("--- shortest path kernel matrix of size %d built in %s seconds ---" % (len(Gn), run_time)) | |||
# print("--- shortest path kernel built in %s seconds ---" % (time.time() - start_time)) | |||
return kernel | |||
return Kmatrix, run_time |
@@ -0,0 +1,198 @@ | |||
""" | |||
@author: linlin | |||
@references: Pierre Mahé and Jean-Philippe Vert. Graph kernels based on tree patterns for molecules. Machine learning, 75(1):3–35, 2009. | |||
""" | |||
import sys | |||
import pathlib | |||
sys.path.insert(0, "../") | |||
import time | |||
from collections import Counter | |||
import networkx as nx | |||
import numpy as np | |||
def treepatternkernel(*args, node_label = 'atom', edge_label = 'bond_type', labeled = True, kernel_type = 'untiln', lmda = 1, h = 1): | |||
"""Calculate tree pattern graph kernels between graphs. | |||
Parameters | |||
---------- | |||
Gn : List of NetworkX graph | |||
List of graphs between which the kernels are calculated. | |||
/ | |||
G1, G2 : NetworkX graphs | |||
2 graphs between which the kernel is calculated. | |||
node_label : string | |||
node attribute used as label. The default node label is atom. | |||
edge_label : string | |||
edge attribute used as label. The default edge label is bond_type. | |||
labeled : boolean | |||
Whether the graphs are labeled. The default is True. | |||
depth : integer | |||
Depth of search. Longest length of paths. | |||
k_func : function | |||
A kernel function used using different notions of fingerprint similarity. | |||
Return | |||
------ | |||
Kmatrix: Numpy matrix | |||
Kernel matrix, each element of which is the tree pattern graph kernel between 2 praphs. | |||
""" | |||
if h < 1: | |||
raise Exception('h > 0 is requested.') | |||
kernel_type = kernel_type.lower() | |||
Gn = args[0] if len(args) == 1 else [args[0], args[1]] # arrange all graphs in a list | |||
Kmatrix = np.zeros((len(Gn), len(Gn))) | |||
h = int(h) | |||
start_time = time.time() | |||
for i in range(0, len(Gn)): | |||
for j in range(i, len(Gn)): | |||
Kmatrix[i][j] = _treepatternkernel_do(Gn[i], Gn[j], node_label, edge_label, labeled, kernel_type, lmda, h) | |||
Kmatrix[j][i] = Kmatrix[i][j] | |||
run_time = time.time() - start_time | |||
print("\n --- kernel matrix of tree pattern kernel of size %d built in %s seconds ---" % (len(Gn), run_time)) | |||
return Kmatrix, run_time | |||
def _treepatternkernel_do(G1, G2, node_label, edge_label, labeled, kernel_type, lmda, h): | |||
"""Calculate tree pattern graph kernels between 2 graphs. | |||
Parameters | |||
---------- | |||
paths1, paths2 : list | |||
List of paths in 2 graphs, where for unlabeled graphs, each path is represented by a list of nodes; while for labeled graphs, each path is represented by a string consists of labels of nodes and edges on that path. | |||
k_func : function | |||
A kernel function used using different notions of fingerprint similarity. | |||
node_label : string | |||
node attribute used as label. The default node label is atom. | |||
edge_label : string | |||
edge attribute used as label. The default edge label is bond_type. | |||
labeled : boolean | |||
Whether the graphs are labeled. The default is True. | |||
Return | |||
------ | |||
kernel : float | |||
Treelet Kernel between 2 graphs. | |||
""" | |||
def matchingset(n1, n2): | |||
"""Get neiborhood matching set of two nodes in two graphs. | |||
""" | |||
def mset_com(allpairs, length): | |||
"""Find all sets R of pairs by combination. | |||
""" | |||
if length == 1: | |||
mset = [ [pair] for pair in allpairs ] | |||
return mset, mset | |||
else: | |||
mset, mset_l = mset_com(allpairs, length - 1) | |||
mset_tmp = [] | |||
for pairset in mset_l: # for each pair set of length l-1 | |||
nodeset1 = [ pair[0] for pair in pairset ] # nodes already in the set | |||
nodeset2 = [ pair[1] for pair in pairset ] | |||
for pair in allpairs: | |||
if (pair[0] not in nodeset1) and (pair[1] not in nodeset2): # nodes in R should be unique | |||
mset_tmp.append(pairset + [pair]) # add this pair to the pair set of length l-1, constructing a new set of length l | |||
nodeset1.append(pair[0]) | |||
nodeset2.append(pair[1]) | |||
mset.extend(mset_tmp) | |||
return mset, mset_tmp | |||
allpairs = [] # all pairs those have the same node labels and edge labels | |||
for neighbor1 in G1[n1]: | |||
for neighbor2 in G2[n2]: | |||
if G1.node[neighbor1][node_label] == G2.node[neighbor2][node_label] \ | |||
and G1[n1][neighbor1][edge_label] == G2[n2][neighbor2][edge_label]: | |||
allpairs.append([neighbor1, neighbor2]) | |||
if allpairs != []: | |||
mset, _ = mset_com(allpairs, len(allpairs)) | |||
else: | |||
mset = [] | |||
return mset | |||
def kernel_h(h): | |||
"""Calculate kernel of h-th iteration. | |||
""" | |||
if kernel_type == 'untiln': | |||
all_kh = { str(n1) + '.' + str(n2) : (G1.node[n1][node_label] == G2.node[n2][node_label]) \ | |||
for n1 in G1.nodes() for n2 in G2.nodes() } # kernels between all pair of nodes with h = 1 ] | |||
all_kh_tmp = all_kh.copy() | |||
for i in range(2, h + 1): | |||
for n1 in G1.nodes(): | |||
for n2 in G2.nodes(): | |||
kh = 0 | |||
mset = all_msets[str(n1) + '.' + str(n2)] | |||
for R in mset: | |||
kh_tmp = 1 | |||
for pair in R: | |||
kh_tmp *= lmda * all_kh[str(pair[0]) + '.' + str(pair[1])] | |||
kh += 1 / lmda * kh_tmp | |||
kh = (G1.node[n1][node_label] == G2.node[n2][node_label]) * (1 + kh) | |||
all_kh_tmp[str(n1) + '.' + str(n2)] = kh | |||
all_kh = all_kh_tmp.copy() | |||
elif kernel_type == 'size': | |||
all_kh = { str(n1) + '.' + str(n2) : lmda * (G1.node[n1][node_label] == G2.node[n2][node_label]) \ | |||
for n1 in G1.nodes() for n2 in G2.nodes() } # kernels between all pair of nodes with h = 1 ] | |||
all_kh_tmp = all_kh.copy() | |||
for i in range(2, h + 1): | |||
for n1 in G1.nodes(): | |||
for n2 in G2.nodes(): | |||
kh = 0 | |||
mset = all_msets[str(n1) + '.' + str(n2)] | |||
for R in mset: | |||
kh_tmp = 1 | |||
for pair in R: | |||
kh_tmp *= lmda * all_kh[str(pair[0]) + '.' + str(pair[1])] | |||
kh += kh_tmp | |||
kh *= lmda * (G1.node[n1][node_label] == G2.node[n2][node_label]) | |||
all_kh_tmp[str(n1) + '.' + str(n2)] = kh | |||
all_kh = all_kh_tmp.copy() | |||
elif kernel_type == 'branching': | |||
all_kh = { str(n1) + '.' + str(n2) : (G1.node[n1][node_label] == G2.node[n2][node_label]) \ | |||
for n1 in G1.nodes() for n2 in G2.nodes() } # kernels between all pair of nodes with h = 1 ] | |||
all_kh_tmp = all_kh.copy() | |||
for i in range(2, h + 1): | |||
for n1 in G1.nodes(): | |||
for n2 in G2.nodes(): | |||
kh = 0 | |||
mset = all_msets[str(n1) + '.' + str(n2)] | |||
for R in mset: | |||
kh_tmp = 1 | |||
for pair in R: | |||
kh_tmp *= lmda * all_kh[str(pair[0]) + '.' + str(pair[1])] | |||
kh += 1 / lmda * kh_tmp | |||
kh *= (G1.node[n1][node_label] == G2.node[n2][node_label]) | |||
all_kh_tmp[str(n1) + '.' + str(n2)] = kh | |||
all_kh = all_kh_tmp.copy() | |||
return all_kh | |||
# calculate matching sets for every pair of nodes at first to avoid calculating in every iteration. | |||
all_msets = ({ str(node1) + '.' + str(node2) : matchingset(node1, node2) for node1 in G1.nodes() \ | |||
for node2 in G2.nodes() } if h > 1 else {}) | |||
all_kh = kernel_h(h) | |||
kernel = sum(all_kh.values()) | |||
if kernel_type == 'size': | |||
kernel = kernel / (lmda ** h) | |||
return kernel |
@@ -1,3 +1,8 @@ | |||
""" | |||
@author: linlin | |||
@references: Gaüzère B, Brun L, Villemin D. Two new graphs kernels in chemoinformatics. Pattern Recognition Letters. 2012 Nov 1;33(15):2038-47. | |||
""" | |||
import sys | |||
import pathlib | |||
sys.path.insert(0, "../") | |||
@@ -12,7 +17,7 @@ import numpy as np | |||
def treeletkernel(*args, node_label = 'atom', edge_label = 'bond_type', labeled = True): | |||
"""Calculate treelet graph kernels between graphs. | |||
Parameters | |||
---------- | |||
Gn : List of NetworkX graph | |||
@@ -26,7 +31,7 @@ def treeletkernel(*args, node_label = 'atom', edge_label = 'bond_type', labeled | |||
edge attribute used as label. The default edge label is bond_type. | |||
labeled : boolean | |||
Whether the graphs are labeled. The default is True. | |||
Return | |||
------ | |||
Kmatrix/kernel : Numpy matrix/float | |||
@@ -37,11 +42,11 @@ def treeletkernel(*args, node_label = 'atom', edge_label = 'bond_type', labeled | |||
Kmatrix = np.zeros((len(Gn), len(Gn))) | |||
start_time = time.time() | |||
# get all canonical keys of all graphs before calculating kernels to save time, but this may cost a lot of memory for large dataset. | |||
canonkeys = [ get_canonkeys(Gn[i], node_label = node_label, edge_label = edge_label, labeled = labeled) \ | |||
for i in range(0, len(Gn)) ] | |||
for i in range(0, len(Gn)): | |||
for j in range(i, len(Gn)): | |||
Kmatrix[i][j] = _treeletkernel_do(canonkeys[i], canonkeys[j], node_label = node_label, edge_label = edge_label, labeled = labeled) | |||
@@ -49,7 +54,7 @@ def treeletkernel(*args, node_label = 'atom', edge_label = 'bond_type', labeled | |||
run_time = time.time() - start_time | |||
print("\n --- treelet kernel matrix of size %d built in %s seconds ---" % (len(Gn), run_time)) | |||
return Kmatrix, run_time | |||
else: # for only 2 graphs | |||
@@ -112,10 +117,6 @@ def get_canonkeys(G, node_label = 'atom', edge_label = 'bond_type', labeled = Tr | |||
------ | |||
canonkey/canonkey_l : dict | |||
For unlabeled graphs, canonkey is a dictionary which records amount of every tree pattern. For labeled graphs, canonkey_l is one which keeps track of amount of every treelet. | |||
References | |||
---------- | |||
[1] Gaüzère B, Brun L, Villemin D. Two new graphs kernels in chemoinformatics. Pattern Recognition Letters. 2012 Nov 1;33(15):2038-47. | |||
""" | |||
patterns = {} # a dictionary which consists of lists of patterns for all graphlet. | |||
canonkey = {} # canonical key, a dictionary which records amount of every tree pattern. | |||
@@ -126,7 +127,7 @@ def get_canonkeys(G, node_label = 'atom', edge_label = 'bond_type', labeled = Tr | |||
# linear patterns | |||
patterns['0'] = G.nodes() | |||
canonkey['0'] = nx.number_of_nodes(G) | |||
for i in range(1, 6): | |||
for i in range(1, 6): # for i in range(1, 6): | |||
patterns[str(i)] = find_all_paths(G, i) | |||
canonkey[str(i)] = len(patterns[str(i)]) | |||
@@ -227,7 +228,7 @@ def get_canonkeys(G, node_label = 'atom', edge_label = 'bond_type', labeled = Tr | |||
for key in canonkey_t: | |||
canonkey_l['0' + key] = canonkey_t[key] | |||
for i in range(1, 6): | |||
for i in range(1, 6): # for i in range(1, 6): | |||
treelet = [] | |||
for pattern in patterns[str(i)]: | |||
canonlist = list(chain.from_iterable((G.node[node][node_label], \ | |||
@@ -378,4 +379,4 @@ def find_all_paths(G, length): | |||
all_paths[idx] = [] | |||
break | |||
return list(filter(lambda a: a != [], all_paths)) | |||
return list(filter(lambda a: a != [], all_paths)) |
@@ -1,3 +1,8 @@ | |||
""" | |||
@author: linlin | |||
@references: Liva Ralaivola, Sanjay J Swamidass, Hiroto Saigo, and Pierre Baldi. Graph kernels for chemical informatics. Neural networks, 18(8):1093–1110, 2005. | |||
""" | |||
import sys | |||
import pathlib | |||
sys.path.insert(0, "../") | |||
@@ -40,7 +45,7 @@ def untildpathkernel(*args, node_label = 'atom', edge_label = 'bond_type', label | |||
Kmatrix = np.zeros((len(Gn), len(Gn))) | |||
start_time = time.time() | |||
# get all paths of all graphs before calculating kernels to save time, but this may cost a lot of memory for large dataset. | |||
all_paths = [ find_all_paths_until_length(Gn[i], depth, node_label = node_label, edge_label = edge_label, labeled = labeled) for i in range(0, len(Gn)) ] | |||
@@ -187,7 +192,7 @@ def find_all_paths(G, length): | |||
all_paths = [] | |||
for node in G: | |||
all_paths.extend(find_paths(G, node, length)) | |||
### The following process is not carried out according to the original article | |||
# all_paths_r = [ path[::-1] for path in all_paths ] | |||
@@ -200,4 +205,4 @@ def find_all_paths(G, length): | |||
# break | |||
# return list(filter(lambda a: a != [], all_paths)) | |||
return all_paths | |||
return all_paths |
@@ -1,13 +1,8 @@ | |||
import sys | |||
import pathlib | |||
sys.path.insert(0, "../") | |||
import networkx as nx | |||
import numpy as np | |||
import time | |||
from pygraph.kernels.spkernel import spkernel | |||
from pygraph.kernels.pathKernel import pathkernel | |||
""" | |||
@author: linlin | |||
@references: | |||
[1] Shervashidze N, Schweitzer P, Leeuwen EJ, Mehlhorn K, Borgwardt KM. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research. 2011;12(Sep):2539-61. | |||
""" | |||
import sys | |||
import pathlib | |||
@@ -18,7 +13,6 @@ import networkx as nx | |||
import numpy as np | |||
import time | |||
from pygraph.kernels.spkernel import spkernel | |||
from pygraph.kernels.pathKernel import pathkernel | |||
def weisfeilerlehmankernel(*args, node_label = 'atom', edge_label = 'bond_type', height = 0, base_kernel = 'subtree'): | |||
@@ -38,97 +32,66 @@ def weisfeilerlehmankernel(*args, node_label = 'atom', edge_label = 'bond_type', | |||
height : int | |||
subtree height | |||
base_kernel : string | |||
base kernel used in each iteration of WL kernel. The default base kernel is subtree kernel. | |||
base kernel used in each iteration of WL kernel. The default base kernel is subtree kernel. For user-defined kernel, base_kernel is the name of the base kernel function used in each iteration of WL kernel. This function returns a Numpy matrix, each element of which is the user-defined Weisfeiler-Lehman kernel between 2 praphs. | |||
Return | |||
------ | |||
Kmatrix/kernel : Numpy matrix/float | |||
Kernel matrix, each element of which is the Weisfeiler-Lehman kernel between 2 praphs. / Weisfeiler-Lehman kernel between 2 graphs. | |||
Kmatrix : Numpy matrix | |||
Kernel matrix, each element of which is the Weisfeiler-Lehman kernel between 2 praphs. | |||
Notes | |||
----- | |||
This function now supports WL subtree kernel and WL shortest path kernel. | |||
References | |||
---------- | |||
[1] Shervashidze N, Schweitzer P, Leeuwen EJ, Mehlhorn K, Borgwardt KM. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research. 2011;12(Sep):2539-61. | |||
This function now supports WL subtree kernel, WL shortest path kernel and WL edge kernel. | |||
""" | |||
if len(args) == 1: # for a list of graphs | |||
start_time = time.time() | |||
# for WL subtree kernel | |||
if base_kernel == 'subtree': | |||
Kmatrix = _wl_subtreekernel_do(args[0], node_label, edge_label, height = height, base_kernel = 'subtree') | |||
# for WL edge kernel | |||
elif base_kernel == 'edge': | |||
print('edge') | |||
# for WL shortest path kernel | |||
elif base_kernel == 'sp': | |||
Gn = args[0] | |||
Kmatrix = np.zeros((len(Gn), len(Gn))) | |||
for i in range(0, len(Gn)): | |||
for j in range(i, len(Gn)): | |||
Kmatrix[i][j] = _weisfeilerlehmankernel_do(Gn[i], Gn[j], height = height) | |||
Kmatrix[j][i] = Kmatrix[i][j] | |||
base_kernel = base_kernel.lower() | |||
Gn = args[0] if len(args) == 1 else [args[0], args[1]] # arrange all graphs in a list | |||
Kmatrix = np.zeros((len(Gn), len(Gn))) | |||
run_time = time.time() - start_time | |||
print("\n --- Weisfeiler-Lehman %s kernel matrix of size %d built in %s seconds ---" % (base_kernel, len(args[0]), run_time)) | |||
return Kmatrix, run_time | |||
else: # for only 2 graphs | |||
start_time = time.time() | |||
# for WL subtree kernel | |||
if base_kernel == 'subtree': | |||
args = [args[0], args[1]] | |||
kernel = _wl_subtreekernel_do(args, node_label, edge_label, height = height, base_kernel = 'subtree') | |||
# for WL edge kernel | |||
elif base_kernel == 'edge': | |||
print('edge') | |||
# for WL shortest path kernel | |||
elif base_kernel == 'sp': | |||
start_time = time.time() | |||
kernel = _pathkernel_do(args[0], args[1]) | |||
# for WL subtree kernel | |||
if base_kernel == 'subtree': | |||
Kmatrix = _wl_subtreekernel_do(args[0], node_label, edge_label, height) | |||
run_time = time.time() - start_time | |||
print("\n --- Weisfeiler-Lehman %s kernel built in %s seconds ---" % (base_kernel, run_time)) | |||
return kernel, run_time | |||
def _wl_subtreekernel_do(*args, node_label = 'atom', edge_label = 'bond_type', height = 0, base_kernel = 'subtree'): | |||
# for WL shortest path kernel | |||
elif base_kernel == 'sp': | |||
Kmatrix = _wl_spkernel_do(args[0], node_label, edge_label, height) | |||
# for WL edge kernel | |||
elif base_kernel == 'edge': | |||
Kmatrix = _wl_edgekernel_do(args[0], node_label, edge_label, height) | |||
# for user defined base kernel | |||
else: | |||
Kmatrix = _wl_userkernel_do(args[0], node_label, edge_label, height, base_kernel) | |||
run_time = time.time() - start_time | |||
print("\n --- Weisfeiler-Lehman %s kernel matrix of size %d built in %s seconds ---" % (base_kernel, len(args[0]), run_time)) | |||
return Kmatrix, run_time | |||
def _wl_subtreekernel_do(Gn, node_label, edge_label, height): | |||
"""Calculate Weisfeiler-Lehman subtree kernels between graphs. | |||
Parameters | |||
---------- | |||
Gn : List of NetworkX graph | |||
List of graphs between which the kernels are calculated. | |||
node_label : string | |||
node attribute used as label. The default node label is atom. | |||
node attribute used as label. | |||
edge_label : string | |||
edge attribute used as label. The default edge label is bond_type. | |||
edge attribute used as label. | |||
height : int | |||
subtree height | |||
base_kernel : string | |||
base kernel used in each iteration of WL kernel. The default base kernel is subtree kernel. | |||
subtree height. | |||
Return | |||
------ | |||
Kmatrix/kernel : Numpy matrix/float | |||
Kmatrix : Numpy matrix | |||
Kernel matrix, each element of which is the Weisfeiler-Lehman kernel between 2 praphs. | |||
""" | |||
height = int(height) | |||
Gn = args[0] | |||
Kmatrix = np.zeros((len(Gn), len(Gn))) | |||
all_num_of_labels_occured = 0 # number of the set of letters that occur before as node labels at least once in all graphs | |||
@@ -148,9 +111,9 @@ def _wl_subtreekernel_do(*args, node_label = 'atom', edge_label = 'bond_type', h | |||
num_of_labels = len(num_of_each_label) # number of all unique labels | |||
all_labels_ori.update(labels_ori) | |||
all_num_of_labels_occured += len(all_labels_ori) | |||
# calculate subtree kernel with the 0th iteration and add it to the final kernel | |||
for i in range(0, len(Gn)): | |||
for j in range(i, len(Gn)): | |||
@@ -159,17 +122,17 @@ def _wl_subtreekernel_do(*args, node_label = 'atom', edge_label = 'bond_type', h | |||
vector2 = np.matrix([ (all_num_of_each_label[j][label] if (label in all_num_of_each_label[j].keys()) else 0) for label in labels ]) | |||
Kmatrix[i][j] += np.dot(vector1, vector2.transpose()) | |||
Kmatrix[j][i] = Kmatrix[i][j] | |||
# iterate each height | |||
for h in range(1, height + 1): | |||
all_set_compressed = {} # a dictionary mapping original labels to new ones in all graphs in this iteration | |||
num_of_labels_occured = all_num_of_labels_occured # number of the set of letters that occur before as node labels at least once in all graphs | |||
all_labels_ori = set() | |||
all_num_of_each_label = [] | |||
# for each graph | |||
for idx, G in enumerate(Gn): | |||
set_multisets = [] | |||
for node in G.nodes(data = True): | |||
# Multiset-label determination. | |||
@@ -190,9 +153,9 @@ def _wl_subtreekernel_do(*args, node_label = 'atom', edge_label = 'bond_type', h | |||
else: | |||
set_compressed.update({ value : str(num_of_labels_occured + 1) }) | |||
num_of_labels_occured += 1 | |||
all_set_compressed.update(set_compressed) | |||
# relabel nodes | |||
for node in G.nodes(data = True): | |||
node[1][node_label] = set_compressed[set_multisets[node[0]]] | |||
@@ -202,9 +165,9 @@ def _wl_subtreekernel_do(*args, node_label = 'atom', edge_label = 'bond_type', h | |||
all_labels_ori.update(labels_comp) | |||
num_of_each_label = dict(Counter(labels_comp)) | |||
all_num_of_each_label.append(num_of_each_label) | |||
all_num_of_labels_occured += len(all_labels_ori) | |||
# calculate subtree kernel with h iterations and add it to the final kernel | |||
for i in range(0, len(Gn)): | |||
for j in range(i, len(Gn)): | |||
@@ -213,87 +176,228 @@ def _wl_subtreekernel_do(*args, node_label = 'atom', edge_label = 'bond_type', h | |||
vector2 = np.matrix([ (all_num_of_each_label[j][label] if (label in all_num_of_each_label[j].keys()) else 0) for label in labels ]) | |||
Kmatrix[i][j] += np.dot(vector1, vector2.transpose()) | |||
Kmatrix[j][i] = Kmatrix[i][j] | |||
return Kmatrix | |||
def _weisfeilerlehmankernel_do(G1, G2, height = 0): | |||
"""Calculate Weisfeiler-Lehman kernels between 2 graphs. This kernel use shortest path kernel to calculate kernel between two graphs in each iteration. | |||
def _wl_spkernel_do(Gn, node_label, edge_label, height): | |||
"""Calculate Weisfeiler-Lehman shortest path kernels between graphs. | |||
Parameters | |||
---------- | |||
G1, G2 : NetworkX graphs | |||
2 graphs between which the kernel is calculated. | |||
Gn : List of NetworkX graph | |||
List of graphs between which the kernels are calculated. | |||
node_label : string | |||
node attribute used as label. | |||
edge_label : string | |||
edge attribute used as label. | |||
height : int | |||
subtree height. | |||
Return | |||
------ | |||
kernel : float | |||
Weisfeiler-Lehman kernel between 2 graphs. | |||
Kmatrix : Numpy matrix | |||
Kernel matrix, each element of which is the Weisfeiler-Lehman kernel between 2 praphs. | |||
""" | |||
from pygraph.utils.utils import getSPGraph | |||
# init. | |||
height = int(height) | |||
kernel = 0 # init kernel | |||
num_nodes1 = G1.number_of_nodes() | |||
num_nodes2 = G2.number_of_nodes() | |||
# the first iteration. | |||
# labelset1 = { G1.nodes(data = True)[i]['label'] for i in range(num_nodes1) } | |||
# labelset2 = { G2.nodes(data = True)[i]['label'] for i in range(num_nodes2) } | |||
kernel += spkernel(G1, G2) # change your base kernel here (and one more below) | |||
Kmatrix = np.zeros((len(Gn), len(Gn))) # init kernel | |||
Gn = [ getSPGraph(G, edge_weight = edge_label) for G in Gn ] # get shortest path graphs of Gn | |||
for h in range(0, height + 1): | |||
# if labelset1 != labelset2: | |||
# break | |||
# initial for height = 0 | |||
for i in range(0, len(Gn)): | |||
for j in range(i, len(Gn)): | |||
for e1 in Gn[i].edges(data = True): | |||
for e2 in Gn[j].edges(data = True): | |||
if e1[2]['cost'] != 0 and e1[2]['cost'] == e2[2]['cost'] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])): | |||
Kmatrix[i][j] += 1 | |||
Kmatrix[j][i] = Kmatrix[i][j] | |||
# iterate each height | |||
for h in range(1, height + 1): | |||
all_set_compressed = {} # a dictionary mapping original labels to new ones in all graphs in this iteration | |||
num_of_labels_occured = 0 # number of the set of letters that occur before as node labels at least once in all graphs | |||
for G in Gn: # for each graph | |||
set_multisets = [] | |||
for node in G.nodes(data = True): | |||
# Multiset-label determination. | |||
multiset = [ G.node[neighbors][node_label] for neighbors in G[node[0]] ] | |||
# sorting each multiset | |||
multiset.sort() | |||
multiset = node[1][node_label] + ''.join(multiset) # concatenate to a string and add the prefix | |||
set_multisets.append(multiset) | |||
# label compression | |||
set_unique = list(set(set_multisets)) # set of unique multiset labels | |||
# a dictionary mapping original labels to new ones. | |||
set_compressed = {} | |||
# if a label occured before, assign its former compressed label, else assign the number of labels occured + 1 as the compressed label | |||
for value in set_unique: | |||
if value in all_set_compressed.keys(): | |||
set_compressed.update({ value : all_set_compressed[value] }) | |||
else: | |||
set_compressed.update({ value : str(num_of_labels_occured + 1) }) | |||
num_of_labels_occured += 1 | |||
all_set_compressed.update(set_compressed) | |||
# relabel nodes | |||
for node in G.nodes(data = True): | |||
node[1][node_label] = set_compressed[set_multisets[node[0]]] | |||
# calculate subtree kernel with h iterations and add it to the final kernel | |||
for i in range(0, len(Gn)): | |||
for j in range(i, len(Gn)): | |||
for e1 in Gn[i].edges(data = True): | |||
for e2 in Gn[j].edges(data = True): | |||
if e1[2]['cost'] != 0 and e1[2]['cost'] == e2[2]['cost'] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])): | |||
Kmatrix[i][j] += 1 | |||
Kmatrix[j][i] = Kmatrix[i][j] | |||
return Kmatrix | |||
# Weisfeiler-Lehman test of graph isomorphism. | |||
relabel(G1) | |||
relabel(G2) | |||
# calculate kernel | |||
kernel += spkernel(G1, G2) # change your base kernel here (and one more before) | |||
# get label sets of both graphs | |||
# labelset1 = { G1.nodes(data = True)[i]['label'] for i in range(num_nodes1) } | |||
# labelset2 = { G2.nodes(data = True)[i]['label'] for i in range(num_nodes2) } | |||
def _wl_edgekernel_do(Gn, node_label, edge_label, height): | |||
"""Calculate Weisfeiler-Lehman edge kernels between graphs. | |||
return kernel | |||
Parameters | |||
---------- | |||
Gn : List of NetworkX graph | |||
List of graphs between which the kernels are calculated. | |||
node_label : string | |||
node attribute used as label. | |||
edge_label : string | |||
edge attribute used as label. | |||
height : int | |||
subtree height. | |||
Return | |||
------ | |||
Kmatrix : Numpy matrix | |||
Kernel matrix, each element of which is the Weisfeiler-Lehman kernel between 2 praphs. | |||
""" | |||
# init. | |||
height = int(height) | |||
Kmatrix = np.zeros((len(Gn), len(Gn))) # init kernel | |||
# initial for height = 0 | |||
for i in range(0, len(Gn)): | |||
for j in range(i, len(Gn)): | |||
for e1 in Gn[i].edges(data = True): | |||
for e2 in Gn[j].edges(data = True): | |||
if e1[2][edge_label] == e2[2][edge_label] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])): | |||
Kmatrix[i][j] += 1 | |||
Kmatrix[j][i] = Kmatrix[i][j] | |||
# iterate each height | |||
for h in range(1, height + 1): | |||
all_set_compressed = {} # a dictionary mapping original labels to new ones in all graphs in this iteration | |||
num_of_labels_occured = 0 # number of the set of letters that occur before as node labels at least once in all graphs | |||
for G in Gn: # for each graph | |||
set_multisets = [] | |||
for node in G.nodes(data = True): | |||
# Multiset-label determination. | |||
multiset = [ G.node[neighbors][node_label] for neighbors in G[node[0]] ] | |||
# sorting each multiset | |||
multiset.sort() | |||
multiset = node[1][node_label] + ''.join(multiset) # concatenate to a string and add the prefix | |||
set_multisets.append(multiset) | |||
# label compression | |||
set_unique = list(set(set_multisets)) # set of unique multiset labels | |||
# a dictionary mapping original labels to new ones. | |||
set_compressed = {} | |||
# if a label occured before, assign its former compressed label, else assign the number of labels occured + 1 as the compressed label | |||
for value in set_unique: | |||
if value in all_set_compressed.keys(): | |||
set_compressed.update({ value : all_set_compressed[value] }) | |||
else: | |||
set_compressed.update({ value : str(num_of_labels_occured + 1) }) | |||
num_of_labels_occured += 1 | |||
def relabel(G): | |||
''' | |||
Relabel nodes in graph G in one iteration of the 1-dim. WL test of graph isomorphism. | |||
all_set_compressed.update(set_compressed) | |||
# relabel nodes | |||
for node in G.nodes(data = True): | |||
node[1][node_label] = set_compressed[set_multisets[node[0]]] | |||
# calculate subtree kernel with h iterations and add it to the final kernel | |||
for i in range(0, len(Gn)): | |||
for j in range(i, len(Gn)): | |||
for e1 in Gn[i].edges(data = True): | |||
for e2 in Gn[j].edges(data = True): | |||
if e1[2][edge_label] == e2[2][edge_label] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])): | |||
Kmatrix[i][j] += 1 | |||
Kmatrix[j][i] = Kmatrix[i][j] | |||
return Kmatrix | |||
def _wl_userkernel_do(Gn, node_label, edge_label, height, base_kernel): | |||
"""Calculate Weisfeiler-Lehman kernels based on user-defined kernel between graphs. | |||
Parameters | |||
---------- | |||
G : NetworkX graph | |||
The graphs whose nodes are relabeled. | |||
''' | |||
# get the set of original labels | |||
labels_ori = list(nx.get_node_attributes(G, 'label').values()) | |||
num_of_each_label = dict(Counter(labels_ori)) | |||
num_of_labels = len(num_of_each_label) | |||
set_multisets = [] | |||
for node in G.nodes(data = True): | |||
# Multiset-label determination. | |||
multiset = [ G.node[neighbors]['label'] for neighbors in G[node[0]] ] | |||
# sorting each multiset | |||
multiset.sort() | |||
multiset = node[1]['label'] + ''.join(multiset) # concatenate to a string and add the prefix | |||
set_multisets.append(multiset) | |||
Gn : List of NetworkX graph | |||
List of graphs between which the kernels are calculated. | |||
node_label : string | |||
node attribute used as label. | |||
edge_label : string | |||
edge attribute used as label. | |||
height : int | |||
subtree height. | |||
base_kernel : string | |||
Name of the base kernel function used in each iteration of WL kernel. This function returns a Numpy matrix, each element of which is the user-defined Weisfeiler-Lehman kernel between 2 praphs. | |||
# label compression | |||
# set_multisets.sort() # this is unnecessary | |||
set_unique = list(set(set_multisets)) # set of unique multiset labels | |||
set_compressed = { value : str(set_unique.index(value) + num_of_labels + 1) for value in set_unique } # assign new labels | |||
# relabel nodes | |||
# nx.relabel_nodes(G, set_compressed, copy = False) | |||
for node in G.nodes(data = True): | |||
node[1]['label'] = set_compressed[set_multisets[node[0]]] | |||
# get the set of compressed labels | |||
labels_comp = list(nx.get_node_attributes(G, 'label').values()) | |||
num_of_each_label.update(dict(Counter(labels_comp))) | |||
Return | |||
------ | |||
Kmatrix : Numpy matrix | |||
Kernel matrix, each element of which is the Weisfeiler-Lehman kernel between 2 praphs. | |||
""" | |||
# init. | |||
height = int(height) | |||
Kmatrix = np.zeros((len(Gn), len(Gn))) # init kernel | |||
# initial for height = 0 | |||
Kmatrix = base_kernel(Gn, node_label, edge_label) | |||
# iterate each height | |||
for h in range(1, height + 1): | |||
all_set_compressed = {} # a dictionary mapping original labels to new ones in all graphs in this iteration | |||
num_of_labels_occured = 0 # number of the set of letters that occur before as node labels at least once in all graphs | |||
for G in Gn: # for each graph | |||
set_multisets = [] | |||
for node in G.nodes(data = True): | |||
# Multiset-label determination. | |||
multiset = [ G.node[neighbors][node_label] for neighbors in G[node[0]] ] | |||
# sorting each multiset | |||
multiset.sort() | |||
multiset = node[1][node_label] + ''.join(multiset) # concatenate to a string and add the prefix | |||
set_multisets.append(multiset) | |||
# label compression | |||
set_unique = list(set(set_multisets)) # set of unique multiset labels | |||
# a dictionary mapping original labels to new ones. | |||
set_compressed = {} | |||
# if a label occured before, assign its former compressed label, else assign the number of labels occured + 1 as the compressed label | |||
for value in set_unique: | |||
if value in all_set_compressed.keys(): | |||
set_compressed.update({ value : all_set_compressed[value] }) | |||
else: | |||
set_compressed.update({ value : str(num_of_labels_occured + 1) }) | |||
num_of_labels_occured += 1 | |||
all_set_compressed.update(set_compressed) | |||
# relabel nodes | |||
for node in G.nodes(data = True): | |||
node[1][node_label] = set_compressed[set_multisets[node[0]]] | |||
# calculate kernel with h iterations and add it to the final kernel | |||
Kmatrix += base_kernel(Gn, node_label, edge_label) | |||
return Kmatrix |
@@ -3,7 +3,7 @@ | |||
def loadCT(filename): | |||
"""load data from .ct file. | |||
nn | |||
Notes | |||
------ | |||
a typical example of data in .ct is like this: | |||
@@ -33,12 +33,17 @@ def loadCT(filename): | |||
tmp = content[i + 2].split(" ") | |||
tmp = [x for x in tmp if x != ''] | |||
g.add_node(i, atom=tmp[3], label=tmp[3]) | |||
for i in range(0, nb_edges): | |||
tmp = content[i + g.number_of_nodes() + 2] | |||
tmp = [tmp[i:i+3] for i in range(0, len(tmp), 3)] | |||
tmp = content[i + g.number_of_nodes() + 2].split(" ") | |||
tmp = [x for x in tmp if x != ''] | |||
g.add_edge(int(tmp[0]) - 1, int(tmp[1]) - 1, | |||
bond_type=tmp[3].strip(), label=tmp[3].strip()) | |||
bond_type=tmp[3].strip(), label=tmp[3].strip()) | |||
# for i in range(0, nb_edges): | |||
# tmp = content[i + g.number_of_nodes() + 2] | |||
# tmp = [tmp[i:i+3] for i in range(0, len(tmp), 3)] | |||
# g.add_edge(int(tmp[0]) - 1, int(tmp[1]) - 1, | |||
# bond_type=tmp[3].strip(), label=tmp[3].strip()) | |||
return g | |||
@@ -101,7 +106,57 @@ def saveGXL(graph, filename): | |||
tree.write(filename) | |||
def loadDataset(filename): | |||
def loadSDF(filename): | |||
"""load data from structured data file (.sdf file). | |||
Notes | |||
------ | |||
A SDF file contains a group of molecules, represented in the similar way as in MOL format. | |||
see http://www.nonlinear.com/progenesis/sdf-studio/v0.9/faq/sdf-file-format-guidance.aspx, 2018 for detailed structure. | |||
""" | |||
import networkx as nx | |||
from os.path import basename | |||
from tqdm import tqdm | |||
import sys | |||
data = [] | |||
with open(filename) as f: | |||
content = f.read().splitlines() | |||
index = 0 | |||
pbar = tqdm(total = len(content) + 1, desc = 'load SDF', file=sys.stdout) | |||
while index < len(content): | |||
index_old = index | |||
g = nx.Graph(name=content[index].strip()) # set name of the graph | |||
tmp = content[index + 3] | |||
nb_nodes = int(tmp[:3]) # number of the nodes | |||
nb_edges = int(tmp[3:6]) # number of the edges | |||
for i in range(0, nb_nodes): | |||
tmp = content[i + index + 4] | |||
g.add_node(i, atom=tmp[31:34].strip()) | |||
for i in range(0, nb_edges): | |||
tmp = content[i + index + g.number_of_nodes() + 4] | |||
tmp = [tmp[i:i+3] for i in range(0, len(tmp), 3)] | |||
g.add_edge(int(tmp[0]) - 1, int(tmp[1]) - 1, bond_type=tmp[2].strip()) | |||
data.append(g) | |||
index += 4 + g.number_of_nodes() + g.number_of_edges() | |||
while content[index].strip() != '$$$$': # seperator | |||
index += 1 | |||
index += 1 | |||
pbar.update(index - index_old) | |||
pbar.update(1) | |||
pbar.close() | |||
return data | |||
def loadDataset(filename, filename_y = ''): | |||
"""load file list of the dataset. | |||
""" | |||
from os.path import dirname, splitext | |||
@@ -128,5 +183,28 @@ def loadDataset(filename): | |||
mol_class = graph.attrib['class'] | |||
data.append(loadGXL(dirname_dataset + '/' + mol_filename)) | |||
y.append(mol_class) | |||
elif extension == "sdf": | |||
import numpy as np | |||
from tqdm import tqdm | |||
import sys | |||
data = loadSDF(filename) | |||
y_raw = open(filename_y).read().splitlines() | |||
y_raw.pop(0) | |||
tmp0 = [] | |||
tmp1 = [] | |||
for i in range(0, len(y_raw)): | |||
tmp = y_raw[i].split(',') | |||
tmp0.append(tmp[0]) | |||
tmp1.append(tmp[1].strip()) | |||
y = [] | |||
for i in tqdm(range(0, len(data)), desc = 'ajust data', file=sys.stdout): | |||
try: | |||
y.append(tmp1[tmp0.index(data[i].name)].strip()) | |||
except ValueError: # if data[i].name not in tmp0 | |||
data[i] = [] | |||
data = list(filter(lambda a: a != [], data)) | |||
return data, y |
@@ -1,5 +1,6 @@ | |||
import networkx as nx | |||
import numpy as np | |||
from tqdm import tqdm | |||
def getSPLengths(G1): | |||
@@ -58,21 +59,15 @@ def floydTransformation(G, edge_weight = 'bond_type'): | |||
S = nx.Graph() | |||
S.add_nodes_from(G.nodes(data=True)) | |||
for i in range(0, G.number_of_nodes()): | |||
for j in range(0, G.number_of_nodes()): | |||
for j in range(i, G.number_of_nodes()): | |||
S.add_edge(i, j, cost = spMatrix[i, j]) | |||
return S | |||
import os | |||
import pathlib | |||
from collections import OrderedDict | |||
from tabulate import tabulate | |||
from .graphfiles import loadDataset | |||
def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, trials = 100, splits = 10, alpha_grid = None, C_grid = None, hyper_name = '', hyper_range = [1], normalize = False): | |||
def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, trials = 100, splits = 10, alpha_grid = None, C_grid = None, hyper_name = '', hyper_range = [1], normalize = False, datafile_y = '', model_type = 'regression'): | |||
"""Perform training and testing for a kernel method. Print out neccessary data during the process then finally the results. | |||
Parameters | |||
---------- | |||
datafile : string | |||
@@ -96,12 +91,14 @@ def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, tria | |||
hyper_range : list | |||
Range of the hyperparameter. | |||
normalize : string | |||
Determine whether or not that normalization is performed. The default is False. | |||
Determine whether or not that normalization is performed. Only works when model_type == 'regression'. The default is False. | |||
model_type : string | |||
Typr of the problem, regression or classification problem | |||
References | |||
---------- | |||
[1] Elisabetta Ghisu, https://github.com/eghisu/GraphKernels/blob/master/GraphKernelsCollection/python_scripts/compute_perf_gk.py, 2018.1 | |||
Examples | |||
-------- | |||
>>> import sys | |||
@@ -113,29 +110,41 @@ def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, tria | |||
>>> kernel_para = dict(node_label = 'atom', edge_label = 'bond_type', labeled = True) | |||
>>> kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = True) | |||
""" | |||
import os | |||
import pathlib | |||
from collections import OrderedDict | |||
from tabulate import tabulate | |||
from .graphfiles import loadDataset | |||
# setup the parameters | |||
model_type = 'regression' # Regression or classification problem | |||
model_type = model_type.lower() | |||
if model_type != 'regression' and model_type != 'classification': | |||
raise Exception('The model type is incorrect! Please choose from regression or clqssification.') | |||
print('\n --- This is a %s problem ---' % model_type) | |||
alpha_grid = np.logspace(-10, 10, num = trials, base = 10) if alpha_grid == None else alpha_grid # corresponds to (2*C)^-1 in other linear models such as LogisticRegression | |||
C_grid = np.logspace(-10, 10, num = trials, base = 10) if C_grid == None else C_grid | |||
if not os.path.exists(kernel_file_path): | |||
os.makedirs(kernel_file_path) | |||
train_means_list = [] | |||
train_stds_list = [] | |||
test_means_list = [] | |||
test_stds_list = [] | |||
kernel_time_list = [] | |||
for hyper_para in hyper_range: | |||
print('' if hyper_name == '' else '\n\n #--- calculating kernel matrix when %s = %.1f ---#' % (hyper_name, hyper_para)) | |||
print('' if hyper_name == '' else '\n\n #--- calculating kernel matrix when', hyper_name, '=', hyper_para, '---#') | |||
print('\n Loading dataset from file...') | |||
dataset, y = loadDataset(datafile) | |||
dataset, y = loadDataset(datafile, filename_y = datafile_y) | |||
y = np.array(y) | |||
# print(y) | |||
# normalize labels and transform non-numerical labels to numerical labels. | |||
if model_type == 'classification': | |||
from sklearn.preprocessing import LabelEncoder | |||
y = LabelEncoder().fit_transform(y) | |||
# print(y) | |||
# save kernel matrices to files / read kernel matrices from files | |||
kernel_file = kernel_file_path + 'km.ds' | |||
@@ -152,7 +161,7 @@ def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, tria | |||
Kmatrix, run_time = kernel_func(dataset, **kernel_para) | |||
kernel_time_list.append(run_time) | |||
print(Kmatrix) | |||
print('\n Saving kernel matrix to file...') | |||
# print('\n Saving kernel matrix to file...') | |||
# np.savetxt(kernel_file, Kmatrix) | |||
""" | |||
@@ -170,25 +179,29 @@ def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, tria | |||
test_stds_list.append(test_std) | |||
print('\n') | |||
table_dict = {'rmse_test': test_means_list, 'std_test': test_stds_list, \ | |||
'rmse_train': train_means_list, 'std_train': train_stds_list, 'k_time': kernel_time_list} | |||
if hyper_name == '': | |||
keyorder = ['rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time'] | |||
else: | |||
table_dict[hyper_name] = hyper_range | |||
keyorder = [hyper_name, 'rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time'] | |||
if model_type == 'regression': | |||
table_dict = {'rmse_test': test_means_list, 'std_test': test_stds_list, \ | |||
'rmse_train': train_means_list, 'std_train': train_stds_list, 'k_time': kernel_time_list} | |||
if hyper_name == '': | |||
keyorder = ['rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time'] | |||
else: | |||
table_dict[hyper_name] = hyper_range | |||
keyorder = [hyper_name, 'rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time'] | |||
elif model_type == 'classification': | |||
table_dict = {'accur_test': test_means_list, 'std_test': test_stds_list, \ | |||
'accur_train': train_means_list, 'std_train': train_stds_list, 'k_time': kernel_time_list} | |||
if hyper_name == '': | |||
keyorder = ['accur_test', 'std_test', 'accur_train', 'std_train', 'k_time'] | |||
else: | |||
table_dict[hyper_name] = hyper_range | |||
keyorder = [hyper_name, 'accur_test', 'std_test', 'accur_train', 'std_train', 'k_time'] | |||
print(tabulate(OrderedDict(sorted(table_dict.items(), key = lambda i:keyorder.index(i[0]))), headers='keys')) | |||
import random | |||
from sklearn.kernel_ridge import KernelRidge # 0.17 | |||
from sklearn.metrics import accuracy_score, mean_squared_error | |||
from sklearn import svm | |||
def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, trials = 100, model_type = 'regression', normalize = False): | |||
"""Split dataset to training and testing splits, train and test. Print out and return the results. | |||
Parameters | |||
---------- | |||
Kmatrix : Numpy matrix | |||
@@ -206,8 +219,8 @@ def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, tri | |||
model_type : string | |||
Determine whether it is a regression or classification problem. The default is 'regression'. | |||
normalize : string | |||
Determine whether or not that normalization is performed. The default is False. | |||
Determine whether or not that normalization is performed. Only works when model_type == 'regression'. The default is False. | |||
Return | |||
------ | |||
train_mean : float | |||
@@ -218,19 +231,27 @@ def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, tri | |||
mean of the best tests. | |||
test_std : float | |||
mean of test stds in the same trial with the best test mean. | |||
References | |||
---------- | |||
[1] Elisabetta Ghisu, https://github.com/eghisu/GraphKernels/blob/master/GraphKernelsCollection/python_scripts/compute_perf_gk.py, 2018.1 | |||
""" | |||
import random | |||
from sklearn.kernel_ridge import KernelRidge # 0.17 | |||
from sklearn.metrics import accuracy_score, mean_squared_error | |||
from sklearn import svm | |||
datasize = len(train_target) | |||
random.seed(20) # Set the seed for uniform parameter distribution | |||
# Initialize the performance of the best parameter trial on train with the corresponding performance on test | |||
train_split = [] | |||
test_split = [] | |||
# For each split of the data | |||
print('\n Starting calculate accuracy/rmse...') | |||
import sys | |||
pbar = tqdm(total = splits * trials, desc = 'calculate performance', file=sys.stdout) | |||
for j in range(10, 10 + splits): | |||
# print('\n Starting split %d...' % j) | |||
@@ -255,7 +276,7 @@ def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, tri | |||
# Split the targets | |||
y_train = y_perm[0:num_train] | |||
# Normalization step (for real valued targets only) | |||
if normalize == True and model_type == 'regression': | |||
@@ -275,7 +296,6 @@ def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, tri | |||
if model_type == 'regression': | |||
# Fit the kernel ridge model | |||
KR = KernelRidge(kernel = 'precomputed', alpha = alpha_grid[i]) | |||
# KR = svm.SVR(kernel = 'precomputed', C = C_grid[i]) | |||
KR.fit(Kmatrix_train, y_train if normalize == False else y_train_norm) | |||
# predict on the train and test set | |||
@@ -284,15 +304,33 @@ def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, tri | |||
# adjust prediction: needed because the training targets have been normalized | |||
if normalize == True: | |||
y_pred_train = y_pred_train * float(y_train_std) + y_train_mean | |||
y_pred_train = y_pred_train * float(y_train_std) + y_train_mean | |||
y_pred_test = y_pred_test * float(y_train_std) + y_train_mean | |||
# root mean squared error in train set | |||
rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train)) | |||
perf_all_train.append(rmse_train) | |||
# root mean squared error in test set | |||
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test)) | |||
perf_all_test.append(rmse_test) | |||
# root mean squared error on train set | |||
accuracy_train = np.sqrt(mean_squared_error(y_train, y_pred_train)) | |||
perf_all_train.append(accuracy_train) | |||
# root mean squared error on test set | |||
accuracy_test = np.sqrt(mean_squared_error(y_test, y_pred_test)) | |||
perf_all_test.append(accuracy_test) | |||
# For clcassification use SVM | |||
elif model_type == 'classification': | |||
KR = svm.SVC(kernel = 'precomputed', C = C_grid[i]) | |||
KR.fit(Kmatrix_train, y_train) | |||
# predict on the train and test set | |||
y_pred_train = KR.predict(Kmatrix_train) | |||
y_pred_test = KR.predict(Kmatrix_test) | |||
# accuracy on train set | |||
accuracy_train = accuracy_score(y_train, y_pred_train) | |||
perf_all_train.append(accuracy_train) | |||
# accuracy on test set | |||
accuracy_test = accuracy_score(y_test, y_pred_test) | |||
perf_all_test.append(accuracy_test) | |||
pbar.update(1) | |||
# --- FIND THE OPTIMAL PARAMETERS --- # | |||
# For regression: minimise the mean squared error | |||
@@ -306,6 +344,17 @@ def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, tri | |||
perf_train_opt = perf_all_train[min_idx] | |||
perf_test_opt = perf_all_test[min_idx] | |||
# For classification: maximise the accuracy | |||
if model_type == 'classification': | |||
# get optimal parameter on test (argmax accuracy) | |||
max_idx = np.argmax(perf_all_test) | |||
C_opt = C_grid[max_idx] | |||
# corresponding performance on train and test set for the same parameter | |||
perf_train_opt = perf_all_train[max_idx] | |||
perf_test_opt = perf_all_test[max_idx] | |||
# append the correponding performance on the train and test set | |||
train_split.append(perf_train_opt) | |||
test_split.append(perf_test_opt) | |||
@@ -322,5 +371,5 @@ def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, tri | |||
print('With standard deviation: %3f' % train_std) | |||
print('\n Mean performance on test set: %3f' % test_mean) | |||
print('With standard deviation: %3f' % test_std) | |||
return train_mean, train_std, test_mean, test_std | |||
return train_mean, train_std, test_mean, test_std |
@@ -0,0 +1,16 @@ | |||
import sys | |||
sys.path.insert(0, "../") | |||
from pygraph.utils.utils import kernel_train_test | |||
from pygraph.kernels.cyclicPatternKernel import cyclicpatternkernel | |||
import numpy as np | |||
datafile = '../../../../datasets/NCI-HIV/AIDO99SD.sdf' | |||
datafile_y = '../../../../datasets/NCI-HIV/aids_conc_may04.txt' | |||
kernel_file_path = 'kernelmatrices_path_acyclic/' | |||
kernel_para = dict(node_label = 'atom', edge_label = 'bond_type', labeled = True) | |||
kernel_train_test(datafile, kernel_file_path, cyclicpatternkernel, kernel_para, \ | |||
hyper_name = 'cycle_bound', hyper_range = np.linspace(0, 1000, 21), normalize = False, \ | |||
datafile_y = datafile_y, model_type = 'classification') |