Browse Source

copié : notebooks/run_weisfeilerLehmankernel_acyclic.ipynb -> .ipynb_checkpoints/run_weisfeilerLehmankernel_acyclic-checkpoint.ipynb

modifié :         README.md
	nouveau fichier : notebooks/run_cyclicpatternkernel.ipynb
	modifié :         notebooks/run_marginalizedkernel_acyclic.ipynb
	modifié :         notebooks/run_pathkernel_acyclic.ipynb
	modifié :         notebooks/run_spkernel_acyclic.ipynb
	modifié :         notebooks/run_treeletkernel_acyclic.ipynb
	nouveau fichier : notebooks/run_treepatternkernel.ipynb
	modifié :         notebooks/run_weisfeilerLehmankernel_acyclic.ipynb
	nouveau fichier : pygraph/kernels/cyclicPatternKernel.py
	modifié :         pygraph/kernels/deltaKernel.py
	modifié :         pygraph/kernels/pathKernel.py
	modifié :         pygraph/kernels/results.md
	modifié :         pygraph/kernels/spKernel.py
	nouveau fichier : pygraph/kernels/treePatternKernel.py
	modifié :         pygraph/kernels/treeletKernel.py
	modifié :         pygraph/kernels/untildPathKernel.py
	modifié :         pygraph/kernels/weisfeilerLehmanKernel.py
	modifié :         pygraph/utils/graphfiles.py
	modifié :         pygraph/utils/utils.py
	nouveau fichier : run_cyclic.py
v0.1
jajupmochi 7 years ago
parent
commit
0d82ebe6b8
39 changed files with 13833 additions and 2466 deletions
  1. +2015
    -0
      .ipynb_checkpoints/run_weisfeilerLehmankernel_acyclic-checkpoint.ipynb
  2. +21
    -11
      README.md
  3. +936
    -0
      notebooks/.ipynb_checkpoints/run_cyclicpatternkernel-checkpoint.ipynb
  4. +149
    -0
      notebooks/.ipynb_checkpoints/run_marginalizedkernel_acyclic-checkpoint.ipynb
  5. +11
    -13
      notebooks/.ipynb_checkpoints/run_pathkernel_acyclic-checkpoint.ipynb
  6. +15
    -17
      notebooks/.ipynb_checkpoints/run_spkernel_acyclic-checkpoint.ipynb
  7. +66
    -60
      notebooks/.ipynb_checkpoints/run_treeletkernel_acyclic-checkpoint.ipynb
  8. +3191
    -0
      notebooks/.ipynb_checkpoints/run_treepatternkernel-checkpoint.ipynb
  9. +778
    -995
      notebooks/.ipynb_checkpoints/run_weisfeilerLehmankernel_acyclic-checkpoint.ipynb
  10. +175
    -0
      notebooks/.ipynb_checkpoints/test_lib-checkpoint.ipynb
  11. +1252
    -0
      notebooks/run_cyclicpatternkernel.ipynb
  12. +149
    -0
      notebooks/run_marginalizedkernel_acyclic.ipynb
  13. +13
    -12
      notebooks/run_pathkernel_acyclic.ipynb
  14. +15
    -17
      notebooks/run_spkernel_acyclic.ipynb
  15. +66
    -60
      notebooks/run_treeletkernel_acyclic.ipynb
  16. +3191
    -0
      notebooks/run_treepatternkernel.ipynb
  17. +766
    -981
      notebooks/run_weisfeilerLehmankernel_acyclic.ipynb
  18. BIN
      pygraph/kernels/__pycache__/cyclicPatternKernel.cpython-35.pyc
  19. BIN
      pygraph/kernels/__pycache__/deltaKernel.cpython-35.pyc
  20. BIN
      pygraph/kernels/__pycache__/marginalizedKernel.cpython-35.pyc
  21. BIN
      pygraph/kernels/__pycache__/pathKernel.cpython-35.pyc
  22. BIN
      pygraph/kernels/__pycache__/spKernel.cpython-35.pyc
  23. BIN
      pygraph/kernels/__pycache__/treePatternKernel.cpython-35.pyc
  24. BIN
      pygraph/kernels/__pycache__/treeletKernel.cpython-35.pyc
  25. BIN
      pygraph/kernels/__pycache__/weisfeilerLehmanKernel.cpython-35.pyc
  26. +147
    -0
      pygraph/kernels/cyclicPatternKernel.py
  27. +4
    -4
      pygraph/kernels/deltaKernel.py
  28. +60
    -22
      pygraph/kernels/pathKernel.py
  29. +112
    -11
      pygraph/kernels/results.md
  30. +28
    -41
      pygraph/kernels/spKernel.py
  31. +198
    -0
      pygraph/kernels/treePatternKernel.py
  32. +13
    -12
      pygraph/kernels/treeletKernel.py
  33. +8
    -3
      pygraph/kernels/untildPathKernel.py
  34. +257
    -153
      pygraph/kernels/weisfeilerLehmanKernel.py
  35. BIN
      pygraph/utils/__pycache__/graphfiles.cpython-35.pyc
  36. BIN
      pygraph/utils/__pycache__/utils.cpython-35.pyc
  37. +84
    -6
      pygraph/utils/graphfiles.py
  38. +97
    -48
      pygraph/utils/utils.py
  39. +16
    -0
      run_cyclic.py

+ 2015
- 0
.ipynb_checkpoints/run_weisfeilerLehmankernel_acyclic-checkpoint.ipynb
File diff suppressed because it is too large
View File


+ 21
- 11
README.md View File

@@ -11,26 +11,30 @@ A python package for graph kernels.
* tabulate - 0.8.2

## Results with minimal test RMSE for each kernel on dataset Asyclic
All kernels are tested on dataset Asyclic, which consists of 185 molecules (graphs).

All kernels expect for Cyclic pattern kernel are tested on dataset Asyclic, which consists of 185 molecules (graphs). (Cyclic pattern kernel is tested on dataset MAO and PAH.)

The criteria used for prediction are SVM for classification and kernel Ridge regression for regression.

For predition we randomly divide the data in train and test subset, where 90% of entire dataset is for training and rest for testing. 10 splits are performed. For each split, we first train on the train data, then evaluate the performance on the test set. We choose the optimal parameters for the test set and finally provide the corresponding performance. The final results correspond to the average of the performances on the test sets.

| Kernels | RMSE(℃) | STD(℃) | Parameter | k_time |
|---------------|:-------:|:------:|-------------:|-------:|
| Shortest path | 35.19 | 4.50 | - | 14.58" |
| Marginalized | 18.02 | 6.29 | p_quit = 0.1 | 4'19" |
| Path | 14.00 | 6.93 | - | 36.21" |
| WL subtree | 7.55 | 2.33 | height = 1 | 0.84" |
| Treelet | 8.31 | 3.38 | - | 0.50" |
| Path up to d | 7.43 | 2.69 | depth = 2 | 0.59" |

| Kernels | RMSE(℃) | STD(℃) | Parameter | k_time |
|------------------|:-------:|:------:|------------------:|-------:|
| Shortest path | 35.19 | 4.50 | - | 14.58" |
| Marginalized | 18.02 | 6.29 | p_quit = 0.1 | 4'19" |
| Path | 18.41 | 10.78 | - | 29.43" |
| WL subtree | 7.55 | 2.33 | height = 1 | 0.84" |
| WL shortest path | 35.16 | 4.50 | height = 2 | 40.24" |
| WL edge | 33.41 | 4.73 | height = 5 | 5.66" |
| Treelet | 8.31 | 3.38 | - | 0.50" |
| Path up to d | 7.43 | 2.69 | depth = 2 | 0.59" |
| Tree pattern | 7.27 | 2.21 | lamda = 1, h = 2 | 37.24" |
| Cyclic pattern | 0.9 | 0.11 | cycle bound = 100 | 0.31" |
* RMSE stands for arithmetic mean of the root mean squared errors on all splits.
* STD stands for standard deviation of the root mean squared errors on all splits.
* Paremeter is the one with which the kenrel achieves the best results.
* k_time is the time spent on building the kernel matrix.
* The targets of training data are normalized before calculating *path kernel* and *treelet kernel*.
* The targets of training data are normalized before calculating *treelet kernel*.
* See detail results in [results.md](pygraph/kernels/results.md).

## References
@@ -44,6 +48,12 @@ For predition we randomly divide the data in train and test subset, where 90% of

[5] Gaüzère B, Brun L, Villemin D. Two new graphs kernels in chemoinformatics. Pattern Recognition Letters. 2012 Nov 1;33(15):2038-47.

[6] Liva Ralaivola, Sanjay J Swamidass, Hiroto Saigo, and Pierre Baldi. Graph kernels for chemical informatics. Neural networks, 18(8):1093–1110, 2005.

[7] Pierre Mahé and Jean-Philippe Vert. Graph kernels based on tree patterns for molecules. Machine learning, 75(1):3–35, 2009.

[8] Tamás Horváth, Thomas Gärtner, and Stefan Wrobel. Cyclic pattern kernels for predictive graph mining. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 158–167. ACM, 2004.

## Updates
### 2018.01.24
* ADD *path kernel up to depth d* and its result on dataset Asyclic.


+ 936
- 0
notebooks/.ipynb_checkpoints/run_cyclicpatternkernel-checkpoint.ipynb
File diff suppressed because it is too large
View File


+ 149
- 0
notebooks/.ipynb_checkpoints/run_marginalizedkernel_acyclic-checkpoint.ipynb View File

@@ -364,6 +364,155 @@
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" --- This is a regression problem ---\n",
"\n",
"\n",
" Loading dataset from file...\n",
"\n",
" Calculating kernel matrix, this could take a while...\n",
"\n",
" --- marginalized kernel matrix of size 185 built in 1133.0229969024658 seconds ---\n",
"[[ 0.0287062 0.0124634 0.00444444 ..., 0.00606061 0.00606061\n",
" 0.00606061]\n",
" [ 0.0124634 0.01108958 0.00333333 ..., 0.00454545 0.00454545\n",
" 0.00454545]\n",
" [ 0.00444444 0.00333333 0.0287062 ..., 0.00819912 0.00819912\n",
" 0.00975875]\n",
" ..., \n",
" [ 0.00606061 0.00454545 0.00819912 ..., 0.02846735 0.02836907\n",
" 0.02896354]\n",
" [ 0.00606061 0.00454545 0.00819912 ..., 0.02836907 0.02831424\n",
" 0.0288712 ]\n",
" [ 0.00606061 0.00454545 0.00975875 ..., 0.02896354 0.0288712\n",
" 0.02987915]]\n",
"\n",
" Saving kernel matrix to file...\n",
"\n",
" Mean performance on train set: 12.186285\n",
"With standard deviation: 7.038988\n",
"\n",
" Mean performance on test set: 18.024312\n",
"With standard deviation: 6.292466\n",
"\n",
"\n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 18.0243 6.29247 12.1863 7.03899 1133.02\n"
]
}
],
"source": [
"%load_ext line_profiler\n",
"\n",
"import numpy as np\n",
"import sys\n",
"sys.path.insert(0, \"../\")\n",
"from pygraph.utils.utils import kernel_train_test\n",
"from pygraph.kernels.marginalizedKernel import marginalizedkernel, _marginalizedkernel_do\n",
"\n",
"datafile = '../../../../datasets/acyclic/Acyclic/dataset_bps.ds'\n",
"kernel_file_path = 'kernelmatrices_weisfeilerlehman_subtree_acyclic/'\n",
"\n",
"kernel_para = dict(node_label = 'atom', edge_label = 'bond_type', itr = 20, p_quit = 0.1)\n",
"\n",
"# kernel_train_test(datafile, kernel_file_path, marginalizedkernel, kernel_para, \\\n",
"# hyper_name = 'p_quit', hyper_range = np.linspace(0.1, 0.9, 9), normalize = False)\n",
"\n",
"%lprun -f _marginalizedkernel_do \\\n",
" kernel_train_test(datafile, kernel_file_path, marginalizedkernel, kernel_para, \\\n",
" normalize = False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Timer unit: 1e-06 s\n",
"\n",
"Total time: 828.879 s\n",
"File: ../pygraph/kernels/marginalizedKernel.py\n",
"Function: _marginalizedkernel_do at line 67\n",
"\n",
"Line # Hits Time Per Hit % Time Line Contents\n",
"==============================================================\n",
" 67 def _marginalizedkernel_do(G1, G2, node_label, edge_label, p_quit, itr):\n",
" 68 \"\"\"Calculate marginalized graph kernel between 2 graphs.\n",
" 69 \n",
" 70 Parameters\n",
" 71 ----------\n",
" 72 G1, G2 : NetworkX graphs\n",
" 73 2 graphs between which the kernel is calculated.\n",
" 74 node_label : string\n",
" 75 node attribute used as label.\n",
" 76 edge_label : string\n",
" 77 edge attribute used as label.\n",
" 78 p_quit : integer\n",
" 79 the termination probability in the random walks generating step.\n",
" 80 itr : integer\n",
" 81 time of iterations to calculate R_inf.\n",
" 82 \n",
" 83 Return\n",
" 84 ------\n",
" 85 kernel : float\n",
" 86 Marginalized Kernel between 2 graphs.\n",
" 87 \"\"\"\n",
" 88 # init parameters\n",
" 89 17205 12886.0 0.7 0.0 kernel = 0\n",
" 90 17205 52542.0 3.1 0.0 num_nodes_G1 = nx.number_of_nodes(G1)\n",
" 91 17205 28240.0 1.6 0.0 num_nodes_G2 = nx.number_of_nodes(G2)\n",
" 92 17205 15595.0 0.9 0.0 p_init_G1 = 1 / num_nodes_G1 # the initial probability distribution in the random walks generating step (uniform distribution over |G|)\n",
" 93 17205 11587.0 0.7 0.0 p_init_G2 = 1 / num_nodes_G2\n",
" 94 \n",
" 95 17205 11663.0 0.7 0.0 q = p_quit * p_quit\n",
" 96 17205 10728.0 0.6 0.0 r1 = q\n",
" 97 \n",
" 98 # initial R_inf\n",
" 99 17205 38412.0 2.2 0.0 R_inf = np.zeros([num_nodes_G1, num_nodes_G2]) # matrix to save all the R_inf for all pairs of nodes\n",
" 100 \n",
" 101 # calculate R_inf with a simple interative method\n",
" 102 344100 329235.0 1.0 0.0 for i in range(1, itr):\n",
" 103 326895 900354.0 2.8 0.1 R_inf_new = np.zeros([num_nodes_G1, num_nodes_G2])\n",
" 104 326895 2287346.0 7.0 0.3 R_inf_new.fill(r1)\n",
" 105 \n",
" 106 # calculate R_inf for each pair of nodes\n",
" 107 2653464 3667117.0 1.4 0.4 for node1 in G1.nodes(data = True):\n",
" 108 2326569 7522840.0 3.2 0.9 neighbor_n1 = G1[node1[0]]\n",
" 109 2326569 3492118.0 1.5 0.4 p_trans_n1 = (1 - p_quit) / len(neighbor_n1) # the transition probability distribution in the random walks generating step (uniform distribution over the vertices adjacent to the current vertex)\n",
" 110 24024379 27775021.0 1.2 3.4 for node2 in G2.nodes(data = True):\n",
" 111 21697810 69471941.0 3.2 8.4 neighbor_n2 = G2[node2[0]]\n",
" 112 21697810 32446626.0 1.5 3.9 p_trans_n2 = (1 - p_quit) / len(neighbor_n2) \n",
" 113 \n",
" 114 59095092 52545370.0 0.9 6.3 for neighbor1 in neighbor_n1:\n",
" 115 104193150 92513935.0 0.9 11.2 for neighbor2 in neighbor_n2:\n",
" 116 \n",
" 117 t = p_trans_n1 * p_trans_n2 * \\\n",
" 118 66795868 285324518.0 4.3 34.4 deltakernel(G1.node[neighbor1][node_label] == G2.node[neighbor2][node_label]) * \\\n",
" 119 66795868 137934393.0 2.1 16.6 deltakernel(neighbor_n1[neighbor1][edge_label] == neighbor_n2[neighbor2][edge_label])\n",
" 120 66795868 106834143.0 1.6 12.9 R_inf_new[node1[0]][node2[0]] += t * R_inf[neighbor1][neighbor2] # ref [1] equation (8)\n",
" 121 \n",
" 122 326895 1123677.0 3.4 0.1 R_inf[:] = R_inf_new\n",
" 123 \n",
" 124 # add elements of R_inf up and calculate kernel\n",
" 125 139656 330283.0 2.4 0.0 for node1 in G1.nodes(data = True):\n",
" 126 1264441 1435263.0 1.1 0.2 for node2 in G2.nodes(data = True): \n",
" 127 1141990 1377134.0 1.2 0.2 s = p_init_G1 * p_init_G2 * deltakernel(node1[1][node_label] == node2[1][node_label])\n",
" 128 1141990 1375456.0 1.2 0.2 kernel += s * R_inf[node1[0]][node2[0]] # ref [1] equation (6)\n",
" 129 \n",
" 130 17205 10801.0 0.6 0.0 return kernel"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"scrolled": false


+ 11
- 13
notebooks/.ipynb_checkpoints/run_pathkernel_acyclic-checkpoint.ipynb View File

@@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 2,
"metadata": {},
"outputs": [
{
@@ -15,13 +15,11 @@
" --- This is a regression problem ---\n",
"\n",
"\n",
"\n",
"\n",
" Loading dataset from file...\n",
"\n",
" Calculating kernel matrix, this could take a while...\n",
"\n",
" --- mean average path kernel matrix of size 185 built in 132.2242877483368 seconds ---\n",
" --- mean average path kernel matrix of size 185 built in 29.430902242660522 seconds ---\n",
"[[ 0.55555556 0.22222222 0. ..., 0. 0. 0. ]\n",
" [ 0.22222222 0.27777778 0. ..., 0. 0. 0. ]\n",
" [ 0. 0. 0.55555556 ..., 0.03030303 0.03030303\n",
@@ -36,16 +34,16 @@
"\n",
" Saving kernel matrix to file...\n",
"\n",
" Mean performance on train set: 3.761907\n",
"With standard deviation: 0.702594\n",
" Mean performance on train set: 3.619948\n",
"With standard deviation: 0.512351\n",
"\n",
" Mean performance on test set: 14.001515\n",
"With standard deviation: 6.936023\n",
" Mean performance on test set: 18.418852\n",
"With standard deviation: 10.781119\n",
"\n",
"\n",
" RMSE_test std_test RMSE_train std_train k_time\n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 14.0015 6.93602 3.76191 0.702594 132.224\n"
" 18.4189 10.7811 3.61995 0.512351 29.4309\n"
]
}
],
@@ -62,10 +60,10 @@
"\n",
"kernel_para = dict(node_label = 'atom', edge_label = 'bond_type')\n",
"\n",
"kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = True)\n",
"kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = False)\n",
"\n",
"# %lprun -f _pathkernel_do \\\n",
"# kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = True)"
"# kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = False)"
]
},
{
@@ -84,7 +82,7 @@
"# without y normalization\n",
" RMSE_test std_test RMSE_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 18.4189 10.7811 3.61995 0.512351 37.0017"
" 18.4189 10.7811 3.61995 0.512351 29.4309"
]
},
{


+ 15
- 17
notebooks/.ipynb_checkpoints/run_spkernel_acyclic-checkpoint.ipynb View File

@@ -2,44 +2,42 @@
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The line_profiler extension is already loaded. To reload it, use:\n",
" %reload_ext line_profiler\n",
"\n",
" --- This is a regression problem ---\n",
"\n",
"\n",
"\n",
" Loading dataset from file...\n",
"\n",
" Calculating kernel matrix, this could take a while...\n",
"--- shortest path kernel matrix of size 185 built in 14.576777696609497 seconds ---\n",
"[[ 3. 1. 3. ..., 1. 1. 1.]\n",
" [ 1. 6. 1. ..., 0. 0. 3.]\n",
" [ 3. 1. 3. ..., 1. 1. 1.]\n",
" ..., \n",
" [ 1. 0. 1. ..., 55. 21. 7.]\n",
" [ 1. 0. 1. ..., 21. 55. 7.]\n",
" [ 1. 3. 1. ..., 7. 7. 55.]]\n",
"\n",
" Saving kernel matrix to file...\n",
"\n",
"--- shortest path kernel matrix of size 185 built in 13.3865065574646 seconds ---\n",
"[[ 3. 1. 3. ... 1. 1. 1.]\n",
" [ 1. 6. 1. ... 0. 0. 3.]\n",
" [ 3. 1. 3. ... 1. 1. 1.]\n",
" ...\n",
" [ 1. 0. 1. ... 55. 21. 7.]\n",
" [ 1. 0. 1. ... 21. 55. 7.]\n",
" [ 1. 3. 1. ... 7. 7. 55.]]\n",
"\n",
" Starting calculate accuracy/rmse...\n",
"calculate performance: 94%|█████████▎| 936/1000 [00:01<00:00, 757.54it/s]\n",
" Mean performance on train set: 28.360361\n",
"With standard deviation: 1.357183\n",
"\n",
" Mean performance on test set: 35.191954\n",
"With standard deviation: 4.495767\n",
"calculate performance: 100%|██████████| 1000/1000 [00:01<00:00, 771.22it/s]\n",
"\n",
"\n",
" RMSE_test std_test RMSE_train std_train k_time\n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 35.192 4.49577 28.3604 1.35718 14.5768\n"
" 35.192 4.49577 28.3604 1.35718 13.3865\n"
]
}
],


+ 66
- 60
notebooks/.ipynb_checkpoints/run_treeletkernel_acyclic-checkpoint.ipynb View File

@@ -2,15 +2,13 @@
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The line_profiler extension is already loaded. To reload it, use:\n",
" %reload_ext line_profiler\n",
"\n",
" --- This is a regression problem ---\n",
"\n",
@@ -19,68 +17,34 @@
"\n",
" Calculating kernel matrix, this could take a while...\n",
"\n",
" --- treelet kernel matrix of size 185 built in 0.48417091369628906 seconds ---\n",
"[[ 4.00000000e+00 2.60653066e+00 1.00000000e+00 ..., 1.26641655e-14\n",
" 1.26641655e-14 1.26641655e-14]\n",
" [ 2.60653066e+00 6.00000000e+00 1.00000000e+00 ..., 1.26641655e-14\n",
" 1.26641655e-14 1.26641655e-14]\n",
" [ 1.00000000e+00 1.00000000e+00 4.00000000e+00 ..., 3.00000000e+00\n",
" 3.00000000e+00 3.00000000e+00]\n",
" ..., \n",
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 1.80000000e+01\n",
" 1.30548713e+01 8.19020657e+00]\n",
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 1.30548713e+01\n",
" 2.20000000e+01 9.71901120e+00]\n",
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 8.19020657e+00\n",
" 9.71901120e+00 1.60000000e+01]]\n",
"\n",
" Saving kernel matrix to file...\n",
"\n",
" --- treelet kernel matrix of size 185 built in 0.47543811798095703 seconds ---\n",
"[[4.00000000e+00 2.60653066e+00 1.00000000e+00 ... 1.26641655e-14\n",
" 1.26641655e-14 1.26641655e-14]\n",
" [2.60653066e+00 6.00000000e+00 1.00000000e+00 ... 1.26641655e-14\n",
" 1.26641655e-14 1.26641655e-14]\n",
" [1.00000000e+00 1.00000000e+00 4.00000000e+00 ... 3.00000000e+00\n",
" 3.00000000e+00 3.00000000e+00]\n",
" ...\n",
" [1.26641655e-14 1.26641655e-14 3.00000000e+00 ... 1.80000000e+01\n",
" 1.30548713e+01 8.19020657e+00]\n",
" [1.26641655e-14 1.26641655e-14 3.00000000e+00 ... 1.30548713e+01\n",
" 2.20000000e+01 9.71901120e+00]\n",
" [1.26641655e-14 1.26641655e-14 3.00000000e+00 ... 8.19020657e+00\n",
" 9.71901120e+00 1.60000000e+01]]\n",
"\n",
" Starting calculate accuracy/rmse...\n",
"calculate performance: 98%|█████████▊| 983/1000 [00:01<00:00, 796.45it/s]\n",
" Mean performance on train set: 2.688029\n",
"With standard deviation: 1.541623\n",
"\n",
" Mean performance on test set: 10.099738\n",
"With standard deviation: 5.035844\n",
"calculate performance: 100%|██████████| 1000/1000 [00:01<00:00, 745.11it/s]\n",
"\n",
"\n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 10.0997 5.03584 2.68803 1.54162 0.484171\n",
"\n",
" --- This is a regression problem ---\n",
"\n",
"\n",
" Loading dataset from file...\n",
"\n",
" Calculating kernel matrix, this could take a while...\n",
"\n",
" --- treelet kernel matrix of size 185 built in 0.5003015995025635 seconds ---\n",
"[[ 4.00000000e+00 2.60653066e+00 1.00000000e+00 ..., 1.26641655e-14\n",
" 1.26641655e-14 1.26641655e-14]\n",
" [ 2.60653066e+00 6.00000000e+00 1.00000000e+00 ..., 1.26641655e-14\n",
" 1.26641655e-14 1.26641655e-14]\n",
" [ 1.00000000e+00 1.00000000e+00 4.00000000e+00 ..., 3.00000000e+00\n",
" 3.00000000e+00 3.00000000e+00]\n",
" ..., \n",
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 1.80000000e+01\n",
" 1.30548713e+01 8.19020657e+00]\n",
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 1.30548713e+01\n",
" 2.20000000e+01 9.71901120e+00]\n",
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 8.19020657e+00\n",
" 9.71901120e+00 1.60000000e+01]]\n",
"\n",
" Saving kernel matrix to file...\n",
"\n",
" Mean performance on train set: 2.908869\n",
"With standard deviation: 1.267900\n",
"\n",
" Mean performance on test set: 8.307902\n",
"With standard deviation: 3.378376\n",
"\n",
"\n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 8.3079 3.37838 2.90887 1.2679 0.500302\n"
" 10.0997 5.03584 2.68803 1.54162 0.475438\n"
]
}
],
@@ -99,8 +63,6 @@
"\n",
"kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = False)\n",
"\n",
"kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = True)\n",
"\n",
"# %lprun -f treeletkernel \\\n",
"# kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = False)"
]
@@ -121,14 +83,58 @@
"# without y normalization\n",
" RMSE_test std_test RMSE_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 10.0997 5.03584 2.68803 1.54162 0.484171"
" 10.0997 5.03584 2.68803 1.54162 0.484171\n",
"\n",
" \n",
"\n",
"# G0 -> WL subtree h = 0\n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 13.9223 2.88611 13.373 0.653301 0.186731\n",
"\n",
"# G0 U G1 U G6 U G8 U G13 -> WL subtree h = 1\n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 8.97706 2.90771 6.7343 1.17505 0.223171\n",
" \n",
"# all patterns \\ { G3 U G4 U G5 U G10 } -> WL subtree h = 2 \n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 7.31274 1.96289 3.73909 0.406267 0.294902\n",
"\n",
"# all patterns \\ { G4 U G5 } -> WL subtree h = 3\n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 8.39977 2.78309 3.8606 1.58686 0.348912\n",
"\n",
"# all patterns \\ { G5 } \n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 9.47647 4.22113 3.18029 1.5669 0.423638\n",
" \n",
" \n",
" \n",
"# G0, -> WL subtree h = 0\n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 13.9223 2.88611 13.373 0.653301 0.186731 \n",
" \n",
"# G0 U G1 U G2 U G6 U G8 U G13 -> WL subtree h = 1\n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 8.62431 2.54327 5.63422 0.255002 0.290797\n",
" \n",
"# all patterns \\ { G5 U G10 } -> WL subtree h = 2\n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 10.1294 3.50275 3.69664 1.55116 0.418498"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"scrolled": false
"scrolled": true
},
"outputs": [
{


+ 3191
- 0
notebooks/.ipynb_checkpoints/run_treepatternkernel-checkpoint.ipynb
File diff suppressed because it is too large
View File


+ 778
- 995
notebooks/.ipynb_checkpoints/run_weisfeilerLehmankernel_acyclic-checkpoint.ipynb
File diff suppressed because it is too large
View File


+ 175
- 0
notebooks/.ipynb_checkpoints/test_lib-checkpoint.ipynb
File diff suppressed because it is too large
View File


+ 1252
- 0
notebooks/run_cyclicpatternkernel.ipynb
File diff suppressed because it is too large
View File


+ 149
- 0
notebooks/run_marginalizedkernel_acyclic.ipynb View File

@@ -364,6 +364,155 @@
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" --- This is a regression problem ---\n",
"\n",
"\n",
" Loading dataset from file...\n",
"\n",
" Calculating kernel matrix, this could take a while...\n",
"\n",
" --- marginalized kernel matrix of size 185 built in 1133.0229969024658 seconds ---\n",
"[[ 0.0287062 0.0124634 0.00444444 ..., 0.00606061 0.00606061\n",
" 0.00606061]\n",
" [ 0.0124634 0.01108958 0.00333333 ..., 0.00454545 0.00454545\n",
" 0.00454545]\n",
" [ 0.00444444 0.00333333 0.0287062 ..., 0.00819912 0.00819912\n",
" 0.00975875]\n",
" ..., \n",
" [ 0.00606061 0.00454545 0.00819912 ..., 0.02846735 0.02836907\n",
" 0.02896354]\n",
" [ 0.00606061 0.00454545 0.00819912 ..., 0.02836907 0.02831424\n",
" 0.0288712 ]\n",
" [ 0.00606061 0.00454545 0.00975875 ..., 0.02896354 0.0288712\n",
" 0.02987915]]\n",
"\n",
" Saving kernel matrix to file...\n",
"\n",
" Mean performance on train set: 12.186285\n",
"With standard deviation: 7.038988\n",
"\n",
" Mean performance on test set: 18.024312\n",
"With standard deviation: 6.292466\n",
"\n",
"\n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 18.0243 6.29247 12.1863 7.03899 1133.02\n"
]
}
],
"source": [
"%load_ext line_profiler\n",
"\n",
"import numpy as np\n",
"import sys\n",
"sys.path.insert(0, \"../\")\n",
"from pygraph.utils.utils import kernel_train_test\n",
"from pygraph.kernels.marginalizedKernel import marginalizedkernel, _marginalizedkernel_do\n",
"\n",
"datafile = '../../../../datasets/acyclic/Acyclic/dataset_bps.ds'\n",
"kernel_file_path = 'kernelmatrices_weisfeilerlehman_subtree_acyclic/'\n",
"\n",
"kernel_para = dict(node_label = 'atom', edge_label = 'bond_type', itr = 20, p_quit = 0.1)\n",
"\n",
"# kernel_train_test(datafile, kernel_file_path, marginalizedkernel, kernel_para, \\\n",
"# hyper_name = 'p_quit', hyper_range = np.linspace(0.1, 0.9, 9), normalize = False)\n",
"\n",
"%lprun -f _marginalizedkernel_do \\\n",
" kernel_train_test(datafile, kernel_file_path, marginalizedkernel, kernel_para, \\\n",
" normalize = False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Timer unit: 1e-06 s\n",
"\n",
"Total time: 828.879 s\n",
"File: ../pygraph/kernels/marginalizedKernel.py\n",
"Function: _marginalizedkernel_do at line 67\n",
"\n",
"Line # Hits Time Per Hit % Time Line Contents\n",
"==============================================================\n",
" 67 def _marginalizedkernel_do(G1, G2, node_label, edge_label, p_quit, itr):\n",
" 68 \"\"\"Calculate marginalized graph kernel between 2 graphs.\n",
" 69 \n",
" 70 Parameters\n",
" 71 ----------\n",
" 72 G1, G2 : NetworkX graphs\n",
" 73 2 graphs between which the kernel is calculated.\n",
" 74 node_label : string\n",
" 75 node attribute used as label.\n",
" 76 edge_label : string\n",
" 77 edge attribute used as label.\n",
" 78 p_quit : integer\n",
" 79 the termination probability in the random walks generating step.\n",
" 80 itr : integer\n",
" 81 time of iterations to calculate R_inf.\n",
" 82 \n",
" 83 Return\n",
" 84 ------\n",
" 85 kernel : float\n",
" 86 Marginalized Kernel between 2 graphs.\n",
" 87 \"\"\"\n",
" 88 # init parameters\n",
" 89 17205 12886.0 0.7 0.0 kernel = 0\n",
" 90 17205 52542.0 3.1 0.0 num_nodes_G1 = nx.number_of_nodes(G1)\n",
" 91 17205 28240.0 1.6 0.0 num_nodes_G2 = nx.number_of_nodes(G2)\n",
" 92 17205 15595.0 0.9 0.0 p_init_G1 = 1 / num_nodes_G1 # the initial probability distribution in the random walks generating step (uniform distribution over |G|)\n",
" 93 17205 11587.0 0.7 0.0 p_init_G2 = 1 / num_nodes_G2\n",
" 94 \n",
" 95 17205 11663.0 0.7 0.0 q = p_quit * p_quit\n",
" 96 17205 10728.0 0.6 0.0 r1 = q\n",
" 97 \n",
" 98 # initial R_inf\n",
" 99 17205 38412.0 2.2 0.0 R_inf = np.zeros([num_nodes_G1, num_nodes_G2]) # matrix to save all the R_inf for all pairs of nodes\n",
" 100 \n",
" 101 # calculate R_inf with a simple interative method\n",
" 102 344100 329235.0 1.0 0.0 for i in range(1, itr):\n",
" 103 326895 900354.0 2.8 0.1 R_inf_new = np.zeros([num_nodes_G1, num_nodes_G2])\n",
" 104 326895 2287346.0 7.0 0.3 R_inf_new.fill(r1)\n",
" 105 \n",
" 106 # calculate R_inf for each pair of nodes\n",
" 107 2653464 3667117.0 1.4 0.4 for node1 in G1.nodes(data = True):\n",
" 108 2326569 7522840.0 3.2 0.9 neighbor_n1 = G1[node1[0]]\n",
" 109 2326569 3492118.0 1.5 0.4 p_trans_n1 = (1 - p_quit) / len(neighbor_n1) # the transition probability distribution in the random walks generating step (uniform distribution over the vertices adjacent to the current vertex)\n",
" 110 24024379 27775021.0 1.2 3.4 for node2 in G2.nodes(data = True):\n",
" 111 21697810 69471941.0 3.2 8.4 neighbor_n2 = G2[node2[0]]\n",
" 112 21697810 32446626.0 1.5 3.9 p_trans_n2 = (1 - p_quit) / len(neighbor_n2) \n",
" 113 \n",
" 114 59095092 52545370.0 0.9 6.3 for neighbor1 in neighbor_n1:\n",
" 115 104193150 92513935.0 0.9 11.2 for neighbor2 in neighbor_n2:\n",
" 116 \n",
" 117 t = p_trans_n1 * p_trans_n2 * \\\n",
" 118 66795868 285324518.0 4.3 34.4 deltakernel(G1.node[neighbor1][node_label] == G2.node[neighbor2][node_label]) * \\\n",
" 119 66795868 137934393.0 2.1 16.6 deltakernel(neighbor_n1[neighbor1][edge_label] == neighbor_n2[neighbor2][edge_label])\n",
" 120 66795868 106834143.0 1.6 12.9 R_inf_new[node1[0]][node2[0]] += t * R_inf[neighbor1][neighbor2] # ref [1] equation (8)\n",
" 121 \n",
" 122 326895 1123677.0 3.4 0.1 R_inf[:] = R_inf_new\n",
" 123 \n",
" 124 # add elements of R_inf up and calculate kernel\n",
" 125 139656 330283.0 2.4 0.0 for node1 in G1.nodes(data = True):\n",
" 126 1264441 1435263.0 1.1 0.2 for node2 in G2.nodes(data = True): \n",
" 127 1141990 1377134.0 1.2 0.2 s = p_init_G1 * p_init_G2 * deltakernel(node1[1][node_label] == node2[1][node_label])\n",
" 128 1141990 1375456.0 1.2 0.2 kernel += s * R_inf[node1[0]][node2[0]] # ref [1] equation (6)\n",
" 129 \n",
" 130 17205 10801.0 0.6 0.0 return kernel"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"scrolled": false


+ 13
- 12
notebooks/run_pathkernel_acyclic.ipynb View File

@@ -2,23 +2,24 @@
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The line_profiler extension is already loaded. To reload it, use:\n",
" %reload_ext line_profiler\n",
"\n",
" --- This is a regression problem ---\n",
"\n",
"\n",
"\n",
" Loading dataset from file...\n",
"\n",
" Calculating kernel matrix, this could take a while...\n",
"\n",
" --- mean average path kernel matrix of size 185 built in 45.52756929397583 seconds ---\n",
" --- mean average path kernel matrix of size 185 built in 29.430902242660522 seconds ---\n",
"[[ 0.55555556 0.22222222 0. ..., 0. 0. 0. ]\n",
" [ 0.22222222 0.27777778 0. ..., 0. 0. 0. ]\n",
" [ 0. 0. 0.55555556 ..., 0.03030303 0.03030303\n",
@@ -33,16 +34,16 @@
"\n",
" Saving kernel matrix to file...\n",
"\n",
" Mean performance on train set: 3.761907\n",
"With standard deviation: 0.702594\n",
" Mean performance on train set: 3.619948\n",
"With standard deviation: 0.512351\n",
"\n",
" Mean performance on test set: 14.001515\n",
"With standard deviation: 6.936023\n",
" Mean performance on test set: 18.418852\n",
"With standard deviation: 10.781119\n",
"\n",
"\n",
" RMSE_test std_test RMSE_train std_train k_time\n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 14.0015 6.93602 3.76191 0.702594 45.5276\n"
" 18.4189 10.7811 3.61995 0.512351 29.4309\n"
]
}
],
@@ -59,10 +60,10 @@
"\n",
"kernel_para = dict(node_label = 'atom', edge_label = 'bond_type')\n",
"\n",
"kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = True)\n",
"kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = False)\n",
"\n",
"# %lprun -f _pathkernel_do \\\n",
"# kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = True)"
"# kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = False)"
]
},
{
@@ -81,7 +82,7 @@
"# without y normalization\n",
" RMSE_test std_test RMSE_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 18.4189 10.7811 3.61995 0.512351 37.0017"
" 18.4189 10.7811 3.61995 0.512351 29.4309"
]
},
{


+ 15
- 17
notebooks/run_spkernel_acyclic.ipynb View File

@@ -2,44 +2,42 @@
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The line_profiler extension is already loaded. To reload it, use:\n",
" %reload_ext line_profiler\n",
"\n",
" --- This is a regression problem ---\n",
"\n",
"\n",
"\n",
" Loading dataset from file...\n",
"\n",
" Calculating kernel matrix, this could take a while...\n",
"--- shortest path kernel matrix of size 185 built in 14.576777696609497 seconds ---\n",
"[[ 3. 1. 3. ..., 1. 1. 1.]\n",
" [ 1. 6. 1. ..., 0. 0. 3.]\n",
" [ 3. 1. 3. ..., 1. 1. 1.]\n",
" ..., \n",
" [ 1. 0. 1. ..., 55. 21. 7.]\n",
" [ 1. 0. 1. ..., 21. 55. 7.]\n",
" [ 1. 3. 1. ..., 7. 7. 55.]]\n",
"\n",
" Saving kernel matrix to file...\n",
"\n",
"--- shortest path kernel matrix of size 185 built in 13.3865065574646 seconds ---\n",
"[[ 3. 1. 3. ... 1. 1. 1.]\n",
" [ 1. 6. 1. ... 0. 0. 3.]\n",
" [ 3. 1. 3. ... 1. 1. 1.]\n",
" ...\n",
" [ 1. 0. 1. ... 55. 21. 7.]\n",
" [ 1. 0. 1. ... 21. 55. 7.]\n",
" [ 1. 3. 1. ... 7. 7. 55.]]\n",
"\n",
" Starting calculate accuracy/rmse...\n",
"calculate performance: 94%|█████████▎| 936/1000 [00:01<00:00, 757.54it/s]\n",
" Mean performance on train set: 28.360361\n",
"With standard deviation: 1.357183\n",
"\n",
" Mean performance on test set: 35.191954\n",
"With standard deviation: 4.495767\n",
"calculate performance: 100%|██████████| 1000/1000 [00:01<00:00, 771.22it/s]\n",
"\n",
"\n",
" RMSE_test std_test RMSE_train std_train k_time\n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 35.192 4.49577 28.3604 1.35718 14.5768\n"
" 35.192 4.49577 28.3604 1.35718 13.3865\n"
]
}
],


+ 66
- 60
notebooks/run_treeletkernel_acyclic.ipynb View File

@@ -2,15 +2,13 @@
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The line_profiler extension is already loaded. To reload it, use:\n",
" %reload_ext line_profiler\n",
"\n",
" --- This is a regression problem ---\n",
"\n",
@@ -19,68 +17,34 @@
"\n",
" Calculating kernel matrix, this could take a while...\n",
"\n",
" --- treelet kernel matrix of size 185 built in 0.48417091369628906 seconds ---\n",
"[[ 4.00000000e+00 2.60653066e+00 1.00000000e+00 ..., 1.26641655e-14\n",
" 1.26641655e-14 1.26641655e-14]\n",
" [ 2.60653066e+00 6.00000000e+00 1.00000000e+00 ..., 1.26641655e-14\n",
" 1.26641655e-14 1.26641655e-14]\n",
" [ 1.00000000e+00 1.00000000e+00 4.00000000e+00 ..., 3.00000000e+00\n",
" 3.00000000e+00 3.00000000e+00]\n",
" ..., \n",
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 1.80000000e+01\n",
" 1.30548713e+01 8.19020657e+00]\n",
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 1.30548713e+01\n",
" 2.20000000e+01 9.71901120e+00]\n",
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 8.19020657e+00\n",
" 9.71901120e+00 1.60000000e+01]]\n",
"\n",
" Saving kernel matrix to file...\n",
"\n",
" --- treelet kernel matrix of size 185 built in 0.47543811798095703 seconds ---\n",
"[[4.00000000e+00 2.60653066e+00 1.00000000e+00 ... 1.26641655e-14\n",
" 1.26641655e-14 1.26641655e-14]\n",
" [2.60653066e+00 6.00000000e+00 1.00000000e+00 ... 1.26641655e-14\n",
" 1.26641655e-14 1.26641655e-14]\n",
" [1.00000000e+00 1.00000000e+00 4.00000000e+00 ... 3.00000000e+00\n",
" 3.00000000e+00 3.00000000e+00]\n",
" ...\n",
" [1.26641655e-14 1.26641655e-14 3.00000000e+00 ... 1.80000000e+01\n",
" 1.30548713e+01 8.19020657e+00]\n",
" [1.26641655e-14 1.26641655e-14 3.00000000e+00 ... 1.30548713e+01\n",
" 2.20000000e+01 9.71901120e+00]\n",
" [1.26641655e-14 1.26641655e-14 3.00000000e+00 ... 8.19020657e+00\n",
" 9.71901120e+00 1.60000000e+01]]\n",
"\n",
" Starting calculate accuracy/rmse...\n",
"calculate performance: 98%|█████████▊| 983/1000 [00:01<00:00, 796.45it/s]\n",
" Mean performance on train set: 2.688029\n",
"With standard deviation: 1.541623\n",
"\n",
" Mean performance on test set: 10.099738\n",
"With standard deviation: 5.035844\n",
"calculate performance: 100%|██████████| 1000/1000 [00:01<00:00, 745.11it/s]\n",
"\n",
"\n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 10.0997 5.03584 2.68803 1.54162 0.484171\n",
"\n",
" --- This is a regression problem ---\n",
"\n",
"\n",
" Loading dataset from file...\n",
"\n",
" Calculating kernel matrix, this could take a while...\n",
"\n",
" --- treelet kernel matrix of size 185 built in 0.5003015995025635 seconds ---\n",
"[[ 4.00000000e+00 2.60653066e+00 1.00000000e+00 ..., 1.26641655e-14\n",
" 1.26641655e-14 1.26641655e-14]\n",
" [ 2.60653066e+00 6.00000000e+00 1.00000000e+00 ..., 1.26641655e-14\n",
" 1.26641655e-14 1.26641655e-14]\n",
" [ 1.00000000e+00 1.00000000e+00 4.00000000e+00 ..., 3.00000000e+00\n",
" 3.00000000e+00 3.00000000e+00]\n",
" ..., \n",
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 1.80000000e+01\n",
" 1.30548713e+01 8.19020657e+00]\n",
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 1.30548713e+01\n",
" 2.20000000e+01 9.71901120e+00]\n",
" [ 1.26641655e-14 1.26641655e-14 3.00000000e+00 ..., 8.19020657e+00\n",
" 9.71901120e+00 1.60000000e+01]]\n",
"\n",
" Saving kernel matrix to file...\n",
"\n",
" Mean performance on train set: 2.908869\n",
"With standard deviation: 1.267900\n",
"\n",
" Mean performance on test set: 8.307902\n",
"With standard deviation: 3.378376\n",
"\n",
"\n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 8.3079 3.37838 2.90887 1.2679 0.500302\n"
" 10.0997 5.03584 2.68803 1.54162 0.475438\n"
]
}
],
@@ -99,8 +63,6 @@
"\n",
"kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = False)\n",
"\n",
"kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = True)\n",
"\n",
"# %lprun -f treeletkernel \\\n",
"# kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = False)"
]
@@ -121,14 +83,58 @@
"# without y normalization\n",
" RMSE_test std_test RMSE_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 10.0997 5.03584 2.68803 1.54162 0.484171"
" 10.0997 5.03584 2.68803 1.54162 0.484171\n",
"\n",
" \n",
"\n",
"# G0 -> WL subtree h = 0\n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 13.9223 2.88611 13.373 0.653301 0.186731\n",
"\n",
"# G0 U G1 U G6 U G8 U G13 -> WL subtree h = 1\n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 8.97706 2.90771 6.7343 1.17505 0.223171\n",
" \n",
"# all patterns \\ { G3 U G4 U G5 U G10 } -> WL subtree h = 2 \n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 7.31274 1.96289 3.73909 0.406267 0.294902\n",
"\n",
"# all patterns \\ { G4 U G5 } -> WL subtree h = 3\n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 8.39977 2.78309 3.8606 1.58686 0.348912\n",
"\n",
"# all patterns \\ { G5 } \n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 9.47647 4.22113 3.18029 1.5669 0.423638\n",
" \n",
" \n",
" \n",
"# G0, -> WL subtree h = 0\n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 13.9223 2.88611 13.373 0.653301 0.186731 \n",
" \n",
"# G0 U G1 U G2 U G6 U G8 U G13 -> WL subtree h = 1\n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 8.62431 2.54327 5.63422 0.255002 0.290797\n",
" \n",
"# all patterns \\ { G5 U G10 } -> WL subtree h = 2\n",
" rmse_test std_test rmse_train std_train k_time\n",
"----------- ---------- ------------ ----------- --------\n",
" 10.1294 3.50275 3.69664 1.55116 0.418498"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"scrolled": false
"scrolled": true
},
"outputs": [
{


+ 3191
- 0
notebooks/run_treepatternkernel.ipynb
File diff suppressed because it is too large
View File


+ 766
- 981
notebooks/run_weisfeilerLehmankernel_acyclic.ipynb
File diff suppressed because it is too large
View File


BIN
pygraph/kernels/__pycache__/cyclicPatternKernel.cpython-35.pyc View File


BIN
pygraph/kernels/__pycache__/deltaKernel.cpython-35.pyc View File


BIN
pygraph/kernels/__pycache__/marginalizedKernel.cpython-35.pyc View File


BIN
pygraph/kernels/__pycache__/pathKernel.cpython-35.pyc View File


BIN
pygraph/kernels/__pycache__/spKernel.cpython-35.pyc View File


BIN
pygraph/kernels/__pycache__/treePatternKernel.cpython-35.pyc View File


BIN
pygraph/kernels/__pycache__/treeletKernel.cpython-35.pyc View File


BIN
pygraph/kernels/__pycache__/weisfeilerLehmanKernel.cpython-35.pyc View File


+ 147
- 0
pygraph/kernels/cyclicPatternKernel.py View File

@@ -0,0 +1,147 @@
"""
@author: linlin <jajupmochi@gmail.com>
@references:
[1] Tamás Horváth, Thomas Gärtner, and Stefan Wrobel. Cyclic pattern kernels for predictive graph mining. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 158–167. ACM, 2004.
[2] Hopcroft, J.; Tarjan, R. (1973). “Efficient algorithms for graph manipulation”. Communications of the ACM 16: 372–378. doi:10.1145/362248.362272.
[3] Finding all the elementary circuits of a directed graph. D. B. Johnson, SIAM Journal on Computing 4, no. 1, 77-84, 1975. http://dx.doi.org/10.1137/0204007
"""

import sys
import pathlib
sys.path.insert(0, "../")
import time

import networkx as nx
import numpy as np

from tqdm import tqdm


def cyclicpatternkernel(*args, node_label = 'atom', edge_label = 'bond_type', labeled = True, cycle_bound = None):
"""Calculate cyclic pattern graph kernels between graphs.
Parameters
----------
Gn : List of NetworkX graph
List of graphs between which the kernels are calculated.
/
G1, G2 : NetworkX graphs
2 graphs between which the kernel is calculated.
node_label : string
node attribute used as label. The default node label is atom.
edge_label : string
edge attribute used as label. The default edge label is bond_type.
labeled : boolean
Whether the graphs are labeled. The default is True.
depth : integer
Depth of search. Longest length of paths.

Return
------
Kmatrix : Numpy matrix
Kernel matrix, each element of which is the path kernel up to d between 2 praphs.
"""
Gn = args[0] if len(args) == 1 else [args[0], args[1]] # arrange all graphs in a list
Kmatrix = np.zeros((len(Gn), len(Gn)))

start_time = time.time()

# get all cyclic and tree patterns of all graphs before calculating kernels to save time, but this may consume a lot of memory for large dataset.
all_patterns = [ get_patterns(Gn[i], node_label = node_label, edge_label = edge_label, labeled = labeled, cycle_bound = cycle_bound)
for i in tqdm(range(0, len(Gn)), desc = 'retrieve patterns', file=sys.stdout) ]

for i in tqdm(range(0, len(Gn)), desc = 'calculate kernels', file=sys.stdout):
for j in range(i, len(Gn)):
Kmatrix[i][j] = _cyclicpatternkernel_do(all_patterns[i], all_patterns[j])
Kmatrix[j][i] = Kmatrix[i][j]

run_time = time.time() - start_time
print("\n --- kernel matrix of cyclic pattern kernel of size %d built in %s seconds ---" % (len(Gn), run_time))

return Kmatrix, run_time


def _cyclicpatternkernel_do(patterns1, patterns2):
"""Calculate path graph kernels up to depth d between 2 graphs.

Parameters
----------
paths1, paths2 : list
List of paths in 2 graphs, where for unlabeled graphs, each path is represented by a list of nodes; while for labeled graphs, each path is represented by a string consists of labels of nodes and edges on that path.
k_func : function
A kernel function used using different notions of fingerprint similarity.
node_label : string
node attribute used as label. The default node label is atom.
edge_label : string
edge attribute used as label. The default edge label is bond_type.
labeled : boolean
Whether the graphs are labeled. The default is True.

Return
------
kernel : float
Treelet Kernel between 2 graphs.
"""
return len(set(patterns1) & set(patterns2))


def get_patterns(G, node_label = 'atom', edge_label = 'bond_type', labeled = True, cycle_bound = None):
"""Find all cyclic and tree patterns in a graph.

Parameters
----------
G : NetworkX graphs
The graph in which paths are searched.
length : integer
The maximum length of paths.
node_label : string
node attribute used as label. The default node label is atom.
edge_label : string
edge attribute used as label. The default edge label is bond_type.
labeled : boolean
Whether the graphs are labeled. The default is True.

Return
------
path : list
List of paths retrieved, where for unlabeled graphs, each path is represented by a list of nodes; while for labeled graphs, each path is represented by a string consists of labels of nodes and edges on that path.
"""
number_simplecycles = 0
bridges = nx.Graph()
patterns = []

bicomponents = nx.biconnected_component_subgraphs(G) # all biconnected components of G. this function use algorithm in reference [2], which (i guess) is slightly different from the one used in paper [1]
for subgraph in bicomponents:
if nx.number_of_edges(subgraph) > 1:
simple_cycles = list(nx.simple_cycles(G.to_directed())) # all simple cycles in biconnected components. this function use algorithm in reference [3], which has time complexity O((n+e)(N+1)) for n nodes, e edges and N simple cycles. Which might be slower than the algorithm applied in paper [1]
if cycle_bound != None and len(simple_cycles) > cycle_bound - number_simplecycles: # in paper [1], when applying another algorithm (subroutine RT), this becomes len(simple_cycles) == cycle_bound - number_simplecycles + 1, check again.
return []
else:

# calculate canonical representation for each simple cycle
all_canonkeys = []
for cycle in simple_cycles:
canonlist = [ G.node[node][node_label] + G[node][cycle[cycle.index(node) + 1]][edge_label] for node in cycle[:-1] ]
canonkey = ''.join(canonlist)
canonkey = canonkey if canonkey < canonkey[::-1] else canonkey[::-1]
for i in range(1, len(cycle[:-1])):
canonlist = [ G.node[node][node_label] + G[node][cycle[cycle.index(node) + 1]][edge_label] for node in cycle[i:-1] + cycle[:i] ]
canonkey_t = ''.join(canonlist)
canonkey_t = canonkey_t if canonkey_t < canonkey_t[::-1] else canonkey_t[::-1]
canonkey = canonkey if canonkey < canonkey_t else canonkey_t
all_canonkeys.append(canonkey)

patterns = list(set(patterns) | set(all_canonkeys))
number_simplecycles += len(simple_cycles)
else:
bridges.add_edges_from(subgraph.edges(data=True))

# calculate canonical representation for each connected component in bridge set
components = list(nx.connected_component_subgraphs(bridges)) # all connected components in the bridge
tree_patterns = []
for tree in components:
break



# patterns += pi(bridges)
return patterns

+ 4
- 4
pygraph/kernels/deltaKernel.py View File

@@ -1,18 +1,18 @@
def deltakernel(condition):
"""Return 1 if condition holds, 0 otherwise.
Parameters
----------
condition : Boolean
A condition, according to which the kernel is set to 1 or 0.
Return
------
kernel : integer
Delta kernel.
References
----------
[1] H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between labeled graphs. In Proceedings of the 20th International Conference on Machine Learning, Washington, DC, United States, 2003.
"""
return (1 if condition else 0)
return condition #(1 if condition else 0)

+ 60
- 22
pygraph/kernels/pathKernel.py View File

@@ -1,3 +1,8 @@
"""
@author: linlin
@references: Suard F, Rakotomamonjy A, Bensrhair A. Kernel on Bag of Paths For Measuring Similarity of Shapes. InESANN 2007 Apr 25 (pp. 355-360).
"""

import sys
import pathlib
sys.path.insert(0, "../")
@@ -27,10 +32,6 @@ def pathkernel(*args, node_label = 'atom', edge_label = 'bond_type'):
------
Kmatrix/kernel : Numpy matrix/float
Kernel matrix, each element of which is the path kernel between 2 praphs. / Path kernel between 2 graphs.

References
----------
[1] Suard F, Rakotomamonjy A, Bensrhair A. Kernel on Bag of Paths For Measuring Similarity of Shapes. InESANN 2007 Apr 25 (pp. 355-360).
"""
some_graph = args[0][0] if len(args) == 1 else args[0] # only edge attributes of type int or float can be used as edge weight to calculate the shortest paths.
some_weight = list(nx.get_edge_attributes(some_graph, edge_label).values())[0]
@@ -42,9 +43,11 @@ def pathkernel(*args, node_label = 'atom', edge_label = 'bond_type'):

start_time = time.time()

splist = [ get_shortest_paths(Gn[i], weight) for i in range(0, len(Gn)) ]

for i in range(0, len(Gn)):
for j in range(i, len(Gn)):
Kmatrix[i][j] = _pathkernel_do(Gn[i], Gn[j], node_label, edge_label, weight = weight)
Kmatrix[i][j] = _pathkernel_do(Gn[i], Gn[j], splist[i], splist[j], node_label, edge_label)
Kmatrix[j][i] = Kmatrix[i][j]

run_time = time.time() - start_time
@@ -55,7 +58,10 @@ def pathkernel(*args, node_label = 'atom', edge_label = 'bond_type'):
else: # for only 2 graphs
start_time = time.time()

kernel = _pathkernel_do(args[0], args[1], node_label, edge_label, weight = weight)
splist = get_shortest_paths(args[0], weight)
splist = get_shortest_paths(args[1], weight)

kernel = _pathkernel_do(args[0], args[1], sp1, sp2, node_label, edge_label)

run_time = time.time() - start_time
print("\n --- mean average path kernel built in %s seconds ---" % (run_time))
@@ -63,19 +69,19 @@ def pathkernel(*args, node_label = 'atom', edge_label = 'bond_type'):
return kernel, run_time


def _pathkernel_do(G1, G2, node_label = 'atom', edge_label = 'bond_type', weight = None):
def _pathkernel_do(G1, G2, sp1, sp2, node_label = 'atom', edge_label = 'bond_type'):
"""Calculate mean average path kernel between 2 graphs.

Parameters
----------
G1, G2 : NetworkX graphs
2 graphs between which the kernel is calculated.
sp1, sp2 : list of list
List of shortest paths of 2 graphs, where each path is represented by a list of nodes.
node_label : string
node attribute used as label. The default node label is atom.
edge_label : string
edge attribute used as label. The default edge label is bond_type.
weight : string/None
edge attribute used as weight to calculate the shortest path. The default edge label is None.

Return
------
@@ -83,30 +89,62 @@ def _pathkernel_do(G1, G2, node_label = 'atom', edge_label = 'bond_type', weight
Path Kernel between 2 graphs.
"""
# calculate shortest paths for both graphs
sp1 = []
num_nodes = G1.number_of_nodes()
for node1 in range(num_nodes):
for node2 in range(node1 + 1, num_nodes):
sp1.append(nx.shortest_path(G1, node1, node2, weight = weight))

sp2 = []
num_nodes = G2.number_of_nodes()
for node1 in range(num_nodes):
for node2 in range(node1 + 1, num_nodes):
sp2.append(nx.shortest_path(G2, node1, node2, weight = weight))

# calculate kernel
kernel = 0
for path1 in sp1:
for path2 in sp2:
if len(path1) == len(path2):
kernel_path = deltakernel(G1.node[path1[0]][node_label] == G2.node[path2[0]][node_label])
kernel_path = (G1.node[path1[0]][node_label] == G2.node[path2[0]][node_label])
if kernel_path:
for i in range(1, len(path1)):
# kernel = 1 if all corresponding nodes and edges in the 2 paths have same labels, otherwise 0
kernel_path *= deltakernel(G1[path1[i - 1]][path1[i]][edge_label] == G2[path2[i - 1]][path2[i]][edge_label]) * deltakernel(G1.node[path1[i]][node_label] == G2.node[path2[i]][node_label])
kernel_path *= (G1[path1[i - 1]][path1[i]][edge_label] == G2[path2[i - 1]][path2[i]][edge_label]) \
* (G1.node[path1[i]][node_label] == G2.node[path2[i]][node_label])
if kernel_path == 0:
break
kernel += kernel_path # add up kernels of all paths

# kernel = 0
# for path1 in sp1:
# for path2 in sp2:
# if len(path1) == len(path2):
# if (G1.node[path1[0]][node_label] == G2.node[path2[0]][node_label]):
# for i in range(1, len(path1)):
# # kernel = 1 if all corresponding nodes and edges in the 2 paths have same labels, otherwise 0
# # kernel_path *= (G1[path1[i - 1]][path1[i]][edge_label] == G2[path2[i - 1]][path2[i]][edge_label]) \
# # * (G1.node[path1[i]][node_label] == G2.node[path2[i]][node_label])
# # if kernel_path == 0:
# # break
# # kernel += kernel_path # add up kernels of all paths
# if (G1[path1[i - 1]][path1[i]][edge_label] != G2[path2[i - 1]][path2[i]][edge_label]) or \
# (G1.node[path1[i]][node_label] != G2.node[path2[i]][node_label]):
# break
# else:
# kernel += 1

kernel = kernel / (len(sp1) * len(sp2)) # calculate mean average

return kernel

def get_shortest_paths(G, weight):
"""Get all shortest paths of a graph.

Parameters
----------
G : NetworkX graphs
The graphs whose paths are calculated.
weight : string/None
edge attribute used as weight to calculate the shortest path.

Return
------
sp : list of list
List of shortest paths of the graph, where each path is represented by a list of nodes.
"""
sp = []
num_nodes = G.number_of_nodes()
for node1 in range(num_nodes):
for node2 in range(node1 + 1, num_nodes):
sp.append(nx.shortest_path(G, node1, node2, weight = weight))
return sp

+ 112
- 11
pygraph/kernels/results.md View File

@@ -1,20 +1,26 @@
# Results with minimal test RMSE for each kernel on dataset Asyclic
All kernels are tested on dataset Asyclic, which consists of 185 molecules (graphs).
All kernels expect for Cyclic pattern kernel are tested on dataset Asyclic, which consists of 185 molecules (graphs). (Cyclic pattern kernel is tested on dataset MAO and PAH.)

The criteria used for prediction are SVM for classification and kernel Ridge regression for regression.

For predition we randomly divide the data in train and test subset, where 90% of entire dataset is for training and rest for testing. 10 splits are performed. For each split, we first train on the train data, then evaluate the performance on the test set. We choose the optimal parameters for the test set and finally provide the corresponding performance. The final results correspond to the average of the performances on the test sets.

All the results were run under Python 3.5.2, in a machine of 64 bit with one Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz, Memory of 32GB, and Ubuntu 16.04.3 LTS OS.

## Summary

| Kernels | RMSE(℃) | STD(℃) | Parameter | k_time |
|---------------|:-------:|:------:|-------------:|-------:|
| Shortest path | 35.19 | 4.50 | - | 14.58" |
| Marginalized | 18.02 | 6.29 | p_quit = 0.1 | 4'19" |
| Path | 14.00 | 6.94 | - | 37.58" |
| WL subtree | 7.55 | 2.33 | height = 1 | 0.84" |
| Treelet | 8.31 | 3.38 | - | 0.50" |
| Path up to d | 7.43 | 2.69 | depth = 2 | 0.52" |
| Kernels | RMSE(℃) | STD(℃) | Parameter | k_time |
|------------------|:-------:|:------:|------------------:|-------:|
| Shortest path | 35.19 | 4.50 | - | 14.58" |
| Marginalized | 18.02 | 6.29 | p_quit = 0.1 | 4'19" |
| Path | 14.00 | 6.94 | - | 37.58" |
| WL subtree | 7.55 | 2.33 | height = 1 | 0.84" |
| WL shortest path | 35.16 | 4.50 | height = 2 | 40.24" |
| WL edge | 33.41 | 4.73 | height = 5 | 5.66" |
| Treelet | 8.31 | 3.38 | - | 0.50" |
| Path up to d | 7.43 | 2.69 | depth = 2 | 0.52" |
| Tree pattern | 7.27 | 2.21 | lamda = 1, h = 2 | 37.24" |
| Cyclic pattern | 0.9 | 0.11 | cycle bound = 100 | 0.31" |

* RMSE stands for arithmetic mean of the root mean squared errors on all splits.
* STD stands for standard deviation of the root mean squared errors on all splits.
@@ -76,6 +82,42 @@ The table below shows the results of the WL subtree under different subtree heig
10 17.1864 4.05672 0.691516 0.564621 5.00918
```

### Weisfeiler-Lehman shortest path kernel
The table below shows the results of the WL subtree under different subtree heights.
```
height rmse_test std_test rmse_train std_train k_time
-------- ----------- ---------- ------------ ----------- --------
0 35.192 4.49577 28.3604 1.35718 13.5041
1 35.1808 4.50045 27.9335 1.44836 26.8292
2 35.1632 4.50205 28.1113 1.50891 40.2356
3 35.1946 4.49801 28.3903 1.36571 54.6704
4 35.1753 4.50111 27.9746 1.46222 67.1522
5 35.1997 4.5071 28.0184 1.45564 80.0881
6 35.1645 4.49849 28.3731 1.60057 92.1925
7 35.1771 4.5009 27.9604 1.45742 105.812
8 35.1968 4.50526 28.1991 1.5149 119.022
9 35.1956 4.50197 28.2665 1.30769 131.228
10 35.1676 4.49723 28.4163 1.61596 144.964
```

### Weisfeiler-Lehman edge kernel
The table below shows the results of the WL subtree under different subtree heights.
```
height rmse_test std_test rmse_train std_train k_time
-------- ----------- ---------- ------------ ----------- ---------
0 33.4077 4.73272 29.9975 0.90234 0.853002
1 33.4235 4.72131 30.1603 1.09423 1.71751
2 33.433 4.72441 29.9286 0.787941 2.66032
3 33.4073 4.73243 30.0114 0.909674 3.47763
4 33.4256 4.72166 30.1842 1.1089 4.54367
5 33.4067 4.72641 30.0411 1.01845 5.66178
6 33.419 4.73075 29.9056 0.782179 6.14803
7 33.4248 4.72155 30.1759 1.10382 7.60354
8 33.4122 4.71554 30.1365 1.07485 7.97222
9 33.4071 4.73193 30.0329 0.921065 9.07084
10 33.4165 4.73169 29.9242 0.790843 10.0254
```

### Treelet kernel
**The targets of training data are normalized before calculating the kernel.**
```
@@ -87,7 +129,7 @@ The table below shows the results of the WL subtree under different subtree heig
### Path kernel up to depth *d*
The table below shows the results of the path kernel up to different depth *d*.

The first table is the results using Tanimoto kernel, where **The targets of training data are normalized before calculating the kernel.**.
The first table is the results using *Tanimoto kernel*, where **The targets of training data are normalized before calculating the kernel.**.
```
depth rmse_test std_test rmse_train std_train k_time
------- ----------- ---------- ------------ ----------- ---------
@@ -104,7 +146,7 @@ The first table is the results using Tanimoto kernel, where **The targets of tra
10 19.8708 5.09217 10.7787 2.10002 2.41006
```

The second table is the results using MinMax kernel.
The second table is the results using *MinMax kernel*.
```
depth rmse_test std_test rmse_train std_train k_time
------- ----------- ---------- ------------ ----------- --------
@@ -120,3 +162,62 @@ depth rmse_test std_test rmse_train std_train k_time
9 13.1789 5.27707 1.36002 1.84834 1.96545
10 13.2538 5.26425 1.36208 1.85426 2.24943
```


### Tree pattern kernel
Until N kernel when h = 2:
```
lmda rmse_test std_test rmse_train std_train k_time
----------- ----------- ---------- ------------ ----------- --------
1e-10 7.46524 1.71862 5.99486 0.356634 38.1447
1e-09 7.37326 1.77195 5.96155 0.374395 37.4921
1e-08 7.35105 1.78349 5.96481 0.378047 37.9971
1e-07 7.35213 1.77903 5.96728 0.382251 38.3182
1e-06 7.3524 1.77992 5.9696 0.3863 39.6428
1e-05 7.34958 1.78141 5.97114 0.39017 37.3711
0.0001 7.3513 1.78136 5.94251 0.331843 37.3967
0.001 7.35822 1.78119 5.9326 0.32534 36.7357
0.01 7.37552 1.79037 5.94089 0.34763 36.8864
0.1 7.32951 1.91346 6.42634 1.29405 36.8382
1 7.27134 2.20774 6.62425 1.2242 37.2425
10 7.49787 2.36815 6.81697 1.50182 37.8286
100 7.42887 2.64789 6.68766 1.34809 36.3701
1000 7.24914 2.65554 6.81906 1.41008 36.1695
10000 7.08183 2.6248 6.93431 1.38441 37.5723
100000 8.021 3.43694 8.69813 0.909839 37.8158
1e+06 8.49625 3.6332 9.59333 0.96626 38.4688
1e+07 10.9067 3.17593 11.5642 2.07792 36.9926
1e+08 61.1524 10.4355 65.3527 13.9538 37.1321
1e+09 99.943 13.6994 98.8848 5.27014 36.7443
1e+10 100.083 13.8503 97.9168 3.22768 37.096
```

### Cyclic pattern kernel
**This kernel is not tested on dataset Acyclic**

Results on dataset MAO:
```
cycle_bound accur_test std_test accur_train std_train k_time
------------- ------------ ---------- ------------- ----------- --------
0 0.642857 0.146385 0.54918 0.0167983 0.187052
50 0.871429 0.1 0.698361 0.116889 0.300629
100 0.9 0.111575 0.732787 0.0826366 0.309837
150 0.9 0.111575 0.732787 0.0826366 0.31808
200 0.9 0.111575 0.732787 0.0826366 0.317575
```

Results on dataset PAH:
```
cycle_bound accur_test std_test accur_train std_train k_time
------------- ------------ ---------- ------------- ----------- --------
0 0.61 0.113578 0.629762 0.0135212 0.521801
10 0.61 0.113578 0.629762 0.0135212 0.52589
20 0.61 0.113578 0.629762 0.0135212 0.548528
30 0.64 0.111355 0.633333 0.0157935 0.535311
40 0.64 0.111355 0.633333 0.0157935 0.61764
50 0.67 0.09 0.658333 0.0345238 0.733868
60 0.68 0.107703 0.671429 0.0365769 0.871147
70 0.67 0.100499 0.666667 0.0380208 1.12625
80 0.78 0.107703 0.709524 0.0588534 1.19828
90 0.78 0.107703 0.709524 0.0588534 1.21182
```

+ 28
- 41
pygraph/kernels/spKernel.py View File

@@ -1,3 +1,8 @@
"""
@author: linlin
@references: Borgwardt KM, Kriegel HP. Shortest-path kernels on graphs. InData Mining, Fifth IEEE International Conference on 2005 Nov 27 (pp. 8-pp). IEEE.
"""

import sys
import pathlib
sys.path.insert(0, "../")
@@ -12,7 +17,7 @@ from pygraph.utils.utils import getSPGraph

def spkernel(*args, edge_weight = 'bond_type'):
"""Calculate shortest-path kernels between graphs.
Parameters
----------
Gn : List of NetworkX graph
@@ -22,51 +27,33 @@ def spkernel(*args, edge_weight = 'bond_type'):
2 graphs between which the kernel is calculated.
edge_weight : string
edge attribute corresponding to the edge weight. The default edge weight is bond_type.
Return
------
Kmatrix/kernel : Numpy matrix/float
Kernel matrix, each element of which is the sp kernel between 2 praphs. / SP kernel between 2 graphs.
References
----------
[1] Borgwardt KM, Kriegel HP. Shortest-path kernels on graphs. InData Mining, Fifth IEEE International Conference on 2005 Nov 27 (pp. 8-pp). IEEE.
"""
if len(args) == 1: # for a list of graphs
Gn = args[0]
Kmatrix = np.zeros((len(Gn), len(Gn)))
Sn = [] # get shortest path graphs of Gn
for i in range(0, len(Gn)):
Sn.append(getSPGraph(Gn[i], edge_weight = edge_weight))
Gn = args[0] if len(args) == 1 else [args[0], args[1]] # arrange all graphs in a list
Kmatrix = np.zeros((len(Gn), len(Gn)))

start_time = time.time()

Gn = [ getSPGraph(G, edge_weight = edge_weight) for G in args[0] ] # get shortest path graphs of Gn

for i in range(0, len(Gn)):
for j in range(i, len(Gn)):
# kernel_t = [ e1[2]['cost'] != 0 and e1[2]['cost'] == e2[2]['cost'] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])) \
# for e1 in Sn[i].edges(data = True) for e2 in Sn[j].edges(data = True) ]
# Kmatrix[i][j] = np.sum(kernel_t)
# Kmatrix[j][i] = Kmatrix[i][j]

start_time = time.time()
for i in range(0, len(Gn)):
for j in range(i, len(Gn)):
for e1 in Sn[i].edges(data = True):
for e2 in Sn[j].edges(data = True):
if e1[2]['cost'] != 0 and e1[2]['cost'] == e2[2]['cost'] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])):
Kmatrix[i][j] += 1
Kmatrix[j][i] += (0 if i == j else 1)
for e1 in Gn[i].edges(data = True):
for e2 in Gn[j].edges(data = True):
if e1[2]['cost'] != 0 and e1[2]['cost'] == e2[2]['cost'] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])):
Kmatrix[i][j] += 1
Kmatrix[j][i] = Kmatrix[i][j]

run_time = time.time() - start_time
print("--- shortest path kernel matrix of size %d built in %s seconds ---" % (len(Gn), run_time))
return Kmatrix, run_time
else: # for only 2 graphs
G1 = getSPGraph(args[0], edge_weight = edge_weight)
G2 = getSPGraph(args[1], edge_weight = edge_weight)
kernel = 0
start_time = time.time()
for e1 in G1.edges(data = True):
for e2 in G2.edges(data = True):
if e1[2]['cost'] != 0 and e1[2]['cost'] == e2[2]['cost'] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])):
kernel += 1
run_time = time.time() - start_time
print("--- shortest path kernel matrix of size %d built in %s seconds ---" % (len(Gn), run_time))

# print("--- shortest path kernel built in %s seconds ---" % (time.time() - start_time))
return kernel
return Kmatrix, run_time

+ 198
- 0
pygraph/kernels/treePatternKernel.py View File

@@ -0,0 +1,198 @@
"""
@author: linlin
@references: Pierre Mahé and Jean-Philippe Vert. Graph kernels based on tree patterns for molecules. Machine learning, 75(1):3–35, 2009.
"""

import sys
import pathlib
sys.path.insert(0, "../")
import time

from collections import Counter

import networkx as nx
import numpy as np


def treepatternkernel(*args, node_label = 'atom', edge_label = 'bond_type', labeled = True, kernel_type = 'untiln', lmda = 1, h = 1):
"""Calculate tree pattern graph kernels between graphs.
Parameters
----------
Gn : List of NetworkX graph
List of graphs between which the kernels are calculated.
/
G1, G2 : NetworkX graphs
2 graphs between which the kernel is calculated.
node_label : string
node attribute used as label. The default node label is atom.
edge_label : string
edge attribute used as label. The default edge label is bond_type.
labeled : boolean
Whether the graphs are labeled. The default is True.
depth : integer
Depth of search. Longest length of paths.
k_func : function
A kernel function used using different notions of fingerprint similarity.

Return
------
Kmatrix: Numpy matrix
Kernel matrix, each element of which is the tree pattern graph kernel between 2 praphs.
"""
if h < 1:
raise Exception('h > 0 is requested.')
kernel_type = kernel_type.lower()
Gn = args[0] if len(args) == 1 else [args[0], args[1]] # arrange all graphs in a list
Kmatrix = np.zeros((len(Gn), len(Gn)))
h = int(h)

start_time = time.time()

for i in range(0, len(Gn)):
for j in range(i, len(Gn)):
Kmatrix[i][j] = _treepatternkernel_do(Gn[i], Gn[j], node_label, edge_label, labeled, kernel_type, lmda, h)
Kmatrix[j][i] = Kmatrix[i][j]

run_time = time.time() - start_time
print("\n --- kernel matrix of tree pattern kernel of size %d built in %s seconds ---" % (len(Gn), run_time))

return Kmatrix, run_time


def _treepatternkernel_do(G1, G2, node_label, edge_label, labeled, kernel_type, lmda, h):
"""Calculate tree pattern graph kernels between 2 graphs.

Parameters
----------
paths1, paths2 : list
List of paths in 2 graphs, where for unlabeled graphs, each path is represented by a list of nodes; while for labeled graphs, each path is represented by a string consists of labels of nodes and edges on that path.
k_func : function
A kernel function used using different notions of fingerprint similarity.
node_label : string
node attribute used as label. The default node label is atom.
edge_label : string
edge attribute used as label. The default edge label is bond_type.
labeled : boolean
Whether the graphs are labeled. The default is True.

Return
------
kernel : float
Treelet Kernel between 2 graphs.
"""

def matchingset(n1, n2):
"""Get neiborhood matching set of two nodes in two graphs.
"""

def mset_com(allpairs, length):
"""Find all sets R of pairs by combination.
"""
if length == 1:
mset = [ [pair] for pair in allpairs ]
return mset, mset
else:
mset, mset_l = mset_com(allpairs, length - 1)
mset_tmp = []
for pairset in mset_l: # for each pair set of length l-1
nodeset1 = [ pair[0] for pair in pairset ] # nodes already in the set
nodeset2 = [ pair[1] for pair in pairset ]
for pair in allpairs:
if (pair[0] not in nodeset1) and (pair[1] not in nodeset2): # nodes in R should be unique
mset_tmp.append(pairset + [pair]) # add this pair to the pair set of length l-1, constructing a new set of length l
nodeset1.append(pair[0])
nodeset2.append(pair[1])

mset.extend(mset_tmp)

return mset, mset_tmp


allpairs = [] # all pairs those have the same node labels and edge labels
for neighbor1 in G1[n1]:
for neighbor2 in G2[n2]:
if G1.node[neighbor1][node_label] == G2.node[neighbor2][node_label] \
and G1[n1][neighbor1][edge_label] == G2[n2][neighbor2][edge_label]:
allpairs.append([neighbor1, neighbor2])

if allpairs != []:
mset, _ = mset_com(allpairs, len(allpairs))
else:
mset = []

return mset


def kernel_h(h):
"""Calculate kernel of h-th iteration.
"""

if kernel_type == 'untiln':
all_kh = { str(n1) + '.' + str(n2) : (G1.node[n1][node_label] == G2.node[n2][node_label]) \
for n1 in G1.nodes() for n2 in G2.nodes() } # kernels between all pair of nodes with h = 1 ]
all_kh_tmp = all_kh.copy()
for i in range(2, h + 1):
for n1 in G1.nodes():
for n2 in G2.nodes():
kh = 0
mset = all_msets[str(n1) + '.' + str(n2)]
for R in mset:
kh_tmp = 1
for pair in R:
kh_tmp *= lmda * all_kh[str(pair[0]) + '.' + str(pair[1])]
kh += 1 / lmda * kh_tmp
kh = (G1.node[n1][node_label] == G2.node[n2][node_label]) * (1 + kh)
all_kh_tmp[str(n1) + '.' + str(n2)] = kh
all_kh = all_kh_tmp.copy()

elif kernel_type == 'size':
all_kh = { str(n1) + '.' + str(n2) : lmda * (G1.node[n1][node_label] == G2.node[n2][node_label]) \
for n1 in G1.nodes() for n2 in G2.nodes() } # kernels between all pair of nodes with h = 1 ]
all_kh_tmp = all_kh.copy()
for i in range(2, h + 1):
for n1 in G1.nodes():
for n2 in G2.nodes():
kh = 0
mset = all_msets[str(n1) + '.' + str(n2)]
for R in mset:
kh_tmp = 1
for pair in R:
kh_tmp *= lmda * all_kh[str(pair[0]) + '.' + str(pair[1])]
kh += kh_tmp
kh *= lmda * (G1.node[n1][node_label] == G2.node[n2][node_label])
all_kh_tmp[str(n1) + '.' + str(n2)] = kh
all_kh = all_kh_tmp.copy()

elif kernel_type == 'branching':
all_kh = { str(n1) + '.' + str(n2) : (G1.node[n1][node_label] == G2.node[n2][node_label]) \
for n1 in G1.nodes() for n2 in G2.nodes() } # kernels between all pair of nodes with h = 1 ]
all_kh_tmp = all_kh.copy()
for i in range(2, h + 1):
for n1 in G1.nodes():
for n2 in G2.nodes():
kh = 0
mset = all_msets[str(n1) + '.' + str(n2)]
for R in mset:
kh_tmp = 1
for pair in R:
kh_tmp *= lmda * all_kh[str(pair[0]) + '.' + str(pair[1])]
kh += 1 / lmda * kh_tmp
kh *= (G1.node[n1][node_label] == G2.node[n2][node_label])
all_kh_tmp[str(n1) + '.' + str(n2)] = kh
all_kh = all_kh_tmp.copy()

return all_kh



# calculate matching sets for every pair of nodes at first to avoid calculating in every iteration.
all_msets = ({ str(node1) + '.' + str(node2) : matchingset(node1, node2) for node1 in G1.nodes() \
for node2 in G2.nodes() } if h > 1 else {})

all_kh = kernel_h(h)
kernel = sum(all_kh.values())

if kernel_type == 'size':
kernel = kernel / (lmda ** h)

return kernel

+ 13
- 12
pygraph/kernels/treeletKernel.py View File

@@ -1,3 +1,8 @@
"""
@author: linlin
@references: Gaüzère B, Brun L, Villemin D. Two new graphs kernels in chemoinformatics. Pattern Recognition Letters. 2012 Nov 1;33(15):2038-47.
"""

import sys
import pathlib
sys.path.insert(0, "../")
@@ -12,7 +17,7 @@ import numpy as np

def treeletkernel(*args, node_label = 'atom', edge_label = 'bond_type', labeled = True):
"""Calculate treelet graph kernels between graphs.
Parameters
----------
Gn : List of NetworkX graph
@@ -26,7 +31,7 @@ def treeletkernel(*args, node_label = 'atom', edge_label = 'bond_type', labeled
edge attribute used as label. The default edge label is bond_type.
labeled : boolean
Whether the graphs are labeled. The default is True.
Return
------
Kmatrix/kernel : Numpy matrix/float
@@ -37,11 +42,11 @@ def treeletkernel(*args, node_label = 'atom', edge_label = 'bond_type', labeled
Kmatrix = np.zeros((len(Gn), len(Gn)))

start_time = time.time()
# get all canonical keys of all graphs before calculating kernels to save time, but this may cost a lot of memory for large dataset.
canonkeys = [ get_canonkeys(Gn[i], node_label = node_label, edge_label = edge_label, labeled = labeled) \
for i in range(0, len(Gn)) ]
for i in range(0, len(Gn)):
for j in range(i, len(Gn)):
Kmatrix[i][j] = _treeletkernel_do(canonkeys[i], canonkeys[j], node_label = node_label, edge_label = edge_label, labeled = labeled)
@@ -49,7 +54,7 @@ def treeletkernel(*args, node_label = 'atom', edge_label = 'bond_type', labeled

run_time = time.time() - start_time
print("\n --- treelet kernel matrix of size %d built in %s seconds ---" % (len(Gn), run_time))
return Kmatrix, run_time
else: # for only 2 graphs
@@ -112,10 +117,6 @@ def get_canonkeys(G, node_label = 'atom', edge_label = 'bond_type', labeled = Tr
------
canonkey/canonkey_l : dict
For unlabeled graphs, canonkey is a dictionary which records amount of every tree pattern. For labeled graphs, canonkey_l is one which keeps track of amount of every treelet.
References
----------
[1] Gaüzère B, Brun L, Villemin D. Two new graphs kernels in chemoinformatics. Pattern Recognition Letters. 2012 Nov 1;33(15):2038-47.
"""
patterns = {} # a dictionary which consists of lists of patterns for all graphlet.
canonkey = {} # canonical key, a dictionary which records amount of every tree pattern.
@@ -126,7 +127,7 @@ def get_canonkeys(G, node_label = 'atom', edge_label = 'bond_type', labeled = Tr
# linear patterns
patterns['0'] = G.nodes()
canonkey['0'] = nx.number_of_nodes(G)
for i in range(1, 6):
for i in range(1, 6): # for i in range(1, 6):
patterns[str(i)] = find_all_paths(G, i)
canonkey[str(i)] = len(patterns[str(i)])

@@ -227,7 +228,7 @@ def get_canonkeys(G, node_label = 'atom', edge_label = 'bond_type', labeled = Tr
for key in canonkey_t:
canonkey_l['0' + key] = canonkey_t[key]

for i in range(1, 6):
for i in range(1, 6): # for i in range(1, 6):
treelet = []
for pattern in patterns[str(i)]:
canonlist = list(chain.from_iterable((G.node[node][node_label], \
@@ -378,4 +379,4 @@ def find_all_paths(G, length):
all_paths[idx] = []
break
return list(filter(lambda a: a != [], all_paths))
return list(filter(lambda a: a != [], all_paths))

+ 8
- 3
pygraph/kernels/untildPathKernel.py View File

@@ -1,3 +1,8 @@
"""
@author: linlin
@references: Liva Ralaivola, Sanjay J Swamidass, Hiroto Saigo, and Pierre Baldi. Graph kernels for chemical informatics. Neural networks, 18(8):1093–1110, 2005.
"""

import sys
import pathlib
sys.path.insert(0, "../")
@@ -40,7 +45,7 @@ def untildpathkernel(*args, node_label = 'atom', edge_label = 'bond_type', label
Kmatrix = np.zeros((len(Gn), len(Gn)))

start_time = time.time()
# get all paths of all graphs before calculating kernels to save time, but this may cost a lot of memory for large dataset.
all_paths = [ find_all_paths_until_length(Gn[i], depth, node_label = node_label, edge_label = edge_label, labeled = labeled) for i in range(0, len(Gn)) ]

@@ -187,7 +192,7 @@ def find_all_paths(G, length):
all_paths = []
for node in G:
all_paths.extend(find_paths(G, node, length))
### The following process is not carried out according to the original article
# all_paths_r = [ path[::-1] for path in all_paths ]

@@ -200,4 +205,4 @@ def find_all_paths(G, length):
# break

# return list(filter(lambda a: a != [], all_paths))
return all_paths
return all_paths

+ 257
- 153
pygraph/kernels/weisfeilerLehmanKernel.py View File

@@ -1,13 +1,8 @@
import sys
import pathlib
sys.path.insert(0, "../")

import networkx as nx
import numpy as np
import time

from pygraph.kernels.spkernel import spkernel
from pygraph.kernels.pathKernel import pathkernel
"""
@author: linlin
@references:
[1] Shervashidze N, Schweitzer P, Leeuwen EJ, Mehlhorn K, Borgwardt KM. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research. 2011;12(Sep):2539-61.
"""

import sys
import pathlib
@@ -18,7 +13,6 @@ import networkx as nx
import numpy as np
import time

from pygraph.kernels.spkernel import spkernel
from pygraph.kernels.pathKernel import pathkernel

def weisfeilerlehmankernel(*args, node_label = 'atom', edge_label = 'bond_type', height = 0, base_kernel = 'subtree'):
@@ -38,97 +32,66 @@ def weisfeilerlehmankernel(*args, node_label = 'atom', edge_label = 'bond_type',
height : int
subtree height
base_kernel : string
base kernel used in each iteration of WL kernel. The default base kernel is subtree kernel.
base kernel used in each iteration of WL kernel. The default base kernel is subtree kernel. For user-defined kernel, base_kernel is the name of the base kernel function used in each iteration of WL kernel. This function returns a Numpy matrix, each element of which is the user-defined Weisfeiler-Lehman kernel between 2 praphs.
Return
------
Kmatrix/kernel : Numpy matrix/float
Kernel matrix, each element of which is the Weisfeiler-Lehman kernel between 2 praphs. / Weisfeiler-Lehman kernel between 2 graphs.
Kmatrix : Numpy matrix
Kernel matrix, each element of which is the Weisfeiler-Lehman kernel between 2 praphs.
Notes
-----
This function now supports WL subtree kernel and WL shortest path kernel.
References
----------
[1] Shervashidze N, Schweitzer P, Leeuwen EJ, Mehlhorn K, Borgwardt KM. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research. 2011;12(Sep):2539-61.
This function now supports WL subtree kernel, WL shortest path kernel and WL edge kernel.
"""
if len(args) == 1: # for a list of graphs
start_time = time.time()
# for WL subtree kernel
if base_kernel == 'subtree':
Kmatrix = _wl_subtreekernel_do(args[0], node_label, edge_label, height = height, base_kernel = 'subtree')
# for WL edge kernel
elif base_kernel == 'edge':
print('edge')
# for WL shortest path kernel
elif base_kernel == 'sp':
Gn = args[0]
Kmatrix = np.zeros((len(Gn), len(Gn)))
for i in range(0, len(Gn)):
for j in range(i, len(Gn)):
Kmatrix[i][j] = _weisfeilerlehmankernel_do(Gn[i], Gn[j], height = height)
Kmatrix[j][i] = Kmatrix[i][j]
base_kernel = base_kernel.lower()
Gn = args[0] if len(args) == 1 else [args[0], args[1]] # arrange all graphs in a list
Kmatrix = np.zeros((len(Gn), len(Gn)))

run_time = time.time() - start_time
print("\n --- Weisfeiler-Lehman %s kernel matrix of size %d built in %s seconds ---" % (base_kernel, len(args[0]), run_time))
return Kmatrix, run_time
else: # for only 2 graphs
start_time = time.time()
# for WL subtree kernel
if base_kernel == 'subtree':
args = [args[0], args[1]]
kernel = _wl_subtreekernel_do(args, node_label, edge_label, height = height, base_kernel = 'subtree')
# for WL edge kernel
elif base_kernel == 'edge':
print('edge')
# for WL shortest path kernel
elif base_kernel == 'sp':
start_time = time.time()

kernel = _pathkernel_do(args[0], args[1])
# for WL subtree kernel
if base_kernel == 'subtree':
Kmatrix = _wl_subtreekernel_do(args[0], node_label, edge_label, height)

run_time = time.time() - start_time
print("\n --- Weisfeiler-Lehman %s kernel built in %s seconds ---" % (base_kernel, run_time))
return kernel, run_time
def _wl_subtreekernel_do(*args, node_label = 'atom', edge_label = 'bond_type', height = 0, base_kernel = 'subtree'):
# for WL shortest path kernel
elif base_kernel == 'sp':
Kmatrix = _wl_spkernel_do(args[0], node_label, edge_label, height)

# for WL edge kernel
elif base_kernel == 'edge':
Kmatrix = _wl_edgekernel_do(args[0], node_label, edge_label, height)

# for user defined base kernel
else:
Kmatrix = _wl_userkernel_do(args[0], node_label, edge_label, height, base_kernel)

run_time = time.time() - start_time
print("\n --- Weisfeiler-Lehman %s kernel matrix of size %d built in %s seconds ---" % (base_kernel, len(args[0]), run_time))

return Kmatrix, run_time



def _wl_subtreekernel_do(Gn, node_label, edge_label, height):
"""Calculate Weisfeiler-Lehman subtree kernels between graphs.
Parameters
----------
Gn : List of NetworkX graph
List of graphs between which the kernels are calculated.
node_label : string
node attribute used as label. The default node label is atom.
node attribute used as label.
edge_label : string
edge attribute used as label. The default edge label is bond_type.
edge attribute used as label.
height : int
subtree height
base_kernel : string
base kernel used in each iteration of WL kernel. The default base kernel is subtree kernel.
subtree height.

Return
------
Kmatrix/kernel : Numpy matrix/float
Kmatrix : Numpy matrix
Kernel matrix, each element of which is the Weisfeiler-Lehman kernel between 2 praphs.
"""
height = int(height)
Gn = args[0]
Kmatrix = np.zeros((len(Gn), len(Gn)))
all_num_of_labels_occured = 0 # number of the set of letters that occur before as node labels at least once in all graphs

@@ -148,9 +111,9 @@ def _wl_subtreekernel_do(*args, node_label = 'atom', edge_label = 'bond_type', h
num_of_labels = len(num_of_each_label) # number of all unique labels

all_labels_ori.update(labels_ori)
all_num_of_labels_occured += len(all_labels_ori)
# calculate subtree kernel with the 0th iteration and add it to the final kernel
for i in range(0, len(Gn)):
for j in range(i, len(Gn)):
@@ -159,17 +122,17 @@ def _wl_subtreekernel_do(*args, node_label = 'atom', edge_label = 'bond_type', h
vector2 = np.matrix([ (all_num_of_each_label[j][label] if (label in all_num_of_each_label[j].keys()) else 0) for label in labels ])
Kmatrix[i][j] += np.dot(vector1, vector2.transpose())
Kmatrix[j][i] = Kmatrix[i][j]
# iterate each height
for h in range(1, height + 1):
all_set_compressed = {} # a dictionary mapping original labels to new ones in all graphs in this iteration
num_of_labels_occured = all_num_of_labels_occured # number of the set of letters that occur before as node labels at least once in all graphs
all_labels_ori = set()
all_num_of_each_label = []
# for each graph
for idx, G in enumerate(Gn):
set_multisets = []
for node in G.nodes(data = True):
# Multiset-label determination.
@@ -190,9 +153,9 @@ def _wl_subtreekernel_do(*args, node_label = 'atom', edge_label = 'bond_type', h
else:
set_compressed.update({ value : str(num_of_labels_occured + 1) })
num_of_labels_occured += 1
all_set_compressed.update(set_compressed)
# relabel nodes
for node in G.nodes(data = True):
node[1][node_label] = set_compressed[set_multisets[node[0]]]
@@ -202,9 +165,9 @@ def _wl_subtreekernel_do(*args, node_label = 'atom', edge_label = 'bond_type', h
all_labels_ori.update(labels_comp)
num_of_each_label = dict(Counter(labels_comp))
all_num_of_each_label.append(num_of_each_label)
all_num_of_labels_occured += len(all_labels_ori)
# calculate subtree kernel with h iterations and add it to the final kernel
for i in range(0, len(Gn)):
for j in range(i, len(Gn)):
@@ -213,87 +176,228 @@ def _wl_subtreekernel_do(*args, node_label = 'atom', edge_label = 'bond_type', h
vector2 = np.matrix([ (all_num_of_each_label[j][label] if (label in all_num_of_each_label[j].keys()) else 0) for label in labels ])
Kmatrix[i][j] += np.dot(vector1, vector2.transpose())
Kmatrix[j][i] = Kmatrix[i][j]
return Kmatrix
def _weisfeilerlehmankernel_do(G1, G2, height = 0):
"""Calculate Weisfeiler-Lehman kernels between 2 graphs. This kernel use shortest path kernel to calculate kernel between two graphs in each iteration.
def _wl_spkernel_do(Gn, node_label, edge_label, height):
"""Calculate Weisfeiler-Lehman shortest path kernels between graphs.
Parameters
----------
G1, G2 : NetworkX graphs
2 graphs between which the kernel is calculated.
Gn : List of NetworkX graph
List of graphs between which the kernels are calculated.
node_label : string
node attribute used as label.
edge_label : string
edge attribute used as label.
height : int
subtree height.
Return
------
kernel : float
Weisfeiler-Lehman kernel between 2 graphs.
Kmatrix : Numpy matrix
Kernel matrix, each element of which is the Weisfeiler-Lehman kernel between 2 praphs.
"""
from pygraph.utils.utils import getSPGraph
# init.
height = int(height)
kernel = 0 # init kernel
num_nodes1 = G1.number_of_nodes()
num_nodes2 = G2.number_of_nodes()
# the first iteration.
# labelset1 = { G1.nodes(data = True)[i]['label'] for i in range(num_nodes1) }
# labelset2 = { G2.nodes(data = True)[i]['label'] for i in range(num_nodes2) }
kernel += spkernel(G1, G2) # change your base kernel here (and one more below)
Kmatrix = np.zeros((len(Gn), len(Gn))) # init kernel

Gn = [ getSPGraph(G, edge_weight = edge_label) for G in Gn ] # get shortest path graphs of Gn
for h in range(0, height + 1):
# if labelset1 != labelset2:
# break
# initial for height = 0
for i in range(0, len(Gn)):
for j in range(i, len(Gn)):
for e1 in Gn[i].edges(data = True):
for e2 in Gn[j].edges(data = True):
if e1[2]['cost'] != 0 and e1[2]['cost'] == e2[2]['cost'] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])):
Kmatrix[i][j] += 1
Kmatrix[j][i] = Kmatrix[i][j]
# iterate each height
for h in range(1, height + 1):
all_set_compressed = {} # a dictionary mapping original labels to new ones in all graphs in this iteration
num_of_labels_occured = 0 # number of the set of letters that occur before as node labels at least once in all graphs
for G in Gn: # for each graph
set_multisets = []
for node in G.nodes(data = True):
# Multiset-label determination.
multiset = [ G.node[neighbors][node_label] for neighbors in G[node[0]] ]
# sorting each multiset
multiset.sort()
multiset = node[1][node_label] + ''.join(multiset) # concatenate to a string and add the prefix
set_multisets.append(multiset)

# label compression
set_unique = list(set(set_multisets)) # set of unique multiset labels
# a dictionary mapping original labels to new ones.
set_compressed = {}
# if a label occured before, assign its former compressed label, else assign the number of labels occured + 1 as the compressed label
for value in set_unique:
if value in all_set_compressed.keys():
set_compressed.update({ value : all_set_compressed[value] })
else:
set_compressed.update({ value : str(num_of_labels_occured + 1) })
num_of_labels_occured += 1

all_set_compressed.update(set_compressed)
# relabel nodes
for node in G.nodes(data = True):
node[1][node_label] = set_compressed[set_multisets[node[0]]]
# calculate subtree kernel with h iterations and add it to the final kernel
for i in range(0, len(Gn)):
for j in range(i, len(Gn)):
for e1 in Gn[i].edges(data = True):
for e2 in Gn[j].edges(data = True):
if e1[2]['cost'] != 0 and e1[2]['cost'] == e2[2]['cost'] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])):
Kmatrix[i][j] += 1
Kmatrix[j][i] = Kmatrix[i][j]
return Kmatrix

# Weisfeiler-Lehman test of graph isomorphism.
relabel(G1)
relabel(G2)

# calculate kernel
kernel += spkernel(G1, G2) # change your base kernel here (and one more before)

# get label sets of both graphs
# labelset1 = { G1.nodes(data = True)[i]['label'] for i in range(num_nodes1) }
# labelset2 = { G2.nodes(data = True)[i]['label'] for i in range(num_nodes2) }
def _wl_edgekernel_do(Gn, node_label, edge_label, height):
"""Calculate Weisfeiler-Lehman edge kernels between graphs.
return kernel
Parameters
----------
Gn : List of NetworkX graph
List of graphs between which the kernels are calculated.
node_label : string
node attribute used as label.
edge_label : string
edge attribute used as label.
height : int
subtree height.
Return
------
Kmatrix : Numpy matrix
Kernel matrix, each element of which is the Weisfeiler-Lehman kernel between 2 praphs.
"""
# init.
height = int(height)
Kmatrix = np.zeros((len(Gn), len(Gn))) # init kernel
# initial for height = 0
for i in range(0, len(Gn)):
for j in range(i, len(Gn)):
for e1 in Gn[i].edges(data = True):
for e2 in Gn[j].edges(data = True):
if e1[2][edge_label] == e2[2][edge_label] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])):
Kmatrix[i][j] += 1
Kmatrix[j][i] = Kmatrix[i][j]
# iterate each height
for h in range(1, height + 1):
all_set_compressed = {} # a dictionary mapping original labels to new ones in all graphs in this iteration
num_of_labels_occured = 0 # number of the set of letters that occur before as node labels at least once in all graphs
for G in Gn: # for each graph
set_multisets = []
for node in G.nodes(data = True):
# Multiset-label determination.
multiset = [ G.node[neighbors][node_label] for neighbors in G[node[0]] ]
# sorting each multiset
multiset.sort()
multiset = node[1][node_label] + ''.join(multiset) # concatenate to a string and add the prefix
set_multisets.append(multiset)

# label compression
set_unique = list(set(set_multisets)) # set of unique multiset labels
# a dictionary mapping original labels to new ones.
set_compressed = {}
# if a label occured before, assign its former compressed label, else assign the number of labels occured + 1 as the compressed label
for value in set_unique:
if value in all_set_compressed.keys():
set_compressed.update({ value : all_set_compressed[value] })
else:
set_compressed.update({ value : str(num_of_labels_occured + 1) })
num_of_labels_occured += 1

def relabel(G):
'''
Relabel nodes in graph G in one iteration of the 1-dim. WL test of graph isomorphism.
all_set_compressed.update(set_compressed)
# relabel nodes
for node in G.nodes(data = True):
node[1][node_label] = set_compressed[set_multisets[node[0]]]
# calculate subtree kernel with h iterations and add it to the final kernel
for i in range(0, len(Gn)):
for j in range(i, len(Gn)):
for e1 in Gn[i].edges(data = True):
for e2 in Gn[j].edges(data = True):
if e1[2][edge_label] == e2[2][edge_label] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])):
Kmatrix[i][j] += 1
Kmatrix[j][i] = Kmatrix[i][j]
return Kmatrix


def _wl_userkernel_do(Gn, node_label, edge_label, height, base_kernel):
"""Calculate Weisfeiler-Lehman kernels based on user-defined kernel between graphs.
Parameters
----------
G : NetworkX graph
The graphs whose nodes are relabeled.
'''
# get the set of original labels
labels_ori = list(nx.get_node_attributes(G, 'label').values())
num_of_each_label = dict(Counter(labels_ori))
num_of_labels = len(num_of_each_label)
set_multisets = []
for node in G.nodes(data = True):
# Multiset-label determination.
multiset = [ G.node[neighbors]['label'] for neighbors in G[node[0]] ]
# sorting each multiset
multiset.sort()
multiset = node[1]['label'] + ''.join(multiset) # concatenate to a string and add the prefix
set_multisets.append(multiset)
Gn : List of NetworkX graph
List of graphs between which the kernels are calculated.
node_label : string
node attribute used as label.
edge_label : string
edge attribute used as label.
height : int
subtree height.
base_kernel : string
Name of the base kernel function used in each iteration of WL kernel. This function returns a Numpy matrix, each element of which is the user-defined Weisfeiler-Lehman kernel between 2 praphs.
# label compression
# set_multisets.sort() # this is unnecessary
set_unique = list(set(set_multisets)) # set of unique multiset labels
set_compressed = { value : str(set_unique.index(value) + num_of_labels + 1) for value in set_unique } # assign new labels
# relabel nodes
# nx.relabel_nodes(G, set_compressed, copy = False)
for node in G.nodes(data = True):
node[1]['label'] = set_compressed[set_multisets[node[0]]]

# get the set of compressed labels
labels_comp = list(nx.get_node_attributes(G, 'label').values())
num_of_each_label.update(dict(Counter(labels_comp)))
Return
------
Kmatrix : Numpy matrix
Kernel matrix, each element of which is the Weisfeiler-Lehman kernel between 2 praphs.
"""
# init.
height = int(height)
Kmatrix = np.zeros((len(Gn), len(Gn))) # init kernel
# initial for height = 0
Kmatrix = base_kernel(Gn, node_label, edge_label)
# iterate each height
for h in range(1, height + 1):
all_set_compressed = {} # a dictionary mapping original labels to new ones in all graphs in this iteration
num_of_labels_occured = 0 # number of the set of letters that occur before as node labels at least once in all graphs
for G in Gn: # for each graph
set_multisets = []
for node in G.nodes(data = True):
# Multiset-label determination.
multiset = [ G.node[neighbors][node_label] for neighbors in G[node[0]] ]
# sorting each multiset
multiset.sort()
multiset = node[1][node_label] + ''.join(multiset) # concatenate to a string and add the prefix
set_multisets.append(multiset)

# label compression
set_unique = list(set(set_multisets)) # set of unique multiset labels
# a dictionary mapping original labels to new ones.
set_compressed = {}
# if a label occured before, assign its former compressed label, else assign the number of labels occured + 1 as the compressed label
for value in set_unique:
if value in all_set_compressed.keys():
set_compressed.update({ value : all_set_compressed[value] })
else:
set_compressed.update({ value : str(num_of_labels_occured + 1) })
num_of_labels_occured += 1

all_set_compressed.update(set_compressed)
# relabel nodes
for node in G.nodes(data = True):
node[1][node_label] = set_compressed[set_multisets[node[0]]]
# calculate kernel with h iterations and add it to the final kernel
Kmatrix += base_kernel(Gn, node_label, edge_label)
return Kmatrix

BIN
pygraph/utils/__pycache__/graphfiles.cpython-35.pyc View File


BIN
pygraph/utils/__pycache__/utils.cpython-35.pyc View File


+ 84
- 6
pygraph/utils/graphfiles.py View File

@@ -3,7 +3,7 @@

def loadCT(filename):
"""load data from .ct file.
nn
Notes
------
a typical example of data in .ct is like this:
@@ -33,12 +33,17 @@ def loadCT(filename):
tmp = content[i + 2].split(" ")
tmp = [x for x in tmp if x != '']
g.add_node(i, atom=tmp[3], label=tmp[3])

for i in range(0, nb_edges):
tmp = content[i + g.number_of_nodes() + 2]
tmp = [tmp[i:i+3] for i in range(0, len(tmp), 3)]
tmp = content[i + g.number_of_nodes() + 2].split(" ")
tmp = [x for x in tmp if x != '']
g.add_edge(int(tmp[0]) - 1, int(tmp[1]) - 1,
bond_type=tmp[3].strip(), label=tmp[3].strip())
bond_type=tmp[3].strip(), label=tmp[3].strip())

# for i in range(0, nb_edges):
# tmp = content[i + g.number_of_nodes() + 2]
# tmp = [tmp[i:i+3] for i in range(0, len(tmp), 3)]
# g.add_edge(int(tmp[0]) - 1, int(tmp[1]) - 1,
# bond_type=tmp[3].strip(), label=tmp[3].strip())
return g


@@ -101,7 +106,57 @@ def saveGXL(graph, filename):
tree.write(filename)


def loadDataset(filename):
def loadSDF(filename):
"""load data from structured data file (.sdf file).

Notes
------
A SDF file contains a group of molecules, represented in the similar way as in MOL format.
see http://www.nonlinear.com/progenesis/sdf-studio/v0.9/faq/sdf-file-format-guidance.aspx, 2018 for detailed structure.
"""
import networkx as nx
from os.path import basename
from tqdm import tqdm
import sys
data = []
with open(filename) as f:
content = f.read().splitlines()
index = 0
pbar = tqdm(total = len(content) + 1, desc = 'load SDF', file=sys.stdout)
while index < len(content):
index_old = index

g = nx.Graph(name=content[index].strip()) # set name of the graph

tmp = content[index + 3]
nb_nodes = int(tmp[:3]) # number of the nodes
nb_edges = int(tmp[3:6]) # number of the edges

for i in range(0, nb_nodes):
tmp = content[i + index + 4]
g.add_node(i, atom=tmp[31:34].strip())

for i in range(0, nb_edges):
tmp = content[i + index + g.number_of_nodes() + 4]
tmp = [tmp[i:i+3] for i in range(0, len(tmp), 3)]
g.add_edge(int(tmp[0]) - 1, int(tmp[1]) - 1, bond_type=tmp[2].strip())

data.append(g)

index += 4 + g.number_of_nodes() + g.number_of_edges()
while content[index].strip() != '$$$$': # seperator
index += 1
index += 1

pbar.update(index - index_old)
pbar.update(1)
pbar.close()

return data



def loadDataset(filename, filename_y = ''):
"""load file list of the dataset.
"""
from os.path import dirname, splitext
@@ -128,5 +183,28 @@ def loadDataset(filename):
mol_class = graph.attrib['class']
data.append(loadGXL(dirname_dataset + '/' + mol_filename))
y.append(mol_class)
elif extension == "sdf":
import numpy as np
from tqdm import tqdm
import sys

data = loadSDF(filename)

y_raw = open(filename_y).read().splitlines()
y_raw.pop(0)
tmp0 = []
tmp1 = []
for i in range(0, len(y_raw)):
tmp = y_raw[i].split(',')
tmp0.append(tmp[0])
tmp1.append(tmp[1].strip())

y = []
for i in tqdm(range(0, len(data)), desc = 'ajust data', file=sys.stdout):
try:
y.append(tmp1[tmp0.index(data[i].name)].strip())
except ValueError: # if data[i].name not in tmp0
data[i] = []
data = list(filter(lambda a: a != [], data))

return data, y

+ 97
- 48
pygraph/utils/utils.py View File

@@ -1,5 +1,6 @@
import networkx as nx
import numpy as np
from tqdm import tqdm


def getSPLengths(G1):
@@ -58,21 +59,15 @@ def floydTransformation(G, edge_weight = 'bond_type'):
S = nx.Graph()
S.add_nodes_from(G.nodes(data=True))
for i in range(0, G.number_of_nodes()):
for j in range(0, G.number_of_nodes()):
for j in range(i, G.number_of_nodes()):
S.add_edge(i, j, cost = spMatrix[i, j])
return S



import os
import pathlib
from collections import OrderedDict
from tabulate import tabulate
from .graphfiles import loadDataset

def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, trials = 100, splits = 10, alpha_grid = None, C_grid = None, hyper_name = '', hyper_range = [1], normalize = False):
def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, trials = 100, splits = 10, alpha_grid = None, C_grid = None, hyper_name = '', hyper_range = [1], normalize = False, datafile_y = '', model_type = 'regression'):
"""Perform training and testing for a kernel method. Print out neccessary data during the process then finally the results.

Parameters
----------
datafile : string
@@ -96,12 +91,14 @@ def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, tria
hyper_range : list
Range of the hyperparameter.
normalize : string
Determine whether or not that normalization is performed. The default is False.
Determine whether or not that normalization is performed. Only works when model_type == 'regression'. The default is False.
model_type : string
Typr of the problem, regression or classification problem

References
----------
[1] Elisabetta Ghisu, https://github.com/eghisu/GraphKernels/blob/master/GraphKernelsCollection/python_scripts/compute_perf_gk.py, 2018.1
Examples
--------
>>> import sys
@@ -113,29 +110,41 @@ def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, tria
>>> kernel_para = dict(node_label = 'atom', edge_label = 'bond_type', labeled = True)
>>> kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = True)
"""
import os
import pathlib
from collections import OrderedDict
from tabulate import tabulate
from .graphfiles import loadDataset

# setup the parameters
model_type = 'regression' # Regression or classification problem
model_type = model_type.lower()
if model_type != 'regression' and model_type != 'classification':
raise Exception('The model type is incorrect! Please choose from regression or clqssification.')
print('\n --- This is a %s problem ---' % model_type)
alpha_grid = np.logspace(-10, 10, num = trials, base = 10) if alpha_grid == None else alpha_grid # corresponds to (2*C)^-1 in other linear models such as LogisticRegression
C_grid = np.logspace(-10, 10, num = trials, base = 10) if C_grid == None else C_grid
if not os.path.exists(kernel_file_path):
os.makedirs(kernel_file_path)
train_means_list = []
train_stds_list = []
test_means_list = []
test_stds_list = []
kernel_time_list = []
for hyper_para in hyper_range:
print('' if hyper_name == '' else '\n\n #--- calculating kernel matrix when %s = %.1f ---#' % (hyper_name, hyper_para))
print('' if hyper_name == '' else '\n\n #--- calculating kernel matrix when', hyper_name, '=', hyper_para, '---#')

print('\n Loading dataset from file...')
dataset, y = loadDataset(datafile)
dataset, y = loadDataset(datafile, filename_y = datafile_y)
y = np.array(y)
# print(y)
# normalize labels and transform non-numerical labels to numerical labels.
if model_type == 'classification':
from sklearn.preprocessing import LabelEncoder
y = LabelEncoder().fit_transform(y)
# print(y)

# save kernel matrices to files / read kernel matrices from files
kernel_file = kernel_file_path + 'km.ds'
@@ -152,7 +161,7 @@ def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, tria
Kmatrix, run_time = kernel_func(dataset, **kernel_para)
kernel_time_list.append(run_time)
print(Kmatrix)
print('\n Saving kernel matrix to file...')
# print('\n Saving kernel matrix to file...')
# np.savetxt(kernel_file, Kmatrix)

"""
@@ -170,25 +179,29 @@ def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, tria
test_stds_list.append(test_std)

print('\n')
table_dict = {'rmse_test': test_means_list, 'std_test': test_stds_list, \
'rmse_train': train_means_list, 'std_train': train_stds_list, 'k_time': kernel_time_list}
if hyper_name == '':
keyorder = ['rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time']

else:
table_dict[hyper_name] = hyper_range
keyorder = [hyper_name, 'rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time']
if model_type == 'regression':
table_dict = {'rmse_test': test_means_list, 'std_test': test_stds_list, \
'rmse_train': train_means_list, 'std_train': train_stds_list, 'k_time': kernel_time_list}
if hyper_name == '':
keyorder = ['rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time']
else:
table_dict[hyper_name] = hyper_range
keyorder = [hyper_name, 'rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time']
elif model_type == 'classification':
table_dict = {'accur_test': test_means_list, 'std_test': test_stds_list, \
'accur_train': train_means_list, 'std_train': train_stds_list, 'k_time': kernel_time_list}
if hyper_name == '':
keyorder = ['accur_test', 'std_test', 'accur_train', 'std_train', 'k_time']
else:
table_dict[hyper_name] = hyper_range
keyorder = [hyper_name, 'accur_test', 'std_test', 'accur_train', 'std_train', 'k_time']
print(tabulate(OrderedDict(sorted(table_dict.items(), key = lambda i:keyorder.index(i[0]))), headers='keys'))


import random
from sklearn.kernel_ridge import KernelRidge # 0.17
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn import svm

def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, trials = 100, model_type = 'regression', normalize = False):
"""Split dataset to training and testing splits, train and test. Print out and return the results.

Parameters
----------
Kmatrix : Numpy matrix
@@ -206,8 +219,8 @@ def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, tri
model_type : string
Determine whether it is a regression or classification problem. The default is 'regression'.
normalize : string
Determine whether or not that normalization is performed. The default is False.
Determine whether or not that normalization is performed. Only works when model_type == 'regression'. The default is False.
Return
------
train_mean : float
@@ -218,19 +231,27 @@ def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, tri
mean of the best tests.
test_std : float
mean of test stds in the same trial with the best test mean.
References
----------
[1] Elisabetta Ghisu, https://github.com/eghisu/GraphKernels/blob/master/GraphKernelsCollection/python_scripts/compute_perf_gk.py, 2018.1
"""
import random
from sklearn.kernel_ridge import KernelRidge # 0.17
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn import svm

datasize = len(train_target)
random.seed(20) # Set the seed for uniform parameter distribution
# Initialize the performance of the best parameter trial on train with the corresponding performance on test
train_split = []
test_split = []

# For each split of the data
print('\n Starting calculate accuracy/rmse...')
import sys
pbar = tqdm(total = splits * trials, desc = 'calculate performance', file=sys.stdout)
for j in range(10, 10 + splits):
# print('\n Starting split %d...' % j)

@@ -255,7 +276,7 @@ def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, tri

# Split the targets
y_train = y_perm[0:num_train]

# Normalization step (for real valued targets only)
if normalize == True and model_type == 'regression':
@@ -275,7 +296,6 @@ def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, tri
if model_type == 'regression':
# Fit the kernel ridge model
KR = KernelRidge(kernel = 'precomputed', alpha = alpha_grid[i])
# KR = svm.SVR(kernel = 'precomputed', C = C_grid[i])
KR.fit(Kmatrix_train, y_train if normalize == False else y_train_norm)

# predict on the train and test set
@@ -284,15 +304,33 @@ def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, tri

# adjust prediction: needed because the training targets have been normalized
if normalize == True:
y_pred_train = y_pred_train * float(y_train_std) + y_train_mean
y_pred_train = y_pred_train * float(y_train_std) + y_train_mean
y_pred_test = y_pred_test * float(y_train_std) + y_train_mean

# root mean squared error in train set
rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
perf_all_train.append(rmse_train)
# root mean squared error in test set
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
perf_all_test.append(rmse_test)
# root mean squared error on train set
accuracy_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
perf_all_train.append(accuracy_train)
# root mean squared error on test set
accuracy_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
perf_all_test.append(accuracy_test)

# For clcassification use SVM
elif model_type == 'classification':
KR = svm.SVC(kernel = 'precomputed', C = C_grid[i])
KR.fit(Kmatrix_train, y_train)

# predict on the train and test set
y_pred_train = KR.predict(Kmatrix_train)
y_pred_test = KR.predict(Kmatrix_test)

# accuracy on train set
accuracy_train = accuracy_score(y_train, y_pred_train)
perf_all_train.append(accuracy_train)
# accuracy on test set
accuracy_test = accuracy_score(y_test, y_pred_test)
perf_all_test.append(accuracy_test)

pbar.update(1)

# --- FIND THE OPTIMAL PARAMETERS --- #
# For regression: minimise the mean squared error
@@ -306,6 +344,17 @@ def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, tri
perf_train_opt = perf_all_train[min_idx]
perf_test_opt = perf_all_test[min_idx]

# For classification: maximise the accuracy
if model_type == 'classification':
# get optimal parameter on test (argmax accuracy)
max_idx = np.argmax(perf_all_test)
C_opt = C_grid[max_idx]

# corresponding performance on train and test set for the same parameter
perf_train_opt = perf_all_train[max_idx]
perf_test_opt = perf_all_test[max_idx]


# append the correponding performance on the train and test set
train_split.append(perf_train_opt)
test_split.append(perf_test_opt)
@@ -322,5 +371,5 @@ def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, tri
print('With standard deviation: %3f' % train_std)
print('\n Mean performance on test set: %3f' % test_mean)
print('With standard deviation: %3f' % test_std)
return train_mean, train_std, test_mean, test_std
return train_mean, train_std, test_mean, test_std

+ 16
- 0
run_cyclic.py View File

@@ -0,0 +1,16 @@
import sys
sys.path.insert(0, "../")
from pygraph.utils.utils import kernel_train_test
from pygraph.kernels.cyclicPatternKernel import cyclicpatternkernel

import numpy as np

datafile = '../../../../datasets/NCI-HIV/AIDO99SD.sdf'
datafile_y = '../../../../datasets/NCI-HIV/aids_conc_may04.txt'
kernel_file_path = 'kernelmatrices_path_acyclic/'

kernel_para = dict(node_label = 'atom', edge_label = 'bond_type', labeled = True)

kernel_train_test(datafile, kernel_file_path, cyclicpatternkernel, kernel_para, \
hyper_name = 'cycle_bound', hyper_range = np.linspace(0, 1000, 21), normalize = False, \
datafile_y = datafile_y, model_type = 'classification')

Loading…
Cancel
Save