Browse Source

* ADD calculation of the time spend to acquire kernel matrices for each kernel. - linlin

* MOD floydTransformation function, calculate shortest paths taking into consideration user-defined edge weight. - linlin
* MOD implementation of nodes and edges attributes genericity for all kernels. - linlin
* ADD detailed results file results.md. - linlin
* MOD Weisfeiler-Lehman subtree kernel and the test code. - linlin
v0.1
jajupmochi 7 years ago
parent
commit
7e2a15d62f
17 changed files with 925 additions and 1416 deletions
  1. +22
    -7
      README.md
  2. +381
    -418
      notebooks/.ipynb_checkpoints/run_WeisfeilerLehmankernel_acyclic-checkpoint.ipynb
  3. +16
    -370
      notebooks/.ipynb_checkpoints/run_marginalizedkernel_acyclic-checkpoint.ipynb
  4. +23
    -14
      notebooks/.ipynb_checkpoints/run_pathkernel_acyclic-checkpoint.ipynb
  5. +2
    -1
      notebooks/.ipynb_checkpoints/run_spkernel_acyclic-checkpoint.ipynb
  6. +317
    -514
      notebooks/run_WeisfeilerLehmankernel_acyclic.ipynb
  7. +1
    -1
      notebooks/run_marginalizedkernel_acyclic.ipynb
  8. +1
    -1
      notebooks/run_pathkernel_acyclic.ipynb
  9. +2
    -1
      notebooks/run_spkernel_acyclic.ipynb
  10. BIN
      pygraph/kernels/__pycache__/weisfeilerLehmanKernel.cpython-35.pyc
  11. +21
    -12
      pygraph/kernels/marginalizedKernel.py
  12. +23
    -14
      pygraph/kernels/pathKernel.py
  13. +36
    -0
      pygraph/kernels/results.md
  14. +9
    -6
      pygraph/kernels/spkernel.py
  15. +61
    -51
      pygraph/kernels/weisfeilerLehmanKernel.py
  16. BIN
      pygraph/utils/__pycache__/graphfiles.cpython-35.pyc
  17. +10
    -6
      pygraph/utils/utils.py

+ 22
- 7
README.md View File

@@ -10,15 +10,30 @@ a python package for graph kernels.
* sklearn - 0.19.1
* tabulate - 0.8.2

## results with minimal RMSE for each kernel on dataset Asyclic
| Kernels | RMSE(℃) | std(℃) | parameter |
|---------------|:---------:|:--------:|-------------:|
| shortest path | 36.400524 | 5.352940 | - |
| marginalized | 17.8991 | 6.59104 | p_quit = 0.1 |
| path | 14.270816 | 6.366698 | - |
| WL subtree | 9.01403 | 6.35786 | height = 1 |
## results with minimal test RMSE for each kernel on dataset Asyclic
-- All the kernels are tested on dataset Asyclic, which consists of 185 molecules (graphs).
-- The criteria used for prediction are SVM for classification and kernel Ridge regression for regression.
-- For predition we randomly divide the data in train and test subset, where 90% of entire dataset is for training and rest for testing. 10 splits are performed. For each split, we first train on the train data, then evaluate the performance on the test set. We choose the optimal parameters for the test set and finally provide the corresponding performance. The final results correspond to the average of the performances on the test sets.

| Kernels | RMSE(℃) | std(℃) | parameter | k_time |
|---------------|:---------:|:--------:|-------------:|-------:|
| shortest path | 36.40 | 5.35 | - | - |
| marginalized | 17.90 | 6.59 | p_quit = 0.1 | - |
| path | 14.27 | 6.37 | - | - |
| WL subtree | 9.00 | 6.37 | height = 1 | 0.85" |

**In each line, paremeter is the one with which the kenrel achieves the best results.
In each line, k_time is the time spent on building the kernel matrix.
See detail results in [results.md](pygraph/kernels/results.md).**

## updates
### 2017.12.22
* ADD calculation of the time spend to acquire kernel matrices for each kernel. - linlin
* MOD floydTransformation function, calculate shortest paths taking into consideration user-defined edge weight. - linlin
* MOD implementation of nodes and edges attributes genericity for all kernels. - linlin
* ADD detailed results file results.md. - linlin
### 2017.12.21
* MOD Weisfeiler-Lehman subtree kernel and the test code. - linlin
### 2017.12.20
* ADD Weisfeiler-Lehman subtree kernel and its result on dataset Asyclic. - linlin
### 2017.12.07


+ 381
- 418
notebooks/.ipynb_checkpoints/run_WeisfeilerLehmankernel_acyclic-checkpoint.ipynb
File diff suppressed because it is too large
View File


+ 16
- 370
notebooks/.ipynb_checkpoints/run_marginalizedkernel_acyclic-checkpoint.ipynb View File

@@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "code",
"execution_count": 30,
"execution_count": 8,
"metadata": {
"scrolled": false
},
@@ -25,360 +25,6 @@
" https://github.com/eghisu/GraphKernels/blob/master/GraphKernelsCollection/python_scripts/compute_perf_gk.py\n",
"\n",
"\n",
" --- This is a regression problem ---\n",
"\n",
" Normalizing output y...\n",
"\n",
" Loading the train set kernel matrix from file...\n",
"[[ 0.15254237 0.08333333 0.0625 ..., 0.11363636 0.11363636\n",
" 0.11363636]\n",
" [ 0.08333333 0.18518519 0.15591398 ..., 0.16617791 0.16617791\n",
" 0.16890214]\n",
" [ 0.0625 0.15591398 0.15254237 ..., 0.12987013 0.12987013\n",
" 0.13163636]\n",
" ..., \n",
" [ 0.11363636 0.16617791 0.12987013 ..., 0.26383753 0.2639004\n",
" 0.26156557]\n",
" [ 0.11363636 0.16617791 0.12987013 ..., 0.2639004 0.26396688\n",
" 0.26162729]\n",
" [ 0.11363636 0.16890214 0.13163636 ..., 0.26156557 0.26162729\n",
" 0.25964592]]\n",
"\n",
" Loading the test set kernel matrix from file...\n",
"[[ 0.18518519 0.1715847 0.11111111 0.16588603 0.11904762 0.16450216\n",
" 0.17281421 0.14285714 0.125 0.16477273 0.16880154 0.14583333\n",
" 0.1660693 0.16906445 0.13333333 0.16612903 0.16420966 0.16441006\n",
" 0.15151515]\n",
" [ 0.1715847 0.19988118 0.15173333 0.18435596 0.16465263 0.21184723\n",
" 0.18985964 0.19960191 0.16819723 0.21540115 0.19575264 0.2041482\n",
" 0.21842419 0.20001664 0.18754969 0.2205599 0.20506165 0.22256445\n",
" 0.2141792 ]\n",
" [ 0.11111111 0.15173333 0.16303156 0.13416478 0.16903494 0.16960573\n",
" 0.13862936 0.18511129 0.16989276 0.17395417 0.14762351 0.18709221\n",
" 0.17706477 0.15293506 0.17970939 0.17975775 0.16082785 0.18295252\n",
" 0.19186573]\n",
" [ 0.16588603 0.18435596 0.13416478 0.17413923 0.14529511 0.19230449\n",
" 0.17775828 0.17598858 0.14892223 0.19462663 0.18166555 0.17986029\n",
" 0.1964604 0.18450695 0.16510376 0.19788853 0.1876399 0.19921541\n",
" 0.18843419]\n",
" [ 0.11904762 0.16465263 0.16903494 0.14529511 0.17703225 0.18464872\n",
" 0.15002895 0.19785455 0.17779663 0.18950917 0.16010081 0.2005743\n",
" 0.19306131 0.16599977 0.19113529 0.1960531 0.175064 0.19963794\n",
" 0.20696464]\n",
" [ 0.16450216 0.21184723 0.16960573 0.19230449 0.18464872 0.23269314\n",
" 0.19681552 0.22450276 0.1871932 0.23765844 0.20733248 0.22967925\n",
" 0.241199 0.21337314 0.21125341 0.24426963 0.22285333 0.24802555\n",
" 0.24156669]\n",
" [ 0.17281421 0.18985964 0.13862936 0.17775828 0.15002895 0.19681552\n",
" 0.18309269 0.18152273 0.15411585 0.19935309 0.18641218 0.18556038\n",
" 0.20169527 0.18946029 0.17030032 0.20320694 0.19192382 0.2042596\n",
" 0.19428999]\n",
" [ 0.14285714 0.19960191 0.18511129 0.17598858 0.19785455 0.22450276\n",
" 0.18152273 0.23269314 0.20168735 0.23049584 0.19407926 0.23694176\n",
" 0.23486084 0.20134404 0.22042984 0.23854906 0.21275711 0.24302959\n",
" 0.24678197]\n",
" [ 0.125 0.16819723 0.16989276 0.14892223 0.17779663 0.1871932\n",
" 0.15411585 0.20168735 0.18391356 0.19188588 0.16365606 0.20428161\n",
" 0.1952436 0.16940489 0.1919249 0.19815511 0.17760881 0.20152837\n",
" 0.20988805]\n",
" [ 0.16477273 0.21540115 0.17395417 0.19462663 0.18950917 0.23765844\n",
" 0.19935309 0.23049584 0.19188588 0.24296859 0.21058278 0.23586086\n",
" 0.24679036 0.21702635 0.21699483 0.25006701 0.22724646 0.25407837\n",
" 0.24818625]\n",
" [ 0.16880154 0.19575264 0.14762351 0.18166555 0.16010081 0.20733248\n",
" 0.18641218 0.19407926 0.16365606 0.21058278 0.19214629 0.19842989\n",
" 0.21317298 0.19609213 0.18225175 0.2151567 0.20088139 0.2171273\n",
" 0.20810339]\n",
" [ 0.14583333 0.2041482 0.18709221 0.17986029 0.2005743 0.22967925\n",
" 0.18556038 0.23694176 0.20428161 0.23586086 0.19842989 0.24154885\n",
" 0.24042054 0.20590264 0.22439219 0.24421452 0.21769149 0.24880304\n",
" 0.25200246]\n",
" [ 0.1660693 0.21842419 0.17706477 0.1964604 0.19306131 0.241199\n",
" 0.20169527 0.23486084 0.1952436 0.24679036 0.21317298 0.24042054\n",
" 0.25107069 0.21988195 0.22126548 0.25446921 0.23058896 0.25855949\n",
" 0.25312182]\n",
" [ 0.16906445 0.20001664 0.15293506 0.18450695 0.16599977 0.21337314\n",
" 0.18946029 0.20134404 0.16940489 0.21702635 0.19609213 0.20590264\n",
" 0.21988195 0.20052959 0.18917551 0.22212027 0.2061696 0.22441239\n",
" 0.21607563]\n",
" [ 0.13333333 0.18754969 0.17970939 0.16510376 0.19113529 0.21125341\n",
" 0.17030032 0.22042984 0.1919249 0.21699483 0.18225175 0.22439219\n",
" 0.22126548 0.18917551 0.2112185 0.224781 0.20021961 0.22904467\n",
" 0.23356012]\n",
" [ 0.16612903 0.2205599 0.17975775 0.19788853 0.1960531 0.24426963\n",
" 0.20320694 0.23854906 0.19815511 0.25006701 0.2151567 0.24421452\n",
" 0.25446921 0.22212027 0.224781 0.25800115 0.23326559 0.26226067\n",
" 0.25717144]\n",
" [ 0.16420966 0.20506165 0.16082785 0.1876399 0.175064 0.22285333\n",
" 0.19192382 0.21275711 0.17760881 0.22724646 0.20088139 0.21769149\n",
" 0.23058896 0.2061696 0.20021961 0.23326559 0.21442192 0.2364528\n",
" 0.22891788]\n",
" [ 0.16441006 0.22256445 0.18295252 0.19921541 0.19963794 0.24802555\n",
" 0.2042596 0.24302959 0.20152837 0.25407837 0.2171273 0.24880304\n",
" 0.25855949 0.22441239 0.22904467 0.26226067 0.2364528 0.26687384\n",
" 0.26210305]\n",
" [ 0.15151515 0.2141792 0.19186573 0.18843419 0.20696464 0.24156669\n",
" 0.19428999 0.24678197 0.20988805 0.24818625 0.20810339 0.25200246\n",
" 0.25312182 0.21607563 0.23356012 0.25717144 0.22891788 0.26210305\n",
" 0.26386999]]\n"
]
},
{
"ename": "ValueError",
"evalue": "Precomputed metric requires shape (n_queries, n_indexed). Got (19, 19) for 164 indexed.",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-30-d4c5f46d5abf>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 133\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 134\u001b[0m \u001b[0;31m# predict on the test set\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 135\u001b[0;31m \u001b[0my_pred_test\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mKR\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpredict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mKmatrix_test\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 136\u001b[0m \u001b[0;31m# print(y_pred)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 137\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/usr/local/lib/python3.5/dist-packages/sklearn/kernel_ridge.py\u001b[0m in \u001b[0;36mpredict\u001b[0;34m(self, X)\u001b[0m\n\u001b[1;32m 182\u001b[0m \"\"\"\n\u001b[1;32m 183\u001b[0m \u001b[0mcheck_is_fitted\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m\"X_fit_\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"dual_coef_\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 184\u001b[0;31m \u001b[0mK\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_get_kernel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mX_fit_\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 185\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdot\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mK\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdual_coef_\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/usr/local/lib/python3.5/dist-packages/sklearn/kernel_ridge.py\u001b[0m in \u001b[0;36m_get_kernel\u001b[0;34m(self, X, Y)\u001b[0m\n\u001b[1;32m 119\u001b[0m \"coef0\": self.coef0}\n\u001b[1;32m 120\u001b[0m return pairwise_kernels(X, Y, metric=self.kernel,\n\u001b[0;32m--> 121\u001b[0;31m filter_params=True, **params)\n\u001b[0m\u001b[1;32m 122\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 123\u001b[0m \u001b[0;34m@\u001b[0m\u001b[0mproperty\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/usr/local/lib/python3.5/dist-packages/sklearn/metrics/pairwise.py\u001b[0m in \u001b[0;36mpairwise_kernels\u001b[0;34m(X, Y, metric, filter_params, n_jobs, **kwds)\u001b[0m\n\u001b[1;32m 1389\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1390\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mmetric\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m\"precomputed\"\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1391\u001b[0;31m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_pairwise_arrays\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mY\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mprecomputed\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1392\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1393\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmetric\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mGPKernel\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/usr/local/lib/python3.5/dist-packages/sklearn/metrics/pairwise.py\u001b[0m in \u001b[0;36mcheck_pairwise_arrays\u001b[0;34m(X, Y, precomputed, dtype)\u001b[0m\n\u001b[1;32m 117\u001b[0m \u001b[0;34m\"(n_queries, n_indexed). Got (%d, %d) \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 118\u001b[0m \u001b[0;34m\"for %d indexed.\"\u001b[0m \u001b[0;34m%\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 119\u001b[0;31m (X.shape[0], X.shape[1], Y.shape[0]))\n\u001b[0m\u001b[1;32m 120\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m!=\u001b[0m \u001b[0mY\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 121\u001b[0m raise ValueError(\"Incompatible dimension for X and Y matrices: \"\n",
"\u001b[0;31mValueError\u001b[0m: Precomputed metric requires shape (n_queries, n_indexed). Got (19, 19) for 164 indexed."
]
}
],
"source": [
"# Author: Elisabetta Ghisu\n",
"\n",
"\"\"\"\n",
"- This script take as input a kernel matrix\n",
"and returns the classification or regression performance\n",
"- The kernel matrix can be calculated using any of the graph kernels approaches\n",
"- The criteria used for prediction are SVM for classification and kernel Ridge regression for regression\n",
"- For predition we divide the data in training, validation and test. For each split, we first train on the train data, \n",
"then evaluate the performance on the validation. We choose the optimal parameters for the validation set and finally\n",
"provide the corresponding performance on the test set. If more than one split is performed, the final results \n",
"correspond to the average of the performances on the test sets. \n",
"\n",
"@references\n",
" https://github.com/eghisu/GraphKernels/blob/master/GraphKernelsCollection/python_scripts/compute_perf_gk.py\n",
"\"\"\"\n",
"\n",
"print(__doc__)\n",
"\n",
"import sys\n",
"import pathlib\n",
"import os\n",
"sys.path.insert(0, \"../py-graph/\")\n",
"from tabulate import tabulate\n",
"\n",
"import random\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"from sklearn.kernel_ridge import KernelRidge # 0.17\n",
"from sklearn.metrics import accuracy_score, mean_squared_error\n",
"from sklearn import svm\n",
"\n",
"from kernels.marginalizedKernel import marginalizedkernel\n",
"from utils.graphfiles import loadDataset\n",
"\n",
"# print('\\n Loading dataset from file...')\n",
"# dataset, y = loadDataset(\"/home/ljia/Documents/research-repo/datasets/acyclic/Acyclic/dataset_bps.ds\")\n",
"# y = np.array(y)\n",
"# print(y)\n",
"\n",
"# kernel_file_path = 'marginalizedkernelmatrix.ds'\n",
"# path = pathlib.Path(kernel_file_path)\n",
"# if path.is_file():\n",
"# print('\\n Loading the matrix from file...')\n",
"# Kmatrix = np.loadtxt(kernel_file_path)\n",
"# print(Kmatrix)\n",
"# else:\n",
"# print('\\n Calculating kernel matrix, this could take a while...')\n",
"# Kmatrix = marginalizeKernel(dataset)\n",
"# print(Kmatrix)\n",
"# print('Saving kernel matrix to file...')\n",
"# np.savetxt(kernel_file_path, Kmatrix)\n",
"\n",
"# setup the parameters\n",
"model_type = 'regression' # Regression or classification problem\n",
"print('\\n --- This is a %s problem ---' % model_type)\n",
"\n",
"# datasize = len(dataset)\n",
"trials = 100 # Trials for hyperparameters random search\n",
"splits = 100 # Number of splits of the data\n",
"alpha_grid = np.linspace(0.01, 100, num = trials) # corresponds to (2*C)^-1 in other linear models such as LogisticRegression\n",
"# C_grid = np.linspace(0.0001, 10, num = trials)\n",
"random.seed(20) # Set the seed for uniform parameter distribution\n",
"data_dir = '/home/ljia/Documents/research-repo/datasets/acyclic/Acyclic/'\n",
"\n",
"# set the output path\n",
"kernel_file_path = 'kernelmatrices_marginalized_acyclic/'\n",
"if not os.path.exists(kernel_file_path):\n",
" os.makedirs(kernel_file_path)\n",
"\n",
"\n",
"\"\"\"\n",
"- Here starts the main program\n",
"- First we permute the data, then for each split we evaluate corresponding performances\n",
"- In the end, the performances are averaged over the test sets\n",
"\"\"\"\n",
"\n",
"# Initialize the performance of the best parameter trial on validation with the corresponding performance on test\n",
"val_split = []\n",
"test_split = []\n",
"\n",
"p_quit = 0.5\n",
"\n",
"# for each split of the data\n",
"for j in range(10):\n",
" dataset_train, y_train = loadDataset(data_dir + 'trainset_' + str(j) + '.ds')\n",
" dataset_test, y_test = loadDataset(data_dir + 'testset_' + str(j) + '.ds')\n",
" \n",
" # Normalization step (for real valued targets only)\n",
" if model_type == 'regression':\n",
" print('\\n Normalizing output y...')\n",
" y_train_mean = np.mean(y_train)\n",
" y_train_std = np.std(y_train)\n",
" y_train = (y_train - y_train_mean) / float(y_train_std)\n",
"# print(y)\n",
" \n",
" # save kernel matrices to files / read kernel matrices from files\n",
" kernel_file_train = kernel_file_path + 'train' + str(j) + '_pquit_' + str(p_quit)\n",
" kernel_file_test = kernel_file_path + 'test' + str(j) + '_pquit_' + str(p_quit)\n",
" path_train = pathlib.Path(kernel_file_train)\n",
" path_test = pathlib.Path(kernel_file_test)\n",
" # get train set kernel matrix\n",
" if path_train.is_file():\n",
" print('\\n Loading the train set kernel matrix from file...')\n",
" Kmatrix_train = np.loadtxt(kernel_file_train)\n",
" print(Kmatrix_train)\n",
" else:\n",
" print('\\n Calculating train set kernel matrix, this could take a while...')\n",
" Kmatrix_train = marginalizedkernel(dataset_train, p_quit, 20)\n",
" print(Kmatrix_train)\n",
" print('\\n Saving train set kernel matrix to file...')\n",
" np.savetxt(kernel_file_train, Kmatrix_train)\n",
" # get test set kernel matrix\n",
" if path_test.is_file():\n",
" print('\\n Loading the test set kernel matrix from file...')\n",
" Kmatrix_test = np.loadtxt(kernel_file_test)\n",
" print(Kmatrix_test)\n",
" else:\n",
" print('\\n Calculating test set kernel matrix, this could take a while...')\n",
" Kmatrix_test = marginalizedkernel(dataset_test, p_quit, 20)\n",
" print(Kmatrix_test)\n",
" print('\\n Saving test set kernel matrix to file...')\n",
" np.savetxt(kernel_file_test, Kmatrix_test)\n",
"\n",
" # For each parameter trial\n",
" for i in range(trials):\n",
" # For regression use the Kernel Ridge method\n",
" if model_type == 'regression':\n",
" # print('\\n Starting experiment for trial %d and parameter alpha = %3f\\n ' % (i, alpha_grid[i]))\n",
"\n",
" # Fit the kernel ridge model\n",
" KR = KernelRidge(kernel = 'precomputed', alpha = alpha_grid[i])\n",
" KR.fit(Kmatrix_train, y_train)\n",
"\n",
" # predict on the test set\n",
" y_pred_test = KR.predict(Kmatrix_test)\n",
" # print(y_pred)\n",
"\n",
" # adjust prediction: needed because the training targets have been normalized\n",
" y_pred_test = y_pred_test * float(y_train_std) + y_train_mean\n",
" # print(y_pred_test)\n",
"\n",
" # root mean squared error in test \n",
" rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))\n",
" perf_all_test.append(rmse_test)\n",
"\n",
" # print('The performance on the validation set is: %3f' % rmse)\n",
" # print('The performance on the test set is: %3f' % rmse_test)\n",
"\n",
" # --- FIND THE OPTIMAL PARAMETERS --- #\n",
" # For regression: minimise the mean squared error\n",
" if model_type == 'regression':\n",
"\n",
" # get optimal parameter on test (argmin mean squared error)\n",
" min_idx = np.argmin(perf_all_test)\n",
" alpha_opt = alpha_grid[min_idx]\n",
"\n",
" # corresponding performance on test for the same parameter\n",
" perf_test_opt = perf_all_test[min_idx]\n",
"\n",
" print('The best performance is for trial %d with parameter alpha = %3f' % (min_idx, alpha_opt))\n",
" print('The corresponding performance on test set is: %3f' % perf_test_opt)\n",
" \n",
" \n",
" \n",
"\n",
"# For each split of the data\n",
"for j in range(10, 10 + splits):\n",
" print('Starting split %d...' % j)\n",
"\n",
" # Set the random set for data permutation\n",
" random_state = int(j)\n",
" np.random.seed(random_state)\n",
" idx_perm = np.random.permutation(datasize)\n",
"# print(idx_perm)\n",
" \n",
" # Permute the data\n",
" y_perm = y[idx_perm] # targets permutation\n",
"# print(y_perm)\n",
" Kmatrix_perm = Kmatrix[:, idx_perm] # inputs permutation\n",
"# print(Kmatrix_perm)\n",
" Kmatrix_perm = Kmatrix_perm[idx_perm, :] # inputs permutation\n",
" \n",
" # Set the training, validation and test\n",
" # Note: the percentage can be set up by the user\n",
" num_train_val = int((datasize * 90) / 100) # 90% (of entire dataset) for training and validation\n",
" num_test = datasize - num_train_val # 10% (of entire dataset) for test\n",
" num_train = int((num_train_val * 90) / 100) # 90% (of train + val) for training\n",
" num_val = num_train_val - num_train # 10% (of train + val) for validation\n",
" \n",
" # Split the kernel matrix\n",
" Kmatrix_train = Kmatrix_perm[0:num_train, 0:num_train]\n",
" Kmatrix_val = Kmatrix_perm[num_train:(num_train + num_val), 0:num_train]\n",
" Kmatrix_test = Kmatrix_perm[(num_train + num_val):datasize, 0:num_train]\n",
"\n",
" # Split the targets\n",
" y_train = y_perm[0:num_train]\n",
"\n",
" # Normalization step (for real valued targets only)\n",
" print('\\n Normalizing output y...')\n",
" if model_type == 'regression':\n",
" y_train_mean = np.mean(y_train)\n",
" y_train_std = np.std(y_train)\n",
" y_train = (y_train - y_train_mean) / float(y_train_std)\n",
"# print(y)\n",
" \n",
" y_val = y_perm[num_train:(num_train + num_val)]\n",
" y_test = y_perm[(num_train + num_val):datasize]\n",
" \n",
" # Record the performance for each parameter trial respectively on validation and test set\n",
" perf_all_val = []\n",
" perf_all_test = []\n",
" \n",
" "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"- This script take as input a kernel matrix\n",
"and returns the classification or regression performance\n",
"- The kernel matrix can be calculated using any of the graph kernels approaches\n",
"- The criteria used for prediction are SVM for classification and kernel Ridge regression for regression\n",
"- For predition we divide the data in training, validation and test. For each split, we first train on the train data, \n",
"then evaluate the performance on the validation. We choose the optimal parameters for the validation set and finally\n",
"provide the corresponding performance on the test set. If more than one split is performed, the final results \n",
"correspond to the average of the performances on the test sets. \n",
"\n",
"@references\n",
" https://github.com/eghisu/GraphKernels/blob/master/GraphKernelsCollection/python_scripts/compute_perf_gk.py\n",
"\n",
"\n",
" Loading dataset from file...\n",
"[ -23.7 14. 37.3 109.7 10.8 39. 42. 66.6 135. 148.5\n",
" 40. 34.6 32. 63. 53.5 67. 64.4 84.7 95.5 92.\n",
@@ -615,17 +261,17 @@
"With standard deviation: 4.891587\n",
"\n",
"\n",
" p_quit RMSE std\n",
"-------- ------- -------\n",
" 0.1 18.5188 7.749\n",
" 0.2 17.8991 6.59104\n",
" 0.3 18.3924 7.10161\n",
" 0.4 19.6233 6.24807\n",
" 0.5 19.9936 6.29951\n",
" 0.6 20.5466 6.26173\n",
" 0.7 21.7018 6.33531\n",
" 0.8 23.1489 6.10246\n",
" 0.9 24.7157 4.89159\n"
" std RMSE p_quit\n",
"------- ------- --------\n",
"7.749 18.5188 0.1\n",
"6.59104 17.8991 0.2\n",
"7.10161 18.3924 0.3\n",
"6.24807 19.6233 0.4\n",
"6.29951 19.9936 0.5\n",
"6.26173 20.5466 0.6\n",
"6.33531 21.7018 0.7\n",
"6.10246 23.1489 0.8\n",
"4.89159 24.7157 0.9\n"
]
}
],
@@ -651,7 +297,7 @@
"import sys\n",
"import os\n",
"import pathlib\n",
"sys.path.insert(0, \"../py-graph/\")\n",
"sys.path.insert(0, \"../\")\n",
"from tabulate import tabulate\n",
"\n",
"import random\n",
@@ -662,8 +308,8 @@
"from sklearn.metrics import accuracy_score, mean_squared_error\n",
"from sklearn import svm\n",
"\n",
"from kernels.marginalizedKernel import marginalizedkernel\n",
"from utils.graphfiles import loadDataset\n",
"from pygraph.kernels.marginalizedKernel import marginalizedkernel\n",
"from pygraph.utils.graphfiles import loadDataset\n",
"\n",
"print('\\n Loading dataset from file...')\n",
"dataset, y = loadDataset(\"../../../../datasets/acyclic/Acyclic/dataset_bps.ds\")\n",
@@ -711,7 +357,7 @@
" print(Kmatrix)\n",
" else:\n",
" print('\\n Calculating kernel matrix, this could take a while...')\n",
" Kmatrix = marginalizedkernel(dataset, p_quit, 20)\n",
" Kmatrix, run_time = marginalizedkernel(dataset, p_quit, 20, node_label = 'atom', edge_label = 'bond_type')\n",
" print(Kmatrix)\n",
" print('\\n Saving kernel matrix to file...')\n",
" np.savetxt(kernel_file, Kmatrix)\n",


+ 23
- 14
notebooks/.ipynb_checkpoints/run_pathkernel_acyclic-checkpoint.ipynb View File

@@ -545,7 +545,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 2,
"metadata": {},
"outputs": [
{
@@ -588,18 +588,27 @@
"\n",
" --- This is a regression problem ---\n",
"\n",
" Calculating kernel matrix, this could take a while...\n"
]
},
{
"ename": "NameError",
"evalue": "name 'pathKernel' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-3-bb38687adbe5>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 72\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 73\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'\\n Calculating kernel matrix, this could take a while...'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 74\u001b[0;31m \u001b[0mKmatrix\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpathKernel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdataset\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 75\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mKmatrix\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 76\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'\\n Saving kernel matrix to file...'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mNameError\u001b[0m: name 'pathKernel' is not defined"
" Calculating kernel matrix, this could take a while...\n",
"--- mean average path kernel matrix of size 185 built in 38.70095658302307 seconds ---\n",
"[[ 0.55555556 0.22222222 0. ..., 0. 0. 0. ]\n",
" [ 0.22222222 0.27777778 0. ..., 0. 0. 0. ]\n",
" [ 0. 0. 0.55555556 ..., 0.03030303 0.03030303\n",
" 0.03030303]\n",
" ..., \n",
" [ 0. 0. 0.03030303 ..., 0.08297521 0.05553719\n",
" 0.05256198]\n",
" [ 0. 0. 0.03030303 ..., 0.05553719 0.07239669\n",
" 0.0538843 ]\n",
" [ 0. 0. 0.03030303 ..., 0.05256198 0.0538843\n",
" 0.07438017]]\n",
"\n",
" Saving kernel matrix to file...\n",
"\n",
" Mean performance on val set: 11.907089\n",
"With standard deviation: 4.781924\n",
"\n",
" Mean performance on test set: 14.270816\n",
"With standard deviation: 6.366698\n"
]
}
],
@@ -677,7 +686,7 @@
" print(Kmatrix)\n",
"else:\n",
" print('\\n Calculating kernel matrix, this could take a while...')\n",
" Kmatrix = pathkernel(dataset)\n",
" Kmatrix, run_time = pathkernel(dataset, node_label = 'atom', edge_label = 'bond_type')\n",
" print(Kmatrix)\n",
" print('\\n Saving kernel matrix to file...')\n",
" np.savetxt(kernel_file, Kmatrix)\n",


+ 2
- 1
notebooks/.ipynb_checkpoints/run_spkernel_acyclic-checkpoint.ipynb View File

@@ -182,7 +182,8 @@
" print(Kmatrix)\n",
"else:\n",
" print('\\n Calculating kernel matrix, this could take a while...')\n",
" Kmatrix = spkernel(dataset)\n",
" #@Q: is it appropriate to use bond type between atoms as the edge weight to calculate shortest path????????\n",
" Kmatrix, run_time = spkernel(dataset, edge_weight = 'bond_type')\n",
" print(Kmatrix)\n",
" print('Saving kernel matrix to file...')\n",
" np.savetxt(kernel_file_path, Kmatrix)\n",


+ 317
- 514
notebooks/run_WeisfeilerLehmankernel_acyclic.ipynb
File diff suppressed because it is too large
View File


+ 1
- 1
notebooks/run_marginalizedkernel_acyclic.ipynb View File

@@ -357,7 +357,7 @@
" print(Kmatrix)\n",
" else:\n",
" print('\\n Calculating kernel matrix, this could take a while...')\n",
" Kmatrix = marginalizedkernel(dataset, p_quit, 20)\n",
" Kmatrix, run_time = marginalizedkernel(dataset, p_quit, 20, node_label = 'atom', edge_label = 'bond_type')\n",
" print(Kmatrix)\n",
" print('\\n Saving kernel matrix to file...')\n",
" np.savetxt(kernel_file, Kmatrix)\n",


+ 1
- 1
notebooks/run_pathkernel_acyclic.ipynb View File

@@ -686,7 +686,7 @@
" print(Kmatrix)\n",
"else:\n",
" print('\\n Calculating kernel matrix, this could take a while...')\n",
" Kmatrix = pathkernel(dataset)\n",
" Kmatrix, run_time = pathkernel(dataset, node_label = 'atom', edge_label = 'bond_type')\n",
" print(Kmatrix)\n",
" print('\\n Saving kernel matrix to file...')\n",
" np.savetxt(kernel_file, Kmatrix)\n",


+ 2
- 1
notebooks/run_spkernel_acyclic.ipynb View File

@@ -182,7 +182,8 @@
" print(Kmatrix)\n",
"else:\n",
" print('\\n Calculating kernel matrix, this could take a while...')\n",
" Kmatrix = spkernel(dataset)\n",
" #@Q: is it appropriate to use bond type between atoms as the edge weight to calculate shortest path????????\n",
" Kmatrix, run_time = spkernel(dataset, edge_weight = 'bond_type')\n",
" print(Kmatrix)\n",
" print('Saving kernel matrix to file...')\n",
" np.savetxt(kernel_file_path, Kmatrix)\n",


BIN
pygraph/kernels/__pycache__/weisfeilerLehmanKernel.cpython-35.pyc View File


+ 21
- 12
pygraph/kernels/marginalizedKernel.py View File

@@ -8,7 +8,7 @@ import time

from pygraph.kernels.deltaKernel import deltakernel

def marginalizedkernel(*args):
def marginalizedkernel(*args, node_label = 'atom', edge_label = 'bond_type'):
"""Calculate marginalized graph kernels between graphs.
Parameters
@@ -22,6 +22,10 @@ def marginalizedkernel(*args):
the termination probability in the random walks generating step
itr : integer
time of iterations to calculate R_inf
node_label : string
node attribute used as label. The default node label is atom.
edge_label : string
edge attribute used as label. The default edge label is bond_type.
Return
------
@@ -34,38 +38,43 @@ def marginalizedkernel(*args):
"""
if len(args) == 3: # for a list of graphs
Gn = args[0]

Kmatrix = np.zeros((len(Gn), len(Gn)))

start_time = time.time()
for i in range(0, len(Gn)):
for j in range(i, len(Gn)):
Kmatrix[i][j] = _marginalizedkernel_do(Gn[i], Gn[j], args[1], args[2])
Kmatrix[i][j] = _marginalizedkernel_do(Gn[i], Gn[j], node_label, edge_label, args[1], args[2])
Kmatrix[j][i] = Kmatrix[i][j]
print("\n --- marginalized kernel matrix of size %d built in %s seconds ---" % (len(Gn), (time.time() - start_time)))
run_time = time.time() - start_time
print("\n --- marginalized kernel matrix of size %d built in %s seconds ---" % (len(Gn), run_time))
return Kmatrix
return Kmatrix, run_time
else: # for only 2 graphs
start_time = time.time()
kernel = _marginalizedkernel_do(args[0], args[1], args[2], args[3])
kernel = _marginalizedkernel_do(args[0], args[1], node_label, edge_label, args[2], args[3])

print("\n --- marginalized kernel built in %s seconds ---" % (time.time() - start_time))
run_time = time.time() - start_time
print("\n --- marginalized kernel built in %s seconds ---" % (run_time))
return kernel
return kernel, run_time

def _marginalizedkernel_do(G1, G2, p_quit, itr):
def _marginalizedkernel_do(G1, G2, node_label = 'atom', edge_label = 'bond_type', p_quit, itr):
"""Calculate marginalized graph kernels between 2 graphs.
Parameters
----------
G1, G2 : NetworkX graphs
2 graphs between which the kernel is calculated.
node_label : string
node attribute used as label. The default node label is atom.
edge_label : string
edge attribute used as label. The default edge label is bond_type.
p_quit : integer
the termination probability in the random walks generating step
itr : integer
@@ -106,8 +115,8 @@ def _marginalizedkernel_do(G1, G2, p_quit, itr):
for neighbor2 in neighbor_n2:

t = p_trans_n1 * p_trans_n2 * \
deltakernel(G1.node[neighbor1]['label'] == G2.node[neighbor2]['label']) * \
deltakernel(neighbor_n1[neighbor1]['label'] == neighbor_n2[neighbor2]['label'])
deltakernel(G1.node[neighbor1][node_label] == G2.node[neighbor2][node_label]) * \
deltakernel(neighbor_n1[neighbor1][edge_label] == neighbor_n2[neighbor2][edge_label])
R_inf_new[node1[0]][node2[0]] += t * R_inf[neighbor1][neighbor2] # ref [1] equation (8)

R_inf[:] = R_inf_new
@@ -115,7 +124,7 @@ def _marginalizedkernel_do(G1, G2, p_quit, itr):
# add elements of R_inf up and calculate kernel
for node1 in G1.nodes(data = True):
for node2 in G2.nodes(data = True):
s = p_init_G1 * p_init_G2 * deltakernel(node1[1]['label'] == node2[1]['label'])
s = p_init_G1 * p_init_G2 * deltakernel(node1[1][node_label] == node2[1][node_label])
kernel += s * R_inf[node1[0]][node2[0]] # ref [1] equation (6)

return kernel

+ 23
- 14
pygraph/kernels/pathKernel.py View File

@@ -8,7 +8,7 @@ import time

from pygraph.kernels.deltaKernel import deltakernel

def pathkernel(*args):
def pathkernel(*args, node_label = 'atom', edge_label = 'bond_type'):
"""Calculate mean average path kernels between graphs.
Parameters
@@ -18,6 +18,10 @@ def pathkernel(*args):
/
G1, G2 : NetworkX graphs
2 graphs between which the kernel is calculated.
node_label : string
node attribute used as label. The default node label is atom.
edge_label : string
edge attribute used as label. The default edge label is bond_type.
Return
------
@@ -29,38 +33,43 @@ def pathkernel(*args):
[1] Suard F, Rakotomamonjy A, Bensrhair A. Kernel on Bag of Paths For Measuring Similarity of Shapes. InESANN 2007 Apr 25 (pp. 355-360).
"""
if len(args) == 1: # for a list of graphs
Gn = args[0]
Gn = args[0]
Kmatrix = np.zeros((len(Gn), len(Gn)))

start_time = time.time()
for i in range(0, len(Gn)):
for j in range(i, len(Gn)):
Kmatrix[i][j] = _pathkernel_do(Gn[i], Gn[j])
Kmatrix[i][j] = _pathkernel_do(Gn[i], Gn[j], node_label, edge_label)
Kmatrix[j][i] = Kmatrix[i][j]

print("\n --- mean average path kernel matrix of size %d built in %s seconds ---" % (len(Gn), (time.time() - start_time)))
run_time = time.time() - start_time
print("\n --- mean average path kernel matrix of size %d built in %s seconds ---" % (len(Gn), run_time))
return Kmatrix
return Kmatrix, run_time
else: # for only 2 graphs
start_time = time.time()
kernel = _pathkernel_do(args[0], args[1])
kernel = _pathkernel_do(args[0], args[1], node_label, edge_label)

print("\n --- mean average path kernel built in %s seconds ---" % (time.time() - start_time))
run_time = time.time() - start_time
print("\n --- mean average path kernel built in %s seconds ---" % (run_time))
return kernel
return kernel, run_time
def _pathkernel_do(G1, G2):
def _pathkernel_do(G1, G2, node_label = 'atom', edge_label = 'bond_type'):
"""Calculate mean average path kernels between 2 graphs.
Parameters
----------
G1, G2 : NetworkX graphs
2 graphs between which the kernel is calculated.
node_label : string
node attribute used as label. The default node label is atom.
edge_label : string
edge attribute used as label. The default edge label is bond_type.
Return
------
@@ -72,24 +81,24 @@ def _pathkernel_do(G1, G2):
num_nodes = G1.number_of_nodes()
for node1 in range(num_nodes):
for node2 in range(node1 + 1, num_nodes):
sp1.append(nx.shortest_path(G1, node1, node2, weight = 'cost'))
sp1.append(nx.shortest_path(G1, node1, node2, weight = edge_label))
sp2 = []
num_nodes = G2.number_of_nodes()
for node1 in range(num_nodes):
for node2 in range(node1 + 1, num_nodes):
sp2.append(nx.shortest_path(G2, node1, node2, weight = 'cost'))
sp2.append(nx.shortest_path(G2, node1, node2, weight = edge_label))

# calculate kernel
kernel = 0
for path1 in sp1:
for path2 in sp2:
if len(path1) == len(path2):
kernel_path = deltakernel(G1.node[path1[0]]['label'] == G2.node[path2[0]]['label'])
kernel_path = deltakernel(G1.node[path1[0]][node_label] == G2.node[path2[0]][node_label])
if kernel_path:
for i in range(1, len(path1)):
# kernel = 1 if all corresponding nodes and edges in the 2 paths have same labels, otherwise 0
kernel_path *= deltakernel(G1[path1[i - 1]][path1[i]]['label'] == G2[path2[i - 1]][path2[i]]['label']) * deltakernel(G1.node[path1[i]]['label'] == G2.node[path2[i]]['label'])
kernel_path *= deltakernel(G1[path1[i - 1]][path1[i]][edge_label] == G2[path2[i - 1]][path2[i]][edge_label]) * deltakernel(G1.node[path1[i]][node_label] == G2.node[path2[i]][node_label])
kernel += kernel_path # add up kernels of all paths

kernel = kernel / (len(sp1) * len(sp2)) # calculate mean average


+ 36
- 0
pygraph/kernels/results.md View File

@@ -0,0 +1,36 @@
# results with minimal test RMSE for each kernel on dataset Asyclic
-- All the kernels are tested on dataset Asyclic, which consists of 185 molecules (graphs).
-- The criteria used for prediction are SVM for classification and kernel Ridge regression for regression.
-- For predition we randomly divide the data in train and test subset, where 90% of entire dataset is for training and rest for testing. 10 splits are performed. For each split, we first train on the train data, then evaluate the performance on the test set. We choose the optimal parameters for the test set and finally provide the corresponding performance. The final results correspond to the average of the performances on the test sets.

## summary

| Kernels | RMSE(℃) | std(℃) | parameter | k_time |
|---------------|:---------:|:--------:|-------------:|-------:|
| shortest path | 36.40 | 5.35 | - | - |
| marginalized | 17.90 | 6.59 | p_quit = 0.1 | - |
| path | 14.27 | 6.37 | - | - |
| WL subtree | 9.00 | 6.37 | height = 1 | 0.85" |

**In each line, paremeter is the one with which the kenrel achieves the best results.
In each line, k_time is the time spent on building the kernel matrix.**

## detailed results of WL subtree kernel.
The table below shows the results of the WL subtree under different subtree heights.
```
height RMSE_test std_test RMSE_train std_train k_time
-------- ----------- ---------- ------------ ----------- --------
0 36.2108 7.33179 141.419 1.08284 0.392911
1 9.00098 6.37145 140.065 0.877976 0.812077
2 19.8113 4.04911 140.075 0.928821 1.36955
3 25.0455 4.94276 140.198 0.873857 1.78629
4 28.2255 6.5212 140.272 0.838915 2.30847
5 30.6354 6.73647 140.247 0.86363 2.8258
6 32.1027 6.85601 140.239 0.872475 3.1542
7 32.9709 6.89606 140.094 0.917704 3.46081
8 33.5112 6.90753 140.076 0.931866 4.08857
9 33.8502 6.91427 139.913 0.928974 4.25243
10 34.0963 6.93115 139.894 0.942612 5.02607
```
**The unit of the *RMSEs* and *stds* is *℃*, The unit of the *k_time* is *s*.
k_time is the time spent on building the kernel matrix.**

+ 9
- 6
pygraph/kernels/spkernel.py View File

@@ -10,7 +10,7 @@ import time
from pygraph.utils.utils import getSPGraph


def spkernel(*args):
def spkernel(*args, edge_weight = 'bond_type'):
"""Calculate shortest-path kernels between graphs.
Parameters
@@ -20,6 +20,8 @@ def spkernel(*args):
/
G1, G2 : NetworkX graphs
2 graphs between which the kernel is calculated.
edge_weight : string
edge attribute corresponding to the edge weight. The default edge weight is bond_type.
Return
------
@@ -37,7 +39,7 @@ def spkernel(*args):
Sn = [] # get shortest path graphs of Gn
for i in range(0, len(Gn)):
Sn.append(getSPGraph(Gn[i]))
Sn.append(getSPGraph(Gn[i], edge_weight = edge_weight))

start_time = time.time()
for i in range(0, len(Gn)):
@@ -48,13 +50,14 @@ def spkernel(*args):
Kmatrix[i][j] += 1
Kmatrix[j][i] += (0 if i == j else 1)

print("--- shortest path kernel matrix of size %d built in %s seconds ---" % (len(Gn), (time.time() - start_time)))
run_time = time.time() - start_time
print("--- shortest path kernel matrix of size %d built in %s seconds ---" % (len(Gn), run_time))
return Kmatrix
return Kmatrix, run_time
else: # for only 2 graphs
G1 = getSPGraph(args[0])
G2 = getSPGraph(args[1])
G1 = getSPGraph(args[0], edge_weight = edge_weight)
G2 = getSPGraph(args[1], edge_weight = edge_weight)
kernel = 0


+ 61
- 51
pygraph/kernels/weisfeilerLehmanKernel.py View File

@@ -23,7 +23,7 @@ import time
from pygraph.kernels.spkernel import spkernel
from pygraph.kernels.pathKernel import pathkernel

def weisfeilerlehmankernel(*args, height = 0, base_kernel = 'subtree'):
def weisfeilerlehmankernel(*args, node_label = 'atom', edge_label = 'bond_type', height = 0, base_kernel = 'subtree'):
"""Calculate Weisfeiler-Lehman kernels between graphs.
Parameters
@@ -32,12 +32,15 @@ def weisfeilerlehmankernel(*args, height = 0, base_kernel = 'subtree'):
List of graphs between which the kernels are calculated.
/
G1, G2 : NetworkX graphs
2 graphs between which the kernel is calculated.
height : subtree height
base_kernel : base kernel used in each iteration of WL kernel
the default base kernel is subtree kernel
2 graphs between which the kernel is calculated.
node_label : string
node attribute used as label. The default node label is atom.
edge_label : string
edge attribute used as label. The default edge label is bond_type.
height : int
subtree height
base_kernel : string
base kernel used in each iteration of WL kernel. The default base kernel is subtree kernel.
Return
------
@@ -57,7 +60,7 @@ def weisfeilerlehmankernel(*args, height = 0, base_kernel = 'subtree'):
# for WL subtree kernel
if base_kernel == 'subtree':
Kmatrix = _wl_subtreekernel_do(args[0], height = height, base_kernel = 'subtree')
Kmatrix = _wl_subtreekernel_do(args[0], node_label, edge_label, height = height, base_kernel = 'subtree')
# for WL edge kernel
elif base_kernel == 'edge':
@@ -73,9 +76,10 @@ def weisfeilerlehmankernel(*args, height = 0, base_kernel = 'subtree'):
Kmatrix[i][j] = _weisfeilerlehmankernel_do(Gn[i], Gn[j], height = height)
Kmatrix[j][i] = Kmatrix[i][j]

print("\n --- Weisfeiler-Lehman %s kernel matrix of size %d built in %s seconds ---" % (base_kernel, len(args[0]), (time.time() - start_time)))
run_time = time.time() - start_time
print("\n --- Weisfeiler-Lehman %s kernel matrix of size %d built in %s seconds ---" % (base_kernel, len(args[0]), run_time))
return Kmatrix
return Kmatrix, run_time
else: # for only 2 graphs
@@ -85,7 +89,7 @@ def weisfeilerlehmankernel(*args, height = 0, base_kernel = 'subtree'):
if base_kernel == 'subtree':
args = [args[0], args[1]]
kernel = _wl_subtreekernel_do(args, height = height, base_kernel = 'subtree')
kernel = _wl_subtreekernel_do(args, node_label, edge_label, height = height, base_kernel = 'subtree')
# for WL edge kernel
elif base_kernel == 'edge':
@@ -97,18 +101,27 @@ def weisfeilerlehmankernel(*args, height = 0, base_kernel = 'subtree'):

kernel = _pathkernel_do(args[0], args[1])

print("\n --- Weisfeiler-Lehman %s kernel built in %s seconds ---" % (base_kernel, time.time() - start_time))
run_time = time.time() - start_time
print("\n --- Weisfeiler-Lehman %s kernel built in %s seconds ---" % (base_kernel, run_time))
return kernel
return kernel, run_time
def _wl_subtreekernel_do(*args, height = 0, base_kernel = 'subtree'):
def _wl_subtreekernel_do(*args, node_label = 'atom', edge_label = 'bond_type', height = 0, base_kernel = 'subtree'):
"""Calculate Weisfeiler-Lehman subtree kernels between graphs.
Parameters
----------
Gn : List of NetworkX graph
List of graphs between which the kernels are calculated.
List of graphs between which the kernels are calculated.
node_label : string
node attribute used as label. The default node label is atom.
edge_label : string
edge attribute used as label. The default edge label is bond_type.
height : int
subtree height
base_kernel : string
base kernel used in each iteration of WL kernel. The default base kernel is subtree kernel.
Return
------
@@ -120,55 +133,54 @@ def _wl_subtreekernel_do(*args, height = 0, base_kernel = 'subtree'):
Kmatrix = np.zeros((len(Gn), len(Gn)))
all_num_of_labels_occured = 0 # number of the set of letters that occur before as node labels at least once in all graphs

# initial
# initial for height = 0
all_labels_ori = set() # all unique orignal labels in all graphs in this iteration
all_num_of_each_label = [] # number of occurence of each label in each graph in this iteration
all_set_compressed = {} # a dictionary mapping original labels to new ones in all graphs in this iteration
num_of_labels_occured = all_num_of_labels_occured # number of the set of letters that occur before as node labels at least once in all graphs

# for each graph
for idx, G in enumerate(Gn):
for G in Gn:
# get the set of original labels
labels_ori = list(nx.get_node_attributes(G, 'label').values())
labels_ori = list(nx.get_node_attributes(G, node_label).values())
all_labels_ori.update(labels_ori)
num_of_each_label = dict(Counter(labels_ori)) # number of occurence of each label in graph
all_num_of_each_label.append(num_of_each_label)
num_of_labels = len(num_of_each_label) # number of all unique labels

all_labels_ori.update(labels_ori)
# # calculate subtree kernel while h = 0 and add it to the final kernel
# for i in range(0, len(Gn)):
# for j in range(i, len(Gn)):
# labels = set(list(nx.get_node_attributes(Gn[i], 'label').values()) + list(nx.get_node_attributes(Gn[j], 'label').values()))
# vector1 = np.matrix([ (nx.get_node_attributes(Gn[i], 'label').values()[label] if (label in all_num_of_each_label[i].keys()) else 0) for label in labels ])
# vector2 = np.matrix([ (all_num_of_each_label[j][label] if (label in all_num_of_each_label[j].keys()) else 0) for label in labels ])
# Kmatrix[i][j] += np.dot(vector1, vector2.transpose())
# Kmatrix[j][i] = Kmatrix[i][j]
all_num_of_labels_occured += len(all_labels_ori)
# calculate subtree kernel with the 0th iteration and add it to the final kernel
for i in range(0, len(Gn)):
for j in range(i, len(Gn)):
labels = set(list(all_num_of_each_label[i].keys()) + list(all_num_of_each_label[j].keys()))
vector1 = np.matrix([ (all_num_of_each_label[i][label] if (label in all_num_of_each_label[i].keys()) else 0) for label in labels ])
vector2 = np.matrix([ (all_num_of_each_label[j][label] if (label in all_num_of_each_label[j].keys()) else 0) for label in labels ])
Kmatrix[i][j] += np.dot(vector1, vector2.transpose())
Kmatrix[j][i] = Kmatrix[i][j]
# iterate each height
for h in range(height + 1):
all_labels_ori = set() # all unique orignal labels in all graphs in this iteration
all_num_of_each_label = [] # number of occurence of each label in each graph in this iteration
for h in range(1, height + 1):
all_set_compressed = {} # a dictionary mapping original labels to new ones in all graphs in this iteration
num_of_labels_occured = all_num_of_labels_occured # number of the set of letters that occur before as node labels at least once in all graphs
all_labels_ori = set()
all_num_of_each_label = []
# for each graph
for idx, G in enumerate(Gn):
# get the set of original labels
labels_ori = list(nx.get_node_attributes(G, 'label').values())
num_of_each_label = dict(Counter(labels_ori)) # number of occurence of each label in graph
num_of_labels = len(num_of_each_label) # number of all unique labels
all_labels_ori.update(labels_ori)
num_of_labels_occured = all_num_of_labels_occured + len(all_labels_ori) + len(all_set_compressed)
set_multisets = []
for node in G.nodes(data = True):
# Multiset-label determination.
multiset = [ G.node[neighbors]['label'] for neighbors in G[node[0]] ]
multiset = [ G.node[neighbors][node_label] for neighbors in G[node[0]] ]
# sorting each multiset
multiset.sort()
multiset = node[1]['label'] + ''.join(multiset) # concatenate to a string and add the prefix
multiset = node[1][node_label] + ''.join(multiset) # concatenate to a string and add the prefix
set_multisets.append(multiset)

# label compression
# set_multisets.sort() # this is unnecessary
set_unique = list(set(set_multisets)) # set of unique multiset labels
# a dictionary mapping original labels to new ones.
set_compressed = {}
@@ -179,20 +191,20 @@ def _wl_subtreekernel_do(*args, height = 0, base_kernel = 'subtree'):
else:
set_compressed.update({ value : str(num_of_labels_occured + 1) })
num_of_labels_occured += 1
# set_compressed = { value : (all_set_compressed[value] if value in all_set_compressed.keys() else str(set_unique.index(value) + num_of_labels_occured + 1)) for value in set_unique }
all_set_compressed.update(set_compressed)
# num_of_labels_occured += len(set_compressed) #@todo not precise

# relabel nodes
# nx.relabel_nodes(G, set_compressed, copy = False)
for node in G.nodes(data = True):
node[1]['label'] = set_compressed[set_multisets[node[0]]]
node[1][node_label] = set_compressed[set_multisets[node[0]]]

# get the set of compressed labels
labels_comp = list(nx.get_node_attributes(G, 'label').values())
num_of_each_label.update(dict(Counter(labels_comp)))
labels_comp = list(nx.get_node_attributes(G, node_label).values())
all_labels_ori.update(labels_comp)
num_of_each_label = dict(Counter(labels_comp))
all_num_of_each_label.append(num_of_each_label)
all_num_of_labels_occured += len(all_labels_ori)
# calculate subtree kernel with h iterations and add it to the final kernel
for i in range(0, len(Gn)):
@@ -203,8 +215,6 @@ def _wl_subtreekernel_do(*args, height = 0, base_kernel = 'subtree'):
Kmatrix[i][j] += np.dot(vector1, vector2.transpose())
Kmatrix[j][i] = Kmatrix[i][j]
all_num_of_labels_occured += len(all_labels_ori)

return Kmatrix


BIN
pygraph/utils/__pycache__/graphfiles.cpython-35.pyc View File


+ 10
- 6
pygraph/utils/utils.py View File

@@ -5,18 +5,20 @@ import numpy as np
def getSPLengths(G1):
sp = nx.shortest_path(G1)
distances = np.zeros((G1.number_of_nodes(), G1.number_of_nodes()))
for i in np.keys():
for j in np[i].keys():
for i in sp.keys():
for j in sp[i].keys():
distances[i, j] = len(sp[i][j])-1
return distances

def getSPGraph(G):
def getSPGraph(G, edge_weight = 'bond_type'):
"""Transform graph G to its corresponding shortest-paths graph.
Parameters
----------
G : NetworkX graph
The graph to be tramsformed.
edge_weight : string
edge attribute corresponding to the edge weight. The default edge weight is bond_type.
Return
------
@@ -31,15 +33,17 @@ def getSPGraph(G):
----------
[1] Borgwardt KM, Kriegel HP. Shortest-path kernels on graphs. InData Mining, Fifth IEEE International Conference on 2005 Nov 27 (pp. 8-pp). IEEE.
"""
return floydTransformation(G)
return floydTransformation(G, edge_weight = edge_weight)
def floydTransformation(G):
def floydTransformation(G, edge_weight = 'bond_type'):
"""Transform graph G to its corresponding shortest-paths graph using Floyd-transformation.
Parameters
----------
G : NetworkX graph
The graph to be tramsformed.
edge_weight : string
edge attribute corresponding to the edge weight. The default edge weight is bond_type.
Return
------
@@ -50,7 +54,7 @@ def floydTransformation(G):
----------
[1] Borgwardt KM, Kriegel HP. Shortest-path kernels on graphs. InData Mining, Fifth IEEE International Conference on 2005 Nov 27 (pp. 8-pp). IEEE.
"""
spMatrix = nx.floyd_warshall_numpy(G) # @todo weigth label not considered
spMatrix = nx.floyd_warshall_numpy(G, weight = edge_weight)
S = nx.Graph()
S.add_nodes_from(G.nodes(data=True))
for i in range(0, G.number_of_nodes()):


Loading…
Cancel
Save