Browse Source

modifié : README.md

modifié :         notebooks/run_cyclicpatternkernel.ipynb
	modifié :         notebooks/run_marginalizedkernel_acyclic.ipynb
	modifié :         notebooks/run_pathkernel_acyclic.ipynb
	modifié :         notebooks/run_spkernel_acyclic.ipynb
	modifié :         notebooks/run_treeletkernel_acyclic.ipynb
	modifié :         notebooks/run_treepatternkernel.ipynb
	modifié :         notebooks/run_untildpathkernel_acyclic.ipynb
	nouveau fichier : notebooks/run_untilnwalkkernel.ipynb
	modifié :         notebooks/run_weisfeilerLehmankernel_acyclic.ipynb
	modifié :         pygraph/kernels/treePatternKernel.py
	modifié :         pygraph/kernels/untildPathKernel.py
	nouveau fichier : pygraph/kernels/untilnWalkKernel.py
	nouveau fichier : pygraph/utils/model_selection_precomputed.py
	modifié :         pygraph/utils/utils.py
v0.1
jajupmochi 7 years ago
parent
commit
338f8ba326
30 changed files with 37615 additions and 54054 deletions
  1. +16
    -17
      README.md
  2. +1789
    -420
      notebooks/.ipynb_checkpoints/run_cyclicpatternkernel-checkpoint.ipynb
  3. +116
    -0
      notebooks/.ipynb_checkpoints/run_spkernel_acyclic-checkpoint.ipynb
  4. +1132
    -0
      notebooks/.ipynb_checkpoints/run_untildpathkernel_acyclic-checkpoint.ipynb
  5. +12297
    -0
      notebooks/.ipynb_checkpoints/run_untilnwalkkernel-checkpoint.ipynb
  6. +2236
    -0
      notebooks/.ipynb_checkpoints/test_modelselection-checkpoint.ipynb
  7. +1271
    -0
      notebooks/.ipynb_checkpoints/test_scikit_ksvm-checkpoint.ipynb
  8. +1460
    -407
      notebooks/run_cyclicpatternkernel.ipynb
  9. +688
    -0
      notebooks/run_marginalizedkernel_acyclic.ipynb
  10. +136
    -0
      notebooks/run_pathkernel_acyclic.ipynb
  11. +116
    -0
      notebooks/run_spkernel_acyclic.ipynb
  12. +123
    -0
      notebooks/run_treeletkernel_acyclic.ipynb
  13. +9296
    -7
      notebooks/run_treepatternkernel.ipynb
  14. +1132
    -0
      notebooks/run_untildpathkernel_acyclic.ipynb
  15. +383
    -0
      notebooks/run_untilnwalkkernel.ipynb
  16. +2513
    -199
      notebooks/run_weisfeilerLehmankernel_acyclic.ipynb
  17. +0
    -52672
      notebooks/test_marginalizedkernel.ipynb
  18. +1931
    -0
      notebooks/test_modelselection.ipynb
  19. +262
    -0
      notebooks/test_scikit_ksvm.ipynb
  20. BIN
      pygraph/kernels/__pycache__/treePatternKernel.cpython-35.pyc
  21. BIN
      pygraph/kernels/__pycache__/untildPathKernel.cpython-35.pyc
  22. BIN
      pygraph/kernels/__pycache__/untilnWalkKernel.cpython-35.pyc
  23. +12
    -4
      pygraph/kernels/treePatternKernel.py
  24. +2
    -4
      pygraph/kernels/untildPathKernel.py
  25. +182
    -0
      pygraph/kernels/untilnWalkKernel.py
  26. BIN
      pygraph/utils/__pycache__/model_selection_precomputed.cpython-35.pyc
  27. BIN
      pygraph/utils/__pycache__/utils.cpython-35.pyc
  28. +213
    -0
      pygraph/utils/model_selection_precomputed.py
  29. +309
    -308
      pygraph/utils/utils.py
  30. +0
    -16
      run_cyclic.py

+ 16
- 17
README.md View File

@@ -16,25 +16,24 @@ All kernels expect for Cyclic pattern kernel are tested on dataset Asyclic, whic

The criteria used for prediction are SVM for classification and kernel Ridge regression for regression.

For predition we randomly divide the data in train and test subset, where 90% of entire dataset is for training and rest for testing. 10 splits are performed. For each split, we first train on the train data, then evaluate the performance on the test set. We choose the optimal parameters for the test set and finally provide the corresponding performance. The final results correspond to the average of the performances on the test sets.
| Kernels | RMSE(℃) | STD(℃) | Parameter | k_time |
|------------------|:-------:|:------:|------------------:|-------:|
| Shortest path | 35.19 | 4.50 | - | 14.58" |
| Marginalized | 18.02 | 6.29 | p_quit = 0.1 | 4'19" |
| Path | 18.41 | 10.78 | - | 29.43" |
| WL subtree | 7.55 | 2.33 | height = 1 | 0.84" |
| WL shortest path | 35.16 | 4.50 | height = 2 | 40.24" |
| WL edge | 33.41 | 4.73 | height = 5 | 5.66" |
| Treelet | 8.31 | 3.38 | - | 0.50" |
| Path up to d | 7.43 | 2.69 | depth = 2 | 0.59" |
| Tree pattern | 7.27 | 2.21 | lamda = 1, h = 2 | 37.24" |
| Cyclic pattern | 0.9 | 0.11 | cycle bound = 100 | 0.31" |
~~For predition we randomly divide the data in train and test subset, where 90% of entire dataset is for training and rest for testing. 10 splits are performed. For each split, we first train on the train data, then evaluate the performance on the test set. We choose the optimal parameters for the test set and finally provide the corresponding performance. The final results correspond to the average of the performances on the test sets.~~
| Kernels | train_perf | valid_perf | test_perf | Parameters | gram_matrix_time |
|------------------|-----------:|-----------:|-----------:|------------------------------------------------------:|-----------------:|
| Shortest path | 28.65±0.59 | 36.09±0.97 | 36.45±6.63 | 'alpha': '3.55e+01' | 12.67" |
| Marginalized | 12.42±0.28 | 18.60±2.02 | 16.51±5.12 | 'p_quit': 0.3, 'alpha': '3.16e-06' | 430.42" |
| Path | 11.19±0.73 | 23.66±1.74 | 25.04±9.60 | 'alpha': '2.57e-03' | 21.84" |
| WL subtree | 6.00±0.27 | 7.59±0.71 | 7.92±2.92 | 'height': 1.0, 'alpha': '1.26e-01' | 0.84" |
| WL shortest path | 28.32±0.63 | 35.99±0.98 | 37.92±5.60 | 'height': 2.0, 'alpha': '1.00e+02' | 39.79" |
| WL edge | 30.10±0.57 | 35.13±0.78 | 37.70±6.92 | 'height': 4.0, 'alpha': '3.98e+01' | 4.35" |
| Treelet | 7.38±0.37 | 14.21±0.80 | 15.26±3.65 | 'alpha': '1.58e+00' | 0.49" |
| Path up to d | 5.48±0.23 | 10.00±0.83 | 10.73±5.67 | 'depth': 2.0, 'k_func': 'MinMax', 'alpha': '7.94e-02' | 0.57" |
| Tree pattern | | | | | |
| Cyclic pattern | 0.62±0.02 | 0.62±0.02 | 0.57±0.17 | 'cycle_bound': 125.0, 'C': '1.78e-01' | 0.33" |
* RMSE stands for arithmetic mean of the root mean squared errors on all splits.
* STD stands for standard deviation of the root mean squared errors on all splits.
* Paremeter is the one with which the kenrel achieves the best results.
* k_time is the time spent on building the kernel matrix.
* The targets of training data are normalized before calculating *treelet kernel*.
* Paremeters are the ones with which the kenrel achieves the best results.
* gram_matrix_time is the time spent on building the gram matrix.
* See detail results in [results.md](pygraph/kernels/results.md).

## References


+ 1789
- 420
notebooks/.ipynb_checkpoints/run_cyclicpatternkernel-checkpoint.ipynb
File diff suppressed because it is too large
View File


+ 116
- 0
notebooks/.ipynb_checkpoints/run_spkernel_acyclic-checkpoint.ipynb
File diff suppressed because it is too large
View File


+ 1132
- 0
notebooks/.ipynb_checkpoints/run_untildpathkernel_acyclic-checkpoint.ipynb
File diff suppressed because it is too large
View File


+ 12297
- 0
notebooks/.ipynb_checkpoints/run_untilnwalkkernel-checkpoint.ipynb
File diff suppressed because it is too large
View File


+ 2236
- 0
notebooks/.ipynb_checkpoints/test_modelselection-checkpoint.ipynb
File diff suppressed because it is too large
View File


+ 1271
- 0
notebooks/.ipynb_checkpoints/test_scikit_ksvm-checkpoint.ipynb
File diff suppressed because it is too large
View File


+ 1460
- 407
notebooks/run_cyclicpatternkernel.ipynb
File diff suppressed because it is too large
View File


+ 688
- 0
notebooks/run_marginalizedkernel_acyclic.ipynb
File diff suppressed because it is too large
View File


+ 136
- 0
notebooks/run_pathkernel_acyclic.ipynb
File diff suppressed because it is too large
View File


+ 116
- 0
notebooks/run_spkernel_acyclic.ipynb
File diff suppressed because it is too large
View File


+ 123
- 0
notebooks/run_treeletkernel_acyclic.ipynb
File diff suppressed because it is too large
View File


+ 9296
- 7
notebooks/run_treepatternkernel.ipynb
File diff suppressed because it is too large
View File


+ 1132
- 0
notebooks/run_untildpathkernel_acyclic.ipynb
File diff suppressed because it is too large
View File


+ 383
- 0
notebooks/run_untilnwalkkernel.ipynb
File diff suppressed because it is too large
View File


+ 2513
- 199
notebooks/run_weisfeilerLehmankernel_acyclic.ipynb
File diff suppressed because it is too large
View File


+ 0
- 52672
notebooks/test_marginalizedkernel.ipynb
File diff suppressed because it is too large
View File


+ 1931
- 0
notebooks/test_modelselection.ipynb
File diff suppressed because it is too large
View File


+ 262
- 0
notebooks/test_scikit_ksvm.ipynb View File

@@ -2,6 +2,268 @@
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Automatically created module for IPython interactive environment\n",
"# Tuning hyper-parameters for precision\n",
"\n",
"Best parameters set found on development set:\n",
"\n",
"{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}\n",
"\n",
"Grid scores on development set:\n",
"\n",
"0.986 (+/-0.016) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}\n",
"0.959 (+/-0.029) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}\n",
"0.988 (+/-0.017) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}\n",
"0.982 (+/-0.026) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}\n",
"0.988 (+/-0.017) for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}\n",
"0.982 (+/-0.025) for {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}\n",
"0.988 (+/-0.017) for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}\n",
"0.982 (+/-0.025) for {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}\n",
"0.975 (+/-0.014) for {'C': 1, 'kernel': 'linear'}\n",
"0.975 (+/-0.014) for {'C': 10, 'kernel': 'linear'}\n",
"0.975 (+/-0.014) for {'C': 100, 'kernel': 'linear'}\n",
"0.975 (+/-0.014) for {'C': 1000, 'kernel': 'linear'}\n",
"\n",
"Detailed classification report:\n",
"\n",
"The model is trained on the full development set.\n",
"The scores are computed on the full evaluation set.\n",
"\n",
" precision recall f1-score support\n",
"\n",
" 0 1.00 1.00 1.00 89\n",
" 1 0.97 1.00 0.98 90\n",
" 2 0.99 0.98 0.98 92\n",
" 3 1.00 0.99 0.99 93\n",
" 4 1.00 1.00 1.00 76\n",
" 5 0.99 0.98 0.99 108\n",
" 6 0.99 1.00 0.99 89\n",
" 7 0.99 1.00 0.99 78\n",
" 8 1.00 0.98 0.99 92\n",
" 9 0.99 0.99 0.99 92\n",
"\n",
"avg / total 0.99 0.99 0.99 899\n",
"\n",
"\n",
"# Tuning hyper-parameters for recall\n",
"\n",
"Best parameters set found on development set:\n",
"\n",
"{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}\n",
"\n",
"Grid scores on development set:\n",
"\n",
"0.986 (+/-0.019) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}\n",
"0.957 (+/-0.029) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}\n",
"0.987 (+/-0.019) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}\n",
"0.981 (+/-0.028) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}\n",
"0.987 (+/-0.019) for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}\n",
"0.981 (+/-0.026) for {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}\n",
"0.987 (+/-0.019) for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}\n",
"0.981 (+/-0.026) for {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}\n",
"0.972 (+/-0.012) for {'C': 1, 'kernel': 'linear'}\n",
"0.972 (+/-0.012) for {'C': 10, 'kernel': 'linear'}\n",
"0.972 (+/-0.012) for {'C': 100, 'kernel': 'linear'}\n",
"0.972 (+/-0.012) for {'C': 1000, 'kernel': 'linear'}\n",
"\n",
"Detailed classification report:\n",
"\n",
"The model is trained on the full development set.\n",
"The scores are computed on the full evaluation set.\n",
"\n",
" precision recall f1-score support\n",
"\n",
" 0 1.00 1.00 1.00 89\n",
" 1 0.97 1.00 0.98 90\n",
" 2 0.99 0.98 0.98 92\n",
" 3 1.00 0.99 0.99 93\n",
" 4 1.00 1.00 1.00 76\n",
" 5 0.99 0.98 0.99 108\n",
" 6 0.99 1.00 0.99 89\n",
" 7 0.99 1.00 0.99 78\n",
" 8 1.00 0.98 0.99 92\n",
" 9 0.99 0.99 0.99 92\n",
"\n",
"avg / total 0.99 0.99 0.99 899\n",
"\n",
"\n"
]
}
],
"source": [
"# Parameter estimation using grid search with cross-validation\n",
"from __future__ import print_function\n",
"\n",
"from sklearn import datasets\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.model_selection import GridSearchCV\n",
"from sklearn.metrics import classification_report\n",
"from sklearn.svm import SVC\n",
"\n",
"print(__doc__)\n",
"\n",
"# Loading the Digits dataset\n",
"digits = datasets.load_digits()\n",
"\n",
"# To apply an classifier on this data, we need to flatten the image, to\n",
"# turn the data in a (samples, feature) matrix:\n",
"n_samples = len(digits.images)\n",
"X = digits.images.reshape((n_samples, -1))\n",
"y = digits.target\n",
"\n",
"# Split the dataset in two equal parts\n",
"X_train, X_test, y_train, y_test = train_test_split(\n",
" X, y, test_size=0.5, random_state=0)\n",
"\n",
"# Set the parameters by cross-validation\n",
"tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],\n",
" 'C': [1, 10, 100, 1000]},\n",
" {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]\n",
"\n",
"scores = ['precision', 'recall']\n",
"\n",
"for score in scores:\n",
" print(\"# Tuning hyper-parameters for %s\" % score)\n",
" print()\n",
"\n",
" clf = GridSearchCV(SVC(), tuned_parameters, cv=5,\n",
" scoring='%s_macro' % score)\n",
" clf.fit(X_train, y_train)\n",
"\n",
" print(\"Best parameters set found on development set:\")\n",
" print()\n",
" print(clf.best_params_)\n",
" print()\n",
" print(\"Grid scores on development set:\")\n",
" print()\n",
" means = clf.cv_results_['mean_test_score']\n",
" stds = clf.cv_results_['std_test_score']\n",
" for mean, std, params in zip(means, stds, clf.cv_results_['params']):\n",
" print(\"%0.3f (+/-%0.03f) for %r\"\n",
" % (mean, std * 2, params))\n",
" print()\n",
"\n",
" print(\"Detailed classification report:\")\n",
" print()\n",
" print(\"The model is trained on the full development set.\")\n",
" print(\"The scores are computed on the full evaluation set.\")\n",
" print()\n",
" y_true, y_pred = y_test, clf.predict(X_test)\n",
" print(classification_report(y_true, y_pred))\n",
" print()\n",
"\n",
"# Note the problem is too easy: the hyperparameter plateau is too flat and the\n",
"# output model is the same for precision and recall with ties in quality."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"GridSearchCV(cv=None, error_score='raise',\n",
" estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n",
" decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',\n",
" max_iter=-1, probability=False, random_state=None, shrinking=True,\n",
" tol=0.001, verbose=False),\n",
" fit_params=None, iid=True, n_jobs=1,\n",
" param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf')},\n",
" pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n",
" scoring=None, verbose=0)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn import svm, datasets\n",
"from sklearn.model_selection import GridSearchCV\n",
"iris = datasets.load_iris()\n",
"parameters = {'kernel': ('linear', 'rbf'), 'C': [1, 10]}\n",
"svc = svm.SVC()\n",
"clf = GridSearchCV(svc, parameters)\n",
"clf.fit(iris.data, iris.target)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['mean_fit_time',\n",
" 'mean_score_time',\n",
" 'mean_test_score',\n",
" 'mean_train_score',\n",
" 'param_C',\n",
" 'param_kernel',\n",
" 'params',\n",
" 'rank_test_score',\n",
" 'split0_test_score',\n",
" 'split0_train_score',\n",
" 'split1_test_score',\n",
" 'split1_train_score',\n",
" 'split2_test_score',\n",
" 'split2_train_score',\n",
" 'std_fit_time',\n",
" 'std_score_time',\n",
" 'std_test_score',\n",
" 'std_train_score']"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sorted(clf.cv_results_.keys())"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict_values([array([0.98 , 0.97333333, 0.97333333, 0.98 ]), array([1. , 0.98039216, 1. , 0.98039216]), array([0.00030899, 0.00021172, 0.00019932, 0.00017134]), array([1, 3, 3, 1], dtype=int32), array([0.01617914, 0.00902067, 0.03715363, 0.01592466]), masked_array(data=['linear', 'rbf', 'linear', 'rbf'],\n",
" mask=[False, False, False, False],\n",
" fill_value='?',\n",
" dtype=object), array([1., 1., 1., 1.]), array([0.98999802, 0.98336304, 0.97999604, 0.97999604]), array([6.43618303e-05, 6.20771049e-05, 7.16528819e-05, 9.16456815e-06]), array([0.97979798, 0.96969697, 0.95959596, 0.95959596]), [{'kernel': 'linear', 'C': 1}, {'kernel': 'rbf', 'C': 1}, {'kernel': 'linear', 'C': 10}, {'kernel': 'rbf', 'C': 10}], array([0.00036526, 0.00039411, 0.0002923 , 0.00032218]), array([0.00824863, 0.01254825, 0.01649726, 0.01649726]), array([0.97916667, 0.97916667, 1. , 1. ]), array([0.99019608, 0.98039216, 0.98039216, 0.98039216]), array([5.54407363e-05, 3.25514857e-05, 7.09833681e-05, 3.70551530e-06]), masked_array(data=[1, 1, 10, 10],\n",
" mask=[False, False, False, False],\n",
" fill_value='?',\n",
" dtype=object), array([0.96078431, 0.96078431, 0.92156863, 0.96078431])])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clf.cv_results_.values()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],


BIN
pygraph/kernels/__pycache__/treePatternKernel.cpython-35.pyc View File


BIN
pygraph/kernels/__pycache__/untildPathKernel.cpython-35.pyc View File


BIN
pygraph/kernels/__pycache__/untilnWalkKernel.cpython-35.pyc View File


+ 12
- 4
pygraph/kernels/treePatternKernel.py View File

@@ -29,10 +29,12 @@ def treepatternkernel(*args, node_label = 'atom', edge_label = 'bond_type', labe
edge attribute used as label. The default edge label is bond_type.
labeled : boolean
Whether the graphs are labeled. The default is True.
depth : integer
Depth of search. Longest length of paths.
k_func : function
A kernel function used using different notions of fingerprint similarity.
kernel_type : string
Type of tree pattern kernel, could be 'untiln', 'size' or 'branching'.
lmda : float
Weight to decide whether linear patterns or trees pattern of increasing complexity are favored.
h : integer
The upper bound of the height of tree patterns.

Return
------
@@ -74,6 +76,12 @@ def _treepatternkernel_do(G1, G2, node_label, edge_label, labeled, kernel_type,
edge attribute used as label. The default edge label is bond_type.
labeled : boolean
Whether the graphs are labeled. The default is True.
kernel_type : string
Type of tree pattern kernel, could be 'untiln', 'size' or 'branching'.
lmda : float
Weight to decide whether linear patterns or trees pattern of increasing complexity are favored.
h : integer
The upper bound of the height of tree patterns.

Return
------


+ 2
- 4
pygraph/kernels/untildPathKernel.py View File

@@ -8,8 +8,6 @@ import pathlib
sys.path.insert(0, "../")
import time

from collections import Counter

import networkx as nx
import numpy as np

@@ -36,8 +34,8 @@ def untildpathkernel(*args, node_label = 'atom', edge_label = 'bond_type', label

Return
------
Kmatrix/kernel : Numpy matrix/float
Kernel matrix, each element of which is the path kernel up to d between 2 praphs. / Path kernel up to d between 2 graphs.
Kmatrix : Numpy matrix
Kernel matrix, each element of which is the path kernel up to d between 2 praphs.
"""
depth = int(depth)
if len(args) == 1: # for a list of graphs


+ 182
- 0
pygraph/kernels/untilnWalkKernel.py View File

@@ -0,0 +1,182 @@
"""
@author: linlin
@references: Thomas Gärtner, Peter Flach, and Stefan Wrobel. On graph kernels: Hardness results and efficient alternatives. Learning Theory and Kernel Machines, pages 129–143, 2003.
"""

import sys
import pathlib
sys.path.insert(0, "../")
import time

from collections import Counter

import networkx as nx
import numpy as np


def untilnwalkkernel(*args, node_label = 'atom', edge_label = 'bond_type', labeled = True, n = 10):
"""Calculate common walk graph kernels up to depth d between graphs.
Parameters
----------
Gn : List of NetworkX graph
List of graphs between which the kernels are calculated.
/
G1, G2 : NetworkX graphs
2 graphs between which the kernel is calculated.
node_label : string
node attribute used as label. The default node label is atom.
edge_label : string
edge attribute used as label. The default edge label is bond_type.
labeled : boolean
Whether the graphs are labeled. The default is True.
n : integer
Longest length of walks.

Return
------
Kmatrix : Numpy matrix
Kernel matrix, each element of which is the path kernel up to d between 2 praphs.
"""
Gn = args[0] if len(args) == 1 else [args[0], args[1]] # arrange all graphs in a list
Kmatrix = np.zeros((len(Gn), len(Gn)))
n = int(n)

start_time = time.time()

# get all paths of all graphs before calculating kernels to save time, but this may cost a lot of memory for large dataset.
all_walks = [ find_all_walks_until_length(Gn[i], n, node_label = node_label, edge_label = edge_label, labeled = labeled) for i in range(0, len(Gn)) ]

for i in range(0, len(Gn)):
for j in range(i, len(Gn)):
Kmatrix[i][j] = _untilnwalkkernel_do(all_walks[i], all_walks[j], node_label = node_label, edge_label = edge_label, labeled = labeled)
Kmatrix[j][i] = Kmatrix[i][j]

run_time = time.time() - start_time
print("\n --- kernel matrix of walk kernel up to %d of size %d built in %s seconds ---" % (n, len(Gn), run_time))

return Kmatrix, run_time


def _untilnwalkkernel_do(walks1, walks2, node_label = 'atom', edge_label = 'bond_type', labeled = True):
"""Calculate walk graph kernels up to n between 2 graphs.

Parameters
----------
walks1, walks2 : list
List of walks in 2 graphs, where for unlabeled graphs, each walk is represented by a list of nodes; while for labeled graphs, each walk is represented by a string consists of labels of nodes and edges on that walk.
node_label : string
node attribute used as label. The default node label is atom.
edge_label : string
edge attribute used as label. The default edge label is bond_type.
labeled : boolean
Whether the graphs are labeled. The default is True.

Return
------
kernel : float
Treelet Kernel between 2 graphs.
"""
counts_walks1 = dict(Counter(walks1))
counts_walks2 = dict(Counter(walks2))
all_walks = list(set(walks1 + walks2))

vector1 = [ (counts_walks1[walk] if walk in walks1 else 0) for walk in all_walks ]
vector2 = [ (counts_walks2[walk] if walk in walks2 else 0) for walk in all_walks ]
kernel = np.dot(vector1, vector2)

return kernel

# this method find walks repetively, it could be faster.
def find_all_walks_until_length(G, length, node_label = 'atom', edge_label = 'bond_type', labeled = True):
"""Find all walks with a certain maximum length in a graph. A recursive depth first search is applied.

Parameters
----------
G : NetworkX graphs
The graph in which walks are searched.
length : integer
The maximum length of walks.
node_label : string
node attribute used as label. The default node label is atom.
edge_label : string
edge attribute used as label. The default edge label is bond_type.
labeled : boolean
Whether the graphs are labeled. The default is True.

Return
------
walk : list
List of walks retrieved, where for unlabeled graphs, each walk is represented by a list of nodes; while for labeled graphs, each walk is represented by a string consists of labels of nodes and edges on that walk.
"""
all_walks = []
for i in range(0, length + 1):
new_walks = find_all_walks(G, i)
if new_walks == []:
break
all_walks.extend(new_walks)

if labeled == True: # convert paths to strings
walk_strs = []
for walk in all_walks:
strlist = [ G.node[node][node_label] + G[node][walk[walk.index(node) + 1]][edge_label] for node in walk[:-1] ]
walk_strs.append(''.join(strlist) + G.node[walk[-1]][node_label])

return walk_strs

return all_walks


def find_walks(G, source_node, length):
"""Find all walks with a certain length those start from a source node. A recursive depth first search is applied.

Parameters
----------
G : NetworkX graphs
The graph in which walks are searched.
source_node : integer
The number of the node from where all walks start.
length : integer
The length of walks.

Return
------
walk : list of list
List of walks retrieved, where each walk is represented by a list of nodes.
"""
return [[source_node]] if length == 0 else \
[ [source_node] + walk for neighbor in G[source_node] \
for walk in find_walks(G, neighbor, length - 1) ]


def find_all_walks(G, length):
"""Find all walks with a certain length in a graph. A recursive depth first search is applied.

Parameters
----------
G : NetworkX graphs
The graph in which walks are searched.
length : integer
The length of walks.

Return
------
walk : list of list
List of walks retrieved, where each walk is represented by a list of nodes.
"""
all_walks = []
for node in G:
all_walks.extend(find_walks(G, node, length))

### The following process is not carried out according to the original article
# all_paths_r = [ path[::-1] for path in all_paths ]


# # For each path, two presentation are retrieved from its two extremities. Remove one of them.
# for idx, path in enumerate(all_paths[:-1]):
# for path2 in all_paths_r[idx+1::]:
# if path == path2:
# all_paths[idx] = []
# break

# return list(filter(lambda a: a != [], all_paths))
return all_walks

BIN
pygraph/utils/__pycache__/model_selection_precomputed.cpython-35.pyc View File


BIN
pygraph/utils/__pycache__/utils.cpython-35.pyc View File


+ 213
- 0
pygraph/utils/model_selection_precomputed.py View File

@@ -0,0 +1,213 @@
def model_selection_for_precomputed_kernel(datafile, estimator, param_grid_precomputed, param_grid, model_type, NUM_TRIALS = 30, datafile_y = ''):
"""Perform model selection, fitting and testing for precomputed kernels using nested cv. Print out neccessary data during the process then finally the results.

Parameters
----------
datafile : string
Path of dataset file.
estimator : function
kernel function used to estimate. This function needs to return a gram matrix.
param_grid_precomputed : dictionary
Dictionary with names (string) of parameters used to calculate gram matrices as keys and lists of parameter settings to try as values. This enables searching over any sequence of parameter settings.
param_grid : dictionary
Dictionary with names (string) of parameters used as penelties as keys and lists of parameter settings to try as values. This enables searching over any sequence of parameter settings.
model_type : string
Typr of the problem, can be regression or classification.
NUM_TRIALS : integer
Number of random trials of outer cv loop. The default is 30.
datafile_y : string
Path of file storing y data. This parameter is optional depending on the given dataset file.

Examples
--------
>>> import numpy as np
>>> import sys
>>> sys.path.insert(0, "../")
>>> from pygraph.utils.model_selection_precomputed import model_selection_for_precomputed_kernel
>>> from pygraph.kernels.weisfeilerLehmanKernel import weisfeilerlehmankernel
>>>
>>> datafile = '../../../../datasets/acyclic/Acyclic/dataset_bps.ds'
>>> estimator = weisfeilerlehmankernel
>>> param_grid_precomputed = {'height': [0,1,2,3,4,5,6,7,8,9,10], 'base_kernel': ['subtree']}
>>> param_grid = {"alpha": np.logspace(-2, 2, num = 10, base = 10)}
>>>
>>> model_selection_for_precomputed_kernel(datafile, estimator, param_grid_precomputed, param_grid, 'regression')
"""
import numpy as np
from matplotlib import pyplot as plt

from sklearn.kernel_ridge import KernelRidge
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.model_selection import KFold, train_test_split, ParameterGrid

import sys
sys.path.insert(0, "../")
from pygraph.utils.graphfiles import loadDataset

from tqdm import tqdm

# setup the model type
model_type = model_type.lower()
if model_type != 'regression' and model_type != 'classification':
raise Exception('The model type is incorrect! Please choose from regression or classification.')
print()
print('--- This is a %s problem ---' % model_type)

# Load the dataset
print()
print('1. Loading dataset from file...')
dataset, y = loadDataset(datafile, filename_y = datafile_y)

# Grid of parameters with a discrete number of values for each.
param_list_precomputed = list(ParameterGrid(param_grid_precomputed))
param_list = list(ParameterGrid(param_grid))

# Arrays to store scores
train_pref = np.zeros((NUM_TRIALS, len(param_list_precomputed), len(param_list)))
val_pref = np.zeros((NUM_TRIALS, len(param_list_precomputed), len(param_list)))
test_pref = np.zeros((NUM_TRIALS, len(param_list_precomputed), len(param_list)))

gram_matrices = [] # a list to store gram matrices for all param_grid_precomputed
gram_matrix_time = [] # a list to store time to calculate gram matrices

# calculate all gram matrices
print()
print('2. Calculating gram matrices. This could take a while...')
for params_out in param_list_precomputed:
Kmatrix, current_run_time = estimator(dataset, **params_out)
print()
print('gram matrix with parameters', params_out, 'is: ')
print(Kmatrix)
plt.matshow(Kmatrix)
plt.show()
# plt.savefig('../../notebooks/gram_matrix_figs/{}_{}'.format(estimator.__name__, params_out))
gram_matrices.append(Kmatrix)
gram_matrix_time.append(current_run_time)

print()
print('3. Fitting and predicting using nested cross validation. This could really take a while...')
# Loop for each trial
pbar = tqdm(total = NUM_TRIALS * len(param_list_precomputed) * len(param_list),
desc = 'calculate performance', file=sys.stdout)
for trial in range(NUM_TRIALS): # Test set level
# loop for each outer param tuple
for index_out, params_out in enumerate(param_list_precomputed):
# split gram matrix and y to app and test sets.
X_app, X_test, y_app, y_test = train_test_split(gram_matrices[index_out], y, test_size=0.1)
split_index_app = [y.index(y_i) for y_i in y_app if y_i in y]
split_index_test = [y.index(y_i) for y_i in y_test if y_i in y]
X_app = X_app[:,split_index_app]
X_test = X_test[:,split_index_app]
y_app = np.array(y_app)
y_test = np.array(y_test)

# loop for each inner param tuple
for index_in, params_in in enumerate(param_list):
inner_cv = KFold(n_splits=10, shuffle=True, random_state=trial)
current_train_perf = []
current_valid_perf = []
current_test_perf = []

# For regression use the Kernel Ridge method
if model_type == 'regression':
KR = KernelRidge(kernel = 'precomputed', **params_in)
# loop for each split on validation set level
for train_index, valid_index in inner_cv.split(X_app): # validation set level
KR.fit(X_app[train_index,:][:,train_index], y_app[train_index])

# predict on the train, validation and test set
y_pred_train = KR.predict(X_app[train_index,:][:,train_index])
y_pred_valid = KR.predict(X_app[valid_index,:][:,train_index])
y_pred_test = KR.predict(X_test[:,train_index])

# root mean squared errors
current_train_perf.append(np.sqrt(mean_squared_error(y_app[train_index], y_pred_train)))
current_valid_perf.append(np.sqrt(mean_squared_error(y_app[valid_index], y_pred_valid)))
current_test_perf.append(np.sqrt(mean_squared_error(y_test, y_pred_test)))
# For clcassification use SVM
else:
KR = SVC(kernel = 'precomputed', **params_in)
# loop for each split on validation set level
for train_index, valid_index in inner_cv.split(X_app): # validation set level
KR.fit(X_app[train_index,:][:,train_index], y_app[train_index])

# predict on the train, validation and test set
y_pred_train = KR.predict(X_app[train_index,:][:,train_index])
y_pred_valid = KR.predict(X_app[valid_index,:][:,train_index])
y_pred_test = KR.predict(X_test[:,train_index])

# root mean squared errors
current_train_perf.append(accuracy_score(y_app[train_index], y_pred_train))
current_valid_perf.append(accuracy_score(y_app[valid_index], y_pred_valid))
current_test_perf.append(accuracy_score(y_test, y_pred_test))

# average performance on inner splits
train_pref[trial][index_out][index_in] = np.mean(current_train_perf)
val_pref[trial][index_out][index_in] = np.mean(current_valid_perf)
test_pref[trial][index_out][index_in] = np.mean(current_test_perf)
pbar.update(1)
pbar.clear()

print()
print('4. Getting final performances...')
# averages and confidences of performances on outer trials for each combination of parameters
average_train_scores = np.mean(train_pref, axis=0)
average_val_scores = np.mean(val_pref, axis=0)
average_perf_scores = np.mean(test_pref, axis=0)
std_train_scores = np.std(train_pref, axis=0, ddof=1) # sample std is used here
std_val_scores = np.std(val_pref, axis=0, ddof=1)
std_perf_scores = np.std(test_pref, axis=0, ddof=1)

if model_type == 'regression':
best_val_perf = np.amin(average_val_scores)
else:
best_val_perf = np.amax(average_val_scores)
print()
best_params_index = np.where(average_val_scores == best_val_perf)
best_params_out = [param_list_precomputed[i] for i in best_params_index[0]]
best_params_in = [param_list[i] for i in best_params_index[1]]
# print('best_params_index: ', best_params_index)
print('best_params_out: ', best_params_out)
print('best_params_in: ', best_params_in)
print('best_val_perf: ', best_val_perf)

# below: only find one performance; muitiple pref might exist
best_val_std = std_val_scores[best_params_index[0][0]][best_params_index[1][0]]
print('best_val_std: ', best_val_std)

final_performance = average_perf_scores[best_params_index[0][0]][best_params_index[1][0]]
final_confidence = std_perf_scores[best_params_index[0][0]][best_params_index[1][0]]
print('final_performance: ', final_performance)
print('final_confidence: ', final_confidence)
train_performance = average_train_scores[best_params_index[0][0]][best_params_index[1][0]]
train_std = std_train_scores[best_params_index[0][0]][best_params_index[1][0]]
print('train_performance: ', train_performance)
print('train_std: ', train_std)

best_gram_matrix_time = gram_matrix_time[best_params_index[0][0]]
print('time to calculate gram matrix: ', best_gram_matrix_time, 's')

# print out as table.
from collections import OrderedDict
from tabulate import tabulate
table_dict = {}
if model_type == 'regression':
for param_in in param_list:
param_in['alpha'] = '{:.2e}'.format(param_in['alpha'])
else:
for param_in in param_list:
param_in['C'] = '{:.2e}'.format(param_in['C'])
table_dict['params'] = [ {**param_out, **param_in} for param_in in param_list for param_out in param_list_precomputed ]
table_dict['gram_matrix_time'] = [ '{:.2f}'.format(gram_matrix_time[index_out])
for param_in in param_list for index_out, _ in enumerate(param_list_precomputed) ]
table_dict['valid_perf'] = [ '{:.2f}±{:.2f}'.format(average_val_scores[index_out][index_in], std_val_scores[index_out][index_in])
for index_in, _ in enumerate(param_list) for index_out, _ in enumerate(param_list_precomputed) ]
table_dict['test_perf'] = [ '{:.2f}±{:.2f}'.format(average_perf_scores[index_out][index_in], std_perf_scores[index_out][index_in])
for index_in, _ in enumerate(param_list) for index_out, _ in enumerate(param_list_precomputed) ]
table_dict['train_perf'] = [ '{:.2f}±{:.2f}'.format(average_train_scores[index_out][index_in], std_train_scores[index_out][index_in])
for index_in, _ in enumerate(param_list) for index_out, _ in enumerate(param_list_precomputed) ]
keyorder = ['params', 'train_perf', 'valid_perf', 'test_perf', 'gram_matrix_time']
print()
print(tabulate(OrderedDict(sorted(table_dict.items(), key = lambda i:keyorder.index(i[0]))), headers='keys'))

+ 309
- 308
pygraph/utils/utils.py View File

@@ -65,311 +65,312 @@ def floydTransformation(G, edge_weight = 'bond_type'):



def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, trials = 100, splits = 10, alpha_grid = None, C_grid = None, hyper_name = '', hyper_range = [1], normalize = False, datafile_y = '', model_type = 'regression'):
"""Perform training and testing for a kernel method. Print out neccessary data during the process then finally the results.

Parameters
----------
datafile : string
Path of dataset file.
kernel_file_path : string
Path of the directory to save results.
kernel_func : function
kernel function to use in the process.
kernel_para : dictionary
Keyword arguments passed to kernel_func.
trials: integer
Number of trials for hyperparameter random search, where hyperparameter stands for penalty parameter for now. The default is 100.
splits: integer
Number of splits of dataset. Times of training and testing procedure processed. The final means and stds are the average of the results of all the splits. The default is 10.
alpha_grid : ndarray
Penalty parameter in kernel ridge regression. Corresponds to (2*C)^-1 in other linear models such as LogisticRegression.
C_grid : ndarray
Penalty parameter C of the error term in kernel SVM.
hyper_name : string
Name of the hyperparameter.
hyper_range : list
Range of the hyperparameter.
normalize : string
Determine whether or not that normalization is performed. Only works when model_type == 'regression'. The default is False.
model_type : string
Typr of the problem, regression or classification problem

References
----------
[1] Elisabetta Ghisu, https://github.com/eghisu/GraphKernels/blob/master/GraphKernelsCollection/python_scripts/compute_perf_gk.py, 2018.1

Examples
--------
>>> import sys
>>> sys.path.insert(0, "../")
>>> from pygraph.utils.utils import kernel_train_test
>>> from pygraph.kernels.treeletKernel import treeletkernel
>>> datafile = '../../../../datasets/acyclic/Acyclic/dataset_bps.ds'
>>> kernel_file_path = 'kernelmatrices_path_acyclic/'
>>> kernel_para = dict(node_label = 'atom', edge_label = 'bond_type', labeled = True)
>>> kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = True)
"""
import os
import pathlib
from collections import OrderedDict
from tabulate import tabulate
from .graphfiles import loadDataset

# setup the parameters
model_type = model_type.lower()
if model_type != 'regression' and model_type != 'classification':
raise Exception('The model type is incorrect! Please choose from regression or clqssification.')
print('\n --- This is a %s problem ---' % model_type)

alpha_grid = np.logspace(-10, 10, num = trials, base = 10) if alpha_grid == None else alpha_grid # corresponds to (2*C)^-1 in other linear models such as LogisticRegression
C_grid = np.logspace(-10, 10, num = trials, base = 10) if C_grid == None else C_grid

if not os.path.exists(kernel_file_path):
os.makedirs(kernel_file_path)

train_means_list = []
train_stds_list = []
test_means_list = []
test_stds_list = []
kernel_time_list = []

for hyper_para in hyper_range:
print('' if hyper_name == '' else '\n\n #--- calculating kernel matrix when', hyper_name, '=', hyper_para, '---#')

print('\n Loading dataset from file...')
dataset, y = loadDataset(datafile, filename_y = datafile_y)
y = np.array(y)
# normalize labels and transform non-numerical labels to numerical labels.
if model_type == 'classification':
from sklearn.preprocessing import LabelEncoder
y = LabelEncoder().fit_transform(y)
# print(y)

# save kernel matrices to files / read kernel matrices from files
kernel_file = kernel_file_path + 'km.ds'
path = pathlib.Path(kernel_file)
# get train set kernel matrix
if path.is_file():
print('\n Loading the kernel matrix from file...')
Kmatrix = np.loadtxt(kernel_file)
print(Kmatrix)
else:
print('\n Calculating kernel matrix, this could take a while...')
if hyper_name != '':
kernel_para[hyper_name] = hyper_para
Kmatrix, run_time = kernel_func(dataset, **kernel_para)
kernel_time_list.append(run_time)
print(Kmatrix)
# print('\n Saving kernel matrix to file...')
# np.savetxt(kernel_file, Kmatrix)

"""
- Here starts the main program
- First we permute the data, then for each split we evaluate corresponding performances
- In the end, the performances are averaged over the test sets
"""

train_mean, train_std, test_mean, test_std = \
split_train_test(Kmatrix, y, alpha_grid, C_grid, splits, trials, model_type, normalize = normalize)

train_means_list.append(train_mean)
train_stds_list.append(train_std)
test_means_list.append(test_mean)
test_stds_list.append(test_std)

print('\n')
if model_type == 'regression':
table_dict = {'rmse_test': test_means_list, 'std_test': test_stds_list, \
'rmse_train': train_means_list, 'std_train': train_stds_list, 'k_time': kernel_time_list}
if hyper_name == '':
keyorder = ['rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time']
else:
table_dict[hyper_name] = hyper_range
keyorder = [hyper_name, 'rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time']
elif model_type == 'classification':
table_dict = {'accur_test': test_means_list, 'std_test': test_stds_list, \
'accur_train': train_means_list, 'std_train': train_stds_list, 'k_time': kernel_time_list}
if hyper_name == '':
keyorder = ['accur_test', 'std_test', 'accur_train', 'std_train', 'k_time']
else:
table_dict[hyper_name] = hyper_range
keyorder = [hyper_name, 'accur_test', 'std_test', 'accur_train', 'std_train', 'k_time']
print(tabulate(OrderedDict(sorted(table_dict.items(), key = lambda i:keyorder.index(i[0]))), headers='keys'))



def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, trials = 100, model_type = 'regression', normalize = False):
"""Split dataset to training and testing splits, train and test. Print out and return the results.

Parameters
----------
Kmatrix : Numpy matrix
Kernel matrix, each element of which is the kernel between 2 praphs.
train_target : ndarray
train target.
alpha_grid : ndarray
Penalty parameter in kernel ridge regression. Corresponds to (2*C)^-1 in other linear models such as LogisticRegression.
C_grid : ndarray
Penalty parameter C of the error term in kernel SVM.
splits : interger
Number of splits of dataset. Times of training and testing procedure processed. The final means and stds are the average of the results of all the splits. The default is 10.
trials : integer
Number of trials for hyperparameters random search. The final means and stds are the ones in the same trial with the best test mean. The default is 100.
model_type : string
Determine whether it is a regression or classification problem. The default is 'regression'.
normalize : string
Determine whether or not that normalization is performed. Only works when model_type == 'regression'. The default is False.

Return
------
train_mean : float
mean of train accuracies in the same trial with the best test mean.
train_std : float
mean of train stds in the same trial with the best test mean.
test_mean : float
mean of the best tests.
test_std : float
mean of test stds in the same trial with the best test mean.

References
----------
[1] Elisabetta Ghisu, https://github.com/eghisu/GraphKernels/blob/master/GraphKernelsCollection/python_scripts/compute_perf_gk.py, 2018.1
"""
import random
from sklearn.kernel_ridge import KernelRidge # 0.17
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn import svm

datasize = len(train_target)
random.seed(20) # Set the seed for uniform parameter distribution

# Initialize the performance of the best parameter trial on train with the corresponding performance on test
train_split = []
test_split = []

# For each split of the data
print('\n Starting calculate accuracy/rmse...')
import sys
pbar = tqdm(total = splits * trials, desc = 'calculate performance', file=sys.stdout)
for j in range(10, 10 + splits):
# print('\n Starting split %d...' % j)

# Set the random set for data permutation
random_state = int(j)
np.random.seed(random_state)
idx_perm = np.random.permutation(datasize)

# Permute the data
y_perm = train_target[idx_perm] # targets permutation
Kmatrix_perm = Kmatrix[:, idx_perm] # inputs permutation
Kmatrix_perm = Kmatrix_perm[idx_perm, :] # inputs permutation

# Set the training, test
# Note: the percentage can be set up by the user
num_train = int((datasize * 90) / 100) # 90% (of entire dataset) for training
num_test = datasize - num_train # 10% (of entire dataset) for test

# Split the kernel matrix
Kmatrix_train = Kmatrix_perm[0:num_train, 0:num_train]
Kmatrix_test = Kmatrix_perm[num_train:datasize, 0:num_train]

# Split the targets
y_train = y_perm[0:num_train]


# Normalization step (for real valued targets only)
if normalize == True and model_type == 'regression':
y_train_mean = np.mean(y_train)
y_train_std = np.std(y_train)
y_train_norm = (y_train - y_train_mean) / float(y_train_std)

y_test = y_perm[num_train:datasize]

# Record the performance for each parameter trial respectively on train and test set
perf_all_train = []
perf_all_test = []

# For each parameter trial
for i in range(trials):
# For regression use the Kernel Ridge method
if model_type == 'regression':
# Fit the kernel ridge model
KR = KernelRidge(kernel = 'precomputed', alpha = alpha_grid[i])
KR.fit(Kmatrix_train, y_train if normalize == False else y_train_norm)

# predict on the train and test set
y_pred_train = KR.predict(Kmatrix_train)
y_pred_test = KR.predict(Kmatrix_test)

# adjust prediction: needed because the training targets have been normalized
if normalize == True:
y_pred_train = y_pred_train * float(y_train_std) + y_train_mean
y_pred_test = y_pred_test * float(y_train_std) + y_train_mean

# root mean squared error on train set
accuracy_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
perf_all_train.append(accuracy_train)
# root mean squared error on test set
accuracy_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
perf_all_test.append(accuracy_test)

# For clcassification use SVM
elif model_type == 'classification':
KR = svm.SVC(kernel = 'precomputed', C = C_grid[i])
KR.fit(Kmatrix_train, y_train)

# predict on the train and test set
y_pred_train = KR.predict(Kmatrix_train)
y_pred_test = KR.predict(Kmatrix_test)

# accuracy on train set
accuracy_train = accuracy_score(y_train, y_pred_train)
perf_all_train.append(accuracy_train)
# accuracy on test set
accuracy_test = accuracy_score(y_test, y_pred_test)
perf_all_test.append(accuracy_test)

pbar.update(1)

# --- FIND THE OPTIMAL PARAMETERS --- #
# For regression: minimise the mean squared error
if model_type == 'regression':

# get optimal parameter on test (argmin mean squared error)
min_idx = np.argmin(perf_all_test)
alpha_opt = alpha_grid[min_idx]

# corresponding performance on train and test set for the same parameter
perf_train_opt = perf_all_train[min_idx]
perf_test_opt = perf_all_test[min_idx]

# For classification: maximise the accuracy
if model_type == 'classification':
# get optimal parameter on test (argmax accuracy)
max_idx = np.argmax(perf_all_test)
C_opt = C_grid[max_idx]

# corresponding performance on train and test set for the same parameter
perf_train_opt = perf_all_train[max_idx]
perf_test_opt = perf_all_test[max_idx]


# append the correponding performance on the train and test set
train_split.append(perf_train_opt)
test_split.append(perf_test_opt)

# average the results
# mean of the train and test performances over the splits
train_mean = np.mean(np.asarray(train_split))
test_mean = np.mean(np.asarray(test_split))
# std deviation of the train and test over the splits
train_std = np.std(np.asarray(train_split))
test_std = np.std(np.asarray(test_split))

print('\n Mean performance on train set: %3f' % train_mean)
print('With standard deviation: %3f' % train_std)
print('\n Mean performance on test set: %3f' % test_mean)
print('With standard deviation: %3f' % test_std)

return train_mean, train_std, test_mean, test_std
# def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, trials = 100, splits = 10, alpha_grid = None, C_grid = None, hyper_name = '', hyper_range = [1], normalize = False, datafile_y = '', model_type = 'regression'):
# """Perform training and testing for a kernel method. Print out neccessary data during the process then finally the results.

# Parameters
# ----------
# datafile : string
# Path of dataset file.
# kernel_file_path : string
# Path of the directory to save results.
# kernel_func : function
# kernel function to use in the process.
# kernel_para : dictionary
# Keyword arguments passed to kernel_func.
# trials: integer
# Number of trials for hyperparameter random search, where hyperparameter stands for penalty parameter for now. The default is 100.
# splits: integer
# Number of splits of dataset. Times of training and testing procedure processed. The final means and stds are the average of the results of all the splits. The default is 10.
# alpha_grid : ndarray
# Penalty parameter in kernel ridge regression. Corresponds to (2*C)^-1 in other linear models such as LogisticRegression.
# C_grid : ndarray
# Penalty parameter C of the error term in kernel SVM.
# hyper_name : string
# Name of the hyperparameter.
# hyper_range : list
# Range of the hyperparameter.
# normalize : string
# Determine whether or not that normalization is performed. Only works when model_type == 'regression'. The default is False.
# model_type : string
# Typr of the problem, regression or classification problem

# References
# ----------
# [1] Elisabetta Ghisu, https://github.com/eghisu/GraphKernels/blob/master/GraphKernelsCollection/python_scripts/compute_perf_gk.py, 2018.1

# Examples
# --------
# >>> import sys
# >>> sys.path.insert(0, "../")
# >>> from pygraph.utils.utils import kernel_train_test
# >>> from pygraph.kernels.treeletKernel import treeletkernel
# >>> datafile = '../../../../datasets/acyclic/Acyclic/dataset_bps.ds'
# >>> kernel_file_path = 'kernelmatrices_path_acyclic/'
# >>> kernel_para = dict(node_label = 'atom', edge_label = 'bond_type', labeled = True)
# >>> kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = True)
# """
# import os
# import pathlib
# from collections import OrderedDict
# from tabulate import tabulate
# from .graphfiles import loadDataset

# # setup the parameters
# model_type = model_type.lower()
# if model_type != 'regression' and model_type != 'classification':
# raise Exception('The model type is incorrect! Please choose from regression or classification.')
# print('\n --- This is a %s problem ---' % model_type)

# alpha_grid = np.logspace(-10, 10, num = trials, base = 10) if alpha_grid == None else alpha_grid # corresponds to (2*C)^-1 in other linear models such as LogisticRegression
# C_grid = np.logspace(-10, 10, num = trials, base = 10) if C_grid == None else C_grid

# if not os.path.exists(kernel_file_path):
# os.makedirs(kernel_file_path)

# train_means_list = []
# train_stds_list = []
# test_means_list = []
# test_stds_list = []
# kernel_time_list = []

# for hyper_para in hyper_range:
# print('' if hyper_name == '' else '\n\n #--- calculating kernel matrix when', hyper_name, '=', hyper_para, '---#')

# print('\n Loading dataset from file...')
# dataset, y = loadDataset(datafile, filename_y = datafile_y)
# y = np.array(y)
# # normalize labels and transform non-numerical labels to numerical labels.
# if model_type == 'classification':
# from sklearn.preprocessing import LabelEncoder
# y = LabelEncoder().fit_transform(y)
# # print(y)

# # save kernel matrices to files / read kernel matrices from files
# kernel_file = kernel_file_path + 'km.ds'
# path = pathlib.Path(kernel_file)
# # get train set kernel matrix
# if path.is_file():
# print('\n Loading the kernel matrix from file...')
# Kmatrix = np.loadtxt(kernel_file)
# print(Kmatrix)
# else:
# print('\n Calculating kernel matrix, this could take a while...')
# if hyper_name != '':
# kernel_para[hyper_name] = hyper_para
# Kmatrix, run_time = kernel_func(dataset, **kernel_para)
# kernel_time_list.append(run_time)
# import matplotlib.pyplot as plt
# plt.matshow(Kmatrix)
# # print('\n Saving kernel matrix to file...')
# # np.savetxt(kernel_file, Kmatrix)

# """
# - Here starts the main program
# - First we permute the data, then for each split we evaluate corresponding performances
# - In the end, the performances are averaged over the test sets
# """

# train_mean, train_std, test_mean, test_std = \
# split_train_test(Kmatrix, y, alpha_grid, C_grid, splits, trials, model_type, normalize = normalize)

# train_means_list.append(train_mean)
# train_stds_list.append(train_std)
# test_means_list.append(test_mean)
# test_stds_list.append(test_std)

# print('\n')
# if model_type == 'regression':
# table_dict = {'rmse_test': test_means_list, 'std_test': test_stds_list, \
# 'rmse_train': train_means_list, 'std_train': train_stds_list, 'k_time': kernel_time_list}
# if hyper_name == '':
# keyorder = ['rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time']
# else:
# table_dict[hyper_name] = hyper_range
# keyorder = [hyper_name, 'rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time']
# elif model_type == 'classification':
# table_dict = {'accur_test': test_means_list, 'std_test': test_stds_list, \
# 'accur_train': train_means_list, 'std_train': train_stds_list, 'k_time': kernel_time_list}
# if hyper_name == '':
# keyorder = ['accur_test', 'std_test', 'accur_train', 'std_train', 'k_time']
# else:
# table_dict[hyper_name] = hyper_range
# keyorder = [hyper_name, 'accur_test', 'std_test', 'accur_train', 'std_train', 'k_time']
# print(tabulate(OrderedDict(sorted(table_dict.items(), key = lambda i:keyorder.index(i[0]))), headers='keys'))



# def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, trials = 100, model_type = 'regression', normalize = False):
# """Split dataset to training and testing splits, train and test. Print out and return the results.

# Parameters
# ----------
# Kmatrix : Numpy matrix
# Kernel matrix, each element of which is the kernel between 2 praphs.
# train_target : ndarray
# train target.
# alpha_grid : ndarray
# Penalty parameter in kernel ridge regression. Corresponds to (2*C)^-1 in other linear models such as LogisticRegression.
# C_grid : ndarray
# Penalty parameter C of the error term in kernel SVM.
# splits : interger
# Number of splits of dataset. Times of training and testing procedure processed. The final means and stds are the average of the results of all the splits. The default is 10.
# trials : integer
# Number of trials for hyperparameters random search. The final means and stds are the ones in the same trial with the best test mean. The default is 100.
# model_type : string
# Determine whether it is a regression or classification problem. The default is 'regression'.
# normalize : string
# Determine whether or not that normalization is performed. Only works when model_type == 'regression'. The default is False.

# Return
# ------
# train_mean : float
# mean of train accuracies in the same trial with the best test mean.
# train_std : float
# mean of train stds in the same trial with the best test mean.
# test_mean : float
# mean of the best tests.
# test_std : float
# mean of test stds in the same trial with the best test mean.

# References
# ----------
# [1] Elisabetta Ghisu, https://github.com/eghisu/GraphKernels/blob/master/GraphKernelsCollection/python_scripts/compute_perf_gk.py, 2018.1
# """
# import random
# from sklearn.kernel_ridge import KernelRidge # 0.17
# from sklearn.metrics import accuracy_score, mean_squared_error
# from sklearn import svm

# datasize = len(train_target)
# random.seed(20) # Set the seed for uniform parameter distribution

# # Initialize the performance of the best parameter trial on train with the corresponding performance on test
# train_split = []
# test_split = []

# # For each split of the data
# print('\n Starting calculate accuracy/rmse...')
# import sys
# pbar = tqdm(total = splits * trials, desc = 'calculate performance', file=sys.stdout)
# for j in range(10, 10 + splits):
# # print('\n Starting split %d...' % j)

# # Set the random set for data permutation
# random_state = int(j)
# np.random.seed(random_state)
# idx_perm = np.random.permutation(datasize)

# # Permute the data
# y_perm = train_target[idx_perm] # targets permutation
# Kmatrix_perm = Kmatrix[:, idx_perm] # inputs permutation
# Kmatrix_perm = Kmatrix_perm[idx_perm, :] # inputs permutation

# # Set the training, test
# # Note: the percentage can be set up by the user
# num_train = int((datasize * 90) / 100) # 90% (of entire dataset) for training
# num_test = datasize - num_train # 10% (of entire dataset) for test

# # Split the kernel matrix
# Kmatrix_train = Kmatrix_perm[0:num_train, 0:num_train]
# Kmatrix_test = Kmatrix_perm[num_train:datasize, 0:num_train]

# # Split the targets
# y_train = y_perm[0:num_train]


# # Normalization step (for real valued targets only)
# if normalize == True and model_type == 'regression':
# y_train_mean = np.mean(y_train)
# y_train_std = np.std(y_train)
# y_train_norm = (y_train - y_train_mean) / float(y_train_std)

# y_test = y_perm[num_train:datasize]

# # Record the performance for each parameter trial respectively on train and test set
# perf_all_train = []
# perf_all_test = []

# # For each parameter trial
# for i in range(trials):
# # For regression use the Kernel Ridge method
# if model_type == 'regression':
# # Fit the kernel ridge model
# KR = KernelRidge(kernel = 'precomputed', alpha = alpha_grid[i])
# KR.fit(Kmatrix_train, y_train if normalize == False else y_train_norm)

# # predict on the train and test set
# y_pred_train = KR.predict(Kmatrix_train)
# y_pred_test = KR.predict(Kmatrix_test)

# # adjust prediction: needed because the training targets have been normalized
# if normalize == True:
# y_pred_train = y_pred_train * float(y_train_std) + y_train_mean
# y_pred_test = y_pred_test * float(y_train_std) + y_train_mean

# # root mean squared error on train set
# accuracy_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
# perf_all_train.append(accuracy_train)
# # root mean squared error on test set
# accuracy_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
# perf_all_test.append(accuracy_test)

# # For clcassification use SVM
# elif model_type == 'classification':
# KR = svm.SVC(kernel = 'precomputed', C = C_grid[i])
# KR.fit(Kmatrix_train, y_train)

# # predict on the train and test set
# y_pred_train = KR.predict(Kmatrix_train)
# y_pred_test = KR.predict(Kmatrix_test)

# # accuracy on train set
# accuracy_train = accuracy_score(y_train, y_pred_train)
# perf_all_train.append(accuracy_train)
# # accuracy on test set
# accuracy_test = accuracy_score(y_test, y_pred_test)
# perf_all_test.append(accuracy_test)

# pbar.update(1)

# # --- FIND THE OPTIMAL PARAMETERS --- #
# # For regression: minimise the mean squared error
# if model_type == 'regression':

# # get optimal parameter on test (argmin mean squared error)
# min_idx = np.argmin(perf_all_train)
# alpha_opt = alpha_grid[min_idx]

# # corresponding performance on train and test set for the same parameter
# perf_train_opt = perf_all_train[min_idx]
# perf_test_opt = perf_all_test[min_idx]

# # For classification: maximise the accuracy
# if model_type == 'classification':
# # get optimal parameter on test (argmax accuracy)
# max_idx = np.argmax(perf_all_train)
# C_opt = C_grid[max_idx]

# # corresponding performance on train and test set for the same parameter
# perf_train_opt = perf_all_train[max_idx]
# perf_test_opt = perf_all_test[max_idx]


# # append the correponding performance on the train and test set
# train_split.append(perf_train_opt)
# test_split.append(perf_test_opt)

# # average the results
# # mean of the train and test performances over the splits
# train_mean = np.mean(np.asarray(train_split))
# test_mean = np.mean(np.asarray(test_split))
# # std deviation of the train and test over the splits
# train_std = np.std(np.asarray(train_split))
# test_std = np.std(np.asarray(test_split))

# print('\n Mean performance on train set: %3f' % train_mean)
# print('With standard deviation: %3f' % train_std)
# print('\n Mean performance on test set: %3f' % test_mean)
# print('With standard deviation: %3f' % test_std)

# return train_mean, train_std, test_mean, test_std

+ 0
- 16
run_cyclic.py View File

@@ -1,16 +0,0 @@
import sys
sys.path.insert(0, "../")
from pygraph.utils.utils import kernel_train_test
from pygraph.kernels.cyclicPatternKernel import cyclicpatternkernel

import numpy as np

datafile = '../../../../datasets/NCI-HIV/AIDO99SD.sdf'
datafile_y = '../../../../datasets/NCI-HIV/aids_conc_may04.txt'
kernel_file_path = 'kernelmatrices_path_acyclic/'

kernel_para = dict(node_label = 'atom', edge_label = 'bond_type', labeled = True)

kernel_train_test(datafile, kernel_file_path, cyclicpatternkernel, kernel_para, \
hyper_name = 'cycle_bound', hyper_range = np.linspace(0, 1000, 21), normalize = False, \
datafile_y = datafile_y, model_type = 'classification')

Loading…
Cancel
Save