modifié : notebooks/run_cyclicpatternkernel.ipynb modifié : notebooks/run_marginalizedkernel_acyclic.ipynb modifié : notebooks/run_pathkernel_acyclic.ipynb modifié : notebooks/run_spkernel_acyclic.ipynb modifié : notebooks/run_treeletkernel_acyclic.ipynb modifié : notebooks/run_treepatternkernel.ipynb modifié : notebooks/run_untildpathkernel_acyclic.ipynb nouveau fichier : notebooks/run_untilnwalkkernel.ipynb modifié : notebooks/run_weisfeilerLehmankernel_acyclic.ipynb modifié : pygraph/kernels/treePatternKernel.py modifié : pygraph/kernels/untildPathKernel.py nouveau fichier : pygraph/kernels/untilnWalkKernel.py nouveau fichier : pygraph/utils/model_selection_precomputed.py modifié : pygraph/utils/utils.pyv0.1
@@ -16,25 +16,24 @@ All kernels expect for Cyclic pattern kernel are tested on dataset Asyclic, whic | |||
The criteria used for prediction are SVM for classification and kernel Ridge regression for regression. | |||
For predition we randomly divide the data in train and test subset, where 90% of entire dataset is for training and rest for testing. 10 splits are performed. For each split, we first train on the train data, then evaluate the performance on the test set. We choose the optimal parameters for the test set and finally provide the corresponding performance. The final results correspond to the average of the performances on the test sets. | |||
| Kernels | RMSE(℃) | STD(℃) | Parameter | k_time | | |||
|------------------|:-------:|:------:|------------------:|-------:| | |||
| Shortest path | 35.19 | 4.50 | - | 14.58" | | |||
| Marginalized | 18.02 | 6.29 | p_quit = 0.1 | 4'19" | | |||
| Path | 18.41 | 10.78 | - | 29.43" | | |||
| WL subtree | 7.55 | 2.33 | height = 1 | 0.84" | | |||
| WL shortest path | 35.16 | 4.50 | height = 2 | 40.24" | | |||
| WL edge | 33.41 | 4.73 | height = 5 | 5.66" | | |||
| Treelet | 8.31 | 3.38 | - | 0.50" | | |||
| Path up to d | 7.43 | 2.69 | depth = 2 | 0.59" | | |||
| Tree pattern | 7.27 | 2.21 | lamda = 1, h = 2 | 37.24" | | |||
| Cyclic pattern | 0.9 | 0.11 | cycle bound = 100 | 0.31" | | |||
~~For predition we randomly divide the data in train and test subset, where 90% of entire dataset is for training and rest for testing. 10 splits are performed. For each split, we first train on the train data, then evaluate the performance on the test set. We choose the optimal parameters for the test set and finally provide the corresponding performance. The final results correspond to the average of the performances on the test sets.~~ | |||
| Kernels | train_perf | valid_perf | test_perf | Parameters | gram_matrix_time | | |||
|------------------|-----------:|-----------:|-----------:|------------------------------------------------------:|-----------------:| | |||
| Shortest path | 28.65±0.59 | 36.09±0.97 | 36.45±6.63 | 'alpha': '3.55e+01' | 12.67" | | |||
| Marginalized | 12.42±0.28 | 18.60±2.02 | 16.51±5.12 | 'p_quit': 0.3, 'alpha': '3.16e-06' | 430.42" | | |||
| Path | 11.19±0.73 | 23.66±1.74 | 25.04±9.60 | 'alpha': '2.57e-03' | 21.84" | | |||
| WL subtree | 6.00±0.27 | 7.59±0.71 | 7.92±2.92 | 'height': 1.0, 'alpha': '1.26e-01' | 0.84" | | |||
| WL shortest path | 28.32±0.63 | 35.99±0.98 | 37.92±5.60 | 'height': 2.0, 'alpha': '1.00e+02' | 39.79" | | |||
| WL edge | 30.10±0.57 | 35.13±0.78 | 37.70±6.92 | 'height': 4.0, 'alpha': '3.98e+01' | 4.35" | | |||
| Treelet | 7.38±0.37 | 14.21±0.80 | 15.26±3.65 | 'alpha': '1.58e+00' | 0.49" | | |||
| Path up to d | 5.48±0.23 | 10.00±0.83 | 10.73±5.67 | 'depth': 2.0, 'k_func': 'MinMax', 'alpha': '7.94e-02' | 0.57" | | |||
| Tree pattern | | | | | | | |||
| Cyclic pattern | 0.62±0.02 | 0.62±0.02 | 0.57±0.17 | 'cycle_bound': 125.0, 'C': '1.78e-01' | 0.33" | | |||
* RMSE stands for arithmetic mean of the root mean squared errors on all splits. | |||
* STD stands for standard deviation of the root mean squared errors on all splits. | |||
* Paremeter is the one with which the kenrel achieves the best results. | |||
* k_time is the time spent on building the kernel matrix. | |||
* The targets of training data are normalized before calculating *treelet kernel*. | |||
* Paremeters are the ones with which the kenrel achieves the best results. | |||
* gram_matrix_time is the time spent on building the gram matrix. | |||
* See detail results in [results.md](pygraph/kernels/results.md). | |||
## References | |||
@@ -2,6 +2,268 @@ | |||
"cells": [ | |||
{ | |||
"cell_type": "code", | |||
"execution_count": 2, | |||
"metadata": { | |||
"scrolled": false | |||
}, | |||
"outputs": [ | |||
{ | |||
"name": "stdout", | |||
"output_type": "stream", | |||
"text": [ | |||
"Automatically created module for IPython interactive environment\n", | |||
"# Tuning hyper-parameters for precision\n", | |||
"\n", | |||
"Best parameters set found on development set:\n", | |||
"\n", | |||
"{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}\n", | |||
"\n", | |||
"Grid scores on development set:\n", | |||
"\n", | |||
"0.986 (+/-0.016) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}\n", | |||
"0.959 (+/-0.029) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}\n", | |||
"0.988 (+/-0.017) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}\n", | |||
"0.982 (+/-0.026) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}\n", | |||
"0.988 (+/-0.017) for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}\n", | |||
"0.982 (+/-0.025) for {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}\n", | |||
"0.988 (+/-0.017) for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}\n", | |||
"0.982 (+/-0.025) for {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}\n", | |||
"0.975 (+/-0.014) for {'C': 1, 'kernel': 'linear'}\n", | |||
"0.975 (+/-0.014) for {'C': 10, 'kernel': 'linear'}\n", | |||
"0.975 (+/-0.014) for {'C': 100, 'kernel': 'linear'}\n", | |||
"0.975 (+/-0.014) for {'C': 1000, 'kernel': 'linear'}\n", | |||
"\n", | |||
"Detailed classification report:\n", | |||
"\n", | |||
"The model is trained on the full development set.\n", | |||
"The scores are computed on the full evaluation set.\n", | |||
"\n", | |||
" precision recall f1-score support\n", | |||
"\n", | |||
" 0 1.00 1.00 1.00 89\n", | |||
" 1 0.97 1.00 0.98 90\n", | |||
" 2 0.99 0.98 0.98 92\n", | |||
" 3 1.00 0.99 0.99 93\n", | |||
" 4 1.00 1.00 1.00 76\n", | |||
" 5 0.99 0.98 0.99 108\n", | |||
" 6 0.99 1.00 0.99 89\n", | |||
" 7 0.99 1.00 0.99 78\n", | |||
" 8 1.00 0.98 0.99 92\n", | |||
" 9 0.99 0.99 0.99 92\n", | |||
"\n", | |||
"avg / total 0.99 0.99 0.99 899\n", | |||
"\n", | |||
"\n", | |||
"# Tuning hyper-parameters for recall\n", | |||
"\n", | |||
"Best parameters set found on development set:\n", | |||
"\n", | |||
"{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}\n", | |||
"\n", | |||
"Grid scores on development set:\n", | |||
"\n", | |||
"0.986 (+/-0.019) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}\n", | |||
"0.957 (+/-0.029) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}\n", | |||
"0.987 (+/-0.019) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}\n", | |||
"0.981 (+/-0.028) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}\n", | |||
"0.987 (+/-0.019) for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}\n", | |||
"0.981 (+/-0.026) for {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}\n", | |||
"0.987 (+/-0.019) for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}\n", | |||
"0.981 (+/-0.026) for {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}\n", | |||
"0.972 (+/-0.012) for {'C': 1, 'kernel': 'linear'}\n", | |||
"0.972 (+/-0.012) for {'C': 10, 'kernel': 'linear'}\n", | |||
"0.972 (+/-0.012) for {'C': 100, 'kernel': 'linear'}\n", | |||
"0.972 (+/-0.012) for {'C': 1000, 'kernel': 'linear'}\n", | |||
"\n", | |||
"Detailed classification report:\n", | |||
"\n", | |||
"The model is trained on the full development set.\n", | |||
"The scores are computed on the full evaluation set.\n", | |||
"\n", | |||
" precision recall f1-score support\n", | |||
"\n", | |||
" 0 1.00 1.00 1.00 89\n", | |||
" 1 0.97 1.00 0.98 90\n", | |||
" 2 0.99 0.98 0.98 92\n", | |||
" 3 1.00 0.99 0.99 93\n", | |||
" 4 1.00 1.00 1.00 76\n", | |||
" 5 0.99 0.98 0.99 108\n", | |||
" 6 0.99 1.00 0.99 89\n", | |||
" 7 0.99 1.00 0.99 78\n", | |||
" 8 1.00 0.98 0.99 92\n", | |||
" 9 0.99 0.99 0.99 92\n", | |||
"\n", | |||
"avg / total 0.99 0.99 0.99 899\n", | |||
"\n", | |||
"\n" | |||
] | |||
} | |||
], | |||
"source": [ | |||
"# Parameter estimation using grid search with cross-validation\n", | |||
"from __future__ import print_function\n", | |||
"\n", | |||
"from sklearn import datasets\n", | |||
"from sklearn.model_selection import train_test_split\n", | |||
"from sklearn.model_selection import GridSearchCV\n", | |||
"from sklearn.metrics import classification_report\n", | |||
"from sklearn.svm import SVC\n", | |||
"\n", | |||
"print(__doc__)\n", | |||
"\n", | |||
"# Loading the Digits dataset\n", | |||
"digits = datasets.load_digits()\n", | |||
"\n", | |||
"# To apply an classifier on this data, we need to flatten the image, to\n", | |||
"# turn the data in a (samples, feature) matrix:\n", | |||
"n_samples = len(digits.images)\n", | |||
"X = digits.images.reshape((n_samples, -1))\n", | |||
"y = digits.target\n", | |||
"\n", | |||
"# Split the dataset in two equal parts\n", | |||
"X_train, X_test, y_train, y_test = train_test_split(\n", | |||
" X, y, test_size=0.5, random_state=0)\n", | |||
"\n", | |||
"# Set the parameters by cross-validation\n", | |||
"tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],\n", | |||
" 'C': [1, 10, 100, 1000]},\n", | |||
" {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]\n", | |||
"\n", | |||
"scores = ['precision', 'recall']\n", | |||
"\n", | |||
"for score in scores:\n", | |||
" print(\"# Tuning hyper-parameters for %s\" % score)\n", | |||
" print()\n", | |||
"\n", | |||
" clf = GridSearchCV(SVC(), tuned_parameters, cv=5,\n", | |||
" scoring='%s_macro' % score)\n", | |||
" clf.fit(X_train, y_train)\n", | |||
"\n", | |||
" print(\"Best parameters set found on development set:\")\n", | |||
" print()\n", | |||
" print(clf.best_params_)\n", | |||
" print()\n", | |||
" print(\"Grid scores on development set:\")\n", | |||
" print()\n", | |||
" means = clf.cv_results_['mean_test_score']\n", | |||
" stds = clf.cv_results_['std_test_score']\n", | |||
" for mean, std, params in zip(means, stds, clf.cv_results_['params']):\n", | |||
" print(\"%0.3f (+/-%0.03f) for %r\"\n", | |||
" % (mean, std * 2, params))\n", | |||
" print()\n", | |||
"\n", | |||
" print(\"Detailed classification report:\")\n", | |||
" print()\n", | |||
" print(\"The model is trained on the full development set.\")\n", | |||
" print(\"The scores are computed on the full evaluation set.\")\n", | |||
" print()\n", | |||
" y_true, y_pred = y_test, clf.predict(X_test)\n", | |||
" print(classification_report(y_true, y_pred))\n", | |||
" print()\n", | |||
"\n", | |||
"# Note the problem is too easy: the hyperparameter plateau is too flat and the\n", | |||
"# output model is the same for precision and recall with ties in quality." | |||
] | |||
}, | |||
{ | |||
"cell_type": "code", | |||
"execution_count": 2, | |||
"metadata": {}, | |||
"outputs": [ | |||
{ | |||
"data": { | |||
"text/plain": [ | |||
"GridSearchCV(cv=None, error_score='raise',\n", | |||
" estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n", | |||
" decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',\n", | |||
" max_iter=-1, probability=False, random_state=None, shrinking=True,\n", | |||
" tol=0.001, verbose=False),\n", | |||
" fit_params=None, iid=True, n_jobs=1,\n", | |||
" param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf')},\n", | |||
" pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n", | |||
" scoring=None, verbose=0)" | |||
] | |||
}, | |||
"execution_count": 2, | |||
"metadata": {}, | |||
"output_type": "execute_result" | |||
} | |||
], | |||
"source": [ | |||
"from sklearn import svm, datasets\n", | |||
"from sklearn.model_selection import GridSearchCV\n", | |||
"iris = datasets.load_iris()\n", | |||
"parameters = {'kernel': ('linear', 'rbf'), 'C': [1, 10]}\n", | |||
"svc = svm.SVC()\n", | |||
"clf = GridSearchCV(svc, parameters)\n", | |||
"clf.fit(iris.data, iris.target)" | |||
] | |||
}, | |||
{ | |||
"cell_type": "code", | |||
"execution_count": 3, | |||
"metadata": {}, | |||
"outputs": [ | |||
{ | |||
"data": { | |||
"text/plain": [ | |||
"['mean_fit_time',\n", | |||
" 'mean_score_time',\n", | |||
" 'mean_test_score',\n", | |||
" 'mean_train_score',\n", | |||
" 'param_C',\n", | |||
" 'param_kernel',\n", | |||
" 'params',\n", | |||
" 'rank_test_score',\n", | |||
" 'split0_test_score',\n", | |||
" 'split0_train_score',\n", | |||
" 'split1_test_score',\n", | |||
" 'split1_train_score',\n", | |||
" 'split2_test_score',\n", | |||
" 'split2_train_score',\n", | |||
" 'std_fit_time',\n", | |||
" 'std_score_time',\n", | |||
" 'std_test_score',\n", | |||
" 'std_train_score']" | |||
] | |||
}, | |||
"execution_count": 3, | |||
"metadata": {}, | |||
"output_type": "execute_result" | |||
} | |||
], | |||
"source": [ | |||
"sorted(clf.cv_results_.keys())" | |||
] | |||
}, | |||
{ | |||
"cell_type": "code", | |||
"execution_count": 6, | |||
"metadata": {}, | |||
"outputs": [ | |||
{ | |||
"data": { | |||
"text/plain": [ | |||
"dict_values([array([0.98 , 0.97333333, 0.97333333, 0.98 ]), array([1. , 0.98039216, 1. , 0.98039216]), array([0.00030899, 0.00021172, 0.00019932, 0.00017134]), array([1, 3, 3, 1], dtype=int32), array([0.01617914, 0.00902067, 0.03715363, 0.01592466]), masked_array(data=['linear', 'rbf', 'linear', 'rbf'],\n", | |||
" mask=[False, False, False, False],\n", | |||
" fill_value='?',\n", | |||
" dtype=object), array([1., 1., 1., 1.]), array([0.98999802, 0.98336304, 0.97999604, 0.97999604]), array([6.43618303e-05, 6.20771049e-05, 7.16528819e-05, 9.16456815e-06]), array([0.97979798, 0.96969697, 0.95959596, 0.95959596]), [{'kernel': 'linear', 'C': 1}, {'kernel': 'rbf', 'C': 1}, {'kernel': 'linear', 'C': 10}, {'kernel': 'rbf', 'C': 10}], array([0.00036526, 0.00039411, 0.0002923 , 0.00032218]), array([0.00824863, 0.01254825, 0.01649726, 0.01649726]), array([0.97916667, 0.97916667, 1. , 1. ]), array([0.99019608, 0.98039216, 0.98039216, 0.98039216]), array([5.54407363e-05, 3.25514857e-05, 7.09833681e-05, 3.70551530e-06]), masked_array(data=[1, 1, 10, 10],\n", | |||
" mask=[False, False, False, False],\n", | |||
" fill_value='?',\n", | |||
" dtype=object), array([0.96078431, 0.96078431, 0.92156863, 0.96078431])])" | |||
] | |||
}, | |||
"execution_count": 6, | |||
"metadata": {}, | |||
"output_type": "execute_result" | |||
} | |||
], | |||
"source": [ | |||
"clf.cv_results_.values()" | |||
] | |||
}, | |||
{ | |||
"cell_type": "code", | |||
"execution_count": null, | |||
"metadata": {}, | |||
"outputs": [], | |||
@@ -29,10 +29,12 @@ def treepatternkernel(*args, node_label = 'atom', edge_label = 'bond_type', labe | |||
edge attribute used as label. The default edge label is bond_type. | |||
labeled : boolean | |||
Whether the graphs are labeled. The default is True. | |||
depth : integer | |||
Depth of search. Longest length of paths. | |||
k_func : function | |||
A kernel function used using different notions of fingerprint similarity. | |||
kernel_type : string | |||
Type of tree pattern kernel, could be 'untiln', 'size' or 'branching'. | |||
lmda : float | |||
Weight to decide whether linear patterns or trees pattern of increasing complexity are favored. | |||
h : integer | |||
The upper bound of the height of tree patterns. | |||
Return | |||
------ | |||
@@ -74,6 +76,12 @@ def _treepatternkernel_do(G1, G2, node_label, edge_label, labeled, kernel_type, | |||
edge attribute used as label. The default edge label is bond_type. | |||
labeled : boolean | |||
Whether the graphs are labeled. The default is True. | |||
kernel_type : string | |||
Type of tree pattern kernel, could be 'untiln', 'size' or 'branching'. | |||
lmda : float | |||
Weight to decide whether linear patterns or trees pattern of increasing complexity are favored. | |||
h : integer | |||
The upper bound of the height of tree patterns. | |||
Return | |||
------ | |||
@@ -8,8 +8,6 @@ import pathlib | |||
sys.path.insert(0, "../") | |||
import time | |||
from collections import Counter | |||
import networkx as nx | |||
import numpy as np | |||
@@ -36,8 +34,8 @@ def untildpathkernel(*args, node_label = 'atom', edge_label = 'bond_type', label | |||
Return | |||
------ | |||
Kmatrix/kernel : Numpy matrix/float | |||
Kernel matrix, each element of which is the path kernel up to d between 2 praphs. / Path kernel up to d between 2 graphs. | |||
Kmatrix : Numpy matrix | |||
Kernel matrix, each element of which is the path kernel up to d between 2 praphs. | |||
""" | |||
depth = int(depth) | |||
if len(args) == 1: # for a list of graphs | |||
@@ -0,0 +1,182 @@ | |||
""" | |||
@author: linlin | |||
@references: Thomas Gärtner, Peter Flach, and Stefan Wrobel. On graph kernels: Hardness results and efficient alternatives. Learning Theory and Kernel Machines, pages 129–143, 2003. | |||
""" | |||
import sys | |||
import pathlib | |||
sys.path.insert(0, "../") | |||
import time | |||
from collections import Counter | |||
import networkx as nx | |||
import numpy as np | |||
def untilnwalkkernel(*args, node_label = 'atom', edge_label = 'bond_type', labeled = True, n = 10): | |||
"""Calculate common walk graph kernels up to depth d between graphs. | |||
Parameters | |||
---------- | |||
Gn : List of NetworkX graph | |||
List of graphs between which the kernels are calculated. | |||
/ | |||
G1, G2 : NetworkX graphs | |||
2 graphs between which the kernel is calculated. | |||
node_label : string | |||
node attribute used as label. The default node label is atom. | |||
edge_label : string | |||
edge attribute used as label. The default edge label is bond_type. | |||
labeled : boolean | |||
Whether the graphs are labeled. The default is True. | |||
n : integer | |||
Longest length of walks. | |||
Return | |||
------ | |||
Kmatrix : Numpy matrix | |||
Kernel matrix, each element of which is the path kernel up to d between 2 praphs. | |||
""" | |||
Gn = args[0] if len(args) == 1 else [args[0], args[1]] # arrange all graphs in a list | |||
Kmatrix = np.zeros((len(Gn), len(Gn))) | |||
n = int(n) | |||
start_time = time.time() | |||
# get all paths of all graphs before calculating kernels to save time, but this may cost a lot of memory for large dataset. | |||
all_walks = [ find_all_walks_until_length(Gn[i], n, node_label = node_label, edge_label = edge_label, labeled = labeled) for i in range(0, len(Gn)) ] | |||
for i in range(0, len(Gn)): | |||
for j in range(i, len(Gn)): | |||
Kmatrix[i][j] = _untilnwalkkernel_do(all_walks[i], all_walks[j], node_label = node_label, edge_label = edge_label, labeled = labeled) | |||
Kmatrix[j][i] = Kmatrix[i][j] | |||
run_time = time.time() - start_time | |||
print("\n --- kernel matrix of walk kernel up to %d of size %d built in %s seconds ---" % (n, len(Gn), run_time)) | |||
return Kmatrix, run_time | |||
def _untilnwalkkernel_do(walks1, walks2, node_label = 'atom', edge_label = 'bond_type', labeled = True): | |||
"""Calculate walk graph kernels up to n between 2 graphs. | |||
Parameters | |||
---------- | |||
walks1, walks2 : list | |||
List of walks in 2 graphs, where for unlabeled graphs, each walk is represented by a list of nodes; while for labeled graphs, each walk is represented by a string consists of labels of nodes and edges on that walk. | |||
node_label : string | |||
node attribute used as label. The default node label is atom. | |||
edge_label : string | |||
edge attribute used as label. The default edge label is bond_type. | |||
labeled : boolean | |||
Whether the graphs are labeled. The default is True. | |||
Return | |||
------ | |||
kernel : float | |||
Treelet Kernel between 2 graphs. | |||
""" | |||
counts_walks1 = dict(Counter(walks1)) | |||
counts_walks2 = dict(Counter(walks2)) | |||
all_walks = list(set(walks1 + walks2)) | |||
vector1 = [ (counts_walks1[walk] if walk in walks1 else 0) for walk in all_walks ] | |||
vector2 = [ (counts_walks2[walk] if walk in walks2 else 0) for walk in all_walks ] | |||
kernel = np.dot(vector1, vector2) | |||
return kernel | |||
# this method find walks repetively, it could be faster. | |||
def find_all_walks_until_length(G, length, node_label = 'atom', edge_label = 'bond_type', labeled = True): | |||
"""Find all walks with a certain maximum length in a graph. A recursive depth first search is applied. | |||
Parameters | |||
---------- | |||
G : NetworkX graphs | |||
The graph in which walks are searched. | |||
length : integer | |||
The maximum length of walks. | |||
node_label : string | |||
node attribute used as label. The default node label is atom. | |||
edge_label : string | |||
edge attribute used as label. The default edge label is bond_type. | |||
labeled : boolean | |||
Whether the graphs are labeled. The default is True. | |||
Return | |||
------ | |||
walk : list | |||
List of walks retrieved, where for unlabeled graphs, each walk is represented by a list of nodes; while for labeled graphs, each walk is represented by a string consists of labels of nodes and edges on that walk. | |||
""" | |||
all_walks = [] | |||
for i in range(0, length + 1): | |||
new_walks = find_all_walks(G, i) | |||
if new_walks == []: | |||
break | |||
all_walks.extend(new_walks) | |||
if labeled == True: # convert paths to strings | |||
walk_strs = [] | |||
for walk in all_walks: | |||
strlist = [ G.node[node][node_label] + G[node][walk[walk.index(node) + 1]][edge_label] for node in walk[:-1] ] | |||
walk_strs.append(''.join(strlist) + G.node[walk[-1]][node_label]) | |||
return walk_strs | |||
return all_walks | |||
def find_walks(G, source_node, length): | |||
"""Find all walks with a certain length those start from a source node. A recursive depth first search is applied. | |||
Parameters | |||
---------- | |||
G : NetworkX graphs | |||
The graph in which walks are searched. | |||
source_node : integer | |||
The number of the node from where all walks start. | |||
length : integer | |||
The length of walks. | |||
Return | |||
------ | |||
walk : list of list | |||
List of walks retrieved, where each walk is represented by a list of nodes. | |||
""" | |||
return [[source_node]] if length == 0 else \ | |||
[ [source_node] + walk for neighbor in G[source_node] \ | |||
for walk in find_walks(G, neighbor, length - 1) ] | |||
def find_all_walks(G, length): | |||
"""Find all walks with a certain length in a graph. A recursive depth first search is applied. | |||
Parameters | |||
---------- | |||
G : NetworkX graphs | |||
The graph in which walks are searched. | |||
length : integer | |||
The length of walks. | |||
Return | |||
------ | |||
walk : list of list | |||
List of walks retrieved, where each walk is represented by a list of nodes. | |||
""" | |||
all_walks = [] | |||
for node in G: | |||
all_walks.extend(find_walks(G, node, length)) | |||
### The following process is not carried out according to the original article | |||
# all_paths_r = [ path[::-1] for path in all_paths ] | |||
# # For each path, two presentation are retrieved from its two extremities. Remove one of them. | |||
# for idx, path in enumerate(all_paths[:-1]): | |||
# for path2 in all_paths_r[idx+1::]: | |||
# if path == path2: | |||
# all_paths[idx] = [] | |||
# break | |||
# return list(filter(lambda a: a != [], all_paths)) | |||
return all_walks |
@@ -0,0 +1,213 @@ | |||
def model_selection_for_precomputed_kernel(datafile, estimator, param_grid_precomputed, param_grid, model_type, NUM_TRIALS = 30, datafile_y = ''): | |||
"""Perform model selection, fitting and testing for precomputed kernels using nested cv. Print out neccessary data during the process then finally the results. | |||
Parameters | |||
---------- | |||
datafile : string | |||
Path of dataset file. | |||
estimator : function | |||
kernel function used to estimate. This function needs to return a gram matrix. | |||
param_grid_precomputed : dictionary | |||
Dictionary with names (string) of parameters used to calculate gram matrices as keys and lists of parameter settings to try as values. This enables searching over any sequence of parameter settings. | |||
param_grid : dictionary | |||
Dictionary with names (string) of parameters used as penelties as keys and lists of parameter settings to try as values. This enables searching over any sequence of parameter settings. | |||
model_type : string | |||
Typr of the problem, can be regression or classification. | |||
NUM_TRIALS : integer | |||
Number of random trials of outer cv loop. The default is 30. | |||
datafile_y : string | |||
Path of file storing y data. This parameter is optional depending on the given dataset file. | |||
Examples | |||
-------- | |||
>>> import numpy as np | |||
>>> import sys | |||
>>> sys.path.insert(0, "../") | |||
>>> from pygraph.utils.model_selection_precomputed import model_selection_for_precomputed_kernel | |||
>>> from pygraph.kernels.weisfeilerLehmanKernel import weisfeilerlehmankernel | |||
>>> | |||
>>> datafile = '../../../../datasets/acyclic/Acyclic/dataset_bps.ds' | |||
>>> estimator = weisfeilerlehmankernel | |||
>>> param_grid_precomputed = {'height': [0,1,2,3,4,5,6,7,8,9,10], 'base_kernel': ['subtree']} | |||
>>> param_grid = {"alpha": np.logspace(-2, 2, num = 10, base = 10)} | |||
>>> | |||
>>> model_selection_for_precomputed_kernel(datafile, estimator, param_grid_precomputed, param_grid, 'regression') | |||
""" | |||
import numpy as np | |||
from matplotlib import pyplot as plt | |||
from sklearn.kernel_ridge import KernelRidge | |||
from sklearn.svm import SVC | |||
from sklearn.metrics import accuracy_score, mean_squared_error | |||
from sklearn.model_selection import KFold, train_test_split, ParameterGrid | |||
import sys | |||
sys.path.insert(0, "../") | |||
from pygraph.utils.graphfiles import loadDataset | |||
from tqdm import tqdm | |||
# setup the model type | |||
model_type = model_type.lower() | |||
if model_type != 'regression' and model_type != 'classification': | |||
raise Exception('The model type is incorrect! Please choose from regression or classification.') | |||
print() | |||
print('--- This is a %s problem ---' % model_type) | |||
# Load the dataset | |||
print() | |||
print('1. Loading dataset from file...') | |||
dataset, y = loadDataset(datafile, filename_y = datafile_y) | |||
# Grid of parameters with a discrete number of values for each. | |||
param_list_precomputed = list(ParameterGrid(param_grid_precomputed)) | |||
param_list = list(ParameterGrid(param_grid)) | |||
# Arrays to store scores | |||
train_pref = np.zeros((NUM_TRIALS, len(param_list_precomputed), len(param_list))) | |||
val_pref = np.zeros((NUM_TRIALS, len(param_list_precomputed), len(param_list))) | |||
test_pref = np.zeros((NUM_TRIALS, len(param_list_precomputed), len(param_list))) | |||
gram_matrices = [] # a list to store gram matrices for all param_grid_precomputed | |||
gram_matrix_time = [] # a list to store time to calculate gram matrices | |||
# calculate all gram matrices | |||
print() | |||
print('2. Calculating gram matrices. This could take a while...') | |||
for params_out in param_list_precomputed: | |||
Kmatrix, current_run_time = estimator(dataset, **params_out) | |||
print() | |||
print('gram matrix with parameters', params_out, 'is: ') | |||
print(Kmatrix) | |||
plt.matshow(Kmatrix) | |||
plt.show() | |||
# plt.savefig('../../notebooks/gram_matrix_figs/{}_{}'.format(estimator.__name__, params_out)) | |||
gram_matrices.append(Kmatrix) | |||
gram_matrix_time.append(current_run_time) | |||
print() | |||
print('3. Fitting and predicting using nested cross validation. This could really take a while...') | |||
# Loop for each trial | |||
pbar = tqdm(total = NUM_TRIALS * len(param_list_precomputed) * len(param_list), | |||
desc = 'calculate performance', file=sys.stdout) | |||
for trial in range(NUM_TRIALS): # Test set level | |||
# loop for each outer param tuple | |||
for index_out, params_out in enumerate(param_list_precomputed): | |||
# split gram matrix and y to app and test sets. | |||
X_app, X_test, y_app, y_test = train_test_split(gram_matrices[index_out], y, test_size=0.1) | |||
split_index_app = [y.index(y_i) for y_i in y_app if y_i in y] | |||
split_index_test = [y.index(y_i) for y_i in y_test if y_i in y] | |||
X_app = X_app[:,split_index_app] | |||
X_test = X_test[:,split_index_app] | |||
y_app = np.array(y_app) | |||
y_test = np.array(y_test) | |||
# loop for each inner param tuple | |||
for index_in, params_in in enumerate(param_list): | |||
inner_cv = KFold(n_splits=10, shuffle=True, random_state=trial) | |||
current_train_perf = [] | |||
current_valid_perf = [] | |||
current_test_perf = [] | |||
# For regression use the Kernel Ridge method | |||
if model_type == 'regression': | |||
KR = KernelRidge(kernel = 'precomputed', **params_in) | |||
# loop for each split on validation set level | |||
for train_index, valid_index in inner_cv.split(X_app): # validation set level | |||
KR.fit(X_app[train_index,:][:,train_index], y_app[train_index]) | |||
# predict on the train, validation and test set | |||
y_pred_train = KR.predict(X_app[train_index,:][:,train_index]) | |||
y_pred_valid = KR.predict(X_app[valid_index,:][:,train_index]) | |||
y_pred_test = KR.predict(X_test[:,train_index]) | |||
# root mean squared errors | |||
current_train_perf.append(np.sqrt(mean_squared_error(y_app[train_index], y_pred_train))) | |||
current_valid_perf.append(np.sqrt(mean_squared_error(y_app[valid_index], y_pred_valid))) | |||
current_test_perf.append(np.sqrt(mean_squared_error(y_test, y_pred_test))) | |||
# For clcassification use SVM | |||
else: | |||
KR = SVC(kernel = 'precomputed', **params_in) | |||
# loop for each split on validation set level | |||
for train_index, valid_index in inner_cv.split(X_app): # validation set level | |||
KR.fit(X_app[train_index,:][:,train_index], y_app[train_index]) | |||
# predict on the train, validation and test set | |||
y_pred_train = KR.predict(X_app[train_index,:][:,train_index]) | |||
y_pred_valid = KR.predict(X_app[valid_index,:][:,train_index]) | |||
y_pred_test = KR.predict(X_test[:,train_index]) | |||
# root mean squared errors | |||
current_train_perf.append(accuracy_score(y_app[train_index], y_pred_train)) | |||
current_valid_perf.append(accuracy_score(y_app[valid_index], y_pred_valid)) | |||
current_test_perf.append(accuracy_score(y_test, y_pred_test)) | |||
# average performance on inner splits | |||
train_pref[trial][index_out][index_in] = np.mean(current_train_perf) | |||
val_pref[trial][index_out][index_in] = np.mean(current_valid_perf) | |||
test_pref[trial][index_out][index_in] = np.mean(current_test_perf) | |||
pbar.update(1) | |||
pbar.clear() | |||
print() | |||
print('4. Getting final performances...') | |||
# averages and confidences of performances on outer trials for each combination of parameters | |||
average_train_scores = np.mean(train_pref, axis=0) | |||
average_val_scores = np.mean(val_pref, axis=0) | |||
average_perf_scores = np.mean(test_pref, axis=0) | |||
std_train_scores = np.std(train_pref, axis=0, ddof=1) # sample std is used here | |||
std_val_scores = np.std(val_pref, axis=0, ddof=1) | |||
std_perf_scores = np.std(test_pref, axis=0, ddof=1) | |||
if model_type == 'regression': | |||
best_val_perf = np.amin(average_val_scores) | |||
else: | |||
best_val_perf = np.amax(average_val_scores) | |||
print() | |||
best_params_index = np.where(average_val_scores == best_val_perf) | |||
best_params_out = [param_list_precomputed[i] for i in best_params_index[0]] | |||
best_params_in = [param_list[i] for i in best_params_index[1]] | |||
# print('best_params_index: ', best_params_index) | |||
print('best_params_out: ', best_params_out) | |||
print('best_params_in: ', best_params_in) | |||
print('best_val_perf: ', best_val_perf) | |||
# below: only find one performance; muitiple pref might exist | |||
best_val_std = std_val_scores[best_params_index[0][0]][best_params_index[1][0]] | |||
print('best_val_std: ', best_val_std) | |||
final_performance = average_perf_scores[best_params_index[0][0]][best_params_index[1][0]] | |||
final_confidence = std_perf_scores[best_params_index[0][0]][best_params_index[1][0]] | |||
print('final_performance: ', final_performance) | |||
print('final_confidence: ', final_confidence) | |||
train_performance = average_train_scores[best_params_index[0][0]][best_params_index[1][0]] | |||
train_std = std_train_scores[best_params_index[0][0]][best_params_index[1][0]] | |||
print('train_performance: ', train_performance) | |||
print('train_std: ', train_std) | |||
best_gram_matrix_time = gram_matrix_time[best_params_index[0][0]] | |||
print('time to calculate gram matrix: ', best_gram_matrix_time, 's') | |||
# print out as table. | |||
from collections import OrderedDict | |||
from tabulate import tabulate | |||
table_dict = {} | |||
if model_type == 'regression': | |||
for param_in in param_list: | |||
param_in['alpha'] = '{:.2e}'.format(param_in['alpha']) | |||
else: | |||
for param_in in param_list: | |||
param_in['C'] = '{:.2e}'.format(param_in['C']) | |||
table_dict['params'] = [ {**param_out, **param_in} for param_in in param_list for param_out in param_list_precomputed ] | |||
table_dict['gram_matrix_time'] = [ '{:.2f}'.format(gram_matrix_time[index_out]) | |||
for param_in in param_list for index_out, _ in enumerate(param_list_precomputed) ] | |||
table_dict['valid_perf'] = [ '{:.2f}±{:.2f}'.format(average_val_scores[index_out][index_in], std_val_scores[index_out][index_in]) | |||
for index_in, _ in enumerate(param_list) for index_out, _ in enumerate(param_list_precomputed) ] | |||
table_dict['test_perf'] = [ '{:.2f}±{:.2f}'.format(average_perf_scores[index_out][index_in], std_perf_scores[index_out][index_in]) | |||
for index_in, _ in enumerate(param_list) for index_out, _ in enumerate(param_list_precomputed) ] | |||
table_dict['train_perf'] = [ '{:.2f}±{:.2f}'.format(average_train_scores[index_out][index_in], std_train_scores[index_out][index_in]) | |||
for index_in, _ in enumerate(param_list) for index_out, _ in enumerate(param_list_precomputed) ] | |||
keyorder = ['params', 'train_perf', 'valid_perf', 'test_perf', 'gram_matrix_time'] | |||
print() | |||
print(tabulate(OrderedDict(sorted(table_dict.items(), key = lambda i:keyorder.index(i[0]))), headers='keys')) |
@@ -65,311 +65,312 @@ def floydTransformation(G, edge_weight = 'bond_type'): | |||
def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, trials = 100, splits = 10, alpha_grid = None, C_grid = None, hyper_name = '', hyper_range = [1], normalize = False, datafile_y = '', model_type = 'regression'): | |||
"""Perform training and testing for a kernel method. Print out neccessary data during the process then finally the results. | |||
Parameters | |||
---------- | |||
datafile : string | |||
Path of dataset file. | |||
kernel_file_path : string | |||
Path of the directory to save results. | |||
kernel_func : function | |||
kernel function to use in the process. | |||
kernel_para : dictionary | |||
Keyword arguments passed to kernel_func. | |||
trials: integer | |||
Number of trials for hyperparameter random search, where hyperparameter stands for penalty parameter for now. The default is 100. | |||
splits: integer | |||
Number of splits of dataset. Times of training and testing procedure processed. The final means and stds are the average of the results of all the splits. The default is 10. | |||
alpha_grid : ndarray | |||
Penalty parameter in kernel ridge regression. Corresponds to (2*C)^-1 in other linear models such as LogisticRegression. | |||
C_grid : ndarray | |||
Penalty parameter C of the error term in kernel SVM. | |||
hyper_name : string | |||
Name of the hyperparameter. | |||
hyper_range : list | |||
Range of the hyperparameter. | |||
normalize : string | |||
Determine whether or not that normalization is performed. Only works when model_type == 'regression'. The default is False. | |||
model_type : string | |||
Typr of the problem, regression or classification problem | |||
References | |||
---------- | |||
[1] Elisabetta Ghisu, https://github.com/eghisu/GraphKernels/blob/master/GraphKernelsCollection/python_scripts/compute_perf_gk.py, 2018.1 | |||
Examples | |||
-------- | |||
>>> import sys | |||
>>> sys.path.insert(0, "../") | |||
>>> from pygraph.utils.utils import kernel_train_test | |||
>>> from pygraph.kernels.treeletKernel import treeletkernel | |||
>>> datafile = '../../../../datasets/acyclic/Acyclic/dataset_bps.ds' | |||
>>> kernel_file_path = 'kernelmatrices_path_acyclic/' | |||
>>> kernel_para = dict(node_label = 'atom', edge_label = 'bond_type', labeled = True) | |||
>>> kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = True) | |||
""" | |||
import os | |||
import pathlib | |||
from collections import OrderedDict | |||
from tabulate import tabulate | |||
from .graphfiles import loadDataset | |||
# setup the parameters | |||
model_type = model_type.lower() | |||
if model_type != 'regression' and model_type != 'classification': | |||
raise Exception('The model type is incorrect! Please choose from regression or clqssification.') | |||
print('\n --- This is a %s problem ---' % model_type) | |||
alpha_grid = np.logspace(-10, 10, num = trials, base = 10) if alpha_grid == None else alpha_grid # corresponds to (2*C)^-1 in other linear models such as LogisticRegression | |||
C_grid = np.logspace(-10, 10, num = trials, base = 10) if C_grid == None else C_grid | |||
if not os.path.exists(kernel_file_path): | |||
os.makedirs(kernel_file_path) | |||
train_means_list = [] | |||
train_stds_list = [] | |||
test_means_list = [] | |||
test_stds_list = [] | |||
kernel_time_list = [] | |||
for hyper_para in hyper_range: | |||
print('' if hyper_name == '' else '\n\n #--- calculating kernel matrix when', hyper_name, '=', hyper_para, '---#') | |||
print('\n Loading dataset from file...') | |||
dataset, y = loadDataset(datafile, filename_y = datafile_y) | |||
y = np.array(y) | |||
# normalize labels and transform non-numerical labels to numerical labels. | |||
if model_type == 'classification': | |||
from sklearn.preprocessing import LabelEncoder | |||
y = LabelEncoder().fit_transform(y) | |||
# print(y) | |||
# save kernel matrices to files / read kernel matrices from files | |||
kernel_file = kernel_file_path + 'km.ds' | |||
path = pathlib.Path(kernel_file) | |||
# get train set kernel matrix | |||
if path.is_file(): | |||
print('\n Loading the kernel matrix from file...') | |||
Kmatrix = np.loadtxt(kernel_file) | |||
print(Kmatrix) | |||
else: | |||
print('\n Calculating kernel matrix, this could take a while...') | |||
if hyper_name != '': | |||
kernel_para[hyper_name] = hyper_para | |||
Kmatrix, run_time = kernel_func(dataset, **kernel_para) | |||
kernel_time_list.append(run_time) | |||
print(Kmatrix) | |||
# print('\n Saving kernel matrix to file...') | |||
# np.savetxt(kernel_file, Kmatrix) | |||
""" | |||
- Here starts the main program | |||
- First we permute the data, then for each split we evaluate corresponding performances | |||
- In the end, the performances are averaged over the test sets | |||
""" | |||
train_mean, train_std, test_mean, test_std = \ | |||
split_train_test(Kmatrix, y, alpha_grid, C_grid, splits, trials, model_type, normalize = normalize) | |||
train_means_list.append(train_mean) | |||
train_stds_list.append(train_std) | |||
test_means_list.append(test_mean) | |||
test_stds_list.append(test_std) | |||
print('\n') | |||
if model_type == 'regression': | |||
table_dict = {'rmse_test': test_means_list, 'std_test': test_stds_list, \ | |||
'rmse_train': train_means_list, 'std_train': train_stds_list, 'k_time': kernel_time_list} | |||
if hyper_name == '': | |||
keyorder = ['rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time'] | |||
else: | |||
table_dict[hyper_name] = hyper_range | |||
keyorder = [hyper_name, 'rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time'] | |||
elif model_type == 'classification': | |||
table_dict = {'accur_test': test_means_list, 'std_test': test_stds_list, \ | |||
'accur_train': train_means_list, 'std_train': train_stds_list, 'k_time': kernel_time_list} | |||
if hyper_name == '': | |||
keyorder = ['accur_test', 'std_test', 'accur_train', 'std_train', 'k_time'] | |||
else: | |||
table_dict[hyper_name] = hyper_range | |||
keyorder = [hyper_name, 'accur_test', 'std_test', 'accur_train', 'std_train', 'k_time'] | |||
print(tabulate(OrderedDict(sorted(table_dict.items(), key = lambda i:keyorder.index(i[0]))), headers='keys')) | |||
def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, trials = 100, model_type = 'regression', normalize = False): | |||
"""Split dataset to training and testing splits, train and test. Print out and return the results. | |||
Parameters | |||
---------- | |||
Kmatrix : Numpy matrix | |||
Kernel matrix, each element of which is the kernel between 2 praphs. | |||
train_target : ndarray | |||
train target. | |||
alpha_grid : ndarray | |||
Penalty parameter in kernel ridge regression. Corresponds to (2*C)^-1 in other linear models such as LogisticRegression. | |||
C_grid : ndarray | |||
Penalty parameter C of the error term in kernel SVM. | |||
splits : interger | |||
Number of splits of dataset. Times of training and testing procedure processed. The final means and stds are the average of the results of all the splits. The default is 10. | |||
trials : integer | |||
Number of trials for hyperparameters random search. The final means and stds are the ones in the same trial with the best test mean. The default is 100. | |||
model_type : string | |||
Determine whether it is a regression or classification problem. The default is 'regression'. | |||
normalize : string | |||
Determine whether or not that normalization is performed. Only works when model_type == 'regression'. The default is False. | |||
Return | |||
------ | |||
train_mean : float | |||
mean of train accuracies in the same trial with the best test mean. | |||
train_std : float | |||
mean of train stds in the same trial with the best test mean. | |||
test_mean : float | |||
mean of the best tests. | |||
test_std : float | |||
mean of test stds in the same trial with the best test mean. | |||
References | |||
---------- | |||
[1] Elisabetta Ghisu, https://github.com/eghisu/GraphKernels/blob/master/GraphKernelsCollection/python_scripts/compute_perf_gk.py, 2018.1 | |||
""" | |||
import random | |||
from sklearn.kernel_ridge import KernelRidge # 0.17 | |||
from sklearn.metrics import accuracy_score, mean_squared_error | |||
from sklearn import svm | |||
datasize = len(train_target) | |||
random.seed(20) # Set the seed for uniform parameter distribution | |||
# Initialize the performance of the best parameter trial on train with the corresponding performance on test | |||
train_split = [] | |||
test_split = [] | |||
# For each split of the data | |||
print('\n Starting calculate accuracy/rmse...') | |||
import sys | |||
pbar = tqdm(total = splits * trials, desc = 'calculate performance', file=sys.stdout) | |||
for j in range(10, 10 + splits): | |||
# print('\n Starting split %d...' % j) | |||
# Set the random set for data permutation | |||
random_state = int(j) | |||
np.random.seed(random_state) | |||
idx_perm = np.random.permutation(datasize) | |||
# Permute the data | |||
y_perm = train_target[idx_perm] # targets permutation | |||
Kmatrix_perm = Kmatrix[:, idx_perm] # inputs permutation | |||
Kmatrix_perm = Kmatrix_perm[idx_perm, :] # inputs permutation | |||
# Set the training, test | |||
# Note: the percentage can be set up by the user | |||
num_train = int((datasize * 90) / 100) # 90% (of entire dataset) for training | |||
num_test = datasize - num_train # 10% (of entire dataset) for test | |||
# Split the kernel matrix | |||
Kmatrix_train = Kmatrix_perm[0:num_train, 0:num_train] | |||
Kmatrix_test = Kmatrix_perm[num_train:datasize, 0:num_train] | |||
# Split the targets | |||
y_train = y_perm[0:num_train] | |||
# Normalization step (for real valued targets only) | |||
if normalize == True and model_type == 'regression': | |||
y_train_mean = np.mean(y_train) | |||
y_train_std = np.std(y_train) | |||
y_train_norm = (y_train - y_train_mean) / float(y_train_std) | |||
y_test = y_perm[num_train:datasize] | |||
# Record the performance for each parameter trial respectively on train and test set | |||
perf_all_train = [] | |||
perf_all_test = [] | |||
# For each parameter trial | |||
for i in range(trials): | |||
# For regression use the Kernel Ridge method | |||
if model_type == 'regression': | |||
# Fit the kernel ridge model | |||
KR = KernelRidge(kernel = 'precomputed', alpha = alpha_grid[i]) | |||
KR.fit(Kmatrix_train, y_train if normalize == False else y_train_norm) | |||
# predict on the train and test set | |||
y_pred_train = KR.predict(Kmatrix_train) | |||
y_pred_test = KR.predict(Kmatrix_test) | |||
# adjust prediction: needed because the training targets have been normalized | |||
if normalize == True: | |||
y_pred_train = y_pred_train * float(y_train_std) + y_train_mean | |||
y_pred_test = y_pred_test * float(y_train_std) + y_train_mean | |||
# root mean squared error on train set | |||
accuracy_train = np.sqrt(mean_squared_error(y_train, y_pred_train)) | |||
perf_all_train.append(accuracy_train) | |||
# root mean squared error on test set | |||
accuracy_test = np.sqrt(mean_squared_error(y_test, y_pred_test)) | |||
perf_all_test.append(accuracy_test) | |||
# For clcassification use SVM | |||
elif model_type == 'classification': | |||
KR = svm.SVC(kernel = 'precomputed', C = C_grid[i]) | |||
KR.fit(Kmatrix_train, y_train) | |||
# predict on the train and test set | |||
y_pred_train = KR.predict(Kmatrix_train) | |||
y_pred_test = KR.predict(Kmatrix_test) | |||
# accuracy on train set | |||
accuracy_train = accuracy_score(y_train, y_pred_train) | |||
perf_all_train.append(accuracy_train) | |||
# accuracy on test set | |||
accuracy_test = accuracy_score(y_test, y_pred_test) | |||
perf_all_test.append(accuracy_test) | |||
pbar.update(1) | |||
# --- FIND THE OPTIMAL PARAMETERS --- # | |||
# For regression: minimise the mean squared error | |||
if model_type == 'regression': | |||
# get optimal parameter on test (argmin mean squared error) | |||
min_idx = np.argmin(perf_all_test) | |||
alpha_opt = alpha_grid[min_idx] | |||
# corresponding performance on train and test set for the same parameter | |||
perf_train_opt = perf_all_train[min_idx] | |||
perf_test_opt = perf_all_test[min_idx] | |||
# For classification: maximise the accuracy | |||
if model_type == 'classification': | |||
# get optimal parameter on test (argmax accuracy) | |||
max_idx = np.argmax(perf_all_test) | |||
C_opt = C_grid[max_idx] | |||
# corresponding performance on train and test set for the same parameter | |||
perf_train_opt = perf_all_train[max_idx] | |||
perf_test_opt = perf_all_test[max_idx] | |||
# append the correponding performance on the train and test set | |||
train_split.append(perf_train_opt) | |||
test_split.append(perf_test_opt) | |||
# average the results | |||
# mean of the train and test performances over the splits | |||
train_mean = np.mean(np.asarray(train_split)) | |||
test_mean = np.mean(np.asarray(test_split)) | |||
# std deviation of the train and test over the splits | |||
train_std = np.std(np.asarray(train_split)) | |||
test_std = np.std(np.asarray(test_split)) | |||
print('\n Mean performance on train set: %3f' % train_mean) | |||
print('With standard deviation: %3f' % train_std) | |||
print('\n Mean performance on test set: %3f' % test_mean) | |||
print('With standard deviation: %3f' % test_std) | |||
return train_mean, train_std, test_mean, test_std | |||
# def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, trials = 100, splits = 10, alpha_grid = None, C_grid = None, hyper_name = '', hyper_range = [1], normalize = False, datafile_y = '', model_type = 'regression'): | |||
# """Perform training and testing for a kernel method. Print out neccessary data during the process then finally the results. | |||
# Parameters | |||
# ---------- | |||
# datafile : string | |||
# Path of dataset file. | |||
# kernel_file_path : string | |||
# Path of the directory to save results. | |||
# kernel_func : function | |||
# kernel function to use in the process. | |||
# kernel_para : dictionary | |||
# Keyword arguments passed to kernel_func. | |||
# trials: integer | |||
# Number of trials for hyperparameter random search, where hyperparameter stands for penalty parameter for now. The default is 100. | |||
# splits: integer | |||
# Number of splits of dataset. Times of training and testing procedure processed. The final means and stds are the average of the results of all the splits. The default is 10. | |||
# alpha_grid : ndarray | |||
# Penalty parameter in kernel ridge regression. Corresponds to (2*C)^-1 in other linear models such as LogisticRegression. | |||
# C_grid : ndarray | |||
# Penalty parameter C of the error term in kernel SVM. | |||
# hyper_name : string | |||
# Name of the hyperparameter. | |||
# hyper_range : list | |||
# Range of the hyperparameter. | |||
# normalize : string | |||
# Determine whether or not that normalization is performed. Only works when model_type == 'regression'. The default is False. | |||
# model_type : string | |||
# Typr of the problem, regression or classification problem | |||
# References | |||
# ---------- | |||
# [1] Elisabetta Ghisu, https://github.com/eghisu/GraphKernels/blob/master/GraphKernelsCollection/python_scripts/compute_perf_gk.py, 2018.1 | |||
# Examples | |||
# -------- | |||
# >>> import sys | |||
# >>> sys.path.insert(0, "../") | |||
# >>> from pygraph.utils.utils import kernel_train_test | |||
# >>> from pygraph.kernels.treeletKernel import treeletkernel | |||
# >>> datafile = '../../../../datasets/acyclic/Acyclic/dataset_bps.ds' | |||
# >>> kernel_file_path = 'kernelmatrices_path_acyclic/' | |||
# >>> kernel_para = dict(node_label = 'atom', edge_label = 'bond_type', labeled = True) | |||
# >>> kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = True) | |||
# """ | |||
# import os | |||
# import pathlib | |||
# from collections import OrderedDict | |||
# from tabulate import tabulate | |||
# from .graphfiles import loadDataset | |||
# # setup the parameters | |||
# model_type = model_type.lower() | |||
# if model_type != 'regression' and model_type != 'classification': | |||
# raise Exception('The model type is incorrect! Please choose from regression or classification.') | |||
# print('\n --- This is a %s problem ---' % model_type) | |||
# alpha_grid = np.logspace(-10, 10, num = trials, base = 10) if alpha_grid == None else alpha_grid # corresponds to (2*C)^-1 in other linear models such as LogisticRegression | |||
# C_grid = np.logspace(-10, 10, num = trials, base = 10) if C_grid == None else C_grid | |||
# if not os.path.exists(kernel_file_path): | |||
# os.makedirs(kernel_file_path) | |||
# train_means_list = [] | |||
# train_stds_list = [] | |||
# test_means_list = [] | |||
# test_stds_list = [] | |||
# kernel_time_list = [] | |||
# for hyper_para in hyper_range: | |||
# print('' if hyper_name == '' else '\n\n #--- calculating kernel matrix when', hyper_name, '=', hyper_para, '---#') | |||
# print('\n Loading dataset from file...') | |||
# dataset, y = loadDataset(datafile, filename_y = datafile_y) | |||
# y = np.array(y) | |||
# # normalize labels and transform non-numerical labels to numerical labels. | |||
# if model_type == 'classification': | |||
# from sklearn.preprocessing import LabelEncoder | |||
# y = LabelEncoder().fit_transform(y) | |||
# # print(y) | |||
# # save kernel matrices to files / read kernel matrices from files | |||
# kernel_file = kernel_file_path + 'km.ds' | |||
# path = pathlib.Path(kernel_file) | |||
# # get train set kernel matrix | |||
# if path.is_file(): | |||
# print('\n Loading the kernel matrix from file...') | |||
# Kmatrix = np.loadtxt(kernel_file) | |||
# print(Kmatrix) | |||
# else: | |||
# print('\n Calculating kernel matrix, this could take a while...') | |||
# if hyper_name != '': | |||
# kernel_para[hyper_name] = hyper_para | |||
# Kmatrix, run_time = kernel_func(dataset, **kernel_para) | |||
# kernel_time_list.append(run_time) | |||
# import matplotlib.pyplot as plt | |||
# plt.matshow(Kmatrix) | |||
# # print('\n Saving kernel matrix to file...') | |||
# # np.savetxt(kernel_file, Kmatrix) | |||
# """ | |||
# - Here starts the main program | |||
# - First we permute the data, then for each split we evaluate corresponding performances | |||
# - In the end, the performances are averaged over the test sets | |||
# """ | |||
# train_mean, train_std, test_mean, test_std = \ | |||
# split_train_test(Kmatrix, y, alpha_grid, C_grid, splits, trials, model_type, normalize = normalize) | |||
# train_means_list.append(train_mean) | |||
# train_stds_list.append(train_std) | |||
# test_means_list.append(test_mean) | |||
# test_stds_list.append(test_std) | |||
# print('\n') | |||
# if model_type == 'regression': | |||
# table_dict = {'rmse_test': test_means_list, 'std_test': test_stds_list, \ | |||
# 'rmse_train': train_means_list, 'std_train': train_stds_list, 'k_time': kernel_time_list} | |||
# if hyper_name == '': | |||
# keyorder = ['rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time'] | |||
# else: | |||
# table_dict[hyper_name] = hyper_range | |||
# keyorder = [hyper_name, 'rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time'] | |||
# elif model_type == 'classification': | |||
# table_dict = {'accur_test': test_means_list, 'std_test': test_stds_list, \ | |||
# 'accur_train': train_means_list, 'std_train': train_stds_list, 'k_time': kernel_time_list} | |||
# if hyper_name == '': | |||
# keyorder = ['accur_test', 'std_test', 'accur_train', 'std_train', 'k_time'] | |||
# else: | |||
# table_dict[hyper_name] = hyper_range | |||
# keyorder = [hyper_name, 'accur_test', 'std_test', 'accur_train', 'std_train', 'k_time'] | |||
# print(tabulate(OrderedDict(sorted(table_dict.items(), key = lambda i:keyorder.index(i[0]))), headers='keys')) | |||
# def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, trials = 100, model_type = 'regression', normalize = False): | |||
# """Split dataset to training and testing splits, train and test. Print out and return the results. | |||
# Parameters | |||
# ---------- | |||
# Kmatrix : Numpy matrix | |||
# Kernel matrix, each element of which is the kernel between 2 praphs. | |||
# train_target : ndarray | |||
# train target. | |||
# alpha_grid : ndarray | |||
# Penalty parameter in kernel ridge regression. Corresponds to (2*C)^-1 in other linear models such as LogisticRegression. | |||
# C_grid : ndarray | |||
# Penalty parameter C of the error term in kernel SVM. | |||
# splits : interger | |||
# Number of splits of dataset. Times of training and testing procedure processed. The final means and stds are the average of the results of all the splits. The default is 10. | |||
# trials : integer | |||
# Number of trials for hyperparameters random search. The final means and stds are the ones in the same trial with the best test mean. The default is 100. | |||
# model_type : string | |||
# Determine whether it is a regression or classification problem. The default is 'regression'. | |||
# normalize : string | |||
# Determine whether or not that normalization is performed. Only works when model_type == 'regression'. The default is False. | |||
# Return | |||
# ------ | |||
# train_mean : float | |||
# mean of train accuracies in the same trial with the best test mean. | |||
# train_std : float | |||
# mean of train stds in the same trial with the best test mean. | |||
# test_mean : float | |||
# mean of the best tests. | |||
# test_std : float | |||
# mean of test stds in the same trial with the best test mean. | |||
# References | |||
# ---------- | |||
# [1] Elisabetta Ghisu, https://github.com/eghisu/GraphKernels/blob/master/GraphKernelsCollection/python_scripts/compute_perf_gk.py, 2018.1 | |||
# """ | |||
# import random | |||
# from sklearn.kernel_ridge import KernelRidge # 0.17 | |||
# from sklearn.metrics import accuracy_score, mean_squared_error | |||
# from sklearn import svm | |||
# datasize = len(train_target) | |||
# random.seed(20) # Set the seed for uniform parameter distribution | |||
# # Initialize the performance of the best parameter trial on train with the corresponding performance on test | |||
# train_split = [] | |||
# test_split = [] | |||
# # For each split of the data | |||
# print('\n Starting calculate accuracy/rmse...') | |||
# import sys | |||
# pbar = tqdm(total = splits * trials, desc = 'calculate performance', file=sys.stdout) | |||
# for j in range(10, 10 + splits): | |||
# # print('\n Starting split %d...' % j) | |||
# # Set the random set for data permutation | |||
# random_state = int(j) | |||
# np.random.seed(random_state) | |||
# idx_perm = np.random.permutation(datasize) | |||
# # Permute the data | |||
# y_perm = train_target[idx_perm] # targets permutation | |||
# Kmatrix_perm = Kmatrix[:, idx_perm] # inputs permutation | |||
# Kmatrix_perm = Kmatrix_perm[idx_perm, :] # inputs permutation | |||
# # Set the training, test | |||
# # Note: the percentage can be set up by the user | |||
# num_train = int((datasize * 90) / 100) # 90% (of entire dataset) for training | |||
# num_test = datasize - num_train # 10% (of entire dataset) for test | |||
# # Split the kernel matrix | |||
# Kmatrix_train = Kmatrix_perm[0:num_train, 0:num_train] | |||
# Kmatrix_test = Kmatrix_perm[num_train:datasize, 0:num_train] | |||
# # Split the targets | |||
# y_train = y_perm[0:num_train] | |||
# # Normalization step (for real valued targets only) | |||
# if normalize == True and model_type == 'regression': | |||
# y_train_mean = np.mean(y_train) | |||
# y_train_std = np.std(y_train) | |||
# y_train_norm = (y_train - y_train_mean) / float(y_train_std) | |||
# y_test = y_perm[num_train:datasize] | |||
# # Record the performance for each parameter trial respectively on train and test set | |||
# perf_all_train = [] | |||
# perf_all_test = [] | |||
# # For each parameter trial | |||
# for i in range(trials): | |||
# # For regression use the Kernel Ridge method | |||
# if model_type == 'regression': | |||
# # Fit the kernel ridge model | |||
# KR = KernelRidge(kernel = 'precomputed', alpha = alpha_grid[i]) | |||
# KR.fit(Kmatrix_train, y_train if normalize == False else y_train_norm) | |||
# # predict on the train and test set | |||
# y_pred_train = KR.predict(Kmatrix_train) | |||
# y_pred_test = KR.predict(Kmatrix_test) | |||
# # adjust prediction: needed because the training targets have been normalized | |||
# if normalize == True: | |||
# y_pred_train = y_pred_train * float(y_train_std) + y_train_mean | |||
# y_pred_test = y_pred_test * float(y_train_std) + y_train_mean | |||
# # root mean squared error on train set | |||
# accuracy_train = np.sqrt(mean_squared_error(y_train, y_pred_train)) | |||
# perf_all_train.append(accuracy_train) | |||
# # root mean squared error on test set | |||
# accuracy_test = np.sqrt(mean_squared_error(y_test, y_pred_test)) | |||
# perf_all_test.append(accuracy_test) | |||
# # For clcassification use SVM | |||
# elif model_type == 'classification': | |||
# KR = svm.SVC(kernel = 'precomputed', C = C_grid[i]) | |||
# KR.fit(Kmatrix_train, y_train) | |||
# # predict on the train and test set | |||
# y_pred_train = KR.predict(Kmatrix_train) | |||
# y_pred_test = KR.predict(Kmatrix_test) | |||
# # accuracy on train set | |||
# accuracy_train = accuracy_score(y_train, y_pred_train) | |||
# perf_all_train.append(accuracy_train) | |||
# # accuracy on test set | |||
# accuracy_test = accuracy_score(y_test, y_pred_test) | |||
# perf_all_test.append(accuracy_test) | |||
# pbar.update(1) | |||
# # --- FIND THE OPTIMAL PARAMETERS --- # | |||
# # For regression: minimise the mean squared error | |||
# if model_type == 'regression': | |||
# # get optimal parameter on test (argmin mean squared error) | |||
# min_idx = np.argmin(perf_all_train) | |||
# alpha_opt = alpha_grid[min_idx] | |||
# # corresponding performance on train and test set for the same parameter | |||
# perf_train_opt = perf_all_train[min_idx] | |||
# perf_test_opt = perf_all_test[min_idx] | |||
# # For classification: maximise the accuracy | |||
# if model_type == 'classification': | |||
# # get optimal parameter on test (argmax accuracy) | |||
# max_idx = np.argmax(perf_all_train) | |||
# C_opt = C_grid[max_idx] | |||
# # corresponding performance on train and test set for the same parameter | |||
# perf_train_opt = perf_all_train[max_idx] | |||
# perf_test_opt = perf_all_test[max_idx] | |||
# # append the correponding performance on the train and test set | |||
# train_split.append(perf_train_opt) | |||
# test_split.append(perf_test_opt) | |||
# # average the results | |||
# # mean of the train and test performances over the splits | |||
# train_mean = np.mean(np.asarray(train_split)) | |||
# test_mean = np.mean(np.asarray(test_split)) | |||
# # std deviation of the train and test over the splits | |||
# train_std = np.std(np.asarray(train_split)) | |||
# test_std = np.std(np.asarray(test_split)) | |||
# print('\n Mean performance on train set: %3f' % train_mean) | |||
# print('With standard deviation: %3f' % train_std) | |||
# print('\n Mean performance on test set: %3f' % test_mean) | |||
# print('With standard deviation: %3f' % test_std) | |||
# return train_mean, train_std, test_mean, test_std |
@@ -1,16 +0,0 @@ | |||
import sys | |||
sys.path.insert(0, "../") | |||
from pygraph.utils.utils import kernel_train_test | |||
from pygraph.kernels.cyclicPatternKernel import cyclicpatternkernel | |||
import numpy as np | |||
datafile = '../../../../datasets/NCI-HIV/AIDO99SD.sdf' | |||
datafile_y = '../../../../datasets/NCI-HIV/aids_conc_may04.txt' | |||
kernel_file_path = 'kernelmatrices_path_acyclic/' | |||
kernel_para = dict(node_label = 'atom', edge_label = 'bond_type', labeled = True) | |||
kernel_train_test(datafile, kernel_file_path, cyclicpatternkernel, kernel_para, \ | |||
hyper_name = 'cycle_bound', hyper_range = np.linspace(0, 1000, 21), normalize = False, \ | |||
datafile_y = datafile_y, model_type = 'classification') |