modifié : README.md

modifié : notebooks/run_cyclicpatternkernel.ipynb modifié : notebooks/run_marginalizedkernel_acyclic.ipynb modifié : notebooks/run_pathkernel_acyclic.ipynb modifié : notebooks/run_spkernel_acyclic.ipynb modifié : notebooks/run_treeletkernel_acyclic.ipynb modifié : notebooks/run_treepatternkernel.ipynb modifié : notebooks/run_untildpathkernel_acyclic.ipynb nouveau fichier : notebooks/run_untilnwalkkernel.ipynb modifié : notebooks/run_weisfeilerLehmankernel_acyclic.ipynb modifié : pygraph/kernels/treePatternKernel.py modifié : pygraph/kernels/untildPathKernel.py nouveau fichier : pygraph/kernels/untilnWalkKernel.py nouveau fichier : pygraph/utils/model_selection_precomputed.py modifié : pygraph/utils/utils.py
7 years ago · 338f8ba326
--- a/README.md
+++ b/README.md
@@ -16,25 +16,24 @@ All kernels expect for Cyclic pattern kernel are tested on dataset Asyclic, whic

 The criteria used for prediction are SVM for classification and kernel Ridge regression for regression.

 For predition we randomly divide the data in train and test subset, where 90% of entire dataset is for training and rest for testing. 10 splits are performed. For each split, we first train on the train data, then evaluate the performance on the test set. We choose the optimal parameters for the test set and finally provide the corresponding performance. The final results correspond to the average of the performances on the test sets. 

 | Kernels          | RMSE(℃) | STD(℃) |         Parameter | k_time |
 |------------------|:-------:|:------:|------------------:|-------:|
 | Shortest path    | 35.19   | 4.50   |                 - | 14.58" |
 | Marginalized     | 18.02   | 6.29   |      p_quit = 0.1 |  4'19" |
 | Path             | 18.41   | 10.78  |                 - | 29.43" |
 | WL subtree       | 7.55    | 2.33   |        height = 1 |  0.84" |
 | WL shortest path | 35.16   | 4.50   |        height = 2 | 40.24" |
 | WL edge          | 33.41   | 4.73   |        height = 5 |  5.66" |
 | Treelet          | 8.31    | 3.38   |                 - |  0.50" |
 | Path up to d     | 7.43    | 2.69   |         depth = 2 |  0.59" |
 | Tree pattern     | 7.27    | 2.21   |  lamda = 1, h = 2 | 37.24" |
 | Cyclic pattern   | 0.9     | 0.11   | cycle bound = 100 |  0.31" |
 ~~For predition we randomly divide the data in train and test subset, where 90% of entire dataset is for training and rest for testing. 10 splits are performed. For each split, we first train on the train data, then evaluate the performance on the test set. We choose the optimal parameters for the test set and finally provide the corresponding performance. The final results correspond to the average of the performances on the test sets.~~

 | Kernels          | train_perf | valid_perf |  test_perf |                                            Parameters | gram_matrix_time |
 |------------------|-----------:|-----------:|-----------:|------------------------------------------------------:|-----------------:|
 | Shortest path    | 28.65±0.59 | 36.09±0.97 | 36.45±6.63 |                                   'alpha': '3.55e+01' |           12.67" |
 | Marginalized     | 12.42±0.28 | 18.60±2.02 | 16.51±5.12 |                    'p_quit': 0.3, 'alpha': '3.16e-06' |          430.42" |
 | Path             | 11.19±0.73 | 23.66±1.74 | 25.04±9.60 |                                   'alpha': '2.57e-03' |           21.84" |
 | WL subtree       |  6.00±0.27 |  7.59±0.71 |  7.92±2.92 |                    'height': 1.0, 'alpha': '1.26e-01' |            0.84" |
 | WL shortest path | 28.32±0.63 | 35.99±0.98 | 37.92±5.60 |                    'height': 2.0, 'alpha': '1.00e+02' |           39.79" |
 | WL edge          | 30.10±0.57 | 35.13±0.78 | 37.70±6.92 |                    'height': 4.0, 'alpha': '3.98e+01' |            4.35" |
 | Treelet          |  7.38±0.37 | 14.21±0.80 | 15.26±3.65 |                                   'alpha': '1.58e+00' |            0.49" |
 | Path up to d     |  5.48±0.23 | 10.00±0.83 | 10.73±5.67 | 'depth': 2.0, 'k_func': 'MinMax', 'alpha': '7.94e-02' |            0.57" |
 | Tree pattern     |            |            |            |                                                       |                  |
 | Cyclic pattern   |  0.62±0.02 |  0.62±0.02 |  0.57±0.17 |                 'cycle_bound': 125.0, 'C': '1.78e-01' |            0.33" |
 * RMSE stands for arithmetic mean of the root mean squared errors on all splits.
 * STD stands for standard deviation of the root mean squared errors on all splits.
 * Paremeter is the one with which the kenrel achieves the best results.
 * k_time is the time spent on building the kernel matrix.
 * The targets of training data are normalized before calculating *treelet kernel*.
 * Paremeters are the ones with which the kenrel achieves the best results.
 * gram_matrix_time is the time spent on building the gram matrix.
 * See detail results in [results.md](pygraph/kernels/results.md).

 ## References
--- a/notebooks/.ipynb_checkpoints/run_cyclicpatternkernel-checkpoint.ipynb
+++ b/notebooks/.ipynb_checkpoints/run_cyclicpatternkernel-checkpoint.ipynb
--- a/notebooks/.ipynb_checkpoints/run_spkernel_acyclic-checkpoint.ipynb
+++ b/notebooks/.ipynb_checkpoints/run_spkernel_acyclic-checkpoint.ipynb
--- a/notebooks/.ipynb_checkpoints/run_untildpathkernel_acyclic-checkpoint.ipynb
+++ b/notebooks/.ipynb_checkpoints/run_untildpathkernel_acyclic-checkpoint.ipynb
--- a/notebooks/.ipynb_checkpoints/run_untilnwalkkernel-checkpoint.ipynb
+++ b/notebooks/.ipynb_checkpoints/run_untilnwalkkernel-checkpoint.ipynb
--- a/notebooks/.ipynb_checkpoints/test_modelselection-checkpoint.ipynb
+++ b/notebooks/.ipynb_checkpoints/test_modelselection-checkpoint.ipynb
--- a/notebooks/.ipynb_checkpoints/test_scikit_ksvm-checkpoint.ipynb
+++ b/notebooks/.ipynb_checkpoints/test_scikit_ksvm-checkpoint.ipynb
--- a/notebooks/run_cyclicpatternkernel.ipynb
+++ b/notebooks/run_cyclicpatternkernel.ipynb
--- a/notebooks/run_marginalizedkernel_acyclic.ipynb
+++ b/notebooks/run_marginalizedkernel_acyclic.ipynb
--- a/notebooks/run_pathkernel_acyclic.ipynb
+++ b/notebooks/run_pathkernel_acyclic.ipynb
--- a/notebooks/run_spkernel_acyclic.ipynb
+++ b/notebooks/run_spkernel_acyclic.ipynb
--- a/notebooks/run_treeletkernel_acyclic.ipynb
+++ b/notebooks/run_treeletkernel_acyclic.ipynb
--- a/notebooks/run_treepatternkernel.ipynb
+++ b/notebooks/run_treepatternkernel.ipynb
--- a/notebooks/run_untildpathkernel_acyclic.ipynb
+++ b/notebooks/run_untildpathkernel_acyclic.ipynb
--- a/notebooks/run_untilnwalkkernel.ipynb
+++ b/notebooks/run_untilnwalkkernel.ipynb
--- a/notebooks/run_weisfeilerLehmankernel_acyclic.ipynb
+++ b/notebooks/run_weisfeilerLehmankernel_acyclic.ipynb
--- a/notebooks/test_marginalizedkernel.ipynb
+++ b/notebooks/test_marginalizedkernel.ipynb
--- a/notebooks/test_modelselection.ipynb
+++ b/notebooks/test_modelselection.ipynb
--- a/notebooks/test_scikit_ksvm.ipynb
+++ b/notebooks/test_scikit_ksvm.ipynb
@@ -2,6 +2,268 @@
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Automatically created module for IPython interactive environment\n",
      "# Tuning hyper-parameters for precision\n",
      "\n",
      "Best parameters set found on development set:\n",
      "\n",
      "{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}\n",
      "\n",
      "Grid scores on development set:\n",
      "\n",
      "0.986 (+/-0.016) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}\n",
      "0.959 (+/-0.029) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}\n",
      "0.988 (+/-0.017) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}\n",
      "0.982 (+/-0.026) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}\n",
      "0.988 (+/-0.017) for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}\n",
      "0.982 (+/-0.025) for {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}\n",
      "0.988 (+/-0.017) for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}\n",
      "0.982 (+/-0.025) for {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}\n",
      "0.975 (+/-0.014) for {'C': 1, 'kernel': 'linear'}\n",
      "0.975 (+/-0.014) for {'C': 10, 'kernel': 'linear'}\n",
      "0.975 (+/-0.014) for {'C': 100, 'kernel': 'linear'}\n",
      "0.975 (+/-0.014) for {'C': 1000, 'kernel': 'linear'}\n",
      "\n",
      "Detailed classification report:\n",
      "\n",
      "The model is trained on the full development set.\n",
      "The scores are computed on the full evaluation set.\n",
      "\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "          0       1.00      1.00      1.00        89\n",
      "          1       0.97      1.00      0.98        90\n",
      "          2       0.99      0.98      0.98        92\n",
      "          3       1.00      0.99      0.99        93\n",
      "          4       1.00      1.00      1.00        76\n",
      "          5       0.99      0.98      0.99       108\n",
      "          6       0.99      1.00      0.99        89\n",
      "          7       0.99      1.00      0.99        78\n",
      "          8       1.00      0.98      0.99        92\n",
      "          9       0.99      0.99      0.99        92\n",
      "\n",
      "avg / total       0.99      0.99      0.99       899\n",
      "\n",
      "\n",
      "# Tuning hyper-parameters for recall\n",
      "\n",
      "Best parameters set found on development set:\n",
      "\n",
      "{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}\n",
      "\n",
      "Grid scores on development set:\n",
      "\n",
      "0.986 (+/-0.019) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}\n",
      "0.957 (+/-0.029) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}\n",
      "0.987 (+/-0.019) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}\n",
      "0.981 (+/-0.028) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}\n",
      "0.987 (+/-0.019) for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}\n",
      "0.981 (+/-0.026) for {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}\n",
      "0.987 (+/-0.019) for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}\n",
      "0.981 (+/-0.026) for {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}\n",
      "0.972 (+/-0.012) for {'C': 1, 'kernel': 'linear'}\n",
      "0.972 (+/-0.012) for {'C': 10, 'kernel': 'linear'}\n",
      "0.972 (+/-0.012) for {'C': 100, 'kernel': 'linear'}\n",
      "0.972 (+/-0.012) for {'C': 1000, 'kernel': 'linear'}\n",
      "\n",
      "Detailed classification report:\n",
      "\n",
      "The model is trained on the full development set.\n",
      "The scores are computed on the full evaluation set.\n",
      "\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "          0       1.00      1.00      1.00        89\n",
      "          1       0.97      1.00      0.98        90\n",
      "          2       0.99      0.98      0.98        92\n",
      "          3       1.00      0.99      0.99        93\n",
      "          4       1.00      1.00      1.00        76\n",
      "          5       0.99      0.98      0.99       108\n",
      "          6       0.99      1.00      0.99        89\n",
      "          7       0.99      1.00      0.99        78\n",
      "          8       1.00      0.98      0.99        92\n",
      "          9       0.99      0.99      0.99        92\n",
      "\n",
      "avg / total       0.99      0.99      0.99       899\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Parameter estimation using grid search with cross-validation\n",
    "from __future__ import print_function\n",
    "\n",
    "from sklearn import datasets\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.metrics import classification_report\n",
    "from sklearn.svm import SVC\n",
    "\n",
    "print(__doc__)\n",
    "\n",
    "# Loading the Digits dataset\n",
    "digits = datasets.load_digits()\n",
    "\n",
    "# To apply an classifier on this data, we need to flatten the image, to\n",
    "# turn the data in a (samples, feature) matrix:\n",
    "n_samples = len(digits.images)\n",
    "X = digits.images.reshape((n_samples, -1))\n",
    "y = digits.target\n",
    "\n",
    "# Split the dataset in two equal parts\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    X, y, test_size=0.5, random_state=0)\n",
    "\n",
    "# Set the parameters by cross-validation\n",
    "tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],\n",
    "                     'C': [1, 10, 100, 1000]},\n",
    "                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]\n",
    "\n",
    "scores = ['precision', 'recall']\n",
    "\n",
    "for score in scores:\n",
    "    print(\"# Tuning hyper-parameters for %s\" % score)\n",
    "    print()\n",
    "\n",
    "    clf = GridSearchCV(SVC(), tuned_parameters, cv=5,\n",
    "                       scoring='%s_macro' % score)\n",
    "    clf.fit(X_train, y_train)\n",
    "\n",
    "    print(\"Best parameters set found on development set:\")\n",
    "    print()\n",
    "    print(clf.best_params_)\n",
    "    print()\n",
    "    print(\"Grid scores on development set:\")\n",
    "    print()\n",
    "    means = clf.cv_results_['mean_test_score']\n",
    "    stds = clf.cv_results_['std_test_score']\n",
    "    for mean, std, params in zip(means, stds, clf.cv_results_['params']):\n",
    "        print(\"%0.3f (+/-%0.03f) for %r\"\n",
    "              % (mean, std * 2, params))\n",
    "    print()\n",
    "\n",
    "    print(\"Detailed classification report:\")\n",
    "    print()\n",
    "    print(\"The model is trained on the full development set.\")\n",
    "    print(\"The scores are computed on the full evaluation set.\")\n",
    "    print()\n",
    "    y_true, y_pred = y_test, clf.predict(X_test)\n",
    "    print(classification_report(y_true, y_pred))\n",
    "    print()\n",
    "\n",
    "# Note the problem is too easy: the hyperparameter plateau is too flat and the\n",
    "# output model is the same for precision and recall with ties in quality."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=None, error_score='raise',\n",
       "       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n",
       "  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',\n",
       "  max_iter=-1, probability=False, random_state=None, shrinking=True,\n",
       "  tol=0.001, verbose=False),\n",
       "       fit_params=None, iid=True, n_jobs=1,\n",
       "       param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf')},\n",
       "       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n",
       "       scoring=None, verbose=0)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import svm, datasets\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "iris = datasets.load_iris()\n",
    "parameters = {'kernel': ('linear', 'rbf'), 'C': [1, 10]}\n",
    "svc = svm.SVC()\n",
    "clf = GridSearchCV(svc, parameters)\n",
    "clf.fit(iris.data, iris.target)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['mean_fit_time',\n",
       " 'mean_score_time',\n",
       " 'mean_test_score',\n",
       " 'mean_train_score',\n",
       " 'param_C',\n",
       " 'param_kernel',\n",
       " 'params',\n",
       " 'rank_test_score',\n",
       " 'split0_test_score',\n",
       " 'split0_train_score',\n",
       " 'split1_test_score',\n",
       " 'split1_train_score',\n",
       " 'split2_test_score',\n",
       " 'split2_train_score',\n",
       " 'std_fit_time',\n",
       " 'std_score_time',\n",
       " 'std_test_score',\n",
       " 'std_train_score']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sorted(clf.cv_results_.keys())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_values([array([0.98      , 0.97333333, 0.97333333, 0.98      ]), array([1.        , 0.98039216, 1.        , 0.98039216]), array([0.00030899, 0.00021172, 0.00019932, 0.00017134]), array([1, 3, 3, 1], dtype=int32), array([0.01617914, 0.00902067, 0.03715363, 0.01592466]), masked_array(data=['linear', 'rbf', 'linear', 'rbf'],\n",
       "             mask=[False, False, False, False],\n",
       "       fill_value='?',\n",
       "            dtype=object), array([1., 1., 1., 1.]), array([0.98999802, 0.98336304, 0.97999604, 0.97999604]), array([6.43618303e-05, 6.20771049e-05, 7.16528819e-05, 9.16456815e-06]), array([0.97979798, 0.96969697, 0.95959596, 0.95959596]), [{'kernel': 'linear', 'C': 1}, {'kernel': 'rbf', 'C': 1}, {'kernel': 'linear', 'C': 10}, {'kernel': 'rbf', 'C': 10}], array([0.00036526, 0.00039411, 0.0002923 , 0.00032218]), array([0.00824863, 0.01254825, 0.01649726, 0.01649726]), array([0.97916667, 0.97916667, 1.        , 1.        ]), array([0.99019608, 0.98039216, 0.98039216, 0.98039216]), array([5.54407363e-05, 3.25514857e-05, 7.09833681e-05, 3.70551530e-06]), masked_array(data=[1, 1, 10, 10],\n",
       "             mask=[False, False, False, False],\n",
       "       fill_value='?',\n",
       "            dtype=object), array([0.96078431, 0.96078431, 0.92156863, 0.96078431])])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.cv_results_.values()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
--- a/pygraph/kernels/pycache/treePatternKernel.cpython-35.pyc
+++ b/pygraph/kernels/pycache/treePatternKernel.cpython-35.pyc
--- a/pygraph/kernels/pycache/untildPathKernel.cpython-35.pyc
+++ b/pygraph/kernels/pycache/untildPathKernel.cpython-35.pyc
--- a/pygraph/kernels/pycache/untilnWalkKernel.cpython-35.pyc
+++ b/pygraph/kernels/pycache/untilnWalkKernel.cpython-35.pyc
--- a/pygraph/kernels/treePatternKernel.py
+++ b/pygraph/kernels/treePatternKernel.py
@@ -29,10 +29,12 @@ def treepatternkernel(*args, node_label = 'atom', edge_label = 'bond_type', labe
        edge attribute used as label. The default edge label is bond_type.
    labeled : boolean
        Whether the graphs are labeled. The default is True.
    depth : integer
        Depth of search. Longest length of paths.
    k_func : function
        A kernel function used using different notions of fingerprint similarity.
    kernel_type : string
        Type of tree pattern kernel, could be 'untiln', 'size' or 'branching'.
    lmda : float
        Weight to decide whether linear patterns or trees pattern of increasing complexity are favored.
    h : integer
        The upper bound of the height of tree patterns.

    Return
    ------
@@ -74,6 +76,12 @@ def _treepatternkernel_do(G1, G2, node_label, edge_label, labeled, kernel_type,
        edge attribute used as label. The default edge label is bond_type.
    labeled : boolean
        Whether the graphs are labeled. The default is True.
    kernel_type : string
        Type of tree pattern kernel, could be 'untiln', 'size' or 'branching'.
    lmda : float
        Weight to decide whether linear patterns or trees pattern of increasing complexity are favored.
    h : integer
        The upper bound of the height of tree patterns.

    Return
    ------
--- a/pygraph/kernels/untildPathKernel.py
+++ b/pygraph/kernels/untildPathKernel.py
@@ -8,8 +8,6 @@ import pathlib
 sys.path.insert(0, "../")
 import time

 from collections import Counter

 import networkx as nx
 import numpy as np

@@ -36,8 +34,8 @@ def untildpathkernel(*args, node_label = 'atom', edge_label = 'bond_type', label

    Return
    ------
    Kmatrix/kernel : Numpy matrix/float
        Kernel matrix, each element of which is the path kernel up to d between 2 praphs. / Path kernel up to d between 2 graphs.
    Kmatrix : Numpy matrix
        Kernel matrix, each element of which is the path kernel up to d between 2 praphs.
    """
    depth = int(depth)
    if len(args) == 1: # for a list of graphs
--- a/pygraph/kernels/untilnWalkKernel.py
+++ b/pygraph/kernels/untilnWalkKernel.py
@@ -0,0 +1,182 @@
 """
@author: linlin
@references: Thomas Gärtner, Peter Flach, and Stefan Wrobel. On graph kernels: Hardness results and efficient alternatives. Learning Theory and Kernel Machines, pages 129–143, 2003.
 """

 import sys
 import pathlib
 sys.path.insert(0, "../")
 import time

 from collections import Counter

 import networkx as nx
 import numpy as np


 def untilnwalkkernel(*args, node_label = 'atom', edge_label = 'bond_type', labeled = True, n = 10):
    """Calculate common walk graph kernels up to depth d between graphs.
    Parameters
    ----------
    Gn : List of NetworkX graph
        List of graphs between which the kernels are calculated.
    /
    G1, G2 : NetworkX graphs
        2 graphs between which the kernel is calculated.
    node_label : string
        node attribute used as label. The default node label is atom.
    edge_label : string
        edge attribute used as label. The default edge label is bond_type.
    labeled : boolean
        Whether the graphs are labeled. The default is True.
    n : integer
        Longest length of walks.

    Return
    ------
    Kmatrix : Numpy matrix
        Kernel matrix, each element of which is the path kernel up to d between 2 praphs.
    """
    Gn = args[0] if len(args) == 1 else [args[0], args[1]] # arrange all graphs in a list
    Kmatrix = np.zeros((len(Gn), len(Gn)))
    n = int(n)

    start_time = time.time()

    # get all paths of all graphs before calculating kernels to save time, but this may cost a lot of memory for large dataset.
    all_walks = [ find_all_walks_until_length(Gn[i], n, node_label = node_label, edge_label = edge_label, labeled = labeled) for i in range(0, len(Gn)) ]

    for i in range(0, len(Gn)):
        for j in range(i, len(Gn)):
            Kmatrix[i][j] = _untilnwalkkernel_do(all_walks[i], all_walks[j], node_label = node_label, edge_label = edge_label, labeled = labeled)
            Kmatrix[j][i] = Kmatrix[i][j]

    run_time = time.time() - start_time
    print("\n --- kernel matrix of walk kernel up to %d of size %d built in %s seconds ---" % (n, len(Gn), run_time))

    return Kmatrix, run_time


 def _untilnwalkkernel_do(walks1, walks2, node_label = 'atom', edge_label = 'bond_type', labeled = True):
    """Calculate walk graph kernels up to n between 2 graphs.

    Parameters
    ----------
    walks1, walks2 : list
        List of walks in 2 graphs, where for unlabeled graphs, each walk is represented by a list of nodes; while for labeled graphs, each walk is represented by a string consists of labels of nodes and edges on that walk.
    node_label : string
        node attribute used as label. The default node label is atom.
    edge_label : string
        edge attribute used as label. The default edge label is bond_type.
    labeled : boolean
        Whether the graphs are labeled. The default is True.

    Return
    ------
    kernel : float
        Treelet Kernel between 2 graphs.
    """
    counts_walks1 = dict(Counter(walks1))
    counts_walks2 = dict(Counter(walks2))
    all_walks = list(set(walks1 + walks2))

    vector1 = [ (counts_walks1[walk] if walk in walks1 else 0) for walk in all_walks ]
    vector2 = [ (counts_walks2[walk] if walk in walks2 else 0) for walk in all_walks ]
    kernel = np.dot(vector1, vector2)

    return kernel

 # this method find walks repetively, it could be faster.
 def find_all_walks_until_length(G, length, node_label = 'atom', edge_label = 'bond_type', labeled = True):
    """Find all walks with a certain maximum length in a graph. A recursive depth first search is applied.

    Parameters
    ----------
    G : NetworkX graphs
        The graph in which walks are searched.
    length : integer
        The maximum length of walks.
    node_label : string
        node attribute used as label. The default node label is atom.
    edge_label : string
        edge attribute used as label. The default edge label is bond_type.
    labeled : boolean
        Whether the graphs are labeled. The default is True.

    Return
    ------
    walk : list
        List of walks retrieved, where for unlabeled graphs, each walk is represented by a list of nodes; while for labeled graphs, each walk is represented by a string consists of labels of nodes and edges on that walk.
    """
    all_walks = []
    for i in range(0, length + 1):
        new_walks = find_all_walks(G, i)
        if new_walks == []:
            break
        all_walks.extend(new_walks)

    if labeled == True: # convert paths to strings
        walk_strs = []
        for walk in all_walks:
            strlist = [ G.node[node][node_label] + G[node][walk[walk.index(node) + 1]][edge_label] for node in walk[:-1] ]
            walk_strs.append(''.join(strlist) + G.node[walk[-1]][node_label])

        return walk_strs

    return all_walks


 def find_walks(G, source_node, length):
    """Find all walks with a certain length those start from a source node. A recursive depth first search is applied.

    Parameters
    ----------
    G : NetworkX graphs
        The graph in which walks are searched.
    source_node : integer
        The number of the node from where all walks start.
    length : integer
        The length of walks.

    Return
    ------
    walk : list of list
        List of walks retrieved, where each walk is represented by a list of nodes.
    """
    return [[source_node]] if length == 0 else \
        [ [source_node] + walk for neighbor in G[source_node] \
        for walk in find_walks(G, neighbor, length - 1) ]


 def find_all_walks(G, length):
    """Find all walks with a certain length in a graph. A recursive depth first search is applied.

    Parameters
    ----------
    G : NetworkX graphs
        The graph in which walks are searched.
    length : integer
        The length of walks.

    Return
    ------
    walk : list of list
        List of walks retrieved, where each walk is represented by a list of nodes.
    """
    all_walks = []
    for node in G:
        all_walks.extend(find_walks(G, node, length))

    ### The following process is not carried out according to the original article
    # all_paths_r = [ path[::-1] for path in all_paths ]


    # # For each path, two presentation are retrieved from its two extremities. Remove one of them.
    # for idx, path in enumerate(all_paths[:-1]):
    #     for path2 in all_paths_r[idx+1::]:
    #         if path == path2:
    #             all_paths[idx] = []
    #             break

    # return list(filter(lambda a: a != [], all_paths))
    return all_walks
--- a/pygraph/utils/pycache/model_selection_precomputed.cpython-35.pyc
+++ b/pygraph/utils/pycache/model_selection_precomputed.cpython-35.pyc
--- a/pygraph/utils/pycache/utils.cpython-35.pyc
+++ b/pygraph/utils/pycache/utils.cpython-35.pyc
--- a/pygraph/utils/model_selection_precomputed.py
+++ b/pygraph/utils/model_selection_precomputed.py
@@ -0,0 +1,213 @@
 def model_selection_for_precomputed_kernel(datafile, estimator, param_grid_precomputed, param_grid, model_type, NUM_TRIALS = 30, datafile_y = ''):
    """Perform model selection, fitting and testing for precomputed kernels using nested cv. Print out neccessary data during the process then finally the results.

    Parameters
    ----------
    datafile : string
        Path of dataset file.
    estimator : function
        kernel function used to estimate. This function needs to return a gram matrix.
    param_grid_precomputed : dictionary
        Dictionary with names (string) of parameters used to calculate gram matrices as keys and lists of parameter settings to try as values. This enables searching over any sequence of parameter settings.
    param_grid : dictionary
        Dictionary with names (string) of parameters used as penelties as keys and lists of parameter settings to try as values. This enables searching over any sequence of parameter settings.
    model_type : string
        Typr of the problem, can be regression or classification.
    NUM_TRIALS : integer
        Number of random trials of outer cv loop. The default is 30.
    datafile_y : string
        Path of file storing y data. This parameter is optional depending on the given dataset file.

    Examples
    --------
    >>> import numpy as np
    >>> import sys
    >>> sys.path.insert(0, "../")
    >>> from pygraph.utils.model_selection_precomputed import model_selection_for_precomputed_kernel
    >>> from pygraph.kernels.weisfeilerLehmanKernel import weisfeilerlehmankernel
    >>>
    >>> datafile = '../../../../datasets/acyclic/Acyclic/dataset_bps.ds'
    >>> estimator = weisfeilerlehmankernel
    >>> param_grid_precomputed = {'height': [0,1,2,3,4,5,6,7,8,9,10], 'base_kernel': ['subtree']}
    >>> param_grid = {"alpha": np.logspace(-2, 2, num = 10, base = 10)}
    >>>
    >>> model_selection_for_precomputed_kernel(datafile, estimator, param_grid_precomputed, param_grid, 'regression')
    """
    import numpy as np
    from matplotlib import pyplot as plt

    from sklearn.kernel_ridge import KernelRidge
    from sklearn.svm import SVC
    from sklearn.metrics import accuracy_score, mean_squared_error
    from sklearn.model_selection import KFold, train_test_split, ParameterGrid

    import sys
    sys.path.insert(0, "../")
    from pygraph.utils.graphfiles import loadDataset

    from tqdm import tqdm

    # setup the model type
    model_type = model_type.lower()
    if model_type != 'regression' and model_type != 'classification':
        raise Exception('The model type is incorrect! Please choose from regression or classification.')
    print()
    print('--- This is a %s problem ---' % model_type)

    # Load the dataset
    print()
    print('1. Loading dataset from file...')
    dataset, y = loadDataset(datafile, filename_y = datafile_y)

    # Grid of parameters with a discrete number of values for each.
    param_list_precomputed = list(ParameterGrid(param_grid_precomputed))
    param_list = list(ParameterGrid(param_grid))

    # Arrays to store scores
    train_pref = np.zeros((NUM_TRIALS, len(param_list_precomputed), len(param_list)))
    val_pref = np.zeros((NUM_TRIALS, len(param_list_precomputed), len(param_list)))
    test_pref = np.zeros((NUM_TRIALS, len(param_list_precomputed), len(param_list)))

    gram_matrices = [] # a list to store gram matrices for all param_grid_precomputed
    gram_matrix_time = [] # a list to store time to calculate gram matrices

    # calculate all gram matrices
    print()
    print('2. Calculating gram matrices. This could take a while...')
    for params_out in param_list_precomputed:
        Kmatrix, current_run_time = estimator(dataset, **params_out)
        print()
        print('gram matrix with parameters', params_out, 'is: ')
        print(Kmatrix)
        plt.matshow(Kmatrix)
        plt.show()
 #         plt.savefig('../../notebooks/gram_matrix_figs/{}_{}'.format(estimator.__name__, params_out))
        gram_matrices.append(Kmatrix)
        gram_matrix_time.append(current_run_time)

    print()
    print('3. Fitting and predicting using nested cross validation. This could really take a while...')
    # Loop for each trial
    pbar = tqdm(total = NUM_TRIALS * len(param_list_precomputed) * len(param_list), 
                desc = 'calculate performance', file=sys.stdout)
    for trial in range(NUM_TRIALS): # Test set level
        # loop for each outer param tuple
        for index_out, params_out in enumerate(param_list_precomputed):
            # split gram matrix and y to app and test sets.
            X_app, X_test, y_app, y_test = train_test_split(gram_matrices[index_out], y, test_size=0.1)
            split_index_app = [y.index(y_i) for y_i in y_app if y_i in y]
            split_index_test = [y.index(y_i) for y_i in y_test if y_i in y]
            X_app = X_app[:,split_index_app]
            X_test = X_test[:,split_index_app]
            y_app = np.array(y_app)
            y_test = np.array(y_test)

            # loop for each inner param tuple
            for index_in, params_in in enumerate(param_list):
                inner_cv = KFold(n_splits=10, shuffle=True, random_state=trial)
                current_train_perf = []
                current_valid_perf = []
                current_test_perf = []

                # For regression use the Kernel Ridge method
                if model_type == 'regression': 
                    KR = KernelRidge(kernel = 'precomputed', **params_in)
                    # loop for each split on validation set level
                    for train_index, valid_index in inner_cv.split(X_app): # validation set level
                        KR.fit(X_app[train_index,:][:,train_index], y_app[train_index])

                        # predict on the train, validation and test set
                        y_pred_train = KR.predict(X_app[train_index,:][:,train_index])
                        y_pred_valid = KR.predict(X_app[valid_index,:][:,train_index])
                        y_pred_test = KR.predict(X_test[:,train_index])

                        # root mean squared errors
                        current_train_perf.append(np.sqrt(mean_squared_error(y_app[train_index], y_pred_train)))
                        current_valid_perf.append(np.sqrt(mean_squared_error(y_app[valid_index], y_pred_valid)))
                        current_test_perf.append(np.sqrt(mean_squared_error(y_test, y_pred_test)))
                # For clcassification use SVM
                else:
                    KR = SVC(kernel = 'precomputed', **params_in)
                    # loop for each split on validation set level
                    for train_index, valid_index in inner_cv.split(X_app): # validation set level
                        KR.fit(X_app[train_index,:][:,train_index], y_app[train_index])

                        # predict on the train, validation and test set
                        y_pred_train = KR.predict(X_app[train_index,:][:,train_index])
                        y_pred_valid = KR.predict(X_app[valid_index,:][:,train_index])
                        y_pred_test = KR.predict(X_test[:,train_index])

                        # root mean squared errors
                        current_train_perf.append(accuracy_score(y_app[train_index], y_pred_train))
                        current_valid_perf.append(accuracy_score(y_app[valid_index], y_pred_valid))
                        current_test_perf.append(accuracy_score(y_test, y_pred_test))

                # average performance on inner splits
                train_pref[trial][index_out][index_in] = np.mean(current_train_perf)
                val_pref[trial][index_out][index_in] = np.mean(current_valid_perf)
                test_pref[trial][index_out][index_in] = np.mean(current_test_perf)
                
                pbar.update(1)
    pbar.clear()

    print()
    print('4. Getting final performances...')
    # averages and confidences of performances on outer trials for each combination of parameters
    average_train_scores = np.mean(train_pref, axis=0)
    average_val_scores = np.mean(val_pref, axis=0)
    average_perf_scores = np.mean(test_pref, axis=0)
    std_train_scores = np.std(train_pref, axis=0, ddof=1) # sample std is used here
    std_val_scores = np.std(val_pref, axis=0, ddof=1)
    std_perf_scores = np.std(test_pref, axis=0, ddof=1)

    if model_type == 'regression':
        best_val_perf = np.amin(average_val_scores)
    else:
        best_val_perf = np.amax(average_val_scores)
    print()
    best_params_index = np.where(average_val_scores == best_val_perf)
    best_params_out = [param_list_precomputed[i] for i in best_params_index[0]]
    best_params_in = [param_list[i] for i in best_params_index[1]]
    # print('best_params_index: ', best_params_index)
    print('best_params_out: ', best_params_out)
    print('best_params_in: ', best_params_in)
    print('best_val_perf: ', best_val_perf)

    # below: only find one performance; muitiple pref might exist
    best_val_std = std_val_scores[best_params_index[0][0]][best_params_index[1][0]]
    print('best_val_std: ', best_val_std)

    final_performance = average_perf_scores[best_params_index[0][0]][best_params_index[1][0]]
    final_confidence = std_perf_scores[best_params_index[0][0]][best_params_index[1][0]]
    print('final_performance: ', final_performance)
    print('final_confidence: ', final_confidence)
    train_performance = average_train_scores[best_params_index[0][0]][best_params_index[1][0]]
    train_std = std_train_scores[best_params_index[0][0]][best_params_index[1][0]]
    print('train_performance: ', train_performance)
    print('train_std: ', train_std)

    best_gram_matrix_time = gram_matrix_time[best_params_index[0][0]]
    print('time to calculate gram matrix: ', best_gram_matrix_time, 's')

    # print out as table.
    from collections import OrderedDict
    from tabulate import tabulate
    table_dict = {}
    if model_type == 'regression':
        for param_in in param_list:
            param_in['alpha'] = '{:.2e}'.format(param_in['alpha'])
    else:
        for param_in in param_list:
            param_in['C'] = '{:.2e}'.format(param_in['C'])
    table_dict['params'] = [ {**param_out, **param_in} for param_in in param_list for param_out in param_list_precomputed ]
    table_dict['gram_matrix_time'] = [ '{:.2f}'.format(gram_matrix_time[index_out])
                                       for param_in in param_list for index_out, _ in enumerate(param_list_precomputed) ]
    table_dict['valid_perf'] = [ '{:.2f}±{:.2f}'.format(average_val_scores[index_out][index_in], std_val_scores[index_out][index_in])
                                 for index_in, _ in enumerate(param_list) for index_out, _ in enumerate(param_list_precomputed) ]
    table_dict['test_perf'] = [ '{:.2f}±{:.2f}'.format(average_perf_scores[index_out][index_in], std_perf_scores[index_out][index_in])
                                 for index_in, _ in enumerate(param_list) for index_out, _ in enumerate(param_list_precomputed) ]
    table_dict['train_perf'] = [ '{:.2f}±{:.2f}'.format(average_train_scores[index_out][index_in], std_train_scores[index_out][index_in])
                                 for index_in, _ in enumerate(param_list) for index_out, _ in enumerate(param_list_precomputed) ]
    keyorder = ['params', 'train_perf', 'valid_perf', 'test_perf', 'gram_matrix_time']
    print()
    print(tabulate(OrderedDict(sorted(table_dict.items(), key = lambda i:keyorder.index(i[0]))), headers='keys'))
--- a/pygraph/utils/utils.py
+++ b/pygraph/utils/utils.py
@@ -65,311 +65,312 @@ def floydTransformation(G, edge_weight = 'bond_type'):



 def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, trials = 100, splits = 10, alpha_grid = None, C_grid = None, hyper_name = '', hyper_range = [1], normalize = False, datafile_y = '', model_type = 'regression'):
    """Perform training and testing for a kernel method. Print out neccessary data during the process then finally the results.

    Parameters
    ----------
    datafile : string
        Path of dataset file.
    kernel_file_path : string
        Path of the directory to save results.
    kernel_func : function
        kernel function to use in the process.
    kernel_para : dictionary
        Keyword arguments passed to kernel_func.
    trials: integer
        Number of trials for hyperparameter random search, where hyperparameter stands for penalty parameter for now. The default is 100.
    splits: integer
        Number of splits of dataset. Times of training and testing procedure processed. The final means and stds are the average of the results of all the splits. The default is 10.
    alpha_grid : ndarray
        Penalty parameter in kernel ridge regression. Corresponds to (2*C)^-1 in other linear models such as LogisticRegression.
    C_grid : ndarray
        Penalty parameter C of the error term in kernel SVM.
    hyper_name : string
        Name of the hyperparameter.
    hyper_range : list
        Range of the hyperparameter.
    normalize : string
        Determine whether or not that normalization is performed. Only works when model_type == 'regression'. The default is False.
    model_type : string
        Typr of the problem, regression or classification problem

    References
    ----------
    [1] Elisabetta Ghisu, https://github.com/eghisu/GraphKernels/blob/master/GraphKernelsCollection/python_scripts/compute_perf_gk.py, 2018.1

    Examples
    --------
    >>> import sys
    >>> sys.path.insert(0, "../")
    >>> from pygraph.utils.utils import kernel_train_test
    >>> from pygraph.kernels.treeletKernel import treeletkernel
    >>> datafile = '../../../../datasets/acyclic/Acyclic/dataset_bps.ds'
    >>> kernel_file_path = 'kernelmatrices_path_acyclic/'
    >>> kernel_para = dict(node_label = 'atom', edge_label = 'bond_type', labeled = True)
    >>> kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = True)
    """
    import os
    import pathlib
    from collections import OrderedDict
    from tabulate import tabulate
    from .graphfiles import loadDataset

    # setup the parameters
    model_type = model_type.lower()
    if model_type != 'regression' and model_type != 'classification':
        raise Exception('The model type is incorrect! Please choose from regression or clqssification.')
    print('\n --- This is a %s problem ---' % model_type)

    alpha_grid = np.logspace(-10, 10, num = trials, base = 10) if alpha_grid == None else alpha_grid # corresponds to (2*C)^-1 in other linear models such as LogisticRegression
    C_grid = np.logspace(-10, 10, num = trials, base = 10) if C_grid == None else C_grid

    if not os.path.exists(kernel_file_path):
        os.makedirs(kernel_file_path)

    train_means_list = []
    train_stds_list = []
    test_means_list = []
    test_stds_list = []
    kernel_time_list = []

    for hyper_para in hyper_range:
        print('' if hyper_name == '' else '\n\n #--- calculating kernel matrix when', hyper_name, '=', hyper_para, '---#')

        print('\n Loading dataset from file...')
        dataset, y = loadDataset(datafile, filename_y = datafile_y)
        y = np.array(y)
        # normalize labels and transform non-numerical labels to numerical labels.
        if model_type == 'classification':
            from sklearn.preprocessing import LabelEncoder
            y = LabelEncoder().fit_transform(y)
        #   print(y)

        # save kernel matrices to files / read kernel matrices from files
        kernel_file = kernel_file_path + 'km.ds'
        path = pathlib.Path(kernel_file)
        # get train set kernel matrix
        if path.is_file():
            print('\n Loading the kernel matrix from file...')
            Kmatrix = np.loadtxt(kernel_file)
            print(Kmatrix)
        else:
            print('\n Calculating kernel matrix, this could take a while...')
            if hyper_name != '':
                kernel_para[hyper_name] = hyper_para
            Kmatrix, run_time = kernel_func(dataset, **kernel_para)
            kernel_time_list.append(run_time)
            print(Kmatrix)
      #     print('\n Saving kernel matrix to file...')
        #     np.savetxt(kernel_file, Kmatrix)

        """
        -  Here starts the main program
        -  First we permute the data, then for each split we evaluate corresponding performances
        -  In the end, the performances are averaged over the test sets
        """

        train_mean, train_std, test_mean, test_std = \
            split_train_test(Kmatrix, y, alpha_grid, C_grid, splits, trials, model_type, normalize = normalize)

        train_means_list.append(train_mean)
        train_stds_list.append(train_std)
        test_means_list.append(test_mean)
        test_stds_list.append(test_std)

    print('\n')
    if model_type == 'regression':
        table_dict = {'rmse_test': test_means_list, 'std_test': test_stds_list, \
                      'rmse_train': train_means_list, 'std_train': train_stds_list, 'k_time': kernel_time_list}
        if hyper_name == '':
            keyorder = ['rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time']
        else:
            table_dict[hyper_name] = hyper_range
            keyorder = [hyper_name, 'rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time']
    elif model_type == 'classification':
        table_dict = {'accur_test': test_means_list, 'std_test': test_stds_list, \
                      'accur_train': train_means_list, 'std_train': train_stds_list, 'k_time': kernel_time_list}
        if hyper_name == '':
            keyorder = ['accur_test', 'std_test', 'accur_train', 'std_train', 'k_time']
        else:
            table_dict[hyper_name] = hyper_range
            keyorder = [hyper_name, 'accur_test', 'std_test', 'accur_train', 'std_train', 'k_time']
    print(tabulate(OrderedDict(sorted(table_dict.items(), key = lambda i:keyorder.index(i[0]))), headers='keys'))



 def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, trials = 100, model_type = 'regression', normalize = False):
    """Split dataset to training and testing splits, train and test. Print out and return the results.

    Parameters
    ----------
    Kmatrix : Numpy matrix
        Kernel matrix, each element of which is the kernel between 2 praphs.
    train_target : ndarray
        train target.
    alpha_grid : ndarray
        Penalty parameter in kernel ridge regression. Corresponds to (2*C)^-1 in other linear models such as LogisticRegression.
    C_grid : ndarray
        Penalty parameter C of the error term in kernel SVM.
    splits : interger
        Number of splits of dataset. Times of training and testing procedure processed. The final means and stds are the average of the results of all the splits. The default is 10.
    trials : integer
        Number of trials for hyperparameters random search. The final means and stds are the ones in the same trial with the best test mean. The default is 100.
    model_type : string
        Determine whether it is a regression or classification problem. The default is 'regression'.
    normalize : string
        Determine whether or not that normalization is performed. Only works when model_type == 'regression'. The default is False.

    Return
    ------
    train_mean : float
        mean of train accuracies in the same trial with the best test mean.
    train_std : float
        mean of train stds in the same trial with the best test mean.
    test_mean : float
        mean of the best tests.
    test_std : float
        mean of test stds in the same trial with the best test mean.

    References
    ----------
    [1] Elisabetta Ghisu, https://github.com/eghisu/GraphKernels/blob/master/GraphKernelsCollection/python_scripts/compute_perf_gk.py, 2018.1
    """
    import random
    from sklearn.kernel_ridge import KernelRidge # 0.17
    from sklearn.metrics import accuracy_score, mean_squared_error
    from sklearn import svm

    datasize = len(train_target)
    random.seed(20) # Set the seed for uniform parameter distribution

    # Initialize the performance of the best parameter trial on train with the corresponding performance on test
    train_split = []
    test_split = []

    # For each split of the data
    print('\n Starting calculate accuracy/rmse...')
    import sys
    pbar = tqdm(total = splits * trials, desc = 'calculate performance', file=sys.stdout)
    for j in range(10, 10 + splits):
    #         print('\n Starting split %d...' % j)

        # Set the random set for data permutation
        random_state = int(j)
        np.random.seed(random_state)
        idx_perm = np.random.permutation(datasize)

        # Permute the data
        y_perm = train_target[idx_perm] # targets permutation
        Kmatrix_perm = Kmatrix[:, idx_perm] # inputs permutation
        Kmatrix_perm = Kmatrix_perm[idx_perm, :] # inputs permutation

        # Set the training, test
        # Note: the percentage can be set up by the user
        num_train = int((datasize * 90) / 100)         # 90% (of entire dataset) for training
        num_test = datasize - num_train             # 10% (of entire dataset) for test

        # Split the kernel matrix
        Kmatrix_train = Kmatrix_perm[0:num_train, 0:num_train]
        Kmatrix_test = Kmatrix_perm[num_train:datasize, 0:num_train]

        # Split the targets
        y_train = y_perm[0:num_train]


        # Normalization step (for real valued targets only)
        if normalize == True and model_type == 'regression':
            y_train_mean = np.mean(y_train)
            y_train_std = np.std(y_train)
            y_train_norm = (y_train - y_train_mean) / float(y_train_std)

        y_test = y_perm[num_train:datasize]

        # Record the performance for each parameter trial respectively on train and test set
        perf_all_train = []
        perf_all_test = []

        # For each parameter trial
        for i in range(trials):
            # For regression use the Kernel Ridge method
            if model_type == 'regression':
                # Fit the kernel ridge model
                KR = KernelRidge(kernel = 'precomputed', alpha = alpha_grid[i])
                KR.fit(Kmatrix_train, y_train if normalize == False else y_train_norm)

                # predict on the train and test set
                y_pred_train = KR.predict(Kmatrix_train)
                y_pred_test = KR.predict(Kmatrix_test)

                # adjust prediction: needed because the training targets have been normalized
                if normalize == True:
                    y_pred_train = y_pred_train * float(y_train_std) + y_train_mean
                    y_pred_test = y_pred_test * float(y_train_std) + y_train_mean

                # root mean squared error on train set
                accuracy_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
                perf_all_train.append(accuracy_train)
                # root mean squared error on test set
                accuracy_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
                perf_all_test.append(accuracy_test)

            # For clcassification use SVM
            elif model_type == 'classification':
                KR = svm.SVC(kernel = 'precomputed', C = C_grid[i])
                KR.fit(Kmatrix_train, y_train)

                # predict on the train and test set
                y_pred_train = KR.predict(Kmatrix_train)
                y_pred_test = KR.predict(Kmatrix_test)

                # accuracy on train set
                accuracy_train = accuracy_score(y_train, y_pred_train)
                perf_all_train.append(accuracy_train)
                # accuracy on test set
                accuracy_test = accuracy_score(y_test, y_pred_test)
                perf_all_test.append(accuracy_test)

            pbar.update(1)

        # --- FIND THE OPTIMAL PARAMETERS --- #
        # For regression: minimise the mean squared error
        if model_type == 'regression':

            # get optimal parameter on test (argmin mean squared error)
            min_idx = np.argmin(perf_all_test)
            alpha_opt = alpha_grid[min_idx]

            # corresponding performance on train and test set for the same parameter
            perf_train_opt = perf_all_train[min_idx]
            perf_test_opt = perf_all_test[min_idx]

        # For classification: maximise the accuracy
        if model_type == 'classification':
            # get optimal parameter on test (argmax accuracy)
            max_idx = np.argmax(perf_all_test)
            C_opt = C_grid[max_idx]

            # corresponding performance on train and test set for the same parameter
            perf_train_opt = perf_all_train[max_idx]
            perf_test_opt = perf_all_test[max_idx]


        # append the correponding performance on the train and test set
        train_split.append(perf_train_opt)
        test_split.append(perf_test_opt)

    # average the results
    # mean of the train and test performances over the splits
    train_mean = np.mean(np.asarray(train_split))
    test_mean = np.mean(np.asarray(test_split))
    # std deviation of the train and test over the splits
    train_std = np.std(np.asarray(train_split))
    test_std = np.std(np.asarray(test_split))

    print('\n Mean performance on train set: %3f' % train_mean)
    print('With standard deviation: %3f' % train_std)
    print('\n Mean performance on test set: %3f' % test_mean)
    print('With standard deviation: %3f' % test_std)

    return train_mean, train_std, test_mean, test_std
 # def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, trials = 100, splits = 10, alpha_grid = None, C_grid = None, hyper_name = '', hyper_range = [1], normalize = False, datafile_y = '', model_type = 'regression'):
 #     """Perform training and testing for a kernel method. Print out neccessary data during the process then finally the results.

 #     Parameters
 #     ----------
 #     datafile : string
 #         Path of dataset file.
 #     kernel_file_path : string
 #         Path of the directory to save results.
 #     kernel_func : function
 #         kernel function to use in the process.
 #     kernel_para : dictionary
 #         Keyword arguments passed to kernel_func.
 #     trials: integer
 #         Number of trials for hyperparameter random search, where hyperparameter stands for penalty parameter for now. The default is 100.
 #     splits: integer
 #         Number of splits of dataset. Times of training and testing procedure processed. The final means and stds are the average of the results of all the splits. The default is 10.
 #     alpha_grid : ndarray
 #         Penalty parameter in kernel ridge regression. Corresponds to (2*C)^-1 in other linear models such as LogisticRegression.
 #     C_grid : ndarray
 #         Penalty parameter C of the error term in kernel SVM.
 #     hyper_name : string
 #         Name of the hyperparameter.
 #     hyper_range : list
 #         Range of the hyperparameter.
 #     normalize : string
 #         Determine whether or not that normalization is performed. Only works when model_type == 'regression'. The default is False.
 #     model_type : string
 #         Typr of the problem, regression or classification problem

 #     References
 #     ----------
 #     [1] Elisabetta Ghisu, https://github.com/eghisu/GraphKernels/blob/master/GraphKernelsCollection/python_scripts/compute_perf_gk.py, 2018.1

 #     Examples
 #     --------
 #     >>> import sys
 #     >>> sys.path.insert(0, "../")
 #     >>> from pygraph.utils.utils import kernel_train_test
 #     >>> from pygraph.kernels.treeletKernel import treeletkernel
 #     >>> datafile = '../../../../datasets/acyclic/Acyclic/dataset_bps.ds'
 #     >>> kernel_file_path = 'kernelmatrices_path_acyclic/'
 #     >>> kernel_para = dict(node_label = 'atom', edge_label = 'bond_type', labeled = True)
 #     >>> kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = True)
 #     """
 #     import os
 #     import pathlib
 #     from collections import OrderedDict
 #     from tabulate import tabulate
 #     from .graphfiles import loadDataset

 #     # setup the parameters
 #     model_type = model_type.lower()
 #     if model_type != 'regression' and model_type != 'classification':
 #         raise Exception('The model type is incorrect! Please choose from regression or classification.')
 #     print('\n --- This is a %s problem ---' % model_type)

 #     alpha_grid = np.logspace(-10, 10, num = trials, base = 10) if alpha_grid == None else alpha_grid # corresponds to (2*C)^-1 in other linear models such as LogisticRegression
 #     C_grid = np.logspace(-10, 10, num = trials, base = 10) if C_grid == None else C_grid

 #     if not os.path.exists(kernel_file_path):
 #         os.makedirs(kernel_file_path)

 #     train_means_list = []
 #     train_stds_list = []
 #     test_means_list = []
 #     test_stds_list = []
 #     kernel_time_list = []

 #     for hyper_para in hyper_range:
 #         print('' if hyper_name == '' else '\n\n #--- calculating kernel matrix when', hyper_name, '=', hyper_para, '---#')

 #         print('\n Loading dataset from file...')
 #         dataset, y = loadDataset(datafile, filename_y = datafile_y)
 #         y = np.array(y)
 #         # normalize labels and transform non-numerical labels to numerical labels.
 #         if model_type == 'classification':
 #             from sklearn.preprocessing import LabelEncoder
 #             y = LabelEncoder().fit_transform(y)
 #         #   print(y)

 #         # save kernel matrices to files / read kernel matrices from files
 #         kernel_file = kernel_file_path + 'km.ds'
 #         path = pathlib.Path(kernel_file)
 #         # get train set kernel matrix
 #         if path.is_file():
 #             print('\n Loading the kernel matrix from file...')
 #             Kmatrix = np.loadtxt(kernel_file)
 #             print(Kmatrix)
 #         else:
 #             print('\n Calculating kernel matrix, this could take a while...')
 #             if hyper_name != '':
 #                 kernel_para[hyper_name] = hyper_para
 #             Kmatrix, run_time = kernel_func(dataset, **kernel_para)
 #             kernel_time_list.append(run_time)
 #             import matplotlib.pyplot as plt
 #             plt.matshow(Kmatrix)
 #       #     print('\n Saving kernel matrix to file...')
 #         #     np.savetxt(kernel_file, Kmatrix)

 #         """
 #         -  Here starts the main program
 #         -  First we permute the data, then for each split we evaluate corresponding performances
 #         -  In the end, the performances are averaged over the test sets
 #         """

 #         train_mean, train_std, test_mean, test_std = \
 #             split_train_test(Kmatrix, y, alpha_grid, C_grid, splits, trials, model_type, normalize = normalize)

 #         train_means_list.append(train_mean)
 #         train_stds_list.append(train_std)
 #         test_means_list.append(test_mean)
 #         test_stds_list.append(test_std)

 #     print('\n')
 #     if model_type == 'regression':
 #         table_dict = {'rmse_test': test_means_list, 'std_test': test_stds_list, \
 #                       'rmse_train': train_means_list, 'std_train': train_stds_list, 'k_time': kernel_time_list}
 #         if hyper_name == '':
 #             keyorder = ['rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time']
 #         else:
 #             table_dict[hyper_name] = hyper_range
 #             keyorder = [hyper_name, 'rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time']
 #     elif model_type == 'classification':
 #         table_dict = {'accur_test': test_means_list, 'std_test': test_stds_list, \
 #                       'accur_train': train_means_list, 'std_train': train_stds_list, 'k_time': kernel_time_list}
 #         if hyper_name == '':
 #             keyorder = ['accur_test', 'std_test', 'accur_train', 'std_train', 'k_time']
 #         else:
 #             table_dict[hyper_name] = hyper_range
 #             keyorder = [hyper_name, 'accur_test', 'std_test', 'accur_train', 'std_train', 'k_time']
 #     print(tabulate(OrderedDict(sorted(table_dict.items(), key = lambda i:keyorder.index(i[0]))), headers='keys'))



 # def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, trials = 100, model_type = 'regression', normalize = False):
 #     """Split dataset to training and testing splits, train and test. Print out and return the results.

 #     Parameters
 #     ----------
 #     Kmatrix : Numpy matrix
 #         Kernel matrix, each element of which is the kernel between 2 praphs.
 #     train_target : ndarray
 #         train target.
 #     alpha_grid : ndarray
 #         Penalty parameter in kernel ridge regression. Corresponds to (2*C)^-1 in other linear models such as LogisticRegression.
 #     C_grid : ndarray
 #         Penalty parameter C of the error term in kernel SVM.
 #     splits : interger
 #         Number of splits of dataset. Times of training and testing procedure processed. The final means and stds are the average of the results of all the splits. The default is 10.
 #     trials : integer
 #         Number of trials for hyperparameters random search. The final means and stds are the ones in the same trial with the best test mean. The default is 100.
 #     model_type : string
 #         Determine whether it is a regression or classification problem. The default is 'regression'.
 #     normalize : string
 #         Determine whether or not that normalization is performed. Only works when model_type == 'regression'. The default is False.

 #     Return
 #     ------
 #     train_mean : float
 #         mean of train accuracies in the same trial with the best test mean.
 #     train_std : float
 #         mean of train stds in the same trial with the best test mean.
 #     test_mean : float
 #         mean of the best tests.
 #     test_std : float
 #         mean of test stds in the same trial with the best test mean.

 #     References
 #     ----------
 #     [1] Elisabetta Ghisu, https://github.com/eghisu/GraphKernels/blob/master/GraphKernelsCollection/python_scripts/compute_perf_gk.py, 2018.1
 #     """
 #     import random
 #     from sklearn.kernel_ridge import KernelRidge # 0.17
 #     from sklearn.metrics import accuracy_score, mean_squared_error
 #     from sklearn import svm

 #     datasize = len(train_target)
 #     random.seed(20) # Set the seed for uniform parameter distribution

 #     # Initialize the performance of the best parameter trial on train with the corresponding performance on test
 #     train_split = []
 #     test_split = []

 #     # For each split of the data
 #     print('\n Starting calculate accuracy/rmse...')
 #     import sys
 #     pbar = tqdm(total = splits * trials, desc = 'calculate performance', file=sys.stdout)
 #     for j in range(10, 10 + splits):
 #     #         print('\n Starting split %d...' % j)

 #         # Set the random set for data permutation
 #         random_state = int(j)
 #         np.random.seed(random_state)
 #         idx_perm = np.random.permutation(datasize)

 #         # Permute the data
 #         y_perm = train_target[idx_perm] # targets permutation
 #         Kmatrix_perm = Kmatrix[:, idx_perm] # inputs permutation
 #         Kmatrix_perm = Kmatrix_perm[idx_perm, :] # inputs permutation

 #         # Set the training, test
 #         # Note: the percentage can be set up by the user
 #         num_train = int((datasize * 90) / 100)         # 90% (of entire dataset) for training
 #         num_test = datasize - num_train             # 10% (of entire dataset) for test

 #         # Split the kernel matrix
 #         Kmatrix_train = Kmatrix_perm[0:num_train, 0:num_train]
 #         Kmatrix_test = Kmatrix_perm[num_train:datasize, 0:num_train]

 #         # Split the targets
 #         y_train = y_perm[0:num_train]


 #         # Normalization step (for real valued targets only)
 #         if normalize == True and model_type == 'regression':
 #             y_train_mean = np.mean(y_train)
 #             y_train_std = np.std(y_train)
 #             y_train_norm = (y_train - y_train_mean) / float(y_train_std)

 #         y_test = y_perm[num_train:datasize]

 #         # Record the performance for each parameter trial respectively on train and test set
 #         perf_all_train = []
 #         perf_all_test = []

 #         # For each parameter trial
 #         for i in range(trials):
 #             # For regression use the Kernel Ridge method
 #             if model_type == 'regression':
 #                 # Fit the kernel ridge model
 #                 KR = KernelRidge(kernel = 'precomputed', alpha = alpha_grid[i])
 #                 KR.fit(Kmatrix_train, y_train if normalize == False else y_train_norm)

 #                 # predict on the train and test set
 #                 y_pred_train = KR.predict(Kmatrix_train)
 #                 y_pred_test = KR.predict(Kmatrix_test)

 #                 # adjust prediction: needed because the training targets have been normalized
 #                 if normalize == True:
 #                     y_pred_train = y_pred_train * float(y_train_std) + y_train_mean
 #                     y_pred_test = y_pred_test * float(y_train_std) + y_train_mean

 #                 # root mean squared error on train set
 #                 accuracy_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
 #                 perf_all_train.append(accuracy_train)
 #                 # root mean squared error on test set
 #                 accuracy_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
 #                 perf_all_test.append(accuracy_test)

 #             # For clcassification use SVM
 #             elif model_type == 'classification':
 #                 KR = svm.SVC(kernel = 'precomputed', C = C_grid[i])
 #                 KR.fit(Kmatrix_train, y_train)

 #                 # predict on the train and test set
 #                 y_pred_train = KR.predict(Kmatrix_train)
 #                 y_pred_test = KR.predict(Kmatrix_test)

 #                 # accuracy on train set
 #                 accuracy_train = accuracy_score(y_train, y_pred_train)
 #                 perf_all_train.append(accuracy_train)
 #                 # accuracy on test set
 #                 accuracy_test = accuracy_score(y_test, y_pred_test)
 #                 perf_all_test.append(accuracy_test)

 #             pbar.update(1)

 #         # --- FIND THE OPTIMAL PARAMETERS --- #
 #         # For regression: minimise the mean squared error
 #         if model_type == 'regression':

 #             # get optimal parameter on test (argmin mean squared error)
 #             min_idx = np.argmin(perf_all_train)
 #             alpha_opt = alpha_grid[min_idx]

 #             # corresponding performance on train and test set for the same parameter
 #             perf_train_opt = perf_all_train[min_idx]
 #             perf_test_opt = perf_all_test[min_idx]

 #         # For classification: maximise the accuracy
 #         if model_type == 'classification':
 #             # get optimal parameter on test (argmax accuracy)
 #             max_idx = np.argmax(perf_all_train)
 #             C_opt = C_grid[max_idx]

 #             # corresponding performance on train and test set for the same parameter
 #             perf_train_opt = perf_all_train[max_idx]
 #             perf_test_opt = perf_all_test[max_idx]


 #         # append the correponding performance on the train and test set
 #         train_split.append(perf_train_opt)
 #         test_split.append(perf_test_opt)

 #     # average the results
 #     # mean of the train and test performances over the splits
 #     train_mean = np.mean(np.asarray(train_split))
 #     test_mean = np.mean(np.asarray(test_split))
 #     # std deviation of the train and test over the splits
 #     train_std = np.std(np.asarray(train_split))
 #     test_std = np.std(np.asarray(test_split))

 #     print('\n Mean performance on train set: %3f' % train_mean)
 #     print('With standard deviation: %3f' % train_std)
 #     print('\n Mean performance on test set: %3f' % test_mean)
 #     print('With standard deviation: %3f' % test_std)

 #     return train_mean, train_std, test_mean, test_std
--- a/run_cyclic.py
+++ b/run_cyclic.py
@@ -1,16 +0,0 @@
 import sys
 sys.path.insert(0, "../")
 from pygraph.utils.utils import kernel_train_test
 from pygraph.kernels.cyclicPatternKernel import cyclicpatternkernel

 import numpy as np

 datafile = '../../../../datasets/NCI-HIV/AIDO99SD.sdf'
 datafile_y = '../../../../datasets/NCI-HIV/aids_conc_may04.txt'
 kernel_file_path = 'kernelmatrices_path_acyclic/'

 kernel_para = dict(node_label = 'atom', edge_label = 'bond_type', labeled = True)

 kernel_train_test(datafile, kernel_file_path, cyclicpatternkernel, kernel_para, \
    hyper_name = 'cycle_bound', hyper_range = np.linspace(0, 1000, 21), normalize = False, \
    datafile_y = datafile_y, model_type = 'classification')