copié : notebooks/run_weisfeilerLehmankernel_acyclic.ipynb -> .ipynb_checkpoints/run_weisfeilerLehmankernel_acyclic-checkpoint.ipynb

modifié : README.md nouveau fichier : notebooks/run_cyclicpatternkernel.ipynb modifié : notebooks/run_marginalizedkernel_acyclic.ipynb modifié : notebooks/run_pathkernel_acyclic.ipynb modifié : notebooks/run_spkernel_acyclic.ipynb modifié : notebooks/run_treeletkernel_acyclic.ipynb nouveau fichier : notebooks/run_treepatternkernel.ipynb modifié : notebooks/run_weisfeilerLehmankernel_acyclic.ipynb nouveau fichier : pygraph/kernels/cyclicPatternKernel.py modifié : pygraph/kernels/deltaKernel.py modifié : pygraph/kernels/pathKernel.py modifié : pygraph/kernels/results.md modifié : pygraph/kernels/spKernel.py nouveau fichier : pygraph/kernels/treePatternKernel.py modifié : pygraph/kernels/treeletKernel.py modifié : pygraph/kernels/untildPathKernel.py modifié : pygraph/kernels/weisfeilerLehmanKernel.py modifié : pygraph/utils/graphfiles.py modifié : pygraph/utils/utils.py nouveau fichier : run_cyclic.py
7 years ago · 0d82ebe6b8
--- a/.ipynb_checkpoints/run_weisfeilerLehmankernel_acyclic-checkpoint.ipynb
+++ b/.ipynb_checkpoints/run_weisfeilerLehmankernel_acyclic-checkpoint.ipynb
--- a/README.md
+++ b/README.md
@@ -11,26 +11,30 @@ A python package for graph kernels.
 * tabulate - 0.8.2

 ## Results with minimal test RMSE for each kernel on dataset Asyclic
 All kernels are tested on dataset Asyclic, which consists of 185 molecules (graphs). 

 All kernels expect for Cyclic pattern kernel are tested on dataset Asyclic, which consists of 185 molecules (graphs). (Cyclic pattern kernel is tested on dataset MAO and PAH.)

 The criteria used for prediction are SVM for classification and kernel Ridge regression for regression.

 For predition we randomly divide the data in train and test subset, where 90% of entire dataset is for training and rest for testing. 10 splits are performed. For each split, we first train on the train data, then evaluate the performance on the test set. We choose the optimal parameters for the test set and finally provide the corresponding performance. The final results correspond to the average of the performances on the test sets. 

 | Kernels       | RMSE(℃) | STD(℃) |    Parameter | k_time |
 |---------------|:-------:|:------:|-------------:|-------:|
 | Shortest path | 35.19   | 4.50   |            - | 14.58" |
 | Marginalized  | 18.02   | 6.29   | p_quit = 0.1 |  4'19" |
 | Path          | 14.00   | 6.93   |            - | 36.21" |
 | WL subtree    | 7.55    | 2.33   |   height = 1 |  0.84" |
 | Treelet       | 8.31    | 3.38   |            - |  0.50" |
 | Path up to d  | 7.43    | 2.69   |    depth = 2 |  0.59" |

 | Kernels          | RMSE(℃) | STD(℃) |         Parameter | k_time |
 |------------------|:-------:|:------:|------------------:|-------:|
 | Shortest path    | 35.19   | 4.50   |                 - | 14.58" |
 | Marginalized     | 18.02   | 6.29   |      p_quit = 0.1 |  4'19" |
 | Path             | 18.41   | 10.78  |                 - | 29.43" |
 | WL subtree       | 7.55    | 2.33   |        height = 1 |  0.84" |
 | WL shortest path | 35.16   | 4.50   |        height = 2 | 40.24" |
 | WL edge          | 33.41   | 4.73   |        height = 5 |  5.66" |
 | Treelet          | 8.31    | 3.38   |                 - |  0.50" |
 | Path up to d     | 7.43    | 2.69   |         depth = 2 |  0.59" |
 | Tree pattern     | 7.27    | 2.21   |  lamda = 1, h = 2 | 37.24" |
 | Cyclic pattern   | 0.9     | 0.11   | cycle bound = 100 |  0.31" |
 * RMSE stands for arithmetic mean of the root mean squared errors on all splits.
 * STD stands for standard deviation of the root mean squared errors on all splits.
 * Paremeter is the one with which the kenrel achieves the best results.
 * k_time is the time spent on building the kernel matrix.
 * The targets of training data are normalized before calculating *path kernel* and *treelet kernel*.
 * The targets of training data are normalized before calculating *treelet kernel*.
 * See detail results in [results.md](pygraph/kernels/results.md).

 ## References
@@ -44,6 +48,12 @@ For predition we randomly divide the data in train and test subset, where 90% of

 [5] Gaüzère B, Brun L, Villemin D. Two new graphs kernels in chemoinformatics. Pattern Recognition Letters. 2012 Nov 1;33(15):2038-47.

 [6] Liva Ralaivola, Sanjay J Swamidass, Hiroto Saigo, and Pierre Baldi. Graph kernels for chemical informatics. Neural networks, 18(8):1093–1110, 2005.

 [7] Pierre Mahé and Jean-Philippe Vert. Graph kernels based on tree patterns for molecules. Machine learning, 75(1):3–35, 2009.

 [8] Tamás Horváth, Thomas Gärtner, and Stefan Wrobel. Cyclic pattern kernels for predictive graph mining. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 158–167. ACM, 2004.

 ## Updates
 ### 2018.01.24
 * ADD *path kernel up to depth d* and its result on dataset Asyclic.
--- a/notebooks/.ipynb_checkpoints/run_cyclicpatternkernel-checkpoint.ipynb
+++ b/notebooks/.ipynb_checkpoints/run_cyclicpatternkernel-checkpoint.ipynb
--- a/notebooks/.ipynb_checkpoints/run_marginalizedkernel_acyclic-checkpoint.ipynb
+++ b/notebooks/.ipynb_checkpoints/run_marginalizedkernel_acyclic-checkpoint.ipynb
@@ -364,6 +364,155 @@
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      " --- This is a regression problem ---\n",
      "\n",
      "\n",
      " Loading dataset from file...\n",
      "\n",
      " Calculating kernel matrix, this could take a while...\n",
      "\n",
      " --- marginalized kernel matrix of size 185 built in 1133.0229969024658 seconds ---\n",
      "[[ 0.0287062   0.0124634   0.00444444 ...,  0.00606061  0.00606061\n",
      "   0.00606061]\n",
      " [ 0.0124634   0.01108958  0.00333333 ...,  0.00454545  0.00454545\n",
      "   0.00454545]\n",
      " [ 0.00444444  0.00333333  0.0287062  ...,  0.00819912  0.00819912\n",
      "   0.00975875]\n",
      " ..., \n",
      " [ 0.00606061  0.00454545  0.00819912 ...,  0.02846735  0.02836907\n",
      "   0.02896354]\n",
      " [ 0.00606061  0.00454545  0.00819912 ...,  0.02836907  0.02831424\n",
      "   0.0288712 ]\n",
      " [ 0.00606061  0.00454545  0.00975875 ...,  0.02896354  0.0288712\n",
      "   0.02987915]]\n",
      "\n",
      " Saving kernel matrix to file...\n",
      "\n",
      " Mean performance on train set: 12.186285\n",
      "With standard deviation: 7.038988\n",
      "\n",
      " Mean performance on test set: 18.024312\n",
      "With standard deviation: 6.292466\n",
      "\n",
      "\n",
      "  rmse_test    std_test    rmse_train    std_train    k_time\n",
      "-----------  ----------  ------------  -----------  --------\n",
      "    18.0243     6.29247       12.1863      7.03899   1133.02\n"
     ]
    }
   ],
   "source": [
    "%load_ext line_profiler\n",
    "\n",
    "import numpy as np\n",
    "import sys\n",
    "sys.path.insert(0, \"../\")\n",
    "from pygraph.utils.utils import kernel_train_test\n",
    "from pygraph.kernels.marginalizedKernel import marginalizedkernel, _marginalizedkernel_do\n",
    "\n",
    "datafile = '../../../../datasets/acyclic/Acyclic/dataset_bps.ds'\n",
    "kernel_file_path = 'kernelmatrices_weisfeilerlehman_subtree_acyclic/'\n",
    "\n",
    "kernel_para = dict(node_label = 'atom', edge_label = 'bond_type', itr = 20, p_quit = 0.1)\n",
    "\n",
    "# kernel_train_test(datafile, kernel_file_path, marginalizedkernel, kernel_para, \\\n",
    "#     hyper_name = 'p_quit', hyper_range = np.linspace(0.1, 0.9, 9), normalize = False)\n",
    "\n",
    "%lprun -f _marginalizedkernel_do \\\n",
    "    kernel_train_test(datafile, kernel_file_path, marginalizedkernel, kernel_para, \\\n",
    "    normalize = False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Timer unit: 1e-06 s\n",
    "\n",
    "Total time: 828.879 s\n",
    "File: ../pygraph/kernels/marginalizedKernel.py\n",
    "Function: _marginalizedkernel_do at line 67\n",
    "\n",
    "Line #      Hits         Time  Per Hit   % Time  Line Contents\n",
    "==============================================================\n",
    "    67                                           def _marginalizedkernel_do(G1, G2, node_label, edge_label, p_quit, itr):\n",
    "    68                                               \"\"\"Calculate marginalized graph kernel between 2 graphs.\n",
    "    69                                               \n",
    "    70                                               Parameters\n",
    "    71                                               ----------\n",
    "    72                                               G1, G2 : NetworkX graphs\n",
    "    73                                                   2 graphs between which the kernel is calculated.\n",
    "    74                                               node_label : string\n",
    "    75                                                   node attribute used as label.\n",
    "    76                                               edge_label : string\n",
    "    77                                                   edge attribute used as label.\n",
    "    78                                               p_quit : integer\n",
    "    79                                                   the termination probability in the random walks generating step.\n",
    "    80                                               itr : integer\n",
    "    81                                                   time of iterations to calculate R_inf.\n",
    "    82                                                   \n",
    "    83                                               Return\n",
    "    84                                               ------\n",
    "    85                                               kernel : float\n",
    "    86                                                   Marginalized Kernel between 2 graphs.\n",
    "    87                                               \"\"\"\n",
    "    88                                               # init parameters\n",
    "    89     17205      12886.0      0.7      0.0      kernel = 0\n",
    "    90     17205      52542.0      3.1      0.0      num_nodes_G1 = nx.number_of_nodes(G1)\n",
    "    91     17205      28240.0      1.6      0.0      num_nodes_G2 = nx.number_of_nodes(G2)\n",
    "    92     17205      15595.0      0.9      0.0      p_init_G1 = 1 / num_nodes_G1 # the initial probability distribution in the random walks generating step (uniform distribution over |G|)\n",
    "    93     17205      11587.0      0.7      0.0      p_init_G2 = 1 / num_nodes_G2\n",
    "    94                                           \n",
    "    95     17205      11663.0      0.7      0.0      q = p_quit * p_quit\n",
    "    96     17205      10728.0      0.6      0.0      r1 = q\n",
    "    97                                           \n",
    "    98                                               # initial R_inf\n",
    "    99     17205      38412.0      2.2      0.0      R_inf = np.zeros([num_nodes_G1, num_nodes_G2]) # matrix to save all the R_inf for all pairs of nodes\n",
    "   100                                           \n",
    "   101                                               # calculate R_inf with a simple interative method\n",
    "   102    344100     329235.0      1.0      0.0      for i in range(1, itr):\n",
    "   103    326895     900354.0      2.8      0.1          R_inf_new = np.zeros([num_nodes_G1, num_nodes_G2])\n",
    "   104    326895    2287346.0      7.0      0.3          R_inf_new.fill(r1)\n",
    "   105                                           \n",
    "   106                                                   # calculate R_inf for each pair of nodes\n",
    "   107   2653464    3667117.0      1.4      0.4          for node1 in G1.nodes(data = True):\n",
    "   108   2326569    7522840.0      3.2      0.9              neighbor_n1 = G1[node1[0]]\n",
    "   109   2326569    3492118.0      1.5      0.4              p_trans_n1 = (1 - p_quit) / len(neighbor_n1) # the transition probability distribution in the random walks generating step (uniform distribution over the vertices adjacent to the current vertex)\n",
    "   110  24024379   27775021.0      1.2      3.4              for node2 in G2.nodes(data = True):\n",
    "   111  21697810   69471941.0      3.2      8.4                  neighbor_n2 = G2[node2[0]]\n",
    "   112  21697810   32446626.0      1.5      3.9                  p_trans_n2 = (1 - p_quit) / len(neighbor_n2)    \n",
    "   113                                           \n",
    "   114  59095092   52545370.0      0.9      6.3                  for neighbor1 in neighbor_n1:\n",
    "   115 104193150   92513935.0      0.9     11.2                      for neighbor2 in neighbor_n2:\n",
    "   116                                           \n",
    "   117                                                                   t = p_trans_n1 * p_trans_n2 * \\\n",
    "   118  66795868  285324518.0      4.3     34.4                              deltakernel(G1.node[neighbor1][node_label] == G2.node[neighbor2][node_label]) * \\\n",
    "   119  66795868  137934393.0      2.1     16.6                              deltakernel(neighbor_n1[neighbor1][edge_label] == neighbor_n2[neighbor2][edge_label])\n",
    "   120  66795868  106834143.0      1.6     12.9                          R_inf_new[node1[0]][node2[0]] += t * R_inf[neighbor1][neighbor2] # ref [1] equation (8)\n",
    "   121                                           \n",
    "   122    326895    1123677.0      3.4      0.1          R_inf[:] = R_inf_new\n",
    "   123                                           \n",
    "   124                                               # add elements of R_inf up and calculate kernel\n",
    "   125    139656     330283.0      2.4      0.0      for node1 in G1.nodes(data = True):\n",
    "   126   1264441    1435263.0      1.1      0.2          for node2 in G2.nodes(data = True):                \n",
    "   127   1141990    1377134.0      1.2      0.2              s = p_init_G1 * p_init_G2 * deltakernel(node1[1][node_label] == node2[1][node_label])\n",
    "   128   1141990    1375456.0      1.2      0.2              kernel += s * R_inf[node1[0]][node2[0]] # ref [1] equation (6)\n",
    "   129                                           \n",
    "   130     17205      10801.0      0.6      0.0      return kernel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": false
--- a/notebooks/.ipynb_checkpoints/run_pathkernel_acyclic-checkpoint.ipynb
+++ b/notebooks/.ipynb_checkpoints/run_pathkernel_acyclic-checkpoint.ipynb
@@ -2,7 +2,7 @@
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 6,
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
@@ -15,13 +15,11 @@
      " --- This is a regression problem ---\n",
      "\n",
      "\n",
      "\n",
      "\n",
      " Loading dataset from file...\n",
      "\n",
      " Calculating kernel matrix, this could take a while...\n",
      "\n",
      " --- mean average path kernel matrix of size 185 built in 132.2242877483368 seconds ---\n",
      " --- mean average path kernel matrix of size 185 built in 29.430902242660522 seconds ---\n",
      "[[ 0.55555556  0.22222222  0.         ...,  0.          0.          0.        ]\n",
      " [ 0.22222222  0.27777778  0.         ...,  0.          0.          0.        ]\n",
      " [ 0.          0.          0.55555556 ...,  0.03030303  0.03030303\n",
@@ -36,16 +34,16 @@
      "\n",
      " Saving kernel matrix to file...\n",
      "\n",
      " Mean performance on train set: 3.761907\n",
      "With standard deviation: 0.702594\n",
      " Mean performance on train set: 3.619948\n",
      "With standard deviation: 0.512351\n",
      "\n",
      " Mean performance on test set: 14.001515\n",
      "With standard deviation: 6.936023\n",
      " Mean performance on test set: 18.418852\n",
      "With standard deviation: 10.781119\n",
      "\n",
      "\n",
      "  RMSE_test    std_test    RMSE_train    std_train    k_time\n",
      "  rmse_test    std_test    rmse_train    std_train    k_time\n",
      "-----------  ----------  ------------  -----------  --------\n",
      "    14.0015     6.93602       3.76191     0.702594   132.224\n"
      "    18.4189     10.7811       3.61995     0.512351   29.4309\n"
     ]
    }
   ],
@@ -62,10 +60,10 @@
    "\n",
    "kernel_para = dict(node_label = 'atom', edge_label = 'bond_type')\n",
    "\n",
    "kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = True)\n",
    "kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = False)\n",
    "\n",
    "# %lprun -f _pathkernel_do \\\n",
    "#     kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = True)"
    "#     kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = False)"
   ]
  },
  {
@@ -84,7 +82,7 @@
    "# without y normalization\n",
    "  RMSE_test    std_test    RMSE_train    std_train    k_time\n",
    "-----------  ----------  ------------  -----------  --------\n",
    "    18.4189     10.7811       3.61995     0.512351   37.0017"
    "    18.4189     10.7811       3.61995     0.512351   29.4309"
   ]
  },
  {
--- a/notebooks/.ipynb_checkpoints/run_spkernel_acyclic-checkpoint.ipynb
+++ b/notebooks/.ipynb_checkpoints/run_spkernel_acyclic-checkpoint.ipynb
@@ -2,44 +2,42 @@
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 2,
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The line_profiler extension is already loaded. To reload it, use:\n",
      "  %reload_ext line_profiler\n",
      "\n",
      " --- This is a regression problem ---\n",
      "\n",
      "\n",
      "\n",
      " Loading dataset from file...\n",
      "\n",
      " Calculating kernel matrix, this could take a while...\n",
      "--- shortest path kernel matrix of size 185 built in 14.576777696609497 seconds ---\n",
      "[[  3.   1.   3. ...,   1.   1.   1.]\n",
      " [  1.   6.   1. ...,   0.   0.   3.]\n",
      " [  3.   1.   3. ...,   1.   1.   1.]\n",
      " ..., \n",
      " [  1.   0.   1. ...,  55.  21.   7.]\n",
      " [  1.   0.   1. ...,  21.  55.   7.]\n",
      " [  1.   3.   1. ...,   7.   7.  55.]]\n",
      "\n",
      " Saving kernel matrix to file...\n",
      "\n",
      "--- shortest path kernel matrix of size 185 built in 13.3865065574646 seconds ---\n",
      "[[ 3.  1.  3. ...  1.  1.  1.]\n",
      " [ 1.  6.  1. ...  0.  0.  3.]\n",
      " [ 3.  1.  3. ...  1.  1.  1.]\n",
      " ...\n",
      " [ 1.  0.  1. ... 55. 21.  7.]\n",
      " [ 1.  0.  1. ... 21. 55.  7.]\n",
      " [ 1.  3.  1. ...  7.  7. 55.]]\n",
      "\n",
      " Starting calculate accuracy/rmse...\n",
      "calculate performance:  94%|█████████▎| 936/1000 [00:01<00:00, 757.54it/s]\n",
      " Mean performance on train set: 28.360361\n",
      "With standard deviation: 1.357183\n",
      "\n",
      " Mean performance on test set: 35.191954\n",
      "With standard deviation: 4.495767\n",
      "calculate performance: 100%|██████████| 1000/1000 [00:01<00:00, 771.22it/s]\n",
      "\n",
      "\n",
      "  RMSE_test    std_test    RMSE_train    std_train    k_time\n",
      "  rmse_test    std_test    rmse_train    std_train    k_time\n",
      "-----------  ----------  ------------  -----------  --------\n",
      "     35.192     4.49577       28.3604      1.35718   14.5768\n"
      "     35.192     4.49577       28.3604      1.35718   13.3865\n"
     ]
    }
   ],
--- a/notebooks/.ipynb_checkpoints/run_treeletkernel_acyclic-checkpoint.ipynb
+++ b/notebooks/.ipynb_checkpoints/run_treeletkernel_acyclic-checkpoint.ipynb
@@ -2,15 +2,13 @@
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 2,
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The line_profiler extension is already loaded. To reload it, use:\n",
      "  %reload_ext line_profiler\n",
      "\n",
      " --- This is a regression problem ---\n",
      "\n",
@@ -19,68 +17,34 @@
      "\n",
      " Calculating kernel matrix, this could take a while...\n",
      "\n",
      " --- treelet kernel matrix of size 185 built in 0.48417091369628906 seconds ---\n",
      "[[  4.00000000e+00   2.60653066e+00   1.00000000e+00 ...,   1.26641655e-14\n",
      "    1.26641655e-14   1.26641655e-14]\n",
      " [  2.60653066e+00   6.00000000e+00   1.00000000e+00 ...,   1.26641655e-14\n",
      "    1.26641655e-14   1.26641655e-14]\n",
      " [  1.00000000e+00   1.00000000e+00   4.00000000e+00 ...,   3.00000000e+00\n",
      "    3.00000000e+00   3.00000000e+00]\n",
      " ..., \n",
      " [  1.26641655e-14   1.26641655e-14   3.00000000e+00 ...,   1.80000000e+01\n",
      "    1.30548713e+01   8.19020657e+00]\n",
      " [  1.26641655e-14   1.26641655e-14   3.00000000e+00 ...,   1.30548713e+01\n",
      "    2.20000000e+01   9.71901120e+00]\n",
      " [  1.26641655e-14   1.26641655e-14   3.00000000e+00 ...,   8.19020657e+00\n",
      "    9.71901120e+00   1.60000000e+01]]\n",
      "\n",
      " Saving kernel matrix to file...\n",
      "\n",
      " --- treelet kernel matrix of size 185 built in 0.47543811798095703 seconds ---\n",
      "[[4.00000000e+00 2.60653066e+00 1.00000000e+00 ... 1.26641655e-14\n",
      "  1.26641655e-14 1.26641655e-14]\n",
      " [2.60653066e+00 6.00000000e+00 1.00000000e+00 ... 1.26641655e-14\n",
      "  1.26641655e-14 1.26641655e-14]\n",
      " [1.00000000e+00 1.00000000e+00 4.00000000e+00 ... 3.00000000e+00\n",
      "  3.00000000e+00 3.00000000e+00]\n",
      " ...\n",
      " [1.26641655e-14 1.26641655e-14 3.00000000e+00 ... 1.80000000e+01\n",
      "  1.30548713e+01 8.19020657e+00]\n",
      " [1.26641655e-14 1.26641655e-14 3.00000000e+00 ... 1.30548713e+01\n",
      "  2.20000000e+01 9.71901120e+00]\n",
      " [1.26641655e-14 1.26641655e-14 3.00000000e+00 ... 8.19020657e+00\n",
      "  9.71901120e+00 1.60000000e+01]]\n",
      "\n",
      " Starting calculate accuracy/rmse...\n",
      "calculate performance:  98%|█████████▊| 983/1000 [00:01<00:00, 796.45it/s]\n",
      " Mean performance on train set: 2.688029\n",
      "With standard deviation: 1.541623\n",
      "\n",
      " Mean performance on test set: 10.099738\n",
      "With standard deviation: 5.035844\n",
      "calculate performance: 100%|██████████| 1000/1000 [00:01<00:00, 745.11it/s]\n",
      "\n",
      "\n",
      "  rmse_test    std_test    rmse_train    std_train    k_time\n",
      "-----------  ----------  ------------  -----------  --------\n",
      "    10.0997     5.03584       2.68803      1.54162  0.484171\n",
      "\n",
      " --- This is a regression problem ---\n",
      "\n",
      "\n",
      " Loading dataset from file...\n",
      "\n",
      " Calculating kernel matrix, this could take a while...\n",
      "\n",
      " --- treelet kernel matrix of size 185 built in 0.5003015995025635 seconds ---\n",
      "[[  4.00000000e+00   2.60653066e+00   1.00000000e+00 ...,   1.26641655e-14\n",
      "    1.26641655e-14   1.26641655e-14]\n",
      " [  2.60653066e+00   6.00000000e+00   1.00000000e+00 ...,   1.26641655e-14\n",
      "    1.26641655e-14   1.26641655e-14]\n",
      " [  1.00000000e+00   1.00000000e+00   4.00000000e+00 ...,   3.00000000e+00\n",
      "    3.00000000e+00   3.00000000e+00]\n",
      " ..., \n",
      " [  1.26641655e-14   1.26641655e-14   3.00000000e+00 ...,   1.80000000e+01\n",
      "    1.30548713e+01   8.19020657e+00]\n",
      " [  1.26641655e-14   1.26641655e-14   3.00000000e+00 ...,   1.30548713e+01\n",
      "    2.20000000e+01   9.71901120e+00]\n",
      " [  1.26641655e-14   1.26641655e-14   3.00000000e+00 ...,   8.19020657e+00\n",
      "    9.71901120e+00   1.60000000e+01]]\n",
      "\n",
      " Saving kernel matrix to file...\n",
      "\n",
      " Mean performance on train set: 2.908869\n",
      "With standard deviation: 1.267900\n",
      "\n",
      " Mean performance on test set: 8.307902\n",
      "With standard deviation: 3.378376\n",
      "\n",
      "\n",
      "  rmse_test    std_test    rmse_train    std_train    k_time\n",
      "-----------  ----------  ------------  -----------  --------\n",
      "     8.3079     3.37838       2.90887       1.2679  0.500302\n"
      "    10.0997     5.03584       2.68803      1.54162  0.475438\n"
     ]
    }
   ],
@@ -99,8 +63,6 @@
    "\n",
    "kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = False)\n",
    "\n",
    "kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = True)\n",
    "\n",
    "# %lprun -f treeletkernel \\\n",
    "#     kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = False)"
   ]
@@ -121,14 +83,58 @@
    "# without y normalization\n",
    "  RMSE_test    std_test    RMSE_train    std_train    k_time\n",
    "-----------  ----------  ------------  -----------  --------\n",
    "    10.0997     5.03584       2.68803      1.54162   0.484171"
    "    10.0997     5.03584       2.68803      1.54162   0.484171\n",
    "\n",
    "    \n",
    "\n",
    "# G0 -> WL subtree h = 0\n",
    "  rmse_test    std_test    rmse_train    std_train    k_time\n",
    "-----------  ----------  ------------  -----------  --------\n",
    "    13.9223     2.88611        13.373     0.653301  0.186731\n",
    "\n",
    "# G0 U G1 U G6 U G8 U G13 -> WL subtree h = 1\n",
    "  rmse_test    std_test    rmse_train    std_train    k_time\n",
    "-----------  ----------  ------------  -----------  --------\n",
    "    8.97706     2.90771        6.7343      1.17505  0.223171\n",
    "    \n",
    "# all patterns \\ { G3 U G4 U G5 U G10 }  -> WL subtree h = 2    \n",
    "  rmse_test    std_test    rmse_train    std_train    k_time\n",
    "-----------  ----------  ------------  -----------  --------\n",
    "    7.31274     1.96289       3.73909     0.406267  0.294902\n",
    "\n",
    "# all patterns \\ { G4 U G5 }  -> WL subtree h = 3\n",
    "  rmse_test    std_test    rmse_train    std_train    k_time\n",
    "-----------  ----------  ------------  -----------  --------\n",
    "    8.39977     2.78309        3.8606      1.58686  0.348912\n",
    "\n",
    "# all patterns \\ { G5 }    \n",
    "  rmse_test    std_test    rmse_train    std_train    k_time\n",
    "-----------  ----------  ------------  -----------  --------\n",
    "    9.47647     4.22113       3.18029       1.5669  0.423638\n",
    "    \n",
    "    \n",
    "    \n",
    "# G0, -> WL subtree h = 0\n",
    "  rmse_test    std_test    rmse_train    std_train    k_time\n",
    "-----------  ----------  ------------  -----------  --------\n",
    "    13.9223     2.88611        13.373     0.653301  0.186731  \n",
    "    \n",
    "# G0 U G1 U G2 U G6 U G8 U G13 -> WL subtree h = 1\n",
    "  rmse_test    std_test    rmse_train    std_train    k_time\n",
    "-----------  ----------  ------------  -----------  --------\n",
    "    8.62431     2.54327       5.63422     0.255002  0.290797\n",
    "    \n",
    "# all patterns \\ { G5 U G10 }  -> WL subtree h = 2\n",
    "  rmse_test    std_test    rmse_train    std_train    k_time\n",
    "-----------  ----------  ------------  -----------  --------\n",
    "    10.1294     3.50275       3.69664      1.55116  0.418498"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": false
    "scrolled": true
   },
   "outputs": [
    {
--- a/notebooks/.ipynb_checkpoints/run_treepatternkernel-checkpoint.ipynb
+++ b/notebooks/.ipynb_checkpoints/run_treepatternkernel-checkpoint.ipynb
--- a/notebooks/.ipynb_checkpoints/run_weisfeilerLehmankernel_acyclic-checkpoint.ipynb
+++ b/notebooks/.ipynb_checkpoints/run_weisfeilerLehmankernel_acyclic-checkpoint.ipynb
--- a/notebooks/.ipynb_checkpoints/test_lib-checkpoint.ipynb
+++ b/notebooks/.ipynb_checkpoints/test_lib-checkpoint.ipynb
--- a/notebooks/run_cyclicpatternkernel.ipynb
+++ b/notebooks/run_cyclicpatternkernel.ipynb
--- a/notebooks/run_marginalizedkernel_acyclic.ipynb
+++ b/notebooks/run_marginalizedkernel_acyclic.ipynb
@@ -364,6 +364,155 @@
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      " --- This is a regression problem ---\n",
      "\n",
      "\n",
      " Loading dataset from file...\n",
      "\n",
      " Calculating kernel matrix, this could take a while...\n",
      "\n",
      " --- marginalized kernel matrix of size 185 built in 1133.0229969024658 seconds ---\n",
      "[[ 0.0287062   0.0124634   0.00444444 ...,  0.00606061  0.00606061\n",
      "   0.00606061]\n",
      " [ 0.0124634   0.01108958  0.00333333 ...,  0.00454545  0.00454545\n",
      "   0.00454545]\n",
      " [ 0.00444444  0.00333333  0.0287062  ...,  0.00819912  0.00819912\n",
      "   0.00975875]\n",
      " ..., \n",
      " [ 0.00606061  0.00454545  0.00819912 ...,  0.02846735  0.02836907\n",
      "   0.02896354]\n",
      " [ 0.00606061  0.00454545  0.00819912 ...,  0.02836907  0.02831424\n",
      "   0.0288712 ]\n",
      " [ 0.00606061  0.00454545  0.00975875 ...,  0.02896354  0.0288712\n",
      "   0.02987915]]\n",
      "\n",
      " Saving kernel matrix to file...\n",
      "\n",
      " Mean performance on train set: 12.186285\n",
      "With standard deviation: 7.038988\n",
      "\n",
      " Mean performance on test set: 18.024312\n",
      "With standard deviation: 6.292466\n",
      "\n",
      "\n",
      "  rmse_test    std_test    rmse_train    std_train    k_time\n",
      "-----------  ----------  ------------  -----------  --------\n",
      "    18.0243     6.29247       12.1863      7.03899   1133.02\n"
     ]
    }
   ],
   "source": [
    "%load_ext line_profiler\n",
    "\n",
    "import numpy as np\n",
    "import sys\n",
    "sys.path.insert(0, \"../\")\n",
    "from pygraph.utils.utils import kernel_train_test\n",
    "from pygraph.kernels.marginalizedKernel import marginalizedkernel, _marginalizedkernel_do\n",
    "\n",
    "datafile = '../../../../datasets/acyclic/Acyclic/dataset_bps.ds'\n",
    "kernel_file_path = 'kernelmatrices_weisfeilerlehman_subtree_acyclic/'\n",
    "\n",
    "kernel_para = dict(node_label = 'atom', edge_label = 'bond_type', itr = 20, p_quit = 0.1)\n",
    "\n",
    "# kernel_train_test(datafile, kernel_file_path, marginalizedkernel, kernel_para, \\\n",
    "#     hyper_name = 'p_quit', hyper_range = np.linspace(0.1, 0.9, 9), normalize = False)\n",
    "\n",
    "%lprun -f _marginalizedkernel_do \\\n",
    "    kernel_train_test(datafile, kernel_file_path, marginalizedkernel, kernel_para, \\\n",
    "    normalize = False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Timer unit: 1e-06 s\n",
    "\n",
    "Total time: 828.879 s\n",
    "File: ../pygraph/kernels/marginalizedKernel.py\n",
    "Function: _marginalizedkernel_do at line 67\n",
    "\n",
    "Line #      Hits         Time  Per Hit   % Time  Line Contents\n",
    "==============================================================\n",
    "    67                                           def _marginalizedkernel_do(G1, G2, node_label, edge_label, p_quit, itr):\n",
    "    68                                               \"\"\"Calculate marginalized graph kernel between 2 graphs.\n",
    "    69                                               \n",
    "    70                                               Parameters\n",
    "    71                                               ----------\n",
    "    72                                               G1, G2 : NetworkX graphs\n",
    "    73                                                   2 graphs between which the kernel is calculated.\n",
    "    74                                               node_label : string\n",
    "    75                                                   node attribute used as label.\n",
    "    76                                               edge_label : string\n",
    "    77                                                   edge attribute used as label.\n",
    "    78                                               p_quit : integer\n",
    "    79                                                   the termination probability in the random walks generating step.\n",
    "    80                                               itr : integer\n",
    "    81                                                   time of iterations to calculate R_inf.\n",
    "    82                                                   \n",
    "    83                                               Return\n",
    "    84                                               ------\n",
    "    85                                               kernel : float\n",
    "    86                                                   Marginalized Kernel between 2 graphs.\n",
    "    87                                               \"\"\"\n",
    "    88                                               # init parameters\n",
    "    89     17205      12886.0      0.7      0.0      kernel = 0\n",
    "    90     17205      52542.0      3.1      0.0      num_nodes_G1 = nx.number_of_nodes(G1)\n",
    "    91     17205      28240.0      1.6      0.0      num_nodes_G2 = nx.number_of_nodes(G2)\n",
    "    92     17205      15595.0      0.9      0.0      p_init_G1 = 1 / num_nodes_G1 # the initial probability distribution in the random walks generating step (uniform distribution over |G|)\n",
    "    93     17205      11587.0      0.7      0.0      p_init_G2 = 1 / num_nodes_G2\n",
    "    94                                           \n",
    "    95     17205      11663.0      0.7      0.0      q = p_quit * p_quit\n",
    "    96     17205      10728.0      0.6      0.0      r1 = q\n",
    "    97                                           \n",
    "    98                                               # initial R_inf\n",
    "    99     17205      38412.0      2.2      0.0      R_inf = np.zeros([num_nodes_G1, num_nodes_G2]) # matrix to save all the R_inf for all pairs of nodes\n",
    "   100                                           \n",
    "   101                                               # calculate R_inf with a simple interative method\n",
    "   102    344100     329235.0      1.0      0.0      for i in range(1, itr):\n",
    "   103    326895     900354.0      2.8      0.1          R_inf_new = np.zeros([num_nodes_G1, num_nodes_G2])\n",
    "   104    326895    2287346.0      7.0      0.3          R_inf_new.fill(r1)\n",
    "   105                                           \n",
    "   106                                                   # calculate R_inf for each pair of nodes\n",
    "   107   2653464    3667117.0      1.4      0.4          for node1 in G1.nodes(data = True):\n",
    "   108   2326569    7522840.0      3.2      0.9              neighbor_n1 = G1[node1[0]]\n",
    "   109   2326569    3492118.0      1.5      0.4              p_trans_n1 = (1 - p_quit) / len(neighbor_n1) # the transition probability distribution in the random walks generating step (uniform distribution over the vertices adjacent to the current vertex)\n",
    "   110  24024379   27775021.0      1.2      3.4              for node2 in G2.nodes(data = True):\n",
    "   111  21697810   69471941.0      3.2      8.4                  neighbor_n2 = G2[node2[0]]\n",
    "   112  21697810   32446626.0      1.5      3.9                  p_trans_n2 = (1 - p_quit) / len(neighbor_n2)    \n",
    "   113                                           \n",
    "   114  59095092   52545370.0      0.9      6.3                  for neighbor1 in neighbor_n1:\n",
    "   115 104193150   92513935.0      0.9     11.2                      for neighbor2 in neighbor_n2:\n",
    "   116                                           \n",
    "   117                                                                   t = p_trans_n1 * p_trans_n2 * \\\n",
    "   118  66795868  285324518.0      4.3     34.4                              deltakernel(G1.node[neighbor1][node_label] == G2.node[neighbor2][node_label]) * \\\n",
    "   119  66795868  137934393.0      2.1     16.6                              deltakernel(neighbor_n1[neighbor1][edge_label] == neighbor_n2[neighbor2][edge_label])\n",
    "   120  66795868  106834143.0      1.6     12.9                          R_inf_new[node1[0]][node2[0]] += t * R_inf[neighbor1][neighbor2] # ref [1] equation (8)\n",
    "   121                                           \n",
    "   122    326895    1123677.0      3.4      0.1          R_inf[:] = R_inf_new\n",
    "   123                                           \n",
    "   124                                               # add elements of R_inf up and calculate kernel\n",
    "   125    139656     330283.0      2.4      0.0      for node1 in G1.nodes(data = True):\n",
    "   126   1264441    1435263.0      1.1      0.2          for node2 in G2.nodes(data = True):                \n",
    "   127   1141990    1377134.0      1.2      0.2              s = p_init_G1 * p_init_G2 * deltakernel(node1[1][node_label] == node2[1][node_label])\n",
    "   128   1141990    1375456.0      1.2      0.2              kernel += s * R_inf[node1[0]][node2[0]] # ref [1] equation (6)\n",
    "   129                                           \n",
    "   130     17205      10801.0      0.6      0.0      return kernel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": false
--- a/notebooks/run_pathkernel_acyclic.ipynb
+++ b/notebooks/run_pathkernel_acyclic.ipynb
@@ -2,23 +2,24 @@
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The line_profiler extension is already loaded. To reload it, use:\n",
      "  %reload_ext line_profiler\n",
      "\n",
      " --- This is a regression problem ---\n",
      "\n",
      "\n",
      "\n",
      " Loading dataset from file...\n",
      "\n",
      " Calculating kernel matrix, this could take a while...\n",
      "\n",
      " --- mean average path kernel matrix of size 185 built in 45.52756929397583 seconds ---\n",
      " --- mean average path kernel matrix of size 185 built in 29.430902242660522 seconds ---\n",
      "[[ 0.55555556  0.22222222  0.         ...,  0.          0.          0.        ]\n",
      " [ 0.22222222  0.27777778  0.         ...,  0.          0.          0.        ]\n",
      " [ 0.          0.          0.55555556 ...,  0.03030303  0.03030303\n",
@@ -33,16 +34,16 @@
      "\n",
      " Saving kernel matrix to file...\n",
      "\n",
      " Mean performance on train set: 3.761907\n",
      "With standard deviation: 0.702594\n",
      " Mean performance on train set: 3.619948\n",
      "With standard deviation: 0.512351\n",
      "\n",
      " Mean performance on test set: 14.001515\n",
      "With standard deviation: 6.936023\n",
      " Mean performance on test set: 18.418852\n",
      "With standard deviation: 10.781119\n",
      "\n",
      "\n",
      "  RMSE_test    std_test    RMSE_train    std_train    k_time\n",
      "  rmse_test    std_test    rmse_train    std_train    k_time\n",
      "-----------  ----------  ------------  -----------  --------\n",
      "    14.0015     6.93602       3.76191     0.702594   45.5276\n"
      "    18.4189     10.7811       3.61995     0.512351   29.4309\n"
     ]
    }
   ],
@@ -59,10 +60,10 @@
    "\n",
    "kernel_para = dict(node_label = 'atom', edge_label = 'bond_type')\n",
    "\n",
    "kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = True)\n",
    "kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = False)\n",
    "\n",
    "# %lprun -f _pathkernel_do \\\n",
    "#     kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = True)"
    "#     kernel_train_test(datafile, kernel_file_path, pathkernel, kernel_para, normalize = False)"
   ]
  },
  {
@@ -81,7 +82,7 @@
    "# without y normalization\n",
    "  RMSE_test    std_test    RMSE_train    std_train    k_time\n",
    "-----------  ----------  ------------  -----------  --------\n",
    "    18.4189     10.7811       3.61995     0.512351   37.0017"
    "    18.4189     10.7811       3.61995     0.512351   29.4309"
   ]
  },
  {
--- a/notebooks/run_spkernel_acyclic.ipynb
+++ b/notebooks/run_spkernel_acyclic.ipynb
@@ -2,44 +2,42 @@
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 2,
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The line_profiler extension is already loaded. To reload it, use:\n",
      "  %reload_ext line_profiler\n",
      "\n",
      " --- This is a regression problem ---\n",
      "\n",
      "\n",
      "\n",
      " Loading dataset from file...\n",
      "\n",
      " Calculating kernel matrix, this could take a while...\n",
      "--- shortest path kernel matrix of size 185 built in 14.576777696609497 seconds ---\n",
      "[[  3.   1.   3. ...,   1.   1.   1.]\n",
      " [  1.   6.   1. ...,   0.   0.   3.]\n",
      " [  3.   1.   3. ...,   1.   1.   1.]\n",
      " ..., \n",
      " [  1.   0.   1. ...,  55.  21.   7.]\n",
      " [  1.   0.   1. ...,  21.  55.   7.]\n",
      " [  1.   3.   1. ...,   7.   7.  55.]]\n",
      "\n",
      " Saving kernel matrix to file...\n",
      "\n",
      "--- shortest path kernel matrix of size 185 built in 13.3865065574646 seconds ---\n",
      "[[ 3.  1.  3. ...  1.  1.  1.]\n",
      " [ 1.  6.  1. ...  0.  0.  3.]\n",
      " [ 3.  1.  3. ...  1.  1.  1.]\n",
      " ...\n",
      " [ 1.  0.  1. ... 55. 21.  7.]\n",
      " [ 1.  0.  1. ... 21. 55.  7.]\n",
      " [ 1.  3.  1. ...  7.  7. 55.]]\n",
      "\n",
      " Starting calculate accuracy/rmse...\n",
      "calculate performance:  94%|█████████▎| 936/1000 [00:01<00:00, 757.54it/s]\n",
      " Mean performance on train set: 28.360361\n",
      "With standard deviation: 1.357183\n",
      "\n",
      " Mean performance on test set: 35.191954\n",
      "With standard deviation: 4.495767\n",
      "calculate performance: 100%|██████████| 1000/1000 [00:01<00:00, 771.22it/s]\n",
      "\n",
      "\n",
      "  RMSE_test    std_test    RMSE_train    std_train    k_time\n",
      "  rmse_test    std_test    rmse_train    std_train    k_time\n",
      "-----------  ----------  ------------  -----------  --------\n",
      "     35.192     4.49577       28.3604      1.35718   14.5768\n"
      "     35.192     4.49577       28.3604      1.35718   13.3865\n"
     ]
    }
   ],
--- a/notebooks/run_treeletkernel_acyclic.ipynb
+++ b/notebooks/run_treeletkernel_acyclic.ipynb
@@ -2,15 +2,13 @@
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 2,
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The line_profiler extension is already loaded. To reload it, use:\n",
      "  %reload_ext line_profiler\n",
      "\n",
      " --- This is a regression problem ---\n",
      "\n",
@@ -19,68 +17,34 @@
      "\n",
      " Calculating kernel matrix, this could take a while...\n",
      "\n",
      " --- treelet kernel matrix of size 185 built in 0.48417091369628906 seconds ---\n",
      "[[  4.00000000e+00   2.60653066e+00   1.00000000e+00 ...,   1.26641655e-14\n",
      "    1.26641655e-14   1.26641655e-14]\n",
      " [  2.60653066e+00   6.00000000e+00   1.00000000e+00 ...,   1.26641655e-14\n",
      "    1.26641655e-14   1.26641655e-14]\n",
      " [  1.00000000e+00   1.00000000e+00   4.00000000e+00 ...,   3.00000000e+00\n",
      "    3.00000000e+00   3.00000000e+00]\n",
      " ..., \n",
      " [  1.26641655e-14   1.26641655e-14   3.00000000e+00 ...,   1.80000000e+01\n",
      "    1.30548713e+01   8.19020657e+00]\n",
      " [  1.26641655e-14   1.26641655e-14   3.00000000e+00 ...,   1.30548713e+01\n",
      "    2.20000000e+01   9.71901120e+00]\n",
      " [  1.26641655e-14   1.26641655e-14   3.00000000e+00 ...,   8.19020657e+00\n",
      "    9.71901120e+00   1.60000000e+01]]\n",
      "\n",
      " Saving kernel matrix to file...\n",
      "\n",
      " --- treelet kernel matrix of size 185 built in 0.47543811798095703 seconds ---\n",
      "[[4.00000000e+00 2.60653066e+00 1.00000000e+00 ... 1.26641655e-14\n",
      "  1.26641655e-14 1.26641655e-14]\n",
      " [2.60653066e+00 6.00000000e+00 1.00000000e+00 ... 1.26641655e-14\n",
      "  1.26641655e-14 1.26641655e-14]\n",
      " [1.00000000e+00 1.00000000e+00 4.00000000e+00 ... 3.00000000e+00\n",
      "  3.00000000e+00 3.00000000e+00]\n",
      " ...\n",
      " [1.26641655e-14 1.26641655e-14 3.00000000e+00 ... 1.80000000e+01\n",
      "  1.30548713e+01 8.19020657e+00]\n",
      " [1.26641655e-14 1.26641655e-14 3.00000000e+00 ... 1.30548713e+01\n",
      "  2.20000000e+01 9.71901120e+00]\n",
      " [1.26641655e-14 1.26641655e-14 3.00000000e+00 ... 8.19020657e+00\n",
      "  9.71901120e+00 1.60000000e+01]]\n",
      "\n",
      " Starting calculate accuracy/rmse...\n",
      "calculate performance:  98%|█████████▊| 983/1000 [00:01<00:00, 796.45it/s]\n",
      " Mean performance on train set: 2.688029\n",
      "With standard deviation: 1.541623\n",
      "\n",
      " Mean performance on test set: 10.099738\n",
      "With standard deviation: 5.035844\n",
      "calculate performance: 100%|██████████| 1000/1000 [00:01<00:00, 745.11it/s]\n",
      "\n",
      "\n",
      "  rmse_test    std_test    rmse_train    std_train    k_time\n",
      "-----------  ----------  ------------  -----------  --------\n",
      "    10.0997     5.03584       2.68803      1.54162  0.484171\n",
      "\n",
      " --- This is a regression problem ---\n",
      "\n",
      "\n",
      " Loading dataset from file...\n",
      "\n",
      " Calculating kernel matrix, this could take a while...\n",
      "\n",
      " --- treelet kernel matrix of size 185 built in 0.5003015995025635 seconds ---\n",
      "[[  4.00000000e+00   2.60653066e+00   1.00000000e+00 ...,   1.26641655e-14\n",
      "    1.26641655e-14   1.26641655e-14]\n",
      " [  2.60653066e+00   6.00000000e+00   1.00000000e+00 ...,   1.26641655e-14\n",
      "    1.26641655e-14   1.26641655e-14]\n",
      " [  1.00000000e+00   1.00000000e+00   4.00000000e+00 ...,   3.00000000e+00\n",
      "    3.00000000e+00   3.00000000e+00]\n",
      " ..., \n",
      " [  1.26641655e-14   1.26641655e-14   3.00000000e+00 ...,   1.80000000e+01\n",
      "    1.30548713e+01   8.19020657e+00]\n",
      " [  1.26641655e-14   1.26641655e-14   3.00000000e+00 ...,   1.30548713e+01\n",
      "    2.20000000e+01   9.71901120e+00]\n",
      " [  1.26641655e-14   1.26641655e-14   3.00000000e+00 ...,   8.19020657e+00\n",
      "    9.71901120e+00   1.60000000e+01]]\n",
      "\n",
      " Saving kernel matrix to file...\n",
      "\n",
      " Mean performance on train set: 2.908869\n",
      "With standard deviation: 1.267900\n",
      "\n",
      " Mean performance on test set: 8.307902\n",
      "With standard deviation: 3.378376\n",
      "\n",
      "\n",
      "  rmse_test    std_test    rmse_train    std_train    k_time\n",
      "-----------  ----------  ------------  -----------  --------\n",
      "     8.3079     3.37838       2.90887       1.2679  0.500302\n"
      "    10.0997     5.03584       2.68803      1.54162  0.475438\n"
     ]
    }
   ],
@@ -99,8 +63,6 @@
    "\n",
    "kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = False)\n",
    "\n",
    "kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = True)\n",
    "\n",
    "# %lprun -f treeletkernel \\\n",
    "#     kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = False)"
   ]
@@ -121,14 +83,58 @@
    "# without y normalization\n",
    "  RMSE_test    std_test    RMSE_train    std_train    k_time\n",
    "-----------  ----------  ------------  -----------  --------\n",
    "    10.0997     5.03584       2.68803      1.54162   0.484171"
    "    10.0997     5.03584       2.68803      1.54162   0.484171\n",
    "\n",
    "    \n",
    "\n",
    "# G0 -> WL subtree h = 0\n",
    "  rmse_test    std_test    rmse_train    std_train    k_time\n",
    "-----------  ----------  ------------  -----------  --------\n",
    "    13.9223     2.88611        13.373     0.653301  0.186731\n",
    "\n",
    "# G0 U G1 U G6 U G8 U G13 -> WL subtree h = 1\n",
    "  rmse_test    std_test    rmse_train    std_train    k_time\n",
    "-----------  ----------  ------------  -----------  --------\n",
    "    8.97706     2.90771        6.7343      1.17505  0.223171\n",
    "    \n",
    "# all patterns \\ { G3 U G4 U G5 U G10 }  -> WL subtree h = 2    \n",
    "  rmse_test    std_test    rmse_train    std_train    k_time\n",
    "-----------  ----------  ------------  -----------  --------\n",
    "    7.31274     1.96289       3.73909     0.406267  0.294902\n",
    "\n",
    "# all patterns \\ { G4 U G5 }  -> WL subtree h = 3\n",
    "  rmse_test    std_test    rmse_train    std_train    k_time\n",
    "-----------  ----------  ------------  -----------  --------\n",
    "    8.39977     2.78309        3.8606      1.58686  0.348912\n",
    "\n",
    "# all patterns \\ { G5 }    \n",
    "  rmse_test    std_test    rmse_train    std_train    k_time\n",
    "-----------  ----------  ------------  -----------  --------\n",
    "    9.47647     4.22113       3.18029       1.5669  0.423638\n",
    "    \n",
    "    \n",
    "    \n",
    "# G0, -> WL subtree h = 0\n",
    "  rmse_test    std_test    rmse_train    std_train    k_time\n",
    "-----------  ----------  ------------  -----------  --------\n",
    "    13.9223     2.88611        13.373     0.653301  0.186731  \n",
    "    \n",
    "# G0 U G1 U G2 U G6 U G8 U G13 -> WL subtree h = 1\n",
    "  rmse_test    std_test    rmse_train    std_train    k_time\n",
    "-----------  ----------  ------------  -----------  --------\n",
    "    8.62431     2.54327       5.63422     0.255002  0.290797\n",
    "    \n",
    "# all patterns \\ { G5 U G10 }  -> WL subtree h = 2\n",
    "  rmse_test    std_test    rmse_train    std_train    k_time\n",
    "-----------  ----------  ------------  -----------  --------\n",
    "    10.1294     3.50275       3.69664      1.55116  0.418498"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": false
    "scrolled": true
   },
   "outputs": [
    {
--- a/notebooks/run_treepatternkernel.ipynb
+++ b/notebooks/run_treepatternkernel.ipynb
--- a/notebooks/run_weisfeilerLehmankernel_acyclic.ipynb
+++ b/notebooks/run_weisfeilerLehmankernel_acyclic.ipynb
--- a/pygraph/kernels/pycache/cyclicPatternKernel.cpython-35.pyc
+++ b/pygraph/kernels/pycache/cyclicPatternKernel.cpython-35.pyc
--- a/pygraph/kernels/pycache/deltaKernel.cpython-35.pyc
+++ b/pygraph/kernels/pycache/deltaKernel.cpython-35.pyc
--- a/pygraph/kernels/pycache/marginalizedKernel.cpython-35.pyc
+++ b/pygraph/kernels/pycache/marginalizedKernel.cpython-35.pyc
--- a/pygraph/kernels/pycache/pathKernel.cpython-35.pyc
+++ b/pygraph/kernels/pycache/pathKernel.cpython-35.pyc
--- a/pygraph/kernels/pycache/spKernel.cpython-35.pyc
+++ b/pygraph/kernels/pycache/spKernel.cpython-35.pyc
--- a/pygraph/kernels/pycache/treePatternKernel.cpython-35.pyc
+++ b/pygraph/kernels/pycache/treePatternKernel.cpython-35.pyc
--- a/pygraph/kernels/pycache/treeletKernel.cpython-35.pyc
+++ b/pygraph/kernels/pycache/treeletKernel.cpython-35.pyc
--- a/pygraph/kernels/pycache/weisfeilerLehmanKernel.cpython-35.pyc
+++ b/pygraph/kernels/pycache/weisfeilerLehmanKernel.cpython-35.pyc
--- a/pygraph/kernels/cyclicPatternKernel.py
+++ b/pygraph/kernels/cyclicPatternKernel.py
@@ -0,0 +1,147 @@
 """
@author: linlin <jajupmochi@gmail.com>
@references:
    [1] Tamás Horváth, Thomas Gärtner, and Stefan Wrobel. Cyclic pattern kernels for predictive graph mining. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 158–167. ACM, 2004.
    [2]	Hopcroft, J.; Tarjan, R. (1973). “Efficient algorithms for graph manipulation”. Communications of the ACM 16: 372–378. doi:10.1145/362248.362272.
    [3] Finding all the elementary circuits of a directed graph. D. B. Johnson, SIAM Journal on Computing 4, no. 1, 77-84, 1975. http://dx.doi.org/10.1137/0204007
 """

 import sys
 import pathlib
 sys.path.insert(0, "../")
 import time

 import networkx as nx
 import numpy as np

 from tqdm import tqdm


 def cyclicpatternkernel(*args, node_label = 'atom', edge_label = 'bond_type', labeled = True, cycle_bound = None):
    """Calculate cyclic pattern graph kernels between graphs.
    Parameters
    ----------
    Gn : List of NetworkX graph
        List of graphs between which the kernels are calculated.
    /
    G1, G2 : NetworkX graphs
        2 graphs between which the kernel is calculated.
    node_label : string
        node attribute used as label. The default node label is atom.
    edge_label : string
        edge attribute used as label. The default edge label is bond_type.
    labeled : boolean
        Whether the graphs are labeled. The default is True.
    depth : integer
        Depth of search. Longest length of paths.

    Return
    ------
    Kmatrix : Numpy matrix
        Kernel matrix, each element of which is the path kernel up to d between 2 praphs.
    """
    Gn = args[0] if len(args) == 1 else [args[0], args[1]] # arrange all graphs in a list
    Kmatrix = np.zeros((len(Gn), len(Gn)))

    start_time = time.time()

    # get all cyclic and tree patterns of all graphs before calculating kernels to save time, but this may consume a lot of memory for large dataset.
    all_patterns = [ get_patterns(Gn[i], node_label = node_label, edge_label = edge_label, labeled = labeled, cycle_bound = cycle_bound)
        for i in tqdm(range(0, len(Gn)), desc = 'retrieve patterns', file=sys.stdout) ]

    for i in tqdm(range(0, len(Gn)), desc = 'calculate kernels', file=sys.stdout):
        for j in range(i, len(Gn)):
            Kmatrix[i][j] = _cyclicpatternkernel_do(all_patterns[i], all_patterns[j])
            Kmatrix[j][i] = Kmatrix[i][j]

    run_time = time.time() - start_time
    print("\n --- kernel matrix of cyclic pattern kernel of size %d built in %s seconds ---" % (len(Gn), run_time))

    return Kmatrix, run_time


 def _cyclicpatternkernel_do(patterns1, patterns2):
    """Calculate path graph kernels up to depth d between 2 graphs.

    Parameters
    ----------
    paths1, paths2 : list
        List of paths in 2 graphs, where for unlabeled graphs, each path is represented by a list of nodes; while for labeled graphs, each path is represented by a string consists of labels of nodes and edges on that path.
    k_func : function
        A kernel function used using different notions of fingerprint similarity.
    node_label : string
        node attribute used as label. The default node label is atom.
    edge_label : string
        edge attribute used as label. The default edge label is bond_type.
    labeled : boolean
        Whether the graphs are labeled. The default is True.

    Return
    ------
    kernel : float
        Treelet Kernel between 2 graphs.
    """
    return len(set(patterns1) & set(patterns2))


 def get_patterns(G, node_label = 'atom', edge_label = 'bond_type', labeled = True, cycle_bound = None):
    """Find all cyclic and tree patterns in a graph.

    Parameters
    ----------
    G : NetworkX graphs
        The graph in which paths are searched.
    length : integer
        The maximum length of paths.
    node_label : string
        node attribute used as label. The default node label is atom.
    edge_label : string
        edge attribute used as label. The default edge label is bond_type.
    labeled : boolean
        Whether the graphs are labeled. The default is True.

    Return
    ------
    path : list
        List of paths retrieved, where for unlabeled graphs, each path is represented by a list of nodes; while for labeled graphs, each path is represented by a string consists of labels of nodes and edges on that path.
    """
    number_simplecycles = 0
    bridges = nx.Graph()
    patterns = []

    bicomponents = nx.biconnected_component_subgraphs(G) # all biconnected components of G. this function use algorithm in reference [2], which (i guess) is slightly different from the one used in paper [1]
    for subgraph in bicomponents:
        if nx.number_of_edges(subgraph) > 1:
            simple_cycles = list(nx.simple_cycles(G.to_directed())) # all simple cycles in biconnected components. this function use algorithm in reference [3], which has time complexity O((n+e)(N+1)) for n nodes, e edges and N simple cycles. Which might be slower than the algorithm applied in paper [1]
            if cycle_bound != None and len(simple_cycles) > cycle_bound - number_simplecycles: # in paper [1], when applying another algorithm (subroutine RT), this becomes len(simple_cycles) == cycle_bound - number_simplecycles + 1, check again.
                return []
            else:

                # calculate canonical representation for each simple cycle
                all_canonkeys = []
                for cycle in simple_cycles:
                    canonlist = [ G.node[node][node_label] + G[node][cycle[cycle.index(node) + 1]][edge_label] for node in cycle[:-1] ]
                    canonkey = ''.join(canonlist)
                    canonkey = canonkey if canonkey < canonkey[::-1] else canonkey[::-1]
                    for i in range(1, len(cycle[:-1])):
                        canonlist = [ G.node[node][node_label] + G[node][cycle[cycle.index(node) + 1]][edge_label] for node in cycle[i:-1] + cycle[:i] ]
                        canonkey_t = ''.join(canonlist)
                        canonkey_t = canonkey_t if canonkey_t < canonkey_t[::-1] else canonkey_t[::-1]
                        canonkey = canonkey if canonkey < canonkey_t else canonkey_t
                    all_canonkeys.append(canonkey)

                patterns = list(set(patterns) | set(all_canonkeys))
                number_simplecycles += len(simple_cycles)
        else:
            bridges.add_edges_from(subgraph.edges(data=True))

    # calculate canonical representation for each connected component in bridge set
    components = list(nx.connected_component_subgraphs(bridges)) # all connected components in the bridge
    tree_patterns = []
    for tree in components:
        break



    # patterns += pi(bridges)
    return patterns
--- a/pygraph/kernels/deltaKernel.py
+++ b/pygraph/kernels/deltaKernel.py
@@ -1,18 +1,18 @@
 def deltakernel(condition):
    """Return 1 if condition holds, 0 otherwise.
    

    Parameters
    ----------
    condition : Boolean
        A condition, according to which the kernel is set to 1 or 0.
        

    Return
    ------
    kernel : integer
        Delta kernel.
        

    References
    ----------
    [1] H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between labeled graphs. In Proceedings of the 20th International Conference on Machine Learning, Washington, DC, United States, 2003.
    """
    return (1 if condition else 0)
    return condition #(1 if condition else 0)
--- a/pygraph/kernels/pathKernel.py
+++ b/pygraph/kernels/pathKernel.py
@@ -1,3 +1,8 @@
 """
@author: linlin
@references: Suard F, Rakotomamonjy A, Bensrhair A. Kernel on Bag of Paths For Measuring Similarity of Shapes. InESANN 2007 Apr 25 (pp. 355-360).
 """

 import sys
 import pathlib
 sys.path.insert(0, "../")
@@ -27,10 +32,6 @@ def pathkernel(*args, node_label = 'atom', edge_label = 'bond_type'):
    ------
    Kmatrix/kernel : Numpy matrix/float
        Kernel matrix, each element of which is the path kernel between 2 praphs. / Path kernel between 2 graphs.

    References
    ----------
    [1] Suard F, Rakotomamonjy A, Bensrhair A. Kernel on Bag of Paths For Measuring Similarity of Shapes. InESANN 2007 Apr 25 (pp. 355-360).
    """
    some_graph = args[0][0] if len(args) == 1 else args[0] # only edge attributes of type int or float can be used as edge weight to calculate the shortest paths.
    some_weight = list(nx.get_edge_attributes(some_graph, edge_label).values())[0]
@@ -42,9 +43,11 @@ def pathkernel(*args, node_label = 'atom', edge_label = 'bond_type'):

        start_time = time.time()

        splist = [ get_shortest_paths(Gn[i], weight) for i in range(0, len(Gn)) ]

        for i in range(0, len(Gn)):
            for j in range(i, len(Gn)):
                Kmatrix[i][j] = _pathkernel_do(Gn[i], Gn[j], node_label, edge_label, weight = weight)
                Kmatrix[i][j] = _pathkernel_do(Gn[i], Gn[j], splist[i], splist[j], node_label, edge_label)
                Kmatrix[j][i] = Kmatrix[i][j]

        run_time = time.time() - start_time
@@ -55,7 +58,10 @@ def pathkernel(*args, node_label = 'atom', edge_label = 'bond_type'):
    else: # for only 2 graphs
        start_time = time.time()

        kernel = _pathkernel_do(args[0], args[1], node_label, edge_label, weight = weight)
        splist = get_shortest_paths(args[0], weight)
        splist = get_shortest_paths(args[1], weight)

        kernel = _pathkernel_do(args[0], args[1], sp1, sp2, node_label, edge_label)

        run_time = time.time() - start_time
        print("\n --- mean average path kernel built in %s seconds ---" % (run_time))
@@ -63,19 +69,19 @@ def pathkernel(*args, node_label = 'atom', edge_label = 'bond_type'):
        return kernel, run_time


 def _pathkernel_do(G1, G2, node_label = 'atom', edge_label = 'bond_type', weight = None):
 def _pathkernel_do(G1, G2, sp1, sp2, node_label = 'atom', edge_label = 'bond_type'):
    """Calculate mean average path kernel between 2 graphs.

    Parameters
    ----------
    G1, G2 : NetworkX graphs
        2 graphs between which the kernel is calculated.
    sp1, sp2 : list of list
        List of shortest paths of 2 graphs, where each path is represented by a list of nodes.
    node_label : string
        node attribute used as label. The default node label is atom.
    edge_label : string
        edge attribute used as label. The default edge label is bond_type.
    weight : string/None
        edge attribute used as weight to calculate the shortest path. The default edge label is None.

    Return
    ------
@@ -83,30 +89,62 @@ def _pathkernel_do(G1, G2, node_label = 'atom', edge_label = 'bond_type', weight
        Path Kernel between 2 graphs.
    """
    # calculate shortest paths for both graphs
    sp1 = []
    num_nodes = G1.number_of_nodes()
    for node1 in range(num_nodes):
        for node2 in range(node1 + 1, num_nodes):
                sp1.append(nx.shortest_path(G1, node1, node2, weight = weight))

    sp2 = []
    num_nodes = G2.number_of_nodes()
    for node1 in range(num_nodes):
        for node2 in range(node1 + 1, num_nodes):
                sp2.append(nx.shortest_path(G2, node1, node2, weight = weight))

    # calculate kernel
    kernel = 0
    for path1 in sp1:
        for path2 in sp2:
            if len(path1) == len(path2):
                kernel_path = deltakernel(G1.node[path1[0]][node_label] == G2.node[path2[0]][node_label])
                kernel_path = (G1.node[path1[0]][node_label] == G2.node[path2[0]][node_label])
                if kernel_path:
                    for i in range(1, len(path1)):
                         # kernel = 1 if all corresponding nodes and edges in the 2 paths have same labels, otherwise 0
                        kernel_path *= deltakernel(G1[path1[i - 1]][path1[i]][edge_label] == G2[path2[i - 1]][path2[i]][edge_label]) * deltakernel(G1.node[path1[i]][node_label] == G2.node[path2[i]][node_label])
                        kernel_path *= (G1[path1[i - 1]][path1[i]][edge_label] == G2[path2[i - 1]][path2[i]][edge_label]) \
                            * (G1.node[path1[i]][node_label] == G2.node[path2[i]][node_label])
                        if kernel_path == 0:
                            break
                    kernel += kernel_path # add up kernels of all paths

    #                   kernel = 0
    # for path1 in sp1:
    #     for path2 in sp2:
    #         if len(path1) == len(path2):
    #             if (G1.node[path1[0]][node_label] == G2.node[path2[0]][node_label]):
    #                 for i in range(1, len(path1)):
    #                      # kernel = 1 if all corresponding nodes and edges in the 2 paths have same labels, otherwise 0
    #                 #     kernel_path *= (G1[path1[i - 1]][path1[i]][edge_label] == G2[path2[i - 1]][path2[i]][edge_label]) \
    #                 #         * (G1.node[path1[i]][node_label] == G2.node[path2[i]][node_label])
    #                 #     if kernel_path == 0:
    #                 #         break
    #                 # kernel += kernel_path # add up kernels of all paths
    #                     if (G1[path1[i - 1]][path1[i]][edge_label] != G2[path2[i - 1]][path2[i]][edge_label]) or \
    #                         (G1.node[path1[i]][node_label] != G2.node[path2[i]][node_label]):
    #                         break
    #                     else:
    #                         kernel += 1

    kernel = kernel / (len(sp1) * len(sp2)) # calculate mean average

    return kernel

 def get_shortest_paths(G, weight):
    """Get all shortest paths of a graph.

    Parameters
    ----------
    G : NetworkX graphs
        The graphs whose paths are calculated.
    weight : string/None
        edge attribute used as weight to calculate the shortest path.

    Return
    ------
    sp : list of list
        List of shortest paths of the graph, where each path is represented by a list of nodes.
    """
    sp = []
    num_nodes = G.number_of_nodes()
    for node1 in range(num_nodes):
        for node2 in range(node1 + 1, num_nodes):
            sp.append(nx.shortest_path(G, node1, node2, weight = weight))
    return sp
--- a/pygraph/kernels/results.md
+++ b/pygraph/kernels/results.md
@@ -1,20 +1,26 @@
 # Results with minimal test RMSE for each kernel on dataset Asyclic
 All kernels are tested on dataset Asyclic, which consists of 185 molecules (graphs). 
 All kernels expect for Cyclic pattern kernel are tested on dataset Asyclic, which consists of 185 molecules (graphs). (Cyclic pattern kernel is tested on dataset MAO and PAH.)

 The criteria used for prediction are SVM for classification and kernel Ridge regression for regression.

 For predition we randomly divide the data in train and test subset, where 90% of entire dataset is for training and rest for testing. 10 splits are performed. For each split, we first train on the train data, then evaluate the performance on the test set. We choose the optimal parameters for the test set and finally provide the corresponding performance. The final results correspond to the average of the performances on the test sets. 

 All the results were run under Python 3.5.2, in a machine of 64 bit with one Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz, Memory of 32GB, and Ubuntu 16.04.3 LTS OS.

 ## Summary

 | Kernels       | RMSE(℃) | STD(℃) |    Parameter | k_time |
 |---------------|:-------:|:------:|-------------:|-------:|
 | Shortest path | 35.19   | 4.50   |            - | 14.58" |
 | Marginalized  | 18.02   | 6.29   | p_quit = 0.1 |  4'19" |
 | Path          | 14.00   | 6.94   |            - | 37.58" |
 | WL subtree    | 7.55    | 2.33   |   height = 1 |  0.84" |
 | Treelet       | 8.31    | 3.38   |            - |  0.50" |
 | Path up to d  | 7.43    | 2.69   |    depth = 2 |  0.52" |
 | Kernels          | RMSE(℃) | STD(℃) |         Parameter | k_time |
 |------------------|:-------:|:------:|------------------:|-------:|
 | Shortest path    | 35.19   | 4.50   |                 - | 14.58" |
 | Marginalized     | 18.02   | 6.29   |      p_quit = 0.1 |  4'19" |
 | Path             | 14.00   | 6.94   |                 - | 37.58" |
 | WL subtree       | 7.55    | 2.33   |        height = 1 |  0.84" |
 | WL shortest path | 35.16   | 4.50   |        height = 2 | 40.24" |
 | WL edge          | 33.41   | 4.73   |        height = 5 |  5.66" |
 | Treelet          | 8.31    | 3.38   |                 - |  0.50" |
 | Path up to d     | 7.43    | 2.69   |         depth = 2 |  0.52" |
 | Tree pattern     | 7.27    | 2.21   |  lamda = 1, h = 2 | 37.24" |
 | Cyclic pattern   | 0.9     | 0.11   | cycle bound = 100 |  0.31" |

 * RMSE stands for arithmetic mean of the root mean squared errors on all splits.
 * STD stands for standard deviation of the root mean squared errors on all splits.
@@ -76,6 +82,42 @@ The table below shows the results of the WL subtree under different subtree heig
      10     17.1864      4.05672      0.691516     0.564621  5.00918
 ```

 ### Weisfeiler-Lehman shortest path kernel
 The table below shows the results of the WL subtree under different subtree heights.
 ```
  height    rmse_test    std_test    rmse_train    std_train    k_time
 --------  -----------  ----------  ------------  -----------  --------
       0      35.192      4.49577       28.3604      1.35718   13.5041
       1      35.1808     4.50045       27.9335      1.44836   26.8292
       2      35.1632     4.50205       28.1113      1.50891   40.2356
       3      35.1946     4.49801       28.3903      1.36571   54.6704
       4      35.1753     4.50111       27.9746      1.46222   67.1522
       5      35.1997     4.5071        28.0184      1.45564   80.0881
       6      35.1645     4.49849       28.3731      1.60057   92.1925
       7      35.1771     4.5009        27.9604      1.45742  105.812
       8      35.1968     4.50526       28.1991      1.5149   119.022
       9      35.1956     4.50197       28.2665      1.30769  131.228
      10      35.1676     4.49723       28.4163      1.61596  144.964
 ```

 ### Weisfeiler-Lehman edge kernel
 The table below shows the results of the WL subtree under different subtree heights.
 ```
  height    rmse_test    std_test    rmse_train    std_train     k_time
 --------  -----------  ----------  ------------  -----------  ---------
       0      33.4077     4.73272       29.9975     0.90234    0.853002
       1      33.4235     4.72131       30.1603     1.09423    1.71751
       2      33.433      4.72441       29.9286     0.787941   2.66032
       3      33.4073     4.73243       30.0114     0.909674   3.47763
       4      33.4256     4.72166       30.1842     1.1089     4.54367
       5      33.4067     4.72641       30.0411     1.01845    5.66178
       6      33.419      4.73075       29.9056     0.782179   6.14803
       7      33.4248     4.72155       30.1759     1.10382    7.60354
       8      33.4122     4.71554       30.1365     1.07485    7.97222
       9      33.4071     4.73193       30.0329     0.921065   9.07084
      10      33.4165     4.73169       29.9242     0.790843  10.0254
 ```

 ### Treelet kernel
 **The targets of training data are normalized before calculating the kernel.**
 ```
@@ -87,7 +129,7 @@ The table below shows the results of the WL subtree under different subtree heig
 ### Path kernel up to depth *d*
 The table below shows the results of the path kernel up to different depth *d*.

 The first table is the results using Tanimoto kernel, where **The targets of training data are normalized before calculating the kernel.**.
 The first table is the results using *Tanimoto kernel*, where **The targets of training data are normalized before calculating the kernel.**.
 ```
  depth    rmse_test    std_test    rmse_train    std_train     k_time
 -------  -----------  ----------  ------------  -----------  ---------
@@ -104,7 +146,7 @@ The first table is the results using Tanimoto kernel, where **The targets of tra
     10      19.8708     5.09217       10.7787      2.10002  2.41006
 ```

 The second table is the results using MinMax kernel.
 The second table is the results using *MinMax kernel*.
 ```
 depth    rmse_test    std_test    rmse_train    std_train    k_time
 -------  -----------  ----------  ------------  -----------  --------
@@ -120,3 +162,62 @@ depth    rmse_test    std_test    rmse_train    std_train    k_time
      9     13.1789      5.27707       1.36002     1.84834   1.96545
     10     13.2538      5.26425       1.36208     1.85426   2.24943
 ```


 ### Tree pattern kernel
 Until N kernel when h = 2:
 ```
       lmda    rmse_test    std_test    rmse_train    std_train    k_time
 -----------  -----------  ----------  ------------  -----------  --------
     1e-10       7.46524     1.71862       5.99486     0.356634   38.1447
     1e-09       7.37326     1.77195       5.96155     0.374395   37.4921
     1e-08       7.35105     1.78349       5.96481     0.378047   37.9971
     1e-07       7.35213     1.77903       5.96728     0.382251   38.3182
     1e-06       7.3524      1.77992       5.9696      0.3863     39.6428
     1e-05       7.34958     1.78141       5.97114     0.39017    37.3711
     0.0001      7.3513      1.78136       5.94251     0.331843   37.3967
     0.001       7.35822     1.78119       5.9326      0.32534    36.7357
     0.01        7.37552     1.79037       5.94089     0.34763    36.8864
     0.1         7.32951     1.91346       6.42634     1.29405    36.8382
     1           7.27134     2.20774       6.62425     1.2242     37.2425
    10           7.49787     2.36815       6.81697     1.50182    37.8286
   100           7.42887     2.64789       6.68766     1.34809    36.3701
  1000           7.24914     2.65554       6.81906     1.41008    36.1695
 10000           7.08183     2.6248        6.93431     1.38441    37.5723
 100000           8.021       3.43694       8.69813     0.909839   37.8158
     1e+06       8.49625     3.6332        9.59333     0.96626    38.4688
     1e+07      10.9067      3.17593      11.5642      2.07792    36.9926
     1e+08      61.1524     10.4355       65.3527     13.9538     37.1321
     1e+09      99.943      13.6994       98.8848      5.27014    36.7443
     1e+10     100.083      13.8503       97.9168      3.22768    37.096
 ```

 ### Cyclic pattern kernel
 **This kernel is not tested on dataset Acyclic**

 Results on dataset MAO:
 ```
 cycle_bound    accur_test    std_test    accur_train    std_train    k_time
 -------------  ------------  ----------  -------------  -----------  --------
            0      0.642857    0.146385       0.54918     0.0167983  0.187052
           50      0.871429    0.1            0.698361    0.116889   0.300629
          100      0.9         0.111575       0.732787    0.0826366  0.309837
          150      0.9         0.111575       0.732787    0.0826366  0.31808
          200      0.9         0.111575       0.732787    0.0826366  0.317575
 ```

 Results on dataset PAH:
 ```
  cycle_bound    accur_test    std_test    accur_train    std_train    k_time
 -------------  ------------  ----------  -------------  -----------  --------
            0          0.61    0.113578       0.629762    0.0135212  0.521801
           10          0.61    0.113578       0.629762    0.0135212  0.52589
           20          0.61    0.113578       0.629762    0.0135212  0.548528
           30          0.64    0.111355       0.633333    0.0157935  0.535311
           40          0.64    0.111355       0.633333    0.0157935  0.61764
           50          0.67    0.09           0.658333    0.0345238  0.733868
           60          0.68    0.107703       0.671429    0.0365769  0.871147
           70          0.67    0.100499       0.666667    0.0380208  1.12625
           80          0.78    0.107703       0.709524    0.0588534  1.19828
           90          0.78    0.107703       0.709524    0.0588534  1.21182
 ```
--- a/pygraph/kernels/spKernel.py
+++ b/pygraph/kernels/spKernel.py
@@ -1,3 +1,8 @@
 """
@author: linlin
@references: Borgwardt KM, Kriegel HP. Shortest-path kernels on graphs. InData Mining, Fifth IEEE International Conference on 2005 Nov 27 (pp. 8-pp). IEEE.
 """

 import sys
 import pathlib
 sys.path.insert(0, "../")
@@ -12,7 +17,7 @@ from pygraph.utils.utils import getSPGraph

 def spkernel(*args, edge_weight = 'bond_type'):
    """Calculate shortest-path kernels between graphs.
    

    Parameters
    ----------
    Gn : List of NetworkX graph
@@ -22,51 +27,33 @@ def spkernel(*args, edge_weight = 'bond_type'):
        2 graphs between which the kernel is calculated.
    edge_weight : string
        edge attribute corresponding to the edge weight. The default edge weight is bond_type.
        

    Return
    ------
    Kmatrix/kernel : Numpy matrix/float
        Kernel matrix, each element of which is the sp kernel between 2 praphs. / SP kernel between 2 graphs.
        
    References
    ----------
    [1] Borgwardt KM, Kriegel HP. Shortest-path kernels on graphs. InData Mining, Fifth IEEE International Conference on 2005 Nov 27 (pp. 8-pp). IEEE.
    """
    if len(args) == 1: # for a list of graphs
        Gn = args[0]
        
        Kmatrix = np.zeros((len(Gn), len(Gn)))
    
        Sn = [] # get shortest path graphs of Gn
        for i in range(0, len(Gn)):
            Sn.append(getSPGraph(Gn[i], edge_weight = edge_weight))
    Gn = args[0] if len(args) == 1 else [args[0], args[1]] # arrange all graphs in a list
    Kmatrix = np.zeros((len(Gn), len(Gn)))

    start_time = time.time()

    Gn = [ getSPGraph(G, edge_weight = edge_weight) for G in args[0] ] # get shortest path graphs of Gn

    for i in range(0, len(Gn)):
        for j in range(i, len(Gn)):
                # kernel_t = [ e1[2]['cost'] != 0 and e1[2]['cost'] == e2[2]['cost'] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])) \
                #     for e1 in Sn[i].edges(data = True) for e2 in Sn[j].edges(data = True) ]
                # Kmatrix[i][j] = np.sum(kernel_t)
                # Kmatrix[j][i] = Kmatrix[i][j]

        start_time = time.time()
        for i in range(0, len(Gn)):
            for j in range(i, len(Gn)):
                for e1 in Sn[i].edges(data = True):
                    for e2 in Sn[j].edges(data = True):          
                        if e1[2]['cost'] != 0 and e1[2]['cost'] == e2[2]['cost'] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])):
                            Kmatrix[i][j] += 1
                            Kmatrix[j][i] += (0 if i == j else 1)
            for e1 in Gn[i].edges(data = True):
                for e2 in Gn[j].edges(data = True):
                    if e1[2]['cost'] != 0 and e1[2]['cost'] == e2[2]['cost'] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])):
                        Kmatrix[i][j] += 1
            Kmatrix[j][i] = Kmatrix[i][j]

        run_time = time.time() - start_time
        print("--- shortest path kernel matrix of size %d built in %s seconds ---" % (len(Gn), run_time))
        
        return Kmatrix, run_time
        
    else: # for only 2 graphs
        G1 = getSPGraph(args[0], edge_weight = edge_weight)
        G2 = getSPGraph(args[1], edge_weight = edge_weight)
        
        kernel = 0
        
        start_time = time.time()
        for e1 in G1.edges(data = True):
            for e2 in G2.edges(data = True):          
                if e1[2]['cost'] != 0 and e1[2]['cost'] == e2[2]['cost'] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])):
                    kernel += 1
    run_time = time.time() - start_time
    print("--- shortest path kernel matrix of size %d built in %s seconds ---" % (len(Gn), run_time))

 #         print("--- shortest path kernel built in %s seconds ---" % (time.time() - start_time))
        
        return kernel
    return Kmatrix, run_time
--- a/pygraph/kernels/treePatternKernel.py
+++ b/pygraph/kernels/treePatternKernel.py
@@ -0,0 +1,198 @@
 """
@author: linlin
@references: Pierre Mahé and Jean-Philippe Vert. Graph kernels based on tree patterns for molecules. Machine learning, 75(1):3–35, 2009.
 """

 import sys
 import pathlib
 sys.path.insert(0, "../")
 import time

 from collections import Counter

 import networkx as nx
 import numpy as np


 def treepatternkernel(*args, node_label = 'atom', edge_label = 'bond_type', labeled = True, kernel_type = 'untiln', lmda = 1, h = 1):
    """Calculate tree pattern graph kernels between graphs.
    Parameters
    ----------
    Gn : List of NetworkX graph
        List of graphs between which the kernels are calculated.
    /
    G1, G2 : NetworkX graphs
        2 graphs between which the kernel is calculated.
    node_label : string
        node attribute used as label. The default node label is atom.
    edge_label : string
        edge attribute used as label. The default edge label is bond_type.
    labeled : boolean
        Whether the graphs are labeled. The default is True.
    depth : integer
        Depth of search. Longest length of paths.
    k_func : function
        A kernel function used using different notions of fingerprint similarity.

    Return
    ------
    Kmatrix: Numpy matrix
        Kernel matrix, each element of which is the tree pattern graph kernel between 2 praphs.
    """
    if h < 1:
        raise Exception('h > 0 is requested.')
    kernel_type = kernel_type.lower()
    Gn = args[0] if len(args) == 1 else [args[0], args[1]] # arrange all graphs in a list
    Kmatrix = np.zeros((len(Gn), len(Gn)))
    h = int(h)

    start_time = time.time()

    for i in range(0, len(Gn)):
        for j in range(i, len(Gn)):
            Kmatrix[i][j] = _treepatternkernel_do(Gn[i], Gn[j], node_label, edge_label, labeled, kernel_type, lmda, h)
            Kmatrix[j][i] = Kmatrix[i][j]

    run_time = time.time() - start_time
    print("\n --- kernel matrix of tree pattern kernel of size %d built in %s seconds ---" % (len(Gn), run_time))

    return Kmatrix, run_time


 def _treepatternkernel_do(G1, G2, node_label, edge_label, labeled, kernel_type, lmda, h):
    """Calculate tree pattern graph kernels between 2 graphs.

    Parameters
    ----------
    paths1, paths2 : list
        List of paths in 2 graphs, where for unlabeled graphs, each path is represented by a list of nodes; while for labeled graphs, each path is represented by a string consists of labels of nodes and edges on that path.
    k_func : function
        A kernel function used using different notions of fingerprint similarity.
    node_label : string
        node attribute used as label. The default node label is atom.
    edge_label : string
        edge attribute used as label. The default edge label is bond_type.
    labeled : boolean
        Whether the graphs are labeled. The default is True.

    Return
    ------
    kernel : float
        Treelet Kernel between 2 graphs.
    """

    def matchingset(n1, n2):
        """Get neiborhood matching set of two nodes in two graphs.
        """

        def mset_com(allpairs, length):
            """Find all sets R of pairs by combination.
            """
            if length == 1:
                mset =  [ [pair] for pair in allpairs ]
                return mset, mset
            else:
                mset, mset_l = mset_com(allpairs, length - 1)
                mset_tmp = []
                for pairset in mset_l: # for each pair set of length l-1
                    nodeset1 = [ pair[0] for pair in pairset ] # nodes already in the set
                    nodeset2 = [ pair[1] for pair in pairset ]
                    for pair in allpairs:
                        if (pair[0] not in nodeset1) and (pair[1] not in nodeset2): # nodes in R should be unique
                            mset_tmp.append(pairset + [pair]) # add this pair to the pair set of length l-1, constructing a new set of length l
                            nodeset1.append(pair[0])
                            nodeset2.append(pair[1])

                mset.extend(mset_tmp)

                return mset, mset_tmp


        allpairs = [] # all pairs those have the same node labels and edge labels
        for neighbor1 in G1[n1]:
            for neighbor2 in G2[n2]:
                if G1.node[neighbor1][node_label] == G2.node[neighbor2][node_label] \
                   and G1[n1][neighbor1][edge_label] == G2[n2][neighbor2][edge_label]:
                    allpairs.append([neighbor1, neighbor2])

        if allpairs != []:
            mset, _ = mset_com(allpairs, len(allpairs))
        else:
            mset = []

        return mset


    def kernel_h(h):
        """Calculate kernel of h-th iteration.
        """

        if kernel_type == 'untiln':
            all_kh = { str(n1) + '.' + str(n2) : (G1.node[n1][node_label] == G2.node[n2][node_label]) \
                for n1 in G1.nodes() for n2 in G2.nodes() } # kernels between all pair of nodes with h = 1 ]
            all_kh_tmp = all_kh.copy()
            for i in range(2, h + 1):
                for n1 in G1.nodes():
                    for n2 in G2.nodes():
                        kh = 0
                        mset = all_msets[str(n1) + '.' + str(n2)]
                        for R in mset:
                            kh_tmp = 1
                            for pair in R:
                                kh_tmp *= lmda * all_kh[str(pair[0]) + '.' + str(pair[1])]
                            kh += 1 / lmda * kh_tmp
                        kh = (G1.node[n1][node_label] == G2.node[n2][node_label]) * (1 + kh)
                        all_kh_tmp[str(n1) + '.' + str(n2)] = kh
                all_kh = all_kh_tmp.copy()

        elif kernel_type == 'size':
            all_kh = { str(n1) + '.' + str(n2) : lmda * (G1.node[n1][node_label] == G2.node[n2][node_label]) \
                for n1 in G1.nodes() for n2 in G2.nodes() } # kernels between all pair of nodes with h = 1 ]
            all_kh_tmp = all_kh.copy()
            for i in range(2, h + 1):
                for n1 in G1.nodes():
                    for n2 in G2.nodes():
                        kh = 0
                        mset = all_msets[str(n1) + '.' + str(n2)]
                        for R in mset:
                            kh_tmp = 1
                            for pair in R:
                                kh_tmp *= lmda * all_kh[str(pair[0]) + '.' + str(pair[1])]
                            kh += kh_tmp
                        kh *= lmda * (G1.node[n1][node_label] == G2.node[n2][node_label])
                        all_kh_tmp[str(n1) + '.' + str(n2)] = kh
                all_kh = all_kh_tmp.copy()

        elif kernel_type == 'branching':
            all_kh = { str(n1) + '.' + str(n2) : (G1.node[n1][node_label] == G2.node[n2][node_label]) \
                for n1 in G1.nodes() for n2 in G2.nodes() } # kernels between all pair of nodes with h = 1 ]
            all_kh_tmp = all_kh.copy()
            for i in range(2, h + 1):
                for n1 in G1.nodes():
                    for n2 in G2.nodes():
                        kh = 0
                        mset = all_msets[str(n1) + '.' + str(n2)]
                        for R in mset:
                            kh_tmp = 1
                            for pair in R:
                                kh_tmp *= lmda * all_kh[str(pair[0]) + '.' + str(pair[1])]
                            kh += 1 / lmda * kh_tmp
                        kh *= (G1.node[n1][node_label] == G2.node[n2][node_label])
                        all_kh_tmp[str(n1) + '.' + str(n2)] = kh
                all_kh = all_kh_tmp.copy()

        return all_kh



    # calculate matching sets for every pair of nodes at first to avoid calculating in every iteration.
    all_msets = ({ str(node1) + '.' + str(node2) : matchingset(node1, node2) for node1 in G1.nodes() \
        for node2 in G2.nodes() } if h > 1 else {})

    all_kh = kernel_h(h)
    kernel = sum(all_kh.values())

    if kernel_type == 'size':
        kernel = kernel / (lmda ** h)

    return kernel
--- a/pygraph/kernels/treeletKernel.py
+++ b/pygraph/kernels/treeletKernel.py
@@ -1,3 +1,8 @@
 """
@author: linlin
@references: Gaüzère B, Brun L, Villemin D. Two new graphs kernels in chemoinformatics. Pattern Recognition Letters. 2012 Nov 1;33(15):2038-47.
 """

 import sys
 import pathlib
 sys.path.insert(0, "../")
@@ -12,7 +17,7 @@ import numpy as np

 def treeletkernel(*args, node_label = 'atom', edge_label = 'bond_type', labeled = True):
    """Calculate treelet graph kernels between graphs.
    

    Parameters
    ----------
    Gn : List of NetworkX graph
@@ -26,7 +31,7 @@ def treeletkernel(*args, node_label = 'atom', edge_label = 'bond_type', labeled
        edge attribute used as label. The default edge label is bond_type.
    labeled : boolean
        Whether the graphs are labeled. The default is True.
        

    Return
    ------
    Kmatrix/kernel : Numpy matrix/float
@@ -37,11 +42,11 @@ def treeletkernel(*args, node_label = 'atom', edge_label = 'bond_type', labeled
        Kmatrix = np.zeros((len(Gn), len(Gn)))

        start_time = time.time()
        

        # get all canonical keys of all graphs before calculating kernels to save time, but this may cost a lot of memory for large dataset.
        canonkeys = [ get_canonkeys(Gn[i], node_label = node_label, edge_label = edge_label, labeled = labeled) \
           for i in range(0, len(Gn)) ]
        

        for i in range(0, len(Gn)):
            for j in range(i, len(Gn)):
                Kmatrix[i][j] = _treeletkernel_do(canonkeys[i], canonkeys[j], node_label = node_label, edge_label = edge_label, labeled = labeled)
@@ -49,7 +54,7 @@ def treeletkernel(*args, node_label = 'atom', edge_label = 'bond_type', labeled

        run_time = time.time() - start_time
        print("\n --- treelet kernel matrix of size %d built in %s seconds ---" % (len(Gn), run_time))
        

        return Kmatrix, run_time
    
    else: # for only 2 graphs
@@ -112,10 +117,6 @@ def get_canonkeys(G, node_label = 'atom', edge_label = 'bond_type', labeled = Tr
    ------
    canonkey/canonkey_l : dict
        For unlabeled graphs, canonkey is a dictionary which records amount of every tree pattern. For labeled graphs, canonkey_l is one which keeps track of amount of every treelet.
        
    References
    ----------
    [1] Gaüzère B, Brun L, Villemin D. Two new graphs kernels in chemoinformatics. Pattern Recognition Letters. 2012 Nov 1;33(15):2038-47.
    """
    patterns = {} # a dictionary which consists of lists of patterns for all graphlet.
    canonkey = {} # canonical key, a dictionary which records amount of every tree pattern.
@@ -126,7 +127,7 @@ def get_canonkeys(G, node_label = 'atom', edge_label = 'bond_type', labeled = Tr
    # linear patterns
    patterns['0'] = G.nodes()
    canonkey['0'] = nx.number_of_nodes(G)
    for i in range(1, 6):
    for i in range(1, 6): # for i in range(1, 6):
        patterns[str(i)] = find_all_paths(G, i)
        canonkey[str(i)] = len(patterns[str(i)])

@@ -227,7 +228,7 @@ def get_canonkeys(G, node_label = 'atom', edge_label = 'bond_type', labeled = Tr
        for key in canonkey_t:
            canonkey_l['0' + key] = canonkey_t[key]

        for i in range(1, 6):
        for i in range(1, 6): # for i in range(1, 6):
            treelet = []
            for pattern in patterns[str(i)]:
                canonlist = list(chain.from_iterable((G.node[node][node_label], \
@@ -378,4 +379,4 @@ def find_all_paths(G, length):
                all_paths[idx] = []
                break
            
    return list(filter(lambda a: a != [], all_paths))
    return list(filter(lambda a: a != [], all_paths))
--- a/pygraph/kernels/untildPathKernel.py
+++ b/pygraph/kernels/untildPathKernel.py
@@ -1,3 +1,8 @@
 """
@author: linlin
@references: Liva Ralaivola, Sanjay J Swamidass, Hiroto Saigo, and Pierre Baldi. Graph kernels for chemical informatics. Neural networks, 18(8):1093–1110, 2005.
 """

 import sys
 import pathlib
 sys.path.insert(0, "../")
@@ -40,7 +45,7 @@ def untildpathkernel(*args, node_label = 'atom', edge_label = 'bond_type', label
        Kmatrix = np.zeros((len(Gn), len(Gn)))

        start_time = time.time()
        

        # get all paths of all graphs before calculating kernels to save time, but this may cost a lot of memory for large dataset.
        all_paths = [ find_all_paths_until_length(Gn[i], depth, node_label = node_label, edge_label = edge_label, labeled = labeled) for i in range(0, len(Gn)) ]

@@ -187,7 +192,7 @@ def find_all_paths(G, length):
    all_paths = []
    for node in G:
        all_paths.extend(find_paths(G, node, length))
    

    ### The following process is not carried out according to the original article
    # all_paths_r = [ path[::-1] for path in all_paths ]

@@ -200,4 +205,4 @@ def find_all_paths(G, length):
    #             break

    # return list(filter(lambda a: a != [], all_paths))
    return all_paths
    return all_paths
--- a/pygraph/kernels/weisfeilerLehmanKernel.py
+++ b/pygraph/kernels/weisfeilerLehmanKernel.py
@@ -1,13 +1,8 @@
 import sys
 import pathlib
 sys.path.insert(0, "../")

 import networkx as nx
 import numpy as np
 import time

 from pygraph.kernels.spkernel import spkernel
 from pygraph.kernels.pathKernel import pathkernel
 """
@author: linlin
@references:
    [1] Shervashidze N, Schweitzer P, Leeuwen EJ, Mehlhorn K, Borgwardt KM. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research. 2011;12(Sep):2539-61.
 """

 import sys
 import pathlib
@@ -18,7 +13,6 @@ import networkx as nx
 import numpy as np
 import time

 from pygraph.kernels.spkernel import spkernel
 from pygraph.kernels.pathKernel import pathkernel

 def weisfeilerlehmankernel(*args, node_label = 'atom', edge_label = 'bond_type', height = 0, base_kernel = 'subtree'):
@@ -38,97 +32,66 @@ def weisfeilerlehmankernel(*args, node_label = 'atom', edge_label = 'bond_type',
    height : int
        subtree height    
    base_kernel : string
        base kernel used in each iteration of WL kernel. The default base kernel is subtree kernel.
        
        base kernel used in each iteration of WL kernel. The default base kernel is subtree kernel. For user-defined kernel, base_kernel is the name of the base kernel function used in each iteration of WL kernel. This function returns a Numpy matrix, each element of which is the user-defined Weisfeiler-Lehman kernel between 2 praphs.

    Return
    ------
    Kmatrix/kernel : Numpy matrix/float
        Kernel matrix, each element of which is the Weisfeiler-Lehman kernel between 2 praphs. / Weisfeiler-Lehman kernel between 2 graphs.
        
    Kmatrix : Numpy matrix
        Kernel matrix, each element of which is the Weisfeiler-Lehman kernel between 2 praphs.

    Notes
    -----
    This function now supports WL subtree kernel and WL shortest path kernel.
    
    References
    ----------
    [1] Shervashidze N, Schweitzer P, Leeuwen EJ, Mehlhorn K, Borgwardt KM. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research. 2011;12(Sep):2539-61.
    This function now supports WL subtree kernel, WL shortest path kernel and WL edge kernel.
    """
    if len(args) == 1: # for a list of graphs
        start_time = time.time()
        
        # for WL subtree kernel
        if base_kernel == 'subtree':           
            Kmatrix = _wl_subtreekernel_do(args[0], node_label, edge_label, height = height, base_kernel = 'subtree')
            
        # for WL edge kernel
        elif base_kernel == 'edge':
            print('edge')
            
        # for WL shortest path kernel
        elif base_kernel == 'sp':
            Gn = args[0]
            Kmatrix = np.zeros((len(Gn), len(Gn)))
            
            for i in range(0, len(Gn)):
                for j in range(i, len(Gn)):
                    Kmatrix[i][j] = _weisfeilerlehmankernel_do(Gn[i], Gn[j], height = height)
                    Kmatrix[j][i] = Kmatrix[i][j]
    base_kernel = base_kernel.lower()
    Gn = args[0] if len(args) == 1 else [args[0], args[1]] # arrange all graphs in a list
    Kmatrix = np.zeros((len(Gn), len(Gn)))

        run_time = time.time() - start_time
        print("\n --- Weisfeiler-Lehman %s kernel matrix of size %d built in %s seconds ---" % (base_kernel, len(args[0]), run_time))
        
        return Kmatrix, run_time
        
    else: # for only 2 graphs
        
        start_time = time.time()
        
        # for WL subtree kernel
        if base_kernel == 'subtree':
            
            args = [args[0], args[1]]
            kernel = _wl_subtreekernel_do(args, node_label, edge_label, height = height, base_kernel = 'subtree')
            
        # for WL edge kernel
        elif base_kernel == 'edge':
            print('edge')
            
        # for WL shortest path kernel
        elif base_kernel == 'sp':
            
    start_time = time.time()

            kernel = _pathkernel_do(args[0], args[1])
    # for WL subtree kernel
    if base_kernel == 'subtree':           
        Kmatrix = _wl_subtreekernel_do(args[0], node_label, edge_label, height)

        run_time = time.time() - start_time
        print("\n --- Weisfeiler-Lehman %s kernel built in %s seconds ---" % (base_kernel, run_time))
        
        return kernel, run_time
    
    
 def _wl_subtreekernel_do(*args, node_label = 'atom', edge_label = 'bond_type', height = 0, base_kernel = 'subtree'):
    # for WL shortest path kernel
    elif base_kernel == 'sp':
        Kmatrix = _wl_spkernel_do(args[0], node_label, edge_label, height)

    # for WL edge kernel
    elif base_kernel == 'edge':
        Kmatrix = _wl_edgekernel_do(args[0], node_label, edge_label, height)

    # for user defined base kernel
    else:
        Kmatrix = _wl_userkernel_do(args[0], node_label, edge_label, height, base_kernel)

    run_time = time.time() - start_time
    print("\n --- Weisfeiler-Lehman %s kernel matrix of size %d built in %s seconds ---" % (base_kernel, len(args[0]), run_time))

    return Kmatrix, run_time



 def _wl_subtreekernel_do(Gn, node_label, edge_label, height):
    """Calculate Weisfeiler-Lehman subtree kernels between graphs.
    

    Parameters
    ----------
    Gn : List of NetworkX graph
        List of graphs between which the kernels are calculated.       
    node_label : string
        node attribute used as label. The default node label is atom.       
        node attribute used as label.
    edge_label : string
        edge attribute used as label. The default edge label is bond_type.       
        edge attribute used as label.      
    height : int
        subtree height  
    base_kernel : string
        base kernel used in each iteration of WL kernel. The default base kernel is subtree kernel.
        
        subtree height.

    Return
    ------
    Kmatrix/kernel : Numpy matrix/float
    Kmatrix : Numpy matrix
        Kernel matrix, each element of which is the Weisfeiler-Lehman kernel between 2 praphs.
    """
    
    height = int(height)
    Gn = args[0]
    Kmatrix = np.zeros((len(Gn), len(Gn)))
    all_num_of_labels_occured = 0 # number of the set of letters that occur before as node labels at least once in all graphs

@@ -148,9 +111,9 @@ def _wl_subtreekernel_do(*args, node_label = 'atom', edge_label = 'bond_type', h
        num_of_labels = len(num_of_each_label) # number of all unique labels

        all_labels_ori.update(labels_ori)
        

    all_num_of_labels_occured += len(all_labels_ori)
        

    # calculate subtree kernel with the 0th iteration and add it to the final kernel
    for i in range(0, len(Gn)):
        for j in range(i, len(Gn)):
@@ -159,17 +122,17 @@ def _wl_subtreekernel_do(*args, node_label = 'atom', edge_label = 'bond_type', h
            vector2 = np.matrix([ (all_num_of_each_label[j][label] if (label in all_num_of_each_label[j].keys()) else 0) for label in labels ])
            Kmatrix[i][j] += np.dot(vector1, vector2.transpose())
            Kmatrix[j][i] = Kmatrix[i][j]
    

    # iterate each height
    for h in range(1, height + 1):
        all_set_compressed = {} # a dictionary mapping original labels to new ones in all graphs in this iteration
        num_of_labels_occured = all_num_of_labels_occured # number of the set of letters that occur before as node labels at least once in all graphs
        all_labels_ori = set()
        all_num_of_each_label = []
        

        # for each graph
        for idx, G in enumerate(Gn):
            

            set_multisets = []
            for node in G.nodes(data = True):
                # Multiset-label determination.
@@ -190,9 +153,9 @@ def _wl_subtreekernel_do(*args, node_label = 'atom', edge_label = 'bond_type', h
                else:
                    set_compressed.update({ value : str(num_of_labels_occured + 1) })
                    num_of_labels_occured += 1
            

            all_set_compressed.update(set_compressed)
            

            # relabel nodes
            for node in G.nodes(data = True):
                node[1][node_label] = set_compressed[set_multisets[node[0]]]
@@ -202,9 +165,9 @@ def _wl_subtreekernel_do(*args, node_label = 'atom', edge_label = 'bond_type', h
            all_labels_ori.update(labels_comp)
            num_of_each_label = dict(Counter(labels_comp))
            all_num_of_each_label.append(num_of_each_label)
                    

        all_num_of_labels_occured += len(all_labels_ori)
        

        # calculate subtree kernel with h iterations and add it to the final kernel
        for i in range(0, len(Gn)):
            for j in range(i, len(Gn)):
@@ -213,87 +176,228 @@ def _wl_subtreekernel_do(*args, node_label = 'atom', edge_label = 'bond_type', h
                vector2 = np.matrix([ (all_num_of_each_label[j][label] if (label in all_num_of_each_label[j].keys()) else 0) for label in labels ])
                Kmatrix[i][j] += np.dot(vector1, vector2.transpose())
                Kmatrix[j][i] = Kmatrix[i][j]
                    

    return Kmatrix
    
    
 def _weisfeilerlehmankernel_do(G1, G2, height = 0):
    """Calculate Weisfeiler-Lehman kernels between 2 graphs. This kernel use shortest path kernel to calculate kernel between two graphs in each iteration.


 def _wl_spkernel_do(Gn, node_label, edge_label, height):
    """Calculate Weisfeiler-Lehman shortest path kernels between graphs.
    
    Parameters
    ----------
    G1, G2 : NetworkX graphs
        2 graphs between which the kernel is calculated.
    Gn : List of NetworkX graph
        List of graphs between which the kernels are calculated.       
    node_label : string
        node attribute used as label.      
    edge_label : string
        edge attribute used as label.       
    height : int
        subtree height.
        
    Return
    ------
    kernel : float
        Weisfeiler-Lehman kernel between 2 graphs.
    Kmatrix : Numpy matrix
        Kernel matrix, each element of which is the Weisfeiler-Lehman kernel between 2 praphs.
    """
    
    from pygraph.utils.utils import getSPGraph
      
    # init.
    height = int(height)
    kernel = 0 # init kernel
    num_nodes1 = G1.number_of_nodes()
    num_nodes2 = G2.number_of_nodes()
    
    # the first iteration.
 #     labelset1 = { G1.nodes(data = True)[i]['label'] for i in range(num_nodes1) }
 #     labelset2 = { G2.nodes(data = True)[i]['label'] for i in range(num_nodes2) }
    kernel += spkernel(G1, G2) # change your base kernel here (and one more below)
    Kmatrix = np.zeros((len(Gn), len(Gn))) # init kernel

    Gn = [ getSPGraph(G, edge_weight = edge_label) for G in Gn ] # get shortest path graphs of Gn
    
    for h in range(0, height + 1):
 #         if labelset1 != labelset2:
 #             break
    # initial for height = 0
    for i in range(0, len(Gn)):
        for j in range(i, len(Gn)):
            for e1 in Gn[i].edges(data = True):
                for e2 in Gn[j].edges(data = True):          
                    if e1[2]['cost'] != 0 and e1[2]['cost'] == e2[2]['cost'] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])):
                        Kmatrix[i][j] += 1
            Kmatrix[j][i] = Kmatrix[i][j]
            
    # iterate each height
    for h in range(1, height + 1):
        all_set_compressed = {} # a dictionary mapping original labels to new ones in all graphs in this iteration
        num_of_labels_occured = 0 # number of the set of letters that occur before as node labels at least once in all graphs
        for G in Gn: # for each graph
            set_multisets = []
            for node in G.nodes(data = True):
                # Multiset-label determination.
                multiset = [ G.node[neighbors][node_label] for neighbors in G[node[0]] ]
                # sorting each multiset
                multiset.sort()
                multiset = node[1][node_label] + ''.join(multiset) # concatenate to a string and add the prefix 
                set_multisets.append(multiset)          

            # label compression
            set_unique = list(set(set_multisets)) # set of unique multiset labels
            # a dictionary mapping original labels to new ones. 
            set_compressed = {}
            # if a label occured before, assign its former compressed label, else assign the number of labels occured + 1 as the compressed label 
            for value in set_unique:
                if value in all_set_compressed.keys():
                    set_compressed.update({ value : all_set_compressed[value] })
                else:
                    set_compressed.update({ value : str(num_of_labels_occured + 1) })
                    num_of_labels_occured += 1

            all_set_compressed.update(set_compressed)
            
            # relabel nodes
            for node in G.nodes(data = True):
                node[1][node_label] = set_compressed[set_multisets[node[0]]]
                
        # calculate subtree kernel with h iterations and add it to the final kernel
        for i in range(0, len(Gn)):
            for j in range(i, len(Gn)):
                for e1 in Gn[i].edges(data = True):
                    for e2 in Gn[j].edges(data = True):          
                        if e1[2]['cost'] != 0 and e1[2]['cost'] == e2[2]['cost'] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])):
                            Kmatrix[i][j] += 1
                Kmatrix[j][i] = Kmatrix[i][j]
        
    return Kmatrix

        # Weisfeiler-Lehman test of graph isomorphism.
        relabel(G1)
        relabel(G2)

        # calculate kernel
        kernel += spkernel(G1, G2) # change your base kernel here (and one more before)

        # get label sets of both graphs
 #         labelset1 = { G1.nodes(data = True)[i]['label'] for i in range(num_nodes1) }
 #         labelset2 = { G2.nodes(data = True)[i]['label'] for i in range(num_nodes2) }
 def _wl_edgekernel_do(Gn, node_label, edge_label, height):
    """Calculate Weisfeiler-Lehman edge kernels between graphs.
    
    return kernel
    Parameters
    ----------
    Gn : List of NetworkX graph
        List of graphs between which the kernels are calculated.       
    node_label : string
        node attribute used as label.      
    edge_label : string
        edge attribute used as label.       
    height : int
        subtree height.
        
    Return
    ------
    Kmatrix : Numpy matrix
        Kernel matrix, each element of which is the Weisfeiler-Lehman kernel between 2 praphs.
    """      
    # init.
    height = int(height)
    Kmatrix = np.zeros((len(Gn), len(Gn))) # init kernel
  
    # initial for height = 0
    for i in range(0, len(Gn)):
        for j in range(i, len(Gn)):
            for e1 in Gn[i].edges(data = True):
                for e2 in Gn[j].edges(data = True):          
                    if e1[2][edge_label] == e2[2][edge_label] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])):
                        Kmatrix[i][j] += 1
            Kmatrix[j][i] = Kmatrix[i][j]
            
    # iterate each height
    for h in range(1, height + 1):
        all_set_compressed = {} # a dictionary mapping original labels to new ones in all graphs in this iteration
        num_of_labels_occured = 0 # number of the set of letters that occur before as node labels at least once in all graphs
        for G in Gn: # for each graph
            set_multisets = []            
            for node in G.nodes(data = True):
                # Multiset-label determination.
                multiset = [ G.node[neighbors][node_label] for neighbors in G[node[0]] ]
                # sorting each multiset
                multiset.sort()
                multiset = node[1][node_label] + ''.join(multiset) # concatenate to a string and add the prefix 
                set_multisets.append(multiset)          

            # label compression
            set_unique = list(set(set_multisets)) # set of unique multiset labels
            # a dictionary mapping original labels to new ones. 
            set_compressed = {}
            # if a label occured before, assign its former compressed label, else assign the number of labels occured + 1 as the compressed label 
            for value in set_unique:
                if value in all_set_compressed.keys():
                    set_compressed.update({ value : all_set_compressed[value] })
                else:
                    set_compressed.update({ value : str(num_of_labels_occured + 1) })
                    num_of_labels_occured += 1

 def relabel(G):
    '''
    Relabel nodes in graph G in one iteration of the 1-dim. WL test of graph isomorphism.
            all_set_compressed.update(set_compressed)
            
            # relabel nodes
            for node in G.nodes(data = True):
                node[1][node_label] = set_compressed[set_multisets[node[0]]]
                
        # calculate subtree kernel with h iterations and add it to the final kernel
        for i in range(0, len(Gn)):
            for j in range(i, len(Gn)):
                for e1 in Gn[i].edges(data = True):
                    for e2 in Gn[j].edges(data = True):          
                        if e1[2][edge_label] == e2[2][edge_label] and ((e1[0] == e2[0] and e1[1] == e2[1]) or (e1[0] == e2[1] and e1[1] == e2[0])):
                            Kmatrix[i][j] += 1
                Kmatrix[j][i] = Kmatrix[i][j]
        
    return Kmatrix


 def _wl_userkernel_do(Gn, node_label, edge_label, height, base_kernel):
    """Calculate Weisfeiler-Lehman kernels based on user-defined kernel between graphs.
    
    Parameters
    ----------
    G : NetworkX graph
        The graphs whose nodes are relabeled.
    '''
    
    # get the set of original labels
    labels_ori = list(nx.get_node_attributes(G, 'label').values())
    num_of_each_label = dict(Counter(labels_ori))
    num_of_labels = len(num_of_each_label)
    
    set_multisets = []
    for node in G.nodes(data = True):
        # Multiset-label determination.
        multiset = [ G.node[neighbors]['label'] for neighbors in G[node[0]] ]
        # sorting each multiset
        multiset.sort()
        multiset = node[1]['label'] + ''.join(multiset) # concatenate to a string and add the prefix 
        set_multisets.append(multiset)
    Gn : List of NetworkX graph
        List of graphs between which the kernels are calculated.       
    node_label : string
        node attribute used as label.      
    edge_label : string
        edge attribute used as label.       
    height : int
        subtree height.
    base_kernel : string
        Name of the base kernel function used in each iteration of WL kernel. This function returns a Numpy matrix, each element of which is the user-defined Weisfeiler-Lehman kernel between 2 praphs.
        
    # label compression
 #     set_multisets.sort() # this is unnecessary
    set_unique = list(set(set_multisets)) # set of unique multiset labels
    set_compressed = { value : str(set_unique.index(value) + num_of_labels + 1) for value in set_unique } # assign new labels
    
    # relabel nodes
 #     nx.relabel_nodes(G, set_compressed, copy = False)
    for node in G.nodes(data = True):
        node[1]['label'] = set_compressed[set_multisets[node[0]]]

    # get the set of compressed labels
    labels_comp = list(nx.get_node_attributes(G, 'label').values())
    num_of_each_label.update(dict(Counter(labels_comp)))
    Return
    ------
    Kmatrix : Numpy matrix
        Kernel matrix, each element of which is the Weisfeiler-Lehman kernel between 2 praphs.
    """      
    # init.
    height = int(height)
    Kmatrix = np.zeros((len(Gn), len(Gn))) # init kernel
  
    # initial for height = 0
    Kmatrix = base_kernel(Gn, node_label, edge_label)
            
    # iterate each height
    for h in range(1, height + 1):
        all_set_compressed = {} # a dictionary mapping original labels to new ones in all graphs in this iteration
        num_of_labels_occured = 0 # number of the set of letters that occur before as node labels at least once in all graphs
        for G in Gn: # for each graph
            set_multisets = []           
            for node in G.nodes(data = True):
                # Multiset-label determination.
                multiset = [ G.node[neighbors][node_label] for neighbors in G[node[0]] ]
                # sorting each multiset
                multiset.sort()
                multiset = node[1][node_label] + ''.join(multiset) # concatenate to a string and add the prefix 
                set_multisets.append(multiset)          

            # label compression
            set_unique = list(set(set_multisets)) # set of unique multiset labels
            # a dictionary mapping original labels to new ones. 
            set_compressed = {}
            # if a label occured before, assign its former compressed label, else assign the number of labels occured + 1 as the compressed label 
            for value in set_unique:
                if value in all_set_compressed.keys():
                    set_compressed.update({ value : all_set_compressed[value] })
                else:
                    set_compressed.update({ value : str(num_of_labels_occured + 1) })
                    num_of_labels_occured += 1

            all_set_compressed.update(set_compressed)
            
            # relabel nodes
            for node in G.nodes(data = True):
                node[1][node_label] = set_compressed[set_multisets[node[0]]]
                
        # calculate kernel with h iterations and add it to the final kernel
        Kmatrix += base_kernel(Gn, node_label, edge_label)
        
    return Kmatrix
--- a/pygraph/utils/pycache/graphfiles.cpython-35.pyc
+++ b/pygraph/utils/pycache/graphfiles.cpython-35.pyc
--- a/pygraph/utils/pycache/utils.cpython-35.pyc
+++ b/pygraph/utils/pycache/utils.cpython-35.pyc
--- a/pygraph/utils/graphfiles.py
+++ b/pygraph/utils/graphfiles.py
@@ -3,7 +3,7 @@

 def loadCT(filename):
    """load data from .ct file.

 nn
    Notes
    ------
    a typical example of data in .ct is like this:
@@ -33,12 +33,17 @@ def loadCT(filename):
            tmp = content[i + 2].split(" ")
            tmp = [x for x in tmp if x != '']
            g.add_node(i, atom=tmp[3], label=tmp[3])

        for i in range(0, nb_edges):
            tmp = content[i + g.number_of_nodes() + 2]
            tmp = [tmp[i:i+3] for i in range(0, len(tmp), 3)]
            tmp = content[i + g.number_of_nodes() + 2].split(" ")
            tmp = [x for x in tmp if x != '']
            g.add_edge(int(tmp[0]) - 1, int(tmp[1]) - 1,
                       bond_type=tmp[3].strip(), label=tmp[3].strip())
                         bond_type=tmp[3].strip(), label=tmp[3].strip())

 #         for i in range(0, nb_edges):
 #             tmp = content[i + g.number_of_nodes() + 2]
 #             tmp = [tmp[i:i+3] for i in range(0, len(tmp), 3)]
 #             g.add_edge(int(tmp[0]) - 1, int(tmp[1]) - 1,
 #                        bond_type=tmp[3].strip(), label=tmp[3].strip())
    return g


@@ -101,7 +106,57 @@ def saveGXL(graph, filename):
    tree.write(filename)


 def loadDataset(filename):
 def loadSDF(filename):
    """load data from structured data file (.sdf file).

    Notes
    ------
    A SDF file contains a group of molecules, represented in the similar way as in MOL format.
    see http://www.nonlinear.com/progenesis/sdf-studio/v0.9/faq/sdf-file-format-guidance.aspx, 2018 for detailed structure.
    """
    import networkx as nx
    from os.path import basename
    from tqdm import tqdm
    import sys
    data = []
    with open(filename) as f:
        content = f.read().splitlines()
        index = 0
        pbar = tqdm(total = len(content) + 1, desc = 'load SDF', file=sys.stdout)
        while index < len(content):
            index_old = index

            g = nx.Graph(name=content[index].strip()) # set name of the graph

            tmp = content[index + 3]
            nb_nodes = int(tmp[:3]) # number of the nodes
            nb_edges = int(tmp[3:6]) # number of the edges

            for i in range(0, nb_nodes):
                tmp = content[i + index + 4]
                g.add_node(i, atom=tmp[31:34].strip())

            for i in range(0, nb_edges):
                tmp = content[i + index + g.number_of_nodes() + 4]
                tmp = [tmp[i:i+3] for i in range(0, len(tmp), 3)]
                g.add_edge(int(tmp[0]) - 1, int(tmp[1]) - 1, bond_type=tmp[2].strip())

            data.append(g)

            index += 4 + g.number_of_nodes() + g.number_of_edges()
            while content[index].strip() != '$$$$': # seperator
                index += 1
            index += 1

            pbar.update(index - index_old)
        pbar.update(1)
        pbar.close()

    return data



 def loadDataset(filename, filename_y = ''):
    """load file list of the dataset.
    """
    from os.path import dirname, splitext
@@ -128,5 +183,28 @@ def loadDataset(filename):
            mol_class = graph.attrib['class']
            data.append(loadGXL(dirname_dataset + '/' + mol_filename))
            y.append(mol_class)
    elif extension == "sdf":
        import numpy as np
        from tqdm import tqdm
        import sys

        data = loadSDF(filename)

        y_raw = open(filename_y).read().splitlines()
        y_raw.pop(0)
        tmp0 = []
        tmp1 = []
        for i in range(0, len(y_raw)):
            tmp = y_raw[i].split(',')
            tmp0.append(tmp[0])
            tmp1.append(tmp[1].strip())

        y = []
        for i in tqdm(range(0, len(data)), desc = 'ajust data', file=sys.stdout):
            try:
                y.append(tmp1[tmp0.index(data[i].name)].strip())
            except ValueError: # if data[i].name not in tmp0
                data[i] = []
        data = list(filter(lambda a: a != [], data))

    return data, y
--- a/pygraph/utils/utils.py
+++ b/pygraph/utils/utils.py
@@ -1,5 +1,6 @@
 import networkx as nx
 import numpy as np
 from tqdm import tqdm


 def getSPLengths(G1):
@@ -58,21 +59,15 @@ def floydTransformation(G, edge_weight = 'bond_type'):
    S = nx.Graph()
    S.add_nodes_from(G.nodes(data=True))
    for i in range(0, G.number_of_nodes()):
        for j in range(0, G.number_of_nodes()):
        for j in range(i, G.number_of_nodes()):
            S.add_edge(i, j, cost = spMatrix[i, j])
    return S



 import os
 import pathlib
 from collections import OrderedDict
 from tabulate import tabulate
 from .graphfiles import loadDataset

 def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, trials = 100, splits = 10, alpha_grid = None, C_grid = None, hyper_name = '', hyper_range = [1], normalize = False):
 def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, trials = 100, splits = 10, alpha_grid = None, C_grid = None, hyper_name = '', hyper_range = [1], normalize = False, datafile_y = '', model_type = 'regression'):
    """Perform training and testing for a kernel method. Print out neccessary data during the process then finally the results.
    

    Parameters
    ----------
    datafile : string
@@ -96,12 +91,14 @@ def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, tria
    hyper_range : list
        Range of the hyperparameter.
    normalize : string
        Determine whether or not that normalization is performed. The default is False.
        Determine whether or not that normalization is performed. Only works when model_type == 'regression'. The default is False.
    model_type : string
        Typr of the problem, regression or classification problem

    References
    ----------
    [1] Elisabetta Ghisu, https://github.com/eghisu/GraphKernels/blob/master/GraphKernelsCollection/python_scripts/compute_perf_gk.py, 2018.1
    

    Examples
    --------
    >>> import sys
@@ -113,29 +110,41 @@ def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, tria
    >>> kernel_para = dict(node_label = 'atom', edge_label = 'bond_type', labeled = True)
    >>> kernel_train_test(datafile, kernel_file_path, treeletkernel, kernel_para, normalize = True)
    """
    import os
    import pathlib
    from collections import OrderedDict
    from tabulate import tabulate
    from .graphfiles import loadDataset

    # setup the parameters
    model_type = 'regression' # Regression or classification problem
    model_type = model_type.lower()
    if model_type != 'regression' and model_type != 'classification':
        raise Exception('The model type is incorrect! Please choose from regression or clqssification.')
    print('\n --- This is a %s problem ---' % model_type)
    

    alpha_grid = np.logspace(-10, 10, num = trials, base = 10) if alpha_grid == None else alpha_grid # corresponds to (2*C)^-1 in other linear models such as LogisticRegression
    C_grid = np.logspace(-10, 10, num = trials, base = 10) if C_grid == None else C_grid
    

    if not os.path.exists(kernel_file_path):
        os.makedirs(kernel_file_path)
        

    train_means_list = []
    train_stds_list = []
    test_means_list = []
    test_stds_list = []
    kernel_time_list = []
        

    for hyper_para in hyper_range:
        print('' if hyper_name == '' else '\n\n #--- calculating kernel matrix when %s = %.1f ---#' % (hyper_name, hyper_para))
        print('' if hyper_name == '' else '\n\n #--- calculating kernel matrix when', hyper_name, '=', hyper_para, '---#')

        print('\n Loading dataset from file...')
        dataset, y = loadDataset(datafile)
        dataset, y = loadDataset(datafile, filename_y = datafile_y)
        y = np.array(y)
 #             print(y)
        # normalize labels and transform non-numerical labels to numerical labels.
        if model_type == 'classification':
            from sklearn.preprocessing import LabelEncoder
            y = LabelEncoder().fit_transform(y)
        #   print(y)

        # save kernel matrices to files / read kernel matrices from files
        kernel_file = kernel_file_path + 'km.ds'
@@ -152,7 +161,7 @@ def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, tria
            Kmatrix, run_time = kernel_func(dataset, **kernel_para)
            kernel_time_list.append(run_time)
            print(Kmatrix)
            print('\n Saving kernel matrix to file...')
      #     print('\n Saving kernel matrix to file...')
        #     np.savetxt(kernel_file, Kmatrix)

        """
@@ -170,25 +179,29 @@ def kernel_train_test(datafile, kernel_file_path, kernel_func, kernel_para, tria
        test_stds_list.append(test_std)

    print('\n')
    table_dict = {'rmse_test': test_means_list, 'std_test': test_stds_list, \
        'rmse_train': train_means_list, 'std_train': train_stds_list, 'k_time': kernel_time_list}
    if hyper_name == '':
        keyorder = ['rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time']

    else:
        table_dict[hyper_name] = hyper_range
        keyorder = [hyper_name, 'rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time']
    if model_type == 'regression':
        table_dict = {'rmse_test': test_means_list, 'std_test': test_stds_list, \
                      'rmse_train': train_means_list, 'std_train': train_stds_list, 'k_time': kernel_time_list}
        if hyper_name == '':
            keyorder = ['rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time']
        else:
            table_dict[hyper_name] = hyper_range
            keyorder = [hyper_name, 'rmse_test', 'std_test', 'rmse_train', 'std_train', 'k_time']
    elif model_type == 'classification':
        table_dict = {'accur_test': test_means_list, 'std_test': test_stds_list, \
                      'accur_train': train_means_list, 'std_train': train_stds_list, 'k_time': kernel_time_list}
        if hyper_name == '':
            keyorder = ['accur_test', 'std_test', 'accur_train', 'std_train', 'k_time']
        else:
            table_dict[hyper_name] = hyper_range
            keyorder = [hyper_name, 'accur_test', 'std_test', 'accur_train', 'std_train', 'k_time']
    print(tabulate(OrderedDict(sorted(table_dict.items(), key = lambda i:keyorder.index(i[0]))), headers='keys'))


 import random
 from sklearn.kernel_ridge import KernelRidge # 0.17
 from sklearn.metrics import accuracy_score, mean_squared_error
 from sklearn import svm

 def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, trials = 100, model_type = 'regression', normalize = False):
    """Split dataset to training and testing splits, train and test. Print out and return the results.
    

    Parameters
    ----------
    Kmatrix : Numpy matrix
@@ -206,8 +219,8 @@ def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, tri
    model_type : string
        Determine whether it is a regression or classification problem. The default is 'regression'.
    normalize : string
        Determine whether or not that normalization is performed. The default is False.
        
        Determine whether or not that normalization is performed. Only works when model_type == 'regression'. The default is False.

    Return
    ------
    train_mean : float
@@ -218,19 +231,27 @@ def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, tri
        mean of the best tests.
    test_std : float
        mean of test stds in the same trial with the best test mean.
    

    References
    ----------
    [1] Elisabetta Ghisu, https://github.com/eghisu/GraphKernels/blob/master/GraphKernelsCollection/python_scripts/compute_perf_gk.py, 2018.1
    """
    import random
    from sklearn.kernel_ridge import KernelRidge # 0.17
    from sklearn.metrics import accuracy_score, mean_squared_error
    from sklearn import svm

    datasize = len(train_target)
    random.seed(20) # Set the seed for uniform parameter distribution
    

    # Initialize the performance of the best parameter trial on train with the corresponding performance on test
    train_split = []
    test_split = []

    # For each split of the data
    print('\n Starting calculate accuracy/rmse...')
    import sys
    pbar = tqdm(total = splits * trials, desc = 'calculate performance', file=sys.stdout)
    for j in range(10, 10 + splits):
    #         print('\n Starting split %d...' % j)

@@ -255,7 +276,7 @@ def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, tri

        # Split the targets
        y_train = y_perm[0:num_train]
       


        # Normalization step (for real valued targets only)
        if normalize == True and model_type == 'regression':
@@ -275,7 +296,6 @@ def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, tri
            if model_type == 'regression':
                # Fit the kernel ridge model
                KR = KernelRidge(kernel = 'precomputed', alpha = alpha_grid[i])
    #                 KR = svm.SVR(kernel = 'precomputed', C = C_grid[i])
                KR.fit(Kmatrix_train, y_train if normalize == False else y_train_norm)

                # predict on the train and test set
@@ -284,15 +304,33 @@ def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, tri

                # adjust prediction: needed because the training targets have been normalized
                if normalize == True:
                    y_pred_train = y_pred_train * float(y_train_std) + y_train_mean                
                    y_pred_train = y_pred_train * float(y_train_std) + y_train_mean
                    y_pred_test = y_pred_test * float(y_train_std) + y_train_mean

                # root mean squared error in train set
                rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
                perf_all_train.append(rmse_train)
                # root mean squared error in test set
                rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
                perf_all_test.append(rmse_test)
                # root mean squared error on train set
                accuracy_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
                perf_all_train.append(accuracy_train)
                # root mean squared error on test set
                accuracy_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
                perf_all_test.append(accuracy_test)

            # For clcassification use SVM
            elif model_type == 'classification':
                KR = svm.SVC(kernel = 'precomputed', C = C_grid[i])
                KR.fit(Kmatrix_train, y_train)

                # predict on the train and test set
                y_pred_train = KR.predict(Kmatrix_train)
                y_pred_test = KR.predict(Kmatrix_test)

                # accuracy on train set
                accuracy_train = accuracy_score(y_train, y_pred_train)
                perf_all_train.append(accuracy_train)
                # accuracy on test set
                accuracy_test = accuracy_score(y_test, y_pred_test)
                perf_all_test.append(accuracy_test)

            pbar.update(1)

        # --- FIND THE OPTIMAL PARAMETERS --- #
        # For regression: minimise the mean squared error
@@ -306,6 +344,17 @@ def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, tri
            perf_train_opt = perf_all_train[min_idx]
            perf_test_opt = perf_all_test[min_idx]

        # For classification: maximise the accuracy
        if model_type == 'classification':
            # get optimal parameter on test (argmax accuracy)
            max_idx = np.argmax(perf_all_test)
            C_opt = C_grid[max_idx]

            # corresponding performance on train and test set for the same parameter
            perf_train_opt = perf_all_train[max_idx]
            perf_test_opt = perf_all_test[max_idx]


        # append the correponding performance on the train and test set
        train_split.append(perf_train_opt)
        test_split.append(perf_test_opt)
@@ -322,5 +371,5 @@ def split_train_test(Kmatrix, train_target, alpha_grid, C_grid, splits = 10, tri
    print('With standard deviation: %3f' % train_std)
    print('\n Mean performance on test set: %3f' % test_mean)
    print('With standard deviation: %3f' % test_std)
    
    return train_mean, train_std, test_mean, test_std

    return train_mean, train_std, test_mean, test_std
--- a/run_cyclic.py
+++ b/run_cyclic.py
@@ -0,0 +1,16 @@
 import sys
 sys.path.insert(0, "../")
 from pygraph.utils.utils import kernel_train_test
 from pygraph.kernels.cyclicPatternKernel import cyclicpatternkernel

 import numpy as np

 datafile = '../../../../datasets/NCI-HIV/AIDO99SD.sdf'
 datafile_y = '../../../../datasets/NCI-HIV/aids_conc_may04.txt'
 kernel_file_path = 'kernelmatrices_path_acyclic/'

 kernel_para = dict(node_label = 'atom', edge_label = 'bond_type', labeled = True)

 kernel_train_test(datafile, kernel_file_path, cyclicpatternkernel, kernel_para, \
    hyper_name = 'cycle_bound', hyper_range = np.linspace(0, 1000, 21), normalize = False, \
    datafile_y = datafile_y, model_type = 'classification')