You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README 3.3 kB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
  1. === Introduction ===
  2. This folder contains 6 data sets of undirected labeled graphs in Matlab format for graph
  3. classification: MUTAG, PTC, NCI1, NCI109, ENZYMES, and D&D.
  4. === Usage ===
  5. For each data set X, the Matlab command
  6. load X
  7. loads into the memory a struct array containing graphs, and a column vector lx containing
  8. a class label for each graph.
  9. X(i).am is the adjacency matrix of the i'th graph,
  10. X(i).al is the adjacency list of the i'th graph,
  11. X(i).nl.values is a column vector of node labels for the i'th graph,
  12. X(i).el (not always available) contains edge labels for the i'th graph.
  13. Example:
  14. typing "load MUTAG" in MATLAB
  15. loads a 188 element array of graph structures, called MUTAG, and a column of 188 numbers,
  16. each of which indicates the class that the corresponding graph belongs to.
  17. === Description ===
  18. MUTAG (Debnath et al., 1991) is a data set of 188 mutagenic aromatic and heteroaromatic
  19. nitro compounds labeled according to whether or not they have a mutagenic effect on the
  20. Gram-negative bacterium Salmonella typhimurium.
  21. PTC (Toivonen et al., 2003) contains 344 chemical compounds tested for carcinogenicity
  22. in mice and rats. The classification task is to predict the carcinogenicity of compounds.
  23. NCI1 and NCI109 represent two balanced subsets of data sets of chemical compounds screened
  24. for activity against non-small cell lung cancer and ovarian cancer cell lines respectively
  25. (Wale and Karypis (2006) and http://pubchem.ncbi.nlm.nih.gov).
  26. ENZYMES is a data set of protein tertiary structures obtained from (Borgwardt et al.,
  27. 2005) consisting of 600 enzymes from the BRENDA enzyme database (Schomburg et al., 2004).
  28. In this case the task is to correctly assign each enzyme to one of the 6 EC top-level
  29. classes.
  30. D&D is a data set of 1178 protein structures (Dobson and Doig, 2003). Each protein is
  31. represented by a graph, in which the nodes are amino acids and two nodes are connected
  32. by an edge if they are less than 6 Angstroms apart. The prediction task is to classify
  33. the protein structures into enzymes and non-enzymes.
  34. === References ===
  35. K. M. Borgwardt, C. S. Ong, S. Schoenauer, S. V. N. Vishwanathan, A. J. Smola, and H. P.
  36. Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(Suppl 1):i47–i56,
  37. Jun 2005.
  38. A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, and C. Hansch.
  39. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds.
  40. Correlation with molecular orbital energies and hydrophobicity. J Med Chem, 34: 786–797,
  41. 1991.
  42. P. D. Dobson and A. J. Doig. Distinguishing enzyme structures from non-enzymes without
  43. alignments. J Mol Biol, 330(4):771–783, Jul 2003.
  44. I. Schomburg, A. Chang, C. Ebeling, M. Gremse, C. Heldt, G. Huhn, and D. Schomburg. Brenda,
  45. the enzyme database: updates and major new developments. Nucleic Acids Research, 32D:431–433,
  46. 2004.
  47. H. Toivonen, A. Srinivasan, R.D. King, S. Kramer, and C. Helma (2003). Statistical
  48. evaluation of the predictive toxicology challenge 2000-2001. Bioinformatics, 19(10):1183–1193.
  49. N. Wale and G. Karypis. Comparison of descriptor spaces for chemical compound retrieval and
  50. classification. In Proc. of ICDM, pages 678–689, Hong Kong, 2006.

A Python package for graph kernels, graph edit distances and graph pre-image problem.