|
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210 |
- This directory includes some useful codes:
-
- 1. subset selection tools.
- 2. parameter selection tools.
- 3. LIBSVM format checking tools
-
- Part I: Subset selection tools
-
- Introduction
- ============
-
- Training large data is time consuming. Sometimes one should work on a
- smaller subset first. The python script subset.py randomly selects a
- specified number of samples. For classification data, we provide a
- stratified selection to ensure the same class distribution in the
- subset.
-
- Usage: subset.py [options] dataset number [output1] [output2]
-
- This script selects a subset of the given data set.
-
- options:
- -s method : method of selection (default 0)
- 0 -- stratified selection (classification only)
- 1 -- random selection
-
- output1 : the subset (optional)
- output2 : the rest of data (optional)
-
- If output1 is omitted, the subset will be printed on the screen.
-
- Example
- =======
-
- > python subset.py heart_scale 100 file1 file2
-
- From heart_scale 100 samples are randomly selected and stored in
- file1. All remaining instances are stored in file2.
-
-
- Part II: Parameter Selection Tools
-
- Introduction
- ============
-
- grid.py is a parameter selection tool for C-SVM classification using
- the RBF (radial basis function) kernel. It uses cross validation (CV)
- technique to estimate the accuracy of each parameter combination in
- the specified range and helps you to decide the best parameters for
- your problem.
-
- grid.py directly executes libsvm binaries (so no python binding is needed)
- for cross validation and then draw contour of CV accuracy using gnuplot.
- You must have libsvm and gnuplot installed before using it. The package
- gnuplot is available at http://www.gnuplot.info/
-
- On Mac OSX, the precompiled gnuplot file needs the library Aquarterm,
- which thus must be installed as well. In addition, this version of
- gnuplot does not support png, so you need to change "set term png
- transparent small" and use other image formats. For example, you may
- have "set term pbm small color".
-
- Usage: grid.py [grid_options] [svm_options] dataset
-
- grid_options :
- -log2c {begin,end,step | "null"} : set the range of c (default -5,15,2)
- begin,end,step -- c_range = 2^{begin,...,begin+k*step,...,end}
- "null" -- do not grid with c
- -log2g {begin,end,step | "null"} : set the range of g (default 3,-15,-2)
- begin,end,step -- g_range = 2^{begin,...,begin+k*step,...,end}
- "null" -- do not grid with g
- -v n : n-fold cross validation (default 5)
- -svmtrain pathname : set svm executable path and name
- -gnuplot {pathname | "null"} :
- pathname -- set gnuplot executable path and name
- "null" -- do not plot
- -out {pathname | "null"} : (default dataset.out)
- pathname -- set output file path and name
- "null" -- do not output file
- -png pathname : set graphic output file path and name (default dataset.png)
- -resume [pathname] : resume the grid task using an existing output file (default pathname is dataset.out)
- Use this option only if some parameters have been checked for the SAME data.
-
- svm_options : additional options for svm-train
-
- The program conducts v-fold cross validation using parameter C (and gamma)
- = 2^begin, 2^(begin+step), ..., 2^end.
-
- You can specify where the libsvm executable and gnuplot are using the
- -svmtrain and -gnuplot parameters.
-
- For windows users, please use pgnuplot.exe. If you are using gnuplot
- 3.7.1, please upgrade to version 3.7.3 or higher. The version 3.7.1
- has a bug. If you use cygwin on windows, please use gunplot-x11.
-
- If the task is terminated accidentally or you would like to change the
- range of parameters, you can apply '-resume' to save time by re-using
- previous results. You may specify the output file of a previous run
- or use the default (i.e., dataset.out) without giving a name. Please
- note that the same condition must be used in two runs. For example,
- you cannot use '-v 10' earlier and resume the task with '-v 5'.
-
- The value of some options can be "null." For example, `-log2c -1,0,1
- -log2 "null"' means that C=2^-1,2^0,2^1 and g=LIBSVM's default gamma
- value. That is, you do not conduct parameter selection on gamma.
-
- Example
- =======
-
- > python grid.py -log2c -5,5,1 -log2g -4,0,1 -v 5 -m 300 heart_scale
-
- Users (in particular MS Windows users) may need to specify the path of
- executable files. You can either change paths in the beginning of
- grid.py or specify them in the command line. For example,
-
- > grid.py -log2c -5,5,1 -svmtrain "c:\Program Files\libsvm\windows\svm-train.exe" -gnuplot c:\tmp\gnuplot\binary\pgnuplot.exe -v 10 heart_scale
-
- Output: two files
- dataset.png: the CV accuracy contour plot generated by gnuplot
- dataset.out: the CV accuracy at each (log2(C),log2(gamma))
-
- The following example saves running time by loading the output file of a previous run.
-
- > python grid.py -log2c -7,7,1 -log2g -5,2,1 -v 5 -resume heart_scale.out heart_scale
-
- Parallel grid search
- ====================
-
- You can conduct a parallel grid search by dispatching jobs to a
- cluster of computers which share the same file system. First, you add
- machine names in grid.py:
-
- ssh_workers = ["linux1", "linux5", "linux5"]
-
- and then setup your ssh so that the authentication works without
- asking a password.
-
- The same machine (e.g., linux5 here) can be listed more than once if
- it has multiple CPUs or has more RAM. If the local machine is the
- best, you can also enlarge the nr_local_worker. For example:
-
- nr_local_worker = 2
-
- Example:
-
- > python grid.py heart_scale
- [local] -1 -1 78.8889 (best c=0.5, g=0.5, rate=78.8889)
- [linux5] -1 -7 83.3333 (best c=0.5, g=0.0078125, rate=83.3333)
- [linux5] 5 -1 77.037 (best c=0.5, g=0.0078125, rate=83.3333)
- [linux1] 5 -7 83.3333 (best c=0.5, g=0.0078125, rate=83.3333)
- .
- .
- .
-
- If -log2c, -log2g, or -v is not specified, default values are used.
-
- If your system uses telnet instead of ssh, you list the computer names
- in telnet_workers.
-
- Calling grid in Python
- ======================
-
- In addition to using grid.py as a command-line tool, you can use it as a
- Python module.
-
- >>> rate, param = find_parameters(dataset, options)
-
- You need to specify `dataset' and `options' (default ''). See the following example.
-
- > python
-
- >>> from grid import *
- >>> rate, param = find_parameters('../heart_scale', '-log2c -1,1,1 -log2g -1,1,1')
- [local] 0.0 0.0 rate=74.8148 (best c=1.0, g=1.0, rate=74.8148)
- [local] 0.0 -1.0 rate=77.037 (best c=1.0, g=0.5, rate=77.037)
- .
- .
- [local] -1.0 -1.0 rate=78.8889 (best c=0.5, g=0.5, rate=78.8889)
- .
- .
- >>> rate
- 78.8889
- >>> param
- {'c': 0.5, 'g': 0.5}
-
-
- Part III: LIBSVM format checking tools
-
- Introduction
- ============
-
- `svm-train' conducts only a simple check of the input data. To do a
- detailed check, we provide a python script `checkdata.py.'
-
- Usage: checkdata.py dataset
-
- Exit status (returned value): 1 if there are errors, 0 otherwise.
-
- This tool is written by Rong-En Fan at National Taiwan University.
-
- Example
- =======
-
- > cat bad_data
- 1 3:1 2:4
- > python checkdata.py bad_data
- line 1: feature indices must be in an ascending order, previous/current features 3:1 2:4
- Found 1 lines with error.
-
-
|