## Wednesday, April 27, 2011

### SVM basic

Linear Classification:
with all the data of x, x -> f() ->y, you want to separate them with: $f(x,w,b) = sign(wx+b)$ . There are many solutions there (w and b), which one is better? -- the one with the maximum margin (Margin Width $M$).

Support vectors are those data points that the margin pushed up against (on the border of the margin). The maximum margin implies that only the support vectors are important, the other training examples are ignorable.
$w.x^+ + b = +1$
$w.x^- + b = -1$
$w.(x^+ - x^-) = 2$
The margin width $M = \frac{(x^+ - x^-) . w}{|w|}=\frac{2}{|w|}$
maximum $M=\frac{2}{|w|}$ is the same as minimum $\frac{1}{2}w^tw$ (dot production)
To solve w and b:
Minimize $\Phi(w) = \frac{1}{2}w^tw$ subject to $y_i(wx_i +b) >= 1$
The solution is the quadratic optimization problem:
There are computation of inner products x_i^T x_j between all pairs of training points.
The solution has the form:

Each non-zero \alpha_i indicates the corresponding x_i is a support vector. The classifying function is:
There are inner product between the test point x and the support vector point x_i.

Soft margin is to minimize $\frac{1}{2}w^tw + C\sum \varepsilon_k$
Linear classifier is a separating hyperplane, the support vectors (the most important training points) define the hyperplane

Non-linear SVM: mapping data to a higher-dimensional feature space where the training data is separable.
Linear classifier relies on dot production between vectors, the non-linear relies on kernel function. Kernel function is some function that corresponds to an inner product in feature space. Kernel function examples:
Linear: $K(x_i, x_j) = x_i^Tx_j$
Polynomial: $K(x_i, x_j) =(1+x_i^Tx_j)^p$
Gaussian(radial-basis function network)$K(x_i, x_j) = exp\left ( - \frac{||x_i-x_j||^2}{2\sigma ^2} \right )$, $\sigma$ is the distance between closest points with different classification.
Non-linear SVM locates a separating hyperplane in the feature space and classify points in that space. It doesn't need to represent the space explicitly, simply by defining a kernel function. The kernel function plays the role of the dot product in feature space.

Weakness of SVM
sensitive to noise.
It only consider two classes (binary class). Didn't we use it to classify multiply categories? If you have m categories, you have m SVM leans. SVM1 leans "output ==1" vs "output != 1". ......SVM m leans "output == m" vs "output != m". In prediction part, predict the new input with each SVM and find out the best probability.
--most contents are from: http://www.cs.cmu.edu/~awm/tutorials