Neural networks and support vector machines

Neural netorks and support vector machines

Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith value set to 1

Loose inspiration: Human neurons

Linear separability

Perceptron training algorithm Initialize eights Cycle through training examples in multiple passes (epochs) For each training example: If classified correctly, do nothing If classified incorrectly, update eights

Perceptron update rule For each training instance x ith label y: Classify ith current eights: y = sgn( x) Update eights: ß + α(y-y )x α is a learning rate that should decay as a function of epoch t, e.g., 1000/(1000+t) What happens if y is correct? Otherise, consider hat happens to individual eights i ß i + α(y-y )x i If y = 1 and y = -1, i ill be increased if x i is positive or decreased if x i is negative à x ill get bigger If y = -1 and y = 1, i ill be decreased if x i is positive or increased if x i is negative à x ill get smaller

Convergence of perceptron update rule Linearly separable data: converges to a perfect solution Non-separable data: converges to a minimum-error solution assuming learning rate decays as O(1/t) and examples are presented in random sequence

Implementation details Bias (add feature dimension ith value fixed to 1) vs. no bias Initialization of eights: all zeros vs. random Learning rate decay function Number of epochs (passes through the training data) Order of cycling through training examples (random)

Multi-class perceptrons One-vs-others frameork: Need to keep a eight vector c for each class c Decision rule: c = argmax c c x Update rule: suppose example from class c gets misclassified as c Update for c: c ß c + α x Update for c : c ß c α x

Multi-class perceptrons One-vs-others frameork: Need to keep a eight vector c for each class c Decision rule: c = argmax c c x Softmax: Max P(c x) = exp( c x) C k=1 exp( k x) Inputs Perceptrons / eights c

Visualizing linear classifiers Source: Andre Karpathy, http://cs231n.github.io/linear-classify/

Differentiable perceptron Input x 1 Weights 1 x 2 x 3... x d 2 3 d Output: σ( x + b) Sigmoid function: σ ( t) 1 = 1 + e t

Update rule for differentiable perceptron Define total classification error or loss on the training set: Update eights by gradient descent: ( ) ) ( ) (, ) ( ) ( 1 2 N f f y E x x x = = = σ E α 1 2

Update rule for differentiable perceptron Define total classification error or loss on the training set: Update eights by gradient descent: For a single training point, the update is: ( ) ) ( ) (, ) ( ) ( 1 2 N f f y E x x x = = = σ ( ) ( ) [ ] = = = = N N f y f y E 1 1 )) ( )(1 ( ) ( 2 ) ( ) '( ) ( 2 x x x x x x x σ σ σ E α ( ) x x x x )) ( )(1 ( ) ( + σ σ α f y

Update rule for differentiable perceptron For a single training point, the update is: ( y f ( x) ) σ ( x)(1 σ ( x x + α )) This is called stochastic gradient descent Compare ith update rule ith non-differentiable perceptron: + α( y f ( x) )x

Netork ith a hidden layer eights eights Can learn nonlinear functions (provided each perceptron is nonlinear and differentiable) f (x)= σ #! σ ( k x )% $ k k &

Netork ith a hidden layer Hidden layer size and netork capacity: Source: http://cs231n.github.io/neural-netorks-1/

Training of multi-layer netorks Find netork eights to minimize the error beteen true and estimated labels of training examples: N =1 ( ) 2 E() = y f (x ) Update eights by gradient descent: E α Back-propagation: gradients are computed in the direction from output to input layers and combined using chain rule Stochastic gradient descent: compute the eight update.r.t. one training example (or a small batch of examples) at a time, cycle through training examples in random order in multiple epochs

Regularization It is common to add a penalty on eight magnitudes to the obective function: N ( ) 2 E( f ) = y i f (x i ) + λ 2 i=1 2 This encourages netork to use all of its inputs a little rather than a fe inputs a lot Source: http://cs231n.github.io/neural-netorks-1/

Multi-Layer Netork Demo http://playground.tensorflo.org/

Neural netorks: Pros and cons Pros Flexible and general function approximation frameork Can build extremely poerful models by adding more layers Cons Hard to analyze theoretically (e.g., training is prone to local optima) Huge amount of training data, computing poer required to get good performance The space of implementation choices is huge (netork architectures, parameters)

Support vector machines When the data is linearly separable, there may be more than one separator (hyperplane) Which separator is best?

Support vector machines Find hyperplane that maximizes the margin beteen the positive and negative examples Support vectors Margin C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knoledge Discovery, 1998

SVM parameter learning Separable data: min, b 1 2 2 subect to y i ( x i + b) 1 Maximize margin Classify training data correctly Non-separable data: min,b 1 2 2 n + C max 0,1 y i ( x i + b) i=1 ( ) Maximize margin Minimize classification mistakes

SVM parameter learning min,b 1 2 2 n + C max 0,1 y i ( x i + b) i=1 ( ) +1 Margin 0-1 Demo: http://cs.stanford.edu/people/karpathy/svms/demo

Nonlinear SVMs General idea: the original input space can alays be mapped to some higher-dimensional feature space here the training set is separable Φ: x φ(x) Image source

Nonlinear SVMs Linearly separable dataset in 1D: Non-separable dataset in 1D: 0 x 0 x We can map the data to a higher-dimensional space: x 2 0 x Slide credit: Andre Moore

The kernel trick General idea: the original input space can alays be mapped to some higher-dimensional feature space here the training set is separable The kernel trick: instead of explicitly computing the lifting transformation φ(x), define a kernel function K such that K(x, y) = φ(x) φ(y) (to be valid, the kernel function must satisfy Mercer s condition)

The kernel trick Linear SVM decision function: x + b = αi yixi x + i b learned eight Support vector C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knoledge Discovery, 1998

The kernel trick Linear SVM decision function: x + b = αi yixi x + i b Kernel SVM decision function: i αi yiϕ( xi ) ϕ( x) + b = αi yik( xi, x) + i b This gives a nonlinear decision boundary in the original feature space C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knoledge Discovery, 1998

Polynomial kernel: K ( x, y) = ( c + x y) d

Gaussian kernel Also knon as the radial basis function (RBF) kernel: 1 K( x, y) = exp x y 2 σ 2 K(x, y) x y

Gaussian kernel SV s

SVMs: Pros and cons Pros Kernel-based frameork is very poerful, flexible Training is convex optimization, globally optimal solution can be found Amenable to theoretical analysis SVMs ork very ell in practice, even ith very small training sample sizes Cons No direct multi-class SVM, must combine to-class SVMs (e.g., ith one-vs-others) Computation, memory (esp. for nonlinear SVMs)

Neural netorks vs. SVMs (a.k.a. deep vs. shallo learning)