Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1 Neural Networks 2 Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 826
Part X : 376of 826
- Motivation Nonlinear kernels extended our toolbox of methods considerably Wherever an inner product was used in an algorithm, we can replace it with a kernel. Kernels act as a kind of similarity measure and can be defined over graphs, sets, strings, and documents. But the kernel matrix is a square matrix with dimensions equal to the number of data points N In order to calculate it, the kernel function must be evaluated for all pairs of training inputs. implement learning algorithms where, for prediction, the kernels are only evaluated at a subset of the training data. : 377of 826
Maximum Return to the two-class classification with a linear model y(x) = w T φ(x) + b where φ(x) is some fixed feature space mapping, and b is the bias. Training data are N input vectors x 1,..., x N with corresponding targets t 1,..., t N where t n { 1, +1}. The class of a new point is predicted as sign (y(x)). Assume there exists a linear decision boundary. That means, there exist w and b such that : t n sign (y(x n )) > 0 n = 1,..., N. 378of 826
The Multiple Separator problem There may exist many solutions w and b for which the linear classifier perfectly separates the two classes! : 379of 826
Maximum There may exist many solutions w and b for which the linear classifier perfectly separates the two classes. The perceptron algorithm can find a solution, but this depends on the initial choice of parameters. What is the decision boundary which results in the best generalisation (smallest generalisation error)? : 380of 826
Maximum The margin is the smallest distance between the decision boundary and any of the samples. (We will see later why y = ±1 on the margin.) y = 1 y = 0 y = 1 : margin 381of 826
Maximum Support Vector Machines (SVMs) choose the decision boundary which maximises the margin. y = 1 y = 0 y = 1 : 382of 826
Reminder - Linear Classification For a linear model with weights w and bias b, the decision boundary is w T z + b = 0 Any x has projection x proj on the boundary, and x x proj = λ w since w is orthogonal to the decision boundary But we also have w T x w T x proj = w T x + b = λ w T w : x w T z + b =0 x proj 0 383of 826
Reminder - Linear Classification So, x x proj = λ w = wt x + b w The distance of x from the decision boundary is therefore w T x+b w = y(x) w. : x w T z + b =0 x proj 0 384of 826
Maximum Support Vector Machines choose the decision boundary which maximises the smallest distance to samples in both classes. We calculated the distance of a point x from the hyperplane y(x) = 0 as y(x) / w. Perfect classification means t n y(x n ) > 0 for all n. Thus, the distance of x n from the decision boundary is t n y(x n ) w = t n (w T φ(x n ) + b) w : y = 1 y = 0 y = 1 385of 826
Maximum Support Vector Machines choose the decision boundary which maximises the smallest distance to samples in both classes. For the maximum margin solution, solve arg max w,b { 1 w min [ tn (w T φ(x n ) + b) ]}. n : How can we solve this? 386of 826
Maximum Maximum margin solution or arg max w,b,γ arg max w,b { 1 w min [ tn (w T φ(x n ) + b) ]} n { } γ s.t. t n (w T φ(x n ) + b) γ n = 1,..., N. w By rescaling any (w, b) by 1/γ, we don t change the answer! We can assume that γ = 1 The canonical representation for the decision hyperplane is therefore : t n (w T φ(x n ) + b) 1 n = 1,..., N. 387of 826
Maximum Maximum margin solution arg max w,b { 1 w min [ tn (w T φ(x n ) + b) ]} n Transformed once arg max w,b Transformed again arg min w,b { 1 w } { 1 2 w 2 } s.t. t n (w T φ(x n ) + b) 1. s.t. t n (w T φ(x n ) + b) 1. : 388of 826
: 389of 826
Find the stationary points for a function f (x 1, x 2 ) subject to one or more constraints on the variables x 1 and x 2 written in the form g(x 1, x 2 ) = 0. Direct approach 1 Solve g(x 1, x 2) = 0 for one of the variables to get x 2 = h(x 1). 2 Insert the result into f (x 1, x 2) to get a function of one variable f (x 1, h(x 1)). 3 Find the stationary point(s) x 1 of f (x 1, h(x 1)) with corresponding value x 2 = h(x 1 ). Finding x 2 = h(x 1 ) may be hard. Symmetry in the variables x 1 and x 2 is lost. : 390of 826
To solve max f (x) s.t. g(x) = 0 x introduce the Lagrangian function L(x, λ) = f (x) + λg(x) from which we get the constraint stationary conditions and the constraint itself x L(x, λ) = f (x) + λ g(x) = 0 : L(x, λ) λ = g(x) = 0. This are D equations resulting from x L(x, λ) and one equation from L(x,λ) λ, together determining x and λ. 391of 826
- Example Given f (x 1, x 2 ) = 1 x1 2 x2 2 subject to the constraint g(x 1, x 2 ) = x 1 + x 2 1 = 0. Define the Lagrangian function L(x, λ) = 1 x 2 1 x 2 2 + λ(x 1 + x 2 1). A stationary solution with respect to x 1, x 2, and λ must satisfy 2x 1 + λ = 0 2x 2 + λ = 0 x 1 + x 2 1 = 0. : Therefore (x1, x 2 ) = ( 1 2, 1 2 ) and λ = 1. 392of 826
- Example Given f (x 1, x 2 ) = 1 x1 2 x2 2 subject to the constraint g(x 1, x 2 ) = x 1 + x 2 1 = 0. Lagrangian L(x, λ) = 1 x1 2 x2 2 + λ(x 1 + x 2 1) (x1, x 2 ) = ( 1 2, 1 2 ). x 2 (x 1, x 2) : x 1 g(x 1, x 2 ) = 0 393of 826
for Inequality Constraints To solve define the Lagrangian max f (x) s.t. g(x) 0 x L(x, λ) = f (x) + λg(x) Solve for x and λ subject to the constraints (Karush-Kuhn-Tucker or KKT conditions ) g(x) 0 λ 0 λg(x)= 0 : 394of 826
for General Case Maximise f (x) subject to the constraints g j (x) = 0 for j = 1,..., J, and h k (x) 0 for k = 1,..., K. Define the Lagrange multipliers {λ j } and {µ k }, and the Lagrangian L(x, {λ j }, {µ k }) = f (x) + J λ j g j (x) + j=1 K µ k h k (x). k=1 Solve for x, {λ j }, and {µ k } subject to the constraints (Karush-Kuhn-Tucker or KKT conditions ) : for k = 1,..., K. µ k 0 µ k h k (x) = 0 For minimisation of f (x), change the sign in front of the Lagrange multipliers. 395of 826
Margin Classifiers: : 396of 826
Maximum Quadratic Programming (QP) problem: minimise a quadratic function subject to linear constraints. { } 1 arg min w,b 2 w 2 s.t. t n (w T φ(x n ) + b) 1. Introduce Lagrange multipliers {a n 0} to get the Lagrangian L(w, b, a) = 1 2 w 2 N { a n tn (w T φ(x n ) + b) 1 } : note negative sign since minimisation problem 397of 826
Maximum Lagrangian L(w, b, a) = 1 N { 2 w 2 a n tn (w T φ(x n ) + b) 1 }. Derivatives with respect to w and b are N N w = a n t n φ(x n ) 0 = a n t n Dual representation (again QP problem): maximise L(a) = N a n t n = 0 N a n 1 2 m=1 a n 0 n = 1,..., N. N N a n a m t n t m k(x n, x m ) : 398of 826
Maximum Dual representation (again QP problem): maximise L(a) = N a n t n = 0 N a n 1 2 m=1 a n 0 n = 1,..., N. Vector form: maximise N N a n a m t n t m k(x n, x m ) : L(a) = 1 T a 1 2 at Qa a T t = 0 a n 0 n = 1,..., N. where Q nm = t n t m k(x n, x m ) 399of 826
Maximum - Prediction Predict via N y(x) = w T φ(x) + b = a n t n k(x, x n ) + b The Karush-Kuhn-Tucker (KKT) conditions are a n 0 t n y(x n ) 1 0 a n {t n y(x n ) 1} = 0. : Therefore, either a n = 0 or t n y(x n) 1 = 0. If a n = 0, no contribution to the prediction! After training, use only the set S of points for which a n > 0 (and therefore t n y(x n ) = 1) holds. S contains only the support vectors. 400of 826
Maximum - Support Vectors S contains only the support vectors. : Decision and margin boundaries for a two-class problem using Gaussian kernel functions. 401of 826
Maximum - Support Vectors Now for the bias b. Observe that for the support vectors, t n y(x n ) = 1. Therefore, we find t n ( m S a m t m k(x n, x m ) + b ) = 1 and can use any support vector to calculate b. Numerically more stable: Multiply by t n, observe tn 2 = 1, and average over all support vectors. ( b = 1 t n ) a m t m k(x n, x m ) N S m S n S : 402of 826
SVMs: Summary Maximise the dual objective L(a) = 1 T a 1 2 at Qa a T t = 0 a n 0 n = 1,..., N. Make predictions via N y(x) = a n t n k(x, x n ) + b : Dependence only on data points with a n > 0 403of 826
Non-Separable Case : 404of 826
Non-Separable Data What if the training data in the feature space are not linearly separable? Can we find a separator that makes fewest possible errors? : 405of 826
Non-Separable Data What if the training data in the feature space are not linearly separable? Allow some data points to be on the wrong side of the decision boundary. Increase a penalty with distance from the decision boundary. Assume, the penalty increases linearly with distance. Introduce slack variable ξ n 0 for each data point n. = 0, data point is correctly classified and on margin boundary or beyond ξ n < 1, data point is in correct margin = 1, data point is on the decision boundary > 1, data point is misclassified : 406of 826
Non-Separable Data Constraints of the separable classification t n y(x n ) 1, change now to t n y(x n ) 1 ξ n, In sum, we have the constraints n = 1,..., N n = 1,..., N : ξ n 0 ξ n 1 t n y(x n ). 407of 826
Non-Separable Data Find now N min C ξ n + 1 w,ξ 2 w 2 s.t. ξ n 0 ξ n 1 t n y(x n ). where C controls the trade-off between the slack variable penalty and the margin. Minimise the total slack associated with all data points Any misclassified point contributes ξ n > 1 therefore N ξn is an upper bound on the number of misclassified points. : 408of 826
Non-Separable Data The Lagrangian with the Lagrange multipliers a = (a 1,..., a N ) T and µ = (µ 1,..., µ N ) T is now L(w, b, ξ, a, µ) = 1 2 w 2 + C N ξ n N a n {t n y(x n ) 1 + ξ n } The KKT conditions for n = 1,..., N are a n 0 t n y(x n ) 1 + ξ n 0 a n (t n y(x n ) 1 + ξ n ) = 0 µ n 0 ξ n 0 µ n ξ n = 0. N µ n ξ n. : 409of 826
Non-Separable Data The dual Lagrangian after eliminating w, b, ξ, and µ is then L(a) = N a n t n = 0 N a n 1 2 m=1 0 a n C n = 1,..., N. N N a n a m t n t m k(x n, x m ) : The only change from the separable case, is the box constraint via the parameter C. 410of 826
Non-Separable Data Equivalent formulation by Schölkopf et. al. (2000) L(a) = 1 2 N a n t n = 0 N N a n a m t n t m k(x n, x m ) m=1 0 a n 1/N N a n ν n = 1,..., N. : where ν is both 1 an upper bound on the fraction of margin errors, and, 2 a lower bound on the fraction of support vectors. 411of 826
Non-Separable Data 2 0 2 : 2 0 2 The ν-svm algorithm using Gaussian kernels exp( γ x x 2 ) with γ = 0.45 applied to a nonseparable data set in two dimensions. Support vectors are indicated by circles. 412of 826
Support Vector Machines - Limitations Output are decisions, not posterior probabilities. Extension to classification with more than two classes is problematic. There is a complexity parameter C (or ν) which must be found (e.g. via cross-validation). Predictions are expressed as linear combinations of kernel functions that are centered on the training points. Kernel matrix is required to be positive definite. : 413of 826
Optimization View of Recall from decision theory that we want to minimize the expected loss p(mistake) = p(x R 1, C 2 ) + p(x R 2, C 1 ) = p(x, C 2 ) dx + p(x, C 1 ) dx R 1 R 2 For finite data, we minimise the negative log likelihood, appropriately regularized. More generally, we minimise the empirical risk (defined by a loss function). : R emp (w, X, t) + Ω(w) = N l(x n, t n ; w) }{{} loss per example + Ω(w) }{{} regulariser 414of 826
Logistic Regression Loss Function? Consider the labels t n { 1, +1}. Recall the definition of the sigmoid σ( ), then p(t n = 1 x n ) = σ(w T 1 φ(x n )) = 1 + exp( w T φ(x n )) p(t n = 1 x n ) = 1 σ(w T 1 φ(x n )) = 1 + exp(+w T φ(x n )) which can be compactly written as p(t n x n ) = σ(w T φ(x n )) = 1 1 + exp( t n w T φ(x n )) : Assuming independence, the negative log likelihood N log ( 1 + exp( t n w T φ(x n )) ) 415of 826
SVM Loss Function? Soft margin SVM N min C ξ n + 1 w,ξ 2 w 2 s.t. ξ n 0 ξ n 1 t n y(x n ). Observe the equivalent unconstrained loss: is equivalent to min ξ ξ,y s.t.ξ 0 ξ 1 y. min y max{0, 1 y} We denote the hinge loss by [1 y] + = max{0, 1 y}. : 416of 826
Linear Regression (squared loss) 1 2 N {t n w T φ(x n )} 2 + λ 2 D w 2 d = 1 2 (t Φw)T (t Φw)+ λ 2 wt w d=1 Logistic Regression (log loss) ln p(t w) = N log ( 1 + exp( t n w T φ(x n )) ) + λ 2 wt w : Support Vector Machine (hinge loss) N C [1 t n w T φ(x n )] + + 1 2 wt w 417of 826
E(z) : 2 1 0 1 2 z Hinge loss in blue, cross entropy loss in red 418of 826