Cheng Soon Ong & Christian Walder. Canberra February June 2018

Similar documents
Cheng Soon Ong & Christian Walder. Canberra February June 2018

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Non-parametric Methods

Kernel Methods and Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Linear & nonlinear classifiers

Support Vector Machine (continued)

Support Vector Machine (SVM) and Kernel Methods

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear & nonlinear classifiers

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Curve Fitting Re-visited, Bishop1.2.5

Introduction to Machine Learning

Max Margin-Classifier

CS798: Selected topics in Machine Learning

Nonparametric Methods Lecture 5

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Machine Learning Lecture 7

probability of k samples out of J fall in R.

BAYESIAN DECISION THEORY

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Lecture : Probabilistic Machine Learning

9.2 Support Vector Machines 159

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

(Kernels +) Support Vector Machines

Support Vector Machines.

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Perceptron Revisited: Linear Separators. Support Vector Machines

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Bayesian Machine Learning

Pattern Recognition and Machine Learning

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Linear Models for Regression

Support Vector Machine II

Learning Methods for Linear Detectors

Machine Learning Lecture 5

Pattern Recognition 2018 Support Vector Machines

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Kernel methods, kernel SVM and ridge regression

Advanced statistical methods for data analysis Lecture 2

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Statistical Pattern Recognition

L11: Pattern recognition principles

Outline lecture 4 2(26)

Support Vector Machine

Review: Support vector machines. Machine learning techniques and image analysis

Introduction to SVM and RVM

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Jeff Howbert Introduction to Machine Learning Winter

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Linear Models for Classification

12. Cholesky factorization

CS-E3210 Machine Learning: Basic Principles

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Probability Models for Bayesian Recognition

Machine Learning Practice Page 2 of 2 10/28/13

Kernel Methods in Machine Learning

Nonparametric probability density estimation

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Support Vector Machines

CS 188: Artificial Intelligence Spring Announcements

Classification: The rest of the story

18.9 SUPPORT VECTOR MACHINES

Nonparametric Bayesian Methods (Gaussian Processes)

Kernel Methods. Charles Elkan October 17, 2007

The Gaussian distribution

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

Transcription:

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear 1 Linear 2 Sparse Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1 Neural Networks 2 Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 855

Part XI for 390of 855

Original Input versus Feature Space Used direct input x until now. All classification algorithms work also if we first apply a fixed nonlinear transformation of the inputs using a vector of basis functions φ(x). Example: Use two Gaussian basis functions centered at the green crosses in the input space. for 1 1 x2 φ2 0 0.5 1 0 1 0 1 x1 0 0.5 1 φ1 391of 855

Original Input versus Feature Space Linear decision boundaries in the feature space correspond to nonlinear decision boundaries in the input space. Classes which are NOT linearly separable in the input space can become linearly separable in the feature space. BUT: If classes overlap in input space, they will also overlap in feature space. Nonlinear features φ(x) can not remove the overlap; but they may increase it! for 1 1 x2 φ2 0 0.5 1 0 1 0 1 x1 0 0.5 1 φ1 392of 855

Where are we? Basis function models (regression, classification) Flexible basis function models (neural networks) after semester break for 393of 855

Where are we going? Why not use all training data to make predictions for the test inputs? Basic ideas: Continuity : Mostly targets don t change abruptly. Similarity : Each training pair (input, target) tells us something about the possible targets in the neighbourhood of the input. Kernels formalise those ideas. for 394of 855

How are we going there? Kernels for density estimation Nonparametric density estimation Kernels for classification Basis functions and the kernel trick Constructing kernels Warning: The term kernel is also used for all vectors mapping under some matrix A to zero. This is a different concept. Don t get confused! for 395of 855

for 396of 855

Density Suppose we observe data points {x n } N n=1 e.g. just N real numbers Suppose we believe these are drawn independently from some distribution p(x) e.g. p(x) is Gaussian with unknown mean and variance Density estimation problem: Estimate p(x) from data for 397of 855

Nonparametric Density Histogram Partition the space into bins of width i. Count the number n i of samples falling into each bin i. Normalise. p i = n i N i 5 = 0.04 0 0 0.5 1 5 = 0.08 0 0 0.5 1 5 = 0.25 for 0 0 0.5 1 Histogram of 50 data points generated from the distribution shown by the green curve for varying common bin width 398of 855

Nonparametric Density Histogram Advantages: Data can be discarded after calculating the p i. Algorithm can be applied to sequentially arriving data. Disadvantages: Dependency on bin width i. Discontinuities due to the bin edges. Exponential scaling with the dimensionality D of the data. Need M D bins for D dimensions and M bins per dimension. for 399of 855

Nonparametric Density - Refined Draw data from some unknown probability distribution p(x) in a D-dimensional space. Consider a small region R containing x. Probability mass associated with this region P = p(x) dx R Data set of N observations drawn from p(x). Total number K of points found inside of R is distributed according to the binomial distribution Bin(K N, P) = N! K!(N K)! PK (1 P) N K for Expectation of K : E [K/N] = P Variance of K : var[k/n] = P(1 P)/N 400of 855

Nonparametric Density - Refined Expectation of K : E [K/N] = P Variance of K : var[k/n] = P(1 P) For large N, the distribution will be sharply peaked and therefore K NP Assuming also that the region has volume V and the region is small enough for p(x) to be roughly constant, then P p(x)v for Combining two contradictory assumptions Region R is small enough for p(x) to be roughly constant. Region R is large enough to have enough K points falling into it to get a sharp peak for the binomial distribution. p(x) K NV 401of 855

Nonparametric Density - Refined Two ways to exploit p(x) K NV for 1 Fix K and determine the volume V from the data : K-nearest-neighbours density estimation 2 Fix V and determine K from the data : kernel density estimation 402of 855

Nonparametric Nearest Neighbour Fix K and find an appropriate value for V. Consider a small sphere around x and then allow the radius to increase until it contains exactly K data points. Calculate the probability by p(x) K NV 5 K = 1 for 0 0 0.5 1 5 K = 5 0 0 0.5 1 5 K = 30 0 0 0.5 1 Nearest neighbour density model for different K. 403of 855

Nonparametric Parzen Estimator Define region R to be a small hypercube around x Define Parzen window (kernel function) { 1, u i 1/2, i = 1,..., D k(u) = 0, otherwise Total number of data points inside of the hypercube centered at x with lengths h: K = N ( ) x xn k h n=1 for Density estimate for p(x) p(x) K NV = 1 N N ( ) 1 x h D k xn h n=1 Interpret as sum over N cubes centered at each of the x n. 404of 855

Nonparametric Parzen Estimator Remaining problem: Discontinuities because of the hypercube (either in or out). Choose a smoother kernel function (and normalise correctly). Common choice : Gaussian kernel p(x) = 1 N N 1 exp { x x n 2 } (2πh 2 ) D/2 2h 2 n=1 Can choose any other kernel function k(u) obeying for k(u) 0, k(u) du = 1 405of 855

Nonparametric Parzen Estimator Gaussian kernel p(x) = 1 N N 1 exp { x x n 2 } (2πh 2 ) D/2 2h 2 n=1 h controls the trade-off between sensitivity to noise and over-smoothing. 5 h = 0.005 0 0 0.5 1 5 h = 0.07 for 0 0 0.5 1 5 h = 0.2 0 0 0.5 1 Kernel density model with Gaussian kernel for different h. 406of 855

for for 407of 855

The Role of Training Data Parametric methods Learn the model parameter w from the training data t. Discard the training data t. Nonparametric methods Use training data directly for prediction k-nearest neighbours : use k-closest data from the training set for classification Kernel methods Base prediction on linear combination of kernel functions evaluated at the training data. for 408of 855

Features A feature is a measurable property of a phenomenon being observed or any derived property thereof raw features: the original data derived features: mappings of the original features to some other space (possibly high- or infinite dimensional, e.g., basis functions) Feature selection: which features matter for the problem at hand? redundant features problem dependent Feature extraction: can we combine the important features to a smaller set of new features? compact representation versus ability to explain to a human for 409of 855

Very simple example - XOR x 1 x 2 y = x 1 xorx 2-1 -1 1-1 1-1 1-1 -1 1 1 1 1.0 0.5 for -1.0-0.5 0.5 1.0-0.5-1.0 not linearly separable (why?) raw features {( 1, 1), ( 1, 1), (1, 1), (1, 1)} 410of 855

Very simple example - XOR x 1 x 2 x new = x 1 x 2 y = x 1 xorx 2-1 -1 1 1-1 1-1 -1 1-1 -1-1 1 1 1 1 1 0.8 0.6 0.4 0.2 0 0.2 0.4 for 0.6 0.8 1 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 feature extraction: x new = x 1 x 2 data is now separable! All classification algorithms work also if we first apply a fixed nonlinear transformation of the inputs using a vector of basis functions φ(x). 411of 855

Kernel methods in one slide Consider a labelled training set {x i, t i } N i=1 On a new point x, we will predict y(x) = N α i K(x, x i ) t i i=1 where {α i } N i=1 are weights to be determined, and K(, ) is a kernel function The kernel function measures the similarity between any two examples Prediction is a weighted average of the training targets Weights depend on the similarity of x to each training example for 412of 855

- Intuition Suppose we perform linear regression with a feature matrix Φ and target vector t, where φ 0 (x 1 ) φ 1 (x 1 )... φ M 1 (x 1 ) φ 0 (x 2 ) φ 1 (x 2 )... φ M 1 (x 2 ) φ(x 1 ) T Φ =...... = φ(x 2 ) T... φ(x N ) T φ 0 (x N ) φ 1 (x N )... φ M 1 (x N ) Recall that the optimal (regularised) w is for w = (λi + Φ T Φ) 1 Φ T t Thus, the prediction for feature vector of new point x with y(x) = φ(x) T w = φ(x) T (λi + Φ T Φ) 1 Φ T t 413of 855

- Intuition Prediction with optimal (regularised) w y(x) = φ(x) T w = φ(x) T (λi + Φ T Φ) 1 Φ T t Suppose that M is very large. Then, the inverse of an M M matrix above will be expensive to compute. Consider however the following trick: for φ(x) T (λi + Φ T Φ) 1 Φ T t = φ(x) T Φ T (λi + ΦΦ T ) 1 t 414of 855

- Intuition We have thus written the prediction as y(x) = φ(x) T Φ T (λi + ΦΦ T ) 1 t Now, our prediction is determined by an N N rather than M M matrix ΦΦ T is known as the kernel matrix of the training data Intuitively, measures the similarities between the training instances Why? Because the inner product between two points is a measure of similarity: for arg max u, v = u. v 415of 855

Consider a linear regression model with regularised sum-of-squares error J(w) = 1 2 N (w T φ(x n ) t n ) 2 + λ 2 wt w n=1 where λ 0. We could also write this in more compact form as J(w) = 1 2 (t Φw)T (t Φw) + λ 2 wt w with the target vector t = (t 1,..., t N ) T, and the design matrix φ 0 (x 1 ) φ 1 (x 1 )... φ M 1 (x 1 ) φ 0 (x 2 ) φ 1 (x 2 )... φ M 1 (x 2 ) Φ =....... φ 0 (x N ) φ 1 (x N )... φ M 1 (x N ) for 416of 855

Critical points for J(w) J(w) = 1 2 (t Φw)T (t Φw) + λ 2 wt w satisfy w = (Φ T Φ + λi) 1 Φ T t (Φ T Φ + λi)w = Φ T t λw = Φ T (t Φw) w = Φ T a N = φ(x n )a n n=1 for where a = (a 1,..., a N ) T with components a n = 1 { w T } φ(x n ) t n λ 417of 855

Now express J(w) as a function of this new variable a instead of w via the relation w = Φ T a J(a) = 1 2 at ΦΦ T ΦΦ T a a T ΦΦ T t + 1 2 tt t + λ 2 at ΦΦ T a where again t = (t 1,..., t N ) T. Known as the dual representation Define the N N Gram matrix K = ΦΦ T with elements K nm = φ(x n ) T φ(x m ) = k(x n, x m ). for Express J(a) now as J(a) = 1 2 at KKa a T Kt + 1 2 tt t + λ 2 at Ka. 418of 855

Critical Points of J(a) Let s calculate the critical points for J(a) = 1 2 at KKa a T Kt + 1 2 tt t + λ 2 at Ka. Directional derivative DJ(a)(ξ) = ξ T KKa ξ T Kt + λ ξ T Ka should be zero in all possible directions ξ. Therefore K(Ka t + λ a) = 0 and so for a = (K + λ I N ) 1 t. Second directional derivative (using K = ΦΦ T ) D 2 J(a)(ξ, ξ) = ξ T KKξ + λ ξ T Kξ = Kξ 2 + λ Φ T ξ > 0. a = (K + λ I N ) 1 t minimises J(a). 419of 855

Prediction for the Linear Regression Model Inserting the argument a which minimises the error J(a) into the prediction model for the linear regression, we get for the prediction y(x) = w T φ(x) = a T Φφ(x) = (Φφ(x)) T a = k(x) T (K + λ I N ) 1 t where we defined the vector k(x) with elements k n (x) = k(x n, x) = φ(x n ) T φ(x). The prediction y(x) can be expressed entirely in terms of the kernel function k(x, x ) evaluated at the training and test data. Looks familiar? See Bayesian Linear Regression. for 420of 855

The Kernel Function The kernel function is defined over two points, x and x, of the input space k(x, x ) is symmetric. k(x, x ) = φ(x) T φ(x ). It is an inner product of two vectors of basis functions k(x, x ) = φ(x), φ(x ). for For prediction, the kernel function will be evaluated at the training data points. (See next slides.) 421of 855

Dual Representation What have we gained by the dual representation? Need to invert an N N matrix now, where N is the number of data points. Can be large! In the parameter space formulation, we only needed to invert an M M matrix, where M was the number of basis functions. But, a kernel corresponds to an inner product of basis functions. So we can use a large number of basis functions, even infinitely many. We can construct new valid kernels directly from given ones (whatever the corresponding basis functions of the new kernel might be). As a kernel defines a kind of similarity between two points in the input space, we can define kernels over graphs, sets, strings, and text documents. for 422of 855

for 423of 855

Kernels from Basis Functions 1 Choose a set of basis functions {φ 1,..., φ M } 2 Find a new kernel as an inner product between vectors of basis functions evaluated at x and x k(x, x ) = φ(x) T φ(x) = M φ i (x)φ i (x ) i=1 for 424of 855

Kernels from Basis Functions 1 Polynomial basis functions Corresponding kernel k(x, x ) as function of x for x = 0.5 (red cross). 0.5 0 0.5 1 1 0 1 1.0 0.0 0.4 1 0 1 for 425of 855

Kernels from Basis Functions 1 Gaussian basis functions Corresponding kernel k(x, x ) as function of x for x = 0.0 (red cross). 0.75 0.5 0.25 0 1 0 1 2.0 1.0 for 0.0 1 0 1 426of 855

Kernels from Basis Functions 1 Logistic Sigmoid basis functions Corresponding kernel k(x, x ) as function of x for x = 0.0 (red cross). 0.75 0.5 0.25 6.0 3.0 0 1 0 1 0.0 1 0 1 for 427of 855

Kernels by Guessing a Kernel Function 1 Choose a mapping from two points of the input space to a real number, which is symmetric in its arguments, e.g. k(x, z) = (x T z) 2 = k(z, x) 2 Try to write this as an inner product of a vector valued function evaluated at the arguments x and z, e.g. k(x, z) = (x T z) 2 = (x 1 z 1 + x 2 z 2 ) 2 = x 2 1 z 2 1 + 2x 1 z 2 x 2 z 2 + x 2 2 z 2 2 = (x 2 1, 2 x 1 x 2, x 2 2)(z 2 1, 2 z 1 z 2, z 2 2) T = φ(x) T φ(z) for with the feature mapping φ(x) = (x 2 1, 2 x 1 x 2, x 2 2 )T. 428of 855

New Kernels From Theory A necessary and sufficient condition for k(x, x ) to be a valid kernel is that the kernel matrix K, whose elements are k(x n, x m ), should be positive semidefinite for all possible choices of the set {x n }. Previously, we constructed K = ΦΦ T, which is automatically positive semidefinite (why?) If we can explicitly construct the kernel via basis functions, we are good Even if we cannot find the basis functions easily, we may be able to deduce k(x, x ) is a valid kernel for 429of 855

New Kernels From Other Kernels Given valid kernels k 1 (x, x ) and k 2 (x, x ), the following kernels are also valid: k(x, x ) = c k 1 (x, x ) k(x, x ) = f (x) k 1 (x, x ) f (x ) k(x, x ) = q(k 1 (x, x )) k(x, x ) = exp(k 1 (x, x )) k(x, x ) = k 1 (x, x ) + k 2 (x, x ) k(x, x ) = k 1 (x, x ) k 2 (x, x ) k(x, x ) = k 3 (φ(x), φ(x )) k(x, x ) = x T Ax k(x, x ) = k a (x a, x a) + k b (x b, x b) k(x, x ) = k a (x a, x a) k b (x b, x b) c > 0 constant f ( ) any function q( ) polynomial with nonneg. coeff. φ(x) any function to R M k 3 (, ) valid kernel in R M A = A T, A 0 x = (x a, x b ) for 430of 855

New Kernels From Other Kernels Further examples of kernels k(x, x ) = (x T x ) M k(x, x ) = (x T x + c) M k(x, x ) = exp ( x x 2 /2σ 2) k(x, x ) = tanh ( a x T x + b ) Generally, we call k(x, x ) = x T x k(x, x ) = k(x x ) k(x, x ) = k( x x ) only terms of degree M all terms up to degree M Gaussian kernel Sigmoidal kernel linear kernel stationary kernel homogeneous kernel for 431of 855

Kernels over Graphs, Sets, Strings, Texts We only need an appropriate similarity measure k(x, x ) which is a kernel. Example: Given a set A and the set of all subsets of A, called the power set P(A). For two subsets A 1, A 2 P(A), denote the number of elements of the intersection of A 1 and A 2 by A 1 A 2. Then it can be shown that k(a 1, A 2 ) = 2 A1 A2 for corresponds to an inner product in a feature space. Therefore, k(a 1, A 2 ) is a valid kernel function. 432of 855

Kernels from Probabilistic Generative Models Given p(x), we can define a kernel k(x, x ) = p(x) p(x ), which means two inputs x and x are similar if they both have high probabilities. Include a weighting function p(i) and extend the kernel to k(x, x ) = i p(x i) p(x i)p(i). for For a continous variable z k(x, x ) = p(x z) p(x z)p(z)dz. Hidden Markov Model with sequences of length L. 433of 855

Kernels for : Summary Pick a suitable kernel function k(x, x ) e.g. by computing inner product of some basis functions Make predictions by suitably combining k(x, x n ) for each training example x n implicitly, a linear model in some high-dimensional space For linear regression, we go from to y(x) = φ(x) T (λi + Φ T Φ) 1 Φ T t y(x) = k(x) T (K + λ I N ) 1 t for can plug in suitable kernel function to implicitly perform nonlinear transformation 434of 855

Kernels for : Summary Working with a nonlinear kernel, we are implicitly performing a nonlinear transformation of our data for with linear kernel k(x, x ) = x T x 435of 855

Kernels for : Summary Working with a nonlinear kernel, we are implicitly performing a nonlinear transformation of our data for with nonlinear kernel k(x, x ) = (x T x ) 2 436of 855