Cheng Soon Ong & Christian Walder. Canberra February June 2018

Size: px
Start display at page:

Download "Cheng Soon Ong & Christian Walder. Canberra February June 2018"

Transcription

1 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear 1 Linear 2 Sparse Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1 Neural Networks 2 Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 855

2 Part XI for 390of 855

3 Original Input versus Feature Space Used direct input x until now. All classification algorithms work also if we first apply a fixed nonlinear transformation of the inputs using a vector of basis functions φ(x). Example: Use two Gaussian basis functions centered at the green crosses in the input space. for 1 1 x2 φ x φ1 391of 855

4 Original Input versus Feature Space Linear decision boundaries in the feature space correspond to nonlinear decision boundaries in the input space. Classes which are NOT linearly separable in the input space can become linearly separable in the feature space. BUT: If classes overlap in input space, they will also overlap in feature space. Nonlinear features φ(x) can not remove the overlap; but they may increase it! for 1 1 x2 φ x φ1 392of 855

5 Where are we? Basis function models (regression, classification) Flexible basis function models (neural networks) after semester break for 393of 855

6 Where are we going? Why not use all training data to make predictions for the test inputs? Basic ideas: Continuity : Mostly targets don t change abruptly. Similarity : Each training pair (input, target) tells us something about the possible targets in the neighbourhood of the input. Kernels formalise those ideas. for 394of 855

7 How are we going there? Kernels for density estimation Nonparametric density estimation Kernels for classification Basis functions and the kernel trick Constructing kernels Warning: The term kernel is also used for all vectors mapping under some matrix A to zero. This is a different concept. Don t get confused! for 395of 855

8 for 396of 855

9 Density Suppose we observe data points {x n } N n=1 e.g. just N real numbers Suppose we believe these are drawn independently from some distribution p(x) e.g. p(x) is Gaussian with unknown mean and variance Density estimation problem: Estimate p(x) from data for 397of 855

10 Nonparametric Density Histogram Partition the space into bins of width i. Count the number n i of samples falling into each bin i. Normalise. p i = n i N i 5 = = = 0.25 for Histogram of 50 data points generated from the distribution shown by the green curve for varying common bin width 398of 855

11 Nonparametric Density Histogram Advantages: Data can be discarded after calculating the p i. Algorithm can be applied to sequentially arriving data. Disadvantages: Dependency on bin width i. Discontinuities due to the bin edges. Exponential scaling with the dimensionality D of the data. Need M D bins for D dimensions and M bins per dimension. for 399of 855

12 Nonparametric Density - Refined Draw data from some unknown probability distribution p(x) in a D-dimensional space. Consider a small region R containing x. Probability mass associated with this region P = p(x) dx R Data set of N observations drawn from p(x). Total number K of points found inside of R is distributed according to the binomial distribution Bin(K N, P) = N! K!(N K)! PK (1 P) N K for Expectation of K : E [K/N] = P Variance of K : var[k/n] = P(1 P)/N 400of 855

13 Nonparametric Density - Refined Expectation of K : E [K/N] = P Variance of K : var[k/n] = P(1 P) For large N, the distribution will be sharply peaked and therefore K NP Assuming also that the region has volume V and the region is small enough for p(x) to be roughly constant, then P p(x)v for Combining two contradictory assumptions Region R is small enough for p(x) to be roughly constant. Region R is large enough to have enough K points falling into it to get a sharp peak for the binomial distribution. p(x) K NV 401of 855

14 Nonparametric Density - Refined Two ways to exploit p(x) K NV for 1 Fix K and determine the volume V from the data : K-nearest-neighbours density estimation 2 Fix V and determine K from the data : kernel density estimation 402of 855

15 Nonparametric Nearest Neighbour Fix K and find an appropriate value for V. Consider a small sphere around x and then allow the radius to increase until it contains exactly K data points. Calculate the probability by p(x) K NV 5 K = 1 for K = K = Nearest neighbour density model for different K. 403of 855

16 Nonparametric Parzen Estimator Define region R to be a small hypercube around x Define Parzen window (kernel function) { 1, u i 1/2, i = 1,..., D k(u) = 0, otherwise Total number of data points inside of the hypercube centered at x with lengths h: K = N ( ) x xn k h n=1 for Density estimate for p(x) p(x) K NV = 1 N N ( ) 1 x h D k xn h n=1 Interpret as sum over N cubes centered at each of the x n. 404of 855

17 Nonparametric Parzen Estimator Remaining problem: Discontinuities because of the hypercube (either in or out). Choose a smoother kernel function (and normalise correctly). Common choice : Gaussian kernel p(x) = 1 N N 1 exp { x x n 2 } (2πh 2 ) D/2 2h 2 n=1 Can choose any other kernel function k(u) obeying for k(u) 0, k(u) du = 1 405of 855

18 Nonparametric Parzen Estimator Gaussian kernel p(x) = 1 N N 1 exp { x x n 2 } (2πh 2 ) D/2 2h 2 n=1 h controls the trade-off between sensitivity to noise and over-smoothing. 5 h = h = 0.07 for h = Kernel density model with Gaussian kernel for different h. 406of 855

19 for for 407of 855

20 The Role of Training Data Parametric methods Learn the model parameter w from the training data t. Discard the training data t. Nonparametric methods Use training data directly for prediction k-nearest neighbours : use k-closest data from the training set for classification Kernel methods Base prediction on linear combination of kernel functions evaluated at the training data. for 408of 855

21 Features A feature is a measurable property of a phenomenon being observed or any derived property thereof raw features: the original data derived features: mappings of the original features to some other space (possibly high- or infinite dimensional, e.g., basis functions) Feature selection: which features matter for the problem at hand? redundant features problem dependent Feature extraction: can we combine the important features to a smaller set of new features? compact representation versus ability to explain to a human for 409of 855

22 Very simple example - XOR x 1 x 2 y = x 1 xorx for not linearly separable (why?) raw features {( 1, 1), ( 1, 1), (1, 1), (1, 1)} 410of 855

23 Very simple example - XOR x 1 x 2 x new = x 1 x 2 y = x 1 xorx for feature extraction: x new = x 1 x 2 data is now separable! All classification algorithms work also if we first apply a fixed nonlinear transformation of the inputs using a vector of basis functions φ(x). 411of 855

24 Kernel methods in one slide Consider a labelled training set {x i, t i } N i=1 On a new point x, we will predict y(x) = N α i K(x, x i ) t i i=1 where {α i } N i=1 are weights to be determined, and K(, ) is a kernel function The kernel function measures the similarity between any two examples Prediction is a weighted average of the training targets Weights depend on the similarity of x to each training example for 412of 855

25 - Intuition Suppose we perform linear regression with a feature matrix Φ and target vector t, where φ 0 (x 1 ) φ 1 (x 1 )... φ M 1 (x 1 ) φ 0 (x 2 ) φ 1 (x 2 )... φ M 1 (x 2 ) φ(x 1 ) T Φ = = φ(x 2 ) T... φ(x N ) T φ 0 (x N ) φ 1 (x N )... φ M 1 (x N ) Recall that the optimal (regularised) w is for w = (λi + Φ T Φ) 1 Φ T t Thus, the prediction for feature vector of new point x with y(x) = φ(x) T w = φ(x) T (λi + Φ T Φ) 1 Φ T t 413of 855

26 - Intuition Prediction with optimal (regularised) w y(x) = φ(x) T w = φ(x) T (λi + Φ T Φ) 1 Φ T t Suppose that M is very large. Then, the inverse of an M M matrix above will be expensive to compute. Consider however the following trick: for φ(x) T (λi + Φ T Φ) 1 Φ T t = φ(x) T Φ T (λi + ΦΦ T ) 1 t 414of 855

27 - Intuition We have thus written the prediction as y(x) = φ(x) T Φ T (λi + ΦΦ T ) 1 t Now, our prediction is determined by an N N rather than M M matrix ΦΦ T is known as the kernel matrix of the training data Intuitively, measures the similarities between the training instances Why? Because the inner product between two points is a measure of similarity: for arg max u, v = u. v 415of 855

28 Consider a linear regression model with regularised sum-of-squares error J(w) = 1 2 N (w T φ(x n ) t n ) 2 + λ 2 wt w n=1 where λ 0. We could also write this in more compact form as J(w) = 1 2 (t Φw)T (t Φw) + λ 2 wt w with the target vector t = (t 1,..., t N ) T, and the design matrix φ 0 (x 1 ) φ 1 (x 1 )... φ M 1 (x 1 ) φ 0 (x 2 ) φ 1 (x 2 )... φ M 1 (x 2 ) Φ = φ 0 (x N ) φ 1 (x N )... φ M 1 (x N ) for 416of 855

29 Critical points for J(w) J(w) = 1 2 (t Φw)T (t Φw) + λ 2 wt w satisfy w = (Φ T Φ + λi) 1 Φ T t (Φ T Φ + λi)w = Φ T t λw = Φ T (t Φw) w = Φ T a N = φ(x n )a n n=1 for where a = (a 1,..., a N ) T with components a n = 1 { w T } φ(x n ) t n λ 417of 855

30 Now express J(w) as a function of this new variable a instead of w via the relation w = Φ T a J(a) = 1 2 at ΦΦ T ΦΦ T a a T ΦΦ T t tt t + λ 2 at ΦΦ T a where again t = (t 1,..., t N ) T. Known as the dual representation Define the N N Gram matrix K = ΦΦ T with elements K nm = φ(x n ) T φ(x m ) = k(x n, x m ). for Express J(a) now as J(a) = 1 2 at KKa a T Kt tt t + λ 2 at Ka. 418of 855

31 Critical Points of J(a) Let s calculate the critical points for J(a) = 1 2 at KKa a T Kt tt t + λ 2 at Ka. Directional derivative DJ(a)(ξ) = ξ T KKa ξ T Kt + λ ξ T Ka should be zero in all possible directions ξ. Therefore K(Ka t + λ a) = 0 and so for a = (K + λ I N ) 1 t. Second directional derivative (using K = ΦΦ T ) D 2 J(a)(ξ, ξ) = ξ T KKξ + λ ξ T Kξ = Kξ 2 + λ Φ T ξ > 0. a = (K + λ I N ) 1 t minimises J(a). 419of 855

32 Prediction for the Linear Regression Model Inserting the argument a which minimises the error J(a) into the prediction model for the linear regression, we get for the prediction y(x) = w T φ(x) = a T Φφ(x) = (Φφ(x)) T a = k(x) T (K + λ I N ) 1 t where we defined the vector k(x) with elements k n (x) = k(x n, x) = φ(x n ) T φ(x). The prediction y(x) can be expressed entirely in terms of the kernel function k(x, x ) evaluated at the training and test data. Looks familiar? See Bayesian Linear Regression. for 420of 855

33 The Kernel Function The kernel function is defined over two points, x and x, of the input space k(x, x ) is symmetric. k(x, x ) = φ(x) T φ(x ). It is an inner product of two vectors of basis functions k(x, x ) = φ(x), φ(x ). for For prediction, the kernel function will be evaluated at the training data points. (See next slides.) 421of 855

34 Dual Representation What have we gained by the dual representation? Need to invert an N N matrix now, where N is the number of data points. Can be large! In the parameter space formulation, we only needed to invert an M M matrix, where M was the number of basis functions. But, a kernel corresponds to an inner product of basis functions. So we can use a large number of basis functions, even infinitely many. We can construct new valid kernels directly from given ones (whatever the corresponding basis functions of the new kernel might be). As a kernel defines a kind of similarity between two points in the input space, we can define kernels over graphs, sets, strings, and text documents. for 422of 855

35 for 423of 855

36 Kernels from Basis Functions 1 Choose a set of basis functions {φ 1,..., φ M } 2 Find a new kernel as an inner product between vectors of basis functions evaluated at x and x k(x, x ) = φ(x) T φ(x) = M φ i (x)φ i (x ) i=1 for 424of 855

37 Kernels from Basis Functions 1 Polynomial basis functions Corresponding kernel k(x, x ) as function of x for x = 0.5 (red cross) for 425of 855

38 Kernels from Basis Functions 1 Gaussian basis functions Corresponding kernel k(x, x ) as function of x for x = 0.0 (red cross) for of 855

39 Kernels from Basis Functions 1 Logistic Sigmoid basis functions Corresponding kernel k(x, x ) as function of x for x = 0.0 (red cross) for 427of 855

40 Kernels by Guessing a Kernel Function 1 Choose a mapping from two points of the input space to a real number, which is symmetric in its arguments, e.g. k(x, z) = (x T z) 2 = k(z, x) 2 Try to write this as an inner product of a vector valued function evaluated at the arguments x and z, e.g. k(x, z) = (x T z) 2 = (x 1 z 1 + x 2 z 2 ) 2 = x 2 1 z x 1 z 2 x 2 z 2 + x 2 2 z 2 2 = (x 2 1, 2 x 1 x 2, x 2 2)(z 2 1, 2 z 1 z 2, z 2 2) T = φ(x) T φ(z) for with the feature mapping φ(x) = (x 2 1, 2 x 1 x 2, x 2 2 )T. 428of 855

41 New Kernels From Theory A necessary and sufficient condition for k(x, x ) to be a valid kernel is that the kernel matrix K, whose elements are k(x n, x m ), should be positive semidefinite for all possible choices of the set {x n }. Previously, we constructed K = ΦΦ T, which is automatically positive semidefinite (why?) If we can explicitly construct the kernel via basis functions, we are good Even if we cannot find the basis functions easily, we may be able to deduce k(x, x ) is a valid kernel for 429of 855

42 New Kernels From Other Kernels Given valid kernels k 1 (x, x ) and k 2 (x, x ), the following kernels are also valid: k(x, x ) = c k 1 (x, x ) k(x, x ) = f (x) k 1 (x, x ) f (x ) k(x, x ) = q(k 1 (x, x )) k(x, x ) = exp(k 1 (x, x )) k(x, x ) = k 1 (x, x ) + k 2 (x, x ) k(x, x ) = k 1 (x, x ) k 2 (x, x ) k(x, x ) = k 3 (φ(x), φ(x )) k(x, x ) = x T Ax k(x, x ) = k a (x a, x a) + k b (x b, x b) k(x, x ) = k a (x a, x a) k b (x b, x b) c > 0 constant f ( ) any function q( ) polynomial with nonneg. coeff. φ(x) any function to R M k 3 (, ) valid kernel in R M A = A T, A 0 x = (x a, x b ) for 430of 855

43 New Kernels From Other Kernels Further examples of kernels k(x, x ) = (x T x ) M k(x, x ) = (x T x + c) M k(x, x ) = exp ( x x 2 /2σ 2) k(x, x ) = tanh ( a x T x + b ) Generally, we call k(x, x ) = x T x k(x, x ) = k(x x ) k(x, x ) = k( x x ) only terms of degree M all terms up to degree M Gaussian kernel Sigmoidal kernel linear kernel stationary kernel homogeneous kernel for 431of 855

44 Kernels over Graphs, Sets, Strings, Texts We only need an appropriate similarity measure k(x, x ) which is a kernel. Example: Given a set A and the set of all subsets of A, called the power set P(A). For two subsets A 1, A 2 P(A), denote the number of elements of the intersection of A 1 and A 2 by A 1 A 2. Then it can be shown that k(a 1, A 2 ) = 2 A1 A2 for corresponds to an inner product in a feature space. Therefore, k(a 1, A 2 ) is a valid kernel function. 432of 855

45 Kernels from Probabilistic Generative Models Given p(x), we can define a kernel k(x, x ) = p(x) p(x ), which means two inputs x and x are similar if they both have high probabilities. Include a weighting function p(i) and extend the kernel to k(x, x ) = i p(x i) p(x i)p(i). for For a continous variable z k(x, x ) = p(x z) p(x z)p(z)dz. Hidden Markov Model with sequences of length L. 433of 855

46 Kernels for : Summary Pick a suitable kernel function k(x, x ) e.g. by computing inner product of some basis functions Make predictions by suitably combining k(x, x n ) for each training example x n implicitly, a linear model in some high-dimensional space For linear regression, we go from to y(x) = φ(x) T (λi + Φ T Φ) 1 Φ T t y(x) = k(x) T (K + λ I N ) 1 t for can plug in suitable kernel function to implicitly perform nonlinear transformation 434of 855

47 Kernels for : Summary Working with a nonlinear kernel, we are implicitly performing a nonlinear transformation of our data for with linear kernel k(x, x ) = x T x 435of 855

48 Kernels for : Summary Working with a nonlinear kernel, we are implicitly performing a nonlinear transformation of our data for with nonlinear kernel k(x, x ) = (x T x ) 2 436of 855

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 89 Part II

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen Kernel Methods Henrik I Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I Christensen (RIM@GT) Kernel Methods 1 / 37 Outline

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1 Kernel Methods Foundations of Data Analysis Torsten Möller Möller/Mori 1 Reading Chapter 6 of Pattern Recognition and Machine Learning by Bishop Chapter 12 of The Elements of Statistical Learning by Hastie,

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

Non-parametric Methods

Non-parametric Methods Non-parametric Methods Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 2 of Pattern Recognition and Machine Learning by Bishop (with an emphasis on section 2.5) Möller/Mori 2 Outline Last

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Support Vector Machine (continued)

Support Vector Machine (continued) Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Kernel Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1 / 21

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

CS798: Selected topics in Machine Learning

CS798: Selected topics in Machine Learning CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning

More information

Nonparametric Methods Lecture 5

Nonparametric Methods Lecture 5 Nonparametric Methods Lecture 5 Jason Corso SUNY at Buffalo 17 Feb. 29 J. Corso (SUNY at Buffalo) Nonparametric Methods Lecture 5 17 Feb. 29 1 / 49 Nonparametric Methods Lecture 5 Overview Previously,

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space to The The A s s in to Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 2, 2009 to The The A s s in 1 Motivation Outline 2 The Mapping the

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

probability of k samples out of J fall in R.

probability of k samples out of J fall in R. Nonparametric Techniques for Density Estimation (DHS Ch. 4) n Introduction n Estimation Procedure n Parzen Window Estimation n Parzen Window Example n K n -Nearest Neighbor Estimation Introduction Suppose

More information

BAYESIAN DECISION THEORY

BAYESIAN DECISION THEORY Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will

More information

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML) Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

9.2 Support Vector Machines 159

9.2 Support Vector Machines 159 9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

(Kernels +) Support Vector Machines

(Kernels +) Support Vector Machines (Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods) Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March

More information

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning

More information

Support Vector Machine II

Support Vector Machine II Support Vector Machine II Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative HW 1 due tonight HW 2 released. Online Scalable Learning Adaptive to Unknown Dynamics and Graphs Yanning

More information

Learning Methods for Linear Detectors

Learning Methods for Linear Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Pattern Recognition 2018 Support Vector Machines

Pattern Recognition 2018 Support Vector Machines Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning Kernel Methods Konstantin Tretyakov (kt@ut.ee) MTAT.03.227 Machine Learning So far Supervised machine learning Linear models Non-linear models Unsupervised machine learning Generic scaffolding So far Supervised

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning Kernel Methods Konstantin Tretyakov (kt@ut.ee) MTAT.03.227 Machine Learning So far Supervised machine learning Linear models Least squares regression, SVR Fisher s discriminant, Perceptron, Logistic model,

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

Advanced statistical methods for data analysis Lecture 2

Advanced statistical methods for data analysis Lecture 2 Advanced statistical methods for data analysis Lecture 2 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline

More information

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop Bayesian Gaussian / Linear Models Read Sections 2.3.3 and 3.3 in the text by Bishop Multivariate Gaussian Model with Multivariate Gaussian Prior Suppose we model the observed vector b as having a multivariate

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Support Vector Machine (SVM) Hamid R. Rabiee Hadi Asheri, Jafar Muhammadi, Nima Pourdamghani Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Introduction

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Outline lecture 4 2(26)

Outline lecture 4 2(26) Outline lecture 4 2(26), Lecture 4 eural etworks () and Introduction to Kernel Methods Thomas Schön Division of Automatic Control Linköping University Linköping, Sweden. Email: schon@isy.liu.se, Phone:

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Kernel: Kernel is defined as a function returning the inner product between the images of the two arguments k(x 1, x 2 ) = ϕ(x 1 ), ϕ(x 2 ) k(x 1, x 2 ) = k(x 2, x 1 ) modularity-

More information

Review: Support vector machines. Machine learning techniques and image analysis

Review: Support vector machines. Machine learning techniques and image analysis Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization

More information

Introduction to SVM and RVM

Introduction to SVM and RVM Introduction to SVM and RVM Machine Learning Seminar HUS HVL UIB Yushu Li, UIB Overview Support vector machine SVM First introduced by Vapnik, et al. 1992 Several literature and wide applications Relevance

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

12. Cholesky factorization

12. Cholesky factorization L. Vandenberghe ECE133A (Winter 2018) 12. Cholesky factorization positive definite matrices examples Cholesky factorization complex positive definite matrices kernel methods 12-1 Definitions a symmetric

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction

More information

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines CS4495/6495 Introduction to Computer Vision 8C-L3 Support Vector Machines Discriminative classifiers Discriminative classifiers find a division (surface) in feature space that separates the classes Several

More information

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector

More information

Probability Models for Bayesian Recognition

Probability Models for Bayesian Recognition Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIAG / osig Second Semester 06/07 Lesson 9 0 arch 07 Probability odels for Bayesian Recognition Notation... Supervised Learning for Bayesian

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Kernel Methods in Machine Learning

Kernel Methods in Machine Learning Kernel Methods in Machine Learning Autumn 2015 Lecture 1: Introduction Juho Rousu ICS-E4030 Kernel Methods in Machine Learning 9. September, 2015 uho Rousu (ICS-E4030 Kernel Methods in Machine Learning)

More information

Nonparametric probability density estimation

Nonparametric probability density estimation A temporary version for the 2018-04-11 lecture. Nonparametric probability density estimation Václav Hlaváč Czech Technical University in Prague Czech Institute of Informatics, Robotics and Cybernetics

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Reading: Ben-Hur & Weston, A User s Guide to Support Vector Machines (linked from class web page) Notation Assume a binary classification problem. Instances are represented by vector

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2010 Lecture 24: Perceptrons and More! 4/22/2010 Pieter Abbeel UC Berkeley Slides adapted from Dan Klein Announcements W7 due tonight [this is your last written for

More information

Classification: The rest of the story

Classification: The rest of the story U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Kernel Methods. Charles Elkan October 17, 2007

Kernel Methods. Charles Elkan October 17, 2007 Kernel Methods Charles Elkan elkan@cs.ucsd.edu October 17, 2007 Remember the xor example of a classification problem that is not linearly separable. If we map every example into a new representation, then

More information

The Gaussian distribution

The Gaussian distribution The Gaussian distribution Probability density function: A continuous probability density function, px), satisfies the following properties:. The probability that x is between two points a and b b P a

More information

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation Deviations from linear separability Kernel methods CSE 250B Noise Find a separator that minimizes a convex loss function related to the number of mistakes. e.g. SVM, logistic regression. Systematic deviation

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning Short Course Robust Optimization and 3. Optimization in Supervised EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012 Outline Overview of Supervised models and variants

More information

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and Kernels Lorenzo Rosasco Linear functions Let H lin be the space of linear functions f(x) = w x. f w is one

More information