Kernel Learning via Random Fourier Representations

Size: px
Start display at page:

Download "Kernel Learning via Random Fourier Representations"

Transcription

1 Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 1 / 30

2 Outline 1 Background Theory Recap of Kernels Fourier features Learning on distributions 2 Optimisation Methods 3 Experiments L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 2 / 30

3 Recap of Kernels Data {(x i, y i )} n i=1, x i X a non-empty set, y R.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 3 / 30

4 Recap of Kernels Data {(x i, y i )} n i=1, x i X a non-empty set, y R. For instance, want a function f : X R to model relationship between x i and y i.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 3 / 30

5 Recap of Kernels Data {(x i, y i )} n i=1, x i X a non-empty set, y R. For instance, want a function f : X R to model relationship between x i and y i. A kernel is a function k : X X R such that k(x, y) = φ(x), φ(y) H, H some Hilbert space, and φ : X H.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 3 / 30

6 Recap of Kernels Data {(x i, y i )} n i=1, x i X a non-empty set, y R. For instance, want a function f : X R to model relationship between x i and y i. A kernel is a function k : X X R such that k(x, y) = φ(x), φ(y) H, H some Hilbert space, and φ : X H. Moore-Aronszajn Theorem: every positive semi-definite function is a reproducing kernel for some (unique) reproducing kernel Hilbert space.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 3 / 30

7 Recap of Kernels Data {(x i, y i )} n i=1, x i X a non-empty set, y R. For instance, want a function f : X R to model relationship between x i and y i. A kernel is a function k : X X R such that k(x, y) = φ(x), φ(y) H, H some Hilbert space, and φ : X H. Moore-Aronszajn Theorem: every positive semi-definite function is a reproducing kernel for some (unique) reproducing kernel Hilbert space. So let s start with a positive-semidefinite function, e.g. k(x, y) = exp( 1 2 x y 2 ). L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 3 / 30

8 Building the RKHS We construct the RKHS within the space of functions f : X R. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 4 / 30

9 Building the RKHS We construct the RKHS within the space of functions f : X R. Take functions of the form k(, x) and linear combinations: f ( ) = n a i k(, x i ). i=1 L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 4 / 30

10 Building the RKHS We construct the RKHS within the space of functions f : X R. Take functions of the form k(, x) and linear combinations: f ( ) = n a i k(, x i ). i=1 Define inner product, for g( ) = m j=1 b jk(, y j ), f, g Hk := n m a i b j k(x i, y j ). i=1 j=1 L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 4 / 30

11 Building the RKHS We construct the RKHS within the space of functions f : X R. Take functions of the form k(, x) and linear combinations: f ( ) = n a i k(, x i ). i=1 Define inner product, for g( ) = m j=1 b jk(, y j ), f, g Hk := n m a i b j k(x i, y j ). i=1 j=1 k(, x), g Hk = g(x), and in particular, k(x, y) = k(, x), k(, y) Hk. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 4 / 30

12 Representer Theorem Theorem Let L be a general loss function and Ω an increasing function. For the regularised risk argmin L ( (x 1, y 1, f (x 1 )),..., (x n, y n, f (x n )) ) + Ω ( f 2 ) H k f H k there exists a minimiser f H k of the form f ( ) = n a i k(, x i ) i=1 for some (a 1,..., a n ) R n. If Ω is strictly increasing, then each minimiser of the regularised risk admits such a representation. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 5 / 30

13 Kernel Methods Kernel methods allows us to work in a high dimensional space, possibly infinite without explicitly working in them. However, it comes with an expensive computational cost, especially when the data set is large. This motivates the idea of Random Fourier features, which approximates the kernel through a lower dimensional explicit feature map.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 6 / 30

14 Random Fourier features Theorem (Bochner s theorem) A continuous, translation-invariant kernel k(x, y) = k(x y) is the Fourier transform of a non-negative measure: k(x y) = e i ωt (x y) p(ω)dω = E ω p [ζ ω (x)ζ ω (y) ] R d where ζ ω (x) := e i ωt x. Idea: sample ω 1,..., ω m frequencies independently from p and use a Monte Carlo estimator ˆk(x, y) = 1 m m [ cos(ω T j x) cos(ωj T y) + sin(ωj T x) sin(ωj T y) ]. j=1 L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 7 / 30

15 Random Fourier features ˆk(x, y) = 1 m [ m j=1 cos(ω T j x) cos(ωj T y) + sin(ωj T x) sin(ωj T y) ] In particular we now have a explicit feature map: 1 [ ( ) ( ) ( ) ( )] φ(x) := cos ω1 x, sin ω1 x..., cos ω m mx, sin ωmx A choice of the kernel has to be pre-specified for the RFF and choosing a good kernel in general is an open and challenging question. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 8 / 30

16 Neural Network approach Attempt to learn the optimal kernel through some data dependent fashion using a one hidden layer neural network with a specific activation function given by: ŷ = 1 m ( ) βj cos cos ωj x + m j=1 m j=1 ( x) βj sin sin ωj (1) L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 9 / 30

17 Neural Network- Picture L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 10 / 30

18 Learning on Distributions Now, for each label y i rather than a single observation, we observe a bag of samples x i,1,..., x i,n iid xi, where x i is a distribution on R d. let ˆx i denote the empirical distribution associated with the ith bag.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 11 / 30

19 Learning on Distributions Now, for each label y i rather than a single observation, we observe a bag of samples x i,1,..., x i,n iid xi, where x i is a distribution on R d. let ˆx i denote the empirical distribution associated with the ith bag. Let k : R d R d R be a kernel.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 11 / 30

20 Learning on Distributions Now, for each label y i rather than a single observation, we observe a bag of samples x i,1,..., x i,n iid xi, where x i is a distribution on R d. let ˆx i denote the empirical distribution associated with the ith bag. Let k : R d R d R be a kernel. Use k to embed ˆx i : ˆx i µˆxi = 1 N N j=1 k(, x i,j) H k.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 11 / 30

21 Learning on Distributions Now, for each label y i rather than a single observation, we observe a bag of samples x i,1,..., x i,n iid xi, where x i is a distribution on R d. let ˆx i denote the empirical distribution associated with the ith bag. Let k : R d R d R be a kernel. Use k to embed ˆx i : ˆx i µˆxi = 1 N N j=1 k(, x i,j) H k. We can approximate k using RFF as before; using frequencies ω 1,..., ω m.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 11 / 30

22 Learning on Distributions Now, for each label y i rather than a single observation, we observe a bag of samples x i,1,..., x i,n iid xi, where x i is a distribution on R d. let ˆx i denote the empirical distribution associated with the ith bag. Let k : R d R d R be a kernel. Use k to embed ˆx i : ˆx i µˆxi = 1 N N j=1 k(, x i,j) H k. We can approximate k using RFF as before; using frequencies ω 1,..., ω m. We now consider another kernel K, initially defined on space of mean-embedded distributions, restricted to the µˆxi.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 11 / 30

23 Learning on Distributions Now, for each label y i rather than a single observation, we observe a bag of samples x i,1,..., x i,n iid xi, where x i is a distribution on R d. let ˆx i denote the empirical distribution associated with the ith bag. Let k : R d R d R be a kernel. Use k to embed ˆx i : ˆx i µˆxi = 1 N N j=1 k(, x i,j) H k. We can approximate k using RFF as before; using frequencies ω 1,..., ω m. We now consider another kernel K, initially defined on space of mean-embedded distributions, restricted to the µˆxi. Since we have approximated k using RFF, our space is now finite-dimensional (parameterised by the frequencies), hence we can approximate K using RFF also.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 11 / 30

24 Optimisation problem The problem becomes minimising the following objective... Objective T = 1 n n (y i ŷ i ) 2 + λ β 2 + µ Ω 2 (2) i=1 L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 12 / 30

25 Optimisation problem The problem becomes minimising the following objective... Objective T = 1 n n (y i ŷ i ) 2 + λ β 2 + µ Ω 2 (2) i=1 where the forward propagation is ŷ i = S 1 m m βj cos cos(wj T x i ) + βj sin sin(wj T x i ) (3) m j=1 j=1 L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 12 / 30

26 Optimisation problem The problem becomes minimising the following objective... Objective T = 1 n n (y i ŷ i ) 2 + λ β 2 + µ Ω 2 (2) i=1 where the forward propagation is ŷ i = S 1 m m βj cos cos(wj T x i ) + βj sin sin(wj T x i ) (3) m j=1 and S is an activation function. j=1 L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 12 / 30

27 Optimisation problem The problem becomes minimising the following objective... Objective T = 1 n n (y i ŷ i ) 2 + λ β 2 + µ Ω 2 (2) i=1 where the forward propagation is ŷ i = S 1 m m βj cos cos(wj T x i ) + βj sin sin(wj T x i ) (3) m j=1 and S is an activation function. Back propagation is the learning of the parameters β and Ω. j=1. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 12 / 30

28 Gradient Descent Given the forward propagation, update the parameters according to the gradient evaluated at the current parameters. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 13 / 30

29 Gradient Descent Given the forward propagation, update the parameters according to the gradient evaluated at the current parameters Gradient Descent β β ɛ β T (4) Ω Ω ɛd Ω T (5) L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 13 / 30

30 Gradient Descent Given the forward propagation, update the parameters according to the gradient evaluated at the current parameters Gradient Descent where ɛ is a user defined step size. β β ɛ β T (4) Ω Ω ɛd Ω T (5) L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 13 / 30

31 Stochastic Gradient Descent Similar to gradient descent but only uses one data point per iteration to evaluate the gradient. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 14 / 30

32 Stochastic Gradient Descent Similar to gradient descent but only uses one data point per iteration to evaluate the gradient. Let T i = (y i ŷ i ) 2 + λ β 2 + µ Ω 2 (6) L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 14 / 30

33 Stochastic Gradient Descent Similar to gradient descent but only uses one data point per iteration to evaluate the gradient. Let T i = (y i ŷ i ) 2 + λ β 2 + µ Ω 2 (6) Stochastic Gradient Descent On the ith iteration β β ɛ β T mod(i 1,n)+1 (7) Ω Ω ɛd Ω T mod(i 1,n)+1. (8) L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 14 / 30

34 Quasi-Newton Similar to gradient descent but step size varies according to the rate of change of the gradient. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 15 / 30

35 Quasi-Newton Similar to gradient descent but step size varies according to the rate of change of the gradient. Quasi-Newton [ ] 1 θ θ θ T θ T θ T (9) The Hessian θ T θ T is approximated. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 15 / 30

36 Timings Gradient descent Stochastic gradient descent Quasi-Newton Time took to optimize (s) Time took to optimize (s) Time took to optimize (s) m m m L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 16 / 30

37 Training Error Gradient descent Stochastic gradient descent Quasi-Newton Training Error 0.1 Training Error 0.1 Training Error m m m L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 17 / 30

38 Test Error Gradient descent Stochastic gradient descent Quasi-Newton Test Error 0.35 Test Error 0.35 Test Error m m m L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 18 / 30

39 Classification Experiment Adult Dataset [3] D = {x i, y i } N i=1, N = 48842, x i R 108, y i = {0, 1} A)Two layer neural network Quadratic Loss λ = 1.5 ɛ = 0.1 M = 100 Result = 15.5% misclassifications B) Random Fourier Features Ridge Regression λ = σ 2 = M = 1000 Result = 15.4% misclassifications Parameter Selection Cross Validation? experimentation, overfitting L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 19 / 30

40 L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 20 / 30

41 Comparison of frequencies L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 21 / 30

42 PCA V36 V2 PC2 (3.4% explained var.) 0.0 V42 V15 V41 V5 V70 V105 V101 V75 V30 V50 V76 V100 V99 V83 V77 V88 V32 V19 V43 V93 V94 V4 V82 V54 V74 V37 V16 V17 V45 V25 V47 V34 V12 V92 V29 V8 V35 V67 V98 V89 V90 V44 V20 V71 V13 V57 V18 V9 V55 V68 V49 V73 V51 V23 V10 V14 V26 V38 V33 V39 V53 V62 V69 V78 V80 V79 V72 V60 V46 V84 V85 V86 V108 V87 V104 V107 V91 V22 V48 V96 V11 V6 V27 V58 V81 V97 V102 V103 V95 V3 V7 V59 V24 V56 V40 V61 V21 V52 V31 V63 V65 V66 V106 V64 V1 V L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 22 / 30

43 Loss L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 23 / 30

44 Regression Experiment Aerosol Dataset [4] B k = {(x i,j, y i ) B j=1 }N i=1 where i k = 100 k {1,.., K = 800}, x i R 16, y i R A) Three layers Neural Network with mean pooling Quadratic Loss M 1 = 50 M 2 = 50 λ = 0.05 ɛ = 0.1 Result = RMSE B) Random Fourier Features with mean pooling Ridge Regression λ = 0.05 M 1 = 50 M 2 = 50 σ 2 1 = σ2 2 = 0.5 Result = 0.09 RMSE L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 24 / 30

45 L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 25 / 30

46 Comparison of frequencies L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 26 / 30

47 V15 V4 V45 V50 PCA PC2 (11.6% explained var.) 2 0 V8 V2 V16 V13 V14 V1 V3 V12 V9 V7 V6 V5 PC2 (4.9% explained var.) 0.0 V7 V29 V40 V77 V34 V2 V54 V88 V90 V9 V16 V15 V10 V66 V52 V19 V99 V93 V62 V71 V5 V13 V17 V76 V14 V25 V6 V58 V72 V91 V80 V84 V59 V30 V48 V27 V4 V95 V87 V37 V51 V53 V68 V56 V100 V83 V75 V78 V42 V43 V94 V18 V3 V21 V31 V11 V28 V41 V49 V32 V20 V55 V39 V60 V92 V97 V8 V23 V35 V1 V38 V70 V63 V82 V89 V86 V44 V65 V96 V57 V26 V74 V12 V47 V69 V81 V24 V85 V46 V61 V98 V36 V67 V79 V22 V64 V33 V73 V10 V PC1 (14.5% explained var.) PC1 (5.4% explained var.) L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 27 / 30

48 Loss L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 28 / 30

49 Conclusions Competetive methods Methods give very similar and accurate results!! Neural Network More computational cost. Potential tonoutperform in predictive power. RFF Fast. Require kernel determination. Simpler structure and interpretability. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 29 / 30

50 For Further Reading I Hofmann, T., Scholkopf, B. and Smola, A.J. (2008) Kernel Methods in Machine Learning, Annals of Statistics, Vol. 36, No. 3, Gretton, A. (2015) Advanced Topics in Machine Learning, course notes, available at coursefiles/rkhscourse.html. Rahimi, A., Recht, B. (2007) Random Features for Large-Scale Kernel Machines, Advances in Neural Information Processing Systems (NIPS). Szabo, Z., Gretton, A., Poczos, B., Sriperumbudur, B.K. (2015) Two-stage Sampled Learning Theory on Distributions, International Conference on Artificial Intelligence and Statistics (AISTATS). L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 30 / 30

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

TUM 2016 Class 3 Large scale learning by regularization

TUM 2016 Class 3 Large scale learning by regularization TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

Distribution Regression

Distribution Regression Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

More information

Distribution Regression with Minimax-Optimal Guarantee

Distribution Regression with Minimax-Optimal Guarantee Distribution Regression with Minimax-Optimal Guarantee (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date

More information

Efficient Complex Output Prediction

Efficient Complex Output Prediction Efficient Complex Output Prediction Florence d Alché-Buc Joint work with Romain Brault, Alex Lambert, Maxime Sangnier October 12, 2017 LTCI, Télécom ParisTech, Institut-Mines Télécom, Université Paris-Saclay

More information

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space to The The A s s in to Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 2, 2009 to The The A s s in 1 Motivation Outline 2 The Mapping the

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

(i) The optimisation problem solved is 1 min

(i) The optimisation problem solved is 1 min STATISTICAL LEARNING IN PRACTICE Part III / Lent 208 Example Sheet 3 (of 4) By Dr T. Wang You have the option to submit your answers to Questions and 4 to be marked. If you want your answers to be marked,

More information

Minimax-optimal distribution regression

Minimax-optimal distribution regression Zoltán Szabó (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) ISNPS, Avignon June 12,

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting

More information

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Kernels A Machine Learning Overview

Kernels A Machine Learning Overview Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph

More information

Degrees of Freedom in Regression Ensembles

Degrees of Freedom in Regression Ensembles Degrees of Freedom in Regression Ensembles Henry WJ Reeve Gavin Brown University of Manchester - School of Computer Science Kilburn Building, University of Manchester, Oxford Rd, Manchester M13 9PL Abstract.

More information

Normalization Techniques in Training of Deep Neural Networks

Normalization Techniques in Training of Deep Neural Networks Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,

More information

Presentation in Convex Optimization

Presentation in Convex Optimization Dec 22, 2014 Introduction Sample size selection in optimization methods for machine learning Introduction Sample size selection in optimization methods for machine learning Main results: presents a methodology

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

ONLINE LEARNING WITH KERNELS: OVERCOMING THE GROWING SUM PROBLEM. Abhishek Singh, Narendra Ahuja and Pierre Moulin

ONLINE LEARNING WITH KERNELS: OVERCOMING THE GROWING SUM PROBLEM. Abhishek Singh, Narendra Ahuja and Pierre Moulin 22 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 23 26, 22, SANTANDER, SPAIN ONLINE LEARNING WITH KERNELS: OVERCOMING THE GROWING SUM PROBLEM Abhishek Singh, Narendra Ahuja

More information

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444 Kernel Methods Jean-Philippe Vert Jean-Philippe.Vert@mines.org Last update: Jan 2015 Jean-Philippe Vert (Mines ParisTech) 1 / 444 What we know how to solve Jean-Philippe Vert (Mines ParisTech) 2 / 444

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

Mathematical Methods for Data Analysis

Mathematical Methods for Data Analysis Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

10-701/ Recitation : Kernels

10-701/ Recitation : Kernels 10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Non-Linear Regression for Bag-of-Words Data via Gaussian Process Latent Variable Set Model

Non-Linear Regression for Bag-of-Words Data via Gaussian Process Latent Variable Set Model Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Non-Linear Regression for Bag-of-Words Data via Gaussian Process Latent Variable Set Model Yuya Yoshikawa Nara Institute of Science

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University 2018.10.25 eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Kernel Methods. Outline

Kernel Methods. Outline Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert

More information

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing

More information

6.036 midterm review. Wednesday, March 18, 15

6.036 midterm review. Wednesday, March 18, 15 6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that

More information

Kernels for Multi task Learning

Kernels for Multi task Learning Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano

More information

Neural networks and support vector machines

Neural networks and support vector machines Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

Kernel Adaptive Metropolis-Hastings

Kernel Adaptive Metropolis-Hastings Kernel Adaptive Metropolis-Hastings Arthur Gretton,?? Gatsby Unit, CSML, University College London NIPS, December 2015 Arthur Gretton (Gatsby Unit, UCL) Kernel Adaptive Metropolis-Hastings 12/12/2015 1

More information

Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks. MGS Lecture 2 Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Random Features for Large Scale Kernel Machines

Random Features for Large Scale Kernel Machines Random Features for Large Scale Kernel Machines Andrea Vedaldi From: A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Proc. NIPS, 2007. Explicit feature maps Fast algorithm for

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Lecture 10: Support Vector Machine and Large Margin Classifier

Lecture 10: Support Vector Machine and Large Margin Classifier Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

Scaling up Vector Autoregressive Models with Operator Random Fourier Features

Scaling up Vector Autoregressive Models with Operator Random Fourier Features Scaling up Vector Autoregressive Models with Operator Random Fourier Features Romain Brault, Néhémy Lim, Florence d Alché-Buc August, 6 Abstract A nonparametric approach to Vector Autoregressive Modeling

More information

Back to the future: Radial Basis Function networks revisited

Back to the future: Radial Basis Function networks revisited Back to the future: Radial Basis Function networks revisited Qichao Que, Mikhail Belkin Department of Computer Science and Engineering Ohio State University Columbus, OH 4310 que, mbelkin@cse.ohio-state.edu

More information

Kernel Methods. Barnabás Póczos

Kernel Methods. Barnabás Póczos Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015 Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional

More information

5.6 Nonparametric Logistic Regression

5.6 Nonparametric Logistic Regression 5.6 onparametric Logistic Regression Dmitri Dranishnikov University of Florida Statistical Learning onparametric Logistic Regression onparametric? Doesnt mean that there are no parameters. Just means that

More information

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Machine Learning. Lecture 6: Support Vector Machine. Feng Li. Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Warm Up 2 / 80 Warm Up (Contd.)

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

The Representor Theorem, Kernels, and Hilbert Spaces

The Representor Theorem, Kernels, and Hilbert Spaces The Representor Theorem, Kernels, and Hilbert Spaces We will now work with infinite dimensional feature vectors and parameter vectors. The space l is defined to be the set of sequences f 1, f, f 3,...

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation) Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation

More information

Machine Learning - MT & 14. PCA and MDS

Machine Learning - MT & 14. PCA and MDS Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)

More information

LMS Algorithm Summary

LMS Algorithm Summary LMS Algorithm Summary Step size tradeoff Other Iterative Algorithms LMS algorithm with variable step size: w(k+1) = w(k) + µ(k)e(k)x(k) When step size µ(k) = µ/k algorithm converges almost surely to optimal

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

Statistical learning on graphs

Statistical learning on graphs Statistical learning on graphs Jean-Philippe Vert Jean-Philippe.Vert@ensmp.fr ParisTech, Ecole des Mines de Paris Institut Curie INSERM U900 Seminar of probabilities, Institut Joseph Fourier, Grenoble,

More information

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information