Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Similar documents
Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Linear Classifiers III

6.867 Machine learning, lecture 7 (Jaakkola) 1

18.657: Mathematics of Machine Learning

Math 61CM - Solutions to homework 3

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

10-701/ Machine Learning Mid-term Exam Solution

Algebra of Least Squares

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Support Vector Machines and Kernel Methods

REGRESSION WITH QUADRATIC LOSS

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Machine Learning Theory (CS 6783)

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Support vector machine revisited

Math Solutions to homework 6

6.867 Machine learning

Regression with quadratic loss

Solutions to home assignments (sketches)

Math 155 (Lecture 3)

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Intro to Learning Theory

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

MATH 205 HOMEWORK #2 OFFICIAL SOLUTION. (f + g)(x) = f(x) + g(x) = f( x) g( x) = (f + g)( x)

Lecture 15: Learning Theory: Concentration Inequalities

Machine Learning for Data Science (CS 4786)

Discrete-Time Systems, LTI Systems, and Discrete-Time Convolution

Lecture 7: October 18, 2017

CSCI567 Machine Learning (Fall 2014)

Linear Regression Demystified

Math 203A, Solution Set 8.

Inverse Matrix. A meaning that matrix B is an inverse of matrix A.

Lecture Notes for Analysis Class

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

ALGEBRAIC GEOMETRY COURSE NOTES, LECTURE 5: SINGULARITIES.

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

Chapter 0. Review of set theory. 0.1 Sets

Lecture 4: Grassmannians, Finite and Affine Morphisms

Assignment 2 Solutions SOLUTION. ϕ 1 Â = 3 ϕ 1 4i ϕ 2. The other case can be dealt with in a similar way. { ϕ 2 Â} χ = { 4i ϕ 1 3 ϕ 2 } χ.

M A T H F A L L CORRECTION. Algebra I 1 4 / 1 0 / U N I V E R S I T Y O F T O R O N T O

lim za n n = z lim a n n.

b i u x i U a i j u x i u x j

Ma 4121: Introduction to Lebesgue Integration Solutions to Homework Assignment 5

Introduction to Optimization Techniques

Optimally Sparse SVMs

TENSOR PRODUCTS AND PARTIAL TRACES

Infinite Sequences and Series

Chapter 7. Support Vector Machine

Recurrence Relations

HILBERT SPACE GEOMETRY

8. Applications To Linear Differential Equations

Notes for Lecture 5. 1 Grover Search. 1.1 The Setting. 1.2 Motivation. Lecture 5 (September 26, 2018)

Square-Congruence Modulo n

Physics 324, Fall Dirac Notation. These notes were produced by David Kaplan for Phys. 324 in Autumn 2001.

MAT1026 Calculus II Basic Convergence Tests for Series

Empirical Process Theory and Oracle Inequalities

Review Problems 1. ICME and MS&E Refresher Course September 19, 2011 B = C = AB = A = A 2 = A 3... C 2 = C 3 = =

Some examples of vector spaces

CHAPTER I: Vector Spaces

Lecture 3: August 31

Riesz-Fischer Sequences and Lower Frame Bounds

Chapter 3 Inner Product Spaces. Hilbert Spaces

Lecture 3 The Lebesgue Integral

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

CHAPTER 5. Theory and Solution Using Matrix Techniques

6.895 Essential Coding Theory October 20, Lecture 11. This lecture is focused in comparisons of the following properties/parameters of a code:

1 Last time: similar and diagonalizable matrices

Real Numbers R ) - LUB(B) may or may not belong to B. (Ex; B= { y: y = 1 x, - Note that A B LUB( A) LUB( B)

1 Review and Overview

4 The Sperner property.

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

Sequences, Series, and All That

Homework 2. Show that if h is a bounded sesquilinear form on the Hilbert spaces X and Y, then h has the representation

Information-based Feature Selection

Machine Learning Assignment-1

6.3 Testing Series With Positive Terms

1 Generating functions for balls in boxes

1 Review and Overview

Second day August 2, Problems and Solutions

Lecture 20. Brief Review of Gram-Schmidt and Gauss s Algorithm

Singular Continuous Measures by Michael Pejic 5/14/10

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

1.3 Convergence Theorems of Fourier Series. k k k k. N N k 1. With this in mind, we state (without proof) the convergence of Fourier series.

Polynomial identity testing and global minimum cut

The Growth of Functions. Theoretical Supplement

Lecture 2 Clustering Part II

TEACHER CERTIFICATION STUDY GUIDE

Questions and answers, kernel part

Computability and computational complexity

Lecture #20. n ( x p i )1/p = max

Lecture 16: Monotone Formula Lower Bounds via Graph Entropy. 2 Monotone Formula Lower Bounds via Graph Entropy

Approximations and more PMFs and PDFs

CALCULATION OF FIBONACCI VECTORS

Math 2784 (or 2794W) University of Connecticut

A REMARK ON A PROBLEM OF KLEE

MATH10212 Linear Algebra B Proof Problems

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

Machine Learning for Data Science (CS 4786)

Transcription:

Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple of easy examples to get some ituitio. Next we will motivate a importace of RKHS for machie learig by cosiderig represeter theorem, which we will also prove. Fially, we will cosider several scearios where represeter theorem actually becomes very useful. Blue colour will be used to highlight parts appearig i the upcomig homework assigmets. Reproducig kerels ad RKHS Cosider ay iput space X. We will call a fuctio k : X X R a kerel or a reproducig kerel if it is symmetric k(x, y) = k(y, x) for all x, y X ad positive defiite, which meas N, α,..., α R, x,..., x X, α i α j k(x i, x j ) 0. j= It ca be show that k defies a uique Hilbert space of real-valued fuctios o X, such that:. Fuctios k(x, ): X R for all X X belog to ;. f(x) = f, k(x, ) Hk for ay f ad X X. (the reproducig property) Throughout this lecture we will write, Hk to deote the ier product of ad Hk the orm iduced by, Hk. The space is commoly kow as Reproducig Kerel Hilbert Space (RKHS). Notice that, because is a vector space, all the fuctios of the form α i k(x i, ) i also belog to for ay fiite sequece of real coefficiets α, α,... ad poits X, X,... from X. A Hilbert space is a vector space with a ier product, which is complete with respect to the orm iduced by the ier product.

Feature map Aother way to look at this costructio is to say that all the poits X of the iput space X are beig mapped to the elemets k(x, ) of the Hilbert space. Moreover, for ay two poits X, X X the ier product betwee their images is equal to k(x, ), k(x, ) Hk = k(x, X ). This observatio leads to very useful implicatios. It turs out that, o matter what the iput space X is (R d, a set of strigs, a set of graphs, pg pictures,... ), oce we come up with a kerel fuctio k defied over X we simultaeously get a way to embed the whole X ito a Hilbert space. This embeddig is very useful, sice the RKHS has a very ice geometry: it is a vector space with a ier product, which meas we ca add its elemets with each other ad compute distaces betwee them somethig which was ot ecessarily possible for elemets of X (thik of a set of graphs). Next we cosider two simple examples of kerels k ad correspodig RKHS:. Liear kerel Cosider X = R d ad defie k(x, y) := x, y R d. First of all, let s check that this is ideed a kerel. It is obviously symmetric. Also ote that j= α i α j k(x i, x j ) = j= α i α j x i, x j R d = α i x i 0. R d Thus, k is ideed a kerel. It is ow easy to see that all the homogeeous liear fuctios of the form f(x) = w, x R d, w R d () belog to the RKHS. As well as all their fiite liear combiatios. Actually, it ca be show that does ot cotai aythig but the fuctios of the form (). I this case it is obvious that is of a fiite dimesioality d. The ier product i betwee its two elemets w, R d ad v, R d (which are two liear fuctios) is defied by w, R d, v, R d = w, v R d.. Polyomial kerel of a secod degree Cosider X = R ad k(x, y) := ( x, y R + ). Expadig the brackets we see: k(x, y) = x y + x y + x x y y + x y + x y +. First we eed to check that it is ideed a kerel. It is symmetric. To check the positive defiiteess ote that if we defie a mappig ψ : X R 6 by we may write ψ(x) = (x, x, x x, x, x, ) k(x, y) = ψ(x), ψ(y) R 6, x, y X. I other words, we showed that k ca be expressed as a liear kerel after mappig X ito R 6 usig ψ. We already showed i the previous example that liear kerel is ideed positive defiite. Iterestigly otice that the image of ψ is oly a subset of R 6, i.e. there are poits z R 6 such that z ca ot be expressed as ψ(x) for ay x X. Let us show that cotais all the polyomials up to degree, i.e. fuctios of the form: f(x) = v x + v x + v 3 x x + v 4 x + v 5 x + v 6, x X, v R 6. ()

First, we kow that all the fuctios of the form k(x, ) belog to for sure, i.e. all the fuctios of the form f(x) = w x + w x + w w x x + w x + w x +, x, w X. (3) These are polyomials with moomials of order up to two. However, we see that coefficiets of moomials are iterdepedet, ad they are all defied by settig oly two coefficiets w ad w. This is quite differet from (), where we are free to choose ay coefficiets of moomials. However, recall that RKHS is a vector space, thus it cotais all the liear combiatios of its elemets. Now, do we get all the fuctios of the form () if we take all the liear combiatios of the fuctios of the form (3)? It turs out that if we take the liear spa of the vectors of the form {(w, w, w w, w, w, ): w, w R} R 6 we will get the whole R 6 (HW). This shows that ideed cotais all the polyomials up to degree. It ca be also show that o other fuctios are cotaied i. Two examples above showed that RKHS ca be of a fiite dimesio, which may or may ot be larger tha the dimesioality of X. At this poit it is importat to say that actually RKHS ca be eve ifiite dimesioal. This is the case, for istace, for the so-called Gaussia kerel k(x, y) = e (x y) /σ. Represeter theorem Why are RKHS ad kerels so importat for machie learig? I all the previous lectures we studied problems of biary classificatio ad also shortly metioed regressio problems. But what type of predictors did we actually see? It turs out that the mai focus was o liear predictors. These fuctios (classifiers) are a good start, but of course they are ot too flexible. We also saw a example of oliear methods, such as KNN. Note, however, that KNN ca t be cosidered as a learig algorithm which chooses a predictor ĥ from a fixed set of predictors H. Fially, we saw the AdaBoost algorithm, which outputs a complex compositio of base classifiers. This compositio is of course ot a liear classifier (eve if the base classifiers were liear). Kerels ad RKHS provide a very coveiet way to defie classes H cosistig of oliear fuctios. As we saw, it is eough to specify oe kerel fuctio k to implicitly get the whole RKHS. Now, assume we would like to choose our predictors from. How do we do that? Next result shows that ofte this problem ca be solved quite efficietly. Theorem (Represeter theorem). Assume k is a kerel defied over ay X ad is a correspodig RKHS. Take ay poits X,..., X X. Cosider the followig optimizatio problem: ( mi l i f(xi ) ) + Q( f Hk ), f (4) where l i : R R, i =,..., are ay fuctios ad Q: R + R is a odecreasig. The there exist α,..., α R such that f = α i k(x i, ) solves (4). 3

Proof. Assume there is f solvig (4). Because is a Hilbert space we may write f = β i k(x i, ) + u, where u, ad u, k(x i, ) Hk = 0 for all i =,...,. We used the fact that ay vector (fuctio) i a Hilbert space ca be uiquely expressed as a sum of its orthogoal projectio oto the liear subspace ad a complemet, which is orthogoal to that subspace. It is also easy to check that f = β i k(x i, ) + u H k ad thus where we deoted f Hk f X Hk, f X := β i k(x i, ). Because Q is odecreasig we coclude that Q( f Hk ) Q( f X Hk ). Now ote that because of the reproducig property ( l i f (X i ) ) ( = l i f ) ( ) ( ) (, k(x i, ) Hk = li f X + u, k(x i, ) Hk = l i f X, k(x i, ) Hk = l i fx (X i ) ). I other words we shoed that ( l i f (X i ) ) = ( l i fx (X i ) ). Thus, the value of the objective fuctioal (4) at f X is ot larger tha for f, which shows that f X also solves the optimizatio problem. I order to motivate represeter theorem we will first cosider two cocrete examples of Problem 4. Biary classificatio Ca we use the real-valued fuctios from for a biary classificatio with Y = {, +}? Of course! We just eed to take the sig of f, which gives us a biary-valued fuctio. Cosider a traiig sample S = {(X i, Y i )} with X i X for ay iput space X ad Y i Y. Take ay kerel k o X. Fially, set l i (z) := {Y i z 0}. I this case ( l i f(xi ) ) = {Y i f(x i ) 0} is just a empirical biary loss associated with a classifier sgf(x). Settig Q(z) = 0 we see that (4) correspods to the empirical risk miimizatio of a biary loss over. 4

Squared loss regressio We may also use elemets of for predictig real-valued outputs. Set Y = R ad l i (z) = (Y i z). I this case ( l i f(xi ) ) = ( Yi f(x i ) ) is just a empirical squared loss ad thus, settig Q(z) = 0 we get the empirical squared loss miimizatio over. What is the importace of Theorem? A surprisig message is the followig. Origially, (4) is a optimizatio with respect to elemets of, which are high-dimesioal objects ad potetially eve ifiite-dimesioal. I other words, solvig (4) requires choosig m real umbers if is m-dimesioal (with m potetially huge) or choosig a fuctio, which ca ot be described by ay fiite umber of parameters if is ifiite-dimesioal. Still, Theorem tells us that i ay case this problem may be reduced to choosig oly real-valued parameters. This gives a huge boost i efficiecy if dim( ), ad especially if is ifiite-dimesioal. Usig represeter theorem ad reproducig property we may restate the Problem 4 i the followig form: mi l i α j k(x i, X j ) + Q α,...,α R α j k(x j, ) j= j= Hk = mi l i α j k(x i, X j ) + Q α i α j k(x i, X j ). α,...,α R j= j= We see that this optimizatio problem depeds o X i ad k oly through the kerel matrix K X R with (i, j)-th elemet beig k(x i, X j ). 5