Statistical Learning & Applications. f w (x) =< f w, K x > H = w T x. α i α j < x i x T i, x j x T j. = < α i x i x T i, α j x j x T j > F
|
|
- Sabina Chase
- 5 years ago
- Views:
Transcription
1 CR2: Statistical Learning & Applications Examples of Kernels and Unsupervised Learning Lecturer: Julien Mairal Scribes: Rémi De Joannis de Verclos & Karthik Srikanta Kernel Inventory Linear Kernel The linear kernel is dened as K(x, y) = x T y, where K : R p R p R. Let us show that the corresponding reproducible kernel Hilbert space is H = {y w T y w R p } where f w is the function which maps y to w T y, with the inner product < f w, f w2 > H = w T w 2. We check the following conditions: () x R p, (2) f w H, x R p, {K x y K(x, y)} H f w (x) =< f w, K x > H = w T x (3) K is positive semidenite i.e. m R, (x,..., x m ) (R p ) m, [K] i,j = K(x i, x j ) and α R m, α T Kα 0 Polynomial kernel with degree 2 The polynomial kernel is given by K(x, y) = (x T y) 2. We now see that it is positive semidenite as, α i α j K(x i, x j ) 2 = α i α j Trace(x i x T i x j x T j ) i,j i,j = i,j α i α j < x i x T i, x j x T j > F = < α i x i x T i, α j x j x T j > F i,j ( ) = Trace α i x i x T i α j x j x T j i = Trace ( Z T Z ) ( ( )) where Z = α i x i x T i = Z 2 F 0 j i
2 2 Examples of Kernels and Unsupervised Learning It is important here to dene the Frobenius norm. Z F = i,j Z2 i,j. Also note that < Z, Z 2 >= Trace(Z T Z 2 ). Now we would like to nd H such that () and (2) mentioned above are satised. To satisfy (), we need f x : y K(x, y) H, x R p and, f x (y) = (x T y) 2 =< xx T, yy T > F = y T (xx T )y Because H is a Hilbert space, we also have β in R n, (x,..., x n ) (R p ) n, y y ( T n i= β ) ix i x T i y is in H. A good candidate for H is thus H : {y y T Ay, A symmetric R p p } with < f A, f B > H = Trace(A B). This is easy to check: f A (x) = x T Ax =< A, xx T > F, K x = f xx T, which implies f A (x) =< f A, K x >. Gaussian kernel The Gaussian kernel is given by K(x, y) = e y x 2 ω 2 = k(x y). The corresponding r.k.h.s is more involved. We will admit that if we dene < f, g > H = we have, H = {f f is integrable, continuous and < f, f > H < + } ˆf(w)ĝ(w) (2π) p ˆk(w) dw, The min kernel Let K(x, y) = min(x, y). x, y [0, ]. Let is show that K is positive semidenite: α R n, (x,.., x n ) [0, ] n we have, α i α j min(x i, x j ) = α i α j t x (t) t y (t) i,j i,j 0 ( n ) n = α i t xi α j t xj dt = 0 0 i= j= ( n ) n Z(t) 2 dt where Z(t) = α j t xj = α i t xi j= i= 0 It is thus appealing to believe that H = {f f L 2 [0, ]} and < f, g >= f(t)g(t)dt. However, this is a 0 mistake since < f, f >= 0 does not imply that f = 0. In fact, H = {f : [0, ] R continuous and dierentiable almost everywhere and f(0) = 0} Now we have < f, g > H = 0 f (t)g (t)dt and thus < f, f > H = 0 f = 0 f = 0. We can now check the remaining conditions K x : y min(x, y) H
3 Examples of Kernels and Unsupervised Learning 3 f H, x [0; ] f(x) =< f, K x >= f (t) t<=x dt = 0 0 f (t)dt = f(x) f(0) = f(x) x K x x Figure : Graph of K x Histogram kernel The Histogram kernel is given by K(h, h 2 ) = k j= min(h (j), h 2 (j)) where h [0, ] k. It is notably used in computer vision for comparing histograms of visual words. Spectrum kernel We now try to motivate the spectrum kernel through biology. There are human proteincoding genes. In other words there are 3, 000, 000, 000 bases of building blocks A, T, C, G. Consider a sequence u A k of size k (where in this case we assume A = {A, T, C, G}). Let φ u (x) = number of occurrences of u in x. We dene the spectrum kernel as K(x, x ) = u A φ u(x)φ k u (x ) =< φ(x), φ(x ) > where φ(x) R A k. We are interested in computing K(x, x ) eciently. K(x, x ) = x k+ x k+ i= j= x[i,i+k ]=x[j,j+k ]. The computation of this can be done in O( x x k) by using a trie. We demonstrate this with an example. Let x = ACGT T T ACGA and x = AGT T T ACG. The corresponding tries are: ɛ ɛ A C G T A G T AC CG GT T T T A AG AC GT T T T A ACG CGT CGA GT T T T T T T A T AC AGT ACG GT T T T A T T T T AC 2 x x
4 4 Examples of Kernels and Unsupervised Learning Algorithm Process Node(v) Input: Trie representation for sequences x and x. Output: Compute K(x, x ). : if v is a leaf then 2: K(x, x ) = K(x, x ) + i(v x ) i(v x ) 3: else 4: for All children in common of v x, v x do 5: Process Node(children). 6: end for 7: end if 8: Repeat Process for Node(r). Exercises. Show that if K and K 2 are positive semidenite then, (a) K + K 2 is positive semidenite. (b) αk + βk 2 is positive semidenite, where α, β > Show that if (K n ) n>0 K (pointwise), then K is positive semidenite. 3. Show that K(x, y) = min(x,y) is positive semidenite for all x, y [0, [. Walk kernel We dene G = (V, E ) is a subgraph of G if V V, E E and labels(v ) = labels(v ). Let φ U (G) = { Number of times U appears as a subgraph of G}. The subgraph kernel is dened as, K(G, G ) = U φ U (G)φ U (G ) Unfortunately computing this in NP-complete. We now dene the walk kernel. We recall some denitions rst. A walk is an alternating sequence of vertices and connecting edges. Less formally a walk is any route through a graph from vertex to vertex along edges. A walk can end on the same vertex on which it began or on a dierent vertex. A walk can travel over any edge and any vertex any number of times. A path is a walk that does not include any vertex twice, except that its rst vertex might be the same as its last. Now let, W k = {walks in G of size k} S k = {sequence of labels of size k} = A k So s S k, φ s (G) = w W k (G) labels(w)=s. This gives us K(G, G 2 ) = s S φ s(g )φ s (G 2 ). Computing the walk kernel seems hard at rst sight, but it can be done eciently by using the product graph, dened as follows. Given graphs G = (V, E ) and G 2 = (V 2, E 2 ), we dene, G G 2 = ({(v, v 2 ) V V 2 with same labels}{((v, v 2 ), (v, v 2)) where ((v, v ) E and (v 2, v 2) E 2 )}) There is a bijection between walks in G G 2 and walks in G and G 2 with same labels. More formally,
5 Examples of Kernels and Unsupervised Learning 5 a a b b c 2 2 c d 2 d e 2 G G 2 ad aa ab bd ba bb de 2cc 2ce 2dc 2 G G 2 Figure 2: Example of graph product K(G, G 2 ) = = w W k (G ) w 2 W k (G 2) w W k (G G 2) l(w)=l(w2) Exercise Let A be the adjacency matrix of G R V V,. Prove that number of walks of size k starting at i and ending at j is [A k ] i,j. 2. Show that K(G, G 2 ) can be computed in polynomial time.
6 6 Examples of Kernels and Unsupervised Learning 2 Unsupervised Learning Unsupervised learning consists of discovering the underlying structure of data without labels. It is useful for many tasks, such as removing noise from data (preprocessing), interpreting and visualizing the data, compression... Principal component analysis We have a centered data set X = [x,..., x n ] R p n. Consider the projection of x onto the direction w R p, then x wt x variance captured by w is Var (w) = n n i= (w T x i) 2. w 2 2 w 2 2 We provide below the PCA (Principle Component Analysis) algorithm: Algorithm 2 PCA : Suppose we have (w,..., w i ) 2: Construct w i arg max V ar(w) under the condition w i (w,..., w i ) w. We observe that the empirical w 2 2 Lemma: w i w i 2 are successive vectors of XX T ordered by decreasing eigenvalue. Proof: We have n w = arg max n i= wt x i x T i w = arg maxw T XX T w. w 2= w 2= XX T is symmetric, so it can be diagonalized into XX T = USU T where U is orthogonal and S is diagonal. For w i, i > we have, w = arg max w 2= w T USU T w = arg max (U T w) T S(U T w) = w i = w 2= ( arg max z 2= z T Sz ) where z = w T U arg max w T USU T w w 2=,w u,...,u i = arg max w 2=,w u,...,u i (U T w) T S(U T w) = arg max z T Sz z 2=,z=e,...,e i As a consequence we have PCA can be computed with SVD. Theorem(Aqart - Young theorem) PCA (SVD) provides the best lowrank approximation. min X X 2, rank(x ) < k X R n p Proof: Any matrix X of rank k can be written X = k i= s iu i v T i = U k S k V T k where U k R p k, V k R k n,
7 Examples of Kernels and Unsupervised Learning 7 U k V k = I and S k M k (R) is diagonal. Thus, min X X 2 = min X USV T 2 X R n p U T p U=I rank(x )<k V T V =I S diagonal rank(usv T )=k Lemma: Let U R p m and V R n m where m = min(n, p). If U T U = I and V T V = I then Z ij = U i Vj T form an orthogonal basis of R p n for the Frobenius norm. Proof: It suces to check the scalar product: < Z ij, Z kl > = Trace(Z T ijz kl ) = Trace(V j U T i U k V T l ) = (V T j V l )(U T j U k ) = {j=k and i=l} since U and V are orthogonal. We have X = m i= s iz ii and because of the lemma we can write X in the basis Z ij : X = m i= s ijz ij Under the condition rank(x ) = r, X X 2 is minimal i i,j (s ij s ij) 2 is minimal i n i= (s ii s i) 2 + i j (s ij )2 is minimal Then, the rst term should be minimized by taking s ii = s i for the k longest values of s i with respect to the rank constaint and the second term should be set to 0. Kernel PCA If we have the kernel φ : x φ(x) V ar(f) = n n i= assuming φ(x i ) are centered in the feature space f 2 (x i) f 2 H There are few remaining questions f i. What does it mean to have φ(x i ) centered? 2. How do we solve () arg max f f 2,...,f i f i H V ar(f) (). Having φ(x i ) centered means that you want to implicitely replace φ(x i ) by φ(x i ) m where m = n n i= φ(x i) is the average of φ. If our kernel is K(x i, x j ) =< φ(x i ), φ(x j ) > H, the new kernel becomes: K c (x i, x j ) =< φ(x i ) m, φ(x j ) m > =< φ(x i ), φ(x j ) > < n φ(x l ), φ(x i ) > < n n l= m φ(x l ), φ(x j ) > + n 2 < φ(x l ), φ(x k ) > l= i,k
8 8 Examples of Kernels and Unsupervised Learning In term of matrices, it is writen K c = (I n )K(I n ) where is the square matrix with all coecients equal to. 2. Due to the representation theorem, any solution of () has the form f : x n i= α ik(x i, x). What we need are the α's. We have f 2 H = αt Kα, f(x i ) = [Kα] i and < f, f l >= 0 = α T Kα l We want to nd max α R n Kα 2 2 n α T Kα s.t. αt Kα l = 0, l < i. i.e. max α α T K 2 α α T Kα and αt Kα l = 0, l < i. K is symetric so K = US 2 U T where U is orthogonal and S diagonal. We want max α α T US 4 U T α α T US 2 U T α = max β R n βt S 2 β β 2 2 where β = SU T α, β l = SU T α l. If we assume that the eigenvalues are ordered in S, the solutions are the duals of β l = e l. Recipe:. Center the kernel matrix. 2. K L = US 2 U T. 3. a l = D 2 U i (eigenvalue decomposition) 4. f i (x i ) = [K i α] i K-means clustering Let X = [X,.., X n ] R p n be data points and we would like to form clusters C = [C,..., C k ] R p k where each of these are centroids of the cluster. Our goals are to:. Learn C. 2. Assign each data point to a centroid C j.
9 Examples of Kernels and Unsupervised Learning 9 y x A popular way to do this is by K-means algorithm: Algorithm 3 K-means : for i =,..., n do 2: Assign l i argmin j=,..,k X i C j 2 2 3: end for 4: for j =,..., k do 5: Reestimate centroids by C j n j i such that l i=j X i 6: end for The interpretation of the algorithm can be seen as following: We are trying to nd C such that we have n min C,li i= X i C li 2 2. Now given some xed labels, c f(c ) = 2[C X i ] = 0 i such that l i=. Note that Kmeans is not an optimal algorithm because min C,li n i= X i C li 2 2 is a non-convex optimization problem with several local minima. Kernel K-means Let us dene the Kernel for this to be min c,l H n i= φ(x i) C li 2 H. Given some labels (reestimation) C j n j i such that l i=j φ(x i)
10 0 Examples of Kernels and Unsupervised Learning Now l i arg min j φ(x i ) n j l S j φ(x l ) 2 = φ(x i ) 2 + C j 2 2 n j i S j K(x i, x j ). Mixture of Gaussians The Gaussian distribution is given by: ( N(x µ, Σ) = exp ) (2π) p/2 Σ /2 2 (x µ)σ (x µ) Assume that the data is generated by the following procedure:. Random class assignment: C: P(C = C j ) = π j where k j= π j =, π j x N(x, µ i, σ i ) The parameters are: π,..., π k (µ, σ ),..., (µ i, σ i ) If we observe X,..., X n, can we infer the parameters (π, µ, Σ)? Let θ represent the parameters. Let ˆθ = arg max θ P θ (X,..., X n ) (Maximum = arg max θ n i= P θ(x i ) One algorithm to get an approximate solution is called EM for "Expectation-maximization", which iteratively increases the likelihood Algorithm 4 Expectation Maximizer : Dene L θ = log P θ (X,..., X n ) = n i= log P θ(x i ). 2: for =,..., n do 3: E-step: nd q given θ xed (auxiliary soft assignment) 4: M-step: Maximize some function l(θ, q) with q xed. 5: end for 6: Repeat step 2 until convergence. Explanations: j At E-step q i,j = N(xi µj,σj) N(xi µ j ;σ j ). Notice that i, k j= q ij =. This can be interpreted as a soft assignment of very data point to the classes At M-step we have: π j n q ij n Σ j µ j i= n q ij x i i= m q ij (x i µ j (x i µ j ) T i=
11 Examples of Kernels and Unsupervised Learning The key ingredient here is the use of Jensen inequality, but we will not have the time to present the details in this course.
Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationLecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016
Lecture 8 Principal Component Analysis Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 13, 2016 Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 1 / 31 Outline 1 Eigen
More informationCOMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017
COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationPrincipal Component Analysis
Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used
More informationUnsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto
Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian
More informationLecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26
Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1
More informationPCA and admixture models
PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1
More informationStatistical Machine Learning
Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x
More informationMachine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.
Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning
More informationMachine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More informationPCA, Kernel PCA, ICA
PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per
More informationLinear Algebra, 4th day, Thursday 7/1/04 REU Info:
Linear Algebra, 4th day, Thursday 7/1/04 REU 004. Info http//people.cs.uchicago.edu/laci/reu04. Instructor Laszlo Babai Scribe Nick Gurski 1 Linear maps We shall study the notion of maps between vector
More information14 Singular Value Decomposition
14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing
More informationELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications
ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications Professor M. Chiang Electrical Engineering Department, Princeton University March
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Numerical Linear Algebra Background Cho-Jui Hsieh UC Davis May 15, 2018 Linear Algebra Background Vectors A vector has a direction and a magnitude
More informationDimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining
Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Combinations of features Given a data matrix X n p with p fairly large, it can
More informationLecture 4 February 2
4-1 EECS 281B / STAT 241B: Advanced Topics in Statistical Learning Spring 29 Lecture 4 February 2 Lecturer: Martin Wainwright Scribe: Luqman Hodgkinson Note: These lecture notes are still rough, and have
More informationAPPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.
APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product
More informationMachine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012
Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component
More informationlinearly indepedent eigenvectors as the multiplicity of the root, but in general there may be no more than one. For further discussion, assume matrice
3. Eigenvalues and Eigenvectors, Spectral Representation 3.. Eigenvalues and Eigenvectors A vector ' is eigenvector of a matrix K, if K' is parallel to ' and ' 6, i.e., K' k' k is the eigenvalue. If is
More information7 Principal Component Analysis
7 Principal Component Analysis This topic will build a series of techniques to deal with high-dimensional data. Unlike regression problems, our goal is not to predict a value (the y-coordinate), it is
More informationExercise Sheet 1.
Exercise Sheet 1 You can download my lecture and exercise sheets at the address http://sami.hust.edu.vn/giang-vien/?name=huynt 1) Let A, B be sets. What does the statement "A is not a subset of B " mean?
More informationMatrix Support Functional and its Applications
Matrix Support Functional and its Applications James V Burke Mathematics, University of Washington Joint work with Yuan Gao (UW) and Tim Hoheisel (McGill), CORS, Banff 2016 June 1, 2016 Connections What
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 5: Numerical Linear Algebra Cho-Jui Hsieh UC Davis April 20, 2017 Linear Algebra Background Vectors A vector has a direction and a magnitude
More informationKernel-Based Contrast Functions for Sufficient Dimension Reduction
Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach
More informationPlan of Class 4. Radial Basis Functions with moving centers. Projection Pursuit Regression and ridge. Principal Component Analysis: basic ideas
Plan of Class 4 Radial Basis Functions with moving centers Multilayer Perceptrons Projection Pursuit Regression and ridge functions approximation Principal Component Analysis: basic ideas Radial Basis
More informationKernel Method: Data Analysis with Positive Definite Kernels
Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University
More informationGI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil
GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection
More information8.1 Concentration inequality for Gaussian random matrix (cont d)
MGMT 69: Topics in High-dimensional Data Analysis Falll 26 Lecture 8: Spectral clustering and Laplacian matrices Lecturer: Jiaming Xu Scribe: Hyun-Ju Oh and Taotao He, October 4, 26 Outline Concentration
More informationi=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 November 16, 017 Due: Dec 01, 017 A. Kernels Show that the following kernels K are PDS: 1.
More informationMachine Learning - MT & 14. PCA and MDS
Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)
More informationSupport Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationIntroduction to Machine Learning
10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what
More informationMaximum variance formulation
12.1. Principal Component Analysis 561 Figure 12.2 Principal component analysis seeks a space of lower dimensionality, known as the principal subspace and denoted by the magenta line, such that the orthogonal
More informationClustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation
Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide
More informationUnsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent
Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:
More informationLatent Variable Models and EM Algorithm
SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/
More informationKaggle.
Administrivia Mini-project 2 due April 7, in class implement multi-class reductions, naive bayes, kernel perceptron, multi-class logistic regression and two layer neural networks training set: Project
More informationCS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu
CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu Feature engineering is hard 1. Extract informative features from domain knowledge
More informationIV. Matrix Approximation using Least-Squares
IV. Matrix Approximation using Least-Squares The SVD and Matrix Approximation We begin with the following fundamental question. Let A be an M N matrix with rank R. What is the closest matrix to A that
More informationLecture 2: Linear Algebra Review
EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1
More informationEECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels
EECS 598: Statistical Learning Theory, Winter 2014 Topic 11 Kernels Lecturer: Clayton Scott Scribe: Jun Guo, Soumik Chatterjee Disclaimer: These notes have not been subjected to the usual scrutiny reserved
More informationLecture 11: Unsupervised Machine Learning
CSE517A Machine Learning Spring 2018 Lecture 11: Unsupervised Machine Learning Instructor: Marion Neumann Scribe: Jingyu Xin Reading: fcml Ch6 (Intro), 6.2 (k-means), 6.3 (Mixture Models); [optional]:
More informationFinite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product
Chapter 4 Hilbert Spaces 4.1 Inner Product Spaces Inner Product Space. A complex vector space E is called an inner product space (or a pre-hilbert space, or a unitary space) if there is a mapping (, )
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationDS-GA 1002 Lecture notes 10 November 23, Linear models
DS-GA 2 Lecture notes November 23, 2 Linear functions Linear models A linear model encodes the assumption that two quantities are linearly related. Mathematically, this is characterized using linear functions.
More informationStatistics: Learning models from data
DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality
More informationSemidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 2
Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 2 Instructor: Farid Alizadeh Scribe: Xuan Li 9/17/2001 1 Overview We survey the basic notions of cones and cone-lp and give several
More informationConvergence of Eigenspaces in Kernel Principal Component Analysis
Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation
More informationPCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given
More informationChapter 4 Euclid Space
Chapter 4 Euclid Space Inner Product Spaces Definition.. Let V be a real vector space over IR. A real inner product on V is a real valued function on V V, denoted by (, ), which satisfies () (x, y) = (y,
More informationKernel methods for comparing distributions, measuring dependence
Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations
More informationMLCC 2015 Dimensionality Reduction and PCA
MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline PCA & Reconstruction PCA and Maximum Variance PCA and Associated Eigenproblem Beyond the First Principal Component
More informationEE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 17
EE/ACM 150 - Applications of Convex Optimization in Signal Processing and Communications Lecture 17 Andre Tkacenko Signal Processing Research Group Jet Propulsion Laboratory May 29, 2012 Andre Tkacenko
More informationKernel methods, kernel SVM and ridge regression
Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationExpectation Maximization
Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same
More informationMathematical Formulation of Our Example
Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent (credit: Yijun Zhao, Arthur Gretton Rasmussen & Williams, Percy Liang) Kernel Regression Basis function regression
More informationEach new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!
Feature Mapping Consider the following mapping φ for an example x = {x 1,...,x D } φ : x {x1,x 2 2,...,x 2 D,,x 2 1 x 2,x 1 x 2,...,x 1 x D,...,x D 1 x D } It s an example of a quadratic mapping Each new
More informationComputational functional genomics
Computational functional genomics (Spring 2005: Lecture 8) David K. Gifford (Adapted from a lecture by Tommi S. Jaakkola) MIT CSAIL Basic clustering methods hierarchical k means mixture models Multi variate
More informationECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis
ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear
More informationMATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels
1/12 MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels Dominique Guillot Departments of Mathematical Sciences University of Delaware March 14, 2016 Separating sets:
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More information15 Singular Value Decomposition
15 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing
More informationKernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444
Kernel Methods Jean-Philippe Vert Jean-Philippe.Vert@mines.org Last update: Jan 2015 Jean-Philippe Vert (Mines ParisTech) 1 / 444 What we know how to solve Jean-Philippe Vert (Mines ParisTech) 2 / 444
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA Tobias Scheffer Overview Principal Component Analysis (PCA) Kernel-PCA Fisher Linear Discriminant Analysis t-sne 2 PCA: Motivation
More informationON THE ARITHMETIC-GEOMETRIC MEAN INEQUALITY AND ITS RELATIONSHIP TO LINEAR PROGRAMMING, BAHMAN KALANTARI
ON THE ARITHMETIC-GEOMETRIC MEAN INEQUALITY AND ITS RELATIONSHIP TO LINEAR PROGRAMMING, MATRIX SCALING, AND GORDAN'S THEOREM BAHMAN KALANTARI Abstract. It is a classical inequality that the minimum of
More informationBeyond Scalar Affinities for Network Analysis or Vector Diffusion Maps and the Connection Laplacian
Beyond Scalar Affinities for Network Analysis or Vector Diffusion Maps and the Connection Laplacian Amit Singer Princeton University Department of Mathematics and Program in Applied and Computational Mathematics
More informationVectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =
Linear Algebra Review Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1 x x = 2. x n Vectors of up to three dimensions are easy to diagram.
More information3 Best-Fit Subspaces and Singular Value Decomposition
3 Best-Fit Subspaces and Singular Value Decomposition (SVD) Think of the rows of an n d matrix A as n data points in a d-dimensional space and consider the problem of finding the best k-dimensional subspace
More informationReproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto
Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationSingular Value Decomposition and Principal Component Analysis (PCA) I
Singular Value Decomposition and Principal Component Analysis (PCA) I Prof Ned Wingreen MOL 40/50 Microarray review Data per array: 0000 genes, I (green) i,i (red) i 000 000+ data points! The expression
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction
More informationFactor Analysis (10/2/13)
STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.
More informationLecture Notes 2: Matrices
Optimization-based data analysis Fall 2017 Lecture Notes 2: Matrices Matrices are rectangular arrays of numbers, which are extremely useful for data analysis. They can be interpreted as vectors in a vector
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationKernel Methods. Machine Learning A W VO
Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance
More informationExercises * on Principal Component Analysis
Exercises * on Principal Component Analysis Laurenz Wiskott Institut für Neuroinformatik Ruhr-Universität Bochum, Germany, EU 4 February 207 Contents Intuition 3. Problem statement..........................................
More informationDimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas
Dimensionality Reduction: PCA Nicholas Ruozzi University of Texas at Dallas Eigenvalues λ is an eigenvalue of a matrix A R n n if the linear system Ax = λx has at least one non-zero solution If Ax = λx
More informationData Mining and Analysis: Fundamental Concepts and Algorithms
Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA
More informationData Analysis and Manifold Learning Lecture 9: Diffusion on Manifolds and on Graphs
Data Analysis and Manifold Learning Lecture 9: Diffusion on Manifolds and on Graphs Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture
More informationKernel Principal Component Analysis
Kernel Principal Component Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationElements of Positive Definite Kernel and Reproducing Kernel Hilbert Space
Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space Statistical Inference with Reproducing Kernel Hilbert Space Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department
More information2 Tikhonov Regularization and ERM
Introduction Here we discusses how a class of regularization methods originally designed to solve ill-posed inverse problems give rise to regularized learning algorithms. These algorithms are kernel methods
More informationProperties of Matrices and Operations on Matrices
Properties of Matrices and Operations on Matrices A common data structure for statistical analysis is a rectangular array or matris. Rows represent individual observational units, or just observations,
More informationCS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization
CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization Tim Roughgarden February 28, 2017 1 Preamble This lecture fulfills a promise made back in Lecture #1,
More informationEE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015
EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,
More informationLinear Algebra: Characteristic Value Problem
Linear Algebra: Characteristic Value Problem . The Characteristic Value Problem Let < be the set of real numbers and { be the set of complex numbers. Given an n n real matrix A; does there exist a number
More informationSTATS 306B: Unsupervised Learning Spring Lecture 2 April 2
STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised
More informationKernels A Machine Learning Overview
Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationData Mining and Analysis: Fundamental Concepts and Algorithms
Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA
More information