Data Mining Techniques

Size: px
Start display at page:

Download "Data Mining Techniques"

Transcription

1 Data Mining Techniques CS Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent (credit: Yijun Zhao, Arthur Gretton Rasmussen & Williams, Percy Liang)

2 Kernel Regression

3 Basis function regression Linear regression Basis function regression For N samples Polynomial regression

4 Basis Function Regression 1 M =3 t x 1

5 The Kernel Trick Define a kernel function such that k can be cheaper to evaluate than φ!

6 Kernel Ridge Regression MAP / Expected value for Weights (requires inversion of DxD matrix) E[w y]=a 1 > y A :=( > + I) := (X) Alternate representation (requires inversion of NxN matrix) A 1 > = > (K + I) 1 K := 1 > Predictive posterior (using kernel function) E[ f (x ) y]= (x ) > E[w y]= (x ) > > (K + I) 1 y X

7 Kernel Ridge Regression MAP / Expected value for Weights (requires inversion of DxD matrix) E[w y]=a 1 > y A :=( > + I) := (X) Alternate representation (requires inversion of NxN matrix) A 1 > = > (K + I) 1 K := 1 > Predictive posterior (using kernel function) E[ f (x ) y]= (x ) > E[w y]= (x ) > > (K + I) 1 y X = k(x, x n )(K + I) 1 nm y m n,m

8 Kernel Ridge Regression f = arg min f 2H nx i=1 (y i hf, (x i )i H ) 2 + kf k 2 H!. 1 λ=0.1, σ=0.6 1 λ=10, σ= λ=1e 07, σ= Closed form Solution

9 Gaussian Processes (a.k.a. Kernel Ridge Regression with Variance Estimates) 2 2 output, f(x) output, f(x) input, x input, x p(y x, x, y) N k(x, x) > [K + 2 noisei] -1 y, k(x, x )+ 2 noise - k(x, x) > [K + 2 noisei] -1 k(x, x) adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13,

10 Choosing Kernel Hyperparameters function value, y too long about right too short input, x The mean posterior predictive function is plotted for 3 different length scales (th k(x, x 0 )=v 2 exp - (x - x0 ) 2 2`2 + 2 noise xx 0. adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13,

11 Intermezzo: Kernels Borrowing from: Arthur Gretton (Gatsby, UCL)

12 Hilbert Spaces Definition (Inner product) Let H be a vector space over R. A function h, i H : H H! R is an inner product on H if 1 Linear: h 1 f f 2, gi H = 1 hf 1, gi H + 2 hf 2, gi H 2 Symmetric: hf, gi H = hg, f i H 3 hf, f i H 0andhf, f i H = 0ifandonlyiff = 0. Norm induced by the inner product: kf k H := p hf, f i H

13 Example: Fourier Bases

14 Example: Fourier Bases

15 Example: Fourier Bases

16 Example: Fourier Bases Fourier modes define a vector space

17 Kernels Definition Let X be a non-empty set. A function k : X X! R is a kernel if there exists an R-Hilbert space and a map : X! H such that 8x, x 0 2 X, k(x, x 0 ):= (x), (x 0 ) H. Almost no conditions on X (eg, X itself doesn t need an inner product, eg. documents). Asinglekernelcancorrespondtoseveralpossiblefeatures.A trivial example for X := R: 1(x) =x and 2(x) = apple x/ p 2 x/ p 2

18 Sums, Transformations, Products Theorem (Sums of kernels are kernels) Given > 0 and k, k 1 and k 2 all kernels on X, then k and k 1 + k 2 are kernels on X. (Proof via positive definiteness: later!) A difference of kernels may not be a kernel (why?) Theorem (Mappings between spaces) Let X and e X be sets, and define a map A : X! e X. Define the kernel k on e X. Then the kernel k(a(x), A(x 0 )) is a kernel on X. Example: k(x, x 0 )=x 2 (x 0 ) 2. Theorem (Products of kernels are kernels) Given k 1 on X 1 and k 2 on X 2, then k 1 k 2 is a kernel on X 1 X 2. If X 1 = X 2 = X, then k := k 1 k 2 is a kernel on X.

19 Polynomial Kernels Theorem (Polynomial kernels) Let x, x 0 2 R d for d 1, and let m 1 be an integer and c 0 be apositivereal.then is a valid kernel. k(x, x 0 ):= x, x 0 + c m To prove: expand into a sum (with non-negative scalars) of kernels hx, x 0 i raised to integer powers. These individual terms are valid kernels by the product rule.

20 Infinite Sequences Definition The space `2 (square summable sequences) comprises all sequences a := (a i ) i 1 for which kak 2`2 = 1X i=1 a 2 i < 1. Definition Given sequence of functions ( i (x)) i 1 in `2 where i : X! R is the ith coordinate of (x). Then k(x, x 0 ):= 1X i=1 i(x) i (x 0 ) (1)

21 Infinite Sequences Why square summable? By Cauchy-Schwarz, 1X i=1 i(x) i (x 0 ) applek (x)k`2 (x 0 ) `2, so the sequence defining the inner product converges for all x, x 0 2 X

22 Taylor Series Kernels Definition (Taylor series kernel) For r 2 (0, 1], with a n 0foralln 0 f (z) = 1X n=0 a n z n z < r, z 2 R, Define X to be the p r-ball in R d, sokxk < p r, k(x, x 0 )=f x, x 0 = 1X a n x, x 0 n. n=0 Example (Exponential kernel) k(x, x 0 ):=exp x, x 0.

23 Gaussian Kernel (also known as Radial Basis Function (RBF) kernel) Example (Gaussian kernel) The Gaussian kernel on R d is defined as k(x, x 0 ):=exp x x 0 2. Proof: an exercise! Use product rule, mapping rule, exponential kernel.

24 Gaussian Kernel (also known as Radial Basis Function (RBF) kernel) Example (Gaussian kernel) The Gaussian kernel on R d is defined as k(x, x 0 ):=exp x x 0 2. Proof: an exercise! Use product rule, mapping rule, exponential kernel. Squared Exponential (SE) Automatic Relevance Determination (ARD)

25 Products of Kernels Squared-exp (SE) Periodic (Per) Linear (Lin) 2 f exp 1 (x xõ ) f exp sin 2 1 fi x xõ p 22 2 f(x c)(x Õ c) 0 0 x x Õ x x Õ x (with x Õ =1) Lin Lin SE Per Lin SE Lin Per x (with x Õ =1) x x Õ x (with x Õ =1) x (with x Õ =1) 0 0 source: David Duvenaud (PhD Thesis)

26 Positive Definiteness Definition (Positive definite functions) Asymmetricfunctionk : X X! R is positive definite if 8n 1, 8(a 1,...a n ) 2 R n, 8(x 1,...,x n ) 2 X n, nx i=1 nx j=1 a i a j k(x i, x j ) 0. The function k(, ) is strictly positive definite if for mutually distinct x i, the equality holds only when all the a i are zero.

27 Mercer s Theorem Theorem Let H be a Hilbert space, X anon-emptysetand : X! H. Then h (x), (y)i H =: k(x, y) is positive definite. Proof. nx nx a i a j k(x i, x j ) = nx nx ha i (x i ), a j (x j )i H i=1 j=1 = i=1 j=1 nx a i (x i ) 2 0. i=1 H Reverse also holds: positive definite k(x, x 0 ) is inner product in a unique H (Moore-Aronsajn: coming later!). Lecture 1: Introduction to RKHS

28 DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford)

29 Linear Dimensionality Reduction Idea: Project high-dimensional vector onto a lower dimensional space x 2 R z 2 R 10 z = U > x

30 Problem Setup Given n data points in d dimensions: x 1,...,x n 2 R d X = ( x 1 x n ) 2 Rd n Transpose of X used in regression!

31 Problem Setup Given n data points in d dimensions: x 1,...,x n 2 R d X = ( x 1 x n ) 2 Rd n Want to reduce dimensionality from d to k Choose k directions u 1,...,u k U = ( u 1 u k ) 2 Rd k u = u > x

32 Problem Setup Given n data points in d dimensions: x 1,...,x n 2 R d X = ( x 1 x n ) 2 Rd n Want to reduce dimensionality from d to k Choose k directions u 1,...,u k U = ( u 1 u k ) 2 Rd k For each u j, compute similarity z j = u > j x =( ) > = >

33 Problem Setup Given n data points in d dimensions: x 1,...,x n 2 R d X = ( x 1 x n ) 2 Rd n Want to reduce dimensionality from d to k Choose k directions u 1,...,u k U = ( u 1 u k ) 2 Rd k For each u j, compute similarity z j = u > j x Project x down to z =(z 1,...,z k ) > = U > x How to choose U?

34 Principal Component Analysis x 2 R z 2 R 10 z = U > x Optimize two equivalent objectives 1. Minimize the reconstruction error 2. Maximizes the projected variance

35 PCA Objective 1: Reconstruction Error U serves two functions: Encode: z = U > x, z j = u > j x P

36 PCA Objective 1: Reconstruction Error U serves two functions: Encode: z = U > x, z j = u > j x Decode: x = Uz = P k j=1 z ju j kx xk

37 PCA Objective 1: Reconstruction Error U serves two functions: Encode: z = U > x, z j = u > j x Decode: x = Uz = P k j=1 z ju j Want reconstruction error kx xk to be small

38 PCA Objective 1: Reconstruction Error U serves two functions: Encode: z = U > x, z j = u > j x Decode: x = Uz = P k j=1 z ju j Want reconstruction error kx xk to be small Objective: minimize total squared reconstruction error min U2R d k nx kx i UU > x i k 2 i=1

39 PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2

40 PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2 Assume data is centered: Ê[x] =0 (

41 PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2 Assume data is centered: Ê[x] =0 (

42 PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2 Assume data is centered: E Ê[x] =0 ( Objective: maximize variance of projected data max U2R d k,u > U=I Ê[kU > xk 2 ]

43 PCA Objective 2: Projected Variance Empirical distribution: uniform over x 1,...,x n Expectation (think sum over data points): Ê[f(x)] = 1 P n n i=1 f(x i) Variance (think sum of squares if centered): cvar[f(x)] + (Ê[f(x)])2 = Ê[f(x)2 ]= 1 P n n i=1 f(x i) 2 Assume data is centered: Ê[x] =0 (what s Ê[U> x]?) Objective: maximize variance of projected data max U2R d k,u > U=I Ê[kU > xk 2 ]

44 Equivalence of two objectives Key intuition: variance of data {z } fixed = captured variance {z } want large + reconstruction error {z } want small = > +( > )

45 Equivalence of two objectives Key intuition: variance of data {z } fixed = captured variance {z } want large + reconstruction error {z } want small Pythagorean decomposition: x = UU > x +(I UU > )x kxk k(i UU > )xk kuu > xk Take expectations; note rotation U doesn t a ect length: Ê[kxk 2 ]=Ê[kU> xk 2 ]+Ê[kx UU> xk 2 ] $

46 Equivalence of two objectives Key intuition: variance of data {z } fixed = captured variance {z } want large + reconstruction error {z } want small Pythagorean decomposition: x = UU > x +(I UU > )x kxk k(i UU > )xk kuu > xk Take expectations; note rotation U doesn t a ect length: Ê[kxk 2 ]=Ê[kU> xk 2 ]+Ê[kx UU> xk 2 ] Minimize reconstruction error $ Maximize captured variance

47 Changes of Basis Data Orthonormal Basis X = ( x 1 x n ) 2 Rd n U = ( u 1 u k ) d 2 Rd d =

48 Changes of Basis Data Orthonormal Basis X = ( x 1 x n ) Change of basis 2 2 Rd n z =(z 1,...,z ) > d z j = u > j x U = ( u 1 u k ) 2 Rd = Inverse Change of basis > x = Uz d d d > z = U > x

49 Principal Component Analysis Data Orthonormal Basis X = ( x 1 x n ) 2 Rd n U = ( u 1 u k ) d 2 Rd d Eigenvectors of Covariance = Eigen-decomposition = C A Claim: Eigenvectors of a symmetric matrix are orthogonal d

50 Principal Component Analysis n (from stack exchange)

51 Principal Component Analysis Data Orthonormal Basis X = ( x 1 x n ) 2 Rd n Eigenvectors of Covariance U = ( u 1 u k ) 2 Rd = Eigen-decomposition d d = C A Idea: Take top-k eigenvectors to maximize variance d

52 Principal Component Analysis Data Truncated Basis X = ( x 1 x n ) 2 Rd n 2 Rd k U = ( u 1 u k ) Eigenvectors of Covariance = Truncated decomposition (k) = C A k

53 Principal Component Analysis Top 2 components Bottom 2 components Data: three varieties of wheat: Kama, Rosa, Canadian Attributes: Area, Perimeter, Compactness, Length of Kernel, Width of Kernel, Asymmetry Coefficient, Length of Groove

54 PCA: Complexity Data Truncated Basis X = ( x 1 x n ) 2 Rd n U = ( u 1 u k ) 2 Rd k = Using eigen-value decomposition Computation of covariance C: O(n d 2 ) Eigen-value decomposition: O(d 3 ) Total complexity: O(n d 2 +d 3 )

55 PCA: Complexity Data Truncated Basis X = ( x 1 x n ) 2 Rd n U = ( u 1 u k ) 2 Rd k = Using singular-value decomposition Full decomposition: O(min{nd 2, n 2 d}) Rank-k decomposition: O(k d n log(n)) (with power method)

56 Singular Value Decomposition Idea: Decompose a d x d matrix M into 1. Change of basis V (unitary matrix) 2. A scaling Σ (diagonal matrix) 3. Change of basis U (unitary matrix)

57 Singular Value Decomposition Idea: Decompose the d x n matrix X into 1. A n x n basis V (unitary matrix) 2. A d x n matrix Σ (diagonal projection) 3. A d x d basis U (unitary matrix) X = U d d d n V > n n U > U = V > V

58 Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i

59 Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i X d n u U d k Z k n (... ) u ( ) ( z 1... z n ) z x

60 Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i X d n u U d k Z k n (... ) u ( ) ( z 1... z n ) Idea: z i more meaningful representation of i-th face than x i Can use z i for nearest-neighbor classification ( + ) ( )

61 Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i X d n u U d k Z k n (... ) u ( ) ( z 1... z n ) Idea: z i more meaningful representation of i-th face than x i Can use z i for nearest-neighbor classification Much faster: O(dk + nk) time instead of O(dn) when n, d k

62 Eigen-faces [Turk & Pentland 1991] d = number of pixels Each x i 2 R d is a face image x ji = intensity of the j-th pixel in image i X d n u U d k Z k n (... ) u ( ) ( z 1... z n ) Idea: z i more meaningful representation of i-th face than x i Can use z i for nearest-neighbor classification Much faster: O(dk + nk) time instead of O(dn) when n, d Why no time savings for linear classifier? k

63 Aside: How many components? Magnitude of eigenvalues indicate fraction of variance captured. Eigenvalues on a face image dataset: i i Eigenvalues typically drop o sharply, so don t need that many. Of course variance isn t everything...

64 Latent Semantic Analysis [Deerwater 1990] d = number of words in the vocabulary Each x i 2 R d is a vector of word counts x =

65 Latent Semantic Analysis [Deerwater 1990] d = number of words in the vocabulary Each x i 2 R d is a vector of word counts x ji = frequency of word j in document i ( X d n u U d k Z ) ( k n ) u stocks: 2 0 chairman: 4 1 the: wins: 0 2 game: 1 3 ( z 1... z n )

66 Latent Semantic Analysis [Deerwater 1990] d = number of words in the vocabulary Each x i 2 R d is a vector of word counts x ji = frequency of word j in document i ( X d n u U d k Z ) ( k n ) u stocks: 2 0 chairman: 4 1 the: wins: 0 2 game: 1 3 ( z 1... z n ) How to measure similarity between two documents? z > 1 z 2 is probably better than x > 1 x 2

67 Latent Semantic Analysis [Deerwater 1990] d = number of words in the vocabulary Each x i 2 R d is a vector of word counts x ji = frequency of word j in document i ( X d n u U d k Z ) ( k n ) u stocks: 2 0 chairman: 4 1 the: wins: 0 2 game: 1 3 ( z 1... z n ) How to measure similarity between two documents? z > 1 z 2 is probably better than x > 1 x 2 Applications: information retrieval Note: no computational savings; original x is already sparse

68 PCA Summary Intuition: capture variance of data or minimize reconstruction error Algorithm: find eigendecomposition of covariance matrix or SVD Impact: reduce storage (from O(nd) to O(nk)), reduce time complexity Advantages: simple,fast Applications: eigen-faces, eigen-documents, network anomaly detection, etc.

69 Probabilistic Interpretation Generative Model [Tipping and Bishop, 1999]: For each data point i =1,...,n: Draw the latent vector: z i N (0,I k k ) Create the data point: x i N (Uz i, 2 I d d ) PCA finds the U that maximizes the likelihood of the data max U p(x U)

70 Probabilistic Interpretation Generative Model [Tipping and Bishop, 1999]: For each data point i =1,...,n: Draw the latent vector: z i N (0,I k k ) Create the data point: x i N (Uz i, 2 I d d ) PCA finds the U that maximizes the likelihood of the data max U p(x U) Advantages: Handles missing data (important for collaborative filtering) Extension to factor analysis: allow non-isotropic noise (replace 2 I d d with arbitrary diagonal matrix)

71 Limitations of Linearity PCA is e ective PCA is ine ective

72 Limitations of Linearity PCA is e ective PCA is ine ective Problem is that PCA subspace is linear: In this example: S = {x = Uz : z 2 R k } S = {(x 1,x 2 ):x 2 = u 2 u 1 x 1 }

73 Nonlinear PCA Broken solution Desired solution We want desired solution: S = {(x 1,x 2 ):x 2 = u 2 u 1 x 2 1} = { (x) =Uz} (x) =( ) >

74 Nonlinear PCA Broken solution Desired solution We want desired solution: S = {(x 1,x 2 ):x 2 = u 2 u 1 x 2 1} We can get this: S = { (x) =Uz} with (x) =(x 2 1,x 2 ) >

75 Nonlinear PCA Broken solution Desired solution We want desired solution: S = {(x 1,x 2 ):x 2 = u 2 u 1 x 2 1} We can get this: S = { (x) =Uz} with (x) =(x 2 1,x 2 ) > { } > Linear dimensionality reduction in (x) space, Nonlinear dimensionality reduction in x space (x) =( sin( ) ) >

76 Nonlinear PCA Broken solution Desired solution We want desired solution: S = {(x 1,x 2 ):x 2 = u 2 u 1 x 2 1} We can get this: S = { (x) =Uz} with (x) =(x 2 1,x 2 ) > { } > Linear dimensionality reduction in (x) space, Nonlinear dimensionality reduction in x space (x) =( sin( ) ) > Idea: Use kernels

77 Kernel PCA Representer theorem: XX > u = u x X u = X = P n i=1 ix i

78 Kernel PCA Representer theorem: X u = X = P n XX > u = u i=1 ix i x Kernel function: k(x 1, x 2 ) such that K, thekernelmatrixformedbyk ij = k(x i, x j ), is positive semi-definite

79 Kernel PCA P Representer theorem: X u = X = P n XX > u = u i=1 ix i x Kernel function: k(x 1, x 2 ) such that K, thekernelmatrixformedbyk ij = k(x i, x j ), is positive semi-definite max kuk=1 u> XX > u = max > X > X =1 > (X > X)(X > X) = max > K =1 > K 2

80 Kernel PCA Direct method: Kernel PCA objective: max > K =1 > K 2 ) kernel PCA eigenvalue problem: X > X = 0 Modular method (if you don t want to think about kernels): Find vectors x 0 1,...,x 0 n such that x 0> i x 0 j = K ij = (x i ) > (x j ) Key: use any vectors that preserve inner products One possibility is Cholesky decomposition K = X x 0 n > Xx 0 n

81 Kernel PCA

82 Canonical Correlation Analysis (CCA)

83 Motivation for CCA [Hotelling 1936] Often, each data point consists of two views: Image retrieval: for each image, have the following: x: Pixels (or other visual features) y: Text around the image

84 Motivation for CCA [Hotelling 1936] Often, each data point consists of two views: Image retrieval: for each image, have the following: x: Pixels (or other visual features) y: Text around the image Time series: x: Signal at time t y: Signal at time t +1

85 Motivation for CCA [Hotelling 1936] Often, each data point consists of two views: Image retrieval: for each image, have the following: x: Pixels (or other visual features) y: Text around the image Time series: x: Signal at time t y: Signal at time t +1 Two-view learning: dividefeaturesintotwosets x: Features of a word/object, etc. y: Features of the context in which it appears

86 Motivation for CCA [Hotelling 1936] Often, each data point consists of two views: Image retrieval: for each image, have the following: x: Pixels (or other visual features) y: Text around the image Time series: x: Signal at time t y: Signal at time t +1 Two-view learning: dividefeaturesintotwosets x: Features of a word/object, etc. y: Features of the context in which it appears Goal: reduce the dimensionality of the two views jointly

87 CCA Example Setup: Input data: (x 1, y 1 ),...,(x n, y n ) (matrices X, Y) Goal: find pair of projections (u, v)

88 CCA Example Setup: Input data: (x 1, y 1 ),...,(x n, y n ) (matrices X, Y) Goal: find pair of projections (u, v) Dimensionality reduction solutions: Independent Joint x and y are paired by brightness

89 CCA Definition Definitions: Variance: cvar(u > x)=u > XX > u Covariance: ccov(u > x, v > y)=u > XY > v Correlation: ccov(u > x,v > y) p p cvar(u > x) cvar(v > y) Objective: maximize correlation between projected views Properties: max u,v dcorr(u> x, v > y) Focus on how variables are related, not how much they vary Invariant to any rotation and scaling of data

90 From PCA to CCA PCA on views separately: no covariance term max u,v u > XX > u u > u + v> YY > v v > v PCA on concatenation (X >, Y > ) > : includes covariance term max u,v u > XX > u +2u > XY > v + v > YY > v u > u + v > v

91 From PCA to CCA PCA on views separately: no covariance term max u,v u > XX > u u > u + v> YY > v v > v PCA on concatenation (X >, Y > ) > : includes covariance term max u,v u > XX > u +2u > XY > v + v > YY > v u > u + v > v Maximum covariance: drop variance terms max u,v u > XY > v p u> u p v > v

92 From PCA to CCA PCA on views separately: no covariance term max u,v u > XX > u u > u + v> YY > v v > v PCA on concatenation (X >, Y > ) > : includes covariance term max u,v u > XX > u +2u > XY > v + v > YY > v u > u + v > v Maximum covariance: drop variance terms max u,v u > XY > v p u> u p v > v Maximum correlation (CCA): divide out variance terms max u,v u > XY > v p u> XX > u p v > YY > v

93 Importance of Regularization Extreme examples of degeneracy: If x = Ay, thenany(u, v) with u = Av is optimal (correlation 1) If x and y are independent, then any (u, v) is optimal (correlation 0) X Y (u v)

94 Importance of Regularization Extreme examples of degeneracy: If x = Ay, thenany(u, v) with u = Av is optimal (correlation 1) If x and y are independent, then any (u, v) is optimal (correlation 0) Problem: ifx or Y has rank n, thenany(u, v) is optimal > (correlation 1) with u = X > Yv ) CCA is meaningless!

95 Importance of Regularization Extreme examples of degeneracy: If x = Ay, thenany(u, v) with u = Av is optimal (correlation 1) If x and y are independent, then any (u, v) is optimal (correlation 0) Problem: ifx or Y has rank n, thenany(u, v) is optimal (correlation 1) with u = X > > Yv ) CCA is meaningless! Solution: regularization (interpolate between maximum covariance and maximum correlation) max u,v u > XY > v p u> (XX > + I)u p v > (YY > + I)v

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality

More information

Linear Dimensionality Reduction

Linear Dimensionality Reduction Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang Lots of high-dimensional data... face images Zambian President Levy Mwanawasa has won a second term

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Multivariate Statistical Analysis

Multivariate Statistical Analysis Multivariate Statistical Analysis Fall 2011 C. L. Williams, Ph.D. Lecture 4 for Applied Multivariate Analysis Outline 1 Eigen values and eigen vectors Characteristic equation Some properties of eigendecompositions

More information

CS 7140: Advanced Machine Learning

CS 7140: Advanced Machine Learning Instructor CS 714: Advanced Machine Learning Lecture 3: Gaussian Processes (17 Jan, 218) Jan-Willem van de Meent (j.vandemeent@northeastern.edu) Scribes Mo Han (han.m@husky.neu.edu) Guillem Reus Muns (reusmuns.g@husky.neu.edu)

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Dimensionality Reduction

Dimensionality Reduction Lecture 5 1 Outline 1. Overview a) What is? b) Why? 2. Principal Component Analysis (PCA) a) Objectives b) Explaining variability c) SVD 3. Related approaches a) ICA b) Autoencoders 2 Example 1: Sportsball

More information

CS 340 Lec. 6: Linear Dimensionality Reduction

CS 340 Lec. 6: Linear Dimensionality Reduction CS 340 Lec. 6: Linear Dimensionality Reduction AD January 2011 AD () January 2011 1 / 46 Linear Dimensionality Reduction Introduction & Motivation Brief Review of Linear Algebra Principal Component Analysis

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY

More information

Lecture 8: Principal Component Analysis; Kernel PCA

Lecture 8: Principal Component Analysis; Kernel PCA Lecture 8: Principal Component Analysis; Kernel PCA Lester Mackey April 23, 2014 Stats 306B: Unsupervised Learning Sta306b April 19, 2011 Principal components: 16 PCA example: digit data 130 threes, a

More information

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University Lecture 4: Principal Component Analysis Aykut Erdem May 016 Hacettepe University This week Motivation PCA algorithms Applications PCA shortcomings Autoencoders Kernel PCA PCA Applications Data Visualization

More information

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition ace Recognition Identify person based on the appearance of face CSED441:Introduction to Computer Vision (2017) Lecture10: Subspace Methods and ace Recognition Bohyung Han CSE, POSTECH bhhan@postech.ac.kr

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Nina Balcan] slide 1 Goals for the lecture you should understand

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Mathematical foundations - linear algebra

Mathematical foundations - linear algebra Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar

More information

Announcements. stuff stat 538 Zaid Hardhaoui. Statistics. cry. spring. g fa. inference. VC dimension covering

Announcements. stuff stat 538 Zaid Hardhaoui. Statistics. cry. spring. g fa. inference. VC dimension covering Announcements spring Convex Optimization next quarter ML stuff EE 578 Margam FaZe CS 547 Tim Althoff Modeling how to formulate real world problems as convex optimization Data science constrained optimization

More information

Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians

Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians Exercise Sheet 1 1 Probability revision 1: Student-t as an infinite mixture of Gaussians Show that an infinite mixture of Gaussian distributions, with Gamma distributions as mixing weights in the following

More information

Lecture Notes 2: Matrices

Lecture Notes 2: Matrices Optimization-based data analysis Fall 2017 Lecture Notes 2: Matrices Matrices are rectangular arrays of numbers, which are extremely useful for data analysis. They can be interpreted as vectors in a vector

More information

Kernel Methods. Barnabás Póczos

Kernel Methods. Barnabás Póczos Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1). Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes

More information

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD DATA MINING LECTURE 8 Dimensionality Reduction PCA -- SVD The curse of dimensionality Real data usually have thousands, or millions of dimensions E.g., web documents, where the dimensionality is the vocabulary

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016 Lecture 8 Principal Component Analysis Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 13, 2016 Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 1 / 31 Outline 1 Eigen

More information

Tutorial on Principal Component Analysis

Tutorial on Principal Component Analysis Tutorial on Principal Component Analysis Copyright c 1997, 2003 Javier R. Movellan. This is an open source document. Permission is granted to copy, distribute and/or modify this document under the terms

More information

Principal Component Analysis

Principal Component Analysis B: Chapter 1 HTF: Chapter 1.5 Principal Component Analysis Barnabás Póczos University of Alberta Nov, 009 Contents Motivation PCA algorithms Applications Face recognition Facial expression recognition

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection Instructor: Herke van Hoof (herke.vanhoof@cs.mcgill.ca) Based on slides by:, Jackie Chi Kit Cheung Class web page:

More information

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond PCA, ICA and beyond Summer School on Manifold Learning in Image and Signal Analysis, August 17-21, 2009, Hven Technical University of Denmark (DTU) & University of Copenhagen (KU) August 18, 2009 Motivation

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same

More information

Dimensionality reduction

Dimensionality reduction Dimensionality Reduction PCA continued Machine Learning CSE446 Carlos Guestrin University of Washington May 22, 2013 Carlos Guestrin 2005-2013 1 Dimensionality reduction n Input data may have thousands

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Principal Component Analysis Barnabás Póczos Contents Motivation PCA algorithms Applications Some of these slides are taken from Karl Booksh Research

More information

Maximum variance formulation

Maximum variance formulation 12.1. Principal Component Analysis 561 Figure 12.2 Principal component analysis seeks a space of lower dimensionality, known as the principal subspace and denoted by the magenta line, such that the orthogonal

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall17 Carlos Fernandez-Granda Descriptive statistics Techniques to visualize

More information

Kernel Methods in Machine Learning

Kernel Methods in Machine Learning Kernel Methods in Machine Learning Autumn 2015 Lecture 1: Introduction Juho Rousu ICS-E4030 Kernel Methods in Machine Learning 9. September, 2015 uho Rousu (ICS-E4030 Kernel Methods in Machine Learning)

More information

Machine Learning (Spring 2012) Principal Component Analysis

Machine Learning (Spring 2012) Principal Component Analysis 1-71 Machine Learning (Spring 1) Principal Component Analysis Yang Xu This note is partly based on Chapter 1.1 in Chris Bishop s book on PRML and the lecture slides on PCA written by Carlos Guestrin in

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA Tobias Scheffer Overview Principal Component Analysis (PCA) Kernel-PCA Fisher Linear Discriminant Analysis t-sne 2 PCA: Motivation

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Machine Learning - MT & 14. PCA and MDS

Machine Learning - MT & 14. PCA and MDS Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA Pavel Laskov 1 Blaine Nelson 1 1 Cognitive Systems Group Wilhelm Schickard Institute for Computer Science Universität Tübingen,

More information

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component

More information

10-701/ Recitation : Kernels

10-701/ Recitation : Kernels 10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2. APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product

More information

Kernel methods for comparing distributions, measuring dependence

Kernel methods for comparing distributions, measuring dependence Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations

More information

Lecture 7: Con3nuous Latent Variable Models

Lecture 7: Con3nuous Latent Variable Models CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/

More information

1 Singular Value Decomposition and Principal Component

1 Singular Value Decomposition and Principal Component Singular Value Decomposition and Principal Component Analysis In these lectures we discuss the SVD and the PCA, two of the most widely used tools in machine learning. Principal Component Analysis (PCA)

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

LECTURE NOTE #11 PROF. ALAN YUILLE

LECTURE NOTE #11 PROF. ALAN YUILLE LECTURE NOTE #11 PROF. ALAN YUILLE 1. NonLinear Dimension Reduction Spectral Methods. The basic idea is to assume that the data lies on a manifold/surface in D-dimensional space, see figure (1) Perform

More information

Kernel Methods. Outline

Kernel Methods. Outline Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) 2 count = 0 3 for x in self.x_test_ridge: 4 5 prediction = np.matmul(self.w_ridge,x) 6 ###ADD THE COMPUTED MEAN BACK TO THE PREDICTED VECTOR### 7 prediction = self.ss_y.inverse_transform(prediction) 8

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Exercises * on Principal Component Analysis

Exercises * on Principal Component Analysis Exercises * on Principal Component Analysis Laurenz Wiskott Institut für Neuroinformatik Ruhr-Universität Bochum, Germany, EU 4 February 207 Contents Intuition 3. Problem statement..........................................

More information

Functional Analysis Review

Functional Analysis Review Functional Analysis Review Lorenzo Rosasco slides courtesy of Andre Wibisono 9.520: Statistical Learning Theory and Applications September 9, 2013 1 2 3 4 Vector Space A vector space is a set V with binary

More information

14 Singular Value Decomposition

14 Singular Value Decomposition 14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

Problem Set 1. Homeworks will graded based on content and clarity. Please show your work clearly for full credit.

Problem Set 1. Homeworks will graded based on content and clarity. Please show your work clearly for full credit. CSE 151: Introduction to Machine Learning Winter 2017 Problem Set 1 Instructor: Kamalika Chaudhuri Due on: Jan 28 Instructions This is a 40 point homework Homeworks will graded based on content and clarity

More information

1 Principal Components Analysis

1 Principal Components Analysis Lecture 3 and 4 Sept. 18 and Sept.20-2006 Data Visualization STAT 442 / 890, CM 462 Lecture: Ali Ghodsi 1 Principal Components Analysis Principal components analysis (PCA) is a very popular technique for

More information

Linear Algebra and Eigenproblems

Linear Algebra and Eigenproblems Appendix A A Linear Algebra and Eigenproblems A working knowledge of linear algebra is key to understanding many of the issues raised in this work. In particular, many of the discussions of the details

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Functional Analysis Review

Functional Analysis Review Outline 9.520: Statistical Learning Theory and Applications February 8, 2010 Outline 1 2 3 4 Vector Space Outline A vector space is a set V with binary operations +: V V V and : R V V such that for all

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

[POLS 8500] Review of Linear Algebra, Probability and Information Theory [POLS 8500] Review of Linear Algebra, Probability and Information Theory Professor Jason Anastasopoulos ljanastas@uga.edu January 12, 2017 For today... Basic linear algebra. Basic probability. Programming

More information

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision) CS4495/6495 Introduction to Computer Vision 8B-L2 Principle Component Analysis (and its use in Computer Vision) Wavelength 2 Wavelength 2 Principal Components Principal components are all about the directions

More information

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal

More information

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA) CS68: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA) Tim Roughgarden & Gregory Valiant April 0, 05 Introduction. Lecture Goal Principal components analysis

More information

Announcements (repeat) Principal Components Analysis

Announcements (repeat) Principal Components Analysis 4/7/7 Announcements repeat Principal Components Analysis CS 5 Lecture #9 April 4 th, 7 PA4 is due Monday, April 7 th Test # will be Wednesday, April 9 th Test #3 is Monday, May 8 th at 8AM Just hour long

More information

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra ECE 83 Fall 2 Statistical Signal Processing instructor: R Nowak Lecture 3: Review of Linear Algebra Very often in this course we will represent signals as vectors and operators (eg, filters, transforms,

More information

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra ECE 83 Fall 2 Statistical Signal Processing instructor: R Nowak, scribe: R Nowak Lecture 3: Review of Linear Algebra Very often in this course we will represent signals as vectors and operators (eg, filters,

More information

CS 4495 Computer Vision Principle Component Analysis

CS 4495 Computer Vision Principle Component Analysis CS 4495 Computer Vision Principle Component Analysis (and it s use in Computer Vision) Aaron Bobick School of Interactive Computing Administrivia PS6 is out. Due *** Sunday, Nov 24th at 11:55pm *** PS7

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works CS68: The Modern Algorithmic Toolbox Lecture #8: How PCA Works Tim Roughgarden & Gregory Valiant April 20, 206 Introduction Last lecture introduced the idea of principal components analysis (PCA). The

More information

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) Principal Component Analysis (PCA) Additional reading can be found from non-assessed exercises (week 8) in this course unit teaching page. Textbooks: Sect. 6.3 in [1] and Ch. 12 in [2] Outline Introduction

More information

Introduction to Machine Learning HW6

Introduction to Machine Learning HW6 CS 189 Spring 2018 Introduction to Machine Learning HW6 1 Getting Started Read through this page carefully. You may typeset your homework in latex or submit neatly handwritten/scanned solutions. Please

More information

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/

More information

Chapter XII: Data Pre and Post Processing

Chapter XII: Data Pre and Post Processing Chapter XII: Data Pre and Post Processing Information Retrieval & Data Mining Universität des Saarlandes, Saarbrücken Winter Semester 2013/14 XII.1 4-1 Chapter XII: Data Pre and Post Processing 1. Data

More information

Kernel Methods. Charles Elkan October 17, 2007

Kernel Methods. Charles Elkan October 17, 2007 Kernel Methods Charles Elkan elkan@cs.ucsd.edu October 17, 2007 Remember the xor example of a classification problem that is not linearly separable. If we map every example into a new representation, then

More information

Methods for sparse analysis of high-dimensional data, II

Methods for sparse analysis of high-dimensional data, II Methods for sparse analysis of high-dimensional data, II Rachel Ward May 26, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 55 High dimensional

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space to The The A s s in to Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 2, 2009 to The The A s s in 1 Motivation Outline 2 The Mapping the

More information

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) Principal Component Analysis (PCA) Salvador Dalí, Galatea of the Spheres CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang Some slides from Derek Hoiem and Alysha

More information