Lehel Csató. March Faculty of Mathematics and Informatics Babeş Bolyai University, Cluj-Napoca,

Size: px
Start display at page:

Download "Lehel Csató. March Faculty of Mathematics and Informatics Babeş Bolyai University, Cluj-Napoca,"

Transcription

1 Faculty of Mathematics and Informatics Babeş Bolyai University, Cluj-Napoca, Matematika és Informatika Tanszék Babeş Bolyai Tudományegyetem, Kolozsvár March 2008

2 Table of Contents 1 2 methods 3 Methods 4

3 Outline 1 2 methods 3 Methods 4

4 Machine learning Historical background / Motivation: Huge amount of data, that should automatically be processed, Mathematics provides general solutions, solutions are i.e. not for a given problem, Need for science, that uses mathematics machinery for solving practical problems.

5 Definitions for Machine learning Collection of methods (from statistics, probability theory) to solve problems met in practice. noise filtering for non-linear regression and/or non-gaussian noise Classification: binary, multiclass, partially labelled Clustering, Inversion problems, density estimation, novelty detection. Generally, we need to model the data,

6 f(x) Observation (x,y) Real world: there is a function y = f (x) (x,y ) 1 1 (x,y ) 2 2 (x,y ) N N Observation process: a corrupted datum is collected for a sample x n : t n = y n + ɛ additive noise t n = h(y n, ɛ) h distortion function Problem: find function y = f (x)

7 * f (x) Inference (x,y ) 1 1 (x,y ) 2 2 F function class Observ. process (x,y ) N N Data set collected. Assume a function class. polynomial, Fourier expansion, Wavelet; Observation process encodes the noise; Find the optimal function from the class.

8 II We have the data set D = {(x 1, y 1 ),..., (x N, y N )}. Consider a function class: (1) F = { w T x + b w R d, b R } (2) F = { K K a 0 + a k sin(2πkx) + b k cos(2πkx) k=1 a, b R K, a 0 R } Assume an observation process: k=1 y n = f (x n ) + ɛ with ɛ N(0, σ 2 ).

9 III 1 The data set: D = {(x 1, y 1 ),..., (x N, y N )}. 2 Assume a function class: F polynomial, etc. F = { f (x, θ) θ R p} 3 Assume an observation process. Define a loss function: L (y n, f (x n, θ)) For the Gaussian noise: L(y n, f (x n, θ)) = (y n f (x n, θ)) 2.

10 Outline 1 2 methods 3 Methods 4

11 Parameter estimation Estimating parameters: Finding the optimal value to θ: where θ = arg min L(D, θ) θ Ω Ω is the domain of the parameters. L(D, θ) is a loss function for the data set. Example: N L(D, θ) = L(y n, f (x n, θ)) n=1

12 L(D, θ) (log)likelihood function. Maximum likelihood estimation of the model: θ = arg min L(D, θ) θ Ω Example quadratic regression: L(D, θ) = N (y n f (x n, θ)) 2 factorisation n=1 Drawback: can produce perfect fit to the data over-fitting.

13 Example of an ML estimate Graphic Y h(cm) We want to fit a model to the data. Use linear model: h = θ 0 + θ 1 w. Use log-linear model: h = θ 0 + θ 1 log(w). Use higher order polynomials, e.g. : h = θ 0 + θ 1 w + θ 2 w 2 + θ 3 w w(kg) X

14 Example of an ML estimate Graphic Y h(cm) We want to fit a model to the data. Use linear model: h = θ 0 + θ 1 w. Use log-linear model: h = θ 0 + θ 1 log(w). Use higher order polynomials, e.g. : h = θ 0 + θ 1 w + θ 2 w 2 + θ 3 w w(kg) X

15 Example of an ML estimate Graphic Y h(cm) We want to fit a model to the data. Use linear model: h = θ 0 + θ 1 w. Use log-linear model: h = θ 0 + θ 1 log(w). Use higher order polynomials, e.g. : h = θ 0 + θ 1 w + θ 2 w 2 + θ 3 w w(kg) X

16 Example of an ML estimate Graphic Y h(cm) We want to fit a model to the data. Use linear model: h = θ 0 + θ 1 w. Use log-linear model: h = θ 0 + θ 1 log(w). Use higher order polynomials, e.g. : h = θ 0 + θ 1 w + θ 2 w 2 + θ 3 w w(kg) X

17 M.L. for linear models I Assume: linear model for the x y relation d f (x n θ) = θ l x l l=1 with x = [1, x, x 2, log(x),...] T quadratic loss for D = {(x 1, y 1 ),..., (x N, h N )} N E 2 (D f ) = (y n f (x n θ)) 2 n=1

18 M.L. for linear models II Minimisation: N (y n f (x n θ)) 2 = (y Xθ) T (y Xθ) n=1 = θ T X T Xθ 2θ T X T y + y T y Solution: 0 = 2X T Xθ 2X T y ( 1 θ = X X) T X T y where y = [y 1,..., y N ] T and X = [x 1,..., x N ] T are the transformed data.

19 M.L. for linear models III Generalised linear models: Use a set of functions Φ = [φ 1 (.),..., φ M (.)]. Project the inputs into the space spanned by Im(Φ). Have a parameter vector of length M: θ = [θ 1,..., θ M ] T. { } The model is m θ mφ m (x) θ m R. The optimal parameter vector is: θ = ( Φ T Φ) 1 Φ T y

20 Summary There are many candidate model families: the degree of polynomials specifies a model family; the rank of a Fourier expansion; the mixture of {log, sin, cos,...} also a family; Selecting the best family is a difficult modelling problem. In maximum likelihood there is no controll on how good a family is when processing a given data-set. Smaller number of parameters than #data.

21 Maximum a posteriori I Generalised linear model powerful it can be extremely complex; With no complexity control, overfitting problem. Aim: to include knowledge in the inference process. Our beliefs are reflected by the choice of the candidate functions.

22 Maximum a posteriori I Goal: Generalised linear model powerful it can be extremely complex; With no complexity control, overfitting problem. Aim: to include knowledge in the inference process. Our beliefs are reflected by the choice of the candidate functions. Prior knowledge specification using probabilities; Using probability theory for consistent estimation; Encode the observation noise in the model;

23 Maximum a posteriori Data/noise Probabilistic data description: How likely is that θ generated the data: y = f (x) y f (x) δ 0 y = f (x) + ɛ y f (x) N ɛ Gaussian noise: y f (x) N(0, σ 2 ) P(y f (x)) = 1 ] (y f (x))2 exp [ 2πσ 2σ 2

24 Maximum a posteriori Prior William of Ockham ( ) principle Entities should not be multiplied beyond necessity. Also known as (wiki...): Principle of simplicity KISS,. When you hear hoofbeats, think horses, not zebras. Simple models small number of parameters. L 0 norm L 2 norm Probabilistic representation: [ p 0 (θ) exp θ 2 2 2σ 2 0 ]

25 M.A.P. Inference M.A.P. probabilities assigned to D via the log-likelihood function: P(y n x n, θ, F) exp [ L(y n, f (x n, θ))] θ prior probabilities: p 0 (θ) exp [ θ 2 2σ 2 0 ] A posteriori probability: p(θ D, F) = P(D θ)p 0 (θ) p(d F) p(d F) probability of the data for a given family.

26 M.A.P. Inference M.A.P. probabilities assigned to D via the log-likelihood function: P(y n x n, θ, F) exp [ L(y n, f (x n, θ))] θ prior probabilities: p 0 (θ) exp [ θ 2 2σ 2 0 ] A posteriori probability: p(θ D, F) = P(D θ)p 0 (θ) p(d F) p(d F) probability of the data for a given family.

27 M.A.P. Inference M.A.P. probabilities assigned to D via the log-likelihood function: P(y n x n, θ, F) exp [ L(y n, f (x n, θ))] θ prior probabilities: p 0 (θ) exp [ θ 2 2σ 2 0 ] A posteriori probability: p(θ D, F) = P(D θ)p 0 (θ) p(d F) p(d F) probability of the data for a given family.

28 M.A.P. Inference II M.A.P. estimation finds θ with largest probability: θ MAP = arg max p(θ D, F) θ Ω Example: with L(y n, f (x n, θ)) and Gaussian prior: θ MAP = argmax K 1 θ Ω 2 n L(y n, f (x n, θ)) θ 2 2σ 2 0 σ0 2 = = maximum likelihood.. after a change of sign and max min

29 M.A.P. Example I Poly 6 N. Dev =10 3 True function Training data N. Dev =

30 M.A.P. Linear models I Y h(cm) w(kg) X Aim: test different levels of flexibility. p = 10 Prior width:

31 M.A.P. Linear models I Y h(cm) w(kg) X Aim: test different levels of flexibility. p = 10 Prior width: σ 2 0 = 106

32 M.A.P. Linear models I Y h(cm) w(kg) X Aim: test different levels of flexibility. p = 10 Prior width: σ 2 0 = 106 σ 2 0 = 105

33 M.A.P. Linear models I Y h(cm) w(kg) X Aim: test different levels of flexibility. p = 10 Prior width: σ 2 0 = 106 σ 2 0 = 105 σ 2 0 = 104

34 M.A.P. Linear models I Y h(cm) w(kg) X Aim: test different levels of flexibility. p = 10 Prior width: σ 2 0 = 106 σ 2 0 = 105 σ 2 0 = 104 σ 2 0 = 103

35 M.A.P. Linear models I Y h(cm) w(kg) X Aim: test different levels of flexibility. p = 10 Prior width: σ0 2 = 106 σ0 2 = 105 σ0 2 = 104 σ0 2 = 103 σ0 2 = 102

36 M.A.P. Linear models I Y h(cm) w(kg) X Aim: test different levels of flexibility. p = 10 Prior width: σ 2 0 = 106 σ 2 0 = 105 σ 2 0 = 104 σ 2 0 = 103 σ 2 0 = 102 σ 2 0 = 101

37 M.A.P. Linear models I Y h(cm) w(kg) X Aim: test different levels of flexibility. p = 10 Prior width: σ 2 0 = 106 σ 2 0 = 105 σ 2 0 = 104 σ 2 0 = 103 σ 2 0 = 102 σ 2 0 = 101 σ 2 0 = 100

38 M.A.P. Linear models II θ MAP = argmax K 1 θ Ω 2 Transform into vector notation: n E 2 (y n, f (x n, θ)) θ 2 2σ 2 0 θ MAP = argmax K 1 θ Ω 2 (y Xθ)T (y Xθ) θt θ 2σ0 2 solve for θ by differentiation: X T (y Xθ) 1 σ0 2 I d θ = 0 θ MAP = ( ) 1 X T X + 1 σ0 2 I d X T y. again M.L. for σ 2 0 =

39 M.A.P. Summary Mximum a posteriori models: Allow for the inclusion of prior knowledge; May protect against overfitting; Can measure the fitness of the family to the data;. Procedure called M.L. type II.

40 M.A.P. application M.L. II Idea: instead of computing the most probable value of θ, we can measure the fit of the model F to the data D. P(D F) = p(d, θ l F) θ l Ω = p(d θ l, F)p 0 (θ l F) θ l Ω Gaussian noise and polynomial of order K : ( ) log(p(d F)) = log dθ p(d θ, F)p 0 (θ F) = log (N(y 0, Σ X )) Ω θ p(d F) = 1 ( ) N log(2π) + log Σ X + y T Σ 1 X 2 y where Σ X = I N σ 2 n+xσ 0 X T with X = [x ] 0, x 1,..., x K Σ 0 = diag(σ 2 0, σ 2 1,..., σ 2 K )

41 M.A.P. M.L.II poly 80 log(p(d k)) log(σ ) Aim: test different models. Polynomial families: k = 10 k = 9 k = 8.

42 M.A.P. M.L.II poly 80 log(p(d k)) log(σ ) Aim: test different models. Polynomial families: k = 9 k = 8 k = 7.

43 M.A.P. M.L.II poly 80 log(p(d k)) log(σ ) Aim: test different models. Polynomial families: k = 8 k = 7 k = 6.

44 M.A.P. M.L.II poly 80 log(p(d k)) log(σ ) Aim: test different models. Polynomial families: k = 7 k = 6 k = 5.

45 M.A.P. M.L.II poly 80 log(p(d k)) log(σ ) Aim: test different models. Polynomial families: k = 6 k = 5 k = 4.

46 M.A.P. M.L.II poly 80 log(p(d k)) log(σ ) Aim: test different models. Polynomial families: k = 5 k = 4 k = 3.

47 M.A.P. M.L.II poly 80 log(p(d k)) log(σ ) Aim: test different models. Polynomial families: k = 4 k = 3 k = 2.

48 M.A.P. M.L.II poly 80 log(p(d k)) log(σ ) Aim: test different models. Polynomial families: k = 3 k = 2 k = 1.

49 Bayesian estimation Intro M.L. and M.A.P. estimates provide single solutions. Point estimates lack the assessment of un/certainty. Better solution: for a query x, the system output is probabilistic: x p(y x, F) Tool: go beyond the M.A.P. solution and use the a posteriori distribution of the parameters.

50 Bayesian estimation II We again use Bayes rule: p(θ D, F) = P(D θ)p 0 (θ) p(d F) with p(d F) = dθ P(D θ)p 0 (θ). Ω and exploit the whole posterior distribution of the parameters. A-posteriori parameter estimates We operate with p post (θ) = def p(θ D, F) and use the total probability rule: p(y D, F) = p(y θ l, F) p post (θ l ) θ l Ω θ in assessing system output.

51 Bayesian estimation Example I Given the data D = {(x 1, y 1 ),..., (x N, y N )} estimate the linear fit: T 1 y = θ 0 + d θ 1 θ i x i =. i=1 θ 0 θ d Gaussian distributions noise and prior: x 1. x d ɛ = y n θ T x n N(0, σ 2 n) w N(0, Σ 0 ) def = θ T x

52 Bayesian estimation Example II Goal: compute the posterior distribution p post (θ). p post(θ) p 0 (θ) p(d θ, F) = p 0 (θ Σ 0 ) N P(y n θ T x n) 2 log (p post(θ)) = K post + 1 (y Xθ) T (y Xθ) + θ T Σ 1 σn 2 0 θ ( ) = θ T 1 X T X + Σ 1 σn 2 0 θ 2 θ T X T y + K σn 2 post = ( ) T θ µ post Σ 1 ( ) post θ µpost + K post and by identification ( ) 1 1 Σ post = X T X + Σ 1 σn 2 0 and µ = Σ X T y post post σn 2 n=1

53 Bayesian estimation Example III Bayesian linear model The posterior distribution for the parameters is a Gaussian with parameters ( ) 1 1 Σ post = σn 2 X T X + Σ 1 X T y 0 and µ post = Σ post σn 2 Point estimates from keeping : M.L. if we take Σ 0 and considering only µ post. M.A.P if we approximate the distribution with a single value at the maximum, i.e. µ post.

54 Bayesian estimation Example IV Prediction for new values x : use the likelihood P(y x, θ, F), and the posterior for θ and Bayes rule. The steps: p(y x, D, F) = dθ p(y x, θ, F)p post(θ D, F) where = [ dθ exp Ω θ = dθ exp Ω θ Ω θ (K + (y θt x ) σn 2 [ 1 ( K + y 2 a T C 1 a + Q(θ) 2 σn 2 )] + (θ µ post ) T Σ 1 post(θ µ post ) )] a = x y σ 2 n + Σ 1 postµ post C = x x T σ 2 n + Σ post

55 Bayesian estimation Example V Integrating out the quadratic in θ: Predictive distribution at x p(y x, D, F) = exp [ 1 2 ( K + (y )] x µ post ) 2 σn 2 + x T Σ 1 postx ) = N (y x T µ post, σn 2 + x T Σ post x With the predictive distribution we: measure the variance of the prediction for each point: σ 2 = σ 2 n + x T Σ post x ; sample from the parameters and plot the candidate predictors.

56 Bayesian example Error bars Pol. 6 N.var σ 2 = The errors are the symmetric thin lines.

57 Bayesian example Predictive samples Third order polynomials are used to approximate the data.

58 Bayesian estimation Problems When computing p post (θ D, F) we assumed that the posterior can be represented analytically. This is not the case. Approximations are needed for the posterior distribution predictive distribution In Bayesian modelling an important issue is how we approximate the posterior distribution.

59 Bayesian estimation Summary Complete specification of the model Can include prior beliefs about the model. Accurate predictions Can include prior beliefs about the model. Computational cost Using models for prediction can be difficult and expensive in time and memory.

60 Bayesian estimation Summary Complete specification of the model Can include prior beliefs about the model. Accurate predictions Can include prior beliefs about the model. Computational cost Using models for prediction can be difficult and expensive in time and memory. Bayesian models Flexible and accurate if priors about the model are used. It can be expensive.

61 Outline 1 2 methods 3 Methods 4

62 setting Data can be unlabeled, i.e. no values y are associated to an input x. We want to extract information from D = {x 1,..., x N }. We assume that the data although high-dimensional spans a manifold of a much smaller dimension. Task is to find the subspace corresponding to the data span.

63 Models in unsupervised learning It is again important the model of the data:... ; ; Mixture models; x x x x 1 x 1 x 1

64 The PCA model I Simple data structure. Spherical cluster that is: translated; scaled; rotated x x 1 We aim to find the principal directions of the data spread. Principal direction: the direction u along which the data preserves most of its variance.

65 The PCA model II Principal direction: u = argmax u =1 1 2N N (u T x n ux) 2 n=1 we pre-process: x = 0. Replacing the empirical covariance with Σ x : u = argmax u =1 = argmax u,λ 1 2N N (u T x n ux) 2 n=1 1 2 ut Σ x u λ( u 2 1) with λ the Lagrange multiplier. Differentiating w.r.t u: Σ x u λu = 0

66 The PCA model III The optimum solution must obey: Σ x u = λu The eigendecomposition of the covariance matrix. (λ, u ) is an eigenvalue, eigenvector of the system. If we replace back, the value of the expression is λ. Optimal solution when λ = λ max. Principal direction: The eigenvector u max corresponding to the largest eigenvalue of the system.

67 The PCA model Data mining I How is this used in data mining? Assume that data is: jointly Gaussian: x = N(m x, Σ x ), high-dimensional; only few (2) directions are relevant

68 The PCA model Data mining II How is this used in data mining? Subtracting mean. Eigendecomposition Selecting the K eigenvectors corresponding to the K largest values Computing the K projections: z nl = x T n u l. The projection using matrix P = def [u 1,..., u K ] T : Z = XP and z n can is used as a compact representation of x n. x 2 x 1

69 The PCA model Data mining III Reconstruction: K x n = z nl u l or, with matrix notation: X = Z P T l=1 PCA projection analysis: E PCA = 1 N ( x N 2 n x 2 n) = 1 [ (X N tr X ) T ( X X )] 2 n=1 [ ] = tr Σ x P T Σ zp [U ] (diag(λ 1,..., λ d ) diag(λ 1,..., λ K, 0,...)) U T = tr = tr [ ] U T U diag(0,..., 0, λ K +1,..., λ d ) d K = l=1 λ K +l

70 The PCA model Data mining IV PCA reconstruction error: The error made using the PCA directions: PCA properties: E PCA = d K l=1 λ K +l PCA system orthonormal: u T l u r = δ l r Reconstruction fast. Spherical assumption critical.

71 PCA application USPS I USPS digits testbed for several models.

72 PCA application USPS II USPS characteristics: handwritten data centered and scaled; items of grayscale images; We plot k r = r l=1 λ l - normalised Conclusion for the USPS set: λ The normalised λ 1 = 0.24 u 1 accounts for 24% of the data. at 10 more than 70% of variance is explained. at 50 more than 98% 50 numbers instead of 256. i

73 PCA application USPS III Visualisation application: Visualisation along the first two eigendirections.

74 PCA application USPS IV Visualisation application: Detail.

75 The ICA model I Start from the PCA: x = Pz is a generative model for the data. We assumed that x z i.i.d. Gaussian random variables z N(0, diag(λ l )); x are not independent; In most of real data: Sources are not Gaussian. But sources are independent. We exploit that!. x 1 z are Gaussian sources;

76 The ICA model II The following model assumption: where z independent sources; A linear mixing matrix; x = As Looking for matrix B that recovers the sources: s def = Bx = B (As) = (BA) s i.e. (BA) is a permutation and scaling. but retains independence.

77 The ICA model III In practice: s def = Bx with s = [s 1,..., s K ] all independent sources. Independence test: the KL-divergence between the joint distribution and the marginals B = argmin B SO d KL (p(s 1, s 2 ) p(s 1 )p(s 2 )) where SO d is the group of matrices with B = 1. In ICA we are looking for matrix B that minimises: ds l log p(s l ) ds log(p(s 1,..., s d ) Ω l Ω l l. skip KL

78 The KL-divergence Detour Kullback-Leibler divergence KL(p q) = x is zero only and only if p = q, p(x) log p(x) q(x) is not a measure of distance (but cloooose to it!), Efficient when exponential families are used. Short proof: KL(p q) 0 ( ) ( 0 = log 1 = log q(x) = log x p(x) log x ( ) q(x) p(x) x = KL(p q) ) p(x) q(x) p(x)

79 ICA Application Data Separation of source signals: Mixture m2 m4 m1 m3 m3 m4

80 ICA Application Results Results of separation: Source m2 m4 m1 m3 m3 m4. FastICA package

81 Applications of ICA Applications: Coctail party problem; Separates noisy and multiple sources from multiple observations. Fetus ECG; Separation of the ECG signal of a fetus from its mother s ECG. MEG recordings; Separation of MEG sources. Financial data; Finding hidden factors in financial data. Noise reduction; Noise reduction in natural images. Interference removal; Interference removal from CDMA Code-division multiple access communication systems.

82 The mixture model Introduction The data structure is more complex. More than a single source for data x x 1 The mixture model: P(x Σ) = where: K π k p k (x µ k, Σ k ) (1) k=1 π 1,..., π K mixing components. p k (x µ k, Σ k ) density of a component. The components are usually called clusters.

83 The mixture model Data generation The generation process reflects the assumptions about the model. The data generation: first we select from which component, then we sample from the component s density function. When modelling data we do not know: Which point belongs to which cluster. What are the parameters for each density function.

84 The mixture model Example I Old Faithful geyser in the Yellowstone National park. Characterised by: intense eruptions; differing times between them. Rule: Duration is 1.5 to 5 minutes. The length of eruption helps determine the interval. If an eruption lasts less than 2 minutes the interval will be around 55 minutes. If the eruption last 4.5 minutes the interval may be around 88 minutes.

85 The mixture model Example II Interval between durations Duration The longer the duration, the longer the interval. The linear relation I = θ 0 + θ 1 d is not the best. There are only a very few eruptions lasting 3 minutes.

86 The mixture model I Assumptions: We know the family of individual density functions: These density functions are parametrised with a few parameters. The densities are easily identifiable: If we knew which data belongs to which cluster, the density function is easily identifiable. Gaussian densities are often used fulfill both conditions.

87 The mixture model II The Gaussian mixture model: p(x) = π 1 N 1 (x µ 1, Σ 1 )+π 2 N 2 (x µ 2, Σ 2 ) for known densities (centres and ellipses): p(x n k) = N k(x n µ k, Σ k ) p(k) l N l(x n µ l, Σ l ) p(l) i.e. we know the probability that data comes from cluster k (shades from red to green). For D: x p(x 1) p(x 2) x 1 γ 11 γ x N γ N1 γ N2 γ nl responsibility of x n in cluster l

88 The mixture model III When γ nl known, the parameters are computed using the data weighted by their responsibilities: (µ k, Σ k ) = argmax µ,σ for all k. This means: (µ k, Σ k ) n N (N k (x n µ, Σ)) γ nk n=1 γ nk log N(x n µ, Σ) When making inference Have to find the responsibility vector and the parameters of the mixture.

89 The mixture model III When γ nl known, the parameters are computed using the data weighted by their responsibilities: (µ k, Σ k ) = argmax µ,σ for all k. This means: (µ k, Σ k ) n N (N k (x n µ, Σ)) γ nk n=1 γ nk log N(x n µ, Σ) Given data D: Initial guess: (µ 1, Σ 1 ),..., (µ K, Σ K )) Re-estimate resp.s: x 1 γ 11 γ x N γ N1 γ N2 When making inference Have to find the responsibility vector and the parameters of the mixture. Re-estimate parameters: (µ 1, Σ 1 ),..., (µ K, Σ K ))

90 The mixture model Summary Responsibilities γ The additional latent variables needed to help computation. In the mixture model: goal is to fit model to data; which submodel gets a particular data; Achieved by the maximisation of the log-likelihood function.

91 The EM algorithm I (π, Θ) = argmax n log [ l π l N l (x n µ l, Σ l ) Θ = [µ 1, Σ 1,..., µ K, Σ K ] is the vector of parameters; π = [π 1,..., π K ] the shares of the factors; Problem with optimisation: The parameters are not separable due to the sum within the logarithm. ] Solution: Use an approximation.

92 The EM algorithm II log P(D π, Θ) = n = n log log [ l [ l π l N l (x n µ l, Σ l ) p l (x n, l) Use Jensen s inequality: ( ) ( ) log p l (x n, l θ l ) = log q p l(x n, l θ l ) n(l) q l l n(l) ( ) pl (x n, l) q n(l) log q l n(l) ] ] for any [q n(1),..., q n(l)].. skip Jensen

93 Jensen Inequality Detour z 1 γ 2 = 0.75 z 2 Jensen s Inequality γ 1 f (z 1 ) + γ 2 f (z 2 ) f (γ 1 z 1 + γ 2 z 2 ) concave f (z) For any concave f (z), any z 1 and z 2, and any γ 1, γ 2 > 0 such that γ 1 + γ 2 = 1: f (γ 1 z 1 + γ 2 z 2 ) γ 1 f (z 1 ) + γ 2 f (z 2 )

94 The EM algorithm III ( ) log p l (x n, l θ l ) l l q n(l) log for any distribution q n( ). Replacing with the right-hand side, we have: log P(D π, Θ) n l l [ n q n(l) log p l(x n θ l ) q n(l) q n(l) log p l(x n θ l ) q n(l) ( ) pl (x n, l) q n(l) ] = L and therefore the optimisation w.r.to cluster parameters separate. l 0 = n q n(l) log p l(x n θ l ) θ l For distributions from exponential family optimisation is easy.

95 The EM algorithm IV any set of distributions q 1 (l),..., q N (l) provides a lower bound to the log-likelihood. We should choose the distributions so that they are the closest to the current parameter set. We assume the parameters have the value θ 0. Want to minimise the difference: log P(x n, l θ 0 l) L = q n(l) log P(x n, l θ 0 l) l l q n(l) log P(x n, l θ0 l)q n(l) p l (x n θ 0 l) l q n(l) log p l(x n, l θ 0 l) q n(l) and observe that by setting we have l qn(l) 0 = 0. q n(l) = p l(x n θ 0 l) P(x n, l θ 0 l)

96 The EM algorithm V The EM algorithm: Init initialise model parameters; E step compute the responsibilities γ nl = q n (l); M step for each k optimize 0 = n q n (l) log p l(x n θ l ) θ l repeat goto the E step.

97 EM application I Old faithful: Interval between durations Duration

98 EM application II Old faithful:

99 EM application III Old faithful:

100 EM application IV Old faithful:

101 J. M. Bernardo and A. F. Smith. Bayesian Theory. John Wiley & Sons, C. M. Bishop. Pattern Recognition and. Springer Verlag, New York, N.Y., T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society series B, 39:1 38, T. Hastie, R. Tibshirani, és J. Friedman. The Elements of Statistical Learning: Data, Inference, and Prediction. Springer Verlag, 2001.

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2017 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 679 Part XIX

More information

Week 3: The EM algorithm

Week 3: The EM algorithm Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of

More information

Latent Variable Models and EM Algorithm

Latent Variable Models and EM Algorithm SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Functional modelling with Gaussian Processes

Functional modelling with Gaussian Processes modelling with Gaussian Processes Faculty of Mathematics and Informatics Babeş Bolyai University, Cluj-Napoca, Matematika és Informatika Tanszék Babeş Bolyai Tudományegyetem, Kolozsvár September 2008 Table

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Probabilistic and Bayesian Machine Learning

Probabilistic and Bayesian Machine Learning Probabilistic and Bayesian Machine Learning Lecture 1: Introduction to Probabilistic Modelling Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Why a

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Lecture 1b: Linear Models for Regression

Lecture 1b: Linear Models for Regression Lecture 1b: Linear Models for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s

More information

Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning. Chapter 20, Sections 1 4 1 Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Linear Regression and Discrimination

Linear Regression and Discrimination Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Lecture 7: Con3nuous Latent Variable Models

Lecture 7: Con3nuous Latent Variable Models CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM. Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Learning Binary Classifiers for Multi-Class Problem

Learning Binary Classifiers for Multi-Class Problem Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,

More information

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Course 495: Advanced Statistical Machine Learning/Pattern Recognition Course 495: Advanced Statistical Machine Learning/Pattern Recognition Goal (Lecture): To present Probabilistic Principal Component Analysis (PPCA) using both Maximum Likelihood (ML) and Expectation Maximization

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models Jaemo Sung 1, Sung-Yang Bang 1, Seungjin Choi 1, and Zoubin Ghahramani 2 1 Department of Computer Science, POSTECH,

More information

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

GWAS IV: Bayesian linear (variance component) models

GWAS IV: Bayesian linear (variance component) models GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian

More information

The Expectation Maximization or EM algorithm

The Expectation Maximization or EM algorithm The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

EM Algorithm LECTURE OUTLINE

EM Algorithm LECTURE OUTLINE EM Algorithm Lukáš Cerman, Václav Hlaváč Czech Technical University, Faculty of Electrical Engineering Department of Cybernetics, Center for Machine Perception 121 35 Praha 2, Karlovo nám. 13, Czech Republic

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Aaron C. Courville Université de Montréal Note: Material for the slides is taken directly from a presentation prepared by Christopher M. Bishop Learning in DAGs Two things could

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Probabilistic Graphical Models for Image Analysis - Lecture 1

Probabilistic Graphical Models for Image Analysis - Lecture 1 Probabilistic Graphical Models for Image Analysis - Lecture 1 Alexey Gronskiy, Stefan Bauer 21 September 2018 Max Planck ETH Center for Learning Systems Overview 1. Motivation - Why Graphical Models 2.

More information

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

Lecture 6: April 19, 2002

Lecture 6: April 19, 2002 EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin

More information

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Machine Learning for Data Science (CS4786) Lecture 12

Machine Learning for Data Science (CS4786) Lecture 12 Machine Learning for Data Science (CS4786) Lecture 12 Gaussian Mixture Models Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016fa/ Back to K-means Single link is sensitive to outliners We

More information

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann Machine Learning! in just a few minutes Jan Peters Gerhard Neumann 1 Purpose of this Lecture Foundations of machine learning tools for robotics We focus on regression methods and general principles Often

More information

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

Bayesian Inference Course, WTCN, UCL, March 2013

Bayesian Inference Course, WTCN, UCL, March 2013 Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Mathematical Formulation of Our Example

Mathematical Formulation of Our Example Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

K-Means and Gaussian Mixture Models

K-Means and Gaussian Mixture Models K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models David Rosenberg, Brett Bernstein New York University April 26, 2017 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 1 / 42 Intro Question Intro

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

More information

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014 Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Bayesian Machine Learning - Lecture 7

Bayesian Machine Learning - Lecture 7 Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1

More information

Mixture Models and Expectation-Maximization

Mixture Models and Expectation-Maximization Mixture Models and Expectation-Maximiation David M. Blei March 9, 2012 EM for mixtures of multinomials The graphical model for a mixture of multinomials π d x dn N D θ k K How should we fit the parameters?

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

Expectation maximization

Expectation maximization Expectation maximization Subhransu Maji CMSCI 689: Machine Learning 14 April 2015 Motivation Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is

More information

Auto-Encoding Variational Bayes

Auto-Encoding Variational Bayes Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Motivating the Covariance Matrix

Motivating the Covariance Matrix Motivating the Covariance Matrix Raúl Rojas Computer Science Department Freie Universität Berlin January 2009 Abstract This note reviews some interesting properties of the covariance matrix and its role

More information

CS 540: Machine Learning Lecture 1: Introduction

CS 540: Machine Learning Lecture 1: Introduction CS 540: Machine Learning Lecture 1: Introduction AD January 2008 AD () January 2008 1 / 41 Acknowledgments Thanks to Nando de Freitas Kevin Murphy AD () January 2008 2 / 41 Administrivia & Announcement

More information

Lecture 14. Clustering, K-means, and EM

Lecture 14. Clustering, K-means, and EM Lecture 14. Clustering, K-means, and EM Prof. Alan Yuille Spring 2014 Outline 1. Clustering 2. K-means 3. EM 1 Clustering Task: Given a set of unlabeled data D = {x 1,..., x n }, we do the following: 1.

More information