Gaussian mean-shift algorithms

Size: px
Start display at page:

Download "Gaussian mean-shift algorithms"

Transcription

1 Gaussian mean-shift algorithms Miguel Á. Carreira-Perpiñán Dept. of Computer Science & Electrical Engineering, OGI/OHSU

2 Gaussian mean-shift (GMS) as an EM algorithm Gaussian mean-shift (GMS) as an EM algorithm Acceleration strategies for GMS image segmentation Gaussian blurring mean-shift (GBMS) p.

3 Gaussian mixtures, kernel density estimates Given a dataset X = {x n } N n= R D, define a Gaussian kernel density estimate with bandwidth σ: p(x) = N N n= ( x x n K σ 2 ) Other kernels: { t, t [, ) Epanechnikov: K(t), t Student s t: K(t) ( ) + t α+d 2. α Useful way to represent the density of a data set. K(t) e t/2. (finite support). Extremely popular in machine learning and statistics. Objective here: find modes of p(x). p. 2

4 Why find modes? Nonparametric clustering ts C 3 C C 2 p(x) Modes clusters. Assign each point x to a mode. Nonparametric: no model for the clusters shape no predetermined number of clusters. p. 3

5 Why find modes? Multivalued mappings ents p(x, y) y y p(x y = y) x x 2 x x x 2 x x x Given a density model for (x, y), represent a multivalued mapping y x by modes of the conditional distribution p(x y). Map y to {x, x, x 2 }. p(x, y) is a Gaussian mixture p(x y) is a Gaussian mixture. p. 4

6 Mode-finding algorithms Idea: start a hill-climbing algorithm from every centroid. This typically finds all modes in practice (Carreira-Perpiñán ). Hill-climbing algorithms: Gradient ascent Newton s method etc. Fixed-point iteration: the mean-shift algorithm. It is based on ideas by Fukunaga & Hostetler 75 (also Cheng 95, Carreira-Perpiñán, Comaniciu & Meer 2, etc.). p. 5

7 The mean-shift algorithm It is obtained by reorganising the stationary point equation p(x) = as a fixed-point iteration: x (τ+) = f(x (τ) ). For example, for an isotropic kde p(x) = N N n= K( x x n σ 2 ) : p(x) = N x = N n= = 2 Nσ 2 2 Nσ 2 N n= K ( x x n σ N n= N n= K ( x x n σ K ( x x n σ 2 ) 2 σ 2 (x x n) 2 ) 2 ) x K ( x x n σ 2 ) N n = K ( x x n σ 2 )x n. x n = p. 6

8 The mean-shift algorithm (cont.) Gaussian, full covariance: f(x) = ( N n= p(n x)σ n ) N n= p(n x)σ n x n Gaussian, isotropic: f(x) = N n= p(n x)x n p(n x) = e 2 x x n σ 2 N n = e 2 x x n σ 2 Other kernel, isotropic: f(x) = N n= K ( x x n σ 2 ) N n = K ( x x n σ 2 )x n. f = f f... maps points in R D to stationary points of p. It is a particularly simple algorithm, with no step size. p. 7

9 Gaussian mean-shift (GMS) as a clustering algorithm x n, x m in same cluster if they converge to same mode nonparametric clustering, able to deal with complex cluster shapes; σ determines the number of clusters popular in computer vision (segmentation, tracking) (Comaniciu & Meer) Segmentation example (one data point x = (i, j, I) per pixel, where (i, j) = location and I = intensity or colour): p. 8

10 Pseudocode: GMS clustering for n {,..., N} x x n repeat n: p(n x) exp ( 2 (x x n)/σ 2) N n = exp ( 2 (x x n )/σ 2) For each data point Starting point Iteration loop Post. prob. (E step) x N p(n x)x n n= until x s update < z n x end connected-components({z n } N n=, Update x (M step) Mode ) Clusters p. 9

11 GMS is an EM algorithm Define the following artificial maximum-likelihood problem (Carreira-Perpiñán & Williams 3): Model: Gaussian mixture that moves as a rigid body ({π n, x n, Σ n } N n= fixed; only parameter is x) q(y x) = N n= π n 2πΣ n 2 e 2 (y (x n x)) T Σ n (y (x n x)). Data: just one point at the origin: y =. Then, the resulting expectation-maximisation (EM) algorithm that maximises the log-likelihood wrt the parameter x (translation of the GM) is formally identical to the GMS step. E step: update posterior probabilities p(n x). M step: update parameter x. p.

12 GMS is an EM algorithm (cont.) The M step maximises the following lower bound on log p: ϱ(x) = log p(x (τ) ) Q(x (τ) x (τ) ) + Q(x x (τ) ) with where ϱ(x (τ) ) = log p(x (τ) ) and ϱ(x) log p(x) x R D Q(x x (τ) ) = constant 2σ 2 N n= p(n x (τ) ) x x n 2 is the expected complete-data log-likelihood from the E step. p.

13 Non-Gaussian mean-shift is a GEM algorithm Consider a mixture of non-gaussian kernels. Analogous derivation as maximum-likelihood problem, but now the M step does not have a closed-form solution for x. However, it is possible to solve the M step iteratively with a certain fixed-point iteration. Taking just one step of this fixed-point iteration provides an inexact M step and so a generalised EM algorithm, which is formally identical to the mean-shift algorithm (Carreira-Perpiñán 6). p. 2

14 Convergence properties of mean-shift From the properties of (G)EM algorithms (Dempster et al 77) we expect, generally speaking: Convergence to a mode from almost any starting point Can also converge to saddle points and minima. Monotonic increase of p at every step (Jensen s inequality). Convergence of linear order. However, particular mean-shift algorithms (for particular kernels) may show different properties. We analyse in detail GMS (isotropic case: Σ n = σ 2 I). p. 3

15 Convergence properties of GMS Taylor expansion around a mode x of the fixed-point mapping f: x (τ+) = f(x (τ) ) = x + J(x )(x (τ) x ) + O( x (τ) x 2 ) x (τ+) x J(x )(x (τ) x ) where the Jacobian of f is J(x) = σ 2 M m= p(m x)(µ m f(x))(µ m f(x)) T. At a mode, J(x ) = σ 2 (local covariance at x ). Thus, the convergence is linear with rate r = lim τ x (τ+) x x (τ) x = λ max(j(x )) <. p. 4

16 Convergence properties of GMS (cont.) Special cases depending on σ: σ, σ (practically uninteresting): r, superlinear. σ at mode merges: r =, sublinear (awfully slow). Intermediate σ (practically useful): r close to (slow). σ =.2 ements σ =.4 σ = r σ = 4 σ PSfrag replacements x.2 σ =.4 σ = 4 r x σ p. 5

17 Convergence properties of GMS (cont.) The GMS iterates follow the local principal component at the mode and lie in the interior of the convex hull of the data points. Smooth path: angle between consecutive steps in ( π 2, π 2 ) (Comaniciu & Meer 2). GMS step size is not optimal along the GMS search direction. p. 6

18 Number of modes of a Gaussian mixture In principle, one would expect that N points should produce at most N modes (additional modes are an artifact of the kernel). But: In D: A Gaussian mixture with N components has at most N modes (Carreira-Perpiñán & Williams 3). A mixture of non-gaussian kernels can have more than N modes: K(x) +e x / : Epanechnikov kernel: p. 7

19 Number of modes of a Gaussian mixture (cont.) In D 2, even a mixture of isotropic Gaussians can have more than N modes (Carreira-Perpiñán & Williams 3): However, in practice the number of modes is much smaller than N because components coalesce. This may be a reason why, in practice, GMS produces better segmentations than other kernels. p. 8

20 Convergence domains of GMS In general, the convergence domains (geometric locus of points that converge to each mode) are nonconvex and can be disconnected and take quite fancy shapes: p. 9

21 Convergence domains of GMS (cont.) The boundary between some domains can be fractal: These peculiar domains are probably undesirable for clustering; but, small effect (seemingly confined to cluster boundaries). p. 2

22 Gaussian mean-shift as an EM algorithm: summary GMS is an EM algorithm Non-Gaussian mean-shift is a GEM algorithm GMS converges to a mode from almost any starting point Convergence is linear (occasionally superlinear or sublinear), slow in practice The iterates approach a mode along its local principal component Gaussian mixtures and kernel density estimates can have more modes than components (but seems rare) The convergence domains for GMS can be fractal p. 2

23 Acceleration strategies for GMS image segmentation Gaussian mean-shift (GMS) as an EM algorithm Acceleration strategies for GMS image segmentation Gaussian blurring mean-shift (GBMS) p. 22

24 Computational bottlenecks of GMS GMS is slow: O(kN 2 D). Computational bottlenecks: B: large average number of iterations k (linear convergence). B2: large cost per iteration 2ND multiplications E step: ND to obtain p(n x (τ) ). M step: ND to obtain x (τ+). Acceleration techniques must address B and/or B2. We propose four strategies: : spatial discretisation : spatial neighbourhood : sparse EM : EM Newton p. 23

25 Evaluation of strategies Evaluation of strategies: Goal: to achieve the same segmentation as GMS. Visual evaluation of segmentation not enough; we compute the segmentation error wrt GMS segmentation (= no. pixels misclustered as a % over the whole image). We report running times in normalised iterations (= iteration of GMS) to ensure independence from implementation details. No pre- or postprocessing of clusters (e.g. removal of small clusters). Why segmentation errors? The accelerated scheme may not converge to a mode of p(x). x n may converge to a different mode than with exact GMS. p. 24

26 : spatial discretisation PSfrag replacements Many different pixels converging to the same mode travel almost identical paths. We discretise the spatial domain by subdividing every pixel (i, j) into n n cells; points projecting to the same cell share the same fate. This works because paths in D dimensions are well approximated by their 2D projection on the spatial domain. We stop iterating once we hit an already visited cell massively reduces the total number of iterations. We start first with a set of pixels uniformly distributed over the image (this finds all modes quickly). Converges to a mode Segmentation error by increasing n Addresses B. Parameter: n =, 2, 3... (discretisation level) p. 25

27 : spatial neighbourhood Approximates E and M steps with a subset of the data points (rather than all N) consisting of a neighbourhood in the spatial domain (not the range domain). Finding neighbours is for free (unlike finding neighbours in full space). p(n x (τ) ) = ( exp 2 (x (τ) x n )/σ 2) n N (x (τ) ) exp ( 2 (x(τ) x n )/σ 2) PSfrag replacements x (τ+) = n N (x (τ) ) p(n x(τ) )x n Does not converge to a mode Segmentation error by increasing e Addresses B2. Parameter: e (, ] (fraction of data set used as neighbours) p. 26

28 : sparse EM Sparse EM (Neal & Hinton 98): coordinate ascent on the space of (x, p) where p are posterior probabilities; this maximises the free energy F ( p, x) = log p(x) D ( p p(n x)) and also p(x). Run partial E steps frequently: update p(n x) only for n S; S = plausible set (nearest neighbours), kept constant over partial E steps. FAST. Run full E steps infrequently: update all p(n x) and S. SLOW. We choose S containing as many neighbours as necessary to account for a total probability n S p(n x) ɛ (, ]. Thus, the fraction of data used e varies after each full step. Using fixed e produced worse results. Converges to a mode no matter how S is chosen; computational savings if few full steps Segmentation error by decreasing ɛ Addresses B2. Parameter: ɛ [, ) (probability not in S) p. 27

29 : EM Newton Start with EM steps, which quickly increase p. Switch to Newton steps when EM slows down (reverting to EM if bad Newton step). Specifically, try Newton step when x (τ) x (τ ) < θ. Computing the Newton step x N yields the EM step x EM for free: Gradient: g(x) = p(x) σ 2 Hessian: H(x) = p(x) σ 2 N Newton step: x N = x H (x)g(x). p(n x)(x n x) = p(x) σ (x 2 EM x) n= ( I + N ) p(n x)(x σ 2 n x)(x n x) T n= Converges to a mode with quadratic order Segmentation error by decreasing θ Addresses B. Parameter: θ > (minimum EM step length) relative to σ p. 28

30 Computational cost Strategy : spatial discretisation : spatial neighbourhood e Cost per iteration relative to exact GMS : sparse EM 2 if full step, e if partial step : EM Newton if EM step, ( + D+ ) 4 if Newton step, ( D+ ) 4 if EM step after failed N. step e (, ] is the fraction of the data set used (neighbours for plausible set for ). Using iterations normalised wrt GMS allows direct comparison of the numbers of iterations in the experiments., p. 29

31 Experimental results with image segmentation Dataset: x n = (i n, j n, I n ) (greyscale) or x n = (i n, j n, L n, u n, v n) (colour) where (i, j) is the pixel s spatial position. Prescale I or (L, u, v ) so same σ for all dimensions (in pixels). Best segmentations for large bandwidths: σ (image side). 5 We study all strategies with different images over a range of σ. Convergence tolerance = 3 pixels. Number of clusters 5 x 5 Total number of iterations ts PSfrag replacements σ ns Iterations Clusters rs σ 5 PSfrag replacements Clusters Iterations σ p. 3

32 & "! % "! $ "! # "! Experimental results with image segmentation (cont.) Segmentation results for each method under its optimal parameter value for σ = 2: GMS: 5 modes : 5 modes : 5 modes : 5 modes : 5 modes its = error.62% its = 3379 error.2% its = error.% its = 3495 error.98% its = 494 segmentation # iterations iterations histogram p. 3

33 Experimental results with image segmentation (cont.) Optimal parameter value for each method (= that which minimises the iterations ratio s. t. achieving an error < 3%) as a function of σ. eplacements e σ σ n r e ɛ θ σ σ p. 32

34 Experimental results with image segmentation (cont.) Clustering error (percent) and number of iterations for each method under its optimal parameter value as a function of σ. eplacements error Iterations 5 x 5 ms ms ms2 ms3 ms σ σ p. 33

35 ents Experimental results with image segmentation (cont.) # "! $ "! % "! & "! Clustering error (percent) and computational cost for each method as a function of its parameter. error e σ = 8 σ = 2 σ = 6 σ = 2 σ = e iteration ratio wrt GMS n r ɛ θ p. 34

36 Experimental results with image segmentation (cont.) Plot of all iterates for all starting pixels: most cells in are empty Proportion of cells visited by (less than nn out of n2 N ) i 5 5 j Average number of iterations per pixel for I PSfrag replacements 6 rag replacements Iterations/pixel Proportion PSfrag replacements n n=6 n=5 n=4 3 n=3 2 n=2 n= 5 64 N p. 35

37 Acceleration strategies: summary (cont.) With parameters set optimally, all methods can obtain nearly the same segmentation as GMS with significant speedups. Near-optimal parameter values can easily be set in advance for, and, and somewhat more heuristically for. (spatial discretisation): best strategy; speedup (average number of iterations per pixel k = 2 4 only) with very small error controlled by the discretisation level. (EM Newton): second best;.5 6 speedup., (neighbourhood methods): 3 speedup; attains the lowest (near-zero) error, while can result in unacceptably large errors for a suboptimal setting of its parameter (neighbourhood size). The neighbourhood methods (, ) are less effective because GMS needs very large neighbourhoods (comparable to the data set size for the larger bandwidths). p. 36

38 Acceleration strategies: summary (cont.) Other methods: Approximate algorithms for neighbour search (e.g. kd-trees): less effective because GMS needs large neighbourhoods. Other optimisation algorithms (e.g. quasi-newton): need to ensure first steps are EM. Fast Gauss transform: can be combined with our methods. Extensions: Adaptive and non-isotropic bandwidths: all 4 methods. Non-image data:, Is it possible to extend readily applicable. to higher dimensions? p. 37

39 Gaussian blurring mean-shift (GBMS) Gaussian mean-shift (GMS) as an EM algorithm Acceleration strategies for GMS image segmentation Gaussian blurring mean-shift (GBMS) p. 38

40 Gaussian blurring mean-shift (GBMS) This is the algorithm that Fukunaga & Hostetler really proposed It has gone largely unnoticed. Same iterative scheme, but now the data points move at each iteration. Result: sequence of progressively shrunk datasets X (), X (),... converging to single, all-points-coincident cluster. τ = τ = τ = 2 τ = 3 τ = 4 τ = 5 τ = 6 τ = 7 τ = 8 τ = 9 τ = τ = p. 39

41 Pseudocode: GBMS clustering repeat for m {,..., N} n: p(n x m ) exp ( 2 (x m x n )/σ 2) N n = exp ( 2 (x m x n )/σ 2) Iteration loop For each data point N y m p(n x m )x n n= end m: x m y m until stop connected-components({x n } N n=, One GMS step Update whole dataset Stopping criterion ) Clusters p. 4

42 Stopping criterion for GBMS GBMS converges to an all-points-coincident cluster (Cheng 95). Proof: diameter of data set decreases at least geometrically. However, GBMS shows two phases: Phase : points merge into clusters of coincident points (a few iterations); we want to stop here. Phase 2: clusters keep approaching and merging (a few to hundreds of iterations); slowly erases clustering structure. We need a stopping criterion (as opposed to a convergence criterion) to stop just after phase. Simply checking X (τ) X (τ ) < the points are always moving. does not work because Instead, consider the histogram of updates {e (τ) n } N n=, e n (τ) = x (τ) n x (τ ) n. Though the histograms change as points move, in phase 2 the entropy H does not change (the histogram bins change their position but not their values). p. 4

43 Stopping criterion for GBMS (cont.) τ = τ = τ = 2 τ = 3 τ = 4 τ = 5 τ = 6 placements PSfrag replacements PSfrag replacements PSfrag replacements PSfrag replacements PSfrag replacements placements PSfrag replacements PSfrag 6 replacements PSfrag 6 replacements PSfrag 5 replacements PSfrag 6 replacements τ = 7 τ = 8 τ = 9 τ = τ = τ = 2 τ = 3 placements PSfrag replacements PSfrag replacements PSfrag replacements PSfrag replacements PSfrag replacements placements PSfrag replacements PSfrag replacements PSfrag replacements PSfrag replacements PSfrag replacements τ = 4 τ = 5 τ = 6 τ = 7 τ = 8 τ = 9 τ = 2 placements PSfrag replacements PSfrag replacements PSfrag replacements PSfrag replacements PSfrag replacements placements PSfrag replacements PSfrag replacements PSfrag replacements PSfrag replacements PSfrag replacements p

44 Stopping criterion for GBMS (cont.) ts Stopping criterion: average update py py s ing ( H(e (τ+) ) H(e (τ) ) < 8) OR PSfrag replacements update does not become update small enough does not become small enough τ average update update entropy ( N N n= e (τ+) n < ) τ entropy stops changing p. 43

45 Convergence rate of GBMS GBMS shrinks a Gaussian cluster towards its mean with cubic convergence rate. Proof: work with continuous pdf p(x) = N N n= ( x x n K σ 2 ) p(x) = R D q(y)k(x y) dy New dataset: x = p(y x)y dy = E {y x} x R D with p(y x) = R D K(x y)q(y) p(x) Considering D w.l.o.g. (dimensions decouple): q(x) = N (x;, s 2 ) p(x) = N (x;, s 2 + σ 2 ), q( x) = N ( x;, (rs) 2 ) with r = +(σ/s) 2 (, ). p. 44

46 Convergence rate of GBMS (cont.) Recurrence relation for the standard deviation s(τ ) : (τ +) (τ ) () s = s for s > (τ ) 2 + (σ/s ) s(τ +) s( ) which converges to with cubic order, i.e., limτ (τ ) ( ) 3 <. s s Effectively, σ increases at every step. τ = τ = τ =2 τ =3 τ =4 τ = Sfrag cements replacements PSfrag replacements PSfrag replacements PSfrag replacements PSfrag replacements acements Sfrag replacements PSfrag replacements PSfrag replacements PSfrag replacements PSfrag replacements The cluster shrinks without changing orientation or mean. The principal direction shrinks more slowly relative to other directions. p. 45

47 Convergence rate of GBMS (cont.) In summary, a Gaussian distribution remains Gaussian with the same mean but each principal axis decreases cubically. This explains the practical behaviour shown by GBMS:. Clusters collapse extremely fast (clustering). 2. After a few iterations only the local principal component survives temporary linearly-shaped clusters (denoising). τ = τ = 2 τ = 3 p. 46

48 Convergence rate of GBMS (cont.) Number of GBMS iterations τ necessary to achieve s (τ) < function of the bandwidth σ for s () = : as a 3 ) ' ( ) ' ( = 3 = 2 frag replacements τ σ Note that GMS converges with linear order (p = ), thus requiring many iterations to converge to a mode. p. 47

49 Connection with spectral clustering for m {,..., N} GBMS pseudocode (inner loop) n: p(n x m ) exp ( 2 (x m x n )/σ 2) N n = exp ( 2 (x m x n )/σ 2) y m N p(n x m )x n n= end m: x m y m For each data point One GMS step Update whole dataset Same pseudocode in matrix form W = ( exp ( ( (x 2 m x n )/σ 2)) nm N ) D = diag X = XWD n= w nm Gaussian affinity matrix Degree (normalising) matrix Update whole dataset p. 48

50 Connection with spectral clustering (cont.) GBMS can be written as repeated products X XP with the random-walk matrix P = WD equivalent to the graph Laplacian (W: Gaussian affinities, D: degree matrix, P: posterior probabilities p(n x m )). In phase P is quickly changing as points cluster. In phase 2 P is almost constant (and perfectly blocky) so GBMS implicitly extracts the leading eigenvectors (power method) like spectral clustering. Thus GBMS is much faster than computing eigenvectors (about 5 matrix-vector products are enough). Actually, since P is a positive matrix it has a single leading eigenvector (Perron-Frobenius th.) with constant components, so eventually all points collapse. p. 49

51 Connection with spectral clustering (cont.) Leading 7 eigenvectors u,..., u 7 R N and leading 2 eigenvalues µ,..., µ 2 (, ] of P = (p(n x m )) nm : u 2 u 3 u 4 u 5 u 6 u 7 µ n τ = placements fragpsfrag replacements PSfrag replacements PSfrag replacements PSfrag replacements replacements PSfrag replacements 5 τ = 2 placements fragpsfrag replacements PSfrag replacements PSfrag replacements PSfrag replacements replacements PSfrag replacements 5 τ = 4 placements fragpsfrag replacements PSfrag replacements PSfrag replacements PSfrag replacements replacements PSfrag replacements 5 τ = 6 placements fragpsfrag replacements PSfrag replacements PSfrag replacements PSfrag replacements replacements PSfrag replacements 5 τ = PSfrag replacements placements fragpsfrag replacements PSfrag replacements PSfrag replacements PSfrag replacements replacements n p. 5

52 Accelerated GBMS algorithm At each iteration, replace clusters already formed with a single point of mass = no. points in cluster. The algorithm is now an alternation between GMS blurring steps and connected-component reduction steps. Equivalent to the original GBMS but faster. The effective number of points N (τ) (thus the computation) decreases very quickly. Computational cost: k : average number of iterations per point for GMS k 2 : number of iterations for GBMS and accelerated GBMS GMS GBMS Accelerated GBMS 2N 2 3 Dk N 2 3 Dk 2 2 D k 2 2 τ= (N (τ ) ) 2 p. 5

53 Experiments with image segmentation +* +. -, / Original image GMS GBMS Number of iterations Image GMS GBMS Accelerated GBMS + / +, (σ = 24.2) 8 (σ = 2.3) 4.6 (σ = 2.3) (σ = 33) 4 (σ = 24) 4.8 (σ = 24) p. 52

54 Experiments with image segmentation (cont.) Effective number of points N (τ) in accelerated GBMS: 6 4 cameraman hand 2 N (τ) frag replacements Accelerated GBMS is 2 4 faster than GBMS 5 6 faster than GMS 5 5 τ p. 53

55 Experiments with image segmentation (cont.) nts clusters C iterations GMS σ = 8.4 σ =. σ = 8.3 C = 6 C = 4 C = 2 σ PSfrag replacements clusters C iterations speedup GBMS and accelerated GBMS acc. GBMS GBMS = acc. GBMS GBMS σ = 8. σ = 8.8 σ = 5.8 C = 6 C = 4 C = 2 σ p. 54

56 Experiments with image segmentation (cont.) Segmentations of similar quality to those of GMS but faster. Computational cost O(kN 2 ). Possible further accelerations: Fast Gauss transform Sparse affinities W: Truncated Gaussian kernel Proximity graph (k-nearest-neighbours, etc.) Bandwidth σ set by the user; good values of σ: GMS: σ 5 (image-side) GBMS: somewhat smaller Can also try values of σ in a scaled-down version of the image (fast) and then scale up σ. p. 55

57 GBMS: summary We have turned a neglected algorithm originally due to Fukunaga & Hostetler into a practical algorithm by introducing a reliable stopping criterion and an acceleration. Fast convergence (cubic order for Gaussian clusters). Connection with spectral clustering (GBMS interleaves refining the affinities with power iterations). Excellent segmentation results, faster than GMS and spectral clustering; only parameter is σ (which controls the granularity). Very simple to implement. We hope to see it more widely applied. p. 56

Gaussian mean shift is an EM algorithm

Gaussian mean shift is an EM algorithm Gaussian mean shift is an EM algorithm Miguel Á. Carreira-Perpiñán Dept. of Computer Science & Electrical Engineering OGI School of Science & Engineering, Oregon Health & Science University 20000 NW Walker

More information

A latent variable modelling approach to the acoustic-to-articulatory mapping problem

A latent variable modelling approach to the acoustic-to-articulatory mapping problem A latent variable modelling approach to the acoustic-to-articulatory mapping problem Miguel Á. Carreira-Perpiñán and Steve Renals Dept. of Computer Science, University of Sheffield {miguel,sjr}@dcs.shef.ac.uk

More information

Entropic Affinities: Properties and Efficient Numerical Computation

Entropic Affinities: Properties and Efficient Numerical Computation Entropic Affinities: Properties and Efficient Numerical Computation Max Vladymyrov and Miguel Carreira-Perpiñán Electrical Engineering and Computer Science University of California, Merced http://eecs.ucmerced.edu

More information

Dimensionality Reduction by Unsupervised Regression

Dimensionality Reduction by Unsupervised Regression Dimensionality Reduction by Unsupervised Regression Miguel Á. Carreira-Perpiñán, EECS, UC Merced http://faculty.ucmerced.edu/mcarreira-perpinan Zhengdong Lu, CSEE, OGI http://www.csee.ogi.edu/~zhengdon

More information

Mean-Shift Tracker Computer Vision (Kris Kitani) Carnegie Mellon University

Mean-Shift Tracker Computer Vision (Kris Kitani) Carnegie Mellon University Mean-Shift Tracker 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University Mean Shift Algorithm A mode seeking algorithm Fukunaga & Hostetler (1975) Mean Shift Algorithm A mode seeking algorithm

More information

CS 534: Computer Vision Segmentation III Statistical Nonparametric Methods for Segmentation

CS 534: Computer Vision Segmentation III Statistical Nonparametric Methods for Segmentation CS 534: Computer Vision Segmentation III Statistical Nonparametric Methods for Segmentation Ahmed Elgammal Dept of Computer Science CS 534 Segmentation III- Nonparametric Methods - - 1 Outlines Density

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Clustering Finding a group structure in the data Data in one cluster similar to

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2014-2015 Jakob Verbeek, ovember 21, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15

More information

Improved Fast Gauss Transform. Fast Gauss Transform (FGT)

Improved Fast Gauss Transform. Fast Gauss Transform (FGT) 10/11/011 Improved Fast Gauss Transform Based on work by Changjiang Yang (00), Vikas Raykar (005), and Vlad Morariu (007) Fast Gauss Transform (FGT) Originally proposed by Greengard and Strain (1991) to

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Practical identifiability of finite mixtures of multivariate Bernoulli distributions

Practical identifiability of finite mixtures of multivariate Bernoulli distributions In: Neural Computation 2(), pp. 4-52 (January 2000) URL: http://www.dcs.shef.ac.uk/~miguel/papers/mix bernoulli.html Practical identifiability of finite mixtures of multivariate Bernoulli distributions

More information

CS798: Selected topics in Machine Learning

CS798: Selected topics in Machine Learning CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning

More information

The Steepest Descent Algorithm for Unconstrained Optimization

The Steepest Descent Algorithm for Unconstrained Optimization The Steepest Descent Algorithm for Unconstrained Optimization Robert M. Freund February, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 1 Steepest Descent Algorithm The problem

More information

Mean Shift is a Bound Optimization

Mean Shift is a Bound Optimization IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.??, NO.??, MONTH YEAR 1 Mean Shift is a Bound Optimization Mark Fashing and Carlo Tomasi, Member, IEEE M. Fashing and C. Tomasi are with

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

8 Numerical methods for unconstrained problems

8 Numerical methods for unconstrained problems 8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

School of Informatics, University of Edinburgh

School of Informatics, University of Edinburgh T E H U N I V E R S I T Y O H F R G School of Informatics, University of Edinburgh E D I N B U Institute for Adaptive and Neural Computation On the Number of Modes of a Gaussian Mixture by Miguel Carreira-Perpinan,

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

More information

K-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543

K-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543 K-Means, Expectation Maximization and Segmentation D.A. Forsyth, CS543 K-Means Choose a fixed number of clusters Choose cluster centers and point-cluster allocations to minimize error can t do this by

More information

A memory gradient algorithm for l 2 -l 0 regularization with applications to image restoration

A memory gradient algorithm for l 2 -l 0 regularization with applications to image restoration A memory gradient algorithm for l 2 -l 0 regularization with applications to image restoration E. Chouzenoux, A. Jezierska, J.-C. Pesquet and H. Talbot Université Paris-Est Lab. d Informatique Gaspard

More information

Lecture 5: September 12

Lecture 5: September 12 10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS

More information

Nonlinear Optimization for Optimal Control

Nonlinear Optimization for Optimal Control Nonlinear Optimization for Optimal Control Pieter Abbeel UC Berkeley EECS Many slides and figures adapted from Stephen Boyd [optional] Boyd and Vandenberghe, Convex Optimization, Chapters 9 11 [optional]

More information

Line Search Methods for Unconstrained Optimisation

Line Search Methods for Unconstrained Optimisation Line Search Methods for Unconstrained Optimisation Lecture 8, Numerical Linear Algebra and Optimisation Oxford University Computing Laboratory, MT 2007 Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The Generic

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

The role of dimensionality reduction in classification

The role of dimensionality reduction in classification The role of dimensionality reduction in classification Weiran Wang and Miguel Á. Carreira-Perpiñán Electrical Engineering and Computer Science University of California, Merced http://eecs.ucmerced.edu

More information

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing George Papandreou and Alan Yuille Department of Statistics University of California, Los Angeles ICCV Workshop on Information

More information

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2012-2013 Jakob Verbeek, ovember 23, 2012 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.12.13

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Numerical Optimization

Numerical Optimization Unconstrained Optimization Computer Science and Automation Indian Institute of Science Bangalore 560 01, India. NPTEL Course on Unconstrained Minimization Let f : R n R. Consider the optimization problem:

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by: Newton s Method Suppose we want to solve: (P:) min f (x) At x = x, f (x) can be approximated by: n x R. f (x) h(x) := f ( x)+ f ( x) T (x x)+ (x x) t H ( x)(x x), 2 which is the quadratic Taylor expansion

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Week 3: The EM algorithm

Week 3: The EM algorithm Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

More information

LECTURE NOTE #3 PROF. ALAN YUILLE

LECTURE NOTE #3 PROF. ALAN YUILLE LECTURE NOTE #3 PROF. ALAN YUILLE 1. Three Topics (1) Precision and Recall Curves. Receiver Operating Characteristic Curves (ROC). What to do if we do not fix the loss function? (2) The Curse of Dimensionality.

More information

Math 273a: Optimization Netwon s methods

Math 273a: Optimization Netwon s methods Math 273a: Optimization Netwon s methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 some material taken from Chong-Zak, 4th Ed. Main features of Newton s method Uses both first derivatives

More information

CS 556: Computer Vision. Lecture 21

CS 556: Computer Vision. Lecture 21 CS 556: Computer Vision Lecture 21 Prof. Sinisa Todorovic sinisa@eecs.oregonstate.edu 1 Meanshift 2 Meanshift Clustering Assumption: There is an underlying pdf governing data properties in R Clustering

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Lecture 25: November 27

Lecture 25: November 27 10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Improved Fast Gauss Transform. (based on work by Changjiang Yang and Vikas Raykar)

Improved Fast Gauss Transform. (based on work by Changjiang Yang and Vikas Raykar) Improved Fast Gauss Transform (based on work by Changjiang Yang and Vikas Raykar) Fast Gauss Transform (FGT) Originally proposed by Greengard and Strain (1991) to efficiently evaluate the weighted sum

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09 Numerical Optimization 1 Working Horse in Computer Vision Variational Methods Shape Analysis Machine Learning Markov Random Fields Geometry Common denominator: optimization problems 2 Overview of Methods

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Gradient Descent. Dr. Xiaowei Huang

Gradient Descent. Dr. Xiaowei Huang Gradient Descent Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Three machine learning algorithms: decision tree learning k-nn linear regression only optimization objectives are discussed,

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2017 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 679 Part XIX

More information

Introduction to gradient descent

Introduction to gradient descent 6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our

More information

SIFT: SCALE INVARIANT FEATURE TRANSFORM BY DAVID LOWE

SIFT: SCALE INVARIANT FEATURE TRANSFORM BY DAVID LOWE SIFT: SCALE INVARIANT FEATURE TRANSFORM BY DAVID LOWE Overview Motivation of Work Overview of Algorithm Scale Space and Difference of Gaussian Keypoint Localization Orientation Assignment Descriptor Building

More information

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive

More information

EM (cont.) November 26 th, Carlos Guestrin 1

EM (cont.) November 26 th, Carlos Guestrin 1 EM (cont.) Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 26 th, 2007 1 Silly Example Let events be grades in a class w 1 = Gets an A P(A) = ½ w 2 = Gets a B P(B) = µ

More information

Solution Methods. Richard Lusby. Department of Management Engineering Technical University of Denmark

Solution Methods. Richard Lusby. Department of Management Engineering Technical University of Denmark Solution Methods Richard Lusby Department of Management Engineering Technical University of Denmark Lecture Overview (jg Unconstrained Several Variables Quadratic Programming Separable Programming SUMT

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

CHAPTER 11. A Revision. 1. The Computers and Numbers therein CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of

More information

Mode-finding for mixtures of Gaussian distributions

Mode-finding for mixtures of Gaussian distributions Mode-finding for mixtures of Gaussian distributions Miguel Á. Carreira-Perpiñán Dept. of Computer Science, University of Sheffield, UK M.Carreira@dcs.shef.ac.uk Technical Report CS 99 03 March 31, 1999

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Lecture 17: October 27

Lecture 17: October 27 0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling Christopher Jennison Department of Mathematical Sciences, University of Bath, UK http://people.bath.ac.uk/mascj Adriana Ibrahim Institute

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Numerical solutions of nonlinear systems of equations

Numerical solutions of nonlinear systems of equations Numerical solutions of nonlinear systems of equations Tsung-Ming Huang Department of Mathematics National Taiwan Normal University, Taiwan E-mail: min@math.ntnu.edu.tw August 28, 2011 Outline 1 Fixed points

More information

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Kernel Density Estimation, Factor Analysis Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 2: 2 late days to hand it in tonight. Assignment 3: Due Feburary

More information

Logistic Regression. Seungjin Choi

Logistic Regression. Seungjin Choi Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

3.1 Introduction. Solve non-linear real equation f(x) = 0 for real root or zero x. E.g. x x 1.5 =0, tan x x =0.

3.1 Introduction. Solve non-linear real equation f(x) = 0 for real root or zero x. E.g. x x 1.5 =0, tan x x =0. 3.1 Introduction Solve non-linear real equation f(x) = 0 for real root or zero x. E.g. x 3 +1.5x 1.5 =0, tan x x =0. Practical existence test for roots: by intermediate value theorem, f C[a, b] & f(a)f(b)

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

Unconstrained optimization I Gradient-type methods

Unconstrained optimization I Gradient-type methods Unconstrained optimization I Gradient-type methods Antonio Frangioni Department of Computer Science University of Pisa www.di.unipi.it/~frangio frangio@di.unipi.it Computational Mathematics for Learning

More information

NONLINEAR. (Hillier & Lieberman Introduction to Operations Research, 8 th edition)

NONLINEAR. (Hillier & Lieberman Introduction to Operations Research, 8 th edition) NONLINEAR PROGRAMMING (Hillier & Lieberman Introduction to Operations Research, 8 th edition) Nonlinear Programming g Linear programming has a fundamental role in OR. In linear programming all its functions

More information

Gradient Descent. Sargur Srihari

Gradient Descent. Sargur Srihari Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors

More information

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem Michael Patriksson 0-0 The Relaxation Theorem 1 Problem: find f := infimum f(x), x subject to x S, (1a) (1b) where f : R n R

More information

Motion Estimation (I) Ce Liu Microsoft Research New England

Motion Estimation (I) Ce Liu Microsoft Research New England Motion Estimation (I) Ce Liu celiu@microsoft.com Microsoft Research New England We live in a moving world Perceiving, understanding and predicting motion is an important part of our daily lives Motion

More information

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Maria Ryskina, Yen-Chia Hsu 1 Introduction

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

x x2 2 + x3 3 x4 3. Use the divided-difference method to find a polynomial of least degree that fits the values shown: (b)

x x2 2 + x3 3 x4 3. Use the divided-difference method to find a polynomial of least degree that fits the values shown: (b) Numerical Methods - PROBLEMS. The Taylor series, about the origin, for log( + x) is x x2 2 + x3 3 x4 4 + Find an upper bound on the magnitude of the truncation error on the interval x.5 when log( + x)

More information

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method. Optimization Unconstrained optimization One-dimensional Multi-dimensional Newton s method Basic Newton Gauss- Newton Quasi- Newton Descent methods Gradient descent Conjugate gradient Constrained optimization

More information

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods Prof. Daniel Cremers 14. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

215 Problem 1. (a) Define the total variation distance µ ν tv for probability distributions µ, ν on a finite set S. Show that

215 Problem 1. (a) Define the total variation distance µ ν tv for probability distributions µ, ν on a finite set S. Show that 15 Problem 1. (a) Define the total variation distance µ ν tv for probability distributions µ, ν on a finite set S. Show that µ ν tv = (1/) x S µ(x) ν(x) = x S(µ(x) ν(x)) + where a + = max(a, 0). Show that

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Multivariate Gaussians Mark Schmidt University of British Columbia Winter 2019 Last Time: Multivariate Gaussian http://personal.kenyon.edu/hartlaub/mellonproject/bivariate2.html

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Gradient Descent, Newton-like Methods Mark Schmidt University of British Columbia Winter 2017 Admin Auditting/registration forms: Submit them in class/help-session/tutorial this

More information

Chapter 4. Unconstrained optimization

Chapter 4. Unconstrained optimization Chapter 4. Unconstrained optimization Version: 28-10-2012 Material: (for details see) Chapter 11 in [FKS] (pp.251-276) A reference e.g. L.11.2 refers to the corresponding Lemma in the book [FKS] PDF-file

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Optimization for neural networks

Optimization for neural networks 0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Optimization. Charles J. Geyer School of Statistics University of Minnesota. Stat 8054 Lecture Notes

Optimization. Charles J. Geyer School of Statistics University of Minnesota. Stat 8054 Lecture Notes Optimization Charles J. Geyer School of Statistics University of Minnesota Stat 8054 Lecture Notes 1 One-Dimensional Optimization Look at a graph. Grid search. 2 One-Dimensional Zero Finding Zero finding

More information

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang. Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

More information