A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring

Size: px

Start display at page:

Download "A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring"

Brian Clarke
5 years ago
Views:

1 A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring Lecture 23:! Nonlinear least squares!! Notes Modeling2015.pdf on course web site (will be posted this tonight)! Chapter 11 in Gregory (Nonlinear model fitting)! Chapter 29 of Mackay (Monte Carlo Methods)! Chapter 12 in Gregory (MCMC)! An Introduction to MCMC for Machine Learning (Andrieu et al. 2003, Machine Learning, 50, 5! Genetic Algorithms: Principles of Natural Selection Applied to Computation (Stephanie Forrest, Science 1993, 261, 872)!! Assignment 4 posted!!!

2 Clustering Algorithms A very general problem is to find groupings or clustering of objects in some space. Clusters are useful for classification, description, discovery, and for learning algorithms. New objects may be identified because they are outliers from clusters defined by prior data. Simple clusters = islands of points. Hierarchical clustering: clusters within clusters. K-means Algorithm: Called K-means because membership in one of K clusters is identified by proximity of individual data points to the cluster mean. The treatment here is an amalgam of Chapter 8 of Introduction to Data Mining, Tan, Steinbach, and Kumar (available on the web at kumar/dmbook/index.php) and Chapters 20 and 21 of Information Theory, Inference, and Learning Algorithms, MacKay (also available on the web, Let there be N objects at locations {x j,j =1,,N} in an L-dimensional space. 1. Specify the number of clusters K and initialize their mean locations, {m k,k =1,,K}, as random vectors in the space. 2. Identify each object with a cluster by calculating the Euclidean distance and finding the nearest cluster. d jk = x j m k 1

3 3. The number of objects associated with each cluster is n(k). 4. Recalculate the mean location of each cluster, m k = 1 N X j:x j ink If a cluster is found to have no associated objects, its mean does not change. 5. Iterate the previous three steps until there is no change in cluster locations and membership. The K-means algorithm always converges but it need not converge to the correct grouping. Issues: 1. All points are treated equally in calculating the mean cluster location, even points that are on the periphery of the cluster. 2. The algorithm does not incorporate any prior shape information. It can be fooled by filamentary shapes. The first issue can be dealt with by calculating a weighted mean m k that weights more strongly the points that are nearest the previous mean. To do so requires imposing a length scale on the cluster, such as using a Gaussian function with some size in L-space. The second issue, elongated clusters, can be dealt with by using elliptical Gaussian functions to generate the weighted cluster means, with the variance on each axis updated along with cluster membership. Examples are given in Chapter 21 of MacKay. x j 2

4 Copyright Cambridge University Press On-screen viewing permitted. Printing not permitted. You can buy this book for 30 pounds or $50. See for links. 20.1: K-means clustering 287 Data: Figure K-means algorithm applied to a data set of 40 points. K = 2 means evolve to stable locations after three iterations. Assignment Update Assignment Update Assignment Update Run 1 Figure K-means algorithm applied to a data set of 40 points. Two separate runs, both with K = 4 means, reach different solutions. Each frame shows a successive assignment step. Run 2 Exercise [4, p.291] See if you can prove that K-means always converges. [Hint: find a physical analogy and an associated Lyapunov function.]

5 Copyright Cambridge University Press On-screen viewing permitted. Printing not permitted. You can buy this book for 30 pounds or $50. See for links An Example Inference Task: Clustering (a) (b) Figure K-means algorithm for a case with two dissimilar clusters. (a) The little n large data. (b) A stable set of assignments and means. Note that four points belonging to the broad cluster have been incorrectly assigned to the narrower cluster. (Points assigned to the right-hand cluster are shown by plus signs.) Figure Two elongated clusters, and the stable solution found by the K-means algorithm. (a) (b) function is provided as part of the problem definition; but I m assuming we are interested in data-modelling rather than vector quantization.] How do we choose K? Having found multiple alternative clusterings for a given K, how can we choose among them? Cases where K-means might be viewed as failing. Further questions arise when we look for cases where the algorithm behaves badly (compared with what the man in the street would call clustering ). Figure 20.5a shows a set of 75 data points generated from a mixture of two Gaussians. The right-hand Gaussian has less weight (only one fifth of the data

6 Nonlinear Least Squares Summary of linear least squares! Features of nonlinear least squares! Tackling the cost-function landscape!

7 Summary of Linear Least Squares Unweighted Least squares, equal uncorrelated errors: Cost function: Q( ) = I = j 2 j. Parameter vector that minimzes Q: ˆ =(X X) 1 X y Covariance matrix for the parameters: P (ˆ )(ˆ ) = 2 X X 1 Cost function dependence on = ˆ + : Q( ) =(y X ) (y X ) =(y Xˆ ) (y Xˆ )+ X X quadraticform 0 The cost function hypersurface is quadratic and has only one minimum.. 27

8 Weighted Least squares: Arbitrary covariance matrix V for : Cost function: Q( ) = V 1 Parameter vector that minimzes Q: ˆ =(X V 1 X) 1 X V 1 y Covariance matrix for the parameters: P (ˆ )(ˆ ) = X V 1 X Cost function dependence on = ˆ +, Q( ): 1 The cost function hypersurface generally has many local minima whereas we want the global minimum. 28

9 Error Ellipses and Confidence Intervals Confidence Intervals for Weighted Least squares: For an arbitrary covariance matrix V for : Cost function: Q( ) = V 1 Parameter vector that minimzes Q: ˆ =(X V 1 X) 1 X V 1 y Covariance matrix for the parameters: What are our goals? P (ˆ )(ˆ ) = X V 1 X 1 1. Given a fit to data, what are the errors on the parameters? 2. Do we know the data errors a priori or not? If not, we need an estimate for the errors. 3. Model comparison: we want to compare models to identify the best one. 29

10 Nonlinear Least Squares So far we have considered linear models of the form y = X +. But often we want to fit models f(x, ) that are nonlinear in the parameters, such as y = f(x, )+, where this vector equation has n elements and is a k-vector. We cannot solve for the best fit to the data in the same way as for the linear model, but the underlying principle is the same: minimize the sum of squares. Thus, we minimize the quadratic form Q( ) = V 1. 45

11 The problem is to find the minimum Q in a k-space where Q is a nonmonotonic function of the parameters. Recall that Q is parabolic for the linear model so for that case, finding any minimum of Q is the same as finding the minimum. With nonlinear functions, there may be an arbitrary number of local minima that can confuse algorithms for finding the nearest minimum of a function. This multiplicity of minima is the bane of the nonlinear LS problem. 46

12 Consider a standard signal + noise data set where n is IID with variance baseline, Cost Function Example: Nonlinear Function y(x) =f(x; ) +n(x) 2 n and the signal is a Gaussian function of x with an additive f(x; ) =a + be (x c)2 /2d 2, which has two linear parameters (a and b) and two nonlinear parameters (c and d). We can generate a realization of y(x) using a set of parameters true =(a, b, c, d) and a realization of noise n(x). Then we can evaluate the cost function vs. comprising ranges of values of a, b, c, d: C( ) = [y(x) ŷ(x)] [y(x) ŷ(x)] (N 4) 2 n where for this example we have divided by the presumed-known noise variance and by the number of degrees of freedom when we specify four parameter values.

13 Gaussian shape: a,b,c,d = 0.10, 1.00, 2.30, 0.73 S/N = S/N = S/N = 5 d = Gaussian Scale Parameter d = Gaussian Scale Parameter Gaussian shape: a,b,c,d = 0.10, 1.00, 2.30, 0.73 S/N = c = Gaussian Location Parameter Q min = c = Gaussian Location Parameter Q min =0.90 Gaussian shape: a,b,c,d = 0.10, 1.00, 2.30, 0.73 S/N = S/N = S/N = d = Gaussian Scale Parameter d = Gaussian Scale Parameter Gaussian shape: a,b,c,d = 0.10, 1.00, 2.30, 0.73 S/N = c = Gaussian Location Parameter Q min = c = Gaussian Location Parameter Q min =1.23

14 Gaussian shape: a,b,c,d = 0.10, 1.00, 2.30, 0.73 S/N = Gaussian shape: a,b,c,d = 0.10, 1.00, 2.30, 0.73 S/N = S/N = S/N = d = Gaussian Scale Parameter d = Gaussian Scale Parameter c = Gaussian Location Parameter Q min = c = Gaussian Location Parameter Q min =1.13 Gaussian shape: a,b,c,d = 0.10, 1.00, 2.30, 0.73 S/N = S/N = 0.05 S/N = Gaussian shape: a,b,c,d = 0.10, 1.00, 2.30, 0.73 S/N = d = Gaussian Scale Parameter d = Gaussian Scale Parameter c = Gaussian Location Parameter Q min = c = Gaussian Location Parameter Q min =0.96

15 The various strategies for minimizing Q include: 1. A grid search in space: (a brute force approach). There is a dynamic-range problem of searching enough hyper-volume so that the global minimum is found, but with sufficiently fine resolution that the global minimum is not missed. The total number of operations grows rapidly with k, the number of parameters. Let i be the total range searched for the parameter i (one element of the vector ) and with a grid sample interval i. The total number of grid points is N = k i=1 k i. 47

16 2. A ravine search: Use the gradient of Q, dq d, to find the bottom of a particular valley in -space. Choose a length in the direction of the negative gradient, move to a new position, evaluate Q and see if a minimum has been found. If not, iterate. This method clearly finds only the minimum that is nearest to the starting point of the search. This may not be the local minimum unless the starting point has been chosen wisely (or luckily). A hybrid approach would combine ravine search with another, pilot search that has identified the rough location of the global minimum. 48

17 3. Parabolic Extrapolation of Q: Near a minimum, Q may be approximated as a parabolic surface, so expression as such leads to a determination of the minimum. In vector form this is Q = Q min +( Q) ( Q). This can also be written in the form Q = Q min + k Q k min k Q j k j k min j k. Minimizing with respect to the increments, we obtain Q = 0 = Q + Q. This is a k-vector equation that yields corrections guesses 0 which yield Q 0 =(y ŷ ( 0 )) V 1 (y ŷ ( 0 )). to initial 49

18 The Fisher information matrix is related to the quadratic term above: F jk = 1 2 Q 2 j k min and is the inverse of the parameter covariance matrix (in the quadratic approximation).

19 4. Linearization of the fitting function, f( ): Linearize f( ) according to f( ) f 0 ( 0 )+ f( 0 ) ( 0 ). Then the model for the data becomes y f 0 ( 0 )+ f( 0 ) ( 0 )+. where 0 is an initial guess for the parameters. Note that f( ) is implicitly a function of some independent variable(s), as with linear LS. Since the model is linear near the initial guess, one can solve for 0 using the linear LS formalism. Specifically, = X V 1 X 1 X V 1 y, where X is now the n k matrix of values f( 0 ) for the k-dimensional gradient, evaluated at n values of some independent variable (e.g. time, spatial coordinate, frequency, etc.). 51

20 Note that, like methods 2. & 3., linearization of the fitting function also will find only the minimum that is closest to the initial guess for the parameters. 52

21 Optimization Methods We have seen a number of instances where we have wanted to maximize or minimize a function. For least-squares problems, the cases of interest are: 1. Linear Models: Q is concave = a single minimum found through a single iteration of the standard LS solution. 2. Non-linear Models: Q is generally complicated with many local minima. (a) Ravine searches, parabolic extrapolation, linearization of the fitting function are all iterative methods for finding the local minimum near a starting point. There is no guarantee that the global minimum will be found with these methods. (b) Grid Search: can find the global minimum but at the great 53

22 cost of evaluating functions at a large number of locations in -space. Also, with too-coarse sampling, the global minimum can be missed with this method as well. (c) Hill-climbing Method: Essentially the same as (a). (d) Downhill Simplex: This method searches the parameter space, or domain, using a geometrical construct called a simplex, a non-coplanar object with k +1vertices in the k-space. There need not be any computations of derivatives, the method simply changes the shape of the simplex and moves it through the k-space according to values of Q that are encountered at the vertices. It can get stuck in false minima, however, so multiple trials with different starting points should be used. 54

23 (e) Simulated Annealing: Allow trial values of parameters to jump around the domain (i.e. -space) according to a temperaturelike parameter and application of the Metropolis algorithm. This provides the opportunity for exploring the entire domain and not getting stuck in a local minimum. The temperature is lowered slowly as in annealing of metals, where the lattice finds a nice minimum-energy solution for itself. This method has a high probability of at leasting finding the neighborhood of the global minimum. Finding the exact minimum through the annealing process is slow. Hybridizing annealing with a method from A. can find the minimum more quickly. 55

24 (f) Genetic Algorithms (GA s): Search the domain through genetic-like operations. Let the parameter vector be associated with chromosomes made up of genes that each represent a specific parameter. The chromosomes are subject to genetic manipulation between generations (iterations). The main genetic processes are: i. selection according to fitness (defined in terms of a better value of the quantity being optimized, i.e. Q in leastsquares, likelihood function in ML); ii. recombination or crossover: where selected pairs of chromosomes (parameter vectors) interchange genes (bits). iii. mutation: where genes (bits) are randomly flipped according to some probability. This helps organisims from getting stuck in local minima. GA s can search the entire domain efficiently because suc- 56

25 cessful substrings (bit sequences) in the chromosomes ( schema ) grow exponentially according to their fitness relative to the mean fitness. Thus, the genetic approach explores the domain more efficently than a purely random search of the domain (e.g. Monte Carlo selection of parameter values) or a deterministic grid search because the genetic approach includes memory. 57

26 Markov Processes Markov processes are used for modeling as well as in statistical inference problems.! Markov processes are generally n th order:! The current state of a system may depend on n previous states! Most applications consider 1 st order processes! Hidden Markov processes:! A physical system may involve transitions between discrete states, but observables my reflect those states only indirectly (e.g. measurement noise, other physics, etc.)!

27 Markov Chains and Markov Processes Definitions: A Markov process has future samples determined only by the present state and by a transition probability from the present state to a future state. A Markov chain is one that has a countable number of states. Transitions between states are described by an n n stochastic matrix Q with elements q ij comprising the probabilities for changing in a single time step from state s i to state s j with i, j = 1,...,n. The state probability vector P has elements comprising the ensemble probability of finding the system in each state. E.g. for a three-state system: States = {s 1,s 2,,s n }, Q = q 11 q 12 q 13 q 21 q 22 q 23 q 31 q 32 q 33. Normalization across a row is j q ij =1since the system must be in some state at any time. In a single time step the probability for staying in the i th state is the metastability q ii and the probability for residing in that state for a time T is proportional to q T ii. 1

28 Two-state Markov Processes

29 The probability density function (PDF) for the duration of a given state is therefore a geometric series that sums to f T (T )=Ti 1 1 Ti 1 T 1, T =1, 2,, (1) with mean and rms values T i =(1 q ii ) 1, T i /T i = q ii. (2) Asymptotic behavior as the number of steps : The transition matrix after t steps is Q t. Under the reasonable assumptions that all elements of Q are non-negative and that all states are accessible in a finite number of steps, Q t converges to a steady-state form Q as t that has identical rows. Each row of Q is equal to the state probability vector P, the elements of which are the probabilities that a given time sample is in a particular state. P also equals the normalized left eigenvector of Q that has unity eigenvalue, i.e. PQ = P (e.g. Papoulis). For P to exist, the determinant det(q I) = 0(where I is the identity matrix), but this is automatically satisfied for a stochastic matrix corresponding to a stationary process. Convergence of Q t to a matrix with identical rows implies that the transition probabilities trend to those appropriate for an i.i.d. process when the time step t is much larger than the mean lifetimes T i of any of the states. For a two-state system P has elements p 1 =(1 q 22 )/(2 q 11 q 22 ) and p 2 =1 P 1. 2

30 Utility of Markov processes: 1. Modeling: Many processes in the lab and in nature are consistent with being Markov chains. The key elements are a set of discrete states and transitions that are random but are according to a transition matrix. 2. Sampling: A Markov chain can define a trajectory in the relevant space which can be used to randomly but efficiently sample the space. The key aspect of Markov Chain Monte Carlo is that the trajectory conforms statistically to the asymptotic form of the transition matrix. 3

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov