More on Unsupervised Learning

Size: px
Start display at page:

Download "More on Unsupervised Learning"

Transcription

1 More on Unsupervised Learning Two types of problems are to find association rules for occurrences in common in observations (market basket analysis), and finding the groups of values of observational data that are similar. For finding groups of similar observations, the main two approaches are tree-based methods (divisive or agglomerative) and use of distances to prototype points. 1

2 Association Rules: Review A problem related to clustering is market basket analysis. In this we seek to determine which items in a basket (or cluster) are likely to occur together. If A and B are items (or more generally, subsets of the possible values taken on by a given variable), we consider rules of the form A B. We call A the antecedent and B the consequent. The rule states that if A is present, B is likely to be present. 2

3 Association Rules The support of the rule is the fraction of all baskets in which A and B are present. We denote this as T(A B). The confidence of the rule or predictability of the association is the conditional fraction C(A B) = T(A B). T(A) Finally, the lift of the rule is the conditional fraction L(A B) = C(A B) T(B) = T(A B) T(A)T(B). 3

4 K-Means Clustering The clustering depends on the variability of the variables. It may be necessary to scale the variables in order for the clustering to be sensible because the larger a variable s variance, the more impact it will have on the clustering. K-means in R. The simple function is kmeans. The first argument is the data and the second is the number of means: kmeans(mydata,k) 4

5 Choosing the Number of Clusters A major issue is how many clusters should be formed. This question must generally be addressed in an ad hoc manner. A number of statistics have been proposed for use in deciding how many clusters to use. The Calinski-Harabasz index, b/(k 1) w/(n k), where b is the between-groups sum-of-squares, b = k m g=1 j=1 ( x j(g) x j ) 2, and w is the pooled within-groups sum-of-squares, can be used as a stopping rule. The objective is to maximize it. 5

6 Variations on K-Means K-means, as we have described it, is appropriate for continuous data in which the natural measure of distance is Euclidean. There are many variations on K-means clustering. One, of course is just to use different measures of similarity. This would also affect what is the representative point of a cluster; it may be something other than the mean. Sometimes, the representative point is constrained to be a data point, of a medoid. Algorithm 14.2 in HTF. 6

7 K-Medoids in R The simple function is kmedoids in the package clue. The first argument is the data and the second is the number of means: kmedoids(datadist,k) 7

8 Model-Based Hierarchical Clustering In the general clustering problem, we may assume that the data come from several distributions, and our problem is to identify the distribution from which each observation arose. Without further restrictions, this problem is ill-posed; no solution is any better than any other. We may, however, impose the constraint that the distributions be of a particular type. We may then formulate the problem as one of fitting the observed data to a mixture of distributions of the given type. The problem posed thusly is similar to the problem of density estimation using parametric mixtures. The R function mclust performs model-based clustering. 8

9 Clusters of Models We consider data with covariates; that is, in addition to the variable of interest y, there is an associated variable x. (Either may be vector-valued.) We consider models of the form y = f(x, θ)+ɛ, and we denote the systematic component of the model in the g th group as f g (x, θ g ). This notation allows retention of the original labels of the dataset. In some early work, this problem was called clusterwise regression. It is also called latent class regression or regression clustering. 9

10 Approaches There are essentially two ways of approaching the problem. They arise from slightly different considerations of why clusters are clusters. These are based on combining the notion of similar relationships among the variables with the property of closeness or density of the elements, or else with the property of a common probability distribution. 10

11 Clustering Based on a Probability Distribution Model If the number of clusters is fixed to be, say k, and if the data in each cluster are considered to be a random sample from a given family of probability distributions, we can formulate the clustering problem as a maximum likelihood problem. For a mixture of k distributions, if the PDF of the j th distribution is p j (x; θ j ), the PDF of the mixture is p(x; θ) = k j=1 π j p j (x; θ j ), where π j 0 and k j=1 π j = 1. The g-vector π is the unconditional probabilities for a random variable from the mixture. 11

12 EM Methods We define k 0-1 dummy variables to indicate the group to which an observation belongs. These dummy variables are not observed, and this leads to the classic EM formulation of the mixture problem. C = (Y, U, x) where Y is an observed random variable, U is an unobserved RV, and x is a (vector) observed covariate. 12

13 EM Methods The E-step yields conditional expectations of the dummy variables. For each observation, the conditional expectation of a given dummy variable can be interpreted as the provisional probability that the observation is from the population represented by that dummy variable. The M-step yields an optimal fit of the model in each group, using the group inclusion probabilities as weights. If it is not practical to use a weighted fit in the M-step, we can first define a classification likelihood, as in Fraley and Raftery (2002). Maximizing the classification likelihood results in each observation being assigned to exactly one group. We could also use the conditional expectations of the dummy variables as probabilities for a random assignment of each observation to a single group. 13

14 Issues with an EM Method The EM method is based on a rather strong model assumption. A likelihood must be formulated. As we said earlier, however, the EM approach can be used even as we change the objective function in the maximization. Two other problems are endemic in EM methods: slowness convergence to local optima 14

15 Regression Clustering In regression clustering, we assume a model of the form y = f g (x, θ g ) + ɛ g for observations y and x in the g th group. Usually, of course, we assume linear models of the form y = x T β g +ɛ g, and assume ɛ g N(0, σg 2 ) and observations are mutually independent. The distribution of the error term allows us to formulate a likelihood, and this provides us the necessary quantities for the EM method. 15

16 EM Methods While an EM method is relatively easy to program, the R package flexmix developed by Friedrich Leisch (2004) provides a simple interface for an EM method for various kinds of regression models. The package allows models of different forms for each group. It uses the classes and methods of R and so is very flexible. The M-step is viewed as a fitting step, and the structure of the package makes it relatively simple matter to use a different fitting method, such as constrained or penalized regression. 16

17 Variable Selection within the Groups As a practical matter, it is generally convenient to fit a model of the same form and with the same covariates within each group. A slight modification is to select different covariates within each group. Under the usual set up with models of the form y = x T β g + ɛ g this has no effect on the formulation of the likelihood, but it does introduce the additional step of variable selection in the M-step. One way of approaching this is to substitute a lasso fit for the usual LS (i.e. ML) fit. The R package lasso2 developed by Lokhorst, Venables, and Turlach (2006) provides a parameter to lasso fitting that will drive insignificant coefficients to zero. Alternatively, a lars approach coupled with use of the L-curve could be used to determine a fit. 17

18 Other Variations on the M-Step Rather than viewing the M-step as part of a procedure to maximize a likelihood, we can view it as a step to fit a model using any criterion. This, of course, changes the basic approach in finding groups in data based on regression models so that it is no longer based on MLE. Even while using an EM method, the approach now is based on a heuristic notion of good fits of individual models and clustering the observations based on best fits. 18

19 Clustering Based on Closeness Elements within a cluster are close to each other. If we define a distance as a dissimilarity for a given element to some overall measure of a given cluster, the clustering problem is to minimize the dissimilarities within clusters. In some cases it is useful to consider a fuzzy cluster membership, but in the following, we will assume that each observation is in exactly one cluster. We denote the dissimilarity of the observation y i to the other members of the g th cluster as d g (y i ), and we define d g (y i ) = 0 if y i is not in the g th cluster. 19

20 Clustering Based on Closeness A given clustering is effectively a partitioning of the dataset. We denote the partition by P, which is a collection of disjoint sets of indices whose union is the set of all indices in the sample, P = {P 1,..., P k }. The sum of the discrepancies f(p) = is a function of the clustering. k n g=1 i=1 d g (y i ). For a fixed number of clusters, k, this is the objective function to be minimized with respect to partitioning P. This is the basic idea of k means clustering. 20

21 Clustering Based on Closeness Singleton clusters need special consideration, as does the number of clusters, k. Depending on how we define the discrepancies, and how we measure the discrepancy attributable to a singleton cluster, we could incorporate choice of k into the objective function. 21

22 Clusters of Models For y in the g th group, the discrepancy is a function of the observed y and its predicted or fitted value, d g (y i ) = h g (y i, f g (x i, θ g )), where h g (y i, ) = 0 if y i is not in the g th cluster. In many cases, h g (y i, f g (x i, θ g )) = h g (y i f g (x i, θ g )); that is, the discrepancy is a function of the difference of the observed y and its fitted value. 22

23 Measures of Dissimilarity The measure of dissimilarity is a measure of the distance of a given observation to the center of the group of which it is a member. There are two aspects to measures of dissimilarity. the type of centers mean, median, harmonic mean the type of distance measure The distance measure is usually a norm, and most often an Lp norm L 1, L 2, L. 23

24 K-Means Type Methods In K-means clustering, the objective function is f(p) = k g=1 i P g y i ȳ g 2, where ȳ g is the mean of the observations in the g th group. In K-models clustering, ȳ g is replaced by f g (x, ˆ theta g ). 24

25 The data shown are similar to some astronomical data from the Sloan Digital Sky Survey (SDSS). The data are two measures of absolute brightness of a large sample of celestial objects. An astronomer had asked for our help in analyzing the data. (The data in the figure are not the original data; that dataset was massive, and would not show well in a simple plot.) The astronomer wanted to fit a regression of one measure on the other. 25

26 26

27 We could fit some model to the data, of course, but the question is what kind of model? Four possibliities are straight line curved line (polynomial? exponential?) segmented straight lines overlapping functions 27

28 28

29 Objectives As in any data analysis, we must identify and focus on the objective. If the objective is prediction of one variable given another, some kind of single model would be desirable. Adopting a more appropriate attitude toward the problem, however, we see that there is something more fundamental going on. It is clear that if we are to have any kind of effective regression model, we need another independent variable. We might ask whether there are groups of different types of objects as suggested by the different models for different subsets of the data. We could perhaps cluster the data based on model fits. Then if we really want a single regression model, a cluster identifier variable could allow us to have one. 29

30 Clusters We can take a purely data-driven approach to defining clusters. From this standpoint, clusters are clusters because the elements within a cluster are closer to one another, or they are dense, the elements within a cluster follow a common distribution, or the variables (attributes) in all elements of a cluster have similar relationships among each other. 30

31 In an extension of a data-driven approach, we may identify clusters based on some relationship amont the variables. The relationship is expressed as a model; perhaps a linear regression model. In this sense, the clusters are conceptual clusters. The clusters are clusters because a common model fits their elements. 31

32 32

33 Issues in Clustering Although we may define a clustering problem in terms of a finite mixture distribution, clustering problems are often not built on a probability model. The clustering problem is usually defined in terms of an objective function to minimize, or in terms of the algorithm that solves the problem. In most mixture problems we have an issue of identifiability. The meanings of the group labels cannot be determined from the data, so any solution can be unique only up to permutations of the labels. Another type of identifiability problem arises if the groups are not distinct (or, in practice, sufficiently distinct). This is similar to an over-parameterized model. 33

34 Clusters of Models In regression modeling, we treat one variable as special, and treat other variables as covariates; that is, in addition to the variable of interest y, there is are associated variables x, which is the vector of all other relevant variables. (The variable of interest may also be vector-valued of course.) The regression models have the general form y = f(x, θ) + ɛ. To allow the models to be different in different clusters, we may denote the systematic component of the model in the g th group as f g (x, θ g ). This notation allows retention of the original labels of the dataset. 34

35 Approaches There are essentially two ways of approaching the problem. They arise from slightly different considerations of why clusters are clusters. These are based on combining the notion of similar relationships among the variables with the property of a common probability distribution, or else with the property of closeness or density of the elements. If we assume a common probability distribution for the random component of the models, we can write a likelihood, conditional on knowing the class of each observation. From the standpoint of clusters defined by closeness, we have an objective function that involves norms of residuals. 35

36 Clustering Based on a Probability Distribution Model If the number of clusters is fixed to be k, say, and if the data in each cluster are considered to be a random sample from a given family of probability distributions, we can formulate the clustering problem as a maximum likelihood problem. For a mixture of k distributions, if the PDF of the j th distribution is p j (x; θ j ), the PDF of the mixture is p(x; θ) = k j=1 π j p j (x; θ j ), where π j 0 and k j=1 π j = 1. The g-vector π is the unconditional probabilities for a random variable from the mixture. 36

37 EM Methods If we consider each observation to have an additional variable that is not observed, we are led to the classic EM formulation of the mixture problem. We define k 0-1 dummy variables to indicate the group to which an observation belongs. These dummy variables are the missing data in EM formulation of the mixture problem. The complete data in each observation is C = (Y, U, x), where Y is an observed random variable, U is an unobserved random variable, and x is a (vector) observed covariate. The E-step yields conditional expectations of the dummy variables. For each observation, the conditional expectation of a given dummy variable can be interpreted as the provisional probability that the observation is from the population represented by that dummy variable. The M-step yields an optimal fit of the model in each group, using the group inclusion probabilities as weights. 37

38 Classification Variables The conditional expectations of the 0-1 dummy variables can be viewed as probabilities that each observation is in the group represented by the dummy variable. There are two possible ways of treating the dummy classification variables viewed as probabilities. One way is to use these values as weights in fitting the model at each step. This way usually results in less variability the EM steps. If it is not practical to use a weighted fit in the M-step, each observation can be assigned to a single group. Another way at the conclusion of the EM computations, is to assign If each observation is assigned to the group corresponding to the dummy variable with the largest associated conditional expectation, we can view this as maximizing a classification likelihood (see Fraley and Raftery, 2002). 38

39 We could also use the conditional expectations of the dummy variables as probabilities for a random assignment of each observation to a single group if a weighted fit is not practical.

40 Fuzzy Membership Interpreting the conditional expected values of the classification variables as probabilities naturally leads to the idea of fuzzy group membership. In the case of only two groups, we may separate the observations into three sets, two sets corresponding to the two groups, and one set that is not classified. This would be based on some threshold value, α > 0.5. If the conditional expected value of a classification variable is greater than α, the observation is put in the cluster corresponding to that variable; otherwise, the observation is not put in either cluster. 39

41 40

42 In the case of more than two clusters, the interpretation of the classification variables can be extended to represent likely membership in some given cluster, membership in two given clusters, or in some combination of any number of clusters. If the likely cluster membership is dispersed among more than two or three clusters, however, it is probably best just to leave that observation unclustered. There are other situations, such as with outliers from all models, in which it may be best to leave an observation unclustered. 41

43 Issues with an EM Method The EM method is based on a rather strong model assumption so that a likelihood can be formulated. We can take a more heuristic approach, however, and merely view the M step as model fitting using any reasonable objective function. Instead of maximizing an identified likelihood, we could perform a model fit by minimizing some norm of the residuals, whether or not this corresponds to a maximization of a likelihood. There are other problems that often occur in the use of EM methods. A common one is that the method may be very slow to converge. Another major problem in applications such as mixtures is that there are local optima. This particular problem has nothing to do with EM per se, but rather with any method we may use to solve the problem. Whenever local optima may be present, there are two standard ways of addressing the problem. One is to use multiple starting points, and the other is to allow 42

44 an iteration to go in a suboptimal direction. The only one of these approaches that would be applicable in the model-based clustering would be the use of multiple starting points. We did not explore this approach in the present research.

45 Regression Clustering In regression clustering, we assume a model of the form y = f g (x, θ g ) + ɛ g for observations y and x in the g th group. Usually we assume linear models of the form y = x T β g + ɛ g, and assume that ɛ g N(0, σg 2 ) and that observations are mutually independent. The distribution of the error term allows us to formulate a likelihood, and this provides us the necessary quantities for the EM method. 43

46 EM Methods While an EM method is relatively easy to program, the R package flexmix developed by Leisch (2004) provides a simple interface for an EM method for various kinds of regression models. In our experience with EM methods as implemented in this package, we rarely had problems with the EM methods being slow slow to converge in the clustering applications. We also did not find that they were particularly sensitive to the starting values (see Li and Gentle, 2007). The M-step is viewed as a fitting step, and the structure of the package makes it relatively simple matter to use a different fitting method, such as constrained or penalized regression. 44

47 Models with Many Covariates Models with many covariates are more interesting. In such cases, however, it is likely that different sets of covariates are appropriate for different groups. Use of all covariates would lead to overparametrized models, and hence the fits have larger variance. While this may still result in an effective clustering, it would seriously degrade the performance of any classification scheme based on the fits. 45

48 Variable Selection within the Groups As a practical matter, it is generally convenient to fit a model of the same form and with the same covariates within each group. A slight modification is to select different covariates within each group. Under the usual set up with models of the form y = x T β g + ɛ g this has no effect on the formulation of the likelihood, but it does introduce the additional step of variable selection in the M-step. Although models with different sets of independent variables can be incorporated in the likelihood, the additional step of variable selection can result present problems of computational convergence, as well as major analytic problems. For variable selection in regression clustering, we need a procedure that is automatic. 46

49 Penalized Likelihood for Variable Selection within the Groups A lasso fit for variable selection can be inserted naturally in the M-step of the EM method; that is, instead of the usual least squares fit, which corresponds to maximum likelihood in the case of a known model with normally distributed error, we minimize u ig (y i x T i b g) 2 + λ b g 1 We could interpret this as maximizing a penalized likelihood. Rather than viewing the M-step as part of a procedure to maximize a likelihood, we can view it as a step to fit a model using any reasonable criterion. This, of course, changes the basic approach in finding groups in data that are based on regression models so that it is no longer based on maximum likelihood estimation, but the same upper-level computational methods can be used. 47

50 Alternatively, a lars approach coupled with use of the L-curve could be used to determine a fit, but either way, the lasso fit often yields models with some fitted coefficients exactly 0. The use of lasso of course biases the estimators of the selected variables downward. The overall statistical properties of the variable selection procedure are not fully understood. Lasso fitting seems useful within the EM iterations, however. At the end, the variables selected within the individual groups can be fitted by regular, that is, nonpenalized least squares. Even while using an EM method, the approach now is based on a heuristic notion of good fits of individual models and clustering of the observations based on best fits.

51 Clustering Based on Closeness The idea of forming clusters based on model fits leads us to the general idea of clustering based on closeness to a model center. Elements within a cluster are close to each other. If we define a distance as a dissimilarity for a given element to some overall measure of a given cluster, the clustering problem is to minimize the dissimilarities within clusters. In some cases it is useful to consider a fuzzy cluster membership, but in the following, we will assume that each observation is in exactly one cluster. We denote the dissimilarity of the observation y i to the other members of the g th cluster as d g (y i ), and we define d g (y i ) = 0 if y i is not in the g th cluster. 48

52 A given clustering is effectively a partitioning of the dataset. We denote the partition by P, which is a collection of disjoint sets of indices whose union is the set of all indices in the sample, P = {P 1,..., P k }. The sum of the discrepancies f(p) = is a function of the clustering. k n g=1 i=1 d g (y i ). For a fixed number of clusters, k, this is the objective function to be minimized with respect to partitioning P. This of course is the basic idea of k means clustering. In any kind of clustering method, singleton clusters need special consideration. Such clusters may be more properly considered as

53 outliers, and their numbers do not contribute to the total count of the number of clusters, k. The number of clusters is itself an important characteristic of the problem. In some cases our knowledge of the application may lead to a known number of clusters, or at least it may lead to an appropriate choice of k. Depending on how we define the discrepancies, and how we measure the discrepancy attributable to a singleton cluster, we could incorporate choice of k into the objective function.

54 Clusters of Models For y in the g th group, the discrepancy is a function of the observed y and its predicted or fitted value, d g (y i ) = h g (y i, f g (x i, θ g )), where h g (y i, ) = 0 if y i is not in the g th cluster. In many cases, h g (y i, f g (x i, θ g )) = h g (y i f g (x i, θ g )); that is, the discrepancy is a function of the difference of the observed y and its fitted value. 49

55 Measures of Dissimilarity The measure of dissimilarity is a measure of the distance of a given observation to the center of the group of which it is a member. There are two aspects to measures of dissimilarity. the type of centers mean, median, harmonic mean; this is the f g (x i, θ g ) above. the type of distance measure; this is the h g (y i f g (x i, θ g )) above. The type of center, for example, whether it is based on a least squares criterion such as a mean or based on a least absolute 50

56 values criterion such as a median, affects the robustness of the clustering procedure. Zhang and Hsu (1999) showed that if harmonic means are used instead of means in k-means clustering, the clusters are less sensitive to the starting values. Zhang (2003) used a harmonic average for the regression clustering problem; that is, instead of using the within-groups residual norms, he used a harmonic mean of the within-groups residuals. The insensitivity of a harmonic average to outlying values may cause problems when the groups are not tightly clustered within the model predictions. Nevertheless, this approach seems promising, but more studies under different configurations are needed. The type of distance measure is usually a norm of the coordinate differences of a given observation and the center. Most often

57 this is an Lp norm L 1, L 2, L. It may seem natural that the distance of an observation to the center be based on the same type of measure as the measure used to define the center, but this is not necessary.

58 K-Means Type Methods In k-means clustering, the objective function is f(p) = k g=1 i P g y i ȳ g 2, where ȳ g is the mean of the observations in the g th group. In K-models clustering, ȳ g is replaced by f g (x, ˆ theta g ). With the model predictions are used as the centers, it is the same as the substitution method used in a univariate k-means clustering algorithm. 51

59 K-means clustering is a combinatorial problem, and the methods are computationally complex. The most efficient methods currently are based on simulated annealing with substitution rules. These methods can allow the iterations to escape from local optima. Because of the local optima, however, any algorithm for k-means clustering is likely to be sensitive to the starting values. As in any combinatorial optimization problem, the performance depends on the method of choosing a new trial point, and the cooling schedule. We are currently investigating these steps in a simulated annealing method for regression clustering, but don t have any useful results yet. 52

60 K-Models Clustering Following Clustering of the Covariates When the covariates have clusters among themselves, a simple clustering method applied only to them may yield good starting values for either an EM method or a k-means method for the regression clustering problem. There may be other types of prior information about the group membership of the individual observations. Any such information, either from group assignments based on clustering of the covariates or from prior assumptions, can be used in the computation of the expected values of the classification variables. Clearly, clustering of covariates has limited effectiveness. We will be trying to characterize distributional patterns to be able to tell when preliminary clustering of covariates is useful in the regression clustering problem. 53

Regression Clustering

Regression Clustering Regression Clustering In regression clustering, we assume a model of the form y = f g (x, θ g ) + ɛ g for observations y and x in the g th group. Usually, of course, we assume linear models of the form

More information

Statistical Estimation

Statistical Estimation Statistical Estimation Use data and a model. The plug-in estimators are based on the simple principle of applying the defining functional to the ECDF. Other methods of estimation: minimize residuals from

More information

Data Preprocessing. Cluster Similarity

Data Preprocessing. Cluster Similarity 1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M

More information

Optimization Problems

Optimization Problems Optimization Problems The goal in an optimization problem is to find the point at which the minimum (or maximum) of a real, scalar function f occurs and, usually, to find the value of the function at that

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Statistical Methods as Optimization Problems

Statistical Methods as Optimization Problems models ( 1 Statistical Methods as Optimization Problems Optimization problems maximization or imization arise in many areas of statistics. Statistical estimation and modeling both are usually special types

More information

University of Florida CISE department Gator Engineering. Clustering Part 1

University of Florida CISE department Gator Engineering. Clustering Part 1 Clustering Part 1 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville What is Cluster Analysis? Finding groups of objects such that the objects

More information

Variable selection for model-based clustering

Variable selection for model-based clustering Variable selection for model-based clustering Matthieu Marbac (Ensai - Crest) Joint works with: M. Sedki (Univ. Paris-sud) and V. Vandewalle (Univ. Lille 2) The problem Objective: Estimation of a partition

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide

More information

Model-based cluster analysis: a Defence. Gilles Celeux Inria Futurs

Model-based cluster analysis: a Defence. Gilles Celeux Inria Futurs Model-based cluster analysis: a Defence Gilles Celeux Inria Futurs Model-based cluster analysis Model-based clustering (MBC) consists of assuming that the data come from a source with several subpopulations.

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Mixture Models, Density Estimation, Factor Analysis Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 2: 1 late day to hand it in now. Assignment 3: Posted,

More information

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering. 1 / 19 sscott@cse.unl.edu x1 If no label information is available, can still perform unsupervised learning Looking for structural information about instance space instead of label prediction function Approaches:

More information

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes Week 10 Based in part on slides from textbook, slides of Susan Holmes Part I Linear regression & December 5, 2012 1 / 1 2 / 1 We ve talked mostly about classification, where the outcome categorical. If

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Course in Data Science

Course in Data Science Course in Data Science About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst. The course gives an

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization

More information

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are

More information

Chapter 14 Combining Models

Chapter 14 Combining Models Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients

More information

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-3.2 STATS 202: Data mining and analysis October 4, 2017 1 / 22 .0.0 5 5 1.0 7 5 X2 X2 7 1.5 1.0 0.5 3 1 2 Hierarchical clustering

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Due Thursday, September 19, in class What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-3.2 STATS 202: Data mining and analysis October 4, 2017 1 / 22 Hierarchical clustering Most algorithms for hierarchical clustering

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Statistical aspects of prediction models with high-dimensional data

Statistical aspects of prediction models with high-dimensional data Statistical aspects of prediction models with high-dimensional data Anne Laure Boulesteix Institut für Medizinische Informationsverarbeitung, Biometrie und Epidemiologie February 15th, 2017 Typeset by

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

More information

CSE446: Clustering and EM Spring 2017

CSE446: Clustering and EM Spring 2017 CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled

More information

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups Contemporary Mathematics Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups Robert M. Haralick, Alex D. Miasnikov, and Alexei G. Myasnikov Abstract. We review some basic methodologies

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

Multivariate Statistics

Multivariate Statistics Multivariate Statistics Chapter 6: Cluster Analysis Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2017/2018 Master in Mathematical Engineering

More information

Data Exploration and Unsupervised Learning with Clustering

Data Exploration and Unsupervised Learning with Clustering Data Exploration and Unsupervised Learning with Clustering Paul F Rodriguez,PhD San Diego Supercomputer Center Predictive Analytic Center of Excellence Clustering Idea Given a set of data can we find a

More information

Module Master Recherche Apprentissage et Fouille

Module Master Recherche Apprentissage et Fouille Module Master Recherche Apprentissage et Fouille Michele Sebag Balazs Kegl Antoine Cornuéjols http://tao.lri.fr 19 novembre 2008 Unsupervised Learning Clustering Data Streaming Application: Clustering

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from

More information

Accelerating the EM Algorithm for Mixture Density Estimation

Accelerating the EM Algorithm for Mixture Density Estimation Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 1/18 Accelerating the EM Algorithm for Mixture Density Estimation Homer Walker Mathematical Sciences Department Worcester Polytechnic

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University Likelihood, MLE & EM for Gaussian Mixture Clustering Nick Duffield Texas A&M University Probability vs. Likelihood Probability: predict unknown outcomes based on known parameters: P(x q) Likelihood: estimate

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

Maximum Likelihood Estimation. only training data is available to design a classifier

Maximum Likelihood Estimation. only training data is available to design a classifier Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Pattern Recognition. Parameter Estimation of Probability Density Functions

Pattern Recognition. Parameter Estimation of Probability Density Functions Pattern Recognition Parameter Estimation of Probability Density Functions Classification Problem (Review) The classification problem is to assign an arbitrary feature vector x F to one of c classes. The

More information

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September 19, 2018 1 / 23 Announcements Starting next week, Julia Fukuyama

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I SYDE 372 Introduction to Pattern Recognition Probability Measures for Classification: Part I Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 Why use probability

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Computational statistics

Computational statistics Computational statistics Combinatorial optimization Thierry Denœux February 2017 Thierry Denœux Computational statistics February 2017 1 / 37 Combinatorial optimization Assume we seek the maximum of f

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Gradient-Based Learning. Sargur N. Srihari

Gradient-Based Learning. Sargur N. Srihari Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1 Regression diagnostics As is true of all statistical methodologies, linear regression analysis can be a very effective way to model data, as along as the assumptions being made are true. For the regression

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Bayesian Networks in Educational Assessment

Bayesian Networks in Educational Assessment Bayesian Networks in Educational Assessment Estimating Parameters with MCMC Bayesian Inference: Expanding Our Context Roy Levy Arizona State University Roy.Levy@asu.edu 2017 Roy Levy MCMC 1 MCMC 2 Posterior

More information

6.036 midterm review. Wednesday, March 18, 15

6.036 midterm review. Wednesday, March 18, 15 6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Expectation Maximization Algorithm

Expectation Maximization Algorithm Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters

More information

MSA220 Statistical Learning for Big Data

MSA220 Statistical Learning for Big Data MSA220 Statistical Learning for Big Data Lecture 4 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology More on Discriminant analysis More on Discriminant

More information

Latent Variable Models and Expectation Maximization

Latent Variable Models and Expectation Maximization Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL 6.867 Machine learning: lecture 2 Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics The learning problem hypothesis class, estimation algorithm loss and estimation criterion sampling, empirical and

More information

Communications in Statistics - Simulation and Computation. Comparison of EM and SEM Algorithms in Poisson Regression Models: a simulation study

Communications in Statistics - Simulation and Computation. Comparison of EM and SEM Algorithms in Poisson Regression Models: a simulation study Comparison of EM and SEM Algorithms in Poisson Regression Models: a simulation study Journal: Manuscript ID: LSSP-00-0.R Manuscript Type: Original Paper Date Submitted by the Author: -May-0 Complete List

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1 Lecture 13: Data Modelling and Distributions Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1 Why data distributions? It is a well established fact that many naturally occurring

More information

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Post-Selection Inference

Post-Selection Inference Classical Inference start end start Post-Selection Inference selected end model data inference data selection model data inference Post-Selection Inference Todd Kuffner Washington University in St. Louis

More information

An Introduction to Expectation-Maximization

An Introduction to Expectation-Maximization An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative

More information

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification 10-810: Advanced Algorithms and Models for Computational Biology Optimal leaf ordering and classification Hierarchical clustering As we mentioned, its one of the most popular methods for clustering gene

More information

Detection of outliers in multivariate data:

Detection of outliers in multivariate data: 1 Detection of outliers in multivariate data: a method based on clustering and robust estimators Carla M. Santos-Pereira 1 and Ana M. Pires 2 1 Universidade Portucalense Infante D. Henrique, Oporto, Portugal

More information

Applying cluster analysis to 2011 Census local authority data

Applying cluster analysis to 2011 Census local authority data Applying cluster analysis to 2011 Census local authority data Kitty.Lymperopoulou@manchester.ac.uk SPSS User Group Conference November, 10 2017 Outline Basic ideas of cluster analysis How to choose variables

More information

Regression I: Mean Squared Error and Measuring Quality of Fit

Regression I: Mean Squared Error and Measuring Quality of Fit Regression I: Mean Squared Error and Measuring Quality of Fit -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 The Setup Suppose there is a scientific problem we are interested in solving

More information

Clustering: K-means. -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD

Clustering: K-means. -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD Clustering: K-means -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 Clustering Introduction When clustering, we seek to simplify the data via a small(er) number of summarizing variables

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning CS4731 Dr. Mihail Fall 2017 Slide content based on books by Bishop and Barber. https://www.microsoft.com/en-us/research/people/cmbishop/ http://web4.cs.ucl.ac.uk/staff/d.barber/pmwiki/pmwiki.php?n=brml.homepage

More information

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Statistics - Lecture One. Outline. Charlotte Wickham  1. Basic ideas about estimation Statistics - Lecture One Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Outline 1. Basic ideas about estimation 2. Method of Moments 3. Maximum Likelihood 4. Confidence

More information

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Introduction to Graphical Models

Introduction to Graphical Models Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic

More information

Latent Variable Models and Expectation Maximization

Latent Variable Models and Expectation Maximization Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

ECO Class 6 Nonparametric Econometrics

ECO Class 6 Nonparametric Econometrics ECO 523 - Class 6 Nonparametric Econometrics Carolina Caetano Contents 1 Nonparametric instrumental variable regression 1 2 Nonparametric Estimation of Average Treatment Effects 3 2.1 Asymptotic results................................

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information