9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Size: px

Start display at page:

Download "9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering"

Allyson Davis
5 years ago
Views:

1 Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make generalizations Nicole Beckage Supervised learning: Classification Supervised learning: Regression Map inputs x to outputs y where y {,, C} where C is the number of classes Binary classification is C = Multinomial classification C > Classification except we now have response variables. x R +, y R -, estimate a function g x = y such that y f x + ε Multi-label classification: classes are not mutually exclusive Probabilistic interpretation: instead of returning class assignment, return probability (certainty) of class label. 4 Unsupervised learning: Clustering Estimate which cluster each data point belongs to by looking for patterns in the input data. Let K denote the number of clusters. We need to infer the distribution over the number of clusters or Pr K D. How many clusters? Usually we assume K = argmax Pr(K D) then we need to estimate? the class of each data point. z E = argmax Pr(z E = k x E, D, h x, K ) F Unsupervised learning: Latent Factors In unsupervised learning we often use high dimensional data (e.g. images, text) We often consider dimensionality as a means to capture the essence of the data What features are meaningful for distinguishing among images or documents Can we discover a low dimensional space capable of explaining the data nearly as well 6

2 Other types of Unsupervised learning Discovering graph/relational structure Graphical Models Network analysis Matrix completion Image imputation (fill in holes/occlusions of images) Collaborative filtering (movie prediction example) Market basket analysis (collaborative filtering with no missing data) 7 predictions: an individual s job, main object in an image, event models: logistic regression, neural network classification, SVM data: topic of a newspaper article, similarity models: k-means, topic modeling predictions: income, number of papers published models: linear regression, regression trees data: NLP, dynamical systems models: kernel methods, process models, Bayesian non-parametric

3 Discriminative vs. Generative Models Discriminative model Generative models Given a supervised learning problem, categorize or predict the outcome. Model the dependence of our unobserved variable y on our observed input X. Model the observations from a conditional probability distribution (e.g. you know X now what is your expectation for y?) Here we re explicitly modeling p y X. Models how the data was generated in order to categorize a signal. Which category is most likely to have generated this signal? Specifies a joint probability distribution over observations and label sequences. Model observations from the joint probability density function. Full probabilistic model of all variables. Model p X, y directly, then use Bayes rule/theorem to compute p y X. Sometimes called descriptive models. 6 Bayes Theorem An example: Describes the probability of an event based on prior knowledge of the conditions that might be related to the event. Means to allow new evidence to update beliefs P B A P(A) P A B = P(B) Generalizes to P A E B = P B A E P(A E ) Q P B A Q P A Q Consider D(x, y) =(, ), (, ), (, ), (, ) Our discriminative model estimates p(y x) so our resulting model says that given x = what we think y will be. y= y= x= x=.. 8

4 An example: Consider D(x, y) =(, ), (, ), (, ), (, ) Our generative model estimates p(x, y) so our resulting model says the likelihood we will see the point (,) is.. y= y= x=. x=.. Tradeoff between descriptive and generative Descriptive is usually better in big data situations It s learning exactly what we re trying to predict Generative is better in small data situations or where we know more about the relationship between the input and output Generally speaking: "discriminative learning has lower asymptotic error, a generative classifier may also approach its (higher) asymptotic error much faster 9 Example ML algs Does it matter? Descriptive models Logistic regression Linear regression Support vector machines Boosting (hopefully) Neural networks Generative models Naïve Bayes Gaussian mixture models Hidden Markov models Linear Discriminate Analysis Restricted Boltzmann machines Generative models are the focus of small data sets Discriminative models fit easily into CTF What s the big deal? Discriminative models usually rely on frequentist statistics Generative models usually rely on Bayesian statistics But they are both based in stats, so they are both correct right? At war: Frequentist vs Bayesian Frequentists in a nutshell Frequentist: probabilities represent long run frequencies of events. Bayesian: probability is used to quantify our uncertainty about something. Why the war? Quantification of uncertainty Rare events (e.g. What s the probability that the polar ice cap will melt by ) Goal is to estimate the parameters give observations Maximize the likelihood of the observations (Maximum likelihood estimate) Maximizing agreement between the model and the specific observations Well defined, analytically tractable for many well known distributions Joint density function over observations, the product of which is the likelihood (e.g. each event has some probability and we want to find the parameters that maximizes the product of each event) 4

5 Frequentists in a nutshell The likelihood function is easy to write down and easy for statisticians to study We get lots of nice features like, Consistency: The MLE converges in probability to the true estimate asymptotic normality: as sample size (n) increases, the MLE will be Gaussian (normal) Efficiency: no consistent estimator has a lower asymptotic error Central limit theorem is a big help in this space Most statistical tests (e.g. t-test, f-test, chi-squared) are frequentist tests 6 An example. The advantage of Bayesian Statistics I give you a coin from a country that you know nothing about. I ask you how often you think side A will come up compared to side B. In frequentist land you have no way of giving me any estimate. But what would you guess? 7 8 Another example Another example You observe the number 6 How probable is it that the next number could be 6? How probable is it that the next number could be 4? Experiment asked for all numbers [,] You observe the numbers 6, 8,, 64 How probable is it that the next number could be 6? How probable is it that the next number could be 4? Experiment asked for all numbers [,]

6 Examples Prior, likelihood, posterior Prior Why can we estimate the next numbers with so little data? Prior belief about what values a number might take Why does our estimate change so much when we see an additional few numbers? We presume the data is telling us something informative about the data (likelihood) How can we talk about our estimates in terms of priors and likelihood? Posteriors integrate both into the estimate. Belief about the world How likely do we think a particular hypothesis is? For example if I tell you that x - = is an ACT score, your guess of what value x S might take would be different than if I told you x - = was the age of a house cat. Formally, the prior is denoted as p h 4 Prior data = 6 even odd squares mult of mult of 4 mult of mult of 6 mult of 7 mult of 8 mult of 9 mult of ends in ends in ends in ends in 4 ends in ends in 6 ends in 7 ends in 8 ends in 9 powers of powers of powers of 4 powers of powers of 6 powers of 7 powers of 8 powers of 9 powers of all powers of + {7} powers of {} post lik.. prior Likelihood We want the data to make sense under our hypothesis. For example if my hypothesis was odd numbers, 6 would violate that assumption. More specifically, the likelihood is equal to Z p D h = = Z size h h This prefers the smallest hypothesis set that accounts for the data (Occam s razor). 6 6

7 Likelihood and prior Posterior data = 6 even odd squares mult of mult of 4 mult of mult of 6 mult of 7 mult of 8 mult of 9 mult of ends in ends in ends in ends in 4 ends in ends in 6 ends in 7 ends in 8 ends in 9 powers of powers of powers of 4 powers of powers of 6 powers of 7 powers of 8 powers of 9 powers of all powers of + {7} powers of {}....4 prior lik Remember our goal in ML is to estimate p(y x) but our ability to accurately estimate that is constrained by our choice of hypotheses. So we can talk about estimating the likelihood of a specific hypothesis give the data and our beliefs. Formally p(h D) = p(d h)p(h) P h H p(d,h) Note the bottom of the fraction is just for normalization. 7 8 Posterior.4..4 post All together data = 6 even odd squares mult of mult of 4 mult of mult of 6 mult of 7 mult of 8 mult of 9 mult of ends in ends in ends in ends in 4 ends in ends in 6 ends in 7 ends in 8 ends in 9 powers of powers of powers of 4 powers of powers of 6 powers of 7 powers of 8 powers of 9 powers of all powers of + {7} powers of {}.. prior..4 lik..4 post 9 4 Bayesian estimation What s the best hypothesis to choose? One that maximizes the Posterior predictive This is the mode of the Posterior Why? We define the Maximum a posteriori (MAP) estimate over hypotheses as: ĥ MAP = argmax h p(d h)p(h) = argmax log p(d h) + log p(h) h 4 7

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this