Scribe to lecture Tuesday March

Similar documents
Statistical learning. Chapter 20, Sections 1 4 1

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Bayesian Learning (II)

Lecture 4: Probabilistic Learning

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Notes on Machine Learning for and

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Lecture : Probabilistic Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

PMR Learning as Inference

Statistical learning. Chapter 20, Sections 1 3 1

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Generative Clustering, Topic Modeling, & Bayesian Inference

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Techniques

Gaussian Mixture Models

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Parametric Techniques Lecture 3

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Statistical learning. Chapter 20, Sections 1 3 1

Linear Dynamical Systems

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Bayesian Machine Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

COMP90051 Statistical Machine Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning

MODULE -4 BAYEIAN LEARNING

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Density Estimation. Seungjin Choi

Introduction to Machine Learning. Lecture 2

EM for Spherical Gaussians

Confidence Intervals and Hypothesis Tests

Bayesian Machine Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Machine Learning for Data Science (CS4786) Lecture 12

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Machine Learning

Bayesian Decision and Bayesian Learning

Algorithmisches Lernen/Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 6: Gaussian Mixture Models (GMM)

Logistic Regression. Machine Learning Fall 2018

Expectation Propagation for Approximate Bayesian Inference

Bayesian Methods for Machine Learning

The Expectation Maximization or EM algorithm

g-priors for Linear Regression

The Naïve Bayes Classifier. Machine Learning Fall 2017

Lecture 2: Statistical Decision Theory (Part I)

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Expectation Maximization

Bayesian Support Vector Machines for Feature Ranking and Selection

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Bayesian Methods: Naïve Bayes

Announcements. Proposals graded

CS 543 Page 1 John E. Boon, Jr.

Machine Learning

0.5. (b) How many parameters will we learn under the Naïve Bayes assumption?

Machine Learning Basics: Maximum Likelihood Estimation

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Bayesian Concept Learning

CPSC 540: Machine Learning

Lecture 9: Bayesian Learning

A first model of learning

Lecture 14. Clustering, K-means, and EM

Parameter estimation Conditional risk

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

T Machine Learning: Basic Principles

COS513 LECTURE 8 STATISTICAL CONCEPTS

Naïve Bayes Lecture 17

Algorithm-Independent Learning Issues

Introduction to Bayesian Statistics

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

More Spectral Clustering and an Introduction to Conjugacy

Statistical Pattern Recognition

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

CSCI-567: Machine Learning (Spring 2019)

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

CSE 473: Artificial Intelligence Autumn Topics

Intro to Probability. Andrei Barbu

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Statistical Learning. Philipp Koehn. 10 November 2015

Introduction to Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning for Signal Processing Bayes Classification and Regression

FINAL: CS 6375 (Machine Learning) Fall 2014

Machine Learning Linear Classification. Prof. Matteo Matteucci

Stephen Scott.

Prior Choice, Summarizing the Posterior

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Naïve Bayes classification

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

STA 414/2104, Spring 2014, Practice Problem Set #1

Transcription:

Scribe to lecture Tuesday March 16 2004 Scribe outlines: Message Confidence intervals Central limit theorem Em-algorithm Bayesian versus classical statistic Note: There is no scribe for the beginning of the lecture Thursday March 19. The missing part of the lecture focuses on the EM algorithm and clustering. Message Left to do is Support Vector Machines, reinforcement learning and Inductive logic programming. We don t have time to do both SVM and ILP so we have to choose between: 1. Skip support vector machines. 12 votes 2. SVM-light. Spend one class on it. 1 vote 3. Spend two classes on SVM. 4 votes Confidence intervals Our goal is to estimate a parameter value e.g. the mean µ given the variance σ. By identifying a constant c you can find an interval such as the probability of getting a data point d within the interval (µ - c, µ +c) is, for example 95%. P(µ - c < d < µ +c) 95%. Used the other way around you can find an interval to which µ belongs with 95% confidence. Procedure 1. Observe a data point d. 2. Output: µ (d - c, d + c) with confidence 95%. µ - c < d µ < d + c µ + c > d µ > d-c d-c < µ < d+c So what does this mean? No matter what the true value of µ is the probability that µ the output interval is 95%. For example P(-0.5 < µ < 0.5) 95%. The data point d is a random variable. Note that µ isn t a random variable, it s fix but unknown. This is why you call it confidence interval and not probability interval. To see the difference in the unknown parameter µ and the random variable d think of the following. Assign all people in the class to a drug experiment. You would have a high probability of getting as many women as men in a group (not bad for a computing class!). This corresponds to the mean µ. But this is different from looking at an actual group and see if there are as many men as women. This would correspond to a sample d.

Confidence interval for a uniform distribution Take the parameter θ to be fixed but unknown. Let D be a uniform distribution over the interval (θ-0.5,θ+0.5). (In a uniform distribution every point have the same probability of being selected. The graph is box shaped compared to the bell shaped normal distribution where the probability is higher near the mean.) Let d be a sample from the interval. The probability of getting a value for d within an interval of length 0.8 centered around θ is 80% since (θ-0.4,θ+0.4) is 80% of the total interval (θ-0.5,θ+0.5). P(d (θ-0.4,θ+0.4))=0.8. For this distribution use c=c 0.8 =0.4 for 80% confidence interval. You can also use this knowledge to go the other way around and draw conclusions of θ based on the value of the sample d you get. If d = 0.7 this lets you conclude that θ (0.7-0.4,0.7+0.4)= (0.3,1.1) with 80% confidence or θ (0.7-0.25,0.7+0.25)= (0.45,0.95) with 50% confidence (c 0.5 =0.25). Now we take two random samples d1=0.3 and d2=0.9. Looking at the data points we can reach some conclusions about the value of θ. The first sample gives us θ (0.3-0.5,0.3+0.5) =(-0.2,0.8) with 100% confidence and the second gives us θ (0.9-0.5,0.9+0.5) = (0.4,1.4) with 100% confidence. Together we have that θ (0.4, 0.8) with 100% confidence. Another way of doing this without having to look at the data points is to use the procedure: P(min(d1,d2)-c a < θ <max(d1,d2)+c a ) a. P(min(d1d2)-c 80 <θ<max(d1,d2)+ c 80 ) = P(0.9-0.4< θ<0.3+0.4) = P(0.5< θ<0.7) 80% P(min(d1,d2)-c 100 <θ<max(d1,d2)+c 100 ) = P(0.9-0.5<θ<0.3+0.5) = P(0.4<θ<0.8)=100% Confidence intervals versus posterior A confidence interval is not the same as posterior probability. From a Bayesian point of view, given the observation Y=y, we may be able to give a much higher probability that the desired mean µ lies in the confidence interval. For all sorts of priors P(Theta=θ) you can get P(d1,d2 Theta=θ), where d1 and d2 are the values of two random samples. Using Bayes formula you can calculate the posterior P(Theta=θ d1,d2). And get an interval for P(Theta=θ d1,d2). In this way Bayes probability is close to classical confidence intervals. Confidence interval for other distributions Confidence intervals can be used to evaluate how good a learned hypothesis is. You can do this by letting the hypothesis classify n random instances. The number of errors committed r, will be distributed with the Binomial distribution. If p is be the probability for the hypothesis to misclassify an instance than E[x]=np and σ D = (p(1-p)n), where x={x1+ +xn}. The actual number of misclassifications of a test case with n instances is r. Use r/n as an estimator of p and r as an estimator of the sample mean. To get the sample variance you divide by the number of samples, giving us the following estimate of the variance of sample variance σ s = σ D /n = (p(1-p)/n).

To get a confidence interval you approximate the binomial distribution by a normal distribution (using the central limit theorem) with µ = sample mean and σ = sample derivation. Now it s possible to calculate the confidence interval using N(µ,σ). Central Limit Theorem If X1, X2,, Xn are n random samples from a distribution X is the sample mean Set E[x]=E[X] Set Var(x) = n*var(x) Let the number of samples n. Then the Central Limit theorem states that no matter which distribution you start with the normal distribution N(E[X],Var(X)) will bee an arbitrary good approximation. Depending on the original distribution the number of samples needed before the normal distribution is a good approximation varies. For a binomial distribution it s usually enough with n=30. This can cause problems when you have large samples. You re expected to get a very precise result, which isn t always the case. The EM-algorithm The EM-algorithm (section 6.12 in Mitchell s Machine learning) is used to find a maximum likelihood hypothesis. You want to find the setting that makes the observed variable as likely as possible. It can be used in many settings where we wish to estimate some set of parameters that describe an underlying probability distribution. This procedure can deal with unobserved variables. Another advantage is that it s guaranteed to find a minimum if there is any. You could say that the general idea behind EM is that you pretend to know the parameters of the model and then interfere the probability that each data point belongs to each component. After that we refit the components to the data, where each component is fitted to the entire data set with each point weighted by the probability that it belongs to that component. General statement For m independent instances let X={x 1,, x m } denote the observed data and Z={z 1,, z m } the unobserved data. The full data X U Z is denoted by Y. Our current estimate of the parameters we want to estimate is denoted θ n. Two main steps make up the EM-algorithm: estimation and maximisation. In the estimation part the current hypothesis θ n together with the data point x is used to estimate the unobserved variables. Next this estimation together with the data point x is used to perform a maximisation over new hypotheses θ. The best one θ n+1 is chosen to replace the old one.

The EM formula: θ (n+1) =argmax θ Σ z [P(Z=z x,θ n ) ln(p(x,z=z θ)]. Where P(Z=zij xi,θ n ) = constant * P(xi Z=zij,θ n ) * P(Z=zij θ n ) (Bayes theorem) When Q is a continuous function, the EM algorithm will converge to a stationary point of the likelihood function P(y θ ). A disadvantage with the EM algorithm is that you can get stuck in a local minimum. The more dimensions you have the more likely you are to find a local minimum and have a dimensional collapse. In the example below fixed and equal variances are assumed. However the algorithm can deal with different or unknown variances. You theoretically you could run it for an unknown number of distributions, but it does helps a lot to specify the number of unknown variables. If you don t there s a big risk of overfitting. Example of the use of the EM-algorithm To illustrate the use of the EM algorithm we take a look at an example where a sample X is drawn for one of two Gaussian distributions with unknown means and the same variance N1(µ1, σ) and N2(µ2, σ). Our goal is to find estimations for µ1 and µ2. (In Mitchell s Machine learning they assume that the probability of a getting a sample from distribution one is ½ or that P(zi1 θ n ) = P(zi2 θ n ), this assumption isn t necessary. If you assume that the samples are drawn at random you can skip the part P(Z=z x,θ n ). ) Using the convention above X is the observed variable and Z denotes the hidden variables, in this case Z1=1 and Z2=0 if x is from distribution number one. θ is the set of parameters to be estimated. Our current hypothesis is θ = <µ1,µ2>. Estimation: P(Z=zij xi,θ n ) = P(xi zij,θ n ) * P(zij θ n ) alpha?? alpha * P(xi, zij θ n ) * P(zij θ n ) -1/(2σ^2) Σj zij(xi-µj)^2 P(xi,Z=zij θ)=1/ (2*π*σ^2) * e Maximisation: θ (n+1) =argmax θ (Σ [P(Z=z x,θ n ) ln(p(x,z=z θ)]) If we now assume that we know that the sample is drawn for distribution one this equation will reduce to the probability of getting a specific sample from an ordinary normal distribution: P(x1 Z=z11,θ)= P(X=x1) N1(µ1, σ) =1/ (2*π*σ^2) * e -(x1-µ1)^2/(2σ^2) Pro Bayesian statistic Unified Intuitively correct Incremental, online No maximum likelihood maximisation Distribution free Bayesian versus classical statistic Pro Classical statistic Avoid subjective priors (ok with base-ratios but no wild guesses)

Less computational expensive than Bayesian statistic Much knowledge about the distribution Lina Björnheden