# PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Save this PDF as:

Size: px
Start display at page:

Download "PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS"

## Transcription

1 PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

2 Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting

3 Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution

4 Binary Variables (2) N coin flips: Binomial Distribution

5 Binomial Distribution

6 Parameter Estimation (1) ML for Bernoulli Given:

7 Parameter Estimation (2) Example: Prediction: all future tosses will land heads up Overfitting to D

8 Beta Distribution Distribution over.

9 Bayesian Bernoulli The Beta distribution provides the conjugate prior for the Bernoulli distribution.

10 Beta Distribution

11 Prior Likelihood = Posterior

12 Properties of the Posterior As the size of the data set, N, increase

13 Prediction under the Posterior What is the probability that the next coin toss will land heads up?

14 Multinomial Variables 1-of-K coding scheme:

15 ML Parameter estimation Given: Ensure, use a Lagrange multiplier,.

16 The Multinomial Distribution

17 The Dirichlet Distribution Conjugate prior for the multinomial distribution.

18 Bayesian Multinomial (1)

19 Bayesian Multinomial (2)

20 The Gaussian Distribution

21 Central Limit Theorem The distribution of the sum of N i.i.d. random variables becomes increasingly Gaussian as N grows. Example: N uniform [0,1] random variables.

22 Geometry of the Multivariate Gaussian

23 Moments of the Multivariate Gaussian (1) thanks to anti-symmetry of z

24 Moments of the Multivariate Gaussian (2)

25 Partitioned Gaussian Distributions

26 Partitioned Conditionals and Marginals

27 Partitioned Conditionals and Marginals

28 Bayes Theorem for Gaussian Variables Given we have where

29 Maximum Likelihood for the Gaussian (1), the log likeli- Given i.i.d. data hood function is given by Sufficient statistics

30 Maximum Likelihood for the Gaussian (2) Set the derivative of the log likelihood function to zero, and solve to obtain Similarly

31 Maximum Likelihood for the Gaussian (3) Under the true distribution Hence define

32 Sequential Estimation Contribution of the N th data point, x N correction given x N correction weight old estimate

33 The Robbins-Monro Algorithm (1) Consider µ and z governed by p(z,µ) and define the regression function Seek µ? such that f(µ? ) = 0.

34 The Robbins-Monro Algorithm (2) Assume we are given samples from p(z,µ), one at the time.

35 The Robbins-Monro Algorithm (3) Successive estimates of µ? are then given by Conditions on a N for convergence :

36 Robbins-Monro for Maximum Likelihood (1) Regarding as a regression function, finding its root is equivalent to finding the maximum likelihood solution µ ML. Thus

37 Robbins-Monro for Maximum Likelihood (2) Example: estimate the mean of a Gaussian. The distribution of z is Gaussian with mean ¹ { ¹ ML. For the Robbins-Monro update equation, a N = ¾ 2 =N.

38 Bayesian Inference for the Gaussian (1) Assume ¾ 2 is known. Given i.i.d. data, the likelihood function for ¹ is given by This has a Gaussian shape as a function of ¹ (but it is not a distribution over ¹).

39 Bayesian Inference for the Gaussian (2) Combined with a Gaussian prior over ¹, this gives the posterior Completing the square over ¹, we see that

40 Bayesian Inference for the Gaussian (3) where Note:

41 Bayesian Inference for the Gaussian (4) Example: for N = 0, 1, 2 and 10.

42 Bayesian Inference for the Gaussian (5) Sequential Estimation The posterior obtained after observing N { 1 data points becomes the prior when we observe the N th data point.

43 Bayesian Inference for the Gaussian (6) Now assume ¹ is known. The likelihood function for = 1/¾ 2 is given by This has a Gamma shape as a function of.

44 Bayesian Inference for the Gaussian (7) The Gamma distribution

45 Bayesian Inference for the Gaussian (8) Now we combine a Gamma prior,, with the likelihood function for to obtain which we recognize as with

46 Bayesian Inference for the Gaussian (9) If both ¹ and are unknown, the joint likelihood function is given by We need a prior with the same functional dependence on ¹ and.

47 Bayesian Inference for the Gaussian (10) The Gaussian-gamma distribution Quadratic in ¹. Linear in. Gamma distribution over. Independent of ¹.

48 Bayesian Inference for the Gaussian (11) The Gaussian-gamma distribution

49 Bayesian Inference for the Gaussian (12) Multivariate conjugate priors ¹ unknown, known: p(¹) Gaussian. unknown, ¹ known: p( ) Wishart, and ¹ unknown: p(¹, ) Gaussian- Wishart,

50 Student s t-distribution where Infinite mixture of Gaussians.

51 Student s t-distribution

52 Student s t-distribution Robustness to outliers: Gaussian vs t-distribution.

53 Student s t-distribution The D-variate case: where. Properties:

54 Periodic variables Examples: calendar time, direction, We require

55 von Mises Distribution (1) This requirement is satisfied by where is the 0 th order modified Bessel function of the 1 st kind.

56 von Mises Distribution (4)

57 Maximum Likelihood for von Mises Given a data set, is given by, the log likelihood function Maximizing with respect to µ 0 we directly obtain Similarly, maximizing with respect to m we get which can be solved numerically for m ML.

58 Mixtures of Gaussians (1) Old Faithful data set Single Gaussian Mixture of two Gaussians

59 Mixtures of Gaussians (2) Combine simple models into a complex model: Component Mixing coefficient K=3

60 Mixtures of Gaussians (3)

61 Mixtures of Gaussians (4) Determining parameters ¹,, and ¼ using maximum log likelihood Log of a sum; no closed form maximum. Solution: use standard, iterative, numeric optimization methods or the expectation maximization algorithm (Chapter 9).

62 The Exponential Family (1) where is the natural parameter and so g( ) can be interpreted as a normalization coefficient.

63 The Exponential Family (2.1) The Bernoulli Distribution Comparing with the general form we see that and so Logistic sigmoid

64 The Exponential Family (2.2) The Bernoulli distribution can hence be written as where

65 The Exponential Family (3.1) The Multinomial Distribution where,, and NOTE: The k parameters are not independent since the corresponding ¹ k must satisfy

66 The Exponential Family (3.2) Let. This leads to and Here the k parameters are independent. Note that and Softmax

67 The Exponential Family (3.3) The Multinomial distribution can then be written as where

68 The Exponential Family (4) The Gaussian Distribution where

69 ML for the Exponential Family (1) From the definition of g( ) we get Thus

70 ML for the Exponential Family (2) Give a data set, function is given by, the likelihood Thus we have Sufficient statistic

71 Conjugate priors For any member of the exponential family, there exists a prior Combining with the likelihood function, we get Prior corresponds to º pseudo-observations with value Â.

72 Noninformative Priors (1) With little or no information available a-priori, we might choose a non-informative prior. discrete, K-nomial : 2[a,b] real and bounded: real and unbounded: improper! A constant prior may no longer be constant after a change of variable; consider p( ) constant and = 2:

73 Noninformative Priors (2) Translation invariant priors. Consider For a corresponding prior over ¹, we have for any A and B. Thus p(¹) = p(¹ { c) and p(¹) must be constant.

74 Noninformative Priors (3) Example: The mean of a Gaussian, ¹ ; the conjugate prior is also a Gaussian, As, this will become constant over ¹.

75 Noninformative Priors (4) Scale invariant priors. Consider and make the change of variable For a corresponding prior over ¾, we have for any A and B. Thus p(¾) / 1/¾ and so this prior is improper too. Note that this corresponds to p(ln ¾) being constant.

76 Noninformative Priors (5) Example: For the variance of a Gaussian, ¾ 2, we have If = 1/¾ 2 and p(¾) / 1/¾, then p( ) / 1/. We know that the conjugate distribution for is the Gamma distribution, A noninformative prior is obtained when a 0 = 0 and b 0 = 0.

77 Nonparametric Methods (1) Parametric distribution models are restricted to specific forms, which may not always be suitable; for example, consider modelling a multimodal distribution with a single, unimodal model. Nonparametric approaches make few assumptions about the overall shape of the distribution being modelled.

78 Nonparametric Methods (2) Histogram methods partition the data space into distinct bins with widths i and count the number of observations, n i, in each bin. Often, the same width is used for all bins, i =. acts as a smoothing parameter. In a D-dimensional space, using M bins in each dimension will require M D bins!

79 Nonparametric Methods (3) Assume observations drawn from a density p(x) and consider a small region R containing x such that If the volume of R, V, is sufficiently small, p(x) is approximately constant over R and The probability that K out of N observations lie inside R is Bin(KjN,P ) and if N is large Thus V small, yet K>0, therefore N large?

80 Nonparametric Methods (4) Kernel Density Estimation: fix V, estimate K from the data. Let R be a hypercube centred on x and define the kernel function (Parzen window) It follows that and hence

81 Nonparametric Methods (5) To avoid discontinuities in p(x), use a smooth kernel, e.g. a Gaussian Any kernel such that will work. h acts as a smoother.

82 Nonparametric Methods (6) Nearest Neighbour Density Estimation: fix K, estimate V from the data. Consider a hypersphere centred on x and let it grow to a volume, V?, that includes K of the given N data points. Then K acts as a smoother.

83 Nonparametric Methods (7) Nonparametric models (not histograms) requires storing and computing with the entire data set. Parametric models, once fitted, are much more efficient in terms of storage and computation.

84 K-Nearest-Neighbours for Classification (1) Given a data set with N k data points from class C k and, we have and correspondingly Since, Bayes theorem gives

85 K-Nearest-Neighbours for Classification (2) K = 3 K = 1

86 K-Nearest-Neighbours for Classification (3) K acts as a smother For, the error rate of the 1-nearest-neighbour classifier is never more than twice the optimal error (obtained from the true conditional class distributions).

### Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

### INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP Personal Healthcare Revolution Electronic health records (CFH) Personal genomics (DeCode, Navigenics, 23andMe) X-prize: first \$10k human genome technology

### Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

### Bayesian Learning (II)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

### Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Machine Learning CMPT 726 Simon Fraser University Binomial Parameter Estimation Outline Maximum Likelihood Estimation Smoothed Frequencies, Laplace Correction. Bayesian Approach. Conjugate Prior. Uniform

### Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

### Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007

### Naïve Bayes classification

Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

### Density Estimation. Seungjin Choi

Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

### The Bayes classifier

The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal

### Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

### Probability Theory for Machine Learning. Chris Cremer September 2015

Probability Theory for Machine Learning Chris Cremer September 2015 Outline Motivation Probability Definitions and Rules Probability Distributions MLE for Gaussian Parameter Estimation MLE and Least Squares

### Bayesian Methods: Naïve Bayes

Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

### Generative Models for Discrete Data

Generative Models for Discrete Data ddebarr@uw.edu 2016-04-21 Agenda Bayesian Concept Learning Beta-Binomial Model Dirichlet-Multinomial Model Naïve Bayes Classifiers Bayesian Concept Learning Numbers

### Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Econ 690 Purdue University In virtually all of the previous lectures, our models have made use of normality assumptions. From a computational point of view, the reason for this assumption is clear: combined

### Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

### Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

### STA 414/2104: Machine Learning

STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! h0p://www.cs.toronto.edu/~rsalakhu/ Lecture 2 Linear Least Squares From

### Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Machine Learning Probability Basics Basic definitions: Random variables, joint, conditional, marginal distribution, Bayes theorem & examples; Probability distributions: Binomial, Beta, Multinomial, Dirichlet,

### NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

### MLE/MAP + Naïve Bayes

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes Matt Gormley Lecture 19 March 20, 2018 1 Midterm Exam Reminders

### p L yi z n m x N n xi

y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

### Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

### CPSC 540: Machine Learning

CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 5: Due April 10. Project description on Piazza. Final details coming

### Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation

### PROBABILITY DISTRIBUTIONS Nonparametric Methods

120 2. PROBABILITY DISTRIBUTIONS An example of a scale parameter would be the standard deviation cr of a Gaussian distribution, after we have taken account of the location parameter J.L, because (2.240)

### Lecture 2: Simple Classifiers

CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 2: Simple Classifiers Slides based on Rich Zemel s All lecture slides will be available on the course website: www.cs.toronto.edu/~jessebett/csc412

### Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics

### Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

### Why Bayesian? Rigorous approach to address statistical estimation problems. The Bayesian philosophy is mature and powerful.

Why Bayesian? Rigorous approach to address statistical estimation problems. The Bayesian philosophy is mature and powerful. Even if you aren t Bayesian, you can define an uninformative prior and everything

### COMS 4771 Regression. Nakul Verma

COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the

### CPSC 540: Machine Learning

CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

### Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures

### Bayesian linear regression

Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding

### COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released

### Language as a Stochastic Process

CS769 Spring 2010 Advanced Natural Language Processing Language as a Stochastic Process Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Basic Statistics for NLP Pick an arbitrary letter x at random from any

### Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 22 Dan Weld Learning Gaussians Naïve Bayes Last Time Gaussians Naïve Bayes Logistic Regression Today Some slides from Carlos Guestrin, Luke Zettlemoyer

### A Brief Review of Probability, Bayesian Statistics, and Information Theory

A Brief Review of Probability, Bayesian Statistics, and Information Theory Brendan Frey Electrical and Computer Engineering University of Toronto frey@psi.toronto.edu http://www.psi.toronto.edu A system

### CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2017. Tom M. Mitchell. All rights reserved. *DRAFT OF September 16, 2017* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is

### Bayesian Inference: Posterior Intervals

Bayesian Inference: Posterior Intervals Simple values like the posterior mean E[θ X] and posterior variance var[θ X] can be useful in learning about θ. Quantiles of π(θ X) (especially the posterior median)

### Unsupervised Learning

Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

### Bayesian Analysis for Natural Language Processing Lecture 2

Bayesian Analysis for Natural Language Processing Lecture 2 Shay Cohen February 4, 2013 Administrativia The class has a mailing list: coms-e6998-11@cs.columbia.edu Need two volunteers for leading a discussion

### BAYESIAN DECISION THEORY. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

BAYESIAN DECISION THEORY Credits 2 Some of these slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Simon Prince, University College London Sergios Theodoridis, University of Athens

### Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

### Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Lecture 15. Pattern Classification (I): Statistical Formulation Outline Statistical Pattern Recognition Maximum Posterior Probability (MAP) Classifier Maximum Likelihood (ML) Classifier K-Nearest Neighbor

### Basics on Probability. Jingrui He 09/11/2007

Basics on Probability Jingrui He 09/11/2007 Coin Flips You flip a coin Head with probability 0.5 You flip 100 coins How many heads would you expect Coin Flips cont. You flip a coin Head with probability

### Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP Predic?ve Distribu?on (1) Predict t for new values of x by integra?ng over w: where The Evidence Approxima?on (1) The

### Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, 2010 Jeffreys priors Lecturer: Michael I. Jordan Scribe: Timothy Hunter 1 Priors for the multivariate Gaussian Consider a multivariate

### 6.867 Machine Learning

6.867 Machine Learning Problem set 1 Due Thursday, September 19, in class What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

### Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

### Introduc)on to Bayesian methods (con)nued) - Lecture 16

Introduc)on to Bayesian methods (con)nued) - Lecture 16 David Sontag New York University Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Vibhav Gogate Outline of lectures Review of

### Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

### L11: Pattern recognition principles

L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

### A Process over all Stationary Covariance Kernels

A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that

### Machine Learning 4771

Machine Learning 4771 Instructor: Tony Jebara Topic 11 Maximum Likelihood as Bayesian Inference Maximum A Posteriori Bayesian Gaussian Estimation Why Maximum Likelihood? So far, assumed max (log) likelihood

### Exponential Families

Exponential Families David M. Blei 1 Introduction We discuss the exponential family, a very flexible family of distributions. Most distributions that you have heard of are in the exponential family. Bernoulli,

### Machine Learning Lecture 5

Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

### Exam 2 Practice Questions, 18.05, Spring 2014

Exam 2 Practice Questions, 18.05, Spring 2014 Note: This is a set of practice problems for exam 2. The actual exam will be much shorter. Within each section we ve arranged the problems roughly in order

### A Very Brief Summary of Bayesian Inference, and Examples

A Very Brief Summary of Bayesian Inference, and Examples Trinity Term 009 Prof Gesine Reinert Our starting point are data x = x 1, x,, x n, which we view as realisations of random variables X 1, X,, X

### Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Machine Learning Regression-Based Classification & Gaussian Discriminant Analysis Manfred Huber 2015 1 Logistic Regression Linear regression provides a nice representation and an efficient solution to

### Introduction to Machine Learning Midterm Exam Solutions

10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

### The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is

### Gentle Introduction to Infinite Gaussian Mixture Modeling

Gentle Introduction to Infinite Gaussian Mixture Modeling with an application in neuroscience By Frank Wood Rasmussen, NIPS 1999 Neuroscience Application: Spike Sorting Important in neuroscience and for

### Other Noninformative Priors

Other Noninformative Priors Other methods for noninformative priors include Bernardo s reference prior, which seeks a prior that will maximize the discrepancy between the prior and the posterior and minimize

### UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

### SVMs, Duality and the Kernel Trick

SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today

### CPSC 540: Machine Learning

CPSC 540: Machine Learning Kernel Density Estimation, Factor Analysis Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 2: 2 late days to hand it in tonight. Assignment 3: Due Feburary

### Probabilistic Graphical Models

Parameter Estimation December 14, 2015 Overview 1 Motivation 2 3 4 What did we have so far? 1 Representations: how do we model the problem? (directed/undirected). 2 Inference: given a model and partially

### CS 630 Basic Probability and Information Theory. Tim Campbell

CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)

### Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

Lecture 2 1 Probability (90 min.) Definition, Bayes theorem, probability densities and their properties, catalogue of pdfs, Monte Carlo 2 Statistical tests (90 min.) general concepts, test statistics,

### Mixture Models and EM

Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering

### Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

STA56: Probabilistic machine learning Introduction: exponential family, conjugacy, and sufficiency 9/2/3 Lecturer: Barbara Engelhardt Scribes: Melissa Dalis, Abhinandan Nath, Abhishek Dubey, Xin Zhou Review

### Using Probability to do Statistics.

Al Nosedal. University of Toronto. November 5, 2015 Milk and honey and hemoglobin Animal experiments suggested that honey in a diet might raise hemoglobin level. A researcher designed a study involving

### CPSC 540: Machine Learning

CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

### Some Probability and Statistics

Some Probability and Statistics David M. Blei COS424 Princeton University February 13, 2012 Card problem There are three cards Red/Red Red/Black Black/Black I go through the following process. Close my

### Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Naïve Bayes Vibhav Gogate The University of Texas at Dallas Supervised Learning of Classifiers Find f Given: Training set {(x i, y i ) i = 1 n} Find: A good approximation to f : X Y Examples: what are

### Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

### 7 Gaussian Discriminant Analysis (including QDA and LDA)

36 Jonathan Richard Shewchuk 7 Gaussian Discriminant Analysis (including QDA and LDA) GAUSSIAN DISCRIMINANT ANALYSIS Fundamental assumption: each class comes from normal distribution (Gaussian). X N(µ,

### PATTERN CLASSIFICATION

PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

### Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics The candidates for the research course in Statistics will have to take two shortanswer type tests

### Expectation-Maximization (EM) algorithm

I529: Machine Learning in Bioinformatics (Spring 2017) Expectation-Maximization (EM) algorithm Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2017 Contents Introduce

### Some Probability and Statistics

Some Probability and Statistics David M. Blei COS424 Princeton University February 12, 2007 D. Blei ProbStat 01 1 / 42 Who wants to scribe? D. Blei ProbStat 01 2 / 42 Random variable Probability is about

### More Spectral Clustering and an Introduction to Conjugacy

CS8B/Stat4B: Advanced Topics in Learning & Decision Making More Spectral Clustering and an Introduction to Conjugacy Lecturer: Michael I. Jordan Scribe: Marco Barreno Monday, April 5, 004. Back to spectral

### Machine Learning 2017

Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

### Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture

### Introduction to Machine Learning

Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

### Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Overview of Course So far, we have studied The concept of Bayesian network Independence and Separation in Bayesian networks Inference in Bayesian networks The rest of the course: Data analysis using Bayesian

### Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

### Bayesian Inference for Normal Mean

Al Nosedal. University of Toronto. November 18, 2015 Likelihood of Single Observation The conditional observation distribution of y µ is Normal with mean µ and variance σ 2, which is known. Its density

### A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I kevin small & byron wallace today a review of probability random variables, maximum likelihood, etc. crucial for clinical

### Linear Dynamical Systems

Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

### Bayesian Inference in a Normal Population

Bayesian Inference in a Normal Population September 17, 2008 Gill Chapter 3. Sections 1-4, 7-8 Bayesian Inference in a Normal Population p.1/18 Normal Model IID observations Y = (Y 1,Y 2,...Y n ) Y i N(µ,σ

### Testing Statistical Hypotheses

E.L. Lehmann Joseph P. Romano Testing Statistical Hypotheses Third Edition 4y Springer Preface vii I Small-Sample Theory 1 1 The General Decision Problem 3 1.1 Statistical Inference and Statistical Decisions

### COM336: Neural Computing

COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

### Machine Learning Lecture 2

Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen

### Linear Models for Classification

Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

### Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,