Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Similar documents
PMR Introduction. Probabilistic Modelling and Reasoning. Amos Storkey. School of Informatics, University of Edinburgh

Bayesian Models in Machine Learning

Introduction to Machine Learning

Introduction to Machine Learning

Grundlagen der Künstlichen Intelligenz

Review of Probability Mark Craven and David Page Computer Sciences 760.

Course Introduction. Probabilistic Modelling and Reasoning. Relationships between courses. Dealing with Uncertainty. Chris Williams.

Review of probability

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Basics on Probability. Jingrui He 09/11/2007

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Introduction to Probability and Statistics (Continued)

Probability Theory for Machine Learning. Chris Cremer September 2015

Chapter 5. Chapter 5 sections

V7 Foundations of Probability Theory

Human-Oriented Robotics. Probability Refresher. Kai Arras Social Robotics Lab, University of Freiburg Winter term 2014/2015

Lecture 1: Probability Fundamentals

PROBABILITY THEORY REVIEW

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Naïve Bayes classification

Introduction to probability and statistics

Probability and Estimation. Alan Moses

Machine Learning using Bayesian Approaches

Artificial Intelligence

15-388/688 - Practical Data Science: Basic probability. J. Zico Kolter Carnegie Mellon University Spring 2018

Random Variables and Their Distributions

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

An Introduction to Bayesian Machine Learning

PMR Learning as Inference

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Lecture 2: Repetition of probability theory and statistics

System Simulation Part II: Mathematical and Statistical Models Chapter 5: Statistical Models

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Cheng Soon Ong & Christian Walder. Canberra February June 2018

MFM Practitioner Module: Quantitative Risk Management. John Dodson. September 23, 2015

Learning Bayesian network : Given structure and completely observed data

IE 230 Probability & Statistics in Engineering I. Closed book and notes. 60 minutes.

Lecture 17: The Exponential and Some Related Distributions

Probability and Information Theory. Sargur N. Srihari

Lecture 2: Conjugate priors

Why study probability? Set theory. ECE 6010 Lecture 1 Introduction; Review of Random Variables

Some Probability and Statistics

Things to remember when learning probability distributions:

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY

Parameter Learning With Binary Variables

A Few Special Distributions and Their Properties

Lecture 3. Probability - Part 2. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. October 19, 2016

Chapter 5 continued. Chapter 5 sections

Conditional distributions

(Ch 3.4.1, 3.4.2, 4.1, 4.2, 4.3)

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Massachusetts Institute of Technology

Multivariate probability distributions and linear regression

Statistics for scientists and engineers

Review of Probabilities and Basic Statistics

Brief Review of Probability

RVs and their probability distributions

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Computational Cognitive Science

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Random variables. DS GA 1002 Probability and Statistics for Data Science.

CS 630 Basic Probability and Information Theory. Tim Campbell

(Ch 3.4.1, 3.4.2, 4.1, 4.2, 4.3)

Topic 2: Review of Probability Theory

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

STAT2201. Analysis of Engineering & Scientific Data. Unit 3

Recitation 2: Probability

Introduction to Probability and Statistics (Continued)

Chapter 1. Sets and probability. 1.3 Probability space

01 Probability Theory and Statistics Review

Single Maths B: Introduction to Probability

5. Conditional Distributions

Probabilistic Machine Learning

Random Variables. Definition: A random variable (r.v.) X on the probability space (Ω, F, P) is a mapping

Algorithms for Uncertainty Quantification

1: PROBABILITY REVIEW

ECE 313 Probability with Engineering Applications Fall 2000

Introduction to Probabilistic Machine Learning

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Sample Spaces, Random Variables

Bayesian RL Seminar. Chris Mansley September 9, 2008

Some Probability and Statistics

1 Random Variable: Topics

Review of Probability. CS1538: Introduction to Simulations

Probability and Distributions

Topic 2: Probability & Distributions. Road Map Probability & Distributions. ECO220Y5Y: Quantitative Methods in Economics. Dr.

Statistical Methods in Particle Physics

1 Uniform Distribution. 2 Gamma Distribution. 3 Inverse Gamma Distribution. 4 Multivariate Normal Distribution. 5 Multivariate Student-t Distribution

THE QUEEN S UNIVERSITY OF BELFAST

Brief Review of Probability

Discrete Binary Distributions

1 Probability and Random Variables

1 Presessional Probability

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Chapter 3, 4 Random Variables ENCS Probability and Stochastic Processes. Concordia University

Probability Distributions Columns (a) through (d)

Transcription:

Probability Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh August 2014 (All of the slides in this course have been adapted from previous versions by Charles Sutton, Amos Storkey, David Barber.) 1 / 31

Outline What is probability? Random Variables (discrete and continuous) Expectation Joint Distributions Marginal Probability Conditional Probability Chain Rule Bayes Rule Independence Conditional Independence Some Probability Distributions (for reference) Reading: Murphy secs 2.1-2.4 2 / 31

What is probability? Quantification of uncertainty Frequentist interpretation: long run frequenies of events Example: The probability of a particular coin landing heads up is 0.43 Bayesian interpretation: quantify our degrees of belief about something Example: the probability of it raining tomorrow is 0.3 Not possible to repeat tomorrow many times Basic rules of probability are the same, no matter which interpretation is adopted 3 / 31

Random Variables A random variable (RV) X denotes a quantity that is subject to variations due to chance May denote the result of an experiment (e.g. flipping a coin) or the measurement of a real-world fluctuating quantity (e.g. temperature) Use capital letters to denote random variables and lower case letters to denote values that they take, e.g. p(x = x) An RV may be discrete or continuous A discrete variable takes on values from a finite or countably infinite set Probability mass function p(x = x) for discrete random variables 4 / 31

Examples: Colour of a car blue, green, red Number of children in a family 0, 1, 2, 3, 4, 5, 6, > 6 Toss two coins, let X = (number of heads) 2. X can take on the values 0, 1 and 4. Example p(colour = red) = 0.3 x p(x) = 1 5 / 31

Continuous RVs Continuous RVs take on values that vary continuously within one or more real intervals Probability density function (pdf) p(x) for a continuous random variable X therefore p(a X b) = b a p(x)dx p(x X x + δx) p(x)δx p(x)dx = 1 (but values of p(x) can be greater than 1) Examples (coming soon): Gaussian, Gamma, Exponential, Beta 6 / 31

Expectation Consider a function f(x) mapping from x onto numerical values E[f(x)] = f(x)p(x) x = f(x)p(x)dx for discrete and continuous variables resp. f(x) = x, we obtain the mean, µ x f(x) = (x µ x ) 2 we obtain the variance 7 / 31

Joint distributions Properties of several random variables are important for modelling complex problems p(x 1 = x 1, X 2 = x 2,..., X D = x D ), is read as and Examples about Grade and Intelligence (from Koller and Friedman, 2009) Intelligence = low Intelligence = high Grade = A 0.07 0.18 Grade = B 0.28 0.09 Grade = C 0.35 0.03 8 / 31

Marginal Probability The sum rule p(x) = y p(x, y) p(grade = A)?? Replace sum by an integral for continuous RVs 9 / 31

Conditional Probability Let X and Y be two disjoint groups of variables, such that p(y = y) > 0. Then the conditional probability distribution (CPD) of X given Y = y is given by p(x = x Y = y) = p(x y) = p(x, y) p(y) Product rule p(x, Y) = p(x)p(y X) = p(y)p(x Y) Example: In the grades example, what is p(intelligence = high Grade = A)? x p(x = x Y = y) = 1 for all y Can we say anything about y p(x = x Y = y)? 10 / 31

Chain Rule The chain rule is derived by repeated application of the product rule p(x 1,..., X D ) = p(x 1,..., X D 1 )p(x D X 1,..., X D 1 ) = p(x 1,..., X D 2 )p(x D 1 X 1,..., X D 2 ) p(x D X 1,..., X D 1 ) =... D = p(x 1 ) p(x i X 1,..., X i 1 ) i=2 Exercise: give six decompositions of p(x, y, z) using the chain rule 11 / 31

Bayes Rule From the product rule, p(x Y) = p(y X)p(X) p(y) From the sum rule the denominator is p(y) = X p(y X)p(X) 12 / 31

Probabilistic Inference using Bayes Rule Tuberculosis (TB) and a skin test (Test) p(t B = yes) = 0.001 (for subjects who get tested) p(t est = yes T B = yes) = 0.95 p(t est = no T B = no) = 0.95 Person gets a positive test result. What is p(t B = yes T est = yes)? p(t est = yes T B = yes)p(t B = yes) p(t B = yes T est = yes) = p(t est = yes) 0.95 0.001 = 0.95 0.001 + 0.05 0.999 0.0187 NB: These are fictitious numbers 13 / 31

Independence Let X and Y be two disjoint groups of variables. Then X is said to be independent of Y if and only if p(x Y) = p(x) for all possible values x and y of X and Y; otherwise X is said to be dependent on Y Using the definition of conditional probability, we get an equivalent expression for the independence condition p(x, Y) = p(x)p(y) X independent of Y Y independent of X Independence of a set of variables. X 1,...., X D are independent iff p(x 1,..., X D ) = D p(x i ) i=1 14 / 31

Conditional Independence Let X, Y and Z be three disjoint groups of variables. X is said to be conditionally independent of Y given Z iff p(x y, z) = p(x z) for all possible values of x, y and z. Equivalently p(x, y z) = p(x z)p(y z) [show this] Notation, I(X, Y Z) 15 / 31

Bernoulli Distribution X is a random variable that either takes the value 0 or the value 1. Let p(x = 1 p) = p and so p(x = 0 p) = 1 p. Then X has a Bernoulli distribution. 1 0.8 0.6 0.4 0.2 0 0 1 16 / 31

Categorical Distribution X is a random variable that takes one of the values 1, 2,..., D. Let p(x = i p) = p i, with D i=1 p i = 1. Then X has a catgorical (aka multinoulli) distribution (see Murphy 2012, p. 35)) 1 0.8 0.6 0.4 0.2 0 1 2 3 4 17 / 31

Binomial Distribution The binomial distribution is obtained from the total number of 1 s in n independent Bernoulli trials. X is a random variable that takes one of the values 0, 1, 2,..., n. ( ) n Let p(x = r p) = p r (1 p) (n r). r Then X is binomially distributed. 1 0.8 0.6 0.4 0.2 0 0 1 2 3 4 18 / 31

Multinomial Distribution The multinomial distribution is obtained from the total count for each outcome in n independent multivariate trials with D possible outcomes. X is a random vector of length D taking values x with x i Z + (non-negative integers) and D i=1 x i = n. Let n! p(x = x p) = x 1!... x D! px 1 1... px D m Then X is multinomially distributed. 19 / 31

Poisson Distribution The Poisson distribution is obtained from binomial distribution in the limit n with p/n = λ. X is a random variable taking non-negative integer values 0, 1, 2,.... Let p(x = x λ) = λx exp( λ) x! Then X is Poisson distributed. 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 / 31

Uniform Distribution X is a random variable taking values x [a, b]. Let p(x = x) = 1/[b a] Then X is uniformly distributed. Note Cannot have a uniform distribution on an unbounded region. 1 0.8 0.6 0.4 0.2 0 0 2 4 6 8 10 21 / 31

Gaussian Distribution X is a random variable taking values x R (real values). Let p(x = x µ, σ 2 ) = ( ) 1 exp (x µ)2 2πσ 2 2σ 2 Then X is Gaussian distributed with mean µ and variance σ 2. 0.4 0.3 0.2 0.1 0 4 2 0 2 4 22 / 31

Gamma Distribution The Gamma distribution has a rate parameter β > 0 (or a scale parameter 1/β) and a shape parameter α > 0. X is a random variable taking values x R + (non-negative real values). Let p(x = x α, β) = 1 Γ(α) xα 1 β α exp( βx) 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 2 4 6 8 10 12 Then X is Gamma distributed. Note the Gamma function. 23 / 31

Exponential Distribution The exponential distribution is a Gamma distribution with α = 1. The exponential distribution is often used for arrival times. X is a random variable taking values x R +. Let p(x = x λ) = λ exp( λx) Then X is exponentially distributed. 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 24 / 31

Laplace Distribution The Laplace distribution is obtained from the difference between two independent identically exponentially distributed variables. X is a random variable taking values x R. Let p(x = x λ) = (λ/2) exp( λ x ) Then X is Laplace distributed. 0.25 0.2 0.15 0.1 0.05 0 10 5 0 5 10 25 / 31

Beta Distribution 3 2.5 2 1.5 X is a random variable taking values x [0, 1]. Let p(x = x a, b) = Γ(a + b) Γ(a)Γ(b) xa 1 (1 x) b 1 Then X is β(a, b) distributed. 1 0.5 1.8 1.6 1.4 1.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 0.8 a = b = 0.5 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 a = 2, b = 3 26 / 31

The Kronecker Delta Think of a discrete distribution with all its probability mass on one value. So p(x = i) = 1 iff (if and only if) i = j. We can write this using the Kronecker Delta: p(x = i) = δ ij δ ij = 1 iff i = j and is zero otherwise. 27 / 31

The Dirac Delta Think of a real valued distribution with all its probability density on one value. There is an infinite density peak at one point (lets call this point a). We can write this using the Dirac delta: p(x = x) = δ(x a) which has the properties δ(x a) = 0 if x a, δ(x a) = if x = a, dx δ(x a) = 1 and dx f(x)δ(x a) = f(a). You could think of it as a Gaussian distribution in the limit of zero variance. 28 / 31

Other Distributions Chi-squared distribution with k degrees of freedom is a Gamma distribution with β = 1/2 and k = 2/α. Dirichlet distribution: will be used on this course. Weibull distribution (a generalisation of the exponential) Geometric distribution Negative binomial distribution. Wishart distribution (a distribution over matrices). Use Wikipedia and Mathworld. Good summaries for distributions. 29 / 31

Things you must never (ever) forget Probabilities must be between 0 and 1 (though probability densities can be greater than 1). Distributions must sum (or integrate) to 1. 30 / 31

Summary Joint distributions Conditional Probability Sum and Product Rules Standard Probability distributions Reading: Murphy secs 2.1-2.4 31 / 31