Introduction to Machine Learning

Size: px

Start display at page:

Download "Introduction to Machine Learning"

Percival Shelton
5 years ago
Views:

Engineering State University of New York at Buffalo

1 Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA CSE 474/574 1 / 37

2 Outline Introduction to Probability Random Variables Bayes Rule More About Conditional Independence Continuous Random Variables Different Types of Distributions Handling Multivariate Distributions Transformations of Random Variables Information Theory - Introduction Chandola@UB CSE 474/574 2 / 37

3 What is Probability? [3, 1] Probability that a coin will land heads is 50% 1 What does this mean? 1 Dr. Persi Diaconis showed that a coin is 51% likely to land facing the same way up as it is started. Chandola@UB CSE 474/574 3 / 37

4 CSE 474/574 4 / 37

5 Frequentist Interpretation Number of times an event will be observed in n trials Chandola@UB CSE 474/574 5 / 37

6 Frequentist Interpretation Number of times an event will be observed in n trials What if the event can only occur once? My winning the next month s powerball. Polar ice caps melting by year Chandola@UB CSE 474/574 5 / 37

7 Bayesian Interpretation Uncertainty of the event Use for making decisions Should I put in an offer for a sports car? Chandola@UB CSE 474/574 6 / 37

8 What is a Random Variable (X )? Can take any value from X Discrete Random Variable - X is finite/countably finite Categorical?? Continuous Random Variable - X is infinite P(X = x) or P(x) is the probability of X taking value x an event What is a distribution? Examples 1. Coin toss (X = {heads, tails}) 2. Six sided dice (X = {1, 2, 3, 4, 5, 6}) Chandola@UB CSE 474/574 7 / 37

9 Notation, Notation, Notation X - random variable (X if multivariate) x - a specific value taken by the random variable ((x if multivariate)) P(X = x) or P(x) is the probability of the event X = x p(x) is either the probability mass function (discrete) or probability density function (continuous) for the random variable X at x Probability mass (or density) at x Chandola@UB CSE 474/574 8 / 37

10 Basic Rules - Quick Review For two events A and B: P(A B) = P(A) + P(B) P(A B) Joint Probability P(A, B) = P(A B) = P(A B)P(B) Also known as the product rule Conditional Probability P(A B) = P(A,B) P(B) Chandola@UB CSE 474/574 9 / 37

11 Chain Rule of Probability Given D random variables, {X 1, X 2,..., X D } P(X 1, X 2,..., X D ) = P(X 1 )P(X 2 X 1 )P(X 3 X 1, X 2 )... P(X D X 1, X 2,..., X D ) Chandola@UB CSE 474/ / 37

12 Marginal Distribution Given P(A, B) what is P(A)? Sum P(A, B) over all values for B P(A) = b P(A, B) = b P(A B = b)p(b = b) Sum rule Chandola@UB CSE 474/ / 37

13 Bayes Rule or Bayes Theorem Computing P(X = x Y = y): Bayes Theorem P(X = x Y = y) = = P(X = x, Y = y) P(Y = y) P(X = x)p(y = y X = x) x P(X = x )P(Y = y X = x ) Chandola@UB CSE 474/ / 37

14 Example Medical Diagnosis Random event 1: A test is positive or negative (X ) Random event 2: A person has cancer (Y ) yes or no What we know: 1. Test has accuracy of 80% 2. Number of times the test is positive when the person has cancer P(X = 1 Y = 1) = Prior probability of having cancer is 0.4% P(Y = 1) = Question? If I test positive, does it mean that I have 80% rate of cancer? Chandola@UB CSE 474/ / 37

15 Base Rate Fallacy Ignored the prior information What we need is: P(Y = 1 X = 1) =? More information: False positive (alarm) rate for the test P(X = 1 Y = 0) = 0.1 P(Y = 1 X = 1) = P(X = 1 Y = 1)P(Y = 1) P(X = 1 Y = 1)P(Y = 1) + P(X = 1 Y = 0)P(Y = 0) Chandola@UB CSE 474/ / 37

16 Classification Using Bayes Rules Given input example X, find the true class P(Y = c X) Y is the random variable denoting the true class Assuming the class-conditional probability is known Applying Bayes Rule P(Y = c X) = P(X Y = c) P(Y = c)p(x Y = c) c P(Y = c ))P(X Y = c ) Chandola@UB CSE 474/ / 37

17 Independence One random variable does not depend on another A B P(A, B) = P(A)P(B) Joint written as a product of marginals Conditional Independence A B C P(A, B C) = P(A C)P(B C) Chandola@UB CSE 474/ / 37

18 More About Conditional Independence Alice and Bob live in the same town but far away from each other Alice drives to work and Bob takes the bus Event A - Alice comes late to work Event B - Bob comes late to work Event C - A snow storm has hit the town P(A C) - Probability that Alice comes late to work given there is a snowstorm Chandola@UB CSE 474/ / 37

19 More About Conditional Independence Alice and Bob live in the same town but far away from each other Alice drives to work and Bob takes the bus Event A - Alice comes late to work Event B - Bob comes late to work Event C - A snow storm has hit the town P(A C) - Probability that Alice comes late to work given there is a snowstorm Now if I know that Bob has also come late to work, will it change the probability that Alice comes late to work? Chandola@UB CSE 474/ / 37

20 More About Conditional Independence Alice and Bob live in the same town but far away from each other Alice drives to work and Bob takes the bus Event A - Alice comes late to work Event B - Bob comes late to work Event C - A snow storm has hit the town P(A C) - Probability that Alice comes late to work given there is a snowstorm Now if I know that Bob has also come late to work, will it change the probability that Alice comes late to work? What if I do not observe C? Will B have any impact on probability of A happening? Chandola@UB CSE 474/ / 37

21 Continuous Random Variables X is continuous Can take any value How does one define probability? x P(X = x) = 1 Chandola@UB CSE 474/ / 37

22 Continuous Random Variables X is continuous Can take any value How does one define probability? x P(X = x) = 1 Probability that X lies in an interval [a, b]? P(a < X b) = P(x b) P(x a) F (q) = P(x q) is the cumulative distribution function P(a < X b) = F (b) F (a) Chandola@UB CSE 474/ / 37

23 Probability Density Probability Density Function p(x) = x F (x) P(a < X b) = Can p(x) be greater than 1? b a p(x)dx Chandola@UB CSE 474/ / 37

24 Expectation Expected value of a random variable E[X ] What is most likely to happen in terms of X? For discrete variables E[X ] xp(x = x) x X For continuous variables E[X ] xp(x)dx Mean of X (µ) X Chandola@UB CSE 474/ / 37

25 Expectation of Functions of Random Variable Let g(x ) be a function of X If X is discrete: E[g(X )] g(x)p(x = x) x X If X is continuous: E[g(X )] g(x)p(x)dx X Properties E[c] = c, c - constant If X Y, then E[X ] E[Y ] E[X + Y ] = E[X ] + E[Y ] E[aX ] = ae[x ] var[x ] = E[(X µ) 2 ] = E[X 2 ] µ 2 Cov[X, Y ] = E[XY ] E[X ]E[Y ] Jensen s inequality: If ϕ(x ) is convex, ϕ(e[x ]) E[ϕ(X )] Chandola@UB CSE 474/ / 37

26 What is a Probability Distribution? Discrete Binomial,Bernoulli Multinomial, Multinolli Poisson Empirical Continuous Gaussian (Normal) Degenerate pdf Laplace Gamma Beta Pareto Chandola@UB CSE 474/ / 37

27 Binomial Distribution X = Number of heads observed in n coin tosses Parameters: n, θ X Bin(n, θ) Probability mass function (pmf) Bin(k n, θ) ( ) n θ k (1 θ) n k k Bernoulli Distribution Binomial distribution with n = 1 Only one parameter (θ) Chandola@UB CSE 474/ / 37

28 Multinomial Distribution Simulates a K sided die Random variable x = (x 1, x 2,..., x K ) Parameters: n, θ θ R K θ j - probability that j th side shows up ( ) n K Mu(x n, θ) x 1, x 2,..., x K Multinoulli Distribution Multinomial distribution with n = 1 j=1 x is a vector of 0s and 1s with only one bit set to 1 Only one parameter (θ) θ x j j Chandola@UB CSE 474/ / 37

29 Gaussian (Normal) Distribution N (x µ, σ 2 ) 1 2πσ 2 e 1 2σ 2 (x µ) Parameters: 1. µ = E[X ] 2. σ 2 = var[x ] = E[(X µ) 2 ] X N (µ, σ 2 ) p(x = x) = N (µ, σ 2 ) X N (0, 1) X is a standard normal random variable Cumulative distribution function: Φ(x; µ, σ 2 ) x N (z µ, σ 2 )dz Chandola@UB CSE 474/ / 37

30 Joint Probability Distributions Multiple related random variables p(x 1, x 2,..., x D ) for D > 1 variables (X 1, X 2,..., X D ) Discrete random variables? Continuous random variables? What do we measure? Covariance How does X vary with respect to Y For linear relationship: cov[x, Y ] E[(X E[X ])(Y E[Y ])] = E[XY ] E[X ]E[Y ] Chandola@UB CSE 474/ / 37

31 Covariance and Correlation x is a d-dimensional random vector cov[x] E[(X E[X])(X E[X]) ] var[x 1 ] cov[x 1, X 2 ] cov[x 1, X d ] cov[x 2, X 1 ] var[x 2 ] cov[x 2, X d ] = cov[x d, X 1 ] cov[x d, X 2 ] var[x d ] Covariances can be between 0 and Normalized covariance Correlation Chandola@UB CSE 474/ / 37

32 Correlation Pearson Correlation Coefficient cov[x, Y ] corr[x, Y ] var[x ]var[y ] What is corr[x, X ]? 1 corr[x, Y ] 1 When is corr[x, Y ] = 1? Chandola@UB CSE 474/ / 37

33 Correlation Pearson Correlation Coefficient cov[x, Y ] corr[x, Y ] var[x ]var[y ] What is corr[x, X ]? 1 corr[x, Y ] 1 When is corr[x, Y ] = 1? Y = ax + b Chandola@UB CSE 474/ / 37

34 Multivariate Gaussian Distribution Most widely used joint probability distribution N (X µ, Σ) 1 exp (2π) D/2 Σ D/2 [ 1 2 (x µ) Σ 1 (x µ) ] Chandola@UB CSE 474/ / 37

35 Linear Transformations of Random Variables What is the distribution of f (X) (X p())? Linear transformation: Y = a X + b E[Y ]? var[y ]? Y = AX + b E[Y]? cov[y]? The Matrix Cookbook [2] Available on Piazza Chandola@UB CSE 474/ / 37

36 General Transformations f () is not linear Example: X - discrete Y = f (X ) = { 1 if X is even 0 if X is odd Chandola@UB CSE 474/ / 37

37 General Transformations for Continuous Variables For continuous variables, work with cdf F Y (y) P(Y y) = P(f (X ) y) = P(X f 1 (y)) = F X (f 1 (y)) For pdf p Y (y) d dy F Y (y) = d dy F X (f 1 (y)) = dx d dy dx F X (x) = dx dy p X (x) x = f 1 (y) Example Let X be Uniform( 1, 1) Let Y = X 2 p Y (y) = 1 2 y 1 2 Chandola@UB CSE 474/ / 37

38 Monte Carlo Approximation Generate N samples from distribution for X For each sample, x i, i [1, N], compute f (x i ) Use empirical distribution as approximate true distribution Approximate Expectation E[f (X )] = f (x)p(x)dx 1 N N f (x i ) i=1 Chandola@UB CSE 474/ / 37

39 Introduction to Information Theory Quantifying uncertainty of a random variable Entropy H(X ) or H(p) H(X ) K p(x = k) log 2 p(x = k) k= Variable with maximum entropy? Lowest entropy? Chandola@UB CSE 474/ / 37

40 Comparing Two Distributions Kullback-Leibler Divergence (or KL Divergence or relative entropy) KL(p q) K k=1 p(k) log p k q k = p(k) log p(k) k k = H(p) + H(p, q) p(k) log q(k) H(p, q) is the cross-entropy Is KL-divergence symmetric? Important fact: H(p, q) H(p) Chandola@UB CSE 474/ / 37

41 Mutual Information What does learning about one variable X tell us about another, Y? Correlation? Mutual Information I(X ; Y ) = x p(x, y) p(x, y) log p(x)p(y) y I(X ; Y ) = I(Y ; X ) I(X ; Y ) 0, equality iff X Y Chandola@UB CSE 474/ / 37

42 References E. Jaynes and G. Bretthorst. Probability Theory: The Logic of Science. Cambridge University Press Cambridge:, K. B. Petersen and M. S. Pedersen. The matrix cookbook, nov Version L. Wasserman. All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer, Oct Chandola@UB CSE 474/ / 37

Introduction to Machine Learning

What does this mean? Outline Contents Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola December 26, 2017 1 Introduction to Probability 1 2 Random Variables 3 3 Bayes