Statistical Machine Learning Lecture 1: Motivation

Size: px

Start display at page:

Download "Statistical Machine Learning Lecture 1: Motivation"

Eugenia Austin
5 years ago
Views:

1 1 / 65 Statistical Machine Learning Lecture 1: Motivation Melih Kandemir Özyeğin University, İstanbul, Turkey

2 2 / 65 What is this course about? Using the science of statistics to build machine learning models Training such models Inference Applications will be mainly on deep neural nets. What a surprise =) CS 458/558 Statistical Machine Learning Statistical Probabilistic Bayesian Probabilistic? Bayesian?

3 3 / 65 What is it NOT about? Introduction to machine learning Introduction to deep learning Introduction to probability and statistics Advanced probability and statistics Advanced Bayesian theory

4 4 / 65 Textbook None. The field is evolving in lightspeed! A brand new text book is already outdated. A rather new one is already archaic! Read scientific articles, book sections, and course slides.

5 5 / 65 Primadonnas

6 6 / 65 Grading protocol for 458 Four programming assignments (10% each) Implement a model, test on data, write one-page report. Python, TensorFlow Midterm exam (20%) - Open-book! Final exam (40%) - Open-book! Open-book means: You can keep during the exam your Lecture slides - Yes! Textbooks - Yes! Any written text of your own - Yes! Article print outs - Yes! Electronic devices - No! No free lunch No free points for memorizing material Cheap lunch granted Learn from examples, apply to similar cases

7 7 / 65 Grading protocol for 558 Four Three programming assignments (10% each) Implement a toy model, test on data, write one-page report. Python, TensorFlow Midterm exam (10%) - Open-book! Final exam (30%) - Open-book! Project (30%) Implement the main idea of a scientific paper (simplifications might be allowed), test on data, write four-page report. Python, TensorFlow

8 A data scientist is like a medic HAS TO learn new tools in every couple of years HAS TO be one step ahead of the crowd HAS TO understand the sources of diseases, not only the prescriptions of the tools. (Med Pharmacy) Hence, NO MATTER IF THE POSITION IS ACADEMIC OR NOT, HAS TO FOLLOW THE LITERATURE VERY CLOSELY! predictive-analytics/ 8 / 65

9 9 / 65 The grand slams of machine learning 1 - International Conference on Machine Learning (ICML)

10 10 / 65 The grand slams of machine learning 2 - Neural Information Processing Systems (NIPS)

11 11 / 65 The grand slams of machine learning 3 - Uncertainty in Artificial Intelligence (UAI)

12 12 / 65 The grand slams of machine learning 4 - Artificial Intelligence and Statistics (AISTATS)

13 13 / 65 The grand slams of machine learning Quality: ICML NIPS UAI AISTATS >> others Scale: ICML NIPS > UAI AISTATS

14 14 / 65 Why probabilistic machine learning? Learn a lot from few cases, just like the human brain!

15 Why probabilistic machine learning? Charming results! This is the paper that made neural networks shake the world A. Krizhevsky et al., NIPS, / 65

16 16 / 65 Why probabilistic machine learning? with unbelievably good results on the very challenging ILSRVC data set

17 Why probabilistic machine learning? using a method called Dropout introduced by yet another paper N. Srivastava, Journal of Machine Learning Research, / 65

18 18 / 65 Why probabilistic machine learning? which builds on uncertainty of a neuron to be active or inactive Mystery: Why such a simple trick works that well NOT mystery: Why accounting for uncertainty leverages more robust predictions Probability Theory!

19 19 / 65 Why probabilistic machine learning? The more advanced the uncertainty account is, the better the predictions are D. Kingma et al., NIPS, 2015

20 20 / 65 Definitions Sample space (Ω): A collection of all possible outcomes of a random experiment. Event (E): A question about the experiment with a yes/no answer. A subset of the sample space. Probability measure: A function that assigns a number P (A) to each event A.

21 21 / 65 Axioms of probability Axiom 1: Probability of an event is a non-negative real number: P (E) R, P (E) 0, E Ω Axiom 2: Probability of the entire sample space is 1: P (Ω) = 1. Axiom 3: P (E 1 E 2 ) = P (E 1 ) + P (E 2 ), where E 1 E 2 =.

22 22 / 65 Consequences Sum rule: P (E 1 E 2 ) = P (E 1 ) + P (E 2 ) P (E 1 E 2 ) P ( ) = 0 All set theory is applicable. Most of the Boolean algebra is applicable.

23 Conditional probability Kolmogorov s definition: P (A B) = P (A B) P (B) a.k.a product rule. De Finetti introduces this formulation as an axiom. Consider the following example 1 : / 65

24 24 / 65 Definitions (2) Probability density function (PDF): P r[a x b] = Cumulative distribution function (CDF): F x (x) = PDF - CDF relationship: x b a p(x)dx p(x)dx P r[a x b] = F x (b) F x (a)

25 25 / 65 Definitions(3) Expected value: E p(x) [x] = Variance: x p(x)dx V ar p(x) [x] = E p(x) [(x E p(x) [x]) 2 ] = E p(x) [x 2 ] (E p(x) [x]) 2 Standard Deviation: σ(x) = V ar p(x) [x]

26 26 / 65 Definitions(4) Joint PDF P r[a x b c y d] = Covariance: b d a c cov(x, y) = E[(x E[x])(y E[y])] p xy (x, y)dydx Marginal probability (sum rule): p(x) = p(x, y)dy

27 27 / 65 Normal distribution PDF: N (x µ, σ 2 ) = 1 (x µ) 2 σ 2π e 2σ 2 CDF: [ ( )] 1 x µ 1 + erf 2 2σ 2 where erf(x) = 1 x π x e t2 dt. Mean: µ Variance: σ 2

28 Normal distribution (2) PDF CDF Std. Dev / 65

29 29 / 65 Multivariate normal distribution PDF: N (x µ, Σ) = (2π) D 2 Σ 1 2 e 1 2 (x µ)t Σ 1 (x µ) CDF: N/A. Mean: µ Covariance: Σ

30 Multivariate normal distribution (2) / 65

31 31 / 65 Why normal distribution? Central limit theorem Let x 1, x 2,, x N be N random variables with E[x n ] = µ and V ar[x n ] = σ 2 <. Then as N approaches infinity, the random variables N(ˆµ n µ) converge to be distributed as N (0, σ 2 ), where ˆµ n = (x 1, x 2,, x n )/n is the sample mean for the first n random variables.

32 32 / 65 Independence and Conditional Independence Independence: P (E 1 E 2 ) = P (E 1 )P (E 2 ) Conditional independence: P (E 1 E 2 E 3 ) = P (E 1 E 3 )P (E 2 E 3 )

33 33 / 65 Independent and identically distributedness (i.i.d) Let X = {x 1, x 2,, x N } be a set of N random variables corresponding to N observations of an experiment. They are defined to be independent and identically distributed (i.i.d) random variables if: All random variables x i have the same probability distribution. All pairs of observation events are independent. Hence, the likelihood of an i.i.d. data set can be written as P (X θ) = N n=1 p(x n θ).

34 34 / 65 Exchangeability The random variables (x 1, x 2,, x N ) are exchangeable if for any permutation π, the following equality holds p(x 1, x 2,, x N ) = p(x π1, x π2,, x πn ).

35 35 / 65 What is probability?

36 36 / 65 What is probability? Is probability an objective or a subjective measure?

37 37 / 65 What if probability is objective? p(e) = n e lim n + n n e : Number of times the event of interest occurs n: Number of trials

38 38 / 65 Can probability really be purely objective? How shall we handle +? Sample set is limited. How do we know that our sample set is not biased? How do we know that the dice is fair? Is not making fairness or biasedness assumptions a subjective guess? Then why not quantify subjectivity? asks a Bayesian, like de Finetti: The classical view, based on physical considerations of symmetry, in which one should be obliged to give the same probability to such symmetric cases. But which symmetry? And, in any case, why? The original sentence becomes meaningful if reversed: the symmetry is probabilistically significant, in someone s opinion, if it leads him to assign the probabilities to such events. de Finetti, 1970/74, Preface,xi-xii

39 / 65 Thomas Bayes the legend (1701-1761) p(h

39 39 / 65 Thomas Bayes the legend ( ) p(h X) = p(x H)p(H) p(x) H: Hypothesis X: Measurement

40 40 / 65 Bayes Theorem p(θ x) = p(x θ)p(θ) p(x) x X is an observable in the sample space X. θ is a set of model parameters. It is an index to a frequentist, and a random variable for a Bayesian. p(x θ): likelihood (how do model parameters describe data?) p(θ): prior (what is our prior belief about model parameters?) p(x): evidence (what is the likelihood of data regardless of the model parameters?) p(θ x): posterior (how do model parameters distribute after observations are taken into account?)

41 41 / 65 Prior? What does it really mean? Who do you expect to win the tennis game and why?

42 42 / 65 What does it mean to be Bayesian in machine learning?

43 43 / 65 Motivation 1: De Finetti s Theorem A sequence of random variables (x 1, x 2,, x N ) is infinitely exchangeable iff, for any N, p(x 1, x 2,, x N ) = N i=1 p(x i θ)p (dθ) Here, P (dθ) = p(θ)dθ if θ has a density. Implications: Exchangeability can be checked from right hand side. There must exist a parameter θ! There must exist a likelihood p(x θ)! There must exist a distribution P on θ

44 44 / 65 Motivation 2: Statistical Decision Theory Loss function: l(θ, δ(x)) where δ(x) is a decision based on data x. Determines the penalty for predicting δ(x) if θ is the true parameter. e.g. Squared loss: l(θ, δ(x)) = (θ δ(x)) 2. However, δ(x) does not have to be an estimate of θ.

45 45 / 65 Frequentist risk R(θ, δ) = E X [l(θ, δ(x))] = for a fixed θ and different x X. x X l(θ, δ(x))p(x θ)dx

46 46 / 65 How to decide which loss function is best 1. Admissibility: Never dominated everywhere by another decision. Not practical, a decision rarely dominates another in real cases. courses/260-spring10/lectures/lecture2.pdf

47 47 / 65 How to decide which loss function is best 2. Restricted classes of procedures: For instance, we can restrict ourselves to the unbiased case (i.e. E θ [ˆθ] = θ). Many good procedures are biased. Moreover, some unbiased procedures are inadmissible.

48 48 / 65 How to decide which loss function is best 3. Minimax: Choose the one with lower maximum worst-case risk. courses/260-spring10/lectures/lecture2.pdf

49 49 / 65 Bayesian decision theory Posterior risk: ρ(π, δ(x)) = l(θ, δ(x))p(θ x)dθ where p(θ x) p(x θ)π(θ). The Bayes action δ (x) for any fixed x is the decision δ(x) that minimizes the posterior risk.

50 Bayesian decision theory (2) For example, let us calculate the posterior risk for l(θ, δ(x)) = (θ δ(x)) 2 : ρ = (θ δ(x)) 2 p(θ x)dθ = δ(x) 2 2δ(x) θp(θ x)dθ + θ 2 p(θ x)dθ and the Bayes action ρ δ(x) = 2δ(x) 2 θp(θ x)dθ = 0, δ (x) = θp(θ x)dθ turns out to be the posterior mean! For l(θ, δ(x)) = θ δ(x), the optimal decision is to choose the posterior median. 50 / 65

51 51 / 65 Wrap up Frequentist risk R(θ, δ) = E X [l(θ, δ(x))] = x X l(θ, δ(x))p(x θ)dx Bayesian risk ρ(π, δ(x)) = E θ [l(θ, δ(x))] l(θ, δ(x))p(θ x)dθ

52 52 / 65 Motivation 3: Posterior predictive distribution Given a posterior p(θ x) and a new observation x, the posterior predictive distribution is p(x x) = p(x θ)p(θ x)dθ = E p(θ x) [p(x θ)] This distribution takes into account all possible values of θ with importance proportional to the probability of their occurrence. This virtue is called model averaging and exists only in Bayesian models!

53 53 / 65 The model selection problem We are given two hypotheses that claim to explain a certain data set. Both give similar prediction quality. Which one should we choose? Which one explains the data better?

54 54 / 65 Motivation 4: Bayes quantifies model selection Hypothesis 1 (H 1 ): Likelihood: p H1 (x θ 1 ), Prior: p H1 (x θ 1 ) Hypothesis 2 (H 2 ): Likelihood: p H2 (x θ 2 ), Prior: p H2 (x θ 2 ) We can alternatively treat the hypothesis as a random variable H = {1, 2} that determines the type of the distribution p( ): p H1 (x θ 1 ) = p(x θ 1, H = 1) p H2 (x θ 2 ) = p(x θ 2, H = 2) Let us place a prior on also on the hypothesis variable. Unless we have a good reason, we are agnostic to both hypotheses: P (H = 1) = P (H = 2).

55 55 / 65 Motivation 4: Bayes quantifies model selection Now let us take into account all possible model parameter realizations for both hypotheses (i.e. calculate the evidence): p(x H = 1) = p(x θ 1, H = 1)p(θ 1 H = 1)dθ 1 p(x H = 2) = p(x θ 2, H = 2)p(θ 2 H = 2)dθ 2 This operation is called MARGINALIZING OUT! Nuisance Variable: A variable that we are not interested for our current analysis of interest. Rule of Thumb: Marginalize out nuisance variables as much as you can!

56 Motivation 4: Bayes quantifies model selection Now apply Bayes theorem to calculate the posterior on hypotheses P (H x) = p(x H)P (H) p(x) Choose the hypothesis with higher posterior probability. Compare p(h = 1 x) and p(h = 2 x). Since p(x) does not depend on H, its magnitude does not have an effect on the comparison. Since we chose a uniform prior on the hypotheses (P (H = 1) = P (H = 2)), the magnitude of P (H) also does not have an effect. Hence, it suffices to calculate p(x H = 1)/p(x H = 2). This metric is called the Bayes factor [Kass and Raftery, 1995]. Choose H 1 if Bayes factor is greater than 1, choose H 2 otherwise. The model evidence serves as a quantitative metric for model selection in the Bayesian setting. 56 / 65

57 Supervised learning Given a set of observations: x 1, x 2,, x N and the corresponding outcomes (labels) y 1, y 2,, y N, learn a function y = f(x) A naive solution is linear regression 4 : y = w T x / 65

58 58 / 65 Types of supervised learning Classification: y a, b, c,, k Regression: y R Semi-supervised learning: A (large) subset of the training set does not have labels. Active learning: The model asks labels of the most important observations. Structured output learning: y is a structure (e.g. a graph).

59 Unsupervised learning Given a set of observations: x 1, x 2,, x N, learn a model that does X. A commonplace X is to infer data chunks, called clusters. This problem is called clustering / 65

60 60 / 65 Discriminative versus Generative models Joint model: p(x, y). Generative model: p(y x) = p(y)p(x y). p(x) Discriminative model deals directly with p(y x).

61 61 / 65 Parametric and nonparametric models Parametric model: The structure of the training data is stored in a predetermined set of parameters. These parameters are sufficient for prediction, no need to store the training data. Non-parametric model: Number of parameters in the model grows with the training data size. Training data also has to be stored for prediction.

62 62 / 65 Take home 1: The Bayesian data analysis pipeline 1 Given i.i.d. data X = {x 1,, x N }. 2 How do you parameterize its generation process? Design a likelihood density p(x n θ) 3 What is your prior belief about model parameters? Design a prior density p(θ) 4 Do learning. Infer the posterior: p(θ X ) = N n=1 p(x n θ)p(θ)/p(x ). 5 Do prediction. The likelihood of a new observation x is p(x X ) = p(x θ)p(θ X )dθ

63 63 / 65 Wait...all is well but... How can we calculate the posterior p(θ X ) = N n=1 p(x n θ)p(θ)/p(x ) especially p(x )? In many cases you cannot. You can only approximate it. And this is what this course is all about!

64 64 / 65 Take home 2 Read Bishop, Sections 1.2.3, 1.2.4, Michael I. Jordan s lecture notes: courses/260-spring10/lectures/lecture2.pdf

65 65 / 65 Take home 3: Assignment 1 =) Implement the variational inference scheme for Bayesian linear regression using TensorFlow API under Python and run it on the UCI Boston Housing data set. Deadline: , 23:59:59 İstanbul Time. Submit a python script that trains the model on a randomly-chosen 90% of the data set and predicts on the rest, repeats this procedure 10 times, and reports the root mean square error (RMSE) averaged across 10 trials. Submit a half-page single-column report detailing your comments on the outcome (e.g. How far did we go to solve the problem? What kind of end products can we develop based on this model?) Hint: See Bishop, Section 10.3.

Lecture 1: Bayesian Framework Basics

Lecture 1: Bayesian Framework Basics Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de April 21, 2014 What is this course about? Building Bayesian machine learning models Performing the inference of