Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Size: px
Start display at page:

Download "Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr"

Transcription

1 Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr

2 Remember: Normal Distribution Distribution over x. Density function with parameters µ (mean) and 2 σ (variance). Density of normal distribution 2

3 Remember: Multivariate Normal Distribution Distribution over vectors Density function with parameters x mean vector, 1 1 2π 2 ) Σ = Σ x μ Σ T 1 ( x μ ) exp ( x μ) ( D/2 1/2 D. μ D D D, Σ. covariance matrix Example D=2: density, sample from distribution 3

4 Overview Basic concepts of Bayesian learning Linear regression: Maximum-likelihood model, Maximum-a-posteriori model, Posterior distribution over models, Bayesian prediction, predictive distribution, Linear classification (logistic regression): Predictive distribution, Maximum-likelihood model, Maximum-a-posteriori model, Bayesian Prediction. Nonlinear models: Gaussian processes. 4

5 Statistics & Machine Learning Machine learning: tightly related to inductive statistics. Two areas in Statistics: Descriptive Statistics: description and examination of the properties of data. Mean values Variances Difference between Populations Inductive Statistics: What conclusions can be drawn from data about the underlying reality? Explanations for observations Model building Relationships and patterns in the data 5

6 Frequentist vs. Bayesian Probabilities Frequentist probabilities Describe the possibility of an occurrence of an intrinsically stochastic event (e.g., a coin toss). Defined as limits of relative frequencies of possible outcomes in a repeatable experiment If one throws a fair coin 1000 times, it will land on heads about 500 times In 1 gram of Potassium-40, around 260,000 nuclei decay per second 6

7 Frequentist vs. Bayesian Probabilities Bayesian subjective probabilities Here, the reason for uncertainty is attributed to a lack of information. How likely is it that suspect X killed the victim? New Information (e.g., finger prints) can change these subjective probabilities. Bayesian view is more important in machine learning Frequentist and Bayesian perspectives are mathematically equivalent; in Bayesian statistics, probabilities are just used to model different things (lack of information). 7

8 Bayesian Statistics An essay towards solving a problem in the doctrine of chances, published in The work of Bayes laid the foundations for inductive Statistics. Bayesian Probabilities offer an important perspective on uncertainty and probability. 8

9 Bayesian Probability in Machine Learning Model building: find an explanation for observations. What is the most likely model? Trade-off between Prior knowledge (a priori distribution over models), Evidence (data, observations). Bayesian Perspective: Evidence (data) changes the subjective probability for models (explanation), A posteriori model probability, MAP hypothesis. 9

10 Bayesian Model of Learning Nature conducts random experiment, determines. System with parameter generates observations y = f (X). Bayesian inference inverts this process: System parameter Observations y for input X Given these assumptions about how y is generated, Given the observed values of y for input matrix X, What is the most likely true value of? Nature randomly determines according to P What is the most likely value y for a new input x? 10

11 Bayesian Model of Learning System parameter Observations y for input X Likelihood of observing y X when model parameter is. Nature randomly determines according to P A priori ( prior ) probability of nature choosing Bayes equation: P X, y = P y X, P() P(y X) Probability of observing y X; independent of. A posteriori ( posterior ) probability that is the correct parameter given observations y X. 11

12 Bayesian Model of Learning Maximum-likelihood (ML) model: MM = arg max P(y X, ). Maximum-a-positeriori (MAP) model: MAA = arg max P( y, X) Likelihood A posteriori ( posterior ) distribution 12

13 Bayesian Model of Learning Maximum-likelihood (ML) model: MM = arg max P(y X, ). Maximum-a-positeriori (MAP) model: MAA = arg max P y X, P P( y, X) = arg max P y X P y X, P = arg max Posterior likelihood x prior 13

14 Bayesian Model of Learning Maximum-likelihood (ML) model: MM = arg max P(y X, ). Maximum-a-positeriori (MAP) model: MAA = arg max P y X, P P( y, X) = arg max P y X P y X, P = arg max Most likely value y for new input x (Bayes-optimal decision): y = arg max y P(y x, y, X) 14

15 Bayesian Model of Learning Maximum-likelihood (ML) model: MM = arg max P(y X, ). Maximum-a-positeriori (MAP) model: MAA = arg max P y X, P P( y, X) = arg max P y X P y X, P = arg max Most likely value y for new input x (Bayes-optimal decision): y = arg max y P(y x, y, X) P y x, y, X = P y, x, y, X dd = P y x, P y, X dd Predictive distribution Bayesian model averaging. Often computationally infeasible, but has a closed-form solution in some cases. 15

16 Linear Regression Models Training data: X = y = x 11 x 1m x nn x nn y 1 y n Model f X Y f x = x T 16

17 Overview Basic concepts of Bayesian learning Linear regression: Maximum-likelihood model, Maximum-a-posteriori model, Posterior distribution over models, Bayesian prediction, predictive distribution, Linear classification (logistic regression): Predictive distribution, Maximum-likelihood model, Maximum-a-posteriori model, Bayesian Prediction. Nonlinear models: Gaussian processes. 17

18 Probabilistic Linear Regression Assumption 1: Nature generates parameter of a linear function f x = x T according to p(). f x i = x i T f x i x i 18

19 Probabilistic Linear Regression Assumption 1: Nature generates parameter of a linear function f x = x T according to p(). Assumption 2: Given inputs X, nature generates outputs y: y i = f x i + ε i with ε i ~N(ε 0, σ 2 ). p y i x i, = N(y i x i T, σ 2 ) f x i = x i T f x i p y i x i, = N y i x i T, σ 2 p y X T, = N y X, σ 2 I x i In reality, we have y, X and want to make inferences about. 19

20 Maximum Likelihood for Linear Regression Maximum-likelihood (ML) model: MM = arg max P(y X, ) = arg max n N(y i x i T, σ 2 ) i=1 Training instances are independent 20

21 Maximum Likelihood for Linear Regression Maximum-likelihood (ML) model: MM = arg max P(y X, ) = arg max n n N(y i x i T, σ 2 ) i=1 = arg max 1 2πσ exp 1 2 2σ 2 y i x T i 2 i=1 21

22 Maximum Likelihood for Linear Regression Maximum-likelihood (ML) model: MM = arg max P(y X, ) = arg max n n N(y i x i T, σ 2 ) i=1 = arg max 1 2πσ exp 1 2 2σ 2 y i x T i 2 i=1 Log is a monononic transformation: arg max P(y X, ) = arg max log P(y X, ) Also constant terms (constant in ) can be dropped. 22

23 Maximum Likelihood for Linear Regression Maximum-likelihood (ML) model: log e x = x MM = arg max P(y X, ) = arg max n n N(y i x i T, σ 2 ) i=1 = arg max 1 2πσ exp 1 2 2σ 2 y i x T i 2 i=1 n = arg min y i x T i 2 i=1 Log is a monononic transformation: Unregularized linear regression with squared loss arg max P(y X, ) = arg max log P(y X, ) Also constant terms (constant in ) can be dropped 23

24 Maximum Likelihood for Linear Regression Maximum-likelihood (ML) model: MM = arg min y i x T i 2 n i=1 Known as least-squares method in statistics. Setting the derivative to zero gives closed-form solution: ML = X T X 1 X T y Inversion of X T X is numerically unstable. 24

25 Maximum Likelihood for Linear Regression The maximum-likelihood model is only based on the data, it is independent of any prior knowledge or domain assumptions. Calculating the maximum-likelihood model is numerically unstable. Regularized least squares works much better in practice. 25

26 Overview Basic concepts of Bayesian learning Linear regression: Maximum-likelihood model, Maximum-a-posteriori model, Posterior distribution over models, Bayesian prediction, predictive distribution, Linear classification (logistic regression): Predictive distribution, Maximum-likelihood model, Maximum-a-posteriori model, Bayesian Prediction. Nonlinear models: Gaussian processes. 26

27 MAP for Linear Regression Maximum-a-positeriori (MAP) model: MMM = arg max P y X, P P( y, X) = arg max P y X 27

28 MAP for Linear Regression Maximum-a-positeriori (MAP) model: MMM = arg max = arg max = arg max = arg max P y X, P P( y, X) = arg max P y X P y X, P log P y X, P n log P y i x i, + log P i=1 n = arg max log N y i x T i, σ 2 + log P = i=1 Training instances are independent 28

29 MAP for Linear Regression: Prior Nature generates parameter of linear model f x = x T according to p(). For convenience, assume p = N( 0, σ p 2 I). p( ) = ( 0, σ I = 2 p ) 1 1 exp 2 σ p m/ 2 m 2π σ p 2 2 p( ) 0 2 σ 2 p controls strength of prior

30 MAP for Linear Regression Maximum-a-positeriori (MAP) model: n MMM = arg max i=1 log N y i x T i, σ 2 + log P n = arg max log 1 2πσ 1 2 2σ 2 y i x T i 2 + log i=1 1 m 1 2π 2 σ p 2σ p 2 T All terms that are constant in can be dropped. 30

31 MAP for Linear Regression Maximum-a-positeriori (MAP) model: n MMM = arg max i=1 log N y i x T i + log P n = arg max log 1 2πσ 1 2 2σ 2 y i x T i 2 + log i=1 1 m 1 2π 2 σ p n 2σ p 2 T = arg min 1 2σ 2 y i x T i 2 i=1 n = arg min y i x T i 2 i= σ p 2 T σ2 2 T σ p λ l 2 -regularized linear regression with squared loss (ridge regression). 31

32 MAP for Linear Regression Maximum-a-positeriori (MAP) model: n MMM = arg min y i x T i 2 i=1 σ2 σ p 2 λ T Same optimization criterion as ridge regression. Analytic solution (see lecture on ridge regression): MMM = X T X + σ2 σ p 2 I 1 X T y 32

33 Overview Basic concepts of Bayesian learning Linear regression: Maximum-likelihood model, Maximum-a-posteriori model, Posterior distribution over models, Bayesian prediction, predictive distribution, Linear classification (logistic regression): Predictive distribution, Maximum-likelihood model, Maximum-a-posteriori model, Bayesian Prediction. Nonlinear models: Gaussian processes. 33

34 Posterior for Linear Regression Posterior distribution of given y, X: P y, X = P y X, P = 1 P y X Z P y X, P 34

35 Posterior for Linear Regression Posterior distribution of given y, X: n P y, X = P y X, P P y X = 1 Z P y X, P = 1 Z N y XT, σ 2 I N 0, σ p 2 I = 1 2πσ exp 1 2 2σ 2 y i x T i 2 i=1 The normal distribution is the conjugate of itself. Therefore N, N, = N, 35

36 Posterior for Linear Regression Posterior distribution of given y, X: P y, X = P y X, P P y X = 1 Z P y X, P = 1 Z N y XT, σ 2 I N 0, σ p 2 I = N(, A 1 ) With = X T X + σ2 σ p 2 I 1 X T y And A 1 = σ 2 X T X + σ p 2 I. Mean value of posterior: MMM 36

37 Example MAP solution regression Training data: 2 = 0 4 3, 2 x1 3, x2 = 3 Matrix notation (adding constant attribute): x 0 = 1, 2 y 1 = 2 y 2 = 3 y 3 = 4 X = y =

38 Example MAP solution regression Choose Variance of prior: σ p = 1 Noise parameter: σ = T σ 1 T Compute: = ( XX+ I) Xy 2 σ p 1 T T =

39 Example MAP solution regression Predictions of model on the training data: ˆ y = X = =

40 Posterior and Regularized Loss Function MAP model: P y X, P MMM = arg max P( y, X) = arg max P y X = arg max P y X, P = arg max = arg max log P y X, + log P n log P y i x i, i=1 n = arg min y i x T i 2 i=1 + log P + σ2 σ p 2 T Likelihood loss function Prior regularizer 40

41 Sequential Learning Training examples arrive sequentially. Each training example (x i, y i ) changes prior p i 1 () into posterior p i 1 ( y i, x i ) which becomes the new prior p i P y, X = 1 Z P 0 P y X, n = 1 Z P 0 P(y i x i, ) i=1 = 1 Z P 0 P y 1 x 1, P 1 () P 2 () P y 2 x 2, P 3 () P y 3 x 3, P(y n x n, ) 41

42 Example: Sequential Learning [from Chris Bishop, Pattern Recognition and Machine Learning] f x = x (one-dimensional regression) Sequential update: P 0 () P 0 () Sample from P 0 ()

43 Example: Sequential Learning f x = x (one-dimensional regression) Sequential update: P 0 () Likelihood P(y 1 x 1, ) P 0 () Instance x 1, y

44 Example: Sequential Learning f x = x (one-dimensional regression) Sequential update: P 1 P 0 ()P(y 1 x 1, ) Likelihood P(y 1 x 1, ) P 1 () Sample from P 1 ()

45 Example: Sequential Learning f x = x (one-dimensional regression) Sequential update: P 2 P 1 ()P(y 2 x 2, ) Likelihood P(y 2 x 2, ) P 2 () Sample from P 2 ()

46 Example: Sequential Learning f x = x (one-dimensional regression) Sequential update: P n P n 1 ()P(y n x n, ) Likelihood P(y n x n, ) P n () Sample from P n ()

47 Learning and Prediction So far, we have always separated learning from prediction. Learning: = arg max Prediction: y = f (x ) R y, X, + Ω() For instance, in MAP linear regression, learning is MMM = arg max P( y, X). And prediction is y T = MMM x. 47

48 Learning and Prediction So far, we have always separated learning from prediction. And there are good reasons to do this: Learning can require processing massive amounts of training data which can take a long time. Predictions may have to be made in real time. However, sometimes, when relatively few data are available and an accurate prediction is worth waiting for, one can directly search for the best possible prediction: y = arg max y P(y x, y, X) Most likely y for new input x given training data y, X. 48

49 Overview Basic concepts of Bayesian learning Linear regression: Maximum-likelihood model, Maximum-a-posteriori model, Posterior distribution over models, Bayesian prediction, predictive distribution, Linear classification (logistic regression): Predictive distribution, Maximum-likelihood model, Maximum-a-posteriori model, Bayesian Prediction. Nonlinear models: Gaussian processes. 49

50 Bayes-Optimal Prediction Bayes-optimal decision: most likely value y for new input x y = arg max y P(y x, y, X) Predictive distribution for new input x given training data: P y x, y, X = P y, x, y, X d = P y, x, y, X P x, y, X d = P y, x P y, X d Sum rule Product rule Bayesian model averaging Independence assumptions from model data generation process Bayes-optimal decision is made by a weighted sum over all values of the model parameters. In general, there is no single model in the model space that always makes the Bayes-optimal decision. 50

51 Bayes-Optimal Prediction Bayes-optimal decision is made by a weighted sum over all model parameters: P y x, y, X = P y x, P y, X d The prediction of the MAP model is only the prediction made by the single most likely model MMM. Predictive distribution P(y x, MMM ). Most likely prediction f MMM x = arg max y P(y x, MMM ). The MAP model MMM is an approximaton of this weighted sum by its element with the highest weight. 51

52 Bayes-Optimal Prediction Bayes-optimal decision is made by a weighted sum over all model parameters: P y x, y, X = P y x, P y, X d Integration over the space of all model parameters is not generally possible. In some cases, there is a closed-form solution. In other cases, approximate numerical integration may be possible. 52

53 Predictive Distribution for Linear Regression Predictive distribution for linear regression P y x, y, X = P y x, P y, X d = N y x, N, A 1 d = N(y T x, σ 2 + x T A 1 x ) With = X T X + σ2 σ p 2 I 1 X T y And A 1 = σ 2 X T X + σ p 2 I. Bayes-optimal prediction: y = arg max y P(y x, y, X) = T x 53

54 Predictive Distribution: Confidence Band Bayesian regression not only yields prediction y * = x T, but a distribution over y und therefore a confidence band. T 2 T 1 ( y x, σ + x A x) y * = x T x e.g. 95% confidence 54

55 Overview Basic concepts of Bayesian learning Linear regression: Maximum-likelihood model, Maximum-a-posteriori model, Posterior distribution over models, Bayesian prediction, predictive distribution, Linear classification (logistic regression): Predictive distribution, Maximum-likelihood model, Maximum-a-posteriori model, Bayesian Prediction. Nonlinear models: Gaussian processes. 55

56 Linear Classification Training data: X = y = x 11 x 1m x n1 x nm y 1 y n Decision function: f x = x T Predictive distribution P y x, = σ(x T ) Linear classifier: y x = arg max P y x, y y : 56

57 Linear Classification: Predictive Distribution For binary classification, y { 1, +1} Predictive distribution given parameters of linear model: P y = +1 x, = σ x T = 1 1+e xt P y = 1 x, = 1 P y = +1 x, Sigmoid function maps, + [0,1]. p(y=1 x,) T x 57

58 Logistic Regression For binary classification, y { 1, +1} Predictive distribution given parameters of linear model: P y = +1 x, = σ x T = 1 1+e xt P y = 1 x, = e xt = 1 Written jointly for both classes: P y x, = σ yx T = Classification function: y x = arg max y P y x, 1 1+e yxt 1+e xt Called logistic regression even though it is a classification model. 58

59 Logistic Regression Decision boundary: P y = +1 x, = P y = 1 x, = = σ x T 1 = 1 + e xt 1 = e xt 0 = x T Decision boundary is a hyperplane in input space. T x = 0 py= ( + 1 x, ) 59

60 Logistic Regression For multi-class classification: = 1 k Generalize sigmoid function to softmax function: P y x, = ext y e xt y y Classification function: y x = arg max y P y x, Normalizer ensures that P y x, = 1 Called multi-class logistic regression even though it is a classification model. y 60

61 Overview Basic concepts of Bayesian learning Linear regression: Maximum-likelihood model, Maximum-a-posteriori model, Posterior distribution over models, Bayesian prediction, predictive distribution, Linear classification (logistic regression): Predictive distribution, Maximum-likelihood model, Maximum-a-posteriori model, Bayesian Prediction. Nonlinear models: Gaussian processes. 61

62 Logistic Regression: ML Model Maximum-likelihood model: MM = arg max P y X, n = argmax e y ix T i = argmin = argmin i=1 n 1 log 1 + e y ix T i i=1 n log 1 + e y T ix i i=1 No analytic solution; numeric optimization, for instance, using (stochastic) gradient descent. 62

63 Logistic Regression: ML Model Maximum-likelihood model: n i=1 MM = argmin log 1 + e y ix i Gradient: n log 1 + e y T ix i i=1 n = 1 + e y ix T i i=1 n log 1 + e y T ix i = e y ix T i e y T ix i T y i x i i=1 n = y i x i T 1 i=1 n 1 + e y ix i T = y i x i 1 σ y i x i T i=1 T y i x i T n = y i x i T e y ix i i=1 1 + e y T ix i T 1 + e y ix i T y ix i T 63

64 Overview Basic concepts of Bayesian learning Linear regression: Maximum-likelihood model, Maximum-a-posteriori model, Posterior distribution over models, Bayesian prediction, predictive distribution, Linear classification (logistic regression): Predictive distribution, Maximum-likelihood model, Maximum-a-posteriori model, Bayesian Prediction. Nonlinear models: Gaussian processes. 64

65 Logistic Regression: MAP Model Maximum-a-posteriori model with prior P = N( 0, σ 2 I): MMM = arg max n P y X, P() = argmax 1 N( 0, σ 2 I) 1 + e y ix T i = argmin = argmin i=1 n 1 log log N( 0, σ 2 I) 1 + e y ix T i i=1 n log 1 + e y T ix i i= σ 2 T No analytic solution; numeric optimization, for instance, using (stochastic) gradient descent. 65

66 Logistic Regression: MAP Model Maximum-a-posteriori model with prior P = N( 0, σ 2 I): Gradient: MMM = argmin log 1 + e y ix i T n i=1 + 1 n log 1 + e y T ix i i= σ2 T p = y i x i 1 σ y i x i T + 1 2σ p 2 2σ p 2 T 66

67 Overview Basic concepts of Bayesian learning Linear regression: Maximum-likelihood model, Maximum-a-posteriori model, Posterior distribution over models, Bayesian prediction, predictive distribution, Linear classification (logistic regression): Predictive distribution, Maximum-likelihood model, Maximum-a-posteriori model, Bayesian Prediction. Nonlinear models: Gaussian processes. 67

68 Bayes-Optimal Prediction for Classification Predictive distribution given the data P y x, y, X = P y, x P y, X d 1 = 1 + e yx T N 0, σ 2 I d No closed-form solution for logistic regression. Possible to approximate by sampling from the posterior. Standard approximation: use only MAP model instead of integrating over model space. 68

69 Overview Basic concepts of Bayesian learning Linear regression: Maximum-likelihood model, Maximum-a-posteriori model, Posterior distribution over models, Bayesian prediction, predictive distribution, Linear classification (logistic regression): Predictive distribution, Maximum-likelihood model, Maximum-a-posteriori model, Bayesian Prediction. Nonlinear models: Gaussian processes. 69

70 Nonlinear Regression Limitation of model discussed so far: only linear dependency between x and f (x). Linear model Nonlinear model Now: nonlinear models. 70

71 Feature Mappings and Kernels Use mapping φ to embed instances x X in higherdimensional feature space. Find linear model in higher-dimensional space, corresponds to non-linear model in input space X. Representer theorem: Model f x = T φ(x) n i=1 Has a representation f α x = α i φ x i T φ x =k(x i,x) Feature mapping φ(x) does not have to be computed; only kernel function k(x i, x) is evaluated. Feature mapping φ(x) can therefore be high- or even infinitedimensional. 71

72 Generalized Linear Regression (Finite-Dimensional Case) Assumption 1: Nature generates parameter of a linear function f x = φ(x) T according to p = N( 0, σ p 2 I). Assumption 2: Inputs are X with feature representation Φ; line i of Φ contains row vector φ x i T. Nature generates outputs y: y i = f x i + ε i with ε i ~N(ε 0, σ 2 ). p y i x i, = N(y i φ x i T, σ 2 ) f x i = φ(x i ) T f x i p y i x i, = N y i φ x i T, σ 2 p y X T, = N y Φ, σ 2 I x i 72

73 Generalized Linear Regression Generalized linear model: f x = φ(x) T y = Φ Parameter governed by normal distribution N( 0, σ 2 p I). Therefore output vector y is also normally distributed. Mean value E y = E Φ = ΦE = 0. Covariance E yy T = ΦE T Φ T = σ p 2 ΦΦ T = σ p 2 K. With K ii = φ x i T φ x j = k(x i, x j ). Remember: ΦΦ T = φ x 1 φ x n φ x 1 φ x n = K 73

74 Generalized Linear Regression (General Case) Data generation assumptions: Given inputs X, nature generates target values y N(y 0, σ p 2 K). Then, nature generates observations y i = y i + ε i with noise ε i ~N(ε 0, σ 2 ). Bayesdian inference: determine predictive distribution P y x, y, X for new test instance. 74

75 Reminder: Linear Regression Bayes-optimal prediction: y = arg max y With = X T X + σ2 σ p 2 I P(y x, y, X) = T x 1 X T y. Number of parameters j = number of attributes in x. 75

76 Generalized Linear Regression Mean value of predictive distribution P y x, y, X has the form y n = i=1 α i k(x i, x ) With α = ΦΦ T + σ2 σ p 2 I 1 y = K + σ2 σ p 2 I Number of parameters α i = number of training instances. 1 y. 76

77 Example nonlinear regression Example for nonlinear regression Generating nonlinear data by y = x + x 2 sin(2 π ) ε ε ~ ( ε 0, σ ), [0,1] Nonlinear kernel k x, x = exp( x x ) How does the predictive distribution and the posterior over models look like? 77

78 Predictive distribution f( x) N=1 data point y = sin(2 π x) N=2 N=4 N=25 78

79 Models sampled from posterior N=1 N=2 N=4 N=25 79

80 Summary Linear regression: Maximum-likelihood model arg max P(y X, ), Maximum-a-posteriori model arg max P( y, X), Posterior distribution over models P( y, X), Bayesian prediction, predictive distribution arg max y P(y x, y, X). Linear classification (logistic regression): Predictive distribution P(y x, ), Maximum-likelihood model arg max Maximum-a-posteriori model arg max Bayesian Prediction arg max y P(y X, ), P(y x, y, X). Nonlinear models: Gaussian processes. P( y, X), 80

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Probabilistic Machine Learning. Industrial AI Lab.

Probabilistic Machine Learning. Industrial AI Lab. Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Linear Regression and Discrimination

Linear Regression and Discrimination Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Some Concepts of Probability (Review) Volker Tresp Summer 2018

Some Concepts of Probability (Review) Volker Tresp Summer 2018 Some Concepts of Probability (Review) Volker Tresp Summer 2018 1 Definition There are different way to define what a probability stands for Mathematically, the most rigorous definition is based on Kolmogorov

More information

Probabilistic Machine Learning

Probabilistic Machine Learning Probabilistic Machine Learning by Prof. Seungchul Lee isystes Design Lab http://isystes.unist.ac.kr/ UNIST Table of Contents I.. Probabilistic Linear Regression I... Maxiu Likelihood Solution II... Maxiu-a-Posteriori

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated

More information

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012) Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation

More information

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

y(x) = x w + ε(x), (1)

y(x) = x w + ε(x), (1) Linear regression We are ready to consider our first machine-learning problem: linear regression. Suppose that e are interested in the values of a function y(x): R d R, here x is a d-dimensional vector-valued

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER 204 205 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE hour Please note that the rubric of this paper is made different from many other papers.

More information

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples Machine Learning Bayes Basics Bayes, probabilities, Bayes theorem & examples Marc Toussaint U Stuttgart So far: Basic regression & classification methods: Features + Loss + Regularization & CV All kinds

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

Relevance Vector Machines

Relevance Vector Machines LUT February 21, 2011 Support Vector Machines Model / Regression Marginal Likelihood Regression Relevance vector machines Exercise Support Vector Machines The relevance vector machine (RVM) is a bayesian

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

10-701/ Machine Learning, Fall

10-701/ Machine Learning, Fall 0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Bayesian Linear Regression. Sargur Srihari

Bayesian Linear Regression. Sargur Srihari Bayesian Linear Regression Sargur srihari@cedar.buffalo.edu Topics in Bayesian Regression Recall Max Likelihood Linear Regression Parameter Distribution Predictive Distribution Equivalent Kernel 2 Linear

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 11 Maximum Likelihood as Bayesian Inference Maximum A Posteriori Bayesian Gaussian Estimation Why Maximum Likelihood? So far, assumed max (log) likelihood

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II 1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Lecture 11: Probability Distributions and Parameter Estimation

Lecture 11: Probability Distributions and Parameter Estimation Intelligent Data Analysis and Probabilistic Inference Lecture 11: Probability Distributions and Parameter Estimation Recommended reading: Bishop: Chapters 1.2, 2.1 2.3.4, Appendix B Duncan Gillies and

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824 Logistic Regression Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative Please start HW 1 early! Questions are welcome! Two principles for estimating parameters Maximum Likelihood

More information

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014 Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

The Naïve Bayes Classifier. Machine Learning Fall 2017

The Naïve Bayes Classifier. Machine Learning Fall 2017 The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning

More information

Classification: Logistic Regression from Data

Classification: Logistic Regression from Data Classification: Logistic Regression from Data Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3 Slides adapted from Emily Fox Machine Learning: Jordan Boyd-Graber Boulder Classification:

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction

More information

Machine Learning. 7. Logistic and Linear Regression

Machine Learning. 7. Logistic and Linear Regression Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Linear Regression [DRAFT - In Progress] Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory

More information

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Neutron inverse kinetics via Gaussian Processes

Neutron inverse kinetics via Gaussian Processes Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques

More information

DD Advanced Machine Learning

DD Advanced Machine Learning Modelling Carl Henrik {chek}@csc.kth.se Royal Institute of Technology November 4, 2015 Who do I think you are? Mathematically competent linear algebra multivariate calculus Ok programmers Able to extend

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

MLPR: Logistic Regression and Neural Networks

MLPR: Logistic Regression and Neural Networks MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition Amos Storkey Amos Storkey MLPR: Logistic Regression and Neural Networks 1/28 Outline 1 Logistic Regression 2 Multi-layer

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap. Outline MLPR: and Neural Networks Machine Learning and Pattern Recognition 2 Amos Storkey Amos Storkey MLPR: and Neural Networks /28 Recap Amos Storkey MLPR: and Neural Networks 2/28 Which is the correct

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 24) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu October 2, 24 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24 Outline Review

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Introduction to Logistic Regression

Introduction to Logistic Regression Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)

More information