Probabilistic Graphical Models & Applications

Size: px
Start display at page:

Download "Probabilistic Graphical Models & Applications"

Transcription

1 Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with kind permission of Christoph H. Lampert of IST Austria

2 Conditional Random Fields Maximum Likelihood for CRFs Numerical Training Latent Variables Supervised Learning Problem I Given training examples (x 1, y 1 ),..., (x N, y N ) X Y I I I I X : collection of interacting random variables, e.g. image pixels x X : one joint assignment, e.g. a specific image, out of all possible joint assignments Y : collection of interacting random variables, e.g. part locations y Y: one joint assignment, e.g. a specific pose, out of all possible joint assignments Images: HumanEva dataset Goal: be able to make predictions for new inputs, i.e. learn a function f : X Y. 3 / 44

3 Supervised Learning Problem X F (1) top Y top Y head F (2) top,head Step 1: define a proper graph structure of X and Y Y Yrarm torso Y larm Y rhnd Y lhnd... Y rleg... Y lleg Y rfoot Ylfoot Step 2: define a proper parameterization of p(y x; θ) p(y x; θ) = 1 Z e d i=1 θ i φ i (x,y) Step 3: learn parameters θ from training data Step 4: for new x X, make prediction e.g. maximum likelihood ( today) e.g. y = argmax p(y x; θ ) y Y ( next lecture) 4 / 44

4 Parameterization Goal: Define feature functions, φ i : X Y R d, for i = 1,..., d, or just φ : X Y R d Main step: For each factor F F we define some φ F (x F, y F ) : X F Y F R d F, where (x F, y F ) are those random variables of (x, y) that appear in F. Example: pose estimation (X : image pixels, Y : part locations) φ head (x, y i ) = the pixels in the image x at the location specified by y i φ head-torso (y i, y j ) = the distance between locations y i and y j Example: image segmentation (X : image pixels, Y : per-pixel foreground/background flag) ( ) xi y unary terms φ i (x i, y i ) = i = 0 (1 x i ) y i = 1 pairwise terms φ ij (y i, y j ) = y i y j 5 / 44

5 Parameterization Goal: Define feature functions, φ i : X Y R d, for i = 1,..., d, or just φ : X Y R d Two common options (combinations of them also work): 1) combine factors by stacking their vectors φ(x, y) = ( φ F (x F, y F ) ) F F p(y x; θ) = 1 Z(x; θ) e θ,φ(x,y) with θ, φ(x, y) = θ F, φ F (x F, y F ) F F 2) combine factors by summing their vectors φ(x, y) = F F φ F (x F, y F ) p(y x; θ) = 1 Z(x; θ) e θ,φ(x,y) with θ, φ(x, y) = θ, φ F (x F, y F ) F F Result: log-linear model with parameter vector θ (sometimes: w, for weight vector) Conditional Random Field 6 / 44

6 Maximum Likelihood Parameter Estimation Maximize likelihood of outputs y 1,..., y N for inputs x 1,..., x N θ = argmax θ R D log( ) = argmin θ R D p(y 1,..., y N x 1,..., x N ; θ) i.i.d. = argmax θ R D N log p(y n x n ; θ) }{{} negative conditional log-likelihood (of D) N p(y n x n ; θ) 7 / 44

7 Maximum Likelihood Parameter Estimation Maximize likelihood of outputs y 1,..., y N for inputs x 1,..., x N θ = argmax θ R D log( ) = argmin θ R D = argmin θ R D = argmin θ R D p(y 1,..., y N x 1,..., x N ; θ) i.i.d. = argmax θ R D N log p(y n x n ; θ) }{{} negative conditional log-likelihood (of D) N [ log e θ,φ(x n,y) log Z(x n ; θ) ] N [ θ, φ(x n, y n ) + log y Y e θ,φ(xn,y) } {{ } log-partition function ] N p(y n x n ; θ) 7 / 44

8 MAP Estimation of θ Treat θ as random variable; maximize posterior p(θ D) 8 / 44

9 MAP Estimation of θ Treat θ as random variable; maximize posterior p(θ D) p(θ D) Bayes = p(x 1, y 1,..., x N, y N θ)p(θ) i.i.d. = p(θ) p(d) p(θ): prior belief on θ (cannot be estimated from data). N p(y n x n ; θ) p(y n x n ) θ = argmax θ R D = argmin θ R D [ ] p(θ D) = argmin log p(θ D) θ R D [ N log p(θ) [ = argmin log p(θ) θ R D N ] log p(y n x n ; θ) + log p(y n x n ) }{{} indep. of θ ] log p(y n x n ; θ) 8 / 44

10 θ [ = argmin log p(θ) θ R D N log p(y n x n ; θ) ] Choices for p(θ): p(θ) : const. θ R D (uniform; in R D not really a distribution) } {{ } θ [ = argmin N log p(y n x n ; θ) + const. ] negative conditional log-likelihood p(θ) := const. e λ 2 θ 2 (Gaussian) θ [ λ N = argmin θ R D 2 θ 2 log p(y n x n ; θ) + const. ] }{{} regularized negative conditional log-likelihood 9 / 44

11 Probabilistic Models for Structured Prediction - Summary Negative (Regularized) Conditional Log-Likelihood (of D) L(θ) = λ 2 θ 2 + (λ 0 makes it unregularized) N [ θ, φ(x n, y n ) + log e θ,φ(xn,y) ] Probabilistic parameter estimation or training means solving θ = argmin L(θ). θ R D y Y Same optimization problem as for multi-class logistic regression. 10 / 44

12 Negative Conditional Log-Likelihood (Toy Example) / 44

13 Steepest Descent Minimization minimize L(θ) input tolerance ɛ > 0 1: θ cur 0 2: repeat 3: v w L(θ cur ) 4: η argmin η R L(θ cur ηv) 5: θ cur θ cur ηv 6: until v < ɛ output θ cur Alternatives: L-BFGS (second-order descent without explicit Hessian) Conjugate Gradient We always need (at least) the gradient of L. 12 / 44

14 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 = / 44

15 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 =0.10 w / 44

16 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 =0.10 w / 44

17 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 =0.10 w / 44

18 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 = w / 44

19 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 = w / 44

20 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 = w / 44

21 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 = w / 44

22 L(θ) = λ 2 θ 2 + θ L(θ) = λθ + = λθ + = λθ + N [ θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y N [ φ(x n, y n y Y ) e θ,φ(xn,y) φ(x n, y) ] ȳ Y e θ,φ(xn,ȳ) N [ φ(x n, y n ) p(y x n ; θ)φ(x n, y) ] y Y N [ φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] L(θ) = λid D D + N Ey p(y x n ;θ){ φ(x n, y)φ(x n, y) } 14 / 44

23 L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y continuous (not discrete), C -differentiable on all R D slice through objective value (θ x [ 3, 5], θ y = 0) 15 / 44

24 N [ θ L(θ) = λθ + φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] For λ = 0: E y p(y x n ;θ)φ(x n, y) = φ(x n, y n ) θ L(θ) = 0, critical point of L (local minimum/maximum/saddle point). Interpretation: We want the model distribution to match the empirical one: E y p(y x;θ) φ(x, y)! = φ(x, y obs ) E.g. image segmentation φ unary : correct amount of foreground vs. background φ pairwise : correct amount of fg/bg transitions smoothness 16 / 44

25 L(θ) = λid D D + N E y p(y x n ;θ) { φ(x n, y)φ(x n, y) } positive definite Hessian matrix L(θ) is convex θ L(θ) = 0 implies global minimum slice through objective value (θ x [ 3, 5], θ y = 0) 17 / 44

26 Milestone I: Probabilistic Training (Conditional Random Fields) p(y x; θ) log-linear in θ R D. Training: minimize (regularized) negative conditional log-likelihood, L(θ) L(θ) is differentiable and convex, gradient descent will find global optimum with θ L(θ) = 0 Same structure as multi-class logistic regression ( ˆ= no structure and Y = {1,..., K}). 18 / 44

27 Milestone I: Probabilistic Training (Conditional Random Fields) p(y x; θ) log-linear in θ R D. Training: minimize (regularized) negative conditional log-likelihood, L(θ) L(θ) is differentiable and convex, gradient descent will find global optimum with θ L(θ) = 0 Same structure as multi-class logistic regression ( ˆ= no structure and Y = {1,..., K}). For logistic regression: this is where the textbook ends. We re done. For conditional random fields: we re not in safe waters, yet! 18 / 44

28 Solving the Training Optimization Problem Numerically Task: Compute v = θ L(θ cur ), evaluate L(θ cur + ηv): L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y θ L(θ) = λ N 2 θ + [ φ(x n, y n ) p(y x n ; θ)φ(x n, y) ] y Y 19 / 44

29 Solving the Training Optimization Problem Numerically Task: Compute v = θ L(θ cur ), evaluate L(θ cur + ηv): L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y θ L(θ) = λ N 2 θ + [ φ(x n, y n ) p(y x n ; θ)φ(x n, y) ] y Y Problem: Y typically is very (exponentially) large: binary image segmentation: Y = ranking N images: Y = N!, e.g. N = 1000: Y We must use the structure in Y, or we re lost. 19 / 44

30 Solving the Training Optimization Problem Numerically N [ θ L(θ) = λθ + φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] Computing the Gradient (naive): O(K M ND) L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log Z(x n ; θ) ] Line Search (naive): O(K M ND) per evaluation of L N: number of samples D: dimension of feature space M: number of output variables K: number of possible labels of each output variables 20 / 44

31 Solving the Training Optimization Problem Numerically N [ θ L(θ) = λθ + φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] Computing the Gradient (naive): O(K M ND) L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log Z(x n ; θ) ] Line Search (naive): O(K M ND) per evaluation of L N: number of samples D: dimension of feature space M: number of output variables 10s to 1,000,000s K: number of possible labels of each output variables 2 to 1000s 20 / 44

32 Solving the Training Optimization Problem Numerically For a graphical model with factors F, the features decompose: φ(x, y) = ( ) φ F (x, y F ) F F ( ) E y p(y x;θ) φ(x, y) = E y p(y x;θ) φ F (x, y F ) F F = ( E yf p(y F x;θ)φ F (x, y F ) ) F F E yf p(y F x;θ)φ F (x, y F ) = y F Y F }{{} K F terms p(y F x; θ) φ }{{} F (x, y F ) factor marginals Factor marginals µ F = p(y F x; θ) are much smaller than complete joint distribution p(y x; θ), compute them by probabilistic inference. 21 / 44

33 Solving the Training Optimization Problem Numerically For a graphical model with factors F, the features decompose: φ(x, y) = F F φ F (x, y F ) E y p(y x;θ) φ(x, y) = E y p(y x;θ) φ F (x, y F ) F F = E y p(y x;θ) φ F (x, y F ) F F = E yf p(y F x;θ)φ F (x, y F ) F F Again, we need only the factor marginals µ F = p(y F x; θ) 22 / 44

34 Solving the Training Optimization Problem Numerically N [ θ L(θ) = λθ + φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] Computing the Gradient (if marginal inference is possible): O(K M ND), O(MK Fmax ND): L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y N: number of samples D: dimension of feature space M: number of output variables K: number of possible labels of each output variables 23 / 44

35 Solving the Training Optimization Problem Numerically N [ θ L(θ) = λθ + φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] Computing the Gradient (if marginal inference is possible): O(K M ND), O(MK Fmax ND): L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y N: number of samples 10s to 1,000,000s D: dimension of feature space M: number of output variables K: number of possible labels of each output variables 23 / 44

36 Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) Minimize L(θ), but without ever computing L(θ) or L(θ) exactly In each gradient descent step: Pick random subset D D, often just 1 3 elements Compute approximate gradient L(θ) = λθ + D [ φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] D (x n,y n ) D Update parameter in negative direction of approximate gradient Line search would still be too slow use fixed stepsize rule, η t, instead (new parameter) more: see [Bottou, Bousquet: The Tradeoffs of Large Scale Learning, NIPS 2008] also: 24 / 44

37 Stochastic Gradient Descent (SGD) Important property of SGD: 3 negative log likelihood σ 2 =0.10 v = L(θ) is random, let s call it s distribution q / 44

38 Stochastic Gradient Descent (SGD) Important property of SGD: v = L(θ) is random, let s call it s distribution q 25 / 44

39 Stochastic Gradient Descent (SGD) Important property of SGD: v = L(θ) is random, let s call it s distribution q 25 / 44

40 Stochastic Gradient Descent (SGD) Important property of SGD: v = L(θ) is random, let s call it s distribution q 25 / 44

41 Stochastic Gradient Descent (SGD) Important property of SGD: v = L(θ) is random, let s call it s distribution q v is unbiased estimate of true gradient: E v q [v] = L(θ) on average, parameters updates point in the right direction 25 / 44

42 Stochastic Gradient Descent (SGD) Important property of SGD: v = L(θ) is random, let s call it s distribution q v is unbiased estimate of true gradient: E v q [v] = L(θ) on average, parameters updates point in the right direction 25 / 44

43 Stochastic Gradient Descent (SGD) Important property of SGD: v = L(θ) is random, let s call it s distribution q v is unbiased estimate of true gradient: E v q [v] = L(θ) on average, parameters updates point in the right direction SGD converges to argmin θ L(θ) η t must be chosen appropriately: t η t =, t η2 t <, e.g. η t 1 t SGD needs more iterations, but each one is much faster 25 / 44

44 Solving the Training Optimization Problem Numerically N [ θ L(θ) = λθ + φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] Computing the Gradient (if marginal inference is possible):o(mk Fmax ND): L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y N: number of samples D: dimension of feature space: φ i,j 1 10s, φ i : 10s to 10000s M: number of output variables K: number of possible labels of each output variables 26 / 44

45 Solving the Training Optimization Problem Numerically Typical feature functions in image segmentation: φ i (y i, x) R 1000 : local image features, e.g. bag-of-words θ i, φ i (y i, x) : local classifier (like logistic-regression) φ pw (y i, y j ) = (i,j) E y i = y j R 1 : test for same label θ ij, φ ij (y i, y j ) : penalizer for label changes (if θ ij > 0) combined: argmax y p(y x; θ) is smoothed version of local cues 27 / 44

46 Solving the Training Optimization Problem Numerically Typical feature functions in pose estimation: φ i (y i, x) R 1000 : local image representation, e.g. HoG θ i, φ i (y i, x) : local confidence map φ i,j (y i, y j ) = good fit(y i, y j ) R 1 : test for geometric fit θ ij, φ ij (y i, y j ) : penalizer for unrealistic poses together: argmax y p(y x) is sanitized version of local cues original local confidence local + geometry 28 / 44

47 Solving the Training Optimization Problem Numerically Idea: split learning of unary potentials into two parts: local classifiers, their importance. Two-Stage Training pre-train f y i (x) ˆ= log p(y i x) use φ i (y i, x) := f y i (x) R K (low-dimensional) keep φ ij (y i, y j ) are before perform CRF learning with φ i and φ ij 29 / 44

48 Solving the Training Optimization Problem Numerically Idea: split learning of unary potentials into two parts: local classifiers, their importance. Two-Stage Training pre-train f y i (x) ˆ= log p(y i x) use φ i (y i, x) := f y i (x) R K (low-dimensional) keep φ ij (y i, y j ) are before perform CRF learning with φ i and φ ij Advantage: lower dimensional feature space during inference faster f y i (x) can be any classifiers, e.g. deep network, random forest,... Disadvantage: if local classifiers are bad, CRF training cannot fix that. 29 / 44

49 Solving the Training Optimization Problem Numerically N [ θ L(θ) = λθ + φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] Computing the Gradient: O(K M ND) L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y N: number of samples D: dimension of feature space M: number of output variables K: number of possible labels of each output variables What, if exact marginal inference is not possible? 30 / 44

50 Training with Approximate Inference Problem: exact marginal inference might be computationally too costly Idea: just use approximate marginals, e.g. from loopy BP Possibility 1: everything works fine Possibility 2: nothing works anymore approximate gradient might not be a descent direction (objective does not decrease along its negative) then, gradient descent might terminate too early (line search chooses η = 0), or not convergence at all, or convergence to a bad solution,... In general: no way to tell which of the possibilities. 31 / 44

51 Training with Approximate Likelihood Question: how to train if (approximate) maximum likelihood learning does not work Idea: optimize a simpler quantity instead of L 32 / 44

52 Training with Approximate Likelihood Question: how to train if (approximate) maximum likelihood learning does not work Idea: optimize a simpler quantity instead of L Pseudolikelihood [Besag, 1987] Pretend: p(y) p PL (y) for p PL (y) = i V p(y i y V \{i} ) = i V p(y i y N(i) ) 32 / 44

53 Training with Approximate Likelihood Pseudolikelihood (PL) p(y x; θ) p PL (y x; θ) = i V p(y i y N(i), x; θ) For training data {(x 1, y 1 ),..., (x N, y N )}: N L PL (θ) = log p PL (y n x n ; θ) = = N log p(yi n yn(i) n, x; θ) i V N i V [ θ, φ(y n, x n ) log z Yi e θ,φ(y n 1,...,y n i 1,z,y n i+1,...,y n V,xn ) ] 33 / 44

54 Training with Approximate Likelihood Pseudolikelihood (PL) p(y x; θ) p PL (y x; θ) = i V p(y i y N(i), x; θ) For training data {(x 1, y 1 ),..., (x N, y N )}: N L PL (θ) = log p PL (y n x n ; θ) = = N log p(yi n yn(i) n, x; θ) i V N i V [ θ, φ(y n, x n ) log z Yi e θ,φ(y n 1,...,y n i 1,z,y n i+1,...,y n V,xn ) ] Partition functions sum only over one variable at a time tractable 33 / 44

55 Training with Approximate Likelihood Piecewise Training (PW) Idea: optimize a simpler quantity instead of L Piecewise Training [Sutton, McCallum, 2005] p(y x) F F p F (y F x) for p F (y F x) = 1 Z F (x; θ) e θ F,φ F (y F,x) 34 / 44

56 Training with Approximate Likelihood Piecewise Training (PW) p(y x) F F p F (y F x; θ F ) for p F (y F x) e θ F,φ F (y F,x) For training data {(x 1, y 1 ),..., (x N, y N )}: N N L PW (θ) = log p PW (yf n x n ; θ) = log p F (yf n x) F F N = [ θ F, φ F (yf n, x n ) log F F ȳ F Y F e θf,φf (ȳf,xn) ] 35 / 44

57 Training with Approximate Likelihood Piecewise Training (PW) p(y x) F F p F (y F x; θ F ) for p F (y F x) e θ F,φ F (y F,x) For training data {(x 1, y 1 ),..., (x N, y N )}: N N L PW (θ) = log p PW (yf n x n ; θ) = log p F (yf n x) F F N = [ θ F, φ F (yf n, x n ) log F F ȳ F Y F e θf,φf (ȳf,xn) Partition functions sum over F variables at a time usually tractable ] 35 / 44

58 Training with Approximate Likelihood Piecewise Training (PW) p(y x) F F p F (y F x; θ F ) for p F (y F x) e θ F,φ F (y F,x) For training data {(x 1, y 1 ),..., (x N, y N )}: N N L PW (θ) = log p PW (yf n x n ; θ) = log p F (yf n x) = F F F F N [ θ F, φ F (yf n, x n ) log ȳ F Y F e θf,φf (ȳf,xn) ] Partition functions sum over F variables at a time usually tractable Optimization decomposes into a sum over the θ F easy to parallelize 35 / 44

59 Solving the Training Optimization Problem Numerically CRF training methods is based on gradient-descent optimization. The faster we can do it, the better (more realistic) models we can use: θ L(θ) = λθ N [ φ(x n, y n ) A lot of research on accelerating CRF training: y Y p(y x n ; θ) φ(x n, y) ] R D problem solution method(s) Y too large exploit structure (loopy) belief propagation fast sampling contrastive divergence surrogate L pseudo-likelihood, piecewise training N too large mini-batches stochastic gradient descent D too large pretrained φ unary two-stage training 36 / 44

Part 4: Conditional Random Fields

Part 4: Conditional Random Fields Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39 Problem (Probabilistic Learning) Let d(y x) be the (unknown) true conditional distribution.

More information

Learning with Structured Inputs and Outputs

Learning with Structured Inputs and Outputs Learning with Structured Inputs and Outputs Christoph H. Lampert IST Austria (Institute of Science and Technology Austria), Vienna ENS/INRIA Summer School, Paris, July 2013 Slides: http://www.ist.ac.at/~chl/

More information

Learning with Structured Inputs and Outputs

Learning with Structured Inputs and Outputs Learning with Structured Inputs and Outputs Christoph Lampert Institute of Science and Technology Austria Vienna (Klosterneuburg), Austria INRIA Summer School 2010 Learning with Structured Inputs and Outputs

More information

Introduction To Graphical Models

Introduction To Graphical Models Peter Gehler Introduction to Graphical Models Introduction To Graphical Models Peter V. Gehler Max Planck Institute for Intelligent Systems, Tübingen, Germany ENS/INRIA Summer School, Paris, July 2013

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi Binary classification In binary classification problems, the target y can take on at only two values. In this set of notes, we show how to model this problem

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient

More information

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824 Naïve Bayes Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative HW 1 out today. Please start early! Office hours Chen: Wed 4pm-5pm Shih-Yang: Fri 3pm-4pm Location: Whittemore 266

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Learning with Structured Inputs and Outputs

Learning with Structured Inputs and Outputs Learning with Structured Inputs and Outputs Christoph H. Lampert IST Austria (Institute of Science and Technology Austria), Vienna Microsoft Machine Learning and Intelligence School 2015 July 29-August

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Learning MN Parameters with Approximation. Sargur Srihari

Learning MN Parameters with Approximation. Sargur Srihari Learning MN Parameters with Approximation Sargur srihari@cedar.buffalo.edu 1 Topics Iterative exact learning of MN parameters Difficulty with exact methods Approximate methods Approximate Inference Belief

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

Learning Parameters of Undirected Models. Sargur Srihari

Learning Parameters of Undirected Models. Sargur Srihari Learning Parameters of Undirected Models Sargur srihari@cedar.buffalo.edu 1 Topics Difficulties due to Global Normalization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

Probabilistic Graphical Models in Computer Vision (IN2329)

Probabilistic Graphical Models in Computer Vision (IN2329) Probabilistic Graphical Models in Computer Vision (IN2329) Csaba Domokos Summer Semester 2017 11. Parameter learning............................................................................................

More information

Machine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io

Machine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io Machine Learning Lecture 2: Linear regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2017 Supervised Learning Regression: Predict

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Markov Random Fields for Computer Vision (Part 1)

Markov Random Fields for Computer Vision (Part 1) Markov Random Fields for Computer Vision (Part 1) Machine Learning Summer School (MLSS 2011) Stephen Gould stephen.gould@anu.edu.au Australian National University 13 17 June, 2011 Stephen Gould 1/23 Pixel

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Lecture 3 - Linear and Logistic Regression

Lecture 3 - Linear and Logistic Regression 3 - Linear and Logistic Regression-1 Machine Learning Course Lecture 3 - Linear and Logistic Regression Lecturer: Haim Permuter Scribe: Ziv Aharoni Throughout this lecture we talk about how to use regression

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Learning Parameters of Undirected Models. Sargur Srihari

Learning Parameters of Undirected Models. Sargur Srihari Learning Parameters of Undirected Models Sargur srihari@cedar.buffalo.edu 1 Topics Log-linear Parameterization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate Gradient

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2014 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Yu-Hsin Kuo, Amos Ng 1 Introduction Last lecture

More information

Bayesian Deep Learning

Bayesian Deep Learning Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference

More information

CSE446: Clustering and EM Spring 2017

CSE446: Clustering and EM Spring 2017 CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d

More information

Variational Inference via Stochastic Backpropagation

Variational Inference via Stochastic Backpropagation Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation

More information

GWAS IV: Bayesian linear (variance component) models

GWAS IV: Bayesian linear (variance component) models GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian

More information

Lecture 6: Graphical Models

Lecture 6: Graphical Models Lecture 6: Graphical Models Kai-Wei Chang CS @ Uniersity of Virginia kw@kwchang.net Some slides are adapted from Viek Skirmar s course on Structured Prediction 1 So far We discussed sequence labeling tasks:

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015 Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Algorithmic Stability and Generalization Christoph Lampert

Algorithmic Stability and Generalization Christoph Lampert Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Lecture 3: Multiclass Classification

Lecture 3: Multiclass Classification Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 9 Sep. 26, 2018 1 Reminders Homework 3:

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 5: Due April 10. Project description on Piazza. Final details coming

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Variational Inference. Sargur Srihari

Variational Inference. Sargur Srihari Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms

More information