Probabilistic Graphical Models & Applications
|
|
- Horace Reynolds
- 5 years ago
- Views:
Transcription
1 Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with kind permission of Christoph H. Lampert of IST Austria
2 Conditional Random Fields Maximum Likelihood for CRFs Numerical Training Latent Variables Supervised Learning Problem I Given training examples (x 1, y 1 ),..., (x N, y N ) X Y I I I I X : collection of interacting random variables, e.g. image pixels x X : one joint assignment, e.g. a specific image, out of all possible joint assignments Y : collection of interacting random variables, e.g. part locations y Y: one joint assignment, e.g. a specific pose, out of all possible joint assignments Images: HumanEva dataset Goal: be able to make predictions for new inputs, i.e. learn a function f : X Y. 3 / 44
3 Supervised Learning Problem X F (1) top Y top Y head F (2) top,head Step 1: define a proper graph structure of X and Y Y Yrarm torso Y larm Y rhnd Y lhnd... Y rleg... Y lleg Y rfoot Ylfoot Step 2: define a proper parameterization of p(y x; θ) p(y x; θ) = 1 Z e d i=1 θ i φ i (x,y) Step 3: learn parameters θ from training data Step 4: for new x X, make prediction e.g. maximum likelihood ( today) e.g. y = argmax p(y x; θ ) y Y ( next lecture) 4 / 44
4 Parameterization Goal: Define feature functions, φ i : X Y R d, for i = 1,..., d, or just φ : X Y R d Main step: For each factor F F we define some φ F (x F, y F ) : X F Y F R d F, where (x F, y F ) are those random variables of (x, y) that appear in F. Example: pose estimation (X : image pixels, Y : part locations) φ head (x, y i ) = the pixels in the image x at the location specified by y i φ head-torso (y i, y j ) = the distance between locations y i and y j Example: image segmentation (X : image pixels, Y : per-pixel foreground/background flag) ( ) xi y unary terms φ i (x i, y i ) = i = 0 (1 x i ) y i = 1 pairwise terms φ ij (y i, y j ) = y i y j 5 / 44
5 Parameterization Goal: Define feature functions, φ i : X Y R d, for i = 1,..., d, or just φ : X Y R d Two common options (combinations of them also work): 1) combine factors by stacking their vectors φ(x, y) = ( φ F (x F, y F ) ) F F p(y x; θ) = 1 Z(x; θ) e θ,φ(x,y) with θ, φ(x, y) = θ F, φ F (x F, y F ) F F 2) combine factors by summing their vectors φ(x, y) = F F φ F (x F, y F ) p(y x; θ) = 1 Z(x; θ) e θ,φ(x,y) with θ, φ(x, y) = θ, φ F (x F, y F ) F F Result: log-linear model with parameter vector θ (sometimes: w, for weight vector) Conditional Random Field 6 / 44
6 Maximum Likelihood Parameter Estimation Maximize likelihood of outputs y 1,..., y N for inputs x 1,..., x N θ = argmax θ R D log( ) = argmin θ R D p(y 1,..., y N x 1,..., x N ; θ) i.i.d. = argmax θ R D N log p(y n x n ; θ) }{{} negative conditional log-likelihood (of D) N p(y n x n ; θ) 7 / 44
7 Maximum Likelihood Parameter Estimation Maximize likelihood of outputs y 1,..., y N for inputs x 1,..., x N θ = argmax θ R D log( ) = argmin θ R D = argmin θ R D = argmin θ R D p(y 1,..., y N x 1,..., x N ; θ) i.i.d. = argmax θ R D N log p(y n x n ; θ) }{{} negative conditional log-likelihood (of D) N [ log e θ,φ(x n,y) log Z(x n ; θ) ] N [ θ, φ(x n, y n ) + log y Y e θ,φ(xn,y) } {{ } log-partition function ] N p(y n x n ; θ) 7 / 44
8 MAP Estimation of θ Treat θ as random variable; maximize posterior p(θ D) 8 / 44
9 MAP Estimation of θ Treat θ as random variable; maximize posterior p(θ D) p(θ D) Bayes = p(x 1, y 1,..., x N, y N θ)p(θ) i.i.d. = p(θ) p(d) p(θ): prior belief on θ (cannot be estimated from data). N p(y n x n ; θ) p(y n x n ) θ = argmax θ R D = argmin θ R D [ ] p(θ D) = argmin log p(θ D) θ R D [ N log p(θ) [ = argmin log p(θ) θ R D N ] log p(y n x n ; θ) + log p(y n x n ) }{{} indep. of θ ] log p(y n x n ; θ) 8 / 44
10 θ [ = argmin log p(θ) θ R D N log p(y n x n ; θ) ] Choices for p(θ): p(θ) : const. θ R D (uniform; in R D not really a distribution) } {{ } θ [ = argmin N log p(y n x n ; θ) + const. ] negative conditional log-likelihood p(θ) := const. e λ 2 θ 2 (Gaussian) θ [ λ N = argmin θ R D 2 θ 2 log p(y n x n ; θ) + const. ] }{{} regularized negative conditional log-likelihood 9 / 44
11 Probabilistic Models for Structured Prediction - Summary Negative (Regularized) Conditional Log-Likelihood (of D) L(θ) = λ 2 θ 2 + (λ 0 makes it unregularized) N [ θ, φ(x n, y n ) + log e θ,φ(xn,y) ] Probabilistic parameter estimation or training means solving θ = argmin L(θ). θ R D y Y Same optimization problem as for multi-class logistic regression. 10 / 44
12 Negative Conditional Log-Likelihood (Toy Example) / 44
13 Steepest Descent Minimization minimize L(θ) input tolerance ɛ > 0 1: θ cur 0 2: repeat 3: v w L(θ cur ) 4: η argmin η R L(θ cur ηv) 5: θ cur θ cur ηv 6: until v < ɛ output θ cur Alternatives: L-BFGS (second-order descent without explicit Hessian) Conjugate Gradient We always need (at least) the gradient of L. 12 / 44
14 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 = / 44
15 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 =0.10 w / 44
16 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 =0.10 w / 44
17 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 =0.10 w / 44
18 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 = w / 44
19 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 = w / 44
20 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 = w / 44
21 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 = w / 44
22 L(θ) = λ 2 θ 2 + θ L(θ) = λθ + = λθ + = λθ + N [ θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y N [ φ(x n, y n y Y ) e θ,φ(xn,y) φ(x n, y) ] ȳ Y e θ,φ(xn,ȳ) N [ φ(x n, y n ) p(y x n ; θ)φ(x n, y) ] y Y N [ φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] L(θ) = λid D D + N Ey p(y x n ;θ){ φ(x n, y)φ(x n, y) } 14 / 44
23 L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y continuous (not discrete), C -differentiable on all R D slice through objective value (θ x [ 3, 5], θ y = 0) 15 / 44
24 N [ θ L(θ) = λθ + φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] For λ = 0: E y p(y x n ;θ)φ(x n, y) = φ(x n, y n ) θ L(θ) = 0, critical point of L (local minimum/maximum/saddle point). Interpretation: We want the model distribution to match the empirical one: E y p(y x;θ) φ(x, y)! = φ(x, y obs ) E.g. image segmentation φ unary : correct amount of foreground vs. background φ pairwise : correct amount of fg/bg transitions smoothness 16 / 44
25 L(θ) = λid D D + N E y p(y x n ;θ) { φ(x n, y)φ(x n, y) } positive definite Hessian matrix L(θ) is convex θ L(θ) = 0 implies global minimum slice through objective value (θ x [ 3, 5], θ y = 0) 17 / 44
26 Milestone I: Probabilistic Training (Conditional Random Fields) p(y x; θ) log-linear in θ R D. Training: minimize (regularized) negative conditional log-likelihood, L(θ) L(θ) is differentiable and convex, gradient descent will find global optimum with θ L(θ) = 0 Same structure as multi-class logistic regression ( ˆ= no structure and Y = {1,..., K}). 18 / 44
27 Milestone I: Probabilistic Training (Conditional Random Fields) p(y x; θ) log-linear in θ R D. Training: minimize (regularized) negative conditional log-likelihood, L(θ) L(θ) is differentiable and convex, gradient descent will find global optimum with θ L(θ) = 0 Same structure as multi-class logistic regression ( ˆ= no structure and Y = {1,..., K}). For logistic regression: this is where the textbook ends. We re done. For conditional random fields: we re not in safe waters, yet! 18 / 44
28 Solving the Training Optimization Problem Numerically Task: Compute v = θ L(θ cur ), evaluate L(θ cur + ηv): L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y θ L(θ) = λ N 2 θ + [ φ(x n, y n ) p(y x n ; θ)φ(x n, y) ] y Y 19 / 44
29 Solving the Training Optimization Problem Numerically Task: Compute v = θ L(θ cur ), evaluate L(θ cur + ηv): L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y θ L(θ) = λ N 2 θ + [ φ(x n, y n ) p(y x n ; θ)φ(x n, y) ] y Y Problem: Y typically is very (exponentially) large: binary image segmentation: Y = ranking N images: Y = N!, e.g. N = 1000: Y We must use the structure in Y, or we re lost. 19 / 44
30 Solving the Training Optimization Problem Numerically N [ θ L(θ) = λθ + φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] Computing the Gradient (naive): O(K M ND) L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log Z(x n ; θ) ] Line Search (naive): O(K M ND) per evaluation of L N: number of samples D: dimension of feature space M: number of output variables K: number of possible labels of each output variables 20 / 44
31 Solving the Training Optimization Problem Numerically N [ θ L(θ) = λθ + φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] Computing the Gradient (naive): O(K M ND) L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log Z(x n ; θ) ] Line Search (naive): O(K M ND) per evaluation of L N: number of samples D: dimension of feature space M: number of output variables 10s to 1,000,000s K: number of possible labels of each output variables 2 to 1000s 20 / 44
32 Solving the Training Optimization Problem Numerically For a graphical model with factors F, the features decompose: φ(x, y) = ( ) φ F (x, y F ) F F ( ) E y p(y x;θ) φ(x, y) = E y p(y x;θ) φ F (x, y F ) F F = ( E yf p(y F x;θ)φ F (x, y F ) ) F F E yf p(y F x;θ)φ F (x, y F ) = y F Y F }{{} K F terms p(y F x; θ) φ }{{} F (x, y F ) factor marginals Factor marginals µ F = p(y F x; θ) are much smaller than complete joint distribution p(y x; θ), compute them by probabilistic inference. 21 / 44
33 Solving the Training Optimization Problem Numerically For a graphical model with factors F, the features decompose: φ(x, y) = F F φ F (x, y F ) E y p(y x;θ) φ(x, y) = E y p(y x;θ) φ F (x, y F ) F F = E y p(y x;θ) φ F (x, y F ) F F = E yf p(y F x;θ)φ F (x, y F ) F F Again, we need only the factor marginals µ F = p(y F x; θ) 22 / 44
34 Solving the Training Optimization Problem Numerically N [ θ L(θ) = λθ + φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] Computing the Gradient (if marginal inference is possible): O(K M ND), O(MK Fmax ND): L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y N: number of samples D: dimension of feature space M: number of output variables K: number of possible labels of each output variables 23 / 44
35 Solving the Training Optimization Problem Numerically N [ θ L(θ) = λθ + φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] Computing the Gradient (if marginal inference is possible): O(K M ND), O(MK Fmax ND): L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y N: number of samples 10s to 1,000,000s D: dimension of feature space M: number of output variables K: number of possible labels of each output variables 23 / 44
36 Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) Minimize L(θ), but without ever computing L(θ) or L(θ) exactly In each gradient descent step: Pick random subset D D, often just 1 3 elements Compute approximate gradient L(θ) = λθ + D [ φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] D (x n,y n ) D Update parameter in negative direction of approximate gradient Line search would still be too slow use fixed stepsize rule, η t, instead (new parameter) more: see [Bottou, Bousquet: The Tradeoffs of Large Scale Learning, NIPS 2008] also: 24 / 44
37 Stochastic Gradient Descent (SGD) Important property of SGD: 3 negative log likelihood σ 2 =0.10 v = L(θ) is random, let s call it s distribution q / 44
38 Stochastic Gradient Descent (SGD) Important property of SGD: v = L(θ) is random, let s call it s distribution q 25 / 44
39 Stochastic Gradient Descent (SGD) Important property of SGD: v = L(θ) is random, let s call it s distribution q 25 / 44
40 Stochastic Gradient Descent (SGD) Important property of SGD: v = L(θ) is random, let s call it s distribution q 25 / 44
41 Stochastic Gradient Descent (SGD) Important property of SGD: v = L(θ) is random, let s call it s distribution q v is unbiased estimate of true gradient: E v q [v] = L(θ) on average, parameters updates point in the right direction 25 / 44
42 Stochastic Gradient Descent (SGD) Important property of SGD: v = L(θ) is random, let s call it s distribution q v is unbiased estimate of true gradient: E v q [v] = L(θ) on average, parameters updates point in the right direction 25 / 44
43 Stochastic Gradient Descent (SGD) Important property of SGD: v = L(θ) is random, let s call it s distribution q v is unbiased estimate of true gradient: E v q [v] = L(θ) on average, parameters updates point in the right direction SGD converges to argmin θ L(θ) η t must be chosen appropriately: t η t =, t η2 t <, e.g. η t 1 t SGD needs more iterations, but each one is much faster 25 / 44
44 Solving the Training Optimization Problem Numerically N [ θ L(θ) = λθ + φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] Computing the Gradient (if marginal inference is possible):o(mk Fmax ND): L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y N: number of samples D: dimension of feature space: φ i,j 1 10s, φ i : 10s to 10000s M: number of output variables K: number of possible labels of each output variables 26 / 44
45 Solving the Training Optimization Problem Numerically Typical feature functions in image segmentation: φ i (y i, x) R 1000 : local image features, e.g. bag-of-words θ i, φ i (y i, x) : local classifier (like logistic-regression) φ pw (y i, y j ) = (i,j) E y i = y j R 1 : test for same label θ ij, φ ij (y i, y j ) : penalizer for label changes (if θ ij > 0) combined: argmax y p(y x; θ) is smoothed version of local cues 27 / 44
46 Solving the Training Optimization Problem Numerically Typical feature functions in pose estimation: φ i (y i, x) R 1000 : local image representation, e.g. HoG θ i, φ i (y i, x) : local confidence map φ i,j (y i, y j ) = good fit(y i, y j ) R 1 : test for geometric fit θ ij, φ ij (y i, y j ) : penalizer for unrealistic poses together: argmax y p(y x) is sanitized version of local cues original local confidence local + geometry 28 / 44
47 Solving the Training Optimization Problem Numerically Idea: split learning of unary potentials into two parts: local classifiers, their importance. Two-Stage Training pre-train f y i (x) ˆ= log p(y i x) use φ i (y i, x) := f y i (x) R K (low-dimensional) keep φ ij (y i, y j ) are before perform CRF learning with φ i and φ ij 29 / 44
48 Solving the Training Optimization Problem Numerically Idea: split learning of unary potentials into two parts: local classifiers, their importance. Two-Stage Training pre-train f y i (x) ˆ= log p(y i x) use φ i (y i, x) := f y i (x) R K (low-dimensional) keep φ ij (y i, y j ) are before perform CRF learning with φ i and φ ij Advantage: lower dimensional feature space during inference faster f y i (x) can be any classifiers, e.g. deep network, random forest,... Disadvantage: if local classifiers are bad, CRF training cannot fix that. 29 / 44
49 Solving the Training Optimization Problem Numerically N [ θ L(θ) = λθ + φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] Computing the Gradient: O(K M ND) L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y N: number of samples D: dimension of feature space M: number of output variables K: number of possible labels of each output variables What, if exact marginal inference is not possible? 30 / 44
50 Training with Approximate Inference Problem: exact marginal inference might be computationally too costly Idea: just use approximate marginals, e.g. from loopy BP Possibility 1: everything works fine Possibility 2: nothing works anymore approximate gradient might not be a descent direction (objective does not decrease along its negative) then, gradient descent might terminate too early (line search chooses η = 0), or not convergence at all, or convergence to a bad solution,... In general: no way to tell which of the possibilities. 31 / 44
51 Training with Approximate Likelihood Question: how to train if (approximate) maximum likelihood learning does not work Idea: optimize a simpler quantity instead of L 32 / 44
52 Training with Approximate Likelihood Question: how to train if (approximate) maximum likelihood learning does not work Idea: optimize a simpler quantity instead of L Pseudolikelihood [Besag, 1987] Pretend: p(y) p PL (y) for p PL (y) = i V p(y i y V \{i} ) = i V p(y i y N(i) ) 32 / 44
53 Training with Approximate Likelihood Pseudolikelihood (PL) p(y x; θ) p PL (y x; θ) = i V p(y i y N(i), x; θ) For training data {(x 1, y 1 ),..., (x N, y N )}: N L PL (θ) = log p PL (y n x n ; θ) = = N log p(yi n yn(i) n, x; θ) i V N i V [ θ, φ(y n, x n ) log z Yi e θ,φ(y n 1,...,y n i 1,z,y n i+1,...,y n V,xn ) ] 33 / 44
54 Training with Approximate Likelihood Pseudolikelihood (PL) p(y x; θ) p PL (y x; θ) = i V p(y i y N(i), x; θ) For training data {(x 1, y 1 ),..., (x N, y N )}: N L PL (θ) = log p PL (y n x n ; θ) = = N log p(yi n yn(i) n, x; θ) i V N i V [ θ, φ(y n, x n ) log z Yi e θ,φ(y n 1,...,y n i 1,z,y n i+1,...,y n V,xn ) ] Partition functions sum only over one variable at a time tractable 33 / 44
55 Training with Approximate Likelihood Piecewise Training (PW) Idea: optimize a simpler quantity instead of L Piecewise Training [Sutton, McCallum, 2005] p(y x) F F p F (y F x) for p F (y F x) = 1 Z F (x; θ) e θ F,φ F (y F,x) 34 / 44
56 Training with Approximate Likelihood Piecewise Training (PW) p(y x) F F p F (y F x; θ F ) for p F (y F x) e θ F,φ F (y F,x) For training data {(x 1, y 1 ),..., (x N, y N )}: N N L PW (θ) = log p PW (yf n x n ; θ) = log p F (yf n x) F F N = [ θ F, φ F (yf n, x n ) log F F ȳ F Y F e θf,φf (ȳf,xn) ] 35 / 44
57 Training with Approximate Likelihood Piecewise Training (PW) p(y x) F F p F (y F x; θ F ) for p F (y F x) e θ F,φ F (y F,x) For training data {(x 1, y 1 ),..., (x N, y N )}: N N L PW (θ) = log p PW (yf n x n ; θ) = log p F (yf n x) F F N = [ θ F, φ F (yf n, x n ) log F F ȳ F Y F e θf,φf (ȳf,xn) Partition functions sum over F variables at a time usually tractable ] 35 / 44
58 Training with Approximate Likelihood Piecewise Training (PW) p(y x) F F p F (y F x; θ F ) for p F (y F x) e θ F,φ F (y F,x) For training data {(x 1, y 1 ),..., (x N, y N )}: N N L PW (θ) = log p PW (yf n x n ; θ) = log p F (yf n x) = F F F F N [ θ F, φ F (yf n, x n ) log ȳ F Y F e θf,φf (ȳf,xn) ] Partition functions sum over F variables at a time usually tractable Optimization decomposes into a sum over the θ F easy to parallelize 35 / 44
59 Solving the Training Optimization Problem Numerically CRF training methods is based on gradient-descent optimization. The faster we can do it, the better (more realistic) models we can use: θ L(θ) = λθ N [ φ(x n, y n ) A lot of research on accelerating CRF training: y Y p(y x n ; θ) φ(x n, y) ] R D problem solution method(s) Y too large exploit structure (loopy) belief propagation fast sampling contrastive divergence surrogate L pseudo-likelihood, piecewise training N too large mini-batches stochastic gradient descent D too large pretrained φ unary two-stage training 36 / 44
Part 4: Conditional Random Fields
Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39 Problem (Probabilistic Learning) Let d(y x) be the (unknown) true conditional distribution.
More informationLearning with Structured Inputs and Outputs
Learning with Structured Inputs and Outputs Christoph H. Lampert IST Austria (Institute of Science and Technology Austria), Vienna ENS/INRIA Summer School, Paris, July 2013 Slides: http://www.ist.ac.at/~chl/
More informationLearning with Structured Inputs and Outputs
Learning with Structured Inputs and Outputs Christoph Lampert Institute of Science and Technology Austria Vienna (Klosterneuburg), Austria INRIA Summer School 2010 Learning with Structured Inputs and Outputs
More informationIntroduction To Graphical Models
Peter Gehler Introduction to Graphical Models Introduction To Graphical Models Peter V. Gehler Max Planck Institute for Intelligent Systems, Tübingen, Germany ENS/INRIA Summer School, Paris, July 2013
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationSequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015
Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationLecture 9: PGM Learning
13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationLogistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu
Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationLinear discriminant functions
Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative
More informationConditional Random Field
Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions
More informationCS229 Supplemental Lecture notes
CS229 Supplemental Lecture notes John Duchi Binary classification In binary classification problems, the target y can take on at only two values. In this set of notes, we show how to model this problem
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationLogistic Regression. William Cohen
Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationCSC321 Lecture 18: Learning Probabilistic Models
CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling
More informationLearning MN Parameters with Alternative Objective Functions. Sargur Srihari
Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient
More informationNaïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824
Naïve Bayes Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative HW 1 out today. Please start early! Office hours Chen: Wed 4pm-5pm Shih-Yang: Fri 3pm-4pm Location: Whittemore 266
More informationIntroduction to Machine Learning
Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationLearning with Structured Inputs and Outputs
Learning with Structured Inputs and Outputs Christoph H. Lampert IST Austria (Institute of Science and Technology Austria), Vienna Microsoft Machine Learning and Intelligence School 2015 July 29-August
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationLearning MN Parameters with Approximation. Sargur Srihari
Learning MN Parameters with Approximation Sargur srihari@cedar.buffalo.edu 1 Topics Iterative exact learning of MN parameters Difficulty with exact methods Approximate methods Approximate Inference Belief
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationRegression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationPMR Learning as Inference
Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning
More informationLearning Parameters of Undirected Models. Sargur Srihari
Learning Parameters of Undirected Models Sargur srihari@cedar.buffalo.edu 1 Topics Difficulties due to Global Normalization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate
More informationStatistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields
Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationHOMEWORK #4: LOGISTIC REGRESSION
HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your
More informationLinear classifiers: Overfitting and regularization
Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x
More informationProbabilistic Graphical Models in Computer Vision (IN2329)
Probabilistic Graphical Models in Computer Vision (IN2329) Csaba Domokos Summer Semester 2017 11. Parameter learning............................................................................................
More informationMachine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io
Machine Learning Lecture 2: Linear regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2017 Supervised Learning Regression: Predict
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationMarkov Random Fields for Computer Vision (Part 1)
Markov Random Fields for Computer Vision (Part 1) Machine Learning Summer School (MLSS 2011) Stephen Gould stephen.gould@anu.edu.au Australian National University 13 17 June, 2011 Stephen Gould 1/23 Pixel
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationLecture 3 - Linear and Logistic Regression
3 - Linear and Logistic Regression-1 Machine Learning Course Lecture 3 - Linear and Logistic Regression Lecturer: Haim Permuter Scribe: Ziv Aharoni Throughout this lecture we talk about how to use regression
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target
More informationParametric Techniques
Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationLearning Parameters of Undirected Models. Sargur Srihari
Learning Parameters of Undirected Models Sargur srihari@cedar.buffalo.edu 1 Topics Log-linear Parameterization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate Gradient
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationGWAS V: Gaussian processes
GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011
More information14 : Theory of Variational Inference: Inner and Outer Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2014 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Yu-Hsin Kuo, Amos Ng 1 Introduction Last lecture
More informationBayesian Deep Learning
Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference
More informationCSE446: Clustering and EM Spring 2017
CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled
More informationLogistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationNonparameteric Regression:
Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released
More informationStein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d
More informationVariational Inference via Stochastic Backpropagation
Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation
More informationGWAS IV: Bayesian linear (variance component) models
GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian
More informationLecture 6: Graphical Models
Lecture 6: Graphical Models Kai-Wei Chang CS @ Uniersity of Virginia kw@kwchang.net Some slides are adapted from Viek Skirmar s course on Structured Prediction 1 So far We discussed sequence labeling tasks:
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationExpectation Propagation for Approximate Bayesian Inference
Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given
More informationParametric Techniques Lecture 3
Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationMachine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015
Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationAlgorithmic Stability and Generalization Christoph Lampert
Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationLecture 3: Multiclass Classification
Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationCSci 8980: Advanced Topics in Graphical Models Gaussian Processes
CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian
More informationLearning Bayesian network : Given structure and completely observed data
Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationLogistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 9 Sep. 26, 2018 1 Reminders Homework 3:
More informationBayesian Methods: Naïve Bayes
Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 5: Due April 10. Project description on Piazza. Final details coming
More informationSVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels
SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic
More informationVariational Inference. Sargur Srihari
Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms
More information