Learning Parameters of Undirected Models. Sargur Srihari

Size: px
Start display at page:

Download "Learning Parameters of Undirected Models. Sargur Srihari"

Transcription

1 Learning Parameters of Undirected Models Sargur 1

2 Topics Difficulties due to Global Normalization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate Gradient Ascent Conditionally Trained Models Learning with Missing Data EM with Gradient Ascent Maximum Entropy and Maximum Likelihood Parameter Priors and Regularization 2

3 Local vs Global Normalization BN: Local normalization within each CPD MN: Global normalization (partition function) P(X 1,.., X n ) = 1 Z Global factor couples all parameters preventing decomposition Significant computational ramifications M.L. parameter estimation has no closed-form soln. Need iterative methods n p(x) = ln p(x i pa i ) i=1 K i=1 φ i ( ) Z = φ i D i X 1,..X n ( ) D i 3

4 Issues in Parameter Estimation Simple ML parameter estimation (even with complete data) cannot be solved in closed form Need iterative methods such as gradient ascent Good news: Likelihood function is concave Methods converge with global optimum Bad news: Each step in iterative algorithm requires inference Simple parameter estimation expensive/intractable Bayesian estimation is practically infeasible 4

5 Discriminative Training Common use of MNs is in settings such as image segmentation where we have a particular inference task in mind Train the network discriminatively to get good performance for our particular inference task 5

6 Likelihood Function Basis for all discussion of learning How likelihood function can be optimized to find maximum likelihood parameter estimates Begin with form of likelihood function for Markov networks, its properties and their computational implications 6

7 Example of Coupled Likelihood Function Simple Markov network A B C Two potentials ϕ 1 (A,B), ϕ 2 (B,C): Gibbs: P(A,B,C)=(1/Z) ϕ 1 (A,B)ϕ 2 (B,C) Log-likelihood of instance (a,b,c) is ln P(a,b,c)=ln ϕ 1 (A,B)+ln ϕ 2 (B,C)-ln Z Log-likelihood of data set D with M instances: l(θ : D) = M m=1 ( lnφ 1 (a[m],b[m]) + lnφ 2 (b[m],c[m]) ln Z(θ) ) = M[a,b] lnφ 1 (a,b) + M[b,c] lnφ 2 (b,c) M ln Z(θ) a,b b,c Z = Summing over different Instances M Summing over different values of A and B First term involves only ϕ 1. Second only ϕ 2. But third involves lnz(θ) = ln φ 1 (a,b)φ 2 (b,c) a,b,c Which couples the two potentials in the likelihood function When we change one potential ϕ 1, Z(θ) changes, possibly changing the value of ϕ 2 that maximizes ln Z(θ) a,b,c φ 1 (a,b)φ 2 (b,c) Parameter θ consists of all values of factorsϕ 1 and ϕ 2

8 Illustration of Coupled Likelihood Log-likelihood A B C wrt two factors only With binary variables we would have 8 factors Log-likelihood surface when ϕ 1 changes, ϕ 2 also changes In this case problem avoidable Equivalent to BN AàBàC & estimate parameters of ϕ 1 (A,B)=P(A)P(B A), ϕ 2 (B,C)=P(C B) In general, cannot convert learned BN parameters into equivalent MN Optimal likelihood achievable by the two representations is not the same 8 f 1 (a 1, b 1 ) f 2 (b 0, c 1 ) All other parameters set to 1 Data Set has M=100 M[a 1,b 1 ]=40, M[b 0,c 1 ]=40

9 Form of Likelihood Function Instead of Gibbs, use log-linear framework Joint distribution of n variables X 1,..X n k features F = { f i (D i ) } i=1,.. k where D i is a sub-graph and f i maps D i to R P(X 1,..X n ;θ) = 1 k Z(θ) exp θ f (D ) i i i i=1 Parameters θ i are weights we put on features If we have a sample ξ then its features are f i (ξ(d i )) which has the shorthand f i (ξ). Representation is general can capture Markov networks with global structure 9 and local structure

10 Parameters θ, Factors ϕ, binary features f Variables: A B C D A feature for every entry in every table f i (D i ) are sixteen indicator functions defined over clusters, AB, BC,CD,DA f a (A, B) = I { A = a 0 0 b }I { B = b 0 } 0 etc. P(X 1,..X n ;θ) = 1 k Z(θ) exp θ i f i (D i ) i=1 With this representation θ a = lnφ ( 0 b 0 1 a 0,b 0 ) A B ϕ 1 (A,B) a 0 b 0 ϕ 1 (a 0,b 0 ) a 0 b 1 ϕ 1 (a 0,b 1 ) a 1 b 0 ϕ 1 (a 1,b 0 ) a 1 b 1 ϕ 1 (a 1, b 1 ) (a) Val(A)={a 0,a 1 } Val(B)={b 0,b 1 } f a0,b0 =1 if a=a 0,b=b 0 0 otherwise, etc. Parameters θ are potentials which are weights put on features D A C B

11 Log Likelihood and Sufficient Statistics Joint probability distribution: P(X 1,..X n ;θ) = 1 k Z(θ) exp θ i f i (D i ) i=1 θ={θ 1,..θ k } are table entries f i are features over instances of D i Let D be a data set of M samples ξ [m] m =1,..M Log-likelihood Log of product of probs of M indep. instances: l(θ : D) = θ i f i ξ[m] i m ( ) M ln Z(θ) Sufficient statistics (likelihood depends only on this) Dividing by no. of samples 1 l(θ : D) = M θ ( [ ]) ln Z(θ) i E D f i (d i ) i E D [f i (d i )] is the average in the data set

12 Properties of Log-Likelihood Log-likelihood is a sum of two functions l(θ : D) = First term is linear in the parameters θ Increasing parameters increases this term But likelihood has upper-bound of probability 1 Second term: θ i f i ξ[m] i m balances first term ( ) Partition function Z(θ) is convex as seen next» Its negative is concave Sum of linear fn. and concave is concave, So There are no local optima Can use gradient ascent First Term M ln Z(θ) ln Z θ Second Term ( ) = ln exp θ i f i ( ξ) ξ i 12

13 Proof that Z(θ) is Convex A function f (! x) is convex if for every 0 α 1 f (α! x + (1 α )! y) α f (! x) + (1 α ) f (! y) The function is bowl-like Every interpolation between the images of two points is larger than the image of their interpolation One way to show that a Function is convex is to show that its Hessian (matrix of the function s second derivatives) is positive-semi-definite Hessian (2 nd der) of ln Z(θ) computed as:

14 Hessian of ln Z(θ) Given set of features F with θ i ln Z(θ) = E θ [ f i ] ln Z θ ( ) = ln exp θ i f i ( ξ) 2 θ i θ j ln Z(θ) = Cov θ [ f i ; f j ] where E θ [f i ] is shorthand for E P(χ,θ) [f i ] ξ i Proof: Partial derivatives wrt θ i and θ j ln Z(θ) = 1 θ i Z(θ) 2 ln Z(θ) = 1 θ i θ j θ j Z(θ) exp θ j f j ( ξ) = 1 f i (ξ)exp θ j f j ( ξ) = E θ [ f i ] ξ θ i j Z(θ) ξ j ξ exp θ k f k ( ξ) = Cov θ [ f i ; f j ] θ i k Since covariance matrix of features is pos. semidefinite, we have -ln Z(θ) is a concave function of θ Corollary: log-likelihood function is concave

15 Non-unique Solution Since ln Z(θ) is convex, -ln Z(θ) is concave Implies that log-likelihood is unimodal Has no local optima However does not imply uniqueness of global optimum Multiple parameterizations can result in same distribution A feature for every entry in the table is always redundant, e.g., f a0,b0 = 1- f a0,b1 - f a1,b0 - f a1,b1 A continuum of parameterizations 15

16 Maximum (Conditional) Likelihood Parameter Estimation Task: Estimate parameters of a Markov network with a fixed structure given a fully observable data set D Simplest variant of the problem is maximum likelihood parameter estimation Log-likelihood given features F={ f i, i=1,.. k} is l(θ : D) = θ i f i ξ m i=1 m ( [ ]) M ln Z ( θ ) 16

17 Gradient of Log-likelihood Gradient of log-likelihood is zero at its maximum points Log-likelihood is l(θ : D) = θ i f i ξ m i=1 ( [ ]) Gradient of the average log-likelihood is m 1 θ i M l(θ : D) = E [ f D i ( χ )] E θ [ f i ] M ln Z ( θ ) Provides a precise characterization of m.l. parameters θ ( ) E D f i χ = E ˆ θ [ f i ] First term is average value of f i in data D. Second term is expected value from distribution Expected value of each feature relative to P θ matches its empirical expectation in D

18 Need for Iterative Method Although function is concave, there is no analytical form for the maximum Since no closed-form solution Can use iterative methods, e.g. gradient ascent as shown next 18

19 Simple Gradient Descent Procedure Gradient-Descent ( θ 1 f δ ) //Initial starting point //Function to be minimized //Convergence threshold 1 t 1 2 do ( ) 3 θ t+1 θ t η f θ t 4 t t +1 5 while θ t θ t 1 > δ ( ) 6 return θ t Intuition Taylor s expansion of function f(θ) in the neighborhood of θ t is Let θ=θ t+1 =θ t +h, thus f (θ t+1 ) f (θ t )+ h f (θ t ) Derivative of f(θ t+1 ) wrt h is f (θ t ) At h = f (θ t ) a maximum occurs (since h 2 is positive) and at h = f (θ t ) a minimum occurs. Alternatively, The slope f (θ t ) points to the direction of steepest ascent. If we take a step η in the opposite direction we decrease the value of f f (θ) f (θ t )+(θ θ t ) T f (θ t ) One-dimensional example Let f(θ)=θ 2 This function has minimum at θ=0 which we want to determine using gradient descent We have f (θ)=2θ For gradient descent, we update by f (θ) If θ t > 0 then θ t+1 <θ t If θ t <0 then f (θ t )=2θ t is negative, thus θ t+1 >θ t

20 Difficulties with Simple Gradient Descent Performance depends on choice of learning rate η Large learning rate Overshoots Small learning rate Extremely slow Solution Start with large η and settle on optimal value Need a schedule for shrinking η 20

21 Improvements to Gradient Descent Procedure Gradient-Descent ( θ 1 f δ ) //Initial starting point //Function to be minimized //Convergence threshold 1 t 1 2 do 3 θ t+1 θ t η f θ t 4 t t +1 5 while θ t θ t 1 > δ ( ) 6 return θ t ( ) Line Search Adaptively choose η at each step Define a line in the direction of gradient g ( η) = θ! t + η f ( θ t ) Find three points η 1 < η 2 < η 3 so that f(g(η 2 )) is smaller than at both others η =(η 1 +η 2 )/2 Brent s method: Solid line is f(g(η)) Three points bracket its maximum Dashed line shows quadratic fit

22 Conjugate Gradient Ascent In simple gradient ascent: two consecutive gradients are orthogonal: is orthogonal to f obj (θ t+1 ) f obj (θ t ) Progress is zig-zag: progress to maximum slows down Quadratic: f obj (x,y)=-(x 2 +10y 2 ) Exponential: f obj (x,y)=exp [-(x 2 +10y 2 )] Solution: Remember past directions and bias new direction h t as combination of current g t and past h t-1 Faster convergence than simple gradient ascent

23 Computing the gradient Gradient of log-likelihood is E D [ f i ( χ )] E θ [ f i ] It is the difference between the feature s empirical count in data D and the expected count relative to our current parameterization Computing the empirical count Ex: for feature f a (a,b) = I { a = a 0 0 b }I b = b 0 0 { } E D [ f i is easy it is the empirical frequency in D of the event a 0,b 0 23 D ( χ )] A C (a) B

24 Computing the expected count We need to compute the different probabilities of the form P ( θ t a,b) Since expectation is a probability-weighted average Computing probability requires running inference over the network Thus computational cost of parameter estimation is very high Gradient ascent is not efficient Much faster convergence using second order methods based on Hessian

25 A standard iterative solution Initialize. θ Run inference (compute Z(θ) ) Compute gradient of l. Update θ. E D [ f i ( χ )] E θ [ f i ] θ t+1 θ t + η f obj ( θ t ) No Optimum Reached? Yes Stop 25

26 Iterative methods for MRF parameters Gradient ascent over parameter space Good news: likelihood function is concave Guaranteed to converge to global optimum Bad news: each step needs inference Simple parameter estimation is intractable Bayesian parameter estimation even harder Integration done using MCMC 26

27 Newton s method Newton s method finds zeroes of a function using derivatives More efficient than simple gradient descent Quasi Newton method uses an approximation to the gradient Since we are solving for derivative of l(θ,d) need second derivative (Hessian) Newton s Method

28 Hessian of the log-likelihood Likelihood has the form l(θ : D) = θ i f i ξ m i=1 m Hessian ( [ ]) θ i θ j l θ, D ( ) = M Cov θ f i, f j ( ) Requires joint expectation of two features, often computationally infeasible L-BFGS (a quasi-newton algorithm) uses line 28 search to avoid computing the Hessian M ln Z ( θ )

29 Conditionally Trained Models Often we want to perform an inference task where we have a known set of variables, or features, X We want to query a pre-determined set of variables Y We prefer to use discriminative training Train the network as a Conditional Random Field (CRF) that encodes a conditional distribution P(Y X) 29

30 CRF Training Train the network as a CRF that encodes a conditional distribution P(Y X) Conditional log-likelihood Training set consists of pairs D={y[m], x[m]}, m =1,.., M Objective function is conditional log-likelihood l Y X ( ) ( θ : D) = ln P y[ 1,.., M ] x[ 1,.., M ],θ M ( ) = ln P y[ m] x[ m],θ m=1 E (x,y)~p* ( ) log P! y x Example: word category, word It is concave since each term is concave 30 Joint probability of observed pairs

31 Gradient of Conditional Likelihood A reduced MN is itself an MN. We use log-linear representation with features f i and parameters θ Analogous to gradient for full MN 1 θ i M l(θ : D) = E [ f D i ( χ )] E θ [ f i ] we can write gradient for reduced MN θ i l Y X ( ) M ( θ : D) = f i ( y m,x m ) E θ f i x m m=1 First term is empirical count conditioned on x[m] Second is based on running inference on each data case

32 Conditional Training is simpler X 1 X 2 X 3 X 4 X 5 Full MN encodes Y 1 Y 2 4 Y 3!P(X,Y) = φ i Y i,y i+1 i=1 ( ) Edges disappear in a reduced Markov network After conditioning on X, remaining edges form a simple chain 5 Y 4 Next slide compares different sequential models, CRF, HMM and MEMM wrt parameter learning complexity Y 5 5 φ i Y i,x 1,X 2,X 3,X 4,X 5 i=1 ( ) ( )!P(Y X) = φ i Y i,y i+1 X 1,X 2,X 3,X 4,X 5 i=1 32

33 Learning Models for Sequence Labeling Given: sequence of observations X={X 1,..X k }. Need: a joint label Y={Y 1,..Y k } Model Trade-offs: expressive power, learnability CRF is a discriminative model MEMM is a conditional directed model Y 1 X 2 if not given Y 2, by D-separation More generally, Y i X j X -j j > i Has label bias problem: Later observation has no effect on posterior probability of current state. In activity recognition in video sequence: frames are labeled as running/walking. Earlier frames may be blurry but later ones clearer. HMM is a generative model That needs joint probability P(X,Y) CRF requires gradient-based learning which is expensive, difficult with large data sets MEMM and HMM are more easily learned Pure directed models: parameters computed in closed-form using maximum likelihood CRF Directly obtains P(Y X) Note: Z(X) is marginal of un-normalized measure MEMM Does not model distribution over X HMM Needs joint distribution X 1 Y 1 X 1 Y 1 X 1 Y 1 X 2 Y 2 X 2 Y 2 X 2 Y 2 X 3 Y 3 X 3 Y 3 X 3 Y 3 P(Y X) = 1!P(Y, X) Z(X) X 4 Y 4 X 4 Y 4 X 4 Y 4!P(Y, X) = φ i (Y i,y i+1 ) φ i (Y i, X i ) k 1 i=1 Z(X) =!P(Y, X) Y P(Y X) = P(Y i X i,y i 1 ) k X 5 Y 5 X 5 Y 5 P(X,Y) = P(X i /Y i )P(Y i Y i 1 ) i=1 i=1 P(Y / X) = P(X,Y) P(X) k k i=1 X 5 Y 5

34 Parameter Estimation with Missing Data Missing data examples Some data fields omitted or not collected Some hidden variables Learning problem has difficulties Parameters may not be identifiable Coupling between different parameters Likelihood is not concave (has local maxima) Available methods 1. Gradient Ascent (assume missing data is random) 2. Expectation-Maximization 34

35 Gradient Ascent with Missing Data Assume data is missing at random in data D In the m th instance, let o[m] be observed entries and h[m] random variables that are missing Log-likelihood has the form 1 M ( ) = 1 M ln P ( o[m], h[m] θ ) ln P D θ Gradient for feature f i has the form M m=1 1 θ i M ( θ : D) = 1 M M m=1 h[m] [ ] E h[m]~p( h[m] o[m],θ ) f i ln Z E θ [ f i ] Which is a difference between two expectations of f i Expectation over data and hidden variables Expectation in current distribution P(χ θ) 35

36 Comparison with Full Data Case With full data, gradient of log-likelihood is 1 θ i M l(θ : D) = E D[ f i ( χ )] E θ [ f i ] For second term we need inference over current distribution P(χ θ) First term is aggregate over data D. With missing data we have 1 θ i M l ( θ : D ) = 1 M M m=1 E fi h[m ]~P ( h[m ] o[m ],θ ) E θ We have to run inference separately for every instance m conditioning on o[m] Cost is much higher than with full data f i 36

37 Use of Expectation Maximization As for any probabilistic model, an alternative method of parameter estimation with missing data is EM E step is to use current parameters to estimate missing values M step is to re-estimate the parameters For BN it has significant advantages M-step has closed-form solution For MN, the answer is not clear-cut M-step requires running inference multiple times As seen next 37

38 E-step EM for MN parameter learning using current parameters θ t to compute expected sufficient statistics, i.e., expected feature counts M-step At iteration t expected sufficient statistic for feature f i is M θ (t ) f i = 1 M E h[m ]~P h[m ] o[m ],θ ( ) done using maximum likelihood parameter estimation (using gradient ascent!) Requires running inference multiple times, once for each iteration of gradient ascent procedure At step k of this inner loop optimization, we have a gradient of the form 38 f Mθ ( t) i E θ t,k ( ) f i fi

39 Maximum Entropy and Maximum Likelihood Return to basic maximum likelihood estimation Alternative formulation provides insight Consider log-linear model P(X 1,..X n ;θ) = 1 k Z(θ) exp θ i f i (D i ) i=1 Relate it to finding the distribution of maximum entropy subject to certain constraints 39

40 Motivation of Max Entropy Suppose we have some summary statistics of an empirical distribution Ex 1: published in a census report Marginal distributions of single variables or certain pairs, or other events of interest Ex 2: average final grade of students in class, correlation of final grades with homework scores But no access to full data set Find a typical distribution that satisfies these constraints 40

41 Maximum Entropy Formulation Goal is to select a distribution that: 1. Satisfies given constraints 2. Has no additional structure or information We formulate the following problem: Find Q(χ) Maximizing H Q (χ) Subject to E Q [f i ]=E D [f i ], i=1,.., k These are called expectation constraints 41

42 Solution to Max Entropy Estimation Somewhat surprisingly The distribution Q*, the maximum entropy distribution satisfying the expectation constraints, E Q [f i ]=E D [f i ], is a Gibbs distribution Q* = Pˆ θ where Pˆ θ ( χ) = 1 Z θ ˆ ( ) exp ˆ i θi f i ( χ) where ˆθ is the m.l.e. parameterization relative to D 42

43 Duality of Max Ent and Max Likelihood The two problems 1. Maximizing entropy subject to expectation constraints 2. Maximizing likelihood given structural constraints on the distribution are convex duals of each other For the maximum likelihood parameters ˆθ H P θ ( χ) = 1 M l ( ˆθ : D) 43

44 Parameter Priors and Regularization Maximum likelihood is prone to over-fitting Can introduce prior distribution P(θ) Due to non-decomposability a fully Bayesian approach is infeasible Instead, we perform MAP estimation Find parameters that maximize P(θ)P(D θ) Where we have ln P(D θ) expressed in log-space as l(θ : D) = θ i f i ξ m i=1 m ( [ ]) Which in turn is derived from the joint distribution P(X 1,..X n ;θ) = 1 k Z(θ) exp θ f (D ) i i i i=1 M ln Z ( θ ) 44

45 Choice of Prior P(θ) Only a few priors used in practice L 2 regularization, quadratic penalty on weights Laplacian (L 1 regularization) Both priors penalize parameters whose magnitude (positive or negative) is large 45

46 Gaussian Prior and L 2 -Regularization Most common is Gaussian prior on log-linear parameters θ P ( θ σ 2 ) = 1 exp θ 2 i 2πσ 2σ 2 For some choice of hyper-parameter σ 2 k i=1 Converting MAP objective P(θ)P(D θ) to logspace, gives ln P( θ ) + l(θ : D) whose first term lnp ( θ) = 1 2σ 2 k i=1 2 θ i is called an L 2 - regularization term 46

47 Laplacian Prior and L 1 -Regularization A different prior used is the Laplacian P Laplacian ( θ) = 1 2β θ exp β Taking log we obtain a term ln P θ ( ) = 1 β which we wish to minimize Generally called L 1 -regularization k θ i i=1 Both forms of regularization penalize parameters whose magnitude is large Laplacian distribution β=1 Gaussian distribution σ 2 =1

48 Why prefer low magnitude parameters? Properties of prior To pull distribution towards an uninformed one To smooth fluctuations in the data A distribution is smooth if Probabilities assigned to different assignments are not radically different Consider two assignments ξ and ξ An assignment is an instance of variables X 1,..X n We consider ratio of their probabilities next 48

49 Smoothness resulting from small θ Given two assignments ξ and ξ Their relative probability is ( ) ( ) = P ξ P ξ '!P ( ξ)/ Z θ!p ( ξ ')/ Z θ ( ) ( ) = where the un-normalized probabilities are In log-space, log-probability ratio is ln P ξ P ξ '!P(ξ) = exp k ( ) ( ) = θ f ξ i i ( ) i=1 k i=1 When θ i s have small magnitude, this log-ratio is also bounded, i.e., probabilities are similar This results in a smooth distribution ( ) ( )!P ξ!p ξ ' θ i f i (ξ) k θ i f i ( ξ ') = θ i f i ( ξ) f i ξ ' i=1 k i=1 ( ( )) 49

50 Comparison of L 1 and L 2 Regularization In both L 1 and L 2 we penalize magnitude of parameters In Gaussian case (L 2 ), penalty grows quadratically with parameters An increase in θ i from 3 to 3.1 is penalized more than θ i from 0 to 0.1 Leads to many small parameters In Laplacian case (L 1 ), penalty grows linearly Results in fewer edges and is more tractable 50

51 Efficiency of Optimization Both L 1 and L 2 Regularization terms are Concave Because Log-likelihood is also Concave, resulting posterior is also concave Can be optimized using gradient descent methods Introduction of penalty terms eliminates multiple equivalent minima 51

52 Choice of Hyper-parameters Regularization parameters σ 2 in case of L 2 and β in the L 1 case encode our beliefs that model weights should be close to zero Larger these parameters are broader our parameter prior Choice of prior has effect on learned model Standard method of selecting this parameter is via cross-validation 52

Learning Parameters of Undirected Models. Sargur Srihari

Learning Parameters of Undirected Models. Sargur Srihari Learning Parameters of Undirected Models Sargur srihari@cedar.buffalo.edu 1 Topics Log-linear Parameterization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate Gradient

More information

Learning MN Parameters with Approximation. Sargur Srihari

Learning MN Parameters with Approximation. Sargur Srihari Learning MN Parameters with Approximation Sargur srihari@cedar.buffalo.edu 1 Topics Iterative exact learning of MN parameters Difficulty with exact methods Approximate methods Approximate Inference Belief

More information

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient

More information

Gradient Descent. Sargur Srihari

Gradient Descent. Sargur Srihari Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Alternative Parameterizations of Markov Networks. Sargur Srihari

Alternative Parameterizations of Markov Networks. Sargur Srihari Alternative Parameterizations of Markov Networks Sargur srihari@cedar.buffalo.edu 1 Topics Three types of parameterization 1. Gibbs Parameterization 2. Factor Graphs 3. Log-linear Models with Energy functions

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari Partially Directed Graphs and Conditional Random Fields Sargur srihari@cedar.buffalo.edu 1 Topics Conditional Random Fields Gibbs distribution and CRF Directed and Undirected Independencies View as combination

More information

Learning Markov Networks. Presented by: Mark Berlin, Barak Gross

Learning Markov Networks. Presented by: Mark Berlin, Barak Gross Learning Markov Networks Presented by: Mark Berlin, Barak Gross Introduction We shall egi, pehaps Eugene Onegin, Chapter VI Off did he take, I folloed at his heels. Inferno, Canto II Reminder Until now

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

Inference as Optimization

Inference as Optimization Inference as Optimization Sargur Srihari srihari@cedar.buffalo.edu 1 Topics in Inference as Optimization Overview Exact Inference revisited The Energy Functional Optimizing the Energy Functional 2 Exact

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

CSC 412 (Lecture 4): Undirected Graphical Models

CSC 412 (Lecture 4): Undirected Graphical Models CSC 412 (Lecture 4): Undirected Graphical Models Raquel Urtasun University of Toronto Feb 2, 2016 R Urtasun (UofT) CSC 412 Feb 2, 2016 1 / 37 Today Undirected Graphical Models: Semantics of the graph:

More information

Probabilistic Models for Sequence Labeling

Probabilistic Models for Sequence Labeling Probabilistic Models for Sequence Labeling Besnik Fetahu June 9, 2011 Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 1 / 26 Background & Motivation Problem introduction Generative

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Structured Variational Inference

Structured Variational Inference Structured Variational Inference Sargur srihari@cedar.buffalo.edu 1 Topics 1. Structured Variational Approximations 1. The Mean Field Approximation 1. The Mean Field Energy 2. Maximizing the energy functional:

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Variational Inference. Sargur Srihari

Variational Inference. Sargur Srihari Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Alternative Parameterizations of Markov Networks. Sargur Srihari

Alternative Parameterizations of Markov Networks. Sargur Srihari Alternative Parameterizations of Markov Networks Sargur srihari@cedar.buffalo.edu 1 Topics Three types of parameterization 1. Gibbs Parameterization 2. Factor Graphs 3. Log-linear Models Features (Ising,

More information

Markov Networks.

Markov Networks. Markov Networks www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts Markov network syntax Markov network semantics Potential functions Partition function

More information

From Bayesian Networks to Markov Networks. Sargur Srihari

From Bayesian Networks to Markov Networks. Sargur Srihari From Bayesian Networks to Markov Networks Sargur srihari@cedar.buffalo.edu 1 Topics Bayesian Networks and Markov Networks From BN to MN: Moralized graphs From MN to BN: Chordal graphs 2 Bayesian Networks

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Log-Linear Models, MEMMs, and CRFs

Log-Linear Models, MEMMs, and CRFs Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Introduction to Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models Introduction to Probabilistic Graphical Models Sargur Srihari srihari@cedar.buffalo.edu 1 Topics 1. What are probabilistic graphical models (PGMs) 2. Use of PGMs Engineering and AI 3. Directionality in

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Oct, 21, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models CPSC

More information

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Notes on Markov Networks

Notes on Markov Networks Notes on Markov Networks Lili Mou moull12@sei.pku.edu.cn December, 2014 This note covers basic topics in Markov networks. We mainly talk about the formal definition, Gibbs sampling for inference, and maximum

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X. Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

MAP Examples. Sargur Srihari

MAP Examples. Sargur Srihari MAP Examples Sargur srihari@cedar.buffalo.edu 1 Potts Model CRF for OCR Topics Image segmentation based on energy minimization 2 Examples of MAP Many interesting examples of MAP inference are instances

More information

Basic Sampling Methods

Basic Sampling Methods Basic Sampling Methods Sargur Srihari srihari@cedar.buffalo.edu 1 1. Motivation Topics Intractability in ML How sampling can help 2. Ancestral Sampling Using BNs 3. Transforming a Uniform Distribution

More information

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

Undirected Graphical Models: Markov Random Fields

Undirected Graphical Models: Markov Random Fields Undirected Graphical Models: Markov Random Fields 40-956 Advanced Topics in AI: Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2015 Markov Random Field Structure: undirected

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

CRF for human beings

CRF for human beings CRF for human beings Arne Skjærholt LNS seminar CRF for human beings LNS seminar 1 / 29 Let G = (V, E) be a graph such that Y = (Y v ) v V, so that Y is indexed by the vertices of G. Then (X, Y) is a conditional

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Linear and logistic regression

Linear and logistic regression Linear and logistic regression Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Linear and logistic regression 1/22 Outline 1 Linear regression 2 Logistic regression 3 Fisher discriminant analysis

More information

Latent Variable View of EM. Sargur Srihari

Latent Variable View of EM. Sargur Srihari Latent Variable View of EM Sargur srihari@cedar.buffalo.edu 1 Examples of latent variables 1. Mixture Model Joint distribution is p(x,z) We don t have values for z 2. Hidden Markov Model A single time

More information

Multivariate Gaussians. Sargur Srihari

Multivariate Gaussians. Sargur Srihari Multivariate Gaussians Sargur srihari@cedar.buffalo.edu 1 Topics 1. Multivariate Gaussian: Basic Parameterization 2. Covariance and Information Form 3. Operations on Gaussians 4. Independencies in Gaussians

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging 10-708: Probabilistic Graphical Models 10-708, Spring 2018 10 : HMM and CRF Lecturer: Kayhan Batmanghelich Scribes: Ben Lengerich, Michael Kleyman 1 Case Study: Supervised Part-of-Speech Tagging We will

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Part 4: Conditional Random Fields

Part 4: Conditional Random Fields Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39 Problem (Probabilistic Learning) Let d(y x) be the (unknown) true conditional distribution.

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Expectation Maximization Algorithm

Expectation Maximization Algorithm Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

Regularization in Neural Networks

Regularization in Neural Networks Regularization in Neural Networks Sargur Srihari 1 Topics in Neural Network Regularization What is regularization? Methods 1. Determining optimal number of hidden units 2. Use of regularizer in error function

More information

CSE446: Clustering and EM Spring 2017

CSE446: Clustering and EM Spring 2017 CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled

More information

Need for Sampling in Machine Learning. Sargur Srihari

Need for Sampling in Machine Learning. Sargur Srihari Need for Sampling in Machine Learning Sargur srihari@cedar.buffalo.edu 1 Rationale for Sampling 1. ML methods model data with probability distributions E.g., p(x,y; θ) 2. Models are used to answer queries,

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Expectation maximization tutorial

Expectation maximization tutorial Expectation maximization tutorial Octavian Ganea November 18, 2016 1/1 Today Expectation - maximization algorithm Topic modelling 2/1 ML & MAP Observed data: X = {x 1, x 2... x N } 3/1 ML & MAP Observed

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

More information

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1) HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April Probabilistic Graphical Models Guest Lecture by Narges Razavian Machine Learning Class April 14 2017 Today What is probabilistic graphical model and why it is useful? Bayesian Networks Basic Inference

More information

Variational Inference via Stochastic Backpropagation

Variational Inference via Stochastic Backpropagation Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation

More information

Using Graphs to Describe Model Structure. Sargur N. Srihari

Using Graphs to Describe Model Structure. Sargur N. Srihari Using Graphs to Describe Model Structure Sargur N. srihari@cedar.buffalo.edu 1 Topics in Structured PGMs for Deep Learning 0. Overview 1. Challenge of Unstructured Modeling 2. Using graphs to describe

More information