Learning Parameters of Undirected Models. Sargur Srihari
|
|
- Randolph Norton
- 5 years ago
- Views:
Transcription
1 Learning Parameters of Undirected Models Sargur 1
2 Topics Difficulties due to Global Normalization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate Gradient Ascent Conditionally Trained Models Learning with Missing Data EM with Gradient Ascent Maximum Entropy and Maximum Likelihood Parameter Priors and Regularization 2
3 Local vs Global Normalization BN: Local normalization within each CPD MN: Global normalization (partition function) P(X 1,.., X n ) = 1 Z Global factor couples all parameters preventing decomposition Significant computational ramifications M.L. parameter estimation has no closed-form soln. Need iterative methods n p(x) = ln p(x i pa i ) i=1 K i=1 φ i ( ) Z = φ i D i X 1,..X n ( ) D i 3
4 Issues in Parameter Estimation Simple ML parameter estimation (even with complete data) cannot be solved in closed form Need iterative methods such as gradient ascent Good news: Likelihood function is concave Methods converge with global optimum Bad news: Each step in iterative algorithm requires inference Simple parameter estimation expensive/intractable Bayesian estimation is practically infeasible 4
5 Discriminative Training Common use of MNs is in settings such as image segmentation where we have a particular inference task in mind Train the network discriminatively to get good performance for our particular inference task 5
6 Likelihood Function Basis for all discussion of learning How likelihood function can be optimized to find maximum likelihood parameter estimates Begin with form of likelihood function for Markov networks, its properties and their computational implications 6
7 Example of Coupled Likelihood Function Simple Markov network A B C Two potentials ϕ 1 (A,B), ϕ 2 (B,C): Gibbs: P(A,B,C)=(1/Z) ϕ 1 (A,B)ϕ 2 (B,C) Log-likelihood of instance (a,b,c) is ln P(a,b,c)=ln ϕ 1 (A,B)+ln ϕ 2 (B,C)-ln Z Log-likelihood of data set D with M instances: l(θ : D) = M m=1 ( lnφ 1 (a[m],b[m]) + lnφ 2 (b[m],c[m]) ln Z(θ) ) = M[a,b] lnφ 1 (a,b) + M[b,c] lnφ 2 (b,c) M ln Z(θ) a,b b,c Z = Summing over different Instances M Summing over different values of A and B First term involves only ϕ 1. Second only ϕ 2. But third involves lnz(θ) = ln φ 1 (a,b)φ 2 (b,c) a,b,c Which couples the two potentials in the likelihood function When we change one potential ϕ 1, Z(θ) changes, possibly changing the value of ϕ 2 that maximizes ln Z(θ) a,b,c φ 1 (a,b)φ 2 (b,c) Parameter θ consists of all values of factorsϕ 1 and ϕ 2
8 Illustration of Coupled Likelihood Log-likelihood A B C wrt two factors only With binary variables we would have 8 factors Log-likelihood surface when ϕ 1 changes, ϕ 2 also changes In this case problem avoidable Equivalent to BN AàBàC & estimate parameters of ϕ 1 (A,B)=P(A)P(B A), ϕ 2 (B,C)=P(C B) In general, cannot convert learned BN parameters into equivalent MN Optimal likelihood achievable by the two representations is not the same 8 f 1 (a 1, b 1 ) f 2 (b 0, c 1 ) All other parameters set to 1 Data Set has M=100 M[a 1,b 1 ]=40, M[b 0,c 1 ]=40
9 Form of Likelihood Function Instead of Gibbs, use log-linear framework Joint distribution of n variables X 1,..X n k features F = { f i (D i ) } i=1,.. k where D i is a sub-graph and f i maps D i to R P(X 1,..X n ;θ) = 1 k Z(θ) exp θ f (D ) i i i i=1 Parameters θ i are weights we put on features If we have a sample ξ then its features are f i (ξ(d i )) which has the shorthand f i (ξ). Representation is general can capture Markov networks with global structure 9 and local structure
10 Parameters θ, Factors ϕ, binary features f Variables: A B C D A feature for every entry in every table f i (D i ) are sixteen indicator functions defined over clusters, AB, BC,CD,DA f a (A, B) = I { A = a 0 0 b }I { B = b 0 } 0 etc. P(X 1,..X n ;θ) = 1 k Z(θ) exp θ i f i (D i ) i=1 With this representation θ a = lnφ ( 0 b 0 1 a 0,b 0 ) A B ϕ 1 (A,B) a 0 b 0 ϕ 1 (a 0,b 0 ) a 0 b 1 ϕ 1 (a 0,b 1 ) a 1 b 0 ϕ 1 (a 1,b 0 ) a 1 b 1 ϕ 1 (a 1, b 1 ) (a) Val(A)={a 0,a 1 } Val(B)={b 0,b 1 } f a0,b0 =1 if a=a 0,b=b 0 0 otherwise, etc. Parameters θ are potentials which are weights put on features D A C B
11 Log Likelihood and Sufficient Statistics Joint probability distribution: P(X 1,..X n ;θ) = 1 k Z(θ) exp θ i f i (D i ) i=1 θ={θ 1,..θ k } are table entries f i are features over instances of D i Let D be a data set of M samples ξ [m] m =1,..M Log-likelihood Log of product of probs of M indep. instances: l(θ : D) = θ i f i ξ[m] i m ( ) M ln Z(θ) Sufficient statistics (likelihood depends only on this) Dividing by no. of samples 1 l(θ : D) = M θ ( [ ]) ln Z(θ) i E D f i (d i ) i E D [f i (d i )] is the average in the data set
12 Properties of Log-Likelihood Log-likelihood is a sum of two functions l(θ : D) = First term is linear in the parameters θ Increasing parameters increases this term But likelihood has upper-bound of probability 1 Second term: θ i f i ξ[m] i m balances first term ( ) Partition function Z(θ) is convex as seen next» Its negative is concave Sum of linear fn. and concave is concave, So There are no local optima Can use gradient ascent First Term M ln Z(θ) ln Z θ Second Term ( ) = ln exp θ i f i ( ξ) ξ i 12
13 Proof that Z(θ) is Convex A function f (! x) is convex if for every 0 α 1 f (α! x + (1 α )! y) α f (! x) + (1 α ) f (! y) The function is bowl-like Every interpolation between the images of two points is larger than the image of their interpolation One way to show that a Function is convex is to show that its Hessian (matrix of the function s second derivatives) is positive-semi-definite Hessian (2 nd der) of ln Z(θ) computed as:
14 Hessian of ln Z(θ) Given set of features F with θ i ln Z(θ) = E θ [ f i ] ln Z θ ( ) = ln exp θ i f i ( ξ) 2 θ i θ j ln Z(θ) = Cov θ [ f i ; f j ] where E θ [f i ] is shorthand for E P(χ,θ) [f i ] ξ i Proof: Partial derivatives wrt θ i and θ j ln Z(θ) = 1 θ i Z(θ) 2 ln Z(θ) = 1 θ i θ j θ j Z(θ) exp θ j f j ( ξ) = 1 f i (ξ)exp θ j f j ( ξ) = E θ [ f i ] ξ θ i j Z(θ) ξ j ξ exp θ k f k ( ξ) = Cov θ [ f i ; f j ] θ i k Since covariance matrix of features is pos. semidefinite, we have -ln Z(θ) is a concave function of θ Corollary: log-likelihood function is concave
15 Non-unique Solution Since ln Z(θ) is convex, -ln Z(θ) is concave Implies that log-likelihood is unimodal Has no local optima However does not imply uniqueness of global optimum Multiple parameterizations can result in same distribution A feature for every entry in the table is always redundant, e.g., f a0,b0 = 1- f a0,b1 - f a1,b0 - f a1,b1 A continuum of parameterizations 15
16 Maximum (Conditional) Likelihood Parameter Estimation Task: Estimate parameters of a Markov network with a fixed structure given a fully observable data set D Simplest variant of the problem is maximum likelihood parameter estimation Log-likelihood given features F={ f i, i=1,.. k} is l(θ : D) = θ i f i ξ m i=1 m ( [ ]) M ln Z ( θ ) 16
17 Gradient of Log-likelihood Gradient of log-likelihood is zero at its maximum points Log-likelihood is l(θ : D) = θ i f i ξ m i=1 ( [ ]) Gradient of the average log-likelihood is m 1 θ i M l(θ : D) = E [ f D i ( χ )] E θ [ f i ] M ln Z ( θ ) Provides a precise characterization of m.l. parameters θ ( ) E D f i χ = E ˆ θ [ f i ] First term is average value of f i in data D. Second term is expected value from distribution Expected value of each feature relative to P θ matches its empirical expectation in D
18 Need for Iterative Method Although function is concave, there is no analytical form for the maximum Since no closed-form solution Can use iterative methods, e.g. gradient ascent as shown next 18
19 Simple Gradient Descent Procedure Gradient-Descent ( θ 1 f δ ) //Initial starting point //Function to be minimized //Convergence threshold 1 t 1 2 do ( ) 3 θ t+1 θ t η f θ t 4 t t +1 5 while θ t θ t 1 > δ ( ) 6 return θ t Intuition Taylor s expansion of function f(θ) in the neighborhood of θ t is Let θ=θ t+1 =θ t +h, thus f (θ t+1 ) f (θ t )+ h f (θ t ) Derivative of f(θ t+1 ) wrt h is f (θ t ) At h = f (θ t ) a maximum occurs (since h 2 is positive) and at h = f (θ t ) a minimum occurs. Alternatively, The slope f (θ t ) points to the direction of steepest ascent. If we take a step η in the opposite direction we decrease the value of f f (θ) f (θ t )+(θ θ t ) T f (θ t ) One-dimensional example Let f(θ)=θ 2 This function has minimum at θ=0 which we want to determine using gradient descent We have f (θ)=2θ For gradient descent, we update by f (θ) If θ t > 0 then θ t+1 <θ t If θ t <0 then f (θ t )=2θ t is negative, thus θ t+1 >θ t
20 Difficulties with Simple Gradient Descent Performance depends on choice of learning rate η Large learning rate Overshoots Small learning rate Extremely slow Solution Start with large η and settle on optimal value Need a schedule for shrinking η 20
21 Improvements to Gradient Descent Procedure Gradient-Descent ( θ 1 f δ ) //Initial starting point //Function to be minimized //Convergence threshold 1 t 1 2 do 3 θ t+1 θ t η f θ t 4 t t +1 5 while θ t θ t 1 > δ ( ) 6 return θ t ( ) Line Search Adaptively choose η at each step Define a line in the direction of gradient g ( η) = θ! t + η f ( θ t ) Find three points η 1 < η 2 < η 3 so that f(g(η 2 )) is smaller than at both others η =(η 1 +η 2 )/2 Brent s method: Solid line is f(g(η)) Three points bracket its maximum Dashed line shows quadratic fit
22 Conjugate Gradient Ascent In simple gradient ascent: two consecutive gradients are orthogonal: is orthogonal to f obj (θ t+1 ) f obj (θ t ) Progress is zig-zag: progress to maximum slows down Quadratic: f obj (x,y)=-(x 2 +10y 2 ) Exponential: f obj (x,y)=exp [-(x 2 +10y 2 )] Solution: Remember past directions and bias new direction h t as combination of current g t and past h t-1 Faster convergence than simple gradient ascent
23 Computing the gradient Gradient of log-likelihood is E D [ f i ( χ )] E θ [ f i ] It is the difference between the feature s empirical count in data D and the expected count relative to our current parameterization Computing the empirical count Ex: for feature f a (a,b) = I { a = a 0 0 b }I b = b 0 0 { } E D [ f i is easy it is the empirical frequency in D of the event a 0,b 0 23 D ( χ )] A C (a) B
24 Computing the expected count We need to compute the different probabilities of the form P ( θ t a,b) Since expectation is a probability-weighted average Computing probability requires running inference over the network Thus computational cost of parameter estimation is very high Gradient ascent is not efficient Much faster convergence using second order methods based on Hessian
25 A standard iterative solution Initialize. θ Run inference (compute Z(θ) ) Compute gradient of l. Update θ. E D [ f i ( χ )] E θ [ f i ] θ t+1 θ t + η f obj ( θ t ) No Optimum Reached? Yes Stop 25
26 Iterative methods for MRF parameters Gradient ascent over parameter space Good news: likelihood function is concave Guaranteed to converge to global optimum Bad news: each step needs inference Simple parameter estimation is intractable Bayesian parameter estimation even harder Integration done using MCMC 26
27 Newton s method Newton s method finds zeroes of a function using derivatives More efficient than simple gradient descent Quasi Newton method uses an approximation to the gradient Since we are solving for derivative of l(θ,d) need second derivative (Hessian) Newton s Method
28 Hessian of the log-likelihood Likelihood has the form l(θ : D) = θ i f i ξ m i=1 m Hessian ( [ ]) θ i θ j l θ, D ( ) = M Cov θ f i, f j ( ) Requires joint expectation of two features, often computationally infeasible L-BFGS (a quasi-newton algorithm) uses line 28 search to avoid computing the Hessian M ln Z ( θ )
29 Conditionally Trained Models Often we want to perform an inference task where we have a known set of variables, or features, X We want to query a pre-determined set of variables Y We prefer to use discriminative training Train the network as a Conditional Random Field (CRF) that encodes a conditional distribution P(Y X) 29
30 CRF Training Train the network as a CRF that encodes a conditional distribution P(Y X) Conditional log-likelihood Training set consists of pairs D={y[m], x[m]}, m =1,.., M Objective function is conditional log-likelihood l Y X ( ) ( θ : D) = ln P y[ 1,.., M ] x[ 1,.., M ],θ M ( ) = ln P y[ m] x[ m],θ m=1 E (x,y)~p* ( ) log P! y x Example: word category, word It is concave since each term is concave 30 Joint probability of observed pairs
31 Gradient of Conditional Likelihood A reduced MN is itself an MN. We use log-linear representation with features f i and parameters θ Analogous to gradient for full MN 1 θ i M l(θ : D) = E [ f D i ( χ )] E θ [ f i ] we can write gradient for reduced MN θ i l Y X ( ) M ( θ : D) = f i ( y m,x m ) E θ f i x m m=1 First term is empirical count conditioned on x[m] Second is based on running inference on each data case
32 Conditional Training is simpler X 1 X 2 X 3 X 4 X 5 Full MN encodes Y 1 Y 2 4 Y 3!P(X,Y) = φ i Y i,y i+1 i=1 ( ) Edges disappear in a reduced Markov network After conditioning on X, remaining edges form a simple chain 5 Y 4 Next slide compares different sequential models, CRF, HMM and MEMM wrt parameter learning complexity Y 5 5 φ i Y i,x 1,X 2,X 3,X 4,X 5 i=1 ( ) ( )!P(Y X) = φ i Y i,y i+1 X 1,X 2,X 3,X 4,X 5 i=1 32
33 Learning Models for Sequence Labeling Given: sequence of observations X={X 1,..X k }. Need: a joint label Y={Y 1,..Y k } Model Trade-offs: expressive power, learnability CRF is a discriminative model MEMM is a conditional directed model Y 1 X 2 if not given Y 2, by D-separation More generally, Y i X j X -j j > i Has label bias problem: Later observation has no effect on posterior probability of current state. In activity recognition in video sequence: frames are labeled as running/walking. Earlier frames may be blurry but later ones clearer. HMM is a generative model That needs joint probability P(X,Y) CRF requires gradient-based learning which is expensive, difficult with large data sets MEMM and HMM are more easily learned Pure directed models: parameters computed in closed-form using maximum likelihood CRF Directly obtains P(Y X) Note: Z(X) is marginal of un-normalized measure MEMM Does not model distribution over X HMM Needs joint distribution X 1 Y 1 X 1 Y 1 X 1 Y 1 X 2 Y 2 X 2 Y 2 X 2 Y 2 X 3 Y 3 X 3 Y 3 X 3 Y 3 P(Y X) = 1!P(Y, X) Z(X) X 4 Y 4 X 4 Y 4 X 4 Y 4!P(Y, X) = φ i (Y i,y i+1 ) φ i (Y i, X i ) k 1 i=1 Z(X) =!P(Y, X) Y P(Y X) = P(Y i X i,y i 1 ) k X 5 Y 5 X 5 Y 5 P(X,Y) = P(X i /Y i )P(Y i Y i 1 ) i=1 i=1 P(Y / X) = P(X,Y) P(X) k k i=1 X 5 Y 5
34 Parameter Estimation with Missing Data Missing data examples Some data fields omitted or not collected Some hidden variables Learning problem has difficulties Parameters may not be identifiable Coupling between different parameters Likelihood is not concave (has local maxima) Available methods 1. Gradient Ascent (assume missing data is random) 2. Expectation-Maximization 34
35 Gradient Ascent with Missing Data Assume data is missing at random in data D In the m th instance, let o[m] be observed entries and h[m] random variables that are missing Log-likelihood has the form 1 M ( ) = 1 M ln P ( o[m], h[m] θ ) ln P D θ Gradient for feature f i has the form M m=1 1 θ i M ( θ : D) = 1 M M m=1 h[m] [ ] E h[m]~p( h[m] o[m],θ ) f i ln Z E θ [ f i ] Which is a difference between two expectations of f i Expectation over data and hidden variables Expectation in current distribution P(χ θ) 35
36 Comparison with Full Data Case With full data, gradient of log-likelihood is 1 θ i M l(θ : D) = E D[ f i ( χ )] E θ [ f i ] For second term we need inference over current distribution P(χ θ) First term is aggregate over data D. With missing data we have 1 θ i M l ( θ : D ) = 1 M M m=1 E fi h[m ]~P ( h[m ] o[m ],θ ) E θ We have to run inference separately for every instance m conditioning on o[m] Cost is much higher than with full data f i 36
37 Use of Expectation Maximization As for any probabilistic model, an alternative method of parameter estimation with missing data is EM E step is to use current parameters to estimate missing values M step is to re-estimate the parameters For BN it has significant advantages M-step has closed-form solution For MN, the answer is not clear-cut M-step requires running inference multiple times As seen next 37
38 E-step EM for MN parameter learning using current parameters θ t to compute expected sufficient statistics, i.e., expected feature counts M-step At iteration t expected sufficient statistic for feature f i is M θ (t ) f i = 1 M E h[m ]~P h[m ] o[m ],θ ( ) done using maximum likelihood parameter estimation (using gradient ascent!) Requires running inference multiple times, once for each iteration of gradient ascent procedure At step k of this inner loop optimization, we have a gradient of the form 38 f Mθ ( t) i E θ t,k ( ) f i fi
39 Maximum Entropy and Maximum Likelihood Return to basic maximum likelihood estimation Alternative formulation provides insight Consider log-linear model P(X 1,..X n ;θ) = 1 k Z(θ) exp θ i f i (D i ) i=1 Relate it to finding the distribution of maximum entropy subject to certain constraints 39
40 Motivation of Max Entropy Suppose we have some summary statistics of an empirical distribution Ex 1: published in a census report Marginal distributions of single variables or certain pairs, or other events of interest Ex 2: average final grade of students in class, correlation of final grades with homework scores But no access to full data set Find a typical distribution that satisfies these constraints 40
41 Maximum Entropy Formulation Goal is to select a distribution that: 1. Satisfies given constraints 2. Has no additional structure or information We formulate the following problem: Find Q(χ) Maximizing H Q (χ) Subject to E Q [f i ]=E D [f i ], i=1,.., k These are called expectation constraints 41
42 Solution to Max Entropy Estimation Somewhat surprisingly The distribution Q*, the maximum entropy distribution satisfying the expectation constraints, E Q [f i ]=E D [f i ], is a Gibbs distribution Q* = Pˆ θ where Pˆ θ ( χ) = 1 Z θ ˆ ( ) exp ˆ i θi f i ( χ) where ˆθ is the m.l.e. parameterization relative to D 42
43 Duality of Max Ent and Max Likelihood The two problems 1. Maximizing entropy subject to expectation constraints 2. Maximizing likelihood given structural constraints on the distribution are convex duals of each other For the maximum likelihood parameters ˆθ H P θ ( χ) = 1 M l ( ˆθ : D) 43
44 Parameter Priors and Regularization Maximum likelihood is prone to over-fitting Can introduce prior distribution P(θ) Due to non-decomposability a fully Bayesian approach is infeasible Instead, we perform MAP estimation Find parameters that maximize P(θ)P(D θ) Where we have ln P(D θ) expressed in log-space as l(θ : D) = θ i f i ξ m i=1 m ( [ ]) Which in turn is derived from the joint distribution P(X 1,..X n ;θ) = 1 k Z(θ) exp θ f (D ) i i i i=1 M ln Z ( θ ) 44
45 Choice of Prior P(θ) Only a few priors used in practice L 2 regularization, quadratic penalty on weights Laplacian (L 1 regularization) Both priors penalize parameters whose magnitude (positive or negative) is large 45
46 Gaussian Prior and L 2 -Regularization Most common is Gaussian prior on log-linear parameters θ P ( θ σ 2 ) = 1 exp θ 2 i 2πσ 2σ 2 For some choice of hyper-parameter σ 2 k i=1 Converting MAP objective P(θ)P(D θ) to logspace, gives ln P( θ ) + l(θ : D) whose first term lnp ( θ) = 1 2σ 2 k i=1 2 θ i is called an L 2 - regularization term 46
47 Laplacian Prior and L 1 -Regularization A different prior used is the Laplacian P Laplacian ( θ) = 1 2β θ exp β Taking log we obtain a term ln P θ ( ) = 1 β which we wish to minimize Generally called L 1 -regularization k θ i i=1 Both forms of regularization penalize parameters whose magnitude is large Laplacian distribution β=1 Gaussian distribution σ 2 =1
48 Why prefer low magnitude parameters? Properties of prior To pull distribution towards an uninformed one To smooth fluctuations in the data A distribution is smooth if Probabilities assigned to different assignments are not radically different Consider two assignments ξ and ξ An assignment is an instance of variables X 1,..X n We consider ratio of their probabilities next 48
49 Smoothness resulting from small θ Given two assignments ξ and ξ Their relative probability is ( ) ( ) = P ξ P ξ '!P ( ξ)/ Z θ!p ( ξ ')/ Z θ ( ) ( ) = where the un-normalized probabilities are In log-space, log-probability ratio is ln P ξ P ξ '!P(ξ) = exp k ( ) ( ) = θ f ξ i i ( ) i=1 k i=1 When θ i s have small magnitude, this log-ratio is also bounded, i.e., probabilities are similar This results in a smooth distribution ( ) ( )!P ξ!p ξ ' θ i f i (ξ) k θ i f i ( ξ ') = θ i f i ( ξ) f i ξ ' i=1 k i=1 ( ( )) 49
50 Comparison of L 1 and L 2 Regularization In both L 1 and L 2 we penalize magnitude of parameters In Gaussian case (L 2 ), penalty grows quadratically with parameters An increase in θ i from 3 to 3.1 is penalized more than θ i from 0 to 0.1 Leads to many small parameters In Laplacian case (L 1 ), penalty grows linearly Results in fewer edges and is more tractable 50
51 Efficiency of Optimization Both L 1 and L 2 Regularization terms are Concave Because Log-likelihood is also Concave, resulting posterior is also concave Can be optimized using gradient descent methods Introduction of penalty terms eliminates multiple equivalent minima 51
52 Choice of Hyper-parameters Regularization parameters σ 2 in case of L 2 and β in the L 1 case encode our beliefs that model weights should be close to zero Larger these parameters are broader our parameter prior Choice of prior has effect on learned model Standard method of selecting this parameter is via cross-validation 52
Learning Parameters of Undirected Models. Sargur Srihari
Learning Parameters of Undirected Models Sargur srihari@cedar.buffalo.edu 1 Topics Log-linear Parameterization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate Gradient
More informationLearning MN Parameters with Approximation. Sargur Srihari
Learning MN Parameters with Approximation Sargur srihari@cedar.buffalo.edu 1 Topics Iterative exact learning of MN parameters Difficulty with exact methods Approximate methods Approximate Inference Belief
More informationLearning MN Parameters with Alternative Objective Functions. Sargur Srihari
Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient
More informationGradient Descent. Sargur Srihari
Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationAlternative Parameterizations of Markov Networks. Sargur Srihari
Alternative Parameterizations of Markov Networks Sargur srihari@cedar.buffalo.edu 1 Topics Three types of parameterization 1. Gibbs Parameterization 2. Factor Graphs 3. Log-linear Models with Energy functions
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should
More informationLecture 9: PGM Learning
13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and
More informationSequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015
Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative
More informationPartially Directed Graphs and Conditional Random Fields. Sargur Srihari
Partially Directed Graphs and Conditional Random Fields Sargur srihari@cedar.buffalo.edu 1 Topics Conditional Random Fields Gibbs distribution and CRF Directed and Undirected Independencies View as combination
More informationLearning Markov Networks. Presented by: Mark Berlin, Barak Gross
Learning Markov Networks Presented by: Mark Berlin, Barak Gross Introduction We shall egi, pehaps Eugene Onegin, Chapter VI Off did he take, I folloed at his heels. Inferno, Canto II Reminder Until now
More informationConditional Random Field
Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions
More informationInference as Optimization
Inference as Optimization Sargur Srihari srihari@cedar.buffalo.edu 1 Topics in Inference as Optimization Overview Exact Inference revisited The Energy Functional Optimizing the Energy Functional 2 Exact
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationCSC 412 (Lecture 4): Undirected Graphical Models
CSC 412 (Lecture 4): Undirected Graphical Models Raquel Urtasun University of Toronto Feb 2, 2016 R Urtasun (UofT) CSC 412 Feb 2, 2016 1 / 37 Today Undirected Graphical Models: Semantics of the graph:
More informationProbabilistic Models for Sequence Labeling
Probabilistic Models for Sequence Labeling Besnik Fetahu June 9, 2011 Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 1 / 26 Background & Motivation Problem introduction Generative
More informationProbabilistic Graphical Models & Applications
Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with
More informationStructured Variational Inference
Structured Variational Inference Sargur srihari@cedar.buffalo.edu 1 Topics 1. Structured Variational Approximations 1. The Mean Field Approximation 1. The Mean Field Energy 2. Maximizing the energy functional:
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationGenerative and Discriminative Approaches to Graphical Models CMSC Topics in AI
Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationVariational Inference. Sargur Srihari
Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationAlternative Parameterizations of Markov Networks. Sargur Srihari
Alternative Parameterizations of Markov Networks Sargur srihari@cedar.buffalo.edu 1 Topics Three types of parameterization 1. Gibbs Parameterization 2. Factor Graphs 3. Log-linear Models Features (Ising,
More informationMarkov Networks.
Markov Networks www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts Markov network syntax Markov network semantics Potential functions Partition function
More informationFrom Bayesian Networks to Markov Networks. Sargur Srihari
From Bayesian Networks to Markov Networks Sargur srihari@cedar.buffalo.edu 1 Topics Bayesian Networks and Markov Networks From BN to MN: Moralized graphs From MN to BN: Chordal graphs 2 Bayesian Networks
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationLog-Linear Models, MEMMs, and CRFs
Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationIntroduction to Probabilistic Graphical Models
Introduction to Probabilistic Graphical Models Sargur Srihari srihari@cedar.buffalo.edu 1 Topics 1. What are probabilistic graphical models (PGMs) 2. Use of PGMs Engineering and AI 3. Directionality in
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationDeep Learning Srihari. Deep Belief Nets. Sargur N. Srihari
Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Oct, 21, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models CPSC
More informationConditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013
Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General
More informationLinear Dynamical Systems
Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations
More informationNotes on Markov Networks
Notes on Markov Networks Lili Mou moull12@sei.pku.edu.cn December, 2014 This note covers basic topics in Markov networks. We mainly talk about the formal definition, Gibbs sampling for inference, and maximum
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationClustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning
Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades
More informationOptimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.
Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may
More informationLearning Bayesian network : Given structure and completely observed data
Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationMAP Examples. Sargur Srihari
MAP Examples Sargur srihari@cedar.buffalo.edu 1 Potts Model CRF for OCR Topics Image segmentation based on energy minimization 2 Examples of MAP Many interesting examples of MAP inference are instances
More informationBasic Sampling Methods
Basic Sampling Methods Sargur Srihari srihari@cedar.buffalo.edu 1 1. Motivation Topics Intractability in ML How sampling can help 2. Ancestral Sampling Using BNs 3. Transforming a Uniform Distribution
More informationLecture 21: Spectral Learning for Graphical Models
10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation
More informationUndirected Graphical Models: Markov Random Fields
Undirected Graphical Models: Markov Random Fields 40-956 Advanced Topics in AI: Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2015 Markov Random Field Structure: undirected
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationCRF for human beings
CRF for human beings Arne Skjærholt LNS seminar CRF for human beings LNS seminar 1 / 29 Let G = (V, E) be a graph such that Y = (Y v ) v V, so that Y is indexed by the vertices of G. Then (X, Y) is a conditional
More informationStatistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields
Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project
More informationMachine Learning Summer School
Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationLinear and logistic regression
Linear and logistic regression Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Linear and logistic regression 1/22 Outline 1 Linear regression 2 Logistic regression 3 Fisher discriminant analysis
More informationLatent Variable View of EM. Sargur Srihari
Latent Variable View of EM Sargur srihari@cedar.buffalo.edu 1 Examples of latent variables 1. Mixture Model Joint distribution is p(x,z) We don t have values for z 2. Hidden Markov Model A single time
More informationMultivariate Gaussians. Sargur Srihari
Multivariate Gaussians Sargur srihari@cedar.buffalo.edu 1 Topics 1. Multivariate Gaussian: Basic Parameterization 2. Covariance and Information Form 3. Operations on Gaussians 4. Independencies in Gaussians
More informationLecture 6: Graphical Models: Learning
Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)
More informationBayesian Learning in Undirected Graphical Models
Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More information10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging
10-708: Probabilistic Graphical Models 10-708, Spring 2018 10 : HMM and CRF Lecturer: Kayhan Batmanghelich Scribes: Ben Lengerich, Michael Kleyman 1 Case Study: Supervised Part-of-Speech Tagging We will
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationStochastic Proximal Gradient Algorithm
Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationPart 4: Conditional Random Fields
Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39 Problem (Probabilistic Learning) Let d(y x) be the (unknown) true conditional distribution.
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationExpectation Maximization Algorithm
Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters
More informationLecture 13: Structured Prediction
Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page
More informationRegularization in Neural Networks
Regularization in Neural Networks Sargur Srihari 1 Topics in Neural Network Regularization What is regularization? Methods 1. Determining optimal number of hidden units 2. Use of regularizer in error function
More informationCSE446: Clustering and EM Spring 2017
CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled
More informationNeed for Sampling in Machine Learning. Sargur Srihari
Need for Sampling in Machine Learning Sargur srihari@cedar.buffalo.edu 1 Rationale for Sampling 1. ML methods model data with probability distributions E.g., p(x,y; θ) 2. Models are used to answer queries,
More informationCSC321 Lecture 18: Learning Probabilistic Models
CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationFundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner
Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More information13 : Variational Inference: Loopy Belief Propagation and Mean Field
10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction
More informationExpectation maximization tutorial
Expectation maximization tutorial Octavian Ganea November 18, 2016 1/1 Today Expectation - maximization algorithm Topic modelling 2/1 ML & MAP Observed data: X = {x 1, x 2... x N } 3/1 ML & MAP Observed
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network
More informationCS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning
CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationCS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas
CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1
More informationChapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)
HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter
More informationQualifying Exam in Machine Learning
Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationCS Lecture 13. More Maximum Likelihood
CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood
More informationProbabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April
Probabilistic Graphical Models Guest Lecture by Narges Razavian Machine Learning Class April 14 2017 Today What is probabilistic graphical model and why it is useful? Bayesian Networks Basic Inference
More informationVariational Inference via Stochastic Backpropagation
Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation
More informationUsing Graphs to Describe Model Structure. Sargur N. Srihari
Using Graphs to Describe Model Structure Sargur N. srihari@cedar.buffalo.edu 1 Topics in Structured PGMs for Deep Learning 0. Overview 1. Challenge of Unstructured Modeling 2. Using graphs to describe
More information