Learning Parameters of Undirected Models. Sargur Srihari

Similar documents
Learning Parameters of Undirected Models. Sargur Srihari

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Learning MN Parameters with Approximation. Sargur Srihari

Alternative Parameterizations of Markov Networks. Sargur Srihari

Gradient Descent. Sargur Srihari

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Alternative Parameterizations of Markov Networks. Sargur Srihari

Graphical Models for Collaborative Filtering

Probabilistic Graphical Models

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

STA 4273H: Statistical Machine Learning

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Lecture 9: PGM Learning

Probabilistic Graphical Models

Conditional Random Field

Probabilistic Models for Sequence Labeling

Undirected Graphical Models

Learning Markov Networks. Presented by: Mark Berlin, Barak Gross

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Neural Network Training

Nonparametric Bayesian Methods (Gaussian Processes)

Log-Linear Models, MEMMs, and CRFs

Structured Variational Inference

Variational Inference. Sargur Srihari

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

CSC 412 (Lecture 4): Undirected Graphical Models

Inference as Optimization

Probabilistic Graphical Models & Applications

From Bayesian Networks to Markov Networks. Sargur Srihari

Introduction to Probabilistic Graphical Models

Lecture 13: Structured Prediction

Notes on Markov Networks

Density Estimation. Seungjin Choi

Markov Networks.

CRF for human beings

Support Vector Machines

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Ch 4. Linear Models for Classification

Machine Learning Lecture 5

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Learning Bayesian network : Given structure and completely observed data

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Lecture 21: Spectral Learning for Graphical Models

Linear Dynamical Systems

Undirected Graphical Models: Markov Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

STA 4273H: Statistical Machine Learning

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Machine Learning Summer School

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Linear and logistic regression

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

13: Variational inference II

MAP Examples. Sargur Srihari

Lecture 6: Graphical Models: Learning

The Expectation-Maximization Algorithm

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Intelligent Systems (AI-2)

Bayesian Learning in Undirected Graphical Models

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Statistical Data Mining and Machine Learning Hilary Term 2016

Regularization in Neural Networks

Multivariate Gaussians. Sargur Srihari

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Introduction to Machine Learning Midterm, Tues April 8

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Statistical Pattern Recognition

Latent Variable View of EM. Sargur Srihari

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Gaussian Mixture Models

Intelligent Systems (AI-2)

Graphical models for part of speech tagging

Basic Sampling Methods

Probabilistic Graphical Models

Intelligent Systems (AI-2)

Probabilistic Graphical Models

Unsupervised Learning

CSC321 Lecture 18: Learning Probabilistic Models

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Linear & nonlinear classifiers

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Probabilistic Graphical Models

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Machine Learning Lecture 7

Expectation maximization tutorial

Sequential Supervised Learning

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Structure Learning in Sequential Data

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

ECE521 week 3: 23/26 January 2017

A brief introduction to Conditional Random Fields

Pattern Recognition and Machine Learning

Part 4: Conditional Random Fields

Transcription:

Learning Parameters of Undirected Models Sargur srihari@cedar.buffalo.edu 1

Topics Log-linear Parameterization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate Gradient Ascent Conditionally Trained Models Learning with Missing Data EM with Gradient Ascent Maximum Entropy and Maximum Likelihood Parameter Priors and Regularization 2

Markov Network Parameters Potentials A D B C (a) φ(d) = exp( ε(d)) ε(d) = lnφ(d) Energy Functions 3

Log-linear MN parameterizations Log-linear (potentials to energy by taking log) Sum of features Good both for hand-coded models and for learning P(X 1,.., X n ) = 1 k Z(θ) exp θ i f i (D i ) i=1 4

Difference bet. BN,MN Parameter Est. MN differs from BN learning since: BN: Local normalization for CPDs n p(x) = ln p(x i pa i ) i=1 MN: Global normalization (partition function) P(X 1,.., X n ) = 1 k Z(θ) exp θ f (D ) k i i i Z(θ) = exp θ i f i (D i ) i=1 i=1 Global factor couples all parameters preventing decomposition Significant computational ramifications M.L. parameter estimation has no closed-form soln. Need iterative methods X 1,..X n 5

Example of Coupled Likelihood Function Simple Markov network A B C Two potentials ϕ 1 (A,B), ϕ 2 (B,C) Log-likelihood of instance (a,b,c) is ln P(a,b,c)=ln ϕ 1 (A,B)+ln ϕ 2 (B,C)-ln Z Log-likelihood of data set D with M samples: l(θ : D) = M m=1 ( lnφ 1 (a[m]) + lnφ 2 (b[m]) ln Z(θ) ) = M[a,b] lnφ 1 (a,b) + M[b,c] lnφ 2 (a,b) M ln Z(θ) a,b b,c First term involves only ϕ 1. Second involves only ϕ 2. But the third term is the log partition function Z(θ) = a,b,c φ 1 (a, b)φ 2 (b,c) Which couples the two potentials in the likelihood function 6

Illustration of Coupled Likelihood Log-likelihood surface when ϕ 1 changes, ϕ 2 also changes We can convert MN to BN: Aà Bà C & estimate parameters of ϕ 1 (A,B)=P(A)P(B A), ϕ 2 (B,C)=P(C B) In general, cannot convert learned BN parameters into equivalent MN Optimal likelihood achievable by the two representations is not the same f 1 (a 1, b 1 ) f 2 (b 0, c 1 ) 7

Log-linear Representation Instead of Gibbs, use log-linear framework Joint distribution of n variables X 1,..X n k features F = { f i (D i ) } i=1,.. k where D i is a sub-graph and f i maps D i to R P(X 1,..X n ;θ) = 1 k Z(θ) exp θ f (D ) i i i i=1 Parameters θ i are weights we put on features If we have a sample ξ then its features are f i (ξ(d i )) which has the shorthand f i (ξ). Representation is general can capture Markov networks with global structure 8 and local structure

Example of binary features Variables: A B C D A feature for every entry in every table f i (D i ) are sixteen indicator functions defined over clusters, AB, BC,CD,DA f a (A, B) = I { A = a 0 0 b }I { B = b 0 } 0 etc. P(X 1,..X n ;θ) = 1 k Z(θ) exp θ i f i (D i ) i=1 With this representation θ a = lnφ ( 0 b 0 1 a 0,b 0 ) A B ϕ 1 (A,B) a 0 b 0 ϕ 1 a0, b0 a 0 b 1 ϕ 1 a0,b1 a 1 b 0 ϕ 1 a1, b0 a 1 b 1 ϕ 1 a1,b1 Val(A)={a 0,a 1 } Val(B)={b 0,b 1 } f a0,b0 =1 if a=a 0,b=b 0 0 otherwise, etc. Parameters θ are potentials which are weights put on features D A C (a) B

Features in Text Analysis Named-entity Recognition BIO: beginning,inside,other Two types of factors: ϕ t1 (Y t,y t+1 ), ϕ t2 (Y t,x 1,..X T ) Factors are derived from a no. of feature functions f t (Y t,x t )=I(Y t =B-organization, X t = Times ) Word is capitalized, does it appear in list of person names Appear on atlas of locations, end with ton, exactly York Is the following word Times Does sentence have more than two sports-related words Hundred of thousands of features

Log Likelihood and Sufficient Statistics Joint probability distribution: P(X 1,..X n ;θ) = 1 k Z(θ) exp θ i f i (D i ) i=1 θ={θ 1,..θ k } are table entries f i are features over instances of D i Let D be a data set of M samples ξ [m] m =1,..M Log-likelihood product of log-probabilities of samples in data set: l(θ : D) = θ i f i ξ[m] i m ( ) M ln Z(θ) Sufficient statistics (likelihood depends only on this) 1 l(θ : D) = M θ ( [ ]) ln Z(θ) i E D f i (d i ) i E D [f i (d i )] is the average in the data set

Properties of Log-Likelihood Log-likelihood is a sum of two functions l(θ : D) = First term is linear in the parameters Increasing parameters increases this term Second term: θ i f i ξ[m] i m balances first term ( ) since likelihood has upper-bound of probability 1 Partition function Z(θ) is convex as seen next» Its negative is concave Sum of linear fn. and concave is concave, So There are no local optima Can use gradient ascent First Term ln Z θ M ln Z(θ) Second Term ( ) = ln exp θ i f i ( ξ) ξ i 12

Proof that Z(θ) is Convex A convex function is bowl-shaped, defined as: A function f (! x) is convex if for every 0 α 1 f (α! x + (1 α )! y) α f (! x) + (1 α ) f (! y) Function is convex if its Hessian is positive-semi-definite Hessian (2 nd derivative) of ln Z(θ) computed as: ln Z θ ( ) = ln exp θ i f i ( ξ) Partial derivatives wrt θ i and θ j ξ i ln Z(θ) = 1 θ i Z(θ) 2 ln Z(θ) = 1 θ i θ j θ j Z(θ) Since covariance matrix of features is pos. semidefinite, we have -ln Z(θ) is a concave function of θ exp θ j f j ( ξ) = 1 f i (ξ)exp θ j f j ( ξ) = E θ [ f i ] ξ θ i j Z(θ) ξ j ξ exp θ k f k ( ξ) = Cov θ [ f i ; f j ] θ i k

Non-unique Solution Since ln Z(θ) is convex, -ln Z(θ) is concave Implies that log-likelihood is unimodal Has no local optima However does not imply uniqueness of global optimum Multiple parameterizations can result in same distribution A feature for every entry in the table is always redundant, e.g., f a0,b0 = 1- f a0,b1 - f a1,b0 - f a1,b1 A continuum of parameterizations 14

Maximum Likelihood Parameter Estimation Task: Estimate parameters of a Markov network with a fixed structure given a fully observable data set D Simplest variant of the problem is maximum likelihood parameter estimation Log-likelihood given features F={ f i, i=1,.. k} is l(θ : D) = θ i f i ξ m i=1 m ( [ ]) M ln Z ( θ ) 15

Gradient of Log-likelihood Gradient of log-likelihood is zero at its maximum points Log-likelihood is l(θ : D) = θ i f i ξ m i=1 ( [ ]) Gradient of the average log-likelihood is m 1 θ i M l(θ : D) = E [ f D i ( χ )] E θ [ f i ] M ln Z ( θ ) Provides a precise characterization of m.l. parameters θ ( ) E D f i χ = E ˆ θ [ f i ] First term is average value of f i in data D. Second term is expected value from distribution Expected value of each feature relative to P θ matches its empirical expectation in D

Need for Iterative Method Although function is concave, there is no analytical form for the maximum Since no closed-form solution Can use iterative methods, e.g. gradient ascent as shown next 17

Simple Gradient Ascent Procedure Gradient-Ascent ( θ 1 f obj δ ) 1 t 1 2 do //Initial starting point //Function to be optimized //Convergence threshold ( ) 3 θ t+1 θ t + η f obj θ t 4 t t +1 5 while θ t θ t 1 > δ ( ) 6 return θ t Intuition Using Taylor s expansion of a function in neighborhood of θ 0, function can be approximated by the linear equation f obj (θ) f obj (θ 0 )+(θ θ 0 ) T f obj (θ 0 ) The slope of this function is f obj (θ 0 ) points to the direction of steepest ascent. If we take a step η in this direction we increase the value of f obj. Line Search Adaptively choose η at each step Define a line in the direction of gradient g ( η) = θ! t + η f obj ( θ t ) Find three points η 1 < η 2 < η 3 so that f obj (g(η 2 )) is larger than at both others η =(η 1 +η 2 )/2 Brent s method: Solid line is f obj (g(η)) Three points bracket its maximum Dashed line shows quadratic fit

Conjugate Gradient Ascent In simple gradient ascent: two consecutive gradients are orthogonal: is orthogonal to f obj (θ t+1 ) f obj (θ t ) Progress is zig-zag: progress to maximum slows down Quadratic: f obj (x,y)=-(x 2 +10y 2 ) Exponential: f obj (x,y)=exp [-(x 2 +10y 2 )] Solution: Remember past directions and bias new direction h t as combination of current g t and past h t-1 Faster convergence than simple gradient ascent

Computing the gradient Gradient of log-likelihood is E D [ f i ( χ )] E θ [ f i ] It is the difference between the feature s empirical count in data D and the expected count relative to our current parameterization Computing the empirical count Ex: for feature f a (a,b) = I { a = a 0 0 b }I b = b 0 0 { } E D [ f i is easy it is the empirical frequency in D of the event a 0,b 0 20 D ( χ )] A C (a) B

Computing the expected count We need to compute the different probabilities of the form P ( θ t a,b) Since expectation is a probability-weighted average Computing probability requires running inference over the network Thus computational cost of parameter estimation is very high Gradient ascent is not efficient Much faster convergence using second order methods based on Hessian

A standard iterative solution Initialize. θ Run inference (compute Z(θ) ) Compute gradient of l. Update θ. E D [ f i ( χ )] E θ [ f i ] θ t+1 θ t + η f obj ( θ t ) No Optimum Reached? Yes Stop 22

Iterative methods for MRF parameters Gradient ascent over parameter space Good news: likelihood function is concave Guaranteed to converge to global optimum Bad news: each step needs inference Simple parameter estimation is intractable Bayesian parameter estimation even harder Integration done using MCMC 23

Newton s method Newton s method finds zeroes of a function using derivatives More efficient than simple gradient descent Quasi Newton method uses an approximation to the gradient Since we are solving for derivative of l(θ,d) need second derivative (Hessian) Newton s Method

Hessian of the log-likelihood Likelihood has the form l(θ : D) = θ i f i ξ m i=1 m Hessian ( [ ]) θ i θ j l θ, D ( ) = M Cov θ f i, f j ( ) Requires joint expectation of two features, often computationally infeasible L-BFGS (a quasi-newton algorithm) uses line 25 search to avoid computing the Hessian M ln Z ( θ )

Conditionally Trained Models Often we want to perform an inference task where we have a known set of variables, or features, X We want to query a pre-determined set of variables Y We prefer to use discriminative training Train the network as a Conditional Random Field (CRF) that encodes a conditional distribution P(Y X) 26

CRF Training Train the network as a CRF that encodes a conditional distribution P(Y X) Conditional log-likelihood Training set consists of pairs D={y[m], x[m]}, m =1,.., M Objective function is conditional log-likelihood l Y X ( ) ( θ : D) = ln P y[ 1,.., M ] x[ 1,.., M ],θ M ( ) = ln P y[ m] x[ m],θ m=1 E (x,y)~p* ( ) log P! y x Example: word category, word It is concave since each term is concave 27 Joint probability of observed pairs

Gradient of Conditional Likelihood A reduced MN is itself an MN. We use log-linear representation with features f i and parameters θ Analogous to gradient for full MN 1 θ i M l(θ : D) = E [ f D i ( χ )] E θ [ f i ] we can write gradient for reduced MN θ i l Y X ( ) M ( θ : D) = f i ( y m,x m ) E θ f i x m m=1 First term is empirical count conditioned on x[m] Second is based on running inference on each data case

Conditional Training is simpler X 1 X 2 X 3 X 4 X 5 Full MN encodes Y 1 Y 2 4 Y 3!P(X,Y) = φ i Y i,y i+1 i=1 ( ) Edges disappear in a reduced Markov network After conditioning on X, remaining edges form a simple chain 5 Y 4 Next slide compares different sequential models, CRF, HMM and MEMM wrt parameter learning complexity Y 5 5 φ i Y i,x 1,X 2,X 3,X 4,X 5 i=1 ( ) ( )!P(Y X) = φ i Y i,y i+1 X 1,X 2,X 3,X 4,X 5 i=1 29

Learning Models for Sequence Labeling Given: sequence of observations X={X 1,..X k }. Need: a joint label Y={Y 1,..Y k } Model Trade-offs: expressive power, learnability CRF is a discriminative model MEMM is a conditional directed model Y 1 X 2 if not given Y 2, by D-separation More generally, Y i X j X -j j > i Has label bias problem: Later observation has no effect on posterior probability of current state. In activity recognition in video sequence: frames are labeled as running/walking. Earlier frames may be blurry but later ones clearer. HMM is a generative model That needs joint probability P(X,Y) CRF requires gradient-based learning which is expensive, difficult with large data sets MEMM and HMM are more easily learned Pure directed models: parameters computed in closed-form using maximum likelihood CRF Directly obtains P(Y X) Note: Z(X) is marginal of un-normalized measure MEMM Does not model distribution over X HMM Needs joint distribution X 1 Y 1 X 1 Y 1 X 1 Y 1 X 2 Y 2 X 2 Y 2 X 2 Y 2 X 3 Y 3 X 3 Y 3 X 3 Y 3 P(Y X) = 1!P(Y, X) Z(X) X 4 Y 4 X 4 Y 4 X 4 Y 4!P(Y, X) = φ i (Y i,y i+1 ) φ i (Y i, X i ) k 1 i=1 Z(X) =!P(Y, X) Y P(Y X) = P(Y i X i,y i 1 ) k X 5 Y 5 X 5 Y 5 P(X,Y) = P(X i /Y i )P(Y i Y i 1 ) i=1 i=1 P(Y / X) = P(X,Y) P(X) k k i=1 X 5 Y 5

Parameter Estimation with Missing Data Missing data examples Some data fields omitted or not collected Some hidden variables Learning problem has difficulties Parameters may not be identifiable Coupling between different parameters Likelihood is not concave (has local maxima) Available methods 1. Gradient Ascent (assume missing data is random) 2. Expectation-Maximization 31

Gradient Ascent with Missing Data Assume data is missing at random in data D In the m th instance, let o[m] be observed entries and h[m] random variables that are missing Log-likelihood has the form 1 M ( ) = 1 M ln P ( o[m], h[m] θ ) ln P D θ Gradient for feature f i has the form M m=1 1 θ i M ( θ : D) = 1 M M m=1 h[m] [ ] E h[m]~p( h[m] o[m],θ ) f i ln Z E θ [ f i ] Which is a difference between two expectations of f i Expectation over data and hidden variables Expectation in current distribution P(χ θ) 32

Comparison with Full Data Case With full data, gradient of log-likelihood is 1 θ i M l(θ : D) = E D[ f i ( χ )] E θ [ f i ] For second term we need inference over current distribution P(χ θ) First term is aggregate over data D. With missing data we have 1 θ i M l ( θ : D ) = 1 M M m=1 E fi h[m ]~P ( h[m ] o[m ],θ ) E θ We have to run inference separately for every instance m conditioning on o[m] Cost is much higher than with full data f i 33

Use of Expectation Maximization As for any probabilistic model, an alternative method of parameter estimation with missing data is EM E step is to use current parameters to estimate missing values M step is to re-estimate the parameters For BN it has significant advantages M-step has closed-form solution For MN, the answer is not clear-cut M-step requires running inference multiple times As seen next 34

E-step EM for MN parameter learning using current parameters θ t to compute expected sufficient statistics, i.e., expected feature counts M-step At iteration t expected sufficient statistic for feature f i is M θ (t ) f i = 1 M E h[m ]~P h[m ] o[m ],θ ( ) done using maximum likelihood parameter estimation (using gradient ascent!) Requires running inference multiple times, once for each iteration of gradient ascent procedure At step k of this inner loop optimization, we have a gradient of the form 35 f Mθ ( t) i E θ t,k ( ) f i fi

Maximum Entropy and Maximum Likelihood Return to basic maximum likelihood estimation Alternative formulation provides insight Consider log-linear model P(X 1,..X n ;θ) = 1 k Z(θ) exp θ i f i (D i ) i=1 Relate it to finding the distribution of maximum entropy subject to certain constraints 36

Motivation of Max Entropy Suppose we have some summary statistics of an empirical distribution Ex 1: published in a census report Marginal distributions of single variables or certain pairs, or other events of interest Ex 2: average final grade of students in class, correlation of final grades with homework scores But no access to full data set Find a typical distribution that satisfies these constraints 37

Maximum Entropy Formulation Goal is to select a distribution that: 1. Satisfies given constraints 2. Has no additional structure or information We formulate the following problem: Find Q(χ) Maximizing H Q (χ) Subject to E Q [f i ]=E D [f i ], i=1,.., k These are called expectation constraints 38

Solution to Max Entropy Estimation Somewhat surprisingly The distribution Q*, the maximum entropy distribution satisfying the expectation constraints, E Q [f i ]=E D [f i ], is a Gibbs distribution Q* = Pˆ θ where Pˆ θ ( χ) = 1 Z θ ˆ ( ) exp ˆ i θi f i ( χ) where ˆθ is the m.l.e. parameterization relative to D 39

Duality of Max Ent and Max Likelihood The two problems 1. Maximizing entropy subject to expectation constraints 2. Maximizing likelihood given structural constraints on the distribution are convex duals of each other For the maximum likelihood parameters ˆθ H P θ ( χ) = 1 M l ( ˆθ : D) 40

Parameter Priors and Regularization Maximum likelihood is prone to over-fitting Can introduce prior distribution P(θ) Due to non-decomposability a fully Bayesian approach is infeasible Instead, we perform MAP estimation Find parameters that maximize P(θ)P(D θ) Where we have ln P(D θ) expressed in log-space as l(θ : D) = θ i f i ξ m i=1 m ( [ ]) Which in turn is derived from the joint distribution P(X 1,..X n ;θ) = 1 k Z(θ) exp θ f (D ) i i i i=1 M ln Z ( θ ) 41

Choice of Prior P(θ) Only a few priors used in practice L 2 regularization, quadratic penalty on weights Laplacian (L 1 regularization) Both priors penalize parameters whose magnitude (positive or negative) is large 42

Gaussian Prior and L 2 -Regularization Most common is Gaussian prior on log-linear parameters θ P ( θ σ 2 ) = 1 exp θ 2 i 2πσ 2σ 2 For some choice of hyper-parameter σ 2 k i=1 Converting MAP objective P(θ)P(D θ) to logspace, gives ln P( θ ) + l(θ : D) whose first term lnp ( θ) = 1 2σ 2 k i=1 2 θ i is called an L 2 - regularization term 43

Laplacian Prior and L 1 -Regularization A different prior used is the Laplacian P Laplacian ( θ) = 1 2β θ exp β Taking log we obtain a term ln P θ ( ) = 1 β which we wish to minimize Generally called L 1 -regularization k θ i i=1 Both forms of regularization penalize parameters whose magnitude is large 44 0.5 0.4 0.3 0.2 0.1 0 10 5 0 5 10 Laplacian distribution β=1 Gaussian distribution σ 2 =1

Why prefer low magnitude parameters? Properties of prior To pull distribution towards an uninformed one To smooth fluctuations in the data A distribution is smooth if Probabilities assigned to different assignments are not radically different Consider two assignments ξ and ξ An assignment is an instance of variables X 1,..X n We consider ratio of their probabilities next 45

Smoothness resulting from small θ Given two assignments ξ and ξ Their relative probability is ( ) ( ) = P ξ P ξ '!P ( ξ)/ Z θ!p ( ξ ')/ Z θ ( ) ( ) = where the un-normalized probabilities are In log-space, log-probability ratio is ln P ξ P ξ '!P(ξ) = exp k ( ) ( ) = θ f ξ i i ( ) i=1 k i=1 When θ i s have small magnitude, this log-ratio is also bounded, i.e., probabilities are similar This results in a smooth distribution ( ) ( )!P ξ!p ξ ' θ i f i (ξ) k θ i f i ( ξ ') = θ i f i ( ξ) f i ξ ' i=1 k i=1 ( ( )) 46

Comparison of L 1 and L 2 Regularization In both L 1 and L 2 we penalize magnitude of parameters In Gaussian case (L 2 ), penalty grows quadratically with parameters An increase in θ i from 3 to 3.1 is penalized more than θ i from 0 to 0.1 Leads to many small parameters In Laplacian case (L 1 ), penalty grows linearly Results in fewer edges and is more tractable 47

Efficiency of Optimization Both L 1 and L 2 Regularization terms are Concave Because Log-likelihood is also Concave, resulting posterior is also concave Can be optimized using gradient descent methods Introduction of penalty terms eliminates multiple equivalent minima 48

Choice of Hyper-parameters Regularization parameters σ 2 in case of L 2 and β in the L 1 case encode our beliefs that model weights should be close to zero Larger these parameters are broader our parameter prior Choice of prior has effect on learned model Standard method of selecting this parameter is via cross-validation 49