MIA - Master on Artificial Intelligence

1 Introduction Unsupervised & semi-supervised approaches Supervised Algorithms Maximum Likelihood Estimation Maximum Entropy Modeling

Introduction 1 Introduction Unsupervised & semi-supervised approaches Supervised Algorithms Maximum Likelihood Estimation Maximum Entropy Modeling

Paradigms Introduction Supervised n-gram models. Parameter estimation: MLE & smoothing. algorithms: Naive Bayes, Decision Trees, SVMs, Adaboost, Perceptron, log-linear,... Unsupervised and semi-supervised Similarity models: Clustering, EBL. Prediction models: Expectation Maximization (EM). Bootstrapping Co-training Active learning...

Other relevant considerations Introduction Batch vs on-line ML algorithms and parameter tuning Train/development data Evaluation Test data N-fold cross validation Precision/Recall/F1

Unsupervised & semi-supervised approaches 1 Introduction Unsupervised & semi-supervised approaches Supervised Algorithms Maximum Likelihood Estimation Maximum Entropy Modeling

Clustering Unsupervised & semi-supervised approaches Single-link clustering of 22 frequent English words represented as a dendogram. be not he I it this the his a and but in on with for at from of to as is was

The EM algorithm Unsupervised & semi-supervised approaches Start with guess for values of your model parameters Step E Compute distribution of the missing/latent data given the observed data and your current guess of the model parameters. Use the missing/latent data distribution to compute the expectation of the likelihood function with respect to the unobserved variables. Step M Use the expected likelihood function with no unobserved variables to maximize the function as you would in the fully observed case, to get a new estimate of your model parameters. Repeat steps E-M until convergence (no further changes).

The EM algorithm - Example Unsupervised & semi-supervised approaches Three coins with probability of heads (λ, p 1, p 2 ). Hidden variable coin 0 (λ): Y = {H, T} Y = H flip coin 1 (p 1 ) three times Y = T flip coin 2 (p 2 ) three times Observed sequency: X = {HHT, HTT, TTT, HHH}

The EM algorithm - Example Unsupervised & semi-supervised approaches Start with a guess model µ = (λ, p 1, p 2 ) Step E - Expectation Use current model parameters µ to compute probability distribution of hidden data given the observations: P µ (H x i ) = P µ(x i,h) P µ (x i ) ; P µ(t x i ) = P µ(x i,t ) P µ (x i ) x i X where P(x i, H), P(x i, T), and P µ (x i ) are computed from current model: P µ (HHT, H) = λp 2 1 (1 p 1) P µ (HTT, T) = (1 λ)p 2 (1 p 2 ) 2... etc... P µ (x i ) = P µ (x i, H) + P µ (x i, T) x i X Compute expected number of occurrences for hidden variable values: E[Y = H] = i P(H x i) E[Y = T] = i P(T x i)

The EM algorithm - Example Unsupervised & semi-supervised approaches Step M - Maximization Use expectations computed above to compute new MLE estimates of model parameters given observations X = {HHT, HTT, TTT, HHH} λ = E[Y=H] N p 1 p 2 = 2 P(HHT,H)+1 P(HT T,H)+0 P(T T T,H)+3 P(HHH,H) E[Y=H] = 2 P(HHT,T )+1 P(HT T,T )+0 P(T T T,T )+3 P(HHH,T) E[Y=T ]

Bootstrapping: Self-training Unsupervised & semi-supervised approaches Input: L 0, a (small) set of labeled examples U, a (large) set of unlabelled examples Output: m, a learned model T = L 0 // Start with a reduced set of labelled examples while not convergence achieved() do m = learn(t) // Learn a model from available labeled examples n = label(u, m) // Use the learned model to label new examples n = filter(n, γ) // Filter labeled examples by confidence threshold T = T n endwhile // Add examples passing the filter to the training set Convergence may be defined as a fixed amount of iterations, or as a point where performance on a development set does not improve further.

Bootstrapping: Co-training Unsupervised & semi-supervised approaches Input: L 0, a (small) set of labeled examples U, a (large) set of unlabelled examples Output: m, a learned model T = L 0 // Start with a reduced set of labelled examples while not convergence achieved() do m 1 = learn(t, view 1 1) // Learn a model from available labeled examples m 2 = learn(t, view 2 ) // Learn a model from available labeled examples n 1 = label(u, m 1 ) // Use the learned model to label new examples n 2 = label(u, m 2 ) // Use the learned model to label new examples n = filter(n 1, n 2, γ) // Filter labeled examples by confidence threshold T = T n // Add new examples to the training set endwhile m = best(m 1, m 2 ) Both views must be conditionally independent and sufficient.

Active learning Unsupervised & semi-supervised approaches Input: L 0, a (small) set of labeled examples U, a (large) set of unlabelled examples oracle, a way to obtain the expected label for a given example Output: m, a learned model T = L 0 // Start with a reduced set of labelled examples while not convergence achieved() do m = learn(t) // Learn a model from available labeled examples n = label(u, m) // Use the learned model to label new examples n = select(n) // Select best examples to be labeled n = oracle(n) // Get supervised label for selected examples T = T n endwhile // Add new examples to the training set Different measures are used for example selection: Confidence of the model, error reduction, expected model change,... )

Supervised Algorithms 1 Introduction Unsupervised & semi-supervised approaches Supervised Algorithms Maximum Likelihood Estimation Maximum Entropy Modeling

Naive Bayes Supervised Algorithms Simplest probabilistic classifier NB generative model: y x1 x2 x3 xn x i is the i th feature of example x Features are conditionally independent given the class y

Naive Bayes (II) Supervised Algorithms P(y x 1,..., x n ) = (applying Bayes rule) = P(y) P(x 1,...,x n y) P(x 1,...,x n ) posterior = prior likelihood evidence

Naive Bayes (II) Supervised Algorithms P(y x 1,..., x n ) = (applying Bayes rule) = P(y) P(x 1,...,x n y) P(x 1,...,x n ) posterior = prior likelihood evidence NB(x) = argmax y posterior = argmax y P(y) P(x 1,...,x n y) P(x 1,...,x n ) P(x 1,..., x n ) is a constant and features are conditionally independent given y, thus: NB(x) = argmax y P(y) n i=1 P(x i y)

Naive Bayes (III) Supervised Algorithms Training a NB classifier consists of estimating two probability distributions: P(y) and P(x i y) from training data Maximum likelihood estimates: P(y) = counts(y) num. examples P(x i y) = counts(x i, y) counts(y) In practice, smoothing is needed

Decision Trees Supervised Algorithms Feature selection (information gain, Gini diversity, χ 2,... ) Stopping criterion Feature binarization, pruning, incremental learning,...

Linear Classifiers Supervised Algorithms Vector space in R n Define a hyperplane with a weight vector w and an offset (or threshold) b. Used as a classification rule: n +1 if x i w i + b > 0 h(x) = sign(w x + b) = i=1 1 otherwise

Linear Classifiers: Perceptron Supervised Algorithms Input: Training set {(x i, y i )} Output: Weight vector w w = 0 repeat for i = 1 to n do if y i (w x i + b) 0 then w = w + y i x i b = b + y i endif endfor until average(y i (w x i + b)) < ɛ On-line learning algorithm Additive error-driven updating. Convergence guaranteed if the training set is linearly separable

Linear Classifiers: SVM Supervised Algorithms Batch learning algorithm Margin maximization: w minimization, subject to constraints y i (w x i + b) 1 i

Linear Classifiers: Kernels What if the training set is not linearly separable? Supervised Algorithms

Linear Classifiers: Kernels Supervised Algorithms Mapping function to make data linearly separable Too costly to compute all f(x), but we actually need only f(x) f(y) Kernel functions efficiently compute K(x, y) = f(x) f(y)

Linear Classifiers: Kernels Supervised Algorithms Identity (linear kernel): K(x, y) = x y Polynomial kernel: K(x, y) = (x y + c) d Gaussian kernel (RBF): K(x, y) = exp(γ x z 2 ) Sigmoid kernel: K(x, y) = tanh(α(x z) + β)

Linear Classifiers: Kernels and dual problem Supervised Algorithms To use a kernel, we need to formulate the classifier in dual form, i.e. in terms of dot products between examples. Example: Perceptron. Classification rule: ŷ = sgn(w x + b) Due to update steps: w w + y i x i b w + y i We get: n w = α i y i x i b = i=1 n α i y i i=1 where α i is the number of misclassifications of x i

Linear Classifiers: Kernels and dual problem Supervised Algorithms Then, we can compute the perceptron prediction as: n n ŷ = sign(( α i y i x i ) x + α i y i ) = sgn( = sgn( i=1 n α i y i (x i x) + i=1 i=1 n α i y i ) i=1 n α i y i (x i x + 1)) i=1 Once the problem is formulated in terms of similarites (dot product) between examples, we can introduce the kernel: n ŷ = sgn( α i y i (K(x i, x) + 1)) i=1 Note that for K(x, y) = x y, this formulation is equivalent to the original perceptron.

Maximum Likelihood Estimation 1 Introduction Unsupervised & semi-supervised approaches Supervised Algorithms Maximum Likelihood Estimation Maximum Entropy Modeling

MLE & Smoothing Maximum Likelihood Estimation Estimate the probability of the target feature based on observed data. The prediction task can be reduced to having good estimations of the conditional distribution: P(Y X) = P(X, Y) P(X) MLE (Maximum Likelihood Estimation) P MLE (X) = count(x) N P MLE (Y X) = count(x,y) count(x) No probability mass for unseen events Unsuitable for NLP Data sparseness, Zipf s Law

Smoothing 1 - Adding Counts Maximum Likelihood Estimation Laplace s Law (adding one) P LAP (X) = count(x) + 1 N + B For large values of B too much probability mass is assigned to unseen events Lidstone s Law P LID (X) = count(x) + λ N + Bλ Usually λ = 0.5, Expected Likelihood Estimation. Equivalent to linear interpolation between MLE and uniform prior, with µ = N/(N + Bλ), P LID (X) = µ count(x) + (1 µ) 1 N B

Smoothing 2 - Discounting Counts Absolute Discounting Maximum Likelihood Estimation P ABS (X) = count(x) δ N if count(x) > 0 (B N 0 )δ/n 0 N otherwise Linear Discounting P LIN (X) = (1 α)count(x) N if count(x) > 0 α N 0 otherwise

Maximum Entropy Modeling 1 Introduction Unsupervised & semi-supervised approaches Supervised Algorithms Maximum Likelihood Estimation Maximum Entropy Modeling

Maximum Entropy / Log-linear Models Maximum Entropy Modeling Maximum Entropy: alternative estimation technique. Able to deal with different kinds of evidence ME principle: Do not assume anything about non-observed events. Find the most uniform (maximum entropy, less informed) probability distribution that matches the observations. Example: P(a, b) dans en à in?? 0.3 on??? total 0.4 1.0 Observations P(a, b) dans en à in 0.1 0.0 0.3 on 0.3 0.2 0.1 total 0.4 1.0 One possible p(a, b)

Maximum Entropy / Log-linear Models Maximum Entropy Modeling Maximum Entropy: alternative estimation technique. Able to deal with different kinds of evidence ME principle: Do not assume anything about non-observed events. Find the most uniform (maximum entropy, less informed) probability distribution that matches the observations. Example: P(a, b) dans en à in?? 0.3 on??? total 0.4 1.0 Observations P(a, b) dans en à in 0.2 0.1 0.3 on 0.2 0.1 0.1 total 0.4 1.0 Max.Entropy p(a, b)

ME Modeling Maximum Entropy Modeling Observed facts are constraints for the desired model p. Constraints take the form of feature functions: f i : ε {0, 1} The desired model must satisfy the constraints: p(x)f i (x) = p(x)f i (x) i x ε x ε that is, the expectation of each f i according to the model matches the actual observed expectation for f i

ME Modeling Example Maximum Entropy Modeling Example: ε = {in,on} {dans,en,à} p(a, b) dans en à in?? on?? total 0.4 1.0 Observed fact: p(in,dans) + p(on,dans) = 0.4 Encoded as a constraint: E p (f 1 ) = 0.4 where: { 1 if b = dans f 1 (a, b) = 0 otherwise E p (f 1 ) = p(a, b)f 1 (a, b) (a,b) ε

ME Probability Model Maximum Entropy Modeling There is an infinite set P of probability models consistent with observations. We want to compute the maximum entropy model p = argmax H(p) p P H(p) = x ε p(x) log p(x)

Parameter Estimation Maximum Entropy Modeling Example: Maximum entropy model for translating prepositions from English to French No constraints P(a, b) dans en à in 0.167 0.167 0.167 on 0.167 0.167 0.167 total 1.0 With constraint p(dans) + p(en) = 0.4 P(a, b) dans en à in 0.1 0.1 0.3 on 0.1 0.1 0.3 total 0.4 1.0 With constraints p(dans) + p(en) = 0.4; p(in) = 0.6...Not so easy!

Parameter estimation Maximum Entropy Modeling Exponential models. (Lagrange multipliers optimization) p(a b) = 1 k Z(b) j=1 αf j(a,b) j α j > 0 Z(b) = a k i=1 αf i(a,b) i also formuled as p(a b) = 1 Z(b) exp( k j=1 λ jf j (a, b)) λ i = ln α i Each model parameter weights the influence of a feature. Several algorithms to compute optimal parameters: GIS, IIS, LM-BFGS,...

Improved Iterative Scaling (IIS) Maximum Entropy Modeling Input: Feature functions f 1... f n, empirical distribution p(f i ) Output: λ i : parameters for optimal model p Start with λ i = 0 for all i {1... n} Repeat For each i {1... n} do let λ i be the solution to p(b)p(a b)f i (a, b) exp( λ i a,b λ i λ i + λ i end for Until all λ i have converged n j=1 f j (a, b)) = p(f i )

Application to NLP Tasks Maximum Entropy Modeling Speech processing (Rosenfeld 94) Translation (Brown et al 90) Morphology (Della Pietra et al. 95) Clause boundary detection (Reynar & Ratnaparkhi 97) PP-attachment (Ratnaparkhi et al 94) PoS Tagging (Ratnaparkhi 96, Black et al 99) Partial Parsing (Skut & Brants 98) Full Parsing (Ratnaparkhi 97, Ratnaparkhi 99) Text Categorization (Nigam et al 99)

PoS Tagging (Ratnaparkhi 96) Maximum Entropy Modeling Probabilistic model over H T h i = (w i, w i+1, w i+2, w i 1, w i 2, t i 1, t i 2 ) { 1 if suffix(wi ) = ing t = VBG f j (h i, t) = 0 otherwise Compute p (h, t) using GIS Disambiguation algorithm: beam search argmax t 1...t n p(t h) = exp( j λ jf j (h, t)) Z(h) p(t 1... t n w 1... w n ) = argmax t 1...t n n p(t i h i ) i=1

Text Categorization (Nigam et al 99) Maximum Entropy Modeling Probabilistic model over W C d = (w 1, w 2... w N ) { N(d,w) f w,c (d, c) = N(d) if c = c 0 otherwise Compute p (c d) using IIS Disambiguation algorithm: Select class with highest probability argmax P(c d) c exp( i = argmax λ if i (d, c)) c Z(d) = argmax λ i f i (d, c) c i

Sentence Boundaries (Reynar and Ratnaparkhi 97) Maximum Entropy Modeling Feature Templates 1 The prefix 2 The suffix 3 The word previous 4 The word next 5 Whether prefix or suffix are in Abbreviations 6 Whether previous or next are in Abbreviations < b=no punc=. pref=mr suff= prev=2010. next=wayne > Two classes: y and n Disambiguation algorithm: Select class with highest probability argmax P(c d) c exp( i = argmax λ if i (d, c)) c Z(d) = argmax λ i f i (d, c) c i