MIA - Master on Artificial Intelligence

Similar documents
Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Linear & nonlinear classifiers

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

CSCI-567: Machine Learning (Spring 2019)

Statistical Pattern Recognition

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Methods for NLP

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Logistic Regression. Machine Learning Fall 2018

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

6.036 midterm review. Wednesday, March 18, 15

Final Examination CS 540-2: Introduction to Artificial Intelligence

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Maxent Models and Discriminative Estimation

Machine Learning Lecture 5

Machine Learning for NLP

Introduction to SVM and RVM

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Brief Introduction of Machine Learning Techniques for Content Analysis

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

CS798: Selected topics in Machine Learning

Machine Learning for Structured Prediction

Machine Learning for NLP

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Statistical Methods for NLP

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Machine Learning, Midterm Exam

PATTERN RECOGNITION AND MACHINE LEARNING

CMU-Q Lecture 24:

Machine Learning Lecture 7

Machine Learning Practice Page 2 of 2 10/28/13

Naïve Bayes classification

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Logistic Regression. COMP 527 Danushka Bollegala

Support Vector Machine (SVM) and Kernel Methods

Statistical Methods for NLP

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Probabilistic Language Modeling

Linear & nonlinear classifiers

From perceptrons to word embeddings. Simon Šuster University of Groningen

Bayesian Methods: Naïve Bayes

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Introduction to Machine Learning Midterm Exam

Hidden Markov Models

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Jeff Howbert Introduction to Machine Learning Winter

Introduction to Logistic Regression and Support Vector Machine

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Support Vector Machine & Its Applications

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Support Vector Machines

A Tutorial on Support Vector Machine

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Natural Language Processing

Graphical models for part of speech tagging

CPSC 540: Machine Learning

Discriminative Models

N-gram Language Modeling Tutorial

Discriminative Models

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

SUPPORT VECTOR MACHINE

Expectation Maximization (EM)

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Qualifier: CS 6375 Machine Learning Spring 2015

Sequential Supervised Learning

Support Vector Machine (SVM) and Kernel Methods

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Kernel Methods and Support Vector Machines

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Ch 4. Linear Models for Classification

Machine Learning for natural language processing

Support Vector Machines for Classification: A Statistical Portrait

Active and Semi-supervised Kernel Classification

FINAL: CS 6375 (Machine Learning) Fall 2014

CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

A brief introduction to Conditional Random Fields

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Information Extraction from Text

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Mixtures of Gaussians continued

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

Introduction to Machine Learning Midterm Exam Solutions

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Transcription:

MIA - Master on Artificial Intelligence

1 Introduction Unsupervised & semi-supervised approaches Supervised Algorithms Maximum Likelihood Estimation Maximum Entropy Modeling

Introduction 1 Introduction Unsupervised & semi-supervised approaches Supervised Algorithms Maximum Likelihood Estimation Maximum Entropy Modeling

Paradigms Introduction Supervised n-gram models. Parameter estimation: MLE & smoothing. algorithms: Naive Bayes, Decision Trees, SVMs, Adaboost, Perceptron, log-linear,... Unsupervised and semi-supervised Similarity models: Clustering, EBL. Prediction models: Expectation Maximization (EM). Bootstrapping Co-training Active learning...

Other relevant considerations Introduction Batch vs on-line ML algorithms and parameter tuning Train/development data Evaluation Test data N-fold cross validation Precision/Recall/F1

Unsupervised & semi-supervised approaches 1 Introduction Unsupervised & semi-supervised approaches Supervised Algorithms Maximum Likelihood Estimation Maximum Entropy Modeling

Clustering Unsupervised & semi-supervised approaches Single-link clustering of 22 frequent English words represented as a dendogram. be not he I it this the his a and but in on with for at from of to as is was

The EM algorithm Unsupervised & semi-supervised approaches Start with guess for values of your model parameters Step E Compute distribution of the missing/latent data given the observed data and your current guess of the model parameters. Use the missing/latent data distribution to compute the expectation of the likelihood function with respect to the unobserved variables. Step M Use the expected likelihood function with no unobserved variables to maximize the function as you would in the fully observed case, to get a new estimate of your model parameters. Repeat steps E-M until convergence (no further changes).

The EM algorithm - Example Unsupervised & semi-supervised approaches Three coins with probability of heads (λ, p 1, p 2 ). Hidden variable coin 0 (λ): Y = {H, T} Y = H flip coin 1 (p 1 ) three times Y = T flip coin 2 (p 2 ) three times Observed sequency: X = {HHT, HTT, TTT, HHH}

The EM algorithm - Example Unsupervised & semi-supervised approaches Start with a guess model µ = (λ, p 1, p 2 ) Step E - Expectation Use current model parameters µ to compute probability distribution of hidden data given the observations: P µ (H x i ) = P µ(x i,h) P µ (x i ) ; P µ(t x i ) = P µ(x i,t ) P µ (x i ) x i X where P(x i, H), P(x i, T), and P µ (x i ) are computed from current model: P µ (HHT, H) = λp 2 1 (1 p 1) P µ (HTT, T) = (1 λ)p 2 (1 p 2 ) 2... etc... P µ (x i ) = P µ (x i, H) + P µ (x i, T) x i X Compute expected number of occurrences for hidden variable values: E[Y = H] = i P(H x i) E[Y = T] = i P(T x i)

The EM algorithm - Example Unsupervised & semi-supervised approaches Step M - Maximization Use expectations computed above to compute new MLE estimates of model parameters given observations X = {HHT, HTT, TTT, HHH} λ = E[Y=H] N p 1 p 2 = 2 P(HHT,H)+1 P(HT T,H)+0 P(T T T,H)+3 P(HHH,H) E[Y=H] = 2 P(HHT,T )+1 P(HT T,T )+0 P(T T T,T )+3 P(HHH,T) E[Y=T ]

Bootstrapping: Self-training Unsupervised & semi-supervised approaches Input: L 0, a (small) set of labeled examples U, a (large) set of unlabelled examples Output: m, a learned model T = L 0 // Start with a reduced set of labelled examples while not convergence achieved() do m = learn(t) // Learn a model from available labeled examples n = label(u, m) // Use the learned model to label new examples n = filter(n, γ) // Filter labeled examples by confidence threshold T = T n endwhile // Add examples passing the filter to the training set Convergence may be defined as a fixed amount of iterations, or as a point where performance on a development set does not improve further.

Bootstrapping: Co-training Unsupervised & semi-supervised approaches Input: L 0, a (small) set of labeled examples U, a (large) set of unlabelled examples Output: m, a learned model T = L 0 // Start with a reduced set of labelled examples while not convergence achieved() do m 1 = learn(t, view 1 1) // Learn a model from available labeled examples m 2 = learn(t, view 2 ) // Learn a model from available labeled examples n 1 = label(u, m 1 ) // Use the learned model to label new examples n 2 = label(u, m 2 ) // Use the learned model to label new examples n = filter(n 1, n 2, γ) // Filter labeled examples by confidence threshold T = T n // Add new examples to the training set endwhile m = best(m 1, m 2 ) Both views must be conditionally independent and sufficient.

Active learning Unsupervised & semi-supervised approaches Input: L 0, a (small) set of labeled examples U, a (large) set of unlabelled examples oracle, a way to obtain the expected label for a given example Output: m, a learned model T = L 0 // Start with a reduced set of labelled examples while not convergence achieved() do m = learn(t) // Learn a model from available labeled examples n = label(u, m) // Use the learned model to label new examples n = select(n) // Select best examples to be labeled n = oracle(n) // Get supervised label for selected examples T = T n endwhile // Add new examples to the training set Different measures are used for example selection: Confidence of the model, error reduction, expected model change,... )

Supervised Algorithms 1 Introduction Unsupervised & semi-supervised approaches Supervised Algorithms Maximum Likelihood Estimation Maximum Entropy Modeling

Naive Bayes Supervised Algorithms Simplest probabilistic classifier NB generative model: y x1 x2 x3 xn x i is the i th feature of example x Features are conditionally independent given the class y

Naive Bayes (II) Supervised Algorithms P(y x 1,..., x n ) = (applying Bayes rule) = P(y) P(x 1,...,x n y) P(x 1,...,x n ) posterior = prior likelihood evidence

Naive Bayes (II) Supervised Algorithms P(y x 1,..., x n ) = (applying Bayes rule) = P(y) P(x 1,...,x n y) P(x 1,...,x n ) posterior = prior likelihood evidence NB(x) = argmax y posterior = argmax y P(y) P(x 1,...,x n y) P(x 1,...,x n )

Naive Bayes (II) Supervised Algorithms P(y x 1,..., x n ) = (applying Bayes rule) = P(y) P(x 1,...,x n y) P(x 1,...,x n ) posterior = prior likelihood evidence NB(x) = argmax y posterior = argmax y P(y) P(x 1,...,x n y) P(x 1,...,x n ) P(x 1,..., x n ) is a constant and features are conditionally independent given y, thus: NB(x) = argmax y P(y) n i=1 P(x i y)

Naive Bayes (III) Supervised Algorithms Training a NB classifier consists of estimating two probability distributions: P(y) and P(x i y) from training data Maximum likelihood estimates: P(y) = counts(y) num. examples P(x i y) = counts(x i, y) counts(y) In practice, smoothing is needed

Naive Bayes (III) Supervised Algorithms Training a NB classifier consists of estimating two probability distributions: P(y) and P(x i y) from training data Maximum likelihood estimates: P(y) = counts(y) num. examples P(x i y) = counts(x i, y) counts(y) In practice, smoothing is needed NB is simple and can train from small datasets (robustness)...but independence assumptions are not realistic

Decision Trees Supervised Algorithms Feature selection (information gain, Gini diversity, χ 2,... ) Stopping criterion Feature binarization, pruning, incremental learning,...

Linear Classifiers Supervised Algorithms Vector space in R n Define a hyperplane with a weight vector w and an offset (or threshold) b. Used as a classification rule: n +1 if x i w i + b > 0 h(x) = sign(w x + b) = i=1 1 otherwise

Linear Classifiers: Perceptron Supervised Algorithms Input: Training set {(x i, y i )} Output: Weight vector w w = 0 repeat for i = 1 to n do if y i (w x i + b) 0 then w = w + y i x i b = b + y i endif endfor until average(y i (w x i + b)) < ɛ On-line learning algorithm Additive error-driven updating. Convergence guaranteed if the training set is linearly separable

Linear Classifiers: SVM Supervised Algorithms Batch learning algorithm Margin maximization: w minimization, subject to constraints y i (w x i + b) 1 i

Linear Classifiers: Kernels What if the training set is not linearly separable? Supervised Algorithms

Linear Classifiers: Kernels Supervised Algorithms Mapping function to make data linearly separable Too costly to compute all f(x), but we actually need only f(x) f(y) Kernel functions efficiently compute K(x, y) = f(x) f(y)

Linear Classifiers: Kernels Supervised Algorithms Identity (linear kernel): K(x, y) = x y Polynomial kernel: K(x, y) = (x y + c) d Gaussian kernel (RBF): K(x, y) = exp(γ x z 2 ) Sigmoid kernel: K(x, y) = tanh(α(x z) + β)

Linear Classifiers: Kernels and dual problem Supervised Algorithms To use a kernel, we need to formulate the classifier in dual form, i.e. in terms of dot products between examples. Example: Perceptron. Classification rule: ŷ = sgn(w x + b) Due to update steps: w w + y i x i b w + y i We get: n w = α i y i x i b = i=1 n α i y i i=1 where α i is the number of misclassifications of x i

Linear Classifiers: Kernels and dual problem Supervised Algorithms Then, we can compute the perceptron prediction as: n n ŷ = sign(( α i y i x i ) x + α i y i ) = sgn( = sgn( i=1 n α i y i (x i x) + i=1 i=1 n α i y i ) i=1 n α i y i (x i x + 1)) i=1 Once the problem is formulated in terms of similarites (dot product) between examples, we can introduce the kernel: n ŷ = sgn( α i y i (K(x i, x) + 1)) i=1 Note that for K(x, y) = x y, this formulation is equivalent to the original perceptron.

Maximum Likelihood Estimation 1 Introduction Unsupervised & semi-supervised approaches Supervised Algorithms Maximum Likelihood Estimation Maximum Entropy Modeling

MLE & Smoothing Maximum Likelihood Estimation Estimate the probability of the target feature based on observed data. The prediction task can be reduced to having good estimations of the conditional distribution: P(Y X) = P(X, Y) P(X) MLE (Maximum Likelihood Estimation) P MLE (X) = count(x) N P MLE (Y X) = count(x,y) count(x) No probability mass for unseen events Unsuitable for NLP Data sparseness, Zipf s Law

Smoothing 1 - Adding Counts Maximum Likelihood Estimation Laplace s Law (adding one) P LAP (X) = count(x) + 1 N + B For large values of B too much probability mass is assigned to unseen events Lidstone s Law P LID (X) = count(x) + λ N + Bλ Usually λ = 0.5, Expected Likelihood Estimation. Equivalent to linear interpolation between MLE and uniform prior, with µ = N/(N + Bλ), P LID (X) = µ count(x) + (1 µ) 1 N B

Smoothing 2 - Discounting Counts Absolute Discounting Maximum Likelihood Estimation P ABS (X) = count(x) δ N if count(x) > 0 (B N 0 )δ/n 0 N otherwise Linear Discounting P LIN (X) = (1 α)count(x) N if count(x) > 0 α N 0 otherwise

Maximum Entropy Modeling 1 Introduction Unsupervised & semi-supervised approaches Supervised Algorithms Maximum Likelihood Estimation Maximum Entropy Modeling

Maximum Entropy / Log-linear Models Maximum Entropy Modeling Maximum Entropy: alternative estimation technique. Able to deal with different kinds of evidence ME principle: Do not assume anything about non-observed events. Find the most uniform (maximum entropy, less informed) probability distribution that matches the observations. Example: P(a, b) dans en à in?? 0.3 on??? total 0.4 1.0 Observations P(a, b) dans en à in 0.1 0.0 0.3 on 0.3 0.2 0.1 total 0.4 1.0 One possible p(a, b)

Maximum Entropy / Log-linear Models Maximum Entropy Modeling Maximum Entropy: alternative estimation technique. Able to deal with different kinds of evidence ME principle: Do not assume anything about non-observed events. Find the most uniform (maximum entropy, less informed) probability distribution that matches the observations. Example: P(a, b) dans en à in?? 0.3 on??? total 0.4 1.0 Observations P(a, b) dans en à in 0.2 0.1 0.3 on 0.2 0.1 0.1 total 0.4 1.0 Max.Entropy p(a, b)

ME Modeling Maximum Entropy Modeling Observed facts are constraints for the desired model p. Constraints take the form of feature functions: f i : ε {0, 1} The desired model must satisfy the constraints: p(x)f i (x) = p(x)f i (x) i x ε x ε that is, the expectation of each f i according to the model matches the actual observed expectation for f i

ME Modeling Example Maximum Entropy Modeling Example: ε = {in,on} {dans,en,à} p(a, b) dans en à in?? on?? total 0.4 1.0 Observed fact: p(in,dans) + p(on,dans) = 0.4 Encoded as a constraint: E p (f 1 ) = 0.4 where: { 1 if b = dans f 1 (a, b) = 0 otherwise E p (f 1 ) = p(a, b)f 1 (a, b) (a,b) ε

ME Probability Model Maximum Entropy Modeling There is an infinite set P of probability models consistent with observations. We want to compute the maximum entropy model p = argmax H(p) p P H(p) = x ε p(x) log p(x)

Parameter Estimation Maximum Entropy Modeling Example: Maximum entropy model for translating prepositions from English to French No constraints P(a, b) dans en à in 0.167 0.167 0.167 on 0.167 0.167 0.167 total 1.0 With constraint p(dans) + p(en) = 0.4 P(a, b) dans en à in 0.1 0.1 0.3 on 0.1 0.1 0.3 total 0.4 1.0 With constraints p(dans) + p(en) = 0.4; p(in) = 0.6...Not so easy!

Parameter estimation Maximum Entropy Modeling Exponential models. (Lagrange multipliers optimization) p(a b) = 1 k Z(b) j=1 αf j(a,b) j α j > 0 Z(b) = a k i=1 αf i(a,b) i also formuled as p(a b) = 1 Z(b) exp( k j=1 λ jf j (a, b)) λ i = ln α i Each model parameter weights the influence of a feature. Several algorithms to compute optimal parameters: GIS, IIS, LM-BFGS,...

Improved Iterative Scaling (IIS) Maximum Entropy Modeling Input: Feature functions f 1... f n, empirical distribution p(f i ) Output: λ i : parameters for optimal model p Start with λ i = 0 for all i {1... n} Repeat For each i {1... n} do let λ i be the solution to p(b)p(a b)f i (a, b) exp( λ i a,b λ i λ i + λ i end for Until all λ i have converged n j=1 f j (a, b)) = p(f i )

Application to NLP Tasks Maximum Entropy Modeling Speech processing (Rosenfeld 94) Translation (Brown et al 90) Morphology (Della Pietra et al. 95) Clause boundary detection (Reynar & Ratnaparkhi 97) PP-attachment (Ratnaparkhi et al 94) PoS Tagging (Ratnaparkhi 96, Black et al 99) Partial Parsing (Skut & Brants 98) Full Parsing (Ratnaparkhi 97, Ratnaparkhi 99) Text Categorization (Nigam et al 99)

PoS Tagging (Ratnaparkhi 96) Maximum Entropy Modeling Probabilistic model over H T h i = (w i, w i+1, w i+2, w i 1, w i 2, t i 1, t i 2 ) { 1 if suffix(wi ) = ing t = VBG f j (h i, t) = 0 otherwise Compute p (h, t) using GIS Disambiguation algorithm: beam search argmax t 1...t n p(t h) = exp( j λ jf j (h, t)) Z(h) p(t 1... t n w 1... w n ) = argmax t 1...t n n p(t i h i ) i=1

Text Categorization (Nigam et al 99) Maximum Entropy Modeling Probabilistic model over W C d = (w 1, w 2... w N ) { N(d,w) f w,c (d, c) = N(d) if c = c 0 otherwise Compute p (c d) using IIS Disambiguation algorithm: Select class with highest probability argmax P(c d) c exp( i = argmax λ if i (d, c)) c Z(d) = argmax λ i f i (d, c) c i

Sentence Boundaries (Reynar and Ratnaparkhi 97) Maximum Entropy Modeling Feature Templates 1 The prefix 2 The suffix 3 The word previous 4 The word next 5 Whether prefix or suffix are in Abbreviations 6 Whether previous or next are in Abbreviations < b=no punc=. pref=mr suff= prev=2010. next=wayne > Two classes: y and n Disambiguation algorithm: Select class with highest probability argmax P(c d) c exp( i = argmax λ if i (d, c)) c Z(d) = argmax λ i f i (d, c) c i