Intelligent Systems (AI-2)

Similar documents
Intelligent Systems (AI-2)

Intelligent Systems (AI-2)

Intelligent Systems (AI-2)

Conditional Random Field

Introduction to Artificial Intelligence (AI)

Lecture 13: Structured Prediction

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Probabilistic Graphical Models

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Introduction to Artificial Intelligence (AI)

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Models for Sequence Labeling

Undirected Graphical Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Hidden Markov Models

Statistical Methods for NLP

Probability and Time: Hidden Markov Models (HMMs)

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Probabilistic Graphical Models

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

A brief introduction to Conditional Random Fields

CPSC 540: Machine Learning

Hidden Markov Models

STA 4273H: Statistical Machine Learning

Hidden Markov Models

Probabilistic Graphical Models

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Graphical models for part of speech tagging

COMP90051 Statistical Machine Learning

Statistical Methods for NLP

Lecture 6: Graphical Models

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Reasoning under Uncertainty: Intro to Probability

Sequential Supervised Learning

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Information Extraction from Text

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

STA 414/2104: Machine Learning

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Dynamic models 1 Kalman filters, linearization,

Random Field Models for Applications in Computer Vision

Bayesian Networks BY: MOHAMAD ALSABBAGH

A gentle introduction to Hidden Markov Models

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

CS Lecture 4. Markov Random Fields

Undirected Graphical Models: Markov Random Fields

Brief Introduction of Machine Learning Techniques for Content Analysis

Log-Linear Models, MEMMs, and CRFs

Design and Implementation of Speech Recognition Systems

Intelligent Systems (AI-2)

CS 188: Artificial Intelligence Fall 2011

UVA CS / Introduc8on to Machine Learning and Data Mining

An Introduction to Bayesian Machine Learning

Administrivia. What is Information Extraction. Finite State Models. Graphical Models. Hidden Markov Models (HMMs) for Information Extraction

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

9 Forward-backward algorithm, sum-product on factor graphs

Conditional Random Fields: An Introduction

CS 343: Artificial Intelligence

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

CS 188: Artificial Intelligence Spring 2009

10/17/04. Today s Main Points

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Bayesian Machine Learning - Lecture 7

4 : Exact Inference: Variable Elimination

CSC 412 (Lecture 4): Undirected Graphical Models

CS Lecture 3. More Bayesian Networks

Advanced Data Science

Probabilistic Machine Learning

Learning in Bayesian Networks

CS 188: Artificial Intelligence Spring Announcements

Directed and Undirected Graphical Models

order is number of previous outputs

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Pattern Recognition and Machine Learning

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

Probability and Information Theory. Sargur N. Srihari

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Conditional Random Fields

CS 343: Artificial Intelligence

2 : Directed GMs: Bayesian Networks

Introduction to Machine Learning Midterm, Tues April 8

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Hidden Markov Models in Language Processing

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Classification & Information Theory Lecture #8

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Lecture 4: State Estimation in Hidden Markov Models (cont.)

CS532, Winter 2010 Hidden Markov Models

STA 4273H: Statistical Machine Learning

Learning Bayesian network : Given structure and completely observed data

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Introduction to Graphical Models

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Bayesian networks Lecture 18. David Sontag New York University

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

CS 7180: Behavioral Modeling and Decision- making in AI

CPSC 540: Machine Learning

Transcription:

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page, Whitehead Institute, MIT Several Figures from Probabilistic Graphical Models: Principles and Techniques D. Koller, N. Friedman 2009 CPSC 422, Lecture 19 Slide 1

Lecture Overview Recap: Naïve Markov Logistic regression (simple CRF) CRFs: high-level definition CRFs Applied to sequence labeling NLP Examples: Name Entity Recognition, joint POS tagging and NP segmentation CPSC 422, Lecture 19 2

i Let s derive the probabilities we need ( Xi, Y ) exp{ wi { X i 1, Y1 1 1}} Y 1 ( Y1 ) exp{ w0 { Y1 0 1}} X 1 X 2 X n CPSC 422, Lecture 18 Slide 3

Naïve Markov Parameters and Inference i P( Y ( Xi, Y ) exp{ wi { X i 1, Y1 1 ( Y1 ) exp{ w0 { Y1 0 P ( Y, x1,..., x 1 n ) 1}} 1}} n 1 1, x1,..., xn) exp( w0 wi xi ) i 1 P( Y1 0, x1,..., x n ) 1 Y 1 X 1 X 2 X n P( Y x1,..., x 1 n ) CPSC 422, Lecture 19 Slide 4

Let s generalize. Assume that you always observe a set of variables X = {X 1 X n } and you want to predict one or more variables Y = {Y 1 Y k } A CRF is an undirected graphical model whose nodes corresponds to X Y. ϕ 1 (D 1 ) ϕ m (D m ) represent the factors which annotate the network (but we disallow factors involving only vars in X why?) m 1 m P(Y X ) ( i ( D i )) Z( X ) ( ( ) Z( X ) i Di CPSC 422, Lecture 19 i 1 Y i 1 5

Lecture Overview Recap: Naïve Markov Logistic regression (simple CRF) CRFs: high-level definition CRFs Applied to sequence labeling NLP Examples: Name Entity Recognition, joint POS tagging and NP segmentation CPSC 422, Lecture 19 6

Sequence Labeling.. Y 1 Y 2 Y T Linear-chain CRF X 1 X 2 XT CPSC 422, Lecture 19 Slide 7

Increase representational Complexity: Adding Features to a CRF Instead of a single observed variable X i we can model multiple features X ij of that observation. Y 1 Y 2 Y T X 1,1 X 1,m X 2,1 X 2,m X T,1 X T,m CPSC 422, 8 Lecture 19

CRFs in Natural Language Processing One target variable Y for each word X, encoding the possible labels for X Each target variable is connected to a set of feature variables that capture properties relevant to the target distinction Y 1 Y 2 Y T X 1,1 X X 1,m X X 2,1 2,m X 1,2 X 2,1 T,1 X T,m Is the word capitalized? Does the word end in ing? CPSC 422, Lecture 19 Slide 9

Name Entity Recognition Task Entity often span multiple words British Columbia Type of an entity may not be apparent for individual words University of British Columbia Let s assume three categories: Person, Location, Organization BIO notation (for sequence labeling) CPSC 422, Lecture 19 Slide 10

1 t Linear chain CRF parameters With two factors types for each word ( Yt, Yt 1) 1 Y, Y ) 2 t ( Yt, X1,.., X t ( t t 1 T ) Dependency between neighboring target vars Dependency between target variable and its context in the word sequence, which can include also features of the words (capitalized, appear in an atlas of location names, etc.) Factors are similar to the ones for the Naïve Markov (logistic regression) ( Y, X ) exp{ w { Y, X t t tk tk t tk }} CPSC 422, Lecture 19 Slide 11

Features can also be The word Following word Previous word CPSC 422, Lecture 19 Slide 12

More on features Including features that are conjunctions of simple features increases accuracy Total number of features can be 10 5-10 6 However features are sparse i.e. most features are 0 for most words CPSC 422, Lecture 19 Slide 13

Linear-Chain Performance Per-token/word accuracy in the high 90% range for many natural datasets Per-field precision and recall are more often around 80-95%, depending on the dataset. Entire Named Entity Phrase must be correct CPSC 422, Lecture 19 Slide 14

Skip-Chain CRFs Include additional factors that connect non-adjacent target variables E.g., When a word occur multiple times in the same documents Graphical structure over Y can depend on the values of the Xs! CPSC 422, Lecture 19 Slide 15

Coupled linear-chain CRFs Linear-chain CRFs can be combined to perform multiple tasks simultaneously Performs part-of-speech labeling and nounphrase segmentation CPSC 422, Lecture 19 Slide 16

Coupled linear-chain CRFs Linear-chain CRFs can be combined to perform multiple tasks simultaneously Performs part-of-speech labeling and nounphrase segmentation CPSC 422, Lecture 19 Slide 17

Inference in CRFs (just intuition) Forward / Backward / Smoothing and Viterbi can be rewritten (not trivial!) using these factors Then you plug in the factors of the CRFs and all the algorithms work fine with CRFs! CPSC 422, Lecture 19 18

CRFs Summary Ability to relax strong independence assumptions Ability to incorporate arbitrary overlapping local and global features Graphical structure over Y can depend on the values of the Xs Can perform multiple tasks simultaneously Standard Inference algorithm for HMM can be applied Practical Leaning algorithms exist State-of the-art on many labeling tasks (deep learning recently shown to be often better ensemble them?) CPSC 422, Lecture 19 19 See MALLET package

Probabilistic Graphical Models From Probabilistic Graphical Models: Principles and Techniques D. Koller, N. Friedman 2009 CPSC 422, Lecture 19 Slide 20

422 big picture: Where are we? Query Planning Deterministic Logics First Order Logics Ontologies Temporal rep. Full Resolution SAT Stochastic Belief Nets Approx. : Gibbs Markov Chains and HMMs Forward, Viterbi. Undirected Graphical Models Markov Networks Conditional Random Fields Markov Decision Processes and Partially Observable MDP Value Iteration Applications of AI Approx. : Particle Filtering Approx. Inference Reinforcement Learning Hybrid: Det +Sto Prob CFG Prob Relational Models Markov Logics Representation Reasoning Technique CPSC 322, Lecture 34 Slide 21

Learning Goals for today s class You can: Provide general definition for CRF Apply CRFs to sequence labeling Describe and justify features for CRFs applied to Natural Language processing tasks Explain benefits of CRFs CPSC 422, Lecture 19 Slide 22

How to prepare. Midterm, Mon, Oct 26, we will start at 9am sharp Work on practice material posted on Connect Learning Goals (look at the end of the slides for each lecture or complete list on Connect) Revise all the clicker questions and practice exercises Extra Office Hours TODAY 11:00am - 12:30pm in the DLC Next class Wed Start Logics Revise Logics from 322! CPSC 422, Lecture 19 18 Slide 23 23

Midterm Announcements Avg 73.5 Max 105 Min 30 If score below 70 need to very seriously revise all the material covered so far You can pick up a printout of the solutions along with your midterm. CPSC 422, Lecture 19 24

Generative vs. Discriminative Models Generative models (like Naïve Bayes): not directly designed to maximize performance on classification. They model the joint distribution P(X,Y ). Classification is then done using Bayesian inference But a generative model can also be used to perform any other inference task, e.g. P(X 1 X 2, X n, ) Jack of all trades, master of none. Discriminative models (like CRFs): specifically designed and trained to maximize performance of classification. They only model the conditional distribution P(Y X ). By focusing on modeling the conditional distribution, they generally perform better on classification than generative models when given a reasonable amount of training data. CPSC 422, 25 Lecture 19

Naïve Bayes vs. Logistic Regression Y 1 X 1 X 2 Xn Generative Naïve Bayes Conditional Discriminative Y 1 Logistic Regression (Naïve Markov) X 1 X 2 X n CPSC 422, Lecture 19 Slide 26

Sequence Labeling Y 1 Y 2.. Y T HMM X 1 X 2 XT Generative Conditional Y 1 Y 2 Discriminative.. Y T Linear-chain CRF X 1 X 2 XT CPSC 422, Lecture 19 Slide 27