Graph-Based Lexicon Expansion with Sparsity-Inducing Penalties

Similar documents
Proximity-Based Anomaly Detection using Sparse Structure Learning

Bayesian Machine Learning - Lecture 7

ECE521 Lectures 9 Fully Connected Neural Networks

Outline of Today s Lecture

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Learning to translate with neural networks. Michael Auli

Log-Linear Models, MEMMs, and CRFs

Proposition Knowledge Graphs. Gabriel Stanovsky Omer Levy Ido Dagan Bar-Ilan University Israel

Probabilistic Graphical Models

Lab 12: Structured Prediction

Conditional Random Field

Probabilistic Models for Sequence Labeling

Machine Learning for NLP

CPSC 540: Machine Learning

7 Distances. 7.1 Metrics. 7.2 Distances L p Distances

Variational Inference (11/04/13)

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Semi-supervised Learning

Semi-Supervised Learning

Lecture 13 : Variational Inference: Mean Field Approximation

13: Variational inference II

Bayesian Learning in Undirected Graphical Models

25 : Graphical induced structured input/output models

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2017

Expectation Maximization

How to learn from very few examples?

CPSC 540: Machine Learning

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Optimization II: Unconstrained Multivariable

CPSC 540: Machine Learning

Probabilistic Graphical Models

Locally-biased analytics

Multivariate Normal Models

Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint

2. Quasi-Newton methods

CPSC 340: Machine Learning and Data Mining

Semi-Supervised Learning with Graphs. Xiaojin (Jerry) Zhu School of Computer Science Carnegie Mellon University

Semi-Supervised Learning with Graphs

Information Theory Primer:

Information Projection Algorithms and Belief Propagation

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Multivariate Normal Models

Bayesian Learning in Undirected Graphical Models

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Optimization II: Unconstrained Multivariable

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Posterior Regularization

Classification Semi-supervised learning based on network. Speakers: Hanwen Wang, Xinxin Huang, and Zeyu Li CS Winter

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

A Combined LP and QP Relaxation for MAP

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Expectation Propagation in Factor Graphs: A Tutorial

CPSC 540: Machine Learning

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Discrete Inference and Learning Lecture 3

Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

To get horizontal and slant asymptotes algebraically we need to know about end behaviour for rational functions.

Conditional Gradient (Frank-Wolfe) Method

Gaussian Graphical Models and Graphical Lasso

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Active and Semi-supervised Kernel Classification

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

NAME DATE PERIOD. Operations with Polynomials. Review Vocabulary Evaluate each expression. (Lesson 1-1) 3a 2 b 4, given a = 3, b = 2

Sparse Approximation and Variable Selection

Large Scale Semi-supervised Linear SVMs. University of Chicago

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Introduction to Optimization

Lecture 25: November 27

Optimization methods

6 Distances. 6.1 Metrics. 6.2 Distances L p Distances

Learning Gaussian Graphical Models with Unknown Group Sparsity

Neural Network Language Modeling

1 Sparsity and l 1 relaxation

Sparse Gaussian Markov Random Field Mixtures for Anomaly Detection

Non-Linear Optimization

Covariance Matrix Simplification For Efficient Uncertainty Management

Driving Semantic Parsing from the World s Response

Robust and sparse Gaussian graphical modelling under cell-wise contamination

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

Regularization on Discrete Spaces

On Bayesian Computation

LIMITATION OF LEARNING RANKINGS FROM PARTIAL INFORMATION. By Srikanth Jagabathula Devavrat Shah

Natural Language Processing with Deep Learning CS224N/Ling284

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Homework 5. Convex Optimization /36-725

ESL Chap3. Some extensions of lasso

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

CSC 412 (Lecture 4): Undirected Graphical Models

Logistic Regression. COMP 527 Danushka Bollegala

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Statistical NLP Spring A Discriminative Approach

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Sparse Permutation Invariant Covariance Estimation: Motivation, Background and Key Results

Transcription:

Graph-Based Lexicon Expansion with Sparsity-Inducing Penalties Dipanjan Das, LTI, CMU Google Noah Smith, LTI, CMU Thanks: André Martins, Amar Subramanya, and Partha Talukdar. This research was supported by Qatar National Research Foundation grant NPRP 08-485-1-083, Google, and TeraGrid resources provided by the Pittsburgh Supercomputing Center under NSF grant number TG-DBS110003.

Motivation FrameNet lexicon (Fillmore et al., 2003) For many words, a set of abstract semantic frames E.g., contribute/v can evoke GIVING or SYMPTOM SEMAFOR (Das et al., 2010). Finds: frames evoked + semantic roles What about the words not in the lexicon or data?

Das and Smith (2011) Graph-based semi-supervised learning with quadratic penalties (Bengio et al., 2006; Subramanya et al., 2010). Frame identification F 1 on unknown predicates: 47% 62% Frame parsing F 1 on unknown predicates: 30% 44%

Das and Smith (2011) Graph-based semi-supervised learning with quadratic penalties (Bengio et al., 2006; Subramanya et al., 2010). Frame identification F 1 on unknown predicates: 47% 62% (today) 65% Frame parsing F 1 on unknown predicates: 30% 44% (today) 47% Today: we consider alternatives that target sparsity, or each word associating with relatively few frames.

Graph-Based Learning 9264 9265 9266 similarity unknown predicates 1 2 3 predicates with observed frame distributions 4 9267 9268 9269 9270

The Case for Sparsity Lexical ambiguity is pervasive, but each word s ambiguity is fairly limited. Ruling out possibilities better runtime and memory properties.

Outline 1. A general family of graph-based SSL techniques for learning distributions. Defining the graph Constructing the graph and carrying out inference New: sparse and unnormalized distributions 2. Experiments with frame analysis: favorable comparison to state-of-the-art graph-based learning algorithms

Notation T = the set of types (words) L = the set of labels (frames) Let q t (l) denote the estimated probability that type t will take label l.

Vertices, Part 1 q 1 q 2 Think of this as a graphical model whose random variables take vector values. q 3 q 4

Factor Graphs (Kschischang et al., 2001) Bipartite graph: Random variable vertices V Factor vertices F Distribution over all variables values: Today: finding collectively highest-scoring values (MAP inference) estimating q Log-factors negated penalties

Notation T = the set of types (words) L = the set of labels (frames) Let q t (l) denote the estimated probability that type t will take label l. Let r t (l) denote the observed relative frequency of type t with label l.

Penalties (1 of 3) r 1 r 2 q 1 q 2 Each type t i s value should be close to its empirical distribution r i. r 3 q 3 q 4 r 4

Empirical Penalties Gaussian (Zhu et al., 2003): penalty is the squared L 2 norm Entropic : penalty is the JS-divergence (cf. Subramanya and Bilmes, 2008, who used KL)

Let s Get Semi-Supervised

Vertices, Part 2 q 9264 q 9265 r 1 r 2 q 1 q 2 There is no empirical distribution for these new vertices! q 9266 r 3 q 3 q r 4 4 q 9267 q 9268 q 9269 q 9270

Penalties (2 of 3) q 9264 r 1 r 2 q 1 q 2 q 9265 q 9266 r 3 q 3 q r 4 4 q 9267 q 9268 q 9269 q 9270

Similarity Factors log ϕ t,t (q t,q t ) = 2 µ sim(t, t ) q t q t 2 2 log ϕ t,t (q t,q t ) = 2 µ sim(t, t ) JS (q t q t ) Gaussian Entropic

Constructing the Graph in one slide Conjecture: contextual distributional similarity correlates with lexical distributional similarity. Subramanya et al. (2010); Das and Petrov (2011); Das and Smith (2011) 1. Calculate distributional similarity for each pair. Details in past work; nothing new here. 2. Choose each vertex s K closest neighbors. 3. Weight each log-factor by the similarity score.

q 9264 r 1 r 2 q 1 q 2 q 9265 q 9266 r 3 q 3 q r 4 4 q 9267 q 9268 q 9269 q 9270

Penalties (3 of 3) q 9264 r 1 r 2 q 1 q 2 q 9265 q 9266 r 3 q 3 q r 4 4 q 9267 q 9268 q 9269 q 9270

What Might Unary Penalties/Factors Do? Hard factors to enforce nonnegativity, normalization Encourage near-uniformity squared distance to uniform (Zhu et al., 2003; Subramanya et al., 2010; Das and Smith, 2011) entropy (Subramanya and Bilmes, 2008) Encourage sparsity Main goal of this paper!

Unary Log-Factors Squared distance to uniform: Entropy: λh(q t ) Lasso /L 1 (Tibshirani, 1996): Elitist Lasso /squared L 1,2 (Kowalski and Torrésani, 2009):

Models to Compare Model normalized Gaussian field (Das and Smith, 2011; generalizes Zhu et al., 2003) measure propagation (Subramanya and Bilmes, 2008) Empirical and pairwise factors Gaussian Kullback-Leibler Unary factor squared L 2 to uniform, normalization entropy, normalization UGF-L 2 Gaussian squared L 2 to uniform UGF-L 1 Gaussian lasso (L 1 ) UGF-L 1,2 Gaussian elitist lasso (squared L 1,2 ) UJSF-L 2 Jensen-Shannon squared L 2 to uniform UJSF-L 1 Jensen-Shannon lasso (L 1 ) UJSF-L 1,2 Jensen-Shannon elitist lasso (squared L 1,2 ) unnormalized distributions sparsity-inducing penalties

Where We Are So Far Factor graph view of semisupervised graphbased learning. Encompasses familiar Gaussian and entropic approaches. Estimating all q t equates to MAP inference. Yet to come: Inference algorithm for all q t Experiments

Inference In One Slide All of these problems are convex. Past work relied on specialized iterative methods. Lack of normalization constraints makes things simpler! Easy quasi-newton gradient-based method, L-BFGS-B (with nonnegativity box constraints) Non-differentiability at 0 causes no problems (assume right-continuity ) KL and JS divergence can be generalized to unnormalized measures

(see the paper) Experiment 1

Experiment 2: Semantic Frames Types: word plus POS Labels: 877 frames from FrameNet Empirical distributions: 3,256 sentences from FrameNet 1.5 release Graph: 64,480 vertices (see D&S 2011) Evaluation: use induced lexicon to constrain frame analysis of unknown predicates on 2,420 sentence test set. 1. Label words with frames. 2. Then find arguments (semantic roles)

Frame Identification Model Unknown predicates, partial match F 1 Lexicon size supervised (Das et al., 2010) 46.62 normalized Gaussian (Das & Smith, 2011) 62.35 129K measure propagation 60.07 129K UGF-L 2 60.81 129K UGF-L 1 62.85 123K UGF-L 1,2 62.85 129K UJSF-L 2 62.81 128K UJSF-L 1 62.43 129K UJSF-L 1,2 65.29 46K

Learned Frames (UJSF-L 1,2 ) discrepancy/n: SIMILARITY, NON-COMMUTATIVE-STATEMENT, NATURAL-FEATURES contribution/n: GIVING, COMMERCE-PAY, COMMITMENT, ASSISTANCE, EARNINGS-AND-LOSSES print/v: TEXT-CREATION, STATE-OF-ENTITY, DISPERSAL, CONTACTING, READING mislead/v: PREVARICATION, EXPERIENCER-OBJ, MANIPULATE-INTO-DOING, REASSURING, EVIDENCE abused/a: (Our models can assign q t = 0.) maker/n: MANUFACTURING, BUSINESSES, COMMERCE-SCENARIO, SUPPLY, BEING-ACTIVE inspire/v: CAUSE-TO-START, SUBJECTIVE-INFLUENCE, OBJECTIVE-INFLUENCE, EXPERIENCER-OBJ, SETTING-FIRE failed/a: SUCCESSFUL-ACTION, SUCCESSFULLY-COMMUNICATE-MESSAGE blue = correct

Frame Parsing (Das, 2012) Model Unknown predicates, partial match F 1 supervised (Das et al., 2010) 29.20 normalized Gaussian (Das & Smith, 2011) 42.71 measure propagation 41.41 UGF-L 2 41.97 UGF-L 1 42.58 UGF-L 1,2 42.58 UJSF-L 2 43.91 UJSF-L 1 42.29 UJSF-L 1,2 46.75

Example REASON Action Discrepancies between North Korean declarations and IAEA inspection findings indicate that North Korea might have reprocessed enough plutonium for one or two nuclear weapons.

Example SIMILARITY Entities Discrepancies between North Korean declarations and IAEA inspection findings indicate that North Korea might have reprocessed enough plutonium for one or two nuclear weapons.

SEMAFOR http://www.ark.cs.cmu.edu/semafor Current version (2.1) incorporates the expanded lexicon. To hear about algorithmic advances in SEMAFOR, see our *SEM talk, 2pm Friday.

Conclusions General family of graph-based semisupervised learning objectives. Key technical ideas: Don t require normalized measures Encourage (local) sparsity Use general optimization methods

Thanks!