smart reply and implicit semantics Matthew Henderson and Brian Strope Google AI

Similar documents
Task-Oriented Dialogue System (Young, 2000)

Sequence Modeling with Neural Networks

Ranked Retrieval (2)

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Collaborative topic models: motivations cont

CS 188: Artificial Intelligence Spring Announcements

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

CSC321 Lecture 10 Training RNNs

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Lecture 1: Shannon s Theorem

Jeffrey D. Ullman Stanford University

NEURAL LANGUAGE MODELS

Lecture 5 Neural models for NLP

Hidden Markov Models. x 1 x 2 x 3 x N

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Neural Architectures for Image, Language, and Speech Processing

CS 188: Artificial Intelligence Spring Announcements

Neural Map. Structured Memory for Deep RL. Emilio Parisotto

with Local Dependencies

Ad Placement Strategies

Deduction by Daniel Bonevac. Chapter 3 Truth Trees

8. TRANSFORMING TOOL #1 (the Addition Property of Equality)

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

MIRA, SVM, k-nn. Lirong Xia

Lecture 15: Recurrent Neural Nets

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Learning Theory Continued

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Math 7.2, Period. Using Set notation: 4, 4 is the set containing 4 and 4 and is the solution set to the equation listed above.

Multiclass Classification-1

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

COMS 4771 Introduction to Machine Learning. Nakul Verma

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

CSC321 Lecture 15: Recurrent Neural Networks

Information Retrieval and Organisation

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Collaborative Filtering. Radek Pelánek

(

Online Passive-Aggressive Algorithms. Tirgul 11

DM534 - Introduction to Computer Science

Conditional Language modeling with attention

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CS798: Selected topics in Machine Learning

Transmogrification: The Magic of Feature Engineering Leah McGuire and Mayukh Bhaowal

Machine Learning, Fall 2009: Midterm

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling

30. TRANSFORMING TOOL #1 (the Addition Property of Equality)

Classification with Perceptrons. Reading:

ENGIN 112 Intro to Electrical and Computer Engineering

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Deep Learning Recurrent Networks 2/28/2018

Jointly Extracting Event Triggers and Arguments by Dependency-Bridge RNN and Tensor-Based Argument Interaction

Attention Based Joint Model with Negative Sampling for New Slot Values Recognition. By: Mulan Hou

text classification 3: neural networks

Machine Learning for Structured Prediction

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Recommendation Systems

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

FIT100 Spring 01. Project 2. Astrological Toys

CSCI567 Machine Learning (Fall 2014)

arxiv: v3 [cs.lg] 14 Jan 2018

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Introduction of Structured Learning. Hung-yi Lee

Long-Short Term Memory and Other Gated RNNs

Introduction to Support Vector Machines

A Mathematical Theory of Communication

CSC321 Lecture 9 Recurrent neural nets

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Better Conditional Language Modeling. Chris Dyer

IFT Lecture 7 Elements of statistical learning theory

Welcome to Comp 411! 2) Course Objectives. 1) Course Mechanics. 3) Information. I thought this course was called Computer Organization

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Assignment 1. Learning distributed word representations. Jimmy Ba

Math 7 Homework # 46 M3 L1

From perceptrons to word embeddings. Simon Šuster University of Groningen

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Lecture 6: Neural Networks for Representing Word Meaning

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

* Matrix Factorization and Recommendation Systems

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

CSE 473: Artificial Intelligence Autumn Topics

June Dear Future Algebra 2 Trig Student,

Nonlinear Classification

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Data Mining Techniques

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

arxiv: v1 [cs.cl] 21 May 2017

Brief Introduction to Machine Learning

CSC321 Lecture 15: Exploding and Vanishing Gradients

Lecture 3: Empirical Risk Minimization

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Lecture 7: Word Embeddings

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Transcription:

smart reply and implicit semantics Matthew Henderson and Brian Strope Google AI

collaborators include: Rami Al-Rfou, Yun-hsuan Sung Laszlo Lukacs, Ruiqi Guo, Sanjiv Kumar Balint Miklos, Ray Kurzweil and many others

Machine learning works when it generalizes to things unseen in training. training: How do you commute to work? -> I ride my bike. What s your favorite color? -> I like red. testing: Do you like red bikes? ->

generalization strategies explicit semantics: discrete frames, slots and values implicit semantics: continuous vectors

explicit semantics specified by humans (often for a task) debuggable fundamental to understanding (?) implicit semantics not specified derived during training emergent natural efficiency for compression and generalization (?)

explicit and implicit semantics: analog / digital (explicit) discrete digital text input discrete logical computation discrete digital text output (implicit) discrete digital text input continuous high-dim analog computation discrete digital text output

signal processing (opposite) real world real world continuous analog input discrete computation continuous analog output

implicit semantics -- why go back to continuous? digital channel implicit semantics digital channel discrete digital text input continuous high-dim analog computation discrete digital text output

channels of communication are digital ideas digital channel implicit semantics digital channel ideas continuous high-dim analog thinking discrete digital text input continuous high-dim analog computation discrete digital text output continuous high-dim analog thinking ideas and even reasoning can be continuous

training task with semantic pressure next sentence prediction, reply prediction I saw a really good band last night.

training task with semantic pressure next sentence prediction, reply prediction I saw a really good band last night. They played upbeat dance music.

training task with semantic pressure next sentence prediction, reply prediction I saw a really good band last night. It often rains in the winter. On Thursdays we like to go out. They played upbeat dance music. The tree looks good to me. Did you get a new car? My son likes to windsurf. Looking forward to lunch.

an initial application: smart reply

Smart reply for Inbox & Gmail feature that suggests short responses to emails initial system used an LSTM to read input email, and did a beam search over the whitelist measure 'suggest conversion', %age of times shown suggestions are clicked INNOVATION + ASSISTANCE 14

The direct smartreply system input email trained to give a high score for the response found in the data, low score for random responses. N best list final score of an email and response is a dot-product of two vectors whitelist of responses INNOVATION + ASSISTANCE 15

Training a dot-product model x 1.y 1 x 1.y 2 x 1.y 3 x 1.y 4 x 1.y 5 network encodes a batch of input x 2.y 1 x 2.y 2 x 2.y 3 x 2.y 4 x 2.y 5 emails to vectors: x 3.y 1 x 3.y 2 x 3.y 3 x 3.y 4 x 3.y 5 x 1 x 2... x N and responses to vectors: y 1 y 2... y N x 4.y 1 x 4.y 2 x 4.y 3 x 4.y 4 x 4.y 5 x 5.y 1 x 5.y 2 x 5.y 3 x 5.y 4 x 5.y 5 INNOVATION + ASSISTANCE 16

Training a dot-product model x 1.y 1 x 1.y 2 x 1.y 3 x 1.y 4 x 1.y 5 the N x N matrix of all scores is a fast matrix product. x 2.y 1 x 2.y 2 x 2.y 3 x 2.y 4 x 2.y 5 x 3.y 1 x 3.y 2 x 3.y 3 x 3.y 4 x 3.y 5 10% absolute improvement in 1 of 100 ranking accuracy over binary x 4.y 1 x 4.y 2 x 4.y 3 x 4.y 4 x 4.y 5 classification. x 5.y 1 x 5.y 2 x 5.y 3 x 5.y 4 x 5.y 5 INNOVATION + ASSISTANCE 17

x i = DNN( n-grams of email i ) y i = DNN( n-grams of response i ) S ij = x i. y j P( response j email i ) e Sij - log P( example i ) = - S ii + log Σ j e Sij "dot product loss" INNOVATION + ASSISTANCE 18

Precomputation for dot product model input email x the representations of the whitelist Y can be precomputed. scores = x.y approximate nearest neighbor search can speed up the top N search Y precomputed whitelist of responses INNOVATION + ASSISTANCE 19

final score Multi-loss dot product model x y each feature predicts the response on its own, then are combined originally used to inspect importance of each feature gives extra depth and hierarchy 10% absolute improvement in 1 of 100 ranking accuracy over concatenating input features and using a single loss email body response subject line response INNOVATION + ASSISTANCE 20

final score Multi-loss dot product model x Y precomputed each feature predicts the response on its own, then are combined originally used to inspect importance of each feature but gives extra depth and hierarchy 10% absolute improvement in 1 of 100 ranking accuracy over concatenating input features and using a single loss email body response subject line response INNOVATION + ASSISTANCE 21

Latency LSTM DNN Dot product + DNN Dot product only Approximate search 5x latency 0.1x 0.02x Beam search over prefix Score everything on the Use dot product model as Use improved multi-loss trie of whitelist. whitelist with a first pass to select 100, dot product model in one fully-connected DNN. then score with DNN. pass of scoring. (non-lstm systems can achieve suggest conversion around 4% higher than LSTM) 0.01x Speed up top N search in dot product space using an efficient nearest neighbor search. INNOVATION + ASSISTANCE 22

Response biases Thank you so much for the wonderful gifts. initial direct system got about half the number of clicks of LSTM baseline language model bias improves clicks Glad you liked the gifts. Our pleasure! probability-of-click model on actual smartreply emails helps more combinations improve click rate above LSTM baseline You are very welcome! You're welcome! Thank you! INNOVATION + ASSISTANCE 23

Quality and latency progress latency relative to LSTM (%) conversion rate relative to LSTM (%) tokenization bug response bias from LM multiloss DNN p(click) improved LSTM efficient search dot product & exhaustive search INNOVATION + ASSISTANCE 8 weeks 24

conclusions implicitly semantic representations are useful beam search isn t always necessary (simple works too) having user quality signals (like clicks) can be very helpful Thank you!