Statistical Ranking Problem

Similar documents
Introduction to Logistic Regression

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

10-701/ Machine Learning - Midterm Exam, Fall 2010

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Log-Linear Models, MEMMs, and CRFs

Gradient Descent. Sargur Srihari

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Logistic Regression Trained with Different Loss Functions. Discussion

Tuning as Linear Regression

Empirical Risk Minimization

Logistic Regression Logistic

Semestrial Project - Expedia Hotel Ranking

Information Retrieval and Organisation

Linear Models in Machine Learning

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Ad Placement Strategies

Kernel Logistic Regression and the Import Vector Machine

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Undirected Graphical Models

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

CS60021: Scalable Data Mining. Large Scale Machine Learning

Training the linear classifier

Linear discriminant functions

Expectation Maximization (EM)

Large-scale Collaborative Ranking in Near-Linear Time

Graphical Models for Collaborative Filtering

Ad Placement Strategies

Topic 3: Neural Networks

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Machine Learning, Fall 2012 Homework 2

Linear Regression. Volker Tresp 2018

Relative Loss Bounds for Multidimensional Regression Problems

Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm

Lecture 5 Neural models for NLP

Linear Regression (9/11/13)

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Big Data Analytics. Lucas Rego Drumond

Name: Student number:

Lecture 5: Logistic Regression. Neural Networks

Variable Selection in Data Mining Project

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

Midterm exam CS 189/289, Fall 2015

Linear classifiers: Overfitting and regularization

Computational Oracle Inequalities for Large Scale Model Selection Problems

Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms

Homework #3 RELEASE DATE: 10/28/2013 DUE DATE: extended to 11/18/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.

Variational Decoding for Statistical Machine Translation

Web Search and Text Mining. Learning from Preference Data

Linear Classifiers IV

Gradient Boosting (Continued)

Linear & nonlinear classifiers

Linear Regression. Volker Tresp 2014

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Neural Network Training

Decoupled Collaborative Ranking

Linear Discrimination Functions

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Multiclass Classification-1

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Collaborative topic models: motivations cont

Homework 4. Convex Optimization /36-725

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

CSC 411 Lecture 6: Linear Regression

Large Scale Machine Learning with Stochastic Gradient Descent

Classification Logistic Regression

Stochastic Gradient Descent. CS 584: Big Data Analytics

Discriminative Training. March 4, 2014

Statistical Machine Learning from Data

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

An Introduction to Machine Learning

Variations of Logistic Regression with Stochastic Gradient Descent

Online Social Networks and Media. Link Analysis and Web Search

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Fast Logistic Regression for Text Categorization with Variable-Length N-grams

Presentation in Convex Optimization

SVRG++ with Non-uniform Sampling

Machine Learning 2017

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

Today. Calculus. Linear Regression. Lagrange Multipliers

Lecture 7 Introduction to Statistical Decision Theory

CSE546 Machine Learning, Autumn 2016: Homework 1

CS 6375 Machine Learning

Coordinate Descent and Ascent Methods

RegML 2018 Class 2 Tikhonov regularization and kernels

Warm up: risk prediction with logistic regression

ECS289: Scalable Machine Learning

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Linear Models for Regression CS534

Lecture 11. Linear Soft Margin Support Vector Machines

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Machine Learning for NLP

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Maximum Margin Matrix Factorization for Collaborative Ranking

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

FORMULATION OF THE LEARNING PROBLEM

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Transcription:

Statistical Ranking Problem Tong Zhang Statistics Department, Rutgers University

Ranking Problems Rank a set of items and display to users in corresponding order. Two issues: performance on top and dealing with large search space. web-page ranking rank pages for a query theoretical analysis with error criterion focusing on top machine translation rank possible (English) translations for a given input (Chinese) sentence algorithm handling large search space 1

Web-Search Problem User types a query, search engine returns a result page: selects from billions of pages. assign a score for each page, and return pages ranked by the scores. Quality of search engine: relevance (whether returned pages are on topic and authoritative) other issues presentation (diversity, perceived relevance, etc) personalization (predict user specific intention) coverage (size and quality of index). freshness (whether contents are timely). responsiveness (how quickly search engine responds to the query). 2

Relevance Ranking: Statistical Learning Formulation Training: randomly select queries q, and web-pages p for each query. use editorial judgment to assign relevance grade y(p, q). construct a feature x(p, q) for each query/page pair. learn scoring function ˆf(x(p, q)) to preserve the order of y(p, q) for each q. Deployment: query q comes in. return pages p 1,..., p m in descending order of ˆf(x(p, q)). 3

Measuring Ranking Quality Given scoring function ˆf, return ordered page-list p 1,..., p m for a query q. only the order information is important. should focus on the relevance of returned pages near the top. DCG (discounted cumulative gain) with decreasing weight c i DCG( ˆf, q) = m c i r(p i, q). j=1 c i : reflects effort (or likelihood) of user clicking on the i-th position. 4

Subset Ranking Model x X : feature (x(p, q) X ) S S: subset of X ({x 1,..., x m } = {x(p, q) : p} S) each subset corresponds to a fixed query q. assume each subset of size m for convenience: m is large. y: quality grade of each x X (y(p, q)). scoring function f : X S R. ranking function r f (S) = {j i }: ordering of S S based on scoring function f. quality: DCG(f, S) = m i=1 c ie yji (x ji,s) y ji. 5

Some Theoretical Questions Learnability: subset size m is huge: do we need many samples (rows) to learn. focusing quality on top. Learning method: regression. pair-wise learning? other methods? Limited goal to address here: can we learn ranking by using regression when m is large. massive data size (more than 20 billion) want to derive: error bounds independent of m. what are some feasible algorithms and statistical implications. 6

Bayes Optimal Scoring Given a set S S, for each x j S, we define the Bayes-scoring function as f B (x j, S) = E yj (x j,s) y j The optimal Bayes ranking function r fb that maximizes DCG induced by f B returns a rank list J = [j 1,..., j m ] in descending order of f B (x ji, S). not necessarily unique (depending on c j ) The function is subset dependent: require appropriate result set features. 7

Simple Regression Given subsets S i {y i,1,..., y i,m }. = {x i,1,..., x i,m } and corresponding relevance score We can estimate f B (x j, S) using regression in a family F: ˆf = arg min f F n i=1 m (f(x i,j, S i ) y i,j ) 2 j=1 Problem: m is massive (> 20 billion) computationally inefficient statistically slow convergence ranking error bounded by O( m) root-mean-squared-error. Solution: should emphasize estimation quality on top. 8

Importance Weighted Regression Some samples are more important than other samples (focus on top). A revised formulation: ˆf = arg minf F 1 n n i=1 L(f, S i, {y i,j } j ), with L(f, S, {y j } j ) = m j=1 w(x j, S)(f(x j, S) y j ) 2 + u sup j w (x j, S)(f(x j, S) δ(x j, S)) 2 + weight w: importance weighting focusing regression error on top zero for irrelevant pages weight w : large for irrelevant pages for which f(x j, S) should be less than threshold δ. importance weighting can be implemented through importance sampling. 9

Relationship of Regression and Ranking Let Q(f) = E S L(f, S), where L(f, S) = E {yj } j SL(f, S, {y j } j ) m = w(x j, S)E yj (x j,s) (f(x j, S) y j ) 2 + u sup w (x j, S)(f(x j, S) δ(x j, S)) 2 +. j j=1 Theorem 1. Assume that c i = 0 for all i > k. Under appropriate parameter choices with some constants u and γ, for all f: DCG(r B ) DCG(r f ) C(γ, u)(q(f) inf f Q(f )) 1/2. Key point: focus on relevant documents on top; j w(x j, S) is much smaller than m. 10

Generalization Performance with Square Regularization Consider scoring f ˆβ(x, S) = ˆβ T ψ(x, S), with feature vector ψ(x, S): L(β, S, {y j } j ) = ˆβ = arg min β H m j=1 [ 1 n ] n L(β, S i, {y i,j } j ) + λβ T β, (1) i=1 w(x j, S)(f β (x j, S) y j ) 2 + u sup j w (x j, S)(f β (x j, S) δ(x j, S)) 2 +. Theorem 2. Let M = sup x,s φ(x, S) 2 and W = sup S [ x j S w(x j, S) + u sup xj S w (x j, S)]. Let f ˆβ be the estimator defined in (1). Then we have DCG(r B ) E {Si,{y i,j } j } n i=1 DCG(r f ˆβ) C(γ, u) [ ( 1 + W M ) 2 inf (Q(f β) + λβ T β) inf 2λn β H f Q(f) ] 1/2. 11

Interpretation of Results Result does not depend on m, but the much smaller quantity quantity W = sup S [ x j S w(x j, S) + u sup xj S w (x j, S)] emphasize relevant samples on top: w is small for irrelevant documents. a refined analysis can replace sup over S by some p-norm over S. Can control generalization for the top portion of the rank-list even with large m. learning complexity does not depend on the majority of items near the bottom of the rank-list. the bottom items are usually easy to estimate. 12

Key Points Ranking quality near the top is most important statistical analysis to deal with the scenario Regression based algorithm to handle large search space importance weighting of regression terms error bounds independent of the massive web-size. 13

Statistical Translation and Algorithm Challenge Problem: conditioned on a source sentence in one language. generate a target sentence in another language. General approach: scoring function: measuring the quality of translated sentence based on the source sentence (similar to web-search) search strategy: effectively generate target sentence candidates. search for the optimal score. structure used in the scoring function (through block model). Main challenge: exponential growth of search space 14

Graphical illustration b 4 airspace Lebanese violate warplanes Israeli b 2 b 1 b 3 A l T A } r A t A l H r b y P A l A s r A } t n t h k A l m j A l A l j w y A l l b n A n y y l y P 15

Block Sequence Decoder Database: a set of possible translation blocks e.g. block a b translates into potential block z x y. Scoring function: candidate translation: a block sequence (b 1,..., b n ) map each sequence of blocks into a non-negative score s w (b n 1) = n i=1 wt F (b i 1, b i ; o). Input: source sentence. Translation: generate block sequence consistent with the source sentence. find the sequence with largest score. 16

Decoder Training Given source/target sentence pairs {(S i, T i )} i=1,...,n. Given a decoding scheme implementing: ẑ(w, S) = arg max z V (S) s w (z). Find parameter w such that on the training data: ẑ(w, S i ) achieves high BLEU score on average. Key difficulty: large search space (similar issues in MLR) Traditional tuning. Our methods: local training. global training. 17

Traditional scoring function A bunch of local statistics gathered from the training data, and beyond. different language models of the target P (T i T i 1, T i 2...). can depend on statistics beyond training data. Google benefited significantly from huge language models. orientation (swap) frequency. block frequency. block quality score: e.g. normalized in-block unigram translation probability: (S, T ) = ({s 1,..., s p }, {t 1,..., t q }); P (S T ) = j n j i p(s j t i ). Linear combination of log-frequencies (five or six features): s w ({b n 1}) = i j w j log f j (b i ; b i 1, ) Tuning: hand-tuning; gradient descent adjustment to optimize BLEU score 18

Large scale training Motivation: want: the ability of incorporating large number of features. require: automated training method to optimize millions of parameters. similar to the transition from inktomi rank function tuning to MLR. Challenges: search space V (S) is too large. Need to break it down. direct global training using relevant set model: treats decoder as a blackbox. 19

Global training of decoder parameter Treat decoder as a black box, implementing: ẑ(w, S) = arg max z V (S) s w (z). No need to know V (S). Generate truth set as block sequences with K-largest blue scores: V K (S). Generate alternatives as a subset of V (S) that are most relevant. if w does well on the relevant alternatives, then it does well on the whole set V (S). related to active sampling procedure in MLR. Some related works in parsing exist. 20

Learning method Try to minimize the following regularized empirical risk minimization: [ ] 1 N ŵ = arg min Φ(w, V K (S i ), V (S i )) + λw 2 w m i=1 Φ(w, V K, V ) = 1 K z V K max z V V K ψ(w, z, z ), ψ(w, z, z ) = φ(s w (z), Bl(z); s w (z ), Bl(z )), Relevant set: let ξ i (w, z) = max z V (S i ) V K (S i ) ψ(w, z, z ) V (r) (ŵ, S i ) = {z V (S i ) : z V K (S i ), ξ i (ŵ, z) 0 & ψ(ŵ, z, z ) = ξ i (ŵ, z)}. Key observation: can replace V by V (r) without changing solution. 21

Example loss functions Truth z V K, alternative z V V K (or in V (r) ). Bl(z) > Bl(z ): want to penalize if score s w (z) s w (z ). Estimate s w (z) using least squares (consistent): φ(s w (z), Bl(z); s w (z ), Bl(z )) = α(s w (z) Bl(z)) 2 + α (s w (z ) Bl(z )) 2. does not do well in our experiments. Possible reasons: we did implement correctly (re-weighting); s w (z) cannot approximate Bl(z) very well. Estimate s w (z) using pair-wise loss (inconsistent): φ(s w (z), Bl(z); s w (z ), Bl(z )) = (Bl(z) Bl(z )) max(1 (s w (z) s w (z )), 0) 2. 22

Approximate Relevant Set Method Observation: relevant set depends on w; w can be calculated based on approximate relevant set. Iterate to find both (similar to the active learning procedure in MLR). fix V (r), update w. fix w, update V (r). 23

Table 1: Generic Relevant Set Algorithm divide training points into m blocks J 1,..., J m initialize weight vector w w 0 for each data point S initialize truth V K (S) and alternative V (r) (S) for l = 1,, L for j = 1,, m for each S J j select relevant points { z k } S (*) update V (r) (S) V (r) (S) { z k } update w by solving the following approximately (**) N i=1 Φ(w, V K(S i ), V (r) (S i )) + λw 2 min w 1 m 24

Convergence analysis Approximate relevant set size: relevant set update rule (*): z k V (S) V k (S) for each z k V K (S) (k = 1,..., K) such that ψ(w, z k, z k ) = max ψ(w, z k, z ) z V (S) V K (S) convergence bound independent of the size of V (S). generalization bound: independent of size of V (S). real implementation using decoder output: z = max z V (S) s w (z ) Weight update rule in (**): can be implemented using stochastic gradient descent. 25

Experiments (MT03 Arabic-English DARPA evaluation) 0.5 0.45 dj363.bleutest dl363.bleutest dj363.bleutraining dl363.bleutraining 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 5 10 15 20 25 Figure 1: BLEU score on test and training data as a function of the number of training iterations. 26

Translation results (25 training iterations) Model training test MON-PHR 0.3661 0.261 MON 0.4773 0.359 SWAP 0.4741 0.362 Using only simple binary-features without language models, etc (never done before in SMT) MON-PHR: phrase id based features; MON/SWAP: plus internal word features with and without swapping. Context: many published grammar based methods, in the.20s. state of the art (with many additional engineering tuning): Google with huge language model (around.50), IBM (around.45). 27

Remarks Method to handle large search space for arbitrary decoding procedure key: restrict search space dependency through a loss function of the form max z V (S) A φ(z, ). z achieved at a small relevant set (can ignore the non-relevant points) using decoder output to approximate relevant set. automatic construction of most relevant alternatives. 28