Machine Learning for Computational Advertising

Similar documents
An Introduction to Machine Learning

CSC2545 Topics in Machine Learning: Kernel Methods and Support Vector Machines

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Machine Learning (CS 567) Lecture 2

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Introduction to Machine Learning

An Introduction to Machine Learning

Be able to define the following terms and answer basic questions about them:

CS 188: Artificial Intelligence Spring Today

Probability Rules. MATH 130, Elements of Statistics I. J. Robert Buchanan. Fall Department of Mathematics

Bayesian Methods: Naïve Bayes

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Approximate Inference

Naïve Bayes classification

CPSC 540: Machine Learning

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Statistical Models. David M. Blei Columbia University. October 14, 2014

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Generative Models for Discrete Data

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

The Naïve Bayes Classifier. Machine Learning Fall 2017

ECE521 week 3: 23/26 January 2017

Introduction to Machine Learning CMU-10701

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Introduction to Machine Learning CMU-10701

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

CSC321 Lecture 18: Learning Probabilistic Models

The PAC Learning Framework -II

Lecture 1: Probability Fundamentals

Probabilistic Machine Learning. Industrial AI Lab.

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Classification & Information Theory Lecture #8

Probability Review and Naïve Bayes

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Midterm. You may use a calculator, but not any device that can access the Internet or store large amounts of data.

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Machine Learning Foundations

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Be able to define the following terms and answer basic questions about them:

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Probabilistic and Bayesian Machine Learning

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

Probability models for machine learning. Advanced topics ML4bio 2016 Alan Moses

CS 361: Probability & Statistics

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

Machine Learning, Fall 2009: Midterm

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Overview. Confidence Intervals Sampling and Opinion Polls Error Correcting Codes Number of Pet Unicorns in Ireland

Data Informatics. Seon Ho Kim, Ph.D.

Day 5: Generative models, structured classification

Introduction to Bayesian Learning. Machine Learning Fall 2018

A brief introduction to Conditional Random Fields

Machine Learning Practice Page 2 of 2 10/28/13

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Machine Learning

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Non-parametric Methods

Midterm sample questions

Planning and Optimal Control

Probability in Programming. Prakash Panangaden School of Computer Science McGill University

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Introduction to Bayesian Learning

Learning From Data Lecture 3 Is Learning Feasible?

Ad Placement Strategies

Logistic Regression. Machine Learning Fall 2018

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

Probability Theory for Machine Learning. Chris Cremer September 2015

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Introduction to Machine Learning Midterm Exam

Bias-Variance Tradeoff

CPSC 540: Machine Learning

A Tutorial on Learning with Bayesian Networks

Terminology. Experiment = Prior = Posterior =

BAYESIAN MACHINE LEARNING.

Brief Introduction of Machine Learning Techniques for Content Analysis

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Foundations of Machine Learning

Deep Learning for Computer Vision

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction. Le Song. Machine Learning I CSE 6740, Fall 2013

CS 246 Review of Proof Techniques and Probability 01/14/19

Bayesian Networks. Motivation

Fourier and Stats / Astro Stats and Measurement : Stats Notes

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

CS 6375 Machine Learning

Notes on Markov Networks

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Machine Learning Theory (CS 6783)

Transcription:

Machine Learning for Computational Advertising L1: Basics and Probability Theory Alexander J. Smola Yahoo! Labs Santa Clara, CA 95051 alex@smola.org UC Santa Cruz, April 2009 Alexander J. Smola: Machine Learning for Computational Advertising 1 / 45

Overview L1: Machine learning and probability theory Introduction to pattern recognition, classification, regression, novelty detection, probability theory, Bayes rule, density estimation L2: Instance based learning Nearest Neighbor, Kernels density estimation, Watson Nadaraya estimator, crossvalidation L3: Linear models Hebb s rule, perceptron algorithm, regression, classification, feature maps Alexander J. Smola: Machine Learning for Computational Advertising 2 / 45

L1 Introduction to Machine Learning Data Texts, images, vectors, graphs What to do with data Unsupervised learning (clustering, embedding, etc.) Classification, sequence annotation Regression Novelty detection Statistics and probability theory Probability of an event Dependence, independence, conditional probability Bayes rule, Hypothesis testing Density estimation empirical frequency, bin counting priors and Laplace rule Alexander J. Smola: Machine Learning for Computational Advertising 3 / 45

Outline 1 Data 2 Data Analysis Unsupervised Learning Supervised Learning Alexander J. Smola: Machine Learning for Computational Advertising 4 / 45

Data Vectors Collections of features e.g. income, age, household size, IP numbers,... Can map categorical variables into vectors Matrices Images, Movies Preferences (see collaborative filtering) Strings Documents (web pages) Headers, URLs, call graphs Structured Objects XML documents Graphs (instant messaging, link structure, tags,... ) Alexander J. Smola: Machine Learning for Computational Advertising 5 / 45

Optical Character Recognition Alexander J. Smola: Machine Learning for Computational Advertising 6 / 45

Reuters Database Alexander J. Smola: Machine Learning for Computational Advertising 7 / 45

Faces Alexander J. Smola: Machine Learning for Computational Advertising 8 / 45

More Faces Alexander J. Smola: Machine Learning for Computational Advertising 9 / 45

Graphs Alexander J. Smola: Machine Learning for Computational Advertising 10 / 45

Missing Variables Incomplete Data Measurement devices may fail E.g. dead pixels on camera, servers fail, forms incomplete,... Measuring things may be expensive Need not compute all features for spam if we know Data may be censored Some users are willing to share more data... How to fix it Clever algorithms (not this course... ) Simple mean imputation Substitute in the average from other observations Works reasonably well alternatively split features... Alexander J. Smola: Machine Learning for Computational Advertising 11 / 45

Mini Summary Data Types Vectors (feature sets, bag of words) Matrices (photos, dynamical systems, controllers) Strings (texts, tags) Structured documents (XML, HTML, collections) Graphs (web, IM networks) Problems and Opportunities Data may be incomplete (use mean imputation) Data may come from different sources (adapt model) Data may be biased (e.g. it is much easier to get blood samples from university students for cheap). Problem may be ill defined, e.g. find information. (get information about what user really needs) Environment may react to intervention (butterfly portfolios in stock markets) Alexander J. Smola: Machine Learning for Computational Advertising 12 / 45

Outline 1 Data 2 Data Analysis Unsupervised Learning Supervised Learning Alexander J. Smola: Machine Learning for Computational Advertising 13 / 45

What to do with data Unsupervised Learning Find clusters of the data Find low-dimensional representation of the data (e.g. unroll a swiss roll, find structure) Find interesting directions in data Interesting coordinates and correlations Find novel observations / database cleaning Supervised Learning Classification (distinguish apples from oranges) Speech recognition Regression (advertising price) Predict time series (sales forecast) Annotate strings (named entities) Alexander J. Smola: Machine Learning for Computational Advertising 14 / 45

Clustering Alexander J. Smola: Machine Learning for Computational Advertising 15 / 45

Principal Components Alexander J. Smola: Machine Learning for Computational Advertising 16 / 45

Linear Subspace Alexander J. Smola: Machine Learning for Computational Advertising 17 / 45

Classification Data Pairs of observations (x i, y i ) drawn from distribution e.g., (query, ad), (credit transactions, fraud), (page, spam) Goal Estimate y {±1} given x at a new location. Or find a function f (x) that does the trick. Alexander J. Smola: Machine Learning for Computational Advertising 18 / 45

Regression Alexander J. Smola: Machine Learning for Computational Advertising 19 / 45

Regression Data Pairs of observations (x i, y i ) generated from some joint distribution Pr(x, y), e.g., (query, ad), bid price demographic, expected number of viewers web page, quality score (pagerank-ish) Task Estimate y, given x, such that some loss c(x, y, f (x)) is minimized. Examples Quadratic error between y and f (x), i.e. c(x, y, f (x)) = 1 2 (y f (x))2. Absolute value, i.e., c(x, y, f (x)) = y f (x)). Alexander J. Smola: Machine Learning for Computational Advertising 20 / 45

Novelty Detection Data Observations x i from some P(x), e.g., network usage patterns advertisers load in server center query / news stream Task Find unusual events, clean database, typical examples, keep network safe. Alexander J. Smola: Machine Learning for Computational Advertising 21 / 45

Eliza Rule based AI dialog system, vintage 1970 I am the psychotherapist. Please, describe your problems. I am here to learn about giving good talks Why do you say that? Because I m interested in learning that Is it because you are interested in learning that that you came to me? Yes, indeed. Earlier you said you are interested in learning that? Sure, you stupid computer, this is what I want! Maybe your plans have something to do with this. Dammit, sure, that s what i m trying to say. I don t understand. Because you re stupid. Does it bother you that I am stupid? Indeed, it does. You mentioned that you are interested in learning that? That s true. And now I want to relax a bit before giving a talk. Try it out on Emacs using M-x doctor. Alexander J. Smola: Machine Learning for Computational Advertising 22 / 45

How the brain doesn t work Alexander J. Smola: Machine Learning for Computational Advertising 23 / 45

Mini Summary Structure Extraction Clustering Low-dimensional subspaces Low-dimensional representation of data Novelty Detection Find typical observations (Joe Sixpack) Find highly unusual ones (oddball) Database cleaning Supervised Learning Regression Classification Preference relationships (recommender systems) Alexander J. Smola: Machine Learning for Computational Advertising 24 / 45

Statistics and Probability Theory Why do we need it? We deal with uncertain events Need mathematical formulation for probabilities Need to estimate probabilities from data (e.g. for coin tosses, we only observe number of heads and tails, not whether the coin is really fair). How do we use it? Statement about probability that an object is an apple (rather than an orange) Probability that two things happen at the same time Find unusual events (= low density events) Conditional events (e.g. what happens if A, B, and C are true) Alexander J. Smola: Machine Learning for Computational Advertising 25 / 45

Probability Basic Idea We have events in a space of possible outcomes. Then Pr(X) tells us how likely is that an event x X will occur. Basic Axioms Pr(X) [0, 1] for all X X Pr(X) = 1 Pr ( i X i ) = Pr(X i ) if X i X j = for all i j i Simple Corollary Pr(X Y ) = Pr(X) + Pr(Y ) Pr(X Y ) Alexander J. Smola: Machine Learning for Computational Advertising 26 / 45

Example Alexander J. Smola: Machine Learning for Computational Advertising 27 / 45

Multiple Variables Two Sets Assume that x and y are drawn from a probability measure on the product space of X and Y. Consider the space of events (x, y) X Y. Independence If x and y are independent, then for all X X and Y Y Pr(X, Y ) = Pr(X) Pr(Y ). Alexander J. Smola: Machine Learning for Computational Advertising 28 / 45

Independent Random Variables Alexander J. Smola: Machine Learning for Computational Advertising 29 / 45

Dependent Random Variables Alexander J. Smola: Machine Learning for Computational Advertising 30 / 45

Bayes Rule Dependence and Conditional Probability Typically, knowing x will tell us something about y (think regression or classification). We have Pr(Y X) Pr(X) = Pr(Y, X) = Pr(X Y ) Pr(Y ). Hence Pr(Y, X) min(pr(x), Pr(Y )). Bayes Rule Pr(Y X) Pr(X) Pr(X Y ) =. Pr(Y ) Proof using conditional probabilities Pr(X, Y ) = Pr(X Y ) Pr(Y ) = Pr(Y X) Pr(X) Alexander J. Smola: Machine Learning for Computational Advertising 31 / 45

Example Pr(X X ) = Pr(X X ) Pr(X ) = Pr(X X) Pr(X) Alexander J. Smola: Machine Learning for Computational Advertising 32 / 45

AIDS Test How likely is it to have AIDS if the test says so? Assume that roughly 0.1% of the population is infected. p(x = AIDS) = 0.001 The AIDS test reports positive for all infections. p(y = test positive X = AIDS) = 1 The AIDS test reports positive for 1% healthy people. p(y = test positive X = healthy) = 0.01 We use Bayes rule to infer Pr(AIDS test positive) via Pr(Y X) Pr(X) Pr(Y ) = Pr(Y X) Pr(X) Pr(Y X) Pr(X) + Pr(Y X\X) Pr(X\X) = 1 0.001 1 0.001+0.01 0.999 = 0.091 Alexander J. Smola: Machine Learning for Computational Advertising 33 / 45

Improving Inference Naive Bayes Follow up on the AIDS test: The doctor performs a followup via a conditionally independent test which has the following properties: The second test reports positive for 90% infections. The AIDS test reports positive for 5% healthy people. Pr(T1, T2 Health) = Pr(T1 Health) Pr(T2 Health). A bit more algebra reveals (assuming that both tests are independent): 0.01 0.05 0.999 0.01 0.05 0.999+1 0.9 0.001 = 0.357. Conclusion: Adding extra observations can improve the confidence of the test considerably. Important Assumption: We assume that T1 and T2 are independent conditioned on health. This is the Naive Bayes classifier. Alexander J. Smola: Machine Learning for Computational Advertising 34 / 45

Different Contexts Hypothesis Testing: Are algorithms A or B better to solve the problem. Can we trust a user (spammer / no spammer) Which parameter setting should we use? Sensor Fusion: Evidence from sensors A and B (AIDS test 1 and 2). Different types of data (text, images, tags). More Data: We obtain two sets of data we get more confident Each observation can be seen as an additional test Alexander J. Smola: Machine Learning for Computational Advertising 35 / 45

Mini Summary Probability theory Basic tools of the trade Use it to model uncertain events Dependence and Independence Independent events don t convey any information about each other. Dependence is what we exploit for estimation Leads to Bayes rule Testing Prior probability matters Combining tests improves outcomes Common sense can be misleading Alexander J. Smola: Machine Learning for Computational Advertising 36 / 45

Estimating Probabilities from Data Rolling a dice: Roll the dice many times and count how many times each side comes up. Then assign empirical probability estimates according to the frequency of occurrence. ˆPr(i) = #occurrences of i #trials Maximum Likelihood Estimation: Find parameters such that the observations are most likely given the current set of parameters. This does not check whether the parameters are plausible! Alexander J. Smola: Machine Learning for Computational Advertising 37 / 45

Practical Example Alexander J. Smola: Machine Learning for Computational Advertising 38 / 45

Properties of MLE Hoeffding s Bound The probability estimates converge exponentially fast Pr{ π i p i > ɛ} 2 exp( 2mɛ 2 ) Problem For small ɛ this can still take a very long time. In particular, for a fixed confidence level δ we have log δ + log 2 δ = 2 exp( 2mɛ 2 ) = ɛ = 2m The above bound holds only for single π i, but not uniformly over all i. Improved Approach If we know something about π i, we should use this extra information: use priors. Alexander J. Smola: Machine Learning for Computational Advertising 39 / 45

Priors to the Rescue Big Problem Only sampling many times gets the parameters right. Rule of Thumb We need at least 10-20 times as many observations. Conjugate Priors Often we know what we should expect. Using a conjugate prior helps. We insert fake additional data which we assume that it comes from the prior. Conjugate Prior for Discrete Distributions Assume we see u i additional observations of class i. π i = #occurrences of i + u i #trials + j u. j Assuming that the dice is even, set u i = m 0 for all 1 i 6. For u i = 1 this is the Laplace Rule. Alexander J. Smola: Machine Learning for Computational Advertising 40 / 45

Example: Dice 20 tosses of a dice Outcome 1 2 3 4 5 6 Counts 3 6 2 1 4 4 MLE 0.15 0.30 0.10 0.05 0.20 0.20 MAP (m 0 = 6) 0.25 0.27 0.12 0.08 0.19 0.19 MAP (m 0 = 100) 0.16 0.19 0.16 0.15 0.17 0.17 Consequences Stronger prior brings the estimate closer to uniform distribution. More robust against outliers But: Need more data to detect deviations from prior Alexander J. Smola: Machine Learning for Computational Advertising 41 / 45

Honest dice Alexander J. Smola: Machine Learning for Computational Advertising 42 / 45

Tainted dice Alexander J. Smola: Machine Learning for Computational Advertising 43 / 45

Mini Summary Maximum Likelihood Solution Count number of observations per event Set probability to empirical frequency of occurrence. Maximum a Posteriori Solution We have a good guess about solution Use conjugate prior Corresponds to inventing extra data Recalibrate probability for additional observations. Extension Works for other estimates, too (means, covariance matrices). Big Guns: Hoeffding and friends Use uniform convergence and tail bounds Exponential convergence for fixed scale Only sublinear convergence, when fixed confidence. Alexander J. Smola: Machine Learning for Computational Advertising 44 / 45

Summary Data Vectors, matrices, strings, graphs,... What to do with data Unsupervised learning (clustering, embedding, etc.), Classification, sequence annotation, Regression,... Random Variables Dependence, Bayes rule, hypothesis testing Estimating Probabilities Maximum likelihood, convergence,... Alexander J. Smola: Machine Learning for Computational Advertising 45 / 45