How can a Machine Learn: Passive, Active, and Teaching

Similar documents
Introduction to Machine Learning

PMR Learning as Inference

Introduction to Bayesian Learning. Machine Learning Fall 2018

Machine Teaching. for Personalized Education, Security, Interactive Machine Learning. Jerry Zhu

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Lecture 2: Priors and Conjugacy

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Naïve Bayes classification

Announcements. Proposals graded

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Probabilistic Graphical Models

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Lecture : Probabilistic Machine Learning

Density Estimation. Seungjin Choi

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Expectation Propagation for Approximate Bayesian Inference

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Statistical Data Mining and Machine Learning Hilary Term 2016

Expectation Propagation Algorithm

Graphical Models for Collaborative Filtering

Introduction to Machine Learning

A brief introduction to Conditional Random Fields

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

COMP90051 Statistical Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Bayesian Models in Machine Learning

Learning Bayesian network : Given structure and completely observed data

Bayesian Methods: Naïve Bayes

Notes on Machine Learning for and

Introduction to Machine Learning

Bayesian Methods for Machine Learning

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Logistic Regression. Machine Learning Fall 2018

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Gaussians Linear Regression Bias-Variance Tradeoff

Gaussian discriminant analysis Naive Bayes

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Linear Models in Machine Learning

Active Learning: Disagreement Coefficient

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Machine Learning

Machine Learning

Introduction to Probabilistic Machine Learning

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Parametric Techniques Lecture 3

Probabilistic Graphical Models

Unsupervised Learning

Generalized Linear Models

Introduction to Machine Learning

Empirical Risk Minimization

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

More Spectral Clustering and an Introduction to Conjugacy

Exponential Families

Parametric Techniques

Lecture 2: Conjugate priors

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Click Prediction and Preference Ranking of RSS Feeds

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Bayesian Inference and MCMC

Warm up: risk prediction with logistic regression

Machine Learning

Machine Learning 4771

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Lecture 7: Passive Learning

Bayesian RL Seminar. Chris Mansley September 9, 2008

Introduction to Statistical Learning Theory

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Optimal Teaching for Online Perceptrons

Machine Learning for Signal Processing Bayes Classification

Statistical Methods for SVM

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Bias-Variance Tradeoff

Active Learning and Optimized Information Gathering

Clustering using Mixture Models

Midterm, Fall 2003

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Logistic regression and linear classifiers COMS 4771

Lecture 4: Probabilistic Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Lecture 11: Probability Distributions and Parameter Estimation

Lecture 2 Machine Learning Review

Computational Cognitive Science

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Graphical Models and Kernel Methods

Machine Learning Gaussian Naïve Bayes Big Picture

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Transcription:

How can a Machine Learn: Passive, Active, and Teaching Xiaojin Zhu jerryzhu@cs.wisc.edu Department of Computer Sciences University of Wisconsin-Madison NLP&CC 2013 Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 1 / 31

Item x R d Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 2 / 31

Class Label y { 1, 1} Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 3 / 31

Learning You see a training set (x 1, y 1 ),..., (x n, y n ) You must learn a good function f : R d { 1, 1} f must predict the label y on test item x, which may not be in the training set You need some assumptions Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 4 / 31

The World p(x, y) Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 5 / 31

Independent and Identically-Distributed (x 1, y 1 ),..., (x n, y n ), (x, y) iid p(x, y) Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 6 / 31

Example: Noiseless 1D Threshold Classifier p(x) = uniform[0, 1] { 0, x < θ p(y = 1 x) = 1, x θ Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 7 / 31

Example: Noisy 1D Threshold Classifier p(x) = uniform[0, 1] { ɛ, x < θ p(y = 1 x) = 1 ɛ, x θ Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 8 / 31

Generalization Error R(f) = E (x,y) p(x,y) (f(x) y) Approximated by test set error 1 m n+m i=n+1 (f(x i ) y i ) on test set (x n+1, y n+1 )... (x n+m, y n+m ) iid p(x, y) Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 9 / 31

Zero Generalization Error is a Dream Speed limit #1: Bayes error ( 1 R(Bayes) = E x p(x) 2 p(y = 1 x) 1 ) 2 Bayes classifier ( sign p(y = 1 x) 1 ) 2 All learners are no better than the Bayes classifier Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 10 / 31

Hypothesis Space f F {g : R d { 1, 1} measurable} Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 11 / 31

Approximation Error F may include the Bayes classifier... or not e.g. F = { g(x) = sign(x θ ) : θ [0, 1] } e.g. F = {g(x) = sign(sin(αx)) : α > 0} Speed limit #2: approximation error inf R(g) R(Bayes) g F Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 12 / 31

Estimation Error Let f = arg inf g F R(g). Can we at least learn f? No. You see a training set (x 1, y 1 ),..., (x n, y n ), not p(x, y) You learn ˆf n Speed limit #3: Estimation error R( ˆf n ) R(f ) Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 13 / 31

Estimation Error As training set size n increases, estimation error goes down But how quickly? Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 14 / 31

Paradigm 1: Passive Learning (x 1, y 1 ),..., (x n, y n ) iid p(x, y) 1D example: O( 1 n ) Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 15 / 31

Paradigm 2: Active Learning In iteration t 1 the learner picks a query x t 2 the world (oracle) answers with a label y t p(y x t ) Pick x t to maximally reduce the hypothesis space 1D example: O( 1 2 n ) Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 16 / 31

Paradigm 3: Teaching A teacher designs the training set 1D example: x 1 = θ ɛ/2, y 1 = 1 x 2 = θ + ɛ/2, y 2 = 1 n = 2 suffices to drive estimation error to ɛ (teaching dimension [Goldman & Kearns 95]) Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 17 / 31

Teaching as an Optimization problem min D loss( f D, θ) + effort(d) Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 18 / 31

Teaching Bayesian Learners if we choose min n,x 1,...,x n loss( f D, θ ) = KL (δ θ p(θ D)) effort(d) = cn log p(θ x 1,..., x n ) + cn Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 19 / 31

Example 1: Teaching a 1D threshold classifier θ θ θ { O(1/n) O(1/2 ) passive learning "waits" active learning "explores" teaching "guides" p 0 (θ) = 1 { p(y = 1 x, θ) = 1 if x θ and 0 otherwise effort(d) = c D For any D = {(x 1, y 1 ),..., (x n, y n )}, p(θ D) = uniform [max i:yi = 1(x i ), min i:yi =1(x i )] The optimal teaching problem becomes ( log min n,x 1,y 1,...,x n,y n n 1 min i:yi =1(x i ) max i:yi = 1(x i ) ) + cn. One solution: D = {(θ ɛ/2, 1), (θ + ɛ/2, 1)} as ɛ 0 with T I = log(ɛ) + 2c Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 20 / 31

Example 2: Learner with poor perception Same as Example 1 but the learner cannot distinguish similar items Encode this in effort() effort(d) = c min xi,x j D x i x j With D = {(θ ɛ/2, 1), (θ + ɛ/2, 1)}, T I = log(ɛ) + c/ɛ with minimum at ɛ = c. D = {(θ c/2, 1), (θ + c/2, 1)}. Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 21 / 31

Example 3: Teaching to pick a model out of two Θ = {θ A = N( 1 4, 1 2 ), θ B = N( 1 4, 1 2 )}, p 0(θ A ) = p 0 (θ B ) = 1 2. θ = θ A. Let D = {x 1,..., x n }. loss(d) = log (1 + n i=1 exp(x i)) minimized by x i. But suppose box constraints x i [ d, d]: ( ) n log 1 + exp(x i ) + cn + min n,x 1,...,x n i=1 n I( x i d) Solution: all x i = d, n = max ( 0, [ 1 d log ( d c 1)]). Note n = 0 for certain combinations of c, d (e.g., when c d): the effort of teaching outweighs the benefit. The teacher may choose to not teach at all and maintain the status quo (prior p 0 ) of the learner! i=1 Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 22 / 31

Teaching Dimension is a Special Case Given concept class C = {c}, define P (y = 1 x, θ c ) = [c(x) = +] and P (x) uniform. The world has θ = θ c The learner has Θ = {θ c c C}, p 0 (θ) = 1 C. 1 P (θ c D) = or 0. c C consistent with D Teaching dimension [Goldman & Kearns 95] T D(c ) is the minimum cardinality of D that uniquely identifies the target concept: min log P (θ c D) + γ D D where γ 1 C. The solution D is a minimum teaching set for c, and D = T D(c ). Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 23 / 31

Teaching Bayesian Learners in the Exponential Family So far, we solved the examples by inspection. Exponential family p(x θ) = h(x) exp ( θ T (x) A(θ) ) T (x) R D sufficient statistics of x θ R D natural parameter A(θ) log partition function h(x) base measure For D = {x 1,..., x n } the likelihood is p(d θ) = n i=1 ( ) h(x i ) exp θ s A(θ) with aggregate sufficient statistics s n i=1 T (x i) Two-step algorithm: finding aggregate sufficient statistics + unpacking Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 24 / 31

Step 1: Aggregate Sufficient Statistics from Conjugacy The conjugate prior has natural parameters (λ 1, λ 2 ) R D R: ( ) p(θ λ 1, λ 2 ) = h 0 (θ) exp λ 1 θ λ 2 A(θ) A 0 (λ 1, λ 2 ) The posterior p(θ D, λ 1, λ 2 ) = ( ) h 0 (θ) exp (λ 1 + s) θ (λ 2 + n)a(θ) A 0 (λ 1 + s, λ 2 + n) D enters the posterior only via s and n Optimal teaching problem min n,s θ (λ 1 + s) + A(θ )(λ 2 + n) + A 0 (λ 1 + s, λ 2 + n) + effort(n, s) Convex relaxation: n R and s R D (assuming effort(n, s) convex) Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 25 / 31

Step 2: Unpacking Cannot teach with the aggregate sufficient statistics n max(0, [n]) Find n teaching examples whose aggregate sufficient statistics is s. exponential distribution T (x) = x, x1 =... = x n = s/n. Poisson distribution T (x) = x (integers), round x1,..., x n Gaussian distribution T (x) = (x, x 2 ), harder. n = 3, s = (3, 5): {x 1 = 0, x 2 = 1, x 3 = 2} {x 1 = 1 2, x2 = 5+ 13 4, x 3 = 5 13 4 }. An approximate unpacking algorithm: 1 initialize x i iid p(x θ ), i = 1... n. 2 solve min x1,...,x n s n i=1 T (x i) 2 (nonconvex) Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 26 / 31

Example 4: Teaching the mean of a univariate Gaussian The world is N(x; µ, σ 2 ), σ 2 is known to the learner T (x) = x Learner s prior in standard form µ N(µ µ 0, σ 2 0 ) Optimal aggregate sufficient statistics s = σ2 (µ µ 0 ) + µ n s n µ : compensating for the learner s initial belief µ 0. n is the solution to n 1 2 effort (n) + σ2 σ 2 0 = 0 e.g. when effort(n) = cn, n = 1 2c σ2 σ0 2 Not to teach if the learner initially had a narrow mind : σ0 2 < 2cσ2. Unpacking s is trivial, e.g. x 1 =... = x n = s/n σ 2 0 Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 27 / 31

Example 5: Teaching a multinomial distribution The world multinomial π = (π 1,..., π K ) The learner Dirichlet prior p(π β) = Γ( β k ) Γ(βk ) K k=1 πβ k 1 k. Step 1: find aggregate sufficient statistics s = (s 1,..., s K ) min s ( K ) K log Γ (β k + s k ) + log Γ(β k + s k ) k=1 k=1 K (β k + s k 1) log πk + effort(s) k=1 Relax s R K 0 Step 2: unpack s k [s k ] for k = 1... K. Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 28 / 31

Examples of Example 5 world π = ( 1 10, 3 10, 6 10 ) learner s wrong Dirichlet prior β = (6, 3, 1) If effortless effort(s) = 0, s = (317, 965, 1933) (fmincon) The MLE from D is (0.099, 0.300, 0.601), very close to π. brute-force teaching : using big data to overwhelm the learner s prior If costly effort(s) = 0.3 K k=1 s k, s = (0, 2, 8), T I = 2.65. Not s = (1, 3, 6): the wrong prior. T I = 4.51 Not s = (317, 965, 1933), T I = 956.25 Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 29 / 31

Example 6: Teaching a multivariate Gaussian world: µ R D and Σ R D D learner likelihood N(x µ, Σ), Normal-Inverse-Wishart (NIW) prior Given x 1,..., x n R D, the aggregate sufficient statistics are s = n i=1 x i, S = n i=1 x ix i Step 1: optimal aggregate sufficient statistics via SDP D log 2 D ( νn + 1 i min ν n + log Γ n,s,s 2 2 i=1 ) ν n 2 log Λ n D 2 log κ n + ν n 2 log Σ + 1 2 tr(σ 1 Λ n ) + κ n 2 (µ µ n ) Σ 1 (µ µ n ) + effort(n, s, S) s.t. S 0; S ii s 2 i /2, i. Step 2: unpack s, S iid initializing x1,..., x n N(µ, Σ ) n solve min vec(s, S) i=1 vec(t (x i)) 2 Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 30 / 31

Examples of Example 6 The target Gaussian is µ = (0, 0, 0) and Σ = I The learner s NIW prior µ 0 = (1, 1, 1), κ 0 = 1, ν 0 = 2 + 10 5, Λ 0 = 10 5 I. expensive effort(n, s, S) = n Optimal D with n = 4, unpacked into a tetrahedron 1 0.5 0 0.5 1 1 0 1 2 0 2 1 0 1 1 0 1 2 0 2 2 T I(D) = 1.69. Four points N(µ, Σ ) have mean(t I) = 9.06 ± 3.34, min(t I) = 1.99, max(t I) = 35.51 (100,000 trials) 1 0.5 0 0.5 1 1.5 1 0 1 2 0 2 Zhu (Univ. Wisconsin-Madison) How can a Machine Learn NLP&CC 2013 31 / 31