Active Learning & Experimental Design

Size: px

Start display at page:

Download "Active Learning & Experimental Design"

Marilyn Webster
5 years ago
Views:

1 Active Learning & Experimenta Design Danie Ting Heaviy modified, of course, by Lye Ungar Origina Sides by Barbara Engehardt and Aex Shyr Lye Ungar, University of Pennsyvania

annotated images u Not a abes are equay usefu u Want to coect the

2 Motivation u Data coection may be expensive Cost of time and materias for an experiment Cheap vs. expensive data n Raw images vs. annotated images u Not a abes are equay usefu u Want to coect the best data at minima cost Where do abes come from? What observations shoud one abe?

3 Toy exampes Assume you are earning y=f(x) for x a scaar You are earning inear regression on [-1,1]. You can pick two x s to get y s for. What two vaues woud you pick? A) -1/3, 1/3 B) -1, 1 C) 0, 1 D) Something ese

4 Toy exampes Assume you are earning y = f(x) for x a scaar You are earning an SVM cassifier on [-1.1]. You can pick 4 x s to get y s for. What strategy woud you use to pick x s? A) Pick -1, -1/3, 1/3, 1 B) Pick -1, 1, see what the answer is, then pick next x C) Pick -1/3, 1/3, see what the answer is, then pick next x D) Something ese

5 Toy Exampe: 1D cassifier x x x x x x x x x x Unabeed data: abes are a 0 then a 1 (eft to right) Cassifier (threshod function): h w (x) = 1 if x > w (0 otherwise) Goa: find transition between 0 and 1 abes in minimum steps Naïve method: choose points to abe at random on ine Requires O(n) training data to find underying cassifier Better method: binary search for transition between 0 and 1 Requires O(og n) training data to find underying cassifier Exponentia reduction in training data size!

6 Exampe: coaborative fitering Users usuay rate ony a few movies ratings are expensive Which movies do you show users to best extrapoate movie preferences? [Yu et a. 2006]

7 Exampe: coaborative fitering Users usuay rate ony a few movies ratings are expensive Which movies do you show users to best extrapoate movie preferences? u Baseine agorithms: Random: m movies randomy Most Popuar Movies: m most frequenty rated movies u Most popuar movies is not better than random design! u Popuar movies rated highy by a users; do not discriminate tastes [Yu et a. 2006]

8 What criteria might you use to seect which items to abe?

9 Active Learning Methods u Active earning Uncertainty samping Query by committee Information-based oss functions u Optima experimenta design u Non-inear optima experimenta design u Summary

Active earning u Setup: Given existing knowedge (X,y), choose where to coect more data Assume access to cheap unabeed points Make a query to obtain expensive abe Want to find abes that are

10 Active earning u Setup: Given existing knowedge (X,y), choose where to coect more data Assume access to cheap unabeed points Make a query to obtain expensive abe Want to find abes that are informative u Output: Cassifier / predictor trained on ess abeed data u Simiar to active earning in cassrooms Students ask questions, receive a response, and ask further questions Contrast: passive earning: student just istens to ecturer u This ecture covers: how to measure the vaue of data agorithms to choose the data

11 Exampe: Gene expression and Cancer cassification u Active earning takes 31 points to achieve same accuracy as passive earning with 174 Liu 2004

12 Active Learning Setup u Active earner picks which data point x to query u Receive response y from an orace u Update parameters q of the mode u Repeat u Query seected to minimize some oss function ( risk )

13 Active Learning u Heuristic methods for reducing risk: Seect most uncertain data point Seect most informative data point n n for distinguishing different hypotheses to optimize expected information gain u Some computationa considerations: May be many queries to cacuate oss for n n Subsampe points Probabiity far from the true min decreases exponentiay May not be easy to cacuate oss

14 Uncertainty Samping u Query the item (x) that the current cassifier is most uncertain about u Needs measure of uncertainty u Exampes: Entropy Least confident predicted abe Eucidean distance (e.g. point cosest to margin in SVM)

Exampe: Gene expression and Cancer cassification u Data: Cancerous ung tissue sampes Cheap unabeed data n gene expression profies from Affymetrix

15 Exampe: Gene expression and Cancer cassification u Data: Cancerous ung tissue sampes Cheap unabeed data n gene expression profies from Affymetrix microarray Labeed data: n 0-1 abe for adenocarcinoma or maignant peura mesotheioma u Method: Linear SVM Measure of uncertainty n distance to SVM hyperpane Liu 2004

16 Exampe: Gene expression and Cancer cassification u Active earning takes 31 points to achieve same accuracy as passive earning with 174 Liu 2004

17 Query by Committee u Which unabeed point shoud you choose? a) Red arrow point b) Green arrow point

18 Query by Committee u Yeow = vaid hypotheses

19 Query by Committee u Point on max-margin hyperpane does not reduce the number of vaid hypotheses by much

20 Query by Committee u Queries an exampe based on the degree of disagreement between committee of cassifiers

21 Query by Committee u Prior distribution over cassifiers/hypotheses u Sampe a set of cassifiers from distribution u Natura for ensembe methods which are aready sampes Random forests, Bagged cassifiers, etc. u Measures of disagreement Entropy of predicted responses KL-divergence of predictive distributions

22 Query by Committee Appication u Use Naïve Bayes mode for text cassification (20 Newsgroups dataset) [McCaum & Nigam, 1998]

23 Information-based Loss Function u Previous methods ooked at uncertainty at a singe point Does not ook at whether you can actuay reduce uncertainty or if adding the point makes a difference in the mode u Want to mode notions of information gained Maximize KL divergence between posterior and prior Maximize reduction in mode entropy between posterior and prior (reduce number of bits required to describe distribution) u A of these can be extended to optima design agorithms u Must decide how to hande uncertainty about query response, mode parameters [MacKay, 1992]

i ) u The average minimum number of yes/no questions to

24 Entropy u A measure of information in random event X with possibe outcomes {x 1,,x n } H(x) = - Σ i p(x i ) og 2 p(x i ) u The average minimum number of yes/no questions to answer some question Reated to binary search [Shannon, 1948]

25 Kuback Leiber divergence u P = true distribution; u Q = aternative distribution that is used to encode data u KL divergence is the expected extra message ength per datum that must be transmitted using Q D KL (P Q) = Σ i P(x i ) og (P(x i )/Q(x i )) = Σ i P(x i ) og Q(x i ) + Σ i P(x i ) og P(x i ) = H(P,Q) - H(P) = Cross-entropy - entropy u Measures how different the two distributions are

26 KL divergence properties u Non-negative: D(P Q) 0 u Divergence 0 if and ony if P and Q are equa: D(P Q) = 0 iff P = Q u Non-symmetric: D(P Q) D(Q P) u Does not satisfy triange inequaity D(P Q) D(P R) + D(R Q)

27 KL divergence properties u Non-negative: D(P Q) 0 u Divergence 0 if and ony if P and Q are equa: D(P Q) = 0 iff P = Q u Non-symmetric: D(P Q) D(Q P) u Does not satisfy triange inequaity D(P Q) D(P R) + D(R Q) Not a distance metric

28 KL divergence as info gain u The KL divergence of the posteriors measures the information gain expected from query (x ): u Goa: choose a query that maximizes the KL divergence between the updated posterior probabiity and the current posterior probabiity D( p(θ x, x ) p(θ x)) This represents the argest expected information gain

29 Active earning warning u Choice of data is ony as good as the mode itsef u Assume a inear mode, then two data points are sufficient u What happens when data are not inear?

30 Active Learning = Sequentia Experimenta Design u Active earning Uncertainty samping Query by committee Information-based oss functions u Optima experimenta design A-optima design D-optima design E-optima design Non-inear optima experimenta design u Summary

31 Experimenta Design u Many considerations in designing an experiment Deaing with confounders Feasibiity Choice of variabes to measure Size of experiment ( # of data points ) Conduction of experiment Choice of interventions/queries to make Etc.

32 Experimenta Design u Many considerations in designing an experiment Deaing with confounders Feasibiity Choice of variabes to measure Size of experiment ( # of data points ) Conduction of experiment Choice of interventions/queries to make Etc. u We wi ony ook at one of them

33 Optima Experimenta Design? u Heuristics give empiricay good performance but Not that much theory on how good the heuristics are u Optima experimenta design gives theoretica criteria for choosing a set of points to abe for a specific set of assumptions and objectives u Theory is good when you ony get to run (a series of) experiments once

34 Optima Experimenta Design u Given a mode M with parameters β, What queries are maximay informative i.e. wi yied the best estimate of β u Best minimizes variance of estimate Equivaenty, maximizes the Fisher Information Different methods use different ways of computing a scaar from the variance matrix u Linear modes Optima design does not depend on β! u Non-inear modes Depends on β, but can use Tayor expansion to inear mode

35 Goa: Minimize variance of w If y = x T β + ε then w = (X T X) -1 X T y w ~ N(β, σ 2 (X T X) -1 ) We want to minimize the variance of of our parameter estimate w, so pick training data X to minimize (X T X) -1 But that is a matrix, so we need to reduce it to a scaar A-optima (average) design minimizes trace(x T X) -1 D-optima (determinant) design minimizes og det(x T X) -1 E-optima (extreme) design minimizes max eigenvaue of (X T X) -1 Aphabet soup of other criteria (C-, G-, L-, V-, etc.)

36 Fisher Information Matrix The negative of the expected vaue of the Hessian of the og of the probabiity of an observation given the parameters θ. If y = x T w + ε Then p(y t ) ~ exp(-(x tt w ) 2 /2σ 2) m ij = - E[d 2 \dw i dw j (og exp(-(x tt w) 2 /2σ 2 ))] = E[d 2 \dw i dw j ((x tt w) 2 /2σ 2 )] ~ E[x it x j ] So M ~ X T X

37 A-Optima Design u A-optima design minimizes the trace of (X T X) -1 Minimizing trace (sum of diagona eements) essentiay chooses maximay independent coumns (sma correations between interventions) u Tends to choose points on the border of the dataset Exampe: mixture of four Gaussians [Yu et a., 2006]

38 A-Optima Design A-optima design minimizes the trace of (X T X) -1 Exampe: 20 candidate data points, minima eipsoid that contains a points [Boyd & Vandenberghe, 2004

39 D-Optima design u D-optima design minimizes og determinant of (X T X) -1 u Equivaent to choosing the confidence eipsoid with minimum voume ( most powerfu hypothesis test in some sense) Minimizing entropy of the estimated parameters u Most commony used optima design [Boyd & Vandenberghe, 2004

40 E-Optima design u E-optima design minimizes argest eigenvaue of (X T X) -1 u Minimizes the diameter of the confidence eipsoid [Boyd & Vandenberghe, 2004

41 Summary of Optima Design [Boyd & Vandenberghe, 2004

42 Practicaities u Sometimes you can generate an x arbitrariy u More often you need to seect from a set of given x s This can be an expensive search!

43 Experimenta Design u Introduction: information theory u Active earning Uncertainty samping Query by committee Information-based oss functions u Optima experimenta design A-optima design D-optima design E-optima design Non-inear optima experimenta design u Summary

44 Optima design in non-inear modes u Given a non-inear mode y = g(x, θ) u Mode is described by a Tayor expansion around a a j ( x, ) = g(x,θ) / θ j, evauated at u Maximization of Fisher information matrix is now the same as the inear mode u Yieds a ocay optima design, optima for the particuar vaue of θ u Yieds no information on the (ack of) fit of the mode [Atkinson, 1996]

45 Optima design in non-inear modes u Probem: parameter vaue θ, used to choose experiments F, is unknown u Three genera techniques to address this probem, usefu for many possibe notions of gain 1) Sequentia experimenta design: iterate between choosing experiment x and updating parameter estimates θ 2) Bayesian experimenta design: put a prior distribution on parameter θ, choose a best data x 3) Maximin experimenta design: assume worst case scenario for parameter θ, choose a best data x

46 Response Surface Methods u Estimate effects of oca changes to the interventions (queries) In particuar, estimate how to maximize the response u Appications: Find optima conditions for growing ce cutures Deveop robust process for chemica manufacturing u Procedure for maximizing response Given a set of datapoints, interpoate a oca surface (This oca surface is caed the response surface ) n Typicay use a quadratic poynomia to obtain a Hessian Hi-cimb or take Newton step on the response surface to find next x Use next x to interpoate subsequent response surface

47 Response Surface Modeing Goa: Approximate the function f(c) = score(minimize(c)) 1. Fit a smoothed response surface to the data points 2. Minimize response surface to find new candidate 3. Use method to find nearby oca minimum of score function 4. Add candidate to data points 5. Re-fit surface, repeat [Bum, unpubished]

Summary Active earning (sequentia) Query by committee Uncertainty samping Information-based oss functions Optima experimenta design A-optima design D-optima

Predictive distribution on points Maximize info gain Minimize trace of information matrix Minimize og det of information matrix Minimize argest eigenvaue of

48 Summary Active earning (sequentia) Query by committee Uncertainty samping Information-based oss functions Optima experimenta design A-optima design D-optima design E-optima design Non-inear optima experimenta design Sequentia experimenta design Bayesian experimenta design Response surface methods Mutipe modes Predictive distribution on points Maximize info gain Minimize trace of information matrix Minimize og det of information matrix Minimize argest eigenvaue of information matrix Mutipe-shot experiments; Litte known of parameters Singe-shot experiment; Some idea of parameter distribution Sequentia experiments for optimization

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with?

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with? Bayesian Learning A powerfu and growing approach in machine earning We use it in our own decision making a the time You hear a which which coud equay be Thanks or Tanks, which woud you go with? Combine