Active Learning and Optimized Information Gathering

Size: px

Start display at page:

Download "Active Learning and Optimized Information Gathering"

Sylvia Harvey
6 years ago
Views:

1 Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS Andreas Krause

2 Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office hours Come to office hours before your presentation! Andreas: Monday 3pm-4:30pm, 260 Jorgensen Ryan: Wednesday 4:00-6:00pm, 109 Moore 2

3 Recap Bandit Problems Bandit problems Online optimization under limited feedback Exploration Exploitation dilemma Algorithms with low regret: ε-greedy, UCB1 Payoffs can be Probabilistic Adversarial (oblivious/ adaptive) 3

4 More complex bandits Bandits with many arms Online linear optimization (online shortest paths ) X-armed bandits (Lipschitz mean payoff function) Gaussian process optimization (Bayesian assumptions about mean payoffs) Bandits with state Contextual bandits Reinforcement learning Key tool: Optimism in the face of uncertainty 4

5 1. Online decision making Course outline 2. Statistical active learning 3. Combinatorial approaches 5

6 Spam or Ham? Spam Ham + + Labels are expensive (need to ask expert) Which labels should we obtain to maximize classification accuracy? x 2 o o o o o o + o o o o o o o o x 1 label = sign(w 0 + w 1 x 1 + w 2 x 2 ) (linear separator) 6

7 Outline Background in learning theory Sample complexity Key challenges Heuristics for active learning Principled algorithms for active learning 7

8 Credit scoring Credit score Defaulted? ??? Want decision rule that performs well for unseen examples (generalization) 8

9 More general: Concept learning Set X of instances True concept c: X {0,1} Hypothesis h: X {0,1} Hypothesis space H = {h 1,, h n, } Want to pick good hypothesis (agrees with true concept on most instances) 9

10 Example: Binary thresholds Input domain: X={1,2,,100} True concept c: c(x) = +1 if x t c(x) = -1 if x < t Threshold t 10

11 How good is a hypothesis? Set X of instances, concept c: X {0,1} Hypothesis h: X {0,1}, H = {h 1,, h n, } Distribution P X over X error true (h) = Want h* = argmin h H error true (h) Can t compute error true (h)! 11

12 Concept learning Data set D = {(x 1,y 1 ),,(x N,y N )}, x i X, y i {0,1} Assume x i drawn independently from P X ; y i = c(x i ) Also assume c H h consistent with D More data fewer consistent hypotheses Learning strategy: Collect enough data Output consistent hypothesis h Hope that error true (h) is small 12

13 Let ε>0 Sample complexity How many samples do we need s.t. all consistent hypotheses have error< ε?? Def: h H bad Suppose h H is bad. Let x P X, y = c(x). Then: 13

14 Sample complexity P( h bad and survives 1 data point) = P( h bad and survives n data points) = P( remains 1 bad h after n data points) = 14

15 Probability of bad hypothesis 15

16 Sample complexity for finite hypothesis spaces [Haussler 88] Theorem: Suppose H <, Data set D =n drawn i.i.d. from P X (no noise) 0<ε<1 Then for any h H consistent with D: P( error true (h) > ε) H exp(-εn) PAC-bound (probably approximately correct) 16

17 How can we use this result? P( error true (h) ε) H exp(-εn) = δ Possibilities: Given δ, n solve for ε Given εand δ, solve for n (Given ε, n, solve for δ) 17

18 Example: Credit scoring X = {1,2, 1000} H = binary thresholds on X H = Want error 0.01 with probability.999 Need n 1382 samples 18

19 Limitations How do we find consistent hypothesis? What if H =? What if there s noise in the data? (or c H) 19

20 Credit scoring Credit score Defaulted? ??? No binary threshold function explains this data with 0 error 20

21 Noisy data Sets of instances X and labels Y = {0,1} Suppose (X,Y) P XY Hypothesis space H error true (h) = E x,y [ h(x) y ] Want to find argmin h H error true (h) 21

22 Learning from noisy data Suppose D = {(x 1,y 1 ),,(x n,y n )} where (x i,y i ) P X,Y error train (h) = (1/n) i h(x i ) y i Learning strategy with noisy data Collect enough data Output h = argmin h H error train (h) Hope that error true (h ) min h H error true (h) 22

23 Estimating error How many samples do we need to accurately estimate the true error? Data set D = {(x 1,y 1 ),,(x n,y n )} where (x i,y i ) P X,Y z i = h(x i ) -y i {0,1} z i are i.i.d. samples from Bernoulli RV Z = h(x) -Y error train (h) = error true (h) = How many samples s.t. error train (h) error true (h) is small?? 23

24 Estimating error How many samples do we need to accurately estimate the true error? Applying Chernoff-Hoeffding bound: P( error true (h) error train (h) ε) exp(-2n ε 2 ) 24

25 Sample complexity with noise Call h H bad if error true (h) > error train (h) + ε P(hbad survives n training examples) exp(-2 n ε 2 ) P(remains 1 bad h after n examples) 25

26 PAC Bound for noisy data Theorem: Suppose H <, Data set D =n drawn i.i.d. from P XY 0<ε<1 Then for any h H it holds that error true (h) error train (h)+ log H +log1/δ 2n 26

27 PAC Bounds: Noise vs. no noise Want error εwith probability 1-δ No noise: n 1/ε ( log H + log 1/δ) Noise: n 1/ε 2 ( log H + log 1/δ) 27

28 Limitations How do we find consistent hypothesis? What if H =? What if there s noise in the data? (or c H) 28

29 Credit scoring Credit score Defaulted? ??? Want to classify continuous instance space H = 29

30 Large hypothesis spaces Idea: Labels of few data points imply labels of many unlabeled data points 30

31 How many points can be arbitrarily classified using binary thresholds? 31

32 How many points can be arbitrarily classified using linear separators? (1D) 32

33 How many points can be arbitrarily classified using linear separators? (2D) 33

34 VC dimension Let S X be a set of instances A Dichotomyis a nontrivial partition of S = S 1 S 0 S is shatteredby hypothesis space H if for any dichotomy, there exists a consistent hypothesis h (i.e., h(x)=1 if x S 1 and h(x)=0 if x S 0 ) The VC (Vapnik-Chervonenkis) dimensionvc(h) of H is the size of the largest set S shattered by H (possibly ) VC(H) 34

35 VC Generalization bound Bound for finite hypothesis spaces error true (h) error train (h)+ log H +log1/δ 2n VC-dimension based bound error true (h) error train (h)+ VC(H) ( ) 2n 1+log VC(H) n +log 4 δ 35

36 Applications Allows to prove generalization bounds for large hypothesis spaces with structure. For many popular hypothesis classes, VC dimension known Binary thresholds Linear classifiers Decision trees Neural networks 36

37 Passive learning protocol Data source P X,Y (produces inputs x i and labels y i ) Data set D n = {(x 1,y 1 ),,(x n,y n )} Learner outputs hypothesis h error true (h) = E x,y h(x) y 37

38 From passive to active learning Spam Ham o + + o o o o o + o o o o o o o o Some labels more informative than others 38

39 Statistical passive/active learning protocol Data source P X (produces inputs x i ) Active learner assembles data set D n = {(x 1,y 1 ),,(x n,y n )} by selectively obtaining labels Learner outputs hypothesis h error true (h) = E x~p [h(x) c(x)] 39

40 Passive learning Input domain: D=[0,1] True concept c: c(x) = +1 if x t c(x) = -1 if x < t 0 Threshold t 1 Passive learning: Acquire all labels y i {+,-} 40

41 Active learning Input domain: D=[0,1] True concept c: c(x) = +1 if x t c(x) = -1 if x < t 0 Threshold t 1 Passive learning: Acquire all labels y i {+,-} Active learning: Decide which labels to obtain 41

42 Comparison Passive learning Active learning Labels needed to learn with classification error ε Ω(1/ε) O(log 1/ε) Active learning can exponentially reduce the number of required labels! 42

43 Key challenges PAC Bounds we ve seen so far crucially depend on i.i.d. data!! Actively assembling data set causes bias! If we re not careful, active learning can do worse! 43

44 What you need to know Concepts, hypotheses PAC bounds (probably approximate correct) For noiseless ( realizable ) case For noisy ( unrealizable ) case VC dimension Active learning protocol 44

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

CS340 Machine learning Lecture 4 Learning theory Some slides are borrowed from Sebastian Thrun and Stuart Russell Announcement What: Workshop on applying for NSERC scholarships and for entry to graduate