Computational Learning

Size: px

Start display at page:

Download "Computational Learning"

Elvin Parsons
5 years ago
Views:

1 Online Learning

2 Computational Learning Computer programs learn a rule as they go rather than being told. Can get information by being given labeled examples, asking queries, experimentation.

3 Automatic Translation

4 Answering Questions magazine/20computer-t.html

5 Prediction Predict a user s rating based on previous rankings and rankings of other users. Recently, Netflix gave out $1,000,000 in contest to improve their ranking system by 10%. Winning algorithm was a blend of many methods. Surprising relevant information: Ranking a movie just after watching vs. years later...

6 Online Learning Bob is given examples on the fly, and has to classify them.

7 Online Learning Bob is given examples on the fly, and has to classify them.?

8 Online Learning Bob is given examples on the fly, and has to classify them.? SPAM!

9 Online Learning Bob is given examples on the fly, and has to classify them.? SPAM! After each example, Bob is given correct answer.

10 Online Learning Bob is given examples on the fly, and has to classify them.? SPAM! No, that s baloney! After each example, Bob is given correct answer.

11 Online Learning Focus on the binary case: spam vs. not spam? SPAM

12 Online Learning Focus on the binary case: spam vs. not spam? SPAM

13 Online Learning Focus on the binary case: spam vs. not spam SPAM? SPAM!

14 Online Learning Focus on the binary case: spam vs. not spam SPAM? SPAM! Correct!

15 Online Learning Focus on the binary case: spam vs. not spam SPAM? SPAM! Correct!

16 Online Learning Focus on the binary case: spam vs. not spam SPAM SPAM? SPAM! Correct!

17 Online Learning Focus on the binary case: spam vs. not spam SPAM SPAM?

18 Online Learning Focus on the binary case: spam vs. not spam SPAM SPAM?

19 Online Learning Focus on the binary case: spam vs. not spam SPAM SPAM? Not Spam!

20 Online Learning Focus on the binary case: spam vs. not spam SPAM SPAM? Not Spam! Correct!

21 Online Learning Focus on the binary case: spam vs. not spam SPAM SPAM? Not Spam! Correct!

22 Online Learning Focus on the binary case: spam vs. not spam Not SPAM? SPAM SPAM Not Spam! Correct!

23 Online Learning Focus on the binary case: spam vs. not spam Not SPAM? SPAM SPAM

24 Online Learning Focus on the binary case: spam vs. not spam Not SPAM? SPAM SPAM

25 Online Learning Focus on the binary case: spam vs. not spam Not SPAM? SPAM SPAM Not Spam!

26 Online Learning Focus on the binary case: spam vs. not spam Not SPAM? SPAM SPAM Not Spam! Wrong!

27 Online Learning Focus on the binary case: spam vs. not spam Not SPAM? SPAM SPAM Not Spam! Wrong!

28 Online Learning Focus on the binary case: spam vs. not spam Not SPAM Not SPAM SPAM SPAM? Not Spam! Wrong!

29 Online Learning Focus on the binary case: spam vs. not spam Not SPAM Not SPAM SPAM SPAM?

30 Online Learning Focus on the binary case: spam vs. not spam Not SPAM Not SPAM SPAM SPAM?

31 Online Learning Focus on the binary case: spam vs. not spam Not SPAM Not SPAM SPAM SPAM? SPAM!

32 Online Learning Focus on the binary case: spam vs. not spam Not SPAM Not SPAM SPAM SPAM? SPAM! Wrong!

33 Online Learning Focus on the binary case: spam vs. not spam Not SPAM Not SPAM SPAM SPAM? SPAM! Wrong! We are interested in total number of mistakes made

34 More abstractly... Bob is learning an unknown Boolean function u(x)

35 More abstractly... Bob is learning an unknown Boolean function u(x). 0 Here, u is a function on lunch meats. 1 0

36 More abstractly... Bob is learning an unknown Boolean function u(x). 0 Here, u is a function on lunch meats. 1 Usually, have an assumption that u comes from a known family of functions---called a concept class. 0

37 How many mistakes? In general, cannot guarantee a finite mistake bound. If you could, would make a lot of money on the stock market... Today S&P 500 will finish up!

38 Learning Disjunctions We will assume the unknown function u comes from a simple family: monotone disjunctions. u(x) =x i1 x i2 x ik Monotone means each variable appears positively, that is as x i rather than x i. What can such functions express?

39 Disjunctions A simple bag of words model of spam--- An is spam if it contains one of a handful of keywords: viagra, lottery, xanax...

40 Disjunctions x 1 = [Aardvark] x 2 = [Aarrghh]... x 601,384 =[Zyzzyva] A simple bag of words model of spam--- An is spam if it contains one of a handful of keywords: viagra, lottery, xanax...

41 Disjunctions x 1 = [Aardvark] x 2 = [Aarrghh]... x 601,384 =[Zyzzyva] A simple bag of words model of spam--- An is spam if it contains one of a handful of keywords: viagra, lottery, xanax... SPAM(x)=viagra(x) OR lottery(x) OR... OR xanax(x)

42 Disjunctions x 1 = [Aardvark] x 2 = [Aarrghh]... x 601,384 =[Zyzzyva] A simple bag of words model of spam--- An is spam if it contains one of a handful of keywords: viagra, lottery, xanax... SPAM(x)=viagra(x) OR lottery(x) OR... OR xanax(x) We have variables to indicate if word is present in x.

43 Disjunctions x 1 = [Aardvark] x 2 = [Aarrghh]... x 601,384 =[Zyzzyva] A simple bag of words model of spam--- An is spam if it contains one of a handful of keywords: viagra, lottery, xanax... SPAM(x)=viagra(x) OR lottery(x) OR... OR xanax(x) We have variables to indicate if word is present in x. n=total # variables, k=number present in disjunction.

44 Basic Algorithm Maintain a hypothesis h(x) for what the disjunction is. Initially, h(x) =x 1 x 2 x n That is, we assume disjunction includes all variables. This will label everything as spam!

45 Update Rule Every time we get an answer wrong, we will modify our hypothesis, as follows:

46 Update Rule Every time we get an answer wrong, we will modify our hypothesis, as follows: On false positive, i.e. SPAM(x)=0 and h(x)=1, remove all variables from h where =1. x i

47 Update Rule Every time we get an answer wrong, we will modify our hypothesis, as follows: On false positive, i.e. SPAM(x)=0 and h(x)=1, remove all variables from h where =1. x i On false negative, i.e. SPAM(x)=1 and h(x)=0, output FAIL.

48 Update Rule Every time we get an answer wrong, we will modify our hypothesis, as follows: On false positive, i.e. SPAM(x)=0 and h(x)=1, remove all variables from h where =1. x i On false negative, i.e. SPAM(x)=1 and h(x)=0, output FAIL. If the function we are learning is actually a monotone disjunction, false negatives never happen.

49 Try It!

50 Number of Mistakes On false positive, i.e. SPAM(x)=0 and h(x)=1, remove all variables from h where =1. x i On false negative, i.e. SPAM(x)=1 and h(x)=0, output FAIL.

51 Number of Mistakes On false positive, i.e. SPAM(x)=0 and h(x)=1, remove all variables from h where =1. x i On false negative, i.e. SPAM(x)=1 and h(x)=0, output FAIL. Variables present in hypothesis always a superset of those in SPAM function.

52 Number of Mistakes On false positive, i.e. SPAM(x)=0 and h(x)=1, remove all variables from h where =1. x i On false negative, i.e. SPAM(x)=1 and h(x)=0, output FAIL. Variables present in hypothesis always a superset of those in SPAM function. Claim: This algorithm makes at most n mistakes if unknown function is indeed a disjunction.

53 Importance of Good Teachers... What happens if some example is mislabeled? Say that actually u(x)=1 but Teacher mistakenly says u(x)=0. Then the weight for some variable relevant to u will be set to zero, and can never come back! Then we can make an unbounded number of mistakes.

54 Sparse Disjunctions This algorithm has disadvantage that the number of mistakes can depend on total number of variables. In spam example, even though there are only a few keywords, we could make make mistakes on order the number of words in English! Ideally, would like an algorithm that depends on the number of relevant variables rather than total number of variables.

55 Winnow Algorithm Winnow- to separate the chaff from the grain The winnow algorithm quickly separates the irrelevant variables from the relevant ones.

56 Winnow Algorithm Say the unknown function only depends on k variables out of a total of n. u(x) =x i1 x i2 x ik The winnow algorithm has a mistake bound of about k log n A lot better than previous when k n

57 Linear Threshold Function In the winnow algorithm, Bob maintains as hypothesis a linear threshold function. h(x) = sign(w 1 x 1 + w 2 x w n x n θ).

58 Linear Threshold Function In the winnow algorithm, Bob maintains as hypothesis a linear threshold function. h(x) = sign(w 1 x 1 + w 2 x w n x n θ). Here the weights w 1,...,w n and threshold θ are real numbers.

59 Linear Threshold Function In the winnow algorithm, Bob maintains as hypothesis a linear threshold function. h(x) = sign(w 1 x 1 + w 2 x w n x n θ). Here the weights w 1,...,w n and threshold θ are real numbers. sign(x) 1 x

60 Linear Threshold Function Linear threshold functions are used as a simple model of a neuron.

61 Linear Threshold Function Linear threshold functions are used as a simple model of a neuron. w 5 x 5 w 4 x 4 w 3 x 3 w 1 x 1 w 2 x 2 i w i x i θ?

62 Linear Threshold Function h(x) = sign(w 1 x 1 + w 2 x w n x n θ).

63 Linear Threshold Function h(x) = sign(w 1 x 1 + w 2 x w n x n θ). What can linear threshold functions do?

64 Linear Threshold Function h(x) = sign(w 1 x 1 + w 2 x w n x n θ). What can linear threshold functions do? OR(x) = sign(x 1 + x x n 1/2).

65 Linear Threshold Function h(x) = sign(w 1 x 1 + w 2 x w n x n θ). What can linear threshold functions do? OR(x) = sign(x 1 + x x n 1/2). AND(x) = sign(x 1 + x x n (n 1/2)).

66 Linear Threshold Function h(x) = sign(w 1 x 1 + w 2 x w n x n θ). What can linear threshold functions do? OR(x) = sign(x 1 + x x n 1/2). AND(x) = sign(x 1 + x x n (n 1/2)). MAJ(x) = sign(x 1 + x x n n/2)

67 Linear Threshold Function Another sanity check: the unknown function u(x) =x i1 x i2 x ik can also be expressed as a LTF. sign(x i1 + x i x ik 1 2 )

68 Linear Threshold Function Linear threshold function defines a plane in space, evaluates which side of the plane a point lies on. 3 x + y =

69 Linear Threshold Function Linear threshold function defines a plane in space, evaluates which side of the plane a point lies on. x + y + z =1

70 Linear Threshold Function Can every function be expressed as a linear threshold function?

71 Linear Threshold Function Can every function be expressed as a linear threshold function? No, for example PARITY(x) =x 1 + x x n mod 2

72 Linear Threshold Function Can every function be expressed as a linear threshold function? No, for example PARITY(x) =x 1 + x x n mod 2 Note that linear threshold functions are monotone in each coordinate.

73 Winnow Algorithm In the winnow algorithm, Bob maintains as hypothesis a linear threshold function. h(x) = sign(w 1 x 1 + w 2 x w n x n θ).

74 Winnow Algorithm In the winnow algorithm, Bob maintains as hypothesis a linear threshold function. h(x) = sign(w 1 x 1 + w 2 x w n x n θ). Initially, all weights are set to 1 and θ = n. On instance x, Bob guesses h(x) for the current hypothesis h.

75 Winnow Algorithm In the winnow algorithm, Bob maintains as hypothesis a linear threshold function. h(x) = sign(w 1 x 1 + w 2 x w n x n θ). Initially, all weights are set to 1 and θ = n. On instance x, Bob guesses h(x) for the current hypothesis h. Now, we need to define how to update the hypothesis after a mistake is made.

76 Recap so far... We want to learn a function which is a disjunction of k variables out of n possible. u(x) =x i1 x i2 x ik Initially, we take as hypothesis the function sign(x 1 + x x n n) After each mistake we will update the weight of each variable. Threshold will stay the same.

77 Update Rules In the winnow algorithm, Bob maintains as hypothesis a linear threshold function. h(x) = sign(w 1 x w n x n n)

78 Update Rules In the winnow algorithm, Bob maintains as hypothesis a linear threshold function. h(x) = sign(w 1 x w n x n n) Update: On false positive, i.e. u(x)=0 and h(x)=1, Bob sets w i = 0 for all i where x i =1.

79 Update Rules In the winnow algorithm, Bob maintains as hypothesis a linear threshold function. h(x) = sign(w 1 x w n x n n) Update: On false positive, i.e. u(x)=0 and h(x)=1, Bob sets w i = 0 for all i where x i =1. On false negatives, i.e. u(x)=1 and h(x)=0, Bob sets w i 2w i for all i where x i =1.

80 Reasonable Update? Update: On false positive, i.e. u(x)=0 and h(x)=1, Bob sets w i = 0 for all i where x i =1. On false negatives, i.e. u(x)=1 and h(x)=0, Bob sets w i 2w i for all i where x i =1.

81 Reasonable Update? Update: On false positive, i.e. u(x)=0 and h(x)=1, Bob sets w i = 0 for all i where x i =1. On false negatives, i.e. u(x)=1 and h(x)=0, Bob sets w i 2w i for all i where x i =1. If the unknown disjunction contains, then will never be set to 0. On false negatives, give more weight to all variables which could be making the disjunction true. x i w i

82 Simple Observations h(x) = sign(w 1 x w n x n n) Update: On false positive, i.e. u(x)=0 and h(x)=1, Bob sets w i = 0 for all i where x i =1. On false negatives, i.e. u(x)=1 and h(x)=0, Bob sets w i 2w i for all i where x i =1. 1) Weights always remain non-negative. 2) No weight will become larger than 2n.

83 Mistake Bound

84 Mistake Bound Call a weight doubling a promotion, and a weight being set to 0 a demotion.

85 Mistake Bound Call a weight doubling a promotion, and a weight being set to 0 a demotion. By the second observation, a weight will only be promoted at most log n times. Otherwise, becomes larger than 2n.

86 Mistake Bound Call a weight doubling a promotion, and a weight being set to 0 a demotion. By the second observation, a weight will only be promoted at most log n times. Otherwise, becomes larger than 2n. As there are only k relevant variables in the unknown disjunction, total number of promotions is at most k log n.

87 Mistake Bound Consider when u(x)=0 and h(x)=1. Then i:x i =1 w i n. After update, all of these weights will be set to 0. Thus the sum of all the weights will decrease by at least n after the update.

88 Mistake Bound Consider when u(x)=1 and h(x)=0. Then i:x i =1 w i <n. After update, all of these weights will be doubled. Thus the sum of all the weights will increase by at most n after the update.

89 Mistake Bound For every demotion, sum of weights decreases by at least n. For every promotion, sum of weights increases by at most n. Total number of promotions is at most k log n.

90 Mistake Bound For every demotion, sum of weights decreases by at least n. For every promotion, sum of weights increases by at most n. Total number of promotions is at most k log n. We know that the sum of the weights must remain non-negative!

91 Mistake Bound For every demotion, sum of weights decreases by at least n. For every promotion, sum of weights increases by at most n. Total number of promotions is at most k log n. We know that the sum of the weights must remain non-negative! Thus number of demotions at most 1+k log n.

92 The Final Grain If the unknown function is a k-disjunction, the total number of mistakes of the winnow algorithm is bounded by about 2k log n.

93 A Softer Version This algorithm is harsh---once a weight is set to zero it can never come back. We can instead consider a softer variation: Update: On false positive, i.e. u(x)=0 and h(x)=1, Bob sets w i w i 2 for all i where x i =1 On false negatives, i.e. u(x)=1 and h(x)=0, Bob sets w i 2w i for all i where x i =1.

94 Learning from Experts Who will win the world cup?

95 Learning from Experts Who will win the world cup?

96 Learning from Experts Who will win the world cup? Everyone has a prediction!

97 Learning from Experts Who will win the world cup? Everyone has a prediction! Brazil

98 Learning from Experts Who will win the world cup? Everyone has a prediction! Brazil Holland

99 Learning from Experts Who should we listen to?

100 Learning from Experts Who should we listen to? Can follow an amazing algorithm!

101 Learning from Experts Who should we listen to? Can follow an amazing algorithm! If there are n experts, algorithm will make at most 2.1k + 20 log n mistakes, where k is number of mistakes made by best expert.

102 Learning from Experts Who should we listen to? Can follow an amazing algorithm! If there are n experts, algorithm will make at most 2.1k + 20 log n mistakes, where k is number of mistakes made by best expert. This is without knowing the best expert in advance!

103 Weighted Majority

104 Weighted Majority The algorithm is very similar to Winnow.

105 Weighted Majority The algorithm is very similar to Winnow. Give a weight to each expert. Initially, set all weights equal to 1.

106 Weighted Majority The algorithm is very similar to Winnow. Give a weight to each expert. Initially, set all weights equal to 1. Predict according to the side with more weight.

107 Weighted Majority The algorithm is very similar to Winnow. Give a weight to each expert. Initially, set all weights equal to 1. Predict according to the side with more weight. After each example, update the weights: If expert i is wrong: w i 0.99w i If expert i is right: w i w i

108 Big Picture Iterative update is a powerful technique. Generalizations of the weighted majority algorithm applied to find optimal strategies in games, solve optimization problems with constraints, give approximations for hard problems. For these applications, the constraints play the role of experts!

109 Further Reading Challenge question: prove that the weighted majority algorithm works as described. Survey paper The multiplicative weights update method by Arora, Hazan, Kale. For much more on learning see homepage of Rocco Servedio.

Is our computers learning?

Is our computers learning? Attribution These slides were prepared for the New Jersey Governor s School course The Math Behind the Machine taught in the summer of 2012 by Grant Schoenebeck Large parts of these slides were copied