Computational Learning Theory: PAC Model

Computational Learning Theory: PAC Model Subhash Suri May 19, 2015 1 A rectangle Learning Game These notes are based on the paper A Theory of the Learnable by Valiant, the book by Kearns-Vazirani, and notes by Avrim Blum. A good example of a simple elegant theory to formally study a messy and complex problem: learning. Quoting from Abu-Mostafa, Magdon-Ismail, Lin book Learning from data,: Show a picture to a 3-year old and ask if there is a tree in it, you will likely get the correct answer. Ask a 30 year old what the definition of a tree is, you will likely get an inconclusive answer. We didn t learn what a tree is by studying the mathematical definition of trees; we learned by looking at trees. In other words, we learned form data. Data-driven learning has shown great promise in a number of practical applications, ranging from financial forecasting to medical diagnosis, computer vision, search engines, recommendation systems etc. They are particularly effective where concepts are somewhat fuzzy and difficult to model precisely and rigorously. For instance, how does Netflix recommend movies for you to watch? Prescribing our own tastes in a rigorous form is likely an impossible task, but our past preferences and ratings are a good indicator. Data-driven or machine learning builds on this idea. This lecture is a very brief attempt to introduce a theoretical framework for understanding both the complexity and the power of data-driven learning. Consider a simple 1-player learning game in which the object is to learn an unknown axis-parallel rectangle R. (Easy to extend to d-dim boxes.) The player receives information only through the following process (see Fig. 1): a random point p is chosen (according to some fixed prob distribution D) and the player is told p s label: positive (inside R) or negative (outside). 1

The goal is to use as few examples (and computation) as possible, and construct a hypothesis rectangle R which is a good approximation of R. Informally, the player s knowledge is tested by picking a new point q at random, using the same distribution D, and checking whether he can correctly decide the label of this new point. Formally, the quality of learning is measured by the error = (R R ) (R R). Throughout, the focus will be on standard CS Metrics: (1) the number of examples/queries needed; (2) the amount of computation to form/update hypothesis, and (3) the amount of error and confidence. Motivation. Imagine a medical learning process, in which the 2-dim plane represents the attribute space (weight, cholesterol). We may hypothesize that a healthy person s weight and cholesterol levels are in some nice range, forming a rectangle. But we don t know what the values of these ranges are. Or, suppose we wish to teach a program to recognize medium built males, where x-axis = weight and y-axis = height. The learner is shown random examples, each labeled with a + (medium built), or (not medium built). How effectively can one teach the concept of medium built male through this process. The Distribution D. The program goes through a learning/training phase, in which random examples are used to construct a hypothesis rectangle R. After the training, the rectangle R is our model, so we want to know how likely is R to be wrong in its future classifications. What assumptions are needed on the prob distributed D? It need not be uniform. We just require that learning, and testing use the same distribution D. In fact, suppose, in the learning phase, each man in the city is chosen with equal probability. Even under this assumption, the corresponding points in the plane are not uniformly distributed not all heights and weights are equally likely; in fact, height and weight may be highly dependent. This sampling will follow some fixed distribution D, which may be quite difficult to characterize, but as long as both samples and test points are chosen w.r.t. the same D, we are fine. Learning. Our strategy is simple: request a sufficiently large number m of sample points. Then choose as R the smallest axis parallel rectangle that includes all the + examples, and excludes all the examples. (If no + examples are drawn, then R = 0.) Error Analysis and Predictive Power. We show that for any target concept R, any distribution D, and any values ε, δ, we can request m samples (how many?) so that with prob at least 1 δ, R misclassifies (w.r.t. R) with error at most ε. 2

First, observe that R R: the former is contained entirely in R, so error = R R. We can express the difference as the union of 4 rectangular strips (with overlap near the corners). We show that the prob that a random test point under D falls in one of these strips is at most ε/4. Thus, the probability of error over all four strips (by union bound) is at most ε. The analysis will reveal the number of samples m needed to achieve this. Consider the top strip T. Suppose it has weight at least ε/4. Then, a mis-classification occurs because none of the m samples fell into this strip, the prob of which is (1 ε/4) m By unioning over all four strips, the total error is at most 4(1 ε/4) m. If we want this error to be less than δ, then we need 4(1 ε/4) m δ. Using the fact that (1 x) e x, we have 4e εm/4 δ, which gives m 4 ε ln 4 δ Thus, the Tightest Fitting Rectangle algorithm takes a sample of O( 1 ln 1 ) examples to ε δ form a hypothesis that classifies nearly as well as R with confidence at least 1 δ. 1.1 PAC (Probably Approximately Correct) Learning. This is an example of PAC learning. It has following important features: 1. Learning of an unknown target set, but the target class is not arbitrary. We have an idea of its general form (e.g. rectangle). 3

2. Learning is Probabilistic. Examples drawn at random, using an arbitrary, unknown, and unconstrained distribution. 3. Hypothesis of the learner is evaluated relative to the same prob distribution D, and we allow an approximation of the target concept. 4. We are interested in computational efficiency how few targets suffice to achieve high confidence. 1.2 The General Model. We have an instance space X, such as the points in the plane for the rectangle game, or set of all 2-dim array of binary pixels, in character recognition. The concept c over X is a subset of the instance space. (e.g. rectangles classifying medium-built males, or arrays whose pixels correspond to a valid character, say, A, assuming every array either exemplifies char A or fails to exemplify A) A concept thus can be thought of as a boolean mapping c : X {0, 1}, where c(x) = 1 indicates that x is a positive example, and c(x) = 0 indicates that x is a negative example. A concept class C over X is the collection of concepts. In the rectangle game, the target rectangle was chosen from the class of all axis-parallel rectangles. (As another example, the concept class can be the pixel maps of letters, and a concept c is specific letter, such as A.) Ideally, we want concept classes that are sufficiently expressive, but still learnable. Another example c can be a boolean formula over n variables, and the positive examples are the satisfying assignments over {0, 1} n. In the PAC model, the algorithm is faced with an unknown, target concept c, from the class C. The learning algorithm will be shown (random) positive or negative examples for c. The learning algorithm is judged by its ability to identify a hypothesis concept h that can accurately classify the instances as positive or negative for c. Note that the learning is assumed to know the target concept class, but it just doesn t know the exact concept c that is the target. D is any, fixed, unknown to the algorithm, arbitrary prob distribution over instance space X. The learning algorithm s error will be measured as err(h) = Prob{c(x) h(x)} where c and h are regarded as boolean functions. Geometrically, we can think of err(h) as the symmetric difference between c and h, using the Venn Diagram. EX(c, D) is a procedure (oracle) that returns a labeled example (x, c(x)), where x is drawn randomly and independently using D. The goal is to achieve small err(h) making as few calls to EX as possible. 4

PAC Learning Model Suppose C is a concept class over X. We say that C is PAClearnable if there exists an algorithm L with the following properties: for every concept c in C, for every distribution D on X, and for all 0 < ε, δ < 1/2, L can output a hypothesis h C so that error(h) ε with probability > 1 δ. The prob is over the random examples drawn by calls to EX and any internal randomization of L. If L runs in time poly(1/ε, 1/δ), we say C is efficiently PAC-learnable. So, e.g. the concept class of axis-aligned rectangles in 2-dim is efficiently PAC learnable. 2 Leaning Boolean Conjunctions A concept here is a CNF boolean formula: e.g x 1 x 3 x 4. C n is the class of all conjunctions of literals over x 1,..., x n. The instance (example) space is X n = {0, 1} n, where each a X n is interpreted as an assignment to the n boolean variables: a i = ith bit of a. c(a) = 1 if the assignment is satisfying (+ example) c(a) = 0 if the assignment is non-satisfying (- example). The conjunction x 1 x 3 x 4 represents the set {a {0, 1} n a 1 = 1, a 3 = 0, a 4 = 1}. (a 2 is a don t care variable.) The size(c) equals the number of literals in c. Clearly, size(c) 2n. Algorithm for Boolean Conjunction We prove that the class of conjunctions of boolean literals is efficiently PAC-learnable. The algorithm begins with the hypothesis h = (x 1 x 1 ) (x 2 x 2 )... (x n x n ) Thus, initially, h has no satisfying assignments. The algorithm simply ignores any negative examples returned by EX(c, D). Let (a, 1) be a positive example returned by EX. In response, the algorithm updates h as follows: for each i if a i = 0, we delete x i from h; if a i = 1, we delete x i from h. That is, the algorithm simply deletes any literal that contradicts the positive data. 5

Analysis Now we analyze the performance of the learned hypothesis h. First, note that at any time the set of literals in h always contains the set of literals in c because initially h has them all, and a literal is only deleted when it is set to 0 in a positive example (meaning it cannot be in c). This also means that h is more specific than c, so the set of assignments satisfying h also satisfy c: in other words, h will never err on a negative example for c. So, consider a literal z that occurs in h, but not in c. Note that z causes h to err only on positive examples, that is, z falsely causes h to output h(a) = 0, while c(a) = 1. Thus, the literal z must be 0 in a. This is also precisely the kind of positive example what would have caused the algorithm to delete z from h, but such an example was not seen. Define p(z) = Prob{c(a) = 1 and z = 0 in a}, where the prob is over all instances a in the distribution. Since every error of h can be blamed on at least one such literal z of h, by the union bound, we have that error(h) z h p(z) Now, say that a literal z is bad if p(z) ε/2n. If h contains no bad literal, then clearly error(h) 2n(ε/2n) = ε. So, let us assume that at least one literal is bad, and compute that prob. For any fixed bad literal z, the probability that it was not deleted from h after m calls to EX is at most (1 ε/2n) m, because a call to EX has prob at least ε/2n of returning an example that causes z to be deleted, and it survived m such calls. By the union bound, the prob that some bad literal survived in h is at most 2n(1 ε/2n) m (summed over 2n literals). How big should m be to make this prob less than δ? 2ne εm/2n δ = m (2n/ε){ln(2n) + ln(1/δ)} Theorem. CNF is efficiently PAC-learnable. 2.1 Intractability of Learning 3-Term DNF Formula We now show that a slight generalization of the representation class of boolean conjunctions becomes intractable to learn. The class we consider is called 3-Term DNF which is the set of all disjunctions of the form T 1 T 2 T 3, where each T i is a conjunction of literals over boolean variables x 1,..., x n. (Size of each such concept is at most 6n, because there are at most 2n literals, and each T i has at most one occurrence of either x i or x i.) To learn this efficiently, we need an algorithm polynomial in n, 1/ε and 1/δ. That is, while concepts expressed as 1-CNF are efficiently learnable, concepts expressed as OR of 3 Conjunctions are not. 6

Theorem. Unless (randomized poly) RP = NP, the class 3-Term DNF is not efficiently PAC-learnable. Reduce an NP-complete problem (language) A to PAC learning of 3-term DNF. We take an instance α A, and construct a set S α of labeled examples so that α A iff S α is consistent with some concept c C. In our case, we will use Graph 3-coloring as A, and C = 3-term DNF. How does a PAC learning algorithm L for concept class C allows us to determine if there is a concept in C that is consistent with S α with high probability? General Method: Fix error parameter ε = 1 2 S α and answer each request of L for a random example by choosing a pair (x i, b i ) (example and its label) uniformly at random from S α. Now if there is a concept c consistent with S α, then this simulation emulates the oracle EX(c, D), where D is uniform distribution. In this case, by our choice of ε, we have ensured that any hypothesis h with error less than ε must in fact be completely consistent with S α. Because if h errs on even a single example in S α, its error with respect to c and D is at least 1/ S α = 2ε, which is strictly larger than ε. On the other hand, if there is no concept consistent with S α, then L cannot possibly find one. Thus, we can simply check with output of L for consistency with S α to determine with 1 δ confidence that there is concept consistent with the examples. Discussion. The hardness arises from our insistence that the error be made very small, namely, 1/ H, where H is hypothesis space. It essentially forces a large sample complexity. Graph 3-Coloring Given an undirected graph G = (V, E), on n nodes, determine if its vertices can be colored with 3 colors so that no adjacent pair receives the same color. A mapping from an instance G to a set S G of labeled examples. S G will have positive examples S G + and negative examples SG. In particular, for each node i, S G + will contain the labeled example (v(i), 1) where v(i) is a n-bit binary vector with a 0 in the ith position and 1 s everywhere else. (Intuitively, these n examples encode the vertices of G.) For each edge (i, j) E, we create in SG a negative example (e(i, j), 0) where e(i, j) is also a n-bit vector with 0 s in the ith and jth bits, and 1 s everywhere else. Use Example of Figure 1.5 from the book. Let R, B, Y denote the 3 colors (red, blue, and yellow). We now argue that G is 3-colorable if and only if our examples S G are consistent with some 3-term DNF formula. First, suppose that G is 3-colorable, and fix one such coloring. Let R be the set of all vertices colored RED, and let T R be the conjunction of all variables in x 1, x 2,..., x n, whose 7

index does not appear in R. Then, for each i R, v(i) must satisfy T R, because variable x i does not appear in T R. Furthermore, no e(i, j) SG can satisfy T R : since both i and j cannot be colored red, one of x i or x j must appear in T R. Similarly, we have terms T B and T Y and they are satisfied by non-blue and non-yellow in the same way, with no negative examples satisfying them. In the converse direction, suppose that the formula T R T B T Y is consistent with all the examples of S G. Then, we can define a coloring as follows. The color of vertex i is RED if v(i) satisfies T R, BLUE if v(i) satisfies T B, and YELLOW if v(i) satisfies T Y. (break ties arbitrarily.) Since the formula is consistent with S G, every v(i) must satisfy some term, and so every vertex is assigned a color. We now argue that this is a legal 3-coloring. To see this, note that if i and j, with i j, are assigned the same color (say RED), then both v(i) and v(j) satisfy T R. Since the ith bit of v(i) is 0 while the ith bit of v(j) is 1, neither x i nor x i can appear in T R. Since v(j) and e(i, j) differ only in the ith bit, if v(j) satisfies T R, then so does e(i, j), which means that e(i, j) is not in SG, and so (i, j) is not in E. 8

2.2 Learning Decision Lists Imagine a Bank/mortgage company trying to decide the risk of default for its loans. It may base this decision on many vars: %downpayment, recent delinquency, income, other accounts in good standing, high debt, mortgage/income ratio, etc. It can then come up with some reasonable rules, such Predict Good is (no recent delinquency) AND (%down > 5) The bank may arrive at this data by studying a database of many individuals. This falls into a general type of concept learning, Decision List. A k-decision list over boolean vars x 1, x 2,..., x n is an ordered sequence (list) L = (c 1, b 1 ), (c 2, b 2 ),..., (c l, b l ), and a (default) bit b, where (a) each c i is a conjunction of at most k literals over x i, and (b) each b i is a bit {0, 1}. For any input a {0, 1} n, the value L(a) is set to b j if j is the smallest index for which c j (a) evaluates to 1. If no such j exists, then L(a) is set to the default bit b. Example of a 2-decision list: (x 1 x 3, 1), (x 4, 1), ( x 2 x 3, 1), (x 1 x 5, 0), ( x 4 x 6, 1), (x 1 x 6, 0), and b = 1 On input a = 011011, this evaluates to L(a) = 1, because the 5th clause (j = 5) is the first one satisfied. Suppose we posit that a target concept (delinquency likelihood) can be learned as a decision tree over a set of variables. Then the PAC learning draws some random samples along with the known outcome labels. E.g. x 1 x 2 x 3 x 4 x 5 Label 1 0 0 1 1 + 0 1 1 0 0 1 1 1 0 0 + 0 0 0 1 0 1 1 0 1 1 + 1 0 0 0 1 So the decision list may be: if (x 1 = 0) then elseif (x 2 = 1) then + elseif (x 4 = 1) then + else 9

Learning the 1-Decision List (k-decision lists are similar) Start with an empty list. Find any if-then rule consistent with data (satisfied by at least one example) Put rule at the bottom of the list so far, and cross off examples covered by this rule. Repeat until no more examples remain. If the algorithm fails, then NO DL consistent with the example (and hence the original) data. Do the Delinquency x 1, x 2,..., x 5 example. Analysis. Why should we expect to do well on future data? Suppose our hypothesis DL h has error err(h) > eps. Then, the prob that this h can survive m examples is (1 ε) m. Now, the entire family of n-variable boolean decision lists has at most n! n members, since there are at most n! variable orderings, and 4 possibilities for each (value, label) pair, and each value and label is binary. Then, by the union bound, the probability that at least one member of of this decision list family is bad is n! 4 n (1 ε) m We can make this probability smaller than δ by choosing Thus, 1-DL is efficiently PAC learnable. m 1 {2n + n ln n + ln(1/δ)} ε Confidence vs. Sample Complexity. Nothing special about DL proof. The essence is just this: if not too many rules to choose from, then it is unlikely that a bad rule will fool us by chance. The proof generalizes to any hypothesis space H, and we get: if m 1/ε{log H + log(1/δ)}, then after m examples, with probability > 1 δ, all h H with err(h) ε have err (h) > 0, where err is the error on the samples, while err is error on the whole data. 10

PAC and Occam s Razor The PAC learning actually has a nice philosophical basis in Occam s Razor. William of Occam (1320 AD) contended that: all other things being equal, simpler explanations are preferable. In computer-science-ish way, simple = short description. At most 2 s explanations are < s bits long. If the number of examples seen satisfies m 1/ε{s ln 2 + ln(1/ε)} then it is unlikely that a bad simple explanation will fool us by chance. Of course, there is no guarantee that a short explanation always exists for the data. 2.3 The problem of Experts PAC learning posits that there is a perfect but unknown function (target) that we are trying to learn. What can we do when there is no such function? Let us formulate a game, which we call Expert Advice. Consider a simplified setting: we want to learn how to predict the stock market (or will it rain): whether it will go up, or down, tomorrow? There cannot be a perfect function for this problem. Instead, suppose there are n experts offering us their advice. Each day, each expert makes his prediction YES or NO. The learning algorithm must use this information to make its own prediction (alg has no other input). After making its prediction, the algorithm is told the outcome (yes/no, up/down). In the absence of any supernatural quality of experts, we cannot make any absolute guarantees. The best we can hope is that the algorithm didn t do much worse than the best expert in hindsight. Unlike PAC, we are not going to evaluate the algorithm on future predictions. Instead, we only compare it to the best expert s performance so far: this is similar to competitive ratio measure in online algorithms. Equivalently, we want a strategy that will minimize our regret. we can view each expert as a different hypothesis h C. or we can think of this as a special case of single-variable concepts The Weighted Majority Algorithm WMA The algorithm is extremely simple: always side with the weighted experts majority and, after each observation, downgrade the experts whose predictions were wrong. Formally, the algorithm maintains a list of weights (representing our trust or confidence in them) w 1,..., w n, one for each expert, and operates as follows. 11

1. Initially, all the weights to 1. 2. Given a set of predictions x 1, x 2,..., x n by the experts, output the prediction with the highest total weight. That is, output 1 if i I w i i Z w i, and 0 otherwise, where I is the set of experts predicting 1, and Z predicting 0. 3. When the correct answer is revealed, penalize each wrong expert by halving his weight. 4. Go to 2. Theorem. The number of mistakes M made by the WMA is no more than 2.41(m + log n), where m is the number of mistakes made by the best expert so far. Proof. Let W denote the total weight of all the experts. Initially, W = n. If the algorithm makes a mistake, then at least half of the weight of the experts also predicted wrong, and so in Step 3 their weight is reduced by 1/2. Thus, a mistake by the algorithm must cause the weight W to drop by at least W/4. So, after the alg has made M mistakes, we have W n(3/4) M On the other hand, if the best expert makes m mistakes, then his weight is 1/2 m. Thus, we have 1/2 m W n(3/4) m, which gives us 1 M (m + log n) log(4/3) 2.41(m + log n) Improvements to WMA. One can improve the WMA in two ways: (a) instead of making the prediction with the total weight, we can use the weights as probabilities and choose the outcome probabilistically. (b) instead of penalizing the wrong experts by factor 1/2, we can penalize each by factor beta. This analysis gives the following mistake bound: M {m ln(1/β) + ln n}/(1 β) E.g. when β = 3/4, we can M 1.15m + 4 ln n. By adjusting beta further dynamically, one can also get M m + ln n + O( (m ln n)). 12

3 The Perceptron Algorithm Good reading material for this includes lecture notes of Avrim Blum, the Large Margin Classification paper of Freund-Schapire, and the book of Abu-Mostafa. Learning Linear Separators A fundamental idea for learning complex binary classifiers is the following Use some mapping to transform the instances into a high dimensional space in which two classes are linearly separable. Then find the vector that classifies all the data correctly and maximizes the margin, the minimum distance between the hyperplane and the instances. The Perceptron algorithm is a particularly simple, and oldest, method for finding the linear separator we will show that if the instances are linearly separable with large margin then perceptron algorithm converges quickly. Assumptions 1. The feature space is R n, and we are looking for a linear classifier of the form w 1 x 1 + w 2 x 2 + + w n x n > 0 In general, we want a classifier of the wx > b, but that can be simulated by introducing an extra variable x 0, so that the problem looks like; w 1 x 1 +w 2 x 2 + +w n x n bx 0 > 0, where we always set x 0 = 1, and w 0 = b. So, from now on, assume that R n already accounts for this transformation. 2. Each instance has a binary label { 1, +1}, and the labels are error-free. 3. We assume that a linear separator w exists. That is, for each positive example, we have w x 0, and for each negative example w x < 0. Moreover, the classification has margin γ > 0. That is, w x x > γ, for all examples x. 4. We normalize the instances so that x = 1. Since we are looking for a hyperplane through the origin, we can project all the points x onto the unit sphere, centered at the origin, without changing their classification. Quick vector refresher wx is a hyperplane, whose normal vector is w. Given a point x, the quantity w x/ x is the Euclidean distance between x and the hyperplane. For any two unit vectors x 1, x 2 R n, we have x 1 x 2 = x 1. x 2. cos(x 1, x 2 ) 1. 13

The Perceptron Algorithm 1. The algorithm processes the instances in an online manner, one at a time. 2. Start with the all zero vector w 1 = 0, and initialize t = 1, the iteration count. 3. Given the next example x, predict positive iff w t x > 0. 4. On a mistake, update as follows: Mistake on positive example: w t+1 w t + x. Mistake on negative example: w t+1 w t x. 5. t t + 1. Remark 1. A mistake on a positive instance x means that we have w t x < 0, while it should be positive. Our update changes the weight vector so that the revision reduces the error (by 1): w t+1 x = (w t + x) x = w t x + 1, where the last term uses the fact that x is a unit vector. Similarly, mistake on a negative example causes w t+1 x = (w t x) x = w t x 1. Remark 2. An important point is that this adjustment can potentially misclassify other points that were previously correct that is, w t+1 may now be wrong on examples where w t was right! In fact, this is the reason why Perceptron can take forever to converge when γ is very small... Remark 3. Observe that the weight vector w is not normalized its magnitude can be arbitrarily large, and so w x can grow unboundedly. Remark 4. The vector w t we learn is the weight vector that encodes the relative importance (weight) of the different features (dimensions of the data). Remark 5. Intuition for Perceptron. The vector w t defines a hyperplane H t (with normal w t ), with points above classified as + and those below as. Now, suppose the label induced by H t is incorrect for the next example x: e.g. the correct label is but x lies above H t. The Perceptron changes the weights (coordinates of w t ) so that x is less above the new hyperplane, which it does by adding the (negative) vector x to w t. See figure below. Note that the label of x may still be incorrect even after this update, but the idea is to react slowly, and not dramatically. Of course, if there are many other points near x of the same type, then each will slowly tilt w t further and further until we get the right classification. However, even with this mild update rule, the hyperplane can jump around 14

quite significantly at each update. See video https://www.youtube.com/watch?v=dxunakhsos4. Perceptron Error Theorem. Let S be a sequence of labeled examples that are consistent with a linear threshold function w x > 0, where w is a unit length vector. (That is, the hyperplane w x correctly classifies all points x S.) Further, let γ be the margin of w, namely, w x γ = min x S x Then, in processing S, the perceptron algorithm misclassifies at most 1/γ 2 times. 15

Proof of Theorem. The proof will track how do the quantities w t w and w t change during the algorithm. Recall that w is a unit vector, but w t is not. The proof works by quantifying the amount by which these two quantities increase after each mistake (a lower bound for the former, and an upper bound for the latter). 1. Claim 1. w t+1 w w t w + γ. That is, on each mistake the dot-product of our weight vector with the target increases by at least γ. This follows because if the mistake occurs on a positive example x, then we have w t+1 w = (w t + x) w = w t x + x w = w t x + γ (by definition of γ). Similarly, for the negative examples. 2. Claim 2. w t+1 2 w t 2 + 1. That is, on every mistake, the length squared of our weight vector increases by at most 1. If the mistake was on a positive example, then w t+1 2 w t + x 2 = w t 2 + 2w t x + x 2. Since w t misclassifies x, we have w t x < 0, and therefore w t+1 2 w t 2 + 1. Same analysis for the negative examples. Suppose we run the algorithm on S, and encounter M mistakes. Then, by Claim 1, we have w M+1 w γm, and by Claim 2, we have w M+1 2 M. Finally, because w is a unit vector, we also have w t w w t. Therefore, γm M, which gives γ 1/γ 2. This completes the proof. QED. How to Find a Correct Linear Separator? The analysis of the Perceptron algorithm gives us an upper bound on the number of mis-classification errors on S under the assumption that the data is linearly classifiable with margin γ. But it still doesn t tell us how to find a valid classifier! Fortunately, finding a valid classifier is easy, based on the analysis. The important point of the theorem is that error bound does not depend on the size of the input, S. It only depends on the margin of error γ. So, we create an artificial input that loops through the original input S endlessly, and feed it to the perceptron algorithm, and watch the weight vector w t. What happens after we have looped through 1/γ 2 + 1 times? We claim that at least one of these rounds must have been mistake-free! If not, then the algorithm made at least one mistake in each of the 1/γ 2 + 1 rounds, which contradicts the guarantee of Theorem 1, namely, that M 1/γ 2. Therefore, we must encounter an mistake-free round, and at that point we can output the linear classifier w t. 16

3.1 Perceptron (approximately) maximizing the margin The basic perceptron algorithm finds a valid separator but doesn t offer any guarantee about how close to the optimal margin γ it comes. A modification of the algorithm can achieve γ/2 margin, as follows. We declare an example positive if w t x/ w t γ/2, declare it negative if w t x/ w t γ/2, but if the margin lies in the range ( γ/2, γ/2), we call it a mistake. The update rule for the mistake is the same as before. Then, a similar analysis shows that the Perceptron algorithm makes at most 12/γ 2 mistakes. As a result, by running the algorithm 12/g 2 + 1 times, we can get a classifier that has margin at least γ/2. 17