Computational Learning Theory: PAC Model

Size: px
Start display at page:

Download "Computational Learning Theory: PAC Model"

Transcription

1 Computational Learning Theory: PAC Model Subhash Suri May 19, A rectangle Learning Game These notes are based on the paper A Theory of the Learnable by Valiant, the book by Kearns-Vazirani, and notes by Avrim Blum. A good example of a simple elegant theory to formally study a messy and complex problem: learning. Quoting from Abu-Mostafa, Magdon-Ismail, Lin book Learning from data,: Show a picture to a 3-year old and ask if there is a tree in it, you will likely get the correct answer. Ask a 30 year old what the definition of a tree is, you will likely get an inconclusive answer. We didn t learn what a tree is by studying the mathematical definition of trees; we learned by looking at trees. In other words, we learned form data. Data-driven learning has shown great promise in a number of practical applications, ranging from financial forecasting to medical diagnosis, computer vision, search engines, recommendation systems etc. They are particularly effective where concepts are somewhat fuzzy and difficult to model precisely and rigorously. For instance, how does Netflix recommend movies for you to watch? Prescribing our own tastes in a rigorous form is likely an impossible task, but our past preferences and ratings are a good indicator. Data-driven or machine learning builds on this idea. This lecture is a very brief attempt to introduce a theoretical framework for understanding both the complexity and the power of data-driven learning. Consider a simple 1-player learning game in which the object is to learn an unknown axis-parallel rectangle R. (Easy to extend to d-dim boxes.) The player receives information only through the following process (see Fig. 1): a random point p is chosen (according to some fixed prob distribution D) and the player is told p s label: positive (inside R) or negative (outside). 1

2 The goal is to use as few examples (and computation) as possible, and construct a hypothesis rectangle R which is a good approximation of R. Informally, the player s knowledge is tested by picking a new point q at random, using the same distribution D, and checking whether he can correctly decide the label of this new point. Formally, the quality of learning is measured by the error = (R R ) (R R). Throughout, the focus will be on standard CS Metrics: (1) the number of examples/queries needed; (2) the amount of computation to form/update hypothesis, and (3) the amount of error and confidence. Motivation. Imagine a medical learning process, in which the 2-dim plane represents the attribute space (weight, cholesterol). We may hypothesize that a healthy person s weight and cholesterol levels are in some nice range, forming a rectangle. But we don t know what the values of these ranges are. Or, suppose we wish to teach a program to recognize medium built males, where x-axis = weight and y-axis = height. The learner is shown random examples, each labeled with a + (medium built), or (not medium built). How effectively can one teach the concept of medium built male through this process. The Distribution D. The program goes through a learning/training phase, in which random examples are used to construct a hypothesis rectangle R. After the training, the rectangle R is our model, so we want to know how likely is R to be wrong in its future classifications. What assumptions are needed on the prob distributed D? It need not be uniform. We just require that learning, and testing use the same distribution D. In fact, suppose, in the learning phase, each man in the city is chosen with equal probability. Even under this assumption, the corresponding points in the plane are not uniformly distributed not all heights and weights are equally likely; in fact, height and weight may be highly dependent. This sampling will follow some fixed distribution D, which may be quite difficult to characterize, but as long as both samples and test points are chosen w.r.t. the same D, we are fine. Learning. Our strategy is simple: request a sufficiently large number m of sample points. Then choose as R the smallest axis parallel rectangle that includes all the + examples, and excludes all the examples. (If no + examples are drawn, then R = 0.) Error Analysis and Predictive Power. We show that for any target concept R, any distribution D, and any values ε, δ, we can request m samples (how many?) so that with prob at least 1 δ, R misclassifies (w.r.t. R) with error at most ε. 2

3 First, observe that R R: the former is contained entirely in R, so error = R R. We can express the difference as the union of 4 rectangular strips (with overlap near the corners). We show that the prob that a random test point under D falls in one of these strips is at most ε/4. Thus, the probability of error over all four strips (by union bound) is at most ε. The analysis will reveal the number of samples m needed to achieve this. Consider the top strip T. Suppose it has weight at least ε/4. Then, a mis-classification occurs because none of the m samples fell into this strip, the prob of which is (1 ε/4) m By unioning over all four strips, the total error is at most 4(1 ε/4) m. If we want this error to be less than δ, then we need 4(1 ε/4) m δ. Using the fact that (1 x) e x, we have 4e εm/4 δ, which gives m 4 ε ln 4 δ Thus, the Tightest Fitting Rectangle algorithm takes a sample of O( 1 ln 1 ) examples to ε δ form a hypothesis that classifies nearly as well as R with confidence at least 1 δ. 1.1 PAC (Probably Approximately Correct) Learning. This is an example of PAC learning. It has following important features: 1. Learning of an unknown target set, but the target class is not arbitrary. We have an idea of its general form (e.g. rectangle). 3

4 2. Learning is Probabilistic. Examples drawn at random, using an arbitrary, unknown, and unconstrained distribution. 3. Hypothesis of the learner is evaluated relative to the same prob distribution D, and we allow an approximation of the target concept. 4. We are interested in computational efficiency how few targets suffice to achieve high confidence. 1.2 The General Model. We have an instance space X, such as the points in the plane for the rectangle game, or set of all 2-dim array of binary pixels, in character recognition. The concept c over X is a subset of the instance space. (e.g. rectangles classifying medium-built males, or arrays whose pixels correspond to a valid character, say, A, assuming every array either exemplifies char A or fails to exemplify A) A concept thus can be thought of as a boolean mapping c : X {0, 1}, where c(x) = 1 indicates that x is a positive example, and c(x) = 0 indicates that x is a negative example. A concept class C over X is the collection of concepts. In the rectangle game, the target rectangle was chosen from the class of all axis-parallel rectangles. (As another example, the concept class can be the pixel maps of letters, and a concept c is specific letter, such as A.) Ideally, we want concept classes that are sufficiently expressive, but still learnable. Another example c can be a boolean formula over n variables, and the positive examples are the satisfying assignments over {0, 1} n. In the PAC model, the algorithm is faced with an unknown, target concept c, from the class C. The learning algorithm will be shown (random) positive or negative examples for c. The learning algorithm is judged by its ability to identify a hypothesis concept h that can accurately classify the instances as positive or negative for c. Note that the learning is assumed to know the target concept class, but it just doesn t know the exact concept c that is the target. D is any, fixed, unknown to the algorithm, arbitrary prob distribution over instance space X. The learning algorithm s error will be measured as err(h) = Prob{c(x) h(x)} where c and h are regarded as boolean functions. Geometrically, we can think of err(h) as the symmetric difference between c and h, using the Venn Diagram. EX(c, D) is a procedure (oracle) that returns a labeled example (x, c(x)), where x is drawn randomly and independently using D. The goal is to achieve small err(h) making as few calls to EX as possible. 4

5 PAC Learning Model Suppose C is a concept class over X. We say that C is PAClearnable if there exists an algorithm L with the following properties: for every concept c in C, for every distribution D on X, and for all 0 < ε, δ < 1/2, L can output a hypothesis h C so that error(h) ε with probability > 1 δ. The prob is over the random examples drawn by calls to EX and any internal randomization of L. If L runs in time poly(1/ε, 1/δ), we say C is efficiently PAC-learnable. So, e.g. the concept class of axis-aligned rectangles in 2-dim is efficiently PAC learnable. 2 Leaning Boolean Conjunctions A concept here is a CNF boolean formula: e.g x 1 x 3 x 4. C n is the class of all conjunctions of literals over x 1,..., x n. The instance (example) space is X n = {0, 1} n, where each a X n is interpreted as an assignment to the n boolean variables: a i = ith bit of a. c(a) = 1 if the assignment is satisfying (+ example) c(a) = 0 if the assignment is non-satisfying (- example). The conjunction x 1 x 3 x 4 represents the set {a {0, 1} n a 1 = 1, a 3 = 0, a 4 = 1}. (a 2 is a don t care variable.) The size(c) equals the number of literals in c. Clearly, size(c) 2n. Algorithm for Boolean Conjunction We prove that the class of conjunctions of boolean literals is efficiently PAC-learnable. The algorithm begins with the hypothesis h = (x 1 x 1 ) (x 2 x 2 )... (x n x n ) Thus, initially, h has no satisfying assignments. The algorithm simply ignores any negative examples returned by EX(c, D). Let (a, 1) be a positive example returned by EX. In response, the algorithm updates h as follows: for each i if a i = 0, we delete x i from h; if a i = 1, we delete x i from h. That is, the algorithm simply deletes any literal that contradicts the positive data. 5

6 Analysis Now we analyze the performance of the learned hypothesis h. First, note that at any time the set of literals in h always contains the set of literals in c because initially h has them all, and a literal is only deleted when it is set to 0 in a positive example (meaning it cannot be in c). This also means that h is more specific than c, so the set of assignments satisfying h also satisfy c: in other words, h will never err on a negative example for c. So, consider a literal z that occurs in h, but not in c. Note that z causes h to err only on positive examples, that is, z falsely causes h to output h(a) = 0, while c(a) = 1. Thus, the literal z must be 0 in a. This is also precisely the kind of positive example what would have caused the algorithm to delete z from h, but such an example was not seen. Define p(z) = Prob{c(a) = 1 and z = 0 in a}, where the prob is over all instances a in the distribution. Since every error of h can be blamed on at least one such literal z of h, by the union bound, we have that error(h) z h p(z) Now, say that a literal z is bad if p(z) ε/2n. If h contains no bad literal, then clearly error(h) 2n(ε/2n) = ε. So, let us assume that at least one literal is bad, and compute that prob. For any fixed bad literal z, the probability that it was not deleted from h after m calls to EX is at most (1 ε/2n) m, because a call to EX has prob at least ε/2n of returning an example that causes z to be deleted, and it survived m such calls. By the union bound, the prob that some bad literal survived in h is at most 2n(1 ε/2n) m (summed over 2n literals). How big should m be to make this prob less than δ? 2ne εm/2n δ = m (2n/ε){ln(2n) + ln(1/δ)} Theorem. CNF is efficiently PAC-learnable. 2.1 Intractability of Learning 3-Term DNF Formula We now show that a slight generalization of the representation class of boolean conjunctions becomes intractable to learn. The class we consider is called 3-Term DNF which is the set of all disjunctions of the form T 1 T 2 T 3, where each T i is a conjunction of literals over boolean variables x 1,..., x n. (Size of each such concept is at most 6n, because there are at most 2n literals, and each T i has at most one occurrence of either x i or x i.) To learn this efficiently, we need an algorithm polynomial in n, 1/ε and 1/δ. That is, while concepts expressed as 1-CNF are efficiently learnable, concepts expressed as OR of 3 Conjunctions are not. 6

7 Theorem. Unless (randomized poly) RP = NP, the class 3-Term DNF is not efficiently PAC-learnable. Reduce an NP-complete problem (language) A to PAC learning of 3-term DNF. We take an instance α A, and construct a set S α of labeled examples so that α A iff S α is consistent with some concept c C. In our case, we will use Graph 3-coloring as A, and C = 3-term DNF. How does a PAC learning algorithm L for concept class C allows us to determine if there is a concept in C that is consistent with S α with high probability? General Method: Fix error parameter ε = 1 2 S α and answer each request of L for a random example by choosing a pair (x i, b i ) (example and its label) uniformly at random from S α. Now if there is a concept c consistent with S α, then this simulation emulates the oracle EX(c, D), where D is uniform distribution. In this case, by our choice of ε, we have ensured that any hypothesis h with error less than ε must in fact be completely consistent with S α. Because if h errs on even a single example in S α, its error with respect to c and D is at least 1/ S α = 2ε, which is strictly larger than ε. On the other hand, if there is no concept consistent with S α, then L cannot possibly find one. Thus, we can simply check with output of L for consistency with S α to determine with 1 δ confidence that there is concept consistent with the examples. Discussion. The hardness arises from our insistence that the error be made very small, namely, 1/ H, where H is hypothesis space. It essentially forces a large sample complexity. Graph 3-Coloring Given an undirected graph G = (V, E), on n nodes, determine if its vertices can be colored with 3 colors so that no adjacent pair receives the same color. A mapping from an instance G to a set S G of labeled examples. S G will have positive examples S G + and negative examples SG. In particular, for each node i, S G + will contain the labeled example (v(i), 1) where v(i) is a n-bit binary vector with a 0 in the ith position and 1 s everywhere else. (Intuitively, these n examples encode the vertices of G.) For each edge (i, j) E, we create in SG a negative example (e(i, j), 0) where e(i, j) is also a n-bit vector with 0 s in the ith and jth bits, and 1 s everywhere else. Use Example of Figure 1.5 from the book. Let R, B, Y denote the 3 colors (red, blue, and yellow). We now argue that G is 3-colorable if and only if our examples S G are consistent with some 3-term DNF formula. First, suppose that G is 3-colorable, and fix one such coloring. Let R be the set of all vertices colored RED, and let T R be the conjunction of all variables in x 1, x 2,..., x n, whose 7

8 index does not appear in R. Then, for each i R, v(i) must satisfy T R, because variable x i does not appear in T R. Furthermore, no e(i, j) SG can satisfy T R : since both i and j cannot be colored red, one of x i or x j must appear in T R. Similarly, we have terms T B and T Y and they are satisfied by non-blue and non-yellow in the same way, with no negative examples satisfying them. In the converse direction, suppose that the formula T R T B T Y is consistent with all the examples of S G. Then, we can define a coloring as follows. The color of vertex i is RED if v(i) satisfies T R, BLUE if v(i) satisfies T B, and YELLOW if v(i) satisfies T Y. (break ties arbitrarily.) Since the formula is consistent with S G, every v(i) must satisfy some term, and so every vertex is assigned a color. We now argue that this is a legal 3-coloring. To see this, note that if i and j, with i j, are assigned the same color (say RED), then both v(i) and v(j) satisfy T R. Since the ith bit of v(i) is 0 while the ith bit of v(j) is 1, neither x i nor x i can appear in T R. Since v(j) and e(i, j) differ only in the ith bit, if v(j) satisfies T R, then so does e(i, j), which means that e(i, j) is not in SG, and so (i, j) is not in E. 8

9 2.2 Learning Decision Lists Imagine a Bank/mortgage company trying to decide the risk of default for its loans. It may base this decision on many vars: %downpayment, recent delinquency, income, other accounts in good standing, high debt, mortgage/income ratio, etc. It can then come up with some reasonable rules, such Predict Good is (no recent delinquency) AND (%down > 5) The bank may arrive at this data by studying a database of many individuals. This falls into a general type of concept learning, Decision List. A k-decision list over boolean vars x 1, x 2,..., x n is an ordered sequence (list) L = (c 1, b 1 ), (c 2, b 2 ),..., (c l, b l ), and a (default) bit b, where (a) each c i is a conjunction of at most k literals over x i, and (b) each b i is a bit {0, 1}. For any input a {0, 1} n, the value L(a) is set to b j if j is the smallest index for which c j (a) evaluates to 1. If no such j exists, then L(a) is set to the default bit b. Example of a 2-decision list: (x 1 x 3, 1), (x 4, 1), ( x 2 x 3, 1), (x 1 x 5, 0), ( x 4 x 6, 1), (x 1 x 6, 0), and b = 1 On input a = , this evaluates to L(a) = 1, because the 5th clause (j = 5) is the first one satisfied. Suppose we posit that a target concept (delinquency likelihood) can be learned as a decision tree over a set of variables. Then the PAC learning draws some random samples along with the known outcome labels. E.g. x 1 x 2 x 3 x 4 x 5 Label So the decision list may be: if (x 1 = 0) then elseif (x 2 = 1) then + elseif (x 4 = 1) then + else 9

10 Learning the 1-Decision List (k-decision lists are similar) Start with an empty list. Find any if-then rule consistent with data (satisfied by at least one example) Put rule at the bottom of the list so far, and cross off examples covered by this rule. Repeat until no more examples remain. If the algorithm fails, then NO DL consistent with the example (and hence the original) data. Do the Delinquency x 1, x 2,..., x 5 example. Analysis. Why should we expect to do well on future data? Suppose our hypothesis DL h has error err(h) > eps. Then, the prob that this h can survive m examples is (1 ε) m. Now, the entire family of n-variable boolean decision lists has at most n! n members, since there are at most n! variable orderings, and 4 possibilities for each (value, label) pair, and each value and label is binary. Then, by the union bound, the probability that at least one member of of this decision list family is bad is n! 4 n (1 ε) m We can make this probability smaller than δ by choosing Thus, 1-DL is efficiently PAC learnable. m 1 {2n + n ln n + ln(1/δ)} ε Confidence vs. Sample Complexity. Nothing special about DL proof. The essence is just this: if not too many rules to choose from, then it is unlikely that a bad rule will fool us by chance. The proof generalizes to any hypothesis space H, and we get: if m 1/ε{log H + log(1/δ)}, then after m examples, with probability > 1 δ, all h H with err(h) ε have err (h) > 0, where err is the error on the samples, while err is error on the whole data. 10

11 PAC and Occam s Razor The PAC learning actually has a nice philosophical basis in Occam s Razor. William of Occam (1320 AD) contended that: all other things being equal, simpler explanations are preferable. In computer-science-ish way, simple = short description. At most 2 s explanations are < s bits long. If the number of examples seen satisfies m 1/ε{s ln 2 + ln(1/ε)} then it is unlikely that a bad simple explanation will fool us by chance. Of course, there is no guarantee that a short explanation always exists for the data. 2.3 The problem of Experts PAC learning posits that there is a perfect but unknown function (target) that we are trying to learn. What can we do when there is no such function? Let us formulate a game, which we call Expert Advice. Consider a simplified setting: we want to learn how to predict the stock market (or will it rain): whether it will go up, or down, tomorrow? There cannot be a perfect function for this problem. Instead, suppose there are n experts offering us their advice. Each day, each expert makes his prediction YES or NO. The learning algorithm must use this information to make its own prediction (alg has no other input). After making its prediction, the algorithm is told the outcome (yes/no, up/down). In the absence of any supernatural quality of experts, we cannot make any absolute guarantees. The best we can hope is that the algorithm didn t do much worse than the best expert in hindsight. Unlike PAC, we are not going to evaluate the algorithm on future predictions. Instead, we only compare it to the best expert s performance so far: this is similar to competitive ratio measure in online algorithms. Equivalently, we want a strategy that will minimize our regret. we can view each expert as a different hypothesis h C. or we can think of this as a special case of single-variable concepts The Weighted Majority Algorithm WMA The algorithm is extremely simple: always side with the weighted experts majority and, after each observation, downgrade the experts whose predictions were wrong. Formally, the algorithm maintains a list of weights (representing our trust or confidence in them) w 1,..., w n, one for each expert, and operates as follows. 11

12 1. Initially, all the weights to Given a set of predictions x 1, x 2,..., x n by the experts, output the prediction with the highest total weight. That is, output 1 if i I w i i Z w i, and 0 otherwise, where I is the set of experts predicting 1, and Z predicting When the correct answer is revealed, penalize each wrong expert by halving his weight. 4. Go to 2. Theorem. The number of mistakes M made by the WMA is no more than 2.41(m + log n), where m is the number of mistakes made by the best expert so far. Proof. Let W denote the total weight of all the experts. Initially, W = n. If the algorithm makes a mistake, then at least half of the weight of the experts also predicted wrong, and so in Step 3 their weight is reduced by 1/2. Thus, a mistake by the algorithm must cause the weight W to drop by at least W/4. So, after the alg has made M mistakes, we have W n(3/4) M On the other hand, if the best expert makes m mistakes, then his weight is 1/2 m. Thus, we have 1/2 m W n(3/4) m, which gives us 1 M (m + log n) log(4/3) 2.41(m + log n) Improvements to WMA. One can improve the WMA in two ways: (a) instead of making the prediction with the total weight, we can use the weights as probabilities and choose the outcome probabilistically. (b) instead of penalizing the wrong experts by factor 1/2, we can penalize each by factor beta. This analysis gives the following mistake bound: M {m ln(1/β) + ln n}/(1 β) E.g. when β = 3/4, we can M 1.15m + 4 ln n. By adjusting beta further dynamically, one can also get M m + ln n + O( (m ln n)). 12

13 3 The Perceptron Algorithm Good reading material for this includes lecture notes of Avrim Blum, the Large Margin Classification paper of Freund-Schapire, and the book of Abu-Mostafa. Learning Linear Separators A fundamental idea for learning complex binary classifiers is the following Use some mapping to transform the instances into a high dimensional space in which two classes are linearly separable. Then find the vector that classifies all the data correctly and maximizes the margin, the minimum distance between the hyperplane and the instances. The Perceptron algorithm is a particularly simple, and oldest, method for finding the linear separator we will show that if the instances are linearly separable with large margin then perceptron algorithm converges quickly. Assumptions 1. The feature space is R n, and we are looking for a linear classifier of the form w 1 x 1 + w 2 x w n x n > 0 In general, we want a classifier of the wx > b, but that can be simulated by introducing an extra variable x 0, so that the problem looks like; w 1 x 1 +w 2 x 2 + +w n x n bx 0 > 0, where we always set x 0 = 1, and w 0 = b. So, from now on, assume that R n already accounts for this transformation. 2. Each instance has a binary label { 1, +1}, and the labels are error-free. 3. We assume that a linear separator w exists. That is, for each positive example, we have w x 0, and for each negative example w x < 0. Moreover, the classification has margin γ > 0. That is, w x x > γ, for all examples x. 4. We normalize the instances so that x = 1. Since we are looking for a hyperplane through the origin, we can project all the points x onto the unit sphere, centered at the origin, without changing their classification. Quick vector refresher wx is a hyperplane, whose normal vector is w. Given a point x, the quantity w x/ x is the Euclidean distance between x and the hyperplane. For any two unit vectors x 1, x 2 R n, we have x 1 x 2 = x 1. x 2. cos(x 1, x 2 ) 1. 13

14 The Perceptron Algorithm 1. The algorithm processes the instances in an online manner, one at a time. 2. Start with the all zero vector w 1 = 0, and initialize t = 1, the iteration count. 3. Given the next example x, predict positive iff w t x > On a mistake, update as follows: Mistake on positive example: w t+1 w t + x. Mistake on negative example: w t+1 w t x. 5. t t + 1. Remark 1. A mistake on a positive instance x means that we have w t x < 0, while it should be positive. Our update changes the weight vector so that the revision reduces the error (by 1): w t+1 x = (w t + x) x = w t x + 1, where the last term uses the fact that x is a unit vector. Similarly, mistake on a negative example causes w t+1 x = (w t x) x = w t x 1. Remark 2. An important point is that this adjustment can potentially misclassify other points that were previously correct that is, w t+1 may now be wrong on examples where w t was right! In fact, this is the reason why Perceptron can take forever to converge when γ is very small... Remark 3. Observe that the weight vector w is not normalized its magnitude can be arbitrarily large, and so w x can grow unboundedly. Remark 4. The vector w t we learn is the weight vector that encodes the relative importance (weight) of the different features (dimensions of the data). Remark 5. Intuition for Perceptron. The vector w t defines a hyperplane H t (with normal w t ), with points above classified as + and those below as. Now, suppose the label induced by H t is incorrect for the next example x: e.g. the correct label is but x lies above H t. The Perceptron changes the weights (coordinates of w t ) so that x is less above the new hyperplane, which it does by adding the (negative) vector x to w t. See figure below. Note that the label of x may still be incorrect even after this update, but the idea is to react slowly, and not dramatically. Of course, if there are many other points near x of the same type, then each will slowly tilt w t further and further until we get the right classification. However, even with this mild update rule, the hyperplane can jump around 14

15 quite significantly at each update. See video Perceptron Error Theorem. Let S be a sequence of labeled examples that are consistent with a linear threshold function w x > 0, where w is a unit length vector. (That is, the hyperplane w x correctly classifies all points x S.) Further, let γ be the margin of w, namely, w x γ = min x S x Then, in processing S, the perceptron algorithm misclassifies at most 1/γ 2 times. 15

16 Proof of Theorem. The proof will track how do the quantities w t w and w t change during the algorithm. Recall that w is a unit vector, but w t is not. The proof works by quantifying the amount by which these two quantities increase after each mistake (a lower bound for the former, and an upper bound for the latter). 1. Claim 1. w t+1 w w t w + γ. That is, on each mistake the dot-product of our weight vector with the target increases by at least γ. This follows because if the mistake occurs on a positive example x, then we have w t+1 w = (w t + x) w = w t x + x w = w t x + γ (by definition of γ). Similarly, for the negative examples. 2. Claim 2. w t+1 2 w t That is, on every mistake, the length squared of our weight vector increases by at most 1. If the mistake was on a positive example, then w t+1 2 w t + x 2 = w t 2 + 2w t x + x 2. Since w t misclassifies x, we have w t x < 0, and therefore w t+1 2 w t Same analysis for the negative examples. Suppose we run the algorithm on S, and encounter M mistakes. Then, by Claim 1, we have w M+1 w γm, and by Claim 2, we have w M+1 2 M. Finally, because w is a unit vector, we also have w t w w t. Therefore, γm M, which gives γ 1/γ 2. This completes the proof. QED. How to Find a Correct Linear Separator? The analysis of the Perceptron algorithm gives us an upper bound on the number of mis-classification errors on S under the assumption that the data is linearly classifiable with margin γ. But it still doesn t tell us how to find a valid classifier! Fortunately, finding a valid classifier is easy, based on the analysis. The important point of the theorem is that error bound does not depend on the size of the input, S. It only depends on the margin of error γ. So, we create an artificial input that loops through the original input S endlessly, and feed it to the perceptron algorithm, and watch the weight vector w t. What happens after we have looped through 1/γ times? We claim that at least one of these rounds must have been mistake-free! If not, then the algorithm made at least one mistake in each of the 1/γ rounds, which contradicts the guarantee of Theorem 1, namely, that M 1/γ 2. Therefore, we must encounter an mistake-free round, and at that point we can output the linear classifier w t. 16

17 3.1 Perceptron (approximately) maximizing the margin The basic perceptron algorithm finds a valid separator but doesn t offer any guarantee about how close to the optimal margin γ it comes. A modification of the algorithm can achieve γ/2 margin, as follows. We declare an example positive if w t x/ w t γ/2, declare it negative if w t x/ w t γ/2, but if the margin lies in the range ( γ/2, γ/2), we call it a mistake. The update rule for the mistake is the same as before. Then, a similar analysis shows that the Perceptron algorithm makes at most 12/γ 2 mistakes. As a result, by running the algorithm 12/g times, we can get a classifier that has margin at least γ/2. 17

Computational Learning Theory - Hilary Term : Introduction to the PAC Learning Framework

Computational Learning Theory - Hilary Term : Introduction to the PAC Learning Framework Computational Learning Theory - Hilary Term 2018 1 : Introduction to the PAC Learning Framework Lecturer: Varun Kanade 1 What is computational learning theory? Machine learning techniques lie at the heart

More information

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 600.463 Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 25.1 Introduction Today we re going to talk about machine learning, but from an

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity Universität zu Lübeck Institut für Theoretische Informatik Lecture notes on Knowledge-Based and Learning Systems by Maciej Liśkiewicz Lecture 5: Efficient PAC Learning 1 Consistent Learning: a Bound on

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

Introduction to Computational Learning Theory

Introduction to Computational Learning Theory Introduction to Computational Learning Theory The classification problem Consistent Hypothesis Model Probably Approximately Correct (PAC) Learning c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 1

More information

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds Lecture 25 of 42 PAC Learning, VC Dimension, and Mistake Bounds Thursday, 15 March 2007 William H. Hsu, KSU http://www.kddresearch.org/courses/spring2007/cis732 Readings: Sections 7.4.17.4.3, 7.5.17.5.3,

More information

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997 A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997 Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science

More information

An Algorithms-based Intro to Machine Learning

An Algorithms-based Intro to Machine Learning CMU 15451 lecture 12/08/11 An Algorithmsbased Intro to Machine Learning Plan for today Machine Learning intro: models and basic issues An interesting algorithm for combining expert advice Avrim Blum [Based

More information

10.1 The Formal Model

10.1 The Formal Model 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate

More information

Computational Learning Theory. Definitions

Computational Learning Theory. Definitions Computational Learning Theory Computational learning theory is interested in theoretical analyses of the following issues. What is needed to learn effectively? Sample complexity. How many examples? Computational

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Slides by and Nathalie Japkowicz (Reading: R&N AIMA 3 rd ed., Chapter 18.5) Computational Learning Theory Inductive learning: given the training set, a learning algorithm

More information

Web-Mining Agents Computational Learning Theory

Web-Mining Agents Computational Learning Theory Web-Mining Agents Computational Learning Theory Prof. Dr. Ralf Möller Dr. Özgür Özcep Universität zu Lübeck Institut für Informationssysteme Tanya Braun (Exercise Lab) Computational Learning Theory (Adapted)

More information

CS 6375: Machine Learning Computational Learning Theory

CS 6375: Machine Learning Computational Learning Theory CS 6375: Machine Learning Computational Learning Theory Vibhav Gogate The University of Texas at Dallas Many slides borrowed from Ray Mooney 1 Learning Theory Theoretical characterizations of Difficulty

More information

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable

More information

Foundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015

Foundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015 10-806 Foundations of Machine Learning and Data Science Lecturer: Avrim Blum Lecture 9: October 7, 2015 1 Computational Hardness of Learning Today we will talk about some computational hardness results

More information

1 Learning Linear Separators

1 Learning Linear Separators 8803 Machine Learning Theory Maria-Florina Balcan Lecture 3: August 30, 2011 Plan: Perceptron algorithm for learning linear separators. 1 Learning Linear Separators Here we can think of examples as being

More information

1 Learning Linear Separators

1 Learning Linear Separators 10-601 Machine Learning Maria-Florina Balcan Spring 2015 Plan: Perceptron algorithm for learning linear separators. 1 Learning Linear Separators Here we can think of examples as being from {0, 1} n or

More information

1 More finite deterministic automata

1 More finite deterministic automata CS 125 Section #6 Finite automata October 18, 2016 1 More finite deterministic automata Exercise. Consider the following game with two players: Repeatedly flip a coin. On heads, player 1 gets a point.

More information

PAC Model and Generalization Bounds

PAC Model and Generalization Bounds PAC Model and Generalization Bounds Overview Probably Approximately Correct (PAC) model Basic generalization bounds finite hypothesis class infinite hypothesis class Simple case More next week 2 Motivating

More information

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell CS340 Machine learning Lecture 4 Learning theory Some slides are borrowed from Sebastian Thrun and Stuart Russell Announcement What: Workshop on applying for NSERC scholarships and for entry to graduate

More information

Computational Learning Theory

Computational Learning Theory 1 Computational Learning Theory 2 Computational learning theory Introduction Is it possible to identify classes of learning problems that are inherently easy or difficult? Can we characterize the number

More information

Computational Learning Theory (COLT)

Computational Learning Theory (COLT) Computational Learning Theory (COLT) Goals: Theoretical characterization of 1 Difficulty of machine learning problems Under what conditions is learning possible and impossible? 2 Capabilities of machine

More information

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016 Machine Learning 10-701, Fall 2016 Computational Learning Theory Eric Xing Lecture 9, October 5, 2016 Reading: Chap. 7 T.M book Eric Xing @ CMU, 2006-2016 1 Generalizability of Learning In machine learning

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Sinh Hoa Nguyen, Hung Son Nguyen Polish-Japanese Institute of Information Technology Institute of Mathematics, Warsaw University February 14, 2006 inh Hoa Nguyen, Hung Son

More information

Homework 4 Solutions

Homework 4 Solutions CS 174: Combinatorics and Discrete Probability Fall 01 Homework 4 Solutions Problem 1. (Exercise 3.4 from MU 5 points) Recall the randomized algorithm discussed in class for finding the median of a set

More information

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Computational Learning Theory Le Song Lecture 11, September 20, 2012 Based on Slides from Eric Xing, CMU Reading: Chap. 7 T.M book 1 Complexity of Learning

More information

Dan Roth 461C, 3401 Walnut

Dan Roth  461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: (Today s notes below are

More information

0.1 Motivating example: weighted majority algorithm

0.1 Motivating example: weighted majority algorithm princeton univ. F 16 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: Sanjeev Arora (Today s notes

More information

Midterm: CS 6375 Spring 2015 Solutions

Midterm: CS 6375 Spring 2015 Solutions Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an

More information

Boosting: Foundations and Algorithms. Rob Schapire

Boosting: Foundations and Algorithms. Rob Schapire Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

Midterm, Fall 2003

Midterm, Fall 2003 5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

More information

Computational learning theory. PAC learning. VC dimension.

Computational learning theory. PAC learning. VC dimension. Computational learning theory. PAC learning. VC dimension. Petr Pošík Czech Technical University in Prague Faculty of Electrical Engineering Dept. of Cybernetics COLT 2 Concept...........................................................................................................

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

1 Differential Privacy and Statistical Query Learning

1 Differential Privacy and Statistical Query Learning 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 5: December 07, 015 1 Differential Privacy and Statistical Query Learning 1.1 Differential Privacy Suppose

More information

Computational Learning Theory

Computational Learning Theory 09s1: COMP9417 Machine Learning and Data Mining Computational Learning Theory May 20, 2009 Acknowledgement: Material derived from slides for the book Machine Learning, Tom M. Mitchell, McGraw-Hill, 1997

More information

Statistical Learning Learning From Examples

Statistical Learning Learning From Examples Statistical Learning Learning From Examples We want to estimate the working temperature range of an iphone. We could study the physics and chemistry that affect the performance of the phone too hard We

More information

Essential facts about NP-completeness:

Essential facts about NP-completeness: CMPSCI611: NP Completeness Lecture 17 Essential facts about NP-completeness: Any NP-complete problem can be solved by a simple, but exponentially slow algorithm. We don t have polynomial-time solutions

More information

Boolean circuits. Lecture Definitions

Boolean circuits. Lecture Definitions Lecture 20 Boolean circuits In this lecture we will discuss the Boolean circuit model of computation and its connection to the Turing machine model. Although the Boolean circuit model is fundamentally

More information

Statistical and Computational Learning Theory

Statistical and Computational Learning Theory Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the

More information

Computational Learning Theory

Computational Learning Theory 0. Computational Learning Theory Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 7 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell 1. Main Questions

More information

The Perceptron algorithm

The Perceptron algorithm The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following

More information

CS446: Machine Learning Spring Problem Set 4

CS446: Machine Learning Spring Problem Set 4 CS446: Machine Learning Spring 2017 Problem Set 4 Handed Out: February 27 th, 2017 Due: March 11 th, 2017 Feel free to talk to other members of the class in doing the homework. I am more concerned that

More information

Lecture 29: Computational Learning Theory

Lecture 29: Computational Learning Theory CS 710: Complexity Theory 5/4/2010 Lecture 29: Computational Learning Theory Instructor: Dieter van Melkebeek Scribe: Dmitri Svetlov and Jake Rosin Today we will provide a brief introduction to computational

More information

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler Complexity Theory Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität Wien 15 May, 2018 Reinhard

More information

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181. Complexity Theory Complexity Theory Outline Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität

More information

Name (NetID): (1 Point)

Name (NetID): (1 Point) CS446: Machine Learning Fall 2016 October 25 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains four

More information

Final Exam, Fall 2002

Final Exam, Fall 2002 15-781 Final Exam, Fall 22 1. Write your name and your andrew email address below. Name: Andrew ID: 2. There should be 17 pages in this exam (excluding this cover sheet). 3. If you need more room to work

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

ICML '97 and AAAI '97 Tutorials

ICML '97 and AAAI '97 Tutorials A Short Course in Computational Learning Theory: ICML '97 and AAAI '97 Tutorials Michael Kearns AT&T Laboratories Outline Sample Complexity/Learning Curves: nite classes, Occam's VC dimension Razor, Best

More information

TTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model

TTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model TTIC 325 An Introduction to the Theory of Machine Learning Learning from noisy data, intro to SQ model Avrim Blum 4/25/8 Learning when there is no perfect predictor Hoeffding/Chernoff bounds: minimizing

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Fall 2015 Introduction to Artificial Intelligence Final You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

More information

What Can We Learn Privately?

What Can We Learn Privately? What Can We Learn Privately? Sofya Raskhodnikova Penn State University Joint work with Shiva Kasiviswanathan Homin Lee Kobbi Nissim Adam Smith Los Alamos UT Austin Ben Gurion Penn State To appear in SICOMP

More information

2 Upper-bound of Generalization Error of AdaBoost

2 Upper-bound of Generalization Error of AdaBoost COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Haipeng Zheng March 5, 2008 1 Review of AdaBoost Algorithm Here is the AdaBoost Algorithm: input: (x 1,y 1 ),...,(x m,y

More information

The Perceptron Algorithm, Margins

The Perceptron Algorithm, Margins The Perceptron Algorithm, Margins MariaFlorina Balcan 08/29/2018 The Perceptron Algorithm Simple learning algorithm for supervised classification analyzed via geometric margins in the 50 s [Rosenblatt

More information

Generalization theory

Generalization theory Generalization theory Chapter 4 T.P. Runarsson (tpr@hi.is) and S. Sigurdsson (sven@hi.is) Introduction Suppose you are given the empirical observations, (x 1, y 1 ),..., (x l, y l ) (X Y) l. Consider the

More information

COMPUTATIONAL LEARNING THEORY

COMPUTATIONAL LEARNING THEORY COMPUTATIONAL LEARNING THEORY XIAOXU LI Abstract. This paper starts with basic concepts of computational learning theory and moves on with examples in monomials and learning rays to the final discussion

More information

Linear Classifiers and the Perceptron

Linear Classifiers and the Perceptron Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers Let s assume that every instance is an n-dimensional vector of real numbers x R n, and there are only two possible

More information

Minimax risk bounds for linear threshold functions

Minimax risk bounds for linear threshold functions CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability

More information

MACHINE LEARNING. Probably Approximately Correct (PAC) Learning. Alessandro Moschitti

MACHINE LEARNING. Probably Approximately Correct (PAC) Learning. Alessandro Moschitti MACHINE LEARNING Probably Approximately Correct (PAC) Learning Alessandro Moschitti Department of Information Engineering and Computer Science University of Trento Email: moschitti@disi.unitn.it Objectives:

More information

Lecture 4: Linear predictors and the Perceptron

Lecture 4: Linear predictors and the Perceptron Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 4 1 / 34 Inductive Bias Inductive bias is critical to prevent overfitting.

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning PAC Learning and VC Dimension Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE

More information

Polynomial time Prediction Strategy with almost Optimal Mistake Probability

Polynomial time Prediction Strategy with almost Optimal Mistake Probability Polynomial time Prediction Strategy with almost Optimal Mistake Probability Nader H. Bshouty Department of Computer Science Technion, 32000 Haifa, Israel bshouty@cs.technion.ac.il Abstract We give the

More information

Support Vector Machines. Machine Learning Fall 2017

Support Vector Machines. Machine Learning Fall 2017 Support Vector Machines Machine Learning Fall 2017 1 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce

More information

Weighted Majority and the Online Learning Approach

Weighted Majority and the Online Learning Approach Statistical Techniques in Robotics (16-81, F12) Lecture#9 (Wednesday September 26) Weighted Majority and the Online Learning Approach Lecturer: Drew Bagnell Scribe:Narek Melik-Barkhudarov 1 Figure 1: Drew

More information

1 Distributional problems

1 Distributional problems CSCI 5170: Computational Complexity Lecture 6 The Chinese University of Hong Kong, Spring 2016 23 February 2016 The theory of NP-completeness has been applied to explain why brute-force search is essentially

More information

UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 22 Lecturer: David Wagner April 24, Notes 22 for CS 170

UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 22 Lecturer: David Wagner April 24, Notes 22 for CS 170 UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 22 Lecturer: David Wagner April 24, 2003 Notes 22 for CS 170 1 NP-completeness of Circuit-SAT We will prove that the circuit satisfiability

More information

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

12.1 A Polynomial Bound on the Sample Size m for PAC Learning 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 12: PAC III Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 In this lecture will use the measure of VC dimension, which is a combinatorial

More information

CS 395T Computational Learning Theory. Scribe: Mike Halcrow. x 4. x 2. x 6

CS 395T Computational Learning Theory. Scribe: Mike Halcrow. x 4. x 2. x 6 CS 395T Computational Learning Theory Lecture 3: September 0, 2007 Lecturer: Adam Klivans Scribe: Mike Halcrow 3. Decision List Recap In the last class, we determined that, when learning a t-decision list,

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning 0.1 Linear Functions So far, we have been looking at Linear Functions { as a class of functions which can 1 if W1 X separate some data and not

More information

CS 151 Complexity Theory Spring Solution Set 5

CS 151 Complexity Theory Spring Solution Set 5 CS 151 Complexity Theory Spring 2017 Solution Set 5 Posted: May 17 Chris Umans 1. We are given a Boolean circuit C on n variables x 1, x 2,..., x n with m, and gates. Our 3-CNF formula will have m auxiliary

More information

P, NP, NP-Complete, and NPhard

P, NP, NP-Complete, and NPhard P, NP, NP-Complete, and NPhard Problems Zhenjiang Li 21/09/2011 Outline Algorithm time complicity P and NP problems NP-Complete and NP-Hard problems Algorithm time complicity Outline What is this course

More information

Lecture 7: Passive Learning

Lecture 7: Passive Learning CS 880: Advanced Complexity Theory 2/8/2008 Lecture 7: Passive Learning Instructor: Dieter van Melkebeek Scribe: Tom Watson In the previous lectures, we studied harmonic analysis as a tool for analyzing

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

6.045: Automata, Computability, and Complexity (GITCS) Class 17 Nancy Lynch

6.045: Automata, Computability, and Complexity (GITCS) Class 17 Nancy Lynch 6.045: Automata, Computability, and Complexity (GITCS) Class 17 Nancy Lynch Today Probabilistic Turing Machines and Probabilistic Time Complexity Classes Now add a new capability to standard TMs: random

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

NP Completeness and Approximation Algorithms

NP Completeness and Approximation Algorithms Chapter 10 NP Completeness and Approximation Algorithms Let C() be a class of problems defined by some property. We are interested in characterizing the hardest problems in the class, so that if we can

More information

Empirical Risk Minimization Algorithms

Empirical Risk Minimization Algorithms Empirical Risk Minimization Algorithms Tirgul 2 Part I November 2016 Reminder Domain set, X : the set of objects that we wish to label. Label set, Y : the set of possible labels. A prediction rule, h:

More information

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 1 Active Learning Most classic machine learning methods and the formal learning

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, 018 Review Theorem (Occam s Razor). Say algorithm A finds a hypothesis h A H consistent with

More information

Show that the following problems are NP-complete

Show that the following problems are NP-complete Show that the following problems are NP-complete April 7, 2018 Below is a list of 30 exercises in which you are asked to prove that some problem is NP-complete. The goal is to better understand the theory

More information

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016 The Boosting Approach to Machine Learning Maria-Florina Balcan 10/31/2016 Boosting General method for improving the accuracy of any given learning algorithm. Works by creating a series of challenge datasets

More information

Hierarchical Concept Learning

Hierarchical Concept Learning COMS 6998-4 Fall 2017 Octorber 30, 2017 Hierarchical Concept Learning Presenter: Xuefeng Hu Scribe: Qinyao He 1 Introduction It has been shown that learning arbitrary polynomial-size circuits is computationally

More information

1 The Probably Approximately Correct (PAC) Model

1 The Probably Approximately Correct (PAC) Model COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #3 Scribe: Yuhui Luo February 11, 2008 1 The Probably Approximately Correct (PAC) Model A target concept class C is PAC-learnable by

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

CSCE 478/878 Lecture 6: Bayesian Learning

CSCE 478/878 Lecture 6: Bayesian Learning Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Outline CSCE 478/878 Lecture 6: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchell

More information

Learning Theory: Basic Guarantees

Learning Theory: Basic Guarantees Learning Theory: Basic Guarantees Daniel Khashabi Fall 2014 Last Update: April 28, 2016 1 Introduction The Learning Theory 1, is somewhat least practical part of Machine Learning, which is most about the

More information

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20. 10-601 Machine Learning, Midterm Exam: Spring 2008 Please put your name on this cover sheet If you need more room to work out your answer to a question, use the back of the page and clearly mark on the

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013 COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 203 Review of Zero-Sum Games At the end of last lecture, we discussed a model for two player games (call

More information

1 Computational Problems

1 Computational Problems Stanford University CS254: Computational Complexity Handout 2 Luca Trevisan March 31, 2010 Last revised 4/29/2010 In this lecture we define NP, we state the P versus NP problem, we prove that its formulation

More information

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013 Computational Learning Theory CS 486/686: Introduction to Artificial Intelligence Fall 2013 1 Overview Introduction to Computational Learning Theory PAC Learning Theory Thanks to T Mitchell 2 Introduction

More information

Computer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Limitations of Algorithms

Computer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Limitations of Algorithms Computer Science 385 Analysis of Algorithms Siena College Spring 2011 Topic Notes: Limitations of Algorithms We conclude with a discussion of the limitations of the power of algorithms. That is, what kinds

More information

Part of the slides are adapted from Ziko Kolter

Part of the slides are adapted from Ziko Kolter Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,

More information