Lecture 3: Empirical Risk Minimization

Lecture 3: Empirical Risk Minimization Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 3 1 / 11

A more general approach We saw the learning algorithms Memorize and k-nearest Neighbor. We will now discuss a more general approach to the design of learning algorithms. We use the same assumptions: Examples X labels Y Distribution D over X Y A learning algorithm gets S D m and outputs ĥs : X Y. D is unknown to the learning algorithm. Kontorovich and Sabato (BGU) Lecture 3 2 / 11

Choosing a prediction rule If the algorithm knew D, it could find the optimal prediction rule. The Bayes-optimal predictor. Since S is a random sample from D, it should be similar to D. Idea: find a prediction rule that works well on S. The error of prediction rule h on S of size m: err(h, S) := 1 m m I[h(x i ) y i ]. i=1 (also called the empirical risk) Empirical Risk Minimization (ERM): Choose a prediction rule that minimizes err(h, S). Both Memorize and Nearest Neighbor are ERM algorithms. What about k-nearest-neighbors? Kontorovich and Sabato (BGU) Lecture 3 3 / 11

Overfitting Problem: Empirical risk minimization can fail miserably. Example: The Memorize algorithm. If the training sample is of size m, and there are N examples (customers) distributed uniformly, N m, and there are two labels (drinks), then err( ĥ S, S) = 0, but err(ĥs, D) will be very large. Overfitting: When the error on the training sample is low, but the error on the distribution is large. Can another learning algorithm avoid this issue? Kontorovich and Sabato (BGU) Lecture 3 4 / 11

The No Free Lunch theorem Theorem Recall: X - examples, Y = {0, 1} - binary labels. For any learning algorithm, if m X /2, there exists a distribution D over X {0, 1} such that There exists a prediction rule f : X {0, 1} with err(f, D) = 0, but With a probability of at least 1/7 over random samples S D m, err(ĥ S, D) 1/8. Proof idea: Assume some algorithm A; choose a uniform distribution over 2m examples; Set the true labels to be the opposite of what A would guess on examples it didn t observe. Kontorovich and Sabato (BGU) Lecture 3 5 / 11

Introducting inductive bias By the No Free Lunch theorem, no learning algorithm gets a low error on all distributions, unless it observes almost all possible examples. A common solution: assume something about the learning problem. Examples: The coffee shop: The waiter got a hint that all customers with the same hairstyle like the same drink. Identifying documents about economics: Assume that there is a small number of words that determine whether a document is about economics or not. Identifying people in photos: Assume that photos of the same person are similar in a specific feature representation. Inductive bias: Restricting/directing the learning algorithms using external knowledge/assumptions about the learning problem. Kontorovich and Sabato (BGU) Lecture 3 6 / 11

Example: Learning dosage safety Learning problem: which medicine dosages are safe? X = [0, 100] (dosage), Y = {0, 1} (causes side effects?) A possible training sample (blue: label is 0, red: label is 1): ERM without inductive bias might return the following rule: Here err(ĥs, S) = 0. What do you think is err(ĥs, D)? Inductive bias: Limit the ERM algorithm to return only functions that describe thresholds on the line: x X, f a (x) := I[x a]. Now the algorithm will return something like this: Again, err(ĥs, S) = 0. What do you think is err(ĥs, D) this time? Kontorovich and Sabato (BGU) Lecture 3 7 / 11

Inductive bias Recall: Empirical Risk Minimization (ERM): Choose a prediction rule that minimizes err(h, S). Inductive bias: Restrict the ERM algorithm. A popular type of inductive bias: Choose the prediction rule from a restricted set of functions called a hypothesis class: H Y X. ERM with a hypothesis class H Given a training sample S D m, output ĥs such that ĥ S argmin err(h, S). h H We will show (later in the course): Restricting the ERM to a simple H can prevent overfitting. A small finite class is always simple. But there are also simple infinite classes. Kontorovich and Sabato (BGU) Lecture 3 8 / 11

Hypothesis classes Popular hypothesis classes: Thresholds (for 1-dimensional examples) Linear functions (for examples in R d ) Combinations of circles (for examples in R d ) Small logical formulas (for examples with binary features) Neural networks Any class of prediction rules is a valid hypothesis class. Why are some classes more popular? Easy to work with (efficient algorithms, easy implementation) Suitable for many different types of problems Good error guarantees Work well in practice Fashionable Kontorovich and Sabato (BGU) Lecture 3 9 / 11

The Bias-Complexity tradeoff Suppose we are using a specific hypothesis class H. Sources of prediction error in an ERM algorithm: Perhaps the rules in H are not very good for D. Approximation error : err app := inf err(h, D) h H Perhaps the error of the rule the ERM selected is far from the best. Estimation error : err est := err(ĥ S, D) inf err(h, D). h H Total error: err(ĥs, D) = err app + err est. If we select a richer (larger) H: Approximation error gets smaller (lower bias), Estimation error gets larger (higher statistical complexity). There is a trade-off between the two kinds of error. Kontorovich and Sabato (BGU) Lecture 3 10 / 11

The Bias-Complexity tradeoff Recall Overfitting: When err(ĥ S, D) err(ĥ S, S) is large. Symptoms: training error (error on S) is low, but true error is high. This usually means errest := err(ĥ S, D) inf h H err(h, D) is also large. Can happen if H is too rich (large). Underfitting: When approximation error is large. Symptoms: training error is high. Can happen if H is not suitable for our problem, or too simple. Best of both worlds: a simple H which is suitable for our problem. E.g. when looking for safe medicine dosages, choose H to be the set of threshold functions: x I[x a]. Selecting H can represent world-knowledge that helps learning. Kontorovich and Sabato (BGU) Lecture 3 11 / 11