An Algorithms-based Intro to Machine Learning

Similar documents
Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Algorithms, Games, and Networks January 17, Lecture 2

Recap from end of last time

Hybrid Machine Learning Algorithms

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Weighted Majority and the Online Learning Approach

TTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs

Introduction to Computational Learning Theory

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Computational Learning Theory

Computational Learning Theory: PAC Model

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Online Learning, Mistake Bounds, Perceptron Algorithm

2 Upper-bound of Generalization Error of AdaBoost

1 Learning Linear Separators

Distributed Machine Learning. Maria-Florina Balcan, Georgia Tech

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

The No-Regret Framework for Online Learning

Boosting: Foundations and Algorithms. Rob Schapire

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Functions and Data Fitting

Ad Placement Strategies

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

CS 188: Artificial Intelligence Spring Today

1 Learning Linear Separators

CS 188: Artificial Intelligence. Machine Learning

A Course in Machine Learning

Computational Learning Theory. Definitions

Foundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015

1 Maintaining a Dictionary

Part of the slides are adapted from Ziko Kolter

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Learning Theory: Basic Guarantees

Name (NetID): (1 Point)

Uncertainty. Michael Peters December 27, 2013

Against the F-score. Adam Yedidia. December 8, This essay explains why the F-score is a poor metric for the success of a statistical prediction.

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

PAC-learning, VC Dimension and Margin-based Bounds

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Sequence convergence, the weak T-axioms, and first countability

Generalization, Overfitting, and Model Selection

Menu. Lecture 2: Orders of Growth. Predicting Running Time. Order Notation. Predicting Program Properties

Computational Learning Theory - Hilary Term : Introduction to the PAC Learning Framework

Does Unlabeled Data Help?

Agnostic Online learnability

Dan Roth 461C, 3401 Walnut

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Inference in Regression Model

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

The Inductive Proof Template

Distributed Machine Learning. Maria-Florina Balcan Carnegie Mellon University

Lecture 8. Instructor: Haipeng Luo

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012

ICML '97 and AAAI '97 Tutorials

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

The Perceptron Algorithm, Margins

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

COMS 4771 Introduction to Machine Learning. Nakul Verma

Web-Mining Agents Computational Learning Theory

CS 6375 Machine Learning

12 Statistical Justifications; the Bias-Variance Decomposition

Lecture 12. Applications to Algorithmic Game Theory & Computational Social Choice. CSC Allan Borodin & Nisarg Shah 1

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das

What Can We Learn Privately?

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Computational Learning Theory

Nonregular Languages

Learning From Data Lecture 3 Is Learning Feasible?

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

An analogy from Calculus: limits

Active Testing! Liu Yang! Joint work with Nina Balcan,! Eric Blais and Avrim Blum! Liu Yang 2011

Hidden Markov Models. Vibhav Gogate The University of Texas at Dallas

9 Classification. 9.1 Linear Classifiers

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Computational Learning

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Classification: The rest of the story

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

PAC Generalization Bounds for Co-training

Machine Learning (CS 567) Lecture 2

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning

CSE 546 Final Exam, Autumn 2013

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

6-1 Computational Complexity

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora

PAC-learning, VC Dimension and Margin-based Bounds

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

Introduction and Models

Transcription:

CMU 15451 lecture 12/08/11 An Algorithmsbased Intro to Machine Learning Plan for today Machine Learning intro: models and basic issues An interesting algorithm for combining expert advice Avrim Blum [Based on a talk given at the National Academy of Sciences Frontiers of Science symposium] Machine learning can be used to... recognize speech, identify patterns in data, steer a car, play games, adapt programs to users, improve web search,... From a scientific perspective: can we develop models to understand learning as a computational problem, and what types of guarantees might we hope to achieve? A typical setting Imagine you want a computer program to help filter which email messages are spam and which are important. Might represent each message by n features. (e.g., return address, keywords, spelling, etc.) Take sample S of data, labeled according to whether they were/weren t spam. Goal of algorithm is to use data seen so far produce good prediction rule (a hypothesis ) h(x) for future data. The concept learning setting E.g., The concept learning setting E.g., a positive example a negative example Given data, some reasonable rules might be: Predict SPAM if :known AND (money OR pills) Predict SPAM if money pills known > 0.... 1

Big questions (A)How might we automatically generate rules that do well on observed data? [algorithm design] (B)What kind of confidence do we have that they will do well in the future? [confidence bound / sample complexity] for a given learning alg, how much data do we need... Power of basic paradigm Many problems solved by converting to basic concept learning from structured data setting. E.g., document classification convert to bagofwords Linear separators do well E.g., driving a car convert image into features. Use neural net with several outputs. Natural formalization (PAC) Email msg Spam or not? We are given sample S = {(x,y)}. View labels y as being produced by some target function f. Alg does optimization over S to produce some hypothesis (prediction rule) h. Assume S is a random sample from some probability distribution D. Goal is for h to do well on new examples also from D. I.e., Pr D [h(x) f(x)] < e. Example of analysis: Decision Lists Say we suspect there might be a good prediction rule of this form. 1. Design an efficient algorithm A that will find a consistent DL if one exists. 2. Show that if S is of reasonable size, then Pr[exists consistent DL h with err(h) > e] < d. 3. This means that A is a good algorithm to use if f is, in fact, a DL. (a bit of a toy example since would want to extend to mostly consistent DL) How can we find a consistent DL? if (x 1 =0) then, else if (x 2 =1) then, else if (x 4 =1) then, else Decision List algorithm Start with empty list. Find ifthen rule consistent with data. (and satisfied by at least one example) Put rule at bottom of list so far, and cross off examples covered. Repeat until no examples remain. If this fails, then: No rule consistent with remaining data. So no DL consistent with remaining data. So, no DL consistent with original data. OK, fine. Now why should we expect it to do well on future data? 2

Confidence/samplecomplexity Consider some DL h with err(h)>e, that we re worried might fool us. Chance that h survives S examples is at most (1e) S. Let H = number of DLs over n Boolean features. H < (4n2)!. (really crude bound) So, Pr[some DL h with err(h)>e is consistent] < H (1e) S. This is <0.01 for S > (1/e)[ln( H ) ln(100)] or about (1/e)[n ln n ln(100)] Example of analysis: Decision Lists Say we suspect there might be a good prediction rule of this form. 1. Design an efficient algorithm A that will find a consistent DL if one exists. 2. Show that if S is of reasonable size, then Pr[exists consistent DL h with err(h) > e] < d. 3. So, if f is in fact a DL, then whp A s hypothesis will be approximately correct. PAC model Confidence/samplecomplexity What s great is there was nothing special about DLs in our argument. All we said was: if there are not too many rules to choose from, then it s unlikely one will have fooled us just by chance. Occam s razor William of Occam (~1320 AD): entities should not be multiplied unnecessarily (in Latin) Which we interpret as: in general, prefer simpler explanations. And in particular, the number of examples needs to only be proportional to log( H ). (big difference between 100 and e 100.) Why? Is this a good policy? What if we have different notions of what s simpler? Occam s razor (contd) A computerscienceish way of looking at it: Say simple = short description. At most 2 s explanations can be < s bits long. So, if the number of examples satisfies: Think of as 10x #bits to write down h. m > (1/e)[s ln(2) ln(100)] Then it s unlikely a bad simple explanation will fool you just by chance. Occam s razor (contd) 2 Nice interpretation: Even if we have different notions of what s simpler (e.g., different representation languages), we can both use Occam s razor. Of course, there s no guarantee there will be a short explanation for the data. That depends on your representation. 3

Further work Replace log( H ) with effective number of degrees of freedom. There are infinitely many linear separators, but not that many really different ones. Kernels, margins, more refined analyses. Online learning What if we don t want to make assumption that data is coming from some fixed distribution? Or any assumptions on data? Can no longer talk about past performance predicting future results. Can we hope to say anything interesting at all?? Idea: regret bounds. Show that our algorithm does nearly as well as best predictor in some large class. Using expert advice Say we want to predict the stock market. We solicit n experts for their advice. (Will the market go up or down?) We then want to use their advice somehow to make our prediction. E.g., Basic question: Is there a strategy that allows us to do nearly as well as best of these in hindsight? [ expert = someone with an opinion. Not necessarily someone who knows anything.] Simpler question We have n experts. One of these is perfect (never makes a mistake). We just don t know which one. Can we find a strategy that makes no more than lg(n) mistakes? Answer: sure. Just take majority vote over all experts that have been correct so far. Each mistake cuts # available by factor of 2. Note: this means ok for n to be very large. What if no expert is perfect? Intuition: Making a mistake doesn't completely disqualify an expert. So, instead of crossing off, just lower its weight. Weighted Majority Alg: Start with all experts having weight 1. Predict based on weighted majority vote. Penalize mistakes by cutting weight in half. Analysis: do nearly as well as best expert in hindsight M = # mistakes we've made so far. m = # mistakes best expert has made so far. W = total weight (starts at n). After each mistake, W drops by at least 25%. So, after M mistakes, W is at most n(3/4) M. Weight of best expert is (1/2) m. So, So, if m is small, then M is pretty small too. 4

Randomized Weighted Majority 2.4(m lg n) not so good if the best expert makes a mistake 20% of the time. Can we do better? Yes. Instead of taking majority vote, use weights as probabilities. (e.g., if 70% on up, 30% on down, then pick 70:30) Idea: smooth out the worst case. Also, generalize ½ to 1 e. M = expected #mistakes Analysis Say at time t we have fraction F t of weight on experts that made mistake. So, we have probability F t of making a mistake, and we remove an ef t fraction of the total weight. W final = n(1e F 1 )(1 e F 2 )... ln(w final ) = ln(n) t [ln(1 e F t )] ln(n) e t F t (using ln(1x) < x) = ln(n) e M. ( F t = E[# mistakes]) If best expert makes m mistakes, then ln(w final ) > ln((1e) m ). Now solve: ln(n) e M > m ln(1e). What can we use this for? Can use for repeated play of matrix game: Consider a matrix where all entries 0 or 1. Rows are different experts. Start at each with weight 1. Pick row with prob. proportional to weight and update as in RWM. Analysis shows do nearly as well as best row in hindsight! In fact, analysis applies for entries in [1,0], not just {1,0}. In fact, gives a proof of the minimax theorem Nice proof of minimax thm (sketch) Suppose for contradiction it was false. This means some game G has V C > V R : If Column player commits first, there exists a row that gets the Row player at least V C. But if Row player has to commit first, the Column player can make him get only V R. Scale matrix so payoffs to row are in [1,0]. Say V R = V C d. V C V R Proof sketch, contd Now, consider randomized weightedmajority alg, against Col who plays optimally against Row s distrib. In T steps, Alg gets (1e/2)[best row in hindsight] log(n)/e BRiH T V C [Best against opponent s empirical distribution] Alg T V R [Each time, opponent knows your randomized strategy] Gap is dt. Contradicts assumption if use e=d, once T > 2log(n)/e 2. Other models Some scenarios allow more options for algorithm. Active learning : have large unlabeled sample and alg may choose among these. E.g., web pages, image databases. Or, allow algorithm to construct its own examples. Membership queries E.g., features represent variablesettings in some experiment, label represents outcome. Gives algorithm more power. 5

Other models A lot of ongoing research into better algorithms, models that capture additional issues, incorporating Machine Learning into broader classes of applications. Additional notes Some courses at CMU on machine learning: 10601 Machine Learning 15859(B) Machine Learning Theory. See http://www.machinelearning.com. Any 10xxx course And finally Final exam is Thurs 1pm DH 2210. 1 sheet of notes allowed. Review session next Wed 13pm in Wean 7500. Good luck everyone! 6