Learning theory Lecture 4
|
|
- Jonas Ward
- 5 years ago
- Views:
Transcription
1 Learning theory Lecture 4 David Sontag New York University Slides adapted from Carlos Guestrin & Luke Zettlemoyer
2 What s next We gave several machine learning algorithms: Perceptron Linear support vector machine (SVM) SVM with kernels, e.g. polynomial or Gaussian How do we guarantee that the learned classifier will perform well on test data? How much training data do we need?
3 Example: Perceptron applied to spam classificanon In your homework 1, you trained a spam classifier using perceptron The training error was always zero With few data points, there was a big gap between training error and test error! Test error - training error > Test error (percentage misclassified) Test error - training error < 0.02 Number of training examples
4 How much training data do you need? Depends on what hypothesis class the learning algorithm considers For example, consider a memorizanon- based learning algorithm Input: training data S = { (x i, y i ) } Output: funcnon f(x) which, if there exists (x i, y i ) in S such that x=x i, predicts y i, and otherwise predicts the majority label This learning algorithm will always obtain zero training error But, it will take a huge amount of training data to obtain small test error (i.e., its generalizanon performance is horrible) Linear classifiers are powerful precisely because of their simplicity GeneralizaNon is easy to guarantee
5 Roadmap of lecture 1. GeneralizaNon of finite hypothesis spaces 2. VC- dimension Will show that linear classifiers need to see approximately d training points, where d is the dimension of the feature vectors Explains the good performance we obtained using perceptron!!!! (we had a few thousand features) 3. Margin based generalizanon Applies to infinite dimensional feature vectors (e.g., Gaussian kernel) Test error (percentage misclassified) Perceptron algorithm on spam classification Number of training examples [Figure from Cynthia Rudin]
6 How big should your validanon set be? In PS1, you tried many configuranons of your algorithms (avg vs. regular perceptron, max # of iteranons) and chose the one that had smallest validanon error Suppose in total you tested H =40 different classifiers on the validanon set of m held- out e- mails The best classifier obtains 98% accuracy on these m e- mails!!! But, what is the true classificanon accuracy? How large does m need to be so that we can guarantee that the best configuranon (measured on validate) is truly good?
7 A simple seing H H c H consistent with data ClassificaNon m data points Finite number of possible hypothesis (e.g., 40 spam classifiers) A learner finds a hypothesis h that is consistent with training data Gets zero error in training: error train (h) = 0 I.e., assume for now that one of the classifiers gets 100% accuracy on the m e- mails (we ll handle the 98% case ajerward) What is the probability that h has more than ε true error? error true (h) ε
8 Refresher on probability: outcomes An outcome space specifies the possible outcomes that we would like to reason about, e.g. Ω = {, } Coin toss Ω = {,,,,, } Die toss We specify a probability p(x) for each outcome x such that X p(x) 0, p(x) =1 x2 E.g., p( ) =.6 p( ) =.4
9 Refresher on probability: events An event is a subset of the outcome space, e.g. E = {,, } Even die tosses O = {,, } Odd die tosses The probability of an event is given by the sum of the probabilines of the outcomes it contains, p(e) = X x2e p(x) E.g., p(e) = p( ) + p( ) + p( ) = 1/2, if fair die
10 Refresher on probability: union bound P(A or B or C or D or ) P(A) + P(B) + P(C) + P(D) + A B C D p(a [ B) =p(a)+p(b) p(a \ B) apple p(a)+p(b) Q: When is this a tight bound? A: For disjoint events (i.e., non-overlapping circles)
11 Refresher on probability: independence Two events A and B are independent if p(a \ B) =p(a)p(b) A B Are these events independent? No! p(a \ B) =0 1 p(a)p(b) = 6 2
12 Refresher on probability: independence Two events A and B are independent if p(a \ B) =p(a)p(b) Suppose our outcome space had two different die: Analogy: outcome space defines all possible sequences of e- mails in training set Ω = {,,,, and the probability of each outcome is defined as } 2 die tosses 6 2 = 36 outcomes p( ) = a 1 b 1 p( ) = a 1 b 2 a 1 a 2 a 3 a 4 a 5 a b 1 b 2 b 3 b 4 b 5 b X a i =1 i=1 6X b j =1 j=1
13 Refresher on probability: independence Two events A and B are independent if p(a \ B) =p(a)p(b) Are these events independent? A B Yes! p(a \ B) =0p( ) Analogy: asking about first e- mail in training set p(a) = p( ) p(b) = p( ) = b 2 = 6X a 1 b j = a 1 j=1 6X b j = a 1 j=1 Analogy: asking about second e- mail in training set p(a)p(b) = 2 p( 1 ) p( ) 6
14 Refresher of probability: discrete random variables A random variable X is a mapping X :! D D is some set (e.g., the integers) Induces a partition of all outcomes For some x 2 D, wesay p(x = x) =p({! 2 : X (!) =x}) probability that variable X assumes state x Notation: Val(X )=setd of all values assumed by X (will interchangeably call these the values or states of variable X ) P Ω = {,,,, } 2 die tosses
15 Refresher of probability: discrete random variables p(x ) is a distribution: P x2val(x ) p(x = x) =1 E.g. X 1 may refer to the value of the first dice, and X 2 to the value of the second dice We call two random variables X and Y iden1cally distributed if Val(X) = Val(Y) and p(x=s) = p(y=s) for all s in Val(X) p( ) = a 1 b 1 p( ) = a 1 b 2 X 1 and X 2 NOT idenncally distributed a 1 a 2 a 3 a 4 a 5 a b 1 b 2 b 3 b 4 b 5 b X a i =1 i=1 6X b j =1 j=1 Ω = {,,,, } 2 die tosses
16 Refresher of probability: discrete random variables p(x ) is a distribution: P x2val(x ) p(x = x) =1 E.g. X 1 may refer to the value of the first dice, and X 2 to the value of the second dice We call two random variables X and Y iden1cally distributed if Val(X) = Val(Y) and p(x=s) = p(y=s) for all s in Val(X) p( ) = a 1 a 1 p( ) = a 1 a 2 X 1 and X 2 idenncally distributed a 1 a 2 a 3 a 4 a 5 a X a i =1 i=1 Ω = {,,,, } 2 die tosses
17 Refresher of probability: discrete random variables X=x is simply an event, so can apply union bound, etc. Two random variables X and Y are independent if: p(x = x, Y = y) =p(x = x)p(y = y) 8x 2 Val(X),y 2 Val(Y ) Joint probability. Formally, given by the event X = x \ Y = y The expectagon of X is defined as: E[X] = X x2val(x) p(x = x)x If X is binary valued, i.e. x is either 0 or 1, then: E[X] = p(x = 0) 0+p(X = 1) 1 = p(x = 1)
18 A simple seing H H c H consistent with data ClassificaNon m data points Finite number of possible hypothesis (e.g., 40 spam classifiers) A learner finds a hypothesis h that is consistent with training data Gets zero error in training: error train (h) = 0 I.e., assume for now that one of the classifiers gets 100% accuracy on the m e- mails (we ll handle the 98% case ajerward) What is the probability that h has more than ε true error? error true (h) ε
19 How likely is a single hypothesis to get m data points right? The probability of a hypothesis h incorrectly classifying: X h = ˆp(~x, y)1[h(~x) 6= y] (~x,y) Let Zi h be a random variable that takes two values: 1 if h correctly classifies i th data point, and 0 otherwise The Z h variables are independent and idengcally distributed (i.i.d.) with Pr(Z h i = 0) = X (~x,y) ˆp(~x, y)1[h(~x) 6= y] = h What is the probability that h classifies m data points correctly? Pr(h gets m iid data points right) =(1 h ) m apple e hm
20 Are we done? Pr(h gets m iid data points right error true (h) ε) e - εm Says if h gets m data points correct, then with very high probability (i.e. 1- e - εm ) it is close to perfect (i.e., will have error ε) This only considers one hypothesis! Suppose 1 billion classifiers were tried, and each was a random funcnon For m small enough, one of the funcnons will classify all points correctly but all have very large true error
21 How likely is learner to pick a bad hypothesis? Pr(h gets m iid data points right error true (h) ε) e - εm Suppose there are H c hypotheses consistent with the training data How likely is learner to pick a bad one, i.e. with true error ε? We need a bound that holds for all of them! P(error true (h 1 ) ε OR error true (h 2 ) ε OR OR error true (h H c ) ε) k P(error true (h k ) ε) k (1-ε) m H (1-ε) m H e -mε Union bound bound on individual h j s H c H (1-ε) e -ε for 0 ε 1
22 GeneralizaNon error of finite hypothesis spaces [Haussler 88] We just proved the following result: Theorem: Hypothesis space H finite, dataset D with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data:
23 Using a PAC bound Typically, 2 use cases: 1: Pick ε and δ, compute m 2: Pick m and δ, compute ε = =.01, H = 40 Need m 830 m p(error true (h) Case 1 ln H + ln 1 ) apple H e m ln H e m ln ln H m ln Argument: Since for all h we know that with probability 1- δ the following holds (either case 1 or case 2) Says: we are willing to tolerate a δ probability of having ε error Case 2 ln H + ln 1 m Log dependence on H, OK if exponennal size (but not doubly) ε has stronger influence than δ ε shrinks at rate O(1/m)
24 LimitaNons of Haussler 88 bound There may be no consistent hypothesis h (where error train (h)=0) Size of hypothesis space What if H is really big? What if it is connnuous? First Goal: Can we get a bound for a learner with error train (h) in the data set?
25 QuesNon: What s the expected error of a hypothesis? The probability of a hypothesis incorrectly classifying: Let s now let Z be a random variable that takes two values, 1 if h correctly classifies i th i h data point, and 0 otherwise The Z variables are independent and idengcally distributed (i.i.d.) with X (~x,y) ˆp(~x, y)1[h(~x) 6= y] Pr(Z h i = 0) = X (~x,y) ˆp(~x, y)1[h(~x) 6= y] EsNmaNng the true error probability is like esnmanng the parameter of a coin! Chernoff bound: for m i.i.d. coin flips, X 1,,X m, where X i {0,1}. For 0<ε<1: p(x i = 1) = True error probability Observed fracnon of points incorrectly classified E[ 1 m mx X i ]= 1 m i=1 mx E[X i ]= i=1 (by linearity of expectanon)
26 GeneralizaNon bound for H hypothesis Theorem: Hypothesis space H finite, dataset D with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h: Pr(error true (h) error D (h) > ) apple H e 2m 2 Why? Same reasoning as before. Use the Union bound over individual Chernoff bounds
27 PAC bound and Bias- Variance tradeoff for all h, with probability at least 1-δ: error true (h) apple error D (h)+ bias s ln H +ln 1 2m variance For large H low bias (assuming we can find a good h) high variance (because bound is looser) For small H high bias (is there a good h?) low variance (Nghter bound)
28 What about connnuous hypothesis spaces? ConNnuous hypothesis space: H = Infinite variance??? Only care about the maximum number of points that can be classified exactly!
29 How many points can a linear boundary classify exactly? (1- D) 2 Points: Yes!! 3 Points: No etc (8 total)
30 Sha ering and Vapnik Chervonenkis Dimension A set of points is sha6ered by a hypothesis space H iff: For all ways of spli7ng the examples into posinve and neganve subsets There exists some consistent hypothesis h The VC Dimension of H over input space X The size of the largest finite subset of X sha ered by H
31 How many points can a linear boundary classify exactly? (2- D) 3 Points: Yes!! 4 Points: No etc. [Figure from Chris Burges]
32 How many points can a linear boundary classify exactly? (d- D) A linear classifier j=1..d w j x j + b can represent all assignments of possible labels to d+1 points But not d+2!! Thus, VC- dimension of d- dimensional linear classifiers is d+1 Bias term b required Rule of Thumb: number of parameters in model ojen matches max number of points QuesNon: Can we get a bound for error as a funcnon of the number of points that can be completely labeled?
33 PAC bound using VC dimension VC dimension: number of training points that can be classified exactly (sha ered) by hypothesis space H!!! Measures relevant size of hypothesis space Same bias / variance tradeoff as always Now, just a funcnon of VC(H) Note: all of this theory is for binary classificanon Can be generalized to muln- class and also regression
34 What is the VC- dimension of rectangle classifiers? First, show that there are 4 points that can be sha ered: Then, show that no set of 5 points can be sha ered: [Figures from Anand Bhaskar, Ilya Sukhar]
35 GeneralizaNon bounds using VC dimension Linear classifiers: VC(H) = d+1, for d features plus constant term b Classifiers using Gaussian Kernel VC(H) = Euclidean distance, squared [Figure from Chris Burges] [Figure from mblondel.org]
36 Gap tolerant classifiers Suppose data lies in R d in a ball of diameter D Consider a hypothesis class H of linear classifiers that can only classify point sets with margin at least M What is the largest set of points that H can sha er? Φ=0 Y=0 Φ=1 Y=+1 Cannot sha er these points: D = 2 M = 3/2 Φ=0 Y=0 <M Φ= 1 Y=-1 Y=0 Φ=0 VC dimension = min d, D2 [Figure from Chris Burges] M 2 M =2 =2 1 w SVM aqempts to minimize w 2, which minimizes VC- dimension!!!
37 Gap tolerant classifiers Suppose data lies in R d in a ball of diameter D Consider a hypothesis class H of linear classifiers that can only classify point sets with margin at least M What is the largest set of points that H can sha er? Φ=0 Y=0 Φ=0 Y=0 Φ=1 Y=+1 Φ= 1 Y=-1 D = 2 M = 3/2 Y=0 Φ=0 VC dimension = min d, D2 [Figure from Chris Burges] M 2 What is R=D/2 for the Gaussian kernel? R = max x = max x = max x =1!!! (x) p (x) p K(x, x) (x) What is w 2? w 2 = w 2 = X i y i (x i ) 2 2 i = X X i j y i y j K(x i,x j ) i j 2 2 M
38 Finite hypothesis space What you need to know Derive results CounNng number of hypothesis Complexity of the classifier depends on number of points that can be classified exactly Finite case number of hypotheses considered Infinite case VC dimension VC dimension of gap tolerant classifiers to jusnfy SVM Bias- Variance tradeoff in learning theory
Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin
Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationLearning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin
Learning Theory Aar$ Singh and Barnabas Poczos Machine Learning 10-701/15-781 Apr 17, 2014 Slides courtesy: Carlos Guestrin Learning Theory We have explored many ways of learning from data But How good
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationLearning Theory Continued
Learning Theory Continued Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 1 A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec.
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationCS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims
CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target
More informationLearning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!
Learning Theory Machine Learning 10-601B Seyoung Kim Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks! Computa2onal Learning Theory What general laws constrain inducgve learning?
More informationLearning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14
Learning Theory Piyush Rai CS5350/6350: Machine Learning September 27, 2011 (CS5350/6350) Learning Theory September 27, 2011 1 / 14 Why Learning Theory? We want to have theoretical guarantees about our
More informationComputational Learning Theory
Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms
More informationDecision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag
Decision Trees Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Supervised Learning Input: labelled training data i.e., data plus desired output Assumption:
More informationLinear classifiers Lecture 3
Linear classifiers Lecture 3 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin ML Methodology Data: labeled instances, e.g. emails marked spam/ham
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationComputational Learning Theory
Computational Learning Theory Slides by and Nathalie Japkowicz (Reading: R&N AIMA 3 rd ed., Chapter 18.5) Computational Learning Theory Inductive learning: given the training set, a learning algorithm
More informationCOMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization
: Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage
More informationStatistical and Computational Learning Theory
Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the
More informationComputational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013
Computational Learning Theory CS 486/686: Introduction to Artificial Intelligence Fall 2013 1 Overview Introduction to Computational Learning Theory PAC Learning Theory Thanks to T Mitchell 2 Introduction
More informationComputational Learning Theory. CS534 - Machine Learning
Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationComputational learning theory. PAC learning. VC dimension.
Computational learning theory. PAC learning. VC dimension. Petr Pošík Czech Technical University in Prague Faculty of Electrical Engineering Dept. of Cybernetics COLT 2 Concept...........................................................................................................
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement
More informationCS 6375: Machine Learning Computational Learning Theory
CS 6375: Machine Learning Computational Learning Theory Vibhav Gogate The University of Texas at Dallas Many slides borrowed from Ray Mooney 1 Learning Theory Theoretical characterizations of Difficulty
More informationComputational Learning Theory
Computational Learning Theory Sinh Hoa Nguyen, Hung Son Nguyen Polish-Japanese Institute of Information Technology Institute of Mathematics, Warsaw University February 14, 2006 inh Hoa Nguyen, Hung Son
More informationLearning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38
Learning Theory Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts p. 1/38 Topics Probability theory meet machine learning Concentration inequalities: Chebyshev, Chernoff, Hoeffding, and
More informationIntroduction to Machine Learning
Introduction to Machine Learning PAC Learning and VC Dimension Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE
More informationHypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell
Hypothesis Testing and Computational Learning Theory EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell Overview Hypothesis Testing: How do we know our learners are good? What does performance
More informationDay 5: Generative models, structured classification
Day 5: Generative models, structured classification Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 22 June 2018 Linear regression
More informationActive Learning and Optimized Information Gathering
Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office
More informationLecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.
Lecture Slides for INTRODUCTION TO Machine Learning By: alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml Postedited by: R. Basili Learning a Class from Examples Class C of a family car Prediction:
More informationComputational Learning Theory (VC Dimension)
Computational Learning Theory (VC Dimension) 1 Difficulty of machine learning problems 2 Capabilities of machine learning algorithms 1 Version Space with associated errors error is the true error, r is
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationRecitation 2: Probability
Recitation 2: Probability Colin White, Kenny Marino January 23, 2018 Outline Facts about sets Definitions and facts about probability Random Variables and Joint Distributions Characteristics of distributions
More informationMachine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015
Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and
More informationVC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.
VC Dimension Review The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. Previously, in discussing PAC learning, we were trying to answer questions about
More informationCS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell
CS340 Machine learning Lecture 4 Learning theory Some slides are borrowed from Sebastian Thrun and Stuart Russell Announcement What: Workshop on applying for NSERC scholarships and for entry to graduate
More informationM378K In-Class Assignment #1
The following problems are a review of M6K. M7K In-Class Assignment # Problem.. Complete the definition of mutual exclusivity of events below: Events A, B Ω are said to be mutually exclusive if A B =.
More informationTerminology. Experiment = Prior = Posterior =
Review: probability RVs, events, sample space! Measures, distributions disjoint union property (law of total probability book calls this sum rule ) Sample v. population Law of large numbers Marginals,
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More informationClassification: The PAC Learning Framework
Classification: The PAC Learning Framework Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 5 Slides adapted from Eli Upfal Machine Learning: Jordan Boyd-Graber Boulder Classification:
More informationIntroduction to Machine Learning
Introduction to Machine Learning Slides adapted from Eli Upfal Machine Learning: Jordan Boyd-Graber University of Maryland FEATURE ENGINEERING Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine
More informationTDT4173 Machine Learning
TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods
More informationCognitive Cyber-Physical System
Cognitive Cyber-Physical System Physical to Cyber-Physical The emergence of non-trivial embedded sensor units, networked embedded systems and sensor/actuator networks has made possible the design and implementation
More informationLECTURE 1. 1 Introduction. 1.1 Sample spaces and events
LECTURE 1 1 Introduction The first part of our adventure is a highly selective review of probability theory, focusing especially on things that are most useful in statistics. 1.1 Sample spaces and events
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationWeb-Mining Agents Computational Learning Theory
Web-Mining Agents Computational Learning Theory Prof. Dr. Ralf Möller Dr. Özgür Özcep Universität zu Lübeck Institut für Informationssysteme Tanya Braun (Exercise Lab) Computational Learning Theory (Adapted)
More informationProbabilistic Systems Analysis Spring 2018 Lecture 6. Random Variables: Probability Mass Function and Expectation
EE 178 Probabilistic Systems Analysis Spring 2018 Lecture 6 Random Variables: Probability Mass Function and Expectation Probability Mass Function When we introduce the basic probability model in Note 1,
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationLearning Theory, Overfi1ng, Bias Variance Decomposi9on
Learning Theory, Overfi1ng, Bias Variance Decomposi9on Machine Learning 10-601B Seyoung Kim Many of these slides are derived from Tom Mitchell, Ziv- 1 Bar Joseph. Thanks! Any(!) learner that outputs a
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationKernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin
Kernels Machine Learning CSE446 Carlos Guestrin University of Washington October 28, 2013 Carlos Guestrin 2005-2013 1 Linear Separability: More formally, Using Margin Data linearly separable, if there
More informationComputational Learning Theory
CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful
More informationSupervised Machine Learning (Spring 2014) Homework 2, sample solutions
58669 Supervised Machine Learning (Spring 014) Homework, sample solutions Credit for the solutions goes to mainly to Panu Luosto and Joonas Paalasmaa, with some additional contributions by Jyrki Kivinen
More informationComputational Learning Theory
1 Computational Learning Theory 2 Computational learning theory Introduction Is it possible to identify classes of learning problems that are inherently easy or difficult? Can we characterize the number
More informationSupport Vector Machines
Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781
More informationThe Decision List Machine
The Decision List Machine Marina Sokolova SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 sokolova@site.uottawa.ca Nathalie Japkowicz SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 nat@site.uottawa.ca
More information[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]
Evaluating Hypotheses [Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4] Sample error, true error Condence intervals for observed hypothesis error Estimators Binomial distribution, Normal distribution,
More informationThe Vapnik-Chervonenkis Dimension
The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB 1 / 91 Outline 1 Growth Functions 2 Basic Definitions for Vapnik-Chervonenkis Dimension 3 The Sauer-Shelah Theorem 4 The Link between VCD and
More informationThe definitions and notation are those introduced in the lectures slides. R Ex D [h
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 04, 2016 Due: October 18, 2016 A. Rademacher complexity The definitions and notation
More informationDan Roth 461C, 3401 Walnut
CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn
More informationSupport Vector Machines
Support Vector Machines Jordan Boyd-Graber University of Colorado Boulder LECTURE 7 Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Jordan Boyd-Graber Boulder Support Vector Machines 1 of
More informationComputational Learning Theory. Definitions
Computational Learning Theory Computational learning theory is interested in theoretical analyses of the following issues. What is needed to learn effectively? Sample complexity. How many examples? Computational
More informationLogistic Regression Logistic
Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,
More informationMachine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013
Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you
More informationProbability Theory for Machine Learning. Chris Cremer September 2015
Probability Theory for Machine Learning Chris Cremer September 2015 Outline Motivation Probability Definitions and Rules Probability Distributions MLE for Gaussian Parameter Estimation MLE and Least Squares
More informationDeep Learning for Computer Vision
Deep Learning for Computer Vision Lecture 3: Probability, Bayes Theorem, and Bayes Classification Peter Belhumeur Computer Science Columbia University Probability Should you play this game? Game: A fair
More informationMA : Introductory Probability
MA 320-001: Introductory Probability David Murrugarra Department of Mathematics, University of Kentucky http://www.math.uky.edu/~dmu228/ma320/ Spring 2017 David Murrugarra (University of Kentucky) MA 320:
More informationMachine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012
Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Computational Learning Theory Le Song Lecture 11, September 20, 2012 Based on Slides from Eric Xing, CMU Reading: Chap. 7 T.M book 1 Complexity of Learning
More information10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers
Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How
More informationCSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015
CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015 Luke ZeElemoyer Slides adapted from Carlos Guestrin Predic5on of con5nuous variables Billionaire says: Wait, that s not what
More informationa b = a T b = a i b i (1) i=1 (Geometric definition) The dot product of two Euclidean vectors a and b is defined by a b = a b cos(θ a,b ) (2)
This is my preperation notes for teaching in sections during the winter 2018 quarter for course CSE 446. Useful for myself to review the concepts as well. More Linear Algebra Definition 1.1 (Dot Product).
More informationAn Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI
An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process
More informationBe able to define the following terms and answer basic questions about them:
CS440/ECE448 Section Q Fall 2017 Final Review Be able to define the following terms and answer basic questions about them: Probability o Random variables, axioms of probability o Joint, marginal, conditional
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationNotes on Discriminant Functions and Optimal Classification
Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem
More informationSupport Vector Machines. Machine Learning Fall 2017
Support Vector Machines Machine Learning Fall 2017 1 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce
More informationBayesian Methods: Naïve Bayes
Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationMidterm: CS 6375 Spring 2015 Solutions
Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an
More informationFINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE
FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE You are allowed a two-page cheat sheet. You are also allowed to use a calculator. Answer the questions in the spaces provided on the question sheets.
More information10.1 The Formal Model
67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate
More informationECE521 Lecture7. Logistic Regression
ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard
More informationCS 590D Lecture Notes
CS 590D Lecture Notes David Wemhoener December 017 1 Introduction Figure 1: Various Separators We previously looked at the perceptron learning algorithm, which gives us a separator for linearly separable
More informationLecture 10: Bayes' Theorem, Expected Value and Variance Lecturer: Lale Özkahya
BBM 205 Discrete Mathematics Hacettepe University http://web.cs.hacettepe.edu.tr/ bbm205 Lecture 10: Bayes' Theorem, Expected Value and Variance Lecturer: Lale Özkahya Resources: Kenneth Rosen, Discrete
More informationFinal Examination CS540-2: Introduction to Artificial Intelligence
Final Examination CS540-2: Introduction to Artificial Intelligence May 9, 2018 LAST NAME: SOLUTIONS FIRST NAME: Directions 1. This exam contains 33 questions worth a total of 100 points 2. Fill in your
More informationIntroduction to Machine Learning
Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland SUPPORT VECTOR MACHINES Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Machine Learning: Jordan
More informationLINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning
LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary
More information12.1 A Polynomial Bound on the Sample Size m for PAC Learning
67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 12: PAC III Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 In this lecture will use the measure of VC dimension, which is a combinatorial
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationComputational Learning Theory
0. Computational Learning Theory Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 7 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell 1. Main Questions
More informationLecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds
Lecture 25 of 42 PAC Learning, VC Dimension, and Mistake Bounds Thursday, 15 March 2007 William H. Hsu, KSU http://www.kddresearch.org/courses/spring2007/cis732 Readings: Sections 7.4.17.4.3, 7.5.17.5.3,
More informationComputational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar
Computational Learning Theory: Shattering and VC Dimensions Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Computational Learning Theory The Theory of Generalization
More informationVBM683 Machine Learning
VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data
More information