Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Similar documents
Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Tasty Coffee example

Ensamble methods: Bagging and Boosting

Ensamble methods: Boosting

Vehicle Arrival Models : Headway

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8)

1 Review of Zero-Sum Games

EXERCISES FOR SECTION 1.5

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

Computational Learning Theory

3.1 More on model selection

Non-parametric techniques. Instance Based Learning. NN Decision Boundaries. Nearest Neighbor Algorithm. Distance metric important

Non-parametric techniques. Instance Based Learning. NN Decision Boundaries. Nearest Neighbor Algorithm. Distance metric important

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD

20. Applications of the Genetic-Drift Model

Chapter 7: Solving Trig Equations

Math 2142 Exam 1 Review Problems. x 2 + f (0) 3! for the 3rd Taylor polynomial at x = 0. To calculate the various quantities:

Online Convex Optimization Example And Follow-The-Leader

Comparing Means: t-tests for One Sample & Two Related Samples

5. Stochastic processes (1)

Hypothesis Testing in the Classical Normal Linear Regression Model. 1. Components of Hypothesis Tests

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H.

Lecture 33: November 29

Computational Learning Theory

R t. C t P t. + u t. C t = αp t + βr t + v t. + β + w t

Some Basic Information about M-S-D Systems

Notes for Lecture 17-18

This document was generated at 1:04 PM, 09/10/13 Copyright 2013 Richard T. Woodward. 4. End points and transversality conditions AGEC

Longest Common Prefixes

Introduction to Probability and Statistics Slides 4 Chapter 4

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

Final Spring 2007

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Approximation Algorithms for Unique Games via Orthogonal Separators

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

10. State Space Methods

11!Hí MATHEMATICS : ERDŐS AND ULAM PROC. N. A. S. of decomposiion, properly speaking) conradics he possibiliy of defining a counably addiive real-valu

CS Homework Week 2 ( 2.25, 3.22, 4.9)

Lecture 2 October ε-approximation of 2-player zero-sum games

STATE-SPACE MODELLING. A mass balance across the tank gives:

Unit Root Time Series. Univariate random walk

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

5.1 - Logarithms and Their Properties

Solutions to the Exam Digital Communications I given on the 11th of June = 111 and g 2. c 2

KINEMATICS IN ONE DIMENSION

Licenciatura de ADE y Licenciatura conjunta Derecho y ADE. Hoja de ejercicios 2 PARTE A

Chapter 4. Truncation Errors

Matlab and Python programming: how to get started

ACE 562 Fall Lecture 8: The Simple Linear Regression Model: R 2, Reporting the Results and Prediction. by Professor Scott H.

Mathcad Lecture #7 In-class Worksheet "Smart" Solve Block Techniques Handout

(10) (a) Derive and plot the spectrum of y. Discuss how the seasonality in the process is evident in spectrum.

Chapter 2. Models, Censoring, and Likelihood for Failure-Time Data

Lecture 20: Riccati Equations and Least Squares Feedback Control

Biol. 356 Lab 8. Mortality, Recruitment, and Migration Rates

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Finish reading Chapter 2 of Spivak, rereading earlier sections as necessary. handout and fill in some missing details!

Robust estimation based on the first- and third-moment restrictions of the power transformation model

Empirical Process Theory

Guest Lectures for Dr. MacFarlane s EE3350 Part Deux

Explaining Total Factor Productivity. Ulrich Kohli University of Geneva December 2015

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Lecture Notes 2. The Hilbert Space Approach to Time Series

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

EE 315 Notes. Gürdal Arslan CLASS 1. (Sections ) What is a signal?

Nature Neuroscience: doi: /nn Supplementary Figure 1. Spike-count autocorrelations in time.

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing

MATH 5720: Gradient Methods Hung Phan, UMass Lowell October 4, 2018

Chapter 2. First Order Scalar Equations

A First Course on Kinetics and Reaction Engineering. Class 19 on Unit 18

m = 41 members n = 27 (nonfounders), f = 14 (founders) 8 markers from chromosome 19

1.6. Slopes of Tangents and Instantaneous Rate of Change

Foundations of Statistical Inference. Sufficient statistics. Definition (Sufficiency) Definition (Sufficiency)

( ) ( ) if t = t. It must satisfy the identity. So, bulkiness of the unit impulse (hyper)function is equal to 1. The defining characteristic is

The Brock-Mirman Stochastic Growth Model

Mathcad Lecture #8 In-class Worksheet Curve Fitting and Interpolation

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature

ACE 562 Fall Lecture 4: Simple Linear Regression Model: Specification and Estimation. by Professor Scott H. Irwin

Properties of Autocorrelated Processes Economics 30331

GMM - Generalized Method of Moments

Announcements: Warm-up Exercise:

Robotics I. April 11, The kinematics of a 3R spatial robot is specified by the Denavit-Hartenberg parameters in Tab. 1.

Lecture 4 Notes (Little s Theorem)

Outline. lse-logo. Outline. Outline. 1 Wald Test. 2 The Likelihood Ratio Test. 3 Lagrange Multiplier Tests

Games Against Nature

Estimation of Poses with Particle Filters

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

!!"#"$%&#'()!"#&'(*%)+,&',-)./0)1-*23)

Overview. COMP14112: Artificial Intelligence Fundamentals. Lecture 0 Very Brief Overview. Structure of this course

2. Nonlinear Conservation Law Equations

This document was generated at 7:34 PM, 07/27/09 Copyright 2009 Richard T. Woodward

Designing Information Devices and Systems I Spring 2019 Lecture Notes Note 17

Let us start with a two dimensional case. We consider a vector ( x,

Exponential Weighted Moving Average (EWMA) Chart Under The Assumption of Moderateness And Its 3 Control Limits

Solutions to Odd Number Exercises in Chapter 6

Christos Papadimitriou & Luca Trevisan November 22, 2016

Optimality Conditions for Unconstrained Problems

2.7. Some common engineering functions. Introduction. Prerequisites. Learning Outcomes

Reserves measures have an economic component eg. what could be extracted at current prices?

Transcription:

Alpaydin Chaper, Michell Chaper 7 Alpaydin slides are in urquoise. Ehem Alpaydin, copyrigh: The MIT Press, 010. alpaydin@boun.edu.r hp://www.cmpe.boun.edu.r/ ehem/imle All oher slides are based on Michell. Learning a Class from Examples Class C of a family car Predicion: Is car x a family car? Knowledge exracion: Wha do people expec from a family car? Oupu: Posiive (+) and negaive ( ) examples Inpu represenaion: x 1 : price, x : engine power 1 3 Training se X r X x {,r } 1if x isposiive 0 if x isnegaive Class C p price p AD e engine power e 1 1 x x x 1 4 5

Hypohesis class H 1if h says x isposiive h( x) 0 if h says x isnegaive S, G, and he Version Space mos specific hypohesis, S mos general hypohesis, G Error of h on H E( h X) 1h x r h H, beween S and G is consisen and make up he version space (Michell, 1997) 6 7 Compuaional Learning Theory (from Michell Compuaional Learning Theory Chaper 7) Theoreical characerizaion of he difficulies and capabiliies of learning algorihms. Wha general laws consrain inducive learning? We seek heory o relae: Quesions: Probabiliy of successful learning Condiions for successful/unsuccessful learning umber of raining examples Condiions of success for paricular algorihms Two frameworks: Probably Approximaely Correc (PAC) framework: classes of hypoheses ha can be learned; complexiy of hypohesis Complexiy of hypohesis space Accuracy o which arge concep is approximaed Manner in which raining examples presened space and bound on raining se size. Misake bound framework: number of raining errors made before correc hypohesis is deermined. 3

Specific Quesions Sample Complexiy Sample complexiy: How many raining examples are needed for a learner o converge? Compuaional complexiy: How much compuaional effor is needed for a learner o converge? Misake bound: How many raining examples will he learner misclassify before converging? Issues: When o say i was successful? How are inpus acquired? How many raining examples are sufficien o learn he arge concep? 1. If learner proposes insances, as queries o eacher Learner proposes insance x, eacher provides c(x). If eacher (who knows c) provides raining examples eacher provides sequence of examples of form x, c(x) 3. If some random process (e.g., naure) proposes insances insance x generaed randomly, eacher provides c(x) 4 5 True Error of a Hypohesis Two oions of Error - Where c and h disagree c Insance space X + + h - - Training error of hypohesis h wih respec o arge concep c How ofen h(x) c(x) over raining insances True error of hypohesis h wih respec o c How ofen h(x) c(x) over fuure random insances Definiion: The rue error (denoed error D (h)) of hypohesis h wih respec o arge concep c and disribuion D is he probabiliy ha h will misclassify an insance drawn a random according o D. Our concern: Can we bound he rue error of h given he raining error of h? Firs consider when raining error of h is zero (i.e., h V S H,D ) error D (h) Pr [c(x) h(x)] x D 6 7

Exhausing he Version Space error =.3 r =.1 error =.1 r =. Hypohesis space H error =. r =0 VS H,D error =.1 r =0 error =.3 r =.4 error =. r =.3 (r = raining error, error = rue error) Definiion: The version space V S H,D is said o be ɛ-exhaused wih respec o c and D, if every hypohesis h in V S H,D has error less han ɛ wih respec o c and D. ( h V S H,D ) error D (h) < ɛ How many examples will ɛ-exhaus he VS? Theorem: [Haussler, 1988]. If he hypohesis space H is finie, and D is a sequence of m 1 independen random examples of some arge concep c, hen for any 0 ɛ 1, he probabiliy ha he version space wih respec o H and D is no ɛ-exhaused (wih respec o c) is less han H e ɛm This bounds he probabiliy ha any consisen learner will oupu a hypohesis h wih error(h) ɛ If we wan his probabiliy o be below δ hen H e ɛm δ m 1 (ln H + ln(1/δ)) ɛ 8 9 Proof of ɛ-exhasing Theorem Theorem: Prob. of V S H,D no being ɛ-exhaused is H e ɛm. Proof: Le h i H (i = 1..k) be hose ha have rue error greaer han ɛ wr c (k H ). We fail o ɛ-exhaus he VS iff a leas one h i is consisen wih all m sample raining insances (noe: hey have rue error greaer han ɛ). Prob. of a single hypohesis wih error > ɛ is consisen for one random sample is a mos (1 ɛ). Prob. of ha hypohesis being consisen wih m samples is (1 ɛ) m. Prob. of a leas one of k hypoheses wih error > ɛ is consisen wih m samples is k(1 ɛ) m. PAC Learning Consider a class C of possible arge conceps defined over a se of insances X of lengh n, and a learner L using hypohesis space H. Definiion: C is PAC-learnable by L using H if for all c C, disribuions D over X, ɛ such ha 0 < ɛ < 1/, and δ such ha 0 < δ < 1/, learner L will wih probabiliy a leas (1 δ) oupu a hypohesis h H such ha error D (h) ɛ, in ime ha is polynomial in 1/ɛ, 1/δ, n and size(c). Since k H, and for 0 ɛ 1, (1 ɛ) e ɛ : k(1 ɛ) m H (1 ɛ) m H e ɛm 10 11

Agnosic Learning Shaering a Se of Insances So far, we assumed ha c H. Wha if i is no he case? Agnosic learning seing: don assume c H Wha do we wan hen? The hypohesis h ha makes fewes errors on raining daa Wha is sample complexiy in his case? derived from Hoeffding bounds: m 1 (ln H + ln(1/δ)) ɛ Definiion: a dichoomy of a se S is a pariion of S ino wo disjoin subses. Definiion: a se of insances S is shaered by hypohesis space H if and only if for every dichoomy of S here exiss some hypohesis in H consisen wih his dichoomy. P r[error D (h) > error D (h) + ɛ] e mɛ 1 13 Three Insances Shaered Insance space X The Vapnik-Chervonenkis Dimension Definiion: The Vapnik-Chervonenkis dimension, V C(H), of hypohesis space H defined over insance space X is he size of he larges finie subse of X shaered by H. If arbirarily large finie ses of X can be shaered by H, hen V C(H). oe ha H can be infinie, while V C(H) finie! Each closed conour indicaes one dichoomy. Wha kind of hypohesis space H can shaer he insances? 14 15

VC Dim. of Linear Decision Surfaces VC Dimension: Anoher Example ( a) ( b) S = {3.1, 5.7}, and hypohesis space includes inervals a < x < b. Dichoomies: boh, none, 3.1, or 5.7. When H is a se of lines, and S a se of poins, V C(H) = 3. (a) can be shaered, bu (b) canno be. However, if a leas one subse of size 3 can be shaered, ha s fine. Are here inervals ha cover all he above dichoomies? Wha abou S = x 0, x 1, x for an arbirary x i? (cf. collinear poins). Se of size 4 canno be shaered, for any combinaion of poins (hink abou an XOR-like siuaion). 16 17 Sample Complexiy from VC Dimension How many randomly drawn examples suffice o ɛ-exhaus V S H,D wih probabiliy a leas (1 δ)? m 1 ɛ (4 log (/δ) + 8V C(H) log (13/ɛ)) Misake Bounds So far: how many examples needed o learn? Wha abou: how many misakes before convergence? This is an ineresing quesion because some learning sysems may need o sar operaing while sill learning. V C(H) is direcly relaed o he sample complexiy: More expressive H needs more samples. More samples needed for H wih more unable parameers. Le s consider similar seing o PAC learning: Insances drawn a random from X according o disribuion D. Learner mus classify each insance before receiving correc classificaion from eacher. Can we bound he number of misakes learner makes before converging? 18 19

Opimal Misake Bounds Le M A (C) be he max number of misakes made by algorihm A o learn conceps in C. (maximum over all possible c C, and all possible raining sequences) M A (C) max c C M A(c) Misake Bounds and VC Dimension Lilesone (1987) showed: V C(C) Op(C) M Halving (C) log ( C ) Definiion: Le C be an arbirary non-empy concep class. The opimal misake bound for C, denoed Op(C), is he minimum over all possible learning algorihms A of M A (C). Op(C) min M A(C) A learning algorihms 0 V C(C) Op(C) M Halving (C) log ( C ). oise and Model Complexiy Use he simpler one because Simpler o use (lower compuaional complexiy) Easier o rain (lower space complexiy) Easier o explain (more inerpreable) Generalizes beer (lower variance - Occam s razor) Muliple Classes, C i i=1,...,k 1 X x r i h {,r } 1 if x Ci 0 if x C i j, j i Train hypoheses h i (x), i =1,...,K: x 1 if x Ci 0 if x C j, j 11 1

Regression X r r E 1 g X r gx E f x, r x g xw1x w0 1 w1, w0 X r w 1x w0 g xwx w1x w0 Model Selecion & Generalizaion Learning is an ill-posed problem; daa is no sufficien o find a unique soluion The need for inducive bias, assumpions abou H Generalizaion: How well a model performs on new daa Overfiing: H more complex han C or f Underfiing: H less complex han C or f 13 14 Triple Trade-Off There is a rade-off beween hree facors (Dieerich, 003): 1. Complexiy of H, c (H),. Training se size,, 3. Generalizaion error, E, on new daa As E As c (H)firs Eand hen E Cross-Validaion To esimae generalizaion error, we need daa unseen during raining. We spli he daa as Training se (50%) Validaion se (5%) Tes (publicaion) se (5%) Resampling when here is few daa 15 16

Dimensions of a Supervised Learner 1. Model:. Loss funcion: 3. Opimizaion procedure: g x E X Lr, gx * argmine X 17