Introduction to Machine Learning (67577) Lecture 3

Similar documents
Introduction to Machine Learning (67577) Lecture 5

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

On the tradeoff between computational complexity and sample complexity in learning

Understanding Machine Learning Solution Manual

Introduction to Machine Learning (67577) Lecture 7

Online Learning Summer School Copenhagen 2015 Lecture 1

Active Learning: Disagreement Coefficient

IFT Lecture 7 Elements of statistical learning theory

Machine Learning in the Data Revolution Era

Understanding Machine Learning A theory Perspective

Introduction to Statistical Learning Theory

Understanding Generalization Error: Bounds and Decompositions

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Introduction to Machine Learning (67577)

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Generalization bounds

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Introduction and Models

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

The PAC Learning Framework -II

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Computational and Statistical Learning Theory

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

The Perceptron algorithm

COMS 4771 Introduction to Machine Learning. Nakul Verma

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Computational and Statistical Learning theory

Littlestone s Dimension and Online Learnability

Introduction to Machine Learning

The sample complexity of agnostic learning with deterministic labels

Machine Learning Theory (CS 6783)

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Consistency of Nearest Neighbor Methods

Empirical Risk Minimization

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

Conditional Sparse Linear Regression

Machine Learning

COMP 551 Applied Machine Learning Lecture 2: Linear regression

FORMULATION OF THE LEARNING PROBLEM

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

1 Differential Privacy and Statistical Query Learning

Introduction to Support Vector Machines

Computational Learning Theory - Hilary Term : Learning Real-valued Functions

Machine Learning

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Multiclass Learnability and the ERM principle

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Lecture Support Vector Machine (SVM) Classifiers

Does Unlabeled Data Help?

ECS171: Machine Learning

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Computational and Statistical Learning Theory

ECE 5424: Introduction to Machine Learning

Machine Learning Theory (CS 6783)

Multiclass Classification-1

4 Bias-Variance for Ridge Regression (24 points)

Lecture 21: Minimax Theory

Sample Complexity of Learning Independent of Set Theory

Lecture 2 Machine Learning Review

Statistical Machine Learning

Introduction to Machine Learning

Variance Reduction and Ensemble Methods

Bias-Variance Decomposition. Mohammad Emtiyaz Khan EPFL Oct 6, 2015

Generalization theory

ECE521 week 3: 23/26 January 2017

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Evaluation requires to define performance measures to be optimized

Online Convex Optimization

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection

Part of the slides are adapted from Ziko Kolter

Generalization, Overfitting, and Model Selection

Advanced Introduction to Machine Learning CMU-10715

Active Learning and Optimized Information Gathering

Association studies and regression

Generalization Bounds and Stability

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Multiclass Learnability and the ERM Principle

Accelerating Stochastic Optimization

Computational and Statistical Learning Theory

Online Passive-Aggressive Algorithms

1 Review of The Learning Setting

Agnostic Online learnability

LEARNING KERNEL-BASED HALFSPACES WITH THE 0-1 LOSS

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

TTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model

Introduction to Machine Learning

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Linear Classifiers. Michael Collins. January 18, 2012

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Transcription:

Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 1 / 39

Outline 1 The general PAC model Releasing the realizability assumption beyond binary classification The general PAC learning model 2 Learning via Uniform Convergence 3 Linear Regression and Least Squares Polynomial Fitting 4 The Bias-Complexity Tradeoff Error Decomposition 5 Validation and Model Selection Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 2 / 39

Relaxing the realizability assumption Agnostic PAC learning So far we assumed that labels are generated by some f H This assumption may be too strong Relax the realizability assumption by replacing the target labeling function with a more flexible notion, a data-labels generating distribution Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 3 / 39

Relaxing the realizability assumption Agnostic PAC learning Recall: in PAC model, D is a distribution over X Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 4 / 39

Relaxing the realizability assumption Agnostic PAC learning Recall: in PAC model, D is a distribution over X From now on, let D be a distribution over X Y Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 4 / 39

Relaxing the realizability assumption Agnostic PAC learning Recall: in PAC model, D is a distribution over X From now on, let D be a distribution over X Y We redefine the risk as: L D (h) def = P [h(x) y] def = D({(x, y) : h(x) y}) (x,y) D Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 4 / 39

Relaxing the realizability assumption Agnostic PAC learning Recall: in PAC model, D is a distribution over X From now on, let D be a distribution over X Y We redefine the risk as: L D (h) def = P [h(x) y] def = D({(x, y) : h(x) y}) (x,y) D We redefine the approximately correct notion to L D (A(S)) min h H L D(h) + ɛ Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 4 / 39

PAC vs. Agnostic PAC learning PAC Agnostic PAC Distribution D over X D over X Y Truth f H not in class or doesn t exist Risk L D,f (h) = L D (h) = D({x : h(x) f(x)}) D({(x, y) : h(x) y}) Training set (x 1,..., x m ) D m ((x 1, y 1 ),..., (x m, y m )) D m i, y i = f(x i ) Goal L D,f (A(S)) ɛ L D (A(S)) min h H L D (h) + ɛ Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 5 / 39

Beyond Binary Classification Scope of learning problems: Multiclass categorization: Y is a finite set representing Y different classes. E.g. X is documents and Y = {News, Sports, Biology, Medicine} Regression: Y = R. E.g. one wishes to predict a baby s birth weight based on ultrasound measures of his head circumference, abdominal circumference, and femur length. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 6 / 39

Loss Functions Let Z = X Y Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

Loss Functions Let Z = X Y Given hypothesis h H, and an example, (x, y) Z, how good is h on (x, y)? Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

Loss Functions Let Z = X Y Given hypothesis h H, and an example, (x, y) Z, how good is h on (x, y)? Loss function: l : H Z R + Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

Loss Functions Let Z = X Y Given hypothesis h H, and an example, (x, y) Z, how good is h on (x, y)? Loss function: l : H Z R + Examples: Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

Loss Functions Let Z = X Y Given hypothesis h H, and an example, (x, y) Z, how good is h on (x, y)? Loss function: l : H Z R + Examples: { 1 if h(x) y 0-1 loss: l(h, (x, y)) = 0 if h(x) = y Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

Loss Functions Let Z = X Y Given hypothesis h H, and an example, (x, y) Z, how good is h on (x, y)? Loss function: l : H Z R + Examples: { 1 if h(x) y 0-1 loss: l(h, (x, y)) = 0 if h(x) = y Squared loss: l(h, (x, y)) = (h(x) y) 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

Loss Functions Let Z = X Y Given hypothesis h H, and an example, (x, y) Z, how good is h on (x, y)? Loss function: l : H Z R + Examples: { 1 if h(x) y 0-1 loss: l(h, (x, y)) = 0 if h(x) = y Squared loss: l(h, (x, y)) = (h(x) y) 2 Absolute-value loss: l(h, (x, y)) = h(x) y Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

Loss Functions Let Z = X Y Given hypothesis h H, and an example, (x, y) Z, how good is h on (x, y)? Loss function: l : H Z R + Examples: { 1 if h(x) y 0-1 loss: l(h, (x, y)) = 0 if h(x) = y Squared loss: l(h, (x, y)) = (h(x) y) 2 Absolute-value loss: l(h, (x, y)) = h(x) y Cost-sensitive loss: l(h, (x, y)) = C h(x),y where C is some Y Y matrix Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

The General PAC Learning Problem We wish to Probably Approximately Solve: min L D(h) where L D (h) def = E [l(h, z)]. h H z D Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39

The General PAC Learning Problem We wish to Probably Approximately Solve: min L D(h) where L D (h) def = E [l(h, z)]. h H z D Learner knows H, Z, and l Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39

The General PAC Learning Problem We wish to Probably Approximately Solve: min L D(h) where L D (h) def = E [l(h, z)]. h H z D Learner knows H, Z, and l Learner receives accuracy parameter ɛ and confidence parameter δ Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39

The General PAC Learning Problem We wish to Probably Approximately Solve: min L D(h) where L D (h) def = E [l(h, z)]. h H z D Learner knows H, Z, and l Learner receives accuracy parameter ɛ and confidence parameter δ Learner can decide on training set size m based on ɛ, δ Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39

The General PAC Learning Problem We wish to Probably Approximately Solve: min L D(h) where L D (h) def = E [l(h, z)]. h H z D Learner knows H, Z, and l Learner receives accuracy parameter ɛ and confidence parameter δ Learner can decide on training set size m based on ɛ, δ Learner doesn t know D but can sample S D m Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39

The General PAC Learning Problem We wish to Probably Approximately Solve: min L D(h) where L D (h) def = E [l(h, z)]. h H z D Learner knows H, Z, and l Learner receives accuracy parameter ɛ and confidence parameter δ Learner can decide on training set size m based on ɛ, δ Learner doesn t know D but can sample S D m Using S the learner outputs some hypothesis A(S) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39

The General PAC Learning Problem We wish to Probably Approximately Solve: min L D(h) where L D (h) def = E [l(h, z)]. h H z D Learner knows H, Z, and l Learner receives accuracy parameter ɛ and confidence parameter δ Learner can decide on training set size m based on ɛ, δ Learner doesn t know D but can sample S D m Using S the learner outputs some hypothesis A(S) We want that with probability of at least 1 δ over the choice of S, the following would hold: L D (A(S)) min h H L D (h) + ɛ Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39

Formal definition A hypothesis class H is agnostic PAC learnable with respect to a set Z and a loss function l : H Z R +, if there exists a function m H : (0, 1) 2 N and a learning algorithm, A, with the following property: for every ɛ, δ (0, 1), m m H (ɛ, δ), and distribution D over Z, }) D ({S m Z m : L D (A(S)) min L D(h) + ɛ 1 δ h H Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 9 / 39

Outline 1 The general PAC model Releasing the realizability assumption beyond binary classification The general PAC learning model 2 Learning via Uniform Convergence 3 Linear Regression and Least Squares Polynomial Fitting 4 The Bias-Complexity Tradeoff Error Decomposition 5 Validation and Model Selection Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 10 / 39

Representative Sample Definition (ɛ-representative sample) A training set S is called ɛ-representative if h H, L S (h) L D (h) ɛ. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 11 / 39

Representative Sample Lemma Assume that a training set S is ɛ 2-representative. Then, any output of ERM H (S), namely any h S argmin h H L S (h), satisfies L D (h S ) min h H L D(h) + ɛ. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 12 / 39

Representative Sample Lemma Assume that a training set S is ɛ 2-representative. Then, any output of ERM H (S), namely any h S argmin h H L S (h), satisfies Proof: For every h H, L D (h S ) min h H L D(h) + ɛ. L D (h S ) L S (h S ) + ɛ 2 L S(h) + ɛ 2 L D(h) + ɛ 2 + ɛ 2 = L D(h) + ɛ Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 12 / 39

Uniform Convergence is Sufficient for Learnability Definition (uniform convergence) H has the uniform convergence property if there exists a function m UC H : (0, 1)2 N such that for every ɛ, δ (0, 1), and every distribution D, D m ({S Z m : S is ɛ -representative}) 1 δ Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 13 / 39

Uniform Convergence is Sufficient for Learnability Definition (uniform convergence) H has the uniform convergence property if there exists a function m UC H : (0, 1)2 N such that for every ɛ, δ (0, 1), and every distribution D, D m ({S Z m : S is ɛ -representative}) 1 δ Corollary If H has the uniform convergence property with a function m UC H then H is agnostically PAC learnable with the sample complexity m H (ɛ, δ) m UC H (ɛ/2, δ). Furthermore, in that case, the ERM H paradigm is a successful agnostic PAC learner for H. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 13 / 39

Finite Classes are Agnostic PAC Learnable We will prove the following: Theorem Assume H is finite and the range of the loss function is [0, 1]. Then, H is agnostically PAC learnable using the ERM H algorithm with sample complexity 2 log(2 H /δ) m H (ɛ, δ). ɛ 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 14 / 39

Finite Classes are Agnostic PAC Learnable We will prove the following: Theorem Assume H is finite and the range of the loss function is [0, 1]. Then, H is agnostically PAC learnable using the ERM H algorithm with sample complexity 2 log(2 H /δ) m H (ɛ, δ). Proof: It suffices to show that H has the uniform convergence property with log(2 H /δ) m UC H (ɛ, δ) 2ɛ 2. ɛ 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 14 / 39

Proof (cont.) To show uniform convergence, we need: D m ({S : h H, L S (h) L D (h) > ɛ}) < δ. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 15 / 39

Proof (cont.) To show uniform convergence, we need: D m ({S : h H, L S (h) L D (h) > ɛ}) < δ. Using the union bound: D m ({S : h H, L S (h) L D (h) > ɛ}) = D m ( h H {S : L S (h) L D (h) > ɛ}) D m ({S : L S (h) L D (h) > ɛ}). h H Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 15 / 39

Proof (cont.) Recall: L D (h) = E z D [l(h, z)] and L S (h) = 1 m m i=1 l(h, z i). Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 16 / 39

Proof (cont.) Recall: L D (h) = E z D [l(h, z)] and L S (h) = 1 m m i=1 l(h, z i). Denote θ i = l(h, z i ). Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 16 / 39

Proof (cont.) Recall: L D (h) = E z D [l(h, z)] and L S (h) = 1 m m i=1 l(h, z i). Denote θ i = l(h, z i ). Then, for all i, E[θ i ] = L D (h) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 16 / 39

Proof (cont.) Recall: L D (h) = E z D [l(h, z)] and L S (h) = 1 m m i=1 l(h, z i). Denote θ i = l(h, z i ). Then, for all i, E[θ i ] = L D (h) Lemma (Hoeffding s inequality) Let θ 1,..., θ m be a sequence of i.i.d. random variables and assume that for all i, E[θ i ] = µ and P[a θ i b] = 1. Then, for any ɛ > 0 [ ] m 1 P m θ i µ > ɛ 2 exp ( 2 m ɛ 2 /(b a) 2). i=1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 16 / 39

Proof (cont.) Recall: L D (h) = E z D [l(h, z)] and L S (h) = 1 m m i=1 l(h, z i). Denote θ i = l(h, z i ). Then, for all i, E[θ i ] = L D (h) Lemma (Hoeffding s inequality) Let θ 1,..., θ m be a sequence of i.i.d. random variables and assume that for all i, E[θ i ] = µ and P[a θ i b] = 1. Then, for any ɛ > 0 [ ] m 1 P m θ i µ > ɛ 2 exp ( 2 m ɛ 2 /(b a) 2). i=1 This implies: D m ({S : L S (h) L D (h) > ɛ}) 2 exp ( 2 m ɛ 2). Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 16 / 39

Proof (cont.) We have shown: D m ({S : h H, L S (h) L D (h) > ɛ}) 2 H exp ( 2 m ɛ 2) So, if m log(2 H /δ) 2ɛ 2 then the right-hand side is at most δ as required. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 17 / 39

The Discretization Trick Suppose H is parameterized by d numbers Suppose we are happy with a representation of each number using b bits (say, b = 32) Then H 2 db, and so 2db + 2 log(2/δ) m H (ɛ, δ). While not very elegant, it s a great tool for upper bounding sample complexity ɛ 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 18 / 39

Outline 1 The general PAC model Releasing the realizability assumption beyond binary classification The general PAC learning model 2 Learning via Uniform Convergence 3 Linear Regression and Least Squares Polynomial Fitting 4 The Bias-Complexity Tradeoff Error Decomposition 5 Validation and Model Selection Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 19 / 39

Linear Regression X R d, Y R, H = {x w, x : w R d } Example: d = 1, predict weight of a child based on his age. 18 Weight (Kg.) 16 14 2 2.5 3 3.5 4 4.5 5 Age (years) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 20 / 39

The Squared Loss Zero-one loss doesn t make sense in regression Squared loss: l(h, (x, y)) = (h(x) y) 2 The ERM problem: 1 min w R d m m ( w, x i y i ) 2 i=1 Equivalently, suppose X is a matrix whose ith column is x i, and y is a vector with y i on its ith entry, then min X w y 2 w R d Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 21 / 39

Background: Gradient and Optimization Given a function f : R R, its derivative is f (x) = lim 0 f(x + ) f(x) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39

Background: Gradient and Optimization Given a function f : R R, its derivative is f (x) = lim 0 f(x + ) f(x) If x minimizes f(x) then f (x) = 0 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39

Background: Gradient and Optimization Given a function f : R R, its derivative is f (x) = lim 0 f(x + ) f(x) If x minimizes f(x) then f (x) = 0 Now take f : R d R Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39

Background: Gradient and Optimization Given a function f : R R, its derivative is f (x) = lim 0 f(x + ) f(x) If x minimizes f(x) then f (x) = 0 Now take f : R d R Its gradient is a d-dimensional vector, f(x), where the ith coordinate of f(x) is the derivative of the scalar function g(a) = f((x 1,..., x i 1, x i + a, x i+1,..., x d )). Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39

Background: Gradient and Optimization Given a function f : R R, its derivative is f (x) = lim 0 f(x + ) f(x) If x minimizes f(x) then f (x) = 0 Now take f : R d R Its gradient is a d-dimensional vector, f(x), where the ith coordinate of f(x) is the derivative of the scalar function g(a) = f((x 1,..., x i 1, x i + a, x i+1,..., x d )). The derivative of g is called the partial derivative of f Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39

Background: Gradient and Optimization Given a function f : R R, its derivative is f (x) = lim 0 f(x + ) f(x) If x minimizes f(x) then f (x) = 0 Now take f : R d R Its gradient is a d-dimensional vector, f(x), where the ith coordinate of f(x) is the derivative of the scalar function g(a) = f((x 1,..., x i 1, x i + a, x i+1,..., x d )). The derivative of g is called the partial derivative of f If x minimizes f(x) then f(x) = (0,..., 0) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39

Background: Jacobian and the chain rule The Jacobian of f : R n R m at x R n, denoted J x (f), is the m n matrix whose i, j element is the partial derivative of f i : R n R w.r.t. its j th variable at x Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 23 / 39

Background: Jacobian and the chain rule The Jacobian of f : R n R m at x R n, denoted J x (f), is the m n matrix whose i, j element is the partial derivative of f i : R n R w.r.t. its j th variable at x Note: if m = 1 then J x (f) = f(x) (as a row vector) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 23 / 39

Background: Jacobian and the chain rule The Jacobian of f : R n R m at x R n, denoted J x (f), is the m n matrix whose i, j element is the partial derivative of f i : R n R w.r.t. its j th variable at x Note: if m = 1 then J x (f) = f(x) (as a row vector) Example: If f(w) = Aw for A R m,n then J w (f) = A Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 23 / 39

Background: Jacobian and the chain rule The Jacobian of f : R n R m at x R n, denoted J x (f), is the m n matrix whose i, j element is the partial derivative of f i : R n R w.r.t. its j th variable at x Note: if m = 1 then J x (f) = f(x) (as a row vector) Example: If f(w) = Aw for A R m,n then J w (f) = A Chain rule: Given f : R n R m and g : R k R n, the Jacobian of the composition function, (f g) : R k R m, at x, is J x (f g) = J g(x) (f)j x (g). Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 23 / 39

Least Squares Recall that we d like to solve the ERM problem: 1 min w R d 2 X w y 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39

Least Squares Recall that we d like to solve the ERM problem: 1 min w R d 2 X w y 2 Let g(w) = X w y and f(v) = 1 2 v 2 = m i=1 v2 i Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39

Least Squares Recall that we d like to solve the ERM problem: 1 min w R d 2 X w y 2 Let g(w) = X w y and f(v) = 1 2 v 2 = m i=1 v2 i Then, we need to solve min w f(g(w)) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39

Least Squares Recall that we d like to solve the ERM problem: 1 min w R d 2 X w y 2 Let g(w) = X w y and f(v) = 1 2 v 2 = m i=1 v2 i Then, we need to solve min w f(g(w)) Note that J w (g) = X and J v (f) = (v 1,..., v m ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39

Least Squares Recall that we d like to solve the ERM problem: 1 min w R d 2 X w y 2 Let g(w) = X w y and f(v) = 1 2 v 2 = m i=1 v2 i Then, we need to solve min w f(g(w)) Note that J w (g) = X and J v (f) = (v 1,..., v m ) Using the chain rule: J w (f g) = J g(w) (f)j w (g) = g(w) X = (X w y) X Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39

Least Squares Recall that we d like to solve the ERM problem: 1 min w R d 2 X w y 2 Let g(w) = X w y and f(v) = 1 2 v 2 = m i=1 v2 i Then, we need to solve min w f(g(w)) Note that J w (g) = X and J v (f) = (v 1,..., v m ) Using the chain rule: J w (f g) = J g(w) (f)j w (g) = g(w) X = (X w y) X Requiring that J w (f g) = (0,..., 0) yields (X w y) X = 0 XX w = Xy. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39

Least Squares Recall that we d like to solve the ERM problem: 1 min w R d 2 X w y 2 Let g(w) = X w y and f(v) = 1 2 v 2 = m i=1 v2 i Then, we need to solve min w f(g(w)) Note that J w (g) = X and J v (f) = (v 1,..., v m ) Using the chain rule: J w (f g) = J g(w) (f)j w (g) = g(w) X = (X w y) X Requiring that J w (f g) = (0,..., 0) yields (X w y) X = 0 XX w = Xy. This is a linear set of equations. If XX is invertible, the solution is w = (XX ) 1 Xy. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39

Least Squares What if XX is not invertible? In the exercise you ll see that there s always a solution to the set of linear equations using pseudo-inverse Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 25 / 39

Least Squares What if XX is not invertible? In the exercise you ll see that there s always a solution to the set of linear equations using pseudo-inverse Non-rigorous trick to help remembering the formula: We want X w y Multiply both sides by X to obtain XX w Xy Multiply both sides by (XX ) 1 to obtain the formula: w = (XX ) 1 Xy Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 25 / 39

Least Squares Interpretation as projection Recall, we try to minimize X w y The set C = {X w : w R d } R m is a linear subspace, forming the range of X Therefore, if w is the least squares solution, then the vector ŷ = X w is the vector in C which is closest to y. This is called the projection of y onto C We can find ŷ by taking V to be an m d matrix whose columns are orthonormal basis of the range of X, and then setting ŷ = V V y Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 26 / 39

Polynomial Fitting Sometimes, linear predictors are not expressive enough for our data We will show how to fit a polynomial to the data using linear regression Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 27 / 39

Polynomial Fitting A one-dimensional polynomial function of degree n: p(x) = a 0 + a 1 x + a 2 x 2 +... + a n x n Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39

Polynomial Fitting A one-dimensional polynomial function of degree n: p(x) = a 0 + a 1 x + a 2 x 2 +... + a n x n Goal: given data S = ((x 1, y 1 ),..., (x m, y m )) find ERM with respect to the class of polynomials of degree n Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39

Polynomial Fitting A one-dimensional polynomial function of degree n: p(x) = a 0 + a 1 x + a 2 x 2 +... + a n x n Goal: given data S = ((x 1, y 1 ),..., (x m, y m )) find ERM with respect to the class of polynomials of degree n Reduction to linear regression: Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39

Polynomial Fitting A one-dimensional polynomial function of degree n: p(x) = a 0 + a 1 x + a 2 x 2 +... + a n x n Goal: given data S = ((x 1, y 1 ),..., (x m, y m )) find ERM with respect to the class of polynomials of degree n Reduction to linear regression: Define ψ : R R n+1 by ψ(x) = (1, x, x 2,..., x n ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39

Polynomial Fitting A one-dimensional polynomial function of degree n: p(x) = a 0 + a 1 x + a 2 x 2 +... + a n x n Goal: given data S = ((x 1, y 1 ),..., (x m, y m )) find ERM with respect to the class of polynomials of degree n Reduction to linear regression: Define ψ : R R n+1 by ψ(x) = (1, x, x 2,..., x n ) Define a = (a 0, a 1,..., a n ) and observe: p(x) = n a i x i = a, ψ(x) i=0 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39

Polynomial Fitting A one-dimensional polynomial function of degree n: p(x) = a 0 + a 1 x + a 2 x 2 +... + a n x n Goal: given data S = ((x 1, y 1 ),..., (x m, y m )) find ERM with respect to the class of polynomials of degree n Reduction to linear regression: Define ψ : R R n+1 by ψ(x) = (1, x, x 2,..., x n ) Define a = (a 0, a 1,..., a n ) and observe: p(x) = n a i x i = a, ψ(x) i=0 To find a, we can solve Least Squares w.r.t. ((ψ(x 1 ), y 1 ),..., (ψ(x m ), y m )) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39

Outline 1 The general PAC model Releasing the realizability assumption beyond binary classification The general PAC learning model 2 Learning via Uniform Convergence 3 Linear Regression and Least Squares Polynomial Fitting 4 The Bias-Complexity Tradeoff Error Decomposition 5 Validation and Model Selection Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 29 / 39

Error Decomposition Let h S = ERM H (S). We can decompose the risk of h S as: L D (h S ) = ɛ app + ɛ est ɛ app ɛ est The approximation error, ɛ app = min h H L D (h): How much risk do we have due to restricting to H Doesn t depend on S Decreases with the complexity (size, or VC dimension) of H The estimation error, ɛ est = L D (h S ) ɛ app : Result of L S being only an estimate of L D Decreases with the size of S Increases with the complexity of H Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 30 / 39

Bias-Complexity Tradeoff How to choose H? degree 2 degree 3 degree 10 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 31 / 39

Outline 1 The general PAC model Releasing the realizability assumption beyond binary classification The general PAC learning model 2 Learning via Uniform Convergence 3 Linear Regression and Least Squares Polynomial Fitting 4 The Bias-Complexity Tradeoff Error Decomposition 5 Validation and Model Selection Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 32 / 39

Validation We have already learned some hypothesis h Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 33 / 39

Validation We have already learned some hypothesis h Now we want to estimate how good is h Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 33 / 39

Validation We have already learned some hypothesis h Now we want to estimate how good is h Simple solution: Take fresh i.i.d. sample V = (x 1, y 1 ),..., (x mv, y mv ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 33 / 39

Validation We have already learned some hypothesis h Now we want to estimate how good is h Simple solution: Take fresh i.i.d. sample V = (x 1, y 1 ),..., (x mv, y mv ) Output L V (h) as an estimator of L D (h) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 33 / 39

Validation We have already learned some hypothesis h Now we want to estimate how good is h Simple solution: Take fresh i.i.d. sample V = (x 1, y 1 ),..., (x mv, y mv ) Output L V (h) as an estimator of L D (h) Using Hoeffding s inequality, if the range of l is [0, 1] we have log(2/δ) L V (h) L D (h). 2 m v Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 33 / 39

Validation for Model Selection Fitting polynomials of degrees 2,3, and 10 based on the black points The red points are validation examples Choose the degree 3 polynomial as it has minimal validation error Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 34 / 39

Validation for Model Selection Analysis Let H = {h 1,..., h r } be the output predictors of applying ERM w.r.t. the different classes on S Let V be a fresh validation set Choose h ERM H (V ) By our analysis of finite classes, L D (h ) min h H L D(h) + 2 log(2 H /δ) V Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 35 / 39

The model-selection curve 0.4 train validation 0.3 error 0.2 0.1 0 2 4 6 8 10 degree Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 36 / 39

Train-Validation-Test split In practice, we usually have one pool of examples and we split them into three sets: Training set: apply the learning algorithm with different parameters on the training set to produce H = {h 1,..., h r } Validation set: Choose h from H based on the validation set Test set: Estimate the error of h using the test set Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 37 / 39

k-fold cross validation The train-validation-test split is the best approach when data is plentiful. If data is scarce: k-fold Cross Validation for Model Selection input: training set S = (x 1, y 1 ),..., (x m, y m ) learning algorithm A and a set of parameter values Θ partition S into S 1, S 2,..., S k foreach θ Θ for i = 1... k h i,θ = A(S \ S i ; θ) error(θ) = 1 k k i=1 L S i (h i,θ ) output θ = argmin θ [error(θ)], h θ = A(S; θ ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 38 / 39

Summary The general PAC model Agnostic General loss functions Uniform convergence is sufficient for learnability Uniform convergence holds for finite classes and bounded loss Least squares Linear regression Polynomial fitting The bias-complexity tradeoff Approximation error vs. Estimation error Validation Model selection Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 39 / 39