Understanding Machine Learning A theory Perspective

Similar documents
Introduction to Machine Learning (67577) Lecture 3

The sample complexity of agnostic learning with deterministic labels

IFT Lecture 7 Elements of statistical learning theory

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Foundations of Machine Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning Theory (CS 6783)

Advanced Introduction to Machine Learning CMU-10715

COMS 4771 Introduction to Machine Learning. Nakul Verma

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Sample Complexity of Learning Independent of Set Theory

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning

Machine Learning Theory (CS 6783)

Machine Learning

Introduction to Machine Learning (67577) Lecture 5

Computational Learning Theory

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Introduction to Machine Learning

Littlestone s Dimension and Online Learnability

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

Impossibility Theorems for Domain Adaptation

Understanding Generalization Error: Bounds and Decompositions

Generalization bounds

Empirical Risk Minimization

Generalization, Overfitting, and Model Selection

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

CS 6375 Machine Learning

Computational Learning Theory. CS534 - Machine Learning

Lecture 2 Machine Learning Review

Lecture 3: Introduction to Complexity Regularization

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

MODULE -4 BAYEIAN LEARNING

The PAC Learning Framework -II

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Computational and Statistical Learning Theory

Agnostic Online learnability

Binary Classification / Perceptron

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Domain Adaptation Can Quantity Compensate for Quality?

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Introduction to Machine Learning (67577) Lecture 7

Computational Learning Theory

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Computational and Statistical Learning Theory

Machine Learning in the Data Revolution Era

Online Learning Summer School Copenhagen 2015 Lecture 1

PAC-learning, VC Dimension and Margin-based Bounds

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

Empirical Risk Minimization Algorithms

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

1 Review of The Learning Setting

Introduction to Machine Learning CMU-10701

An Introduction to No Free Lunch Theorems

Algorithmic Stability and Generalization Christoph Lampert

Dan Roth 461C, 3401 Walnut

Active Learning and Optimized Information Gathering

The Perceptron algorithm

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Lecture Support Vector Machine (SVM) Classifiers

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Support vector machines Lecture 4

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert

VC dimension and Model Selection

Does Unlabeled Data Help?

1 Differential Privacy and Statistical Query Learning

Part of the slides are adapted from Ziko Kolter

Empirical Risk Minimization, Model Selection, and Model Assessment

Understanding Machine Learning Solution Manual

Machine Learning

Introduction to Machine Learning

Loss Functions, Decision Theory, and Linear Models

Machine Learning

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Active Learning: Disagreement Coefficient

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

FORMULATION OF THE LEARNING PROBLEM

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

On the tradeoff between computational complexity and sample complexity in learning

[read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] General-to-specific ordering over hypotheses

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Lecture 7 Introduction to Statistical Decision Theory

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

CS Machine Learning Qualifying Exam

PAC-learning, VC Dimension and Margin-based Bounds

Learnability, Stability, Regularization and Strong Convexity

CS 6375: Machine Learning Computational Learning Theory

Generalization and Overfitting

An Algorithms-based Intro to Machine Learning

Boosting: Foundations and Algorithms. Rob Schapire

Consistency of Nearest Neighbor Methods

Transcription:

Understanding Machine Learning A theory Perspective Shai Ben-David University of Waterloo MLSS at MPI Tubingen, 2017

Disclaimer Warning. This talk is NOT about how cool machine learning is. I am sure you are already convinced of that. I am NOT going to show any videos of amazing applications of ML. You will hear a lot about the great applications of ML throughout this MLSS. I wish to focus on understanding the principles underlying Machine Learning.

High level view of (Statistical) Machine Learning The purpose of science is to find meaningful simplicity in the midst of disorderly complexity Herbert Simon

More concretely Ø Statistical learning is concerned with algorithms that detect meaningful regularities in large complex data sets. Ø We focus on data that is too complex for humans to figure out its meaningful regularities. Ø We consider the task of finding such regularities from random samples of the data population.

A typical setting Ø Imagine you want a computer program to help decide which email messages are spam and which are important. Ø Might represent each message by features. (e.g., return address, keywords, spelling, etc.) Ø Train on a sample S of emails, labeled according to whether they are/aren t spam. Ø Goal of algorithm is to produce good prediction rule (program) to classify future emails.

The concept learning setting The learner is given some Training Sample E.g.,

The learner s output a classification rule Given data, some reasonable rules might be: Predict SPAM if [unknown AND (sex OR sales)] Predict SPAM if [sales + sex known > 0].... These kind of tasks are called Classification Prediction

Some typical classification prediction tasks Ø Medical Diagnosis (Patient infoà High/Low risk). Ø Sequence-based classifications of proteins. Ø Detection of fraudulent use of credit cards. Ø Stock market prediction (today s newsà tomorrow s market trend).

The formal setup (for label prediction tasks) l Domain set X l Label set - Y (often {0,1}) l Training data S=((x 1,y 1 ), (x m, y m )) l Learner s output h: X è Y

Data generation and measures of success Ø An unknown distribution D generates instances (x 1, x 2, ) independently. Ø An unknown function f: X è Y labels them. Ø The error of a classifier h is the probability (over D) that it will fail, Pr x~d [h(x) f(x)]

Empirical Risk Minimization (ERM) Given a labeled sample S=((x 1,y 1 ), (x m, y m )) and some candidate classifier h, Define the empirical error of h as L S (h) = {i : h(x i ) f(x i )} /m (the proportion of sample points on which h errs) ERM find h the minimizes L S (h).

Not so simple Risk of Overfitting l Given any training sample l Let, S=((x 1,y 1 ), (x m, y m )) h(x)=y i if x=x i for some i m and h(x)=0 for any other x. l Clearly L S (h) =0. l It is also pretty obvious that in many cases this h has high error probability.

The missing component l Leaners need of some prior knowledge

How is learning handled in nature (1)? Bait Shyness in rats

Successful animal learning The Bait Shyness phenomena in rats: When rats encounter poisoned food, they learn very fast the causal relationship between the taste and smell of the food and sickness that follows a few hours later.

How is learning handled in nature (2)? Pigeon Superstition (Skinner 1948)

What is the source of difference? l In what way are the rats smarter than the pigeons?

Bait shyness and inductive bias Garcia et al (1989) : Replace the stimulus associated with the poisonous baits by making a sound when they taste it (rather than having a slightly different taste or smell). How well do the rats pick the relation of sickness to bait in this experiment?

Surprisingly (?) The rats fail to detect the association! They do not refrain from eating when the same warning sound occurs again.

What about improved rats? l Why aren t there rats that will also pay attention to the noise when they are eating? l And to light, and temperature, time-ofday, and so on? l Wouldn t such improved rats survive better?

Second thoughts about our improved rats l But then every tasting of food will be an outlier in some respect. l How will they know which tasting should be blamed for the sickness?

The Basic No Free Lunch principle No learning is possible without applying prior knowledge. (we will phrase and prove a precise statement later)

First type of prior knowledge Hypothesis classes l A hypothesis class H is a set of hypotheses. l We re-define the ERM rule by searching only inside such a prescribed H. l ERM H (S) picks a classifier h in H that minimizes the empirical error over members of H

Our first theorem Theorem: (Guaranteed success for ERM H ) Let H be a finite class, and assume further that the unknown labeling rule, f, is a member of H. Then for every ε, δ >0, if m>(log( H ) + log(1/δ))/ ε, With probability > 1- δ (over the choice of S) any ERM H (S) hypothesis has error below ε.

Proof All we need to apply are two basic probability rules: 1) The probability of the AND of independent events is the product of their probabilities. 2) The unions bound the probability of the OR of any events is at most sum of their probabilities.

Not only finite classes l The same holds, for example, for the class H of all intervals over the real line. (we will see a proof of that in the afternoon)

A formal definition of learnability H is PAC Learnable if there is a function m H : (0,1) 2 è N and a learning algorithm A, such that for every distribution D over X, every ε, δ >0, and every f in H, for samples S of size m>m H (ε, δ) generated by D and labeled by f, Pr[L D (A(S)) > ε] < δ

More realistic setups Relaxing the realizability assumption. We wish to model scenarios in which the learner does not have a priori knowledge of a class to which the true classifier belongs. Furthermore, often the labels are not fully determined by the instance attributes.

General loss functions Our learning formalism applies well beyond counting classification errors. Let Z be any domain set. and l : H x Z è R quantify the loss of a model h on an instance z. Given a probability distribution P over Z Let L P (h) = Ex z~p (l (h, z))

Examples of such losses Ø The 0-1 classification loss: l (h, (x,y))= 0 if h(x)=y and 1 otherwise. Ø Regression square loss: l (h, (x,y)) = (y-h(x)) 2 Ø K-means clustering loss: l (c 1, c k ), z)= min i (c i z) 2

Agnostic PAC learnability H is Agnostic PAC Learnable if there is a function m H : (0,1) 2 è N and a learning algorithm A, such that for every distribution P over XxY and every ε, δ >0, for samples S of size m>m H (ε, δ) generated by P, Pr[L P (A(S)) > Inf [h in H] L P (h) + ε] < δ

Note the different philosophy Rather than making an absolute statement that is guaranteed to hold only when certain assumptions are met (like realizability), provide a weaker, relative guarantee, that is guaranteed to always hold.

General Empirical loss l For any loss l : H x Z è R as above and a finite domain subset S, define the empirical loss w.r.t. S=(z 1, z m ) as L S (h) = Σ i l (h, z i )/m

Representative samples We say that a sample S is ε- representative of a class H w.r.t. a distribution P If for every h in H L S (h) L P (h) < ε

Representative samples and ERM Why care about representative samples? If S is an ε- representative of a class H w.r.t. a distribution P then for every ERM H (S) classifier h, L P (h) < Inf [h in H] L P (h ) + 2ε

Uniform Convergence Property l We say that a class H has the Uniform Convergence Property if there is a function m H : (0,1) 2 è N such that for every distribution P over Z and every ε, δ >0, with probability> (1-δ), samples S of size m>m H (ε, δ) generated by P, are ε- representative of H w.r.t. P

Learning via Uniform Convergence Corollary: If a class H has the uniform convergence property then it is agnostic PAC learnable, and any ERM algorithm is a successful PAC learner for it.

Finite classes enjoy Unif. Conv. l Theorem: If H is finite, then it has the uniform convergence property. l Proof: Hoeffding implies Unif Conv for single h s and then the Union Bound handles the full class. l Corollary: Finite classes are PAC learnable and ERM suffices for that.

Do we need the restriction to an H? l Note that agnostic PAC learning requires only relative-to-h accuracy L D (A(S)) < Inf [h in H] L D (h) + ε Why restrict to some H? Can we have a Universal Learner, capable of competing with every function?

For the proof we need The Hoeffding inequality: Let θ 1.θ m be random variables over [0,1] with a common expectation µ, then Pr[ 1/m Σ i=1 m θ i µ > ε] < 2 exp(-2mε 2 ) We will apply it for θ i = l (h, z i )

The No-Free-Lunch theorem Let A be any learning algorithm over some domain set X. Let m be < X /2, then there is a distribution P over X x {0,1} and f:x è{0,1} such that 1) L P (f)=0 and 2) for P- samples S of size m with probability > 1/7, L P (A(S))> 1/8

The Bias-Complexity tradeoff Corollary: Any class of infinite VC dimension is not PAC learnable we cannot learn w.r.t. the universal H. For every learner L P (A(S)) can be viewed as the sum of the Approximation error Inf [h in H] L D (h) and the Generalization error the ε

Want to know more? Understanding Machine Learning: From Theory to Algorithms [Hardcover] Shai Shalev-Shwartz (Author), Shai Ben-David (Author) List Price: CDN$ 62.95 Price: CDN$ 50.36 You Save: CDN$ 12.59 (20%) Pre-order Price Guarantee. Learn more. This title has not yet been released. You may pre-order it now and we will deliver it to you when it arrives. Ships from and sold by Amazon.ca. Gift-wrap available. Vous voulez voir cette page en français? Cliquez ici. Quantity: 1 or Sign in to turn on 1-Click ordering. Share Save Up to 90% on Textbooks Hit the books in Amazon.ca's Textbook Store and save up to 90% on used textbooks and 35% on new textbooks. Learn more. See all 1 image(s) Publisher: learn how customers can search inside this book. Tell the Publisher! I'd like to read this book on Kindle Don't have a Kindle? Get your Kindle here, or download a FREE Kindle Reading App. Book Description Publication Date: May 31 2014 ISBN-10: 1107057132 ISBN-13: 978-1107057135 Machine learning is one of the fastest growing areas of computer science, with far-reaching applications. The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a principled way. The book provides an extensive theoretical account of the fundamental ideas underlying machine learning and the mathematical derivations that transform these principles into practical algorithms. Following a presentation of the basics of the field, the book covers a wide array of central topics that have not been addressed by previous textbooks. These include a discussion of the computational complexity of learning and the concepts of convexity and stability;; important algorithmic paradigms including stochastic gradient descent, neural networks, and structured output learning;; and emerging theoretical concepts such as the PAC-Bayes approach and compression-based bounds. Designed for an advanced undergraduate or beginning graduate course, the text makes the fundamentals and algorithms of machine learning accessible to students and non-expert readers in statistics, computer science, mathematics, and engineering.