Machine Learning Theory (CS 6783)

Similar documents
Machine Learning Theory (CS 6783)

COMS 4771 Introduction to Machine Learning. Nakul Verma

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

CS 6375 Machine Learning

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Foundations of Machine Learning

Empirical Risk Minimization

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

IFT Lecture 7 Elements of statistical learning theory

Introduction to Machine Learning (67577) Lecture 3

Understanding Generalization Error: Bounds and Decompositions

Generalization bounds

Generalization, Overfitting, and Model Selection

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Advanced Introduction to Machine Learning CMU-10715

The No-Regret Framework for Online Learning

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

The PAC Learning Framework -II

Generalization, Overfitting, and Model Selection

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Active Learning and Optimized Information Gathering

Generalization Bounds and Stability

Online Learning and Sequential Decision Making

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Topics in Natural Language Processing

Statistical Learning Learning From Examples

Computational and Statistical Learning Theory

Learnability, Stability, Regularization and Strong Convexity

Machine Learning 2007: Slides 1. Instructor: Tim van Erven Website: erven/teaching/0708/ml/

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Machine Learning (CS 567) Lecture 2

PAC-learning, VC Dimension and Margin-based Bounds

Does Unlabeled Data Help?

Agnostic Online learnability

Least Squares Regression

Generalization theory

Understanding Machine Learning A theory Perspective

CS155: Probability and Computing: Randomized Algorithms and Probabilistic Analysis

Machine Learning. Instructor: Pranjal Awasthi

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

PATTERN RECOGNITION AND MACHINE LEARNING

Generalization and Overfitting

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

PAC-learning, VC Dimension and Margin-based Bounds

Class 2 & 3 Overfitting & Regularization

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

ECE521 week 3: 23/26 January 2017

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

ORIE 4741: Learning with Big Messy Data. Generalization

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Models, Data, Learning Problems

On the tradeoff between computational complexity and sample complexity in learning

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

CS281B/Stat241B. Statistical Learning Theory. Lecture 1.

Computational and Statistical Learning theory

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Least Squares Regression

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Computational Learning Theory


CS 598 Statistical Reinforcement Learning. Nan Jiang

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Online Learning, Mistake Bounds, Perceptron Algorithm

FORMULATION OF THE LEARNING PROBLEM

Be able to define the following terms and answer basic questions about them:

Multi-armed bandit models: a tutorial

Online Learning and Sequential Decision Making

Lecture 5 Neural models for NLP

1 A Lower Bound on Sample Complexity

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

Models of Language Acquisition: Part II

Littlestone s Dimension and Online Learnability

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Advanced Introduction to Machine Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Computational and Statistical Learning Theory

Computational Learning Theory. CS534 - Machine Learning

From Complexity to Intelligence

Introduction to Machine Learning Midterm Exam

Consistency of Nearest Neighbor Methods

Bandit models: a tutorial

Computational Learning Theory

Introduction and Models

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Learning Theory Continued

Learning with multiple models. Boosting.

The sample complexity of agnostic learning with deterministic labels

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Computational Learning Theory

Logic and machine learning review. CS 540 Yingyu Liang

Decision trees COMS 4771

Lecture 3: Introduction to Complexity Regularization

Machine Learning for NLP

Machine Learning: Homework 5

Introduction to Machine Learning (67577) Lecture 5

Transcription:

Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, B-11 Instructor : Karthik Sridharan

ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%) 5% for class participation

PRE-REQUISITES Basic probability theory Basics of algorithms and analysis Introductory level machine learning course Mathematical maturity, comfortable reading/writing formal mathematical proofs.

TERM PROJECT One of the following three options : 1 Pick your research problem, get it approved by me, write a report on your work 2 Pick two papers on learning theory, get it approved by me, write a report with your own views/opinions 3 I will provide a list of problems, workout problems worth a total of 10 stars out of this list Oct 16th submit proposal/get your project approved by me Finals week projects are due

Lets get started...

WHAT IS MACHINE LEARNING Use past observations to automatically learn to make better predictions/decisions in the future.

W HERE IS IT USED? Recommendation Systems

WHERE IS IT USED? Pedestrian Detection

WHERE IS IT USED? Market Predictions

W HERE IS IT USED? Spam Classification

WHERE IS IT USED? Online advertising (improving click through rates) Climate/weather prediction Text categorization Unsupervised clustering (of articles... )...

WHAT IS LEARNING THEORY

WHAT IS LEARNING THEORY Oops...

WHAT IS MACHINE LEARNING THEORY How do formalize machine learning problems Right framework for right problems (Eg. online, statistical) What does it mean for a problem to be learnable How many instances do we need to see to learn to given accuracy How do we build sound learning algorithms based on theory Computational learning theory : which problems are efficiently learnable

OUTLINE OF TOPICS Learning problem and frameworks, settings, minimax rates Statistical learning theory Probably Approximately Correct (PAC) and Agnostic PAC frameworks Empirical Risk Minimization, Uniform convergence, Empirical process theory Finite model classes, MDL bounds, PAC Bayes theorem Infinite model classes, Rademacher complexity Binary Classification : growth function, VC dimension Real-valued function classes, covering numbers, chaining, fat-shattering dimension Supervised learning : necessary and sufficient conditions for learnability Online learning theory Sequential minimax and value of online learning game Martingale Uniform convergence, sequential empirical process theory Sequential Rademacher complexity Binary Classification : Littlestone dimension Real valued function classes, sequential covering numbers, chaining bounds, sequential fat-shattering dimension Online supervised learning : necessary & sufficient conditions for learnability Designing learning algorithms : relaxations, random play-outs Computational Learning theory and more if time permits...

LEARNING PROBLEM : BASIC NOTATION Input space/ feature space : X (Eg. bag-of-words, n-grams, vector of grey-scale values, user-movie pair to rate) Feature extraction is an art,... an art we won t cover in this course Output space/ label space Y (Eg. {±1}, [K], R-valued output, structured output) Loss function : l Y Y R (Eg. 0 1 loss l(y, y) = 1 {y y}, sq-loss l(y, y) = (y y ) 2 ), absolute loss l(y, y) = y y Measures performance/cost per instance (inaccuracy of prediction/ cost of decision). Model class/hypothesis class F Y X (Eg. F = {x f x f 2 1}, F = {x sign(f x)})

FORMALIZING LEARNING PROBLEMS How is data generated? How do we measure performance or success? Where do we place our prior assumption or model assumptions?

FORMALIZING LEARNING PROBLEMS How is data generated? How do we measure performance or success? Where do we place our prior assumption or model assumptions? What we observe?

PROBABLY APPROXIMATELY CORRECT LEARNING Y = {±1}, l(y, y) = 1 {y y}, F Y X Learner only observes training sample S = {(x 1, y 1 ),..., (x n, y n )} x 1,..., x n D X t [n], y t = f (x t ) where f F Goal : find ŷ Y X to minimize P x DX (ŷ(x) f (x)) (Either in expectation or with high probability)

PROBABLY APPROXIMATELY CORRECT LEARNING Definition Given δ > 0, ɛ > 0, sample complexity n(ɛ, δ) is the smallest n such that we can always find forecaster ŷ s.t. with probability at least 1 δ, P x DX (ŷ(x) f (x)) ɛ (efficiently PAC learnable if we can learn efficiently in 1/δ and 1/ɛ) Eg. : learning output for deterministic systems

NON-PARAMETRIC REGRESSION Y R, l(y, y) = (y y ) 2, F Y X Learner only observes training sample S = {(x 1, y 1 ),..., (x n, y n )} x 1,..., x n D X t [n], y t = f (x t ) + ε t where f F and ε t N(0, σ) Goal : find ŷ R X to minimize ŷ f 2 L 2 (D X ) = E x D X [(ŷ(x) f (x)) 2 ] (Either in expectation or in high probability) = E x DX [(ŷ(x) y) 2 ] inf f F E x D X [(f (x) y) 2 ] Eg. : clinical trials (inference problems) model class known.

NON-PARAMETRIC REGRESSION Y R, l(ŷ, y) = (y ŷ) 2, F Y X Learner only observes training sample S = {(x 1, y 1 ),..., (x n, y n )} x 1,..., x n D X t [n], y t = f (x t ) + ε t where f F and ε t N(0, σ) Goal : find ŷ R X to minimize ŷ f 2 L 2 (D X ) = E x D X [(ŷ(x) f (x)) 2 ] (Either in expectation or in high probability) = E x DX [(ŷ(x) y) 2 ] inf f F E x D X [(f (x) y) 2 ] Eg. : clinical trials (inference problems) model class known.

STATISTICAL LEARNING (AGNOSTIC PAC) Learner only observes training sample S = {(x 1, y 1 ),..., (x n, y n )} drawn iid from joint distribution D on X Y Goal : find ŷ R X to minimize expected loss over future instances E (x,y) D [l(ŷ(x), y)] inf f F E (x,y) D [l(f (x), y)] ɛ L D (ŷ) inf f F L D(f ) ɛ

STATISTICAL LEARNING (AGNOSTIC PAC) Definition Given δ > 0, ɛ > 0, sample complexity n(ɛ, δ) is the smallest n such that we can always find forecaster ŷ s.t. with probability at least 1 δ, L D (ŷ) inf f F L D(f ) ɛ

L EARNING P ROBLEMS Pedestrian Detection Spam Classification

L EARNING P ROBLEMS Pedestrian Detection Spam Classification (Batch/Statistical setting) (Online/adversarial setting)

ONLINE LEARNING (SEQUENTIAL PREDICTION) For t = 1 to n Learner receives x t X Learner predicts output ŷ t Y True output y t Y is revealed End for Goal : minimize regret Reg n (F) = 1 n 1 l(ŷ t, y t ) inf t=1 f F n l(f (x t ), y t ) t=1

OTHER PROBLEMS/FRAMEWORKS Unsupervised learning, clustering Semi-supervised learning Active learning and selective sampling Online convex optimization Bandit problems, partial monitoring,...

SNEEK PEEK No Free Lunch Theorems Statistical learning theory Empirical risk minimization Uniform convergence and learning Finite model classes, MDL, PAC Bayes theorem,...

HOMEWORK 0 : WARMUP Brush up Markov inequality, Chebychev inequality, central limit theorem Read up or brush up, concentration inequalities (specifically Hoeffding bound, Bernstein bound, Hoeffding-Azuma inequality, McDiarmid s inequality also referred to as bounded difference inequality) Brush up union bound Watch out for homework 0, no need to submit, just a warmup