CS281B/Stat241B. Statistical Learning Theory. Lecture 1.

Similar documents
Minimax risk bounds for linear threshold functions

The No-Regret Framework for Online Learning

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Foundations of Machine Learning

Online Learning and Sequential Decision Making

Machine Learning (CS 567) Lecture 2

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Online Prediction Peter Bartlett

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

ECE521 week 3: 23/26 January 2017

Active Learning and Optimized Information Gathering

CS 6375 Machine Learning

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783)

Least Squares Regression

FORMULATION OF THE LEARNING PROBLEM

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

Least Squares Regression

Probabilistic Graphical Models for Image Analysis - Lecture 1

ECE521 Lecture7. Logistic Regression

A Study of Relative Efficiency and Robustness of Classification Methods

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Adaptive Sampling Under Low Noise Conditions 1

Least Squares Classification

COMS 4771 Lecture Boosting 1 / 16

Decision trees COMS 4771

Announcements - Homework

Lecture 5: GPs and Streaming regression

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

CSCI-567: Machine Learning (Spring 2019)

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Does Modeling Lead to More Accurate Classification?

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Entropy Rate of Stochastic Processes

The Perceptron Algorithm, Margins

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning Lecture 7

Day 5: Generative models, structured classification

Lecture 3 January 28

Learning theory. Ensemble methods. Boosting. Boosting: history

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Online Learning, Mistake Bounds, Perceptron Algorithm

Machine Learning Lecture 2

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Theoretical Statistics. Lecture 1.

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

L11: Pattern recognition principles

Overfitting, Bias / Variance Analysis

Lecture 4 Discriminant Analysis, k-nearest Neighbors

IE598 Big Data Optimization Introduction

Statistical learning theory, Support vector machines, and Bioinformatics

Models, Data, Learning Problems

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

U Logo Use Guidelines

An Algorithms-based Intro to Machine Learning

Support vector machines Lecture 4

Machine Learning Lecture 1

CS 188: Artificial Intelligence Spring Today

Machine Learning for Signal Processing Bayes Classification and Regression

The Bayes classifier

Statistical Machine Learning Hilary Term 2018

Machine Learning (CS 567) Lecture 3

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

Lecture 21: Minimax Theory

Computational Oracle Inequalities for Large Scale Model Selection Problems

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Lecture 1: Bayesian Framework Basics

STA414/2104 Statistical Methods for Machine Learning II

Machine Learning Basics: Maximum Likelihood Estimation

1 Review and Overview

COMS 4771 Introduction to Machine Learning. Nakul Verma

Proteomics and Variable Selection

Sample complexity bounds for differentially private learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Midterm: CS 6375 Spring 2015 Solutions

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Machine Learning. Lecture 9: Learning Theory. Feng Li.

MODULE -4 BAYEIAN LEARNING

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

The Online Approach to Machine Learning

Machine Learning (COMP-652 and ECSE-608)

Worst-Case Bounds for Gaussian Process Models

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Hybrid Machine Learning Algorithms

Association studies and regression

Ad Placement Strategies

Consistency of Nearest Neighbor Methods

Classification objectives COMS 4771

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

Lecture 3 Classification, Logistic Regression

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

CS 188: Artificial Intelligence. Machine Learning

CS 188: Artificial Intelligence Fall 2008

Transcription:

CS281B/Stat241B. Statistical Learning Theory. Lecture 1. Peter Bartlett 1. Organizational issues. 2. Overview. 3. Probabilistic formulation of prediction problems. 4. Game theoretic formulation of prediction problems. 1

Organizational Issues Lectures: Tue/Thu 12:30 2:00, 334 Evans. Peter Bartlett. bartlett@cs. Office hours: Mon 11-12 (Sutardja-Dai Hall), Thu 2-3 (Evans 399). GSI: Alan Malek. malek@berkeley Office hours: TBA. Web site: see http://www.stat.berkeley.edu/ bartlett/courses Check it for details of office hours, the syllabus, assignments, readings, lecture notes, and announcements. No text. See website for readings. 2

Organizational Issues Assessment: Homework Assignments (50%): posted on the website. (approximately one every two weeks) Final Project (50%): Proposals due March 13. Report due May 2. Required background: CS281A/Stat241A/Stat205A/Stat210A. 3

Overview Theoretical analysis of prediction methods. 1. Probabilistic formulation of prediction problems 2. Risk bounds 3. Game theoretic formulation of prediction problems 4. Regret bounds 5. Algorithms: (a) Kernel methods (b) Boosting algorithms 6. Model selection 4

Probabilistic Formulations of Prediction Problems Aim: Predict an outcomey from some sety of possible outcomes, on the basis of some observation x from a feature space X. Some examples: x words in a document image of a digit in a zipcode email message sentence patient medical test results gene expression levels of a tissue sample y topic (sports, music, tech,...) the digit spam or ham correct parse tree patient disease state presence of cancer 5

Probabilistic Formulations of Prediction Problems x phylogenetic profile of a gene (i.e., relationship to genomes of other species) image of a signature on a check web search query y gene function identity of the writer ranked list of pages Use data set of n pairs: (x 1,y 1 ),...,(x n,y n ), to choose a function f : X Y so that, for subsequent (x,y) pairs, f(x) is a good prediction of y. 6

Probabilistic Formulations of Prediction Problems To define the notion of a good prediction, we can define a loss function l : Y Y R. Sol(ŷ,y) quantifies the cost of predicting ŷ when the true outcome isy. Then the aim is to ensure that l(f(x),y) is small. 7

Probabilistic Formulations of Prediction Problems Example: In pattern classification problems, the aim is to classify a patternxinto one of a finite number of classes (that is, the label space Y is finite). If all mistakes are equally bad, we could define 1 ifŷ y, l(ŷ,y) = 1[ŷ y] = 0 otherwise. Example: In a regression problem, withy = R, we might choose the quadratic loss function, l(ŷ,y) = (ŷ y) 2. 8

Probabilistic Assumptions Assume: There is a probability distributionp onx Y, The pairs (X 1,Y 1 ),...,(X n,y n ),(X,Y) are chosen independently according top The aim is to choose f with small risk: R(f) = El(f(X),Y). For instance, in the pattern classification example, this is the misclassification probability. R(f) = E1[f(X) Y] = Pr(f(X) Y). 9

Probabilistic Assumptions Some things to notice: 1. Capital letters denote random variables. 2. The distributionp can be viewed as modelling both the relative frequency of different features or covariates X, together with the conditional distribution of the outcome Y given X. 3. The assumption that the data is i.i.d. is a strong one. But we need to assume something about what the information in the data (x 1,y 1 ),...,(x n,y n ) tells us about (X,Y). 10

Probabilistic Assumptions 4. The function x f n (x) = f n (x;x 1,Y 1,...,X n,y n ) is random, since it depends on the random data D n = (X 1,Y 1,...,X n,y n ). Thus, the risk R(f n ) = E[l(f n (X),Y) D n ] = E[l(f n (X;X 1,Y 1,...,X n,y n ),Y) D n ] is a random variable. We might aim for ER(f n ) small, orr(f n ) small with high probability (over the training data). 11

Key Questions We might choose f n from some class F of functions (for instance, linear function, sparse linear function, decision tree, neural network, kernel machine). There are several questions that we are interested in: 1. Can we design algorithms for which f n is close to the best that we could hope for, given that it was chosen from F? (that is, is R(f n ) inf f F R(f) small?) 2. How does the performance off n depend onn? On the complexity of F? On P? 3. Can we ensure that R(f n ) approaches the best possible performance (that is, the infimum over all f ofr(f))? 12

Statistical Learning Theory vs Classical Statistics In this course, we are concerned with results that apply to large classes of distributionsp, such as the set of all joint distributions on X Y. In contrast to parametric problems, we will not (often) assume that P comes from a small (e.g., finite-dimensional) space, P {P θ : θ Θ}. Since we make few assumptions onp, and we are concerned with high-dimensional data, the goal is typically to ensure that the performance is close to the best we can achieve using prediction rules from some fixed class F. 13

Key Issues Several key issues arise in designing a prediction method for these problems: Approximation How good is the best f in the classf that we are using? That is, how close toinf f R(f) isinf f F R(f)? Estimation How close is our performance to that of the best f inf? (Recall that we only have access to the distributionp through observing a finite data set.) Computation We need to use the data to choose f n, typically by solving some kind of optimization problem. How can we do that efficiently? 14

Key Issues We will not spend much time on the approximation properties, beyond observing some universality results (that particular classes can achieve zero approximation error). (But for complex problems and simple hence statistically feasible function classes, this is not a very interesting property.) We will focus on the estimation issue. We will take the approach that efficiency of computation is a constraint. Indeed, the methods that we spend most of our time studying involve convex optimization problems. (e.g., kernel methods involve solving a quadratic program, and boosting algorithms involve minimizing a convex criterion in a convex set.) 15

More General Probabilistic Formulation We can consider a decision-theoretic formulation: Have 1. Outcome space Z. 2. Prediction strategy S : Z A. 3. Loss function l : A Z R. Protocol: See outcomes Z 1,...,Z n, i.i.d. from unknown P on Z. Choose actiona = S(Z 1,...,Z n ) A. Incur risk El(a,Z). Aim is to minimize the excess risk, compared to the best decision: E[l(S(Z 1,...,Z n ),Z) Z n 1] inf a A El(a,Z). 16

More General Probabilistic Formulation Example: In pattern classification problems, Z = X Y, A Y X. l(f,(x,y)) = 1[f(x) y]. Example: In density estimation problems, Z = R d (or some measurable space). A = measurable functions on Z (densities wrt a reference measure onz). l(p,y) = logp(z). In this case, if the distribution P has a density in A, the excess risk is the KL-divergence between a and P. 17

Game Theoretic Formulation Cumulative loss: ˆL n = Decision method playsa t A World reveals z t Z n l(a t,z t ). t=1 Incur lossl(a t,z t ) Aim to minimize regret, that is, perform well compared to the best (in retrospect) from some class: regret = n l(a t,z t ) t=1 } {{ } ˆL n Data can be adversarially chosen. inf a A n l(a,z t ). t=1 } {{ } L n 18

Game Theoretic Formulation: Motivation 1. Appropriate formulation for online/sequential prediction problems. 2. Adversarial model is often appropriate (e.g., in computer security, computational finance). 3. Adversarial model assumes little: It is often straightforward to convert a strategy for an adversarial environment to a method for a probabilistic environment. 4. Studying the adversarial model can reveal the deterministic core of a statistical problem: there are strong similarities between the performance guarantees in the two cases. 5. Significant overlaps in the design of methods for the two problems: Regularization plays a central role. Often have a natural interpretation as a Bayesian method. 19

Examples Example: In an online pattern classification problem (like spam classification), Z = X Y, A Y X. l(f,(x,y)) = 1[f(x) y]. The action is a classification rule, and the regret indicates how close the spam misclassification rate is to the best performance possible in retrospect on the particular email sequence. 20

Example: Portfolio Optimization Aim to choose a portfolio (distribution over financial instruments) to maximize utility. Other market players can profit from making our decisions bad ones. For example, if our trades have a market impact, someone can front-run (trade ahead of us). The decision method s action a t is a distribution on them instruments, a t m = {a [0,1] m : i a i = 1}. The outcome z t is the vector of relative price increases, z t R m + ; the ith component is the ratio of the price of instrument i at timetto its price at the previous time. The losslmight be the negative logarithm of the portfolio s increase, l(a t,z t ) = log(a t z t ). 21

Example: Portfolio Optimization We might compare our performance to the best stock (distribution is a delta function), or a set of indices (distribution corresponds to Dow Jones Industrial Average, etc), or the set of all distributions. The regret is then the log of the ratio of the maximum value the portfolio would have at the end (for the best mixture choice) to the final portfolio value: n t=1 l(a t,z t ) min a A n t=1 l(a,z t ) = max a A n n log(a z t ) log(a t z t ), t=1 t=1 since a z t is the relative increase in capital under action a. 22

Key Questions Often interested in minimax regret, which is the value of the game: ( n ) n minmax minmax l(a t,z t ) min l(a,z t ). a 1 z 1 a n z n a A t=1 1. How does the performance (minimax regret) depend onn? On the complexity of A (and Z)? 2. Can we design computationally efficient strategies that (almost) achieve the minimax regret? 3. What if the strategy has limited information? (e.g., auctions, bandits) t=1 23

Overview: probabilistic and game-theoretic formulations Decision-theoretic formulation: For outcome Z, actiona, incur lossl(a,z). Probabilistic: Data Z 1,...,Z n,z i.i.d., Use data to choose a A, Aim to minimize excess risk, El(a,Z) inf a A El(a,Z). 24

Overview: probabilistic and game-theoretic formulations Online: Arbitrary (even adversarial) choice of data. Sequential game: at roundt, Choose a t, See Z t, Incur lossl(a t,z t ). Aim to minimize regret (excess cumulative loss): l(a t,z t ) inf l(a,z t ). a A t t 25