Lecture 3: Statistical Decision Theory (Part II)
|
|
- Pierce Brendan Sims
- 5 years ago
- Views:
Transcription
1 Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27
2 Outline of This Note Part I: Statistics Decision Theory (Classical Statistical Perspective - Estimation ) loss and risk MSE and bias-variance tradeoff Bayes risk and minimax risk (Machine Learning Perspective - Prediction ) optimal learner empirical risk minimization restricted estimators Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 2 / 27
3 Supervised Learning Any supervised learning problem has three components: Input vector X R d. Output Y, either discrete or continuous valued Probability framework (X, Y ) P(X, Y ). The goal is to estimate the relationship between X and Y, described by f (X ), for future prediction and decision-making. For regression, f : R d R For K-class classification, f : R d {1,, K} Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 3 / 27
4 Learning Loss Function Similar to the learning theory, we use a learning loss function L to measure the discrepancy Y and f (X ) to penalize the errors for predicting Y. Examples: squared error loss (used in mean regression) L(Y, f (X )) = (Y f (X )) 2 absolute error loss (used in median regression) L(Y, f (X )) = Y f (X ) 0-1 loss function (used in binary classification) L(Y, f (X )) = I (Y f (X )) Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 4 / 27
5 Risk, Expected Prediction Error (EPE) The risk of f is R(f ) = E X,Y L(Y, f (X )) = L(y, f (x))dp(x, y) In machine learning, R(f ) is called Expected Prediction Error (EPE). For squared error loss, For 0-1 loss R(f ) = EPE(f ) = E X,Y [Y f (X )] 2 R(f ) = EPE(f ) = E X,Y I (Y f (X )) = Pr(Y f (X )), which is the probability of making a mistake. Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 5 / 27
6 Optimal Learner We seek f which minimizes the average (long-term) loss over the entire population. Optimal Learner: The minimizer of R(f ) is f = arg min f R(f ). The solution f depends on the loss L. f is not achievable without knowing P(x, y) or the entire population Bayes Risk: The optimal risk, or risk of the optimal learner R(f ) = E X,Y (Y f (X )) 2. Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 6 / 27
7 How to Find The Optimal Learner Note that R(f ) = E X,Y L(Y, f (X )) = L(y, f (x))dp(x, y) { = E X EY X L(Y, f (X )) } One can do the following, for each fixed x, find f (x) by solving f (x) = arg min c E Y X =x L(Y, c) Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 7 / 27
8 Example 1: Regression with Squared Error Loss For regression, consider L(Y, f (X )) = (Y f (X )) 2. The risk is R(f ) = E X,Y (Y f (X )) 2 = E X E Y X ((Y f (X )) 2 X ). For each fixed x, the best learner solves the problem min E Y X [(Y c) 2 X = x]. c The solution is E(Y X = x), so f (x) = E(Y X = x) = ydp(y x). So the optimal leaner is f (X ) = E(Y X ). Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 8 / 27
9 Decomposition of EPE: Under Squared Error Loss For any learner f, its EPE under the squared error loss is decomposed as EPE(f ) = E X,Y [Y f (X )] 2 = E X E Y X ([Y f (X )] 2 X ) = E X E Y X ([Y E(Y X ) + E(Y X ) f (X )] 2 X } = E X E Y X ([Y E(Y X )] 2 X ) + E X [f (X ) E(Y X )] 2 = E X,Y [Y f (X )] 2 + E X [f (X ) f (X )] 2 = EPE(f ) + MSE = Bayes Risk + MSE The Bayes risk (gold standard) gives a lower bound for any risk. Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 9 / 27
10 Example 2: Classification with 0-1 Loss In binary classification, assume Y {0, 1}. The learning rule f : R d {0, 1}. Using 0-1 loss L(Y, f (X )) = I (Y f (X )), the expected prediction error is EPE(f ) = EI (Y f (X )) = Pr(Y f (X )) = E X [P(Y = 1 X )I (f (X ) = 0) + P(Y = 0 X )I (f (X ) = 1)] Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 10 / 27
11 Bayes Classification Rule for Binary Problems For each fixed x, the minimizer of EPE(f ) is given by { f 1 if P(Y = 1 X = x) > P(Y = 0 X = x) (x) = 0 if P(Y = 1 X = x) < P(Y = 0 X = x) It is called the Bayes classifier, denoted by φ B. The Bayes rule can be written as φ B (x) = f (x) = I [P(Y = 1 X = x) > 0.5]. To obtain the Bayes rule, we need to know P(Y = 1 X = x), or whether P(Y = 1 X = x = x) is larger than 0.5 or not. Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 11 / 27
12 Probability Framework Assume the training set (X 1, Y 1 ),..., (X n, Y n ) i.i.d P(X, Y ). We aim to find the best function (optimal solution) from the model class F for decision making. However, the optimal learner f is not attainable without knowledge of P(x, y). The training set (x 1, y 1 ),..., (x n, y n ) generated from P(x, y) carry information about P(x, y) We approximate the integral over P(x, y) by the average over the training samples. Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 12 / 27
13 Empirical Risk Minimization Key Idea: We can not compute R(f ) = E X,Y L(Y, f (X )), but can compute its empirical version using the data! Empirical Risk: also known as the training error Remp(f ) = 1 n n L(y i, f (x i )). i=1 A reasonable estimator f should be close to f, or converge to f when more data are collected. Using law of large numbers, Remp(f ) EPE(f ) as n. Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 13 / 27
14 Learning Based on ERM We construct the estimator of f by ˆf = arg min f Remp(f ). This principle is called Empirical Risk Minimization. (ERM) For regression, the empirical risk is known as Residual Sum of Squares (RSS) Remp(f ) = RSS(f ) = ERM gives the least mean squares: n [y i f (x i )] 2. i=1 ˆf ols = arg min RSS(f ) = arg min f f n [y i f (x i )] 2. i=1 Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 14 / 27
15 Issues with ERM Minimizing Remp(f ) over all functions is not proper. Any function passing through all (x i, y i ) has RSS = 0. Example: { y i if x = x i for some i = 1,, n f 1 (x) = 1 otherwise. It is easy to check that RSS(f 1 ) = 0. However, it does not amount to any form of learning. For the future points not equal to x i, it will always predict y = 1. This phenomenon is called overfitting. Overfitting is undesired, since a small Remp(f ) does not imply a small R(f ) = L(y, f (x))dp(x, y). Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 15 / 27
16 Class of Restricted Estimators We need to restrict the estimation process within a set of functions F, to control complexity of functions and hence avoid over-fitting. Choose a simple model class F. ˆf ols = arg min RSS(f F) = arg min f f F n [y i f (x i )] 2. Choose a large model class F, but use a roughness penalty to control model complexity: Find f F to minimize i=1 ˆf = arg min Penalized RSS RSS(f ) + λj(f ). f F The solution is called regularized empirical risk minimizer. How should we choose the model F? Simple one or complex one? Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 16 / 27
17 Parametric Learners Some model classes can be parametrized as F = {f (X ; α), α Λ}, where α is the index parameter for the class. Parametric Model Class: The form of f is known except for finitely many unknown parameters Linear Models: linear regression, linear SVM, LDA f (X ; β) = β 0 + X β 1, β 1 R d, f (X ; β) = β 0 + sin(x 1 )β 1 + β 2 X 2 2, X R 2 Nonlinear models f (x; β) = β 1 x β 2 f (X ; β) = β 1 + β 2 exp β 3X β 4 Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 17 / 27
18 Advantages of Linear Models Why are linear models in wide use? If linear assumptions are correct, it is more efficient than nonparametric fits. convenient and easy to fit easy to interpret providing the first-order Taylor approximation to f (X) Example: In the linear regression, F = {all linear functions of x}. The ERM is the same as solving the ordinary least squares, ( ˆβ ols 0, ˆβ ols ) = arg min (β 0,β) n [y i β 0 x iβ] 2. i=1 Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 18 / 27
19 Drawback of Parametric Models We never know whether linear assumptions are proper or not. Not flexible in practice, having all orders of derivatives and the same function form everywhere Individual observations have large influences on remote parts of the curve The polynomial degree can not be controlled continuously Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 19 / 27
20 More Flexible Learners Nonparametric Model Class F: the function form f is unspecified, and the parameter set could be infinitely dimensional. F = {all continuous functions of x}. F = the second-order Sobolev space over [0, 1]. F is the space spanned by a set of Basis functions (dictionary functions) F = {f : f θ (x) = θ m h m (x)}. m=1 Kernel methods and local regression: F contains all piece-wise constant functions, or, F contains all piece-wise linear functions. Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 20 / 27
21 Example of Nonparametric Methods For regression problems, splines, wavelets, kernel estimators, locally weighted polynomial regression, GAM projection pursuit, regression trees, k-nearest neighbors For classification problems, kernel support vector machines, k-nearest neighbors classification trees... For density estimation, histogram, kernel density estimation Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 21 / 27
22 Semiparametric Models: a compromise between linear models and nonparametric models partially linear models Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 22 / 27
23 Nonparametric Models Motivation: The underlying regression function is so complicated that no reasonable parametric models would be adequate. Advantages: NOT assume any specific form. Assume less, thus discover more infinite parameters: not no parameters Local more relying on the neighbors Outliers and influential points do not affect the results Let the data speak for themselves and show us the appropriate functional form! Price we pay: more computational effort, lower efficiency and slower convergence rate (if linear models happen to be reasonable) Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 23 / 27
24 Nonparametric vs Parametric Models They are not mutually exclusive competitors A nonparametric regression estimate may suggest a simple parametric model In practice, parametric models are preferred for simplicity Linear regression line is infinitely smooth fit method # parameters estimate outliers parametric finite global sensitive nonparametric infinite local robust Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 24 / 27
25 linear fit y y 2 order polynomial fit x x 3 order polynomial fit y y smoothing spline (lambda=0.85) x x Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 25 / 27
26 Interpretation of Risk Let F denote the restricted model space. Denote f = arg min EPE(f ), the optimal learner. The minimization is taken over all measurable functions. Denote f = arg min F EPE(f ), the best learner in the restricted space F. Denote ˆf = arg min F Remp(f ), the learner based on limited data in the restricted space F. EPE(ˆf ) EPE(f ) = [EPE(ˆf ) EPE( f )] + [EPE( f ) EPE(f )] = estimation error + approximation error. Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 26 / 27
27 Estimation Error and Approximation Error The gap EPE(ˆf ) EPE( f ) is called the estimation error, which is due to the limited training data. The gap EPE( f ) EPE(f ) is called the approximation error. It is incurred from approximating f with a restricted model space or via regularization, since it is possible that f / F. This decomposition is similar to the bias-variance trade-off, with estimation error playing the role of variance approximation error playing the role of bias. Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 27 / 27
Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University
Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More information9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients
What our model needs to do regression Usually, we are not just trying to explain observed data We want to uncover meaningful trends And predict future observations Our questions then are Is β" a good estimate
More informationLecture 3: Introduction to Complexity Regularization
ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,
More informationTerminology for Statistical Data
Terminology for Statistical Data variables - features - attributes observations - cases (consist of multiple values) In a standard data matrix, variables or features correspond to columns observations
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationInstance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest
More informationIndirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection
More informationBoosting Methods: Why They Can Be Useful for High-Dimensional Data
New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,
More informationBayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina
Bayes rule and Bayes error Definition If f minimizes E[L(Y, f (X))], then f is called a Bayes rule (associated with the loss function L(y, f )) and the resulting prediction error rate, E[L(Y, f (X))],
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationStatistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003
Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationThe Learning Problem and Regularization
9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationLecture 5: LDA and Logistic Regression
Lecture 5: and Logistic Regression Hao Helen Zhang Hao Helen Zhang Lecture 5: and Logistic Regression 1 / 39 Outline Linear Classification Methods Two Popular Linear Models for Classification Linear Discriminant
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationA Bahadur Representation of the Linear Support Vector Machine
A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support
More informationFast learning rates for plug-in classifiers under the margin condition
Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationDecision trees COMS 4771
Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).
More informationBias-Variance Tradeoff. David Dalpiaz STAT 430, Fall 2017
Bias-Variance Tradeoff David Dalpiaz STAT 430, Fall 2017 1 Announcements Homework 03 released Regrade policy Style policy? 2 Statistical Learning Supervised Learning Regression Parametric Non-Parametric
More informationClassification. Chapter Introduction. 6.2 The Bayes classifier
Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode
More informationIntroduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones
Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More information18.9 SUPPORT VECTOR MACHINES
744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the
More informationCOS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007
COS 424: Interacting ith Data Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007 Recapitulation of Last Lecture In linear regression, e need to avoid adding too much richness to the
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the
More informationVariance Reduction and Ensemble Methods
Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis
More informationAdvanced Introduction to Machine Learning CMU-10715
Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationLecture 2: Statistical Decision Theory (Part I)
Lecture 2: Statistical Decision Theory (Part I) Hao Helen Zhang Hao Helen Zhang Lecture 2: Statistical Decision Theory (Part I) 1 / 35 Outline of This Note Part I: Statistics Decision Theory (from Statistical
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationHoldout and Cross-Validation Methods Overfitting Avoidance
Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationIntroduction to Machine Learning and Cross-Validation
Introduction to Machine Learning and Cross-Validation Jonathan Hersh 1 February 27, 2019 J.Hersh (Chapman ) Intro & CV February 27, 2019 1 / 29 Plan 1 Introduction 2 Preliminary Terminology 3 Bias-Variance
More informationTDT4173 Machine Learning
TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods
More information12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016
12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses
More informationStatistics: Learning models from data
DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial
More informationNearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2
Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationA Study of Relative Efficiency and Robustness of Classification Methods
A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationMachine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015
Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and
More informationLinear and Logistic Regression. Dr. Xiaowei Huang
Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics
More informationSTATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization
More informationClassification objectives COMS 4771
Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification
More informationData Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.
TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationMachine Learning 4771
Machine Learning 4771 Instructor: Tony Jebara Topic 3 Additive Models and Linear Regression Sinusoids and Radial Basis Functions Classification Logistic Regression Gradient Descent Polynomial Basis Functions
More informationISyE 691 Data mining and analytics
ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)
More informationLecture 1: Bayesian Framework Basics
Lecture 1: Bayesian Framework Basics Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de April 21, 2014 What is this course about? Building Bayesian machine learning models Performing the inference of
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationData Mining Stat 588
Data Mining Stat 588 Lecture 9: Basis Expansions Department of Statistics & Biostatistics Rutgers University Nov 01, 2011 Regression and Classification Linear Regression. E(Y X) = f(x) We want to learn
More informationNotes on Discriminant Functions and Optimal Classification
Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem
More informationChapter 9. Support Vector Machine. Yongdai Kim Seoul National University
Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved
More informationLecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University
Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationCSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 24) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu October 2, 24 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24 Outline Review
More informationDoes Modeling Lead to More Accurate Classification?
Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang
More informationLecture 18: Multiclass Support Vector Machines
Fall, 2017 Outlines Overview of Multiclass Learning Traditional Methods for Multiclass Problems One-vs-rest approaches Pairwise approaches Recent development for Multiclass Problems Simultaneous Classification
More informationMachine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier
Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically
More informationDay 3: Classification, logistic regression
Day 3: Classification, logistic regression Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 20 June 2018 Topics so far Supervised
More information41903: Introduction to Nonparametrics
41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next
More informationCOMS 4771 Regression. Nakul Verma
COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the
More informationIntroduction to the regression problem. Luca Martino
Introduction to the regression problem Luca Martino 2017 2018 1 / 30 Approximated outline of the course 1. Very basic introduction to regression 2. Gaussian Processes (GPs) and Relevant Vector Machines
More informationSTK Statistical Learning: Advanced Regression and Classification
STK4030 - Statistical Learning: Advanced Regression and Classification Riccardo De Bin debin@math.uio.no STK4030: lecture 1 1/ 42 Outline of the lecture Introduction Overview of supervised learning Variable
More information6.036 midterm review. Wednesday, March 18, 15
6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that
More informationClassification: The rest of the story
U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher
More informationStatistical Learning
Statistical Learning Supervised learning Assume: Estimate: quantity of interest function predictors to get: error Such that: For prediction and/or inference Model fit vs. Model stability (Bias variance
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationA Modern Look at Classical Multivariate Techniques
A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico
More informationCOMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma
COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released
More informationSupport Vector Machine
Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationSVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning
SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Supervised Learning: Regression I Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Some of the
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More information