Boosting. b m (x). m=1
|
|
- Sibyl Rodgers
- 5 years ago
- Views:
Transcription
1 Statistical Machine Learning Notes 9 Instructor: Justin Domke Boosting 1 Additive Models The asic idea of oosting is to greedily uild additive models. Let m (x) e some predictor (a tree, a neural network, whatever), which we will call a ase learning. In oosting, we will uild a model that is the sum of ase learners as f(x) = M m (x). m=1 The ovious way to fit an additive model is greedily. Namely, with start with the simple function f (x) =, then iteratively add ase learners to minimize the risk of f m 1 (x)+ m (x). Forward stagewise additive modeling f (x) For m = 1,..., M m arg min L(f m 1 (ˆx) + (ˆx), ŷ) f m (x) f m 1 (x) + v m (x) Notice here that once we have fit a particular ase learner, it is frozen in, and is not further changed. Once can also create procedures for additive modeling that re-fit the ase learners several times. 2 Additive egression Trees It is quite easy to do forward stagewise additive modeling when we are interested in minimizing the least-squares regression loss. 1
2 Boosting 2 arg min L ( f m 1 (ˆx) + (ˆx), ŷ ) = arg min = arg min ( fm 1 (ˆx) + (ˆx) ŷ ) 2 ( ) 2 (ˆx) r(ˆx, ŷ) (2.1) where r(ˆx, ŷ) = ŷ f m 1 (ˆx). However, minimizing a prolem like Eq. 2.1 is exactly what all our previous regression methods do. Thus, we can apply any method for least-squares regression essentially unchanged. The only difference is that the target values are given y ŷ f m 1 (ˆx) the difference etween ŷ and the additive model so far. Here are some examples applying this to two datasets. Here (as in all the examples in these notes) our ase learner is a regression tree with two levels of splits.
3 Boosting 3 data s 3 s 4 s 9 s 5 s
4 Boosting 4 data s 3 s 9 s 49 s 1 s 3 Boosting Suppose we want to minimize some other loss function. Unfortunately, if we aren t using the least-squares loss, the optimization
5 Boosting 5 arg min L(f m 1 (ˆx) + (ˆx), ŷ) (3.1) will not reduce to a least-squares prolem. One alternative would e to use a different aselearner, that fits a different loss. With oosting, we don t want to change our ase learner. This could e ecause it is simply inconvenient to create a new algorithm for a different loss function. The general idea of oosting is to fit a new ase learner to, if not minimize the risk for f m, to decrease it. As we will see, we can do this for a large range of loss functions, all ased on a least-squares ase learner. 4 Gradient Boosting Gradient oosting is a very clever algorithm. In each, we set as target values the negative gradient of the loss with respect to f. So, if the ase learner closely matches the target values, when we add some multiple v of the ase learner to our additive model, it should decrease the loss. Gradient Boosting f (x) For m = 1,..., M For all(ˆx, ŷ) r(ˆx, ŷ) d df L(f m 1(ˆx), ŷ) m arg min ( ) 2 (ˆx) r(ˆx, ŷ) f m (x) f m 1 (x) + v m (x) Consider changing f(x) to f(x)+v(x) for some small v. We want to minimize the loss, i.e. L(f m 1 (ˆx) + (ˆx), ŷ) L(f m 1 (ˆx), ŷ) + dl(f m 1 (ˆx), ŷ) (ˆx). (4.1) df However, it doesn t make much sense to simply minimize the right hand side of Eq. 4.1, since we can always make the decrease igger y multiplying y a larger constant. Instead, we choose to minimize
6 Boosting 6 dl(f m 1 (ˆx), ŷ) (ˆx) + 1 (ˆx) 2 df 2 which prevents from ecoming too large. It is easy to see that arg min dl(f m 1 (ˆx), ŷ) (ˆx) + 1 df 2 (ˆx) 2 = arg min ((ˆx) + dl(f m 1(ˆx), ŷ) df = arg min ( ) 2 (ˆx) r(ˆx, ŷ) ) 2 Some example loss functions, and their derivatives: least-squares. L = (f ŷ) 2 dl/fd = 2(f ŷ) least-asolute-deviation. L = f ŷ dl/df = sign(f ŷ) logistic regression. L = log(1 + exp( ŷf)) dl/df = exponential loss. L = exp( ŷf) dl/df = ŷ exp( ŷf) ŷ 1+exp(fŷ). Fun fact: gradient oosting with a step size of v = 1 on the least-squares loss gives exactly 2 the forward stagewise modeling algorithm we devised aove. Here are some examples of gradient oosting on classification losses (logistic and exponential). The figures show the function f(x). As usual, we would get a predicted class y taking the sign of f(x).
7 Boosting 7 Gradient Boosting, Logistic Loss, step size of v = 1. data s 3 s 9 s 49 s 1 s
8 Boosting 8 Gradient Boosting, Exponential loss, step size of v = 1. data s 3 s 9 s 49 s 1 s
9 Boosting 9 5 Newton Boosting The asic idea of Newton oosting is, instead of just taking the gradient, to uild a local quadratic model, and then to minimize it. We can again do this using any least-squares regressor as a ase learner. (Note that Newton oosting is not generally accepted terminology. The general idea of fitting to a quadratic optimization is used in several algorithms for specific loss function (e.g. Logitoost), ut doesn t seem to have een given a name. The class voted to use Newton Boosting in preference to Quadratic Boosting.) Newton Boosting f (x) For m = 1,..., M For all(ˆx, ŷ) h(ˆx, ŷ) d2 df 2L(f m 1(ˆx), ŷ) g(ˆx, ŷ) d L(f df m 1(ˆx), ŷ) r(ˆx, ŷ) 1 d h(ˆx, ŷ) df L(f m 1(ˆx), ŷ) m arg min h(ˆx, ŷ) ( (ˆx) r(ˆx, ŷ) ) 2 f m (x) f m 1 (x) + v m (x) To understand this, consider search searching for the that minimizes the empirical risk for f m. To start with, we make a local second-order approximation. (Here, g(ˆx, ŷ) = dl(f m 1(ˆx),ŷ) df, and h(ˆx, ŷ) = d2 L(f m 1 (ˆx),ŷ) df 2.) arg min L(f m 1 (ˆx) + (ˆx), ŷ) arg min 1 = arg min 2 = arg min = arg min ( L(fm 1 (ˆx), ŷ) + (ˆx)g(ˆx, ŷ) (ˆx)2 h(ˆx, ŷ) ) h(ˆx, ŷ) ( g(ˆx, ŷ) 2 (ˆx) + (ˆx)2) h(ˆx, ŷ) h(ˆx, ŷ) ( (ˆx) g(ˆx, ŷ) ) 2 h(ˆx, ŷ) h(ˆx, ŷ) ( (ˆx) r(ˆx, ŷ) ) 2
10 Boosting 1 This is a weighted least-squares regression, with weights h(ˆx, ŷ) and targets g h. Here, we collect in a tale all of the loss functions and derivatives we might use. Loss Functions L(f, ŷ) dl d 2 L Name L df df 2 least-squares (f ŷ) 2 2(f ŷ) 2 least-asolute-deviation f ŷ sign(f ŷ) logistic log ( 1 + exp( ŷf) ) ŷ 1 + exp(fŷ) ŷ exp(fŷ) + exp( fŷ) exponential exp( ŷf) ŷ exp( ŷf) y 2 exp( fŷ) Here are some experiments using Newton oosting on the same dataset as aove. Generally, we see that Newton oosting is more effective than gradient oosting in quickly reducing the ing error. However, this does also means it egins to overfit after a smaller numer of s. Thus, if running time is an issue (either at ing or time), Newton oosting may e preferale, since it more quickly reduces ing error.
11 Boosting 11 Newton Boosting, Logistic Loss, step size of v = 1. data s 3 s 9 s 49 s 1 s
12 Boosting 12 Newton Boosting, Logistic Loss, step size of v = 1. data s 3 s 9 s 49 s 1 s
13 Boosting 13 6 Shrinkage and Early Stopping Now, we have not talked aout regularization. What is suggested in practice is to regularize y early stoppping. Namely, repeatedly oost, ut hold out a set. Select the numer of ase learners M in the model y the numer that leads the minimum error on the set. (It would e est to select this M y cross validation, then re-fit with all models, though this is often not done is practice to reduce running times.) Notice that we have another parameter controlling complexity, however: the step size v. For a small step size, we will need more models M to achieve the same complexity. Experimentally, etter results are almost always given y choosing a small step size v, and a larger numer of ase learners M. The following experiments suggest that this is a product of the fact that it yields smoother models. Note also from the following experiments that Newton oosting with a given step size performs similarly to gradient oosting with a smaller step size.
14 Boosting 14 Gradient oosting, least-squares loss, v = Gradient oosting, least-squares loss, v = Gradient oosting, least-squares loss, v =
15 Boosting 15 In the following figures, the left column shows f(x) after 1 s, while the middle column shows sign(f(x)). Newton oosting, logistic loss, v = Newton oosting, logistic loss,v = Newton oosting, logistic loss,v =
16 Boosting 16 Gradient oosting, logistic loss,v = Gradient oosting, logistic loss, v = Gradient oosting, logistic loss, v = Discussion Note that while our experiments here have used trees, any least-squares regressor could e used at the ase learner. If so inclined, we could even use a oosted regression tree as our ase learner. Most of these loss functions can e extended to multi-class classification. This works y, in each, not fitting a single tree, ut C trees, where there are C classes. For newton oosting, this is made tractale y taking a diagonal approximation of the Hessian.
17 Boosting 17 Our presentation here has een quite ahistorical. The first oosting algorithms were presented from a very different, and more theoretical perspective.
Gradient Boosting (Continued)
Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31 Boosting Fits an Additive Model Boosting Fits an Additive
More informationBoosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13
Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationAutomatic Differentiation and Neural Networks
Statistical Machine Learning Notes 7 Automatic Differentiation and Neural Networks Instructor: Justin Domke 1 Introduction The name neural network is sometimes used to refer to many things (e.g. Hopfield
More informationRecitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14
Brett Bernstein CDS at NYU March 30, 2017 Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 1 / 14 Initial Question Intro Question Question Suppose 10 different meteorologists have produced functions
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationEmpirical Risk Minimization and Optimization
Statistical Machine Learning Notes 3 Empirical Risk Minimization and Optimization Instructor: Justin Domke 1 Empirical Risk Minimization Empirical Risk Minimization is a fancy sounding name for a very
More informationEmpirical Risk Minimization and Optimization
Statistical Machine Learning Notes 3 Empirical Risk Minimization and Optimization Instructor: Justin Domke Contents 1 Empirical Risk Minimization 1 2 Ingredients of Empirical Risk Minimization 4 3 Convex
More informationVoting (Ensemble Methods)
1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers
More informationVBM683 Machine Learning
VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data
More informationLecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning
Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function
More informationLogitBoost with Trees Applied to the WCCI 2006 Performance Prediction Challenge Datasets
Applied to the WCCI 2006 Performance Prediction Challenge Datasets Roman Lutz Seminar für Statistik, ETH Zürich, Switzerland 18. July 2006 Overview Motivation 1 Motivation 2 The logistic framework The
More informationGradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve
More informationAnnouncements Kevin Jamieson
Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium
More informationBig Data Analytics. Special Topics for Computer Science CSE CSE Feb 24
Big Data Analytics Special Topics for Computer Science CSE 4095-001 CSE 5095-005 Feb 24 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Prediction III Goal
More informationIntroduction to gradient descent
6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our
More informationGradient Boosting, Continued
Gradient Boosting, Continued David Rosenberg New York University December 26, 2016 David Rosenberg (New York University) DS-GA 1003 December 26, 2016 1 / 16 Review: Gradient Boosting Review: Gradient Boosting
More informationSCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.
SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne
More informationBoosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi
Boosting CAP5610: Machine Learning Instructor: Guo-Jun Qi Weak classifiers Weak classifiers Decision stump one layer decision tree Naive Bayes A classifier without feature correlations Linear classifier
More informationDecision Trees: Overfitting
Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement
More informationIFT Lecture 7 Elements of statistical learning theory
IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and
More informationA Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007
Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized
More informationTDT4173 Machine Learning
TDT4173 Machine Learning Lecture 9 Learning Classifiers: Bagging & Boosting Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline
More informationMachine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution
More informationRegularization Paths
December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and
More informationCSE 151 Machine Learning. Instructor: Kamalika Chaudhuri
CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers
More informationLinear Regression 1 / 25. Karl Stratos. June 18, 2018
Linear Regression Karl Stratos June 18, 2018 1 / 25 The Regression Problem Problem. Find a desired input-output mapping f : X R where the output is a real value. x = = y = 0.1 How much should I turn my
More informationA Study of Relative Efficiency and Robustness of Classification Methods
A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics
More informationGeneralized Boosted Models: A guide to the gbm package
Generalized Boosted Models: A guide to the gbm package Greg Ridgeway April 15, 2006 Boosting takes on various forms with different programs using different loss functions, different base models, and different
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationChapter 6. Ensemble Methods
Chapter 6. Ensemble Methods Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Introduction
More informationModule 9: Further Numbers and Equations. Numbers and Indices. The aim of this lesson is to enable you to: work with rational and irrational numbers
Module 9: Further Numers and Equations Lesson Aims The aim of this lesson is to enale you to: wor with rational and irrational numers wor with surds to rationalise the denominator when calculating interest,
More informationAdaBoost and other Large Margin Classifiers: Convexity in Classification
AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at
More informationIndirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection
More informationMachine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /
Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical
More informationBoosting Methods: Why They Can Be Useful for High-Dimensional Data
New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,
More informationNon-Linear Regression Samuel L. Baker
NON-LINEAR REGRESSION 1 Non-Linear Regression 2006-2008 Samuel L. Baker The linear least squares method that you have een using fits a straight line or a flat plane to a unch of data points. Sometimes
More informationImportance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University
Importance Sampling: An Alternative View of Ensemble Learning Jerome H. Friedman Bogdan Popescu Stanford University 1 PREDICTIVE LEARNING Given data: {z i } N 1 = {y i, x i } N 1 q(z) y = output or response
More informationMachine Learning and Data Mining. Linear classification. Kalev Kask
Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q
More informationBoosting. March 30, 2009
Boosting Peter Bühlmann buhlmann@stat.math.ethz.ch Seminar für Statistik ETH Zürich Zürich, CH-8092, Switzerland Bin Yu binyu@stat.berkeley.edu Department of Statistics University of California Berkeley,
More informationCART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions
CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions f (x) = M c m 1(x R m ) m=1 with R 1,..., R m R p disjoint. The CART algorithm is a heuristic, adaptive
More informationDecision Tree Learning Lecture 2
Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over
More informationCART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )
CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions with R 1,..., R m R p disjoint. f(x) = M c m 1(x R m ) m=1 The CART algorithm is a heuristic, adaptive
More informationDoes Modeling Lead to More Accurate Classification?
Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationBoosted Lasso and Reverse Boosting
Boosted Lasso and Reverse Boosting Peng Zhao University of California, Berkeley, USA. Bin Yu University of California, Berkeley, USA. Summary. This paper introduces the concept of backward step in contrast
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationLearning Ensembles. 293S T. Yang. UCSB, 2017.
Learning Ensembles 293S T. Yang. UCSB, 2017. Outlines Learning Assembles Random Forest Adaboost Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous)
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting
More informationCS 6375 Machine Learning
CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationLecture 16 Solving GLMs via IRWLS
Lecture 16 Solving GLMs via IRWLS 09 November 2015 Taylor B. Arnold Yale Statistics STAT 312/612 Notes problem set 5 posted; due next class problem set 6, November 18th Goals for today fixed PCA example
More informationChapter 14 Combining Models
Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients
More informationStatistical Machine Learning Hilary Term 2018
Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationApplication of machine learning in manufacturing industry
Application of machine learning in manufacturing industry MSC Degree Thesis Written by: Hsinyi Lin Master of Science in Mathematics Supervisor: Lukács András Institute of Mathematics Eötvös Loránd University
More informationDelta Boosting Machine and its application in Actuarial Modeling Simon CK Lee, Sheldon XS Lin KU Leuven, University of Toronto
Delta Boosting Machine and its application in Actuarial Modeling Simon CK Lee, Sheldon XS Lin KU Leuven, University of Toronto This presentation has been prepared for the Actuaries Institute 2015 ASTIN
More informationSTK-IN4300 Statistical Learning Methods in Data Science
Outline of the lecture STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no AdaBoost Introduction algorithm Statistical Boosting Boosting as a forward stagewise additive
More informationRegression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More informationFrank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex
Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c15 2013/9/9 page 331 le-tex 331 15 Ensemble Learning The expression ensemble learning refers to a broad class
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationExpectation Maximization and Mixtures of Gaussians
Statistical Machine Learning Notes 10 Expectation Maximiation and Mixtures of Gaussians Instructor: Justin Domke Contents 1 Introduction 1 2 Preliminary: Jensen s Inequality 2 3 Expectation Maximiation
More informationOrdinal Classification with Decision Rules
Ordinal Classification with Decision Rules Krzysztof Dembczyński 1, Wojciech Kotłowski 1, and Roman Słowiński 1,2 1 Institute of Computing Science, Poznań University of Technology, 60-965 Poznań, Poland
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationLearning with multiple models. Boosting.
CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models
More information10701/15781 Machine Learning, Spring 2007: Homework 2
070/578 Machine Learning, Spring 2007: Homework 2 Due: Wednesday, February 2, beginning of the class Instructions There are 4 questions on this assignment The second question involves coding Do not attach
More informationMachine Learning 4771
Machine Learning 4771 Instructor: Tony Jebara Topic 3 Additive Models and Linear Regression Sinusoids and Radial Basis Functions Classification Logistic Regression Gradient Descent Polynomial Basis Functions
More informationFinQuiz Notes
Reading 9 A time series is any series of data that varies over time e.g. the quarterly sales for a company during the past five years or daily returns of a security. When assumptions of the regression
More informationMachine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017
Machine Learning Regularization and Feature Selection Fabio Vandin November 14, 2017 1 Regularized Loss Minimization Assume h is defined by a vector w = (w 1,..., w d ) T R d (e.g., linear models) Regularization
More informationTDT4173 Machine Learning
TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationWhy does boosting work from a statistical view
Why does boosting work from a statistical view Jialin Yi Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 939 jialinyi@sas.upenn.edu Abstract We review boosting
More informationMLCC 2017 Regularization Networks I: Linear Models
MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational
More informationIntroduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,
1 Introduction to Machine Learning Regression Computer Science, Tel-Aviv University, 2013-14 Classification Input: X Real valued, vectors over real. Discrete values (0,1,2,...) Other structures (e.g.,
More informationBagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7
Bagging Ryan Tibshirani Data Mining: 36-462/36-662 April 23 2013 Optional reading: ISL 8.2, ESL 8.7 1 Reminder: classification trees Our task is to predict the class label y {1,... K} given a feature vector
More informationThe Mean Version One way to write the One True Regression Line is: Equation 1 - The One True Line
Chapter 27: Inferences for Regression And so, there is one more thing which might vary one more thing aout which we might want to make some inference: the slope of the least squares regression line. The
More informationAdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague
AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees
More informationLearning Task Grouping and Overlap in Multi-Task Learning
Learning Task Grouping and Overlap in Multi-Task Learning Abhishek Kumar Hal Daumé III Department of Computer Science University of Mayland, College Park 20 May 2013 Proceedings of the 29 th International
More informationThe following document was developed by Learning Materials Production, OTEN, DET.
NOTE CAREFULLY The following document was developed y Learning Materials Production, OTEN, DET. This material does not contain any 3 rd party copyright items. Consequently, you may use this material in
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationExploring Lucas s Theorem. Abstract: Lucas s Theorem is used to express the remainder of the binomial coefficient of any two
Delia Ierugan Exploring Lucas s Theorem Astract: Lucas s Theorem is used to express the remainder of the inomial coefficient of any two integers m and n when divided y any prime integer p. The remainder
More informationComputational statistics
Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial
More informationEscaping the curse of dimensionality with a tree-based regressor
Escaping the curse of dimensionality with a tree-based regressor Samory Kpotufe UCSD CSE Curse of dimensionality In general: Computational and/or prediction performance deteriorate as the dimension D increases.
More informationQualifying Exam in Machine Learning
Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts
More informationDATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane
DATA MINING AND MACHINE LEARNING Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Linear models for regression Regularized
More information6.036 midterm review. Wednesday, March 18, 15
6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that
More informationCOMS 4771 Regression. Nakul Verma
COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the
More informationThe Perceptron algorithm
The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationConceptual Explanations: Logarithms
Conceptual Eplanations: Logarithms Suppose you are a iologist investigating a population that doules every year. So if you start with 1 specimen, the population can e epressed as an eponential function:
More informationBias-Variance in Machine Learning
Bias-Variance in Machine Learning Bias-Variance: Outline Underfitting/overfitting: Why are complex hypotheses bad? Simple example of bias/variance Error as bias+variance for regression brief comments on
More information