COMP th April, 2007 Clement Pang

Similar documents
Generalized Linear Methods

Excess Error, Approximation Error, and Estimation Error

Feature Selection: Part 1

Ensemble Methods: Boosting

1 Definition of Rademacher Complexity

Computational and Statistical Learning theory Assignment 4

total If no external forces act, the total linear momentum of the system is conserved. This occurs in collisions and explosions.

LECTURE :FACTOR ANALYSIS

COS 511: Theoretical Machine Learning

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

CSC 411 / CSC D11 / CSC C11

Y = β 0 + β 1 X 1 + β 2 X β k X k + ε

BAYESIAN CURVE FITTING USING PIECEWISE POLYNOMIALS. Dariusz Biskup

Least Squares Fitting of Data

CSE 546 Midterm Exam, Fall 2014(with Solution)

Homework Assignment 3 Due in class, Thursday October 15

Kernel Methods and SVMs Extension

Fall 2012 Analysis of Experimental Measurements B. Eisenstein/rev. S. Errede. ) with a symmetric Pcovariance matrix of the y( x ) measurements V

Least Squares Fitting of Data

Chapter 13: Multiple Regression

Recap: the SVM problem

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

AN ANALYSIS OF A FRACTAL KINETICS CURVE OF SAVAGEAU

XII.3 The EM (Expectation-Maximization) Algorithm

Support Vector Machines

Limited Dependent Variables

System in Weibull Distribution

Support Vector Machines

Denote the function derivatives f(x) in given points. x a b. Using relationships (1.2), polynomials (1.1) are written in the form

Which Separator? Spring 1

Applied Mathematics Letters

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

10-701/ Machine Learning, Fall 2005 Homework 3

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

On the Eigenspectrum of the Gram Matrix and the Generalisation Error of Kernel PCA (Shawe-Taylor, et al. 2005) Ameet Talwalkar 02/13/07

Xiangwen Li. March 8th and March 13th, 2001

Chapter 12 Lyes KADEM [Thermodynamics II] 2007

Support Vector Machines

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

36.1 Why is it important to be able to find roots to systems of equations? Up to this point, we have discussed how to find the solution to

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

SDMML HT MSc Problem Sheet 4

Lecture 2 Solution of Nonlinear Equations ( Root Finding Problems )

PROBABILITY AND STATISTICS Vol. III - Analysis of Variance and Analysis of Covariance - V. Nollau ANALYSIS OF VARIANCE AND ANALYSIS OF COVARIANCE

MATH 567: Mathematical Techniques in Data Science Lab 8

Logistic Regression Maximum Likelihood Estimation

T E C O L O T E R E S E A R C H, I N C.

SimpleMKL. Abstract. Alain Rakotomamonjy LITIS EA 4108 Université de Rouen Saint Etienne du Rouvray, France. Francis R. Bach

Support Vector Machines

Scattering by a perfectly conducting infinite cylinder

Perceptual Organization (IV)

: Numerical Analysis Topic 2: Solution of Nonlinear Equations Lectures 5-11:

Statistics for Managers Using Microsoft Excel/SPSS Chapter 14 Multiple Regression Models

p p +... = p j + p Conservation Laws in Physics q Physical states, process, and state quantities: Physics 201, Lecture 14 Today s Topics

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Chapter 8. Momentum Impulse and Collisions. Analysis of motion: 2 key ideas. Newton s laws of motion. Conservation of Energy

Lecture 3: Dual problems and Kernels

Negative Binomial Regression

1.3 Hence, calculate a formula for the force required to break the bond (i.e. the maximum value of F)

Small-Sample Equating With Prior Information

is the calculated value of the dependent variable at point i. The best parameters have values that minimize the squares of the errors

Final Exam Solutions, 1998

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

1 Review From Last Time

Our focus will be on linear systems. A system is linear if it obeys the principle of superposition and homogenity, i.e.

ˆ f. Contents. Overview. Function Approximation. f ˆ : X Y. y x m. Introduction to Radial Basis Function Networks RBF

Supporting Information

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Elastic Collisions. Definition: two point masses on which no external forces act collide without losing any energy.

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Machine Learning. What is a good Decision Boundary? Support Vector Machines

PGM Learning Tasks and Metrics

VERIFICATION OF FE MODELS FOR MODEL UPDATING

An Accurate Measure for Multilayer Perceptron Tolerance to Weight Deviations

Lecture 2: Prelude to the big shrink

Lecture 3. Camera Models 2 & Camera Calibration. Professor Silvio Savarese Computational Vision and Geometry Lab. 13- Jan- 15.

An Optimal Bound for Sum of Square Roots of Special Type of Integers

4.3 Poisson Regression

Boostrapaggregating (Bagging)

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

y new = M x old Feature Selection: Linear Transformations Constraint Optimization (insertion)

Chapter 11: Simple Linear Regression and Correlation

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Explaining the Stein Paradox

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Learning Objectives for Chapter 11

Tracking with Kalman Filter

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

On the number of regions in an m-dimensional space cut by n hyperplanes

Chapter 5 Multilevel Models

Classification as a Regression Problem

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

Lecture Notes on Linear Regression

BIO Lab 2: TWO-LEVEL NORMAL MODELS with school children popularity data

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Estimating Per Capita Rates Using Aggregate Measurements From Groups of Diverse Compositions

Machine Learning. Support Vector Machines. Eric Xing. Lecture 4, August 12, Reading: Eric CMU,

ON WEIGHTED ESTIMATION IN LINEAR REGRESSION IN THE PRESENCE OF PARAMETER UNCERTAINTY

Lecture 10 Support Vector Machines II

Physics 3A: Linear Momentum. Physics 3A: Linear Momentum. Physics 3A: Linear Momentum. Physics 3A: Linear Momentum

Transcription:

COMP 540 12 th Aprl, 2007 Cleent Pang

Boostng Cobnng weak classers Fts an Addtve Model Is essentally Forward Stagewse Addtve Modelng wth Exponental Loss Loss Functons Classcaton: Msclasscaton, Exponental, Bnoal Devance, Squared Error, Support Vector Regresson: Squared Error, Absolute Error, Huber

MART Generalzaton o tree boostng Tres to tgate proble o decson trees ro beng less accurate than the best classer or a partcular proble

Let s see MART n acton rst beore gong nto detals Spa dataset ro Chapter 9 Error Rates: MART: 4.0% Addtve Logstc Regresson: 5.3% CART ully grown and pruned by CV): 8.7% MARS: 5.5% standard error o estates: 0.6%)

More on ths n Secton 10.13 57 Predctor Varables Most Relevant:! $ hp reove Least relevant: 857 415 table 3d

Pr spa x) x) log Pr eal x)

One Varable Shows dependence o log-odds wth predctor Two Varable Shows nteractons aong the predctor varables When to Run? Runnng MART wth J=2 an eects odel) yelds a hgher error rate when copared to runnng wth larger J

MART deonstraton wth TreeNet Classcaton and Regresson

Decson tree: Foral Expresson: Paraeter: Optzaton Process: j j x R x ) J j j R j x I x T 1 ) ) ; J j j R 1 }, { J j R x j j y L 1 ), arg n ˆ

Approxaton Fndng γ j gven R j Trval Estatng γ j s oten the ean/ode o y n regon R j Fndng R j Dcult Typcal way s to use a greedy, top-down recursve parttonng algorth Can also approxate by a soother and ore convenent crteron 10.26)

Su o Trees Solve usng FSAM The dcult part s ndng R j M M x T x 1 ) ; ) j j R x j j N x y L x T x y L ) ), arg n )) ; ), arg n 1 1 1

Soe specal cases are easer Square-error loss: nd the tree that best predct the current resdual Two-class w/ Exponental loss: Adaboost.M1; tree that nze weghted error rate; {-1, +1} N-class w/ Exponental loss: ˆ arg n N 1 w ) exp[ y T x ; γ can be ound by 10.31) weghted log-odds n each regon )]

Regresson: Absolute Error, Huber Loss Classcaton: Devance Wll robusty boostng trees However, they do not gve rse to sple ast boostng algorths

Solvng each step n FSAM by nuercal optzaton Derentable loss crteron Total loss: L Goal: ˆ ) N 1 L y, x arg n L )) )

s a vector Paraeters o are the values at each data pont { x ), x2),..., x 1 N Nuercal optzaton solves the proble wth a su o coponent vectors M 0 h 0 h M 0 )}

Greedy Strategy Gradents n Table 10.2 x x g g L x x y L g 1 1 ) ) ) arg n ) )), 1

Splyng To Ratonale: Mnze Loss vs. Generalzaton N x T x y L 1 1 )) ; ), arg n N x T g 1 2 )) ; arg n ~

Sze o tree J: nuber o ternal nodes) or each teraton o boostng Sple strategy: constant J How to nd J? Mnze predcton rsk on uture data

Analyss o Varance o Predctor Varables

Most probles have low-order nteracton eects donatng the proble space Thus, odels wth hgh-order nteracton wll suer n accuracy Interacton eects are lted by J No nteracton eects o level greater than K-1 are possble J=2: Decson Stup only an eects, no nteractons) J=3: two-varable nteracton eects are allowed

Typcally J = 2 wll be nsucent J > 10 wll be hghly unlkely 4 <= J <= 8 works well n boostng by experence J=6 should be the ntal guess

Regularzaton: preventon o overttng o data by odels Exaple: Paraeter M Increases M reduces the tranng rsk Could lead to overttng Use a hold-out set Slar to early stoppng strategy n NN

Scale the contrbuton o each tree by a actor 0 < v < 1 J 1 x) v ji x R j) j1 x) Controllng the learnng rate o the boostng procedure v, M; v, M Eprcally, saller v avor better test error but longer tranng te Best strategy s to choose a sall v v < 0.1) and nd M by early stoppng

Consder the set o all possble J-ternal node regresson trees as bass unctons Thus, the lnear odel: K x) T k1 k k x) K = cartt) and s lkely to be uch larger than any possble tranng set Thus, penalzed least squares s requred to nd the alphas

Penalty Functon Rdge regresson Lasso K k k K k k N k k k J J J x T y 1 1 2 1 2 ) ) ) ) arg n ) ˆ

Many alphas wll be zero wth a large labda Only a racton o possble tress are relevant Proble: Stll can t solve or all possble tress Soluton: Forward stagewse strategy Intalze to alpha = 0 rst More teratons lead to saller alphas

The approxaton works approxates lasso) Tree boostng wth shrnkage resebles penalzed regresson No shrnkage s analogous to subset selecton penalzes the nuber o non-zero coecents)

Superor perorance o boostng over procedures such as SVM ay be largely due to the plct use o L1 versus L2 penalty L1 penalty s better suted to sparse stuatonsdonoho et al., 1995) Though nzaton o L1-penalzed proble s uch ore dcult than that or L2 The orward stagewse approach provdes an approxate, practcal way to tackle the proble

Sngle decson tress are hghly nterpretable Lnear cobnaton o tress lose ths eature How to nterpret the odel then?

Brean et al. 1984) proposed a easure o relevance or each predctor varable or a sngle decson tree Intuton: varable s the one that gves axu estated proveent n squared error rsk Sply average over the trees or addtve odels Also works or K-class classers Pg. 332)

Vsualzaton s a great tool but s lted to low-densonal vews Margnal average o a odel gven a subset o nput varables and the copleent o that wthn all nput varables Works or k-class probles as well