Chapter 14 Combining Models

Similar documents
Machine Learning. Ensemble Methods. Manfred Huber

Learning with multiple models. Boosting.

CS534 Machine Learning - Spring Final Exam

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Statistical Machine Learning from Data

Machine Learning Lecture 5

PDEEC Machine Learning 2016/17

Voting (Ensemble Methods)

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

VBM683 Machine Learning

CSCI-567: Machine Learning (Spring 2019)

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Machine Learning Lecture 7

Statistics and learning: Big Data

Algorithm-Independent Learning Issues

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Ensembles. Léon Bottou COS 424 4/8/2010

CS7267 MACHINE LEARNING

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Ch 4. Linear Models for Classification

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

FINAL: CS 6375 (Machine Learning) Fall 2014

A Brief Introduction to Adaboost

ECE 5984: Introduction to Machine Learning

Pattern Recognition and Machine Learning

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

ECE 5424: Introduction to Machine Learning

Data Mining und Maschinelles Lernen

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Decision Tree Learning Lecture 2

Decision Trees: Overfitting

Linear Models for Classification

Neural Networks and Ensemble Methods for Classification

Introduction to Machine Learning CMU-10701

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Announcements Kevin Jamieson

Introduction to Machine Learning Midterm, Tues April 8

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

10701/15781 Machine Learning, Spring 2007: Homework 2

ECE521 Lecture7. Logistic Regression

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Gradient Boosting (Continued)

Linear & nonlinear classifiers

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Neural Networks and Deep Learning

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Learning Decision Trees

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning Lecture 10

Bagging and Other Ensemble Methods

Online Learning and Sequential Decision Making

10-701/ Machine Learning - Midterm Exam, Fall 2010

Linear and Logistic Regression. Dr. Xiaowei Huang

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Reading Group on Deep Learning Session 1

Hierarchical Boosting and Filter Generation

Boosting & Deep Learning

SF2930 Regression Analysis

CS145: INTRODUCTION TO DATA MINING

Ensembles of Classifiers.

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Ensemble Methods and Random Forests

ECE521 week 3: 23/26 January 2017

6.867 Machine Learning

Optimal and Adaptive Algorithms for Online Boosting

Click Prediction and Preference Ranking of RSS Feeds

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

the tree till a class assignment is reached

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Evaluation requires to define performance measures to be optimized

Stochastic Gradient Descent

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Final Exam, Machine Learning, Spring 2009

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning, Fall 2009: Midterm

TDT4173 Machine Learning

Midterm Exam Solutions, Spring 2007

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

PATTERN RECOGNITION AND MACHINE LEARNING

Generalization, Overfitting, and Model Selection

Transcription:

Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27

Outline Independent Mixing Coefficients Mixture of Experts Summary

Chapter 14 shortest chapter in the book examples in regression and classification Bishop style: exponential error functions introduced with which boosting can be expressed in a flexible way, etc. BiShop-Bingo three crosses in row, column or diagonal erase the counters BS-Bingo starts...

Chapter 14 shortest chapter in the book examples in regression and classification Bishop style: exponential error functions introduced with which boosting can be expressed in a flexible way, etc. BiShop-Bingo three crosses in row, column or diagonal erase the counters BS-Bingo starts......now!

ensemble of statistical classifiers are more accurate than a single classifier weak learner or weak classifier: slightly better than chance final results by voting (classification) or averaging (regression) some techniques: bagging, boosting

a committee technique based on bootstrapping the data set and model averaging bootstrapping: given a data set of size N, create M datasets of size N with replacement averaging low-bias models produce accurate predictions bias-variance decomposotion (Section 3.5) 1 1 1 1 1 1

Bagging in Regression example on regression y(x) = h(x) + ɛ(x) from a single data set D M bootstrap data sets D m, and from which regressors y m (x) with errors ɛ m (x) sum-of-squares error average of individual errors E x {(y m (x) h(x)) 2 } = E x {ɛ(x) 2 } E AV = 1 M M E x {ɛ m (x) 2 } m=1

Averaging Gives Better Permormance committee prediction is the average of y m y COM (x) = 1 M M y m (x) m=1 expected error from the committee E COM = E x { ( y COM (x) h(x) ) 2 } = Ex { ( 1 M M ɛ m (x) ) 2 } m=1 under assumptions that errors ɛ m (x) zero-mean and uncorrelated we obtain E COM = 1 M E AV

Not As Good As in Theory assumptions do not hold generally however, it can be proved that E COM E AV, e.g., E x {(h(x) y m (x)) 2 } = h(x) 2 2h(x)E x {y m (x)}+e x {y m (x) 2 } using inequality E x {X 2 } E x {X} 2 and E x {y m (x)} = y COM (x) we get (h(x) y COM (x)) 2 E x {(h(x) y m (x)) 2 }

training in sequence misclassified data point gets more weight in the following classifier final prediction given by a weighted majority voting scheme example on two-class classification problem with most widely used algorithm AdaBoost

Adaptive AdaBoost weights w i for each training sample M weak classifiers in sequence indicator function I(y m (x n ) t n ), which equals 1 if the argument is true, i.e., in case of misclassification misclassified data points will have more weight in the following classifier weights α m for each classifier {w (1) n } {w (2) n } {w (M) n } y1(x) y2(x) ym (x) ( M ) YM (x) = sign αmym(x) m

AdaBoost: Algorithm Algorithm 1. Initialize data weights w (1) n = 1/N 2. For m = 1,..., M, 2.1 Fit a classifier y m (x) by minimizing J m = w (m) n I(y m (x n t n )) 2.2 Evaluate quantity ɛ m (ratio of misclassified) ɛ m = Pn w (m) n I(y m(x n) t n) Pn w (m) n Evaluate quantity α m (weight for classfier m) α m = log ( ) 1 ɛ m ɛ m 2.3 Update the data weighting coefficients w (m+1) n = w (m) n e αmi(ym(xn) tn) 3. Make prediction by Y M (x) = sign ( M m=1 α my m (x) )

AdaBoost: Example base learners consist of a threshold on one of the input variables misclassified samples by classifier at m = 1 get greater weight for m = 2 final classification: Y m (x) = sign( m α my m (x)) 2 2 2 2 2 2 2 2 1 1 2 1 1 2 1 1 2 1 1 2

as Sequential Minimization boosting was originally motivated by statistical learning theory here sequential optimization of exponential error function ( in a Bishop Style ) error function E = N n=1 e tnfm(xn) combined classifier fm (x) =.5 l α ly l (x) keeping base classifiers y1 (x)... y m 1 (x) with corresponding α l fixed and minimizing only the last α m and y m (x) leads to the same equations as in AdaBoost

Error Functions for lots of boosting-like algorithms by altering of error function exponential error function sequential minimization leads to simple AdaBoost penalizes large negative values of ty(x) cross-entropy error function for t { 1, 1}: log(1 + e yt ) more robust to outliers log likelihoods for any distribution exist multi-class problems possible to solve

Classification and Regression Trees CART input space is splitted into cuboid regions; axis-aligned boundaries only one model, e.g., constant, in one region human interpretation is easy x 2 E θ 3 B x 1 > θ 1 θ 2 C D x 2 θ 2 x 1 θ 4 x 2 > θ 3 A θ 1 θ 4 x 1 A B C D E

CART: Learning from Data determine from data structure of a tree input variable for each node threshold values θi for a split values of prediction combinatorially infeasible greedy algorithm from a single node start growing stopping criterion pruning criterion

Drawbacks of CART learning of a tree is sensitive to data splits aligned with axes of feature space hard splitting: each region of input space belongs to one and only one node piecewise-constant predictions of a tree not smooth hierarchical mixture of expers

Independent Mixing Coefficients Mixture of Experts Mixture of Linear Regression Models simple probabilistic cases for regression and classification mixtures of linear regression models mixtures of logistic models Gaussians with mixing coefficients independent from input variables K p(t θ) = π k N (t w T k φ, β 1 ) k=1

Independent Mixing Coefficients Mixture of Experts EM for Maximizing Log Likelihood log likelihood function given a data set of {φ n, t n } log p(t θ) = N log ( K π k N (t n w T k φ n, β 1 ) ) n=1 k=1 complete-data log likelihood function with binary latent variables z nk log p(t, Z θ) = N n=1 k=1 EM for γ nk, Q(θ, θ old ), π k, w k, and β K z nk log ( π k N (t n w T k φ n, β 1 ) )

Example Independent Mixing Coefficients Mixture of Experts mixture of two linear regressors drawback: lot of probability mass with no data solution: input dependent mixing coefficients 1.5 1.5.5 1 1.5 1.5.5 1 1.8.6.4.2 1.5.5 1 1.5 1.5.5 1 1.5 1.5.5 1 1.8.6.4.2 1.5.5 1 1.5 1.5.5 1 1.5 1.5.5 1 1.8.6.4.2 1.5.5 1

Mixture of Experts Independent Mixing Coefficients Mixture of Experts mixture of linear regression models p(t θ) = K k=1 π kp k (t θ) mixture of experts model p(t x, θ) = K π k (x)p k (t x, θ) k=1 mixing coefficients, gating functions, as functions of input individual component densities, experts

Hierarchical Mixture of Experts Independent Mixing Coefficients Mixture of Experts probabilistic version of decision trees each component in the mixture is itself a mixture distribution nodes: probabilistic splits of all input variables leaves: probabilistic models mixture density network (Section 5.6) 1 1 (c)

Summary multiple models to increase capabilities of the regressor or classifier basic methods bagging and boosting improve results compared to a single learner decision trees are easy to interpret probabilistic networks extend models

Course Feedback http://www.cs.hut.fi/opinnot/palaute/ kurssipalaute.html Kevään 27 kurssikyselyt T-61.62