Combining Estimators to Improve Performance

Size: px
Start display at page:

Download "Combining Estimators to Improve Performance"

Transcription

1 Combining Estimators to Improve Performance A survey of model bundling techniques -- from boosting and bagging, to Bayesian model averaging -- creating a breakthrough in the practice of Data Mining. John F. Elder IV, Ph.D. Elder Research, Charlottesville, Virginia Greg Ridgeway, Ph.D. University of Washington, Dept. of Statistics Elder & Ridgeway KDD99 T5-1 Combining Estimators

2 Outline Why combine? A motivating example Hidden dangers of model selection Reducing modeling uncertainty through Bayesian Model Averaging Stabilizing predictors through bagging Improving performance through boosting Emerging theory illuminates empirical success Bundling, in general Latest algorithms Closing Examples & Summary 1999 Elder & Ridgeway KDD99 T5-2 Combining Estimators

3 Reasons to combine estimators Decreases variability in the predictions. Accounts for uncertainty in the model class. > Improved accuracy on new data Elder & Ridgeway KDD99 T5-3 Combining Estimators

4 A Motivating Example: Classifying a bat s species from its chirp Goal: Use time-frequency features of echolocation signals to classify bats by species in the field (avoiding capture and physical inspection). U. Illinois biologists gathered data: 98 signals from 19 bats representing 6 species: Southeastern, Grey, Little Brown, Indiana, Pipistrelle, Big-Eared. ~35 data features (dimensions) calculated from signals, such as low frequency at the 3db level, time position of the signal peak, and amplitude ratio of 1st and 2nd harmonics. Turned out to have a nice level of difficulty for comparing methods: overlap in classes, but some separability Elder & Ridgeway KDD99 T5-4 Combining Estimators

5 Sample Projection Var4 t10 bw 1999 Elder & Ridgeway KDD99 T5-5 Combining Estimators

6 % 88 Percent Classified Correctly training time: TX2step (11-12) CART (7) ASPN (15) 8-N.net (14) (tree vars) 17-N.net Visualization & multicorrelation 35-N.net Optimal (of 8) dimensions Near.n. 8-Near.n (13) [Fit](16) prop. Avg. Avg. Trees Polynomial Neural Nearest y ˆ Networks Networks Neighbor 1:seconds hour day ε combinations 2:20 mins 1999 Elder & Ridgeway KDD99 T5-6 Combining Estimators Bundling 4 Old Fit Advisor Perceptrons (16) enabled

7 What is model uncertainty? Suppose we wish to predict y from predictors x. Given a dataset of observations, D, for a new observation with predictors x * we want to derive the predictive distribution of y * given x * and D. P( y x, D) 1999 Elder & Ridgeway KDD99 T5-7 Combining Estimators

8 In practice Although we want to use all the information in D to make the best estimate of y * for an individual with covariates x * P( y x, D) In practice, however, we always use P( y x, M where M is a model constructed from D. ) 1999 Elder & Ridgeway KDD99 T5-8 Combining Estimators

9 Selecting M The process of selecting a model usually involves Model class selection Linear regression, tree regression, neural network Variable selection variable exclusion, transformation, smoothing Parameter estimation We tend to choose the one model that fits the data or performs best as the model Elder & Ridgeway KDD99 T5-9 Combining Estimators

10 What s wrong with that? Two models may equally fit a dataset (with repect to some loss) but have different predictions. Competing interpretable models with equivalent performance offer ambiguious conclusions. Model search dilutes the evidence. Part of the evidence is spent specifying the model Elder & Ridgeway KDD99 T5-10 Combining Estimators

11 Bayesian Model Averaging Goal: Account for model uncertainty Method: Use Bayes Theorem and average the models by their posterior probabilities Properties: Improves predictive performance Theoretically elegant Computationally costly 1999 Elder & Ridgeway KDD99 T5-11 Combining Estimators

12 Averaging the models Consider a set containing the K candidate models M 1,, M K. With a few probability manipulations we can make predictions using all of them. P( y * x *, D) * * = k k k P( y x, M ) P( M D) The probability mass for a particular prediction value of y is a weighted average of the probability mass that each model places on that value of y. The weight is based on the posterior probability of that model given the data Elder & Ridgeway KDD99 T5-12 Combining Estimators

13 1999 Elder & Ridgeway KDD99 T5-13 Combining Estimators Bayes Theorem M k - model D - data P(D M k ) - integrated likelihood of M k P(M k ) - prior model probability = = K l l l k k k M P M D P M P M D P D M P 1 ) ( ) ( ) ( ) ( ) (

14 Challenges The size of the model set may cause exhaustive summation to be impossible. The integrated likelihood of each model is usually complex. Specifying a prior distribution (even a noninformative one) across the space of models is non-trivial. Proposed solutions to these challenges often involve MCMC, BIC approximation, MLE approximation, Occam s window, Occam s razor Elder & Ridgeway KDD99 T5-14 Combining Estimators

15 Performance Survival model: Primary biliary cirrhosis BMA vs. Stepwise regression 2% improvement BMA vs. expert selected model 10% improvement Linear regression: Body fat prediction BMA provides best 90% predictive coverage. Graphical models BMA yields an improvement 1999 Elder & Ridgeway KDD99 T5-15 Combining Estimators

16 BMA References Chris Volinsky s BMA homepage J. Hoeting, D. Madigan, A. Raftery, C. Volinsky (1999). Bayesian Model Averaging: A Practical Tutorial (to appear in Statistical Science), Elder & Ridgeway KDD99 T5-16 Combining Estimators

17 Unstable predictors We can always assume y = f ( x) + ε, where E( ε x) = 0 Assume that we have a way of constructing a predictor, f ˆ ( x), from a dataset D. f D We want to choose the estimator of f that minimizes J, squared loss for example. J ( fˆ, D) = E ˆ, ( y f ( x)) y x 1999 Elder & Ridgeway KDD99 T5-17 Combining Estimators D 2

18 Bias-variance decomposition If we could average over all possible datasets, let the average prediction be f ( x) = E f ˆ ( x) The average prediction error over all datasets that we might see is decomposable D D E D J ( fˆ, D) 2 2 = Eε + E ( f ( x) f ( x)) + E ( fˆ ( x) f ( x)) x = noise + bias + variance x, D D Elder & Ridgeway KDD99 T5-18 Combining Estimators

19 Bias-variance decomposition (cont.) E D J ( fˆ, D) 2 2 = Eε + E ( f ( x) f ( x)) + E ( fˆ ( x) f ( x)) x = noise + bias + variance x, D D 2 The noise cannot be reduced. The squared-bias term might be reducible The variance term is 0 if we use fˆ D ( x) = f ( x) But this requires having an infinite number of datasets 1999 Elder & Ridgeway KDD99 T5-19 Combining Estimators

20 Bagging (Bootstrap Aggregating) Goal: Variance reduction Method: Create bootstrap replicates of the dataset and fit a model to each. Average the predictions of each model. Properties: Stabilizes unstable methods Easy to implement, parallelizable Theory is not fully explained 1999 Elder & Ridgeway KDD99 T5-20 Combining Estimators

21 Bagging algorithm 1. Create K bootstrap replicates of the dataset. 2. Fit a model to each of the replicates. 3. Average (or vote) the predictions of the K models. Bootstrapping simulates the stream of infinite datasets in the bias-variance decomposition Elder & Ridgeway KDD99 T5-21 Combining Estimators

22 Bagging Example x Elder & Ridgeway KDD99 T5-22 Combining Estimators x1

23 CART decision boundary Elder & Ridgeway KDD99 T5-23 Combining Estimators

24 100 bagged trees Elder & Ridgeway KDD99 T5-24 Combining Estimators

25 Bagged tree decision boundary Elder & Ridgeway KDD99 T5-25 Combining Estimators

26 Regression results Squared error loss CART Bagged CART Boston Housing Ozone Friedman #1 Friedman #2 Friedman # Elder & Ridgeway KDD99 T5-26 Combining Estimators

27 Classification results Misclassification rates CART Bagged CART Diabetes Breast Ionosphere Heart Soybean Glass Waveform 1999 Elder & Ridgeway KDD99 T5-27 Combining Estimators

28 Bagging References Leo Breiman s homepage Breiman, L. (1996) Bagging Predictors, Machine Learning, 26:2, Friedman, J. and P. Hall (1999) On Bagging and Nonlinear Estimation Elder & Ridgeway KDD99 T5-28 Combining Estimators

29 Boosting Goal: Improve misclassification rates Method: Sequentially fit models, each more heavily weighting those observations poorly predicted by the previous model Properties: Bias and variance reduction Easy to implement Theory is not fully (but almost) explained 1999 Elder & Ridgeway KDD99 T5-29 Combining Estimators

30 Origin of Boosting Classification problems {y, x} i, i = 1,,n y {0, 1} The task - construct a function, F(x) : x {0, 1} so that F minimizes misclassification error Elder & Ridgeway KDD99 T5-30 Combining Estimators

31 Generic boosting algorithm Equally weight the observations (y,x) i For t in 1,,T Using the weights, fit a classifier f t (x) y Upweight the poorly predicted observations Downweight the well-predicted observations Merge f 1,,f T to form the boosted classifier 1999 Elder & Ridgeway KDD99 T5-31 Combining Estimators

32 Real AdaBoost Schapire & Singer 1998 y i {-1,1}, w i = 1/N For t in 1,,T do 1. Estimate P w (y = 1 x). Pˆ ( y = 1 x) 1 w 2. Set f ( x) = 2 log t Pˆ ( y = 1 x) w 3. w w exp y f ( x ) and renormalize i i ( ) i t i Output the classifier F ( x) = sign f ( x) t 1999 Elder & Ridgeway KDD99 T5-32 Combining Estimators

33 AdaBoost s Performance Freund & Schapire [1996] Leo Breiman - AdaBoost with trees is the best off-the-shelf classifier in the world. Performs well with many base classifiers and in a variety of problem domains. AdaBoost is generally slow to overfit. Boosted naïve Bayes tied for first place in the 1997 KDD Cup. (Elkan [1997]) Boosted naïve Bayes is a scalable, interpretable classifier (Ridgeway, et al [1998]) Elder & Ridgeway KDD99 T5-33 Combining Estimators

34 Boosting Example x Elder & Ridgeway KDD99 T5-34 Combining Estimators x1

35 After one iteration CART splits, larger points have great weight x Elder & Ridgeway KDD99 T5-35 Combining Estimators x1

36 After 3 iterations x Elder & Ridgeway KDD99 T5-36 Combining Estimators x1

37 After 20 iterations x Elder & Ridgeway KDD99 T5-37 Combining Estimators x1

38 Decision boundary after 100 iterations Elder & Ridgeway KDD99 T5-38 Combining Estimators

39 Boosting as optimization Friedman, Hastie, Tibshirani [1998] - AdaBoost is an optimization method for finding a classifier. Let y {-1,1}, F(x) (-, ) J ( ) ( ) F = E e yf ( x) x 1999 Elder & Ridgeway KDD99 T5-39 Combining Estimators

40 Criterion E(e yf(x) ) bounds the misclassification rate. I( yf( x) < 0) < e yf ( x) The minimizer of E(e yf(x) ) coincides with the maximizer of the expected Bernoulli likelihood. E ( ) 2 yf ( x) l( p( x), y) = E log(1 + e ) 1999 Elder & Ridgeway KDD99 T5-40 Combining Estimators

41 1999 Elder & Ridgeway KDD99 T5-41 Combining Estimators Optimization step Select f to minimize J ( ) ( ) ( ) x e E f F J x f x F y ) ( ) ( + = + ) ( 2 1 ) ( 1) ( ) ( ), ( ] 1) ( [ 1 ] 1) ( [ log x yf w w t t t e y x w x y I E x y I E F F + = = = +

42 1999 Elder & Ridgeway KDD99 T5-42 Combining Estimators LogitBoost Friedman, Hastie, Tibshirani [1998] Logistic regression Expected log-likelihood of a regressor, F(x) = ) ( 1 with probability 0 ) ( with probability 1 x p x p y ( ) x e x yf F x F ) log(1 ) ( E ) ( E ) ( + = l ) ( 1 1 ) ( x F e x p + =

43 1999 Elder & Ridgeway KDD99 T5-43 Combining Estimators Newton steps Iterate to optimize expected log-likelihood. 0 ) ( 0 ) ( ) ( 1) ( ) ( ) ( ) ( ) ( 2 2 = = f t f f t f t t f F J f F J x F x F ( ) x e x f x F y E f F J x f x F ) log(1 )) ( ) ( ( ) ( ) ( ) ( = +

44 LogitBoost, continued Newton steps for Bernoulli likelihood F( x) y p( x) F( x) + Ew p( x)(1 p( x)) w( x) = p( x)(1 p( x)) In practice the E w ( x) can be any regressor - trees, smoothers, etc. x Trees are adaptive and work well for high dimensional data Elder & Ridgeway KDD99 T5-44 Combining Estimators

45 Misclassification rates Friedman, Hastie, Tibshirani [1998] CART AdaBoost CART LogitBoost CART Breast Ionosphere Glass Sonar Waveform 1999 Elder & Ridgeway KDD99 T5-45 Combining Estimators

46 Boosting References Rob Schapire s homepage Freund, Y. and R. Schapire (1996). Experiments with a new boosting algorithm, Machine Learning: Proceedings of the 13 th International Conference, Jerry Friedman s homepage Friedman, J., T. Hastie, R. Tibshirani (1998). Additive Logistic Regression: a statistical view of boosting, Technical report, Statistics Department, Stanford University Elder & Ridgeway KDD99 T5-46 Combining Estimators

47 In general, combining ( bundling ) estimators consists of two steps: 1) Constructing varied models, and 2) Combining their estimates Generate component models by varying: Case Weights Data Values Guiding Parameters Variable Subsets Combine estimates using: Estimator Weights Voting Advisor Perceptrons Partitions of Design Space, X 1999 Elder & Ridgeway KDD99 T5-47 Combining Estimators

48 Other Bundling Techniques We ve Examined: Bayesian Model Averaging: sum estimates of possible models, weighted by posterior evidence Bagging (Breiman 96) (bootstrap aggregating) -- bootstrap data (to build trees mostly); take majority vote or average Boosting (Freund & Shapire 96) -- weight error cases by β t = (1-e(t))/e(t), iteratively re-model; average, weighing model t by ln(β t ) Additional Example Techniques: GMDH (Ivakhenko 68) -- multiple layers of quadratic polynomials, using two inputs each, fit by Linear Regression Stacking (Wolpert 92) -- train a 2nd-level (LR) model using leave-1-out estimates of 1st-level (neural net) models ARCing (Breiman 96) (Adaptive Resampling and Combining) -- Bagging with reweighting of error cases; superset of boosting Bumping (Tibshirani 97) -- bootstrap, select single best Crumpling (Anderson & Elder 98) -- average cross-validations Born-Again (Breiman 98) -- invent new X data Elder & Ridgeway KDD99 T5-48 Combining Estimators

49 Group Method of Data Handling Layer 1 (GMDH) a b A 0 + A 0 + A 1 a + A 1 a + A 2 b 2 b + A + A 3 a 2 3 a 2 + A + A 4 ab 4 ab + A + A 5 b 2 5 b 2 z 1 Layer 2 Layer 3 c d B 0 + B 0 + B 1 c + B 1 c + B 2 d 2 d + B + B 3 c 2 3 c 2 + B + B 4 cd 4 cd + B + B 5 d 2 5 d 2 z 2 z 4 y est e f C 0 + C 0 + C 1 e + C 1 e + C 2 f 2 f + C + C 3 e 2 3 e 2 + C + C 4 ef 4 ef + C + C 5 f 2 5 f 2 z 3 z 5 Try all pairs of variables (K choose 2) in quadratic polynomial nodes. Fit coefficients using regression. Keep best M nodes. Train model on one training data set, score on test data set. (Need a third data set for independent confirmation of model.) 1999 Elder & Ridgeway KDD99 T5-49 Combining Estimators

50 Polynomial Networks (ASPN) Z 17 = a -.15b bc abc + 0.5c 3 Layer 0 (Normalizers) Layer 1 a N1 z1 z9 Double 16 k N9 z6 Single 14 f N6 d N4 z4 Triple 17 z16 z14 z17 Layer 2 Multi- Linear 15 z15 Layer 3 Triple 21 z21 Unitizers U2 Y1 h N8 z8 Double 20 z20 Double 19 z19 U7 Y2 e N5 z Elder & Ridgeway KDD99 T5-50 Combining Estimators

51 When does Bundling work? Hypotheses: Breiman (1996): when the prediction method is unstable (significantly different models are constructed) Ali & Pazzani (1996): when there is low noise, lots of irrelevant variables, and good individual predictors which make different errors when models are slightly overfit when models are from different families 1999 Elder & Ridgeway KDD99 T5-51 Combining Estimators

52 Advanced techniques Stochastic gradient boosting Adaptive bagging Example regression and classification results 1999 Elder & Ridgeway KDD99 T5-52 Combining Estimators

53 Stochastic Gradient Boosting Goal: Non-parametric function estimation Method: Cast the problem as optimization and use gradient ascent to obtain predictor Properties: Bias and variance reduction Widely applicable Can make use of existing algorithms Many tuning parameters 1999 Elder & Ridgeway KDD99 T5-53 Combining Estimators

54 Improving boosting Boosting usually has the form F ( t+ 1) Improve by... ( x) ( x) + λe Sub-sampling a fraction of the data at each step when computing the expectation. Robustifying the expectation. F ( t) ( z( y, x) x) Trimming observations with small weights. w 1999 Elder & Ridgeway KDD99 T5-54 Combining Estimators

55 Stochastic gradient boosting offers... Application to likelihood based models (GLM, Cox models) Bias reduction - non-linear fitting Massive datasets - bagging, trimming Variance reduction - bagging Interpretability - additive models High-dimensional regression - trees Robust regression 1999 Elder & Ridgeway KDD99 T5-55 Combining Estimators

56 SGB References Friedman, J. (1999). Greedy function approximation: a gradient boosting machine, Technical report, Dept. of Statistics, Stanford University. Friedman, J. (1999). Stochastic gradient boosting, Technical report, Dept. of Statistics, Stanford University Elder & Ridgeway KDD99 T5-56 Combining Estimators

57 Adaptive Bagging Goal: Bias and variance reduction Method: Sequentially fit bagged models, where each fits the current residuals Properties: Bias and variance reduction No tuning parameters 1999 Elder & Ridgeway KDD99 T5-57 Combining Estimators

58 Adaptive bagging algorithm 1. Fit a bagged regressor to the dataset D. 2. Predict out-of-bag observations. 3. Fit a new bagged regressor to the bias (error) and repeat. For a new observation, sum the predictions from each stage Elder & Ridgeway KDD99 T5-58 Combining Estimators

59 Regression results Squared error loss Bagging Adaptive bagging Abalone Robot arm Peak20 Boston Housing Ozone Servo F #1 F #2 F # Elder & Ridgeway KDD99 T5-59 Combining Estimators

60 Classification results Misclassification rates AdaBoost Adaptive bagging Breast Ionosphere Sonar Heart German credit Votes Liver Diabetes 1999 Elder & Ridgeway KDD99 T5-60 Combining Estimators

61 Relative Performance Examples: 5 Algorithms on 6 Datasets (John Elder, Elder Research & Stephen Lee, U. Idaho, 1997) Neural Network Logistic Regression Linear Vector Quantization Projection Pursuit Regression Decision Tree Diabetes Gaussian Hypothyroid German Credit Waveform Investment 1999 Elder & Ridgeway KDD99 T5-61 Combining Estimators

62 1.00 Essentially every Bundling method improves performance Advisor Perceptron AP weighted average Vote Average Diabetes Gaussian Hypothyroid German Credit Waveform Investment 1999 Elder & Ridgeway KDD99 T5-62 Combining Estimators

63 Application Ex.: Direct Marketing (Elder Research ) Model respondants to direct marketing as binary variable: 0 (no response), 1 (response). Create models using several (here, 5) different algorithms, all employing the same candidate model inputs. Rank-order model responses: Give highest-probability response value a rank of 1, second highest value 2, etc. For bundling, combine model ranks (not estimates) into a new consensus estimate (which is again ranked). Report number of response cases missed (in top portion) Elder & Ridgeway KDD99 T5-63 Combining Estimators

64 80 Marketing Application Performance 75 Bundled Trees NT SNT #Cases Missed Stepwise Regression Polynomial Network Neural Network MARS NS MT PT MS MP ST PS NP MN SPT PNT MPT SMN SMT SPN MNT SMP SPNT SMPT SMNT SMPN MPNT 55 MPN #Models Combined (averaging output rank) 1999 Elder & Ridgeway KDD99 T5-64 Combining Estimators

65 Median (and Mean) Error Reduced with each Stage of Combination M i s s e d NOTE: Fewer misses is better No. Models in combination 1999 Elder & Ridgeway KDD99 T5-65 Combining Estimators

66 Why Bundling works (semi-) Independent Estimators Bayes Rule - weighing evidence Shrinking (ex.: stepwise LR) Smoothing (ex.: decision trees)...and in a multitude of counselors there is safety. Proverbs 24:6b Additive modeling and maximum likelihood (Friedman, Hastie, & Tibshirani 8/20/98) Open research area. Meanwhile, we recommend bundling competing candidate models both within, and between, model families Elder & Ridgeway KDD99 T5-66 Combining Estimators

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12 Ensemble Methods Charles Sutton Data Mining and Exploration Spring 2012 Bias and Variance Consider a regression problem Y = f(x)+ N(0, 2 ) With an estimate regression function ˆf, e.g., ˆf(x) =w > x Suppose

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning

More information

Generalized Boosted Models: A guide to the gbm package

Generalized Boosted Models: A guide to the gbm package Generalized Boosted Models: A guide to the gbm package Greg Ridgeway April 15, 2006 Boosting takes on various forms with different programs using different loss functions, different base models, and different

More information

Voting (Ensemble Methods)

Voting (Ensemble Methods) 1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

PDEEC Machine Learning 2016/17

PDEEC Machine Learning 2016/17 PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees

More information

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi Boosting CAP5610: Machine Learning Instructor: Guo-Jun Qi Weak classifiers Weak classifiers Decision stump one layer decision tree Naive Bayes A classifier without feature correlations Linear classifier

More information

RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY

RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY 1 RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY Leo Breiman Statistics Department University of California Berkeley, CA. 94720 leo@stat.berkeley.edu Technical Report 518, May 1, 1998 abstract Bagging

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Boosting Methods: Why They Can Be Useful for High-Dimensional Data New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

More information

Chapter 14 Combining Models

Chapter 14 Combining Models Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients

More information

Variance Reduction and Ensemble Methods

Variance Reduction and Ensemble Methods Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

Ensemble Methods and Random Forests

Ensemble Methods and Random Forests Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Machine Learning. Ensemble Methods. Manfred Huber

Machine Learning. Ensemble Methods. Manfred Huber Machine Learning Ensemble Methods Manfred Huber 2015 1 Bias, Variance, Noise Classification errors have different sources Choice of hypothesis space and algorithm Training set Noise in the data The expected

More information

SPECIAL INVITED PAPER

SPECIAL INVITED PAPER The Annals of Statistics 2000, Vol. 28, No. 2, 337 407 SPECIAL INVITED PAPER ADDITIVE LOGISTIC REGRESSION: A STATISTICAL VIEW OF BOOSTING By Jerome Friedman, 1 Trevor Hastie 2 3 and Robert Tibshirani 2

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Neural Networks and Ensemble Methods for Classification

Neural Networks and Ensemble Methods for Classification Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks A neural network is a set of connected input/output units (neurons) where each connection has a weight associated

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Lossless Online Bayesian Bagging

Lossless Online Bayesian Bagging Lossless Online Bayesian Bagging Herbert K. H. Lee ISDS Duke University Box 90251 Durham, NC 27708 herbie@isds.duke.edu Merlise A. Clyde ISDS Duke University Box 90251 Durham, NC 27708 clyde@isds.duke.edu

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Ensemble Methods for Machine Learning

Ensemble Methods for Machine Learning Ensemble Methods for Machine Learning COMBINING CLASSIFIERS: ENSEMBLE APPROACHES Common Ensemble classifiers Bagging/Random Forests Bucket of models Stacking Boosting Ensemble classifiers we ve studied

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

More information

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007 Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

day month year documentname/initials 1

day month year documentname/initials 1 ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi

More information

Decision Trees: Overfitting

Decision Trees: Overfitting Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

TDT4173 Machine Learning

TDT4173 Machine Learning TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University

Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University Importance Sampling: An Alternative View of Ensemble Learning Jerome H. Friedman Bogdan Popescu Stanford University 1 PREDICTIVE LEARNING Given data: {z i } N 1 = {y i, x i } N 1 q(z) y = output or response

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Chapter 6. Ensemble Methods

Chapter 6. Ensemble Methods Chapter 6. Ensemble Methods Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Introduction

More information

Ensembles of Classifiers.

Ensembles of Classifiers. Ensembles of Classifiers www.biostat.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts ensemble bootstrap sample bagging boosting random forests error correcting

More information

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24 Big Data Analytics Special Topics for Computer Science CSE 4095-001 CSE 5095-005 Feb 24 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Prediction III Goal

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University A Gentle Introduction to Gradient Boosting Cheng Li chengli@ccs.neu.edu College of Computer and Information Science Northeastern University Gradient Boosting a powerful machine learning algorithm it can

More information

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7 Bagging Ryan Tibshirani Data Mining: 36-462/36-662 April 23 2013 Optional reading: ISL 8.2, ESL 8.7 1 Reminder: classification trees Our task is to predict the class label y {1,... K} given a feature vector

More information

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview

More information

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Methods and Criteria for Model Selection CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Introduce classifier evaluation criteria } Introduce Bias x Variance duality } Model Assessment }

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

PATTERN CLASSIFICATION

PATTERN CLASSIFICATION PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

More information

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine CS 484 Data Mining Classification 7 Some slides are from Professor Padhraic Smyth at UC Irvine Bayesian Belief networks Conditional independence assumption of Naïve Bayes classifier is too strong. Allows

More information

ABC random forest for parameter estimation. Jean-Michel Marin

ABC random forest for parameter estimation. Jean-Michel Marin ABC random forest for parameter estimation Jean-Michel Marin Université de Montpellier Institut Montpelliérain Alexander Grothendieck (IMAG) Institut de Biologie Computationnelle (IBC) Labex Numev! joint

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

Discussion of Least Angle Regression

Discussion of Least Angle Regression Discussion of Least Angle Regression David Madigan Rutgers University & Avaya Labs Research Piscataway, NJ 08855 madigan@stat.rutgers.edu Greg Ridgeway RAND Statistics Group Santa Monica, CA 90407-2138

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Gradient Boosting (Continued)

Gradient Boosting (Continued) Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31 Boosting Fits an Additive Model Boosting Fits an Additive

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Learning Ensembles. 293S T. Yang. UCSB, 2017. Learning Ensembles 293S T. Yang. UCSB, 2017. Outlines Learning Assembles Random Forest Adaboost Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous)

More information

Statistics and learning: Big Data

Statistics and learning: Big Data Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Boosting & Deep Learning

Boosting & Deep Learning Boosting & Deep Learning Ensemble Learning n So far learning methods that learn a single hypothesis, chosen form a hypothesis space that is used to make predictions n Ensemble learning à select a collection

More information

1/sqrt(B) convergence 1/B convergence B

1/sqrt(B) convergence 1/B convergence B The Error Coding Method and PICTs Gareth James and Trevor Hastie Department of Statistics, Stanford University March 29, 1998 Abstract A new family of plug-in classication techniques has recently been

More information

Bagging and the Bayesian Bootstrap

Bagging and the Bayesian Bootstrap Bagging and the Bayesian Bootstrap erlise A. Clyde and Herbert K. H. Lee Institute of Statistics & Decision Sciences Duke University Durham, NC 27708 Abstract Bagging is a method of obtaining more robust

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

A Brief Introduction to Adaboost

A Brief Introduction to Adaboost A Brief Introduction to Adaboost Hongbo Deng 6 Feb, 2007 Some of the slides are borrowed from Derek Hoiem & Jan ˇSochman. 1 Outline Background Adaboost Algorithm Theory/Interpretations 2 What s So Good

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Machine Learning & Data Mining

Machine Learning & Data Mining Group M L D Machine Learning M & Data Mining Chapter 7 Decision Trees Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University Top 10 Algorithm in DM #1: C4.5 #2: K-Means #3: SVM

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13 Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Thomas G. Dietterich tgd@eecs.oregonstate.edu 1 Outline What is Machine Learning? Introduction to Supervised Learning: Linear Methods Overfitting, Regularization, and the

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Hierarchical Boosting and Filter Generation

Hierarchical Boosting and Filter Generation January 29, 2007 Plan Combining Classifiers Boosting Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Hierarchical Structure Filters Combining Classifiers Combining Classifiers

More information

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on

More information

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods The wisdom of the crowds Ensemble learning Sir Francis Galton discovered in the early 1900s that a collection of educated guesses can add up to very accurate predictions! Chapter 11 The paper in which

More information

Announcements Kevin Jamieson

Announcements Kevin Jamieson Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium

More information

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010 Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

ENSEMBLES OF DECISION RULES

ENSEMBLES OF DECISION RULES ENSEMBLES OF DECISION RULES Jerzy BŁASZCZYŃSKI, Krzysztof DEMBCZYŃSKI, Wojciech KOTŁOWSKI, Roman SŁOWIŃSKI, Marcin SZELĄG Abstract. In most approaches to ensemble methods, base classifiers are decision

More information

Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

More information

LEAST ANGLE REGRESSION 469

LEAST ANGLE REGRESSION 469 LEAST ANGLE REGRESSION 469 Specifically for the Lasso, one alternative strategy for logistic regression is to use a quadratic approximation for the log-likelihood. Consider the Bayesian version of Lasso

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c15 2013/9/9 page 331 le-tex 331 15 Ensemble Learning The expression ensemble learning refers to a broad class

More information

Ensemble Methods: Jay Hyer

Ensemble Methods: Jay Hyer Ensemble Methods: committee-based learning Jay Hyer linkedin.com/in/jayhyer @adatahead Overview Why Ensemble Learning? What is learning? How is ensemble learning different? Boosting Weak and Strong Learners

More information