Boosting. Jiahui Shen. October 27th, / 44
|
|
- Marvin Horn
- 5 years ago
- Views:
Transcription
1 Boosting Jiahui Shen October 27th, / 44
2 Target of Boosting Figure: Weak learners Figure: Combined learner 2 / 44
3 Boosting introduction and notation Boosting: combines weak learners into a strong one iteratively. Weak learners (base learner, dictionary): error rate is only slightly better than random guessing. Some notations: Samples x and responses y; Weak learners f (or T for trees), combined (strong) learner F; Iteration k; Loss function L; Weight for the learners α, β < a, b > denotes the inner product between a and b 3 / 44
4 Example: Boosting based on tree Each tree is a weak learner (T) Boosting combines different trees (T k ) into a strong learner (F = α k T k ) through linear combination Figure: Combined learner can be regarded as linear combination of trees 4 / 44
5 Discrete AdaBoost Initial the same weight on each sample The base learner with the lowest weighted error is selected Add this base learner with a coefficient based on its accuracy After adding the selected base learner, the weights of misclassified samples are increased Repeat the above three steps Figure: Discrete AdaBoost (each stump is a simple weak learner) 5 / 44
6 Discrete AdaBoost Another interpretation: AdaBoost fits an additive model F (x) = K α k f (x w k ) k=1 by minimizing a loss function L(y, F (x)), where α k, k = 1, 2,..., K are the coefficients, f (x w k ) is the k-th weak learner characterized by parameter w k In each iteration, solve (α k, w k ) = argmin Ni=1 α,w L(y i, F k 1 (x i ) + αf (x i w)) and update the model as F k (x) = F k 1 (x) + α k f k (x w k ) By fitting an additive model of different simple functions, it expands the class of functions that can be approximated 6 / 44
7 Characteristics of Boosting Starts with an empty model, gradually add new weak learners Select (or build) base learners in a greedy way Add the selected base learner with a coefficient (or coefficients) 7 / 44
8 Greedy algorithms Boosting algorithms are all greedy algorithms Greedy: Find local optimal in each iteration; may not be optimal in a global view Figure: Achieve the maximum number: Greedy algorithm optimal result and global optimal result 8 / 44
9 AdaBoost variants Figure: A (possibly incomplete) timeline of AdaBoost variants 9 / 44
10 Key ingredients in Boosting Choice of the loss function: exponential loss, logistic loss, savage loss Base learner selection: neural networks, wavelets, trees Selection criterion: criterion on choosing a new classifier Iterative format: define a new estimator by the selected classifiers Termination rule: time to terminate the learning process 10 / 44
11 Choice of the loss function Convex loss functions (lead to unconstrained loss): Exponential loss: L(y, F (x)) = e yf (x) Quadratic loss: L(y, F (x)) = (y F (x)) 2 Logistic loss: L(y, F (x)) = log(1 + e yf (x) ) Non-convex loss function (to deal with noise): Savage loss L(y, F (x)) = 1 (1+e 2yF (x) ) 2 11 / 44
12 Base learner selection Single hidden layer neural network: f (x w) = 1/(1 + e w T x ), where w parameterizes a linear combination of the input variables Tree: f k (x) = J k j=1 b ji(x R j ), where J k is the number of leaves of decision tree f k at iteration k, b j is the value predicted in the region R j Figure: Decision tree Some other base learner: wavelets, radial basis function, etc 12 / 44
13 Selection criterion In k-th iteration, select the weak learner that has largest inner product with the residual r = y F k 1 Generalize the idea of residual to gradient, select] the weak learner with largest inner product with r = [ L(y,F (x)) F (x) F (x)=f k 1 (x) (If using quadratic loss, this criterion is just the same as the first one) Similar to the above two criterion, but choose multiple weak learners in one time Select an arbitrary base learner whose inner product with residual is larger than a predefined threshold σ (being less greedy) (proposed by Xu Lin, et al., 2017) Some other criterion in a greedy sense 13 / 44
14 Iterative format Add the weak learner with a fixed step-length factor ν Add the weak learner with a fixed infinitesimal coefficient ɛ Commonly used iterative schemes under high dimensional regression setting (weak learners are much more than samples): pure greedy algorithm (PGA), orthogonal greedy algorithm (OGA), relaxed greedy algorithm (RGA) Design some backward steps: delete useless weak learners added in the previous iterations 14 / 44
15 Iterative format (PGA, OGA, RGA) All three algorithms find the weak learner f k that has largest inner product with the residual r = y F k 1 Pure greedy algorithm: add the weak learner with weight as <r,f k> f k Orthogonal greedy algorithm: do fully corrective step each time, add the weak learner with the calculated weight and modify each weight of all the previous weak learners (project y to the span of {f 1,..., f k }) Relaxed greedy algorithm: Add the weak learner with some weight and modify the previous weights by the same amount, i.e. F k = α k F k 1 + β k f k, where (α k, β k, f k ) = argmin α,β,f y αf k 1 βf PGA does not work well (it even has a lower bound for approximation error); OGA and RGA achieve similar performance; OGA is more computationally inefficient 15 / 44
16 Representative Boosting algorithms with more details Boosting for classification: AdaBoost, LogitBoost, SavageBoost Boosting for regression: L2 Boosting, Incremental forward stagewise regression (FS ɛ ) A more general Boosting algorithm: Gradient (Tree) Boosting 16 / 44
17 LogitBoost Less sensitive to outliers compared to AdaBoost Choice of the loss function: logistic loss Selection criterion: regard the strong learner as the critical point F (x) that can minimize the loss function ln(1 + exp( yf (x))); based on the Newton s method, fit a weak learner f k to H 1 s where [ ] L(y, F (x) + f (x)) s(x) = f (x) and H(x) = [ 2 ] L(y, F (x) + f (x)) f (x) 2 f (x)=0 f (x)=0 so that F k = F k 1 + f k approximate the Newton s update Iterative format: add the weak learner directly (F k = F k 1 + f k ) 17 / 44
18 SavageBoost Handle the outliers by using non-convex loss function Boosting with savageloss proposed by Masnadi-Shirazi, H., & Vasconcelos, N. (2009) Figure: Experimental results on Liver-Disorder set (A binary UCI data set) 18 / 44
19 L2 Boosting Under the regression setting, y can be continuous values Choice of the loss function: quadratic loss Selection criterion: fit a weak learner f k to the residual r = y F k 1 (x) in each iteration Iterative format: add the weak learner directly with a fixed step-length factor ν (F k = F k 1 + νf k ) Common choice for ν is 1 or smaller numbers (e.g. 0.3) 19 / 44
20 Incremental forward stagewise regression (FS ɛ ) Recall: Boosting fits an additive model F (x) = K α k T k (x) k=1 by minimizing the loss function L(y, F (x)) Consider the case that weak learners are much more than samples, to get an optimal model can be considered as high dimensional problem Incremental forward stagewise regression approximates the effect of the LASSO as a greedy algorithm Choice of the loss function: quadratic loss Selection criterion: select the base learner that best fits the residuals Iterative format: change the coefficient by an infinitesimal amount 20 / 44
21 Incremental forward stagewise regression (FS ɛ ) Initial all coefficients α k to be zero Calculate the residual as r = y K k=1 α k T k (x) Select that the base learner best fits the current residual ( α k, k ) = argmin αk,k(r α k T k (x)) 2 Update α k = α k + ɛsign( α k ), repeat 21 / 44
22 Comparison between FS ɛ and LASSO LASSO solves the problem in an optimization view 1 min α R p N y T α λ α 1 but due to the very large number of base learners T k, directly solving a LASSO problem is not feasible FS ɛ solves the problem in a greedy view With K < iterations, many of the coefficients will remain zero, while the others will tend to have absolute values smaller than their corresponding least squares solution values Therefore this K-iteration solution qualitatively resembles the LASSO, with K inversely related to λ 22 / 44
23 Comparison between FS ɛ and LASSO FS ɛ compares with LASSO: Figure: Solution paths 23 / 44
24 Comparison between PGA, L2 Boosting, FS ɛ All three algorithms find the weak learner f k that has largest inner product with the residual r = y F k 1 Pure greedy algorithm (PGA) add f k with weight calculated as <r,f k> f k L2 Boosting add f k with weight ν, which is normally set as 1 or small other smaller value (like 0.3) FS ɛ add f k with an infinitesimal weight ɛ PGA can be too greedy and L2 Boosting also cannot achieve good performance; for both algorithms, the weights in the previous iterations will not be modified during the future iterations FS ɛ changes weights in an extremely small amount (ɛ) each time, which can have an L 1 regularization effect and get a result similar to LASSO 24 / 44
25 Gradient (Tree) Boosting Choice of the loss function: can be applied to any differentiable loss function in general Selection [ criterion: ] Fit a base learner f (x) to gradient r = L(y,F (x)) F (x) F (x)=f k 1 (x) Calculate the weight in front of the base learner (line search): n γ k = argmin γ L(y i, F k 1 (x i ) γf k (x i )) i=1 Iterative format: F k = F k 1 + γ k f k In the case of using decision trees as weak learners, choose a separate optimal value γ jk for each of the tree s regions (R j ), instead of a single γ k for the whole tree Demonstration 25 / 44
26 Application with Boosting algorithms Regression and classification problems: Gradient Boosting decision tree (GBDT), XGBoost, LightGBM Compressive sensing problem: compressive sampling matching pursuit (CoSaMP) Efficient sparse learning and feature selection method: Forward backward greedy algorithm (FoBa), gradient forward backward greedy algorithm (FoBa-gdt) 26 / 44
27 Gradient Boosting decision tree (GBDT) Gradient Boosting based on decision trees to solve classification and regression problems If weak learners are not given, build them in the process of learning Given a tree structure, determine the quality of the tree based on a score function Exact greedy algorithm: Find the best split by calculating the scores among all possible splits (based on the current structure) 27 / 44
28 Gradient Tree Boosting (XGBoost) A very popular package to apply gradient tree boosting Based on GBDT, find splits in a subset of all possible splits in an appropriate way Handle sparsity patterns in a unified way Figure: Tree structure with default directions 28 / 44
29 Gradient Tree Boosting (LightGBM) A new package runs even faster than XGBoost in some situations Similar to XGBoost, but use histogram to find splits to build trees Grow tree on leaves, unlike XGBoost(level-wise) Comparison between XGBoost and LightGBM 29 / 44
30 Some remarks on Gradient Boosting Randomization: re-sample features and samples each time to build more effective weak learners Another regularization term is often added to penalize the complexity of the trees 30 / 44
31 Compressive sampling matching pursuit (CoSaMP) A commonly used greedy algorithm in compressive sensing problem A more advanced algorithm compared to PGA, OGA, RGA Choice of the loss function: quadratic loss Selection criterion: choose multiple weak learners in one time with largest inner product with the residual Iterative format: Same as OGA (projection) 31 / 44
32 Compressive sampling matching pursuit (CoSaMP) Initial the support set Ω to be empty Calculate residual r = y F (x) and g k =< r, f k (x) > for each weak learner f k Find the largest 2s components in g = g 1,..., g K, include the corresponding 2s f k into the Ω Let F k be the orthogonal projection from y onto the span of f Ω Select s base learners with largest entries in F k to calculate new residual, repeat Output the s base learners with largest entries in the last iteration 32 / 44
33 Restricted Isometry Property Restricted isometry property (RIP) characterizes matrices which are nearly orthonormal, at least when operating on sparse vectors Let T be an n p matrix and let 1 s p be an integer. Suppose that there exists a constant δ s (0, 1) such that, for every n s submatrix T s of T and for every s-sparsity vector α, (1 δ s ) α 2 2 T s α 2 2 (1 + δ s ) α 2 2 Then, the matrix T is said to satisfy the s-restricted isometry property with restricted isometry constant δ s Intuition: any small number(at the order of the desired sparsity level) of features are not highly correlated Suitable random matrices (e.g. independent random Gaussian) will satisfy RIP with high probability 33 / 44
34 Some theoretical analysis on CoSaMP Suppose sampling matrix T has restricted isometry constant δ s C. Let y = T α + ɛ be a vector of an arbitrary signal with noise ɛ. CoSaMP produces a 2s-sparse approximation β that satisfies α β Cmax{η, 1 s α α s + ɛ } where α s is β best s-sparse approximation to α CoSaMP provides rigorous bounds on the runtime and can deal with contaminated samples 34 / 44
35 Forward algorithm Forward algorithm is not prone to overfit However, forward algorithm can never correct mistakes in earlier steps Figure: Failure of Forward Greedy Algorithm 35 / 44
36 Forward Backward greedy algorithm Backward algorithm start with a full model and greedily remove features However, backward algorithm needs to start with sparse/non-overfited model Forward Backward greedy algorithm: an efficient sparse learning and feature selection method which combines forward and backward steps, proposed by Zhang, T. (2011) Choice of the loss function: quadratic loss Selection criterion: The weaker learner satisfies f k = argmin f min α L(y, F k + αf ) Iterative format: delete useless weak learners in the combined learner in the learning process 36 / 44
37 Forward Backward greedy algorithm Algorithm 1 FoBa 1. Initialize Ω 0 =, F 0 = 0, k = 0, v = While (TRUE) i = argmin i min αl(f k + αf i ) Ω k+1 = {i} Ω k F k = argmin F L(F Ω k+1 ) σ k = L(F k ) L(F k+1 ) if σ k < η then BREAK k = k + 1 While (TRUE) j = argmin j Ωk L(F k f j ) d = L(F k f j ) L(F k ) d + = σ k if d > vd + then BREAK k = k 1 Ω k = Ω k+1 / {j} F k = argmin F L(F Ω k ) end end 37 / 44
38 Some theoretical analysis A sufficient condition for feature selection consistency on OSA (also for LASSO): irrepresentable condition (L type condition) For FoBa, feature selection consistency us guaranteed under restricted isometry property (L 2 type condition) If the data matrix is Gaussian random matrix, L requires sample size n to be of the order of O(s 2 logn) while L 2 type condition to be O(slogn) to achieve consistent solution FoBa terminates after at most 2L(0)/η forward iterations 38 / 44
39 Simulation results on FoBa Artificial data experiment: p = 500, n = 100, noise ɛ = 0.1, moderately correlated design matrix Exact sparse weight with s = 5 and weights uniform random runs, results for top five features Foba Forward-greedy LASSO Least squares ± ± ± 0.14 Parameter estimation ± ± ± 1 Feature selection 0.76 ± ± ± 0.77 Table: Error range 39 / 44
40 Gradient Forward Backward Greedy Algorithm (FoBa-gdt) FoBa-gdt is a more generalized FoBa proposed by Liu, J., Ye, J., & Fujimaki, R. (2014) FoBa-gdt: Change the measure of goodness of a feature from L(F k ) min α L(F k + αf j ) to the form of gradient L(F k ) Choice of the loss function: any differentiable loss function FoBa directly evaluates a feature by the decrease in objective function (computationally expensive due to solving a large number of one dimensional optimization) but FoBa-gdt does not suffer this problem 40 / 44
41 Some remarks and other research scope Boosting is not just about selecting learners but more about how to build an appropriate model in a greedy way Loss function can even change over iteration time. RobustBoost proposed by Freund, Y. (2009) implemented this idea Boosting will induce bias when repetitively using the same training data for sampling, Dorogush, Anna Veronika, et al.(2017) discussed this issue and proposed dynamic Boosting. Boosting can be applied to deep learning field by combining shallow networks into a complex one Boosting can be also helpful to do functional matrix factorization Selecting an appropriate momentum term (learning rate) can be another research area 41 / 44
42 Reference Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics, 28(2), Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1, pp ). New York: Springer series in statistics. Schapire, R. E. (2003). The boosting approach to machine learning: An overview. In Nonlinear estimation and classification (pp ). Springer New York. Mannor, S., Meir, R., & Zhang, T. (2003). Greedy Algorithms for Classification Consistency, Convergence Rates, and Adaptivity. Journal of Machine Learning Research, 4(Oct), Buehlmann, P. (2006). Boosting for high-dimensional linear models. The Annals of Statistics, Barron, A. R., Cohen, A., Dahmen, W., & DeVore, R. A. (2008). Approximation and learning by greedy algorithms. The annals of statistics, Needell, D., & Tropp, J. A. (2009). CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3), Zhang, T. (2009). Adaptive forward-backward greedy algorithm for sparse learning with linear models. In Advances in Neural Information Processing Systems (pp ). 42 / 44
43 Reference Freund, Y. (2009). A more robust boosting algorithm. arxiv preprint arxiv: Ferreira, A. J., & Figueiredo, M. A. (2012). Boosting algorithms: A review of methods, theory, and applications. In Ensemble Machine Learning (pp ). Springer US. Freund, R. M., Grigas, P., & Mazumder, R. (2013). Adaboost and forward stagewise regression are first-order convex optimization methods. arxiv preprint arxiv: Liu, J., Ye, J., & Fujimaki, R. (2014, January). Forward-backward greedy algorithms for general convex smooth functions over a cardinality constraint. In International Conference on Machine Learning (pp ). Cortes, C., Mohri, M., & Syed, U. (2014). Deep boosting. In Proceedings of the 31st International Conference on Machine Learning (ICML-14) (pp ). Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp ). ACM. Miao, Q., Cao, Y., Xia, G., Gong, M., Liu, J., & Song, J. (2016). RBoost: label noise-robust boosting algorithm based on a nonconvex loss function and the numerically stable base learners. IEEE transactions on neural networks and learning systems, 27(11), Yuan, X., Li, P., & Zhang, T. (2016). Exact recovery of hard thresholding pursuit. In Advances in Neural Information Processing Systems (pp ). 43 / 44
44 Reference Sancetta, A. (2016). Greedy algorithms for prediction. Bernoulli, 22(2), Locatello, F., Khanna, R., Tschannen, M., & Jaggi, M. (2017). A Unified Optimization View on Generalized Matching Pursuit and Frank-Wolfe. arxiv preprint arxiv: Huang, F., Ash, J., Langford, J., & Schapire, R. (2017). Learning Deep ResNet Blocks Sequentially using Boosting Theory. arxiv preprint arxiv: Xu, L., Lin, S., Zeng, J., Liu, X., Fang, Y., & Xu, Z. (2017). Greedy Criterion in Orthogonal Greedy Learning. IEEE Transactions on Cybernetics. Dorogush, A. V., Gulin, A., Gusev, G., Kazeev, N., Prokhorenkova, L. O., & Vorobev, A. (2017). Fighting biases with dynamic boosting. arxiv preprint arxiv: / 44
Analysis of Greedy Algorithms
Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm
More informationGradient Boosting (Continued)
Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31 Boosting Fits an Additive Model Boosting Fits an Additive
More informationRestricted Strong Convexity Implies Weak Submodularity
Restricted Strong Convexity Implies Weak Submodularity Ethan R. Elenberg Rajiv Khanna Alexandros G. Dimakis Department of Electrical and Computer Engineering The University of Texas at Austin {elenberg,rajivak}@utexas.edu
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationBoosting Methods: Why They Can Be Useful for High-Dimensional Data
New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,
More informationVBM683 Machine Learning
VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data
More informationOrthogonal Matching Pursuit for Sparse Signal Recovery With Noise
Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published
More informationVoting (Ensemble Methods)
1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers
More informationSparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28
Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:
More informationAn Introduction to Sparse Approximation
An Introduction to Sparse Approximation Anna C. Gilbert Department of Mathematics University of Michigan Basic image/signal/data compression: transform coding Approximate signals sparsely Compress images,
More informationFrank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex
Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c15 2013/9/9 page 331 le-tex 331 15 Ensemble Learning The expression ensemble learning refers to a broad class
More informationAnnouncements Kevin Jamieson
Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium
More informationA Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007
Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationDecision Trees: Overfitting
Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9
More informationA New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives
A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives Paul Grigas May 25, 2016 1 Boosting Algorithms in Linear Regression Boosting [6, 9, 12, 15, 16] is an extremely
More informationGradient Boosting, Continued
Gradient Boosting, Continued David Rosenberg New York University December 26, 2016 David Rosenberg (New York University) DS-GA 1003 December 26, 2016 1 / 16 Review: Gradient Boosting Review: Gradient Boosting
More informationBoosting Based Conditional Quantile Estimation for Regression and Binary Classification
Boosting Based Conditional Quantile Estimation for Regression and Binary Classification Songfeng Zheng Department of Mathematics Missouri State University, Springfield, MO 65897, USA SongfengZheng@MissouriState.edu
More informationA Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University
A Gentle Introduction to Gradient Boosting Cheng Li chengli@ccs.neu.edu College of Computer and Information Science Northeastern University Gradient Boosting a powerful machine learning algorithm it can
More informationSparse representation classification and positive L1 minimization
Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng
More informationMLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT
MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net
More informationBoosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13
Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y
More informationA Brief Introduction to Adaboost
A Brief Introduction to Adaboost Hongbo Deng 6 Feb, 2007 Some of the slides are borrowed from Derek Hoiem & Jan ˇSochman. 1 Outline Background Adaboost Algorithm Theory/Interpretations 2 What s So Good
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement
More informationThe Frank-Wolfe Algorithm:
The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology
More informationStatistics and learning: Big Data
Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationWhy does boosting work from a statistical view
Why does boosting work from a statistical view Jialin Yi Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 939 jialinyi@sas.upenn.edu Abstract We review boosting
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationSupport Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012
Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Neural networks Neural network Another classifier (or regression technique)
More informationPart I Week 7 Based in part on slides from textbook, slides of Susan Holmes
Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes Support Vector Machine, Random Forests, Boosting December 2, 2012 1 / 1 2 / 1 Neural networks Artificial Neural networks: Networks
More informationData Mining und Maschinelles Lernen
Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting
More informationMultivariate Analysis Techniques in HEP
Multivariate Analysis Techniques in HEP Jan Therhaag IKTP Institutsseminar, Dresden, January 31 st 2013 Multivariate analysis in a nutshell Neural networks: Defeating the black box Boosted Decision Trees:
More informationLearning with multiple models. Boosting.
CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models
More informationMachine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /
Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationHierarchical Boosting and Filter Generation
January 29, 2007 Plan Combining Classifiers Boosting Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Hierarchical Structure Filters Combining Classifiers Combining Classifiers
More informationLearning Ensembles. 293S T. Yang. UCSB, 2017.
Learning Ensembles 293S T. Yang. UCSB, 2017. Outlines Learning Assembles Random Forest Adaboost Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous)
More informationGreedy Signal Recovery and Uniform Uncertainty Principles
Greedy Signal Recovery and Uniform Uncertainty Principles SPIE - IE 2008 Deanna Needell Joint work with Roman Vershynin UC Davis, January 2008 Greedy Signal Recovery and Uniform Uncertainty Principles
More informationIntroduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:
More informationRecitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14
Brett Bernstein CDS at NYU March 30, 2017 Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 1 / 14 Initial Question Intro Question Question Suppose 10 different meteorologists have produced functions
More informationNeural Networks and Ensemble Methods for Classification
Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks A neural network is a set of connected input/output units (neurons) where each connection has a weight associated
More informationCOMS 4771 Lecture Boosting 1 / 16
COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?
More informationof Orthogonal Matching Pursuit
A Sharp Restricted Isometry Constant Bound of Orthogonal Matching Pursuit Qun Mo arxiv:50.0708v [cs.it] 8 Jan 205 Abstract We shall show that if the restricted isometry constant (RIC) δ s+ (A) of the measurement
More informationNear Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing
Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas M.Vidyasagar@utdallas.edu www.utdallas.edu/ m.vidyasagar
More informationBias-free Sparse Regression with Guaranteed Consistency
Bias-free Sparse Regression with Guaranteed Consistency Wotao Yin (UCLA Math) joint with: Stanley Osher, Ming Yan (UCLA) Feng Ruan, Jiechao Xiong, Yuan Yao (Peking U) UC Riverside, STATS Department March
More informationEnsemble Methods: Jay Hyer
Ensemble Methods: committee-based learning Jay Hyer linkedin.com/in/jayhyer @adatahead Overview Why Ensemble Learning? What is learning? How is ensemble learning different? Boosting Weak and Strong Learners
More informationc 2011 International Press Vol. 18, No. 1, pp , March DENNIS TREDE
METHODS AND APPLICATIONS OF ANALYSIS. c 2011 International Press Vol. 18, No. 1, pp. 105 110, March 2011 007 EXACT SUPPORT RECOVERY FOR LINEAR INVERSE PROBLEMS WITH SPARSITY CONSTRAINTS DENNIS TREDE Abstract.
More informationSPECIAL INVITED PAPER
The Annals of Statistics 2000, Vol. 28, No. 2, 337 407 SPECIAL INVITED PAPER ADDITIVE LOGISTIC REGRESSION: A STATISTICAL VIEW OF BOOSTING By Jerome Friedman, 1 Trevor Hastie 2 3 and Robert Tibshirani 2
More informationMachine Learning. Ensemble Methods. Manfred Huber
Machine Learning Ensemble Methods Manfred Huber 2015 1 Bias, Variance, Noise Classification errors have different sources Choice of hypothesis space and algorithm Training set Noise in the data The expected
More informationABC-Boost: Adaptive Base Class Boost for Multi-class Classification
ABC-Boost: Adaptive Base Class Boost for Multi-class Classification Ping Li Department of Statistical Science, Cornell University, Ithaca, NY 14853 USA pingli@cornell.edu Abstract We propose -boost (adaptive
More informationLogistic Regression and Boosting for Labeled Bags of Instances
Logistic Regression and Boosting for Labeled Bags of Instances Xin Xu and Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand {xx5, eibe}@cs.waikato.ac.nz Abstract. In
More informationMachine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:
Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Carlos Guestrin University of Washington October 7, 2013 1 Sparsity Vector w is sparse, if many entries are zero: Very useful
More information10701/15781 Machine Learning, Spring 2007: Homework 2
070/578 Machine Learning, Spring 2007: Homework 2 Due: Wednesday, February 2, beginning of the class Instructions There are 4 questions on this assignment The second question involves coding Do not attach
More informationGeneralized Orthogonal Matching Pursuit- A Review and Some
Generalized Orthogonal Matching Pursuit- A Review and Some New Results Department of Electronics and Electrical Communication Engineering Indian Institute of Technology, Kharagpur, INDIA Table of Contents
More informationJEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA
1 SEPARATING SIGNAL FROM BACKGROUND USING ENSEMBLES OF RULES JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA 94305 E-mail: jhf@stanford.edu
More informationTractable Upper Bounds on the Restricted Isometry Constant
Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.
More informationLinear Regression. Aarti Singh. Machine Learning / Sept 27, 2010
Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationOrdinal Classification with Decision Rules
Ordinal Classification with Decision Rules Krzysztof Dembczyński 1, Wojciech Kotłowski 1, and Roman Słowiński 1,2 1 Institute of Computing Science, Poznań University of Technology, 60-965 Poznań, Poland
More informationEnsemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12
Ensemble Methods Charles Sutton Data Mining and Exploration Spring 2012 Bias and Variance Consider a regression problem Y = f(x)+ N(0, 2 ) With an estimate regression function ˆf, e.g., ˆf(x) =w > x Suppose
More informationChapter 14 Combining Models
Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients
More informationRegularization Paths
December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and
More informationBoosting. March 30, 2009
Boosting Peter Bühlmann buhlmann@stat.math.ethz.ch Seminar für Statistik ETH Zürich Zürich, CH-8092, Switzerland Bin Yu binyu@stat.berkeley.edu Department of Statistics University of California Berkeley,
More informationStrengthened Sobolev inequalities for a random subspace of functions
Strengthened Sobolev inequalities for a random subspace of functions Rachel Ward University of Texas at Austin April 2013 2 Discrete Sobolev inequalities Proposition (Sobolev inequality for discrete images)
More informationCS534 Machine Learning - Spring Final Exam
CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the
More informationCS7267 MACHINE LEARNING
CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning
More informationGeometry of U-Boost Algorithms
Geometry of U-Boost Algorithms Noboru Murata 1, Takashi Takenouchi 2, Takafumi Kanamori 3, Shinto Eguchi 2,4 1 School of Science and Engineering, Waseda University 2 Department of Statistical Science,
More informationChapter 6. Ensemble Methods
Chapter 6. Ensemble Methods Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Introduction
More informationLearning theory. Ensemble methods. Boosting. Boosting: history
Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over
More informationImportance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University
Importance Sampling: An Alternative View of Ensemble Learning Jerome H. Friedman Bogdan Popescu Stanford University 1 PREDICTIVE LEARNING Given data: {z i } N 1 = {y i, x i } N 1 q(z) y = output or response
More informationISyE 691 Data mining and analytics
ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)
More informationSparse Approximation and Variable Selection
Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation
More informationOn the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost Hamed Masnadi-Shirazi Statistical Visual Computing Laboratory, University of California, San Diego La
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationGREEDY SIGNAL RECOVERY REVIEW
GREEDY SIGNAL RECOVERY REVIEW DEANNA NEEDELL, JOEL A. TROPP, ROMAN VERSHYNIN Abstract. The two major approaches to sparse recovery are L 1-minimization and greedy methods. Recently, Needell and Vershynin
More informationSelf Supervised Boosting
Self Supervised Boosting Max Welling, Richard S. Zemel, and Geoffrey E. Hinton Department of omputer Science University of Toronto 1 King s ollege Road Toronto, M5S 3G5 anada Abstract Boosting algorithms
More informationCS 231A Section 1: Linear Algebra & Probability Review
CS 231A Section 1: Linear Algebra & Probability Review 1 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationLecture 13: Ensemble Methods
Lecture 13: Ensemble Methods Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu 1 / 71 Outline 1 Bootstrap
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationGradient Descent with Sparsification: An iterative algorithm for sparse recovery with restricted isometry property
: An iterative algorithm for sparse recovery with restricted isometry property Rahul Garg grahul@us.ibm.com Rohit Khandekar rohitk@us.ibm.com IBM T. J. Watson Research Center, 0 Kitchawan Road, Route 34,
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More informationCoSaMP. Iterative signal recovery from incomplete and inaccurate samples. Joel A. Tropp
CoSaMP Iterative signal recovery from incomplete and inaccurate samples Joel A. Tropp Applied & Computational Mathematics California Institute of Technology jtropp@acm.caltech.edu Joint with D. Needell
More informationBoosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi
Boosting CAP5610: Machine Learning Instructor: Guo-Jun Qi Weak classifiers Weak classifiers Decision stump one layer decision tree Naive Bayes A classifier without feature correlations Linear classifier
More informationAdaptive Forward-Backward Greedy Algorithm for Learning Sparse Representations
Adaptive Forward-Backward Greedy Algorithm for Learning Sparse Representations Tong Zhang, Member, IEEE, 1 Abstract Given a large number of basis functions that can be potentially more than the number
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationLEAST ANGLE REGRESSION 469
LEAST ANGLE REGRESSION 469 Specifically for the Lasso, one alternative strategy for logistic regression is to use a quadratic approximation for the log-likelihood. Consider the Bayesian version of Lasso
More informationCSC 576: Variants of Sparse Learning
CSC 576: Variants of Sparse Learning Ji Liu Department of Computer Science, University of Rochester October 27, 205 Introduction Our previous note basically suggests using l norm to enforce sparsity in
More informationAdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague
AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees
More informationIncremental Training of a Two Layer Neural Network
Incremental Training of a Two Layer Neural Network Group 10 Gurpreet Singh 150259 guggu@iitk.ac.in Jaivardhan Kapoor 150300 jkapoor@iitk.ac.in Abstract Gradient boosting for convex objectives has had a
More informationLinear Discrimination Functions
Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationGeneralized Boosting Algorithms for Convex Optimization
Alexander Grubb agrubb@cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 523 USA Abstract Boosting is a popular way to derive powerful
More informationSparsity in Underdetermined Systems
Sparsity in Underdetermined Systems Department of Statistics Stanford University August 19, 2005 Classical Linear Regression Problem X n y p n 1 > Given predictors and response, y Xβ ε = + ε N( 0, σ 2
More informationSparsity Regularization
Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation
More informationTaylorBoost: First and Second-order Boosting Algorithms with Explicit Margin Control
TaylorBoost: First and Second-order Boosting Algorithms with Explicit Margin Control Mohammad J. Saberian Hamed Masnadi-Shirazi Nuno Vasconcelos Department of Electrical and Computer Engineering University
More informationIs the test error unbiased for these programs? 2017 Kevin Jamieson
Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning
More information