Applied Statistics and Machine Learning

Size: px
Start display at page:

Download "Applied Statistics and Machine Learning"

Transcription

1 Applied Statistics and Machine Learning MDL, e-l2boosting, Sparse Graphical Model Estimation, and Ridge Estimation Bin Yu, IMA, June 24, /24/13

2 Minimum Description Length Principle MDL is a modern articulation of Occam s Razor through Shannon s Information Theory. Jorma Rissanen (1978): One should choose the model that gives the shortest description of data. Conceptually, the MDL framework removes the need for the postulation of a true model. Rissanen, J. (1978). "Modeling by shortest data description". Automatica 14 (5): doi: / (78) Hansen, M. and Yu. B (2001). Model selection and the principle of Minimum Description Length, JASA, vol. 96, pp More at 2 6/24/13

3 Information Theoretic Foundation of MDL Example of a prefix code on the alphabet of {a, b, c}: a maps into 0, b maps into 10 and c into /24/13

4 Information Theoretic Foundation of MDL 4 6/24/13

5 Any distribution can be viewed as a codelength For any probability distribution on Q on the alphabet, - log Q(a) could be viewed a code length. Q does not have to be the data generating probability distribution. When it is, the code is most efficient in the Shannon sense. 5 6/24/13

6 A valid code to be used in MDL Rissenan (1978) generalized Shannon s source coding theorem to a universal source coding theorem. That is, he showed that if a data generating distribution is known to belong to a nice parametric family of dim k, then the redundancy of any universal code is lower bounded by k/2 log n for n data points, for all members of the parametric family except for a zero measure set. This means that one has to pay a price of k/2 log n for not knowing the particular member of a parametric family. If a code achieves this lower bound, we call it valid to be used in MDL to compare model classes. 6 6/24/13

7 Different valid forms of code lengths to use with MDL in parametric models (0) two-stage code (0) 7 6/24/13

8 Two-stage code length } BIC balances fitting to data with model complexity and so does AIC. The penalty in AIC is motivated from the complexity penalty on prediction for a model while that in BIC is code length of the model both go up with the complexity of the model. } Using two stage coding MDL with sparse regression model will lead to something quite close to LS with a log form of penalty (Mairal s lecture) 8 6/24/13

9 Predictive MDL Two other forms of valid codes (or best to the first order after entropy term): 9 Predictive MDL is closely related to on-line learning. 6/24/13

10 Predictive MDL: example Example: predictive MDL for regression model selection with known noise variance: 10 6/24/13

11 Predictive MDL for combining predictors in audio compression Question: how to compress, in a perceptually lossless manner, mixed audio signals (music and speech, songs) with low delay? Application: internet radio, musicians playing together longdistance 11 6/24/13

12 Predictive MDL for combining predictors in audio compression Key steps in Schuller et al (2002): 1. Perceptual pre-filter banks (related to wavelets) 2. Quantization on the filtered signals 3. Lossless predictive coding for quantized signals 12 6/24/13

13 Predictive MDL for combining predictors in audio compression WCLMS predictor (weighted cascading LMS) in Step3 : Basic predictor P1: normalized LMS (Least Mean Square minimization with one data point at a time: a version of stochastic gradient) (LMS is an old algorithm from control theory) with lag L1. Fitting another N-LMS (with lag L2) to the residual time series to arrive at a new predictor P2; Repeating this with lag L3 gives P3. This cascading is really Tukey s twicing or L2boosting in some sense. For each of the 3 predictors, we get a weight at time i (c=2, mu=0.9) to linearly combine the three predictors: 13 6/24/13

14 Compression results (bit rates) of LMS and WCLMS with different lags 14 6/24/13

15 Comments on compression results Single LMS does not work well compared to WCLMS. For WCLMS, it is better to have long lags (200, 80, 40) to remove more harmonic signals first. 15 6/24/13

16 Mixture form of MDL 16 6/24/13

17 Boosting: computation and estimation in one Boosting was originated from Valiant s Probability Approximate Correct (PAC) learning (Schapire, 1991), but made practical by Freund and Schapire (1996) (AdaBoost). It has enjoyed impressive empirical successes. Boosting idea: choose a base procedure and apply it sequentially to modified data and then linearly combine the resulted estimators. Cross-validation or a test set are commonly used to stop the iterations early. June 24, 2013 page 17

18 Boosting: a gradient view An important step forward for understanding boosting was made via the gradient view of Boosting by Breiman, Mason et al, Friedman et al, Friedman (01): L2 boosting or boosting with squared loss in the regression case: refitting of residuals (Twicing, Tukey, 1972). Number of iteration steps is the regularization parameter. Minimax optimality of L2 boosting in 1-dim case for Sobolev classes (Buhlmann & Yu, 03) (see Yu s last lecture) June 24, 2013 page 18

19 Stepwise regression: selected refs Efroymson, MA (1960) "Multiple regression analysis." In Ralston, A. and Wilf, HS, editors, Mathematical Methods for Digital Computers. Wiley. Hocking, R. R. (1976) "The Analysis and Selection of Variables in Linear Regression," Biometrics, 32. Roughly, forward selection methods use tests (e.g. F-test) to find the most significant predictor to add and then the new model is refitted via OLS. Backward-selection or backward elimination is done similarly. June 24, 2013 page 19

20 Epsilon-L2Boosting (with fixed small steps) Regression set-up: standardized data X i : a p-dimensional predictor Y i : response variable We consider loss function and let j index the jth predictor for j=1,, p. } L 2 Boosting with a fixed step size (Forward Stagewise Fitting, Efron et al, 2004): June 24, 2013 page 20

21 News media text data June 24, 2013 page 21

22 Comparative summarization (CCS) of text June 24, 2013 page 22

23 The need for automatic text summarization June 24, 2013 page 23

24 Concise and Comparative Summarization (CCS) on a given topic (Miratrix et al, 2012) } Make the summarization problem predictive: automatic labeling of documents into positive and negative by thresholding the count of relevant words on the topic. (e.g. if the topic is china, count words china, chinese, ) } Predictors are counts of word phrases (uni-grams, bi-grams, tri-grams ) what are very high dimension (1 million) and very sparse. } Epsilon-L2 Boosting for picking a fixed number of word phrases Our algorithm is a modification of BBR (Genkins et al, 2007) from L1penalized logistic regression to Lasso for sparse, short and fat matrix text data. June 24, 2013 page 24

25 Human comparison experiment Human experiment findings: Lasso with L2 column rescaling works better overall for different document sizes (articles vs paragraphs) than tf-idf normalization and stop-word removal and co-occurrence. tf-idf works well for articles. June 24, 2013 page 25

26 Some CCS results from NYT international section in 2009 June 24, 2013 page 26

27 How does e-l2boosting compare with Lasso? } Lasso (Least Absolute Sum of Squares Operator): where is a regularization parameter. e-l2boosting also gives sparse models. Are they the same? June 24, 2013 page 27

28 Striking Similarity (Efron et al, 2004) Lasso e-l2 Boosting (with small steps) The paths are not always the same. (These plots are reverse of Mairal s path plots that start with lambda=0. June 24, 2013 page 28

29 e-l2boosting could be very different from Lasso L 2 Boosting Lasso June 24, 2013 page 29

30 Lasso trade-off between empirical loss and penalty If we run coordinate gradient descent with a fixed step size on the Lasso loss, at each iteration t: Trade-off June 24, 2013 page 30

31 Forward and Backward Two ways one can reduce the Lasso loss: } Forward Step -- Reduce Empirical Loss (L2 boosting) } Backward Step Reduce Penalty Let s start with a forward step to get, and use an initial June 24, 2013 page 31

32 Blasso or Stagewise Lasso (Zhao &Yu, JMLR, 2004) } Find the backward direction that leads to the minimal empirical loss } Take the direction if it leads to a decrease in the Lasso loss } Otherwise force a forward step } Relax λ whenever necessary T. Zhang s work on backward-forward algorithms: see Mairal s lecture. June 24, 2013 page 32

33 BLasso Converges to Lasso Path as ε goes to zero (Zhao and Y, 2004) BLasso Lasso June 24, pa

34 Gaussian graphical model (cf. Rocha et al, 08, ) 34 6/24/13

35 Gaussian graphical model 35 6/24/13

36 Gaussian graphical model 36 6/24/13

37 Relating regression coefficients with inverse covariance matrix It follows that the precision matrix has the same off-diagonal sparsity patterns as B. 37 6/24/13

38 Gaussian Graphical Model Gaussian inverse covariance characterizes conditional independencies (and hence graphical model) Inverse " Covariance Graph June 24, 2013 page 38

39 L1 penalized -log Gaussian Likelihood Given n iid observations of X with Yuan and Lin (06): Banerjee et al (08), Friedman et al (07): fast algorithms based on block descent. An earlier work was Meinshause and Buhlmann (2006) who estimated the sparsity pattern of the precision matrix via Lasso by regressing one variable at a time against the rest. Rocha et al (2008) integrated the many Lasso problems into one via pseudo-likelihood (SPLICE). June 24, 2013 page 39

40 Example: voting clusters via sparse graphical model estimation Banerjee, El Ghaoui, d Aspremont (08) June 24, 2013 page 40

41 Nuclear norm penalty to induce low-rank in matrix estimation The nuclear norm of a matrix is the sum of its singular values and it is the same as trace norm for symmetric matrices. It has been used as a surrogate for rank in the rank minimization problem to complete a matrix (e.g. Netflix) with missing entries and noise. See lectures by Smola and Mairal and Yu s V4 modeling lecture where the nuclear norm is used with data. 41 6/24/13

42 AdaBoost L2Boosting is an extension of AdaBoost for binary classification to regression with the L2 loss function. AdaBoost was invented by Freund, Yoav; Schapire, Robert E. (1996). A Decision-Theoretic Generalization of on-line Learning and an Application to Boosting. It often improved the classification accuracy of a simple classifier such as stumps by 30% or more via repeatedly application of the same classifier to modified data and then a combination the resulting classifiers. The iteration is stopped early by CV or test data. It was observed that for many data sets even after the training error hit zero, the generalization error kept improving. Is AdaBoost resistant to overfitting? 42 6/24/13

43 AdaBoost Recall that L2Boosting is an extension of AdaBoost for binary classification to regression with the L2 loss function. AdaBoost was invented by Freund, Yoav; Schapire, Robert E. (1995). A Decision-Theoretic Generalization of on-line Learning and an Application to Boosting. It often improved the classification accuracy of a simple classifier such as stumps by 30% or more via repeatedly application of the same classifier to modified data and then a combination the resulting classifiers. The iteration is stopped early by CV or test data. It was observed that for many data sets even after the training error hit zero, the generalization error kept improving. Is AdaBoost resistant to overfitting? 43 6/24/13

44 AdaBoost for a given base (weak) learner with early stopping via test or CV For data (x i,y i ) n, to obtain a classification function F i=1 given a base learner that is better than random guessing, update as follows: w 0 (i) =1/n F 0 (x) =0 f 0 (x) is the classifier learned from original data using the base learner. For m=0, 1,, let Sample data with weights: 44 err m be the error rate of F m+1 (x) =F m (x)+ 1 2 log 1 err m err m f m (x) f m (x) w m+1 (i) w m (i) exp(log 1 err m err m )I [yi =f m (x i )]) Apply the base learner to the sampled data to get 6/24/13 f m+1 (x)

45 AdaBoost: a gradient descent view AdaBoost was much better understood statistically after the gradient descent point of view of it was developed by Breiman, Manson et al and Friedman et al all in It could be viewed as taking gradient descent on an exponential loss function of the margin yf(x): exp (-yf(x)), which is an upper bound on the 0-1 loss function of classification. The log likelihood function of the logistic model can also been seen as a different upper bound. 45 6/24/13

46 Different upper bounds of the 0-1 loss function (Friedman et al, 2000, ADDITIVE LOGISTIC REGRESSION: A STATISTICAL VIEW OF BOOSTING, Annals of Stats, p ) June 24, 2013 page 46

47 AdaBoost is an ensemble method Other ensemble methods include bagging, random forests (RF), model averaging, Such methods average different models/classifiers to come up with the final model/classifier. See Hansen s lecture on CART (decision trees) and RF June 24, 2013 page 47

48 Ridge regression method: L2 penalized LS Ridge regression is an old method, but still very useful. Tikhonov (1943) used it for the first time and this method is also called Tikhonov regularization. June 24, 2013 page 48

49 Ridge regression method: L2 penalized LS June 24, 2013 page 49

50 Ridge regression method: L2 penalized LS which can always be made true if we choose appropriately alpha in (0,1). June 24, 2013 page 50

51 Ridge covariance estimation X X + λi can also be viewed as a regularized covariance estimation. This is the method that we used to reconstruct the movies based on encoding models and fmri signals. We also tried sparse graphical method for precision matrix estimation and the ridge method worked better. June 24, 2013 page 51

52 Back to reconstructing movies Using e-l2boosting or Lasso, for each voxel, we have a good linear prediction rule for fmri responses corresponding to any movie clis. What else do we have? A large movie clip database from movie trailers, YouTube, etc. An external memory to approximately replace our internal memory? 52 6/24/13

53 Given an observed fmri vector Y over selected informative voxels clip 1 Lasso-model > predicted fmri vector Y1 clip > predicted fmri vector Y2 clip > predicted fmri vector Y3 53 : : Rank clip 1, clip 2, clip 3, depending on how close Y1,, Y3 are to Y in terms of a weighted LS metric. 6/24/13

54 Decoding details } Task A: Identify the stimulus the subject has seen } Main strategy: 1. Fit encoding models 2. Choose r=2700 best voxels (based on CV) 3. For each candidate stimulus: 1. Generate prediction of brain activity 2. Give score by how well it fits observed activity Notation Im,Im i,im k represent images. Im i,i 1...n t training images, Im k,k =1...N candidates. Y i R r f(im i ) R p (averaged) observed responses to Im i is the vector of features, β, ( ˆβ) R p r (fitted) linear coefficients. 54 6/24/13

55 Simplest solution square distance } In other words, for a fixed response, we want to find an input that gives the prediction through the encoding model that is closest to the response. } It is natural to use square distance d 2 (Ŷ,Y)=Ŷ Y 2 Problem: Best voxels tend to estimate the same thing. The same signal (part of the image) dominates the reconstruction. Need to down-weigh voxel-groups 55 6/24/13

56 Gaussian model (second order) Given an observed fmri response vector, we postulate a regression model with its linear response function estimated on training data and e-l2boosting or Lasso: Y = f(im)β + w, w N(0, Σ w ) Then the likelihood function for the kth image in the validation set, given the observed fmri vector Y, is exp[ (Y f(im k )β) Σ 1 w (Y f(im k )β)] Wed like to pick the image k in validation set that is closest to the observed fmri signal in a metric defined by the inverse covariance (precision or concentration) matrix. ˆk = arg min k N (Y f(im k )β) Σ 1 w (Y f(im k )β) 56 6/24/13

57 Estimating } } Σ 1 ω Take β = ˆβ, soω represents observed residuals estimated from residuals (of data set aside) ˆΣ ω = Cov(Y i f(im i ) ˆβ) } Since } r n we need some regularization to invert } (Banerjee et al 2009) } (Ledoit and Wolf 2003) 57 6/24/13

58 Results on a simpler image identification data set from the Gallant Lab (2 Repeats) (13 Repeats) 58 6/24/13

59 Reconstruction of movies Nishimoto, Vu, Naselaris, Benjamini, Yu, and Gallant (2011) 1. Download N videos/images from the internet to construct an empirical prior (e.g. movie trailers, youtube videos, ) 2. Using residuals from fitted encoding models, we take into account correlation between voxels through a ridge regularized covariance estimate ˆΣ ω 3. Given an fmri vector Y on voxels, for some image feature g(im), we can calculate its posterior mean: g(im k )exp[ (Y f(im k )β) ˆΣ 1 w (Y f(im k )β))] k N g s represent red, green, blue values at each pixel, frame. Recover color based on the prior, not part of the model. 59 6/24/13

60 Some issues with reconstruction } Weighing images by prior does not work well: the posterior weights are very similar, or not differentiating, possibly because in 26K-dim space, everything is quite far part. } Instead, we used equal weights for top 100 images -- the ordering of posterior weights is informative. 60 6/24/13

61 Reconstruction algorithm on one slide A. Record data B. Fit encoding models (p=26220, n=7200; 20-35K voxels) B. Sample prior and rank fit based on posterior (using chosen voxels) B. Pick top 100 ranks B. Average in color space Averaging in color-scale is quite rough 61 6/24/13

62 Analyzing static image data: comparing Lasso and ridge, nonlinear sparse model and power transform for prediction Episode 1: linear } Fitting details: pre-selection by correlation localization Regularization parameter selected by 5-fold cross validation } e-l2 boosting applied to all 10,000+ features -- L2 boosting is the method of choice in Gallant Lab } Other methods applied to 500 features pre-selected by correlation } E- June 24, 2013 page 62

63 Other methods 1 β = argmin b n Y Xb α Σk/2 b 2 β =(ˆΣ + ασ k ) 1 X T Y/n } } } k = 1: semi OLS (theoretically better than OLS) k = 0: ridge k = -1: semi OLS (inverse) k =1,-1: semi-supervised because of the use of population cov., which is estimated from the whole image data base. June 24, 2013 page 63

64 Validation correlation voxel OLS L2 boost Semi OLS Ridge Semi OLS inverse (k=-1) June 24, 2013 page 64

65 Comparison of the feature locations Semi methods e-l2boost June 24, 2013 page 65

66 Our Story (cont) Episode 2: nonlinear via iv-spam (machine learning stage) } Additive Models (Hastie and Tibshirani, 1990): } Sparse: f p Y i = f j ( X ij ) + " i, i = 1,, j= 1 j 0 for most j n } High dimensional: p >>> n SpAM (Sparse Additve Models) By Ravikumar, Lafferty, Liu, Wasserman (2007) Related work: COSSO, Lin and Zhang (2006) June 24, 2013 page 66 page 66

67 SpAM V1 encoding model (1331 voxels from V1) (Ravikumar, Vu, Gallant Lab, BY, NIPS08, AoAS11) For each voxel, Start with complex cells (features) Pre-selection of 500 complex cells via correlation Fitting of SpAM to 500 complex cells with AIC stopping Pooling of SpAM fitted complex cells according to location and frequency to form pooled complex cells Fitting SpAM to 500 complex cells and pooled complex cells with AIC stopping June 24, 2013 page 67 page 67

68 page 68 June 24, 2013 page 68 Prediction performance (R 2 ) sparse linear pooling model SpAM V1 Model (mean = 14, SD = 17, median = 12, IQR = 17) relative improvement (%) Frequency median improvement 12%. inset region

69 Nonlinearities Compressive effect (finite energy supply) Common nonlinearity for each voxel? June 24, 2013 page 69 page 69

70 Power transformation of features Episode 3: nonlinearity via power transformations (classical stage) Plot a predictor vs. the response: The feature is skewed The response is Gaussian. Response Feature After transformation (4 th root): Response Get better correlation 24 June 2013 June 24, 2013 Feature page 70 page 70

71 Make it Gaussian } Histograms of Gabor wavelets, by spatial frequency. Transformations Frequency None Square root 4 th root Log June 24, 2013 page 71 page 71

72 Gaussianization of features improves prediction } Gaussianized features used to train (using L2 epsilon-boosting) F T = F 1/2 if F in levels 2,3 (4 x 4 and 8 x 8) F T = F 1/4 if F in levels 4,5 (16 x 16 and 32 x 32) June 24, 2013 page 72 page 72

Boosting. March 30, 2009

Boosting. March 30, 2009 Boosting Peter Bühlmann buhlmann@stat.math.ethz.ch Seminar für Statistik ETH Zürich Zürich, CH-8092, Switzerland Bin Yu binyu@stat.berkeley.edu Department of Statistics University of California Berkeley,

More information

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24 Big Data Analytics Special Topics for Computer Science CSE 4095-001 CSE 5095-005 Feb 24 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Prediction III Goal

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

Regularization Paths

Regularization Paths December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

LASSO Review, Fused LASSO, Parallel LASSO Solvers

LASSO Review, Fused LASSO, Parallel LASSO Solvers Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13 Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Boosting Methods: Why They Can Be Useful for High-Dimensional Data New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

More information

Boosting: Foundations and Algorithms. Rob Schapire

Boosting: Foundations and Algorithms. Rob Schapire Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you

More information

A Study of Relative Efficiency and Robustness of Classification Methods

A Study of Relative Efficiency and Robustness of Classification Methods A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

LEAST ANGLE REGRESSION 469

LEAST ANGLE REGRESSION 469 LEAST ANGLE REGRESSION 469 Specifically for the Lasso, one alternative strategy for logistic regression is to use a quadratic approximation for the log-likelihood. Consider the Bayesian version of Lasso

More information

Regularization Paths. Theme

Regularization Paths. Theme June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics Theme Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Mee-Young Park,

More information

Does Modeling Lead to More Accurate Classification?

Does Modeling Lead to More Accurate Classification? Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang

More information

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees

More information

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana

More information

Data Warehousing & Data Mining

Data Warehousing & Data Mining 13. Meta-Algorithms for Classification Data Warehousing & Data Mining Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 13.

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Boosted Lasso and Reverse Boosting

Boosted Lasso and Reverse Boosting Boosted Lasso and Reverse Boosting Peng Zhao University of California, Berkeley, USA. Bin Yu University of California, Berkeley, USA. Summary. This paper introduces the concept of backward step in contrast

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions f (x) = M c m 1(x R m ) m=1 with R 1,..., R m R p disjoint. The CART algorithm is a heuristic, adaptive

More information

Why does boosting work from a statistical view

Why does boosting work from a statistical view Why does boosting work from a statistical view Jialin Yi Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 939 jialinyi@sas.upenn.edu Abstract We review boosting

More information

Ensemble Methods for Machine Learning

Ensemble Methods for Machine Learning Ensemble Methods for Machine Learning COMBINING CLASSIFIERS: ENSEMBLE APPROACHES Common Ensemble classifiers Bagging/Random Forests Bucket of models Stacking Boosting Ensemble classifiers we ve studied

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University. SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression

More information

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m ) CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions with R 1,..., R m R p disjoint. f(x) = M c m 1(x R m ) m=1 The CART algorithm is a heuristic, adaptive

More information

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets Nan Zhou, Wen Cheng, Ph.D. Associate, Quantitative Research, J.P. Morgan nan.zhou@jpmorgan.com The 4th Annual

More information

Machine Learning for Biomedical Engineering. Enrico Grisan

Machine Learning for Biomedical Engineering. Enrico Grisan Machine Learning for Biomedical Engineering Enrico Grisan enrico.grisan@dei.unipd.it Curse of dimensionality Why are more features bad? Redundant features (useless or confounding) Hard to interpret and

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

STK-IN4300 Statistical Learning Methods in Data Science

STK-IN4300 Statistical Learning Methods in Data Science Outline of the lecture STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no AdaBoost Introduction algorithm Statistical Boosting Boosting as a forward stagewise additive

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next

More information

Feature Engineering, Model Evaluations

Feature Engineering, Model Evaluations Feature Engineering, Model Evaluations Giri Iyengar Cornell University gi43@cornell.edu Feb 5, 2018 Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 1 / 35 Overview 1 ETL 2 Feature Engineering

More information

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I What We Did The Machine Learning Zoo Moving Forward M Magdon-Ismail CSCI 4100/6100 recap: Three Learning Principles Scientist 2

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:

More information

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016 The Boosting Approach to Machine Learning Maria-Florina Balcan 10/31/2016 Boosting General method for improving the accuracy of any given learning algorithm. Works by creating a series of challenge datasets

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Ensembles. Léon Bottou COS 424 4/8/2010

Ensembles. Léon Bottou COS 424 4/8/2010 Ensembles Léon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. Léon

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results

More information

COMS 4771 Lecture Boosting 1 / 16

COMS 4771 Lecture Boosting 1 / 16 COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014 Case Study 3: fmri Prediction Fused LASSO LARS Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, 2014 Emily Fox 2014 1 LASSO Regression

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Day 3: Classification, logistic regression

Day 3: Classification, logistic regression Day 3: Classification, logistic regression Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 20 June 2018 Topics so far Supervised

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani

More information

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University BOOSTING Robert E. Schapire and Yoav

More information

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi Boosting CAP5610: Machine Learning Instructor: Guo-Jun Qi Weak classifiers Weak classifiers Decision stump one layer decision tree Naive Bayes A classifier without feature correlations Linear classifier

More information

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

PDEEC Machine Learning 2016/17

PDEEC Machine Learning 2016/17 PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Chapter 6. Ensemble Methods

Chapter 6. Ensemble Methods Chapter 6. Ensemble Methods Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Introduction

More information

Announcements Kevin Jamieson

Announcements Kevin Jamieson Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Chart types and when to use them

Chart types and when to use them APPENDIX A Chart types and when to use them Pie chart Figure illustration of pie chart 2.3 % 4.5 % Browser Usage for April 2012 18.3 % 38.3 % Internet Explorer Firefox Chrome Safari Opera 35.8 % Pie chart

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012 Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Neural networks Neural network Another classifier (or regression technique)

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

Sparse inverse covariance estimation with the lasso

Sparse inverse covariance estimation with the lasso Sparse inverse covariance estimation with the lasso Jerome Friedman Trevor Hastie and Robert Tibshirani November 8, 2007 Abstract We consider the problem of estimating sparse graphs by a lasso penalty

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c15 2013/9/9 page 331 le-tex 331 15 Ensemble Learning The expression ensemble learning refers to a broad class

More information

Is the test error unbiased for these programs?

Is the test error unbiased for these programs? Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Gaussian Graphical Models and Graphical Lasso

Gaussian Graphical Models and Graphical Lasso ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Learning Ensembles. 293S T. Yang. UCSB, 2017. Learning Ensembles 293S T. Yang. UCSB, 2017. Outlines Learning Assembles Random Forest Adaboost Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous)

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

Chapter 3. Linear Models for Regression

Chapter 3. Linear Models for Regression Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear

More information

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. August 2008 Trevor

More information

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes Support Vector Machine, Random Forests, Boosting December 2, 2012 1 / 1 2 / 1 Neural networks Artificial Neural networks: Networks

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch . Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

6.036 midterm review. Wednesday, March 18, 15

6.036 midterm review. Wednesday, March 18, 15 6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information