Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso
|
|
- Corey Hortense Cobb
- 5 years ago
- Views:
Transcription
1 Machine Learning Data Mining CS/CS/EE 155 Lecture 4: Regularization, Sparsity Lasso 1
2 Recap: Complete Pipeline S = {(x i, y i )} Training Data f (x, b) = T x b Model Class(es) L(a, b) = (a b) 2 Loss Function,b L( y i, f (x i, b) ) SGD! Cross Validation Model Selection Profit! 2
3 Different Model Classes? Option 1: SVMs vs As vs LR vs LS Option 2: Regularization,b L( y i, f (x i, b) ) SGD! Cross Validation Model Selection 3
4 otation Part 1 L0 orm (not actually a norm) of non-zero entries L1 orm Sum of absolute values L2 orm Squared L2 orm Sum of squares Sqrt(sum of squares) L-infinity orm Max absolute value 0 = = lim p d = 1 = 1 d 0 [ ] d d 2 = d T d 2 = 2 d T p d d d p = max d d 4
5 otation Part 2 Minimizing Squared Loss Regression Least-Squares i ( y i T x i + b) 2 (Unless Otherise Stated) E.g., Logistic Regression = Log Loss 5
6 Ridge Regression,b λ T + i ( y i T x i + b) 2 Regularization Training Loss aka L2-Regularized Regression Trades off model complexity vs training loss Each choice of λ a model class Will discuss the further later 6
7 ,b Larger Lambda! x = " " $ y = %$ 1 [age>10] 1 [gender=male] Test Loss Training Loss $ % 1 height > 55" 0 height 55" lambda λ T + i ( y i T x i + b) b Train Test Person Age>10 Male? Height > 55 Alice Bob Carol Dave Erin Frank Gena Harold Irene John Kelly Larry
8 Updated Pipeline S = {(x i, y i )} Training Data f (x, b) = T x b Model Class L(a, b) = (a b) 2 Loss Function,b ( ) λ T + L y i, f (x i, b) Cross Validation Model Selection Profit! 8
9 Test Train Person Age>10 Male? Height > 55 Model Score / Increasing Lambda Alice Bob Carol Dave Erin Frank Gena Harold Irene John Kelly Larry Best test error 9
10 Choice of Lambda Depends on Training Size Training Points Training Points test loss 30 test loss lambda lambda test loss Training Points lambda test loss Training Points lambda 25 dimensional space Randomly generated linear response function + noise 10
11 Recap: Ridge Regularization Ridge Regression: L2 Regularized Least-Squares,b λ T + Large λ è more stable predictions Less likely to overfit to training data Too large λ è underfit Works ith other loss Hinge Loss, Log Loss, etc. i ( y i T x i + b) 2 11
12 Aside: Stochastic Gradient Descent,b!L(, b) = ( ) λ 2 + L y i, f (x i, b) 1 λ 2 + L i (, b) 1!L(, b) = E i!l i (, b) Do SGD on this 12
13 Model Class Interpretation,b This is not a model class! At least not hat e ve discussed... An optimization procedure Is there a connection? ( ) λ T + L y i, f (x i, b) 13
14 orm Constrained Model Class f (x, b) = T x b s.t. T c 2 c Visualization c=1 c=2 c=3 L2 orm Seems to correspond to lambda,b λ T + L y i, f (x i, b) ( ) lambda 14
15 Lagrange Multipliers s.t. T c L(y, ) ( y T x) 2 Optimality Condition: Gradients aligned! Constraint Boundary λ 0 :( L(y, ) = λ T ) ( T c) Omitting b 1 training data for simplicity 15
16 orm Constrained Model Class Training: L(y, ) ( y T x) 2 s.t. T c Omitting b 1 training data for simplicity To Conditions Must Be Satisfied At Optimality ó. Observation about Optimality: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangian:,λ Λ(, λ) = ( y T x) 2 + λ ( T c) Claim: Solving Lagrangian Solves orm-constrained Training Problem Optimality Implication of Lagrangian: Satisfies First Condition! Λ(, λ) = 2x( y T x) T + 2λ 0 è2x y x = 2λ 16
17 orm Constrained Model Class Training: L(y, ) ( y T x) 2 s.t. T c Omitting b 1 training data for simplicity To Conditions Must Be Satisfied At Optimality ó. Observation about Optimality: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangian:,λ Λ(, λ) = ( y T x) 2 + λ ( T c) Claim: Solving Lagrangian Solves orm-constrained Training Problem Optimality Implication of Lagrangian: Satisfies 2 nd Condition! λ Λ(, λ) = %' (' 0 if T < c T c if T c 0 è T c 17
18 orm Constrained Model Class Training: L(y, ) ( y T x) 2 s.t. T c L2 Regularized Training: λ T + ( y T x) 2 Lagrangian: Λ(, λ) = ( y T x) 2 + λ ( T c),λ Lagrangian = orm Constrained Training: λ 0 :( L(y, ) = λ T ) ( T c) Lagrangian = L2 Regularized Training: Hold λ fixed Equivalent to solving orm Constrained! For some c Omitting b 1 training data for simplicity 18
19 Recap 2: Ridge Regularization Ridge Regression: L2 Regularized Least-Squares = orm Constrained Model,b λ T + L(),b Large λ è more stable predictions Less likely to overfit to training data Too large λ è underfit Works ith other loss Hinge Loss, Log Loss, etc. L() s.t. T c 19
20 Hallucinating Data Points λ T + ( y i T x i ) 2 = 2λ 2 x( y i T x i ) T Instead hallucinate D data points? D = 2 λe d T λe d 2x y i T x i d=1 D d=1 D d=1 ( 0 T λe ) 2 d + ( ) T D d=1 ( y i T x i ) 2 = 2 λe T d = 2 λ d = 2λ ( ) T Identical to Regularization! {( λe d, 0) } D d=1 Unit vector along d-th Dimension! e d = " 0! 0 1 0! 0 $ % Omitting b for simplicity 20
21 Extension: Multi-task Learning 2 prediction tasks: Spam filter for Alice Spam filter for Bob Limited training data for both but Alice is similar to Bob 21
22 Extension: Multi-task Learning To Training Sets relatively small Option 1: Train Separately S (1) = (x (1) i, y (1) { i )} S (2) = (x (2) i, y (2) { i )} v ( ) 2 λ T + y i (1) T x i (1) ( ) 2 λv T v + y i (2) v T x i (2) Both models have high error. Omitting b for simplicity 22
23 Extension: Multi-task Learning To Training Sets relatively small Option 2: Train Jointly,v ( ) 2 λ T + y i (1) T x i (1) ( ) 2 + λv T v + y i (2) v T x i (2) S (1) = (x (1) i, y (1) { i )} S (2) = (x (2) i, y (2) { i )} Doesn t accomplish anything! ( v don t depend on each other) Omitting b for simplicity 23
24 Multi-task Regularization,v ( ) 2 λ T + λv T v +γ ( v) T ( v) + y (1) i T (1) x i + y (2) i v T (2) x i ( ) 2 Standard Regularization Multi-task Regularization Training Loss Prefer v to be close Controlled by γ Tasks similar Larger γ helps! Tasks not identical γ not too large test loss Test Loss (Task 2) gamma 24
25 Lasso L1-Regularized Least-Squares 25
26 L1 Regularized Least Squares L2: λ + = 2 vs ( y i T x i ) 2 = λ 2 + ( y i T x i ) 2 L1: =1 = 2 =1 = = vs vs vs = 0 =1 = Omitting b for simplicity 26
27 Aside: Subgradient (sub-differential) a R(a) = c a' : R(a') R(a) c(a' a) { } Differentiable: a R(a) = a R(a) L1: d % $ % % 1 if d < 0 +1 if d > 0 [ 1,+1 ] if d = 0 Continuous range for =0! Omitting b for simplicity 27
28 L1 Regularized Least Squares L2: λ + ( y i T x i ) 2 λ ( y i T x i ) 2 d 2 = 2 d L1: d % $ % % 1 if d < 0 +1 if d > 0 [ 1,+1 ] if d = Omitting b for simplicity 28
29 Lagrange Multipliers s.t. c L(y, ) ( y T x) 2 d % $ % % 1 if d < 0 +1 if d > 0 [ 1,+1 ] if d = 0 Solutions tend to be at corners! λ 0 :( L(y, ) λ ) ( c) Omitting b 1 training data for simplicity 29
30 Sparsity is sparse if mostly 0 s: Small L0 orm 0 = d 1 d 0 [ ] Why not L0 Regularization? ot continuous! λ 0 + ( y i T x i ) 2 L1 induces sparsity And is continuous! λ + ( y i T x ) 2 i Omitting b for simplicity 30
31 Why is Sparsity Important? Computational / Memory Efficiency Store 1M numbers in array Store 2 numbers per non-zero (Index, Value) pairs E.g., [ (50,1), (51,1) ] Dot product more efficient: T x! " 0 0! ! 0 $ % Sometimes true is sparse Want to recover non-zero dimensions 31
32 Lasso Guarantee λ + ( y i T x i + b) 2 Suppose data generated as: y i ~ ormal( T * x i,σ 2 ) Then if: λ > 2 κ 2σ 2 log D With high probability (increasing ith ): Supp( ) Supp( * ) d : d λc Supp è ( ) = Supp( * ) High Precision Parameter Recovery Sometimes High Recall Supp( * ) = { d *,d 0} See also:
33 male? L1 L age >10? Magnitude of the to eights. (As regularization shrinks) Person Age>10 Male? Height > 55 Alice Bob Carol Dave Erin Frank Gena Harold Irene John Kelly Larry
34 Aside: Optimizing Lasso Solving Lasso gives sparse model Will stochastic gradient descent find it? o! Hard to hit exactly 0 ith gradient descent Solution: Iterative Soft Thresholding Intuition: if gradient update passes 0, clamp at
35 Recap: Lasso vs Ridge Model Assumptions Lasso learns sparse eight vector Predictive Accuracy Lasso often not as accurate Re-run Least Squares on dimensions selected by Lasso Ease of Inspection Sparse s easier to inspect Ease of Optimization Lasso somehat trickier to optimize 35
36 Recap: Regularization L2 L1 (Lasso) λ 2 + λ + ( y i T x ) 2 i ( y i T x i ) 2 Multi-task [Insert Yours Here!],v λ T + λv T v +γ ( v) T ( v) ( ) 2 + y (1) i T (1) x i + y (2) i v T (2) x i ( ) 2 Omitting b for simplicity 36
37 Recap: Updated Pipeline S = {(x i, y i )} Training Data f (x, b) = T x b Model Class L(a, b) = (a b) 2 Loss Function,b ( ) λ T + L y i, f (x i, b) Cross Validation Model Selection Profit! 37
38 ext Lectures Decision Trees Bagging Random Forests Boosting Ensemble Selection Recitation tonight: Linear Algebra
Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso
Machne Learnng Data Mnng CS/CS/EE 155 Lecture 4: Regularzaton, Sparsty Lasso 1 Recap: Complete Ppelne S = {(x, y )} Tranng Data f (x, b) = T x b Model Class(es) L(a, b) = (a b) 2 Loss Functon,b L( y, f
More informationMachine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regulariza5on, Sparsity & Lasso
Machne Learnng Data Mnng CS/CS/EE 155 Lecture 4: Regularza5on, Sparsty Lasso 1 Recap: Complete Ppelne S = {(x, y )} Tranng Data f (x, b) = T x b Model Class(es) L(a, b) = (a b) 2 Loss Func5on,b L( y, f
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationLasso Regression: Regularization for feature selection
Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 1 Feature selection task 2 1 Why might you want to perform feature selection? Efficiency: - If
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationIs the test error unbiased for these programs? 2017 Kevin Jamieson
Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationRidge Regression 1. to which some random noise is added. So that the training labels can be represented as:
CS 1: Machine Learning Spring 15 College of Computer and Information Science Northeastern University Lecture 3 February, 3 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu 1 Introduction Ridge
More informationSupport Vector Machine
Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
More informationIntroduction to Logistic Regression
Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationCOMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017
COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationCSC 411 Lecture 17: Support Vector Machine
CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17
More informationCSC321 Lecture 4: Learning a Classifier
CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 28 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron
More informationCSC321 Lecture 4: Learning a Classifier
CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 31 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationLasso Regression: Regularization for feature selection
Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 Feature selection task 1 Why might you want to perform feature selection? Efficiency: - If size(w)
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationSVMs, Duality and the Kernel Trick
SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today
More informationLinear Models for Regression
Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning
More informationCS6375: Machine Learning Gautam Kunapuli. Support Vector Machines
Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this
More informationl 1 and l 2 Regularization
David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about
More informationMotivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning
Convex Optimization Lecture 15 - Gradient Descent in Machine Learning Instructor: Yuanzhang Xiao University of Hawaii at Manoa Fall 2017 1 / 21 Today s Lecture 1 Motivation 2 Subgradient Method 3 Stochastic
More informationAnnouncements - Homework
Announcements - Homework Homework 1 is graded, please collect at end of lecture Homework 2 due today Homework 3 out soon (watch email) Ques 1 midterm review HW1 score distribution 40 HW1 total score 35
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationCPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017
CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code
More informationMachine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:
Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Carlos Guestrin University of Washington October 7, 2013 1 Sparsity Vector w is sparse, if many entries are zero: Very useful
More informationSVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning
SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are
More informationIntroduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,
1 Introduction to Machine Learning Regression Computer Science, Tel-Aviv University, 2013-14 Classification Input: X Real valued, vectors over real. Discrete values (0,1,2,...) Other structures (e.g.,
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationMachine Learning & Data Mining CMS/CS/CNS/EE 155. Lecture 2: Perceptron & Gradient Descent
Machine Learning & Data Mining CMS/CS/CNS/EE 155 Lecture 2: Perceptron & Gradient Descent Announcements Homework 1 is out Due Tuesday Jan 12 th at 2pm Via Moodle Sign up for Moodle & Piazza if you haven
More informationMachine Learning (CSE 446): Neural Networks
Machine Learning (CSE 446): Neural Networks Noah Smith c 2017 University of Washington nasmith@cs.washington.edu November 6, 2017 1 / 22 Admin No Wednesday office hours for Noah; no lecture Friday. 2 /
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationLeast Mean Squares Regression
Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method
More informationCSC 411 Lecture 6: Linear Regression
CSC 411 Lecture 6: Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 06-Linear Regression 1 / 37 A Timely XKCD UofT CSC 411: 06-Linear Regression
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationEE 381V: Large Scale Optimization Fall Lecture 24 April 11
EE 381V: Large Scale Optimization Fall 2012 Lecture 24 April 11 Lecturer: Caramanis & Sanghavi Scribe: Tao Huang 24.1 Review In past classes, we studied the problem of sparsity. Sparsity problem is that
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationLinear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.
Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationCSE 546 Midterm Exam, Fall 2014
CSE 546 Midterm Eam, Fall 2014 1. Personal info: Name: UW NetID: Student ID: 2. There should be 14 numbered pages in this eam (including this cover sheet). 3. You can use an material ou brought: an book,
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationKernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning
Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:
More informationTopics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families
Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines
More informationMachine Learning and Data Mining. Linear classification. Kalev Kask
Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationLinear and Logistic Regression. Dr. Xiaowei Huang
Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics
More informationRegression.
Regression www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts linear regression RMSE, MAE, and R-square logistic regression convex functions and sets
More informationLinear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)
Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Nathan Schneider (some slides borrowed from Chris Dyer) ENLP 12 February 2018 23 Outline Words, probabilities Features,
More informationApplied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson
Applied Machine Learning Lecture 5: Linear classifiers, continued Richard Johansson overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationLearning theory. Ensemble methods. Boosting. Boosting: history
Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationDiscriminative Learning and Big Data
AIMS-CDT Michaelmas 2016 Discriminative Learning and Big Data Lecture 2: Other loss functions and ANN Andrew Zisserman Visual Geometry Group University of Oxford http://www.robots.ox.ac.uk/~vgg Lecture
More informationOptimization and Gradient Descent
Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function
More informationConvergence rate of SGD
Convergence rate of SGD heorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded hen, for step sizes: he expected
More informationMachine Learning for NLP
Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers
More informationSolving Regression. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 12. Slides adapted from Matt Nedrich and Trevor Hastie
Solving Regression Jordan Boyd-Graber University of Colorado Boulder LECTURE 12 Slides adapted from Matt Nedrich and Trevor Hastie Jordan Boyd-Graber Boulder Solving Regression 1 of 17 Roadmap We talked
More informationII. Linear Models (pp.47-70)
Notation: Means pencil-and-paper QUIZ Means coding QUIZ Agree or disagree: Regression can be always reduced to classification. Explain, either way! A certain classifier scores 98% on the training set,
More informationMachine Learning Basics
Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each
More informationVBM683 Machine Learning
VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data
More informationSVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning
SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationLinear Regression. Aarti Singh. Machine Learning / Sept 27, 2010
Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X
More informationCPSC 340: Machine Learning and Data Mining. Regularization Fall 2017
CPSC 340: Machine Learning and Data Mining Regularization Fall 2017 Assignment 2 Admin 2 late days to hand in tonight, answers posted tomorrow morning. Extra office hours Thursday at 4pm (ICICS 246). Midterm
More informationCOMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)
COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless
More informationORIE 4741: Learning with Big Messy Data. Regularization
ORIE 4741: Learning with Big Messy Data Regularization Professor Udell Operations Research and Information Engineering Cornell October 26, 2017 1 / 24 Regularized empirical risk minimization choose model
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationLeast Mean Squares Regression. Machine Learning Fall 2018
Least Mean Squares Regression Machine Learning Fall 2018 1 Where are we? Least Squares Method for regression Examples The LMS objective Gradient descent Incremental/stochastic gradient descent Exercises
More informationMachine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018
Policies Due 9 PM, January 9 th, via Moodle. You are free to collaborate on all of the problems, subject to the collaboration policy stated in the syllabus. You should submit all code used in the homework.
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationMachine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group
Machine Learning for Computer Vision 8. Neural Networks and Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group INTRODUCTION Nonlinear Coordinate Transformation http://cs.stanford.edu/people/karpathy/convnetjs/
More informationDay 4: Classification, support vector machines
Day 4: Classification, support vector machines Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 21 June 2018 Topics so far
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationCPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016
CPSC 340: Machine Learning and Data Mining Gradient Descent Fall 2016 Admin Assignment 1: Marks up this weekend on UBC Connect. Assignment 2: 3 late days to hand it in Monday. Assignment 3: Due Wednesday
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next
More informationWarm up. Regrade requests submitted directly in Gradescope, do not instructors.
Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationLecture 9: Generalization
Lecture 9: Generalization Roger Grosse 1 Introduction When we train a machine learning model, we don t just want it to learn to model the training data. We want it to generalize to data it hasn t seen
More informationCSC321 Lecture 2: Linear Regression
CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,
More informationFA Homework 2 Recitation 2
FA17 10-701 Homework 2 Recitation 2 Logan Brooks,Matthew Oresky,Guoquan Zhao October 2, 2017 Logan Brooks,Matthew Oresky,Guoquan Zhao FA17 10-701 Homework 2 Recitation 2 October 2, 2017 1 / 15 Outline
More informationGradient Boosting (Continued)
Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31 Boosting Fits an Additive Model Boosting Fits an Additive
More informationCSC321 Lecture 5: Multilayer Perceptrons
CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y output output bias i'th weight w 1 w2 w3
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More informationIs the test error unbiased for these programs?
Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x
More informationOnline Learning With Kernel
CS 446 Machine Learning Fall 2016 SEP 27, 2016 Online Learning With Kernel Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Stochastic Gradient Descent Algorithms Regularization Algorithm Issues
More informationStatistical Computing (36-350)
Statistical Computing (36-350) Lecture 19: Optimization III: Constrained and Stochastic Optimization Cosma Shalizi 30 October 2013 Agenda Constraints and Penalties Constraints and penalties Stochastic
More informationLogistic Regression. Stochastic Gradient Descent
Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}
More information