Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
|
|
- Elfreda Jennings
- 6 years ago
- Views:
Transcription
1 Linear Regression
2 Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features Song year Processes, memory Power consumption Historical financials Future stock price Many more
3 Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight For each observation we have a feature vector, x, and label, y x = x 1 x 2 x 3 We assume a linear mapping between features and label: y w 0 + w 1 x 1 + w 2 x 2 + w 3 x 3
4 Linear Least Squares Regression Example: Predicting shoe size from height, gender, and weight We can augment the feature vector to incorporate offset: x = 1 x 1 x 2 x 3 We can then rewrite this linear mapping as scalar product: 3 y ŷ = w i x i = w x i=0
5 Why a Linear Mapping? Simple Often works well in practice Can introduce complexity via feature extraction
6 1D Example Goal: find the line of best fit y x coordinate: features y coordinate: labels y ŷ = w 0 + w 1 x x Intercept / Offset Slope
7 Evaluating Predictions Can measure closeness between label and prediction Shoe size: better to be off by one size than 5 sizes Song year prediction: better to be off by a year than by 20 years What is an appropriate evaluation metric or loss function? Absolute loss: Squared loss: y ŷ (y ŷ) 2 Has nice mathematical properties
8 How Can We Learn Model (w)? Assume we have n training points, where denotes the ith point Recall two earlier points: Linear assumption: ŷ = w x We use squared loss: (y ŷ) 2 Idea: Find w that minimizes squared loss over training points: n min w i=1 (w y (i) ) 2 ŷ (i)
9 Given n training points with d features, we define: : matrix storing points X R n d y : real-valued labels R n ŷ : predicted labels, where R n ŷ = Xw w : regression parameters / model to learn R d Least Squares Regression: Learn mapping ( ) from features to labels that minimizes residual sum of squares: n min w Xw y 2 2 w Equivalent min w (w y (i) ) 2 by definition of Euclidean norm i=1
10 Find solution by setting derivative to zero n 1D: f(w) = wx y 2 2 = (w y (i) ) 2 n df dw (w) =2 i=1 (w y (i) ) wx x x y i=1 = 0 wx x x y = 0 w =(x x) 1 x y Least Squares Regression: Learn mapping ( ) from features to labels that minimizes residual sum of squares: min w Xw y 2 2 w Closed form solution: w =(X X) 1 X y (if inverse exists)
11 Overfitting and Generalization We want good predictions on new data, i.e., generalization Least squares regression minimizes training error, and could overfit Simpler models are more likely to generalize (Occam s razor) Can we change the problem to penalize for model complexity? Intuitively, models with smaller weights are simpler
12 Given n training points with d features, we define: : matrix storing points X R n d y : real-valued labels R n ŷ : predicted labels, where R n ŷ = Xw w : regression parameters / model to learn R d Ridge Regression: Learn mapping ( ) that minimizes residual sum of squares along with a regularization term: w Training Error Model Complexity min w Xw y λ w 2 2 free parameter trades off Closed-form solution: w =(X X + λi d ) 1 X y between training error and model complexity
13 Millionsong Regression Pipeline
14 full dataset Obtain Raw Data Supervised Learning Pipeline
15 training set full dataset test set Obtain Raw Data Split Data Supervised Learning Pipeline
16 training set full dataset test set Obtain Raw Data Split Data Feature Extraction Supervised Learning Pipeline
17 training set model full dataset test set Obtain Raw Data Split Data Feature Extraction Supervised Learning Pipeline Supervised Learning
18 training set model full dataset test set accuracy Obtain Raw Data Split Data Feature Extraction Supervised Learning Pipeline Supervised Learning Evaluation
19 training set model full dataset test set accuracy Obtain Raw Data Split Data Feature Extraction Supervised Learning Pipeline Supervised Learning Evaluation
20 training set model full dataset new entity test set accuracy prediction Obtain Raw Data Split Data Feature Extraction Supervised Learning Pipeline Supervised Learning Evaluation Predict
21 training set model full dataset new entity test set accuracy prediction Obtain Raw Data Goal: Predict song s release year from audio features Raw Data: Millionsong Dataset from UCI ML Repository Western, commercial tracks from timbre averages (features) and release year (label) Split Data Feature Extraction Supervised Learning Evaluation Predict
22 training set model full dataset new entity test set accuracy prediction Obtain Raw Data Split Data: Train on training set, evaluate with test set Test set simulates unobserved data Test error tells us whether we ve generalized well Split Data Feature Extraction Supervised Learning Evaluation Predict
23 training set model full dataset new entity test set accuracy prediction Obtain Raw Data Feature Extraction: Quadratic features Compute pairwise feature interactions Captures covariance of initial timbre features Leads to a non-linear model relative to raw features Split Data Feature Extraction Supervised Learning Evaluation Predict
24 Given 2 dimensional data, quadratic features are: x = x 1 x 2 = Φ(x) = x 2 1 x 1 x 2 x 2 x 1 x 2 2 z = z 1 z 2 = Φ(z) = z 2 1 z 1 z 2 z 2 z 1 z 2 2 More succinctly: Φ (x) = x 2 1 2x 1 x 2 x 2 2 Φ (z) = z 2 1 2z 1 z 2 z 2 2 Equivalent inner products: Φ(x) Φ(z) = x 2 1z x 1 x 2 z 1 z 2 + x 2 2z 2 2 = Φ (x) Φ (z)
25 training set model full dataset new entity test set accuracy prediction Obtain Raw Data Supervised Learning: Least Squares Regression Learn a mapping from entities to continuous labels given a training set Audio features Song year Split Data Feature Extraction Supervised Learning Evaluation Predict
26 Given n training points with d features, we define: : matrix storing points X R n d y : real-valued labels R n ŷ : predicted labels, where R n ŷ = Xw w : regression parameters / model to learn R d Ridge Regression: Learn mapping ( ) that minimizes residual sum of squares along with a regularization term: w Training Error Model Complexity min w Xw y λ w 2 2 Closed-form solution: w =(X X + λi d ) 1 X y
27 Ridge Regression: Learn mapping ( ) that minimizes residual sum of squares along with a regularization term: w Training Error Model Complexity min w Xw y λ w 2 2 free parameter trades off between training error and model complexity How do we choose a good value for this free parameter? Most methods have free parameters / hyperparameters to tune First thought: Search over multiple values, evaluate each on test set But, goal of test set is to simulate unobserved data We may overfit if we use it to choose hyperparameters Second thought: Create another hold out dataset for this search
28 training set model full dataset validation set new entity test set accuracy prediction Obtain Raw Data Evaluation (Part 1): Hyperparameter tuning Training: train various models Validation: evaluate various models (e.g., Grid Search) Test: evaluate final model s accuracy Split Data Feature Extraction Supervised Learning Evaluation Predict
29 Hyperparameter Regulariza*on-Parameter-(--) λ Hyperparameter-1 Grid Search: Exhaustively search through hyperparameter space Define and discretize search space (linear or log scale) Evaluate points via validation error
30 Evaluating Predictions How can we compare labels and predictions for n validation points? Least squares optimization involves squared loss, (y ŷ) 2, so it seems reasonable to use mean squared error (MSE): MSE = 1 n n i=1 (ŷ (i) y (i) ) 2 But MSE s unit of measurement is square of quantity being measured, e.g., squared years for song prediction More natural to use root-mean-square error (RMSE), i.e., MSE
31 training set model full dataset validation set new entity test set accuracy prediction Obtain Raw Data Evaluation (Part 2): Evaluate final model Training set: train various models Validation set: evaluate various models Test set: evaluate final model s accuracy Split Data Feature Extraction Supervised Learning Evaluation Predict
32 training set model full dataset validation set new entity test set accuracy prediction Obtain Raw Data Split Data Predict: Final model can then be used to make predictions on future observations, e.g., new songs Feature Extraction Supervised Learning Evaluation Predict
33 Distributed ML: Computation and Storage
34 Challenge: Scalability Classic ML techniques are not always suitable for modern datasets 60" 50" Moore's"Law" Data Grows Faster 40" 30" 20" Overall"Data" Par8cle"Accel." DNA"Sequencers" than Moore s Law [IDC report, Kathy Yelick, LBNL] 10" 0" 2010" 2011" 2012" 2013" 2014" 2015" Data Machine Learning Distributed Computing
35 Least Squares Regression: Learn mapping ( ) from features to labels that minimizes residual sum of squares: min w Xw y 2 2 w Closed form solution: w =(X X) 1 X y (if inverse exists) How do we solve this computationally? Computational profile similar for Ridge Regression
36 Computing Closed Form Solution w =(X X) 1 X y Computation: O(nd 2 + d 3 ) operations Consider number of arithmetic operations ( +,,, / ) Computational bottlenecks: Matrix multiply of X X : O(nd 2 ) operations Matrix inverse: O(d 3 ) operations Other methods (Cholesky, QR, SVD) have same complexity
37 Storage Requirements w =(X X) 1 X y Computation: O(nd 2 + d 3 ) operations Storage: O(nd + d 2 ) floats Consider storing values as floats (8 bytes) Storage bottlenecks: X X and its inverse: O(d 2 ) floats X : O(nd) floats
38 Big n and Small d w =(X X) 1 X y Computation: O(nd 2 + d 3 ) operations Storage: O(nd + d 2 ) floats Assume O(d 3 ) computation and O(d 2 ) storage feasible on single machine X Storing and computing X X are the bottlenecks Can distribute storage and computation! Store data points (rows of X ) across machines Compute X X as a sum of outer products
39 Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices = = 28
40 Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices =
41 Matrix Multiplication via Inner Products Each entry of output matrix is result of inner product of inputs matrices =
42 Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices =
43 Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices =
44 Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices =
45 Matrix Multiplication via Outer Products Output matrix is sum of outer products between corresponding rows and columns of input matrices =
46 d x (1) x (2) n x (n) d x (1) x (2) n X X = n = x (n) i=1 Example: n = 6; 3 workers workers: x (1) x (5) x (4) x (6) x (3) x (2) O(nd) Distributed Storage map: O(nd 2 ) Distributed O(d 2 ) Local Storage Computation reduce: ( ) -1 O(d 3 ) Local Computation O(d 2 ) Local Storage
47 d x (1) x (2) n x (n) d x (1) x (2) n X X = n = x (n) i=1 Example: n = 6; 3 workers workers: x (1) x (5) x (4) x (6) x (3) x (2) O(nd) Distributed Storage map: O(nd 2 ) Distributed O(d 2 ) Local Storage Computation reduce: ( ) -1 O(d 3 ) Local Computation O(d 2 ) Local Storage
48 Distributed ML: Computation and Storage, Part II
49 Big n and Small d w =(X X) 1 X y Computation: O(nd 2 + d 3 ) operations Storage: O(nd + d 2 ) floats Assume O(d 3 ) computation and O(d 2 ) storage feasible on single machine Can distribute storage and computation! Store data points (rows of X ) across machines Compute X X as a sum of outer products
50 Big n and Small d w =(X X) 1 X y Computation: O(nd 2 + d 3 ) operations Storage: O(nd + d 2 ) floats Assume O(d 3 ) computation and O(d 2 ) storage feasible on single machine Can distribute storage and computation! Store data points (rows of X ) across machines Compute X X as a sum of outer products
51 Big n and Big d w =(X X) 1 X y Computation: O(nd 2 + d 3 ) operations Storage: O(nd + d 2 ) floats X As before, storing and computing X X are bottlenecks Now, storing and operating on Can t easily distribute! X X is also a bottleneck
52 d x (1) x (2) n x (n) d x (1) x (2) n X X = n = x (n) i=1 Example: n = 6; 3 workers workers: x (1) x (5) x (4) x (6) x (3) x (2) O(nd) Distributed Storage map: O(nd 2 ) Distributed O(d 2 ) Local Storage Computation reduce: ( ) -1 O(d 3 ) Local Computation O(d 2 ) Local Storage
53 d x (1) x (2) n x (n) d x (1) x (2) n X X = n = x (n) i=1 Example: n = 6; 3 workers workers: x (1) x (5) x (4) x (6) x (3) x (2) O(nd) Distributed Storage map: O(nd 2 ) Distributed O(d 2 ) Local Storage Computation reduce: ( ) -1 O(d 3 ) Local Computation O(d 2 ) Local Storage
54 Big n and Big d w =(X X) 1 X y Computation: O(nd 2 + d 3 ) operations Storage: O(nd + d 2 ) floats X As before, storing and computing X X are bottlenecks Now, storing and operating on Can t easily distribute! X X is also a bottleneck 1st Rule of thumb Computation and storage should be linear (in n, d)
55 Big n and Big d We need methods that are linear in time and space One idea: Exploit sparsity Explicit sparsity can provide orders of magnitude storage and computational gains Sparse data is prevalent Text processing: bag-of-words, n-grams Collaborative filtering: ratings matrix Graphs: adjacency matrix Categorical features: one-hot-encoding Genomics: SNPs, variant calling dense : size : 7 >< sparse : indices : 0 6 >: values : 1. 3.
56 Big n and Big d We need methods that are linear in time and space One idea: Exploit sparsity Explicit sparsity can provide orders of magnitude storage and computational gains Latent sparsity assumption can be used to reduce dimension, e.g., PCA, low-rank approximation (unsupervised learning) d r d r Low-rank n n
57 Big n and Big d We need methods that are linear in time and space One idea: Exploit sparsity Explicit sparsity can provide orders of magnitude storage and computational gains Latent sparsity assumption can be used to reduce dimension, e.g., PCA, low-rank approximation (unsupervised learning) Another idea: Use different algorithms Gradient descent is an iterative algorithm that requires O(nd) computation and O(d) local storage per iteration
58 Closed Form Solution for Big n and Big d Example: n = 6; 3 workers workers: x (1) x (5) x (4) x (6) x (3) x (2) O(nd) Distributed Storage map: O(nd 2 ) Distributed O(d 2 ) Local Storage Computation reduce: ( ) -1 O(d 3 ) Local O(d 2 ) Local Computation Storage
59 Gradient Descent for Big n and Big d Example: n = 6; 3 workers workers: x (1) x (5) x (4) x (6) x (3) x (2) O(nd) Distributed Storage map: O(nd 2 ) Distributed O(d 2 ) Local Storage Computation reduce: ( ) -1 O(d 3 ) Local O(d 2 ) Local Computation Storage
60 Gradient Descent for Big n and Big d Example: n = 6; 3 workers workers: x (1) x (5) x (4) x (6) x (3) x (2) map:??? O(nd) Distributed Storage O(nd) O(nd 2 O(d) ) O(d 2 ) Local Distributed Storage Computation reduce: ( ) -1 O(d 3 ) Local O(d 2 ) Local Computation Storage
61 Gradient Descent for Big n and Big d Example: n = 6; 3 workers workers: x (1) x (5) x (4) x (6) x (3) x (2) map:??? O(nd) Distributed Storage O(nd) O(nd 2 O(d) ) O(d 2 ) Local Distributed Storage Computation O(d) O(d) reduce:? O(d 3 ) Local O(d 2 ) Local Computation Storage
62 Gradient Descent for Big n and Big d Example: n = 6; 3 workers workers: x (1) x (5) x (4) x (6) x (3) x (2) map:??? O(nd) Distributed Storage O(nd) O(nd 2 O(d) ) O(d 2 ) Local Distributed Storage Computation O(d) O(d) reduce:? O(d 3 ) Local O(d 2 ) Local Computation Storage
Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationCSC321 Lecture 2: Linear Regression
CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,
More informationMLCC 2017 Regularization Networks I: Linear Models
MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Matrix Notation Mark Schmidt University of British Columbia Winter 2017 Admin Auditting/registration forms: Submit them at end of class, pick them up end of next class. I need
More informationCSC 411 Lecture 6: Linear Regression
CSC 411 Lecture 6: Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 06-Linear Regression 1 / 37 A Timely XKCD UofT CSC 411: 06-Linear Regression
More informationMachine Learning - MT & 5. Basis Expansion, Regularization, Validation
Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships
More informationIntroduction to Machine Learning
10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationhttps://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:
More informationGenerative Models for Discrete Data
Generative Models for Discrete Data ddebarr@uw.edu 2016-04-21 Agenda Bayesian Concept Learning Beta-Binomial Model Dirichlet-Multinomial Model Naïve Bayes Classifiers Bayesian Concept Learning Numbers
More informationFundamentals of Machine Learning (Part I)
Fundamentals of Machine Learning (Part I) Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp April 12, 2018 Mohammad Emtiyaz Khan 2018 1 Goals Understand (some) fundamentals
More informationFundamentals of Machine Learning
Fundamentals of Machine Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://icapeople.epfl.ch/mekhan/ emtiyaz@gmail.com Jan 2, 27 Mohammad Emtiyaz Khan 27 Goals Understand (some) fundamentals of Machine
More informationLinear Models for Regression
Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning
More informationLecture 2: Linear regression
Lecture 2: Linear regression Roger Grosse 1 Introduction Let s ump right in and look at our first machine learning algorithm, linear regression. In regression, we are interested in predicting a scalar-valued
More informationLinear Regression. Aarti Singh. Machine Learning / Sept 27, 2010
Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationSparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent
More informationANLP Lecture 22 Lexical Semantics with Dense Vectors
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous
More informationOnline Advertising is Big Business
Online Advertising Online Advertising is Big Business Multiple billion dollar industry $43B in 2013 in USA, 17% increase over 2012 [PWC, Internet Advertising Bureau, April 2013] Higher revenue in USA
More information6.036 midterm review. Wednesday, March 18, 15
6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationMachine Learning! in just a few minutes. Jan Peters Gerhard Neumann
Machine Learning! in just a few minutes Jan Peters Gerhard Neumann 1 Purpose of this Lecture Foundations of machine learning tools for robotics We focus on regression methods and general principles Often
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationDeep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści
Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationSCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.
SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression
More informationFA Homework 2 Recitation 2
FA17 10-701 Homework 2 Recitation 2 Logan Brooks,Matthew Oresky,Guoquan Zhao October 2, 2017 Logan Brooks,Matthew Oresky,Guoquan Zhao FA17 10-701 Homework 2 Recitation 2 October 2, 2017 1 / 15 Outline
More informationSample questions for Fundamentals of Machine Learning 2018
Sample questions for Fundamentals of Machine Learning 2018 Teacher: Mohammad Emtiyaz Khan A few important informations: In the final exam, no electronic devices are allowed except a calculator. Make sure
More informationLINEAR REGRESSION, RIDGE, LASSO, SVR
LINEAR REGRESSION, RIDGE, LASSO, SVR Supervised Learning Katerina Tzompanaki Linear regression one feature* Price (y) What is the estimated price of a new house of area 30 m 2? 30 Area (x) *Also called
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationRidge Regression 1. to which some random noise is added. So that the training labels can be represented as:
CS 1: Machine Learning Spring 15 College of Computer and Information Science Northeastern University Lecture 3 February, 3 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu 1 Introduction Ridge
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationLinear Regression 1 / 25. Karl Stratos. June 18, 2018
Linear Regression Karl Stratos June 18, 2018 1 / 25 The Regression Problem Problem. Find a desired input-output mapping f : X R where the output is a real value. x = = y = 0.1 How much should I turn my
More informationKernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning
Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:
More informationDS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University
DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University January 17 2019 Logistics HW 1 is on Piazza and Gradescope Deadline: Friday, Jan. 25, 2019 Office
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique
More informationCS 6375 Machine Learning
CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.
More informationLinear Regression (9/11/13)
STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter
More informationBayesian Linear Regression [DRAFT - In Progress]
Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory
More informationLinear Regression. Volker Tresp 2018
Linear Regression Volker Tresp 2018 1 Learning Machine: The Linear Model / ADALINE As with the Perceptron we start with an activation functions that is a linearly weighted sum of the inputs h = M j=0 w
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More informationRegularized Least Squares
Regularized Least Squares Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Summary In RLS, the Tikhonov minimization problem boils down to solving a linear system (and this
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer
More informationReminders. Thought questions should be submitted on eclass. Please list the section related to the thought question
Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a
More informationOrdinary Least Squares Linear Regression
Ordinary Least Squares Linear Regression Ryan P. Adams COS 324 Elements of Machine Learning Princeton University Linear regression is one of the simplest and most fundamental modeling ideas in statistics
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training
More informationCS540 Machine learning Lecture 5
CS540 Machine learning Lecture 5 1 Last time Basis functions for linear regression Normal equations QR SVD - briefly 2 This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 3
More information4 Bias-Variance for Ridge Regression (24 points)
2 count = 0 3 for x in self.x_test_ridge: 4 5 prediction = np.matmul(self.w_ridge,x) 6 ###ADD THE COMPUTED MEAN BACK TO THE PREDICTED VECTOR### 7 prediction = self.ss_y.inverse_transform(prediction) 8
More informationModeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop
Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector
More informationOnline Advertising is Big Business
Online Advertising Online Advertising is Big Business Multiple billion dollar industry $43B in 2013 in USA, 17% increase over 2012 [PWC, Internet Advertising Bureau, April 2013] Higher revenue in USA
More informationMachine Learning - MT & 14. PCA and MDS
Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)
More informationCPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017
CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the
More informationLinear Model Selection and Regularization
Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In
More informationCPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016
CPSC 340: Machine Learning and Data Mining Gradient Descent Fall 2016 Admin Assignment 1: Marks up this weekend on UBC Connect. Assignment 2: 3 late days to hand it in Monday. Assignment 3: Due Wednesday
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in
More informationLinear classifiers: Logistic regression
Linear classifiers: Logistic regression STAT/CSE 416: Machine Learning Emily Fox University of Washington April 19, 2018 How confident is your prediction? The sushi & everything else were awesome! The
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationRandomized Algorithms
Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models
More informationMachine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme
Machine Learning A. Supervised Learning A.1. Linear Regression Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany
More informationTufts COMP 135: Introduction to Machine Learning
Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)
More informationUVA CS 4501: Machine Learning. Lecture 6: Linear Regression Model with Dr. Yanjun Qi. University of Virginia
UVA CS 4501: Machine Learning Lecture 6: Linear Regression Model with Regulariza@ons Dr. Yanjun Qi University of Virginia Department of Computer Science Where are we? è Five major sec@ons of this course
More informationLinear Models for Regression. Sargur Srihari
Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationRegularized Least Squares
Regularized Least Squares Ryan M. Rifkin Google, Inc. 2008 Basics: Data Data points S = {(X 1, Y 1 ),...,(X n, Y n )}. We let X simultaneously refer to the set {X 1,...,X n } and to the n by d matrix whose
More informationSemantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing
Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationMachine Learning for Biomedical Engineering. Enrico Grisan
Machine Learning for Biomedical Engineering Enrico Grisan enrico.grisan@dei.unipd.it Curse of dimensionality Why are more features bad? Redundant features (useless or confounding) Hard to interpret and
More informationGI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil
GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection
More informationUsing SVD to Recommend Movies
Michael Percy University of California, Santa Cruz Last update: December 12, 2009 Last update: December 12, 2009 1 / Outline 1 Introduction 2 Singular Value Decomposition 3 Experiments 4 Conclusion Last
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationMachine Learning Linear Regression. Prof. Matteo Matteucci
Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares
More informationMachine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More informationOutline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space
to The The A s s in to Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 2, 2009 to The The A s s in 1 Motivation Outline 2 The Mapping the
More informationCOMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017
COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We
More informationLecture 2: Linear Algebra Review
EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1
More informationA short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie
A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab
More informationNumerical Methods I Non-Square and Sparse Linear Systems
Numerical Methods I Non-Square and Sparse Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 25th, 2014 A. Donev (Courant
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationMethods for sparse analysis of high-dimensional data, II
Methods for sparse analysis of high-dimensional data, II Rachel Ward May 23, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 47 High dimensional
More informationFundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015
Fundamentals of Machine Learning Mohammad Emtiyaz Khan EPFL Aug 25, 25 Mohammad Emtiyaz Khan 24 Contents List of concepts 2 Course Goals 3 2 Regression 4 3 Model: Linear Regression 7 4 Cost Function: MSE
More informationMatrix Factorization Techniques for Recommender Systems
Matrix Factorization Techniques for Recommender Systems Patrick Seemann, December 16 th, 2014 16.12.2014 Fachbereich Informatik Recommender Systems Seminar Patrick Seemann Topics Intro New-User / New-Item
More informationDimension Reduction Methods
Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.
More informationKernel Methods. Barnabás Póczos
Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More informationCollaborative Filtering. Radek Pelánek
Collaborative Filtering Radek Pelánek 2017 Notes on Lecture the most technical lecture of the course includes some scary looking math, but typically with intuitive interpretation use of standard machine
More informationCME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016
GloVe on Spark Alex Adamson SUNet ID: aadamson June 6, 2016 Introduction Pennington et al. proposes a novel word representation algorithm called GloVe (Global Vectors for Word Representation) that synthesizes
More informationMachine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso
Machine Learning Data Mining CS/CS/EE 155 Lecture 4: Regularization, Sparsity Lasso 1 Recap: Complete Pipeline S = {(x i, y i )} Training Data f (x, b) = T x b Model Class(es) L(a, b) = (a b) 2 Loss Function,b
More informationSVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning
SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are
More informationMultivariate Normal Models
Case Study 3: fmri Prediction Graphical LASSO Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox February 26 th, 2013 Emily Fox 2013 1 Multivariate Normal Models
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationLearning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I
Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I What We Did The Machine Learning Zoo Moving Forward M Magdon-Ismail CSCI 4100/6100 recap: Three Learning Principles Scientist 2
More informationExercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians
Exercise Sheet 1 1 Probability revision 1: Student-t as an infinite mixture of Gaussians Show that an infinite mixture of Gaussian distributions, with Gamma distributions as mixing weights in the following
More informationIntroduction to Logistic Regression
Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the
More informationModern Optimization Techniques
Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University
More informationWe choose parameter values that will minimize the difference between the model outputs & the true function values.
CSE 4502/5717 Big Data Analytics Lecture #16, 4/2/2018 with Dr Sanguthevar Rajasekaran Notes from Yenhsiang Lai Machine learning is the task of inferring a function, eg, f : R " R This inference has to
More informationNotes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert
Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2007-025 CBCL-268 May 1, 2007 Notes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert massachusetts institute
More informationAdvances in Extreme Learning Machines
Advances in Extreme Learning Machines Mark van Heeswijk April 17, 2015 Outline Context Extreme Learning Machines Part I: Ensemble Models of ELM Part II: Variable Selection and ELM Part III: Trade-offs
More information