DS-GA 1002 Lecture notes 12 Fall Linear regression
|
|
- Marsha Dennis
- 6 years ago
- Views:
Transcription
1 DS-GA Lecture notes 1 Fall 16 1 Linear models Linear regression In statistics, regression consists of learning a function relating a certain quantity of interest y, the response or dependent variable, to several observed variables x 1, x,..., x p, known as covariates, features or independent variables. To ease notation, we store the features in a vector x R p. For example, we might be interested in estimating the price of a house based on its extension, the number of rooms, the year it was built, etc. The assumption in regression is that the predictor is generated according to a function h applied to the features and then perturbed by some unknown noise z y = h ( x) + z. (1) The aim is to learn h from n examples of responses and their corresponding features ( y (1), x (1)), ( y (), x ()),..., ( y (n), x (n)). () In linear regression, we assume that h is a linear function, so that the response is well modeled as a linear combination of the predictors: y (i) = x (i) T β + z (i), 1 i n, (3) where z (i) is an unknown perturbation. The function is parametrized by a vector of weights β R p, which we would like to learn from the data. Combining the equations in (3) we obtain a matrix representation of the linear-regression problem y (1) x (1) 1 x (1) x (1) p β y () = x () 1 x () x p () 1 z (1) β + z (). (4) y (n) x (n) 1 x (n) x p (n) β p z (n) Equivalently, y = X β + z, () where X is a n p matrix containing the features, y contains the response and z R n represents the noise.
2 Example 1.1 (Linear model for GDP). We consider the problem of building a linear model to predict the gross domestic product (GDP) of a state in the US from its population and unemployment rate. We have available the following data: Population Unemployment GDP rate (%) (USD millions) California Minnesota Oregon Nevada Idaho Alaska South Carolina ??? The GDP is the response, whereas the population and the unemployment rate are the features. Our goal is to fit a linear model to the data so that we can predict the GDP of South Carolina. Since the features are on different scales we normalize them so that the columns in the regression matrix have unit norm. We also normalize the response, although this is not strictly necessary. We could also apply some additional preprocessing such as centering the data y :=.9.7, X := (6) If we find a β R such that y X β then we can estimate the GDP of South Carolina by computing x sc T β, where [ ].1 x sc := (7).373 corresponds to the population and employment rate of South Carolina divided by the constants used to normalize X.
3 1. 1. Least-squares fit.8 y x Figure 1: Linear model learnt via least-squares fitting for a simple example where there is just one feature (p = 1) and 4 examples (n = 4). Least-squares estimation To calibrate the linear regression model, we need to estimate the weight vector so that it yields a good fit to the data. We can evaluate the fit for a specific choice of β R p using the sum of the squares of the error, n ) (y (i) x (i) T β y = X β. (8) i=1 The least-squares estimate β LS is the vector of weights that minimizes this cost function, β LS := arg min y X β. (9) β The least-squares cost function is convenient from a computational view, since it is convex and can be minimized efficiently (in fact, as we will see in a moment it has a closed-form solution). In addition, it has intuitive geometric and probabilistic interpretations. Figure 1 shows the linear model learnt using least squares in a simple example where there is just one feature (p = 1) and 4 examples (n = 4). Example.1 (Linear model for GDP (cont.)). The least-squares estimate for the regression coefficients in the linear GDP model are [ ] 1. β LS =. ().19 3
4 The model essentially learns that the GDP is roughly proportional to the population and that the unemployment does not help when fitting the data linearly. We now compare the fit provided by the linear model to the original data, as well as its prediction of the GDP of South Carolina: GDP Estimate California Minnesota Oregon Nevada Idaho Alaska South Carolina Geometric interpretation The following theorem, proved in Section., shows that the least-squares problem has a closed form solution. Theorem. (Least-squares solution). For p n, if X is full rank then the solution to the least-squares problem (9) is β LS := ( X T X ) 1 X T y. (11) A corollary to this result provides a geometric interpretation for the least-squares estimate of y: it is obtained by projecting the response onto the column space of the matrix formed by the predictors. Corollary.3. For p n, if X is full rank then X β LS is the projection of y onto the column space of X. We provide a formal proof in Section B of the appendix, but the result is very intuitive. Any vector of the form X β is in the span of the columns of X. By definition, the least-squares estimate is the closest vector to y that can be represented in this way, so it is the projection of y onto the column space of X. This is illustrated in Figure. 4
5 Figure : Illustration of Corollary.3. The least-squares solution is a projection of the data onto the subspace spanned by the columns of X, denoted by X 1 and X.. Probabilistic interpretation If we model the noise in () as a realization from a random vector Z which has entries that are independent Gaussian random variables with mean zero and a certain variance σ, then we can interpret the least-squares estimate as a maximum-likelihood estimate. Under that assumption, the data are a realization of the random vector Y := X β + Z, (1) which is an iid Gaussian random vector with mean X β and covariance matrix σ I. The joint pdf of Y is equal to n ( 1 f Y ( a) = exp 1 ( ( a i=1 πσ σ i X ) ) ) β (13) i ( 1 = (π) n σ exp n 1 1 ) a X β. (14) σ The likelihood is the probability density function of Y evaluated at the observed data y and interpreted as a function of the weight vector β, ( ) ( L y β 1 = exp 1 (π) n y X β ). (1)
6 To find the ML estimate, we maximize the log likelihood. We conclude that it is given by the solution to the least-squares problem, since ( ) β ML = arg max L y β (16) β ( ) = arg max log L y β (17) β = arg min β y X β (18) = β LS. (19) 3 Overfitting Imagine that a friend tells you: I found a cool way to predict the temperature in New York: It s just a linear combination of the temperature in every other state. I fit the model on data from the last month and a half and it s perfect! Your friend is not lying, but the problem is that she is using a number of data points to fit the linear model that is roughly the same as the number of parameters. If n p we can find a β such that y = X β exactly, even if y and X have nothing to do with each other! This is called overfitting and is usually caused by using a model that is too flexible with respect to the number of data that are available. To evaluate whether a model suffers from overfitting we separate the data into a training set and a test set. The training set is used to fit the model and the test set is used to evaluate the error. A model that overfits the training set will have a very low error when evaluated on the training examples, but will not generalize well to the test examples. Figure 3 shows the result of evaluating the training error and the test error of a linear model with p = parameters fitted from n training examples. The training and test data are generated by fixing a vector of weights β and then computing y train = X train β + z train, () y test = X test β, (1) where the entries of X train, X test, z train and β are sampled independently at random from a Gaussian distribution with zero mean and unit variance. The training and test errors are 6
7 ..4 Error (training) Error (test) Noise level (training) Relative error (l norm) n Figure 3: Relative l -norm error in estimating the response achieved using least-squares regression for different values of n (the number of training data). The training error is plotted in blue, whereas the test error is plotted in red. The green line indicates the training error of the true model used to generate the data. defined as error train = error test = X train βls y train, () y train X test βls y test. (3) y test Note that even the true β does not achieve zero training error because of the presence of the noise, but the test error is actually zero if we manage to estimate β exactly. The training error of the linear model grows with n. This makes sense as the model has to fit more data using the same number of parameters. When n is close to p :=, the fitted model is much better than the true model at replicating the training data (the error of the true model is shown in green). This is a sign of overfitting: the model is adapting to the noise and not learning the true linear structure. Indeed, in that regime the test error is extremely high. At larger n, the training error rises to the level achieved by the true linear model and the test error decreases, indicating that we are learning the underlying model. 7
8 4 Global warming In this section we describe the application of linear regression to climate data. In particular, we analyze temperature data taken in a weather station in Oxford over years. 1 Our objective is not to perform prediction, but rather to determine whether temperatures have risen or decreased during the last years in Oxford. In order to separate the temperature into different components that account for seasonal effects we use a simple linear model with three predictors and an intercept ( ) ( ) y t β + β πt 1 cos + β 1 πt sin + β 1 3 t (4) where 1 t n denotes the time in months (n equals 1 times ). The corresponding matrix of predictors is 1 cos ( ) ( πt 1 1 sin πt1 ) t cos ( πt ) ( X := 1 sin πt ) t 1. () 1 cos ( πt n ) ( 1 sin πtn ) t 1 n The intercept β represents the mean temperature, β 1 and β account for periodic yearly fluctuations and β 3 is the overall trend. If β 3 is positive then the model indicates that temperatures are increasing, if it is negative then it indicates that temperatures are decreasing. The results of fitting the linear model are shown in Figures 4 and. The fitted model indicates that both the maximum and minimum temperatures have an increasing trend of about.8 degrees Celsius (around 1.4 degrees Fahrenheit). 1 The data is available at stationdata/oxforddata.txt. 8
9 Maximum temperature Minimum temperature Figure 4: Temperature data together with the linear model described by (4) for both maximum and minimum temperatures. 9
10 Maximum temperature Minimum temperature Trend Trend +.7 C / years +.88 C / years Figure : Temperature trend obtained by fitting the model described by (4) for both maximum and minimum temperatures. A Proof of Proposition. Let X = UΣ V T be the singular-value decomposition (SVD) of X. Under the conditions of the theorem, ( X T X ) 1 X T y = V Σ U T. We begin by separating y into two components y = UU T y + ( I UU T ) y (6) where UU T y is the projection of y onto the column space of X. Note that ( I UU T ) y is orthogonal to the column space of X and consequently to both UU T y and X β for any β. By Pythagoras s Theorem y X β = ( I UU ) T y UU + T y X β. (7) The minimum value of this cost function that can be achieved by optimizing over y X. This can be achieved by solving the system of equations beta is UU T y = X β = UΣ V T β. (8) Since U T U = I because p n, multiplying both sides of the equality yields the equivalent system U T y = Σ V T β. (9)
11 Since X is full rank, Σ and V are square and invertible (and by definition of the SVD V 1 = V T ), so β LS = V Σ U T y (3) is the unique solution to the system and consequently also of the least-squares problem. B Proof of Corollary.3 Let X = UΣ V T be the singular-value decomposition of X. Since X is full rank and p n we have U T U = I, V T V = I and Σ is a square invertible matrix, which implies X β LS = X ( X T X ) 1 X T y (31) = UΣ V T ( V Σ U T UΣ V T ) V Σ U T y (3) = UU T y. (33) 11
Linear regression. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda
Linear regression DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall15 Carlos Fernandez-Granda Linear models Least-squares estimation Overfitting Example:
More informationDS-GA 1002 Lecture notes 10 November 23, Linear models
DS-GA 2 Lecture notes November 23, 2 Linear functions Linear models A linear model encodes the assumption that two quantities are linearly related. Mathematically, this is characterized using linear functions.
More informationLinear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.
Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Linear regression Least-squares estimation
More informationLecture Notes 6: Linear Models
Optimization-based data analysis Fall 17 Lecture Notes 6: Linear Models 1 Linear regression 1.1 The regression problem In statistics, regression is the problem of characterizing the relation between a
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationThe Singular-Value Decomposition
Mathematical Tools for Data Science Spring 2019 1 Motivation The Singular-Value Decomposition The singular-value decomposition (SVD) is a fundamental tool in linear algebra. In this section, we introduce
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the
More informationModeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop
Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationLinear Regression. Aarti Singh. Machine Learning / Sept 27, 2010
Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X
More informationStatistical Data Analysis
DS-GA 0 Lecture notes 8 Fall 016 1 Descriptive statistics Statistical Data Analysis In this section we consider the problem of analyzing a set of data. We describe several techniques for visualizing the
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationLecture Notes 1: Vector spaces
Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationGI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil
GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationVector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.
Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar
More informationStatistical Machine Learning Hilary Term 2018
Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationLeast Squares Regression
E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More information10. Linear Models and Maximum Likelihood Estimation
10. Linear Models and Maximum Likelihood Estimation ECE 830, Spring 2017 Rebecca Willett 1 / 34 Primary Goal General problem statement: We observe y i iid pθ, θ Θ and the goal is to determine the θ that
More informationCSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression
CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html
More informationComputer Vision Group Prof. Daniel Cremers. 3. Regression
Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the
More informationStatistics: Learning models from data
DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial
More informationEcon 424 Time Series Concepts
Econ 424 Time Series Concepts Eric Zivot January 20 2015 Time Series Processes Stochastic (Random) Process { 1 2 +1 } = { } = sequence of random variables indexed by time Observed time series of length
More informationBasic Probability Reference Sheet
February 27, 2001 Basic Probability Reference Sheet 17.846, 2001 This is intended to be used in addition to, not as a substitute for, a textbook. X is a random variable. This means that X is a variable
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationComposite Hypotheses and Generalized Likelihood Ratio Tests
Composite Hypotheses and Generalized Likelihood Ratio Tests Rebecca Willett, 06 In many real world problems, it is difficult to precisely specify probability distributions. Our models for data may involve
More informationIntroduction to Machine Learning
10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what
More informationMachine Learning! in just a few minutes. Jan Peters Gerhard Neumann
Machine Learning! in just a few minutes Jan Peters Gerhard Neumann 1 Purpose of this Lecture Foundations of machine learning tools for robotics We focus on regression methods and general principles Often
More informationLinear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,
Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationMathematical Formulation of Our Example
Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot
More informationLinear Regression (9/11/13)
STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter
More informationReview. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda
Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with
More informationSTAT 100C: Linear models
STAT 100C: Linear models Arash A. Amini April 27, 2018 1 / 1 Table of Contents 2 / 1 Linear Algebra Review Read 3.1 and 3.2 from text. 1. Fundamental subspace (rank-nullity, etc.) Im(X ) = ker(x T ) R
More informationFitting Linear Statistical Models to Data by Least Squares I: Introduction
Fitting Linear Statistical Models to Data by Least Squares I: Introduction Brian R. Hunt and C. David Levermore University of Maryland, College Park Math 420: Mathematical Modeling February 5, 2014 version
More information4 Bias-Variance for Ridge Regression (24 points)
2 count = 0 3 for x in self.x_test_ridge: 4 5 prediction = np.matmul(self.w_ridge,x) 6 ###ADD THE COMPUTED MEAN BACK TO THE PREDICTED VECTOR### 7 prediction = self.ss_y.inverse_transform(prediction) 8
More informationLecture 13: Orthogonal projections and least squares (Section ) Thang Huynh, UC San Diego 2/9/2018
Lecture 13: Orthogonal projections and least squares (Section 3.2-3.3) Thang Huynh, UC San Diego 2/9/2018 Orthogonal projection onto subspaces Theorem. Let W be a subspace of R n. Then, each x in R n can
More informationA time series is called strictly stationary if the joint distribution of every collection (Y t
5 Time series A time series is a set of observations recorded over time. You can think for example at the GDP of a country over the years (or quarters) or the hourly measurements of temperature over a
More informationFitting Linear Statistical Models to Data by Least Squares: Introduction
Fitting Linear Statistical Models to Data by Least Squares: Introduction Radu Balan, Brian R. Hunt and C. David Levermore University of Maryland, College Park University of Maryland, College Park, MD Math
More information1 Cricket chirps: an example
Notes for 2016-09-26 1 Cricket chirps: an example Did you know that you can estimate the temperature by listening to the rate of chirps? The data set in Table 1 1. represents measurements of the number
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationSTAT 350: Geometry of Least Squares
The Geometry of Least Squares Mathematical Basics Inner / dot product: a and b column vectors a b = a T b = a i b i a b a T b = 0 Matrix Product: A is r s B is s t (AB) rt = s A rs B st Partitioned Matrices
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationY i = η + ɛ i, i = 1,...,n.
Nonparametric tests If data do not come from a normal population (and if the sample is not large), we cannot use a t-test. One useful approach to creating test statistics is through the use of rank statistics.
More informationLinear Models for Regression
Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning
More informationMachine Learning (Spring 2012) Principal Component Analysis
1-71 Machine Learning (Spring 1) Principal Component Analysis Yang Xu This note is partly based on Chapter 1.1 in Chris Bishop s book on PRML and the lecture slides on PCA written by Carlos Guestrin in
More informationPrincipal Component Analysis
Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used
More informationMath 1553, Introduction to Linear Algebra
Learning goals articulate what students are expected to be able to do in a course that can be measured. This course has course-level learning goals that pertain to the entire course, and section-level
More information10. Composite Hypothesis Testing. ECE 830, Spring 2014
10. Composite Hypothesis Testing ECE 830, Spring 2014 1 / 25 In many real world problems, it is difficult to precisely specify probability distributions. Our models for data may involve unknown parameters
More informationPattern Recognition. Parameter Estimation of Probability Density Functions
Pattern Recognition Parameter Estimation of Probability Density Functions Classification Problem (Review) The classification problem is to assign an arbitrary feature vector x F to one of c classes. The
More informationThe Hilbert Space of Random Variables
The Hilbert Space of Random Variables Electrical Engineering 126 (UC Berkeley) Spring 2018 1 Outline Fix a probability space and consider the set H := {X : X is a real-valued random variable with E[X 2
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationy(x) = x w + ε(x), (1)
Linear regression We are ready to consider our first machine-learning problem: linear regression. Suppose that e are interested in the values of a function y(x): R d R, here x is a d-dimensional vector-valued
More informationLecture 16: Perceptron and Exponential Weights Algorithm
EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew
More informationDescriptive Statistics
Descriptive Statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall17 Carlos Fernandez-Granda Descriptive statistics Techniques to visualize
More informationAnnouncements (repeat) Principal Components Analysis
4/7/7 Announcements repeat Principal Components Analysis CS 5 Lecture #9 April 4 th, 7 PA4 is due Monday, April 7 th Test # will be Wednesday, April 9 th Test #3 is Monday, May 8 th at 8AM Just hour long
More informationLinear Systems. Carlo Tomasi. June 12, r = rank(a) b range(a) n r solutions
Linear Systems Carlo Tomasi June, 08 Section characterizes the existence and multiplicity of the solutions of a linear system in terms of the four fundamental spaces associated with the system s matrix
More informationLecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26
Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationSingular Value Decomposition (SVD)
School of Computing National University of Singapore CS CS524 Theoretical Foundations of Multimedia More Linear Algebra Singular Value Decomposition (SVD) The highpoint of linear algebra Gilbert Strang
More informationCSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 24) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu October 2, 24 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24 Outline Review
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More information[y i α βx i ] 2 (2) Q = i=1
Least squares fits This section has no probability in it. There are no random variables. We are given n points (x i, y i ) and want to find the equation of the line that best fits them. We take the equation
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationFitting Linear Statistical Models to Data by Least Squares I: Introduction
Fitting Linear Statistical Models to Data by Least Squares I: Introduction Brian R. Hunt and C. David Levermore University of Maryland, College Park Math 420: Mathematical Modeling January 25, 2012 version
More informationVariations. ECE 6540, Lecture 10 Maximum Likelihood Estimation
Variations ECE 6540, Lecture 10 Last Time BLUE (Best Linear Unbiased Estimator) Formulation Advantages Disadvantages 2 The BLUE A simplification Assume the estimator is a linear system For a single parameter
More informationRandom projections. 1 Introduction. 2 Dimensionality reduction. Lecture notes 5 February 29, 2016
Lecture notes 5 February 9, 016 1 Introduction Random projections Random projections are a useful tool in the analysis and processing of high-dimensional data. We will analyze two applications that use
More informationSingular Value Decomposition
Singular Value Decomposition (Com S 477/577 Notes Yan-Bin Jia Sep, 7 Introduction Now comes a highlight of linear algebra. Any real m n matrix can be factored as A = UΣV T where U is an m m orthogonal
More informationReminders. Thought questions should be submitted on eclass. Please list the section related to the thought question
Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a
More informationSingular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces
Singular Value Decomposition This handout is a review of some basic concepts in linear algebra For a detailed introduction, consult a linear algebra text Linear lgebra and its pplications by Gilbert Strang
More informationLinear Models for Regression. Sargur Srihari
Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood
More informationLinear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1
Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 1 The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq.
More informationLecture 4: Probabilistic Learning
DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods
More informationUsing SVD to Recommend Movies
Michael Percy University of California, Santa Cruz Last update: December 12, 2009 Last update: December 12, 2009 1 / Outline 1 Introduction 2 Singular Value Decomposition 3 Experiments 4 Conclusion Last
More informationDS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.
DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1
More informationNotes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.
Notes on singular value decomposition for Math 54 Recall that if A is a symmetric n n matrix, then A has real eigenvalues λ 1,, λ n (possibly repeated), and R n has an orthonormal basis v 1,, v n, where
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationConstrained optimization
Constrained optimization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Compressed sensing Convex constrained
More informationDATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane
DATA MINING AND MACHINE LEARNING Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Linear models for regression Regularized
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationBayesian Linear Regression [DRAFT - In Progress]
Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory
More informationLinear Regression Linear Regression with Shrinkage
Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle
More informationPCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given
More informationPseudoinverse and Adjoint Operators
ECE 275AB Lecture 5 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 1/1 Lecture 5 ECE 275A Pseudoinverse and Adjoint Operators ECE 275AB Lecture 5 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p.
More informationIntroduction to Simple Linear Regression
Introduction to Simple Linear Regression Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Introduction to Simple Linear Regression 1 / 68 About me Faculty in the Department
More informationMachine Learning 4771
Machine Learning 4771 Instructor: Tony Jebara Topic 11 Maximum Likelihood as Bayesian Inference Maximum A Posteriori Bayesian Gaussian Estimation Why Maximum Likelihood? So far, assumed max (log) likelihood
More informationFocus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.
Previously Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations y = Ax Or A simply represents data Notion of eigenvectors,
More information10-725/36-725: Convex Optimization Prerequisite Topics
10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the
More informationClustering VS Classification
MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:
More informationy Xw 2 2 y Xw λ w 2 2
CS 189 Introduction to Machine Learning Spring 2018 Note 4 1 MLE and MAP for Regression (Part I) So far, we ve explored two approaches of the regression framework, Ordinary Least Squares and Ridge Regression:
More informationMultimodel Ensemble forecasts
Multimodel Ensemble forecasts Calibrated methods Michael K. Tippett International Research Institute for Climate and Society The Earth Institute, Columbia University ERFS Climate Predictability Tool Training
More information