Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Size: px
Start display at page:

Download "Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis."

Transcription

1 Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis Carlos Fernandez-Granda

2 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

3 Regression The aim is to learn a function h that relates a response or dependent variable y to several observed variables x 1, x 2,..., x p, known as covariates, features or independent variables The response is assumed to be of the form y = h ( x) + z where x R p contains the features and z is noise

4 Linear regression The regression function h is assumed to be linear y (i) = x (i) T β + z (i), 1 i n Our aim is to estimate β R p from the data

5 Linear regression In matrix form y (1) x (1) y (2) 1 x (1) 2 x (1) p β = x (2) 1 x (2) 2 x p (2) 1 z (1) β 2 + z (2) y (n) x (n) 1 x (n) 2 x p (n) β p z (n) Equivalently, y = X β + z

6 Linear model for GDP State GDP (millions) Population Unemployment Rate North Dakota Alabama Mississippi Arkansas Kansas Georgia Iowa West Virginia Kentucky Tennessee???

7 Centering y cent = X cent = av ( y) = av (X ) = [ ]

8 Normalizing y norm = X norm = std ( y) = std (X ) = [ ]

9 Linear model for GDP Aim: find β R 2 such that y norm X norm β The estimate for the GDP of Tennessee will be y Ten = av ( y) + std ( y) x norm, Ten β where x Ten norm is centered using av (X ) and normalized using std (X )

10 Temperature predictor A friend tells you: I found a cool way to predict the average daily temperature in New York: It s just a linear combination of the temperature in every other state. I fit the model on data from the last month and a half and it s perfect!

11 System of equations A is n p and full rank A b = c If n < p the system is underdetermined: infinite solutions for any b! (overfitting) If n = p the system is determined: unique solution for any b (overfitting) If n > p the system is overdetermined: unique solution exists only if b col (A) (if there is noise, no solutions)

12 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

13 Least squares For fixed β we can evaluate the error using n ) 2 (y (i) x (i) T β y = X β 2 i=1 2 The least-squares estimate β LS minimizes this cost function β LS := arg min y X β β 2 ( ) 1 = X T X X T y if X is full rank and n p

14 Least-squares fit 2 Data Least-squares fit 1 y x

15 Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 = 2

16 Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 (I ) = UU T y 2 UU + T y X β 2 2 arg min β y X β

17 Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 (I ) = UU T y 2 UU + T y X β 2 2 arg min β y X β 2 2 = arg min β UU T y X β

18 Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 (I ) = UU T y 2 UU + T y X β 2 2 arg min β y X β 2 2 = arg min β = arg min β UU T y X β UU T y USV T β 2 2

19 Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 (I ) = UU T y 2 UU + T y X β 2 2 arg min β y X β 2 2 = arg min β = arg min β = arg min β UU T y X β UU T y USV T β 2 U T y SV T β 2 2 2

20 Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 (I ) = UU T y 2 UU + T y X β 2 2 arg min β y X β 2 2 = arg min β = arg min β = arg min β = VS 1 U T y = UU T y X β UU T y USV T β 2 U T y SV T β 2 2 ( X T X ) 1 X T y 2

21 Linear model for GDP The least-squares estimate is β LS = [ ] GDP roughly proportional to the population Unemployment has a negative (linear) effect

22 Linear model for GDP State GDP Estimate North Dakota Alabama Mississippi Arkansas Kansas Georgia Iowa West Virginia Kentucky Tennessee

23 Maximum temperatures in Oxford, UK Temperature (Celsius)

24 Maximum temperatures in Oxford, UK Temperature (Celsius)

25 Linear model y t β 0 + β 1 cos ( ) ( ) 2πt + β 12 2πt 2 sin + β 12 3 t 1 t n is the time in months (n = )

26 Model fitted by least squares Temperature (Celsius) Data Model

27 Model fitted by least squares Temperature (Celsius) Data Model

28 Model fitted by least squares Temperature (Celsius) Data Model

29 Trend: Increase of 0.75 C / 100 years (1.35 F) Temperature (Celsius) Data Trend

30 Model for minimum temperatures Temperature (Celsius) Data Model

31 Model for minimum temperatures Temperature (Celsius) Data Model

32 Model for minimum temperatures Temperature (Celsius) Data Model

33 Trend: Increase of 0.88 C / 100 years (1.58 F) Temperature (Celsius) Data Trend

34 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

35 Geometric interpretation Any vector X β is in the span of the columns of X The least-squares estimate is the closest vector to y that can be represented in this way This is the projection of y onto the column space of X X β LS = USV T VS 1 U T y = UU T y

36 Geometric interpretation

37 Face denoising We denoise by projecting onto: S 1 : the span of the 9 images from the same subject S 2 : the span of the 360 images in the training set Test error: x P S1 y 2 x 2 = x P S2 y 2 x 2 = 0.078

38 S 1 S 1 := span ( )

39 Denoising via projection onto S 1 Projection onto S 1 Projection onto S 1 Signal x = Noise z = = Data y = + Estimate

40 S 2 S 2 := span ( )

41 Denoising via projection onto S 2 Projection onto S 2 Projection onto S 2 Signal x = Noise z = = Data y = + Estimate

42 P S1 y and P S2 y x P S1 y P S2 y

43 Lessons of Face Denoising What does our intuition learned from Face Denoising tell us about linear regression?

44 Lessons of Face Denoising What does our intuition learned from Face Denoising tell us about linear regression? More features = larger column space Larger column space = captures more of the true image Larger column space = captures more of the noise Balance between underfitting and overfitting

45 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

46 Motivation Model data y 1,..., y n as realizations of a set of random variables y 1,..., y n The joint pdf depends on a vector of parameters β f β (y 1,..., y n ) := f y1,...,y n (y 1,..., y n ) is the probability density of y 1,..., y n at the observed data Idea: Choose β such that the density is as high as possible

47 Likelihood The likelihood is equal to the joint pdf L y1,...,y n ( β ) := f β (y 1,..., y n ) interpreted as a function of the parameters The log-likelihood function is the log of the likelihood log L y1,...,y n ( β )

48 Maximum-likelihood estimator The likelihood quantifies how likely the data are according to the model Maximum-likelihood (ML) estimator : ( ) β ML (y 1,..., y n ) := arg max L y1,...,y β n β ( ) = arg max log L y1,...,y β n β Maximizing the log-likelihood is equivalent, and often more convenient

49 Probabilistic interpretation We model the noise as an iid Gaussian random vector z Entries have zero mean and variance σ 2 The data are a realization of the random vector y := X β + z y is Gaussian with mean X β and covariance matrix σ 2 I

50 Likelihood The joint pdf of y is f y ( a) := = The likelihood is n i=1 ( 1 exp 1 ( ( 2πσ 2σ 2 a[i] X β ) ) ) 2 [i] ( 1 (2π) n σ exp 1 a X 2 β n 2σ 2 2 L y ( β ) = ( 1 (2π) n exp 1 2 ) y X β ) 2 2

51 Maximum-likelihood estimate The maximum-likelihood estimate is ( ) β ML = arg max L y β β ( ) = arg max log L y β β = arg min β = β LS y X β 2 2

52 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

53 Estimation error If the data are generated according to the linear model y := X β + z then β LS β

54 Estimation error If the data are generated according to the linear model y := X β + z then β LS β = ( ) ( 1 X T X X T X β ) + z β

55 Estimation error If the data are generated according to the linear model y := X β + z then β LS β ( ) ( 1 = X T X X T X β ) + z β ( ) 1 = X T X X T z as long as X is full rank

56 LS estimator is unbiased Assume noise z is random and has zero mean, then ( E βls β )

57 LS estimator is unbiased Assume noise z is random and has zero mean, then E ( βls β ) = ( X T X ) 1 X T E ( z)

58 LS estimator is unbiased Assume noise z is random and has zero mean, then E ( βls β ) = = 0 ( X T X ) 1 X T E ( z) The estimate is unbiased: its mean equals β

59 Least-squares error If the data are generated according to the linear model y := X β + z then z 2 σ 1 β LS β 2 z 2 σ p σ 1 and σ p are the largest and smallest singular values of X

60 Least-squares error: Proof The error is given by β LS β = (X T X ) 1 X T z. How can we bound (X T X ) 1 X T z 2?

61 Singular values The singular values of a matrix A R n p of rank p satisfy σ 1 = max { x 2 =1 x R n } A x 2 σ p = min { x 2 =1 x R n } A x 2

62 Least-squares error β LS β = VS 1 U T z The smallest and largest singular values of VS 1 U are 1/σ 1 and 1/σ p, so z 2 σ 1 VS 1 U T z z 2 2 σ p

63 Experiment X train, X test, z train and β are sampled iid from a standard Gaussian Data has 50 features y train = X train β + z train y test = X test β (No Test Noise) We use y train and X train to compute β LS error train = error test = X train βls 2 y train y train 2 X test βls 2 y test y test 2

64 Experiment Error (training) Error (test) Noise level (training) Relative error (l2 norm) n

65 Experiment Questions 1. Can we approximate the relative noise level z 2 / y 2? 2. Why does the training error start at 0? 3. Why does the relative training error converge to the noise level? 4. Why does the relative test error converge to zero?

66 Experiment Questions 1. Can we approximate the relative noise level z 2 / y 2? β 2 50, X train β 2 50n, z train 2 n, Why does the training error start at 0? 3. Why does the relative training error converge to the noise level? 4. Why does the relative test error converge to zero?

67 Experiment Questions 1. Can we approximate the relative noise level z 2 / y 2? β 2 50, X train β 2 50n, z train 2 n, Why does the training error start at 0? X is square and invertible 3. Why does the relative training error converge to the noise level? 4. Why does the relative test error converge to zero?

68 Experiment Questions 1. Can we approximate the relative noise level z 2 / y 2? β 2 50, X train β 2 50n, z train 2 n, Why does the training error start at 0? X is square and invertible 3. Why does the relative training error converge to the noise level? X train βls y train 2 = X train ( β LS β ) z train 2 and β LS β 4. Why does the relative test error converge to zero?

69 Experiment Questions 1. Can we approximate the relative noise level z 2 / y 2? β 2 50, X train β 2 50n, z train 2 n, Why does the training error start at 0? X is square and invertible 3. Why does the relative training error converge to the noise level? X train βls y train 2 = X train ( β LS β ) z train 2 and β LS β 4. Why does the relative test error converge to zero? We assumed no test noise, and β LS β

70 Non-asymptotic bound Let y := X β + z, where the entries of X and z are iid standard Gaussians The least-squares estimate satisfies (1 ɛ) p (1 + ɛ) n βls β 2 (1 + ɛ) p (1 ɛ) n with probability at least 1 1/p 2 exp ( pɛ 2 /8 ) as long as n 64p log(12/ɛ)/ɛ 2

71 Proof U T z 2 σ 1 VS 1 U T z 2 U T z 2 σ p

72 Projection onto a fixed subspace Let S be a k-dimensional subspace of R n and z R n a vector of iid standard Gaussian noise For any ɛ > 0 ) ) P (k (1 ɛ) < P S z 2 2 < k (1 + ɛ) 1 2 exp ( kɛ2 8

73 Projection onto a fixed subspace Let S be a k-dimensional subspace of R n and z R n a vector of iid standard Gaussian noise For any ɛ > 0 ) ) P (k (1 ɛ) < P S z 2 2 < k (1 + ɛ) 1 2 exp ( kɛ2 8 Consequence: With probability 1 2 exp ( pɛ 2 /8 ) (1 ɛ) p U T z (1 + ɛ) p 2 2

74 Singular values of a Gaussian matrix Let A be a n k matrix with iid standard Gaussian entries such that n > k For any fixed ɛ > 0, the singular values of A satisfy n (1 ɛ) σk σ 1 n (1 + ɛ) with probability at least 1 1/k as long as n > 64k 12 log ɛ2 ɛ

75 Proof With probability 1 1/p n (1 ɛ) σp σ 1 n (1 + ɛ) as long as n 64p log(12/ɛ)/ɛ 2

76 Experiment: β p 2 Plot of β β LS 2 β 2 Relative coefficient error (l2 norm) p=50 p=100 p=200 1/ n n

77 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

78 Condition number The condition number of A R n p, n p, is the ratio σ 1 /σ p of its largest and smallest singular values A matrix is ill conditioned if its condition is large (almost rank defficient)

79 Noise amplification Let y := X β + z, where z is iid standard Gaussian With probability at least 1 2 exp ( ɛ 2 /8 ) β LS β 2 1 ɛ where σ p is the smallest singular value of X σ p

80 Proof β LS β 2 2

81 Proof β LS β 2 VS = 1 U T z 2 2 2

82 Proof β LS β 2 VS = 1 U T z 2 = S 1 U T z V is orthogonal

83 Proof β LS β 2 VS = 1 U T z 2 = S 1 U T z 2 2 ( p u T = i z ) 2 i σ 2 i 2 2 V is orthogonal

84 Proof β LS β 2 VS = 1 U T z 2 = S 1 U T z 2 2 ( p u T = i z ) 2 i σ 2 i ( u T p z ) 2 σ 2 p 2 2 V is orthogonal

85 Projection onto a fixed subspace Let S be a k-dimensional subspace of R n and z R n a vector of iid standard Gaussian noise For any ɛ > 0 ) ) P (k (1 ɛ) < P S z 2 2 < k (1 + ɛ) 1 2 exp ( kɛ2 8

86 Projection onto a fixed subspace Let S be a k-dimensional subspace of R n and z R n a vector of iid standard Gaussian noise For any ɛ > 0 ) ) P (k (1 ɛ) < P S z 2 2 < k (1 + ɛ) 1 2 exp ( kɛ2 8 Consequence: With probability 1 2 exp ( ɛ 2 /8 ) ( u T p z ) 2 (1 ɛ)

87 Example Let y := X β + z where X := , β := z 2 = 0.11 [ ] 0.471, z :=

88 Example Condition number = [ ] [ ] X = USV T =

89 Example β LS β

90 Example β LS β = VS 1 U T z

91 Example β LS β = VS 1 U T z [ ] = V U T z

92 Example β LS β = VS 1 U T z [ ] = V U T z [ ] = V 3.004

93 Example β LS β = VS 1 U T z [ ] = V U T z [ ] = V [ ] = 2.723

94 Example β LS β = VS 1 U T z [ ] = V U T z [ ] = V [ ] = so that β LS β 2 = z 2

95 Multicollinearity Feature matrix is ill conditioned if any subset of columns is close to being linearly dependent (there is a vector almost in the null space) This occurs if features are highly correlated For any X R n p, with normalized columns, if X i and X j, i j, satisfy X i, X j 2 1 ɛ 2 then the smallest singular value σ p ɛ

96 Multicollinearity Feature matrix is ill conditioned if any subset of columns is close to being linearly dependent (there is a vector almost in the null space) This occurs if features are highly correlated For any X R n p, with normalized columns, if X i and X j, i j, satisfy X i, X j 2 1 ɛ 2 then the smallest singular value σ p ɛ Proof Idea: Consider X ( e i e j ) 2.

97 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

98 Motivation Avoid noise amplification due to multicollinearity Problem: Noise amplification blows up coefficients Solution: Penalize large-norm solutions when fitting the model Adding a penalty term promoting a particular structure is called regularization

99 Ridge regression For a fixed regularization parameter λ > 0 β ridge := arg min β y X β 2 +λ β 2 2 2

100 Ridge regression For a fixed regularization parameter λ > 0 β ridge := arg min y X β 2 +λ 2 β β 2 2 ( ) 1 = X T X + λi X T y

101 Ridge regression For a fixed regularization parameter λ > 0 β ridge := arg min y X β 2 +λ 2 β β 2 2 ( ) 1 = X T X + λi X T y λi increases the singular values of X T X

102 Ridge regression For a fixed regularization parameter λ > 0 β ridge := arg min y X β 2 +λ 2 β β 2 2 ( ) 1 = X T X + λi X T y λi increases the singular values of X T X When λ 0 then β ridge β LS

103 Ridge regression For a fixed regularization parameter λ > 0 β ridge := arg min y X β 2 +λ 2 β β 2 2 ( ) 1 = X T X + λi X T y λi increases the singular values of X T X When λ 0 then β ridge β LS When λ then β ridge 0

104 Proof β ridge is the solution to a modified least-squares problem β ridge = arg min β [ ] y 0 ] λi β [ X 2 2

105 Proof β ridge is the solution to a modified least-squares problem [ ] [ ] 2 β ridge = arg min y X β λi β 0 2 ( [ ] T [ ] ) 1 [ ] T [ ] X = λi λi X λi X y 0

106 Proof β ridge is the solution to a modified least-squares problem [ ] [ ] 2 β ridge = arg min y X β λi β 0 2 ( [ ] T [ ] ) 1 [ ] T [ ] X = λi λi X λi X y 0 ( ) 1 = X T X + λi X T y

107 Modified projection y ridge := X β ridge

108 Modified projection y ridge := X β ridge ( ) 1 = X X T X + λi X T y

109 Modified projection y ridge := X β ridge ( ) 1 = X X T X + λi X T y = USV T ( VS 2 V T + λvv T ) 1 VSU T y

110 Modified projection y ridge := X β ridge ( ) 1 = X X T X + λi X T y = USV T ( VS 2 V T + λvv T ) 1 VSU T y = USV T V ( S 2 + λi ) 1 V T VSU T y

111 Modified projection y ridge := X β ridge ( ) 1 = X X T X + λi X T y = USV T ( VS 2 V T + λvv T ) 1 VSU T y = USV T V ( S 2 + λi ) 1 V T VSU T y = US ( S 2 + λi ) 1 SU T y

112 Modified projection y ridge := X β ridge ( ) 1 = X X T X + λi X T y = USV T ( VS 2 V T + λvv T ) 1 VSU T y = USV T V ( S 2 + λi ) 1 V T VSU T y = US ( S 2 + λi ) 1 SU T y p σi 2 = σi 2 + λ y, u i u i i=1 Component of data in direction of u i is shrunk by σ 2 i σ 2 i +λ

113 Modified projection: Relation to PCA Component of data in direction of u i is shrunk by σ 2 i σ 2 i +λ Instead of orthogonally projecting on to the column space of X as in standard regression, we shrink and project Which directions are shrunk the most?

114 Modified projection: Relation to PCA Component of data in direction of u i is shrunk by σ 2 i σ 2 i +λ Instead of orthogonally projecting on to the column space of X as in standard regression, we shrink and project Which directions are shrunk the most? The directions in the data with smallest variance

115 Modified projection: Relation to PCA Component of data in direction of u i is shrunk by σ 2 i σ 2 i +λ Instead of orthogonally projecting on to the column space of X as in standard regression, we shrink and project Which directions are shrunk the most? The directions in the data with smallest variance In PCA, we delete the directions with smallest variance (i.e., shrink them to zero) Can think of Ridge Regression as a continuous variant of performing regression on principal components

116 Ridge-regression estimate If y := X β + z σ1 2 σ σ 1 2 +λ σ σ λ σ β ridge = V σ2 2 +λ V T β + V σ2 2 +λ U T z σ 2 p σ 2 p +λ where X = USV T and σ 1,..., σ p are the singular values σ p σ 2 p +λ For comparison, β LS = β + VS 1 U T z

117 Bias-variance tradeoff Error β ridge β can be divided into two terms: Bias (depends on β ) and variance (depends on z) The bias equals λ σ ( 1 2+λ 0 0 E β ridge β ) λ 0 = V σ2 2+λ 0 V T β λ 0 0 σp 2+λ Larger λ increases bias, but dampens noise (decreases variance)

118 Example Let y := X β + z where X := , β := z 2 = 0.11 [ ] 0.471, z :=

119 Example β ridge β = V λ 1+λ 0 V T β V λ λ 1 1+λ λ U T z

120 Example Setting λ = 0.01 β ridge β = V λ 1+λ 0 λ λ V T β V 1 1+λ λ U T z = V V T β + V U T z =

121 Example Least-squares relative error = β ridge β 2 = 7.96 z 2

122 Example Coefficients Coefficient error Least-squares fit Regularization parameter

123 Maximum-a-posteriori estimator Is there a probabilistic interpretation of ridge regression? Bayesian viewpoint: β is modeled as random, not deterministic The maximum-a-posteriori (MAP) estimator of β given y is ( ) β MAP ( y) := arg max fβ β y y, β f β y is the conditional pdf of β given y

124 Maximum-a-posteriori estimator Let y R n be a realization of y := X β + z where β and z are iid Gaussian with mean zero and variance σ 2 1 and σ2 2 If X R n m is known, then β MAP = arg min β y X β 2 + λ β where λ := σ 2 2 /σ2 1 What does it mean if σ 2 1 is tiny or large? How about σ2 2?

125 Problem How to calibrate regularization parameter Cannot use coefficient error (we don t know the true value!) Cannot minimize over training data (why?) Solution: Check fit on new data

126 Cross validation Given a set of examples ( y (1), x (1)) (, y (2), x (2)) (,..., y (n), x (n)), 1. Partition data into a training set X train R n train p, y train R n train and a validation set X val R n val p, y val R n val 2. Fit model using the training set for every λ in a set Λ β ridge (λ) := arg min y train X trainβ 2 + λ 2 β β 2 2 and evaluate the fitting error on the validation set err (λ) := y train X trainβridge (λ) Choose the value of λ that minimizes the validation-set error λ cv := arg min err (λ) λ Λ

127 Prediction of house prices Aim: Predicting the price of a house from 1. Area of the living room 2. Condition (integer between 1 and 5) 3. Grade (integer between 7 and 12) 4. Area of the house without the basement 5. Area of the basement 6. The year it was built 7. Latitude 8. Longitude 9. Average area of the living room of houses within 15 blocks

128 Prediction of house prices Training data: 15 houses Validation data: 15 houses Test data: 15 houses Condition number of training-data feature matrix: 9.94 We evaluate the relative fit y X β 2 ridge y 2

129 Prediction of house prices Coefficients l2-norm cost (training set) l2-norm cost (validation set) Regularization parameter

130 Prediction of house prices Best λ: 0.27 Validation set error: (least-squares: 0.906) Test set error: (least-squares: 1.186)

131 Training Estimated price (dollars) Least squares Ridge regression True price (dollars)

132 Validation Estimated price (dollars) Least squares Ridge regression True price (dollars)

133 Test Estimated price (dollars) Least squares Ridge regression True price (dollars)

134 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

135 The classification problem Goal: Assign examples to one of several predefined categories We have n examples of labels and corresponding features ( y (1), x (1)) (, y (2), x (2)) (,..., y (n), x (n)). Here, we consider only two categories: labels are 0 or 1

136 Logistic function Smoothed version of step function g (t) := exp( t)

137 Logistic function exp( t) t

138 Logistic regression Generalized linear model: linear model + entrywise link function y (i) g ( β 0 + x (i), β ).

139 Maximum likelihood If y (1),..., y (n) are independent samples from Bernoulli random variables with parameter ( p y (i) (1) := g x (i), β ) where x (1),..., x (n) R p are known, the ML estimate of β given y (1),..., y (n) is β ML := n y (i) log g i=1 ( x (i), β ) ( + 1 y (i)) ( ( log 1 g x (i), β ))

140 Maximum likelihood ( ) ( L β := p y (1),...,y (n) y (1),..., y (n))

141 Maximum likelihood ( ) ( L β := p y (1),...,y (n) y (1),..., y (n)) = n g i=1 ( x (i), β ) y (i) ( ( 1 g x (i), β )) 1 y (i)

142 Logistic-regression estimator β LR := n y (i) log g i=1 ( x (i), β ) ( + 1 y (i)) ( ( log 1 g x (i), β )) For a new x the logistic-regression prediction is { ( 1 if g x, ) β LR 1/2 y LR := 0 otherwise g ( x, β ) LR can be interpreted as the probability that the label is 1

143 Iris data set Aim: Classify flowers using sepal width and length Two species, 5 examples each: Iris setosa (label 0): sepal lengths 5.4, 4.3, 4.8, 5.1 and 5.7, and sepal widths 3.7, 3, 3.1, 3.8 and 3.8 Iris versicolor (label 1): sepal lengths 6.5, 5.7, 7, 6.3 and 6.1, and sepal widths 2.8, 2.8, 3.2, 2.3 and 2.8 Two new examples: (5.1, 3.5), (5,2)

144 Iris data set After centering and normalizing [ ] 32.1 β LR = 29.6 and β 0 = 2.06 g i x (i) [1] x (i) [2] x (i), β LR + β ( x (i), β ) LR + β g i x (i) [1] x (i) [2] x (i), β LR + β ( x (i), β ) LR + β

145 Iris data set Sepal width Sepal length ??? Setosa Versicolor

146 Iris data set Petal length Sepal width Virginica Versicolor

147 Digit classification MNIST data Aim: Distinguish one digit from another x i is an image of a 6 or a 9 y i = 1 or y i = 0 if image i is a 6 or 9, respectively 2000 training examples and 2000 test examples, each half 6 half 9 Training error rate: 0.0, Test error rate = 0.006

148 Digit classification: β

149 Digit classification: True Positives β T x Probability of 6 Image

150 Digit classification: True Negatives β T x Probability of 6 Image

151 Digit classification: False Positives β T x Probability of 6 Image

152 Digit classification: False Negatives β T x Probability of 6 Image

153 Digit Classification This is a toy problem: distinguishing one digit from another is very easy Harder is to classify any given digit We used it to give insight into how logistic regression works It turns out, on this simplified problem, a very easy solution for β gives good results. Can you guess it?

154 Digit Classification Average of 6 s minus average of 9 s Training error: 0.005, Test error:

Lecture Notes 6: Linear Models

Lecture Notes 6: Linear Models Optimization-based data analysis Fall 17 Lecture Notes 6: Linear Models 1 Linear regression 1.1 The regression problem In statistics, regression is the problem of characterizing the relation between a

More information

Linear regression. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Linear regression. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Linear regression DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall15 Carlos Fernandez-Granda Linear models Least-squares estimation Overfitting Example:

More information

DS-GA 1002 Lecture notes 12 Fall Linear regression

DS-GA 1002 Lecture notes 12 Fall Linear regression DS-GA Lecture notes 1 Fall 16 1 Linear models Linear regression In statistics, regression consists of learning a function relating a certain quantity of interest y, the response or dependent variable,

More information

DS-GA 1002 Lecture notes 10 November 23, Linear models

DS-GA 1002 Lecture notes 10 November 23, Linear models DS-GA 2 Lecture notes November 23, 2 Linear functions Linear models A linear model encodes the assumption that two quantities are linearly related. Mathematically, this is characterized using linear functions.

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

The Singular-Value Decomposition

The Singular-Value Decomposition Mathematical Tools for Data Science Spring 2019 1 Motivation The Singular-Value Decomposition The singular-value decomposition (SVD) is a fundamental tool in linear algebra. In this section, we introduce

More information

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall17 Carlos Fernandez-Granda Descriptive statistics Techniques to visualize

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann Machine Learning! in just a few minutes Jan Peters Gerhard Neumann 1 Purpose of this Lecture Foundations of machine learning tools for robotics We focus on regression methods and general principles Often

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Classification Methods II: Linear and Quadratic Discrimminant Analysis Classification Methods II: Linear and Quadratic Discrimminant Analysis Rebecca C. Steorts, Duke University STA 325, Chapter 4 ISL Agenda Linear Discrimminant Analysis (LDA) Classification Recall that linear

More information

Statistical Data Analysis

Statistical Data Analysis DS-GA 0 Lecture notes 8 Fall 016 1 Descriptive statistics Statistical Data Analysis In this section we consider the problem of analyzing a set of data. We describe several techniques for visualizing the

More information

Lecture Notes 2: Matrices

Lecture Notes 2: Matrices Optimization-based data analysis Fall 2017 Lecture Notes 2: Matrices Matrices are rectangular arrays of numbers, which are extremely useful for data analysis. They can be interpreted as vectors in a vector

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

Theorems. Least squares regression

Theorems. Least squares regression Theorems In this assignment we are trying to classify AML and ALL samples by use of penalized logistic regression. Before we indulge on the adventure of classification we should first explain the most

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Constrained optimization

Constrained optimization Constrained optimization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Compressed sensing Convex constrained

More information

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Review. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with

More information

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Principal Component Analysis (PCA) CSC411/2515 Tutorial Principal Component Analysis (PCA) CSC411/2515 Tutorial Harris Chan Based on previous tutorial slides by Wenjie Luo, Ladislav Rampasek University of Toronto hchan@cs.toronto.edu October 19th, 2017 (UofT)

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods

More information

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Multiple regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Multiple regression 1 / 36 Previous two lectures Linear and logistic

More information

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University. SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Probabilistic Machine Learning. Industrial AI Lab.

Probabilistic Machine Learning. Industrial AI Lab. Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle

More information

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes Matt Gormley Lecture 19 March 20, 2018 1 Midterm Exam Reminders

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector

More information

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Linear Regression [DRAFT - In Progress] Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory

More information

A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Approximate Principal Components Analysis of Large Data Sets

Approximate Principal Components Analysis of Large Data Sets Approximate Principal Components Analysis of Large Data Sets Daniel J. McDonald Department of Statistics Indiana University mypage.iu.edu/ dajmcdon April 27, 2016 Approximation-Regularization for Analysis

More information

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA Department

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1). Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes

More information

The prediction of house price

The prediction of house price 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 11 Maximum Likelihood as Bayesian Inference Maximum A Posteriori Bayesian Gaussian Estimation Why Maximum Likelihood? So far, assumed max (log) likelihood

More information

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012 Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012 Overview Review: Conditional Probability LDA / QDA: Theory Fisher s Discriminant Analysis LDA: Example Quality control:

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive

More information

PRINCIPAL COMPONENTS ANALYSIS

PRINCIPAL COMPONENTS ANALYSIS 121 CHAPTER 11 PRINCIPAL COMPONENTS ANALYSIS We now have the tools necessary to discuss one of the most important concepts in mathematical statistics: Principal Components Analysis (PCA). PCA involves

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Lecture 6: Methods for high-dimensional problems

Lecture 6: Methods for high-dimensional problems Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2015 Announcements TA Monisha s office hour has changed to Thursdays 10-12pm, 462WVH (the same

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

More information

Outline lecture 2 2(30)

Outline lecture 2 2(30) Outline lecture 2 2(3), Lecture 2 Linear Regression it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic Control

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Bayesian Classification Methods

Bayesian Classification Methods Bayesian Classification Methods Suchit Mehrotra North Carolina State University smehrot@ncsu.edu October 24, 2014 Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 1 / 33 How do you define

More information

Machine Learning: Assignment 1

Machine Learning: Assignment 1 10-701 Machine Learning: Assignment 1 Due on Februrary 0, 014 at 1 noon Barnabas Poczos, Aarti Singh Instructions: Failure to follow these directions may result in loss of points. Your solutions for this

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10 COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem

More information

Machine Learning (CSE 446): Probabilistic Machine Learning

Machine Learning (CSE 446): Probabilistic Machine Learning Machine Learning (CSE 446): Probabilistic Machine Learning oah Smith c 2017 University of Washington nasmith@cs.washington.edu ovember 1, 2017 1 / 24 Understanding MLE y 1 MLE π^ You can think of MLE as

More information

MATH 221, Spring Homework 10 Solutions

MATH 221, Spring Homework 10 Solutions MATH 22, Spring 28 - Homework Solutions Due Tuesday, May Section 52 Page 279, Problem 2: 4 λ A λi = and the characteristic polynomial is det(a λi) = ( 4 λ)( λ) ( )(6) = λ 6 λ 2 +λ+2 The solutions to the

More information

Today. Calculus. Linear Regression. Lagrange Multipliers

Today. Calculus. Linear Regression. Lagrange Multipliers Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Midterm Exam, Spring 2005

Midterm Exam, Spring 2005 10-701 Midterm Exam, Spring 2005 1. Write your name and your email address below. Name: Email address: 2. There should be 15 numbered pages in this exam (including this cover sheet). 3. Write your name

More information

20 Unsupervised Learning and Principal Components Analysis (PCA)

20 Unsupervised Learning and Principal Components Analysis (PCA) 116 Jonathan Richard Shewchuk 20 Unsupervised Learning and Principal Components Analysis (PCA) UNSUPERVISED LEARNING We have sample points, but no labels! No classes, no y-values, nothing to predict. Goal:

More information

2 Statistical Estimation: Basic Concepts

2 Statistical Estimation: Basic Concepts Technion Israel Institute of Technology, Department of Electrical Engineering Estimation and Identification in Dynamical Systems (048825) Lecture Notes, Fall 2009, Prof. N. Shimkin 2 Statistical Estimation:

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a

More information