Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Size: px

Start display at page:

Download "Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis."

Beverly Ferguson
6 years ago
Views:

1 Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis Carlos Fernandez-Granda

2 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

3 Regression The aim is to learn a function h that relates a response or dependent variable y to several observed variables x 1, x 2,..., x p, known as covariates, features or independent variables The response is assumed to be of the form y = h ( x) + z where x R p contains the features and z is noise

4 Linear regression The regression function h is assumed to be linear y (i) = x (i) T β + z (i), 1 i n Our aim is to estimate β R p from the data

5 Linear regression In matrix form y (1) x (1) y (2) 1 x (1) 2 x (1) p β = x (2) 1 x (2) 2 x p (2) 1 z (1) β 2 + z (2) y (n) x (n) 1 x (n) 2 x p (n) β p z (n) Equivalently, y = X β + z

6 Linear model for GDP State GDP (millions) Population Unemployment Rate North Dakota Alabama Mississippi Arkansas Kansas Georgia Iowa West Virginia Kentucky Tennessee???

7 Centering y cent = X cent = av ( y) = av (X ) = [ ]

8 Normalizing y norm = X norm = std ( y) = std (X ) = [ ]

9 Linear model for GDP Aim: find β R 2 such that y norm X norm β The estimate for the GDP of Tennessee will be y Ten = av ( y) + std ( y) x norm, Ten β where x Ten norm is centered using av (X ) and normalized using std (X )

10 Temperature predictor A friend tells you: I found a cool way to predict the average daily temperature in New York: It s just a linear combination of the temperature in every other state. I fit the model on data from the last month and a half and it s perfect!

11 System of equations A is n p and full rank A b = c If n < p the system is underdetermined: infinite solutions for any b! (overfitting) If n = p the system is determined: unique solution for any b (overfitting) If n > p the system is overdetermined: unique solution exists only if b col (A) (if there is noise, no solutions)

12 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

13 Least squares For fixed β we can evaluate the error using n ) 2 (y (i) x (i) T β y = X β 2 i=1 2 The least-squares estimate β LS minimizes this cost function β LS := arg min y X β β 2 ( ) 1 = X T X X T y if X is full rank and n p

14 Least-squares fit 2 Data Least-squares fit 1 y x

15 Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 = 2

16 Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 (I ) = UU T y 2 UU + T y X β 2 2 arg min β y X β

17 Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 (I ) = UU T y 2 UU + T y X β 2 2 arg min β y X β 2 2 = arg min β UU T y X β

18 Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 (I ) = UU T y 2 UU + T y X β 2 2 arg min β y X β 2 2 = arg min β = arg min β UU T y X β UU T y USV T β 2 2

19 Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 (I ) = UU T y 2 UU + T y X β 2 2 arg min β y X β 2 2 = arg min β = arg min β = arg min β UU T y X β UU T y USV T β 2 U T y SV T β 2 2 2

20 Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 (I ) = UU T y 2 UU + T y X β 2 2 arg min β y X β 2 2 = arg min β = arg min β = arg min β = VS 1 U T y = UU T y X β UU T y USV T β 2 U T y SV T β 2 2 ( X T X ) 1 X T y 2

21 Linear model for GDP The least-squares estimate is β LS = [ ] GDP roughly proportional to the population Unemployment has a negative (linear) effect

22 Linear model for GDP State GDP Estimate North Dakota Alabama Mississippi Arkansas Kansas Georgia Iowa West Virginia Kentucky Tennessee

23 Maximum temperatures in Oxford, UK Temperature (Celsius)

24 Maximum temperatures in Oxford, UK Temperature (Celsius)

25 Linear model y t β 0 + β 1 cos ( ) ( ) 2πt + β 12 2πt 2 sin + β 12 3 t 1 t n is the time in months (n = )

26 Model fitted by least squares Temperature (Celsius) Data Model

27 Model fitted by least squares Temperature (Celsius) Data Model

28 Model fitted by least squares Temperature (Celsius) Data Model

29 Trend: Increase of 0.75 C / 100 years (1.35 F) Temperature (Celsius) Data Trend

30 Model for minimum temperatures Temperature (Celsius) Data Model

31 Model for minimum temperatures Temperature (Celsius) Data Model

32 Model for minimum temperatures Temperature (Celsius) Data Model

33 Trend: Increase of 0.88 C / 100 years (1.58 F) Temperature (Celsius) Data Trend

34 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

35 Geometric interpretation Any vector X β is in the span of the columns of X The least-squares estimate is the closest vector to y that can be represented in this way This is the projection of y onto the column space of X X β LS = USV T VS 1 U T y = UU T y

36 Geometric interpretation

37 Face denoising We denoise by projecting onto: S 1 : the span of the 9 images from the same subject S 2 : the span of the 360 images in the training set Test error: x P S1 y 2 x 2 = x P S2 y 2 x 2 = 0.078

38 S 1 S 1 := span ( )

39 Denoising via projection onto S 1 Projection onto S 1 Projection onto S 1 Signal x = Noise z = = Data y = + Estimate

40 S 2 S 2 := span ( )

41 Denoising via projection onto S 2 Projection onto S 2 Projection onto S 2 Signal x = Noise z = = Data y = + Estimate

42 P S1 y and P S2 y x P S1 y P S2 y

43 Lessons of Face Denoising What does our intuition learned from Face Denoising tell us about linear regression?

44 Lessons of Face Denoising What does our intuition learned from Face Denoising tell us about linear regression? More features = larger column space Larger column space = captures more of the true image Larger column space = captures more of the noise Balance between underfitting and overfitting

45 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

46 Motivation Model data y 1,..., y n as realizations of a set of random variables y 1,..., y n The joint pdf depends on a vector of parameters β f β (y 1,..., y n ) := f y1,...,y n (y 1,..., y n ) is the probability density of y 1,..., y n at the observed data Idea: Choose β such that the density is as high as possible

47 Likelihood The likelihood is equal to the joint pdf L y1,...,y n ( β ) := f β (y 1,..., y n ) interpreted as a function of the parameters The log-likelihood function is the log of the likelihood log L y1,...,y n ( β )

48 Maximum-likelihood estimator The likelihood quantifies how likely the data are according to the model Maximum-likelihood (ML) estimator : ( ) β ML (y 1,..., y n ) := arg max L y1,...,y β n β ( ) = arg max log L y1,...,y β n β Maximizing the log-likelihood is equivalent, and often more convenient

49 Probabilistic interpretation We model the noise as an iid Gaussian random vector z Entries have zero mean and variance σ 2 The data are a realization of the random vector y := X β + z y is Gaussian with mean X β and covariance matrix σ 2 I

50 Likelihood The joint pdf of y is f y ( a) := = The likelihood is n i=1 ( 1 exp 1 ( ( 2πσ 2σ 2 a[i] X β ) ) ) 2 [i] ( 1 (2π) n σ exp 1 a X 2 β n 2σ 2 2 L y ( β ) = ( 1 (2π) n exp 1 2 ) y X β ) 2 2

51 Maximum-likelihood estimate The maximum-likelihood estimate is ( ) β ML = arg max L y β β ( ) = arg max log L y β β = arg min β = β LS y X β 2 2

52 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

53 Estimation error If the data are generated according to the linear model y := X β + z then β LS β

54 Estimation error If the data are generated according to the linear model y := X β + z then β LS β = ( ) ( 1 X T X X T X β ) + z β

55 Estimation error If the data are generated according to the linear model y := X β + z then β LS β ( ) ( 1 = X T X X T X β ) + z β ( ) 1 = X T X X T z as long as X is full rank

56 LS estimator is unbiased Assume noise z is random and has zero mean, then ( E βls β )

57 LS estimator is unbiased Assume noise z is random and has zero mean, then E ( βls β ) = ( X T X ) 1 X T E ( z)

58 LS estimator is unbiased Assume noise z is random and has zero mean, then E ( βls β ) = = 0 ( X T X ) 1 X T E ( z) The estimate is unbiased: its mean equals β

59 Least-squares error If the data are generated according to the linear model y := X β + z then z 2 σ 1 β LS β 2 z 2 σ p σ 1 and σ p are the largest and smallest singular values of X

60 Least-squares error: Proof The error is given by β LS β = (X T X ) 1 X T z. How can we bound (X T X ) 1 X T z 2?

61 Singular values The singular values of a matrix A R n p of rank p satisfy σ 1 = max { x 2 =1 x R n } A x 2 σ p = min { x 2 =1 x R n } A x 2

62 Least-squares error β LS β = VS 1 U T z The smallest and largest singular values of VS 1 U are 1/σ 1 and 1/σ p, so z 2 σ 1 VS 1 U T z z 2 2 σ p

63 Experiment X train, X test, z train and β are sampled iid from a standard Gaussian Data has 50 features y train = X train β + z train y test = X test β (No Test Noise) We use y train and X train to compute β LS error train = error test = X train βls 2 y train y train 2 X test βls 2 y test y test 2

64 Experiment Error (training) Error (test) Noise level (training) Relative error (l2 norm) n

65 Experiment Questions 1. Can we approximate the relative noise level z 2 / y 2? 2. Why does the training error start at 0? 3. Why does the relative training error converge to the noise level? 4. Why does the relative test error converge to zero?

66 Experiment Questions 1. Can we approximate the relative noise level z 2 / y 2? β 2 50, X train β 2 50n, z train 2 n, Why does the training error start at 0? 3. Why does the relative training error converge to the noise level? 4. Why does the relative test error converge to zero?

67 Experiment Questions 1. Can we approximate the relative noise level z 2 / y 2? β 2 50, X train β 2 50n, z train 2 n, Why does the training error start at 0? X is square and invertible 3. Why does the relative training error converge to the noise level? 4. Why does the relative test error converge to zero?

68 Experiment Questions 1. Can we approximate the relative noise level z 2 / y 2? β 2 50, X train β 2 50n, z train 2 n, Why does the training error start at 0? X is square and invertible 3. Why does the relative training error converge to the noise level? X train βls y train 2 = X train ( β LS β ) z train 2 and β LS β 4. Why does the relative test error converge to zero?

69 Experiment Questions 1. Can we approximate the relative noise level z 2 / y 2? β 2 50, X train β 2 50n, z train 2 n, Why does the training error start at 0? X is square and invertible 3. Why does the relative training error converge to the noise level? X train βls y train 2 = X train ( β LS β ) z train 2 and β LS β 4. Why does the relative test error converge to zero? We assumed no test noise, and β LS β

70 Non-asymptotic bound Let y := X β + z, where the entries of X and z are iid standard Gaussians The least-squares estimate satisfies (1 ɛ) p (1 + ɛ) n βls β 2 (1 + ɛ) p (1 ɛ) n with probability at least 1 1/p 2 exp ( pɛ 2 /8 ) as long as n 64p log(12/ɛ)/ɛ 2

71 Proof U T z 2 σ 1 VS 1 U T z 2 U T z 2 σ p

72 Projection onto a fixed subspace Let S be a k-dimensional subspace of R n and z R n a vector of iid standard Gaussian noise For any ɛ > 0 ) ) P (k (1 ɛ) < P S z 2 2 < k (1 + ɛ) 1 2 exp ( kɛ2 8

73 Projection onto a fixed subspace Let S be a k-dimensional subspace of R n and z R n a vector of iid standard Gaussian noise For any ɛ > 0 ) ) P (k (1 ɛ) < P S z 2 2 < k (1 + ɛ) 1 2 exp ( kɛ2 8 Consequence: With probability 1 2 exp ( pɛ 2 /8 ) (1 ɛ) p U T z (1 + ɛ) p 2 2

74 Singular values of a Gaussian matrix Let A be a n k matrix with iid standard Gaussian entries such that n > k For any fixed ɛ > 0, the singular values of A satisfy n (1 ɛ) σk σ 1 n (1 + ɛ) with probability at least 1 1/k as long as n > 64k 12 log ɛ2 ɛ

75 Proof With probability 1 1/p n (1 ɛ) σp σ 1 n (1 + ɛ) as long as n 64p log(12/ɛ)/ɛ 2

76 Experiment: β p 2 Plot of β β LS 2 β 2 Relative coefficient error (l2 norm) p=50 p=100 p=200 1/ n n

77 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

78 Condition number The condition number of A R n p, n p, is the ratio σ 1 /σ p of its largest and smallest singular values A matrix is ill conditioned if its condition is large (almost rank defficient)

79 Noise amplification Let y := X β + z, where z is iid standard Gaussian With probability at least 1 2 exp ( ɛ 2 /8 ) β LS β 2 1 ɛ where σ p is the smallest singular value of X σ p

80 Proof β LS β 2 2

81 Proof β LS β 2 VS = 1 U T z 2 2 2

82 Proof β LS β 2 VS = 1 U T z 2 = S 1 U T z V is orthogonal

83 Proof β LS β 2 VS = 1 U T z 2 = S 1 U T z 2 2 ( p u T = i z ) 2 i σ 2 i 2 2 V is orthogonal

84 Proof β LS β 2 VS = 1 U T z 2 = S 1 U T z 2 2 ( p u T = i z ) 2 i σ 2 i ( u T p z ) 2 σ 2 p 2 2 V is orthogonal

85 Projection onto a fixed subspace Let S be a k-dimensional subspace of R n and z R n a vector of iid standard Gaussian noise For any ɛ > 0 ) ) P (k (1 ɛ) < P S z 2 2 < k (1 + ɛ) 1 2 exp ( kɛ2 8

86 Projection onto a fixed subspace Let S be a k-dimensional subspace of R n and z R n a vector of iid standard Gaussian noise For any ɛ > 0 ) ) P (k (1 ɛ) < P S z 2 2 < k (1 + ɛ) 1 2 exp ( kɛ2 8 Consequence: With probability 1 2 exp ( ɛ 2 /8 ) ( u T p z ) 2 (1 ɛ)

87 Example Let y := X β + z where X := , β := z 2 = 0.11 [ ] 0.471, z :=

88 Example Condition number = [ ] [ ] X = USV T =

89 Example β LS β

90 Example β LS β = VS 1 U T z

91 Example β LS β = VS 1 U T z [ ] = V U T z

92 Example β LS β = VS 1 U T z [ ] = V U T z [ ] = V 3.004

93 Example β LS β = VS 1 U T z [ ] = V U T z [ ] = V [ ] = 2.723

94 Example β LS β = VS 1 U T z [ ] = V U T z [ ] = V [ ] = so that β LS β 2 = z 2

95 Multicollinearity Feature matrix is ill conditioned if any subset of columns is close to being linearly dependent (there is a vector almost in the null space) This occurs if features are highly correlated For any X R n p, with normalized columns, if X i and X j, i j, satisfy X i, X j 2 1 ɛ 2 then the smallest singular value σ p ɛ

96 Multicollinearity Feature matrix is ill conditioned if any subset of columns is close to being linearly dependent (there is a vector almost in the null space) This occurs if features are highly correlated For any X R n p, with normalized columns, if X i and X j, i j, satisfy X i, X j 2 1 ɛ 2 then the smallest singular value σ p ɛ Proof Idea: Consider X ( e i e j ) 2.

97 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

98 Motivation Avoid noise amplification due to multicollinearity Problem: Noise amplification blows up coefficients Solution: Penalize large-norm solutions when fitting the model Adding a penalty term promoting a particular structure is called regularization

99 Ridge regression For a fixed regularization parameter λ > 0 β ridge := arg min β y X β 2 +λ β 2 2 2

100 Ridge regression For a fixed regularization parameter λ > 0 β ridge := arg min y X β 2 +λ 2 β β 2 2 ( ) 1 = X T X + λi X T y

101 Ridge regression For a fixed regularization parameter λ > 0 β ridge := arg min y X β 2 +λ 2 β β 2 2 ( ) 1 = X T X + λi X T y λi increases the singular values of X T X

102 Ridge regression For a fixed regularization parameter λ > 0 β ridge := arg min y X β 2 +λ 2 β β 2 2 ( ) 1 = X T X + λi X T y λi increases the singular values of X T X When λ 0 then β ridge β LS

103 Ridge regression For a fixed regularization parameter λ > 0 β ridge := arg min y X β 2 +λ 2 β β 2 2 ( ) 1 = X T X + λi X T y λi increases the singular values of X T X When λ 0 then β ridge β LS When λ then β ridge 0

104 Proof β ridge is the solution to a modified least-squares problem β ridge = arg min β [ ] y 0 ] λi β [ X 2 2

105 Proof β ridge is the solution to a modified least-squares problem [ ] [ ] 2 β ridge = arg min y X β λi β 0 2 ( [ ] T [ ] ) 1 [ ] T [ ] X = λi λi X λi X y 0

106 Proof β ridge is the solution to a modified least-squares problem [ ] [ ] 2 β ridge = arg min y X β λi β 0 2 ( [ ] T [ ] ) 1 [ ] T [ ] X = λi λi X λi X y 0 ( ) 1 = X T X + λi X T y

107 Modified projection y ridge := X β ridge

108 Modified projection y ridge := X β ridge ( ) 1 = X X T X + λi X T y

109 Modified projection y ridge := X β ridge ( ) 1 = X X T X + λi X T y = USV T ( VS 2 V T + λvv T ) 1 VSU T y

110 Modified projection y ridge := X β ridge ( ) 1 = X X T X + λi X T y = USV T ( VS 2 V T + λvv T ) 1 VSU T y = USV T V ( S 2 + λi ) 1 V T VSU T y

111 Modified projection y ridge := X β ridge ( ) 1 = X X T X + λi X T y = USV T ( VS 2 V T + λvv T ) 1 VSU T y = USV T V ( S 2 + λi ) 1 V T VSU T y = US ( S 2 + λi ) 1 SU T y

112 Modified projection y ridge := X β ridge ( ) 1 = X X T X + λi X T y = USV T ( VS 2 V T + λvv T ) 1 VSU T y = USV T V ( S 2 + λi ) 1 V T VSU T y = US ( S 2 + λi ) 1 SU T y p σi 2 = σi 2 + λ y, u i u i i=1 Component of data in direction of u i is shrunk by σ 2 i σ 2 i +λ

113 Modified projection: Relation to PCA Component of data in direction of u i is shrunk by σ 2 i σ 2 i +λ Instead of orthogonally projecting on to the column space of X as in standard regression, we shrink and project Which directions are shrunk the most?

114 Modified projection: Relation to PCA Component of data in direction of u i is shrunk by σ 2 i σ 2 i +λ Instead of orthogonally projecting on to the column space of X as in standard regression, we shrink and project Which directions are shrunk the most? The directions in the data with smallest variance

115 Modified projection: Relation to PCA Component of data in direction of u i is shrunk by σ 2 i σ 2 i +λ Instead of orthogonally projecting on to the column space of X as in standard regression, we shrink and project Which directions are shrunk the most? The directions in the data with smallest variance In PCA, we delete the directions with smallest variance (i.e., shrink them to zero) Can think of Ridge Regression as a continuous variant of performing regression on principal components

116 Ridge-regression estimate If y := X β + z σ1 2 σ σ 1 2 +λ σ σ λ σ β ridge = V σ2 2 +λ V T β + V σ2 2 +λ U T z σ 2 p σ 2 p +λ where X = USV T and σ 1,..., σ p are the singular values σ p σ 2 p +λ For comparison, β LS = β + VS 1 U T z

117 Bias-variance tradeoff Error β ridge β can be divided into two terms: Bias (depends on β ) and variance (depends on z) The bias equals λ σ ( 1 2+λ 0 0 E β ridge β ) λ 0 = V σ2 2+λ 0 V T β λ 0 0 σp 2+λ Larger λ increases bias, but dampens noise (decreases variance)

118 Example Let y := X β + z where X := , β := z 2 = 0.11 [ ] 0.471, z :=

119 Example β ridge β = V λ 1+λ 0 V T β V λ λ 1 1+λ λ U T z

120 Example Setting λ = 0.01 β ridge β = V λ 1+λ 0 λ λ V T β V 1 1+λ λ U T z = V V T β + V U T z =

121 Example Least-squares relative error = β ridge β 2 = 7.96 z 2

122 Example Coefficients Coefficient error Least-squares fit Regularization parameter

123 Maximum-a-posteriori estimator Is there a probabilistic interpretation of ridge regression? Bayesian viewpoint: β is modeled as random, not deterministic The maximum-a-posteriori (MAP) estimator of β given y is ( ) β MAP ( y) := arg max fβ β y y, β f β y is the conditional pdf of β given y

124 Maximum-a-posteriori estimator Let y R n be a realization of y := X β + z where β and z are iid Gaussian with mean zero and variance σ 2 1 and σ2 2 If X R n m is known, then β MAP = arg min β y X β 2 + λ β where λ := σ 2 2 /σ2 1 What does it mean if σ 2 1 is tiny or large? How about σ2 2?

125 Problem How to calibrate regularization parameter Cannot use coefficient error (we don t know the true value!) Cannot minimize over training data (why?) Solution: Check fit on new data

126 Cross validation Given a set of examples ( y (1), x (1)) (, y (2), x (2)) (,..., y (n), x (n)), 1. Partition data into a training set X train R n train p, y train R n train and a validation set X val R n val p, y val R n val 2. Fit model using the training set for every λ in a set Λ β ridge (λ) := arg min y train X trainβ 2 + λ 2 β β 2 2 and evaluate the fitting error on the validation set err (λ) := y train X trainβridge (λ) Choose the value of λ that minimizes the validation-set error λ cv := arg min err (λ) λ Λ

127 Prediction of house prices Aim: Predicting the price of a house from 1. Area of the living room 2. Condition (integer between 1 and 5) 3. Grade (integer between 7 and 12) 4. Area of the house without the basement 5. Area of the basement 6. The year it was built 7. Latitude 8. Longitude 9. Average area of the living room of houses within 15 blocks

128 Prediction of house prices Training data: 15 houses Validation data: 15 houses Test data: 15 houses Condition number of training-data feature matrix: 9.94 We evaluate the relative fit y X β 2 ridge y 2

129 Prediction of house prices Coefficients l2-norm cost (training set) l2-norm cost (validation set) Regularization parameter

130 Prediction of house prices Best λ: 0.27 Validation set error: (least-squares: 0.906) Test set error: (least-squares: 1.186)

131 Training Estimated price (dollars) Least squares Ridge regression True price (dollars)

132 Validation Estimated price (dollars) Least squares Ridge regression True price (dollars)

133 Test Estimated price (dollars) Least squares Ridge regression True price (dollars)

134 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

135 The classification problem Goal: Assign examples to one of several predefined categories We have n examples of labels and corresponding features ( y (1), x (1)) (, y (2), x (2)) (,..., y (n), x (n)). Here, we consider only two categories: labels are 0 or 1

136 Logistic function Smoothed version of step function g (t) := exp( t)

137 Logistic function exp( t) t

138 Logistic regression Generalized linear model: linear model + entrywise link function y (i) g ( β 0 + x (i), β ).

139 Maximum likelihood If y (1),..., y (n) are independent samples from Bernoulli random variables with parameter ( p y (i) (1) := g x (i), β ) where x (1),..., x (n) R p are known, the ML estimate of β given y (1),..., y (n) is β ML := n y (i) log g i=1 ( x (i), β ) ( + 1 y (i)) ( ( log 1 g x (i), β ))

140 Maximum likelihood ( ) ( L β := p y (1),...,y (n) y (1),..., y (n))

141 Maximum likelihood ( ) ( L β := p y (1),...,y (n) y (1),..., y (n)) = n g i=1 ( x (i), β ) y (i) ( ( 1 g x (i), β )) 1 y (i)

142 Logistic-regression estimator β LR := n y (i) log g i=1 ( x (i), β ) ( + 1 y (i)) ( ( log 1 g x (i), β )) For a new x the logistic-regression prediction is { ( 1 if g x, ) β LR 1/2 y LR := 0 otherwise g ( x, β ) LR can be interpreted as the probability that the label is 1

143 Iris data set Aim: Classify flowers using sepal width and length Two species, 5 examples each: Iris setosa (label 0): sepal lengths 5.4, 4.3, 4.8, 5.1 and 5.7, and sepal widths 3.7, 3, 3.1, 3.8 and 3.8 Iris versicolor (label 1): sepal lengths 6.5, 5.7, 7, 6.3 and 6.1, and sepal widths 2.8, 2.8, 3.2, 2.3 and 2.8 Two new examples: (5.1, 3.5), (5,2)

144 Iris data set After centering and normalizing [ ] 32.1 β LR = 29.6 and β 0 = 2.06 g i x (i) [1] x (i) [2] x (i), β LR + β ( x (i), β ) LR + β g i x (i) [1] x (i) [2] x (i), β LR + β ( x (i), β ) LR + β

Iris data set Sepal width 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 Sepal length 1.

145 Iris data set Sepal width Sepal length ??? Setosa Versicolor

Iris data set Petal length 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Sepal width 1.

146 Iris data set Petal length Sepal width Virginica Versicolor

147 Digit classification MNIST data Aim: Distinguish one digit from another x i is an image of a 6 or a 9 y i = 1 or y i = 0 if image i is a 6 or 9, respectively 2000 training examples and 2000 test examples, each half 6 half 9 Training error rate: 0.0, Test error rate = 0.006

148 Digit classification: β

149 Digit classification: True Positives β T x Probability of 6 Image

150 Digit classification: True Negatives β T x Probability of 6 Image

151 Digit classification: False Positives β T x Probability of 6 Image

152 Digit classification: False Negatives β T x Probability of 6 Image

153 Digit Classification This is a toy problem: distinguishing one digit from another is very easy Harder is to classify any given digit We used it to give insight into how logistic regression works It turns out, on this simplified problem, a very easy solution for β gives good results. Can you guess it?

154 Digit Classification Average of 6 s minus average of 9 s Training error: 0.005, Test error:

Lecture Notes 6: Linear Models

Optimization-based data analysis Fall 17 Lecture Notes 6: Linear Models 1 Linear regression 1.1 The regression problem In statistics, regression is the problem of characterizing the relation between a