Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Size: px

Start display at page:

Download "Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010"

Griselda Pope
6 years ago
Views:

1 Linear Regression Aarti Singh Machine Learning / Sept 27, 2010

2 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X = Cell Image Y = Diagnosis Stock Market Prediction Y =? X = Feb01 2

3 Regression Tasks Weather Prediction Y = Temp X = 7 pm Estimating Contamination X = new location Y = sensor reading 3

4 Supervised Learning Goal: Sports Science News Y =? X = Feb01 Classification: Regression: Probability of Error Mean Squared Error 4

5 Regression Optimal predictor: (Conditional Mean) Intuition: Signal plus (zero-mean) Noise model 5

6 Regression Optimal predictor: Proof Strategy: Dropping subscripts for notational convenience 0 6

7 Regression Optimal predictor: (Conditional Mean) Intuition: Signal plus (zero-mean) Noise model Depends on unknown distribution 7

Linear Regression) Nonlinear Regression Kernel

8 Regression algorithms Learning algorithm Linear Regression Lasso, Ridge regression (Regularized Linear Regression) Nonlinear Regression Kernel Regression Regression Trees, Splines, Wavelet estimators, 8

9 Empirical Risk Minimization (ERM) Optimal predictor: Empirical Risk Minimizer: Class of predictors Empirical mean Law of Large Numbers More later 9

10 ERM you saw it before! Learning Distributions Max likelihood = Min -ve log likelihood empirical risk What is the class F? Class of parametric distributions Bernoulli (q) Gaussian (m, s 2 ) 10

11 Linear Regression Least Squares Estimator - Class of Linear functions Uni-variate case: b1 - intercept b2 = slope Multi-variate case: 1 where, 11

12 Least Squares Estimator 12

13 Least Squares Estimator 13

14 Normal Equations p xp p x1 p x1 If is invertible, When is invertible? Recall: Full rank matrices are invertible. What is rank of? What if is not invertible? Regularization (later) 14

15 Geometric Interpretation Difference in prediction on training set: 0 is the orthogonal projection of onto the linear subspace spanned by the columns of 15

16 Revisiting Gradient Descent Even when is invertible, might be computationally expensive if A is huge. Gradient Descent since J(b) is convex Initialize: Update: 0 if = Stop: when some criterion met e.g. fixed # iterations, or < ε. 16

17 Effect of step-size α Large α => Fast convergence but larger residual error Also possible oscillations Small α => Slow convergence but small residual error 17

18 Least Squares and MLE Intuition: Signal plus (zero-mean) Noise model log likelihood Least Square Estimate is same as Maximum Likelihood Estimate under a Gaussian model! 19

19 Regularized Least Squares and MAP What if is not invertible? log likelihood log prior I) Gaussian Prior 0 Ridge Regression Closed form: HW Prior belief that β is Gaussian with zero-mean biases solution to small β 20

20 Regularized Least Squares and MAP What if is not invertible? log likelihood log prior II) Laplace Prior Lasso Prior belief that β is Laplace with zero-mean biases solution to small β 21

21 Ridge Regression vs Lasso Ridge Regression: Lasso: HOT! Ideally l0 penalty, but optimization becomes non-convex βs with constant J(β) (level sets of J(β)) βs with constant l2 norm β2 βs with constant l1 norm βs with constant l0 norm β1 Lasso (l1 penalty) results in sparse solutions vector with more zero coordinates Good for high-dimensional problems don t have to store all coordinates! 22

22 Beyond Linear Regression Polynomial regression Regression with nonlinear features/basis functions Kernel regression - Local/Weighted regression h Regression trees Spatially adaptive regression 23

23 Polynomial Regression Univariate (1-d) case: where, Weight of each feature Nonlinear features 24

24 Polynomial Regression 25

25 Nonlinear Regression Basis coefficients Nonlinear features/basis functions Fourier Basis Wavelet Basis Good representation for oscillatory functions Good representation for functions localized at multiple scales 26

26 Local Regression Basis coefficients Nonlinear features/basis functions Globally supported basis functions (polynomial, fourier) will not yield a good representation 27

27 Local Regression Basis coefficients Nonlinear features/basis functions Globally supported basis functions (polynomial, fourier) will not yield a good representation 28

28 What you should know Linear Regression Least Squares Estimator Normal Equations Gradient Descent Geometric and Probabilistic Interpretation (connection to MLE) Regularized Linear Regression (connection to MAP) Ridge Regression, Lasso Polynomial Regression, Basis (Fourier, Wavelet) Estimators Next time - Kernel Regression (Localized) - Regression Trees 29

COMS 4771 Regression. Nakul Verma

COMS 4771 Regression. Nakul Verma COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the