Machine Learning 1: Linear Regression

Size: px

Start display at page:

Download "Machine Learning 1: Linear Regression"

Egbert Morrison
5 years ago
Views:

1 Machine Learning 1: Linear Regression J. Zico Kolter September 18,

2 Motivation How much energy will we consume tommorrow? Difficult to estimate from a priori models But, we have lots of data from which to build a model 2

3 Energy 101 Energy: ability to do work (apply force through a distance) Unit of energy: joule (J), (also btu, kilowatt hour) 1 joule = 1 newton 1 meter = 1 kilogram 1 meter2 1 second 2 Power is a rate of energy use Unit of power: watt (W) (commonly kilo/mega/gigawatt) 1 watt = 1 joule 1 second 3

4 Forms of energy Mechanical kinetic energy: E = 1 2 mv2 Gravitational potential energy: E = mgh Thermal energy: E = 3 2 NkT Electrical energy: E = V Q Electromagnetic energy: E = hf Chemical energy Nuclear energy: E = mc 2 4

5 Energy conversion 5

6 Duquesne Light electricity consumption Hourly Demand (GW) Feb 9 Jul 13 Oct Hour of Day Data: PJM 6

7 Duquesne Light electricity consumption Source: Duquesne Light 7

8 Predict peak demand from high temperature What will peak demand be tomorrow? If we know something else about tomorrow (like the high temperature), we can use this to predict peak demand Peak Hourly Demand (GW) High Temperature (F) Data: PJM, Weather Underground (summer months, June-August) 8

9 A simple model A linear model that predicts demand: predicted peak demand = θ 1 (high temperature) + θ 2 Peak Hourly Demand (GW) Observed data Linear regression prediction High Temperature (F) Parameters of model: θ 1, θ 2 R (θ 1 = 0.046, θ 2 = 1.46) 9

10 A simple model We can use a model like this to make predictions What will be the peak demand for Duquense Light tomorrow? I know from weather report that high temperature will be 80 F (ignore, for the moment, that this too is a prediction) Then predicted peak demand is: θ θ 2 = = 2.19 GW 10

11 Formal problem setting Input: x i R n, i = 1,..., m E.g.: x i R 1 = {high temperature for day i} Output: y i R (regression task) E.g.: y i R = {peak demand for day i} Model Parameters: θ R k Predicted Output: ŷ i R E.g.: ŷ i = θ 1 x i + θ 2 11

12 For convenience, we define a function that maps inputs to feature vectors φ : R n R k For example, in our task above, if we define [ ] xi φ(x i ) = (here n = 1, k = 2) 1 then we can write ŷ i = k θ j φ j (x i ) θ T φ(x i ) j=1 12

13 Loss functions Want a model that performs well on the data we have I.e., ŷ i y i, i We measure closeness of ŷ i and y i using loss function l : R R R + Example: squared loss l(ŷ i, y i ) = (ŷ i y i ) 2 13

14 Finding model parameters, and optimization Want to find model parameters such that minimize sum of costs over all input/output pairs J(θ) = m l(ŷ i, y i ) = i=1 m (θ T φ(x i ) y i ) 2 i=1 Write our objective formally as minimize θ J(θ) simple example of an optimization problem; these will dominate our development of algorithms throughout the course 14

15 Let s write J(θ) a little more compactly using matrix notation; define φ(x 1 ) T y 1 φ(x 2 ) T y 2 Φ R m k =. φ(x m ) T, y Rm =. y m then J(θ) = m (θ T φ(x i ) y i ) 2 = Φθ y 2 2 i=1 ( z 2 is l 2 norm of a vector: z 2 n i=1 z2 i = z T z) Called least-squares objective function 15

16 How do we optimize a function? 1-D case (θ R): J(θ) = θ 2 2θ dj dθ = 2θ 2 θ minimum dj dθ = 0 θ 2θ 2 = 0 θ = 1 16

17 Multi-variate case: θ R k, J : R k R Generalized condition: θ J(θ) θ = 0 θ J(θ) denotes gradient of J with respect to θ θ J(θ) R n J θ 1 J θ 2. J θ k Some important rules and common gradient θ (af(θ) + bg(θ)) = a θ f(θ) + b θ g(θ), (a, b R) θ (θ T Aθ) = (A + A T )θ, (A R k k ) θ (b T θ) = b, (b R k ) 17

18 Optimizing least-squares objective J(θ) = Φθ y 2 2 = (Φθ y) T (Φθ y) = θ T Φ T Φθ 2y T Φθ + y T y Using the previous gradient rules θ J(θ) = θ (θ T Φ T Φθ 2y T Φθ + y T y) = θ (θ T Φ T Φθ) 2 θ (y T Φθ) + θ (y T y) = 2Φ T Φθ 2Φ T y Setting gradient equal to zero 2Φ T Φθ 2Φ T y = 0 θ = (Φ T Φ) 1 Φ T y known as the normal equations 18

19 Let s see how this looks in MATLAB code X = load('high_temperature.txt'); y = load('peak_demand.txt'); n = size(x,2); m = size(x,1); Phi = [X ones(m,1)]; theta = inv(phi' * Phi) * Phi' * y; theta = The normal equations are so common that MATLAB has a special operation for them % same as inv(phi' * Phi) * Phi' * y theta = Phi \ y; 19

20 Aside: general optimization problems In this class we ll consider general optimization problems minimize θ J(θ) subject to g i (θ) 0, i = 1,..., N i h i (θ) = 0, i = 1,..., N e A constrained optimization problem; g i terms are the inequality constraints; h i terms are the equality constraints. Many different classifications of optimization problems (linear programming, quadratic programming, semidefinite programming, integer programming), depending on the form of J, the g i s and the h i s. 20

21 Important distinctions in optimization is between convex (where J, g i are convex and h i linear) and non-convex problems f convex f(aθ + (1 a)θ ) af(θ) + (1 a)f(θ ) for 0 a 1 Informally speaking, we can usually find global solutions of convex problems efficiently, while for non-convex problems we must settle for local solutions or time-consuming optimization 21

22 Solving optimization problems Many generic optimization libraries We will be using YALMIP (Yet Another Linear Matrix Inequality Parser): YALMIP code for least squares optimization: theta = sdpvar(n,1); solvesdp([], sum((phi*theta - y).^2)); double(theta) ans =

23 Alternative loss functions Nothing special about least-squares loss function l(ŷ, y) = (ŷ y) 2. Some alternatives: Absolute loss: l(ŷ, y) = ŷ y Deadband loss: l(ŷ, y) = max{0, ŷ y ɛ}, ɛ R + Loss Squared Loss Absolute Loss Deadband Loss Error: y pred y 23

24 How do we find parameters that minimize absolute loss? minimize θ m θ T φ(x i ) y i i=1 Non-differentiable, can t take gradient Solution: frame as a constrained optimization problem Introduce new variables ν R m, (ν i θ T φ(x i ) y i ) m minimize ν i θ,ν subject to i=1 ν i θ T φ(x i ) y i ν i Linear program (LP): linear object and linear constraints 24

25 To solve LPs, typically need to put them in standard form: minimize c T z z subject to Az b z R n, A R Ni n, b R Ni For absolute loss LP [ ] [ θ 0 z =, c = ν 1 ] [ Φ I, A = Φ I ] [ y, b = y ] 25

26 MATLAB code c = [zeros(n,1); ones(m,1)]; A = [Phi -eye(m); -Phi -eye(m)]; b = [y; -y]; z = linprog(c,a,b); theta = z(1:n) theta = The same solution in YALMIP: theta = sdpvar(n,1); solvesdp([], sum(abs((phi*theta - y)))); double(theta) theta =

27 Which loss function should we use? Peak Hourly Demand (GW) Observed data Squared loss Absolute loss Deadband loss, eps = High Temperature (F) 27

Machine Learning 2: Nonlinear Regression

Machine Learning 2: Nonlinear Regression 15-884 Machine Learning : Nonlinear Regression J. Zico Kolter September 17, 01 1 Non-linear regression Peak Hourly Demand (GW).5 0 0 40 60 80 100 High temperature / peak demand observations for all days