IEOR 165: Spring 2019 Problem Set 2

Size: px

Start display at page:

Download "IEOR 165: Spring 2019 Problem Set 2"

Tobias Sherman
5 years ago
Views:

1 IEOR 65: Spring 209 Problem Set 2 Instructor: Professor Anil Aswani Issued: 2/8/9 Due: 3//9 Problem : Part a: We may first calculate the sample means: ȳ =.6, x = 7. Then: i= ˆβ = y ix i ȳ x i= x2 i x2 = = 3 5, ˆβ 0 = ȳ ˆβ x = = 3. Figure plots our data points against our model. Figure : The data vs. the model Part b: r 2 = i= ŷ i ȳ 2 i= y i ȳ 2 = = Part c: First, note that w 0.233, wx , wy.9893, wxy 3.824, wx Then, from the formulas given in the lecture notes: ˆβ = w wxy wy wx w wx wx , ˆβ 0 = wy ˆβ wx = 3. 2

2 Adding these weights does not seem to alter our model. We could have predicted this from noting that the prediction of our model at the unique x-values of 65, 70, 75 all happened to precisely hit the average of the y-values associated with those points, respectively. This is also observed in Figure. The fact that our weights are also the same for any unique x-value recall that w i = E[ε 2 x i ] in the ideal case implies that such a model cannot be improved by weighting. Thus, the r 2 is the same as above. Problem 2: Part a: Again, note that x = 7, ȳ =.2, xy = 806, x 2 = Then, xy ˆβ = ȳ x = 0.8 x 2 x , ˆβ 0 = ȳ ˆβ x This is plotted in Figure 2. Figure 2: The Albany data vs. the Albany model Part b: Using the same formula as above, we find that r Part c: Using the same procedure as above, we find that beta ˆ , ˆβ Figure 3 plots the difference between OLS and WLS in this case. Problem 3: This is not true, because r 2 only relies on explained variance within the training sample. As we observed in discussion, adding a column of random data to an existing model can actually increase r 2, even though it clearly should not add any predictive power. To see why this occurs, note that the numerator of r 2 is equivalent to nˆσ y 2 i= y i β T x i 2. Since the first term is constant and the second term is the negative of the objective function in OLS, it turns out that minimizing squared error as OLS does is equivalent to maximizing r 2. Now consider two models, ŷ i = β 0 + i= β ix i 2

3 Figure 3: Plotting the difference between OLS and WLS for Albany and ŷ i = β 0 + i= β ix i + β n+ x n+, trained with the same y and the same first n exogenous variables x,..., x n. Let beta ˆ 0,..., ˆβ n be the optimal parameters given to us by OLS for the first model. Note that by choosing ˆβ n+ = 0 and setting all of the other parameters to be the same as in the first model, we can make the second model achieve the same squared loss, and so the same r 2, as the first model. However, OLS minimizes squared loss, so this existence of such a solution means that the squared loss for the second model cannot be greater than that of the first model and the r 2 of the second model cannot be less than and, in fact, will most likely be strictly higher than the r 2 of the first model, regardless of what x n+ is. Thus, a higher r 2 does not necessarily imply a better model. Instead, we should consider a model s performance on appropriately-chosen test data meaning data from the same source as the training data, but that the model has not seen before to judge its strength. Problem 4: Part a: You are given samples, 5, 3, 5, and 28 from a Laplace distribution with unknown mean and scale. What is a Maximum-Likelihood estimate of the mean? Hint : The pdf of a Laplace distribution with unknown mean and scale is fu = exp u. Hint 2: To compute the 2 MLE, plot the objective function of the optimization problem and locate a minimizer on the plot. Recall that the Maximum Likelihood Estimator is: ˆ = arg max n i= 2 exp x i = arg max i= log 2 exp x i = arg min x i. In our case, this amounts to minimizing f = f is shown in Figure 4, with red lines demarcating f when is equal to one of the data point values above. From here, we can see that the minimum occurs at = 3, which would make this our MLE estimate of the mean. However, we can make a more general statement here. We note that we can describe the derivative of f as the following: i= 3

4 df d = 5, < x 3, x < < x 2, x 2 < < x 3., x 3 < < x 4 3, x 4 < < x 5 5, x 5 < This follows because each of the terms in f is an absolute value. So, whenever the number of points that are strictly greater than is bigger than the number of points that are strictly less than, f can be decreased by increasing, and vice versa. Thus, the point at which f cannot be decreased i.e. the minimum must be such that the number of data points strictly greater than is equal to the number of data points strictly less than. This means that the MLE estimate of the mean of a Laplacian distribution will always be the median of the data, as we have observed in this case here. Figure 4: f, with red, dotted lines marking the points where is equal to one of the given data points. Part b: Using the same trick as in part a, we can see that the MLE estimate of the mean of the Laplacian will always be the median of the dataset. In this case, that is 996, regardless of the scale. However, this seems odd, particularly when the scale is, because the data has two pretty clear and approximately equal-sized clusters very far from each other, yet the estimate of the mean is very heavily skewed towards one side. In other words, if we replaced 996 in our data set with 996, it would intuitively seem like it shouldn t radically alter our estimate of distribution parameters, but yet it would shift our MLE estimate of the mean by a huge amount! This is not as big of a problem if we assume that the scale is 000, but could then force us to obtain an overlyconservative estimate of the distribution of the data. Here, it seems that the issue arises from the choice of a Laplacian distribution use in the context of MLE. 4

5 Problem 5: Using the definition of MLE, we can say: ˆ, ˆσ = arg max,σ = arg max,σ = arg max,σ = arg min,σ n 2σ exp x i /σ log 2σ exp x i /σ log exp x i /σ n log σ i= i= i= n log σ + σ x i. Here, we know that = 0, so the problem is simplified note that, in the general case, the value of that minimizes i= x i will also be the one that minimizes the objective in Equation 4, regardless of the σ that is chosen, so long as σ > 0. Adding this into our formulation, we may solve for ˆσ separately: ˆσ = arg min n log σ + x i. 5 σ σ This clearly not convex consider what the objective will look like when σ is very large. However, we can transform the problem to be in terms of ν = log σ: ˆν = arg min nν + e ν x i. 6 ν It turns out that this new problem is convex note that linear functions and exponentials are both convex, and that the sum of convex functions is also convex. So, we can set the derivative equal to zero and solve: d nν + e ν x i = n e ν x i = 0 dν i= i= i= ˆν = log x i 7 n i= ˆσ = x i. n i= i= i= 4 5

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari MS&E 226: Small Data Lecture 11: Maximum likelihood (v2) Ramesh Johari ramesh.johari@stanford.edu 1 / 18 The likelihood function 2 / 18 Estimating the parameter This lecture develops the methodology behind