CSE 546 Midterm Exam, Fall 2014

Similar documents
10-701/ Machine Learning - Midterm Exam, Fall 2010

CSE 546 Final Exam, Autumn 2013

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Stochastic Gradient Descent

Final Exam, Spring 2006

Midterm, Fall 2003

Introduction to Machine Learning Midterm, Tues April 8

Name (NetID): (1 Point)

FINAL: CS 6375 (Machine Learning) Fall 2014

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Logistic Regression. COMP 527 Danushka Bollegala

Midterm Exam Solutions, Spring 2007

ECE 5984: Introduction to Machine Learning

CS534 Machine Learning - Spring Final Exam

Midterm Exam, Spring 2005

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

0.24 adults 2. (c) Prove that, regardless of the possible values of and, the covariance between X and Y is equal to zero. Show all work.

Midterm: CS 6375 Spring 2015 Solutions

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

HTF: Ch4 B: Ch4. Linear Classifiers. R Greiner Cmput 466/551

Machine Learning, Fall 2009: Midterm

Midterm: CS 6375 Spring 2018

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

ECE 5424: Introduction to Machine Learning

Linear and Logistic Regression. Dr. Xiaowei Huang

Limits 4: Continuity

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Machine Learning for NLP

Machine Learning, Midterm Exam

Name (NetID): (1 Point)

Gradient Boosting (Continued)

6.036 midterm review. Wednesday, March 18, 15

Machine Learning, Fall 2011: Homework 5

Introduction to Machine Learning Midterm Exam

Classification Logistic Regression

6.867 Machine learning

Decision Trees: Overfitting

MATH 115 MIDTERM EXAM

Voting (Ensemble Methods)

MATH 2300 review problems for Exam 3 ANSWERS

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Roberto s Notes on Integral Calculus Chapter 3: Basics of differential equations Section 3. Separable ODE s

Linear classifiers: Overfitting and regularization

MATH SPEAK - TO BE UNDERSTOOD AND MEMORIZED

ECE521 Lecture7. Logistic Regression

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Getting ready for Exam 1 - review

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

9 Classification. 9.1 Linear Classifiers

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

VBM683 Machine Learning

Machine Learning Basics III

Ch 3 Alg 2 Note Sheet.doc 3.1 Graphing Systems of Equations

Midterm exam CS 189/289, Fall 2015

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

3.6 Implicit Differentiation. explicit representation of a function rule. y = f(x) y = f (x) e.g. y = (x 2 + 1) 3 y = 3(x 2 + 1) 2 (2x)

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

INF Introduction to classifiction Anne Solberg

Answers Investigation 3

Warm up: risk prediction with logistic regression

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Nova Scotia Examinations Mathematics 12 Web Sample 2. Student Booklet

Artificial Neuron (Perceptron)

1. (a) Sketch the graph of a function that has two local maxima, one local minimum, and no absolute minimum. Solution: Such a graph is shown below.

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Learning Theory Continued

Department of Computer and Information Science and Engineering. CAP4770/CAP5771 Fall Midterm Exam. Instructor: Prof.

Feedforward Neural Networks

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

Part 1: Integration problems from exams

INF Anne Solberg One of the most challenging topics in image analysis is recognizing a specific object in an image.

CPSC 340: Machine Learning and Data Mining

Derivatives 2: The Derivative at a Point

1 History of statistical/machine learning. 2 Supervised learning. 3 Two approaches to supervised learning. 4 The general learning procedure

Qualifier: CS 6375 Machine Learning Spring 2015

Day 3: Classification, logistic regression

Boosting: Foundations and Algorithms. Rob Schapire

Algebra I. Relations and Functions. Slide 1 / 113 Slide 2 / 113. Slide 3 / 113. Slide 4 / 113. Slide 6 / 113. Slide 5 / 113.

Introduction to Logistic Regression

Data Mining und Maschinelles Lernen

Least Mean Squares Regression

Ad Placement Strategies

Machine Learning and Data Mining. Linear classification. Kalev Kask

Pre-Cal Review 13.2 Parent Functions and Their Domain

For questions 5-8, solve each inequality and graph the solution set. You must show work for full credit. (2 pts each)

Overfitting, Bias / Variance Analysis

Introduction to Differential Equations. National Chiao Tung University Chun-Jen Tsai 9/14/2011

Finding Limits Graphically and Numerically. An Introduction to Limits

Discriminative Models

Monte Carlo integration

Exponential and Logarithmic Functions

Online Advertising is Big Business

Machine Learning for NLP

MATHEMATICS 200 December 2014 Final Exam Solutions

Finding Limits Graphically and Numerically. An Introduction to Limits

Machine Learning: The Perceptron. Lecture 06

Transcription:

CSE 546 Midterm Eam, Fall 2014 1. Personal info: Name: UW NetID: Student ID: 2. There should be 14 numbered pages in this eam (including this cover sheet). 3. You can use an material ou brought: an book, class notes, our print outs of class materials that are on the class website, including m annotated slides and relevant readings. You cannot use materials brought b other students. 4. If ou need more room to work out our answer to a question, use the back of the page and clearl mark on the front of the page if we are to look at what s on the back. 5. Work efficientl. Some questions are easier, some more difficult. Be sure to give ourself time to answer all of the eas ones, and avoid getting bogged down in the more difficult ones before ou have answered the easier ones. 6. You have 80 minutes. 7. Good luck! Question Topic Ma score Score 1 Short Answer 24 2 Decision Trees 16 3 Logistic Regression 20 4 Boosting 16 5 Elastic Net 24 Total 100 1

1 [24 points] Short Answer 1. [6 points] On the 2D dataset below, draw the decision boundaries learned b the following algorithms (using the features /). Be sure to mark which regions are labeled positive or negative, and assume that ties are broken arbitraril. (a) Logistic regression (λ = 0) (b) 1-NN (c) 3-NN 2. [6 points] A random variable follows an eponential distribution with parameter λ (λ > 0) if it has the following densit: p(t) = λe λt, t [0, ) This distribution is often used to model waiting times between events. Imagine ou are given i.i.d. data T = (t 1,..., t n ) where each t i is modeled as being drawn from an eponential distribution with parameter λ. (a) [3 points] Compute the log-probabilit of T given λ. (Turn products into sums when possible). 2

(b) [3 points] Solve for ˆλ MLE. 3. [6 points] A B C Y F F F F T F T T T T F T T T T F (a) [3 points] Using the dataset above, we want to build a decision tree which classifies Y as T/F given the binar variables A, B, C. Draw the tree that would be learned b the greed algorithm with zero training error. You do not need to show an computation. (b) [3 points] Is this tree optimal (i.e. does it get zero training error with minimal depth)? Eplain in less than two sentences. If it is not optimal, draw the optimal tree as well. 3

4. [6 points] In class we used decision trees and ensemble methods for classification, but we can use them for regression as well (i.e. learning a function from features to real values). Let s imagine that our data has 3 binar features A, B, C, which take values 0/1, and we want to learn a function which counts the number of features which have value 1. (a) [2 points] Draw the decision tree which represents this function. How man leaf nodes does it have? (b) [2 points] Now represent this function as a sum of decision stumps (e.g. sgn(a)). How man terms do we need? (c) [2 points] In the general case, imagine that we have d binar features, and we want to count the number of features with value 1. How man leaf nodes would a decision tree need to represent this function? If we used a sum of decision stumps, how man terms would be needed? No eplanation is necessar. 4

2 [16 points] Decision Trees We will use the dataset below to learn a decision tree which predicts if people pass machine learning (Yes or No), based on their previous GPA (High, Medium, or Low) and whether or not the studied. GPA Studied Passed L F F L T T M F F M T T H F T H T T For this problem, ou can write our answers using log 2, but it ma be helpful to note that log 2 3 1.6. 1. [4 points] What is the entrop H(Passed)? 2. [4 points] What is the entrop H(Passed GPA)? 3. [4 points] What is the entrop H(Passed Studied)? 4. [4 points] Draw the full decision tree that would be learned for this dataset. You do not need to show an calculations. 5

3 [20 points] Logistic Regression for Sparse Data In man real-world scenarios our data has millions of dimensions, but a given eample has onl hundreds of non-zero features. For eample, in document analsis with word counts for features, our dictionar ma have millions of words, but a given document has onl hundreds of unique words. In this question we will make l 2 regularized SGD efficient when our input data is sparse. Recall that in l 2 regularized logistic regression, we want to maimize the following objective (in this problem we have ecluded w 0 for simplicit): F (w) = 1 N N l( (j), (j), w) λ 2 j=1 d i=1 w 2 i where l( (j), (j), w) is the logistic objective function l( (j), (j), w) = (j) ( d i=1 w i (j) i ) ln(1 + ep( d i=1 w i (j) i )) and the remaining sum is our regularization penalt. When we do stochastic gradient descent on point ( (j), (j) ), we are approimating the objective function as F (w) l( (j), (j), w) λ d wi 2 2 Definition of sparsit: Assume that our input data has d features, i.e. (j) R d. In this problem, we will consider the scenario where (j) is sparse. Formall, let s be average number of nonzero elements in each eample. We sa the data is sparse when s << d. In the following questions, our answer should take the sparsit of (j) into consideration when possible. Note: When we use a sparse data structure, we can iterate over the non-zero elements in O(s) time, whereas a dense data structure requires O(d) time. 1. [2 points] Let us first consider the case when λ = 0. Write down the SGD update rule for w i when λ = 0, using step size η, given the eample ( (j), (j) ). i=1 2. [4 points] If we use a dense data structure, what is the average time compleit to update w i when λ = 0? What if we use a sparse data structure? Justif our answer in one or two sentences. 6

3. [2 points] Now let us consider the general case when λ > 0. Write down the SGD update rule for w i when λ > 0, using step size η, given the eample ( (j), (j) ). 4. [2 points] If we use a dense data structure, what is the average time compleit to update w i when λ > 0? 5. [4 points] Let w (t) i be the weight vector after t-th update. Now imagine that we perform k SGD updates on w using eamples ( (t+1), (t+1) ),, ( (t+k), (t+k) ), where (j) i = 0 for ever eample in the sequence. (i.e. the i-th feature is zero for all of the eamples in the sequence). Epress the new weight, w (t+k) i in terms of w (t) i, k, η, and λ. 6. [6 points] Using our answer in the previous part, come up with an efficient algorithm for regularized SGD when we use a sparse data structure. What is the average time compleit per eample? (Hint: when do ou need to update w i?) 7

4 [16 points] Boosting Recall that Adaboost learns a classifier H using a weighted sum of weak learners h t as follows ( T ) H() = sgn α t h t () In this question we will use decision trees as our weak learners, which classif a point as {1, 1} based on a sequence of threshold splits on its features (here, ). In the questions below, be sure to mark which regions are marked positive/negative, and assume that ties are broken arbitraril. t=1 1. [2 points] Assume that our weak learners are decision trees of depth 1 (i.e. decision stumps), which minimize the weighted training error. Using the dataset below, draw the decision boundar learned b h 1. 8

2. [3 points] On the dataset below, circle the point(s) with the highest weights on the second iteration, and draw the decision boundar learned b h 2. 3. [3 points] On the dataset below, draw the decision boundar of H = sgn (α 1 h 1 + α 2 h 2 ). (Hint, ou do not need to eplicitl compute the α s). 9

4. [2 points] Now assume that our weak learners are decision trees of maimium depth 2, which minimize the weighted training error. Using the dataset below, draw the decision boundar learned b h 1. 5. [3 points] On the dataset below, circle the point(s) with the highest weights on the second iteration, and draw the decision boundar learned b h 2. 10

6. [3 points] On the dataset below, draw the decision boundar of H = sgn (α 1 h 1 + α 2 h 2 ). (Hint, ou do not need to eplicitl compute the α s). 11

5 [24 points] Elastic Net In class we discussed two tpes of regularization, l 1 and l 2. Both are useful, and sometimes it is helpful combine them, giving the objective function below (in this problem we have ecluded w 0 for simplicit): F (w) = 1 2 n ( (j) j=1 d i=1 w i (j) i ) 2 + α d w i + λ 2 i=1 d wi 2 (1) Here ( (j), (j) ) is j-th eample in the training data, w is a d dimensional weight vector, λ is a regularization parameter for the l 2 norm of w, and α is a regularization parameter for the l 1 norm of w. This approach is called the Elastic Net, and ou can see that it is a generalization of Ridge and Lasso regression: It reverts to Lasso when λ = 0, and it reverts to Ridge when α = 0. In this question, we are going to derive the coordinate descent (CD) update rule for this objective. Let g, h, c be real constants, and consider the function of f 1 () = c + g + 1 2 h2 (h > 0) (2) 1. [2 points] What is the that minimizes f 1 ()? (i.e. calculate = arg min f 1 ()) i=1 Let α be an additional real constant, and consider another function of f 2 () = c + g + 1 2 h2 + α (h > 0, α > 0) (3) This is a piecewise function, composed of two quadratic functions: and f 2 () = c + g + 1 2 h2 α f + 2 () = c + g + 1 2 h2 + α Let = arg min R f 2 () and + = arg min R f + 2 (). (Note: The argmin is taken over (, + ) here) 12

2. [3 points] What are + and? Show that +. 3. [6 points] Draw a picture of f 2 () in each of the three cases below. (a) > 0, + > 0 (b) < 0, + < 0 (c) > 0, + < 0. For each case, mark the minimum as either 0,, or +. (You do not need to draw perfect curves, just get the rough shape and the relative locations of the minima to the -ais) 13

4. [4 points] Write, the minimum of f 2 (), as a piecewise function of g. (Hint: what is g in the cases above?) 5. [4 points] Now let s derive the update rule for w k. Fiing all parameters but w k, epress the objective function in the form of Eq (3) (here w k is our ). You do not need to eplicit write constant factors, ou can put them all together as c. What are g and h? 6. [5 points] Putting it all together, write down the coordinate descent algorithm for the Elastic Net (again ecluding w 0 ). You can write the update for w k in terms of the g and h ou calculated in the previous question. 14