Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

Similar documents
12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

The PAC Learning Framework -II

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Computational Learning Theory

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Independence. P(A) = P(B) = 3 6 = 1 2, and P(C) = 4 6 = 2 3.

1 The Probably Approximately Correct (PAC) Model

1 Review of The Learning Setting

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. Definitions

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]

MA : Introductory Probability

Hierarchical Concept Learning

Statistical Learning Theory: Generalization Error Bounds

Lecture 9: March 26, 2014

Understanding Generalization Error: Bounds and Decompositions

CS446: Machine Learning Spring Problem Set 4

Classification: The PAC Learning Framework

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Introduction to Machine Learning

Chapter 18 Sampling Distribution Models

Aditya Bhaskara CS 5968/6968, Lecture 1: Introduction and Review 12 January 2016

Statistical Theory 1

Homework 4 Solutions

1 Differential Privacy and Statistical Query Learning

Generalization theory

Introduction to Statistical Learning Theory

ECS171: Machine Learning

The Vapnik-Chervonenkis Dimension

Minimax risk bounds for linear threshold functions

FORMULATION OF THE LEARNING PROBLEM

Naïve Bayes classification

Homework #1 RELEASE DATE: 09/26/2013 DUE DATE: 10/14/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.

2. AXIOMATIC PROBABILITY


Coding of memoryless sources 1/35

Active Learning: Disagreement Coefficient

Real Variables: Solutions to Homework 3

Statistical Learning Learning From Examples

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

ORIE 4741: Learning with Big Messy Data. Generalization

Online Learning, Mistake Bounds, Perceptron Algorithm

Qualifying Exam in Machine Learning

10.1 The Formal Model

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

PAC Model and Generalization Bounds

Name: Firas Rassoul-Agha

Probabilistic Systems Analysis Spring 2018 Lecture 6. Random Variables: Probability Mass Function and Expectation

Midterm #1 - Solutions

ASSIGNMENT 1 SOLUTIONS

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Learning From Data Lecture 3 Is Learning Feasible?

Learning theory Lecture 4

Chapter 10 Markov Chains and Transition Matrices

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Show Your Work! Point values are in square brackets. There are 35 points possible. Some facts about sets are on the last page.

Lecture Lecture 5

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Linear Classifiers: Expressiveness

11.1 Set Cover ILP formulation of set cover Deterministic rounding

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity

Carmen s Core Concepts (Math 135)

STAT2201. Analysis of Engineering & Scientific Data. Unit 3

Evaluating Hypotheses

Probability Notes (A) , Fall 2010

M17 MAT25-21 HOMEWORK 6

Midterm: CS 6375 Spring 2015 Solutions

Introduction to Machine Learning (67577) Lecture 3

Assignment 4: Solutions

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

A simple algorithmic explanation for the concentration of measure phenomenon

Machine Learning Foundations

IFT Lecture 7 Elements of statistical learning theory

The sample complexity of agnostic learning with deterministic labels

Empirical Risk Minimization, Model Selection, and Model Assessment

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

Recap of Basic Probability Theory

Reasoning with Probabilities. Eric Pacuit Joshua Sack. Outline. Basic probability logic. Probabilistic Epistemic Logic.

Introduction to Statistics

Classification objectives COMS 4771

Understanding Machine Learning Solution Manual

2 Upper-bound of Generalization Error of AdaBoost

ORIE 4741 Final Exam

Foundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015

14.1 Finding frequent elements in stream

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Computational and Statistical Learning Theory

A PECULIAR COIN-TOSSING MODEL

Homework 3 Solutions

Consistency of Nearest Neighbor Methods

Susceptible-Infective-Removed Epidemics and Erdős-Rényi random

Generalization bounds

Lecture 6: The Pigeonhole Principle and Probability Spaces

Discrete Mathematics and Probability Theory Fall 2011 Rao Midterm 2 Solutions

Transcription:

58669 Supervised Machine Learning (Spring 014) Homework, sample solutions Credit for the solutions goes to mainly to Panu Luosto and Joonas Paalasmaa, with some additional contributions by Jyrki Kivinen Problem 1 The logarithmic loss is by definition Let y t 0 first. Now From the update rule we get L log (y, ŷ) { ln(1 ŷ) if y 0 ln ŷ if y 1. L log (y t, ŷ t ) ln(1 ŷ t ) ln(1 W t+1 v t,i h i (x t )). w t,i exp ( ηl log (0, h i (x t )) ) w t,i exp ( η( ln(1 h i (x t )) ) w t,i (1 h i (x t )), where we used the assumption η 1. Because c 1, and Let next y 1. In this case, P t P t+1 c ln W t c ln W t+1 ln W t+1 W t n ln w t,i(1 h i (x t )) ln ( 1 W t v t,i h i (x t ) ) L log (y t, ŷ t ). L log (y t, ŷ t ) ln ŷ t ln W t+1 v t,i h i (x t ), w t,i exp ( ηl log (1, h i (x t )) ) w t,i exp ( η( ln h i (x t )) ) w t,i h i (x t ).

Therefore, P t P t+1 ln W t+1 W t n ln w t,ih i (x t ) ln W t v t,i h i (x t ) L log (y t, ŷ t ) which completes the proof. Loss bound. Let η c 1 as before and let H {h 1, h,..., h n }. We get the bound directly by using the previous result L log (y t, ŷ t ) P t P t+1. For all h H we have the bound L log (S, W A) T L log (y t, ŷ t ) t1 T (P t P t+1 ) t1 P 1 P T +1 c ln W 1 c ln W T +1 T ln n ln exp( ηl log (y t, h(x t ))) ln n Taking the minimum over H gives the bound Problem (a) t1 T L log (y t, h(x t )) t1 ln n + L log (S, h). L log (S, W A) min h H L log(s, h i ) + ln n. Let A i be the event heads come up in the toss number i {1,,..., 10}. Then 10 P(A 1, A,..., A 10 ) P(A i ) ( ) 10 1 because the tosses are independently and identically-distributed. This is an example of a binomial random variable. The number of events heads come up happening in a series of 10 fair coin tosses can be modelled as a random variable X Bin(10, 1/), and P(X 10) ( 10) (1/) 10 (1/) 0 (1/) 10. (b) From a) we know that the probability that a given coin does not come up heads 10 times is q 1 (1/) 10. Because the coins tosses are independent, the probability that no coin comes up heads

10 times is q 1000. Finally, the probability that at least one coin comes up heads 10 times is ( ( ) ) 10 1000 1 1 q 1000 1 1 ( ) 1000 103 1 104 0.6358. We can formulate the situation also as follows. The series of 10 tosses with 1000 coins are again a sequence of repeated experiments. Let the length of the sequence be n 1000, and let the probability of success (10 heads with a single coin) in an experiment be p 10. The number of successes X in the sequence is a binomial distributed random variable X Bin(n, p). What is asked is the probability P(X 1) 1 P(X 0) 1 ( n 0) p 0 (1 p) n 1 (1 p) 1000. (c) Let B i be the event coin number i comes up heads 10 times and let C be the event at least one coin comes up heads 10 times. The union bound gives us the estimate ( 1000 ) 1000 ( ) 10 1 P(C) P B i P(B i ) 1000 0.97656. Our estimate turns out to be very inaccurate. Actually, if we had considered 1000 104 coin tosses, we would have got only the trivial bound P(C) 1. Exercise 3 (a) We calculate the risk using the expression from page 4 of lecture notes: R(h) ν + (1 ν)p X (x), (1) Using this for h f we directly get h(x) f(x) R(f) x νp X (x) ν. On the other hand, for any h such that P X (h(x) f(x)) > ɛ, using (1) we get R(h) ν + (1 ν)p X (h(x) f(x)) > ν + ε(1 ν) since we assume ν < 1/. (b) Following the proof of Theorem 1.9, we are going to show that with probability at least 1 we have ɛ(1 ν) ˆR(f) R(f) + () for the target classifier, and ɛ(1 ν) ˆR(h) R(h). (3) 3

for all h f. Combining () and (3) with the previous estimates shows that, for any h such that P X (h(x) f(x)) > ɛ holds, we have ɛ(1 ν) ˆR(h) R(h) ɛ(1 ν) > ν + ε(1 ν) ɛ(1 ν) R(f) + ˆR(f). Hence, ERM cannot pick any such h as its hypothesis. Therefore, if we prove that () and (3) hold with probability at least 1, we have the desired result. We use Hoeffding s inequality ( t ) Pr S ES] + t] exp m (b. i a i ) The probability that () fails is estimated as ] ɛ(1 ν) Pr ˆR(f) > R(f) + ] Pr m ˆR(f) ɛ(1 ν) > mr(f) + m ( exp m ɛ (1 ν) ) /4 m exp ( mɛ (1 ν) / ). Since we assume this implies Pr Similarly, we get ˆR(f) > R(f) + H m ɛ ln, (1 ν) ] ( ɛ(1 ν) exp ln H ) ( exp ln Pr ˆR(h) R(h) The union bound then gives the final results. Problem 4 (a) ] ɛ(1 ν) H. ) H H H. Positive examples Each positive example (x j, 1) in the sample will be classified by ĥ as positive iff ĥ does not contain a literal whose negation is in x j. The negations of the literals in x j have been removed from ĥ during training, so all positive examples are classified correctly by ĥ. Negative examples All negative examples (x j, 0) are classified as negative if all the literals of h are also in ĥ. The algorithm proceeds by removing from L only the literals that are contradicting with some x j. Those contradicting literals cannot be in h, because h classifies all positive examples correctly. Thus, no literal of h has been removed from ĥ. 4

(b) As we argued in part (a), the hypothesis ĥ includes all the literals of the target h. Therefore, In addition to this, we need to show Pr x ĥ(x) 1 and h (x) 0] 0. Pr x ĥ(x) 0 and h (x) 1] ε holds with probability at least 1. Fix ε > 0, and call a literal dangerous if it has probability greater than ε/n of being false on a positive example. More precisely, denote by ṽ(x) the value of literal ṽ on instance x: if ṽ v i, then ṽ(x) x i, and if ṽ v i, then ṽ(x) 1 x i. Then a literal ṽ is dangerous if Pr x ṽ(x) 0 and h (x) 1] > ε. There are at most n literals ṽ in ĥ. If ĥ(x) 0 and h (x) 1, then at least one of the literals in ĥ satisfies ṽ(x) 0 and h (x) 1. If none of the literals in ĥ are dangerous, then by union bound the probability of drawing x such that ṽ(x) 0 and h (x) 1 holds for at least one ṽ in ĥ is at most n (ε/n) ε. Thus, we are done if we can show that with probability at least 1, no dangerous literals remain. There are n literals to consider, so by union bound it is sufficient to show that for a fixed dangerous literal, the probability that it remains is at most /(n). If a literal ṽ is dangerous, then each example x has probability at least ε/n of satisfying both h (x) 1 and ṽ(x) 0, causing ṽ to be removed from ĥ. The probability that ṽ remains after m examples is at most ( 1 ε ) m ( exp mε ). n n For any this is at most /(n), as desired. (c) m n ε ln n First, there is one conjunction with no positive examples, which we can represent, for example, as x 1 x 1. Consider now conjunctions that have at least one positive example. For each variable i, there are three alternatives: literal v i is included, but literal v i is not literal v i is included, but literal v i is not neither literal v i nor v i is included. Since we can choose from these three alternatives independently for each of the n variables, we get 3 n conjunctions. This includes the conjunction with no literals at all, which by convention represents the hypothesis that always outputs 1. We therefore get C n 3 n + 1. The required number can be calculated directly with the theorem, m 1 0.1 ln 3100 + 1 0.001 10(ln ( 3 100 + 1 ) ln 0.001) 1167.7, 5

and 1168 is enough. Calculating m with the formula in the previous part gives Hence, m 107 is enough. Exercise 5 m n ɛ ln n 100 00 ln 0.1 0.001 106.07. The Set Cover problem can be formulated as follows. Let A {a 1, a,..., a n } be a set, and let U {B 1, B,..., B k } P(A) be a collection of subsets of A such that C U C A. We are asked to find the smallest V U such that C V C A. In other words, we should cover the set A with the smallest possible number of sets from U. We show how the Set Cover problem can be solved with an algorithm that takes as input a sample ((x 1, y 1 ),..., (x m, y m )) and outputs a monotone conjunction f such that ˆR(f) is minimized. We generate input vectors with label 0 for every a A and vectors with label 1 for every B U. The coordinates of an input vector correspond to the sets in U. In the monotone conjunction f, the variables indicate which sets belong to the set cover. For every i {1,,..., n}, make two identical sample pairs (x i, 0) (x i+n, 0) ((x i1, x i,..., x ik ), 0) such that x ij 0 if a i B j and x ij 1 otherwise. We call the vectors of the pairs 0-vectors. If the outputted conjunction corresponds to a set cover, all the labels of these vectors are correctly predicted. And if some element a A belongs to none of the sets indicated by f, that causes two prediction errors. Also, make for every i {1,,..., k} a pair (x n+i, 1) ((x n+i,1, x n+i,,..., x n+i,k ), 1) where x n+i,j 0 if i j and x n+i,j 1 otherwise. We call these 1-vectors. The purpose of 1-vectors is to incur one prediction error for each set included to the set cover. Using this input, assume that the output f of the algorithm corresponds to a collection V U, and there is a A such that a / C V C. But then we can improve f and ˆR(f) by choosing some B U such that a B. Including B increases the number of prediction errors in the set of 1-vectors by 1, and covering a reduces the number of errors among 0-vectors by. So f always corresponds to a set cover. For every set cover, all prediction errors happen with the 1-vectors, and the number of prediction errors is equal to the size of the set cover. Thus, the output of the algorithm is the smallest set cover. The sample can be generated in a linear time, so the given problem is NP-hard. 6