Foundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015

Similar documents
Computational Learning Theory - Hilary Term : Introduction to the PAC Learning Framework

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

BCCS: Computational methods for complex systems Wednesday, 16 June Lecture 3

Boolean circuits. Lecture Definitions

NP-Complete Reductions 2

COL 352 Introduction to Automata and Theory of Computation Major Exam, Sem II , Max 80, Time 2 hr. Name Entry No. Group

1 Learning Linear Separators

Classes of Boolean Functions

1 More finite deterministic automata

Computational Learning Theory

CSC 5170: Theory of Computational Complexity Lecture 9 The Chinese University of Hong Kong 15 March 2010

1 Learning Linear Separators

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

10.1 The Formal Model

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity

Lecture 29: Computational Learning Theory

15-855: Intensive Intro to Complexity Theory Spring Lecture 7: The Permanent, Toda s Theorem, XXX

Lecture 5: February 21, 2017

Lecture 5. 1 Review (Pairwise Independence and Derandomization)

1 Randomized Computation

Essential facts about NP-completeness:

Lecture 24: Randomized Complexity, Course Summary

Comp487/587 - Boolean Formulas

PCP Theorem and Hardness of Approximation

CSCI3390-Lecture 14: The class NP

Computational Learning Theory. Definitions

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Introduction to Machine Learning

Theoretical Cryptography, Lectures 18-20

Computational and Statistical Learning Theory

Limits to Approximability: When Algorithms Won't Help You. Note: Contents of today s lecture won t be on the exam

NP-Hardness reductions

A brief introduction to Logic. (slides from

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 6: Provable Approximation via Linear Programming

2 P vs. NP and Diagonalization

6.045: Automata, Computability, and Complexity (GITCS) Class 17 Nancy Lynch

Lecture 2: November 9

CS 154, Lecture 3: DFA NFA, Regular Expressions

Lecture 8. MINDNF = {(φ, k) φ is a CNF expression and DNF expression ψ s.t. ψ k and ψ is equivalent to φ}

A An Overview of Complexity Theory for the Algorithm Designer

1 Review of The Learning Setting

6.842 Randomness and Computation April 2, Lecture 14

CS 151 Complexity Theory Spring Solution Set 5

Chapter 2. Reductions and NP. 2.1 Reductions Continued The Satisfiability Problem (SAT) SAT 3SAT. CS 573: Algorithms, Fall 2013 August 29, 2013

Automata Theory CS Complexity Theory I: Polynomial Time

Lecture 15: A Brief Look at PCP

Testing by Implicit Learning

Lecture 23: More PSPACE-Complete, Randomized Complexity

Lecture 4 Scribe Notes

NP-Completeness Part II

Lecture 13: Lower Bounds using the Adversary Method. 2 The Super-Basic Adversary Method [Amb02]

Qualifying Exam in Machine Learning

Notes for Lecture 11

CS446: Machine Learning Spring Problem Set 4

NP-Completeness I. Lecture Overview Introduction: Reduction and Expressiveness

Lecture 13 March 7, 2017

Lectures One Way Permutations, Goldreich Levin Theorem, Commitments

The PAC Learning Framework -II

From Batch to Transductive Online Learning

Design and Analysis of Algorithms

Summer School on Introduction to Algorithms and Optimization Techniques July 4-12, 2017 Organized by ACMU, ISI and IEEE CEDA.

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

Week 3: Reductions and Completeness

Lecture 5: Probabilistic tools and Applications II

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory

Class Note #20. In today s class, the following four concepts were introduced: decision

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 8 Luca Trevisan September 19, 2017

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

(Refer Slide Time: 0:21)

1 PSPACE-Completeness

Linear Classifiers: Expressiveness

Solutions for week 1, Cryptography Course - TDA 352/DIT 250

CSC 2429 Approaches to the P vs. NP Question and Related Complexity Questions Lecture 2: Switching Lemma, AC 0 Circuit Lower Bounds

Homework 7 Solutions

Lecture 12 Scribe Notes

CS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms

Lecture 12: Interactive Proofs

Lecture 3: AC 0, the switching lemma

3 Finish learning monotone Boolean functions

Price: $25 (incl. T-Shirt, morning tea and lunch) Visit:

ENEE 459E/CMSC 498R In-class exercise February 10, 2015

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

1 Primals and Duals: Zero Sum Games

Definition: Alternating time and space Game Semantics: State of machine determines who

Lecture #14: NP-Completeness (Chapter 34 Old Edition Chapter 36) Discussion here is from the old edition.

A non-turing-recognizable language

Lecture 20: conp and Friends, Oracles in Complexity Theory

Homework 4 Solutions

Resolution Lower Bounds for the Weak Pigeonhole Principle

MA 1128: Lecture 08 03/02/2018. Linear Equations from Graphs And Linear Inequalities

SAT, NP, NP-Completeness

1 Cryptographic hash functions

Lecture 2: Connecting the Three Models

20.1 2SAT. CS125 Lecture 20 Fall 2016

Nondeterministic finite automata

Approximation & Complexity

Topic: Sampling, Medians of Means method and DNF counting Date: October 6, 2004 Scribe: Florin Oprea

Lecture 5, CPA Secure Encryption from PRFs

Transcription:

10-806 Foundations of Machine Learning and Data Science Lecturer: Avrim Blum Lecture 9: October 7, 2015 1 Computational Hardness of Learning Today we will talk about some computational hardness results for learning. Generally speaking, there are two different types of such results. The first focus on the concept class (or representation) used by the learning algorithm. For example, given a labeled sample S, how hard is it to find a consistent h C whenever one exists? Or, given a labeled sample S, how hard is it to find the h C of minimum empirical error (especially if there is no h C whose empirical error is 0)? The second type of result focuses on the intrinsic difficulty of prediction. For example, given access to data from some distribution D labeled by some target c C, how hard is it to find a prediction rule h of error at most, say, 1/4 when we do not require that h C and only require that it be some polynomial-time procedure? The first type of hardness is often called representation-dependent hardness, or hardness of proper learning. The second type is often called representation-independent hardness, or hardness of improper learning. 2 Representation-dependent hardness We ve seen that the consistency problem can be efficiently solved for linear threshold functions (using linear programming) and for disjunctions (using a simple list-and-cross-off algorithm). How about for the intersection of 2 halfspaces, i.e., the AND of two linear threshold functions, or for a 2-clause CNF, i.e., the AND of two disjunctions? These consistency problems turn out to be NP-hard. 2.1 Hardness for learning an intersection of two halfspaces Let s consider the intersection of two halfspaces. We will prove that solving the consistency problem is NP-hard by performing a reduction from the NP-hard problem hypergraph 2-coloring. First of all, what is the hypergraph 2-coloring problem? Here you are given a hypergraph, which is just a set V of n nodes (call them 1, 2,..., n) and m subsets s 1, s 2,..., s m V which are sometimes called hyperedges. Your goal is to give each node a color, Red or Blue, such that no subset s j is monochromatic. That is, every S j has at least one Red node and at least one Blue node. This problem is NP-hard. To perform our reduction, we want to create a set S of positive and negative examples in R n (in fact, they will be in {0, 1} n ) such that the set S is consistent with an intersection of two halfspaces (i.e., an AND of two linear threshold functions) if and only if the given instance of hypergraph 2-coloring has a solution. We can do this as follows. 1. First, we label the origin as positive. 1

2. Next, for each coordinate i, we label the unit vector e i (having a 1 in coordinate i and a 0 in all other coordinates) as negative. 3. Finally, for each set s j we label the indicator-vector for that set (having a 1 in coordinate i for each i s j and a 0 in the rest) as positive. Claim 1 The labeled examples produced by this reduction are consistent with an intersection of two halfspaces (i.e., an AND of two linear threshold functions) if and only if the given instance of hypergraph 2-coloring has a solution. Proof: First, suppose the labeled examples are consistent with the AND of two linear threshold functions. For each negative example: if threshold 1 says negative, color its associated node Red; if threshold 2 says negative, color its associated node Blue. (For points on the negative side of both, their associated nodes can be colored anything). We claim this is a legal solution to the given hypergraph 2-coloring instance. The reason is that suppose some set s j had all its nodes the same color, say Red. Let s look at the Red LTF: w 1 x 1 + w 2 x 2 +... + w n x n w 0, where w 0 0 since the origin is positive. For each i s j, we know w i > w 0 since e i is negative. But this means that i s j w i > s j w 0, and this in turn is w 0 since w 0 0. So, this means the indicator vector for s j would be mistakenly labeled as negative, a contradiction. In the other direction, suppose you are given a solution to the hypergraph 2-coloring instance. Then just create the Red LTF w 1 x 1 +... w n x n 1/2 where w i = 1 if i is Red and w i = n if i is Blue, and create the Blue LTF w 1 x 1 +... w nx n 1/2 where w i = 1 if i is Blue and w i = n if i is Red. Each negative example e i is correctly separated from the origin by one of the two LTFs (Red points separated by the Red LTF and Blue points by the Blue LTF). But each of the positives will be correctly classified since the large n value will counteract any number of +1 values of the other color, and each set has at least one element of each color. 2.2 Hardness for learning an AND of two OR-functions (2-clause CNF) We saw we could easily learn a single OR function, like x 1 x 3 x 5. How about an AND of two OR functions, like (x 1 x 3 x 5 ) (x 2 x 4 x 6 )? This turns out to be NP-hard. These are called 2-clause CNF formulas. To make things easier to think about, let s define our class of interest C to be the class of monotone 2-clause CNFs; i.e., no variables are negated. Claim 2 The consistency problem for monotone 2-clause CNF is NP-complete. Proof: We can do a reduction from hypergraph 2-coloring just like before. For each i = 1, 2,..., n create the negative example e i (with x i = 1 and the rest of the features set to 0). Notice that this implies that no variable can be in both clauses. Next, for each set s j we label the indicator vector for s j as positive. Let s think about what this means. Suppose we have a consistent 2-clause CNF. Create a coloring where variables in clause 1 are the red nodes, variables in clause 2 are the blue nodes, and nodes whose variables are not in either clause can be either color (recall that no variable is in both clauses). 2

This will be a legal coloring since the fact that the indicator vector for s j is positive means that s j must have at least one feature set to 1 inside each clause (to make their AND be positive). In the other direction, given a legal coloring, we just create one clause for each. This will correctly output negative on the negatives and correctly output positive on the positives. What about the non-monotone case? Actually, the above reduction still works (assuming n 3). In particular, if there exists a consistent 2-clause CNF at all, it will have to be monotone. The reason is that if some x i is in clause 1, then clause 2 can t have any x j (else e k will be positive for k {i, j}) and clause 2 can t have any x j for j i (else e j would be positive). So the only option for clause 2 is to equal x i. But ANDing clause 1 with x i is the same as just removing x i from clause 1. 2.3 Learning 2-clause CNF using 2-DNF Interestingly, you can learn 2-clause CNF formulas using the class of 2-DNF formulas (an OR of ANDs where each AND is of size at most 2). The reason is that (a) any 2-clause CNF can be written as a 2-term DNF by just distributing out the AND, and (b) you can learn 2-DNF in polynomial time and O(n 2 /ɛ) examples by reduction to learning a single OR function. Interestingly, no similar trick is known for an intersection of 2 halfspaces. 2.4 Hardness of finding the LTF of minimum empirical error We know that we can find a single LTF consistent with the data when one exists. However, when there is no consistent LTF, the problem of finding the LTF of minimum empirical error is NP-hard. We can show this using a very similar reduction, based on the Maximum Independent Set problem in graphs. Given a graph G = (V, E), an independent set is a set of nodes having no edges between them. Finding the maximum independent set in a graph is NP-hard. We can do a reduction by using the same idea as before. For sake of intuition, let s flip signs this time. Specifically, the origin will be negative, each coordinate unit vector e i will be positive, and then for each edge (i, j) in the graph, we will make the indicator vector for that edge negative. Notice that the maximum independent set in G is exactly the largest set of positive examples that can be separated from all the negatives. In other words, imagine you are not allowed to make mistakes on any negative examples and you goal is to get as many positive examples correct as possible. Specifically, if S is an independent set, then the associated LTF is w 1 x 1 +... + w n x n 1/2 where we set w i = 1 if i S and w i = 1 if i S. Notice that this guarantees that for each edge (i, j), the example x corresponding to that edge has w x 0. In the other direction, if we have an LTF w 1 x 1 +... + w n x n w 0, where w 0 > 0, and we let S = {i : w i w 0 }, then this has to be an independent set, because if w i w 0 and w j w 0 then w i + w j 2w 0 w 0, so there can t be a negative example in the indicator vector for that pair. Now, what about minimizing empirical error, i.e., you are allowed to make mistakes of both kinds? To address this, we just replicate each negative example enough times (more than the total number positive examples) so that even making a mistake on a single negative example would be suboptimal. 3

3 Representation-independent hardness We now consider hardness results that do not depend on the form of the hypotheses used by the learning algorithm. Specifically, our goal is to say that the class C contains functions that are so complicated that for some distributions D, no polynomial-time algorithm can predict much better than just random guessing. 3.1 Cryptographic assumptions A classic cryptographic result is that if factoring is hard, then one can create functions f called pseudorandom generators (PRGs) with the following properties. First of all f is a function from {0, 1} n to {0, 1} m where m > n. An algorithm A is said to break f if Pr v {0,1} m[a(v) = 1] Pr x {0,1} n[a(f(x)) = 1] 1/poly(n). In other words, A can (at least slightly) distinguish the a random m-bit string from the result of running f on a random n-bit string. Now the claim is that given an n-bit number N, for any m = poly(n), one can create such a function f such that if f can be broken by a polynomial-time algorithm A, then this gives a polynomial-time algorithm for factoring N. (Formally, there is a general efficient procedure for creating such functions f N from any given N). It turns out one can use such constructions to show that learning DFAs (finite automata) is hard, under the assumption that factoring is hard. 1 Specifically, if you could learn an n-state DFA to error less than 1/4 (say), with at most n k samples and polynomial time, then you could get a polynomial-time factoring algorithm. Here is the idea. Let m = 10n k, and suppose we are given a string of m bits b 1 b 2 b m and need to decide if it s truly random or is the output of a PRG f on a random input of size n. What we ll do is create a probability distribution that is uniform over m = 10n k examples x 1, x 2,..., x m (chosen in a specifc way that I won t get into), where the label of x i is b i. If b 1 b 2 b m is truly random, then there s no way an algorithm from only n k = m/10 samples will be able to get to error rate 1/4. The best it could hope to do is have error rate 9 1 10 2 = 9 20 > 1 4. However, if b 1b 2 b m is the result of running a PRG from some appropriate construction, 2 the claim is there exists an n-state DFA that can compute the b i from the x i and so our algorithm has to get error at most 1/4. So our algorithm can break the PRG. To make this all work, one has to show there are PRGs that are simple enough that the ith bit can be computed from the x i using a DFA. This is discussed in the Kearns-Vazirani book. 3.2 Modern results using hardness of refuting random CSPs Recently there has been a development of strong representation-independent hardness results for classes we encounter much more commonly in machine learning, based on the presumed difficulty of refuting certain kinds of random constraint satisfaction problems (CSPs). Here is a brief summary of a result of [2] on hardness for agnostic learning of linear separators. 1 You can view a DFA as a Boolean function: given an input string x, we feed it into the DFA and output positive if the DFA accepts the string and negative if the DFA rejects the string. 2 Technically, this will be a PRG on an input of size n such that n = poly(n ), but that s fine since we still have m = poly(n ). 4

First, the claim is that under an appropriate hardness assumption (described below), for any constant c, there is no polynomial-time algorithm that given a sample S of n c points in { 1, 1} n, can in general whp distinguish (a) the case that labels are just uniform random coin flips from (b) the case that there exists a linear separator of error at most 10%. (You can replace 10% with any constant greater than 0). So, this is a pretty strong hardness claim. The assumption used to achieve this result is that refuting random k-xor formulas is hard. A k-xor formula is an AND of XORs, where XORs are allowed to be negated too. For instance, a 2-XOR might look like: (x 1 x 3 ) (x 2 x 3 ) (x 1 x 3 ).... It is easy to find a satisfying assignment to such a formula if one exists, using Gaussian elimination. However, if there is no satisfying assignment, finding the one that satisfies as many of the XORs as possible is NP-hard. What the assumption states is that for m n k log k, it is in fact hard to distinguish the case (a) of a random k-xor formula of m constraints from (b) a formula of m k-xor constraints that has an assignment satisfying at least 90% of the constraints. A few notes about this: 1. A random formula of m n/ɛ 2 k-xor constraints will with high probability have the property that no assignment satisfies more than a 1 2 + ɛ fraction of the constraints. We can see this as follows. Fix an assignment x and then draw the m constraints at random (each selected by picking k variables at random, along with randomly choosing to negate the XOR or not). Each constraint is satisfied by x with probability 1/2, independently, so by Hoeffding bounds, the probability that the specific assignment x satisfies more than a 1 2 + ɛ fraction of the constraints is at most e 2mɛ2. By the union bound, the chance that any assignment satisfies more than this many is at most 2 n e 2mɛ2 2 n e 2n < e n = o(1). 2. There were earlier results based on the assumption that this task was hard for other kind of constraint satisfaction problems, some of which were recently broken (i.e., shown not to be hard) by [1]. 3. For the k-xor problem, the best known polynomial-time algorithm requires m = Ω(n k/2 ) to solve the distinguishing problem. References [1] Sarah R. Allen, Ryan O Donnell, and David Witmer. How to refute a random CSP. In FOCS, 2015. [2] Amit Daniely. Complexity theoretic limitations on learning halfspaces. CoRR, abs/1505.05800, 2015. 5