Learning and Fourier Analysis

Similar documents
Learning and Fourier Analysis

Lecture 4: LMN Learning (Part 2)

Lecture 10: Learning DNF, AC 0, Juntas. 1 Learning DNF in Almost Polynomial Time

Lecture 7: Passive Learning

Lecture 5: February 16, 2012

Lecture 29: Computational Learning Theory

With P. Berman and S. Raskhodnikova (STOC 14+). Grigory Yaroslavtsev Warren Center for Network and Data Sciences

1 Last time and today

3 Finish learning monotone Boolean functions

The Analysis of Partially Symmetric Functions

Notes for Lecture 15

Agnostic Learning of Disjunctions on Symmetric Distributions

Testing by Implicit Learning

Active Testing! Liu Yang! Joint work with Nina Balcan,! Eric Blais and Avrim Blum! Liu Yang 2011

Property Testing: A Learning Theory Perspective

Junta Approximations for Submodular, XOS and Self-Bounding Functions

Learning and Testing Submodular Functions

Linear sketching for Functions over Boolean Hypercube

Lecture 1: 01/22/2014

COMS E F15. Lecture 22: Linearity Testing Sparse Fourier Transform

Introduction to Machine Learning

On the power and the limits of evolvability. Vitaly Feldman Almaden Research Center

Lecture 8: Linearity and Assignment Testing

Classes of Boolean Functions

TTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model

Learning Pseudo-Boolean k-dnf and Submodular Functions

Testing Linear-Invariant Function Isomorphism

Learning Coverage Functions and Private Release of Marginals

On Learning Monotone DNF under Product Distributions

6.842 Randomness and Computation April 2, Lecture 14

Testing that distributions are close

Polynomial Representations of Threshold Functions and Algorithmic Applications. Joint with Josh Alman (Stanford) and Timothy M.

Quantum boolean functions

New Results for Random Walk Learning

Proclaiming Dictators and Juntas or Testing Boolean Formulae

Computational and Statistical Learning Theory

Learning Combinatorial Functions from Pairwise Comparisons

CS 151 Complexity Theory Spring Solution Set 5

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Learning DNF Expressions from Fourier Spectrum

Computational and Statistical Learning Theory

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Lecture 9: March 26, 2014

Testing for Concise Representations

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Unconditional Lower Bounds for Learning Intersections of Halfspaces

Knapsack. Bag/knapsack of integer capacity B n items item i has size s i and profit/weight w i

Lecture 5: Derandomization (Part II)

Tolerant Junta Testing and the Connection to Submodular Optimization and Function Isomorphism

Polynomial time Prediction Strategy with almost Optimal Mistake Probability

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Separating Models of Learning from Correlated and Uncorrelated Data

Lecture 7: ɛ-biased and almost k-wise independent spaces

Learning DNF from Random Walks

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Efficiently Testing Sparse GF(2) Polynomials

Learning Juntas. Elchanan Mossel CS and Statistics, U.C. Berkeley Berkeley, CA. Rocco A. Servedio MIT Department of.

1 Continuous extensions of submodular functions

Computational and Statistical Learning Theory

Approximating MAX-E3LIN is NP-Hard

Learning convex bodies is hard

Learning Functions of Halfspaces Using Prefix Covers

CS151 Complexity Theory. Lecture 15 May 22, 2017

PROPERTY TESTING LOWER BOUNDS VIA COMMUNICATION COMPLEXITY

Computational Learning Theory

Computational Learning Theory

5.1 Learning using Polynomial Threshold Functions

Nearly Tight Bounds for Testing Function Isomorphism

Simulation of quantum computers with probabilistic models

Notes for Lecture 11

25 Minimum bandwidth: Approximation via volume respecting embeddings

How Low Can Approximate Degree and Quantum Query Complexity be for Total Boolean Functions?

11.1 Set Cover ILP formulation of set cover Deterministic rounding

Learning symmetric non-monotone submodular functions

Two Decades of Property Testing

arxiv: v1 [cs.cc] 29 Feb 2012

On Finding Planted Cliques and Solving Random Linear Equa9ons. Math Colloquium. Lev Reyzin UIC

15-251: Great Theoretical Ideas In Computer Science Recitation 9 : Randomized Algorithms and Communication Complexity Solutions

Testing Properties of Boolean Functions

Lecture 13: 04/23/2014

On the Fourier spectrum of symmetric Boolean functions

Defining the Integral

Sublinear Algorithms Lecture 3. Sofya Raskhodnikova Penn State University

Optimal Bounds on Approximation of Submodular and XOS Functions by Juntas

Property Testing Bounds for Linear and Quadratic Functions via Parity Decision Trees

Sums of products of polynomials of low-arity - lower bounds and PIT

Minimization and Maximization Algorithms in Discrete Convex Analysis

Low degree almost Boolean functions are sparse juntas

ICML '97 and AAAI '97 Tutorials

Where do pseudo-random generators come from?

Distribution Free Learning with Local Queries

Foundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015

h(x) lim H(x) = lim Since h is nondecreasing then h(x) 0 for all x, and if h is discontinuous at a point x then H(x) > 0. Denote

CS 395T Computational Learning Theory. Scribe: Mike Halcrow. x 4. x 2. x 6

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory (COLT)

TEL AVIV UNIVERSITY THE IBY AND ALADAR FLEISCHMAN FACULTY OF ENGINEERING Department of Electrical Engineering - Systems

Lecture 23: Analysis of Boolean Functions

CSC 2429 Approaches to the P vs. NP Question and Related Complexity Questions Lecture 2: Switching Lemma, AC 0 Circuit Lower Bounds

Uniform-Distribution Attribute Noise Learnability

Transcription:

Learning and Fourier Analysis Grigory Yaroslavtsev http://grigory.us Slides at http://grigory.us/cis625/lecture2.pdf CIS 625: Computational Learning Theory

Fourier Analysis and Learning Powerful tool for PAC-style learning under uniform distribution over 0,1 n Sometimes requires queries of the form f(x) Works for learning many classes of functions, e.g: Monotone, DNF, decision trees, low-degree polynomials Small circuits, halfspaces, k-linear, juntas (depend on small # of variables) Submodular functions (analog of convex/concave) Can be extended to product distributions over 0,1 n, i.e. D = D 1 D 2 D n where means that draws are independent

Recap: Fourier Analysis Functions as vectors form a vector space: f: 1,1 n R f R 2n Inner product on functions = correlation : f, g = 2 n f x g(x) = E x 1,1 n f x g x x 1,1 n Thm: Every function f: 1,1 R can be uniquely represented as a multilinear polynomial f x 1,, x n = f S χ S (x) S,n- f S Fourier coefficient of f on S = f, χ S

Recap:Convolution Def.: For x, y 1,1 n define x y 1,1 n : x y i = x i y i Def.: For f, g: 1,1 n R their convolution f g: 1,1 n R: f g x E y 1,1 n,f y g(x y)- Properties: 1. f g = g f 2. f g h = f g h 3. For all S n : f g S = f S g S

Linearity Testing f: 0,1 n *0,1+ P = class of linear functions dist f, P = min dist(f, g) g P ε-close: dist f, P ε Linearity Tester Linear Accept with probability = 1 ε-close Nonlinear Don t care Reject with probability 2 3

Local Correction Learning linear functions takes n queries Lem: If f is ε-close to linear function χ S then for every x one can compute χ S (x) w.p. 1 2ε as: Pick y 0,1 n Output f y f(x y) Proof: Pr f y χ S y = Pr f x y χ S x y = ε By union bound: Pr f y = χ S y, f x y = χ S x y 1 2ε Then f y f x y = χ S y χ S x y = χ S (x)

Recap: PAC-style learning PAC-learning under uniform distribution: for a class of functions C, given access to f C and ε find a hypothesis h such that dist f, h ε Two query access models: Random samples (x, f(x)), where x 1,1 n Queries: (x, f(x)), for any x 1,1 n

Fourier Analysis and Learning Def (Fourier Concentration): Fourier spectrum of f: 1,1 n R is ε-concentrated on a collection of subsets F if: S n,s F f S 2 1 ε Thm (Sparse Fourier Algorithm): Given F on which f 1,1 n * 1,1+ is ε/2- concentrated there is an algorithm that PAClearns f with O( F log F /ε) random samples

Estimating Fourier Coefficients Lemma: Given S and O log 1 δ /ε2 random samples from f: 1,1 n 1,1 there is an algorithm that gives f(s) such that with prob. 1 δ: f S f S ε Proof:f S = E x,f x χ S (x)- Given k = O log 1 δ /ε2 random samples x i, f x i Empirical average 1 k f x i χ S (x i ) k i ε-close by a Chernoff bound with prob. 1 δ

Rounding real-valued approximations Lem: If f: 1,1 n 1,1, g: 1,1 n R 2 such that E x f g 2 ε. For h: 1,1 n 1,1 defined as h x = sign(g(x)): dist f, h ε Proof: f x g x f x g x 2 1 dist f, h = Pr f x h x = x E x 1 f x sign(g x ) E x f g 2 2 ε

Sparse Fourier Algorithm Thm (Sparse Fourier Algorithm): Given F such that f 1,1 n * 1,1+ is ε/2-concentrated on F there is a Sparse Fourier Algorithm which PAC-learns f with O( F log F /ε) random samples. For each S F get f S with prob. 1 1/10 F : f S f S ε/2 F Output: h = sign(g) where g = S F f g 2 2 = f g S 2 S f S χ S = f S f S 2 S F + S F f S 2 S F ε 2 F 2 + ε 2 ε

Low-Degree Algorithm Some classes are ε-concentrated on low degree Fourier coefficients: F = *S: S k+, k n F n k Monotone functions: k = O( n/ε) Learning complexity: n O( n/ε) Size-s decision trees: k = O((log s)/ε) Learning complexity: no((log s)/ε) Depth-d decision trees: k = O(d/ε) Learning complexity: n O(d/ε)

Restrictions Def: For a partition (J, J ) of [n] and z 1,1 J let the restriction f J z : 1,1 J R be f J z y = f(y, z) where (y, z) is a string composed of y and z. Example: min x 1, x 2 = 1 2 + 1 2 x 1 + 1 2 x 2 + 1 2 x 1x 2 J = 1, J = 2, z = 1 f J z : 1,1 1,1 = x 1

Fourier coefficients of restrictions Fourier coefficients of f J z can be obtained from the multilinear polynomial by subsitution f J z S = f S T χ T (z) T J E z f J z S = f S Take T =, otherwise E z χ T z = 0 E z f J z S 2 = f S T 2 T J E z f J z S 2 = E z f S T χ T z T J since E z χ T z χ T z = 0 2 = f S T 2 T J

Goldreich-Levin/Kushilevitz-Mansour Thm (GL/KM): Given query access to f: 1,1 n * 1,1+ and 0 < τ 1 GL/KM-algorithm w.h.p. outputs L = *U 1,, U l +: f(u) τ U L U L f U τ/2 Exercise: GL/KM + Sparse Fourier Algorithm: A class C which is ε-concentrated on at most M sets can be learned using poly, n queries M, 1 ε Every large coefficient f(u) 1/ M Corollary: Size-s decision trees are learnable with poly(n, s, 1 ε ) queries

Estimating Fourier Weight via Restrictions Recall: E z f J z S 2 = f S T 2 T J Lemma: T J f S T 2 can be estimated from O(1/ε 2 log 1/δ) random samples w.p. 1 δ f S T 2 T J = E z f J z S 2 = = E z * 1,1+ J E y * 1,1+ J f y, z χ S y 2 = E z * 1,1+ J E y,y * 1,1+ J f y, z χ S y f y, z χ S y f y, z χ S y f y, z χ S y is a ±1 random variable O(1/ε 2 log 1/δ) samples suffice to estimate

GL/KM-Algorithm Put all 2 n subsets of,n- into a single bucket At each step: Select any bucket B containing 2 m sets, m 1 Split B into B 1, B 2 of 2 m 1 sets each Estimate Fourier weight U B f U 2 i of each B i up to τ 2 /4 for Discard B 1 or B 2 if its weight is τ2 2 Output all buckets that contain a single set

GL/KM-Algorithm: Correctness Put all 2 n subsets of,n- into a single bucket At each step: Select any bucket B containing 2 m sets, m 1 Split B into B 1, B 2 of 2 m 1 sets each Estimate Fourier weight f U 2 U B i up to τ 2 /4 for each B i Discard B 1 or B 2 if its weight is τ2 2 Output all buckets that contain a single set Correctness (assuming all estimates up to τ 2 /4 w.h.p): f(u) τ U L: no bucket with weight τ 2 discarded U L f U τ/2: buckets with weight τ 2 /4 discarded

GL/KM-Algorithm: Complexity Put all 2 n subsets of,n- into a single bucket At each step: Select any bucket B containing 2 m sets, m 1 Split B into B 1, B 2 of 2 m 1 sets each Estimate Fourier weight f U 2 U B i up to τ 2 /4 for each B i Discard B 1 or B 2 if its weight is τ2 2 Output all buckets that contain a single set By Parseval 4/τ 2 active buckets at any time Bucket can be split at most n times At most 4n/τ 2 steps to finish

GL/KM-Algorithm: Bucketing B k,s = *S T: T *k + 1,, n+ +, B k,s = 2 n k Initial bucket B 0, Split B k,s into B k+1,s and B k+1,s *k+1+ Fourier weight of B k,s = f S T 2 T *k+1,,n+ f S T 2 T *k+1,,n+ estimated via restrictions Estimate each up to ± τ2 4 complexity O 1 log n τ 4 τ with prob. 1 τ2 80n, All estimates are up to ± τ2 4 with prob. 9/10

Thanks!