Chebyshev Polynomials and Approximation Theory in Theoretical Computer Science and Algorithm Design

Similar documents
Lecture 10: Powers of Matrices, Difference Equations

Qualifying Exam in Machine Learning

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

5.1 Learning using Polynomial Threshold Functions

Quantum algorithms (CO 781/CS 867/QIC 823, Winter 2013) Andrew Childs, University of Waterloo LECTURE 13: Query complexity and the polynomial method

Lecture 12: Lower Bounds for Element-Distinctness and Collision

3.1 Decision Trees Are Less Expressive Than DNFs

Chebyshev Polynomials, Approximate Degree, and Their Applications

Lecture 22. r i+1 = b Ax i+1 = b A(x i + α i r i ) =(b Ax i ) α i Ar i = r i α i Ar i

Agnostic Learning of Disjunctions on Symmetric Distributions

Junta Approximations for Submodular, XOS and Self-Bounding Functions

Foundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Lecture 4: Training a Classifier

Optimization and Gradient Descent

8.1 Polynomial Threshold Functions

20.1 2SAT. CS125 Lecture 20 Fall 2016

Optimization methods

CSC321 Lecture 5 Learning in a Single Neuron

Lecture 13: Lower Bounds using the Adversary Method. 2 The Super-Basic Adversary Method [Amb02]

Chapter 1 Review of Equations and Inequalities

Math 410 Homework 6 Due Monday, October 26

Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions

Lecture 5. 1 Review (Pairwise Independence and Derandomization)

Lecture 15: Exploding and Vanishing Gradients

6.842 Randomness and Computation April 2, Lecture 14

Lecture 6: Finite Fields

Lecture 4: Training a Classifier

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Notes for CS542G (Iterative Solvers for Linear Systems)

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

6.080 / Great Ideas in Theoretical Computer Science Spring 2008

1 Lecture 6-7, Scribe: Willy Quach

1 Randomized reduction: a first start

Upper and Lower Bounds on ε-approximate Degree of AND n and OR n Using Chebyshev Polynomials

Compute the Fourier transform on the first register to get x {0,1} n x 0.

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.

ter. on Can we get a still better result? Yes, by making the rectangles still smaller. As we make the rectangles smaller and smaller, the

1 Extrapolation: A Hint of Things to Come

CPSC 540: Machine Learning

Introduction to gradient descent

1 Proof techniques. CS 224W Linear Algebra, Probability, and Proof Techniques

Partial Fractions. June 27, In this section, we will learn to integrate another class of functions: the rational functions.

Attribute-Efficient Learning and Weight-Degree Tradeoffs for Polynomial Threshold Functions

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

MATH 115, SUMMER 2012 LECTURE 12

1 Differentiability at a point

Introductory Analysis I Fall 2014 Homework #9 Due: Wednesday, November 19

Probability inequalities 11

A Polynomial Time Algorithm for Lossy Population Recovery

Accelerating Stochastic Optimization

Chapter 1 Functions and Limits

Quasi-Newton Methods

On the power and the limits of evolvability. Vitaly Feldman Almaden Research Center

On the tradeoff between computational complexity and sample complexity in learning

CS 395T Computational Learning Theory. Scribe: Rahul Suri

Eigenvalues & Eigenvectors

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

One important way that you can classify differential equations is as linear or nonlinear.

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Math 115 Spring 11 Written Homework 10 Solutions

1 Reductions from Maximum Matchings to Perfect Matchings

Unconditional Lower Bounds for Learning Intersections of Halfspaces

Chapter REVIEW ANSWER KEY

LINEAR RECURSIVE SEQUENCES. The numbers in the sequence are called its terms. The general form of a sequence is

Testing by Implicit Learning

1 What a Neural Network Computes

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Matroids and Greedy Algorithms Date: 10/31/16

Statistical and Computational Learning Theory

CPSC 540: Machine Learning

Problem Solving in Math (Math 43900) Fall 2013

Lecture 4: Constructing the Integers, Rationals and Reals

means is a subset of. So we say A B for sets A and B if x A we have x B holds. BY CONTRAST, a S means that a is a member of S.

Lesson 21 Not So Dramatic Quadratics

Mathematical Induction

STEP Support Programme. Pure STEP 1 Questions

Math Stochastic Processes & Simulation. Davar Khoshnevisan University of Utah

Least Mean Squares Regression. Machine Learning Fall 2018

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

X = X X n, + X 2

3 Finish learning monotone Boolean functions

Topic: Sampling, Medians of Means method and DNF counting Date: October 6, 2004 Scribe: Florin Oprea

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Classification objectives COMS 4771

Lecture 5: Special Functions and Operations

9 Classification. 9.1 Linear Classifiers

Physics 202 Laboratory 5. Linear Algebra 1. Laboratory 5. Physics 202 Laboratory

Maximum Margin Algorithms with Boolean Kernels

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2

8.3 Partial Fraction Decomposition

Least Mean Squares Regression

Post-selected classical query complexity

x 1. x n i + x 2 j (x 1, x 2, x 3 ) = x 1 j + x 3

Legendre s Equation. PHYS Southern Illinois University. October 18, 2016

6.045: Automata, Computability, and Complexity (GITCS) Class 17 Nancy Lynch

Bias-Variance Tradeoff

Machine Learning. Neural Networks

Gradient Methods Using Momentum and Memory

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Contents. 2 Partial Derivatives. 2.1 Limits and Continuity. Calculus III (part 2): Partial Derivatives (by Evan Dummit, 2017, v. 2.

Transcription:

Chebyshev Polynomials and Approximation Theory in Theoretical Computer Science and Algorithm Design (Talk for MIT s Danny Lewin Theory Student Retreat, 2015) Cameron Musco October 8, 2015 Abstract I will talk about low degree polynomials that are small on the interval [0,1] but jump up very rapidly outside of that interval. First I ll try to convince you why you should care about such polynomials, which have a ton of applications in complexity, optimization, learning theory, and numerical linear algebra. In particular, I will introduce the Chebyshev polynomials and then try to explain the magic behind them which makes them so useful in the above applications. A very nice reference if you re interested in this area is Faster Algorithms via Approximation Theory by Sachdeva and Vishnoi. Some of my proofs are taken from that book. I also pulled my quantum proof from a talk by Scott Aaronson on the polynomial method in TCS: www.scottaaronson.com/talks/polymeth.ppt 1 Step Polynomials and Their Applications We will refer to a jump polynomial as a degree d polynomial p(x) such that, for x [0, 1], p(x) ɛ. For x > 1 + γ, p(x) > 1. It is not hard to see that such polynomials exist for d = O (log(1/ɛ)/γ). Just set p(x) = 1 (1+γ) d x d. Clearly p(x) 1 for x 1 + γ. Additionally, of x [0, 1] we have: p(x) 1 (1 + γ) d 1 ( ) (1 + γ) 1/γ log(1/ɛ) 1 ɛ. elog(1/ɛ) However we can do much better. In fact using Chebyshev Polynomials, degree d = O ( log(1/ɛ)/ γ ) suffices. The fact that you can achieve this improvement should be surprising to you. Additionally, it is optimal. Before we talk at all about Chebyshev polynomials let me give some applications of this fact. The two most important applications (from my unbiased viewpoint) are eigenvector computation/pca and linear systems solving/convex optimization. However, I am talking to the theory group so let me try two other options first: 1

Learning Theory A polynomial threshold function of degree d is just some function h(x) = sign (p(x)) where p(x) is a degree d multivariate polynomial. The most basic version, when d = 1 is a linear threshold function, aka a halfspace, aka a linear classifier. Learning these functions is one of the most (or the most) studied problems in learning theory. Its what perceptron does, what SVM does, what boosting and neural networks were originally used for etc. You can provably learn a halfspace in polynomial time using linear programming. By building on these techniques, you can learn a degree d polynomial threshold function over n variables in n O(d) time. Hence a number of results in learning theory attempt to show that certain functions are well approximated by low degree polynomial threshold functions and so can be learned efficiently. One good example is Learning Intersections and Thresholds of Halfspaces by Klivans, O Donnell, and Servedio [FOCS 02]. Here I will talk about a result in STOC 01 Klivans and Servedio showing how to learn DNF functions over n variables with s clauses in time 2 O(n1/3 log s). This solved a longstanding open question, matching a 1968 lower bound of Minsky and Papert and beating the previous algorithmic result of 2 O(n1/2 log n log s). Really they show that a DNF with n variables and s clauses can be approximated by a polynomial of degree n 1/3 log s, which they did by showing first that a DNF with s terms, where each conjunction has only k variables in it can be approximated by a polynomial of degree O( k log s). Which they did using Chebyshev polynomials. How? Its ridiculously simple. Consider evaluating a single term x 1 x 2... x k. We want a threshold that is high if all terms are true and low otherwise. Set our threshold to be: ( ) x1 + x 2 +... + x k f(x 1, x 2,..., x k ) = T k log S k Note that since T k log S is a degree k log s polynomial, this is a degree k log s multivariate polynomial. If the clause is true then we have x 1+x 2 +...x k k = 1. If not true, we have x 1+x 2 +...x k k 1 1 k. Therefore, by the jump property of T k log S, we know that f(x 1, x 2,...x k ) 1 2S if the clause isn t true and f(x 1, x 2,...x k ) 1 otherwise. Now if we just add all these functions together, we can evaluate the DNF by returning true if the sum 1 and false otherwise. If at least 1 term is true the sum will be 1. If all terms are false, the sum will be s 1 2s 1 2. Note this is NOT a probabilistic thing. This polynomial is just a way to evaluate the DNF exactly. Hence, if we learn the polynomial, we learn the DNF. I won t go into details but note that without Chebyshev acceleration they just get 2 O(n1/2 log s) so are far from the lower bound and don t beat previous results based on other methods. Quantum Complexity Theory Above we saw an algorithm based off Chebyshev polynomials. However, the optimality of these polynomials is also often used for lower bounds. This is especially popular in quantum complexity theory, where the general technique is referred to as the polynomial method. There is a famous algorithm in quantum complexity called Grover s search algorithm that allows you to find a non zero value in a vector x [0, 1] n using O( n) quantum queries. I know, 2

quantum is pretty messed up. You can use the optimality of Chebyshev polynomials to show that this is tight. First off, note that Grover s search solves OR(x) in O( n) queries to x, since if there is at least one non zero value in x, it returns its index, indicating that OR(x) = 1. The number of quantum queries required to compute a boolean function like OR(x) (denote it as Q(OR)) is lower bounded by what quantum people refer to as the approximate degree of that function deg(or). This is just the minimum degree of a (multivariate) polynomial p(x 1,...x n ) that approximates that function up to uniform error say 1/3 (i.e. differs on any input by at most 1/3 from the OR of that input). Note that OR(x) only outputs either 0 or 1. p(x) outputs general real numbers. The bound that Q(OR) deg(or) is essentially because the probability that a quantum algorithm outputs 1 on a given input can be written as a polynomial with degree equal to the number of queries made to the input. So to be right with probability 2/3 on any input the algorithm s success probability better be a polynomial with degree Q(OR) that is a 1/3 uniform approximation to OR(x). Now we don t want to work with multivariate polynomials. But we can use symmetrization, defining the univariate polynomial: h(k) = E p(x). x 1 =k You can show that the degree of h( ) is upper bounded by the degree of p( ), although I will not discuss that here. So lower bounding Q(OR) just requires lower bounding deg(h). Now consider p(x) approximating OR(x), which is computed by Grover search. If p(x) approximates OR, then we have h(0) 0 and for x = 1,...n, h(x) 1. Flipping this polynomial horizontally and vertically and scaling the x axis by 1/n, we have a polynomial that is small on [0, 1] and jumps to near 1 at 1 + 1/n. By the optimality of the Chebyshev polynomials this must have degree Ω( n) giving us the lower bound for Grover search. Linear System Solvers Chebyshev polynomials place a very important role in accelerated methods for convex optimization (linear system solvers as a special case). The methods are widely studied by numerical computation people, theorists, machine learning people, etc. Simple iterative linear system solvers or convex optimization routines like Gradient Descent, Richardson Iteration run in time O(nnz(A) log(n/ɛ)/κ), where κ is the condition number of the matrix (or the strong convexity parameter in convex optimization). This can be improved to O(nnz(A) log(n/ɛ)/ κ) using accelerated solvers like conjugate gradient, Nesterov s accelerated gradient descent, and Chebyshev iteration. To solve Ax = b we want to return x = A 1 b, but computing the inverse exactly is too slow. Any iterative method is just applying a polynomial of A to b whose degree is equal to the number of iterations in the method. That is, we are computing x = p(a)b as an approximation to x = A 1 b. Say we write x as p(a)ax. We can decompose into the eigenbasis to get: Vp(Λ)V VΛV = Vh(Λ)V. where h(x) = x p(x). Now h acts on the eigenvalues of A, which fall in the range [λ n, λ 1 ]. We want h to to be near one on all these eigenvalues so that VΛV is close to the identity. However, 3

clearly h(0) = 0 p(0) = 0. So...its just like the function we were looking at for Grover s search. Its zero at 0 and jumps to one on [λ n, λ 1 ]. Flipping and scaling, we see that it is a jump polynomial with gap γ = λ 1 /λ n = 1/κ. Eigenvector Computation Let gap = (λ 1 λ 2 )/λ 2 be the eigengap separating the top eigenvalue of a matrix from its second eigenvalue. The power method converges in O(log(1/ɛ)/gap) iterations. Lanczos or Chebyshev iteration use Chebyshev polynomials to get O(log(1/ɛ)/ gap). I m not going to explain this one in detail it is a direct application of jump polynomials, where we scale and shift such that λ 2 goes to 1 and λ 1 goes to 1 + gap. Other Applications Approximating other matrix functions such as exp(m), random walks, etc... Streaming entropy approximation [Harvey, Nelson, Onak FOCS 08] Quadrature rules. Lower bounds in communication complexity and learning theory via analytic methods, Alexander Sherstov s PhD thesis 2009. I even found a paper in cryptography using them to replace monomials in the Diffie-Hellman and RSA algorithms [Fee and Monagan. Cryptography using Chebyshev polynomials 2004.] 2 Optimality of Chebyshev Polynomials There s only one bullet in the gun. It s called the Chebyshev polynomial. Rocco Servedio via Moritz Hardt (Zen of Gradient Descent blog post). It turns out, that the optimal jump polynomials are given by the Chebyshev polynomials (of the first kind). These are a family of polynomials, T d ( ) for all degrees d 0. The extremal properties of Chebyshev polynomials were first proven by the Markov brothers, both who were Chebyshev s students. Specifically, they showed that the Chebyshev polynomials achieve the maximum derivative (second derivative, third derivative, etc.) on [ 1, 1] as a ratio of max value on [ 1, 1]. One brother died soon after of tuberculousis at the age of 25. The other is the Markov we all know about. We re going to show something a bit different that is more directly related to the jump property that we care about. Theorem 1 (Optimality of Chebyshev Polynomials). For any polynomial g( ) of degree d with 1 g(x) 1 for x [ 1, 1], letting T d (x) be the degree d Chebyshev polynomial, we have for any x / [ 1, 1]: Further, 1 T (x) 1 g(x) T d (x). 4

In words, this theorem says that T d grows faster outside of [ 1, 1] than any other polynomial that is bounded in the range [ 1, 1]. Definition 2 (Recursive Definition of Chebyshev Polynomials). The Chebyshev Polynomials can be defined by the recurrence: T 0 (x) = 1 T 1 (x) = x T d (x) = 2xT d 1 (x) T d 2 (x) This recurrence gives us the lemma Lemma 3 (Cosine Intepretation of Chebyshev Polynomials). Proof. How? By induction. How else? T 0 (cos θ) = 1 = cos(0 θ). T 1 (cos θ) = cos θ = cos(1 θ). And by induction: T d (cos θ) = cos(dθ) T d (cos θ) = 2 cos θ T d 1 (cos θ) T d 2 (cos θ) = 2 cos θ cos((d 1)θ) cos((d 2)θ) = cos(dθ) The last inequality follows from doing a Google image search for cosine addition formula and finding: cos a cos b = Plugging in a = (d 1)θ and b = θ gives: cos(a + b) + cos(a b). 2 cos(a + b) = 2 cos a cos b cos(a b) cos(dθ) = 2 cos θ cos((d 1)θ) cos((d 2)θ). Or you can just apply the normal cosine addition formula to cos((d 1)θ + θ) and cos((d 1)θ θ). Well why does Lemma 3 matter? First, it immediately gives: Corollary 4 (Boundedness Property of Chebyshev Polynomials). For any d and x [ 1, 1]: 1 T d (x) 1. Proof. Let θ = cos 1 x. By Lemma 3, T d (x) = T d (cos(θ)) = cos(dθ) [ 1, 1] since cosine is always in [ 1, 1]. 5

Second, it gives: Corollary 5 (Alternation Property of Chebyshev Polynomials). T d (x) alternates between 1 and 1 d times on [ 1, 1].. Proof. For x {cos(0), cos(2π/d), cos(4π/d),..., cos(dπ/d)} we have T d (x) = cos(2iπ) = 1. For x {cos(π/d), cos(3π/d), cos(5π/d),...} we have T d (x) = cos((2i + 1)π) = 1. There are d + 1 such values. So the function alternates between 1 and 1 d times. This alternation property is critical. it makes proving the optimality theorem super easy. Proof of Theorem 1. Assume for contradiction that there exists some degree d polynomial g( ) with 1 g(x) 1 for x [ 1, 1] and some y / [ 1, 1] with g(y) T q (y). Let h(x) = T d(y) g(y) g(x) Then T d(x) h(x) is a degree d polynomial with a 0 at y. However, it also has d zeros in [ 1, 1] since every time T d alternates between 1 and 1 it crosses h(x) (which is bounded in [ 1, 1] since T d(y) g(y) < 1. Hence, T d (x) h(x) has d + 1 zeros but only has degree d, a contradiction. That s it. 3 Understanding the Optimal Growth Rate It is not very hard to explicitly compute for Chebyshev polynomials that T d (1 + γ) 2 d γ 1. ( Thus if we set d = O log(1/ɛ) γ ), and scale appropriately, we have a polynomial such that P (1 + γ) = 1 at 1 + γ and p(x) ɛ for x [ 1, 1]. We can also prove it inductively, starting with T 1 ( ) and using the recurrence relation. However this doesn t give us any intuition for what we can achieve this type of growth with a 1/ γ instead of a 1/γ dependence. In their monograph, Vishnoi and Sachdeva discuss a proof based on the fact that a random walk of length d will deviate from the origin by at most O( d) with good probability. This sort of gives a better intuition. There are also some intuitive understanding of how accelerated gradient descent methods (which achieve similar bounds to Chebyshev polynomials) work. However, to me, a large part of Chebyshev polynomials can still be attributed to magic. 6