Bloom Features. Kwabena Boahen Bioengineering Department Stanford University Stanford CA, USA

Size: px

Start display at page:

Download "Bloom Features. Kwabena Boahen Bioengineering Department Stanford University Stanford CA, USA"

Cecilia Owen
5 years ago
Views:

1 2015 International Conference on Coputational Science and Coputational Intelligence Bloo Features Asok Cutkosky Coputer Science Departent Stanford University Stanford CA, USA Kwabena Boaen Bioengineering Departent Stanford University Stanford CA, USA Abstract We introduce a etod for function-fitting tat acieves ig accuracy wit a low eory footprint. For d-diensional data and any user-specified, we define a feature ap fro d to diensional uclidean space wit eory footprint O() tat scales as follows: As increases, te space of linear functions on our -diensional features approxiates any MAX (or boolean OR) function on te d-diensional inputs wit expected error inversely proportional to. Our etod is te only one in existence wit tis scaling tat can siultaneously run in O() tie, process real-value inputs, and approxiate non-linear functions, properties respectively not acieved by rando Fourier features, b-bit Minwise Hasing, and Vowpal Wabbit, tree copeting etods. We acieve all tree properties by using asing (O() space) to ipleent a sparse-atrix ultiply (O() tie) wit addition replaced by MAX (non-linear approxiation). As tese tecniques are inspired by te Bloo filter, we call te vectors produced by our apping Bloo features. We deonstrate tat te scaling pre-factors are reasonable by testing our etod on siulated (Diriclet distributions) and real (MNIST and webspa) datasets. Keywords-large-scale learning, learning teory I. LARNING WITH MMORY CONSTRAINTS Te use of randoness in order to save soe resource suc as eory (or coputation tie) at te potential expense of accuracy is an establised strategy trougout coputer science. A priary application of tese low-eory footprint etods is in large-scale settings in wic one ust process an extreely large aount of possibly ig-diensional data. In tis situation te original dataset ay not fit into eory and so one turns to etods tat reduce te eory footprint of te dataset but also allow te user to run soe type of analysis on te data. In acine learning, tese eory savings are acieved by etods tat use rando projections [1] to approxiate soe function f : R d R using a sall nuber of paraeters, but state of te art etods suffer fro particular drawbacks. Rando Fourier features [2] as poor asyptotic tie coplexity; Vowpal Wabbit (VW) [3] can only approxiate linear functions; and b-bit Minwise asing [4] can only operate on binary inputs. We present an algorit tat acieves a low eory footprint wile avoiding tese drawbacks. Siilar to te previous etods, for any we produce a apping φ : R d R suc tat as increases linear functions on R approxiate nonlinear functions in R d wit error inversely proportional to. However our eory footprint increases linearly wit, and so represents a trade-off between eory and accuracy in function fitting. We call te vectors in R produced by our apping Bloo features (Definition A.1), as our etod was inspired by te Bloo filter data structure [5]. To copute Bloo features, we coose k as functions to assign eac coponent of a vector x in R d to k coponents of φ(x) in R. We copute te it coponent of φ(x) by taking te MAX of all te coponents of x tat are apped to tis it coponent (Figure 1). Tis is analogous to ultiplication by a sparse d atrix (specified by te k as functions) wit addition replaced by MAX. Using as functions and sparsity allows us to ave low eory footprint and sall tie coplexity wile te substitution of MAX for addition allows us to copute non-linear functions on real-valued inputs. x: φ(x): Fig. 1. ac coordinate of x is apped to k rando coordinates in φ(x) by as functions. Te final value of eac coordinate of φ(x) is given by taking te axiu of te values ased to it. In tis case k =2and = /15 $ I DOI /CSCI

2 In Section II, we introduce Bloo features and describe teir properties. In Section III, we epirically validate Bloo features on te freely available MNIST [6], and webspa [7] datasets. We also provide results on siulated data. We state foral results and proofs in an appendix. II. BLOOM FATURS Approxiating functions wit Bloo features offers a good tradeoff between space, tie and accuracy. Given input/output pairs (x i,y i ) R d R, tis proble ay be forulated as follows: Find f w : R d R of te for f w (x) = φ(x),w were a, b indicates inner-product, suc tat f w (x i ) y i, were φ : R d R is a fixed ap and w R is cosen to acieve te best approxiation. Te degree of approxiation is easured by soe loss function L. For exaple, L(f w,x,y) = (f w (x) y) 2 is a coon coice. Linear regression can be forulated in tis anner wit φ(x) =x and quadratic-kernel support vector regression corresponds to φ(x) equal to a vector of all 2 nd order interactions in x. Forulating function-approxiation as a linearap on a set of -diensional features φ(x), provided by a fixed apping φ, introduces a tradeoff between resources and accuracy. As increases, φ aps into iger diensional spaces, resulting in ore accurate function-fitting troug inner products wit φ(x), but requiring ore tie and space to be coputed. A desirable φ will obtain a good tradeoff between space, tie and accuracy. Te Bloo feature ap acieves a good tradeoff by using as functions to specify a sparse atrix and replacing te addition in atrix ultiplication by MAX. Tis ap φ operates on d-diensional non-negative inputs tat is, eleents of R d 0. It can approxiate any linear cobination of binary OR functions or real-valued MAX functions wit error tat decreases inversely proportional to (Teores A.4, A.8) and can be coputed in O() space and O() tie (Proposition A.2). Wile its space scaling is te sae as tree state-of-te-art etods, its tie scaling is n to n/ ties better (Table I). TABL I COMPARISON BTWN MTHODS Feature Tie Non-Binary Non-linear Bloo O() Yes Yes Fourier O(n) Yes Yes b-bit Minwise O( n 2b ) No Yes VW O(n) Yes No III. MPIRICAL RSULTS To verify tat Bloo features resource scaling factors are reasonable and confir tat its teoretical perforance generalizes to te real-world, we tested it on siulated data (binary and real-valued) and real data (MNIST and webspa). A. Siulated Data We first tested Bloo features on a siulated classification task wit binary vectors. Te binary vectors were drawn fro {0, 1} 100 wit 10 randoly selected non-zero entries. Te two classes were defined by an OR function coputed on eiter 3, 5 or 10 randoly cosen bits. We cose a subset of te data tat ad an equal fraction belonging to eac class. We used Bloo features wit = 100, 1000, and k = 10 log(2), te optial value for a Bloo filter. We found tat Bloo features acieve low MS using values significantly saller tan ( ) 100 F, te nuber of possible OR functions of fan-in F (Figure 2). Mean Squared rror value Fan-in: 10 Fan-in: 5 Fan-in: 3 Fig. 2. Mean squared error vs for a boolean OR of fan-in 3 and 10, wit d = 100. Horizontal lines indicate best linear regression error for coparison. rror bars represent one standard deviation. Next, we tested Bloo features on a siulated realvalued classification task consisting of two classes defined by two distinct, 100-diensional Diriclet distributions wit paraeters α and 3α+β 4 respectively. α and β were drawn uniforly at rando fro (0, 1) 100. Tis particular coice results in a Bayes risk of 0.3% (error of optial classifier). We copared Bloo features to rando Fourier features wit ranging fro 50 to for bot. k was set to 10 log(2) for Bloo features; b-bit Minwise asing was not included because tis etod only

3 applies to binary inputs. We found tat Fourier features were ore accurate for sall wereas Bloo features were ore accurate for large (Figure 3). Accuracy Bloo Features Fourier Features Fig. 3. Bloo features vs. rando Fourier Features on siulated dataset. rror bars represent one standard deviation. We ten proceeded to test Bloo features on ore realistic data by classifying te MNIST [6] and webspa [7] datasets. For all tasks, we applied one-vs-all linear classification on a Bloo feature representation of te data. B. MNIST Te MNIST dataset contains grayscale iages of andwritten digits 0 troug 9, wic are split into a training and test set of size and respectively. Te task is to deterine wic digit a given iage represents. We consider te perutation-invariant for of te task, in wic one cannot take advantage of any previously known structure in te data (suc as te fact tat te inputs are 2D iages). We cose k via cross-validation on = 400, wic gave k = 4. We terefore used k = /100 (rounded to te nearest integer) for all oter values of. We copared Bloo features perforance on MNIST wit tat of rando Fourier Features and b- bit Minwise asing, wic ave te sae eory footprint. We did not include VW because it can only classify linearly separable data; it is well-known tat MNIST is not linearly separable. In te case of Minwise asing, we converted te iages to binary vectors by tresolding values to 1, an easy and natural process for MNIST. For oter datasets (e.g. our siulation in Figure 3), tis ay not be true. We found tat te perforance gap between Bloo and Fourier features narrowed as increased (Table II), as expected fro our results wit siulated data (Figure 3). And bot out-perfored Minwise asing, deonstrating te advantage of using realvalued vectors. We also copared Bloo features perforance on MNIST wit several state-of-te-art algorits tat ave a uc larger eory footprint as easured by paraeter count (taken fro Table 1 of [8]). One of tese algorits (Maxout MLP, described in [8]) is siilar in spirit to Bloo features. In tis etod, one trains a ulti-layer perceptron wose activation function takes te axiu of its inputs. We found tat Bloo features use 95% fewer paraeters tan any of tese algorits wile acieving an error-rate witin a factor of two of te lowest result (1.6% versus 0.79%). TABL II MNIST PRFORMANC (AVRAG RROR) Bloo Fourier b-bit Minwise % 5.3% 11.1% % 3.1% 7.4% % 1.6% 2.8% TABL III MNIST PRFORMANC (OTHR MTHODS) Metod rror Paraeter Count ReLU MLP [9] 1.05% Maxout MLP [8] 0.94% Manifold Tangent Classifier [10] 0.81% DBM (wit dropout) [11] 0.79% Bloo Features 1.6% C. webspa Te webspa [7] dataset consists of sparse vectors of trigra counts representing eiter spa or non-spa docuents. Te data ave a diensionality of , of wic only coponents are ever nonzero. webspa is nearly linearly separable, wit linear classification accuracies in excess of 99%. Tus, we consider classifying webspa as a deonstration tat Bloo features are still able to capture linear relationsips. To classify webspa we selected 80% of te data to be a training set and 20% to be a testing set. We copared Bloo features perforance on webspa wit tat of rando Fourier Features, b-bit Minwise asing, and VW for = 100, 1000, All tree coices use uc less

4 eory tan is needed to represent te original features (up to vs ). webspa s sparse binary features and linearly separability are exactly te conditions required by b-bit Minwise asing and VW, respectively, ence tey perfored te best (Table IV). Neverteless, across te range of values tested, Bloo Features error was no worse tan ties tat of tese etods. TABL IV WBSPAM PRFORMANC (AVRAG RROR) Bloo Fourier b-bit Minwise VW % 26.4% 10.8% 8.1% % 14.7% 2.1% 1.9% % 7.9% 0.8% 1.1% IV. CONCLUSIONS We ave introduced Bloo features and analyzed teir perforance. Bloo feature representations are eory-efficient (O()) and can be coputed very quickly (O()). As teir diensionality increases, tey approxiate boolean functions and non-linear real-valued functions wit error decreasing inversely proportionally to. We prove tese results in te Appendix and sow tat Bloo features operate by approxiating a ig-diensional RKHS consisting of all interactions aong inputs. We deonstrated te practical viability of Bloo features using siulated data and te MNIST and webspa datasets. On te siulated and MNIST datasets, Bloo features represented non-linear functions on real-valued inputs accurately and efficiently. On te webspa dataset, Bloo features were copetitive wit ore restrictive etods suc as b-bit Minwise asing and VW wen representing linear functions on binary inputs. Tus Bloo features not only possess teoretical advantages in ters of eiter andling real-valued inputs, coputation tie, or non-linear approxiation over oter etods, tey also copare favorably on practical tasks in ters of test error for a given eory footprint. RFRNCS [1] Willia B Jonson and Jora Lindenstrauss. xtensions of lipscitz appings into a ilbert space. Conteporary ateatics, 26( ):1, [2] Ali Raii and Benjain Rect. Rando Features for Large Scale Kernel Macines. In Advances in Neural Inforation Processing Systes, [3] W Kilian, D Anirban, L Jon, S Alex, and A Jos. Feature Hasing for Large Scale Multitask Learning Feature Hasing for Large Scale Multitask Learning. In International Conference on Macine Learning (ICML), pages , [4] Ping Li and Cristian König. b-bit inwise asing. In Proceedings of te 19t international conference on World wide web, pages ACM, [5] B Bloo. Space/Tie trade-offs in as coding wit allowable errors. Counications of te ACM, 13(7): , [6] L Yann, B Léon, B Yosua, and H Patrick. Gradient-Based Learning Applied to Docunet Recognition. Proceedings of te I, 86(11): , [7] Steve Webb, Jaes Caverlee, and Calton Pu. Introducing te webb spa corpus: Using eail spa to identify web spa autoatically. In CAS, [8] G I J, W David, M Medi, C Aaron, and B Yosua. Maxout Networks. Journal of Macine Learning Researc (JMLR), 28(3): , [9] Nitis Srivastava. Iproving neural networks wit dropout. PD tesis, University of Toronto, [10] Sala Rifai, Yann N Daupin, Pascal Vincent, Yosua Bengio, and Xavier Muller. Te anifold tangent classifier. In Advances in Neural Inforation Processing Systes, pages , [11] Geoffrey Hinton, Nitis Srivastava, Alex Krizevsky, Ilya Sutskever, and Ruslan R Salakutdinov. Iproving neural networks by preventing co-adaptation of feature detectors. arxiv preprint arxiv: , [12] Micael Mitzenacer and li Upfal. Probability and coputing: Randoized algorits and probabilistic analysis. Cabridge University Press, APPNDIX Definition A.1. Bloo Features Suppose 1,..., k are functions apping N to {0,..., 1} drawn at rando fro a pair-wise independent faily of as functions. For x R d 0, define φ (x) R 0 by φ (x) i = ax j,l: l (j)=i x j were te subscript in φ is intended to epasize tat φ (x) depends on te coice of as functions. Proposition A.2. Coputation Suppose x R d wit n non-zero coponents, and let φ (x) be a Bloo feature ap using k as functions. Ten φ (x) can be coputed in O(kn) tie wen vectors are represented in sparse forat (only non-zero quantities stored), and in O(d + kn + ) tie wen vectors are represented in dense forat. Proposition A.3. Meory Standard Bloo filter analysis [12] suggests a value for k: If our inputs x ave n non-zero coponents, ten we sould set k = log(2) n. Using tis value of k and Proposition A.2, we get a coplexity of O(kn) =O(). Teore A.4. Binary Input Approxiation Suppose B(x) is a boolean function of fan-in F and

5 φ (x) : R d R is a Bloo feature ap wit k as functions (see Definition A.1). Let n be te nuber of set bits of soe vector x {0, 1} d. Let p =1 (1 1 )k be te probability tat a bit of x is ased to a given coponent of φ (x). Ten for sufficiently large tere exists w R suc tat [φ (x) w ]=B(x) ( [(φ (x) w B(x)) 2 ]=O 2 F p F (1 1/) kn Proof: Our proof as two steps. First we observe tat we can write B(x) as a linear cobination of ORs involving no negations: B(x) = s i a i(x i1 x iz ) wit a i R, z F and s 2 F. Ten we sow tat x i1 x iz φ (x) w,i for soe appropriate w,i (see Lea A.5). Substituting tis approxiation in to te expression for B(x) copletes te proof wit w = i a iw,i. To prove te first step, we consider OR functions as vectors in R 2F (defined by teir trut tables), and sow tat tey are linearly independent in tis space. To see tis, first set y j =1 x j. Ten by De Morgan s rules we ave x i1 x iz = 1 y i1 y iz. Tus tere is a 1-1 linear ap fro OR functions to onoials y i1 y iz. Now since te set of distinct onoials is linearly independent, so are te OR functions. Lea A.5. Suppose C 1 (x) and C 2 (x) are OR functions of fan-in F 1 and F 2. Furter, define F as te nuber of inputs sared by C 1 and C 2 and let F = (F 1 + F 2 ) F. Suppose φ (x) is a Bloo feature ap wit k as functions. Ten for sufficiently large tere exists w 1 and w2 suc tat: [φ (x) w i ]=C i (x) Var(φ (x) w i ) 4 p F i (1 p) n Cov(φ (x) w 1,φ (x) w 2 ) 2 p F (1 p) n were p,, n, x were defined previously (see Teore A.4). Proof: If φ (x) s coponents were bot zeroean and independent, ten we can prove tis result wit bounds iproved by a factor of 4 (Lea A.7). To prove tis result for te non-zero-ean, non-independent φ, we subtract φ (x) s first /2 coponents fro its second /2 coponents to for a new vector φ (x) tat is alf te size but is now zero-ean wile still aving non-independent ) coponents. Since φ (x) is alf te size, we lose a factor of 2 in te bounds. Now we coplete te proof by sowing tat te lack of independence between coponents introduces anoter factor of 2 in te bounds. Te factor of 2 loosening due to te lack of independence coes fro distributing expectations over products. Specifically, we need to sow tat distributing te expectation [φ (x) i φ (x) j ] over te product accrues a error tat goes down as increases. To do tis we sow tat φ (x) i concentrates about its ean. Let e be te nuber of zero bits in φ (x) and q = e. Mitzenacer and Upfal sow [12] for Bloo filters (wic applies to Bloo features in te binary case) tat P( q [q] > λ ) < 2exp( 2λ 2 /). Tus [q 2 ] (1 2exp( 2λ 2 /))([q] λ/) 2 [q 2 ] (1 2exp( 2λ 2 /))([q]+λ/) 2 +2exp( 2λ 2 /) So tat for any δ, for sufficiently large, we ust ave 1 1 δ [q] 2 [q 2 ] (1 + δ) [q] 2.Now [φ (x) i φ (x) j ]= t P(q = t)(1 t)(1 t 1 ) = [q]+ 1 [q 2 ] Tus for any δ, for sufficiently large, [φ (x) i φ (x) j ] is witin a factor of 1 + δ of (1 [q]) 2 = [φ (x) i ] [φ (x) j ]. Coosing δ =1, for sufficiently large te error fro te independence assuption is bounded by a factor of 2. Definition A.6. Independent Features Let p (0, 1). Suppose z + i,j,z i,j are independent Bernoulli rando variables wit te sae ean p for i {0,..., 1}, j {0,...,d 1}. Ten define T + (x) i = ax x j j:z + i,j =1 T (x) i = ax x j j:z i,j =1 φ z (x) i = T + (x) i T (x) i were ere te subscript z indicates tat φ z depends on te values of te variables z + i,j and z i,j. Lea A.7. Let C 1, C 2, F 1, F 2, F, F, x and n be as defined previously (Teore A.4). Let φ z (x) be an

6 independent feature wit paraeter p (see Definition A.6). Ten tere exists w 1 z,w 2 z R suc tat: z [φ z (x) w i z]=c i (x) Var(φ z (x) w i z) 1 p F i (1 p) n Cov(φ z (x) w 1 z,φ z (x) w 2 z) 1 2 p F (1 p) n Proof: We ll prove te bias and variance results for C = C 1. Te results for C 2 are syetric. Let C = x c1 x cf. We start by defining Q n =1 (1 p) n, Z + j = F i=1 z+ c i,j, and Z j = F i=1 z c i,j. Te following facts will be useful: [Z + j ]= [Z j ]=pf (1) [Z + j T + (x) j ]=p F C(x)+p F (1 C(x))Q n (2) [Z + j T (x) j ]=p F Q n (3) Wit (r z ) j = Z + j Z j, soe algebra gives: [φ z (x) r z ]=[φ z (x) 1 (r z ) 1 ] z z =2 z [T + 1 Z+ 1 ] 2 z [T 1 Z+ 1 ] =2p F (1 Q n )C(x) For te variance, we copute Var(φ z (x) r z )=Var(φ z (x) 1 (r z ) 1 ) [(φ z (x) 1 (r z ) 1 ) 2 ]=[(Z + Z ) 2 (T 1 + T 1 )2 ] fro wic we obtain (fro equations 1, 2, 3): Var(φ z (x) r z ) 4p F (1 Q n ) Nowifweset(w z ) j = 1 2 p F (1 Q n ) 1 1 (r z) j ten we recover te expectation and variance stateents. Te covariance stateent is coputed siilarly. Teore A.8. Approxiating a space of MAX functions Suppose φ (x) is a Bloo feature constructed wit k as functions. Suppose K(x, y) is a scaled inner product of MAX functions as defined below. Ten: ( P [ 1 φ (x) 1 φ (y)] = K(x, y) ) φ sup (x) φ(y) x,y D n K(x, y) >ɛ as long as ( > 2 ɛ (log(1/δ)+2log 2 ( ) )) d 2 n <δ Were D n [0, 1] d is te set of vectors wit at ost n non-zero coponents. K(x, y) is given by te following construction. For x [0, 1] d, let MAX q (x) R (d q) be te vector obtained by applying all q-ary MAX functions to te entries of x. Define K(x, y) =ψ(x) ψ(y) wit ψ(x) = d q=1 (1 p) d q p q MAX q (x) were p =1 (1 1 )k. Tus K(x, y) is te kernel of te RKHS H defined by H d d ) = R( q = R 2d q=1 wit v(x) =v ψ(x) for v H. Proof: First we copute te expectation: [ 1 φ (x) 1 φ (y)] = [φ (x) 1 φ (y) 1 ] Let A q be te event tat tere are exactly q values of t suc tat l (t) =1for soe l. Ten P (A q )= (1 p) d q p q( d q). Furter, [φ (x) 1 φ (y) 1 A q ]= ( d q) 1MAXq (x) MAX q (y) n q=1 [ φ (x) φ(y) ] = P (A q ) ( d q) 1MAXq (x) MAX q (y) =K(x, y) For te concentration inequality, we apply te Azua-Hoeffding inequality to te Doob artingale given by B i = 1 [φ (x) φ (y) φ (x) 1,...,φ (y) i ] Since eac coponent of φ (x) is in [0, 1], B i+1 B i 1, and so ( ) φ P (x) φ(y) K(x, y) >ɛ 2exp( ɛ2 2 ) Te axiu error occurs at corners of D n,soby applying a union bound over all ( d 2 n) pairs x, y we bound te probability of error ore tan ɛ by 2 ( d n) 2 exp( ɛ 2 /2) = δ Solve for to prove te teore

1 Proving the Fundamental Theorem of Statistical Learning

1 Proving the Fundamental Theorem of Statistical Learning THEORETICAL MACHINE LEARNING COS 5 LECTURE #7 APRIL 5, 6 LECTURER: ELAD HAZAN NAME: FERMI MA ANDDANIEL SUO oving te Fundaental Teore of Statistical Learning In tis section, we prove te following: Teore.