Optimal Estimation and Completion of Matrices with Biclustering Structures

Size: px

Start display at page:

Download "Optimal Estimation and Completion of Matrices with Biclustering Structures"

Derrick Henderson
6 years ago
Views:

1 Journal of Machine Learning Research 17 (2016) 1-29 Submitted 12/15; Revised 5/16; Published 9/16 Otimal Estimation and Comletion of Matrices with Biclustering Structures Chao Gao Yu Lu Yale University Zongming Ma University of Pennsylvania Harrison H. Zhou Yale University Editor: Edo Airoldi Abstract Biclustering structures in data matrices were first formalized in a seminal aer by John Hartigan (Hartigan, 1972) where one seeks to cluster cases and variables simultaneously. Such structures are also revalent in block modeling of networks. In this aer, we develo a theory for the estimation and comletion of matrices with biclustering structures, where the data is a artially observed and noise contaminated matrix with a certain underlying biclustering structure. In articular, we show that a constrained least squares estimator achieves minimax rate-otimal erformance in several of the most imortant scenarios. To this end, we derive unified high robability uer bounds for all sub-gaussian data and also rovide matching minimax lower bounds in both Gaussian and binary cases. Due to the close connection of grahon to stochastic block models, an immediate consequence of our general results is a minimax rate-otimal estimator for sarse grahons. Keywords: Biclustering, grahon, matrix comletion, missing data, stochastic block models, sarse network 1. Introduction In a range of imortant data analytic scenarios, we encounter matrices with biclustering structures. For instance, in gene exression studies, one can organize the rows of a data matrix to corresond to individual cancer atients and the columns to transcrits. Then the atients are exected to form grous according to different cancer subtyes and the genes are also exected to exhibit clustering effect according to the different athways they belong to. Therefore, after aroriate reordering of the rows and the columns, the data matrix is exected to have a biclustering structure contaminated by noises (Lee et al., 2010). Here, the observed gene exression levels are real numbers. In a different context, such a biclustering structure can also be resent in network data. For examle, stochastic block model (SBM for short) (Holland et al., 1983) is a oular model for exchangeable networks. In SBMs, the grah nodes are artitioned into k disjoint communities and the robability that any air of nodes are connected is determined entirely by the community membershis of the nodes. Consequently, if one rearranges the nodes from the same communities together in c 2016 Chao Gao, Yu Lu, Zongming Ma and Harrison H. Zhou.

2 Gao, Lu, Ma and Zhou the grah adjacency matrix, then the mean adjacency matrix, where each off-diagonal entry equals the robability of an edge connecting the nodes reresented by the corresonding row and column, also has a biclustering structure. The goal of the resent aer is to develo a theory for the estimation (and comletion when there are missing entries) of matrices with biclustering structures. To this end, we roose to consider the following general model X ij = θ ij + ɛ ij, i [n 1 ], j [n 2 ], (1) where for any ositive integer m, we let [m] = {1,..., m}. Here, for each (i, j), θ ij = E[X ij ] and ɛ ij is an indeendent iece of mean zero sub-gaussian noise. Moreover, we allow entries to be missing comletely at random (Rubin, 1976). Thus, let E ij be i.i.d. Bernoulli random variables with success robability (0, 1] indicating whether the (i, j)th entry is observed, and define the set of observed entries Our final observations are Ω = {(i, j) : E ij = 1}. (2) X ij, (i, j) Ω. (3) To model the biclustering structure, we focus on the case where there are k 1 row clusters and k 2 column clusters, and the values of {θ ij } are taken as constant if the rows and the columns belong to the same clusters. The goal is then to recover the signal matrix θ R n 1 n 2 from the observations (3). To accomodate most interesting cases, esecially the case of undirected networks, we shall also consider the case where the data matrix X is symmetric with zero diagonals. In such cases, we also require X ij = X ji and E ij = E ji for all i j. Main contributions In this aer, we roose a unified estimation rocedure for artially observed data matrix generated from model (1) (3). We establish high robability uer bounds for the mean squared errors of the resulting estimators. In addition, we show that these uer bounds are minimax rate-otimal in both the continuous case and the binary case by roviding matching minimax lower bounds. Furthermore, SBM can be viewed as a secial case of the symmetric version of (1). Thus, an immediate alication of our results is the network comletion roblem for SBMs. With artially observed network edges, our method gives a rate-otimal estimator for the robability matrix of the whole network in both the dense and the sarse regimes, which further leads to rate-otimal grahon estimation in both regimes. Connection to the literature If only a low rank constraint is imosed on the mean matrix θ, then (1) (3) becomes what is known in the literature as the matrix comletion roblem (Recht et al., 2010). An imressive list of algorithms and theories have been develoed for this roblem, including but not limited to Candès and Recht (2009); Keshavan et al. (2009); Candès and Tao (2010); Candes and Plan (2010); Cai et al. (2010); Keshavan et al. (2010); Recht (2011); Koltchinskii et al. (2011). In this aer, we investigate an alternative biclustering structural assumtion for the matrix comletion roblem, which was first roosed by John Hartigan (Hartigan, 1972). Note that a biclustering structure 2

3 Matrix Comletion with Biclustering Structures automatically imlies low-rankness. However, if one alies a low rank matrix comletion algorithm directly in the current setting, the resulting estimator suffers an inferior error bound to the minimax rate-otimal one. Thus, a full exloitation of the biclustering structure is necessary, which is the focus of the current aer. The results of our aer also imly rate-otimal estimation for sarse grahons. Previous results on grahon estimation include Airoldi et al. (2013); Wolfe and Olhede (2013); Olhede and Wolfe (2014); Borgs et al. (2015); Choi (2015) and the references therein. The minimax rates for dense grahon estimation were derived by Gao et al. (2015a). During the time when this aer was written, we became aware of an indeendent result on otimal sarse grahon estimation by Klo et al. (2015). There are also an interesting line of works on biclustering (Flynn and Perry, 2012; Rohe et al., 2012; Choi and Wolfe, 2014). While these aers aim to recover the clustering structures of rows and columns, the goal of the current aer is to estimate the underlying mean matrix with otimal rates. Organization After a brief introduction to notation, the rest of the aer is organized as follows. In Section 2, we introduce the recise formulation of the roblem and roose a constrained least squares estimator for the mean matrix θ. In Section 3, we show that the roosed estimator leads to minimax otimal erformance for both Gaussian and binary data. Section 4 resents some extensions of our results to sarse grahon estimation and adatation. Imlementation and simulation results are given in Section 5. In Section 6, we discuss the key oints of the aer and roose some oen roblems for future research. The roofs of the main results are laid out in Section 7, with some auxiliary results deferred to the aendix. Notation For a vector z [k] n, define the set z 1 (a) = {i [n] : z(i) = a} for a [k]. For a set S, S denotes its cardinality and 1 S denotes the indicator function. For a matrix A = (A ij ) R n 1 n 2, the l 2 norm and l norm are defined by A = ij A2 ij and A = max ij A ij, resectively. The inner roduct for two matrices A and B is A, B = ij A ijb ij. Given a subset Ω [n 1 ] [n 2 ], we use the notation A, B Ω = (i,j) Ω A ijb ij and A Ω = (i,j) Ω A2 ij. Given two numbers a, b R, we use a b = max(a, b) and a b = min(a, b). The floor function a is the largest integer no greater than a, and the ceiling function a is the smallest integer no less than a. For two ositive sequences {a n }, {b n }, a n b n means a n Cb n for some constant C > 0 indeendent of n, and a n b n means a n b n and b n a n. The symbols P and E denote generic robability and exectation oerators whose distribution is determined from the context. 2. Constrained least squares estimation Recall the generative model defined in (1) and also the definition of the set Ω in (2) of the observed entries. As we have mentioned, throughout the aer, we assume that the ɛ ij s are indeendent sub-gaussian noises with sub-gaussianity arameter uniformly bounded from above by σ > 0. More recisely, we assume Ee λɛ ij e λ2 σ 2 /2, for all i [n 1 ], j [n 2 ] and λ R. (4) 3

4 Gao, Lu, Ma and Zhou We consider two tyes of biclustering structures. One is rectangular and asymmetric, where we assume that the mean matrix belongs to the following arameter sace { Θ k1 k 2 (M) = θ = (θ ij ) R n 1 n 2 : θ ij = Q z1 (i)z 2 (j), z 1 [k 1 ] n 1, z 2 [k 2 ] n 2, (5) Q [ M, M] k 1 k 2 }. In other words, the mean values within each bicluster is homogenous, i.e., θ ij = Q ab if the ith row belongs to the ath row cluster and the jth column belong to the bth column cluster. The other tye of structures we consider is the square and symmetric case. In this case, we imose symmetry requirement on the data generating rocess, i.e., n 1 = n 2 = n and X ij = X ji, E ij = E ji, for all i j. (6) Since the case is mainly motivated by undirected network data where there is no edge linking any node to itself, we also assume X ii = 0 for all i [n]. Finally, the mean matrix is assumed to belong to the following arameter sace Θ s k (M) = {θ = (θ ij ) R n n : θ ii = 0, θ ij = θ ji = Q z(i)z(j) for i > j, z [k] n, Q = Q T [ M, M] k k}. (7) We roceed by assuming that we know the arameter sace Θ which can be either Θ k1 k 2 (M) or Θ s k (M) and the rate of an indeendent entry being observed. The issues of adatation to unknown numbers of clusters and unknown observation rate are addressed later in Section 4.1 and Section 4.2. Given Θ and, we roose to estimate θ by the following rogram { min θ 2 2 } θ Θ X, θ Ω. (8) If we define Y ij = X ij E ij /, (9) then (8) is equivalent to the following constrained least squares roblem min Y θ Θ θ 2, (10) and hence the name of our estimator. When the data is binary, Θ = Θ s k (1) and = 1, the roblem secializes to estimating the mean adjacency matrix in stochastic block models, and the estimator defined as the solution to (10) reduces to the least squares estimator in Gao et al. (2015a). 3. Main results In this section, we rovide theoretical justifications of the constrained least squares estimator defined as the solution to (10). Our first result is the following universal high robability uer bounds. 4

5 Matrix Comletion with Biclustering Structures Theorem 1. For any global otimizer of (10) and any constant C > 0, there exists a constant C > 0 only deending on C such that ˆθ θ 2 C M 2 σ 2 (k 1 k 2 + n 1 log k 1 + n 2 log k 2 ), with robability at least 1 ex ( C (k 1 k 2 + n 1 log k 1 + n 2 log k 2 )) uniformly over θ Θ k1 k 2 (M) and all error distributions satisfying (4). For the symmetric arameter sace Θ s k (M), the bound is simlified to ˆθ θ 2 C M 2 σ 2 ( k 2 + n log k ), with robability at least 1 ex ( C ( k 2 + n log k )) uniformly over θ Θ s k (M) and all error distributions satisfying (4). When (M 2 σ 2 ) is bounded, the rate in Theorem 1 is (k 1 k 2 + n 1 log k 1 + n 2 log k 2 ) / which can be decomosed into two arts. The art involving k 1 k 2 reflects the number of arameters in the biclustering structure, while the art involving (n 1 log k 1 + n 2 log k 2 ) results from the comlexity of estimating the clustering structures of rows and columns. It is the rice one needs to ay for not knowing the clustering information. In contrast, the minimax rate for matrix comletion under low rank assumtion would be (n 1 n 2 )(k 1 k 2 )/ (Koltchinskii et al., 2011; Ma and Wu, 2015), since without any other constraint the biclustering assumtion imlies that the rank of the mean matrix θ is at most k 1 k 2. Therefore, we have (k 1 k 2 + n 1 log k 1 + n 2 log k 2 ) / (n 1 n 2 )(k 1 k 2 )/ as long as both n 1 n 2 and k 1 k 2 tend to infinity. Thus, by fully exloiting the biclustering structure, we obtain a better convergence rate than only using the low rank assumtion. In the rest of this section, we discuss two most reresentative cases, namely the Gaussian case and the symmetric Bernoulli case. The latter case is also known in the literature as stochastic block models. The Gaussian case the following result. Secializing Theorem 1 to Gaussian random variables, we obtain Corollary 2. Assume ɛ ij iid N(0, σ 2 ) and M C 1 σ for some constant C 1 > 0. For any constant C > 0, there exists some constant C only deending on C 1 and C such that ˆθ θ 2 C σ2 (k 1k 2 + n 1 log k 1 + n 2 log k 2 ), with robability at least 1 ex ( C (k 1 k 2 + n 1 log k 1 + n 2 log k 2 )) uniformly over θ Θ k1 k 2 (M). For the symmetric arameter sace Θ s k (M), the bound is simlified to ˆθ θ 2 C σ2 ( k 2 + n log k ), with robability at least 1 ex ( C ( k 2 + n log k )) uniformly over θ Θ s k (M). We now resent a rate matching lower bound in the Gaussian model to show that the result of Corollary 2 is minimax otimal. To this end, we use P (θ,σ 2,) to indicate the ind robability distribution of the model X ij N(θ ij, σ 2 ) with observation rate. 5

6 Gao, Lu, Ma and Zhou Theorem 3. Assume σ2 0, such that inf ˆθ su P (θ,σ 2,) θ Θ k1 k 2 (M) when log k 1 log k 2, and inf ˆθ ( k1 k 2 n 1 n 2 + log k 1 n 2 + log k 2 n 1 ) M 2. There exist some constants C, c > su θ Θ s k (M) P (θ,σ 2,) ) ( ˆθ θ 2 > C σ2 (k 1k 2 + n 1 log k 1 + n 2 log k 2 ) > c, ( ˆθ θ 2 > C σ2 ( k 2 + n log k )) > c. The symmetric Bernoulli case When the observed matrix is symmetric with zero diagonal and Bernoulli random variables as its suer-diagonal entries, it can be viewed as the adjacency matrix of an undirected network and the roblem of estimating its mean matrix with missing data can be viewed as a network comletion roblem. Given a artially observed Bernoulli adjacency matrix {X ij } (i,j) Ω, one can redict the unobserved edges by estimating the whole mean matrix θ. Also, we assume that each edge is observed indeendently with robability. Given a symmetric adjacency matrix X = X T {0, 1} n n with zero diagonals, the stochastic block model (Holland et al., 1983) assumes {X ij } i>j are indeendent Bernoulli random variables with mean θ ij = Q z(i)z(j) [0, 1] with some matrix Q [0, 1] k k and some label vector z [k] n. In other words, the robability that there is an edge between the ith and the jth nodes only deends on their community labels z(i) and z(j). The following class then includes all ossible mean matrices of stochastic block models with n nodes and k clusters and with edge robabilities uniformly bounded by ρ: Θ + k (ρ) = {θ [0, 1] n n : θ ii = 0, θ ij = θ ji = Q z(i)z(j), Q = Q T [0, ρ] k k, z [k] n}. (11) By the definition in (7), Θ + k (ρ) Θs k (ρ). Although the tail robability of Bernoulli random variables does not satisfy the sub- Gaussian assumtion (4), a slightly modification of the roof of Theorem 1 leads to the following result. The roof of Corollary 4 will be given in Section A in the aendix. Corollary 4. Consider the otimization roblem (10) with Θ = Θ s k (ρ). For any global otimizer ˆθ and any constant C > 0, there exists a constant C > 0 only deending on C such that ˆθ θ 2 C ρ ( k 2 + n log k ), with robability at least 1 ex ( C ( k 2 + n log k )) uniformly over θ Θ s k (ρ) Θ+ k (ρ). When ρ = = 1, Corollary 4 imlies Theorem 2.1 in Gao et al. (2015a). A rate matching lower bound is given by the following theorem. We denote the robability distribution of a stochastic block model with mean matrix θ Θ + k (ρ) and observation rate by P (θ,). Theorem 5. For stochastic block models, we have ( ( ( ρ k 2 + n log k ) inf ˆθ su θ Θ + k (ρ) P (θ,) for some constants C, c > 0. ˆθ θ 2 > C ρ 2 n 2 )) > c, 6

7 Matrix Comletion with Biclustering Structures The lower bound is the minimum of two terms. When ρ k2 +n log k, the rate becomes ρ(k 2 +n log k) ρ 2 n 2 ρ(k2 +n log k). It is achieved by the constrained least squares estimator according to Corollary 4. When ρ < k2 +n log k, the rate is dominated by ρ 2 n 2. In this case, n 2 a trivial zero estimator achieves the minimax rate. In the case of = 1, a comarable result has been found indeendently by Klo et al. (2015). However, our result here is more general as it accommodates missing observations. Moreover, the general uer bounds in Theorem 1 even hold for networks with weighted edges. 4. Extensions In this section, we extends the estimation rocedure and the theory in Sections 2 and 3 toward three directions: adatation to unknown observation rate, adatation to unknown model arameters, and sarse grahon estimation. 4.1 Adatation to unknown observation rate The estimator (10) deends on the knowledge of the observation rate. When is not too small, such a knowledge is not necessary for achieving the desired rates. Define n1 n2 i=1 j=1 ˆ = E ij (12) n 1 n 2 for the asymmetric and for the symmetric case, and redefine ˆ = n 2 1 i<j n E ij 1 2 n(n 1) (13) Y ij = X ij E ij /ˆ (14) where the actual definition of ˆ is chosen between (12) and (13) deending on whether one is dealing with the asymmetric or symmetric arameter sace. Then we have the following result for the solution to (10) with Y redefined by (14). Theorem 6. For Θ = Θ k1 k 2 (M), suose for some absolute constant C 1 > 0, C 1 [log(n 1 + n 2 )] 2 k 1 k 2 + n 1 log k 1 + n 2 log k 2. Let ˆθ be the solution to (10) with Y defined as in (14). Then for any constant C > 0, there exists a constant C > 0 only deending on C and C 1 such that ˆθ θ 2 C M 2 σ 2 (k 1 k 2 + n 1 log k 1 + n 2 log k 2 ), with robability at least 1 (n 1 n 2 ) C uniformly over θ Θ and all error distributions satisfying (4). For Θ = Θ s k (M), the same result holds if we relace n 1 and n 2 with n and k 1 and k 2 with k in the foregoing statement. 7

8 Gao, Lu, Ma and Zhou 4.2 Adatation to unknown model arameters We now rovide an adative rocedure for estimating θ without assuming the knowledge of the model arameters k 1, k 2 and M. The rocedure can be regarded as a variation of a 2-fold cross validation (Wold, 1978). We give details on the rocedure for the asymmetric arameter saces Θ k1 k 2 (M), and that for the symmetric arameter saces Θ s k (M) can be obtained similarly. To adat to k 1, k 2 and M, we slit the data into two halves. Namely, samle i.i.d. T ij from Bernoulli( 1 2 ). Define = {(i, j) [n 1] n 2 : T ij = 1}. Define Yij = 2X ij E ij T ij / and Yij c = 2X ij E ij (1 T ij )/ for all (i, j) [n 1 ] [n 2 ]. Then, for some given (k 1, k 2, M), the least squares estimators using Y and Y c are given by ˆθ k 1 k 2 M = argmin Y θ 2, θ Θ k1 k 2 (M) ˆθ c k 1 k 2 M = argmin Y c θ 2. θ Θ k1 k 2 (M) Select the arameters by (ˆk 1, ˆk 2, ˆM) = argmin ˆθ k 1 k 2 M Y c 2 c, (k 1,k 2,M) [n 1 ] [n 2 ] M { } where M = h n 1 +n 2 : h [(n 1 + n 2 ) 6 ], and define ˆθ = ˆθ ˆk1ˆk2 ˆM. Similarly, we can also c define ˆθ by validate the arameters using Y. The final estimator is given by {ˆθ c ij, (i, j) ; ˆθ ij = ˆθ ij, (i, j) c. Theorem 7. Assume (n 1 + n 2 ) 1 M (n 1 + n 2 ) 5 (n 1 + n 2 ) 1. For any constant C > 0, there exists a constant C > 0 only deending on C such that ˆθ θ 2 C M 2 σ 2 ( k 1 k 2 + n 1 log k 1 + n 2 log k 2 + log(n ) 1 + n 2 ), with robability at least 1 ex ( C (k 1 k 2 + n 1 log k 1 + n 2 log k 2 )) (n 1 n 2 ) C over θ Θ k1 k 2 (M) and all error distributions satisfying (4). uniformly Comared with Theorem 1, the rate given by Theorem 7 has an extra 1 log(n 1 + n 2 ) term. A sufficient condition for this extra term to be inconsequential is log(n 1+n 2 ) n 1 n 2. Theorem 7 is adative for all (k 1, k 2 ) [n 1 ] [n 2 ] and for (n 1 + n 2 ) 1 M (n 1 + n 2 ) 5 (n 1 + n 2 ) 1. In fact, by choosing a larger M, we can extend the adative region for M to (n 1 + n 2 ) a M (n 1 + n 2 ) b for arbitrary constants a, b > Sarse grahon estimation Consider a random grah with adjacency matrix {X ij } {0, 1} n n, whose samling rocedure is determined by (ξ 1,..., ξ n ) P ξ, X ij (ξ i, ξ j ) Bernoulli(θ ij ), where θ ij = f(ξ i, ξ j ). (15) 8

9 Matrix Comletion with Biclustering Structures For i [n], X ii = θ ii = 0. Conditioning on (ξ 1,..., ξ n ), X ij = X ji is indeendent across i > j. The function f on [0, 1] 2, which is assumed to be symmetric, is called a grahon. The concet of grahon is originated from grah limit theory (Hoover, 1979; Lovász and Szegedy, 2006; Diaconis and Janson, 2007; Lovász, 2012) and the studies of exchangeable arrays (Aldous, 1981; Kallenberg, 1989). It is the underlying nonarametric object that generates the random grah. Statistical estimation of grahon has been considered by Wolfe and Olhede (2013); Olhede and Wolfe (2014); Gao et al. (2015a,b); Lu and Zhou (2015) for dense networks. Using Corollary 4, we resent a result for sarse grahon estimation. Let us start with secifying the function class of grahons. Define the derivative oerator by j+k jk f(x, y) = ( x) j f(x, y), ( y) k and we adot the convention 00 f(x, y) = f(x, y). The Hölder norm is defined as f Hα = max su jk f(x, y) + max j+k α x,y D su j+k= α (x,y) (x,y ) D jk f(x, y) jk f(x, y ) (x x, y y ) α α, where D = {(x, y) [0, 1] 2 : x y}. Then, the sarse grahon class with Hölder smoothness α is defined by F α (ρ, L) = {0 f ρ : f Hα L ρ, f(x, y) = f(y, x) for all x D}, where L > 0 is the radius of the class, which is assumed to be a constant. As argued in Gao et al. (2015a), it is sufficient to aroximate a grahon with Hölder smoothness by a iecewise constant function. In the random grah setting, a iecewise constant function is the stochastic block model. Therefore, we can use the estimator defined by (10). Using Corollary 4, a direct bias-variance tradeoff argument leads to the following result. An indeendent finding of the same result is also made by Klo et al. (2015). Corollary 8. Consider the otimization roblem (10) where Y ij = X ij and Θ = Θ s k (M) with k = n 1 1+α 1 and M = ρ. Given any global otimizer ˆθ of (10), we estimate f by ˆf(ξ i, ξ j ) = ˆθ ij. Then, for any constant C > 0, there exists a constant C > 0 only deending on C and L such that 1 ( ) 2 ˆf(ξi n 2, ξ j ) f(ξ i, ξ j )) Cρ (n 2α log n α+1 +, n i,j [n] with robability at least 1 ex( C (n 1 α+1 + n log n)) uniformly over f Fα (ρ, L) and P ξ. Corollary 8 imlies an interesting hase transition henomenon. When α (0, 1), the rate becomes ρ(n 2α 2α α+1 + log n n ) ρn α+1, which is the tyical nonarametric rate times a sarsity index of the network. When α 1, the rate becomes ρ(n 2α α+1 + log n n ) ρ log n n, which does not deend on the smoothness α. Corollary 8 extends Theorem 2.3 of Gao et al. (2015a) to the case ρ < 1. In Wolfe and Olhede (2013), the grahon f is defined in a different way. Namely, they considered the setting where (ξ 1,..., ξ n ) are i.i.d. Unif[0, 1] random variables under P ξ. Then, the adjacency matrix is generated with 9

10 Gao, Lu, Ma and Zhou Bernoulli random variables having means θ ij = ρf(ξ i, ξ j ) for a nonarametric grahon f satisfying f(x, y)dxdy = 1. For this setting, with aroriate smoothness assumtion, we can estimate f by ˆf(ξ i, ξ j ) = ˆθ ij /ρ. The rate of convergence would be ρ 1 (n 2α α+1 + log n n ). Using the result of Theorem 7, we resent an adative version for Corollary 8. The estimator we consider is a symmetric version of the one introduced in Section 4.2. The only difference is that we choose the set M as {m/n : m [n + 1]}. The estimator is fully data driven in the sense that it does not deend on α or ρ. Corollary 9. Assume ρ n 1. Consider the adative estimator ˆθ introduced in Theorem 7, and we set ˆf(ξ i, ξ j ) = ˆθ ij. Then, for any constant C > 0, there exists a constant C > 0 only deending on C and L such that 1 n 2 i,j [n] ( ˆf(ξi, ξ j ) f(ξ i, ξ j )) 2 Cρ (n 2α α+1 + log n n with robability at least 1 n C uniformly over f F α (ρ, L) and P ξ. 5. Numerical Studies To introduce an algorithm solving (8) or (10), we write (10) in an alternative way, min L(Q, z 1, z 2 ), z 1 [k 1 ] n 1,z 2 [k 2 ] n 2,Q [l,u] k 1 k 2 where l and u are the lower and uer constraints of the arameters and L(Q, z 1, z 2 ) = (Y ij Q z1 (i)z 2 (j)) 2. (i,j) [n 1 ] [n 2 ] For biclustering, we set l = M and u = M. For SBM, we set l = 0 and u = ρ. We do not imose symmetry for SBM to gain comutational convenience without losing much statistical accuracy. The simle form of L(Q, z 1, z 2 ) indicates that we can iteratively otimize over (Q, z 1, z 2 ) with exlicit formulas. This motivates the following algorithm. The iteration stes (16), (17) and (18) of Algorithm 1 can be equivalently exressed as Q = argmin L(Q, z 1, z 2 ); Q [l,u] k 1 k 2 z 1 = argmin L(Q, z 1, z 2 ); z 1 [k 1 ] n 1 z 2 = argmin L(Q, z 1, z 2 ). z 2 [k 2 ] n 2 Thus, each iteration will reduce the value of the objective function. It is worth noting that Algorithm 1 can be viewed as a two-way extension for the ordinary k-means algorithm. Since the objective function is non-convex, one cannot guarantee convergence to global otimum. We initialize the algorithm via a sectral clustering ste using multile random starting oints, which worked well on simulated datasets we resent below. Now we resent some numerical results to demonstrate the accuracy of the error rate behavior suggested by Theorem 1 on simulated data. ), 10

11 Matrix Comletion with Biclustering Structures Algorithm 1: A Biclustering Algorithm Inut : {X ij } (i,j) Ω, the numbers of column and row clusters (k 1, k 2 ), lower and uer constraints (l, u) and the number of random starting oints m. Outut: An n 1 n 2 matrix ˆθ with ˆθ ij = Q z1 (i)z 2 (j). Prerocessing: Let X ij = 0 for (i, j) Ω, ˆ = Ω /(n 1 n 2 ) and Y = X/ˆ. 1 Initialization Ste Aly singular value decomosition and obtain Y = UDV T. Run k-means algorithm on the first k 1 columns of U with m random starting oints to get z 1. Run k-means algorithm on the first k 2 columns of V with m random starting oints to get z 2. while not converge do 2 Udate Q: for each (a, b) [k 1 ] [k 2 ], 1 Q ab = z1 1 (a) z 1 2 (b) If Q ab > u, let Q ab = u. If Q ab < l, let Q ab = l. 3 Udate z 1 : for each i [n 1 ], 4 Udate z 2 : for each j [n 2 ], z 1 (i) = argmin a [k 1 ] z 2 (j) = argmin b [k 2 ] n 2 j=1 n 1 i=1 i z 1 1 (a) j z 1 2 (b) Y ij. (16) (Q az2 (j) A ij ) 2. (17) (Q z1 (i)b A ij ) 2. (18) 11

12 Gao, Lu, Ma and Zhou Bernoulli case. Our theoretical result indicates the rate of recovery is ( ) ρ k 2 + log k n 2 n for the root mean squared error (RMSE) 1 n ˆθ θ. When k is not too large, the dominating term is ρ log k n. We are going to confirm this rate by simulation. We first generate our data from SBM with the number of blocks k {2, 4, 8, 16}. The observation rate = 0.5. For every fixed k, we use four different Q = 0.51 k 1 T k + 0.1tI k with t = 1, 2, 3, 4 and generate the community labels z uniformly on [k]. Then we calculate the error 1 n ˆθ θ. Panel (a) of Figure 1 shows the error versus the samle size n. In Panel (b), we rescale the x-axis to N = n log k. The curves for different k align well with each other and the error decreases at the rate of 1/N. This confirms our theoretical results in Theorem 1. mse k=2 k=4 k=8 k=16 mse k=2 k=4 k=8 k= samle size n n rescaled samle size N = logk (a) (b) Figure 1: Plots of 1 n ˆθ θ when using our algorithm on SBM. Each curve corresonds to a different samle size n with a fixed k. (a) Plots of error against the raw samle size n. (b) Plots of the same error against rescaled samle size n/ log k. Gaussian case. We simulate data with Gaussian noise under four different settings of k 1 and k 2. For each (k 1, k 2 ) {(4, 4), (4, 8), (8, 8), (8, 12)}, the entries of matrix Q are indeendently and uniformly generated from {1, 2, 3, 4, 5}. The cluster labels z 1 and z 2 are uniform on [k 1 ] and [k 2 ] resectively. After generating Q, z 1 and z 2, we add an N(0, 1) noise to the data and observe X ij with robability = 0.1. For each number of rows n 1, we set the number of columns as n 2 = n 1 log k 1 / log k 2. Panel (a) of Figure 2 shows the error versus n 1. In Panel (b), we rescale the x-axis by N = n1 log k 2. Again, the lots for different (k 1, k 2 ) align fairly well and the error decreases roughly at the rate of 1/N. 12

13 Matrix Comletion with Biclustering Structures mse k 1 =4, k 2 =4 k 1 =4, k 2 =8 k 1 =8, k 2 =8 k 1 =8, k 2 =12 mse k 1 =4, k 2 =4 k 1 =4, k 2 =8 k 1 =8, k 2 =8 k 1 =8, k 2 = samle size n n 1 rescaled samle size N = logk 2 (a) (b) Figure 2: Plots of error 1 n1 n 2 ˆθ θ when using our algorithm on biclustering data with gaussian noise. Each curve corresonds to a fixed (k 1, k 2 ). (a) Plots of error against n 1. (b) Plots of the same error against n 1 / log k 2. Sarse Bernoulli case. We also study recovery of sarse SBMs. We do the same simulation as the Bernoulli case excet that we choose Q = k 1 T k tI k for t = 1, 2, 3, 4. The results are shown in Figure 3. Adatation to unknown arameters. We use the 2-fold cross validation rocedure roosed in Section 4.2 to adatively choose the unknown number of clusters k and the sarsity level ρ. We use the setting of sarse SBM with the number of block k {4, 6} and Q = k 1 T k + 0.1tI k for t = 1, 2, 3, 4. When running our algorithms, we search over all the (k, ρ) air for k {2, 3,, 8} and ρ {0.2, 0.3, 0.4, 0.5}. In Table 1, we reort the errors for different configurations of Q. The first row is the error obtained by our adative rocedure and the second row is the error using the true k and ρ. Consistent with our Theorem 7, the error from the adative rocedure is almost the same as the oracle error. rescaled samle size (k=4) adative mse oracle mse (k=6) adative mse oracle mse Table 1: Errors of the adative rocedure versus the oracle. The effect of constraints. The otimization (10) and Algorithm 1 involves the constraint Q [l, u] k 1 k 2. It is curious whether this constraint really hels reduce the error or 13

14 Gao, Lu, Ma and Zhou mse samle size k=2 k=4 k=8 k=16 mse k=2 k=4 k=8 k= n rescaled samle size N = ρlogk (a) (b) Figure 3: Plots of error 1 n ˆθ θ when using our algorithm on sarse SBM. Each curve corresonds to a fixed k. (a) Plots of error against the raw samle size n. (b) Plots of the same error against rescaled samle size n/ρ log k. merely an artifact of the roof. We investigate the effect of this constraint on simulated data by comaring Algorithm 1 with its variation without the constraint for both Gaussian case and sarse Bernoulli case. Panel (a) of Figure 4 shows the lots of sarse SBM with 8 communities. Panel (b) is the lots of Gaussian case with (k 1, k 2 ) = (4, 8). For both anels, when the rescaled samle size is small, the effect of constraint is significant, while as the rescaled samle size increases, the erformance of two estimators become similar. 6. Discussion This aer studies the otimal rates of recovering a matrix with biclustering structure. While the recent rogresses in high-dimensional estimation mainly focus on sarse and low rank structures, the study of biclustering structure does not gain much attention. This aer fills in the ga. In what follows, we discuss some key oints of the aer and some ossible future directions of research. Difference from low-rankness. A biclustering structure is imlicitly low-rank. Therefore, we show that by exloring the stronger biclustering assumtion, one can achieve better rates of convergence in estimation and comletion. The minimax rates derived in this aer recisely characterize how much one can gain by taking advantage of this structure. Relation to other structures. A natural question to investigate is whether there is similarity between the biclustering structure and the well-studied sarsity structure. The aer Gao et al. (2015b) gives a general theory of structured estimation in linear models that uts both sarse and biclustering structures in a unified theoretical framework. According 14

15 Matrix Comletion with Biclustering Structures mse unconstrained constrained mse unconstrained constrained Rescaled Samle Size Rescaled Samle Size (a) Sarse SBM with k = 8 (b) Gaussian biclustering with (k 1, k 2 ) = (4, 8) Figure 4: Constrained versus unconstrained least-squares to this general theory, the k 1 k 2 art in the minimax rate is the comlexity of arameter estimation and the n 1 log k 1 + n 2 log k 2 art is the comlexity of structure estimation. Oen roblems. The otimization roblem (10) is not convex, thus causing difficulty in devising a rovably otimal olynomial-time algorithm. An oen question is whether there is a convex relaxation of (10) that can be solved efficiently without losing much statistical accuracy. Another interesting roblem for future research is whether the objective function in (10) can be extended beyond the least squares framework. 7. Proofs 7.1 Proof of Theorem 1 Below, we focus on the roof for the asymmetric arameter sace Θ k1 k 2 (M). The result for the symmetric arameter sace Θ s k (M) can be obtained by letting k 1 = k 2 and by taking care of the diagonal entries. Since ˆθ Θ k1 k 2 (M), there exists ẑ 1 [k 1 ] n 1, ẑ 2 [k 2 ] n 2 and ˆQ [ M, M] k 1 k 2 such that ˆθ ij = ˆQẑ1 (i)ẑ 2 (j). For this (ẑ 1, ẑ 2 ), we define a matrix θ by θ ij = 1 ẑ1 1 (a) ẑ 1 2 (b) θ ij, (i,j) ẑ 1 1 (a) ẑ 1 2 (b) for any (i, j) ẑ1 1 (a) ẑ 1 2 (b) and any (a, b) [k 1] [k 2 ]. To facilitate the roof, we need to following three lemmas, whose roofs are given in the sulementary material. Lemma 10. For any constant C > 0, there exists a constant C 1 > 0 only deending on C, such that ˆθ θ 2 M 2 σ 2 C 1 (k 1 k 2 + n 1 log k 1 + n 2 log k 2 ), with robability at least 1 ex( C (k 1 k 2 + n 1 log k 1 + n 2 log k 2 )). 15

16 Gao, Lu, Ma and Zhou Lemma 11. For any constant C > 0, there exists a constant C 2 > 0 only deending on C, such that the inequality θ θ 2 C 2 (M 2 σ 2 )(n 1 log k 1 + n 2 log k 2 )/ imlies θ θ θ θ, Y θ M C 2 σ 2 2 (k 1 k 2 + n 1 log k 1 + n 2 log k 2 ), with robability at least 1 ex( C (k 1 k 2 + n 1 log k 1 + n 2 log k 2 )). Lemma 12. For any constant C > 0, there exists a constant C 3 > 0 only deending on C, such that M ˆθ θ, 2 σ 2 Y θ C3 (k 1 k 2 + n 1 log k 1 + n 2 log k 2 ), with robability at least 1 ex( C (k 1 k 2 + n 1 log k 1 + n 2 log k 2 )). Proof of Theorem 1. Alying union bound, the results of Lemma hold with robability at least 1 3 ex ( C (k 1 k 2 + n 1 log k 1 + n 2 log k 2 )). We consider the following two cases. Case 1: θ θ 2 C 2 (M 2 σ 2 )(k 1 k 2 + n 1 log k 1 + n 2 log k 2 )/. Then we have ˆθ θ 2 2 ˆθ θ θ θ 2 2(C 1 + C 2 ) M 2 σ 2 (k 1 k 2 + n 1 log k 1 + n 2 log k 2 ) by Lemma 10. Case 2: θ θ 2 > C 2 (M 2 σ 2 )(k 1 k 2 + n 1 log k 1 + n 2 log k 2 )/. By the definition of the estimator, we have ˆθ Y 2 θ Y 2. After rearrangement, we have ˆθ θ 2 2 ˆθ θ, Y θ = 2 ˆθ θ, Y θ + 2 θ θ, Y θ θ θ 2 ˆθ θ, Y θ + 2 θ θ θ θ, Y θ 2 ˆθ θ, Y θ + 2( θ ˆθ θ θ + ˆθ θ ) θ θ, Y θ 2(C 2 + C 3 + C 1 C 2 ) M 2 σ 2 (k 1 k 2 + n 1 log k 1 + n 2 log k 2 ) ˆθ θ 2, which leads to the bound ˆθ θ 2 4(C 2 + C 3 + C 1 C 2 ) M 2 σ 2 (k 1 k 2 + n 1 log k 1 + n 2 log k 2 ). Combining the two cases, we have ˆθ θ 2 C M 2 σ 2 (k 1 k 2 + n 1 log k 1 + n 2 log k 2 ), with robability at least 1 3 ex ( C (k 1 k 2 + n 1 log k 1 + n 2 log k 2 )) for C = 4(C 2 + C 3 + C1 C 2 ) 2(C 1 + C 2 ). 16

17 Matrix Comletion with Biclustering Structures 7.2 Proof of Theorem 7 We first resent a lemma for the tail behavior of sum of indeendent roducts of sub- Gaussian and Bernoulli random variables. Its roof is given in the sulementary material. Lemma 13. Let {X i } be indeendent sub-gaussian random variables with mean θ i [ M, M] and Ee λ(x i θ i ) e λ2 σ 2 /2. Let {E i } be indeendent Bernoulli random variables with mean. Assume {X i } and {E i } are all indeendent. Then for λ /(M σ) and Y i = X i E i /, we have Ee λ(y i θ i ) 2e (M 2 +2σ 2 )λ 2 /. Moreover, for n i=1 c2 i = 1, { } n P c i (Y i θ i ) t i=1 for any t > 0. { ( 4 ex min t 2 4(M 2 + 2σ 2 ), t 2(M σ) c Proof of Theorem 7. Consider the mean matrix θ that belongs to the sace Θ k1 k 2 (M). By the definition of (ˆk 1, ˆk 2, ˆM), we have ˆθ ˆk1ˆk2 ˆM Y c 2 c ˆθ k 1 k 2 m Y c 2 c, where k 1 and k 2 are true numbers of row and column clusters and m is chosen to be the smallest element in M that is no smaller than M. After rearrangement, we have ˆθ ˆk1ˆk2 θ ˆM 2 c ˆθ k 1 k 2 m θ 2 c + 2 ˆθ ˆk1ˆk2 ˆθ ˆk1ˆk2 ˆθ ˆM k 1 k 2 m ˆθ ˆM k 1 k 2 m c ˆθ ˆk1ˆk2 ˆθ ˆM k 1 k 2 m, Y c θ c ˆθ k 1 k 2 m θ 2 c + 2 ˆθ ˆk1ˆk2 ˆM ˆθ k 1 k 2 m c max (l 1,l 2 ) [n 1 ] [n 2 ] By Lemma 13 and the indeendence structure, we have ˆθ max l1 l 2 h ˆθ k 1 k 2 m (l 1,l 2,h) [n 1 ] [n 2 ] M ˆθ l 1 l 2 h ˆθ k 1 k 2 m, Y c θ c c )} c (19) ˆθ l1 l 2 h ˆθ k 1 k 2 m ˆθ l 1 l 2 h ˆθ k 1 k 2 m, Y c θ c C(M σ)log(n 1 + n 2 ), with robability at least 1 (n 1 n 2 ) C. Using triangle inequality and Cauchy-Schwarz inequality, we have ˆθ ˆk1ˆk2 θ ˆM 2 c 3 2 ˆθ k 1 k 2 m θ 2 c + 1 ( ) 2 ˆθ ˆk1ˆk2 θ ˆM 2 c + 4C2 (M 2 σ 2 log(n1 + n 2 ) 2 ). By rearranging the above inequality, we have ( ) ˆθ ˆk1ˆk2 θ ˆM 2 c 3 ˆθ k 1 k 2 m θ 2 c + 8C2 (M 2 σ 2 log(n1 + n 2 ) 2 ). A symmetric argument leads to ( ) ˆθ ˆk c θ 1ˆk2 ˆM 2 3 ˆθ k c 1 k 2 m θ 2 + 8C 2 (M 2 σ 2 log(n1 + n 2 ) 2 ). 17 c.

18 Gao, Lu, Ma and Zhou Summing u the above two inequalities, we have ( ) ˆθ θ 2 3 ˆθ k 1 k 2 m θ ˆθ k c 1 k 2 m θ C 2 (M 2 σ 2 log(n1 + n 2 ) 2 ). (20) Using Theorem 1 to bound ˆθ k 1 k 2 m θ 2 and ˆθ k c 1 k 2 m θ 2 can be bounded by C m2 σ 2 (k 1 k 2 + n 1 log k 1 + n 2 log k 2 ). Given that m = M ( 1 + m M ) M 2M by the choice of m, the roof is comlete. 7.3 Proof of Theorem 6 Recall the augmented data Y ij = X ij E ij /. Define Y ij = X ij E ij /ˆ. Let us give two lemmas to facilitate the roof. Lemma 14. Assume log(n 1+n 2 ) n 1 n 2. For any C > 0, there is some constant C > 0 such that Y Y 2 C [ M 2 + σ 2 log(n 1 + n 2 ) ] log(n 1 + n 2 ) 2, with robability at least 1 (n 1 n 2 ) C. Lemma 15. The inequalities in Lemma continue to hold with bounds and resectively. M 2 σ 2 C 1 (k 1 k 2 + n 1 log k 1 + n 2 log k 2 ) + 2 Y Y 2, M C 2 σ 2 2 (k 1 k 2 + n 1 log k 1 + n 2 log k 2 ) + Y Y, M 2 σ 2 C 3 (k 1 k 2 + n 1 log k 1 + n 2 log k 2 ) + ˆθ θ Y Y, Proof of Theorem 6. The roof is similar to that of Theorem 1. We only need to relace Lemma by Lemma 14 and Lemma 15 to get the desired result. 7.4 Proofs of Theorem 3 and Theorem 5 This section gives roofs of the minimax lower bounds. We first introduce some notation. For any robability measures P, Q, define the Kullback Leibler divergence by D(P Q) = ( ) log dp dq dp. The chi-squared divergence is defined by χ 2 (P Q) = ( ) dp dq dp 1. The main tool we will use is the following roosition. Proosition 16. Let (Ξ, l) be a metric sace and {P ξ : ξ Ξ} be a collection of robability measures. For any totally bounded T Ξ, define the Kullback-Leibler diameter and the chi-squared diameter of T by d KL (T ) = su D(P ξ P ξ ), ξ,ξ T d χ 2(T ) = su χ 2 (P ξ P ξ ). ξ,ξ T 18

19 Matrix Comletion with Biclustering Structures Then inf ˆξ inf ˆξ { su P ξ l 2(ˆξ(X), ) } ξ ɛ2 1 d KL(T ) + log 2 ξ Ξ 4 } { su P ξ l 2(ˆξ(X), ) ξ ɛ2 ξ Ξ 4 1 log M(ɛ, T, l), (21) 1 M(ɛ, T, l) d χ 2(T ) M(ɛ, T, l), (22) for any ɛ > 0, where the acking number M(ɛ, T, l) is the largest number of oints in T that are at least ɛ away from each other. The inequality (21) is the classical Fano s inequality. The version we resent here is by Yu (1997). The inequality (22) is a generalization of the classical Fano s inequality by using chi-squared divergence instead of KL divergence. It is due to Guntuboyina (2011). The following roosition bounds the KL divergence and the chi-squared divergence for both Gaussian and Bernoulli models. Proosition 17. For the Gaussian model, we have D ( ) P (θ,σ 2,) P (θ,σ 2,) 2σ 2 θ θ 2, For the Bernoulli model with any θ, θ [ρ/2, 3ρ/4] n 1 n 2, we have D ( ) 8 P (θ,) P (θ,) ρ θ θ 2, χ 2 ( ) ( P (θ,σ 2,) P (θ,σ 2,) ex σ 2 θ θ 2) 1. χ 2 ( ( ) ) 8 P (θ,) P (θ,) ex ρ θ θ 2 1. Finally, we need the following Varshamov Gilbert bound. The version we resent here is due to (Massart, 2007, Lemma 4.7). Lemma 18. There exists a subset {ω 1,..., ω N } {0, 1} d such that for some N ex (d/8). H(ω i, ω j ) ω i ω j 2 d, for any i j [N], (23) 4 Proof of Theorem 3. We focus on the roof for the asymmetric arameter sace Θ k1 k 2 (M). The result for the symmetric arameter sace Θ s k (M) can be obtained by letting k 1 = k 2 and by taking care of the diagonal entries. Let us assume n 1 /k 1 and n 2 /k 2 are integers without loss of generality. We first derive the lower bound for the nonarametric rate σ 2 k 1 k 2 /. Let us fix the labels by z 1 (i) = ik 1 /n 1 and z 2 (j) = jk 2 /n 2. For any ω {0, 1} k 1 k 2, define Q ω ab = c σ 2 k 1 k 2 ω ab. (24) n 1 n 2 By Lemma 18, there exists some T {0, 1} k 1k 2 such that T ex(k 1 k 2 /8) and H(ω, ω ) k 1 k 2 /4 for any ω, ω T and ω ω. We construct the subsace } Θ(z 1, z 2, T ) = {θ R n 1 n 2 : θ ij = Q ω z1 (i)z2 (j), ω T. 19

20 Gao, Lu, Ma and Zhou By Proosition 17, we have su χ 2 ( ( P (θ,σ 2,) P (θ,σ,)) 2 ex c 2 ) k 1 k 2. θ,θ Θ(z 1,z 2,T ) For any two different θ and θ in Θ(z 1, z 2, T ) associated with ω, ω T, we have θ θ 2 c2 σ 2 H(ω, ω ) c2 σ 2 4 k 1k 2. ( ) c Therefore, M 2 σ 2 4 k 1k 2, Θ(z 1, z 2, T ), ex(k 1 k 2 /8). Using (22) with an aroriate c, we have obtained the rate σ2 k 1k 2 in the lower bound. Now let us derive the clustering rate σ 2 n 2 log k 2 /. Let us ick ω 1,..., ω k2 {0, 1} k 1 such that H(ω a, ω b ) k 1 4 for all a b. By Lemma 18, this is ossible when ex(k 1 /8) k 2. Then, define Q a = c σ 2 n 2 log k 2 n 1 n 2 ω a. (25) Define z 1 by z 1 (i) = ik 1 /n 1. Fix Q and z 1 and we are gong to let z 2 vary. Select a set Z 2 [k 2 ] n 2 such that Z 2 ex(cn 2 log k 2 ) and H(z 2, z 2 ) n 2 6 for any z 2, z 2 Z k and z 2 z 2. The existence of such Z 2 is roved by Gao et al. (2015a). Then, the subsace we consider is Θ(z 1, Z 2, Q) = { θ R n 1 n 2 } : θ ij = Q z1 (i)z 2 (j), z 2 Z 2. By Proosition 17, we have su D ( P (θ,σ 2,) P (θ,σ,)) 2 c 2 n 2 log k 2. θ,θ Θ(z 1,Z 2,Q) For any two different θ and θ in Θ(z 1, Z 2, Q) associated with z 2, z 2 Z 2, we have n 2 θ θ 2 = θ z2 (j) θ z 2 (j) 2 H(z 2, z 2) c2 σ 2 n 2 log k 2 n 1 n 1 n 2 4 c2 σ 2 n 2 log k j=1 ( ) Therefore, M c 2 σ 2 n 2 log k 2 24, Θ(z 1, Z 2, Q), ex(cn 2 log k 2 ). Using (21) with some aroriate c, we obtain the lower bound σ2 n 2 log k 2. A symmetric argument gives the rate σ2 n 1 log k 1. Combining the three arts using the same argument in Gao et al. (2015a), the roof is comlete. Proof of Theorem 5. The roof is similar to that of Theorem 3. The only differences are (24) relaced by ) Q ω ab (c = 1 2 ρ + ρk 2 n ρ ω ab and (25) relaced by Q a = 1 2 ρ + (c ρ log k n 1 2 ρ It is easy to check that the constructed subsaces are subsets of Θ + k (ρ). Then, a symmetric modification of the roof of Theorem 3 leads to the desired conclusion. ) ω a. 20

21 Matrix Comletion with Biclustering Structures 7.5 Proofs of Corollary 8 and Corollary 9 The result of Corollary 8 can be derived through a standard bias-variance trade-off argument by combining Corollary 4 and Lemma 2.1 in Gao et al. (2015a). The result of Corollary 9 follows Theorem 7. By studying the roof of Theorem 7, (20) holds for all k. Choosing the best k to trade-off bias and variance gives the result of Corollary 9. We omit the details here. Acknowledgments The researches of CG, YL and HHZ are suorted in art by NSF grant DMS The research of ZM is suorted in art by NSF Career grant DMS Aendix A. Proofs of auxiliary results In this section, we give roofs of Lemma We first introduce some notation. Define the set Z k1 k 2 = {z = (z 1, z 2 ) : z 1 [k 1 ] n 1, z 2 [k 2 ] n 2 }. For a matrix G R n 1 n 2 and some z = (z 1, z 2 ) Z k1 k 2, define 1 Ḡ ab (z) = z1 1 (a) z 1 2 (b) G ij, (i,j) z 1 1 (a) z 1 2 (b) for all a [k 1 ], b [k 2 ]. To facilitate the roof, we need the following two results. Proosition 19. For the estimator ˆθ ij = ˆQẑ1 (i)ẑ 2 (j), we have for all a [k 1 ], b [k 2 ]. ˆQ ab = sign(ȳab(ẑ)) ( Ȳab(ẑ) M ), Lemma 20. Under the setting of Lemma 13, define S = 1 n n i=1 (Y i θ i ) and τ = 2(M 2 + 2σ 2 )/(M σ). Then we have the following results: a. Let T = S1{ S τ n}, then Ee T 2 /(8(M 2 +2σ 2 )) 5; b. Let R = τ n S 1{ S > τ n}, then Ee R/(8(M 2 +2σ 2 )) 9. Proof. By (19), Then Ee λt 2 = ( ) { ( P S > t 4 ex min 0 ( ) P e λt 2 > u du 1 + ( e λτ 2 n = 1 + P S > 1 t 2 4(M 2 + 2σ 2 ), 1 ) log u λ )} nt. 2(M σ) ( ) log u P T > du λ e λτ 2 n du = u /(4λ(M 2 +2σ 2 )) du. 1 21

22 Gao, Lu, Ma and Zhou Choosing λ = /(8(M 2 + 2σ 2 )), we get Ee T 2 /(8(M 2 +2σ 2 )) 5. We roceed to rove the second claim. Ee λr = P(R = 0) + P(R > 0)E[e λr R > 0] = P(R = 0) + P(R > 0) = P(R = 0) P(e λr > u R > 0)du P(e λr > u, R > 0)du P(R = 0) + P(R > 0)e λτ 2n e τ 2 n/(4(m 2 +2σ 2 ))+λτ 2n + = 1 + 4e τ 2 n/(4(m 2 +3σ 2 ))+λτ 2n + 4 e λτ2 n e λτ2 n P(e λr > u)du e λτ2 n Choosing λ = /(8(M 2 + 2σ 2 )), we get Ee R/(8(M 2 +2σ 2 )) 9. P(e nλτ S > u)du u /(2λτ(M σ)) du Proof of Lemma 10. By the definitions of ˆθ ij and θ ij and Proosition 19, we have for any (i, j) ẑ 1 1 M θ ab (ẑ), if Ȳab(ẑ) M; ˆθ ij θ ij = Ȳ ab (ẑ) θ ab (ẑ), if M Ȳab(ẑ) < M; M θ ab (ẑ), if Ȳab(ẑ) < M (a) ẑ 1 2 (b). Define W = Y θ, and it is easy to check that ˆθ ij θ ij W ab (ẑ) 2M W ab (ẑ) τ, where ẑ = (ẑ 1, ẑ 2 ) and τ is defined in Lemma 20. Then ˆθ θ 2 max z Z k1 k 2 ẑ1 1 (a) ẑ2 1 (b) ( W ab (ẑ) τ z1 1 (a) z2 1 (b) ( 2. W ab (z) τ) (26) ) 2 For any a [k 1 ], b [k 2 ] and z 1 [k 1 ] n 1, z 2 [k 2 ] n 2, define n 1 (a) = z1 1 (a), n 2 (b) = z 1 2 (b) and V ab(z) = n 1 (a)n 2 (b) W ab(z) 1{ W ab(z) τ}, R ab (z) = n 1 (a)n 2 (b)τ W ab (z) 1{ W ab (z) > τ}. Then, ˆθ θ 2 max z Z k1 k 2 ( V 2 ab (z) + R ab (z) ). (27) 22

23 Matrix Comletion with Biclustering Structures By Markov s inequality and Lemma 20, we have P Vab 2 e t/(8(m 2 +2σ 2 )) e V 2 ab (z)/(8(m 2 +2σ 2 )) { } t ex 8(M 2 + 2σ 2 ) + k 1k 2 log 5, and P R ab (z) > t e t/(8(m 2 +2σ 2 )) { } t ex 8(M 2 + 2σ 2 ) + k 1k 2 log 9, e R ab(z)/(8(m 2 +2σ 2 )) Alying union bound and using the fact that log [k 1 ] n + log [k 2 ] n 2 = n 1 log k 1 + n 2 log k 2, { } P max Vab 2 (z) > t t ex z Z k1 k 2 8(M 2 + 2σ 2 ) + k 1k 2 log 5 + n 1 log k 1 + n 2 log k 2. For any given constant C M > 0, we choose t = C 2 σ 2 1 (k 1 k 2 +n 1 log k 1 +n 2 log k 2 ) for some sufficiently large C 1 > 0 to obtain max z Z k1 k 2 Vab 2 (z) C M 2 σ 2 1 (k 1 k 2 + n 1 log k 1 + n 2 log k 2 ) (28) with robability at least 1 ex ( C (k 1 k 2 + n 1 log k 1 + n 2 log k 2 )). sufficiently large C 2 > 0, we have max z Z k1 k 2 Similarly, for some M 2 σ 2 R ab (z) C 2 (k 1 k 2 + n 1 log k 1 + n 2 log k 2 ) (29) with robability at least 1 ex ( C (k 1 k 2 + n 1 log k 1 + n 2 log k 2 )). Plugging (28) and (29) into (27), we comlete the roof. Proof of Lemma 11. Note that θ ij θ ij = is a function of ẑ 1 and ẑ 2. Then we have θ ij θ ij (Y ij θ ij ) ij ( θ ij θ ij ) 2 ij θ ab (ẑ)1{(i, j) ẑ1 1 (a) ẑ 1 2 (b)} θ ij max z Z k1 k 2 γ ij (z)(y ij θ ij ), ij 23

24 Gao, Lu, Ma and Zhou where γ ij (z) θ ab (z)1{(i, j) z 1 1 (a) z 1 2 (b)} θ ij satisfies ij γ ij(z) 2 = 1. Consider the event θ θ 2 C 2 (M 2 σ 2 )(k 1 k 2 + n 1 log k 1 + n 2 log k 2 )/ for some C 2 to be secified later, we have γ ij (z) 2M θ θ 4M 2 C 2 (M 2 σ 2 )(k 1 k 2 + n 1 log k 1 + n 2 log k 2 ). By Lemma 13 and union bound, we have P max z Z k1 γ ij (z)(y ij θ ij ) k 2 ij > t P γ ij (z)(y ij θ ij ) > t z 1 [k 1 ] n 1,z 2 [k 2 ] n 2 ex ( C (k 1 k 2 + n 1 log k 1 + n 2 log k 2 ) ), by setting t = C 2 (M 2 σ 2 )(k 1 k 2 + n 1 log k 1 + n 2 log k 2 )/ for some sufficiently large C 2 deending on C. Thus, the lemma is roved. Proof of Lemma 12. By definition, ˆθ θ, Y θ ( = sign(ȳab(ẑ)) ( Ȳab(ẑ) M ) θ ) ab (ẑ) Wab (ẑ) ẑ1 1 (a) ẑ 1 2 (b) ( max z Z k1 sign(ȳab(z)) ( Ȳab(z) M ) θ ) ab (z) Wab (z) z1 1 k 2 (a) z 1 2 (b). By definition, we have ( sign(ȳab(z)) ( Ȳab(z) M ) θ ab (z)) Wab (z) W ab (z) 2 τ W ab (z). ij For any fixed z 1 [k 1 ] n 1, z 2 [k 2 ] n 2, define n 1 (a) = z1 1 (a) for a [k 1], n 2 (b) = z1 1 (b) for b [k 2 ] and V ab (z) = n 1 (a)n 2 (b) W ab (z) 1{ W ab (z) τ}, R ab (z) = τn 1 (a)n 2 (b) W ab (z) 1{ W ab (z) > τ}. Then ˆθ θ, Y θ max Vab 2 z Z k1 k 2 (z) + R ab (z). Following the same argument in the roof of Lemma 10, a choice of t = C 3 (M 2 σ 2 )(k 1 k 2 + n 1 log k 1 + n 2 log k 2 )/ for some sufficiently large C 3 > 0 will comlete the roof. 24

25 Matrix Comletion with Biclustering Structures Proof of Lemma 13. When λ /(M σ), λθ i / 1 and λ 2 σ 2 / 2 1. Then Ee λ(y i θ i ) = Ee λ(x/ θ i) + (1 )Ee λθ i e λ 2 σ ( λθ i + λθi + (1 )e λθ i 2(1 )2 2 λ 2 θi 2 ) (1 + λ2 σ 2 2 ) + (1 )(1 λθ i + 2λ 2 θ 2 i ) 1 + 2(1 )θ2 i + σ2 λ λ 3 θ i σ 2 2(1 )2 + 3 λ 4 σ 2 θi 2 + 2(1 )λ 2 θi (2M 2 + σ 2 )λ 2 / + λ 3 θ i σ 2 / 2 + 2λ 4 σ 2 θi 2 / (2M 2 + σ 2 )λ 2 / + λ 2 σ 2 / + 2λ 2 σ 2 / 2 + (2M 2 + 4σ 2 )λ 2 / 2e (M 2 +2σ 2 )λ 2 /. The second inequality is due to the fact that e x 1 + 2x for all x 0 and e x 1 + x + 2x 2 for all x 1. Then for λ (M σ) c, Markov inequality imlies ( n ) P c i (Y i θ i ) t i=1 } 2 ex { λt + λ2 (M 2 + 2σ 2 ). { t By choosing λ = min 2(M 2 +2σ 2 ), (M σ) c }, we get (19). Proof of Corollary 4. For indeendent Bernoulli random variables X i Ber(θ i ) with θ i [0, ρ] for i [n]. Let Y i = X i E i /, where {E i } are indenendent Bernoulli random variables and {E i } and {X i } are indeendent. Note that EY i = i, EYi 2 ρ/ and Y i 1/. Then Bernstein s inequality (Massart, 2007, Corollary 2.10) imlies { 1 P n } n (Y i θ i ) t i=1 { ( t 2 2 ex min 4ρ, 3 )} nt for any t > 0. Let S = 1 n n i=1 (Y i θ i ), T = S1{ S 3ρ n} and R = 3ρ n1{ S > 3ρ n}. Following the same arguments as in the roof of Lemma 20, we have Ee T 2 /(8ρ) 5 and Ee R/(8ρ) 9. Consequently, Lemma 10, Lemma 11 and Lemma 12 hold for the Bernoulli case. Then the rest of the roof follows from the roof of Theorem 1. Proof of Lemma 14. By the definitions of Y and Y, we have Y Y 2 (ˆ 1 1 ) 2 max Xij 2 i,j E ij. Therefore, it is sufficient to bound the three terms. For the first term, we have ˆ 1 1 ˆ 1 1 ˆ ˆ + 2, 25 ij 4 (30)

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University

Submitted to the Annals of Statistics arxiv: arxiv:0000.0000 RATE-OPTIMAL GRAPHON ESTIMATION By Chao Gao, Yu Lu and Harrison H. Zhou Yale University Network analysis is becoming one of the most active