Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

Size: px

Start display at page:

Download "Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression"

Julius Warren
6 years ago
Views:

1 Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Sha M Kakade Microsoft Research and Wharton, U Penn skakade@icrosoftco Varun Kanade SEAS, Harvard University vkanade@fasharvardedu Ada Tauan Kalai Microsoft Research adu@icrosoftco Ohad Shair Microsoft Research ohadsh@icrosoftco Abstract Generalized Linear Models (GLMs and Single Index Models (SIMs provide powerful generalizations of linear regression, where the target variable is assued to be a (possibly unknown -diensional function of a linear predictor In general, these probles entail non-convex estiation procedures, and, in practice, iterative local search heuristics are often used Kalai and Sastry (009 provided the first provably efficient ethod, the Isotron algorith, for learning SIMs and GLMs, under the assuption that the data is in fact generated under a GLM and under certain onotonicity and Lipschitz (bounded slope constraints The Isotron algorith interleaves steps of perceptron-like updates with isotonic regression (fitting a one-diensional non-decreasing function However, to obtain provable perforance, the ethod requires a fresh saple every iteration In this paper, we provide algoriths for learning GLMs and SIMs, which are both coputationally and statistically efficient We odify the isotonic regression step in Isotron to fit a Lipschitz onotonic function, and also provide an efficient O(n log(n algorith for this step, iproving upon the previous O(n algorith We provide a brief epirical study, deonstrating the feasibility of our algoriths in practice Introduction The oft used linear regression paradig odels a dependent variable Y as a linear function of a vector-valued independent variable X Naely, for soe vector w, we assue that E[Y X] = w X Generalized linear odels (GLMs provide a flexible extension of linear regression, by assuing that the dependent variable Y is of the for, E[Y X] = u(w X; u is referred to as the inverse link function or transfer function (see [] for a review Generalized linear odels include coonly used regression techniques such as logistic regression, where u(z = /( + e z is the logistic function The class of perceptrons also falls in this category, where u is a siple piecewise linear function of the for /, with the slope of the iddle piece being the inverse of the argin In the case of linear regression, the least-squares ethod is an highly efficient procedure for paraeter estiation Unfortunately, in the case of GLMs, even in the setting when u is known, the proble of fitting a odel that iniizes squared error is typically not convex We are not aware of any classical estiation procedure for GLMs which is both coputationally and statistically efficient, and with provable guarantees The standard procedure is iteratively reweighted least squares, based on Newton-Raphson (see [] The case when both u and w are unknown (soeties referred to as Single Index Models (SIMs, involves the ore challenging (and practically relevant question of jointly estiating u and w,

2 where u ay coe fro a large non-paraetric faily such as all onotonic functions There are two questions here: What statistical rate is achievable for siultaneous estiation of u and w? Is there a coputationally efficient algorith for this joint estiation? With regards to the forer, under ild Lipschitz-continuity restrictions on u, it is possible to characterize the effectiveness of an (appropriately constrained joint epirical risk iniization procedure This suggests that, fro a purely statistical viewpoint, it ay be worthwhile to attept jointly optiizing u and w on epirical data However, the issue of coputationally efficiently estiating both u and w (and still achieving a good statistical rate is ore delicate, and is the focus of this work We note that this is not a trivial proble: in general, the joint estiation proble is highly non-convex, and despite a significant body of literature on the proble, existing ethods are usually based on heuristics, which are not guaranteed to converge to a global optiu (see for instance [, 3, 4, 5, 6] The Isotron algorith of Kalai and Sastry [7] provides the first provably efficient ethod for learning GLMs and SIMs, under the coon assuption that u is onotonic and Lipschitz, and assuing that the data corresponds to the odel The saple and coputational coplexity of this algorith is polynoial, and the saple coplexity does not explicitly depend on the diension The algorith is a variant of the gradient-like perceptron algorith, where apart fro the perceptronlike updates, an isotonic regression procedure is perfored on the linear predictions using the Pool Adjacent Violators (PAV algorith, on every iteration While the Isotron algorith is appealing due to its ease of ipleentation (it has no paraeters other than the nuber of iterations to run and theoretical guarantees (it works for any u, w, there is one principal drawback It is a batch algorith, but the analysis given requires the algorith to be run on fresh saples each batch In fact, as we show in experients, this is not just an artifact of the analysis if the algorith loops over the sae data in each update step, it really does overfit in very high diensions (such as when the nuber of diensions exceeds the nuber of exaples Our Contributions: We show that the overfitting proble in Isotron stes fro the fact that although it uses a slope (Lipschitz condition as an assuption in the analysis, it does not constrain the output hypothesis to be of this for To address this issue, we introduce the SLISOTRON algorith (pronounced slice-o-tron, cobining slope and Isotron The algorith replaces the isotonic regression step of the Isotron by finding the best non-decreasing function with a bounded Lipschitz paraeter - this constraint plays here a siilar role as the argin in classification algoriths We also note SLISOTRON (like Isotron has a significant advantage over standard regression techniques, since it does not require knowing the transfer function Our two ain contributions are: We show that the new algorith, like Isotron, has theoretical guarantees, and significant new analysis is required for this step We provide an efficient O(n log(n tie algorith for finding the best non-decreasing function with a bounded Lipschitz paraeter, iproving on the previous O(n algorith [0] This akes SLISOTRON practical even on large datasets We begin with a siple perceptron-like algorith for fitting GLMs, with a known transfer function u which is onotone and Lipschitz Soewhat surprisingly, prior to this work (and Isotron [7] a coputationally efficient procedure that guarantees to learn GLMs was not known Section 4 contains the ore challenging SLISOTRON algorith and also the efficient O(n log(n algorith for Lipschitz isotonic regression We conclude with a brief epirical analysis Setting We assue the data (x, y are sapled iid fro a distribution supported on B d [0, ], where B d = {x R d : x } is the unit ball in d-diensional Euclidean space Our algoriths and In the ore challenging agnostic setting, the data is not required to be distributed according to a true u and w, but it is required to find the best u, w which iniize the epirical squared error Siilar to observations of Kalai et al [8], it is straightforward to show that this proble is likely to be coputationally intractable in the agnostic setting In particular, it is at least as hard as the proble of learning parity with noise, whose hardness has been used as the basis for designing ultiple cryptographic systes Shalev-Shwartz et al [9] present a kernel-based algorith for learning certain types of GLMs and SIMs in the agnostic setting However, their worst-case guarantees are exponential in the nor of w (or equivalently the Lipschitz paraeter

3 Algorith GLM-TRON Input: data (x i, y i Rd [0, ], u : R [0, ], held-out data (x +j, y +j s j= w := 0; for t =,, do h t (x := u(w t x; w t+ := w t + (y i u(w t x i x i ; end for Output: arg in h t s j= (ht (x +j y +j analysis also apply to the case where B d is the unit ball in soe high (or infinite-diensional kernel feature space We assue there is a fixed vector w, such that w W, and a non-decreasing -Lipschitz function u : R [0, ], such that E[y x] = u(w x for all x The restriction that u is -Lipschitz is without loss of generality, since the nor of w is arbitrary (an equivalent restriction is that w = and that u is W -Lipschitz for an arbitrary W Our focus is on approxiating the regression function well, as easured by the squared loss For a real valued function h : B d [0, ], define err(h = E (x,y [ (h(x y ] ε(h = err(h err(e[y x] = E (x,y [ (h(x u(w x ] err(h easures the error of h, and ε(h easures the excess error of h copared to the Bayesoptial predictor x u(w x Our goal is to find h such that ε(h (equivalently, err(h is as sall as possible In addition, we define the epirical counterparts êrr(h, ˆε(h, based on a saple (x, y,, (x, y, to be êrr(h = (h(x i y i ; ˆε(h = (h(x i u(w x i Note that ˆε is the standard fixed design error (as this error conditions on the observed x s Our algoriths work by iteratively constructing hypotheses h t of the for h t (x = u t (w t x, where u t is a non-decreasing, -Lipschitz function, and w t is a linear predictor The algorithic analysis provides conditions under which ˆε(h t is sall, and using statistical arguents, one can guarantee that ε(h t would be sall as well 3 The GLM-TRON algorith We begin with the sipler case, where the transfer function u is assued to be known (eg a sigoid, and the proble is estiating w properly We present a siple, paraeter-free, perceptronlike algorith, GLM-TRON (Alg, which efficiently finds a close-to-optial predictor We note that the algorith works for arbitrary non-decreasing, Lipschitz functions u, and thus covers ost generalized linear odels We refer the reader to the pseudo-code in Algorith for soe of the notation used in this section To analyze the perforance of the algorith, we show that if we run the algorith for sufficiently any iterations, one of the predictors h t obtained ust be nearly-optial, copared to the Bayesoptial predictor Theore Suppose (x, y,, (x, y are drawn independently fro a distribution supported on B d [0, ], such that E[y x] = u(w x, where w W, and u : R [0, ] is a known nondecreasing -Lipschitz function Then for any δ (0,, the following holds with probability at least δ: there exists soe iteration t < O(W / log(/δ of GLM-TRON such that the hypothesis h t (x = u(w t x satisfies ( W ax{ˆε(h t, ε(h t log(/δ } O 3

4 Algorith SLISOTRON Input: data (x i, y i Rd [0, ], held-out data (x +j, y +j s j= w := 0; for t =,, do u t := LIR ((w t x, y,, (w t x, y // Fit -d function along w t w t+ := w t + (y i u t (w t x i x i end for Output: arg in h t s j= (ht (x +j y +j In particular, the theore iplies that soe h t has sall enough ε(h t Since ε(h t equals err(h t up to a constant, we can easily find an appropriate h t by picking the one that has least êrr(h t on a held-out set The ain idea of the proof is showing that at each iteration, if ˆε(h t is not sall, then the squared distance w t+ w is substantially saller than w t w Since the squared distance is bounded below by 0, and w 0 w W, there is an iteration (arrived at within reasonable tie such that the hypothesis h t at that iteration is highly accurate Although the algorith iniizes epirical squared error, we can bound the true error using a unifor convergence arguent The coplete proofs are provided in Appendix A 4 The SLISOTRON algorith In this section, we present SLISOTRON (Alg, which is applicable to the harder setting where the transfer function u is unknown, except for it being non-decreasing and -Lipschitz SLISOTRON does have one paraeter, the Lipschitz constant; however, in theory we show that this can siply be set to The ain difference between SLISOTRON and GLM-TRON is that now the transfer function ust also be learned, and the algorith keeps track of a transfer function u t which changes fro iteration to iteration The algorith is inspired by the Isotron algorith [7], with the ain difference being that at each iteration, instead of applying the PAV procedure to fit an arbitrary onotonic function along the direction w t, we use a different procedure, (Lipschitz Isotonic Regression LIR, to fit a Lipschitz onotonic function, u t, along w t This key difference allows for an analysis that does not require a fresh saple each iteration We also provide an efficient O( log( tie algorith for LIR (see Section 4, aking SLISOTRON an extreely efficient algorith We now turn to the foral theore about our algorith The foral guarantees parallel those of the GLM-TRON algorith However, the rates achieved are soewhat worse, due to the additional difficulty of siultaneously estiating both u and w Theore Suppose (x, y,, (x, y are drawn independently fro a distribution supported on B d [0, ], such that E[y x] = u(w x, where w W, and u : R [0, ] is an unknown non-decreasing -Lipschitz function Then the following two bounds hold: (Diension-dependent ( With probability at least δ, there exists soe iteration t < ( /3 W O d log(w /δ of SLISOTRON such that ( (dw ax{ˆε(h t, ε(h t /3 log(w /δ } O (Diension-independent ( With probability at least δ, there exists soe iteration t < ( /4 W O log(/δ of SLISOTRON such that ( (W ax{ˆε(h t, ε(h t /4 log(/δ } O 4

5 As in the case of Th, one can easily find h t which satisfies the theore s conditions, by running the SLISOTRON algorith for sufficiently any iterations, and choosing the hypothesis h t which iniizes êrr(h t on a held-out set The algorith iniizes epirical error and generalization bounds are obtained using a unifor convergence arguent The proofs are soewhat involved and appear in Appendix B 4 Lipschitz isotonic regression The SLISOTRON algorith (Alg perfors Lipschitz Isotonic Regression (LIR at each iteration The goal is to find the best fit (least squared error non-decreasing -Lipschitz function that fits the data in one diension Let (z, y, (z, y be such that z i R, y i [0, ] and z z z The Lipschitz Isotonic Regression (LIR proble is defined as the following quadratic progra: Miniize wrt ŷ i : (y i ŷ i ( subject to: ŷ i ŷ i+ i (Monotonicity ( ŷ i+ ŷ i (z i+ z i i (Lipschitz (3 Once the values ŷ i are obtained at the data points, the actual function can be constructed by interpolating linearly between the data points Prior to this work, the best known algorith for this proble wass due to Yeganova and Wilbur [0] and required O( tie for points In this work, we present an algorith that perfors the task in O( log( tie The actual algorith is fairly coplex and relies on designing a clever data structure We provide a high-level view here; the details are provided in Appendix D Algorith Sketch: We define functions G i (, where G i (s is the iniu squared loss that can be attained if ŷ i is fixed to be s, and ŷ i+, ŷ are then chosen to be the best fit -Lipschitz non-decreasing function to the points (z i, y i,, (z, y Forally, for i =,,, define the functions, G i (s = in ŷ i+,,ŷ (s y i + subject to the constraints (where s = ŷ i, ŷ j ŷ j+ ŷ j+ ŷ j z j+ z j (ŷ j y j (4 j=i+ i j (Monotonic i j (Lipschitz Furtherore, define: s i = in s G i (s The functions G i are piecewise quadratic, differentiable everywhere and strictly convex, a fact we prove in full paper [] Thus, G i is iniized at s i and it is strictly increasing on both sides of s i Note that G (s = (/(s y and hence is piecewise quadratic, differentiable everywhere and strictly convex Let δ i = z i+ z i The reaining G i obey the following recursive relation { G i (s = Gi (s + δ i If s s (s y i i δ i + G i (s i If s i δ i < s s i (5 G i (s If s i < s As intuition for the above relation, note that G i (s is obtained fixing ŷ i = s and then by choosing ŷ i as close to s i (since G i is strictly increasing on both sides of s i as possible without violating either the onotonicity or Lipschitz constraints The above arguent can be iediately translated into an algorith, if the values s i are known Since s iniizes G (s, which is the sae as the objective of (, start with ŷ = s, and then successively chose values for ŷ i to be as close to s i as possible without violating the Lipschitz or onotonicity constraints This will produce an assignent for ŷ i which achieves loss equal to G (s and hence is optial 5

6 (a (b Figure : (a Finding the zero of G i (b Update step to transfor representation of G i to G i The harder part of the algorith is finding the values s i Notice that G i are all piecewise linear, continuous and strictly increasing, and obey a siilar recursive relation (G (s = s y : { G i (s + δ i If s s G i δ i i (s = (s y i + 0 If s i δ i < s s i (6 G i (s If s i < s The algorith then finds s i by finding zeros of G i Starting fro, G = s y, and s = y We design a special data structure, called notable red-black trees, for representing piecewise linear, continuous, strictly increasing functions We initialize such a tree T to represent G (s = s y Assuing that at soe tie it represents G i, we need to support two operations: Find the zero of G i to get s i Such an operation can be done efficiently O(log( tie using a tree-like structure (Fig (a Update T to represent G i This operation is ore coplicated, but using the relation (6, we do the following: Split the interval containing s i Move the left half of the piecewise linear function G i by δ i (Fig (b, adding the constant zero function in between Finally, we add the linear function s y i to every interval, to get G i, which is again piecewise linear, continuous and strictly increasing To perfor the operations in step ( above, we cannot naïvely apply the transforations, shift-by(δ i and add(s y i to every node in the tree, as it ay take O( operations Instead, we siply leave a note (hence the nae notable red-black trees that such a transforation should be applied before the function is evaluated at that node or at any of its descendants To prevent a large nuber of such notes accuulating at any given node we show that these notes satisfy certain coutative and additive relations, thus requiring us to keep track of no ore than notes at any given node This lazy evaluation of notes allows us to perfor all of the above operations in O(log( tie The details of the construction are provided in Appendix D 5 Experients In this section, we present an epirical study of the SLISOTRON and GLM-TRON algoriths We perfor two evaluations using synthetic data The first one copares SLISOTRON and Isotron [7] and illustrates the iportance of iposing a Lipschitz constraint The second one deonstrates the advantage of using SLISOTRON over standard regression techniques, in the sense that SLISOTRON can learn any onotonic Lipschitz function We also report results of an evaluation of SLISOTRON, GLM-TRON and several copeting approaches on 5 UCI[] datasets All errors are reported in ters of average root ean squared error (RMSE using 0 fold cross validation along with the standard deviation 5 Synthetic Experients Although, the theoretical guarantees for Isotron are under the assuption that we get a fresh saple each round, one ay still attept to run Isotron on the sae saple each iteration and evaluate the 6

7 Slisotron Isotron SLISOTRON Isotron 089 ± ± ± 008 (a Synthetic Experient Slisotron SLISOTRON Logistic 0058 ± ± ± 0004 (b Synthetic Experient Figure : (a The figure shows the transfer functions as predicted by SLISOTRON and Isotron The table shows the average RMSE using 0 fold cross validation The colun shows the average difference between the RMSE values of the two algoriths across the folds (b The figure shows the transfer function as predicted by SLISOTRON Table shows the average RMSE using 0 fold cross validation for SLISOTRON and Logistic Regression The colun shows the average difference between the RMSE values of the two algoriths across folds epirical perforance Then, the ain difference between SLISOTRON and Isotron is that while SLISOTRON fits the best Lipschitz onotonic function using LIR each iteration, Isotron erely finds the best onotonic fit using PAV This difference is analogous to finding a large argin classifier vs just a consistent one We believe this difference will be particularly relevant when the data is sparse and lies in a high diensional space Our first synthetic dataset is the following: The dataset is of size = 500 in d = 500 diensions The first co-ordinate of each point is chosen uniforly at rando fro {, 0, } The reaining co-ordinates are all 0, except that for each data point one of the reaining co-ordinates is randoly set to The true direction is w = (, 0,, 0 and the transfer function is u(z = ( + z/ Both SLISOTRON and Isotron put weight on the first co-ordinate (the true direction However, Isotron overfits the data using the reaining (irrelevant co-ordinates, which SLISOTRON is prevented fro doing because of the Lipschitz constraint Figure (a shows the transfer functions as predicted by the two algoriths, and the table below the plot shows the average RMSE using 0 fold cross validation The colun shows the average difference between the RMSE values of the two algoriths across the folds A principle advantage of SLISOTRON over standard regression techniques is that it is not necessary to know the transfer function in advance The second synthetic experient is designed as a sanity check to verify this clai The dataset is of size = 000 in d = 4 diensions We chose a rando direction as the true w and used a piecewise linear function as the true u We then added rando noise (σ = 0 to the y values We copared SLISOTRON to Logistic Regression on this dataset SLISOTRON correctly recovers the true function (up to soe scaling Fig (b shows the actual transfer function as predicted by SLISOTRON, which is essentially the function we used The table below the figure shows the perforance coparison between SLISOTRON and logistic regression 5 Real World Datasets We now turn to describe the results of experients perfored on the following 5 UCI datasets: counities, concrete, housing, parkinsons, and wine-quality We copared the perforance of SLISOTRON (Sl-Iso and GLM-TRON with logistic transfer function (GLM-t against Isotron (Iso, as well as standard logistic regression (Log-R, linear regression (Lin-R and a siple heuristic algorith (SIM for single index odels, along the lines of standard iterative axiu-likelihood procedures for these types of probles (eg, [3] The SIM algorith works by iteratively fixing the direction w and finding the best transfer function u, and then fixing u and 7

8 optiizing w via gradient descent For each of the algoriths we perfored 0-fold cross validation, using fold each tie as the test set, and we report averaged results across the folds Table shows average RMSE values of all the algoriths across 0 folds The first colun shows the ean Y value (with standard deviation of the dataset for coparison Table shows the average difference between RMSE values of SLISOTRON and the other algoriths across the folds Negative values indicate that the algorith perfored better than SLISOTRON The results suggest that the perforance of SLISOTRON (and even Isotron is coparable to other regression techniques and in any cases also slightly better The perforance of GLM-TRON is siilar to standard ipleentations of logistic regression on these datasets This suggests that these algoriths should work well in practice, while providing non-trivial theoretical guarantees It is also illustrative to see how the transfer functions found by SLISOTRON and Isotron copare In Figure 3, we plot the transfer functions for concrete and counities We see that the fits found by SLISOTRON tend to be soother because of the Lipschitz constraint We also observe that concrete is the only dataset where SLISOTRON perfors noticeably better than logistic regression, and the transfer function is indeed soewhat far fro the logistic function Table : Average RMSE values using 0 fold cross validation The Ȳ colun shows the ean Y value and standard deviation dataset Ȳ Sl-Iso GLM-t Iso Lin-R Log-R SIM counities 04 ± ± ± ± ± ± ± 00 concrete 358 ± ± ± 0 99 ± ± 04 ± 0 99 ± 09 housing 5 ± ± ± ± ± ± ± 078 parkinsons 9 ± 07 0 ± 0 03 ± 0 0 ± 0 0 ± 0 0 ± 0 03 ± 0 winequality 59 ± ± ± ± ± ± ± 003 Table : Perforance coparison of SLISOTRON with the other algoriths The values reported are the average difference between RMSE values of the algorith and SLISOTRON across the folds Negative values indicate better perforance than SLISOTRON dataset GLM-t Iso Lin-R Log-R SIM counities 000 ± ± ± ± ± 000 concrete 056 ± ± ± ± ± 06 housing 00 ± ± ± ± ± 053 parkinsons 09 ± ± ± ± ± 00 winequality 00 ± ± ± ± ± Slisotron Isotron Slisotron Isotron (a concrete (b counities Figure 3: The transfer function u as predicted by SLISOTRON (blue and Isotron (red for the concrete and counities datasets The doain of both functions was noralized to [, ] 8

9 References [] P McCullagh and J A Nelder Generalized Linear Models (nd ed Chapan and Hall, 989 [] P Hall W Härdle and H Ichiura Optial soothing in single-index odels Annals of Statistics, (:57 78, 993 [3] J Horowitz and W Härdle Direct seiparaetric estiation of single-index odels with discrete covariates, 994 [4] A Juditsky M Hristache and V Spokoiny Direct estiation of the index coefficients in a single-index odel Technical Report 3433, INRIA, May 998 [5] P Naik and C Tsai Isotonic single-index odel for high-diensional database arketing Coputational Statistics and Data Analysis, 47: , 004 [6] P Ravikuar, M Wainwright, and B Yu Single index convex experts: Efficient estiation via adapted bregan losses Snowbird Workshop, 008 [7] A T Kalai and R Sastry The isotron algorith: High-diensional isotonic regression In COLT 09, 009 [8] A T Kalai, A R Klivans, Y Mansour, and R A Servedio Agnostically learning halfspaces In Proceedings of the 46th Annual IEEE Syposiu on Foundations of Coputer Science, FOCS 05, pages 0, Washington, DC, USA, 005 IEEE Coputer Society [9] S Shalev-Shwartz, O Shair, and K Sridharan Learning kernel-based halfspaces with the zero-one loss In COLT, 00 [0] L Yeganova and W J Wilbur Isotonic regression under lipschitz constraint Journal of Optiization Theory and Applications, 4(:49 443, 009 [] S M Kakade, A T Kalai, V Kanade, and O Shair Efficient learning of generalized linear and single index odels with isotonic regression arxivorg/abs/0408 [] UCI University of california, irvine: [3] S Cosslett Distribution-free axiu-likelihood estiator of the binary choice odel Econoetrica, 5(3, May 983 [4] J Shawe-Taylor and N Christianini Kernel Methods for Pattern Analysis Cabridge University Press, 004 [5] G Pisier The Volue of Convex Bodies and Banach Space Geoetry Cabridge University Press, 999 [6] T Zhang Covering nuber bounds for certain regularized function classes Journal of Machine Learning Research, :57 550, 00 [7] P Bartlett and S Mendelson Radeacher and gaussian coplexities: Risk bounds and structural results Journal of Machine Learning Research, 3:463 48, 00 [8] N Srebro, K Sridharan, and A Tewari Soothness, low-noise and fast rates In NIPS, 00 (full version on arxiv [9] S Mendelson Iproving the saple coplexity using global data IEEE Transactions on Inforation Theory, 48(7:977 99, 00 9

10 A Proof of Th The reader is referred to GLM-TRON (Alg for notation used in this section The ain lea shows that as long as the error of the current hypothesis is large, the distance of our predicted direction vector w t fro the ideal direction w decreases by a substantial aount at each step Lea At iteration t in GLM-TRON, suppose w t w W, then if (/ (y i u(w x i x i w η, then w t w w t+ w ˆε(h t 5W η Proof We have w t w w t+ w = (y i u(w t x i (w x i w t x i Consider the first ter above, (y i u(w t x i (w x i w t x i = (y i u(w t x i x i ( (u(w x i u(w t x i (w x i w t x i + (y i u(w x i x i (w w t Using the fact that u is non-decreasing and -Lipschitz (for the first ter and w w t W and (/ (y i u(w x i x i η, we can lower bound this by (u(w x i u(w t x i W η ˆε(h t W η (8 For the second ter in (7, we have (y i u(w t x i x i = (y i u(w x i + u(w x i u(w t x i x i (y i u(w x i x i + (y i u(w x i x i (u(w x i u(w t x i x i + (u(w x i u(w t x i x i (9 Using the fact that (/ (y i u(w x i x i η, and using Jensen s inequality to show that (/ (u(w x i u(w t x i x i (/ (u(w x i u(w t x i = ˆε(h t, and assuing W, we get (y i u(w x i x i ˆε(h t + 3W η (0 Cobining (8 and (0 in (7, we get w t w w t+ w ˆε(h t 5W η (7 To coplete the proof of Theore, the bound on ˆε(h t for soe t now follows fro Lea Let η = ( + log(/δ// Notice that (y i u(w x i x i for all i are iid 0-ean rando variables with nor bounded by, so using Lea 3 (Appendix B, (/ (y i u(w x i x i η with high probability Now using Lea, at each iteration of algorith GLM-TRON, either w t+ w w t w W η, or ˆε(h t 6W η If the 0

11 latter is the case, we are done If not, since w t+ w 0, and w 0 w = w W, there can be at ost W /(W η = W/(η iterations before ˆε(h t 6W η Overall, there is soe h t such that ( W ˆε(h t log(/δ O In addition, we can reduce this to a high-probability bound on ε(h t using an unifor convergence arguent (Lea 5, Appendix B, which is applicable since w t W Using a union bound, we get a bound which holds siultaneously for ˆε(h t and ε(h t B Proof of Th This section is devoted to the proof of Th The reader is referred to Section 4 for notation First we need a property of the LIR algorith that is used to find the best one-diensional nondecreasing -Lipschitz function Forally, this proble can be defined as follows: Given as input {z i, y i } [ W, W ] [0, ] the goal is to find ŷ,, ŷ such that (ŷ i y i, ( is inial, under the constraint that ŷ i = u(z i for soe non-decreasing -Lipschitz function u : [ W, W ] [0, ] After finding such values, LIR obtains an entire function u by interpolating linearly between the points Assuing that z i are in sorted order, this can be forulated as a quadratic proble with the following constraints: ŷ i ŷ i+ 0 i < ( ŷ i+ ŷ i (z i+ z i 0 i < (3 Lea Let (z, y,, (z, y be input to LIR where z i are increasing and y i [0, ] Let ŷ,, ŷ be the output of LIR Let f be any function such that f(β f(α β α, for β α, then (y i ŷ i (f(ŷ i z i 0 Proof We first note that j= (y j ŷ j = 0, since otherwise we could have found other values for ŷ,, ŷ which ake ( even saller So for notational convenience, let ŷ 0 = 0, and we ay assue wlog that f(ŷ 0 = 0 Define σ i = j=i (y j ŷ j Then we have (y i ŷ i (f(ŷ i z i = σ i ((f(ŷ i z i (f(ŷ i z i (4 Suppose that σ i < 0 Intuitively, this eans that if we could have decreased all values ŷ i+, ŷ by an infinitesial constant, then the objective function ( would have been reduced, contradicting the optiality of the values This eans that the constraint ŷ i ŷ i+ 0 ust be tight, so we have (f(ŷ i+ z i+ (f(ŷ i z i = z i+ + z i 0 (this arguent is inforal, but can be easily foralized using KKT conditions Siilarly, when σ i > 0, then the constraint ŷ i+ ŷ i (z i+ z i 0 ust be tight, hence f(ŷ i+ f(ŷ i ŷ i+ ŷ i = (z i+ z i 0 So in either case, each suand in (4 ust be non-negative, leading to the required result We also use another result, for which we require a bit of additional notation At each iteration of the SLISOTRON algorith, we run the LIR procedure based on the training saple (x, y,, (x, y and the current direction w t, and get a non-decreasing Lipschitz function u t Define i ŷi t = u t (w t x i Recall that w, u are such that E[y x] = u(w x, and the input to the SLISOTRON algorith is (x, y,, (x, y Define i ȳ i = u(w x i

12 to be the expected value of each y i Clearly, we do not have access to ȳ i However, consider a hypothetical call to LIR with inputs (w t x i, ȳ i, and suppose LIR returns the function ũt In that case, define i ỹi t = ũ t (w t x i for all i Our proof uses the following proposition, which relates the values ŷi t (the values we can actually copute and ỹi t (the values we could copute if we had the conditional eans of each y i The proof of Proposition is soewhat lengthy and requires additional technical achinery, and is therefore relegated to Appendix C Proposition With probability at least δ over the saple {(x i, y i }, it holds for any t that ŷt i ỹt i is at ost the iniu of { ( (dw /3 ( (W log(w /δ /4 } log(/δ in O, O The third auxiliary result we need is the following, which is well-known (see for exaple [4], Section 4 Lea 3 Suppose z,, z are iid 0-ean rando variables in a Hilbert space, such that Pr( x i = Then with probability at least δ, ( + log(/δ/ z i With these auxiliary results in hand, we can now turn to prove Th itself The heart of the proof is the following lea, which shows that the squared distance w t w between w t and the true direction w decreases at each iteration at a rate which depends on the error of the hypothesis ˆε(h t : Lea 4 Suppose that w t w W and (/ (y i ȳ i x i η and (/ ŷt i ỹt i η Then w t w w t+ w ˆε(h t 5W (η + η Proof We have w t+ w = w t+ w t + w t w = w t+ w t + w t w + (wt+ w t (w t w Since w t+ w t = (/ (y i ŷi tx i, substituting this above and rearranging the ters we get, w t w w t+ w = (y i ŷi(w t x i w t x i (y i ŷix t i (5 Consider the first ter above, ( (y i ŷ t i(w x i w t x i = (y i ȳ i x i (w w t (6 + (ȳ i ỹ t i(w x i w t x i (7 + (ỹi t ŷi(w t x i w t x i (8 The ter (6 is at least W η, the ter (8 is at least W η (since (w w t x i W We thus consider the reaining ter (7 Letting u be the true transfer function, suppose for a inute it is strictly increasing, so its inverse u is well defined Then we have (ȳ i ỹi(w t x i w t x i = (ȳ i ỹi(w t x i u (ỹi t + (ȳ i ỹi(u t (ỹi t w t x i

13 The second ter in the expression above is positive by Lea As to the first ter, it is equal to (ȳ i ỹ t i (u (ȳ i u (ỹ t i, which by the Lipschitz property of u is at least (ȳ i ỹ t i = ˆε( h t Plugging this in the above, we get (y i ŷi(w t x i w t x i ˆε( h t W (η + η (9 This inequality was obtained under the assuption that u is strictly increasing, but it is not hard to verify that the sae would hold even if u is only non-decreasing The second ter in (5 can be bounded, using soe tedious technical anipulations (see (9 and (0 in Appendix A, by (y i ŷ t ix i ˆε(h t + 3W η (0 Cobining (9 and (0 in (5, we get w t w w t+ w ˆε( h t ˆε(h t W (5η + η ( Now, we clai that since ˆε( h t = = = ˆε( h t ˆε(h t (ỹi t ȳ i ŷi t ỹi t η, (ỹi t ŷi t + ŷi t ȳ i ( (ŷi t ȳ i + (ỹi t ŷ t i (ỹi t + ŷi t ȳ i and we have that ỹ t i + ŷt i ȳ i Plugging this into ( leads to the desired result The bound on ˆε(h t in Th now follows fro Lea 4 Using the notation fro Lea 4, η can be set to the bound in Lea 3, since {(y i ȳ i x i } are iid 0-ean rando variables with nor bounded by Also, η can be set to any of the bounds in Proposition η is clearly the doinant ter Thus, we get that Lea 4 holds, so either w t+ w w t w W (η + η, or ˆε(h t 3W (η + η If the latter is the case, we are done If not, since w t+ w 0, and w 0 w = w W, there can be at ost W /(W (η + η = W/(η + η iterations before ˆε(h t 6W η Plugging in the values for η, η results in the bound on ˆε(h t Finally, to get a bound on ε(h t, we utilize the following unifor convergence lea: Lea 5 Suppose that E[y x] = u( w, x for soe non-decreasing -Lipschitz u and w such that w W Then with probability at least δ over a saple (x, y,, (x, y, the following holds siultaneously for any function h(x = û(ŵ x such that ŵ W and a non-decreasing and -Lipschitz function û: ( W log(/δ ε(h ˆε(h O The proof of the lea uses a covering nuber arguent, and is shown as part of the ore general Lea 7 in Appendix C This lea applies in particular to h t Cobining this with the bound on ˆε(h t, and using a union bound, we get the result on ε(h t as well 3

14 C Proof of Proposition To prove the proposition, we actually prove a ore general result Define the function class and U = {u : [ W, W ] [0, ] : u -Lipschitz} W = {x x, w : w R d, w W }, where d is possibly infinite (for instance, if we are using kernels It is easy to see that the proposition follows fro the following unifor convergence guarantee: Theore 3 With probability at least δ, for any fixed w W, if we let and define û = arg in u U ũ = arg in u U (u(w x i y i, (u(w x i E[y x i ], then ( { (dw 3/ /3 log(w /δ W log(/δ û(w x i ũ(w x i O in +, To prove the theore, we use the concept of ( -nor covering nubers Given a function class F on soe doain and soe ɛ > 0, we define N (ɛ, F to be the sallest size of a covering set F F, such that for any f F, there exists soe f F for which sup x f(x f (x ɛ In addition, we use a ore refined notion of an -nor covering nuber, which deals with an epirical saple of size Forally, define N (ɛ, F, to be the sallest integer n, such that for any x,, x, one can construct a covering set F F of size at ost n, such that for any f F, there exists soe f F such that ax,, f(x i f (x i ɛ Lea 6 Assuing, /ɛ, W, we have the following covering nuber bounds: ( W /4 } log(/δ N (ɛ, U ɛ W/ɛ N (ɛ, W ( d + W ɛ 3 N (ɛ, U W ( ɛ 4W/ɛ + 4W ɛ d 4 N (ɛ, U W, ɛ ( + +8W /ɛ Proof We start with the first bound Discretize [ W, W ] [0, ] to a two-diensional grid { W + ɛa, ɛb} a=0,,w/ɛ,b=0,,/ɛ It is easily verified that for any function u U, we can define a piecewise linear function u, which passes through points in the grid, and in between the points, is either constant or linear with slope, and sup x u(x u (x ɛ Moreover, all such functions are paraeterized by their value at W, and whether they are sloping up or constant at any grid interval afterward Thus, their nuber can be coarsely upper bounded as W/ɛ /ɛ The second bound in the lea is a well known fact - see for instance pg 63 in [5] The third bound in the lea follows fro cobining the first two bounds, and using the Lipschitz property of u (we siply cobine the two covers at an ɛ/ scale, which leads to a cover at scale ɛ for U W To get the fourth bound, we note that by corollary 3 in [6] N (ɛ, W, ( + +W /ɛ Note that unlike the second bound in the lea, this bound is diension-free, but has worse dependence on W and ɛ Also, we have N (ɛ, U, N (ɛ, U ɛ W/ɛ by definition of covering 4

15 nubers and the first bound in the lea Cobining these two bounds, and using the Lipschitz property of u, we get ( + /ɛ +4W 4W/ɛ ɛ Upper bounding 4W/ɛ by ( + 4W /ɛ, the the fourth bound in the lea follows Lea 7 With probability at least δ over a saple (x, y,, (x, y the following bounds hold siultaneously for any w W, u, u U, (u(w x i y i E [ (u(w x y ] ( W log(/δ O, (u(w x i E[y x i ] E [ (u(w x E[y x] ] ( W log(/δ O, ( u(w x i u (w x i E [ u(w x u W (w x ] O log(/δ Proof Lea 6 tells us that N (ɛ, U W, ɛ ( + +8W /ɛ It is easy to verify that the sae covering nuber bound holds for the function classes {(x, y (u(w x y : u U, w W} and {x (u(w x E[y x] : u U, w W}, by definition of the covering nuber and since the loss function is -Lipschitz In a siilar anner, one can show that the covering nuber of the function class {x u(w x u (w x : u, u U, w W} is at ost 4 ɛ ( + +3W /ɛ Now, one just need to use results fro the literature which provides unifor convergence bounds given a covering nuber on the function class In particular, cobining a unifor convergence bound in ters of the Radeacher coplexity of the function class (eg Theore 8 in [7], and a bound on the Radeacher coplexity in ters of the covering nuber, using an entropy integral (eg, Lea A3 in [8], gives the desired result Lea 8 With probability at least δ over a saple (x, y,, (x, y, the following holds siultaneously for any w W: if we let û w ( w, = arg in (u(w x i y i u U denote the epirical risk iniizer with that fixed w, then ( ( /3 d log(w /δ E(û w (w x y inf E(u(w x u U y O W, Proof For generic losses and function classes, standard bounds on the the excess error typically scale as O(/ However, we can utilize the fact that we are dealing with the squared loss to get better rates In particular, using Theore 4 in [9], as well as the bound on N (ɛ, U fro Lea 6, we get that for any fixed w, with probability at least δ, ( ( /3 log(/δ E(û w (w x y inf E(u(w x u U y O W To get a stateent which holds siultaneously for any w, we apply a union bound over a covering set of W In particular, by Lea 6, we know that we can cover W by a set W of size at ost ( + W/ɛ d, such that any eleent in W is at ost ɛ-far (in an -nor sense fro soe w W So applying a union bound over W, we get that with probability at least δ, it holds siultaneously for any w W that ( ( /3 log(/δ + d log( + W/ɛ E(û w ( w, x y inf u E(u( w, x y O W ( 5

16 Now, for any w W, if we let w denote the closest eleent in W, then u(w x and u( w, x are ɛ-close uniforly for any u U and any x Fro this, it is easy to see that we can extend ( to hold for any W, with an additional O(ɛ eleent in the right hand side In other words, with probability at least δ, it holds siultaneously for any w W that ( ( /3 log(/δ + d log( + W/ɛ E(û w (w x y inf E(u(w x u y O W + ɛ Picking (say ɛ = / provides the required result Lea 9 Let F be a convex class of functions, and let f = arg in f F E[(f(x y ] Suppose that E[y x] U W Then for any f F, it holds that [ E[(f(x y ] E[(f (x y ] E (f(x f (x ] (E [ f(x f (x ] Proof It is easily verified that E[(f(x y ] E[(f (x y ] = E x [(f(x E[y x] (f (x E[y x] ] (3 This iplies that f = arg in f F E[(f(x E[y x] ] Consider the L Hilbert space of square-integrable functions, with respect to the easure induced by the distribution on x (ie, the inner product is defined as f, f = E x [f(xf (x] Note that E[y x] U W is a eber of that space Viewing E[y x] as a function y(x, what we need to show is that f y f y f f By expanding, it can be verified that this is equivalent to showing f y, f f 0 To prove this, we start by noticing that according to (3, f iniizes f y over F Therefore, for any f F and any ɛ (0,, ( ɛf + ɛf y f y 0, (4 as ( ɛf + ɛf F by convexity of F However, the right hand side of (4 equals ɛ f f + ɛ f y, f f, so to ensure (4 is positive for any ɛ, we ust have f y, f f 0 This gives us the required result, and establishes the first inequality in the lea stateent The second inequality is just by convexity of the squared function Proof of Th 3 We bound û(w x ũ(w x in two different ways, one which is diension-dependent and one which is diension independent We begin with the diension-dependent bound For any fixed w, let u be arg in u U E(u(w x y We have fro Lea 8 that with probability at least δ, siultaneously for all w W, ( ( /3 d log(w /δ E(û(w x y E(u (w x y O W, and by Lea 9, this iplies ( (dw E[ û(w x u 3/ /3 log(w /δ (w x ] O (5 Now, we note that since u = arg in u U E(u(w x y, then u = arg in u U E(u(w x E[y x] as well Again applying Lea 8 and Lea 9 in a siilar anner, but now with respect to ũ, we get that with probability at least δ, siultaneously for all w W, ( (dw E[ ũ(w x u 3/ /3 log(w /δ (w x ] O (6 6

17 Cobining (5 and (6, with a union bound, we have ( (dw 3/ /3 log(w /δ E[ û(w x ũ(w x ] O Finally, we invoke the last inequality in Lea 7, using a union bound, to get ( (dw 3/ /3 log(w /δ W log(/δ û(w x ũ(w x O + We now turn to the diension-independent bound In this case, the covering nuber bounds are different, and we do not know how to prove an analogue to Lea 8 (with rate faster than O(/ This leads to a soewhat worse bound in ters of the dependence on As before, for any fixed w, we let u be arg in u U E[(u(w x y ] Lea 7 tells us that the epirical risk (u(w x i y i is concentrated around its expectation uniforly for any u, w In particular, (û(w x i y i E [ û(w x y ] ( W log(/δ O as well as (u (w x i y i E [ (u (w x y ] ( W log(/δ O, but since û was chosen to be the epirical risk iniizer, it follows that E [ (û(w x i y i ] E [ ( (u (w x y ] W log(/δ O, so by Lea 9, ( (W E [ u /4 log(/δ (w x û(w x ] O (7 Now, it is not hard to see that if u = arg in u U E[(u(w x y ], then u = arg in u U E[(u(w x E[y x] ] as well Again invoking Lea 7, and aking siilar arguents, it follows that ( (W E [ u /4 log(/δ (w x ũ(w x ] O (8 Cobining (7 and (8, we get ( (W /4 log(/δ E [ û(w x ũ(w x ] O We now invoke Lea 7 to get ( (W /4 log(/δ û(w x i ũ(w x i O (9 D Lipschitz Isotonic Regression The reader is referred to Section 4 for notation used in this section We first prove that the functions G i defined in (Section 4 eq (4 are piecewise quadratic, differentiable everywhere and strongly convex 7

18 Lea 0 For i =,,, the functions G i (eq (Section 4 eq (4 are piecewise quadratic, differentiable everywhere and strictly convex Proof We prove this assertion by induction on j = i This stateent is true for j = 0 (i =, since G (s = (/(s y Recall that δ i = z i+ z i and s i = in s G i (s Suppose that the assertion is true for i (j = i, then we shall prove the sae for i (j = i+ Suppose ŷ i is fixed to s Recall that G i (s is the iniu possible error attained by fitting points (z i, y i, (z i, y i,, (z, y satisfying the Lipschitz and onotonic constraints and setting ŷ i = s (the prediction at the point z i For any assignent ŷ i = s the least error is obtained by fitting the rest of the points is G i (s Thus, G i (s = (s y + in s G i (s subject to the conditions that s s and s s δ i Since G i is piecewise quadratic, differentiable everywhere and strictly convex, G i is iniu at s i and is increasing on both sides of s i Thus, the quantity above is iniized by picking s to be as close to s i as possible without violating any constraint Hence, we get the recurrence relation: { G i (s = Gi (s + δ i If s s (s y i i δ i + G i (s i If s i δ i < s s i (30 G i (s If s i < s By assuption for i, G i is a piecewise quadratic function, say defined on intervals (a 0 =, a ],, (a k, a k ], (a k, a k+ ],, (a l, a l = Let s i (a k, a k ] and note that G i (s i = 0 Define an interediate function F (s which is defined on l + intervals, (, a δ i ], (a δ i, a δ i ],, (a k δ i, s i δ i ], (s i δ i, s i ], (s i, a k], (a k, a k+ ],, (a l,, where F (s = G i (s + δ i on intervals (, a δ i ],, (a k δ i, s i δ i ] F (s = G i (s i on interval (s i δ i, s i ] 3 F (s = G i (s on intervals (s i, a k],, (a l, Since G i is piecewise quadratic, so is F on intervals as defined above (constant on the interval (s i δ i, s i ] F is differentiable everywhere, since G i (s i = 0 Observe that G i (s = (/(s y i + F (s, this akes G i (s strictly convex, while retaining the fact that it is piecewise quadratic and differentiable everywhere D Algorith We now give an efficient algorith for Lipschitz Isotonic Regression As discussed already in Section 4, it suffices to find s i for i =,, Once these values are known it is easy to find values ŷ, ŷ that are optial The pseudocode for LIR algorith is given in Alg 3 As discussed in Section 4 the algorith aintains G i, and finds s i by finding the zero of G i G i is piecewise linear, continuous and strictly increasing We design a new data structure notable red-black trees to store these piecewise linear, continuous, strictly increasing functions A notable red-black tree (henceforth NRB-Tree stores a piecewise linear function The disjoint intervals on which the function is defined have a natural order, thus the intervals serve as the key and the linear function is stored as the value at a node Apart fro storing these key-value pairs, the nodes of the tree also store soe notes that are applied to the entire subtree rooted at that node, hence the nae notable red-black trees We assue that standard tree operations, ie finding and inserting nodes can be perfored in O(log( tie There is a slight caveat that there ay be notes at soe of the nodes, but we show in Section D how to flush all notes at a given node in constant Any tree-structure that uses only local rotations and guarantees O(log( worst-case or aortized find and insert operations ay be used 8

19 Algorith 3 LIR: Lipschitz Isotonic Regression Input: (z, y,, (z, y s i = 0, i =, ŷ i = 0, i =, NRB-Tree T(s y ; // new T represents s y δ i = z i+ z i, i =,, for i =,, do s i = Tzero( if i = then break; end if Tupdate(s i, δ i, y i end for ŷ i = s for i =,, do if s i > ŷ i + δ i then ŷ i = ŷ i + δ i else if s i ŷ i then ŷ i = s i else ŷ i = ŷ i end if end for Output: ŷ,, ŷ tie Thus, whenever a node is accessed for find or insert operations the notes are first applied to the node appropriately and then the standard tree operation can continue The algorith initializes T to represent the linear function s y This can be perfored very easily by just creating a root node, setting the interval (key to be (, and the linear function (value to s y, represented by coefficients (, y There are two function calls that algorith LIR akes on T Tzero(: This finds the zero (the point where the function is 0 of the piecewise linear function G i represented by the NRB-Tree T Since the function is piecewise linear, continuous and strictly increasing, a unique zero exists and the tree structure is well-suited to find the zero in O(log tie Tupdate(s i, δ i, y i Having obtained the zero of G i, the algorith needs to update T to now represent the G i This involves the following: (a Suppose (a k, a k ] is the interval containing s i Split the interval (a k, a k ] into two intervals (a k, s i ] and (s i, a k] (see Figure (a in Section 4 This operation is easy to perfor on a binary tree in O(log( tie: It involves finding the node representing the interval (a k, a k ] that contains s i, then odifying it to represent the interval (a k, s i ] Then a new node representing the interval (s i, a k] (currently none of the intervals in the tree overlaps with this interval is inserted and the linear function on this interval is set to be the sae as the one on (a k, s i ] (b All the intervals to the left of s i, ie up to (a k, s i ] are to be oved to the left by δ i Forally, if the interval is (a, a ] and the linear function is (l, l representing ls+l, then after oving to the left by δ i, the interval becoes (a δ i, a δ i ] and the linear function becoes (l, l + lδ i (see Figure (b in Section 4 The above operation ay involve odifying up to O( nodes in the tree However rather than actually aking the odifications we only leave a note at appropriate nodes that such a odification is to be perfored Whenever a note shift-by(δ i 9

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds