Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

Size: px
Start display at page:

Download "Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression"

Transcription

1 Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Sha M Kakade Microsoft Research and Wharton, U Penn skakade@icrosoftco Varun Kanade SEAS, Harvard University vkanade@fasharvardedu Ada Tauan Kalai Microsoft Research adu@icrosoftco Ohad Shair Microsoft Research ohadsh@icrosoftco Abstract Generalized Linear Models (GLMs and Single Index Models (SIMs provide powerful generalizations of linear regression, where the target variable is assued to be a (possibly unknown -diensional function of a linear predictor In general, these probles entail non-convex estiation procedures, and, in practice, iterative local search heuristics are often used Kalai and Sastry (009 provided the first provably efficient ethod, the Isotron algorith, for learning SIMs and GLMs, under the assuption that the data is in fact generated under a GLM and under certain onotonicity and Lipschitz (bounded slope constraints The Isotron algorith interleaves steps of perceptron-like updates with isotonic regression (fitting a one-diensional non-decreasing function However, to obtain provable perforance, the ethod requires a fresh saple every iteration In this paper, we provide algoriths for learning GLMs and SIMs, which are both coputationally and statistically efficient We odify the isotonic regression step in Isotron to fit a Lipschitz onotonic function, and also provide an efficient O(n log(n algorith for this step, iproving upon the previous O(n algorith We provide a brief epirical study, deonstrating the feasibility of our algoriths in practice Introduction The oft used linear regression paradig odels a dependent variable Y as a linear function of a vector-valued independent variable X Naely, for soe vector w, we assue that E[Y X] = w X Generalized linear odels (GLMs provide a flexible extension of linear regression, by assuing that the dependent variable Y is of the for, E[Y X] = u(w X; u is referred to as the inverse link function or transfer function (see [] for a review Generalized linear odels include coonly used regression techniques such as logistic regression, where u(z = /( + e z is the logistic function The class of perceptrons also falls in this category, where u is a siple piecewise linear function of the for /, with the slope of the iddle piece being the inverse of the argin In the case of linear regression, the least-squares ethod is an highly efficient procedure for paraeter estiation Unfortunately, in the case of GLMs, even in the setting when u is known, the proble of fitting a odel that iniizes squared error is typically not convex We are not aware of any classical estiation procedure for GLMs which is both coputationally and statistically efficient, and with provable guarantees The standard procedure is iteratively reweighted least squares, based on Newton-Raphson (see [] The case when both u and w are unknown (soeties referred to as Single Index Models (SIMs, involves the ore challenging (and practically relevant question of jointly estiating u and w,

2 where u ay coe fro a large non-paraetric faily such as all onotonic functions There are two questions here: What statistical rate is achievable for siultaneous estiation of u and w? Is there a coputationally efficient algorith for this joint estiation? With regards to the forer, under ild Lipschitz-continuity restrictions on u, it is possible to characterize the effectiveness of an (appropriately constrained joint epirical risk iniization procedure This suggests that, fro a purely statistical viewpoint, it ay be worthwhile to attept jointly optiizing u and w on epirical data However, the issue of coputationally efficiently estiating both u and w (and still achieving a good statistical rate is ore delicate, and is the focus of this work We note that this is not a trivial proble: in general, the joint estiation proble is highly non-convex, and despite a significant body of literature on the proble, existing ethods are usually based on heuristics, which are not guaranteed to converge to a global optiu (see for instance [, 3, 4, 5, 6] The Isotron algorith of Kalai and Sastry [7] provides the first provably efficient ethod for learning GLMs and SIMs, under the coon assuption that u is onotonic and Lipschitz, and assuing that the data corresponds to the odel The saple and coputational coplexity of this algorith is polynoial, and the saple coplexity does not explicitly depend on the diension The algorith is a variant of the gradient-like perceptron algorith, where apart fro the perceptronlike updates, an isotonic regression procedure is perfored on the linear predictions using the Pool Adjacent Violators (PAV algorith, on every iteration While the Isotron algorith is appealing due to its ease of ipleentation (it has no paraeters other than the nuber of iterations to run and theoretical guarantees (it works for any u, w, there is one principal drawback It is a batch algorith, but the analysis given requires the algorith to be run on fresh saples each batch In fact, as we show in experients, this is not just an artifact of the analysis if the algorith loops over the sae data in each update step, it really does overfit in very high diensions (such as when the nuber of diensions exceeds the nuber of exaples Our Contributions: We show that the overfitting proble in Isotron stes fro the fact that although it uses a slope (Lipschitz condition as an assuption in the analysis, it does not constrain the output hypothesis to be of this for To address this issue, we introduce the SLISOTRON algorith (pronounced slice-o-tron, cobining slope and Isotron The algorith replaces the isotonic regression step of the Isotron by finding the best non-decreasing function with a bounded Lipschitz paraeter - this constraint plays here a siilar role as the argin in classification algoriths We also note SLISOTRON (like Isotron has a significant advantage over standard regression techniques, since it does not require knowing the transfer function Our two ain contributions are: We show that the new algorith, like Isotron, has theoretical guarantees, and significant new analysis is required for this step We provide an efficient O(n log(n tie algorith for finding the best non-decreasing function with a bounded Lipschitz paraeter, iproving on the previous O(n algorith [0] This akes SLISOTRON practical even on large datasets We begin with a siple perceptron-like algorith for fitting GLMs, with a known transfer function u which is onotone and Lipschitz Soewhat surprisingly, prior to this work (and Isotron [7] a coputationally efficient procedure that guarantees to learn GLMs was not known Section 4 contains the ore challenging SLISOTRON algorith and also the efficient O(n log(n algorith for Lipschitz isotonic regression We conclude with a brief epirical analysis Setting We assue the data (x, y are sapled iid fro a distribution supported on B d [0, ], where B d = {x R d : x } is the unit ball in d-diensional Euclidean space Our algoriths and In the ore challenging agnostic setting, the data is not required to be distributed according to a true u and w, but it is required to find the best u, w which iniize the epirical squared error Siilar to observations of Kalai et al [8], it is straightforward to show that this proble is likely to be coputationally intractable in the agnostic setting In particular, it is at least as hard as the proble of learning parity with noise, whose hardness has been used as the basis for designing ultiple cryptographic systes Shalev-Shwartz et al [9] present a kernel-based algorith for learning certain types of GLMs and SIMs in the agnostic setting However, their worst-case guarantees are exponential in the nor of w (or equivalently the Lipschitz paraeter

3 Algorith GLM-TRON Input: data (x i, y i Rd [0, ], u : R [0, ], held-out data (x +j, y +j s j= w := 0; for t =,, do h t (x := u(w t x; w t+ := w t + (y i u(w t x i x i ; end for Output: arg in h t s j= (ht (x +j y +j analysis also apply to the case where B d is the unit ball in soe high (or infinite-diensional kernel feature space We assue there is a fixed vector w, such that w W, and a non-decreasing -Lipschitz function u : R [0, ], such that E[y x] = u(w x for all x The restriction that u is -Lipschitz is without loss of generality, since the nor of w is arbitrary (an equivalent restriction is that w = and that u is W -Lipschitz for an arbitrary W Our focus is on approxiating the regression function well, as easured by the squared loss For a real valued function h : B d [0, ], define err(h = E (x,y [ (h(x y ] ε(h = err(h err(e[y x] = E (x,y [ (h(x u(w x ] err(h easures the error of h, and ε(h easures the excess error of h copared to the Bayesoptial predictor x u(w x Our goal is to find h such that ε(h (equivalently, err(h is as sall as possible In addition, we define the epirical counterparts êrr(h, ˆε(h, based on a saple (x, y,, (x, y, to be êrr(h = (h(x i y i ; ˆε(h = (h(x i u(w x i Note that ˆε is the standard fixed design error (as this error conditions on the observed x s Our algoriths work by iteratively constructing hypotheses h t of the for h t (x = u t (w t x, where u t is a non-decreasing, -Lipschitz function, and w t is a linear predictor The algorithic analysis provides conditions under which ˆε(h t is sall, and using statistical arguents, one can guarantee that ε(h t would be sall as well 3 The GLM-TRON algorith We begin with the sipler case, where the transfer function u is assued to be known (eg a sigoid, and the proble is estiating w properly We present a siple, paraeter-free, perceptronlike algorith, GLM-TRON (Alg, which efficiently finds a close-to-optial predictor We note that the algorith works for arbitrary non-decreasing, Lipschitz functions u, and thus covers ost generalized linear odels We refer the reader to the pseudo-code in Algorith for soe of the notation used in this section To analyze the perforance of the algorith, we show that if we run the algorith for sufficiently any iterations, one of the predictors h t obtained ust be nearly-optial, copared to the Bayesoptial predictor Theore Suppose (x, y,, (x, y are drawn independently fro a distribution supported on B d [0, ], such that E[y x] = u(w x, where w W, and u : R [0, ] is a known nondecreasing -Lipschitz function Then for any δ (0,, the following holds with probability at least δ: there exists soe iteration t < O(W / log(/δ of GLM-TRON such that the hypothesis h t (x = u(w t x satisfies ( W ax{ˆε(h t, ε(h t log(/δ } O 3

4 Algorith SLISOTRON Input: data (x i, y i Rd [0, ], held-out data (x +j, y +j s j= w := 0; for t =,, do u t := LIR ((w t x, y,, (w t x, y // Fit -d function along w t w t+ := w t + (y i u t (w t x i x i end for Output: arg in h t s j= (ht (x +j y +j In particular, the theore iplies that soe h t has sall enough ε(h t Since ε(h t equals err(h t up to a constant, we can easily find an appropriate h t by picking the one that has least êrr(h t on a held-out set The ain idea of the proof is showing that at each iteration, if ˆε(h t is not sall, then the squared distance w t+ w is substantially saller than w t w Since the squared distance is bounded below by 0, and w 0 w W, there is an iteration (arrived at within reasonable tie such that the hypothesis h t at that iteration is highly accurate Although the algorith iniizes epirical squared error, we can bound the true error using a unifor convergence arguent The coplete proofs are provided in Appendix A 4 The SLISOTRON algorith In this section, we present SLISOTRON (Alg, which is applicable to the harder setting where the transfer function u is unknown, except for it being non-decreasing and -Lipschitz SLISOTRON does have one paraeter, the Lipschitz constant; however, in theory we show that this can siply be set to The ain difference between SLISOTRON and GLM-TRON is that now the transfer function ust also be learned, and the algorith keeps track of a transfer function u t which changes fro iteration to iteration The algorith is inspired by the Isotron algorith [7], with the ain difference being that at each iteration, instead of applying the PAV procedure to fit an arbitrary onotonic function along the direction w t, we use a different procedure, (Lipschitz Isotonic Regression LIR, to fit a Lipschitz onotonic function, u t, along w t This key difference allows for an analysis that does not require a fresh saple each iteration We also provide an efficient O( log( tie algorith for LIR (see Section 4, aking SLISOTRON an extreely efficient algorith We now turn to the foral theore about our algorith The foral guarantees parallel those of the GLM-TRON algorith However, the rates achieved are soewhat worse, due to the additional difficulty of siultaneously estiating both u and w Theore Suppose (x, y,, (x, y are drawn independently fro a distribution supported on B d [0, ], such that E[y x] = u(w x, where w W, and u : R [0, ] is an unknown non-decreasing -Lipschitz function Then the following two bounds hold: (Diension-dependent ( With probability at least δ, there exists soe iteration t < ( /3 W O d log(w /δ of SLISOTRON such that ( (dw ax{ˆε(h t, ε(h t /3 log(w /δ } O (Diension-independent ( With probability at least δ, there exists soe iteration t < ( /4 W O log(/δ of SLISOTRON such that ( (W ax{ˆε(h t, ε(h t /4 log(/δ } O 4

5 As in the case of Th, one can easily find h t which satisfies the theore s conditions, by running the SLISOTRON algorith for sufficiently any iterations, and choosing the hypothesis h t which iniizes êrr(h t on a held-out set The algorith iniizes epirical error and generalization bounds are obtained using a unifor convergence arguent The proofs are soewhat involved and appear in Appendix B 4 Lipschitz isotonic regression The SLISOTRON algorith (Alg perfors Lipschitz Isotonic Regression (LIR at each iteration The goal is to find the best fit (least squared error non-decreasing -Lipschitz function that fits the data in one diension Let (z, y, (z, y be such that z i R, y i [0, ] and z z z The Lipschitz Isotonic Regression (LIR proble is defined as the following quadratic progra: Miniize wrt ŷ i : (y i ŷ i ( subject to: ŷ i ŷ i+ i (Monotonicity ( ŷ i+ ŷ i (z i+ z i i (Lipschitz (3 Once the values ŷ i are obtained at the data points, the actual function can be constructed by interpolating linearly between the data points Prior to this work, the best known algorith for this proble wass due to Yeganova and Wilbur [0] and required O( tie for points In this work, we present an algorith that perfors the task in O( log( tie The actual algorith is fairly coplex and relies on designing a clever data structure We provide a high-level view here; the details are provided in Appendix D Algorith Sketch: We define functions G i (, where G i (s is the iniu squared loss that can be attained if ŷ i is fixed to be s, and ŷ i+, ŷ are then chosen to be the best fit -Lipschitz non-decreasing function to the points (z i, y i,, (z, y Forally, for i =,,, define the functions, G i (s = in ŷ i+,,ŷ (s y i + subject to the constraints (where s = ŷ i, ŷ j ŷ j+ ŷ j+ ŷ j z j+ z j (ŷ j y j (4 j=i+ i j (Monotonic i j (Lipschitz Furtherore, define: s i = in s G i (s The functions G i are piecewise quadratic, differentiable everywhere and strictly convex, a fact we prove in full paper [] Thus, G i is iniized at s i and it is strictly increasing on both sides of s i Note that G (s = (/(s y and hence is piecewise quadratic, differentiable everywhere and strictly convex Let δ i = z i+ z i The reaining G i obey the following recursive relation { G i (s = Gi (s + δ i If s s (s y i i δ i + G i (s i If s i δ i < s s i (5 G i (s If s i < s As intuition for the above relation, note that G i (s is obtained fixing ŷ i = s and then by choosing ŷ i as close to s i (since G i is strictly increasing on both sides of s i as possible without violating either the onotonicity or Lipschitz constraints The above arguent can be iediately translated into an algorith, if the values s i are known Since s iniizes G (s, which is the sae as the objective of (, start with ŷ = s, and then successively chose values for ŷ i to be as close to s i as possible without violating the Lipschitz or onotonicity constraints This will produce an assignent for ŷ i which achieves loss equal to G (s and hence is optial 5

6 (a (b Figure : (a Finding the zero of G i (b Update step to transfor representation of G i to G i The harder part of the algorith is finding the values s i Notice that G i are all piecewise linear, continuous and strictly increasing, and obey a siilar recursive relation (G (s = s y : { G i (s + δ i If s s G i δ i i (s = (s y i + 0 If s i δ i < s s i (6 G i (s If s i < s The algorith then finds s i by finding zeros of G i Starting fro, G = s y, and s = y We design a special data structure, called notable red-black trees, for representing piecewise linear, continuous, strictly increasing functions We initialize such a tree T to represent G (s = s y Assuing that at soe tie it represents G i, we need to support two operations: Find the zero of G i to get s i Such an operation can be done efficiently O(log( tie using a tree-like structure (Fig (a Update T to represent G i This operation is ore coplicated, but using the relation (6, we do the following: Split the interval containing s i Move the left half of the piecewise linear function G i by δ i (Fig (b, adding the constant zero function in between Finally, we add the linear function s y i to every interval, to get G i, which is again piecewise linear, continuous and strictly increasing To perfor the operations in step ( above, we cannot naïvely apply the transforations, shift-by(δ i and add(s y i to every node in the tree, as it ay take O( operations Instead, we siply leave a note (hence the nae notable red-black trees that such a transforation should be applied before the function is evaluated at that node or at any of its descendants To prevent a large nuber of such notes accuulating at any given node we show that these notes satisfy certain coutative and additive relations, thus requiring us to keep track of no ore than notes at any given node This lazy evaluation of notes allows us to perfor all of the above operations in O(log( tie The details of the construction are provided in Appendix D 5 Experients In this section, we present an epirical study of the SLISOTRON and GLM-TRON algoriths We perfor two evaluations using synthetic data The first one copares SLISOTRON and Isotron [7] and illustrates the iportance of iposing a Lipschitz constraint The second one deonstrates the advantage of using SLISOTRON over standard regression techniques, in the sense that SLISOTRON can learn any onotonic Lipschitz function We also report results of an evaluation of SLISOTRON, GLM-TRON and several copeting approaches on 5 UCI[] datasets All errors are reported in ters of average root ean squared error (RMSE using 0 fold cross validation along with the standard deviation 5 Synthetic Experients Although, the theoretical guarantees for Isotron are under the assuption that we get a fresh saple each round, one ay still attept to run Isotron on the sae saple each iteration and evaluate the 6

7 Slisotron Isotron SLISOTRON Isotron 089 ± ± ± 008 (a Synthetic Experient Slisotron SLISOTRON Logistic 0058 ± ± ± 0004 (b Synthetic Experient Figure : (a The figure shows the transfer functions as predicted by SLISOTRON and Isotron The table shows the average RMSE using 0 fold cross validation The colun shows the average difference between the RMSE values of the two algoriths across the folds (b The figure shows the transfer function as predicted by SLISOTRON Table shows the average RMSE using 0 fold cross validation for SLISOTRON and Logistic Regression The colun shows the average difference between the RMSE values of the two algoriths across folds epirical perforance Then, the ain difference between SLISOTRON and Isotron is that while SLISOTRON fits the best Lipschitz onotonic function using LIR each iteration, Isotron erely finds the best onotonic fit using PAV This difference is analogous to finding a large argin classifier vs just a consistent one We believe this difference will be particularly relevant when the data is sparse and lies in a high diensional space Our first synthetic dataset is the following: The dataset is of size = 500 in d = 500 diensions The first co-ordinate of each point is chosen uniforly at rando fro {, 0, } The reaining co-ordinates are all 0, except that for each data point one of the reaining co-ordinates is randoly set to The true direction is w = (, 0,, 0 and the transfer function is u(z = ( + z/ Both SLISOTRON and Isotron put weight on the first co-ordinate (the true direction However, Isotron overfits the data using the reaining (irrelevant co-ordinates, which SLISOTRON is prevented fro doing because of the Lipschitz constraint Figure (a shows the transfer functions as predicted by the two algoriths, and the table below the plot shows the average RMSE using 0 fold cross validation The colun shows the average difference between the RMSE values of the two algoriths across the folds A principle advantage of SLISOTRON over standard regression techniques is that it is not necessary to know the transfer function in advance The second synthetic experient is designed as a sanity check to verify this clai The dataset is of size = 000 in d = 4 diensions We chose a rando direction as the true w and used a piecewise linear function as the true u We then added rando noise (σ = 0 to the y values We copared SLISOTRON to Logistic Regression on this dataset SLISOTRON correctly recovers the true function (up to soe scaling Fig (b shows the actual transfer function as predicted by SLISOTRON, which is essentially the function we used The table below the figure shows the perforance coparison between SLISOTRON and logistic regression 5 Real World Datasets We now turn to describe the results of experients perfored on the following 5 UCI datasets: counities, concrete, housing, parkinsons, and wine-quality We copared the perforance of SLISOTRON (Sl-Iso and GLM-TRON with logistic transfer function (GLM-t against Isotron (Iso, as well as standard logistic regression (Log-R, linear regression (Lin-R and a siple heuristic algorith (SIM for single index odels, along the lines of standard iterative axiu-likelihood procedures for these types of probles (eg, [3] The SIM algorith works by iteratively fixing the direction w and finding the best transfer function u, and then fixing u and 7

8 optiizing w via gradient descent For each of the algoriths we perfored 0-fold cross validation, using fold each tie as the test set, and we report averaged results across the folds Table shows average RMSE values of all the algoriths across 0 folds The first colun shows the ean Y value (with standard deviation of the dataset for coparison Table shows the average difference between RMSE values of SLISOTRON and the other algoriths across the folds Negative values indicate that the algorith perfored better than SLISOTRON The results suggest that the perforance of SLISOTRON (and even Isotron is coparable to other regression techniques and in any cases also slightly better The perforance of GLM-TRON is siilar to standard ipleentations of logistic regression on these datasets This suggests that these algoriths should work well in practice, while providing non-trivial theoretical guarantees It is also illustrative to see how the transfer functions found by SLISOTRON and Isotron copare In Figure 3, we plot the transfer functions for concrete and counities We see that the fits found by SLISOTRON tend to be soother because of the Lipschitz constraint We also observe that concrete is the only dataset where SLISOTRON perfors noticeably better than logistic regression, and the transfer function is indeed soewhat far fro the logistic function Table : Average RMSE values using 0 fold cross validation The Ȳ colun shows the ean Y value and standard deviation dataset Ȳ Sl-Iso GLM-t Iso Lin-R Log-R SIM counities 04 ± ± ± ± ± ± ± 00 concrete 358 ± ± ± 0 99 ± ± 04 ± 0 99 ± 09 housing 5 ± ± ± ± ± ± ± 078 parkinsons 9 ± 07 0 ± 0 03 ± 0 0 ± 0 0 ± 0 0 ± 0 03 ± 0 winequality 59 ± ± ± ± ± ± ± 003 Table : Perforance coparison of SLISOTRON with the other algoriths The values reported are the average difference between RMSE values of the algorith and SLISOTRON across the folds Negative values indicate better perforance than SLISOTRON dataset GLM-t Iso Lin-R Log-R SIM counities 000 ± ± ± ± ± 000 concrete 056 ± ± ± ± ± 06 housing 00 ± ± ± ± ± 053 parkinsons 09 ± ± ± ± ± 00 winequality 00 ± ± ± ± ± Slisotron Isotron Slisotron Isotron (a concrete (b counities Figure 3: The transfer function u as predicted by SLISOTRON (blue and Isotron (red for the concrete and counities datasets The doain of both functions was noralized to [, ] 8

9 References [] P McCullagh and J A Nelder Generalized Linear Models (nd ed Chapan and Hall, 989 [] P Hall W Härdle and H Ichiura Optial soothing in single-index odels Annals of Statistics, (:57 78, 993 [3] J Horowitz and W Härdle Direct seiparaetric estiation of single-index odels with discrete covariates, 994 [4] A Juditsky M Hristache and V Spokoiny Direct estiation of the index coefficients in a single-index odel Technical Report 3433, INRIA, May 998 [5] P Naik and C Tsai Isotonic single-index odel for high-diensional database arketing Coputational Statistics and Data Analysis, 47: , 004 [6] P Ravikuar, M Wainwright, and B Yu Single index convex experts: Efficient estiation via adapted bregan losses Snowbird Workshop, 008 [7] A T Kalai and R Sastry The isotron algorith: High-diensional isotonic regression In COLT 09, 009 [8] A T Kalai, A R Klivans, Y Mansour, and R A Servedio Agnostically learning halfspaces In Proceedings of the 46th Annual IEEE Syposiu on Foundations of Coputer Science, FOCS 05, pages 0, Washington, DC, USA, 005 IEEE Coputer Society [9] S Shalev-Shwartz, O Shair, and K Sridharan Learning kernel-based halfspaces with the zero-one loss In COLT, 00 [0] L Yeganova and W J Wilbur Isotonic regression under lipschitz constraint Journal of Optiization Theory and Applications, 4(:49 443, 009 [] S M Kakade, A T Kalai, V Kanade, and O Shair Efficient learning of generalized linear and single index odels with isotonic regression arxivorg/abs/0408 [] UCI University of california, irvine: [3] S Cosslett Distribution-free axiu-likelihood estiator of the binary choice odel Econoetrica, 5(3, May 983 [4] J Shawe-Taylor and N Christianini Kernel Methods for Pattern Analysis Cabridge University Press, 004 [5] G Pisier The Volue of Convex Bodies and Banach Space Geoetry Cabridge University Press, 999 [6] T Zhang Covering nuber bounds for certain regularized function classes Journal of Machine Learning Research, :57 550, 00 [7] P Bartlett and S Mendelson Radeacher and gaussian coplexities: Risk bounds and structural results Journal of Machine Learning Research, 3:463 48, 00 [8] N Srebro, K Sridharan, and A Tewari Soothness, low-noise and fast rates In NIPS, 00 (full version on arxiv [9] S Mendelson Iproving the saple coplexity using global data IEEE Transactions on Inforation Theory, 48(7:977 99, 00 9

10 A Proof of Th The reader is referred to GLM-TRON (Alg for notation used in this section The ain lea shows that as long as the error of the current hypothesis is large, the distance of our predicted direction vector w t fro the ideal direction w decreases by a substantial aount at each step Lea At iteration t in GLM-TRON, suppose w t w W, then if (/ (y i u(w x i x i w η, then w t w w t+ w ˆε(h t 5W η Proof We have w t w w t+ w = (y i u(w t x i (w x i w t x i Consider the first ter above, (y i u(w t x i (w x i w t x i = (y i u(w t x i x i ( (u(w x i u(w t x i (w x i w t x i + (y i u(w x i x i (w w t Using the fact that u is non-decreasing and -Lipschitz (for the first ter and w w t W and (/ (y i u(w x i x i η, we can lower bound this by (u(w x i u(w t x i W η ˆε(h t W η (8 For the second ter in (7, we have (y i u(w t x i x i = (y i u(w x i + u(w x i u(w t x i x i (y i u(w x i x i + (y i u(w x i x i (u(w x i u(w t x i x i + (u(w x i u(w t x i x i (9 Using the fact that (/ (y i u(w x i x i η, and using Jensen s inequality to show that (/ (u(w x i u(w t x i x i (/ (u(w x i u(w t x i = ˆε(h t, and assuing W, we get (y i u(w x i x i ˆε(h t + 3W η (0 Cobining (8 and (0 in (7, we get w t w w t+ w ˆε(h t 5W η (7 To coplete the proof of Theore, the bound on ˆε(h t for soe t now follows fro Lea Let η = ( + log(/δ// Notice that (y i u(w x i x i for all i are iid 0-ean rando variables with nor bounded by, so using Lea 3 (Appendix B, (/ (y i u(w x i x i η with high probability Now using Lea, at each iteration of algorith GLM-TRON, either w t+ w w t w W η, or ˆε(h t 6W η If the 0

11 latter is the case, we are done If not, since w t+ w 0, and w 0 w = w W, there can be at ost W /(W η = W/(η iterations before ˆε(h t 6W η Overall, there is soe h t such that ( W ˆε(h t log(/δ O In addition, we can reduce this to a high-probability bound on ε(h t using an unifor convergence arguent (Lea 5, Appendix B, which is applicable since w t W Using a union bound, we get a bound which holds siultaneously for ˆε(h t and ε(h t B Proof of Th This section is devoted to the proof of Th The reader is referred to Section 4 for notation First we need a property of the LIR algorith that is used to find the best one-diensional nondecreasing -Lipschitz function Forally, this proble can be defined as follows: Given as input {z i, y i } [ W, W ] [0, ] the goal is to find ŷ,, ŷ such that (ŷ i y i, ( is inial, under the constraint that ŷ i = u(z i for soe non-decreasing -Lipschitz function u : [ W, W ] [0, ] After finding such values, LIR obtains an entire function u by interpolating linearly between the points Assuing that z i are in sorted order, this can be forulated as a quadratic proble with the following constraints: ŷ i ŷ i+ 0 i < ( ŷ i+ ŷ i (z i+ z i 0 i < (3 Lea Let (z, y,, (z, y be input to LIR where z i are increasing and y i [0, ] Let ŷ,, ŷ be the output of LIR Let f be any function such that f(β f(α β α, for β α, then (y i ŷ i (f(ŷ i z i 0 Proof We first note that j= (y j ŷ j = 0, since otherwise we could have found other values for ŷ,, ŷ which ake ( even saller So for notational convenience, let ŷ 0 = 0, and we ay assue wlog that f(ŷ 0 = 0 Define σ i = j=i (y j ŷ j Then we have (y i ŷ i (f(ŷ i z i = σ i ((f(ŷ i z i (f(ŷ i z i (4 Suppose that σ i < 0 Intuitively, this eans that if we could have decreased all values ŷ i+, ŷ by an infinitesial constant, then the objective function ( would have been reduced, contradicting the optiality of the values This eans that the constraint ŷ i ŷ i+ 0 ust be tight, so we have (f(ŷ i+ z i+ (f(ŷ i z i = z i+ + z i 0 (this arguent is inforal, but can be easily foralized using KKT conditions Siilarly, when σ i > 0, then the constraint ŷ i+ ŷ i (z i+ z i 0 ust be tight, hence f(ŷ i+ f(ŷ i ŷ i+ ŷ i = (z i+ z i 0 So in either case, each suand in (4 ust be non-negative, leading to the required result We also use another result, for which we require a bit of additional notation At each iteration of the SLISOTRON algorith, we run the LIR procedure based on the training saple (x, y,, (x, y and the current direction w t, and get a non-decreasing Lipschitz function u t Define i ŷi t = u t (w t x i Recall that w, u are such that E[y x] = u(w x, and the input to the SLISOTRON algorith is (x, y,, (x, y Define i ȳ i = u(w x i

12 to be the expected value of each y i Clearly, we do not have access to ȳ i However, consider a hypothetical call to LIR with inputs (w t x i, ȳ i, and suppose LIR returns the function ũt In that case, define i ỹi t = ũ t (w t x i for all i Our proof uses the following proposition, which relates the values ŷi t (the values we can actually copute and ỹi t (the values we could copute if we had the conditional eans of each y i The proof of Proposition is soewhat lengthy and requires additional technical achinery, and is therefore relegated to Appendix C Proposition With probability at least δ over the saple {(x i, y i }, it holds for any t that ŷt i ỹt i is at ost the iniu of { ( (dw /3 ( (W log(w /δ /4 } log(/δ in O, O The third auxiliary result we need is the following, which is well-known (see for exaple [4], Section 4 Lea 3 Suppose z,, z are iid 0-ean rando variables in a Hilbert space, such that Pr( x i = Then with probability at least δ, ( + log(/δ/ z i With these auxiliary results in hand, we can now turn to prove Th itself The heart of the proof is the following lea, which shows that the squared distance w t w between w t and the true direction w decreases at each iteration at a rate which depends on the error of the hypothesis ˆε(h t : Lea 4 Suppose that w t w W and (/ (y i ȳ i x i η and (/ ŷt i ỹt i η Then w t w w t+ w ˆε(h t 5W (η + η Proof We have w t+ w = w t+ w t + w t w = w t+ w t + w t w + (wt+ w t (w t w Since w t+ w t = (/ (y i ŷi tx i, substituting this above and rearranging the ters we get, w t w w t+ w = (y i ŷi(w t x i w t x i (y i ŷix t i (5 Consider the first ter above, ( (y i ŷ t i(w x i w t x i = (y i ȳ i x i (w w t (6 + (ȳ i ỹ t i(w x i w t x i (7 + (ỹi t ŷi(w t x i w t x i (8 The ter (6 is at least W η, the ter (8 is at least W η (since (w w t x i W We thus consider the reaining ter (7 Letting u be the true transfer function, suppose for a inute it is strictly increasing, so its inverse u is well defined Then we have (ȳ i ỹi(w t x i w t x i = (ȳ i ỹi(w t x i u (ỹi t + (ȳ i ỹi(u t (ỹi t w t x i

13 The second ter in the expression above is positive by Lea As to the first ter, it is equal to (ȳ i ỹ t i (u (ȳ i u (ỹ t i, which by the Lipschitz property of u is at least (ȳ i ỹ t i = ˆε( h t Plugging this in the above, we get (y i ŷi(w t x i w t x i ˆε( h t W (η + η (9 This inequality was obtained under the assuption that u is strictly increasing, but it is not hard to verify that the sae would hold even if u is only non-decreasing The second ter in (5 can be bounded, using soe tedious technical anipulations (see (9 and (0 in Appendix A, by (y i ŷ t ix i ˆε(h t + 3W η (0 Cobining (9 and (0 in (5, we get w t w w t+ w ˆε( h t ˆε(h t W (5η + η ( Now, we clai that since ˆε( h t = = = ˆε( h t ˆε(h t (ỹi t ȳ i ŷi t ỹi t η, (ỹi t ŷi t + ŷi t ȳ i ( (ŷi t ȳ i + (ỹi t ŷ t i (ỹi t + ŷi t ȳ i and we have that ỹ t i + ŷt i ȳ i Plugging this into ( leads to the desired result The bound on ˆε(h t in Th now follows fro Lea 4 Using the notation fro Lea 4, η can be set to the bound in Lea 3, since {(y i ȳ i x i } are iid 0-ean rando variables with nor bounded by Also, η can be set to any of the bounds in Proposition η is clearly the doinant ter Thus, we get that Lea 4 holds, so either w t+ w w t w W (η + η, or ˆε(h t 3W (η + η If the latter is the case, we are done If not, since w t+ w 0, and w 0 w = w W, there can be at ost W /(W (η + η = W/(η + η iterations before ˆε(h t 6W η Plugging in the values for η, η results in the bound on ˆε(h t Finally, to get a bound on ε(h t, we utilize the following unifor convergence lea: Lea 5 Suppose that E[y x] = u( w, x for soe non-decreasing -Lipschitz u and w such that w W Then with probability at least δ over a saple (x, y,, (x, y, the following holds siultaneously for any function h(x = û(ŵ x such that ŵ W and a non-decreasing and -Lipschitz function û: ( W log(/δ ε(h ˆε(h O The proof of the lea uses a covering nuber arguent, and is shown as part of the ore general Lea 7 in Appendix C This lea applies in particular to h t Cobining this with the bound on ˆε(h t, and using a union bound, we get the result on ε(h t as well 3

14 C Proof of Proposition To prove the proposition, we actually prove a ore general result Define the function class and U = {u : [ W, W ] [0, ] : u -Lipschitz} W = {x x, w : w R d, w W }, where d is possibly infinite (for instance, if we are using kernels It is easy to see that the proposition follows fro the following unifor convergence guarantee: Theore 3 With probability at least δ, for any fixed w W, if we let and define û = arg in u U ũ = arg in u U (u(w x i y i, (u(w x i E[y x i ], then ( { (dw 3/ /3 log(w /δ W log(/δ û(w x i ũ(w x i O in +, To prove the theore, we use the concept of ( -nor covering nubers Given a function class F on soe doain and soe ɛ > 0, we define N (ɛ, F to be the sallest size of a covering set F F, such that for any f F, there exists soe f F for which sup x f(x f (x ɛ In addition, we use a ore refined notion of an -nor covering nuber, which deals with an epirical saple of size Forally, define N (ɛ, F, to be the sallest integer n, such that for any x,, x, one can construct a covering set F F of size at ost n, such that for any f F, there exists soe f F such that ax,, f(x i f (x i ɛ Lea 6 Assuing, /ɛ, W, we have the following covering nuber bounds: ( W /4 } log(/δ N (ɛ, U ɛ W/ɛ N (ɛ, W ( d + W ɛ 3 N (ɛ, U W ( ɛ 4W/ɛ + 4W ɛ d 4 N (ɛ, U W, ɛ ( + +8W /ɛ Proof We start with the first bound Discretize [ W, W ] [0, ] to a two-diensional grid { W + ɛa, ɛb} a=0,,w/ɛ,b=0,,/ɛ It is easily verified that for any function u U, we can define a piecewise linear function u, which passes through points in the grid, and in between the points, is either constant or linear with slope, and sup x u(x u (x ɛ Moreover, all such functions are paraeterized by their value at W, and whether they are sloping up or constant at any grid interval afterward Thus, their nuber can be coarsely upper bounded as W/ɛ /ɛ The second bound in the lea is a well known fact - see for instance pg 63 in [5] The third bound in the lea follows fro cobining the first two bounds, and using the Lipschitz property of u (we siply cobine the two covers at an ɛ/ scale, which leads to a cover at scale ɛ for U W To get the fourth bound, we note that by corollary 3 in [6] N (ɛ, W, ( + +W /ɛ Note that unlike the second bound in the lea, this bound is diension-free, but has worse dependence on W and ɛ Also, we have N (ɛ, U, N (ɛ, U ɛ W/ɛ by definition of covering 4

15 nubers and the first bound in the lea Cobining these two bounds, and using the Lipschitz property of u, we get ( + /ɛ +4W 4W/ɛ ɛ Upper bounding 4W/ɛ by ( + 4W /ɛ, the the fourth bound in the lea follows Lea 7 With probability at least δ over a saple (x, y,, (x, y the following bounds hold siultaneously for any w W, u, u U, (u(w x i y i E [ (u(w x y ] ( W log(/δ O, (u(w x i E[y x i ] E [ (u(w x E[y x] ] ( W log(/δ O, ( u(w x i u (w x i E [ u(w x u W (w x ] O log(/δ Proof Lea 6 tells us that N (ɛ, U W, ɛ ( + +8W /ɛ It is easy to verify that the sae covering nuber bound holds for the function classes {(x, y (u(w x y : u U, w W} and {x (u(w x E[y x] : u U, w W}, by definition of the covering nuber and since the loss function is -Lipschitz In a siilar anner, one can show that the covering nuber of the function class {x u(w x u (w x : u, u U, w W} is at ost 4 ɛ ( + +3W /ɛ Now, one just need to use results fro the literature which provides unifor convergence bounds given a covering nuber on the function class In particular, cobining a unifor convergence bound in ters of the Radeacher coplexity of the function class (eg Theore 8 in [7], and a bound on the Radeacher coplexity in ters of the covering nuber, using an entropy integral (eg, Lea A3 in [8], gives the desired result Lea 8 With probability at least δ over a saple (x, y,, (x, y, the following holds siultaneously for any w W: if we let û w ( w, = arg in (u(w x i y i u U denote the epirical risk iniizer with that fixed w, then ( ( /3 d log(w /δ E(û w (w x y inf E(u(w x u U y O W, Proof For generic losses and function classes, standard bounds on the the excess error typically scale as O(/ However, we can utilize the fact that we are dealing with the squared loss to get better rates In particular, using Theore 4 in [9], as well as the bound on N (ɛ, U fro Lea 6, we get that for any fixed w, with probability at least δ, ( ( /3 log(/δ E(û w (w x y inf E(u(w x u U y O W To get a stateent which holds siultaneously for any w, we apply a union bound over a covering set of W In particular, by Lea 6, we know that we can cover W by a set W of size at ost ( + W/ɛ d, such that any eleent in W is at ost ɛ-far (in an -nor sense fro soe w W So applying a union bound over W, we get that with probability at least δ, it holds siultaneously for any w W that ( ( /3 log(/δ + d log( + W/ɛ E(û w ( w, x y inf u E(u( w, x y O W ( 5

16 Now, for any w W, if we let w denote the closest eleent in W, then u(w x and u( w, x are ɛ-close uniforly for any u U and any x Fro this, it is easy to see that we can extend ( to hold for any W, with an additional O(ɛ eleent in the right hand side In other words, with probability at least δ, it holds siultaneously for any w W that ( ( /3 log(/δ + d log( + W/ɛ E(û w (w x y inf E(u(w x u y O W + ɛ Picking (say ɛ = / provides the required result Lea 9 Let F be a convex class of functions, and let f = arg in f F E[(f(x y ] Suppose that E[y x] U W Then for any f F, it holds that [ E[(f(x y ] E[(f (x y ] E (f(x f (x ] (E [ f(x f (x ] Proof It is easily verified that E[(f(x y ] E[(f (x y ] = E x [(f(x E[y x] (f (x E[y x] ] (3 This iplies that f = arg in f F E[(f(x E[y x] ] Consider the L Hilbert space of square-integrable functions, with respect to the easure induced by the distribution on x (ie, the inner product is defined as f, f = E x [f(xf (x] Note that E[y x] U W is a eber of that space Viewing E[y x] as a function y(x, what we need to show is that f y f y f f By expanding, it can be verified that this is equivalent to showing f y, f f 0 To prove this, we start by noticing that according to (3, f iniizes f y over F Therefore, for any f F and any ɛ (0,, ( ɛf + ɛf y f y 0, (4 as ( ɛf + ɛf F by convexity of F However, the right hand side of (4 equals ɛ f f + ɛ f y, f f, so to ensure (4 is positive for any ɛ, we ust have f y, f f 0 This gives us the required result, and establishes the first inequality in the lea stateent The second inequality is just by convexity of the squared function Proof of Th 3 We bound û(w x ũ(w x in two different ways, one which is diension-dependent and one which is diension independent We begin with the diension-dependent bound For any fixed w, let u be arg in u U E(u(w x y We have fro Lea 8 that with probability at least δ, siultaneously for all w W, ( ( /3 d log(w /δ E(û(w x y E(u (w x y O W, and by Lea 9, this iplies ( (dw E[ û(w x u 3/ /3 log(w /δ (w x ] O (5 Now, we note that since u = arg in u U E(u(w x y, then u = arg in u U E(u(w x E[y x] as well Again applying Lea 8 and Lea 9 in a siilar anner, but now with respect to ũ, we get that with probability at least δ, siultaneously for all w W, ( (dw E[ ũ(w x u 3/ /3 log(w /δ (w x ] O (6 6

17 Cobining (5 and (6, with a union bound, we have ( (dw 3/ /3 log(w /δ E[ û(w x ũ(w x ] O Finally, we invoke the last inequality in Lea 7, using a union bound, to get ( (dw 3/ /3 log(w /δ W log(/δ û(w x ũ(w x O + We now turn to the diension-independent bound In this case, the covering nuber bounds are different, and we do not know how to prove an analogue to Lea 8 (with rate faster than O(/ This leads to a soewhat worse bound in ters of the dependence on As before, for any fixed w, we let u be arg in u U E[(u(w x y ] Lea 7 tells us that the epirical risk (u(w x i y i is concentrated around its expectation uniforly for any u, w In particular, (û(w x i y i E [ û(w x y ] ( W log(/δ O as well as (u (w x i y i E [ (u (w x y ] ( W log(/δ O, but since û was chosen to be the epirical risk iniizer, it follows that E [ (û(w x i y i ] E [ ( (u (w x y ] W log(/δ O, so by Lea 9, ( (W E [ u /4 log(/δ (w x û(w x ] O (7 Now, it is not hard to see that if u = arg in u U E[(u(w x y ], then u = arg in u U E[(u(w x E[y x] ] as well Again invoking Lea 7, and aking siilar arguents, it follows that ( (W E [ u /4 log(/δ (w x ũ(w x ] O (8 Cobining (7 and (8, we get ( (W /4 log(/δ E [ û(w x ũ(w x ] O We now invoke Lea 7 to get ( (W /4 log(/δ û(w x i ũ(w x i O (9 D Lipschitz Isotonic Regression The reader is referred to Section 4 for notation used in this section We first prove that the functions G i defined in (Section 4 eq (4 are piecewise quadratic, differentiable everywhere and strongly convex 7

18 Lea 0 For i =,,, the functions G i (eq (Section 4 eq (4 are piecewise quadratic, differentiable everywhere and strictly convex Proof We prove this assertion by induction on j = i This stateent is true for j = 0 (i =, since G (s = (/(s y Recall that δ i = z i+ z i and s i = in s G i (s Suppose that the assertion is true for i (j = i, then we shall prove the sae for i (j = i+ Suppose ŷ i is fixed to s Recall that G i (s is the iniu possible error attained by fitting points (z i, y i, (z i, y i,, (z, y satisfying the Lipschitz and onotonic constraints and setting ŷ i = s (the prediction at the point z i For any assignent ŷ i = s the least error is obtained by fitting the rest of the points is G i (s Thus, G i (s = (s y + in s G i (s subject to the conditions that s s and s s δ i Since G i is piecewise quadratic, differentiable everywhere and strictly convex, G i is iniu at s i and is increasing on both sides of s i Thus, the quantity above is iniized by picking s to be as close to s i as possible without violating any constraint Hence, we get the recurrence relation: { G i (s = Gi (s + δ i If s s (s y i i δ i + G i (s i If s i δ i < s s i (30 G i (s If s i < s By assuption for i, G i is a piecewise quadratic function, say defined on intervals (a 0 =, a ],, (a k, a k ], (a k, a k+ ],, (a l, a l = Let s i (a k, a k ] and note that G i (s i = 0 Define an interediate function F (s which is defined on l + intervals, (, a δ i ], (a δ i, a δ i ],, (a k δ i, s i δ i ], (s i δ i, s i ], (s i, a k], (a k, a k+ ],, (a l,, where F (s = G i (s + δ i on intervals (, a δ i ],, (a k δ i, s i δ i ] F (s = G i (s i on interval (s i δ i, s i ] 3 F (s = G i (s on intervals (s i, a k],, (a l, Since G i is piecewise quadratic, so is F on intervals as defined above (constant on the interval (s i δ i, s i ] F is differentiable everywhere, since G i (s i = 0 Observe that G i (s = (/(s y i + F (s, this akes G i (s strictly convex, while retaining the fact that it is piecewise quadratic and differentiable everywhere D Algorith We now give an efficient algorith for Lipschitz Isotonic Regression As discussed already in Section 4, it suffices to find s i for i =,, Once these values are known it is easy to find values ŷ, ŷ that are optial The pseudocode for LIR algorith is given in Alg 3 As discussed in Section 4 the algorith aintains G i, and finds s i by finding the zero of G i G i is piecewise linear, continuous and strictly increasing We design a new data structure notable red-black trees to store these piecewise linear, continuous, strictly increasing functions A notable red-black tree (henceforth NRB-Tree stores a piecewise linear function The disjoint intervals on which the function is defined have a natural order, thus the intervals serve as the key and the linear function is stored as the value at a node Apart fro storing these key-value pairs, the nodes of the tree also store soe notes that are applied to the entire subtree rooted at that node, hence the nae notable red-black trees We assue that standard tree operations, ie finding and inserting nodes can be perfored in O(log( tie There is a slight caveat that there ay be notes at soe of the nodes, but we show in Section D how to flush all notes at a given node in constant Any tree-structure that uses only local rotations and guarantees O(log( worst-case or aortized find and insert operations ay be used 8

19 Algorith 3 LIR: Lipschitz Isotonic Regression Input: (z, y,, (z, y s i = 0, i =, ŷ i = 0, i =, NRB-Tree T(s y ; // new T represents s y δ i = z i+ z i, i =,, for i =,, do s i = Tzero( if i = then break; end if Tupdate(s i, δ i, y i end for ŷ i = s for i =,, do if s i > ŷ i + δ i then ŷ i = ŷ i + δ i else if s i ŷ i then ŷ i = s i else ŷ i = ŷ i end if end for Output: ŷ,, ŷ tie Thus, whenever a node is accessed for find or insert operations the notes are first applied to the node appropriately and then the standard tree operation can continue The algorith initializes T to represent the linear function s y This can be perfored very easily by just creating a root node, setting the interval (key to be (, and the linear function (value to s y, represented by coefficients (, y There are two function calls that algorith LIR akes on T Tzero(: This finds the zero (the point where the function is 0 of the piecewise linear function G i represented by the NRB-Tree T Since the function is piecewise linear, continuous and strictly increasing, a unique zero exists and the tree structure is well-suited to find the zero in O(log tie Tupdate(s i, δ i, y i Having obtained the zero of G i, the algorith needs to update T to now represent the G i This involves the following: (a Suppose (a k, a k ] is the interval containing s i Split the interval (a k, a k ] into two intervals (a k, s i ] and (s i, a k] (see Figure (a in Section 4 This operation is easy to perfor on a binary tree in O(log( tie: It involves finding the node representing the interval (a k, a k ] that contains s i, then odifying it to represent the interval (a k, s i ] Then a new node representing the interval (s i, a k] (currently none of the intervals in the tree overlaps with this interval is inserted and the linear function on this interval is set to be the sae as the one on (a k, s i ] (b All the intervals to the left of s i, ie up to (a k, s i ] are to be oved to the left by δ i Forally, if the interval is (a, a ] and the linear function is (l, l representing ls+l, then after oving to the left by δ i, the interval becoes (a δ i, a δ i ] and the linear function becoes (l, l + lδ i (see Figure (b in Section 4 The above operation ay involve odifying up to O( nodes in the tree However rather than actually aking the odifications we only leave a note at appropriate nodes that such a odification is to be perfored Whenever a note shift-by(δ i 9

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

Learnability and Stability in the General Learning Setting

Learnability and Stability in the General Learning Setting Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness A Note on Scheduling Tall/Sall Multiprocessor Tasks with Unit Processing Tie to Miniize Maxiu Tardiness Philippe Baptiste and Baruch Schieber IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights,

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

Support Vector Machines MIT Course Notes Cynthia Rudin

Support Vector Machines MIT Course Notes Cynthia Rudin Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

Improved Guarantees for Agnostic Learning of Disjunctions

Improved Guarantees for Agnostic Learning of Disjunctions Iproved Guarantees for Agnostic Learning of Disjunctions Pranjal Awasthi Carnegie Mellon University pawasthi@cs.cu.edu Avri Blu Carnegie Mellon University avri@cs.cu.edu Or Sheffet Carnegie Mellon University

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a

More information

The Weierstrass Approximation Theorem

The Weierstrass Approximation Theorem 36 The Weierstrass Approxiation Theore Recall that the fundaental idea underlying the construction of the real nubers is approxiation by the sipler rational nubers. Firstly, nubers are often deterined

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

VC Dimension and Sauer s Lemma

VC Dimension and Sauer s Lemma CMSC 35900 (Spring 2008) Learning Theory Lecture: VC Diension and Sauer s Lea Instructors: Sha Kakade and Abuj Tewari Radeacher Averages and Growth Function Theore Let F be a class of ±-valued functions

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

3.8 Three Types of Convergence

3.8 Three Types of Convergence 3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to

More information

Lecture 21. Interior Point Methods Setup and Algorithm

Lecture 21. Interior Point Methods Setup and Algorithm Lecture 21 Interior Point Methods In 1984, Kararkar introduced a new weakly polynoial tie algorith for solving LPs [Kar84a], [Kar84b]. His algorith was theoretically faster than the ellipsoid ethod and

More information

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science A Better Algorith For an Ancient Scheduling Proble David R. Karger Steven J. Phillips Eric Torng Departent of Coputer Science Stanford University Stanford, CA 9435-4 Abstract One of the oldest and siplest

More information

Fairness via priority scheduling

Fairness via priority scheduling Fairness via priority scheduling Veeraruna Kavitha, N Heachandra and Debayan Das IEOR, IIT Bobay, Mubai, 400076, India vavitha,nh,debayan}@iitbacin Abstract In the context of ulti-agent resource allocation

More information

1 Identical Parallel Machines

1 Identical Parallel Machines FB3: Matheatik/Inforatik Dr. Syaantak Das Winter 2017/18 Optiizing under Uncertainty Lecture Notes 3: Scheduling to Miniize Makespan In any standard scheduling proble, we are given a set of jobs J = {j

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

COS 424: Interacting with Data. Written Exercises

COS 424: Interacting with Data. Written Exercises COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well

More information

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay A Low-Coplexity Congestion Control and Scheduling Algorith for Multihop Wireless Networks with Order-Optial Per-Flow Delay Po-Kai Huang, Xiaojun Lin, and Chih-Chun Wang School of Electrical and Coputer

More information

Chapter 6 1-D Continuous Groups

Chapter 6 1-D Continuous Groups Chapter 6 1-D Continuous Groups Continuous groups consist of group eleents labelled by one or ore continuous variables, say a 1, a 2,, a r, where each variable has a well- defined range. This chapter explores:

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical IEEE TRANSACTIONS ON INFORMATION THEORY Large Alphabet Source Coding using Independent Coponent Analysis Aichai Painsky, Meber, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE arxiv:67.7v [cs.it] Jul

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

Convex Programming for Scheduling Unrelated Parallel Machines

Convex Programming for Scheduling Unrelated Parallel Machines Convex Prograing for Scheduling Unrelated Parallel Machines Yossi Azar Air Epstein Abstract We consider the classical proble of scheduling parallel unrelated achines. Each job is to be processed by exactly

More information

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space Journal of Machine Learning Research 3 (2003) 1333-1356 Subitted 5/02; Published 3/03 Grafting: Fast, Increental Feature Selection by Gradient Descent in Function Space Sion Perkins Space and Reote Sensing

More information

Fast Montgomery-like Square Root Computation over GF(2 m ) for All Trinomials

Fast Montgomery-like Square Root Computation over GF(2 m ) for All Trinomials Fast Montgoery-like Square Root Coputation over GF( ) for All Trinoials Yin Li a, Yu Zhang a, a Departent of Coputer Science and Technology, Xinyang Noral University, Henan, P.R.China Abstract This letter

More information

Estimating Entropy and Entropy Norm on Data Streams

Estimating Entropy and Entropy Norm on Data Streams Estiating Entropy and Entropy Nor on Data Streas Ait Chakrabarti 1, Khanh Do Ba 1, and S. Muthukrishnan 2 1 Departent of Coputer Science, Dartouth College, Hanover, NH 03755, USA 2 Departent of Coputer

More information

Ch 12: Variations on Backpropagation

Ch 12: Variations on Backpropagation Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith

More information

List Scheduling and LPT Oliver Braun (09/05/2017)

List Scheduling and LPT Oliver Braun (09/05/2017) List Scheduling and LPT Oliver Braun (09/05/207) We investigate the classical scheduling proble P ax where a set of n independent jobs has to be processed on 2 parallel and identical processors (achines)

More information

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS A Thesis Presented to The Faculty of the Departent of Matheatics San Jose State University In Partial Fulfillent of the Requireents

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

Testing Properties of Collections of Distributions

Testing Properties of Collections of Distributions Testing Properties of Collections of Distributions Reut Levi Dana Ron Ronitt Rubinfeld April 9, 0 Abstract We propose a fraework for studying property testing of collections of distributions, where the

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence Best Ar Identification: A Unified Approach to Fixed Budget and Fixed Confidence Victor Gabillon Mohaad Ghavazadeh Alessandro Lazaric INRIA Lille - Nord Europe, Tea SequeL {victor.gabillon,ohaad.ghavazadeh,alessandro.lazaric}@inria.fr

More information

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data Suppleentary to Learning Discriinative Bayesian Networks fro High-diensional Continuous Neuroiaging Data Luping Zhou, Lei Wang, Lingqiao Liu, Philip Ogunbona, and Dinggang Shen Proposition. Given a sparse

More information

Bootstrapping Dependent Data

Bootstrapping Dependent Data Bootstrapping Dependent Data One of the key issues confronting bootstrap resapling approxiations is how to deal with dependent data. Consider a sequence fx t g n t= of dependent rando variables. Clearly

More information

Tight Bounds for Maximal Identifiability of Failure Nodes in Boolean Network Tomography

Tight Bounds for Maximal Identifiability of Failure Nodes in Boolean Network Tomography Tight Bounds for axial Identifiability of Failure Nodes in Boolean Network Toography Nicola Galesi Sapienza Università di Roa nicola.galesi@uniroa1.it Fariba Ranjbar Sapienza Università di Roa fariba.ranjbar@uniroa1.it

More information

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute

More information

Bipartite subgraphs and the smallest eigenvalue

Bipartite subgraphs and the smallest eigenvalue Bipartite subgraphs and the sallest eigenvalue Noga Alon Benny Sudaov Abstract Two results dealing with the relation between the sallest eigenvalue of a graph and its bipartite subgraphs are obtained.

More information

Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks

Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks Bounds on the Miniax Rate for Estiating a Prior over a VC Class fro Independent Learning Tasks Liu Yang Steve Hanneke Jaie Carbonell Deceber 01 CMU-ML-1-11 School of Coputer Science Carnegie Mellon University

More information

arxiv: v3 [cs.lg] 7 Jan 2016

arxiv: v3 [cs.lg] 7 Jan 2016 Efficient and Parsionious Agnostic Active Learning Tzu-Kuo Huang Alekh Agarwal Daniel J. Hsu tkhuang@icrosoft.co alekha@icrosoft.co djhsu@cs.colubia.edu John Langford Robert E. Schapire jcl@icrosoft.co

More information

Symmetrization and Rademacher Averages

Symmetrization and Rademacher Averages Stat 928: Statistical Learning Theory Lecture: Syetrization and Radeacher Averages Instructor: Sha Kakade Radeacher Averages Recall that we are interested in bounding the difference between epirical and

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

A note on the multiplication of sparse matrices

A note on the multiplication of sparse matrices Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani

More information

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes Graphical Models in Local, Asyetric Multi-Agent Markov Decision Processes Ditri Dolgov and Edund Durfee Departent of Electrical Engineering and Coputer Science University of Michigan Ann Arbor, MI 48109

More information

arxiv: v1 [cs.lg] 8 Jan 2019

arxiv: v1 [cs.lg] 8 Jan 2019 Data Masking with Privacy Guarantees Anh T. Pha Oregon State University phatheanhbka@gail.co Shalini Ghosh Sasung Research shalini.ghosh@gail.co Vinod Yegneswaran SRI international vinod@csl.sri.co arxiv:90.085v

More information

Lower Bounds for Quantized Matrix Completion

Lower Bounds for Quantized Matrix Completion Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &

More information

On the Use of A Priori Information for Sparse Signal Approximations

On the Use of A Priori Information for Sparse Signal Approximations ITS TECHNICAL REPORT NO. 3/4 On the Use of A Priori Inforation for Sparse Signal Approxiations Oscar Divorra Escoda, Lorenzo Granai and Pierre Vandergheynst Signal Processing Institute ITS) Ecole Polytechnique

More information

arxiv: v1 [cs.ds] 29 Jan 2012

arxiv: v1 [cs.ds] 29 Jan 2012 A parallel approxiation algorith for ixed packing covering seidefinite progras arxiv:1201.6090v1 [cs.ds] 29 Jan 2012 Rahul Jain National U. Singapore January 28, 2012 Abstract Penghui Yao National U. Singapore

More information

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs On the Inapproxiability of Vertex Cover on k-partite k-unifor Hypergraphs Venkatesan Guruswai and Rishi Saket Coputer Science Departent Carnegie Mellon University Pittsburgh, PA 1513. Abstract. Coputing

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

Tight Complexity Bounds for Optimizing Composite Objectives

Tight Complexity Bounds for Optimizing Composite Objectives Tight Coplexity Bounds for Optiizing Coposite Objectives Blake Woodworth Toyota Technological Institute at Chicago Chicago, IL, 60637 blake@ttic.edu Nathan Srebro Toyota Technological Institute at Chicago

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis Soft Coputing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis Beverly Rivera 1,2, Irbis Gallegos 1, and Vladik Kreinovich 2 1 Regional Cyber and Energy Security Center RCES

More information

Randomized Recovery for Boolean Compressed Sensing

Randomized Recovery for Boolean Compressed Sensing Randoized Recovery for Boolean Copressed Sensing Mitra Fatei and Martin Vetterli Laboratory of Audiovisual Counication École Polytechnique Fédéral de Lausanne (EPFL) Eail: {itra.fatei, artin.vetterli}@epfl.ch

More information

arxiv: v1 [cs.ds] 17 Mar 2016

arxiv: v1 [cs.ds] 17 Mar 2016 Tight Bounds for Single-Pass Streaing Coplexity of the Set Cover Proble Sepehr Assadi Sanjeev Khanna Yang Li Abstract arxiv:1603.05715v1 [cs.ds] 17 Mar 2016 We resolve the space coplexity of single-pass

More information

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions Tight Inforation-Theoretic Lower Bounds for Welfare Maxiization in Cobinatorial Auctions Vahab Mirrokni Jan Vondrák Theory Group, Microsoft Dept of Matheatics Research Princeton University Redond, WA 9805

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial

More information

Least Squares Fitting of Data

Least Squares Fitting of Data Least Squares Fitting of Data David Eberly, Geoetric Tools, Redond WA 98052 https://www.geoetrictools.co/ This work is licensed under the Creative Coons Attribution 4.0 International License. To view a

More information

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t. CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when

More information

Polygonal Designs: Existence and Construction

Polygonal Designs: Existence and Construction Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G

More information

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x)

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x) 7Applying Nelder Mead s Optiization Algorith APPLYING NELDER MEAD S OPTIMIZATION ALGORITHM FOR MULTIPLE GLOBAL MINIMA Abstract Ştefan ŞTEFĂNESCU * The iterative deterinistic optiization ethod could not

More information

In this chapter, we consider several graph-theoretic and probabilistic models

In this chapter, we consider several graph-theoretic and probabilistic models THREE ONE GRAPH-THEORETIC AND STATISTICAL MODELS 3.1 INTRODUCTION In this chapter, we consider several graph-theoretic and probabilistic odels for a social network, which we do under different assuptions

More information

Mixed Robust/Average Submodular Partitioning

Mixed Robust/Average Submodular Partitioning Mixed Robust/Average Subodular Partitioning Kai Wei 1 Rishabh Iyer 1 Shengjie Wang 2 Wenruo Bai 1 Jeff Biles 1 1 Departent of Electrical Engineering, University of Washington 2 Departent of Coputer Science,

More information

On Poset Merging. 1 Introduction. Peter Chen Guoli Ding Steve Seiden. Keywords: Merging, Partial Order, Lower Bounds. AMS Classification: 68W40

On Poset Merging. 1 Introduction. Peter Chen Guoli Ding Steve Seiden. Keywords: Merging, Partial Order, Lower Bounds. AMS Classification: 68W40 On Poset Merging Peter Chen Guoli Ding Steve Seiden Abstract We consider the follow poset erging proble: Let X and Y be two subsets of a partially ordered set S. Given coplete inforation about the ordering

More information

New Slack-Monotonic Schedulability Analysis of Real-Time Tasks on Multiprocessors

New Slack-Monotonic Schedulability Analysis of Real-Time Tasks on Multiprocessors New Slack-Monotonic Schedulability Analysis of Real-Tie Tasks on Multiprocessors Risat Mahud Pathan and Jan Jonsson Chalers University of Technology SE-41 96, Göteborg, Sweden {risat, janjo}@chalers.se

More information

Exact tensor completion with sum-of-squares

Exact tensor completion with sum-of-squares Proceedings of Machine Learning Research vol 65:1 54, 2017 30th Annual Conference on Learning Theory Exact tensor copletion with su-of-squares Aaron Potechin Institute for Advanced Study, Princeton David

More information

Soft-margin SVM can address linearly separable problems with outliers

Soft-margin SVM can address linearly separable problems with outliers Non-linear Support Vector Machines Non-linearly separable probles Hard-argin SVM can address linearly separable probles Soft-argin SVM can address linearly separable probles with outliers Non-linearly

More information

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

More information

The Methods of Solution for Constrained Nonlinear Programming

The Methods of Solution for Constrained Nonlinear Programming Research Inventy: International Journal Of Engineering And Science Vol.4, Issue 3(March 2014), PP 01-06 Issn (e): 2278-4721, Issn (p):2319-6483, www.researchinventy.co The Methods of Solution for Constrained

More information

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation journal of coplexity 6, 459473 (2000) doi:0.006jco.2000.0544, available online at http:www.idealibrary.co on On the Counication Coplexity of Lipschitzian Optiization for the Coordinated Model of Coputation

More information

On Constant Power Water-filling

On Constant Power Water-filling On Constant Power Water-filling Wei Yu and John M. Cioffi Electrical Engineering Departent Stanford University, Stanford, CA94305, U.S.A. eails: {weiyu,cioffi}@stanford.edu Abstract This paper derives

More information

Fixed-to-Variable Length Distribution Matching

Fixed-to-Variable Length Distribution Matching Fixed-to-Variable Length Distribution Matching Rana Ali Ajad and Georg Böcherer Institute for Counications Engineering Technische Universität München, Gerany Eail: raa2463@gail.co,georg.boecherer@tu.de

More information

Bayes Decision Rule and Naïve Bayes Classifier

Bayes Decision Rule and Naïve Bayes Classifier Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.

More information

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION GUANGHUI LAN AND YI ZHOU Abstract. In this paper, we consider a class of finite-su convex optiization probles defined over a distributed

More information