Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Size: px

Start display at page:

Download "Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4"

Natalie Hicks
5 years ago
Views:

1 Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth. Similarly, not everywhere differentiable functions will be called nonsmooth. First, we define Lipschitz continuity. Definition 4. Lipschitz continuity. A function φ : R p R is called K- Lipschitz continuous with respect to the norm if and only if there is a constant K < + such that w, u R p φw φu K w u A smooth function φ : R p R is called K-Lipschitz continuous with respect to the norm if and only if there is a constant K < + such that w R p φw K Recall that the gradient of a smooth convex function φ : R p R at w fulfills: u R p φu φw φw, u w Definition 4. Subgradient. For a possibly nonsmooth convex function φ : R p R, we can define a subdifferential set as follows: φw = {g u R p φu φw g, u w } Each element g φw is called a subdifferential or subgradient of φ at w. Clearly, in the above definition, if φ : R p R is smooth, then φw has a single element for every w R p. If φw is nonsmooth there exist some w R p for which φw has more than one element. Consider for instance the nonsmooth function φw = w where w R. By Definition 4., we have: Thus, clearly φ0 = [, +]. φ0 = {g u R u g u}

2 Now, consider the following optimization problem where f : R p R is convex and K-Lipschitz with respect to the l -norm: ŵ = arg min fw w R p Let η t be the step size at iteration t. Specifically, let β be a constant factor and define: η t = β K t Consider the next subgradient descent algorithm for solving the above problem: Algorithm 4. Subgradient descent algorithm Input: Number of iterations T, factor β > 0, initial point w R p The setting of w can be uninformed, e.g., w = 0 for t =... T do w t+ w t η t g t where g t fw t end for Output: w T η tw t η t In what follows, we state our main result regarding convergence rates for Algorithm 4.. Theorem 4. Adapted from [, ]. Assume that f : R p R is convex and K-Lipschitz with respect to the l -norm in the problem of eq.. Recall that ŵ is the optimal solution of the problem of eq.. Assume that Algorithm 4. runs for a number of iterations T, factor β and initial point w, and that the algorithm outputs w T. We have: f w T fŵ K w ŵ 4β T βk + log T + 4 T Proof. Let a t w t ŵ. Note that since gt fw t, by Definition 4. we have: fŵ fw t g t, ŵ w t In fact, the above holds ŵ R p but in our problem we care about a unique ŵ. By the Lipschitz continuity of f, we know that t g t K. Therefore: a t+ = w t+ ŵ = w t ŵ η t g t = w t ŵ η t g t, w t ŵ + ηt g t a t + η t fŵ fw t + ηt K

3 Reorganizing the above, we obtain: η t fw t fŵ a t a t+ + ηt K Summing over t, we get: T η t fw t fŵ T a t a t+ + ηt K = a a T + + K w ŵ + β T ηt t w ŵ + β + log T where we used the fact that /t + log T. By Jensen s inequality and convexity of f, we have: T f w T fŵ = f η tw t η fŵ t η tfw t η fŵ t η t fw t fŵ = η t η t fw t fŵ = β K t w ŵ + β + log T β K T where we used the fact that T / t. This proves our claim. A more general setting? Now, consider the following optimization problem where f, r : R p R are convex and K-Lipschitz with respect to the l -norm: ŵ = arg min fw + rw w R p Consider the next subgradient descent algorithm for solving the above problem: 3

4 Algorithm 4. Subgradient descent algorithm Input: Number of iterations T, factor β > 0, initial point w R p The setting of w can be uninformed, e.g., w = 0 for t =... T do w t+/ w t η t g t where g t fw t w t+ arg min w R p w wt+/ + η t+rw end for Output: w T η tw t η t Before going into more general observations, lets first consider an example in order to show the usefulness of the above. Sparse optimization example. Let rw = λ w. In this case, the second assignment reduces to p independent assignments of the form: j =,..., p w t+ j arg min w j R w j w t+/ j + η t+ λ w j sgn w t+/ j max 0, w t+/ j η t+ λ While the Algorithm 4. might look like a drastic generalization of Algorithm 4., in fact it is not. Note that: w t+ = arg min w R p w wt+/ + η t+rw 3 Recall that g t fw t and let s t rw t. Note that w t+ is optimal if and only if 0 belongs to the subdifferential set of eq.3 evaluated at w t+. 0 w wt+/ t+rw w=w + η t+ 0 w w t+/ + η t+ rw 0 = w t+ w t+/ + η t+ s t+ w=wt+ = w t+ w t + η t g t + η t+ s t+ Reorganizing the above, we obtain: w t+ = w t η t g t η t+ s t+ which looks a lot like a regular subgradient descent step. 4

5 3 Stochastic Optimization In several optimization problems, it is possible to compute a stochastic version of the subgradient a lot faster than the true subgradient. Application : Stochastic Sample. For instance, in the problem of eq., consider functions that depend on n data samples z... z n Z and a collection of functions i f i : R p Z R. Let: fw = n f i w, z i i= Given two sets A, B and a scalar c, define A + B = {a + b a A and b B} and ca = {ca a A}. Clearly: fw = n i= f i w w, zi Assume that we pick a data sample j uniformly at random from {... n}, and use the following stochastic subgradient: g j w f j w w, zj 4 Let j be a uniformly distributed random variable with support on {... n}, we can see that: E j [g j w] = n P j [j = i] g i w i= i= = fw f i w w, zi Application : Stochastic Minibatch of Samples. Similarly, we can consider a minibatch instead of a single sample. Assume that we pick B data samples j... j B independently with replacement, uniformly at random from {... n}, and use the following stochastic subgradient: g j...j B w B B g jb w b= 5

6 where g jb w was defined in eq.4. Let j... j B be i.i.d. uniformly distributed random variables with support on {... n}, we can see that: E j...j B [g j...j B w] = B = B Bn = n B E jb [g jb w] b= B b= i= B P jb [j b = i] g i w b= i= i= = fw f i w w, zi f i w w, zi Application 3: Stochastic Coordinate. Consider a general function f : R p R. Let e k be the vector defined by e k i = [i = k]. The gradient of f is: f w w p fw =. f = k f e w w k k= w w p Assume that we pick a coordinate j uniformly at random from {... p}, and use the following stochastic subgradient: j f g j w p e w w j Let j be a uniformly distributed random variable with support on {... p}, we can see that: p E j [g j w] = P j [j = k] g k w p k= p k= = fw Clearly, we can also use the minibatch trick. k f p e w w k Back to the General Problem. Note that above, randomness does not come from uncertainty in the data but from generating a fast approximate version of the gradient. We will in general only assume that at every iteration t: E[g t ] fw t 6

7 Consider the next subgradient descent algorithm for solving eq.: Algorithm 4.3 Stochastic subgradient descent algorithm Input: Number of iterations T, factor β > 0, initial point w R p The setting of w can be uninformed, e.g., w = 0, The setting of w should be deterministic, i.e., non-stochastic. for t =... T do w t+ w t η t g t where E[g t ] fw t end for Output: w T η tw t η t In what follows, we state our main result regarding convergence rates for Algorithm 4.3. Theorem 4.. Assume that f : R p R is convex and K-Lipschitz with respect to the l -norm in the problem of eq.. Recall that ŵ is the optimal solution of the problem of eq.. Assume that Algorithm 4.3 runs for a number of iterations T, factor β and initial point w, and that the algorithm outputs w T. Assume that the stochastic subgradients [ ] g... g T are independent and that they fulfill the condition t E g t K. We have: E g...g T [f wt ] fŵ K w ŵ 4β T βk + log T + 4 T Fix δ 0,. With probability at least δ with respect to the choice of the stochastic subgradients g... g T, we have: f w T fŵ K w ŵ δ 4β βk + log T + T 4 T Proof. For proving the upper bound of E g...g T [f wt ] fŵ Z we follow a similar approach as in Theorem 4.. Define the random variable z = f w T fŵ. Note that z 0 and that E[z] = E[f w T ] fŵ. Recall that in the above we provided an upper bound Z for E[z]. That is, we showed that E[z] Z. By Markov s inequality Theorem., we have: which proves our second claim. P[z < Z/δ] P[z < E[z]/δ] = P[z E[z]/δ] δ 7

8 References [] J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 0Dec: , 009. [] J. Honorio. Convergence rates of biased stochastic optimization for learning sparse Ising models. International Conference on Machine Learning, pages 57 64, 0. 8

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training