A Course in Machine Learning

Similar documents
Admin BACKPROPAGATION. Neural network. Neural network 11/3/16. Assignment 7. Assignment 8 Goals today. David Kauchak CS158 Fall 2016

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Linear First-Order Equations

JUST THE MATHS UNIT NUMBER DIFFERENTIATION 2 (Rates of change) A.J.Hobson

Solutions to Practice Problems Tuesday, October 28, 2008

Implicit Differentiation

Lecture Note 2. 1 Bonferroni Principle. 1.1 Idea. 1.2 Want. Material covered today is from Chapter 1 and chapter 4

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Least-Squares Regression on Sparse Spaces

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

The Exact Form and General Integrating Factors

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

Math 1B, lecture 8: Integration by parts

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

Differentiation ( , 9.5)

State observers and recursive filters in classical feedback control theory

Analyzing Tensor Power Method Dynamics in Overcomplete Regime

Exam 2 Review Solutions

Cascaded redundancy reduction

Introduction to Machine Learning

Integration by Parts

Two formulas for the Euler ϕ-function

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

by using the derivative rules. o Building blocks: d

7.1 Support Vector Machine

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

Derivatives and the Product Rule

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

MA 2232 Lecture 08 - Review of Log and Exponential Functions and Exponential Growth

A Sketch of Menshikov s Theorem

Vectors in two dimensions

Quantum Mechanics in Three Dimensions

Calculus of Variations

PDE Notes, Lecture #11

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

Separation of Variables

Proof of SPNs as Mixture of Trees

Designing Information Devices and Systems I Spring 2018 Lecture Notes Note 16

18 EVEN MORE CALCULUS

Semi-continuous challenge

26.1 Metropolis method

Math 342 Partial Differential Equations «Viktor Grigoryan

TIME-DELAY ESTIMATION USING FARROW-BASED FRACTIONAL-DELAY FIR FILTERS: FILTER APPROXIMATION VS. ESTIMATION ERRORS

Convergence of Random Walks

Parameter estimation: A new approach to weighting a priori information

Step 1. Analytic Properties of the Riemann zeta function [2 lectures]

Markov Chains in Continuous Time

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Algorithms and matching lower bounds for approximately-convex optimization

Topic 7: Convergence of Random Variables

5.4 Fundamental Theorem of Calculus Calculus. Do you remember the Fundamental Theorem of Algebra? Just thought I'd ask

Ad Placement Strategies

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

A Course in Machine Learning

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling

ensembles When working with density operators, we can use this connection to define a generalized Bloch vector: v x Tr x, v y Tr y

Zachary Scherr Math 503 HW 3 Due Friday, Feb 12

Lecture 6 : Dimensionality Reduction

Mathcad Lecture #5 In-class Worksheet Plotting and Calculus

On the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization

05 The Continuum Limit and the Wave Equation

Lecture 3 Notes. Dan Sheldon. September 17, 2012

Homework 2 Solutions EM, Mixture Models, PCA, Dualitys

APPPHYS 217 Thursday 8 April 2010

Designing Information Devices and Systems I Spring 2017 Official Lecture Notes Note 13

Lecture 5. Symmetric Shearer s Lemma

6 General properties of an autonomous system of two first order ODE

Linear Regression with Limited Observation

SYDE 112, LECTURE 1: Review & Antidifferentiation

Tutorial on Maximum Likelyhood Estimation: Parametric Density Estimation

Linear and quadratic approximation

Schrödinger s equation.

A Review of Multiple Try MCMC algorithms for Signal Processing

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

u!i = a T u = 0. Then S satisfies

Collaborative Ranking for Local Preferences Supplement

Chapter 6: Integration: partial fractions and improper integrals

Robust Low Rank Kernel Embeddings of Multivariate Distributions

Proof by Mathematical Induction.

Monte Carlo Methods with Reduced Error

Role of parameters in the stochastic dynamics of a stick-slip oscillator

The Principle of Least Action

Euler equations for multiple integrals

II. First variation of functionals

θ x = f ( x,t) could be written as

PD Controller for Car-Following Models Based on Real Data

SYNCHRONOUS SEQUENTIAL CIRCUITS

arxiv: v5 [cs.lg] 28 Mar 2017

Homework 2 EM, Mixture Models, PCA, Dualitys

Designing Information Devices and Systems II Fall 2017 Note Theorem: Existence and Uniqueness of Solutions to Differential Equations

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

Experiment I Electric Force

3.2 Shot peening - modeling 3 PROCEEDINGS

2Algebraic ONLINE PAGE PROOFS. foundations

Problem Sheet 2: Eigenvalues and eigenvectors and their use in solving linear ODEs

Introduction to Markov Processes

Antiderivatives. Definition (Antiderivative) If F (x) = f (x) we call F an antiderivative of f. Alan H. SteinUniversity of Connecticut

Final Exam Study Guide and Practice Problems Solutions

Equilibrium in Queues Under Unknown Service Times and Service Value

Lecture 2: Correlated Topic Model

Transcription:

A Course in Machine Learning Hal Daumé III

12 EFFICIENT LEARNING So far, our focus has been on moels of learning an basic algorithms for those moels. We have not place much emphasis on how to learn quickly. The basic techniques you learne about so far are enough to get learning algorithms running on tens or hunres of thousans of examples. But if you want to buil an algorithm for web page ranking, you will nee to eal with millions or billions of examples, in hunres of thousans of imensions. The basic approaches you have seen so far are insufficient to achieve such a massive scale. In this chapter, you will learn some techniques for scaling learning algorithms. This are useful even when you o not have billions of training examples, because it s always nice to have a program that runs quickly. You will see techniques for speeing up both moel training an moel preiction. The focus in this chapter is on linear moels (for simplicity), but most of what you will learn applies more generally. Learning Objectives: Unerstan an be able to implement stochastic graient escent algorithms. Compare an contrast small versus large batch sizes in stochastic optimization. Derive subgraients for sparse regularizers. Implement feature hashing. Depenencies: 12.1 What Does it Mean to be Fast? Everyone always wants fast algorithms. In the context of machine learning, this can mean many things. You might want fast training algorithms, or perhaps training algorithms that scale to very large ata sets (for instance, ones that will not fit in main memory). You might want training algorithms that can be easily parallelize. Or, you might not care about training efficiency, since it is an offline process, an only care about how quickly your learne functions can make classification ecisions. It is important to separate out these esires. If you care about efficiency at training time, then what you are really asking for are more efficient learning algorithms. On the other han, if you care about efficiency at test time, then you are asking for moels that can be quickly evaluate. One issue that is not covere in this chapter is parallel learning.

160 a course in machine learning This is largely because it is currently not a well-unerstoo area in machine learning. There are many aspects of parallelism that come into play, such as the spee of communication across the network, whether you have share memory, etc. Right now, this the general, poor-man s approach to parallelization, is to employ ensembles. 12.2 Stochastic Optimization During training of most learning algorithms, you consier the entire ata set simultaneously. This is certainly true of graient escent algorithms for regularize linear classifiers (recall Algorithm 6.4), in which you first compute a graient over the entire training ata (for simplicity, consier the unbiase case): g = w l(y n, w x n ) + λw (12.1) n where l(y, ŷ) is some loss function. Then you upate the weights by w w ηg. In this algorithm, in orer to make a single upate, you have to look at every training example. When there are billions of training examples, it is a bit silly to look at every one before oing anything. Perhaps just on the basis of the first few examples, you can alreay start learning something! Stochastic optimization involves thinking of your training ata as a big istribution over examples. A raw from this istribution correspons to picking some example (uniformly at ranom) from your ata set. Viewe this way, the optimization problem becomes a stochastic optimization problem, because you are trying to optimize some function (say, a regularize linear classifier) over a probability istribution. You can erive this intepretation irectly as follows: w = arg max w n = arg max w n = arg max w n l(y n, w x n ) + R(w) [l(y n, w x n ) + 1N R(w) ] [ 1 N l(y n, w x n ) + 1 ] N 2 R(w) = arg max E w (y,x) D [l(y, w x) + 1N ] R(w) efinition (12.2) move R insie sum (12.3) ivie through by N (12.4) write as expectation (12.5) where D is the training ata istribution (12.6) Given this framework, you have the following general form of an

efficient learning 161 Algorithm 33 StochasticGraientDescent(F, D, S, K, η 1,... ) 1: z (0) 0, 0,..., 0 // initialize variable we are optimizing 2: for k = 1... K o 3: D (k) S-many ranom ata points from D 4: g (k) z F(D (k) ) z (k-1) // compute graient on sample 5: z (k) z (k-1) η (k) g (k) // take a step own the graient 6: en for 7: return z (K) optimization problem: min z E ζ [F(z, ζ)] (12.7) In the example, ζ enotes the ranom choice of examples over the ataset, z enotes the weight vector an F(w, ζ) enotes the loss on that example plus a fraction of the regularizer. Stochastic optimization problems are formally harer than regular (eterministic) optimization problems because you o not even get access to exact function values an graients. The only access you have to the function F that you wish to optimize are noisy measurements, governe by the istribution over ζ. Despite this lack of information, you can still run a graient-base algorithm, where you simply compute local graients on a current sample of ata. More precisely, you can raw a ata point at ranom from your ata set. This is analogous to rawing a single value ζ from its istribution. You can compute the graient of F just at that point. In this case of a 2-norm regularize linear moel, this is simply g = w l(y, w x) + N 1 w, where (y, x) is the ranom point you selecte. Given this estimate of the graient (it s an estimate because it s base on a single ranom raw), you can take a small graient step w w ηg. This is the stochastic graient escent algorithm (SGD). In practice, taking graients with respect to a single ata point might be too myopic. In such cases, it is useful to use a small batch of ata. Here, you can raw 10 ranom examples from the training ata an compute a small graient (estimate) base on those examples: g = 10 m=1 wl(y m, w x m ) + 10 N w, where you nee to inclue 10 counts of the regularizer. Popular batch sizes are 1 (single points) an 10. The generic SGD algorithm is epicte in Algorithm 12.2, which takes K-many steps over batches of S-many examples. In stochastic graient escent, it is imperative to choose goo step sizes. It is also very important that the steps get smaller over time at a reasonable slow rate. In particular, convergence can be guarantee for learning rates of the form: η (k) = η 0, where η k 0 is a fixe, initial step size, typically 0.01, 0.1 or 1 epening on how quickly you ex-

162 a course in machine learning pect the algorithm to converge. Unfortunately, in comparisong to graient escent, stochastic graient is quite sensitive to the selection of a goo learning rate. There is one more practical issues relate to the use of SGD as a learning algorithm: o you really select a ranom point (or subset of ranom points) at each step, or o you stream through the ata in orer. The answer is akin to the answer of the same question for the perceptron algorithm (Chapter 3). If you o not permute your ata at all, very ba things can happen. If you o permute your ata once an then o multiple passes over that same permutation, it will converge, but more slowly. In theory, you really shoul permute every iteration. If your ata is small enough to fit in memory, this is not a big eal: you will only pay for cache misses. However, if your ata is too large for memory an resies on a magnetic isk that has a slow seek time, ranomly seeking to new ata points for each example is prohibitivly slow, an you will likely nee to forgo permuting the ata. The spee hit in convergence spee will almost certainly be recovere by the spee gain in not having to seek on isk routinely. (Note that the story is very ifferent for soli state isks, on which ranom accesses really are quite efficient.) 12.3 Sparse Regularization For many learning algorithms, the test-time efficiency is governe by how many features are use for preiction. This is one reason ecision trees ten to be among the fastest preictors: they only use a small number of features. Especially in cases where the actual computation of these features is expensive, cutting own on the number that are use at test time can yiel huge gains in efficiency. Moreover, the amount of memory use to make preictions is also typically governe by the number of features. (Note: this is not true of kernel methos like support vector machines, in which the ominant cost is the number of support vectors.) Furthermore, you may simply believe that your learning problem can be solve with a very small number of features: this is a very reasonable form of inuctive bias. This is the iea behin sparse moels, an in particular, sparse regularizers. One of the isavantages of a 2-norm regularizer for linear moels is that they ten to never prouce weights that are exactly zero. They get close to zero, but never hit it. To unerstan why, as a weight w approaches zero, its graient also approaches zero. Thus, even if the weight shoul be zero, it will essentially never get there because of the constantly shrinking graient. This suggests that an alternative regularizer is require to yiel a sparse inuctive bias. An ieal case woul be the zero-norm regular-

efficient learning 163 izer, which simply counts the number of non-zero values in a vector: w 0 = [w = 0]. If you coul minimize this regularizer, you woul be explicitly minimizing the number of non-zero features. Unfortunately, not only is the zero-norm non-convex, it s also iscrete. Optimizing it is NP-har. A reasonable mile-groun is the one-norm: w 1 = w. It is inee convex: in fact, it is the tighest l p norm that is convex. Moreover, its graients o not go to zero as in the two-norm. Just as hinge-loss is the tightest convex upper boun on zero-one error, the one-norm is the tighest convex upper boun on the zero-norm. At this point, you shoul be content. You can take your subgraient optimizer for arbitrary functions an plug in the one-norm as a regularizer. The one-norm is surely non-ifferentiable at w = 0, but you can simply choose any value in the range [ 1, +1] as a subgraient at that point. (You shoul choose zero.) Unfortunately, this oes not quite work the way you might expect. The issue is that the graient might overstep zero an you will never en up with a solution that is particularly sparse. For example, at the en of one graient step, you might have w 3 = 0.6. Your graient might have g 6 = 0.8 an your graient step (assuming η = 1) will upate so that the new w 3 = 0.2. In the subsequent iteration, you might have g 6 = 0.3 an step to w 3 = 0.1. This observation leas to the iea of trucate graients. The iea is simple: if you have a graient that woul step you over w = 0, then just set w = 0. In the easy case when the learning rate is 1, this means that if the sign of w g is ifferent than the sign of w then you truncate the graient step an simply set w = 0. In other wors, g shoul never be larger than w Once you incorporate learning rates, you can express this as: g if w > 0 an g 1 w η (k) g g if w < 0 an g 1 w η (k) 0 otherwise (12.8) This works quite well in the case of subgraient escent. It works somewhat less well in the case of stochastic subgraient escent. The problem that arises in the stochastic case is that wherever you choose to stop optimizing, you will have just touche a single example (or small batch of examples), which will increase the weights for a lot of features, before the regularizer has time to shrink them back own to zero. You will still en up with somewhat sparse solutions, but not as sparse as they coul be. There are algorithms for ealing with this situation, but they all have a heuristic flavor to them an are beyon the scope of this book.

164 a course in machine learning 12.4 Feature Hashing As much as spee is a bottleneck in preiction, so often is memory usage. If you have a very large number of features, the amount of memory that it takes to store weights for all of them can become prohibitive, especially if you wish to run your algorithm on small evices. Feature hashing is an increibly simple technique for reucing the memory footprint of linear moels, with very small sacrifices in accuracy. The basic iea is to replace all of your features with hashe versions of those features, thus reucing your space from D-many feature weights to P-many feature weights, where P is the range of the hash function. You can actually think of hashing as a (ranomize) feature mapping φ : R D R P, for some P D. The iea is as follows. First, you choose a hash function h whose omain is [D] = {1, 2,..., D} an whose range is [P]. Then, when you receive a feature vector x R D, you map it to a shorter feature vector ˆx R P. Algorithmically, you can think of this mapping as follows: 1. Initialize ˆx = 0, 0,..., 0 2. For each = 1... D: (a) Hash to position p = h() (b) Upate the pth position by aing x : ˆx p ˆx p + x 3. Return ˆx Mathematically, the mapping looks like: φ(x) p = [h() = p]x = x (12.9) h 1 (p) where h 1 (p) = { : h() = p}. In the (unrealistic) case where P = D an h simply encoes a permutation, then this mapping oes not change the learning problem at all. All it oes is rename all of the features. In practice, P D an there will be collisions. In this context, a collision means that two features, which are really ifferent, en up looking the same to the learning algorithm. For instance, is it sunny toay? an i my favorite sports team win last night? might get mappe to the same location after hashing. The hope is that the learning algorithm is sufficiently robust to noise that it can hanle this case well.

efficient learning 165 Consier the kernel efine by this hash mapping. Namely: K (hash) (x, z) = φ(x) φ(z) (12.10) ( ) ) = [h() = p]x ( [h() = p]z (12.11) p = p = [h() = p][h(e) = p]x z e (12.12),e e h 1 (h()) = x z + x z e (12.13) e =, e h 1 (h()) x z e (12.14) This hash kernel has the form of a linear kernel plus a small number of quaratic terms. The particular quaratic terms are exactly those given by collisions of the hash function. There are two things to notice about this. The first is that collisions might not actually be ba things! In a sense, they re giving you a little extra representational power. In particular, if the hash function happens to select out feature pairs that benefit from being paire, then you now have a better representation. The secon is that even if this oesn t happen, the quaratic term in the kernel has only a small effect on the overall preiction. In particular, if you assume that your hash function is pairwise inepenent (a common assumption of hash functions), then the expecte value of this quaratic term is zero, an its variance ecreases at a rate of O(P 2 ). In other wors, if you choose P 100, then the variance is on the orer of 0.0001. 12.5 Exercises Exercise 12.1. TODO...