A Course in Machine Learning

Size: px
Start display at page:

Download "A Course in Machine Learning"

Transcription

1 A Course in Machine Learning Hal Daumé III

2 12 EFFICIENT LEARNING So far, our focus has been on moels of learning an basic algorithms for those moels. We have not place much emphasis on how to learn quickly. The basic techniques you learne about so far are enough to get learning algorithms running on tens or hunres of thousans of examples. But if you want to buil an algorithm for web page ranking, you will nee to eal with millions or billions of examples, in hunres of thousans of imensions. The basic approaches you have seen so far are insufficient to achieve such a massive scale. In this chapter, you will learn some techniques for scaling learning algorithms. This are useful even when you o not have billions of training examples, because it s always nice to have a program that runs quickly. You will see techniques for speeing up both moel training an moel preiction. The focus in this chapter is on linear moels (for simplicity), but most of what you will learn applies more generally. Learning Objectives: Unerstan an be able to implement stochastic graient escent algorithms. Compare an contrast small versus large batch sizes in stochastic optimization. Derive subgraients for sparse regularizers. Implement feature hashing. Depenencies: 12.1 What Does it Mean to be Fast? Everyone always wants fast algorithms. In the context of machine learning, this can mean many things. You might want fast training algorithms, or perhaps training algorithms that scale to very large ata sets (for instance, ones that will not fit in main memory). You might want training algorithms that can be easily parallelize. Or, you might not care about training efficiency, since it is an offline process, an only care about how quickly your learne functions can make classification ecisions. It is important to separate out these esires. If you care about efficiency at training time, then what you are really asking for are more efficient learning algorithms. On the other han, if you care about efficiency at test time, then you are asking for moels that can be quickly evaluate. One issue that is not covere in this chapter is parallel learning.

3 160 a course in machine learning This is largely because it is currently not a well-unerstoo area in machine learning. There are many aspects of parallelism that come into play, such as the spee of communication across the network, whether you have share memory, etc. Right now, this the general, poor-man s approach to parallelization, is to employ ensembles Stochastic Optimization During training of most learning algorithms, you consier the entire ata set simultaneously. This is certainly true of graient escent algorithms for regularize linear classifiers (recall Algorithm 6.4), in which you first compute a graient over the entire training ata (for simplicity, consier the unbiase case): g = w l(y n, w x n ) + λw (12.1) n where l(y, ŷ) is some loss function. Then you upate the weights by w w ηg. In this algorithm, in orer to make a single upate, you have to look at every training example. When there are billions of training examples, it is a bit silly to look at every one before oing anything. Perhaps just on the basis of the first few examples, you can alreay start learning something! Stochastic optimization involves thinking of your training ata as a big istribution over examples. A raw from this istribution correspons to picking some example (uniformly at ranom) from your ata set. Viewe this way, the optimization problem becomes a stochastic optimization problem, because you are trying to optimize some function (say, a regularize linear classifier) over a probability istribution. You can erive this intepretation irectly as follows: w = arg max w n = arg max w n = arg max w n l(y n, w x n ) + R(w) [l(y n, w x n ) + 1N R(w) ] [ 1 N l(y n, w x n ) + 1 ] N 2 R(w) = arg max E w (y,x) D [l(y, w x) + 1N ] R(w) efinition (12.2) move R insie sum (12.3) ivie through by N (12.4) write as expectation (12.5) where D is the training ata istribution (12.6) Given this framework, you have the following general form of an

4 efficient learning 161 Algorithm 33 StochasticGraientDescent(F, D, S, K, η 1,... ) 1: z (0) 0, 0,..., 0 // initialize variable we are optimizing 2: for k = 1... K o 3: D (k) S-many ranom ata points from D 4: g (k) z F(D (k) ) z (k-1) // compute graient on sample 5: z (k) z (k-1) η (k) g (k) // take a step own the graient 6: en for 7: return z (K) optimization problem: min z E ζ [F(z, ζ)] (12.7) In the example, ζ enotes the ranom choice of examples over the ataset, z enotes the weight vector an F(w, ζ) enotes the loss on that example plus a fraction of the regularizer. Stochastic optimization problems are formally harer than regular (eterministic) optimization problems because you o not even get access to exact function values an graients. The only access you have to the function F that you wish to optimize are noisy measurements, governe by the istribution over ζ. Despite this lack of information, you can still run a graient-base algorithm, where you simply compute local graients on a current sample of ata. More precisely, you can raw a ata point at ranom from your ata set. This is analogous to rawing a single value ζ from its istribution. You can compute the graient of F just at that point. In this case of a 2-norm regularize linear moel, this is simply g = w l(y, w x) + N 1 w, where (y, x) is the ranom point you selecte. Given this estimate of the graient (it s an estimate because it s base on a single ranom raw), you can take a small graient step w w ηg. This is the stochastic graient escent algorithm (SGD). In practice, taking graients with respect to a single ata point might be too myopic. In such cases, it is useful to use a small batch of ata. Here, you can raw 10 ranom examples from the training ata an compute a small graient (estimate) base on those examples: g = 10 m=1 wl(y m, w x m ) + 10 N w, where you nee to inclue 10 counts of the regularizer. Popular batch sizes are 1 (single points) an 10. The generic SGD algorithm is epicte in Algorithm 12.2, which takes K-many steps over batches of S-many examples. In stochastic graient escent, it is imperative to choose goo step sizes. It is also very important that the steps get smaller over time at a reasonable slow rate. In particular, convergence can be guarantee for learning rates of the form: η (k) = η 0, where η k 0 is a fixe, initial step size, typically 0.01, 0.1 or 1 epening on how quickly you ex-

5 162 a course in machine learning pect the algorithm to converge. Unfortunately, in comparisong to graient escent, stochastic graient is quite sensitive to the selection of a goo learning rate. There is one more practical issues relate to the use of SGD as a learning algorithm: o you really select a ranom point (or subset of ranom points) at each step, or o you stream through the ata in orer. The answer is akin to the answer of the same question for the perceptron algorithm (Chapter 3). If you o not permute your ata at all, very ba things can happen. If you o permute your ata once an then o multiple passes over that same permutation, it will converge, but more slowly. In theory, you really shoul permute every iteration. If your ata is small enough to fit in memory, this is not a big eal: you will only pay for cache misses. However, if your ata is too large for memory an resies on a magnetic isk that has a slow seek time, ranomly seeking to new ata points for each example is prohibitivly slow, an you will likely nee to forgo permuting the ata. The spee hit in convergence spee will almost certainly be recovere by the spee gain in not having to seek on isk routinely. (Note that the story is very ifferent for soli state isks, on which ranom accesses really are quite efficient.) 12.3 Sparse Regularization For many learning algorithms, the test-time efficiency is governe by how many features are use for preiction. This is one reason ecision trees ten to be among the fastest preictors: they only use a small number of features. Especially in cases where the actual computation of these features is expensive, cutting own on the number that are use at test time can yiel huge gains in efficiency. Moreover, the amount of memory use to make preictions is also typically governe by the number of features. (Note: this is not true of kernel methos like support vector machines, in which the ominant cost is the number of support vectors.) Furthermore, you may simply believe that your learning problem can be solve with a very small number of features: this is a very reasonable form of inuctive bias. This is the iea behin sparse moels, an in particular, sparse regularizers. One of the isavantages of a 2-norm regularizer for linear moels is that they ten to never prouce weights that are exactly zero. They get close to zero, but never hit it. To unerstan why, as a weight w approaches zero, its graient also approaches zero. Thus, even if the weight shoul be zero, it will essentially never get there because of the constantly shrinking graient. This suggests that an alternative regularizer is require to yiel a sparse inuctive bias. An ieal case woul be the zero-norm regular-

6 efficient learning 163 izer, which simply counts the number of non-zero values in a vector: w 0 = [w = 0]. If you coul minimize this regularizer, you woul be explicitly minimizing the number of non-zero features. Unfortunately, not only is the zero-norm non-convex, it s also iscrete. Optimizing it is NP-har. A reasonable mile-groun is the one-norm: w 1 = w. It is inee convex: in fact, it is the tighest l p norm that is convex. Moreover, its graients o not go to zero as in the two-norm. Just as hinge-loss is the tightest convex upper boun on zero-one error, the one-norm is the tighest convex upper boun on the zero-norm. At this point, you shoul be content. You can take your subgraient optimizer for arbitrary functions an plug in the one-norm as a regularizer. The one-norm is surely non-ifferentiable at w = 0, but you can simply choose any value in the range [ 1, +1] as a subgraient at that point. (You shoul choose zero.) Unfortunately, this oes not quite work the way you might expect. The issue is that the graient might overstep zero an you will never en up with a solution that is particularly sparse. For example, at the en of one graient step, you might have w 3 = 0.6. Your graient might have g 6 = 0.8 an your graient step (assuming η = 1) will upate so that the new w 3 = 0.2. In the subsequent iteration, you might have g 6 = 0.3 an step to w 3 = 0.1. This observation leas to the iea of trucate graients. The iea is simple: if you have a graient that woul step you over w = 0, then just set w = 0. In the easy case when the learning rate is 1, this means that if the sign of w g is ifferent than the sign of w then you truncate the graient step an simply set w = 0. In other wors, g shoul never be larger than w Once you incorporate learning rates, you can express this as: g if w > 0 an g 1 w η (k) g g if w < 0 an g 1 w η (k) 0 otherwise (12.8) This works quite well in the case of subgraient escent. It works somewhat less well in the case of stochastic subgraient escent. The problem that arises in the stochastic case is that wherever you choose to stop optimizing, you will have just touche a single example (or small batch of examples), which will increase the weights for a lot of features, before the regularizer has time to shrink them back own to zero. You will still en up with somewhat sparse solutions, but not as sparse as they coul be. There are algorithms for ealing with this situation, but they all have a heuristic flavor to them an are beyon the scope of this book.

7 164 a course in machine learning 12.4 Feature Hashing As much as spee is a bottleneck in preiction, so often is memory usage. If you have a very large number of features, the amount of memory that it takes to store weights for all of them can become prohibitive, especially if you wish to run your algorithm on small evices. Feature hashing is an increibly simple technique for reucing the memory footprint of linear moels, with very small sacrifices in accuracy. The basic iea is to replace all of your features with hashe versions of those features, thus reucing your space from D-many feature weights to P-many feature weights, where P is the range of the hash function. You can actually think of hashing as a (ranomize) feature mapping φ : R D R P, for some P D. The iea is as follows. First, you choose a hash function h whose omain is [D] = {1, 2,..., D} an whose range is [P]. Then, when you receive a feature vector x R D, you map it to a shorter feature vector ˆx R P. Algorithmically, you can think of this mapping as follows: 1. Initialize ˆx = 0, 0,..., 0 2. For each = 1... D: (a) Hash to position p = h() (b) Upate the pth position by aing x : ˆx p ˆx p + x 3. Return ˆx Mathematically, the mapping looks like: φ(x) p = [h() = p]x = x (12.9) h 1 (p) where h 1 (p) = { : h() = p}. In the (unrealistic) case where P = D an h simply encoes a permutation, then this mapping oes not change the learning problem at all. All it oes is rename all of the features. In practice, P D an there will be collisions. In this context, a collision means that two features, which are really ifferent, en up looking the same to the learning algorithm. For instance, is it sunny toay? an i my favorite sports team win last night? might get mappe to the same location after hashing. The hope is that the learning algorithm is sufficiently robust to noise that it can hanle this case well.

8 efficient learning 165 Consier the kernel efine by this hash mapping. Namely: K (hash) (x, z) = φ(x) φ(z) (12.10) ( ) ) = [h() = p]x ( [h() = p]z (12.11) p = p = [h() = p][h(e) = p]x z e (12.12),e e h 1 (h()) = x z + x z e (12.13) e =, e h 1 (h()) x z e (12.14) This hash kernel has the form of a linear kernel plus a small number of quaratic terms. The particular quaratic terms are exactly those given by collisions of the hash function. There are two things to notice about this. The first is that collisions might not actually be ba things! In a sense, they re giving you a little extra representational power. In particular, if the hash function happens to select out feature pairs that benefit from being paire, then you now have a better representation. The secon is that even if this oesn t happen, the quaratic term in the kernel has only a small effect on the overall preiction. In particular, if you assume that your hash function is pairwise inepenent (a common assumption of hash functions), then the expecte value of this quaratic term is zero, an its variance ecreases at a rate of O(P 2 ). In other wors, if you choose P 100, then the variance is on the orer of Exercises Exercise TODO...

Admin BACKPROPAGATION. Neural network. Neural network 11/3/16. Assignment 7. Assignment 8 Goals today. David Kauchak CS158 Fall 2016

Admin BACKPROPAGATION. Neural network. Neural network 11/3/16. Assignment 7. Assignment 8 Goals today. David Kauchak CS158 Fall 2016 Amin Assignment 7 Assignment 8 Goals toay BACKPROPAGATION Davi Kauchak CS58 Fall 206 Neural network Neural network inputs inputs some inputs are provie/ entere Iniviual perceptrons/ neurons Neural network

More information

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013 Survey Sampling Kosuke Imai Department of Politics, Princeton University February 19, 2013 Survey sampling is one of the most commonly use ata collection methos for social scientists. We begin by escribing

More information

Linear First-Order Equations

Linear First-Order Equations 5 Linear First-Orer Equations Linear first-orer ifferential equations make up another important class of ifferential equations that commonly arise in applications an are relatively easy to solve (in theory)

More information

JUST THE MATHS UNIT NUMBER DIFFERENTIATION 2 (Rates of change) A.J.Hobson

JUST THE MATHS UNIT NUMBER DIFFERENTIATION 2 (Rates of change) A.J.Hobson JUST THE MATHS UNIT NUMBER 10.2 DIFFERENTIATION 2 (Rates of change) by A.J.Hobson 10.2.1 Introuction 10.2.2 Average rates of change 10.2.3 Instantaneous rates of change 10.2.4 Derivatives 10.2.5 Exercises

More information

Solutions to Practice Problems Tuesday, October 28, 2008

Solutions to Practice Problems Tuesday, October 28, 2008 Solutions to Practice Problems Tuesay, October 28, 2008 1. The graph of the function f is shown below. Figure 1: The graph of f(x) What is x 1 + f(x)? What is x 1 f(x)? An oes x 1 f(x) exist? If so, what

More information

Implicit Differentiation

Implicit Differentiation Implicit Differentiation Thus far, the functions we have been concerne with have been efine explicitly. A function is efine explicitly if the output is given irectly in terms of the input. For instance,

More information

Lecture Note 2. 1 Bonferroni Principle. 1.1 Idea. 1.2 Want. Material covered today is from Chapter 1 and chapter 4

Lecture Note 2. 1 Bonferroni Principle. 1.1 Idea. 1.2 Want. Material covered today is from Chapter 1 and chapter 4 Lecture Note 2 Material covere toay is from Chapter an chapter 4 Bonferroni Principle. Iea Get an iea the frequency of events when things are ranom billion = 0 9 Each person has a % chance to stay in a

More information

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors Math 18.02 Notes on ifferentials, the Chain Rule, graients, irectional erivative, an normal vectors Tangent plane an linear approximation We efine the partial erivatives of f( xy, ) as follows: f f( x+

More information

Least-Squares Regression on Sparse Spaces

Least-Squares Regression on Sparse Spaces Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction

More information

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs Lectures - Week 10 Introuction to Orinary Differential Equations (ODES) First Orer Linear ODEs When stuying ODEs we are consiering functions of one inepenent variable, e.g., f(x), where x is the inepenent

More information

The Exact Form and General Integrating Factors

The Exact Form and General Integrating Factors 7 The Exact Form an General Integrating Factors In the previous chapters, we ve seen how separable an linear ifferential equations can be solve using methos for converting them to forms that can be easily

More information

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an

More information

Math 1B, lecture 8: Integration by parts

Math 1B, lecture 8: Integration by parts Math B, lecture 8: Integration by parts Nathan Pflueger 23 September 2 Introuction Integration by parts, similarly to integration by substitution, reverses a well-known technique of ifferentiation an explores

More information

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x) Y. D. Chong (2016) MH2801: Complex Methos for the Sciences 1. Derivatives The erivative of a function f(x) is another function, efine in terms of a limiting expression: f (x) f (x) lim x δx 0 f(x + δx)

More information

Differentiation ( , 9.5)

Differentiation ( , 9.5) Chapter 2 Differentiation (8.1 8.3, 9.5) 2.1 Rate of Change (8.2.1 5) Recall that the equation of a straight line can be written as y = mx + c, where m is the slope or graient of the line, an c is the

More information

State observers and recursive filters in classical feedback control theory

State observers and recursive filters in classical feedback control theory State observers an recursive filters in classical feeback control theory State-feeback control example: secon-orer system Consier the riven secon-orer system q q q u x q x q x x x x Here u coul represent

More information

Analyzing Tensor Power Method Dynamics in Overcomplete Regime

Analyzing Tensor Power Method Dynamics in Overcomplete Regime Journal of Machine Learning Research 18 (2017) 1-40 Submitte 9/15; Revise 11/16; Publishe 4/17 Analyzing Tensor Power Metho Dynamics in Overcomplete Regime Animashree Ananumar Department of Electrical

More information

Exam 2 Review Solutions

Exam 2 Review Solutions Exam Review Solutions 1. True or False, an explain: (a) There exists a function f with continuous secon partial erivatives such that f x (x, y) = x + y f y = x y False. If the function has continuous secon

More information

Cascaded redundancy reduction

Cascaded redundancy reduction Network: Comput. Neural Syst. 9 (1998) 73 84. Printe in the UK PII: S0954-898X(98)88342-5 Cascae reunancy reuction Virginia R e Sa an Geoffrey E Hinton Department of Computer Science, University of Toronto,

More information

Introduction to Machine Learning

Introduction to Machine Learning How o you estimate p(y x)? Outline Contents Introuction to Machine Learning Logistic Regression Varun Chanola April 9, 207 Generative vs. Discriminative Classifiers 2 Logistic Regression 2 3 Logistic Regression

More information

Integration by Parts

Integration by Parts Integration by Parts 6-3-207 If u an v are functions of, the Prouct Rule says that (uv) = uv +vu Integrate both sies: (uv) = uv = uv + u v + uv = uv vu, vu v u, I ve written u an v as shorthan for u an

More information

Two formulas for the Euler ϕ-function

Two formulas for the Euler ϕ-function Two formulas for the Euler ϕ-function Robert Frieman A multiplication formula for ϕ(n) The first formula we want to prove is the following: Theorem 1. If n 1 an n 2 are relatively prime positive integers,

More information

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences. S 63 Lecture 8 2/2/26 Lecturer Lillian Lee Scribes Peter Babinski, Davi Lin Basic Language Moeling Approach I. Special ase of LM-base Approach a. Recap of Formulas an Terms b. Fixing θ? c. About that Multinomial

More information

by using the derivative rules. o Building blocks: d

by using the derivative rules. o Building blocks: d Calculus for Business an Social Sciences - Prof D Yuen Eam Review version /9/01 Check website for any poste typos an upates Eam is on Sections, 5, 6,, 1,, Derivatives Rules Know how to fin the formula

More information

7.1 Support Vector Machine

7.1 Support Vector Machine 67577 Intro. to Machine Learning Fall semester, 006/7 Lecture 7: Support Vector Machines an Kernel Functions II Lecturer: Amnon Shashua Scribe: Amnon Shashua 7. Support Vector Machine We return now to

More information

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation JOURNAL OF MATERIALS SCIENCE 34 (999)5497 5503 Thermal conuctivity of grae composites: Numerical simulations an an effective meium approximation P. M. HUI Department of Physics, The Chinese University

More information

Derivatives and the Product Rule

Derivatives and the Product Rule Derivatives an the Prouct Rule James K. Peterson Department of Biological Sciences an Department of Mathematical Sciences Clemson University January 28, 2014 Outline Differentiability Simple Derivatives

More information

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics This moule is part of the Memobust Hanbook on Methoology of Moern Business Statistics 26 March 2014 Metho: Balance Sampling for Multi-Way Stratification Contents General section... 3 1. Summary... 3 2.

More information

MA 2232 Lecture 08 - Review of Log and Exponential Functions and Exponential Growth

MA 2232 Lecture 08 - Review of Log and Exponential Functions and Exponential Growth MA 2232 Lecture 08 - Review of Log an Exponential Functions an Exponential Growth Friay, February 2, 2018. Objectives: Review log an exponential functions, their erivative an integration formulas. Exponential

More information

A Sketch of Menshikov s Theorem

A Sketch of Menshikov s Theorem A Sketch of Menshikov s Theorem Thomas Bao March 14, 2010 Abstract Let Λ be an infinite, locally finite oriente multi-graph with C Λ finite an strongly connecte, an let p

More information

Vectors in two dimensions

Vectors in two dimensions Vectors in two imensions Until now, we have been working in one imension only The main reason for this is to become familiar with the main physical ieas like Newton s secon law, without the aitional complication

More information

Quantum Mechanics in Three Dimensions

Quantum Mechanics in Three Dimensions Physics 342 Lecture 20 Quantum Mechanics in Three Dimensions Lecture 20 Physics 342 Quantum Mechanics I Monay, March 24th, 2008 We begin our spherical solutions with the simplest possible case zero potential.

More information

Calculus of Variations

Calculus of Variations 16.323 Lecture 5 Calculus of Variations Calculus of Variations Most books cover this material well, but Kirk Chapter 4 oes a particularly nice job. x(t) x* x*+ αδx (1) x*- αδx (1) αδx (1) αδx (1) t f t

More information

PDE Notes, Lecture #11

PDE Notes, Lecture #11 PDE Notes, Lecture # from Professor Jalal Shatah s Lectures Febuary 9th, 2009 Sobolev Spaces Recall that for u L loc we can efine the weak erivative Du by Du, φ := udφ φ C0 If v L loc such that Du, φ =

More information

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k A Proof of Lemma 2 B Proof of Lemma 3 Proof: Since the support of LL istributions is R, two such istributions are equivalent absolutely continuous with respect to each other an the ivergence is well-efine

More information

Separation of Variables

Separation of Variables Physics 342 Lecture 1 Separation of Variables Lecture 1 Physics 342 Quantum Mechanics I Monay, January 25th, 2010 There are three basic mathematical tools we nee, an then we can begin working on the physical

More information

Proof of SPNs as Mixture of Trees

Proof of SPNs as Mixture of Trees A Proof of SPNs as Mixture of Trees Theorem 1. If T is an inuce SPN from a complete an ecomposable SPN S, then T is a tree that is complete an ecomposable. Proof. Argue by contraiction that T is not a

More information

Designing Information Devices and Systems I Spring 2018 Lecture Notes Note 16

Designing Information Devices and Systems I Spring 2018 Lecture Notes Note 16 EECS 16A Designing Information Devices an Systems I Spring 218 Lecture Notes Note 16 16.1 Touchscreen Revisite We ve seen how a resistive touchscreen works by using the concept of voltage iviers. Essentially,

More information

18 EVEN MORE CALCULUS

18 EVEN MORE CALCULUS 8 EVEN MORE CALCULUS Chapter 8 Even More Calculus Objectives After stuing this chapter you shoul be able to ifferentiate an integrate basic trigonometric functions; unerstan how to calculate rates of change;

More information

Semi-continuous challenge

Semi-continuous challenge Maurice Clerc Maurice.Clerc@WriteMe.com Last upate: 004-0-7. Why? Semi-continuous challenge The aim is to improve my parameter free (aaptive) particle swarm optimizer Tribes [CLE 03]. In orer to o that,

More information

26.1 Metropolis method

26.1 Metropolis method CS880: Approximations Algorithms Scribe: Dave Anrzejewski Lecturer: Shuchi Chawla Topic: Metropolis metho, volume estimation Date: 4/26/07 The previous lecture iscusse they some of the key concepts of

More information

Math 342 Partial Differential Equations «Viktor Grigoryan

Math 342 Partial Differential Equations «Viktor Grigoryan Math 342 Partial Differential Equations «Viktor Grigoryan 6 Wave equation: solution In this lecture we will solve the wave equation on the entire real line x R. This correspons to a string of infinite

More information

TIME-DELAY ESTIMATION USING FARROW-BASED FRACTIONAL-DELAY FIR FILTERS: FILTER APPROXIMATION VS. ESTIMATION ERRORS

TIME-DELAY ESTIMATION USING FARROW-BASED FRACTIONAL-DELAY FIR FILTERS: FILTER APPROXIMATION VS. ESTIMATION ERRORS TIME-DEAY ESTIMATION USING FARROW-BASED FRACTIONA-DEAY FIR FITERS: FITER APPROXIMATION VS. ESTIMATION ERRORS Mattias Olsson, Håkan Johansson, an Per öwenborg Div. of Electronic Systems, Dept. of Electrical

More information

Convergence of Random Walks

Convergence of Random Walks Chapter 16 Convergence of Ranom Walks This lecture examines the convergence of ranom walks to the Wiener process. This is very important both physically an statistically, an illustrates the utility of

More information

Parameter estimation: A new approach to weighting a priori information

Parameter estimation: A new approach to weighting a priori information Parameter estimation: A new approach to weighting a priori information J.L. Mea Department of Mathematics, Boise State University, Boise, ID 83725-555 E-mail: jmea@boisestate.eu Abstract. We propose a

More information

Step 1. Analytic Properties of the Riemann zeta function [2 lectures]

Step 1. Analytic Properties of the Riemann zeta function [2 lectures] Step. Analytic Properties of the Riemann zeta function [2 lectures] The Riemann zeta function is the infinite sum of terms /, n. For each n, the / is a continuous function of s, i.e. lim s s 0 n = s n,

More information

Markov Chains in Continuous Time

Markov Chains in Continuous Time Chapter 23 Markov Chains in Continuous Time Previously we looke at Markov chains, where the transitions betweenstatesoccurreatspecifietime- steps. That it, we mae time (a continuous variable) avance in

More information

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012 CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration

More information

Algorithms and matching lower bounds for approximately-convex optimization

Algorithms and matching lower bounds for approximately-convex optimization Algorithms an matching lower bouns for approximately-convex optimization Yuanzhi Li Department of Computer Science Princeton University Princeton, NJ, 08450 yuanzhil@cs.princeton.eu Anrej Risteski Department

More information

Topic 7: Convergence of Random Variables

Topic 7: Convergence of Random Variables Topic 7: Convergence of Ranom Variables Course 003, 2016 Page 0 The Inference Problem So far, our starting point has been a given probability space (S, F, P). We now look at how to generate information

More information

5.4 Fundamental Theorem of Calculus Calculus. Do you remember the Fundamental Theorem of Algebra? Just thought I'd ask

5.4 Fundamental Theorem of Calculus Calculus. Do you remember the Fundamental Theorem of Algebra? Just thought I'd ask 5.4 FUNDAMENTAL THEOREM OF CALCULUS Do you remember the Funamental Theorem of Algebra? Just thought I' ask The Funamental Theorem of Calculus has two parts. These two parts tie together the concept of

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks A PAC-Bayesian Approach to Spectrally-Normalize Margin Bouns for Neural Networks Behnam Neyshabur, Srinah Bhojanapalli, Davi McAllester, Nathan Srebro Toyota Technological Institute at Chicago {bneyshabur,

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 3 THE PERCEPTRON Learning Objectives: Describe the biological motivation behind the perceptron. Classify learning algorithms based on whether they are error-driven

More information

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling Case Stuy : Document Retrieval Collapse Gibbs an Variational Methos for LDA Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 7 th, 0 Example

More information

ensembles When working with density operators, we can use this connection to define a generalized Bloch vector: v x Tr x, v y Tr y

ensembles When working with density operators, we can use this connection to define a generalized Bloch vector: v x Tr x, v y Tr y Ph195a lecture notes, 1/3/01 Density operators for spin- 1 ensembles So far in our iscussion of spin- 1 systems, we have restricte our attention to the case of pure states an Hamiltonian evolution. Toay

More information

Zachary Scherr Math 503 HW 3 Due Friday, Feb 12

Zachary Scherr Math 503 HW 3 Due Friday, Feb 12 Zachary Scherr Math 503 HW 3 Due Friay, Feb 1 1 Reaing 1. Rea sections 7.5, 7.6, 8.1 of Dummit an Foote Problems 1. DF 7.5. Solution: This problem is trivial knowing how to work with universal properties.

More information

Lecture 6 : Dimensionality Reduction

Lecture 6 : Dimensionality Reduction CPS290: Algorithmic Founations of Data Science February 3, 207 Lecture 6 : Dimensionality Reuction Lecturer: Kamesh Munagala Scribe: Kamesh Munagala In this lecture, we will consier the roblem of maing

More information

Mathcad Lecture #5 In-class Worksheet Plotting and Calculus

Mathcad Lecture #5 In-class Worksheet Plotting and Calculus Mathca Lecture #5 In-class Worksheet Plotting an Calculus At the en of this lecture, you shoul be able to: graph expressions, functions, an matrices of ata format graphs with titles, legens, log scales,

More information

On the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization

On the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization JMLR: Workshop an Conference Proceeings vol 30 013) 1 On the Complexity of Banit an Derivative-Free Stochastic Convex Optimization Oha Shamir Microsoft Research an the Weizmann Institute of Science oha.shamir@weizmann.ac.il

More information

05 The Continuum Limit and the Wave Equation

05 The Continuum Limit and the Wave Equation Utah State University DigitalCommons@USU Founations of Wave Phenomena Physics, Department of 1-1-2004 05 The Continuum Limit an the Wave Equation Charles G. Torre Department of Physics, Utah State University,

More information

Lecture 3 Notes. Dan Sheldon. September 17, 2012

Lecture 3 Notes. Dan Sheldon. September 17, 2012 Lecture 3 Notes Dan Shelon September 17, 2012 0 Errata Section 4, Equation (2): yn 2 shoul be x2 N. Fixe 9/17/12 Section 5.3, Example 3: shoul rea w 0 = 0, w 1 = 1. Fixe 9/17/12. 1 Review: Linear Regression

More information

Homework 2 Solutions EM, Mixture Models, PCA, Dualitys

Homework 2 Solutions EM, Mixture Models, PCA, Dualitys Homewor Solutions EM, Mixture Moels, PCA, Dualitys CMU 0-75: Machine Learning Fall 05 http://www.cs.cmu.eu/~bapoczos/classes/ml075_05fall/ OUT: Oct 5, 05 DUE: Oct 9, 05, 0:0 AM An EM algorithm for a Mixture

More information

APPPHYS 217 Thursday 8 April 2010

APPPHYS 217 Thursday 8 April 2010 APPPHYS 7 Thursay 8 April A&M example 6: The ouble integrator Consier the motion of a point particle in D with the applie force as a control input This is simply Newton s equation F ma with F u : t q q

More information

Designing Information Devices and Systems I Spring 2017 Official Lecture Notes Note 13

Designing Information Devices and Systems I Spring 2017 Official Lecture Notes Note 13 EES 6A Designing Information Devices an Systems I Spring 27 Official Lecture Notes Note 3 Touchscreen Revisite We ve seen how a resistive touchscreen works by using the concept of voltage iviers. Essentially,

More information

Lecture 5. Symmetric Shearer s Lemma

Lecture 5. Symmetric Shearer s Lemma Stanfor University Spring 208 Math 233: Non-constructive methos in combinatorics Instructor: Jan Vonrák Lecture ate: January 23, 208 Original scribe: Erik Bates Lecture 5 Symmetric Shearer s Lemma Here

More information

6 General properties of an autonomous system of two first order ODE

6 General properties of an autonomous system of two first order ODE 6 General properties of an autonomous system of two first orer ODE Here we embark on stuying the autonomous system of two first orer ifferential equations of the form ẋ 1 = f 1 (, x 2 ), ẋ 2 = f 2 (, x

More information

Linear Regression with Limited Observation

Linear Regression with Limited Observation Ela Hazan Tomer Koren Technion Israel Institute of Technology, Technion City 32000, Haifa, Israel ehazan@ie.technion.ac.il tomerk@cs.technion.ac.il Abstract We consier the most common variants of linear

More information

SYDE 112, LECTURE 1: Review & Antidifferentiation

SYDE 112, LECTURE 1: Review & Antidifferentiation SYDE 112, LECTURE 1: Review & Antiifferentiation 1 Course Information For a etaile breakown of the course content an available resources, see the Course Outline. Other relevant information for this section

More information

Tutorial on Maximum Likelyhood Estimation: Parametric Density Estimation

Tutorial on Maximum Likelyhood Estimation: Parametric Density Estimation Tutorial on Maximum Likelyhoo Estimation: Parametric Density Estimation Suhir B Kylasa 03/13/2014 1 Motivation Suppose one wishes to etermine just how biase an unfair coin is. Call the probability of tossing

More information

Linear and quadratic approximation

Linear and quadratic approximation Linear an quaratic approximation November 11, 2013 Definition: Suppose f is a function that is ifferentiable on an interval I containing the point a. The linear approximation to f at a is the linear function

More information

Schrödinger s equation.

Schrödinger s equation. Physics 342 Lecture 5 Schröinger s Equation Lecture 5 Physics 342 Quantum Mechanics I Wenesay, February 3r, 2010 Toay we iscuss Schröinger s equation an show that it supports the basic interpretation of

More information

A Review of Multiple Try MCMC algorithms for Signal Processing

A Review of Multiple Try MCMC algorithms for Signal Processing A Review of Multiple Try MCMC algorithms for Signal Processing Luca Martino Image Processing Lab., Universitat e València (Spain) Universia Carlos III e Mari, Leganes (Spain) Abstract Many applications

More information

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments 2 Conference on Information Sciences an Systems, The Johns Hopkins University, March 2, 2 Time-of-Arrival Estimation in Non-Line-Of-Sight Environments Sinan Gezici, Hisashi Kobayashi an H. Vincent Poor

More information

u!i = a T u = 0. Then S satisfies

u!i = a T u = 0. Then S satisfies Deterministic Conitions for Subspace Ientifiability from Incomplete Sampling Daniel L Pimentel-Alarcón, Nigel Boston, Robert D Nowak University of Wisconsin-Maison Abstract Consier an r-imensional subspace

More information

Collaborative Ranking for Local Preferences Supplement

Collaborative Ranking for Local Preferences Supplement Collaborative Raning for Local Preferences Supplement Ber apicioglu Davi S Rosenberg Robert E Schapire ony Jebara YP YP Princeton University Columbia University Problem Formulation Let U {,,m} be the set

More information

Chapter 6: Integration: partial fractions and improper integrals

Chapter 6: Integration: partial fractions and improper integrals Chapter 6: Integration: partial fractions an improper integrals Course S3, 006 07 April 5, 007 These are just summaries of the lecture notes, an few etails are inclue. Most of what we inclue here is to

More information

Robust Low Rank Kernel Embeddings of Multivariate Distributions

Robust Low Rank Kernel Embeddings of Multivariate Distributions Robust Low Rank Kernel Embeings of Multivariate Distributions Le Song, Bo Dai College of Computing, Georgia Institute of Technology lsong@cc.gatech.eu, boai@gatech.eu Abstract Kernel embeing of istributions

More information

Proof by Mathematical Induction.

Proof by Mathematical Induction. Proof by Mathematical Inuction. Mathematicians have very peculiar characteristics. They like proving things or mathematical statements. Two of the most important techniques of mathematical proof are proof

More information

Monte Carlo Methods with Reduced Error

Monte Carlo Methods with Reduced Error Monte Carlo Methos with Reuce Error As has been shown, the probable error in Monte Carlo algorithms when no information about the smoothness of the function is use is Dξ r N = c N. It is important for

More information

Role of parameters in the stochastic dynamics of a stick-slip oscillator

Role of parameters in the stochastic dynamics of a stick-slip oscillator Proceeing Series of the Brazilian Society of Applie an Computational Mathematics, v. 6, n. 1, 218. Trabalho apresentao no XXXVII CNMAC, S.J. os Campos - SP, 217. Proceeing Series of the Brazilian Society

More information

The Principle of Least Action

The Principle of Least Action Chapter 7. The Principle of Least Action 7.1 Force Methos vs. Energy Methos We have so far stuie two istinct ways of analyzing physics problems: force methos, basically consisting of the application of

More information

Euler equations for multiple integrals

Euler equations for multiple integrals Euler equations for multiple integrals January 22, 2013 Contents 1 Reminer of multivariable calculus 2 1.1 Vector ifferentiation......................... 2 1.2 Matrix ifferentiation........................

More information

II. First variation of functionals

II. First variation of functionals II. First variation of functionals The erivative of a function being zero is a necessary conition for the etremum of that function in orinary calculus. Let us now tackle the question of the equivalent

More information

θ x = f ( x,t) could be written as

θ x = f ( x,t) could be written as 9. Higher orer PDEs as systems of first-orer PDEs. Hyperbolic systems. For PDEs, as for ODEs, we may reuce the orer by efining new epenent variables. For example, in the case of the wave equation, (1)

More information

PD Controller for Car-Following Models Based on Real Data

PD Controller for Car-Following Models Based on Real Data PD Controller for Car-Following Moels Base on Real Data Xiaopeng Fang, Hung A. Pham an Minoru Kobayashi Department of Mechanical Engineering Iowa State University, Ames, IA 5 Hona R&D The car following

More information

SYNCHRONOUS SEQUENTIAL CIRCUITS

SYNCHRONOUS SEQUENTIAL CIRCUITS CHAPTER SYNCHRONOUS SEUENTIAL CIRCUITS Registers an counters, two very common synchronous sequential circuits, are introuce in this chapter. Register is a igital circuit for storing information. Contents

More information

arxiv: v5 [cs.lg] 28 Mar 2017

arxiv: v5 [cs.lg] 28 Mar 2017 Equilibrium Propagation: Briging the Gap Between Energy-Base Moels an Backpropagation Benjamin Scellier an Yoshua Bengio * Université e Montréal, Montreal Institute for Learning Algorithms March 3, 217

More information

Homework 2 EM, Mixture Models, PCA, Dualitys

Homework 2 EM, Mixture Models, PCA, Dualitys Homework 2 EM, Mixture Moels, PCA, Dualitys CMU 10-715: Machine Learning (Fall 2015) http://www.cs.cmu.eu/~bapoczos/classes/ml10715_2015fall/ OUT: Oct 5, 2015 DUE: Oct 19, 2015, 10:20 AM Guielines The

More information

Designing Information Devices and Systems II Fall 2017 Note Theorem: Existence and Uniqueness of Solutions to Differential Equations

Designing Information Devices and Systems II Fall 2017 Note Theorem: Existence and Uniqueness of Solutions to Differential Equations EECS 6B Designing Information Devices an Systems II Fall 07 Note 3 Secon Orer Differential Equations Secon orer ifferential equations appear everywhere in the real worl. In this note, we will walk through

More information

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions Working Paper 2013:5 Department of Statistics Computing Exact Confience Coefficients of Simultaneous Confience Intervals for Multinomial Proportions an their Functions Shaobo Jin Working Paper 2013:5

More information

Experiment I Electric Force

Experiment I Electric Force Experiment I Electric Force Twenty-five hunre years ago, the Greek philosopher Thales foun that amber, the harene sap from a tree, attracte light objects when rubbe. Only twenty-four hunre years later,

More information

3.2 Shot peening - modeling 3 PROCEEDINGS

3.2 Shot peening - modeling 3 PROCEEDINGS 3.2 Shot peening - moeling 3 PROCEEDINGS Computer assiste coverage simulation François-Xavier Abaie a, b a FROHN, Germany, fx.abaie@frohn.com. b PEENING ACCESSORIES, Switzerlan, info@peening.ch Keywors:

More information

2Algebraic ONLINE PAGE PROOFS. foundations

2Algebraic ONLINE PAGE PROOFS. foundations Algebraic founations. Kick off with CAS. Algebraic skills.3 Pascal s triangle an binomial expansions.4 The binomial theorem.5 Sets of real numbers.6 Surs.7 Review . Kick off with CAS Playing lotto Using

More information

Problem Sheet 2: Eigenvalues and eigenvectors and their use in solving linear ODEs

Problem Sheet 2: Eigenvalues and eigenvectors and their use in solving linear ODEs Problem Sheet 2: Eigenvalues an eigenvectors an their use in solving linear ODEs If you fin any typos/errors in this problem sheet please email jk28@icacuk The material in this problem sheet is not examinable

More information

Introduction to Markov Processes

Introduction to Markov Processes Introuction to Markov Processes Connexions moule m44014 Zzis law Gustav) Meglicki, Jr Office of the VP for Information Technology Iniana University RCS: Section-2.tex,v 1.24 2012/12/21 18:03:08 gustav

More information

Antiderivatives. Definition (Antiderivative) If F (x) = f (x) we call F an antiderivative of f. Alan H. SteinUniversity of Connecticut

Antiderivatives. Definition (Antiderivative) If F (x) = f (x) we call F an antiderivative of f. Alan H. SteinUniversity of Connecticut Antierivatives Definition (Antierivative) If F (x) = f (x) we call F an antierivative of f. Antierivatives Definition (Antierivative) If F (x) = f (x) we call F an antierivative of f. Definition (Inefinite

More information

Final Exam Study Guide and Practice Problems Solutions

Final Exam Study Guide and Practice Problems Solutions Final Exam Stuy Guie an Practice Problems Solutions Note: These problems are just some of the types of problems that might appear on the exam. However, to fully prepare for the exam, in aition to making

More information

Equilibrium in Queues Under Unknown Service Times and Service Value

Equilibrium in Queues Under Unknown Service Times and Service Value University of Pennsylvania ScholarlyCommons Finance Papers Wharton Faculty Research 1-2014 Equilibrium in Queues Uner Unknown Service Times an Service Value Laurens Debo Senthil K. Veeraraghavan University

More information

Lecture 2: Correlated Topic Model

Lecture 2: Correlated Topic Model Probabilistic Moels for Unsupervise Learning Spring 203 Lecture 2: Correlate Topic Moel Inference for Correlate Topic Moel Yuan Yuan First of all, let us make some claims about the parameters an variables

More information