Training Convolutional Neural Networks
|
|
- Gyles Warner
- 5 years ago
- Views:
Transcription
1 Tranng Convolutonal Neural Networks Carlo Tomas November 26, 208 The Soft-Max Smplex Neural networks are typcally desgned to compute real-valued functons y = h(x) : R d R e of ther nput x When a classfer s needed, a soft-max functon s used as the last layer, wth e entres n ts output vector p f there are e classes n the label space Y The class correspondng to nput x s then s found as the arg max of p Thus, the network can be vewed as a functon p = f(x, w) : X P that transforms data space X nto the soft-max smplex P, the set of all nonnegatve real-valued vectors p R e whose entres add up to : P def = {p R e : p 0 and e p = } Ths set has dmenson e, and s the convex hull of the e columns of the dentty matrx n R e Fgure shows the -smplex and the 2-smplex The vector w n the expresson above collects all the parameters of the neural network, that s, the gans and bases of all the neurons More specfcally, for a deep neural network wth K layers ndexed by k =,, K, we can wrte w = w () w (K) where w (k) s a vector collectng both gans and bases for layer k If the arg max rule s used to compute the class, = ŷ = h(x) = arg max p, then the network has a low tranng rsk f the transformed data ponts p fall n the decson regons P c = {p c p j for j c} for c =,, e These regons are convex, because ther boundares are defned by lnear nequaltes n the entres of p Thus, when used for classfcaton, the neural network can be vewed as learnng a transformaton of the orgnal decson regons n X nto the convex decson regons n the soft-max smplex In geometry, the smplces are named by ther dmenson, whch s one less than the number of classes
2 y 3 y 2 /3 /3 y 2 /2 /3 y /2 y Fgure : The -smplex for two classes (dark segment n the dagram on the left) and the 2-smplex for three classes (lght trangle n the dagram on the rght) The blue dot on the left and the blue lne segments on the rght are the boundares of the decson regons The boundares meet at the unt pont /e n e dmensons 2 Loss The rsk L T to be mnmzed to tran a neural network s the average loss on a tranng set of nput-output pars T = {(x, y ),, (x N, y N )} The outputs y n are categorcal n a classfcaton problem, and real-valued vectors n a regresson problem For a regresson problem, the loss functon s typcally the quadratc loss, l(y, y ) = y y 2 For classfcaton, on the other hand, we would lke the rsk L T (h) to be dfferentable, n order to be able to use gradent descent methods However, the arg max s a pecewse-constant functon, and ts dervatves are ether zero or undefned (where the arg max changes value) The zero-one loss functon has smlar propertes To address these ssue, a dfferentable loss defned on f s used as a proxy for the zero-one loss defned on h Specfcally, the mult-class cross-entropy loss s used, whch we studed n the context of logstc-regresson classfers Its defnton s repeated here for convenence: l(y, p) = log p y Equvalently, f q(y) s the one-hot encodng of the true label y, the cross-entropy loss can also be wrtten as follows: e l(y, p) = q (y) log p k= Wth these defntons, L T, s a pecewse-dfferentable functon, and one can use gradent or sub-gradent methods to compute the gradent of L T wth respect to the parameter vector w Exceptons to dfferentablty are due to the use of the ReLU, whch has a cusp at the orgn, as the nonlnearty n neurons, as well as to the possble use of max-poolng These exceptons are pontwse, and 2
3 are typcally gnored n both the lterature and the software packages used to mnmze L T If desred, they could be addresses by ether computng sub-gradents rather than gradents [3], or roundng out the cusps wth dfferentable jonts As usual, once the loss has been settled on, the tranng rsk s defned as the average loss over the tranng set, and expressed as a functon of the parameters w of f: L T (w) = N N l n (w) where l n (w) = l(y n, f(x n, w)) () n= 3 Back-Propagaton A local mnmum for the rsk L T (w) s found by an teratve procedure that starts wth some ntal values w 0 for w, and then at step t performs the followng operatons: Compute the gradent of the tranng rsk, L T w w=wt Take a step that reduces the value of L T by movng n the drecton of the negatve gradent by a varant of the steepest descent method called Stochastc Gradent Descent (SGD), dscussed n Secton 4 The gradent computaton s called back-propagaton and s descrbed next The computaton of the n-th loss term l n (w) can be rewrtten as follows: x (0) = x n x (k) = f (k) (W (k) x (k ) ) for k =,, K p = x (K) l n = l(y n, p) where (x n, y n ) s the n-th tranng sample and f (k) descrbes the functon mplemented by layer k Computaton of the dervatves of the loss term l n (w) can be understood wth reference to Fgure 2 The term l n depends on the parameter vector w (k) for layer k through the output x (k) from that layer and nothng else, so that we can wrte w (k) = w (k) for k = K,, (2) and the frst gradent on the rght-hand sde satsfes the backward recurson x (k ) = x (k ) for k = K,, 2 (3) because l n depends on the output x (k ) from layer k only through the output x (k) from layer k The recurson (3) starts wth l = (4) x (K) p 3
4 x n = x (0) x = p l n l () (2) (3) x x f () f (2) f (3) w () w (2) w (3) y n Fgure 2: Example data flow for the computaton of the loss term l n for a neural network wth K = 3 layers When vewed from the loss term l n, the output x (k) from layer k (pck for nstance k = 2) s a bottleneck of nformaton for both the parameter vector w (k) for that layer and the output x (k ) from the prevous layer (k = n the example) Ths observaton justfes the use of the chan rule for dfferentaton to obtan equatons (2) and (3) where p s the second argument to the loss functon l In the equatons above, the dervatve of a functon wth respect to a vector s to be nterpreted as the row vector of all dervatves Let d k be the dmensonalty (number of entres) of x (k), and j k be the dmensonalty of w (k) The two matrces w (k) = w (k) d k w (k) w (k) j k d k w (k) j k and x (k ) = x (k ) d k x (k ) x (k ) d k d k x (k ) d k are the Jacoban matrces of the layer output x (k) wth respect to the layer parameters and nputs Computaton of the entres of these Jacobans s a smple exercse n dfferentaton, and s left to the Appendx The equatons (2-5) are the bass for the back-propagaton algorthm for the computaton of the gradent of the tranng rsk L T (w) wth respect to the parameter vector w of the neural network (Algorthm ) The algorthm loops over the tranng samples For each sample, t feeds the nput x n to the network to compute the layer outputs x (k) for that sample and for all k =,, K, n ths order The algorthm temporarly stores all the values x (k), because they are needed to compute the requred dervatves Ths ntal volley of computaton s called forward propagaton (of the nputs) The algorthm then revsts the layers n reverse order whle computng the dervatves n equaton (4) frst and then n equatons (2) and (3), and concatenates the resultng K layer gradents nto a sngle gradent ln w Ths computaton s called back-propagaton (of the dervatves) The gradent of L T (w) s the average (from equaton ()) of the gradents computed for each of the samples: L T w = N N n= w = N N n= w () w (K) (here, the dervatve wth respect to w s read as a column vector of dervatves) Ths average vector can be accumulated (see last assgnment n Algorthm ) as back-propagaton progresses For succnctness, operatons are expressed as matrx-vector computatons n Algorthm In practce, the matrces would be very sparse, and correlatons and explct loops over approprate ndces are used nstead (5) 4
5 Algorthm Backpropagaton functon L T backprop(t, w = [w (),, w (K) ], l) L T = zeros(sze(w)) for n =,, N do x (0) = x n for k =,, K do Forward propagaton x (k) f (k) (x (k ), w (k) ) Compute and store layer outputs to be used n back-propagaton end for l n = [ ] g = l(yn,x(k) ) end for end functon p for k = K,, 2 do l n [g x(k) ; l w (k) n ] g g x(k) x (k ) end for L T (n ) L T + l n n Intally empty contrbuton of the n-th sample to the loss gradent g s ln Back-propagaton Dervatves are evaluated at w (k) and x (k) Dtto Accumulate the average 4 Stochastc Gradent Descent In prncple, a neural network can be traned by mnmzng the tranng rsk L T (w) defned n equaton () by any of a vast varety of numercal optmzaton methods [5, 2] At one end of the spectrum, methods that make no use of gradent nformaton take too many steps to converge At the other end, methods that use second-order dervatves (Hessan) to determne hgh-qualty steps tend to be too expensve n terms of both space and tme at each teraton, although some researchers advocate these types of methods [4] By far the most wdely used methods employ gradent nformaton, computed by back-propagaton [] Lne search s too expensve, and the step sze s therefore chosen accordng to some heurstc nstead The momentum method [6, 8] starts from an ntal value w 0 chosen at random and terates as follows: v t+ = µ t v t α L T (w t ) w t+ = w t + v t+ The vector v t+ s the step or velocty that s added to the old value w t to compute the new value w t+ The scalar α > 0 s the learnng rate that determnes how fast to move n the drecton opposte to the rsk gradent L T (w), and the tme-dependent scalar µ t [0, ] s the momentum coeffcent Gradent descent s obtaned when µ t = 0 Greater values of µ t encourage steps n a consstent drecton (snce the new velocty v t+ has a greater component n the drecton of the old velocty v t than f no momentum were present), and these steps accelerate descent when the gradent of L T (w) s small, as s the case around shallow mnma The value of µ t s often vared accordng to some schedule lke the one n Fgure 3 The ratonale for the ncreasng values over tme s that momentum s more useful n later stages, n whch the gradent magntude s very small as w t approaches the mnmum The learnng rate α s often fxed, and s a parameter of crtcal mportance [9] A rate that s too large leads to large steps that often overshoot, and a rate that s too small leads to very slow progress In practce, an ntal value of α s chosen by cross-valdaton to be some value much smaller than Convergence can 5
6 µ t = mn ( µ max, 2 t 250 ) + µt t Fgure 3: A possble schedule [8] for the momentum coeffcent µ t take between hours and weeks for typcal applcatons, and the value of L T s typcally montored through some user nterface When progress starts to saturate, the value of α s decreased (say, dvded by 0) Mn-Batches The gradent of the rsk L T (w) s expensve to compute, and one tends to use as large a learnng rate as possble so as to mnmze the number of steps taken One way to prevent the resultng overshootng would be to do onlne learnng, n whch each step µ t v t α l n (w t ) (there s one such step for each tranng sample) s taken rght away, rather than accumulated nto the step µ t v t α L T (w t ) (no subscrpt n here) In contrast, usng the latter step s called batch learnng Computng l n s much less expensve (by a factor of N) than computng L T In addton and most mportantly for convergence behavor onlne learnng breaks a sngle batch step nto N small steps, after each of whch the value of the rsk s re-evaluated As a result, the onlne steps can follow very curved paths, whereas a sngle batch step can only move n a fxed drecton n parameter space Because of ths greater flexblty, onlne learnng converges faster than batch learnng for the same overall computatonal effort The small onlne steps, however, have hgh varance, because each of them s taken based on mnmal amounts of data One can mprove convergence further by processng mn-batches of tranng data: Accumulate B gradents l n from the data n one mn-batch nto a sngle gradent L T, take the step, and move on to the next mnbatch It turns out that small values of B acheve the best compromse between reducng varance and keepng steps flexble Values of B around a few dozen are common Termnaton When used outsde learnng, gradent descent s typcally stopped when steps make lttle progress, as measured by step sze w t w t and/or decrease n functon value L T (w t ) L T (w t ) When tranng a deep network, on the other hand, descent s often stopped earler to mprove generalzaton Specfcally, one montors the zero-one rsk error of the classfer on a valdaton set, rather than the crossentropy rsk of the soft-max output on the tranng set, and stops when the valdaton-set error bottoms out, even f the tranng-set rsk would contnue to decrease A dfferent way to mprove generalzaton, sometmes used n combnaton wth early termnaton, s dscussed n Secton 5 5 Dropout Snce deep nets have a large number of parameters, they would need mpractcally large tranng sets to avod overfttng f no specal measures are taken durng tranng Early termnaton, descrbed at the end of the 6
7 prevous secton, s one such measure In general, the best way to avod overfttng n the presence of lmted data would be to buld one network for every possble settng of the parameters, compute the posteror probablty of each settng gven the tranng set, and then aggregate the nets nto a sngle predctor that computes the average output weghted by the posteror probabltes Ths approach, whch s remnscent of buldng a forest of trees, s obvously nfeasble to mplement for nontrval nets One way to approxmate ths scheme n a computatonally effcent way s called the dropout method [7] Gven a deep network to be traned, a dropout network s obtaned by flppng a based con for each node of the orgnal network and droppng that node f the flp turns out heads Droppng the node means that all the weghts and bases for that node are set to zero, so that the node becomes effectvely nactve One then trans the network by usng mn-batches of tranng data, and performs one teraton of tranng on each mn-batch after turnng off neurons ndependently wth probablty p When tranng s done, all the weghts n the network are multpled by p, and ths effectvely averages the outputs of the nets wth weghts that depend on how often a unt partcpated n tranng The value of p s typcally set to /2 Each dropout network can be vewed as a dfferent network, and the dropout method effectvely samples a large number of nets effcently 7
8 Appendx: The Jacobans for Back-Propagaton If f (k) s a pont functon, that s, f t s R R, the ndvdual entres of the Jacoban matrces (5) are easly found to be (revertng to matrx subscrpts for the weghts) W (k) qj = δ q df (k) x (k ) j and x (k ) j = df (k) W (k) j The Kronecker delta δ q = { f = q 0 otherwse n the frst of the two expressons above reflects the fact that x (k) depends only on the -th actvaton, whch s n turn the nner product of row of W (k) wth x (k ) Because of ths, the dervatve of x (k) wth respect to entry W (k) qj s zero f ths entry s not n that row, that s, when q The expresson df (k) s shorthand for df (k) da a=a (k) the dervatve of the actvaton functon f (k) wth respect to ts only argument a, evaluated for a = a (k) For the ReLU actvaton functon h k = h,, df (k) { for a 0 da = 0 otherwse For the ReLU actvaton functon followed by max-poolng, h k ( ) = π(h( )), on the other hand, the value of the output at ndex s computed from a wndow P () of actvatons, and only one of the actvatons (the one wth the hghest value) n the wndow s relevant to the output 2 Let then p (k) = max q P () (h(a(k) q )) be the value resultng from max-poolng over the wndow P () assocated wth output of layer k Furthermore, let ˆq = arg max q P () (h(a(k) q )) be the ndex of the actvaton where that maxmum s acheved, where for brevty we leave the dependence of ˆq on actvaton ndex and layer k mplct Then, W (k) qj = δ qˆq df (k) ˆq x (k ) j and x (k ) j = df (k) ˆq W (k) ˆqj 2 In case of a te, we attrbute the hghest values n P () to one of the hghest nputs, say, chosen at random 8
9 References [] C M Bshop Pattern Recognton and Machne Learnng Sprnger, 2006 [2] S Boyd and L Vandeberghe Convex Optmzaton Cambdrge Unversty Press, 2004 [3] J B Hrart-Urruty and C Lemaréchal Convex analyss and mnmzaton algorthms I: Fundamentals, volume 305 Sprnger scence & busness meda, 203 [4] J Martens Learnng recurrent neural networks wth Hessan-free optmzaton In Proceedngs of the 28th Internatonal Conference on Machne Learnng, pages , 20 [5] J Nocedal and S J Wrght Numercal Optmzaton Sprnger, New York, NY, 999 [6] B T Polyak Some methods of speedng up the convergence of teraton methods USSR Computatonal Mathematcs and Mathematcal Physcs, 4(5): 7, 964 [7] N Srvastava, G Hnton, A Krzhevsky, I Sutskever, and R Salakhutdnov Dropout: a smple way to prevent neural networks from overfttng Journal of Machne Learnng Research, 5: , 204 [8] I Sutskever, J Martens, G Dahl, and G Hnton On the mportance of ntalzaton and momentum n deep learnng In Proceedngs of the 30th Internatonal Conference on Machne Learnng, pages 39 47, 203 [9] D R Wlson and T R Martnez The general neffcency of batch tranng for gradent descent learnng Neural Networks, 6:429 45,
For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.
Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson
More informationWeek 5: Neural Networks
Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple
More information1 Convex Optimization
Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,
More informationEEE 241: Linear Systems
EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they
More informationGeneralized Linear Methods
Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set
More informationWhich Separator? Spring 1
Whch Separator? 6.034 - Sprng 1 Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng 3 Margn of a pont " # y (w $ + b) proportonal
More informationMLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012
MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1 Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example:
More informationFeature Selection: Part 1
CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?
More informationCSC 411 / CSC D11 / CSC C11
18 Boostng s a general strategy for learnng classfers by combnng smpler ones. The dea of boostng s to take a weak classfer that s, any classfer that wll do at least slghtly better than chance and use t
More informationSupporting Information
Supportng Informaton The neural network f n Eq. 1 s gven by: f x l = ReLU W atom x l + b atom, 2 where ReLU s the element-wse rectfed lnear unt, 21.e., ReLUx = max0, x, W atom R d d s the weght matrx to
More information2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification
E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton
More informationIV. Performance Optimization
IV. Performance Optmzaton A. Steepest descent algorthm defnton how to set up bounds on learnng rate mnmzaton n a lne (varyng learnng rate) momentum learnng examples B. Newton s method defnton Gauss-Newton
More information10-701/ Machine Learning, Fall 2005 Homework 3
10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40
More informationChapter 11: Simple Linear Regression and Correlation
Chapter 11: Smple Lnear Regresson and Correlaton 11-1 Emprcal Models 11-2 Smple Lnear Regresson 11-3 Propertes of the Least Squares Estmators 11-4 Hypothess Test n Smple Lnear Regresson 11-4.1 Use of t-tests
More informationLecture 10 Support Vector Machines II
Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed
More informationLecture Notes on Linear Regression
Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume
More informationErrors for Linear Systems
Errors for Lnear Systems When we solve a lnear system Ax b we often do not know A and b exactly, but have only approxmatons  and ˆb avalable. Then the best thng we can do s to solve ˆx ˆb exactly whch
More informationU.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017
U.C. Berkeley CS94: Beyond Worst-Case Analyss Handout 4s Luca Trevsan September 5, 07 Summary of Lecture 4 In whch we ntroduce semdefnte programmng and apply t to Max Cut. Semdefnte Programmng Recall that
More informationINF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018
INF 5860 Machne learnng for mage classfcaton Lecture 3 : Image classfcaton and regresson part II Anne Solberg January 3, 08 Today s topcs Multclass logstc regresson and softma Regularzaton Image classfcaton
More informationCSE 546 Midterm Exam, Fall 2014(with Solution)
CSE 546 Mdterm Exam, Fall 014(wth Soluton) 1. Personal nfo: Name: UW NetID: Student ID:. There should be 14 numbered pages n ths exam (ncludng ths cover sheet). 3. You can use any materal you brought:
More informationLogistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI
Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton
More informationKernel Methods and SVMs Extension
Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general
More informationC4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )
C4B Machne Learnng Answers II.(a) Show that for the logstc sgmod functon dσ(z) dz = σ(z) ( σ(z)) A. Zsserman, Hlary Term 20 Start from the defnton of σ(z) Note that Then σ(z) = σ = dσ(z) dz = + e z e z
More informationMultilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata
Multlayer Perceptrons and Informatcs CG: Lecture 6 Mrella Lapata School of Informatcs Unversty of Ednburgh mlap@nf.ed.ac.uk Readng: Kevn Gurney s Introducton to Neural Networks, Chapters 5 6.5 January,
More informationSDMML HT MSc Problem Sheet 4
SDMML HT 06 - MSc Problem Sheet 4. The recever operatng characterstc ROC curve plots the senstvty aganst the specfcty of a bnary classfer as the threshold for dscrmnaton s vared. Let the data space be
More informationIntroduction to the Introduction to Artificial Neural Network
Introducton to the Introducton to Artfcal Neural Netork Vuong Le th Hao Tang s sldes Part of the content of the sldes are from the Internet (possbly th modfcatons). The lecturer does not clam any onershp
More informationWe present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.
CS 189 Introducton to Machne Learnng Sprng 2018 Note 26 1 Boostng We have seen that n the case of random forests, combnng many mperfect models can produce a snglodel that works very well. Ths s the dea
More informationThe Geometry of Logit and Probit
The Geometry of Logt and Probt Ths short note s meant as a supplement to Chapters and 3 of Spatal Models of Parlamentary Votng and the notaton and reference to fgures n the text below s to those two chapters.
More informationNeural networks. Nuno Vasconcelos ECE Department, UCSD
Neural networs Nuno Vasconcelos ECE Department, UCSD Classfcaton a classfcaton problem has two types of varables e.g. X - vector of observatons (features) n the world Y - state (class) of the world x X
More informationVQ widely used in coding speech, image, and video
at Scalar quantzers are specal cases of vector quantzers (VQ): they are constraned to look at one sample at a tme (memoryless) VQ does not have such constrant better RD perfomance expected Source codng
More informationLinear Feature Engineering 11
Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19
More informationP R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /
Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Lecture 4 Bayes Classfcaton Rule Dept. of Electrcal and Computer Engneerng 0909.40.0 / 0909.504.04 Theory & Applcatons
More informationLinear Approximation with Regularization and Moving Least Squares
Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...
More informationKernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan
Kernels n Support Vector Machnes Based on lectures of Martn Law, Unversty of Mchgan Non Lnear separable problems AND OR NOT() The XOR problem cannot be solved wth a perceptron. XOR Per Lug Martell - Systems
More informationMATH 567: Mathematical Techniques in Data Science Lab 8
1/14 MATH 567: Mathematcal Technques n Data Scence Lab 8 Domnque Gullot Departments of Mathematcal Scences Unversty of Delaware Aprl 11, 2017 Recall We have: a (2) 1 = f(w (1) 11 x 1 + W (1) 12 x 2 + W
More informationMMA and GCMMA two methods for nonlinear optimization
MMA and GCMMA two methods for nonlnear optmzaton Krster Svanberg Optmzaton and Systems Theory, KTH, Stockholm, Sweden. krlle@math.kth.se Ths note descrbes the algorthms used n the author s 2007 mplementatons
More informationMulti-layer neural networks
Lecture 0 Mult-layer neural networks Mlos Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square Lnear regresson w Lnear unts f () Logstc regresson T T = w = p( y =, w) = g( w ) w z f () = p ( y = ) w d w d Gradent
More informationMultilayer Perceptron (MLP)
Multlayer Perceptron (MLP) Seungjn Cho Department of Computer Scence and Engneerng Pohang Unversty of Scence and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjn@postech.ac.kr 1 / 20 Outlne
More informationOnline Classification: Perceptron and Winnow
E0 370 Statstcal Learnng Theory Lecture 18 Nov 8, 011 Onlne Classfcaton: Perceptron and Wnnow Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton In ths lecture we wll start to study the onlne learnng
More informationSupport Vector Machines. Vibhav Gogate The University of Texas at dallas
Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest
More informationVector Norms. Chapter 7 Iterative Techniques in Matrix Algebra. Cauchy-Bunyakovsky-Schwarz Inequality for Sums. Distances. Convergence.
Vector Norms Chapter 7 Iteratve Technques n Matrx Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematcs Unversty of Calforna, Berkeley Math 128B Numercal Analyss Defnton A vector norm
More informationNatural Language Processing and Information Retrieval
Natural Language Processng and Informaton Retreval Support Vector Machnes Alessandro Moschtt Department of nformaton and communcaton technology Unversty of Trento Emal: moschtt@ds.untn.t Summary Support
More informationDynamic Programming. Preview. Dynamic Programming. Dynamic Programming. Dynamic Programming (Example: Fibonacci Sequence)
/24/27 Prevew Fbonacc Sequence Longest Common Subsequence Dynamc programmng s a method for solvng complex problems by breakng them down nto smpler sub-problems. It s applcable to problems exhbtng the propertes
More informationBoostrapaggregating (Bagging)
Boostrapaggregatng (Baggng) An ensemble meta-algorthm desgned to mprove the stablty and accuracy of machne learnng algorthms Can be used n both regresson and classfcaton Reduces varance and helps to avod
More informationReport on Image warping
Report on Image warpng Xuan Ne, Dec. 20, 2004 Ths document summarzed the algorthms of our mage warpng soluton for further study, and there s a detaled descrpton about the mplementaton of these algorthms.
More informationCalculation of time complexity (3%)
Problem 1. (30%) Calculaton of tme complexty (3%) Gven n ctes, usng exhaust search to see every result takes O(n!). Calculaton of tme needed to solve the problem (2%) 40 ctes:40! dfferent tours 40 add
More informationChapter Newton s Method
Chapter 9. Newton s Method After readng ths chapter, you should be able to:. Understand how Newton s method s dfferent from the Golden Secton Search method. Understand how Newton s method works 3. Solve
More informationCS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015
CS 3710: Vsual Recognton Classfcaton and Detecton Adrana Kovashka Department of Computer Scence January 13, 2015 Plan for Today Vsual recognton bascs part 2: Classfcaton and detecton Adrana s research
More informationChapter - 2. Distribution System Power Flow Analysis
Chapter - 2 Dstrbuton System Power Flow Analyss CHAPTER - 2 Radal Dstrbuton System Load Flow 2.1 Introducton Load flow s an mportant tool [66] for analyzng electrcal power system network performance. Load
More informationOn the Multicriteria Integer Network Flow Problem
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 5, No 2 Sofa 2005 On the Multcrtera Integer Network Flow Problem Vassl Vasslev, Marana Nkolova, Maryana Vassleva Insttute of
More informationLecture 21: Numerical methods for pricing American type derivatives
Lecture 21: Numercal methods for prcng Amercan type dervatves Xaoguang Wang STAT 598W Aprl 10th, 2014 (STAT 598W) Lecture 21 1 / 26 Outlne 1 Fnte Dfference Method Explct Method Penalty Method (STAT 598W)
More information4DVAR, according to the name, is a four-dimensional variational method.
4D-Varatonal Data Assmlaton (4D-Var) 4DVAR, accordng to the name, s a four-dmensonal varatonal method. 4D-Var s actually a drect generalzaton of 3D-Var to handle observatons that are dstrbuted n tme. The
More informationAdmin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester
0/25/6 Admn Assgnment 7 Class /22 Schedule for the rest of the semester NEURAL NETWORKS Davd Kauchak CS58 Fall 206 Perceptron learnng algorthm Our Nervous System repeat untl convergence (or for some #
More informationLectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix
Lectures - Week 4 Matrx norms, Condtonng, Vector Spaces, Lnear Independence, Spannng sets and Bass, Null space and Range of a Matrx Matrx Norms Now we turn to assocatng a number to each matrx. We could
More informationLecture 20: November 7
0-725/36-725: Convex Optmzaton Fall 205 Lecturer: Ryan Tbshran Lecture 20: November 7 Scrbes: Varsha Chnnaobreddy, Joon Sk Km, Lngyao Zhang Note: LaTeX template courtesy of UC Berkeley EECS dept. Dsclamer:
More informationCollege of Computer & Information Science Fall 2009 Northeastern University 20 October 2009
College of Computer & Informaton Scence Fall 2009 Northeastern Unversty 20 October 2009 CS7880: Algorthmc Power Tools Scrbe: Jan Wen and Laura Poplawsk Lecture Outlne: Prmal-dual schema Network Desgn:
More informationMultilayer neural networks
Lecture Multlayer neural networks Mlos Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square Mdterm exam Mdterm Monday, March 2, 205 In-class (75 mnutes) closed book materal covered by February 25, 205 Multlayer
More informationMarkov Chain Monte Carlo Lecture 6
where (x 1,..., x N ) X N, N s called the populaton sze, f(x) f (x) for at least one {1, 2,..., N}, and those dfferent from f(x) are called the tral dstrbutons n terms of mportance samplng. Dfferent ways
More informationLecture 12: Discrete Laplacian
Lecture 12: Dscrete Laplacan Scrbe: Tanye Lu Our goal s to come up wth a dscrete verson of Laplacan operator for trangulated surfaces, so that we can use t n practce to solve related problems We are mostly
More informationTime-Varying Systems and Computations Lecture 6
Tme-Varyng Systems and Computatons Lecture 6 Klaus Depold 14. Januar 2014 The Kalman Flter The Kalman estmaton flter attempts to estmate the actual state of an unknown dscrete dynamcal system, gven nosy
More informationCOMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS
Avalable onlne at http://sck.org J. Math. Comput. Sc. 3 (3), No., 6-3 ISSN: 97-537 COMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS
More informationNegative Binomial Regression
STATGRAPHICS Rev. 9/16/2013 Negatve Bnomal Regresson Summary... 1 Data Input... 3 Statstcal Model... 3 Analyss Summary... 4 Analyss Optons... 7 Plot of Ftted Model... 8 Observed Versus Predcted... 10 Predctons...
More informationDepartment of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING
MACHINE LEANING Vasant Honavar Bonformatcs and Computatonal Bology rogram Center for Computatonal Intellgence, Learnng, & Dscovery Iowa State Unversty honavar@cs.astate.edu www.cs.astate.edu/~honavar/
More informationSome Comments on Accelerating Convergence of Iterative Sequences Using Direct Inversion of the Iterative Subspace (DIIS)
Some Comments on Acceleratng Convergence of Iteratve Sequences Usng Drect Inverson of the Iteratve Subspace (DIIS) C. Davd Sherrll School of Chemstry and Bochemstry Georga Insttute of Technology May 1998
More informationCSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography
CSc 6974 and ECSE 6966 Math. Tech. for Vson, Graphcs and Robotcs Lecture 21, Aprl 17, 2006 Estmatng A Plane Homography Overvew We contnue wth a dscusson of the major ssues, usng estmaton of plane projectve
More informationModule 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur
Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:
More informationChapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems
Numercal Analyss by Dr. Anta Pal Assstant Professor Department of Mathematcs Natonal Insttute of Technology Durgapur Durgapur-713209 emal: anta.bue@gmal.com 1 . Chapter 5 Soluton of System of Lnear Equatons
More informationEcon107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)
I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes
More informationProblem Set 9 Solutions
Desgn and Analyss of Algorthms May 4, 2015 Massachusetts Insttute of Technology 6.046J/18.410J Profs. Erk Demane, Srn Devadas, and Nancy Lynch Problem Set 9 Solutons Problem Set 9 Solutons Ths problem
More informationLearning Theory: Lecture Notes
Learnng Theory: Lecture Notes Lecturer: Kamalka Chaudhur Scrbe: Qush Wang October 27, 2012 1 The Agnostc PAC Model Recall that one of the constrants of the PAC model s that the data dstrbuton has to be
More informationClassification as a Regression Problem
Target varable y C C, C,, ; Classfcaton as a Regresson Problem { }, 3 L C K To treat classfcaton as a regresson problem we should transform the target y nto numercal values; The choce of numercal class
More informationFundamentals of Neural Networks
Fundamentals of Neural Networks Xaodong Cu IBM T. J. Watson Research Center Yorktown Heghts, NY 10598 Fall, 2018 Outlne Feedforward neural networks Forward propagaton Neural networks as unversal approxmators
More information1 The Mistake Bound Model
5-850: Advanced Algorthms CMU, Sprng 07 Lecture #: Onlne Learnng and Multplcatve Weghts February 7, 07 Lecturer: Anupam Gupta Scrbe: Bryan Lee,Albert Gu, Eugene Cho he Mstake Bound Model Suppose there
More informationDiscriminative classifier: Logistic Regression. CS534-Machine Learning
Dscrmnatve classfer: Logstc Regresson CS534-Machne Learnng 2 Logstc Regresson Gven tranng set D stc regresson learns the condtonal dstrbuton We ll assume onl to classes and a parametrc form for here s
More informationSTAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 16
STAT 39: MATHEMATICAL COMPUTATIONS I FALL 218 LECTURE 16 1 why teratve methods f we have a lnear system Ax = b where A s very, very large but s ether sparse or structured (eg, banded, Toepltz, banded plus
More informationCHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD
CHALMERS, GÖTEBORGS UNIVERSITET SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS COURSE CODES: FFR 35, FIM 72 GU, PhD Tme: Place: Teachers: Allowed materal: Not allowed: January 2, 28, at 8 3 2 3 SB
More informationStructure and Drive Paul A. Jensen Copyright July 20, 2003
Structure and Drve Paul A. Jensen Copyrght July 20, 2003 A system s made up of several operatons wth flow passng between them. The structure of the system descrbes the flow paths from nputs to outputs.
More informationPattern Classification
Pattern Classfcaton All materals n these sldes ere taken from Pattern Classfcaton (nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wley & Sons, 000 th the permsson of the authors and the publsher
More informationThe Study of Teaching-learning-based Optimization Algorithm
Advanced Scence and Technology Letters Vol. (AST 06), pp.05- http://dx.do.org/0.57/astl.06. The Study of Teachng-learnng-based Optmzaton Algorthm u Sun, Yan fu, Lele Kong, Haolang Q,, Helongang Insttute
More informationfind (x): given element x, return the canonical element of the set containing x;
COS 43 Sprng, 009 Dsjont Set Unon Problem: Mantan a collecton of dsjont sets. Two operatons: fnd the set contanng a gven element; unte two sets nto one (destructvely). Approach: Canoncal element method:
More informationAPPENDIX A Some Linear Algebra
APPENDIX A Some Lnear Algebra The collecton of m, n matrces A.1 Matrces a 1,1,..., a 1,n A = a m,1,..., a m,n wth real elements a,j s denoted by R m,n. If n = 1 then A s called a column vector. Smlarly,
More informationEnsemble Methods: Boosting
Ensemble Methods: Boostng Ncholas Ruozz Unversty of Texas at Dallas Based on the sldes of Vbhav Gogate and Rob Schapre Last Tme Varance reducton va baggng Generate new tranng data sets by samplng wth replacement
More informationA Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach
A Bayes Algorthm for the Multtask Pattern Recognton Problem Drect Approach Edward Puchala Wroclaw Unversty of Technology, Char of Systems and Computer etworks, Wybrzeze Wyspanskego 7, 50-370 Wroclaw, Poland
More informationSuppose that there s a measured wndow of data fff k () ; :::; ff k g of a sze w, measured dscretely wth varable dscretzaton step. It s convenent to pl
RECURSIVE SPLINE INTERPOLATION METHOD FOR REAL TIME ENGINE CONTROL APPLICATIONS A. Stotsky Volvo Car Corporaton Engne Desgn and Development Dept. 97542, HA1N, SE- 405 31 Gothenburg Sweden. Emal: astotsky@volvocars.com
More informationWinter 2008 CS567 Stochastic Linear/Integer Programming Guest Lecturer: Xu, Huan
Wnter 2008 CS567 Stochastc Lnear/Integer Programmng Guest Lecturer: Xu, Huan Class 2: More Modelng Examples 1 Capacty Expanson Capacty expanson models optmal choces of the tmng and levels of nvestments
More informationHomework Assignment 3 Due in class, Thursday October 15
Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.
More informationDifference Equations
Dfference Equatons c Jan Vrbk 1 Bascs Suppose a sequence of numbers, say a 0,a 1,a,a 3,... s defned by a certan general relatonshp between, say, three consecutve values of the sequence, e.g. a + +3a +1
More informationLecture 23: Artificial neural networks
Lecture 23: Artfcal neural networks Broad feld that has developed over the past 20 to 30 years Confluence of statstcal mechancs, appled math, bology and computers Orgnal motvaton: mathematcal modelng of
More informationNumerical Heat and Mass Transfer
Master degree n Mechancal Engneerng Numercal Heat and Mass Transfer 06-Fnte-Dfference Method (One-dmensonal, steady state heat conducton) Fausto Arpno f.arpno@uncas.t Introducton Why we use models and
More informationInductance Calculation for Conductors of Arbitrary Shape
CRYO/02/028 Aprl 5, 2002 Inductance Calculaton for Conductors of Arbtrary Shape L. Bottura Dstrbuton: Internal Summary In ths note we descrbe a method for the numercal calculaton of nductances among conductors
More informationMaximum Likelihood Estimation (MLE)
Maxmum Lkelhood Estmaton (MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Wnter 01 UCSD Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y
More informationFrom Biot-Savart Law to Divergence of B (1)
From Bot-Savart Law to Dvergence of B (1) Let s prove that Bot-Savart gves us B (r ) = 0 for an arbtrary current densty. Frst take the dvergence of both sdes of Bot-Savart. The dervatve s wth respect to
More informationChapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.
Chapter - The Smple Lnear Regresson Model The lnear regresson equaton s: where y + = β + β e for =,..., y and are observable varables e s a random error How can an estmaton rule be constructed for the
More informationCIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M
CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute
More informationInexact Newton Methods for Inverse Eigenvalue Problems
Inexact Newton Methods for Inverse Egenvalue Problems Zheng-jan Ba Abstract In ths paper, we survey some of the latest development n usng nexact Newton-lke methods for solvng nverse egenvalue problems.
More informationLecture 12: Classification
Lecture : Classfcaton g Dscrmnant functons g The optmal Bayes classfer g Quadratc classfers g Eucldean and Mahalanobs metrcs g K Nearest Neghbor Classfers Intellgent Sensor Systems Rcardo Guterrez-Osuna
More informationYong Joon Ryang. 1. Introduction Consider the multicommodity transportation problem with convex quadratic cost function. 1 2 (x x0 ) T Q(x x 0 )
Kangweon-Kyungk Math. Jour. 4 1996), No. 1, pp. 7 16 AN ITERATIVE ROW-ACTION METHOD FOR MULTICOMMODITY TRANSPORTATION PROBLEMS Yong Joon Ryang Abstract. The optmzaton problems wth quadratc constrants often
More informationThe Minimum Universal Cost Flow in an Infeasible Flow Network
Journal of Scences, Islamc Republc of Iran 17(2): 175-180 (2006) Unversty of Tehran, ISSN 1016-1104 http://jscencesutacr The Mnmum Unversal Cost Flow n an Infeasble Flow Network H Saleh Fathabad * M Bagheran
More informationSupport Vector Machines
Separatng boundary, defned by w Support Vector Machnes CISC 5800 Professor Danel Leeds Separatng hyperplane splts class 0 and class 1 Plane s defned by lne w perpendcular to plan Is data pont x n class
More informationFormulas for the Determinant
page 224 224 CHAPTER 3 Determnants e t te t e 2t 38 A = e t 2te t e 2t e t te t 2e 2t 39 If 123 A = 345, 456 compute the matrx product A adj(a) What can you conclude about det(a)? For Problems 40 43, use
More information