Max-margin structured output learning in L 1 norm space

Size: px
Start display at page:

Download "Max-margin structured output learning in L 1 norm space"

Transcription

1 Max-margin structured output learning in L norm space Sandor Szedmak ISIS Group, Electronics and Computer Science University of Southampton Southampton, United Kingdom ss03v@ecssotonacuk Yizhao Ni ISIS Group, Electronics and Computer Science University of Southampton Southampton, United Kingdom yn05r@ecssotonacuk Craig J Saunders ISIS Group, Electronics and Computer Science University of Southampton Southampton, United Kingdom cjs@ecssotonacuk Juho Rousu Department of Computer Science University of Helsinki, Finland juhorousu@cshelsinkifi Astract We study a structured output learning setting where oth the sample size and dimensions of the feature vectors of oth the input and output are very large (possily infinite in the latter case), ut the input and output feature representations are nonnegative and very sparse (ie the numer of nonzero components is finite and their proportion to the dimension is close to zero) Such situations are encountered in real-world prolems such as statistical machine translation We show that in this setting structured output learning can e efficiently implemented The solution relies on maximum margin learning of the linear relations etween the inputs and outputs in an L norm space This learning prolem can e formulated y imposing L norm regularisation on the linear transformation expressing the relations Introduction The machine learning researchers have devoted relatively small effort in discovering how the margin ased learning methods ehave in L space Several papers investigate the case when the L norm is applied in the regularisation - eg Linear Programming Boosting [DBST0], or under the name lasso [Ti96], - or(and) in measuring the loss, eg the SVM with soft margin, see for example in [CST00] This paper focuses on the applications where instead of the L regularisation, the margin is measured in an L sense It will e shown that this kind of learning displays characteristic properties allowing very large scale prolems to e solved with moderate computational effort Like in machine learning, also in approximation theory surprisingly little attention to the L norm space has een given A ook [Pin89] summaries some results to e good All authors would like to acknowledge the support of the EU framework six project SMART starting points for further research activities Another valuale source could e [DG85] which deals with density estimation in L space The authors of the latter ook emphasise an important fact that measuring the distance etween two density functions using the L norm is invariant under monotone transformation of the coordinate axes: in other words if only the order of the coordinates are preserved ut the scales are changing then the L normed ase distances of vectors normalised to in the same norm remains the same This allows us to use a nonparametric approach when the underlying distriutions might e irregular, eg they have no expected value Human language is an example application field showing the symptoms of irregularity which has motivated us in formulating the presented approach In the following we first formulate the supervised learning prolem for structured outputs, then secondly present the optimisation framework We then show that the ase prolem can e solved in a very simple way which leads us to an online algorithm Using a perceptron type interpretation we are ale to state Novikoff-style ounds for the new algorithm 2 General Setting We are given a sample of pairs of input and output ojects {x i, y i }, i =,, m, taken from the sets X and Y independently with respect to an unknown distriution defined on X Y Furthermore, there exist two functions φ and ψ which map the input and output ojects into linear vector spaces, namely φ : X L φ, ψ : Y L ψ, where L φ and L φ are linear vector spaces whose elements represent the input and output ojects The task is to find a linear transformation W which gives a good predictor of the outputs represented in the corresponding linear vector space y the feature vector of the inputs ψ(y) Wφ(x) One concrete example of this style of prolem is a machine translation task where oth the input and output ojects

2 are sentences taken from natural languages, eg English and French, and are represented occurrences of phrases: ngrams, sustrings of words with a special structure, etc Thus, the input and output feature vectors φ(x i ) and ψ(y i ) have very high dimensions ut they are very sparse Many other applications studied in the structured prediction literature [TGK03, TLJJ06, TJHA05, RSSST06] however also fit naturally into this framework 3 Optimisation prolem We are going to express the relations etween the input and the output via a linear transformation projecting the input feature vectors into the space of the output feature vectors which could e an optimum solution of the following maximum margin prolem: min r(w) wrt W : L φ L ψ, Linear operator st ψ(y i ), Wφ(x i ) Lψ, i =,, m, > 0 given constant The ojective function r() is assumed to e a regularisation function and its concrete definition is derived later The constraints force the inner products etween the output feature vectors and the images of the input feature vectors with respect to the linear operator W to e sufficiently and uniformly large We use the inner product in a rather algeraic sense u, v = n k= u kv k, u, v R n instead of the geometric one which assumes a Hilert space in the ackground The constraints can e rewritten expressing, instead of a regression task, a one-class classification task in the joint feature space for inputs and outputs, namely ψ(yi ), Wφ(x i ) L ψ = tr ( ψ(y i ) T Wφ(x i ) ) = tr ( Wφ(x i )ψ(y i ) ) T = W, [ ψ(y i ) φ(x i ) ] (), L ψ L φ where tr() denotes the trace of the matrix in the argument, and operator marks the tensor product etween its operands Since the tensor product of two linear vector spaces is a linear vector space too, so we can interpret z i = φ(x i ) ψ(y i ) as vectors and the linear operator W ecomes a linear functional, thus a vector in the dual space of the space spanned y the vectors {z i }, i =,, m, hence we can use vector notation w as well Based on () we arrive at a prolem coinciding in form with the standard one-class SVM classification prolem: min r(w) wrt w st z i w, i =,, m, > 0 given constant, z i R nz The well-known approaches to the regularisation apply L 2 norm [Vap98], eg Support Vector Machine, or L norm, eg Linear Programming Boosting [DBST0] The use of structured outputs or feature spaces in the output space has (2) recently een studied using the standard L 2 norm ased regularisation, see eg [SSTPH05] and [AHP + 08] However investigating the case when the maximum margin is measured in L norm have rarely een investigated We are going to focus on the regularisation function r() which maximises the L norm ased distance etween the separating hyperplane and the origin in the one-class prolem To this end a suprolem can e formulated computing this distance measured etween the origin and the closest point of the hyperplane for which we have min u wrt u st w u =, saying that a vector u sitting on the hyperplane is looked for with the minimum L norm The entire prolem which maximises the minimum distance takes the following form [ ] min u max f(w, ) = wrt u st w u = (4) wrt st w z i w, i =,, m > 0 given constant In the sequel we are going to deal with the sucase of (4) when the non-negativity conditions z i 0, i =,, m hold 4 Optimum solution Let us solve first the suprolem given y (3) for a fixed w Via a simple argument one can show that the optimum value of (3) can e given in a closed form Proposition The optimum value to the prolem (3) is equal to u = max j w j = (5) Proof: First for sake of simplicity, we divide oth sides of the equality constraint y, due to it eing strictly positive no effect on the prolem is caused Let us denote w with w We can assume at least one component of w differs from 0 otherwise no feasile solution exists Now we unfold the norm in the ojective y applying the sustitution u = u + u, u + 0, u 0 and write up the dual prolem as well min ( ) u + + u max γ wrt u +, u ) st w ( u+ u =, u + 0, u 0, wrt γ st γw, γw For any strictly positive components of w we have (3) (6) γw γ min j (w ) j, (w ) j > 0, (7) and for any strictly negative components of the same vector the following holds γw γ min, (w ) j < 0, (8) j (w ) j

3 therefore γ min j (w ) j = max j (w ) j = w (9) Since the primal ojective has an lower ound, ie 0, thus the dual has a feasile ounded optimal solution and the optimum dual value is equal to the optimum primal value, therefore, γ = u which is the statement of the proposition Based on Proposition we have max f(w, ) = wrt w st z i w, i =,, m > 0 given constant, and after reformulation it as a minimisation prolem we otain min wrt w (0) st z i w, i =,, m > 0 given constant Proposition 2 If z i 0 for all i =,, m then the optimum solution for the Linear Programming Prolem given y (0) is equal to w =, where i = arg min i z i () Proof: First oth side of the equality constraints are divided y Let us use the notation w for w So we can otain min w wrt w st z i w, i =,, m, > 0 given constant, z i 0, i =,, m (2) We can recognise that w = is a feasile solution, since z iw = z i = z i = z i min i z i (3) Now we need to prove that w is also an optimum solution If it is not true then we can find a ŵ which is feasile and ŵ < w This means that there is a constant β such that ŵ j β < for any j =,, n z From z i 0, i =,, m it follows that β > 0 otherwise the feasiility assumption is immediately violated Let us check the feasiility of ŵ (z i ) ŵ β(z i ) = β <, (4) hence, ŵ violates the constraint elonging to z i, with the smallest L norm Thus w is an optimum solution for (0) With a constant, completely flat optimum solution, the predictor to a new φ(x) 0 can e written as ( ψ(y)) j = (Wφ(x)) j = φ(x), z i which might not seem interesting at first sight, however let us now consider the sparse case in the next susection 4 Sparse case First we define what we understand on sparseness in the prolem given y (0) Sparseness means that there is at least an index j {,, n z } such that for all (z i ) j = 0 holds A consequence of this kind of sparseness is that the corresponding components of w has no influence on the feasiility, thus, this component is not determined except for the upper and lower ound imposed y the ojective function, namely min Hence, the optimum solution to (0) ecomes a set containing the elements oeying the form { d if i =,, m, (w (zi ) ) j = [ ] j > 0, d, d otherwise, where d = min i z i (5) Because z i = ψ(y i ) φ(x i ) then (z i ) j is equal to 0 in each of the components where the corresponding components of either ψ(y i ), or φ(x i ) or oth, are equal to 0, which has high proaility if oth terms in the tensor product have lots of zero elements In the sparse case we can impose a further optimisation on our ase prolem, (0), which minimises the numer of non-zero components of w: min w 0 wrt w, st w W, arg min where W = wrt st w z i w, i =,, m > 0 given constant, (6) where the norm w 0 means the numer of non-zero components of the vector in the argument This extension gives us the following optimum solution { (w d if i =,, m, (zi ) ) j = j > 0, 0 otherwise where d is defined as efore 42 Scale independency (7) If the prediction ψ(ỹ) = Wφ(x) depends on the shape not on the scale we can employ a particular normalised solution { (w if i =,, m, (zi ) ) j = j > 0, (8) 0 otherwise instead of the original one given y (5)

4 5 Kernelization In L norm we can not apply the same kernelization that is straightforwardly implemented in the L 2 norm case via the dual form of the optimisation prolem A simple approach to this prolem is presented here Let us consider the following emedding Φ : Z R m where the components of Φ(z i ) are defined y (Φ(z i )) k = z k, z i = φ(x k ) ψ(y k ), φ(x i ) ψ(y i ) = φ(x k ), φ(x i ) ψ(y k ), ψ(y i ), (9) }{{}}{{} κ φ (x k,x i) κ ψ (y k,y i) where κ φ and κ ψ are the kernel functions Changing the margin constraints in (4) from into st z i w, i =,, m, (20) st Φ(z i ) w Φ, i =,, m, (2) we received the same structure, thus, the solution to the modified prolem follows the same pattern One can recognise that all we did is nothing more than expressing the linear operator projecting the input into the output space y W = m k= (w Φ) k ψ(y k ) φ(x k ) (22) A similar kernelization is applied in for example [Man00] to the inary Support Vector Machine 6 Prediction The prediction to an aritrary x can e given y y = ψ (Wφ(x)), ut to give sustance and some interpretation to this formula we need to make some additional remarks The prediction of Wφ(x) gives a score vector to every input vector, ut this score vector lives in the linear vector space representing the output and not the output oject itself For the complete solution we need to find an inverse image, which is not always straightforward The inversion can e carried out if we know the structure of the output space and how it is represented The first step the computation of the scores to ψ(y) is demonstrated in Figure The prediction can e derived y taking the optimum solution of (8) To find the optimal output y to an input x we need to estalish a model First, to invert the function ψ the next optimisation schema can e applied: ŷ = arg max W, [ ψ(y) φ(x) ] L ψ L φ wrt y (23) st y Ŷ, where Ŷ is the set of possile outputs The next element of our inversion model is the assumption such that the vectors of ψ(y), y Ŷ are indicator vectors of the possile patterns should appear in the output, so that we are looking for a vector with a relatively small numer of non zeros pointing to the patterns and all other components ought to e set to 0 (ie this is explicit in the machine translation example mentioned earlier; vectors are indexed y the entire set of possile phrases/features in a language, ut elements are only non-zero for those that actually occur in any given sentence) Our solution approach can exploit the fact that the ojective function of (23) is linear in ψ(y) and can e written as d ψ(y), where d = Wφ(x) However, the maximum of a linear function can e finite only if the components of the vector ψ(y) are ounded components-wise or(and) in a norm To this end let us consider the following prolem max d u wrt u, st (24) u =, 0 u C, which has a simple optimum solution To derive that first we need to sort the components of vector d into a decreasing order (d,, d nd ) (d i, d i2,, d ind ), (25) and let N the smallest integer greater than, then an optimum is given y C { C if k < K, u ik = C(K ) if k = K, (26) 0 otherwise Now the task can e stated as finding an optimal ound C which can approximate the numer of nonzero items in the indicator which we look for Let us consider the following general two level prolem, where the inner part follows (24) d u t max g(t) max wrt u t, (27) t=,2,, st u t =, 0 u t g(t) where t is running on the positive integers and g a real valued monotone increasing function What is expressed here is that if t = we chose the component of d with the highest value then the two highest ones and so onwe might stop at the first local minimum of the outer prolem The function g relates to the speed of the decay of decreasingly ordered components of the vector d, thus, in this way it is prolem dependent Possile choices are g(t) = log(t) or g(t) = t The linear case g(t) = t is oviously wrong, since in case of a decreasing sequence it gives surely the optimum when t = Further research activity is needed to estimate a good candidate The prediction is then derived of the optimum solution u t of (27) interpreting it as an indicator vector We need to mention the numer of non-zero components in this optimum solution will e proportional to the non-zero components of the input ut this connection is not a direct one 7 An online framework, a set ased perceptron learner The motivation of the online approach stems of the structure of the solution to the sparse case which, in turn, can e interpreted as a learning method of the possile co-existences of the parts of input and output vectors via an indicator vector

5 ψ(y) = Wφ(x) φ(x) output scores p P p P 2 p P Figure : The prediction schema column(row) vectors are indicators of candidates W First we outline the prolem that we are going to solve Let us assume that the input and output ojects are represented y the following way: Given two, supposed to e finite, sets Ω x and Ω y the collections of the possile pattern characterising the ojects x and y Every oserved x and y are descried y function µ x : Ω x R + and µ y : Ω y R +, where R + denotes the nonnegative real numers They may e measures of the importance of the patterns Assume that ω x Ω x µ x (ω x ) = and ω y Ω y µ y (ω y ) =, the measures are normalised in L norm Thus, µ x and µ y might e interpreted as proailities Let φ(x) = (µ x (ω x )) and ψ(y) = (µ y (ω y )) the vectors of the weights of the patterns From these conditions we can derive Ω Z = Ω x Ω y, and z = φ(x) ψ(y) a tensor product which as a consequence is normalised to as well Based on the model stated we can formulate the next learning task which is to find w a linear functional of the space {0, } Ωz corresponding to a suset W of Ω z such that w, z i λ, where λ > 0 is given constant and i =,, m are the indeces of the sample items It means y the definition of the inner product w, z i = µ xi (ω xi )µ yi (ω yi ), ω xi Ω xi,ω yi Ω yi thus, that we are looking for is a set W which comprises a sufficiently large numer of the most important common patterns of all sample items We can associate sets represented y indicators to all the sample items z i = ψ(y i ) φ(x i ) and to the linear functional w as well y z i [z i > 0] = {z ij > 0} = Z i w [w > 0] = {w j > 0} = W (28) Since all the nonzero components of w are equal to a constant, thus we can restore w from its set ased representation Algorithm Primal perceprtron for sets Input of the learner: The sample S, Output of the learner: W R dim(hy)dim(hx) Initialisation: W t = ; i = ;, noupdate = true repeat for i =, 2,, m do read input: z i = ψ(y i ) φ(x i ) t = 0 if w t, z i < λ then w t W t z i Z i W t+ = W t Z i {Set update} W t+ w t+ t = t + noupdate = false end if end for until noupdate is true via a ijective mapping etween the vector and set representation We can write up a perceptron type algorithm, see details in [CST00], for solving prolems following the schema of the L norm ased maximum margin learner The algorithm is given y Algorithm A Novikoff-style ound [Nov62] on this algorithm can e stated Based on the definition of w for any two realisations w k and w l w k, w l = W k W l, and let w 2 = w, w = W We can define a margin to the perceptron learner y γ(w, S) = min (y i,x i) S w, z i w 2, z i = ψ(y i ) φ(x i ) (29) Let δ i (λ, w) e such that if w, z i λ then W Z i δ i (λ, w) This kind of δ i (λ, w) > l min has to exist as a consequence of the definition of the inner product, the nonnegativity and the normalisation of the sample items Now we consider the minimum of them, ie δ(λ, w) = min δ i(λ, w) i=,,m The non-negativity guarantees δ(λ, w) is monotonic, increasing function of its two variales, so, greater λ or W implies greater δ Theorem 3 Let S = {(x i, y i )} (Y X ), i =,, e a sample independently and identically drawn from an unknown distriution and let φ : X L φ and ψ : Y L ψ e mappings into spaces of tuples of indicators Let z i = ψ(y i ) φ(x i ) e tensor product and Z i the indicator set of nonzero items in z i Assume that 0 < λ <, and 0 < l min Z i l max, i =,, m, the support, the numer of patterns, of every sample items falls within a given range Furthermore, there is a w such that Algorithm () stops with no more update, w, z i λ, i =,, m, and then the numer of updates in Algorithm () is ounded y t l max 2, (30)

6 2 the margin at the solution w t has a lower ound γ(w t, S) λ l max, (3) where = δ(λ, w ) w 2 Proof: We are going to follow the main thread of the Novikoff s reasoning, ut with some extensions Because w satisfies all the margin constraints, therefore for any t we have W W t, thus, δ(λ, W ) δ(λ, W t ) Let us use the short notation δ(λ) = δ(λ, W ) After t steps of update the squared L 2 norm of w t can e ounded y w t 2 2 = W t l max t (32) Now consider the following inner product w, w t = W W t = W (W t Z it ) = (W W t ) (W Z it ) y induction on t (33) = W ( t Z i t ) δ(λ)t (34) since for any t W t W holds Merging the inequalities aove we otain lmax t w 2 w 2 w t 2 w, w t δ(λ)t (35) It gives us an upper ound on the numer of updates as function of the functional margin t l max 2 (36) After sustituting this inequality into (32) we have w t 2 l max, (37) and at the end for the functional margin the lower ound can e otained γ(w t, S) λ l max, (38) which statement completes the proof Remark 4 One can recognise ehind this scenario a special variant of the weighted set covering prolem The sample items to e found as errors in the perceptron algorithm give a cover to all of the patterns occurring in the sample It is oviously not an optimum cover, the smallest in cardinality of all possile ones, ut a sufficiently good one It is an open question how to extend the range of the applications of the machine learning algorithms of this kind to produce approximations for hard cominatorial prolems This relationship allow us to find some connections etween the L norm ased learning and the Set Covering Machine introduced in [MST02] 8 Discussion We have shown in this paper that measuring the margin y L norm in a maximum margin learning prolem gives a simple solution to an otherwise hardly tractale class of structural learning tasks References [AHP + 08] K Astikainen, L Holm, E Pitknen, S Szedmak, and J Rousu Towards structured output prediction of enzyme function In BMC Proceedings, 2(Suppl 4):S [CST00] N Cristianini and J Shawe-Taylor An introduction to Support Vector Machines and other kernel-ased learning methods Camridge University Press, 2000 [DBST0] A Demiriz, K P Bennett, and J Shawe-Taylor Linear programming oosting via column generation Machine Learning, 46:: , 200 [DG85] L Devroye and L Györfi Nonparametric Density Estimation: The L View John Wiley, New York, 985 [Man00] O L Mangasarian Generalized support vector machines In Advances in Large Margin Classifiers, pages MIT Press, 2000 [MST02] M Marchand and J Shawe-Taylor The set covering machine Journal of Machine Learning Research, 3, 2002 [Nov62] A Novikoff On convergence proofs for perceptrons In In Report at the Symposium on Mathematical Theory of Automata, pages Politechnical Institute Brooklyn, 962 [Pin89] AM Pinkus On L-approximation Camridge University Press, 989 [RSSST06] J Rousu, CJ Saunders, S Szedmak, and J Shawe-Taylor Kernel-ased learning of hierarchical multilael classification models Journal of Machine Learning Research, Special issue on Machine Learning and Large Scale Optimization, 2006 [SSTPH05] S Szedmak, J Shawe-Taylor, and E Parado- Hernandez Learning via linear operators: Maximum margin regression In PAS- CAL Research Reports, [TGK03] B Taskar, C Guestrin, and D Koller Max margin markov networks In NIPS [Ti96] R Tishirani Regression shrinkage and selection via the lasso Journal of Royal Statistical Society, Series B, 58: , 996 [TJHA05] I Tsochantaridis, T Joachims, T Hofmann, and Y Altun Large margin methods for structured and interdependent output variales Journal of Machine Learning Research (JMLR), 6(Sep): , 2005 [TLJJ06] B Taskar, S Lacoste-Julien, and M I Jordan Structured prediction, dual extragradient and regman projections In JMLR, Special Topic on Machine Learning and Optimization, pages [Vap98] V Vapnik Statistical Learning Theory Wiley, 998

Joint SVM for Accurate and Fast Image Tagging

Joint SVM for Accurate and Fast Image Tagging his copy is for personal use, the final version will be publisehd in the proceedings of the 22th European Symposium on Artificial Neural Networks, Computation Intelligence and Machine Learning (ESANN 2014)

More information

Incorporating detractors into SVM classification

Incorporating detractors into SVM classification Incorporating detractors into SVM classification AGH University of Science and Technology 1 2 3 4 5 (SVM) SVM - are a set of supervised learning methods used for classification and regression SVM maximal

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This

More information

Support Vector Machines: Kernels

Support Vector Machines: Kernels Support Vector Machines: Kernels CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 14.1, 14.2, 14.4 Schoelkopf/Smola Chapter 7.4, 7.6, 7.8 Non-Linear Problems

More information

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for

More information

ICS-E4030 Kernel Methods in Machine Learning

ICS-E4030 Kernel Methods in Machine Learning ICS-E4030 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 28. September, 2016 Juho Rousu 28. September, 2016 1 / 38 Convex optimization Convex optimisation This

More information

Minimizing a convex separable exponential function subject to linear equality constraint and bounded variables

Minimizing a convex separable exponential function subject to linear equality constraint and bounded variables Minimizing a convex separale exponential function suect to linear equality constraint and ounded variales Stefan M. Stefanov Department of Mathematics Neofit Rilski South-Western University 2700 Blagoevgrad

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Applied inductive learning - Lecture 7

Applied inductive learning - Lecture 7 Applied inductive learning - Lecture 7 Louis Wehenkel & Pierre Geurts Department of Electrical Engineering and Computer Science University of Liège Montefiore - Liège - November 5, 2012 Find slides: http://montefiore.ulg.ac.be/

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 5: Multi-class and preference learning Juho Rousu 11. October, 2017 Juho Rousu 11. October, 2017 1 / 37 Agenda from now on: This week s theme: going

More information

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Support Vector Machines. Manfred Huber Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts Sridhar Mahadevan: CMPSCI 689 p. 1/32 Margin Classifiers margin b = 0 Sridhar Mahadevan: CMPSCI 689 p.

More information

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning Short Course Robust Optimization and 3. Optimization in Supervised EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012 Outline Overview of Supervised models and variants

More information

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems

More information

Representation theory of SU(2), density operators, purification Michael Walter, University of Amsterdam

Representation theory of SU(2), density operators, purification Michael Walter, University of Amsterdam Symmetry and Quantum Information Feruary 6, 018 Representation theory of S(), density operators, purification Lecture 7 Michael Walter, niversity of Amsterdam Last week, we learned the asic concepts of

More information

Convex Optimization in Classification Problems

Convex Optimization in Classification Problems New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space to The The A s s in to Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 2, 2009 to The The A s s in 1 Motivation Outline 2 The Mapping the

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Supervised Learning Coursework

Supervised Learning Coursework Supervised Learning Coursework John Shawe-Taylor Tom Diethe Dorota Glowacka November 30, 2009; submission date: noon December 18, 2009 Abstract Using a series of synthetic examples, in this exercise session

More information

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in

More information

The Intrinsic Recurrent Support Vector Machine

The Intrinsic Recurrent Support Vector Machine he Intrinsic Recurrent Support Vector Machine Daniel Schneegaß 1,2, Anton Maximilian Schaefer 1,3, and homas Martinetz 2 1- Siemens AG, Corporate echnology, Learning Systems, Otto-Hahn-Ring 6, D-81739

More information

Kernel Methods & Support Vector Machines

Kernel Methods & Support Vector Machines Kernel Methods & Support Vector Machines Mahdi pakdaman Naeini PhD Candidate, University of Tehran Senior Researcher, TOSAN Intelligent Data Miners Outline Motivation Introduction to pattern recognition

More information

6.036 midterm review. Wednesday, March 18, 15

6.036 midterm review. Wednesday, March 18, 15 6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that

More information

The Perceptron Algorithm with Uneven Margins

The Perceptron Algorithm with Uneven Margins The Perceptron Algorithm with Uneven Margins Yaoyong Li Royal Holloway, University of London, Egham, Surrey, TW20 0EX, UK Hugo Zaragoza Microsoft Research, 7 J J Thomson Avenue, CB3 0FB Cambridge, UK Ralf

More information

Convex Optimization and Support Vector Machine

Convex Optimization and Support Vector Machine Convex Optimization and Support Vector Machine Problem 0. Consider a two-class classification problem. The training data is L n = {(x 1, t 1 ),..., (x n, t n )}, where each t i { 1, 1} and x i R p. We

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Kernel: Kernel is defined as a function returning the inner product between the images of the two arguments k(x 1, x 2 ) = ϕ(x 1 ), ϕ(x 2 ) k(x 1, x 2 ) = k(x 2, x 1 ) modularity-

More information

The Capacity Region of 2-Receiver Multiple-Input Broadcast Packet Erasure Channels with Channel Output Feedback

The Capacity Region of 2-Receiver Multiple-Input Broadcast Packet Erasure Channels with Channel Output Feedback IEEE TRANSACTIONS ON INFORMATION THEORY, ONLINE PREPRINT 2014 1 The Capacity Region of 2-Receiver Multiple-Input Broadcast Packet Erasure Channels with Channel Output Feedack Chih-Chun Wang, Memer, IEEE,

More information

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Machine Learning. Lecture 6: Support Vector Machine. Feng Li. Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Warm Up 2 / 80 Warm Up (Contd.)

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Sequential Minimal Optimization (SMO)

Sequential Minimal Optimization (SMO) Data Science and Machine Intelligence Lab National Chiao Tung University May, 07 The SMO algorithm was proposed by John C. Platt in 998 and became the fastest quadratic programming optimization algorithm,

More information

Upper Bounds for Stern s Diatomic Sequence and Related Sequences

Upper Bounds for Stern s Diatomic Sequence and Related Sequences Upper Bounds for Stern s Diatomic Sequence and Related Sequences Colin Defant Department of Mathematics University of Florida, U.S.A. cdefant@ufl.edu Sumitted: Jun 18, 01; Accepted: Oct, 016; Pulished:

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Beyond Loose LP-relaxations: Optimizing MRFs by Repairing Cycles

Beyond Loose LP-relaxations: Optimizing MRFs by Repairing Cycles Beyond Loose LP-relaxations: Optimizing MRFs y Repairing Cycles Nikos Komodakis 1 and Nikos Paragios 2 1 University of Crete, komod@csd.uoc.gr 2 Ecole Centrale de Paris, nikos.paragios@ecp.fr Astract.

More information

Kernels and the Kernel Trick. Machine Learning Fall 2017

Kernels and the Kernel Trick. Machine Learning Fall 2017 Kernels and the Kernel Trick Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels

More information

IN this paper we study a discrete optimization problem. Constrained Shortest Link-Disjoint Paths Selection: A Network Programming Based Approach

IN this paper we study a discrete optimization problem. Constrained Shortest Link-Disjoint Paths Selection: A Network Programming Based Approach Constrained Shortest Link-Disjoint Paths Selection: A Network Programming Based Approach Ying Xiao, Student Memer, IEEE, Krishnaiyan Thulasiraman, Fellow, IEEE, and Guoliang Xue, Senior Memer, IEEE Astract

More information

SVETLANA KATOK AND ILIE UGARCOVICI (Communicated by Jens Marklof)

SVETLANA KATOK AND ILIE UGARCOVICI (Communicated by Jens Marklof) JOURNAL OF MODERN DYNAMICS VOLUME 4, NO. 4, 010, 637 691 doi: 10.3934/jmd.010.4.637 STRUCTURE OF ATTRACTORS FOR (a, )-CONTINUED FRACTION TRANSFORMATIONS SVETLANA KATOK AND ILIE UGARCOVICI (Communicated

More information

On the Generalization of Soft Margin Algorithms

On the Generalization of Soft Margin Algorithms IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 10, OCTOBER 2002 2721 On the Generalization of Soft Margin Algorithms John Shawe-Taylor, Member, IEEE, and Nello Cristianini Abstract Generalization

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

1 Training and Approximation of a Primal Multiclass Support Vector Machine

1 Training and Approximation of a Primal Multiclass Support Vector Machine 1 Training and Approximation of a Primal Multiclass Support Vector Machine Alexander Zien 1,2 and Fabio De Bona 1 and Cheng Soon Ong 1,2 1 Friedrich Miescher Lab., Max Planck Soc., Spemannstr. 39, Tübingen,

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

Synthesis of Maximum Margin and Multiview Learning using Unlabeled Data

Synthesis of Maximum Margin and Multiview Learning using Unlabeled Data Synthesis of Maximum Margin and Multiview Learning using Unlabeled Data Sandor Szedma 1 and John Shawe-Taylor 1 1 - Electronics and Computer Science, ISIS Group University of Southampton, SO17 1BJ, United

More information

Introduction to SVM and RVM

Introduction to SVM and RVM Introduction to SVM and RVM Machine Learning Seminar HUS HVL UIB Yushu Li, UIB Overview Support vector machine SVM First introduced by Vapnik, et al. 1992 Several literature and wide applications Relevance

More information

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support

More information

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3

More information

Support Vector Machine Classification via Parameterless Robust Linear Programming

Support Vector Machine Classification via Parameterless Robust Linear Programming Support Vector Machine Classification via Parameterless Robust Linear Programming O. L. Mangasarian Abstract We show that the problem of minimizing the sum of arbitrary-norm real distances to misclassified

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

Infinite Ensemble Learning with Support Vector Machinery

Infinite Ensemble Learning with Support Vector Machinery Infinite Ensemble Learning with Support Vector Machinery Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology ECML/PKDD, October 4, 2005 H.-T. Lin and L. Li (Learning Systems

More information

Sharp estimates of bounded solutions to some semilinear second order dissipative equations

Sharp estimates of bounded solutions to some semilinear second order dissipative equations Sharp estimates of ounded solutions to some semilinear second order dissipative equations Cyrine Fitouri & Alain Haraux Astract. Let H, V e two real Hilert spaces such that V H with continuous and dense

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Subgradient Methods for Maximum Margin Structured Learning

Subgradient Methods for Maximum Margin Structured Learning Nathan D. Ratliff ndr@ri.cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. 15213 USA Martin A. Zinkevich maz@cs.ualberta.ca Department of Computing

More information

Scheduling Two Agents on a Single Machine: A Parameterized Analysis of NP-hard Problems

Scheduling Two Agents on a Single Machine: A Parameterized Analysis of NP-hard Problems Scheduling Two Agents on a Single Machine: A Parameterized Analysis of NP-hard Prolems Danny Hermelin 1, Judith-Madeleine Kuitza 2, Dvir Shatay 1, Nimrod Talmon 3, and Gerhard Woeginger 4 arxiv:1709.04161v1

More information

Support Vector Machines for Classification and Regression

Support Vector Machines for Classification and Regression CIS 520: Machine Learning Oct 04, 207 Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 29 April, SoSe 2015 Support Vector Machines (SVMs) 1. One of

More information

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label,

More information

Ahmed Nazeem and Spyros Reveliotis

Ahmed Nazeem and Spyros Reveliotis Designing compact and maximally permissive deadlock avoidance policies for complex resource allocation systems through classification theory: the non-linear case Ahmed Nazeem and Spyros Reveliotis Astract

More information

Fantope Regularization in Metric Learning

Fantope Regularization in Metric Learning Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction

More information

Pseudo-automata for generalized regular expressions

Pseudo-automata for generalized regular expressions Pseudo-automata for generalized regular expressions B. F. Melnikov A. A. Melnikova Astract In this paper we introduce a new formalism which is intended for representing a special extensions of finite automata.

More information

Zeroing the baseball indicator and the chirality of triples

Zeroing the baseball indicator and the chirality of triples 1 2 3 47 6 23 11 Journal of Integer Sequences, Vol. 7 (2004), Article 04.1.7 Zeroing the aseall indicator and the chirality of triples Christopher S. Simons and Marcus Wright Department of Mathematics

More information

1 Hoeffding s Inequality

1 Hoeffding s Inequality Proailistic Method: Hoeffding s Inequality and Differential Privacy Lecturer: Huert Chan Date: 27 May 22 Hoeffding s Inequality. Approximate Counting y Random Sampling Suppose there is a ag containing

More information

The Perceptron Algorithm

The Perceptron Algorithm The Perceptron Algorithm Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Outline The Perceptron Algorithm Perceptron Mistake Bound Variants of Perceptron 2 Where are we? The Perceptron

More information

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version

More information

MinOver Revisited for Incremental Support-Vector-Classification

MinOver Revisited for Incremental Support-Vector-Classification MinOver Revisited for Incremental Support-Vector-Classification Thomas Martinetz Institute for Neuro- and Bioinformatics University of Lübeck D-23538 Lübeck, Germany martinetz@informatik.uni-luebeck.de

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Algorithms for Predicting Structured Data

Algorithms for Predicting Structured Data 1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure

More information

An introduction to Support Vector Machines

An introduction to Support Vector Machines 1 An introduction to Support Vector Machines Giorgio Valentini DSI - Dipartimento di Scienze dell Informazione Università degli Studi di Milano e-mail: valenti@dsi.unimi.it 2 Outline Linear classifiers

More information

Lecture Notes on Support Vector Machine

Lecture Notes on Support Vector Machine Lecture Notes on Support Vector Machine Feng Li fli@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional space, a hyper plane is defined by ω T x + b = 0 (1) where ω R n is

More information

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses

More information

On Universality of Blow-up Profile for L 2 critical nonlinear Schrödinger Equation

On Universality of Blow-up Profile for L 2 critical nonlinear Schrödinger Equation On Universality of Blow-up Profile for L critical nonlinear Schrödinger Equation Frank Merle,, Pierre Raphael Université de Cergy Pontoise Institut Universitaire de France Astract We consider finite time

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

IN this paper, we consider the estimation of the frequency

IN this paper, we consider the estimation of the frequency Iterative Frequency Estimation y Interpolation on Fourier Coefficients Elias Aoutanios, MIEEE, Bernard Mulgrew, MIEEE Astract The estimation of the frequency of a complex exponential is a prolem that is

More information

Learning Kernel Parameters by using Class Separability Measure

Learning Kernel Parameters by using Class Separability Measure Learning Kernel Parameters by using Class Separability Measure Lei Wang, Kap Luk Chan School of Electrical and Electronic Engineering Nanyang Technological University Singapore, 3979 E-mail: P 3733@ntu.edu.sg,eklchan@ntu.edu.sg

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

SOME GENERAL RESULTS AND OPEN QUESTIONS ON PALINTIPLE NUMBERS

SOME GENERAL RESULTS AND OPEN QUESTIONS ON PALINTIPLE NUMBERS #A42 INTEGERS 14 (2014) SOME GENERAL RESULTS AND OPEN QUESTIONS ON PALINTIPLE NUMBERS Benjamin V. Holt Department of Mathematics, Humoldt State University, Arcata, California vh6@humoldt.edu Received:

More information

Generalized Reed-Solomon Codes

Generalized Reed-Solomon Codes Chapter 5 Generalized Reed-Solomon Codes In 1960, I.S. Reed and G. Solomon introduced a family of error-correcting codes that are douly lessed. The codes and their generalizations are useful in practice,

More information

CS798: Selected topics in Machine Learning

CS798: Selected topics in Machine Learning CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

a b = a T b = a i b i (1) i=1 (Geometric definition) The dot product of two Euclidean vectors a and b is defined by a b = a b cos(θ a,b ) (2)

a b = a T b = a i b i (1) i=1 (Geometric definition) The dot product of two Euclidean vectors a and b is defined by a b = a b cos(θ a,b ) (2) This is my preperation notes for teaching in sections during the winter 2018 quarter for course CSE 446. Useful for myself to review the concepts as well. More Linear Algebra Definition 1.1 (Dot Product).

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

Lecture 6 January 15, 2014

Lecture 6 January 15, 2014 Advanced Graph Algorithms Jan-Apr 2014 Lecture 6 January 15, 2014 Lecturer: Saket Sourah Scrie: Prafullkumar P Tale 1 Overview In the last lecture we defined simple tree decomposition and stated that for

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

From Lasso regression to Feature vector machine

From Lasso regression to Feature vector machine From Lasso regression to Feature vector machine Fan Li, Yiming Yang and Eric P. Xing,2 LTI and 2 CALD, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA USA 523 {hustlf,yiming,epxing}@cs.cmu.edu

More information

The variance for partial match retrievals in k-dimensional bucket digital trees

The variance for partial match retrievals in k-dimensional bucket digital trees The variance for partial match retrievals in k-dimensional ucket digital trees Michael FUCHS Department of Applied Mathematics National Chiao Tung University January 12, 21 Astract The variance of partial

More information

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. Polynomial Time Perfect Sampler for Discretized Dirichlet Distribution

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. Polynomial Time Perfect Sampler for Discretized Dirichlet Distribution MATHEMATICAL ENGINEERING TECHNICAL REPORTS Polynomial Time Perfect Sampler for Discretized Dirichlet Distriution Tomomi MATSUI and Shuji KIJIMA METR 003 7 April 003 DEPARTMENT OF MATHEMATICAL INFORMATICS

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes Ürün Dogan 1 Tobias Glasmachers 2 and Christian Igel 3 1 Institut für Mathematik Universität Potsdam Germany

More information

Introduction to Data-Driven Dependency Parsing

Introduction to Data-Driven Dependency Parsing Introduction to Data-Driven Dependency Parsing Introductory Course, ESSLLI 2007 Ryan McDonald 1 Joakim Nivre 2 1 Google Inc., New York, USA E-mail: ryanmcd@google.com 2 Uppsala University and Växjö University,

More information

Polynomial Degree and Finite Differences

Polynomial Degree and Finite Differences CONDENSED LESSON 7.1 Polynomial Degree and Finite Differences In this lesson, you Learn the terminology associated with polynomials Use the finite differences method to determine the degree of a polynomial

More information