Learning From Data: Modelling as an Optimisation Problem
|
|
- Dustin Miller
- 5 years ago
- Views:
Transcription
1 Learning From Data: Modelling as an Optimisation Problem Iman Shames April / 31
2 You should be able to... Identify and formulate a regression problem; Appreciate the utility of regularisation; Identify and formulate a binary classification problem; Recognise a generic supervised learning problem as an optimisation problem; Formulate different instances of unsupervised learning problem. Observe that many learning problems are special instances of optimisation theory. 2 / 31
3 Some Optimisation History If the facts don t fit the theory, change the facts. -A Einestein A family of problems arising in machine learning, viewed through the looking glass of optimisation theory, will be discussed. Supervised learning: fitting a model to given response data Unsupervised learning: build a model for the data without a particular response in d Denote [t 1,..., t m ] R n m a generic matrix of data points t i R n, i = 1,..., m, is the i-th data point, aka example. A particular entry of a data point is known as feature, e.g. temperature, price, blood pressure, signal strength,... 3 / 31
4 Supervised Learning In supervised learning the goal is build a model of an unknown function: t y(t) We are given a set of observations (examples) that is a number of input-output pairs (t i, y i ), i = 1,..., m y = [y 1,..., y m ] is called the response vector. These examples are used to learn a model of the function t ŷ(t; x) where x is a vector of model parameters. The goal is to use the data and the information in y to form a model. In turn the model can be used to predict a value ŷ for a (yet unseen) new test point x R n. 4 / 31
5 Supervised Learning In the most general form: ŷ(t; x) = x φ(t) where φ(t) is a given nonlinear function. If φ(t) = (1, t) we retrieve the affine relationship. Example: Demand Prediction for Inventory Management A store needs to predict demand for a specific item from costumers in an area. It assumes that the logarithm of the demand is an arbitrary real number and depends linearly on a number of features: time of year, type of item, number of items sold the day before, etc. The problem is to predict the demand the next day based the observations of features-demand pairs in the past. 5 / 31
6 Supervised Learning If the response is binary, e.g. y(x) {1, 1} for every t, it is then referred to as a label. In this case a sign-(non)linear model can be used: ŷ(t; x) = sign(x φ(t)) Letting φ(t) = [1, t] yields the sign-linear model. Example: Binary Classification of Credit Card Applicants A credit card company receives thousands of applications. Each application contains information about an applicant: age, marital status, annual salary, outstanding debts, etc. How can we set up a system that classifies the applicants into two categories: approved and not approved. 6 / 31
7 Supervised Learning The learning (or training) problem is to find the best model coefficient vector x such that ŷ(t i; x) y i, i = 1,..., m. A certain measure of mismatch between y i and ŷ i (t i ; x) needs to be imised. An important point is assign a measure of reliability to the predicted values. This is the subject of statistical learning, see the following: The one that tells you what happens: Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1). Springer, Berlin: Springer series in statistics. The one that asks you to believe: James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 6). New York: springer. 7 / 31
8 Least Squares Via Polynomials Learning has its roots in this theorem: Theorem (Weierstrass Approximation Theorem): Suppose f is a continuous real-valued function defined on the real interval [a, b]. For every ɛ > 0, there exists a polynomial p such that for all x in [a, b], we have f(x) p(x) < ɛ. Let s assume a basic linear model for the data: y(t) = x 1 + tx 2 + δ = x φ(t) + δ where φ(t) = [1, t], δ is the error term. 8 / 31
9 Least Squares Via Polynomials To find the coefficient (weight) vector x, we use a least-squares approach to imise the training error: 1 (yi φ i x) 2 x m where φ i = φ(t i ), i = 1,..., m. For y = [y 1,..., y m ], Φ being a matrix with columns φ i : x Φ x y 2 2 This is learning (training) problem. Once this problem is solved, one can use the coefficients to predict values of y for unseen t. 9 / 31
10 Least Squares Via Polynomials If the linear models don t explain the data well, one can obtain a higher order model: y(t) = x 1 + tx 2 + t 2 x t k x k+1 + δ 10 / 31
11 Least Squares Via Polynomials If the linear models don t explain the data well, one can obtain a higher order model: y(t) = x 1 + tx 2 + t 2 x t k x k+1 + δ The fitting problem is still a least-squares problem: x Φ x y 2 2 Each φ i = [1, t i, t 2 i,..., tk i ], i = 1,..., m. Choice of k is magical and requires cross validation: leave a bunch of data points out and evaluate the performance of the predictor. Typically as k increases the accuracy of training gets better, but the cross validation error initially decreases (or remains constant) and then increases (over-fitting). 10 / 31
12 Least Squares Via Polynomials Example: Polynomial Model For Age Versus Age Data Income and demographic information for males in the central Atlantic region of the United States. The model is assumed to be a degree 4 polynomial: ŷ(t; x) = x 1 + x 2t + x 3t 2 f(x) = 62 i=1 x + x 4t 3 + x 5t 4 φ i x y(t i) 2. f(x) Wage Age Note that m = 62 and t i is known perfectly. 11 / 31
13 Regularisation Assume that the degree is chosen to be some k. Not all polynomials of degree k are equal; some might vary wildly with large derivatives. Such large variations might result in unrealiability. The size of coeefficients are the other factor (other than degree) that deteres the behaviour of the model. If there isnsome bound for input, then: Theorem: For a polynomial p x (t) = x 1 +x 2 t+ +x k+1 t k, we have t [ 1, 1] : dp x (t) dt k3/2 x 2, and t [ 1, 1] : dp x (t) dt k x / 31
14 Regularisation Thus, one might want to have some control over the size of the derivatives through imising x 2 or x 1. Two different objective function need to be imised... multi-objective optimisation. Let s have a quick excursion / 31
15 Multi-objective or Vector Optimisation It seems that there is doant perception that multi-objective optimsiation is inherently different from other types of optimisaiton. x s.t. F (x) = (f 1 (x),..., f m (x)) x X 14 / 31
16 Multi-objective or Vector Optimisation It seems that there is wrong doant perception that multi-objective optimsiation is inherently different from other types of optimisaiton. x s.t. F (x) = (f 1 (x),..., f m (x)) x X Well, it is the case! They are not particularly different from any other type of optimisation problem. It is all about defining what solution you are after. A trivial (and impossible in practice) solution is the case where a point x X imises all f i (x), i = 1,..., m. Two fundamental approaches: Scalarisation: x U(f 1(x),..., f m(x)), s.t. x X Pareto Approaches: x is optimal if F (x ) F (x), F (x ) F (x) > 0, x X 14 / 31
17 Multi-objective Optimisation In scalarisation U(f 1 (x),..., f m (x)) = m i=1 λ if i (x). Choices of λ i matter because of relative magnitude, physical dimensions, etc. Scalarisation has a close relative. For each choices of λ i, i = 1,..., m there exist some γ i, i = 1,..., m and i j such that the solution of the scalarisation is the same as x f j (x) s.t. f i (x) γ i, i {1,..., m} \ {j} x X This latter formulation is easier to interpret from an application point of view. In the end we solve a bunch of optimisation problems. 15 / 31
18 Regularisation Thus, one might want to have some control over the size of the derivatives through imising x 2 or x 1. The l 2 -norm case: x Φ x y λ x 2 2 This is known as Tikhonov or ridge regularisation and is easy to solve smooth. 16 / 31
19 Regularisation and Sparsity The one with l 1 -norm: x Φ x y λ x 1 The other is LASSO, solving requires a bit of attention due to nonsmoothness. LASSO with small λ results in a sparse solution. 17 / 31
20 Binary Classification When predicting a label in { 1, 1}, we can use the following prediction rule: sign(t x + b) Let s find x and w such that average number of errors on a training set is imised. The classification rule sign(t i x + b) is wrong for some t i, if y i (t i x + b) < 0. (i.e. not having the same sign) Define this average error using the following function: { 1 w < 0 E(w) = 0 otherwise 18 / 31
21 Binary Classification x,b 1 m m max(0, 1 y i (t i x + b)) i=1 It has a smooth reformulation: e,x,b S.t. 1 m m i=1 e i e 0, e i 1 y i (t i x + b), i = 1,..., m This formulation is the building block of Support Vector Machines (SVM). There might be a need to control the complexity/sensitivity of the model regularisation. 19 / 31
22 Binary Classification An upper bound on the magnitude of the (sub)gradient is x. If all data points are in (Euclidean) sphere of radius R: max t,t : t 2 R, t 2 R x (t t ) 2R x 2 The learning Problem becomes: x,b 1 m m max(0, 1 y i (t i x + b)) + λ x 2 2 i=1 λ 0 is a regularisation parameter to be chosen via cross validation. 20 / 31
23 Binary Classification An upper bound on the magnitude of the (sub)gradient is x. If all data points are in a box of size R: max t,t : t R, t R x (t t ) 2R x 1 The learning Problem becomes: x,b 1 m m max(0, 1 y i (t i x + b)) + λ x 1 i=1 λ 0 is a regularisation parameter to be chosen via cross validation. the l 1 -norm above encourages sparsity a few elements of t will be involved in classification. This allows identifying key features. 21 / 31
24 Geometric Interpretations of SVM x,b 1 m m max(0, 1 y i (t i x + b)) i=1 Consider the case where the error in training is zero. It is equivalent to existence of x and b such that y i (x t i + b) 0, i = 1,..., m Data points are linearly sperable by {t x t + b = 0}. }{{} hyperplane 22 / 31
25 Geometric Interpretations of SVM Consider the case where the data is strictly seperable: y i (x t i + b) β > 0, i = 1,..., m Normalising x and b yields y i (x t i + b) 1, i = 1,..., m Two important hyperplanes {t x t + b ± 1} form a separating slab. 23 / 31
26 Geometric Interpretations of (Maximum Margin) SVM The choice of sperating hyperplane, (consequnelty the slabs) is not unique. One can maximise the width of the slab, i.e. the distance between the hyperplanes, known as the separation margin. The distance between the two hyperplanes is given by 2/ x. Thus, to maximise the margin: x,b x, s.t. y i (x t i + b) 1, i = 1,..., m The points that lie on the hyperplanes are called support vectors. 24 / 31
27 SVM with Non-seperable Data We introduce slack variables to capture constraint violations: e 0, y i (x t i + b) 1 e i, i = 1,..., m Ideally, we would like to imise the number of nonzero entries of e: e 0 However, this is a hard (nonconvex) problem. Instead, we can imise e 1. This is the same as the basic SVM problem we saw before! 25 / 31
28 A Generic Supervised Learning Problem x L(Φ x, y) + λp(x) The loss function is usually assumed to be decomposable as a sum m L(z, y) = l(z i, y i ) i=1 Euclidean squared: L(z, y) = z y 2 2 l 1: L(z, y) = z y 1 l : L(z, y) = z y Hinge: l(z, y) = max(0, 1 yz) Logistic: l(z, y) = log(1 e zy ) The choice of the loss function depends on the task, data, and practical implementation considerations. 26 / 31
29 A Generic Supervised Learning Problem The penalty function can be l 1 -norm, l 2 -norm, or approximations of indicator functions to capture constraints. x L(Φ x, y), s.t. x X is the same as x L(Φ x, y) + λp(x) where p(x) = { 0 x X x X Sometimes, we add an l 2 -norm penalty to ensure uniqueness of solution strong convexity. 27 / 31
30 Unsupervised Learning In unsupervised learning, the data points t i R n, i = 1,..., m, do not come with assigned labels or responses. The task is to learn the structure of the data. Principal component analysis (PCA): The idea is to discover the most important directions in a data set along which data vary the most. Big variations along 45 and nearly none along 135. PCA is a way to generalise this intuition. 28 / 31
31 PCA Let t i R n, i = 1,..., m, be the data points with average t = 1 m m i=1 t i, and let Θ = [ t 1... t m ] be the centred data matrix where t i = t i t. The goal is to find normalised directions z such that the variance of the projections of centered data points along z is maximised. The component of the centred data along z: α i = t i z. The mean-square variation of the data along z 1 m m αi 2 = 1 m i=1 m z t i t i z = 1 m z ΘΘ z i=1 The direction z along which the data has the largest variation is obtained from: max z z ΘΘ z, s.t. z 2 = 1 z is the normalised eigenvector of ΘΘ corresponding to its largest eigenvalue. 29 / 31
32 Sparse PCA In Sparse PCA a constraint is added to limit the number of the nonzero elements in the decision variable: max z z ΘΘ z, s.t. z 2 = 1, z 0 k For small size problems one can enumerate the solutions and consider all possible combinations for entries of z. Let X be a matrix to be approximated by a rank-1 matrix: p R n,q R m X pq F If the objective can be made small, X pq, we may interpret p as a typical data point and q as a typical feature. However, if X corresponds to positive data points, we cannot draw the same conclusion if p and q are not positive. 30 / 31
33 Non-negative matrix factorisation (NNMF) Thus, sign constraints need to be enforced: F s.t. p 0, q 0 p R n,q R m The interpretation is that each column of X is proportional to a single vecotr q with weights in p. Hence, each data point follows the same profile q. More generally NNMF: X P Q F s.t. P 0, Q 0 P R n k,q R m k Here the data points follow a linear combination of k profiles given by columns of Q. 31 / 31
6.036 midterm review. Wednesday, March 18, 15
6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that
More informationCSC 411 Lecture 17: Support Vector Machine
CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17
More informationShort Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning
Short Course Robust Optimization and 3. Optimization in Supervised EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012 Outline Overview of Supervised models and variants
More informationSupport Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs
E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in
More information18.9 SUPPORT VECTOR MACHINES
744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the
More informationOrdinary Least Squares Linear Regression
Ordinary Least Squares Linear Regression Ryan P. Adams COS 324 Elements of Machine Learning Princeton University Linear regression is one of the simplest and most fundamental modeling ideas in statistics
More informationL5 Support Vector Classification
L5 Support Vector Classification Support Vector Machine Problem definition Geometrical picture Optimization problem Optimization Problem Hard margin Convexity Dual problem Soft margin problem Alexander
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationConvex Optimization and Support Vector Machine
Convex Optimization and Support Vector Machine Problem 0. Consider a two-class classification problem. The training data is L n = {(x 1, t 1 ),..., (x n, t n )}, where each t i { 1, 1} and x i R p. We
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationSupport Vector Machines for Classification and Regression
CIS 520: Machine Learning Oct 04, 207 Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationJeff Howbert Introduction to Machine Learning Winter
Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationKernelized Perceptron Support Vector Machines
Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationA short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie
A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab
More informationLinear Classifiers. Michael Collins. January 18, 2012
Linear Classifiers Michael Collins January 18, 2012 Today s Lecture Binary classification problems Linear classifiers The perceptron algorithm Classification Problems: An Example Goal: build a system that
More informationSVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning
SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer
More informationNonlinear Support Vector Machines through Iterative Majorization and I-Splines
Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support
More informationSupport Vector Machine. Industrial AI Lab.
Support Vector Machine Industrial AI Lab. Classification (Linear) Autonomously figure out which category (or class) an unknown item should be categorized into Number of categories / classes Binary: 2 different
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationSupport Vector Machines Explained
December 23, 2008 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationSoft-margin SVM can address linearly separable problems with outliers
Non-linear Support Vector Machines Non-linearly separable problems Hard-margin SVM can address linearly separable problems Soft-margin SVM can address linearly separable problems with outliers Non-linearly
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationConvex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)
ORF 523 Lecture 8 Princeton University Instructor: A.A. Ahmadi Scribe: G. Hall Any typos should be emailed to a a a@princeton.edu. 1 Outline Convexity-preserving operations Convex envelopes, cardinality
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More informationLinear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction
Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the
More informationSupport Vector Machines and Bayes Regression
Statistical Techniques in Robotics (16-831, F11) Lecture #14 (Monday ctober 31th) Support Vector Machines and Bayes Regression Lecturer: Drew Bagnell Scribe: Carl Doersch 1 1 Linear SVMs We begin by considering
More informationMachine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationCHAPTER 11. A Revision. 1. The Computers and Numbers therein
CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of
More informationCOMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma
COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released
More informationDimension Reduction Methods
Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write
More informationDiscriminative Learning and Big Data
AIMS-CDT Michaelmas 2016 Discriminative Learning and Big Data Lecture 2: Other loss functions and ANN Andrew Zisserman Visual Geometry Group University of Oxford http://www.robots.ox.ac.uk/~vgg Lecture
More informationCS798: Selected topics in Machine Learning
CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationSupport Vector Machine (continued)
Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need
More informationMidterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.
CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic
More informationStat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.
Linear SVM (separable case) First consider the scenario where the two classes of points are separable. It s desirable to have the width (called margin) between the two dashed lines to be large, i.e., have
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationSupport Vector Machines
Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationCSE 151 Machine Learning. Instructor: Kamalika Chaudhuri
CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Linear Classification Given labeled data: (xi, feature vector yi) label i=1,..,n where y is 1 or 1, find a hyperplane to separate from Linear Classification
More informationRegularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline
Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function
More informationSupport Vector Machines
Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two
More informationMachine Learning for NLP
Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationSupport Vector Machines
Support Vector Machines INFO-4604, Applied Machine Learning University of Colorado Boulder September 28, 2017 Prof. Michael Paul Today Two important concepts: Margins Kernels Large Margin Classification
More informationSecond Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs
Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs Ammon Washburn University of Arizona September 25, 2015 1 / 28 Introduction We will begin with basic Support Vector Machines (SVMs)
More informationAn Introduction to Statistical and Probabilistic Linear Models
An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning
More informationChemometrics: Classification of spectra
Chemometrics: Classification of spectra Vladimir Bochko Jarmo Alander University of Vaasa November 1, 2010 Vladimir Bochko Chemometrics: Classification 1/36 Contents Terminology Introduction Big picture
More informationSupport Vector Machine. Industrial AI Lab. Prof. Seungchul Lee
Support Vector Machine Industrial AI Lab. Prof. Seungchul Lee Classification (Linear) Autonomously figure out which category (or class) an unknown item should be categorized into Number of categories /
More informationLearning Multiple Tasks with a Sparse Matrix-Normal Penalty
Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1
More informationClassification and Support Vector Machine
Classification and Support Vector Machine Yiyong Feng and Daniel P. Palomar The Hong Kong University of Science and Technology (HKUST) ELEC 5470 - Convex Optimization Fall 2017-18, HKUST, Hong Kong Outline
More informationMLCC 2017 Regularization Networks I: Linear Models
MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational
More informationLearning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationClassification 2: Linear discriminant analysis (continued); logistic regression
Classification 2: Linear discriminant analysis (continued); logistic regression Ryan Tibshirani Data Mining: 36-462/36-662 April 4 2013 Optional reading: ISL 4.4, ESL 4.3; ISL 4.3, ESL 4.4 1 Reminder:
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationDay 4: Classification, support vector machines
Day 4: Classification, support vector machines Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 21 June 2018 Topics so far
More informationProteomics and Variable Selection
Proteomics and Variable Selection p. 1/55 Proteomics and Variable Selection Alex Lewin With thanks to Paul Kirk for some graphs Department of Epidemiology and Biostatistics, School of Public Health, Imperial
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationGaussian discriminant analysis Naive Bayes
DM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Outline 1. is 2. Multi-variate
More informationLinear Regression. Aarti Singh. Machine Learning / Sept 27, 2010
Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X
More informationLearning with L q<1 vs L 1 -norm regularisation with exponentially many irrelevant features
Learning with L q
More informationCS6375: Machine Learning Gautam Kunapuli. Support Vector Machines
Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this
More informationTufts COMP 135: Introduction to Machine Learning
Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)
More informationORIE 4741: Learning with Big Messy Data. Regularization
ORIE 4741: Learning with Big Messy Data Regularization Professor Udell Operations Research and Information Engineering Cornell October 26, 2017 1 / 24 Regularized empirical risk minimization choose model
More informationNon-linear Support Vector Machines
Non-linear Support Vector Machines Andrea Passerini passerini@disi.unitn.it Machine Learning Non-linear Support Vector Machines Non-linearly separable problems Hard-margin SVM can address linearly separable
More informationCOMS 4771 Regression. Nakul Verma
COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 89 Part II
More informationSVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels
SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic
More informationFrom Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018
From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationMetric-based classifiers. Nuno Vasconcelos UCSD
Metric-based classifiers Nuno Vasconcelos UCSD Statistical learning goal: given a function f. y f and a collection of eample data-points, learn what the function f. is. this is called training. two major
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationMachine Learning. Support Vector Machines. Manfred Huber
Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data
More informationKernel Methods. Barnabás Póczos
Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationMachine learning for pervasive systems Classification in high-dimensional spaces
Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version
More informationStructured matrix factorizations. Example: Eigenfaces
Structured matrix factorizations Example: Eigenfaces An extremely large variety of interesting and important problems in machine learning can be formulated as: Given a matrix, find a matrix and a matrix
More informationPart of the slides are adapted from Ziko Kolter
Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,
More informationMargin Maximizing Loss Functions
Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing
More informationMulticlass Classification-1
CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationSupport Vector Machines: Maximum Margin Classifiers
Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer,
More informationSupport Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar
Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 Support Vector Machines Find a linear hyperplane
More information