Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Size: px
Start display at page:

Download "Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017"

Transcription

1 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall

2 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels 2

3 SVM objective function Regularization term: Maximize the margin Imposes a preference over the hypothesis space and pushes for better generalization Can be replaced with other regularization terms which impose other preferences Empirical Loss: Hinge loss Penalizes weight vectors that make mistakes Can be replaced with other loss functions which impose other preferences A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss 3

4 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels 4

5 Outline: Training SVM by optimization 1. Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 5

6 Outline: Training SVM by optimization 1. Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 6

7 Solving the SVM optimization problem This function is convex in w 7

8 Recall: Convex functions A function f is convex if for every u, v in the domain, and for every λ [0,1] we have f λu + 1 λ v λf u + 1 λ f(v) From geometric perspective Every tangent plane lies below the function f(u) f(v) u v 8

9 Recall: Convex functions A function f is convex if for every u, v in the domain, and for every λ [0,1] we have f λu + 1 λ v λf u + 1 λ f(v) From geometric perspective Every tangent plane lies below the function f(u) f(v) u v 9

10 Convex functions Linear functions max is convex Some ways to show that a function is convex: 1. Using the definition of convexity 2. Showing that the second derivative is positive (for one dimensional functions) 3. Showing that the second derivative is positive semi-definite (for vector functions) 10

11 Not all functions are convex These are concave These are neither f λu + 1 λ v λf u + 1 λ f(v) 11

12 Convex functions are convenient A function f is convex if for every u, v in the domain, and for every λ [0,1] we have f λu + 1 λ v λf u + 1 λ f(v) f(u) f(v) In general: Necessary condition for x to be a minimum for the function f is r f (x)= 0 For convex functions, this is both necessary and sufficient u v 12

13 Solving the SVM optimization problem This function is convex in w This is a quadratic optimization problem because the objective is quadratic Older methods: Used techniques from Quadratic Programming Very slow No constraints, can use gradient descent Still very slow! 13

14 We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 Iterate till convergence: Compute the gradient of the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 14

15 We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 Iterate till convergence: Compute the gradient of the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 1 w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 15

16 We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 Iterate till convergence: Compute the gradient of the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 2 w 1 w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 16

17 We are trying to minimize Gradient descent J(w) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 Iterate till convergence: Compute the gradient of the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 3 w 2 w 1 w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 17

18 Gradient descent for SVM 1. Initialize w 0 2. For t = 0, 1, 2,. We are trying to minimize 1. Compute gradient of J(w) at w t. Call it r J(w t ) 2. Update w as follows: r: Called the learning rate. 18

19 Outline: Training SVM by optimization ü Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 19

20 Gradient descent for SVM 1. Initialize w 0 2. For t = 0, 1, 2,. 1. Compute gradient of J(w) at w t. Call it r J(w t ) 2. Gradient Update of the w as SVM follows: objective requires summing over the entire training set Slow, does not really scale r: Called the learning rate We are trying to minimize 20

21 Stochastic gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w 0 = 0 2 < n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be rj t (w t-1 ) 3. Update: w t à w t-1 t rj t (w t-1 ) 3. Return final w 21

22 Stochastic gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w 0 = 0 2 < n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be rj t (w t-1 ) 3. Update: w t à w t-1 t rj t (w t-1 ) 3. Return final w 22

23 Stochastic gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w 0 = 0 2 < n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be rj t (w t-1 ) 3. Update: w t à w t-1 t rj t (w t-1 ) 3. Return final w 23

24 Stochastic gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w 0 = 0 2 < n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be rj t (w t-1 ) 3. Update: w t à w t-1 t rj t (w t-1 ) 3. Return final w 24

25 Stochastic gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w 0 = 0 2 < n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be rj t (w t-1 ) 3. Update: w t à w t-1 t rj t (w t-1 ) 3. Return final w 25

26 Stochastic gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w 0 = 0 2 < n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be rj t (w t-1 ) 3. Update: w t à w t-1 t rj t (w t-1 ) 3. Return final w What is the gradient of the hinge loss with respect to w? (The hinge loss is not a differentiable function!) This algorithm is guaranteed to converge to the minimum of J if t is small enough. Why? The objective J(w) is a convex function 26

27 Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 27

28 Gradient Descent vs SGD Gradient descent 28

29 Gradient Descent vs SGD Stochastic Gradient descent 29

30 Gradient Descent vs SGD Stochastic Gradient descent 30

31 Gradient Descent vs SGD Stochastic Gradient descent 31

32 Gradient Descent vs SGD Stochastic Gradient descent 32

33 Gradient Descent vs SGD Stochastic Gradient descent 33

34 Gradient Descent vs SGD Stochastic Gradient descent 34

35 Gradient Descent vs SGD Stochastic Gradient descent 35

36 Gradient Descent vs SGD Stochastic Gradient descent 36

37 Gradient Descent vs SGD Stochastic Gradient descent 37

38 Gradient Descent vs SGD Stochastic Gradient descent 38

39 Gradient Descent vs SGD Stochastic Gradient descent 39

40 Gradient Descent vs SGD Stochastic Gradient descent 40

41 Gradient Descent vs SGD Stochastic Gradient descent 41

42 Gradient Descent vs SGD Stochastic Gradient descent 42

43 Gradient Descent vs SGD Stochastic Gradient descent 43

44 Gradient Descent vs SGD Stochastic Gradient descent 44

45 Gradient Descent vs SGD Stochastic Gradient descent 45

46 Gradient Descent vs SGD Many more updates than gradient descent, but each individual update is less computationally expensive Stochastic Gradient descent 46

47 Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent ü Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 47

48 Stochastic gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w 0 = 0 2 < n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be rj t (w t-1 ) 3. Update: w t à w t-1 t rj t (w t-1 ) 3. Return final w What is the derivative of the hinge loss with respect to w? (The hinge loss is not a differentiable function!) 48

49 Hinge loss is not differentiable! What is the derivative of the hinge loss with respect to w? 49

50 Detour: Sub-gradients Generalization of gradients to non-differentiable functions (Remember that every tangent lies below the function for convex functions) Informally, a sub-tangent at a point is any line lies below the function at the point. A sub-gradient is the slope of that line 50

51 Sub-gradients Formally, g is a subgradient to f at x if f is differentiable at x 1 Tangent at this point g 1 is a gradient at x 1 g 2 and g 3 is are both subgradients at x 2 [Example from Boyd] 51

52 Sub-gradients Formally, g is a subgradient to f at x if f is differentiable at x 1 Tangent at this point g 1 is a gradient at x 1 g 2 and g 3 is are both subgradients at x 2 [Example from Boyd] 52

53 Sub-gradients Formally, g is a subgradient to f at x if f is differentiable at x 1 Tangent at this point g 1 is a gradient at x 1 g 2 and g 3 is are both subgradients at x 2 [Example from Boyd] 53

54 Sub-gradient of the SVM objective General strategy: First solve the max and compute the gradient for each case 54

55 Sub-gradient of the SVM objective General strategy: First solve the max and compute the gradient for each case 55

56 Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent ü Gradient descent vs stochastic gradient descent ü Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 56

57 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w = 0 2 < n 2. For epoch = 1 T: 1. For each training example (x i, y i )2 S: If y i w T x i 1, else 3. Return w w à (1- t ) w + t C y i x i w à (1- t ) w 57

58 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w = 0 2 < n 2. For epoch = 1 T: 1. For each training example (x i, y i )2 S: If y i w T x i 1, else 3. Return w w à (1- t ) w + t C y i x i w à (1- t ) w 58

59 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w = 0 2 < n 2. For epoch = 1 T: 1. For each training example (x i, y i )2 S: If y i w T x i 1, Update w à w (1- à w t ) w + t rj t C t y i x i else 3. Return w w à (1- t ) w 59

60 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w = 0 2 < n 2. For epoch = 1 T: 1. For each training example (x i, y i )2 S: If y i w T x i 1, else 3. Return w w à (1- t ) w + t C y i x i w à (1- t ) w 60

61 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w = 0 2 < n 2. For epoch = 1 T: 1. For each training example (x i, y i )2 S: If y i w T x i 1, else 3. Return w w à (1- t ) w + t C y i x i w à (1- t ) w t : learning rate, many tweaks possible Important to shuffle examples at the start of each epoch 61

62 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w = 0 2 < n 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: If y i w T x i 1, t : learning rate, many tweaks possible w à (1- t ) w + t C y i x i else w à (1- t ) w 3. Return w 62

63 Convergence and learning rates With enough iterations, it will converge in expectation Provided the step sizes are square summable, but not summable Step sizes t are positive Sum of squares of step sizes over t = 1 to 1 is not infinite Sum of step sizes over t = 1 to 1 is infinity Some examples: γ 2 = : ; or γ 2 =

64 Convergence and learning rates Number of iterations to get to accuracy within ² For strongly convex functions, N examples, d dimensional: Gradient descent: O(Nd ln(1/²)) Stochastic gradient descent: O(d/²) More subtleties involved, but SGD is generally preferable when the data size is huge Recently, many variants that are based on this general strategy Examples: Adagrad, momentum, Nesterov s accelerated gradient, Adam, RMSProp, etc 64

65 Convergence and learning rates Number of iterations to get to accuracy within ² For strongly convex functions, N examples, d dimensional: Gradient descent: O(Nd ln(1/²)) Stochastic gradient descent: O(d/²) More subtleties involved, but SGD is generally preferable when the data size is huge Recently, many variants that are based on this general strategy, targeting multilayer neural networks Examples: Adagrad, momentum, Nesterov s accelerated gradient, Adam, RMSProp, etc 65

66 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w = 0 2 < n 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: If y i w T x i 1, else 3. Return w w à (1- t ) w + t C y i x i w à (1- t ) w 66

67 Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent ü Gradient descent vs stochastic gradient descent ü Sub-derivatives of the hinge loss ü Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 67

68 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x 2 < n, y 2 {-1,1} 1. Initialize w = 0 2 < n 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i )2 S: If y i w T x i 1, w à (1- t ) w + t C y i x i else w à (1- t ) w 3. Return w Compare with the Perceptron update: If y w T x 0, update w à w + r y x 68

69 Perceptron vs. SVM Perceptron: Stochastic sub-gradient descent for a different loss No regularization though SVM optimizes the hinge loss With regularization 69

70 SVM summary from optimization perspective Minimize regularized hinge loss Solve using stochastic gradient descent Very fast, run time does not depend on number of examples Compare with Perceptron algorithm: Perceptron does not maximize margin width Perceptron variants can force a margin Convergence criterion is an issue; can be too aggressive in the beginning and get to a reasonably good solution fast; but convergence is slow for very accurate weight vector Other successful optimization algorithms exist Eg: Dual coordinate descent, implemented in liblinear Questions? 70

Kernelized Perceptron Support Vector Machines

Kernelized Perceptron Support Vector Machines Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:

More information

Least Mean Squares Regression. Machine Learning Fall 2018

Least Mean Squares Regression. Machine Learning Fall 2018 Least Mean Squares Regression Machine Learning Fall 2018 1 Where are we? Least Squares Method for regression Examples The LMS objective Gradient descent Incremental/stochastic gradient descent Exercises

More information

Least Mean Squares Regression

Least Mean Squares Regression Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Support Vector Machines

Support Vector Machines Support Vector Machines INFO-4604, Applied Machine Learning University of Colorado Boulder September 28, 2017 Prof. Michael Paul Today Two important concepts: Margins Kernels Large Margin Classification

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines

More information

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning Support Vector Machines. Prof. Matteo Matteucci Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin Kernels Machine Learning CSE446 Carlos Guestrin University of Washington October 28, 2013 Carlos Guestrin 2005-2013 1 Linear Separability: More formally, Using Margin Data linearly separable, if there

More information

Support Vector Machines. Machine Learning Fall 2017

Support Vector Machines. Machine Learning Fall 2017 Support Vector Machines Machine Learning Fall 2017 1 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Support Vector Machines

Support Vector Machines Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Optimization for neural networks

Optimization for neural networks 0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

SVMs, Duality and the Kernel Trick

SVMs, Duality and the Kernel Trick SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I Non-Linearity CS 188: Artificial Intelligence Deep Learning I Instructors: Pieter Abbeel & Anca Dragan --- University of California, Berkeley [These slides were created by Dan Klein, Pieter Abbeel, Anca

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution

More information

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on Overview of gradient descent optimization algorithms HYUNG IL KOO Based on http://sebastianruder.com/optimizing-gradient-descent/ Problem Statement Machine Learning Optimization Problem Training samples:

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

The Perceptron Algorithm

The Perceptron Algorithm The Perceptron Algorithm Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Outline The Perceptron Algorithm Perceptron Mistake Bound Variants of Perceptron 2 Where are we? The Perceptron

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Kernels and the Kernel Trick. Machine Learning Fall 2017

Kernels and the Kernel Trick. Machine Learning Fall 2017 Kernels and the Kernel Trick Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Weihang Chen, Xingchen Chen, Jinxiu Liang, Cheng Xu, Zehao Chen and Donglin He March 26, 2017 Outline What is Stochastic Gradient Descent Comparison between BGD and SGD Analysis

More information

Lecture 6 Optimization for Deep Neural Networks

Lecture 6 Optimization for Deep Neural Networks Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

Optimization for Training I. First-Order Methods Training algorithm

Optimization for Training I. First-Order Methods Training algorithm Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

The Perceptron Algorithm 1

The Perceptron Algorithm 1 CS 64: Machine Learning Spring 5 College of Computer and Information Science Northeastern University Lecture 5 March, 6 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu Introduction The Perceptron

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This

More information

Binary Classification / Perceptron

Binary Classification / Perceptron Binary Classification / Perceptron Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Supervised Learning Input: x 1, y 1,, (x n, y n ) x i is the i th data

More information

Name (NetID): (1 Point)

Name (NetID): (1 Point) CS446: Machine Learning (D) Spring 2017 March 16 th, 2017 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains

More information

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7. Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012 CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012 Luke ZeClemoyer Slides adapted from Carlos Guestrin Linear classifiers Which line is becer? w. = j w (j) x (j) Data Example i Pick the one with the

More information

Machine Learning : Support Vector Machines

Machine Learning : Support Vector Machines Machine Learning Support Vector Machines 05/01/2014 Machine Learning : Support Vector Machines Linear Classifiers (recap) A building block for almost all a mapping, a partitioning of the input space into

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Input layer. Weight matrix [ ] Output layer

Input layer. Weight matrix [ ] Output layer MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science 6.034 Artificial Intelligence, Fall 2003 Recitation 10, November 4 th & 5 th 2003 Learning by perceptrons

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

ICS-E4030 Kernel Methods in Machine Learning

ICS-E4030 Kernel Methods in Machine Learning ICS-E4030 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 28. September, 2016 Juho Rousu 28. September, 2016 1 / 38 Convex optimization Convex optimisation This

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson Applied Machine Learning Lecture 5: Linear classifiers, continued Richard Johansson overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Classification with Perceptrons. Reading:

Classification with Perceptrons. Reading: Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters

More information

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given

More information

Gradient Descent. Sargur Srihari

Gradient Descent. Sargur Srihari Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Support Vector Machines and Bayes Regression

Support Vector Machines and Bayes Regression Statistical Techniques in Robotics (16-831, F11) Lecture #14 (Monday ctober 31th) Support Vector Machines and Bayes Regression Lecturer: Drew Bagnell Scribe: Carl Doersch 1 1 Linear SVMs We begin by considering

More information

CSC321 Lecture 7: Optimization

CSC321 Lecture 7: Optimization CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

Dan Roth 461C, 3401 Walnut

Dan Roth   461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Nathan Schneider (some slides borrowed from Chris Dyer) ENLP 12 February 2018 23 Outline Words, probabilities Features,

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Neural Networks: Optimization & Regularization

Neural Networks: Optimization & Regularization Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg

More information

CS221 / Autumn 2017 / Liang & Ermon. Lecture 2: Machine learning I

CS221 / Autumn 2017 / Liang & Ermon. Lecture 2: Machine learning I CS221 / Autumn 2017 / Liang & Ermon Lecture 2: Machine learning I cs221.stanford.edu/q Question How many parameters (real numbers) can be learned by machine learning algorithms using today s computers?

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information