Machine Learning & Data Mining CMS/CS/CNS/EE 155. Lecture 2: Perceptron & Gradient Descent

Size: px
Start display at page:

Download "Machine Learning & Data Mining CMS/CS/CNS/EE 155. Lecture 2: Perceptron & Gradient Descent"

Transcription

1 Machine Learning & Data Mining CMS/CS/CNS/EE 155 Lecture 2: Perceptron & Gradient Descent

2 Announcements Homework 1 is out Due Tuesday Jan 12 th at 2pm Via Moodle Sign up for Moodle & Piazza if you haven t yet Announcements are made via Piazza RecitaIon on Python Programming Tonight 7:30pm in Annenberg 105 2

3 Recap: Basic Recipe Training Data: Model Class: N S = {(x i, y i )} i=1 f (x w, b) = w T x b x R D y { 1,+1 } Linear Models Loss FuncIon: L(a, b) = (a b) 2 Squared Loss Learning ObjecIve: argmin w,b N i=1 L( y i, f (x i w, b) ) Op8miza8on Problem 3

4 Recap: Bias-Variance Trade-off Bias Variance 1.5 Bias Variance 1.5 Bias Variance

5 Recap: Complete Pipeline N S = {(x i, y i )} i=1 Training Data f (x w, b) = w T x b Model Class(es) L(a, b) = (a b) 2 Loss FuncIon argmin w,b N i=1 L( y i, f (x i w, b) ) Cross ValidaIon & Model SelecIon Profit! 5

6 Today Two Basic Learning Approaches Perceptron Algorithm Gradient Descent Aka, actually solving the opimizaion problem 6

7 The Perceptron One of the earliest learning algorithms 1957 by Frank Rosenbla^ SIll a great algorithm Fast Clean analysis Precursor to Neural Networks Frank Rosenbla^ with the Mark 1 Perceptron Machine 7

8 Perceptron Learning Algorithm (Linear ClassificaIon Model) w 1 = 0, b 1 = 0 For t = 1. Receive example (x,y) If f(x w t ) = y [w t+1, b t+1 ] = [w t, b t ] Else w t+1 = w t + yx b t+1 = b t + y f (x w) = sign(w T x b) Training Set: N S = {(x i, y i )} i=1 y { +1, 1 } Go through training set in arbitrary order (e.g., randomly) 8

9 Aside: Hyperplane Distance Line is a 1D, Plane is 2D Hyperplane is many D Includes Line and Plane w Defined by (w,b) Distance: w T x b w b/ w Signed Distance: w T x b w Linear Model = un-normalized signed distance!

10 Perceptron Learning 10

11 Perceptron Learning Misclassified! 11

12 Perceptron Learning Update! 12

13 Perceptron Learning Correct! 13

14 Perceptron Learning Misclassified! 14

15 Perceptron Learning Update! 15

16 Perceptron Learning Update! 16

17 Perceptron Learning Correct! 17

18 Perceptron Learning Correct! 18

19 Perceptron Learning Misclassified! 19

20 Perceptron Learning Update! 20

21 Perceptron Learning Update! 21

22 All Training Examples Correctly Classified! Perceptron Learning 22

23 Start Again Perceptron Learning 23

24 Perceptron Learning Misclassified! 24

25 Perceptron Learning Update! 25

26 Perceptron Learning Correct! 26

27 Perceptron Learning Correct! 27

28 Perceptron Learning Misclassified! 28

29 Perceptron Learning Update! 29

30 Perceptron Learning Update! 30

31 Perceptron Learning Correct! 31

32 Perceptron Learning Correct! 32

33 Perceptron Learning Misclassified! 33

34 Perceptron Learning Update! 34

35 Perceptron Learning Update! 35

36 Perceptron Learning Misclassified! 36

37 Perceptron Learning Update! 37

38 Perceptron Learning Update! 38

39 Perceptron Learning Misclassified! 39

40 Perceptron Learning Update! 40

41 Perceptron Learning Update! 41

42 Perceptron Learning Misclassified! 42

43 Perceptron Learning Update! 43

44 Perceptron Learning Update! 44

45 All Training Examples Correctly Classified! Perceptron Learning 45

46 Recap: Perceptron Learning Algorithm (Linear ClassificaIon Model) w 1 = 0, b 1 = 0 For t = 1. Receive example (x,y) If f(x w t ) = y [w t+1, b t+1 ] = [w t, b t ] Else w t+1 = w t + yx b t+1 = b t + y f (x w) = sign(w T x b) Training Set: N S = {(x i, y i )} i=1 y { +1, 1 } Go through training set in arbitrary order (e.g., randomly) 46

47 Comparing the Two Models 47

48 Convergence to Mistake Free = Linearly Separable! 48

49 Margin γ = max w min (x,y) y(w T x) w 49

50 Linear Separability A classificaion problem is Linearly Separable: Exists w with perfect classificaion accuracy Separable with Margin γ: γ = max w min (x,y) y(w T x) w Linearly Separable: γ > 0 50

51 Perceptron Mistake Bound Holds for any ordering of training examples! Radius of Feature Space R = max x x #Mistakes Bounded By: R 2 γ 2 Margin **If Linearly Separable More Details: h^p:// 51

52 In the Real World Most problems are NOT linearly separable! May never converge So what to do? Use valida8on set! 52

53 Early Stopping via ValidaIon Run Perceptron Learning on Training Set Evaluate current model on ValidaIon Set Terminate when validaion accuracy stops improving h^ps://en.wikipedia.org/wiki/early_stopping 53

54 Online Learning vs Batch Learning Online Learning: Receive a stream of data (x,y) Make incremental updates Perceptron Learning is an instance of Online Learning Batch Learning Train over all data simultaneously Can use online learning algorithms for batch learning E.g., stream the data to the learning algorithm h^ps://en.wikipedia.org/wiki/online_machine_learning 54

55 Recap: Perceptron One of the first machine learning algorithms Benefits: Simple and fast Clean analysis Drawbacks: Might not converge to a very good model What is the objecive funcion? 55

56 (StochasIc) Gradient Descent 56

57 Back to OpImizing ObjecIve FuncIons Training Data: Model Class: N S = {(x i, y i )} i=1 f (x w, b) = w T x b x R D y { 1,+1 } Linear Models Loss FuncIon: L(a, b) = (a b) 2 Squared Loss Learning ObjecIve: argmin w,b N i=1 L( y i, f (x i w, b) ) OpImizaIon Problem 57

58 Back to OpImizing ObjecIve FuncIons argmin w,b N i=1 L(w, b S) L y i, f (x i w, b) Typically, requires opimizaion algorithm. Simplest: Gradient Descent ( ) This Lecture: sick with squared loss Talk about various loss funcions next lecture

59 Gradient Review for Squared Loss N i=1 ( ) w L(w, b S) = w L y i, f (x i w, b) N i=1 ( ) = w L y i, f (x i w, b) N i=1 = 2(y i f (x i w, b)) w f (x i w, b) Linearity of DifferenIaIon L(a, b) = (a b) 2 Chain Rule N i=1 = 2(y i f (x i w, b))x i f (x w, b) = w T x b

60 Gradient Descent IniIalize: w 1 = 0, b 1 = 0 For t = 1 w t+1 = w t η t+1 w L(w t, b t S) b t+1 = b t η t+1 b L(w t, b t S) Step Size 60

61 How to Choose Step Size? η =1 w L(w) = 2(1 w) L w 61

62 How to Choose Step Size? η =1 w L(w) = 2(1 w) L w 62

63 How to Choose Step Size? η =1 w L(w) = 2(1 w) L w 63

64 How to Choose Step Size? η =1 w L(w) = 2(1 w) Oscillate Infinitely! L w 64

65 How to Choose Step Size? η = w L(w) = 2(1 w) L w 65

66 How to Choose Step Size? η = w L(w) = 2(1 w) L w 66

67 How to Choose Step Size? η = w L(w) = 2(1 w) L w 67

68 How to Choose Step Size? η = w L(w) = 2(1 w) Takes Really Long Time! L w 68

69 How to Choose Step Size? Loss As Large As Possible! (Without Diverging) IteraIons Note that the absolute scale is not meaningful Focus on the relaive magnitude differences 69

70 Being Scale Invariant Consider the following two gradient updates: w t+1 = w t η t+1 w L(w t, b t S) w t+1 = w t ˆη t+1 w ˆL(w t, b t S) Suppose: ˆL =1000L How are the two step sizes related? ˆη t+1 = η /

71 PracIcal Rules of Thumb Divide Loss FuncIon by Number of Examples: w t+1 = w t ηt+1 w L(w t, b t S) N Start with large step size If loss plateaus, divide step size by 2 (Can also use advanced opimizaion methods) (Step size must decrease over Ime to guarantee convergence to global opimum) 71

72 Aside: Convexity Not Convex Easy to find global op8ma! Strict convex if diff always >0 Image Source: h^p://en.wikipedia.org/wiki/convex_funcion 72

73 Aside: Convexity 2.5 L(x 2 ) L(x 1 )+ L(x 1 ) T (x 2 x 1 ) FuncIon is always above the locally linear extrapolaion

74 Aside: Convexity All local opima are global opima: Gradient Descent will find opimum Assuming step size chosen safely Strictly convex: unique global opimum: Almost all standard objecives are (strictly) convex: Squared Loss, SVMs, LR, Ridge, Lasso We will see non-convex objecives in 2 nd half of course 74

75 Convergence Assume L is convex How many iteraions to achieve: L(w) L(w * ) ε If: If: L(a) L(b) ρ a b Then O(1/ε 2 ) iteraions L(a) L(b) ρ a b Then O(1/ε) iteraions L is ρ-lipschitz L is ρ-smooth If: L(a) L(b)+ L(b) T (a b)+ ρ 2 a b 2 Then O(log(1/ε)) iteraions L is ρ-strongly convex More Details: Bubeck Textbook Chapter 3 75

76 Convergence In general, takes infinite Ime to reach global opimum. But in general, we don t care! As long as we re close enough to the global opimum Loss How do we know if we re here? And not here? IteraIons 76

77 When to Stop? Convergence analyses = worst-case upper bounds What to do in prac8ce? Stop when progress is sufficiently small E.g., relaive reducion less than Stop azer pre-specified #iteraions E.g., Yisong prefers this opion Stop when validaion error stops going down 77

78 LimitaIon of Gradient Descent Requires full pass over training set per iteraion N i=1 ( ) w L(w, b S) = w L y i, f (x i w, b) Very expensive if training set is huge Do we need to do a full pass over the data? 78

79 StochasIc Gradient Descent Suppose Loss FuncIon Decomposes AddiIvely L(w, b) = 1 N N L i (w, b) = E i L i (w, b) i=1 [ ] Gradient = expected gradient of sub-funcions L i (w, b) ( y i f (x i w, b) 2 Each L i corresponds to a single data point w L(w, b) = w E i L i (w, b) [ ] 79

80 StochasIc Gradient Descent Suffices to take random gradient update So long as it matches the true gradient in expectaion Each iteraion t: Choose i at random Expected Value is: w L(w, b) w t+1 = w t η t+1 w L i (w, b) b t+1 = b t η t+1 b L i (w, b) SGD is an online learning algorithm! 80

81 Mini-Batch SGD Each L i is a small batch of training examples E.g, examples Can leverage vector operaions Decrease volaility of gradient updates Industry state-of-the-art Everyone uses mini-batch SGD Ozen parallelized (e.g., different cores work on different mini-batches) 81

82 Checking for Convergence How to check for convergence? EvaluaIng loss on enire training set seems expensive Loss IteraIons 82

83 Checking for Convergence How to check for convergence? EvaluaIng loss on enire training set seems expensive Don t check azer every iteraion E.g., check every 1000 iteraions Evaluate loss on a subset of training data E.g., the previous 5000 examples. 83

84 Recap: StochasIc Gradient Descent Conceptually: Decompose Loss FuncIon AddiIvely Choose a Component Randomly Gradient Update Benefits: Avoid iteraing enire dataset for every update Gradient update is consistent (in expectaion) Industry Standard 84

85 Perceptron Revisited (What is the Objec8ve Func8on?) w 1 = 0, b 1 = 0 For t = 1. Receive example (x,y) If f(x w t ) = y [w t+1, b t+1 ] = [w t, b t ] Else w t+1 = w t + yx b t+1 = b t + y f (x w) = sign(w T x b) Training Set: N S = {(x i, y i )} i=1 y { +1, 1 } Go through training set in arbitrary order (e.g., randomly) 85

86 Perceptron (Implicit) ObjecIve L i (w, b) = max 0, y i f (x i w, b) { } Loss yf(x) 86

87 Recap: Complete Pipeline N S = {(x i, y i )} i=1 Training Data f (x w, b) = w T x b Model Class(es) L(a, b) = (a b) 2 Loss FuncIon argmin w,b N i=1 L( y i, f (x i w, b) ) Use SGD! Cross ValidaIon & Model SelecIon Profit! 87

88 Next Week Different Loss FuncIons Hinge Loss (SVM) Log Loss (LogisIc Regression) Non-linear model classes Neural Nets RegularizaIon Recita8on on Python Programming Tonight! 88

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Kernelized Perceptron Support Vector Machines

Kernelized Perceptron Support Vector Machines Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:

More information

Classification with Perceptrons. Reading:

Classification with Perceptrons. Reading: Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

More about the Perceptron

More about the Perceptron More about the Perceptron CMSC 422 MARINE CARPUAT marine@cs.umd.edu Credit: figures by Piyush Rai and Hal Daume III Recap: Perceptron for binary classification Classifier = hyperplane that separates positive

More information

Deep Learning for Computer Vision

Deep Learning for Computer Vision Deep Learning for Computer Vision Lecture 4: Curse of Dimensionality, High Dimensional Feature Spaces, Linear Classifiers, Linear Regression, Python, and Jupyter Notebooks Peter Belhumeur Computer Science

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Support Vector Machines, Kernel SVM

Support Vector Machines, Kernel SVM Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Slides adapted from Jordan Boyd-Graber Machine Learning: Chenhao Tan Boulder 1 of 39 Recap Supervised learning Previously: KNN, naïve

More information

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012 CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012 Luke ZeClemoyer Slides adapted from Carlos Guestrin Linear classifiers Which line is becer? w. = j w (j) x (j) Data Example i Pick the one with the

More information

CSC321 Lecture 4 The Perceptron Algorithm

CSC321 Lecture 4 The Perceptron Algorithm CSC321 Lecture 4 The Perceptron Algorithm Roger Grosse and Nitish Srivastava January 17, 2017 Roger Grosse and Nitish Srivastava CSC321 Lecture 4 The Perceptron Algorithm January 17, 2017 1 / 1 Recap:

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin Kernels Machine Learning CSE446 Carlos Guestrin University of Washington October 28, 2013 Carlos Guestrin 2005-2013 1 Linear Separability: More formally, Using Margin Data linearly separable, if there

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

INTRODUCTION TO DATA SCIENCE

INTRODUCTION TO DATA SCIENCE INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #13 3/9/2017 CMSC320 Tuesdays & Thursdays 3:30pm 4:45pm ANNOUNCEMENTS Mini-Project #1 is due Saturday night (3/11): Seems like people are able to do

More information

Inf2b Learning and Data

Inf2b Learning and Data Inf2b Learning and Data Lecture : Single layer Neural Networks () (Credit: Hiroshi Shimodaira Iain Murray and Steve Renals) Centre for Speech Technology Research (CSTR) School of Informatics University

More information

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018 Policies Due 9 PM, January 9 th, via Moodle. You are free to collaborate on all of the problems, subject to the collaboration policy stated in the syllabus. You should submit all code used in the homework.

More information

Perceptron (Theory) + Linear Regression

Perceptron (Theory) + Linear Regression 10601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron (Theory) Linear Regression Matt Gormley Lecture 6 Feb. 5, 2018 1 Q&A

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Jordan Boyd-Graber University of Colorado Boulder LECTURE 7 Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Jordan Boyd-Graber Boulder Support Vector Machines 1 of

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 3: Perceptron Princeton University COS 495 Instructor: Yingyu Liang Perceptron Overview Previous lectures: (Principle for loss function) MLE to derive loss Example: linear

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Midterm: CS 6375 Spring 2015 Solutions

Midterm: CS 6375 Spring 2015 Solutions Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7. Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Support Vector Machine

Support Vector Machine Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

More information

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37 COMP 652: Machine Learning Lecture 12 COMP 652 Lecture 12 1 / 37 Today Perceptrons Definition Perceptron learning rule Convergence (Linear) support vector machines Margin & max margin classifier Formulation

More information

The Perceptron Algorithm

The Perceptron Algorithm The Perceptron Algorithm Greg Grudic Greg Grudic Machine Learning Questions? Greg Grudic Machine Learning 2 Binary Classification A binary classifier is a mapping from a set of d inputs to a single output

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland SUPPORT VECTOR MACHINES Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Machine Learning: Jordan

More information

CSC242: Intro to AI. Lecture 21

CSC242: Intro to AI. Lecture 21 CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages

More information

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given

More information

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine Value Function Methods CS 294-112: Deep Reinforcement Learning Sergey Levine Class Notes 1. Homework 2 is due in one week 2. Remember to start forming final project groups and writing your proposal! Proposal

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso Machine Learning Data Mining CS/CS/EE 155 Lecture 4: Regularization, Sparsity Lasso 1 Recap: Complete Pipeline S = {(x i, y i )} Training Data f (x, b) = T x b Model Class(es) L(a, b) = (a b) 2 Loss Function,b

More information

Neural networks and support vector machines

Neural networks and support vector machines Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith

More information

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

The Perceptron Algorithm 1

The Perceptron Algorithm 1 CS 64: Machine Learning Spring 5 College of Computer and Information Science Northeastern University Lecture 5 March, 6 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu Introduction The Perceptron

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

4.1 Online Convex Optimization

4.1 Online Convex Optimization CS/CNS/EE 53: Advanced Topics in Machine Learning Topic: Online Convex Optimization and Online SVM Lecturer: Daniel Golovin Scribe: Xiaodi Hou Date: Jan 3, 4. Online Convex Optimization Definition 4..

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

1 Review of Winnow Algorithm

1 Review of Winnow Algorithm COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture # 17 Scribe: Xingyuan Fang, Ethan April 9th, 2013 1 Review of Winnow Algorithm We have studied Winnow algorithm in Algorithm 1. Algorithm

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Support Vector Machines

Support Vector Machines Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Linear classifiers Lecture 3

Linear classifiers Lecture 3 Linear classifiers Lecture 3 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin ML Methodology Data: labeled instances, e.g. emails marked spam/ham

More information

Lecture 6. Notes on Linear Algebra. Perceptron

Lecture 6. Notes on Linear Algebra. Perceptron Lecture 6. Notes on Linear Algebra. Perceptron COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne This lecture Notes on linear algebra Vectors

More information

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler

Machine Learning and Data Mining. Linear regression. Prof. Alexander Ihler + Machine Learning and Data Mining Linear regression Prof. Alexander Ihler Supervised learning Notation Features x Targets y Predictions ŷ Parameters θ Learning algorithm Program ( Learner ) Change µ Improve

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Machine Learning A Geometric Approach

Machine Learning A Geometric Approach Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector Machines (SVM) Professor Liang Huang some slides from Alex Smola (CMU) Linear Separator Ham Spam From Perceptron

More information

y(x n, w) t n 2. (1)

y(x n, w) t n 2. (1) Network training: Training a neural network involves determining the weight parameter vector w that minimizes a cost function. Given a training set comprising a set of input vector {x n }, n = 1,...N,

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson Applied Machine Learning Lecture 5: Linear classifiers, continued Richard Johansson overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

1 Review of the Perceptron Algorithm

1 Review of the Perceptron Algorithm COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #15 Scribe: (James) Zhen XIANG April, 008 1 Review of the Perceptron Algorithm In the last few lectures, we talked about various kinds

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

Machine Learning and Data Mining. Linear regression. Kalev Kask

Machine Learning and Data Mining. Linear regression. Kalev Kask Machine Learning and Data Mining Linear regression Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ Parameters q Learning algorithm Program ( Learner ) Change q Improve performance

More information

Binary Classification / Perceptron

Binary Classification / Perceptron Binary Classification / Perceptron Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Supervised Learning Input: x 1, y 1,, (x n, y n ) x i is the i th data

More information

Support Vector Machines. Machine Learning Fall 2017

Support Vector Machines. Machine Learning Fall 2017 Support Vector Machines Machine Learning Fall 2017 1 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce

More information

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information