Machine Learning & Data Mining CMS/CS/CNS/EE 155. Lecture 2: Perceptron & Gradient Descent
|
|
- Sara Beatrice Shelton
- 5 years ago
- Views:
Transcription
1 Machine Learning & Data Mining CMS/CS/CNS/EE 155 Lecture 2: Perceptron & Gradient Descent
2 Announcements Homework 1 is out Due Tuesday Jan 12 th at 2pm Via Moodle Sign up for Moodle & Piazza if you haven t yet Announcements are made via Piazza RecitaIon on Python Programming Tonight 7:30pm in Annenberg 105 2
3 Recap: Basic Recipe Training Data: Model Class: N S = {(x i, y i )} i=1 f (x w, b) = w T x b x R D y { 1,+1 } Linear Models Loss FuncIon: L(a, b) = (a b) 2 Squared Loss Learning ObjecIve: argmin w,b N i=1 L( y i, f (x i w, b) ) Op8miza8on Problem 3
4 Recap: Bias-Variance Trade-off Bias Variance 1.5 Bias Variance 1.5 Bias Variance
5 Recap: Complete Pipeline N S = {(x i, y i )} i=1 Training Data f (x w, b) = w T x b Model Class(es) L(a, b) = (a b) 2 Loss FuncIon argmin w,b N i=1 L( y i, f (x i w, b) ) Cross ValidaIon & Model SelecIon Profit! 5
6 Today Two Basic Learning Approaches Perceptron Algorithm Gradient Descent Aka, actually solving the opimizaion problem 6
7 The Perceptron One of the earliest learning algorithms 1957 by Frank Rosenbla^ SIll a great algorithm Fast Clean analysis Precursor to Neural Networks Frank Rosenbla^ with the Mark 1 Perceptron Machine 7
8 Perceptron Learning Algorithm (Linear ClassificaIon Model) w 1 = 0, b 1 = 0 For t = 1. Receive example (x,y) If f(x w t ) = y [w t+1, b t+1 ] = [w t, b t ] Else w t+1 = w t + yx b t+1 = b t + y f (x w) = sign(w T x b) Training Set: N S = {(x i, y i )} i=1 y { +1, 1 } Go through training set in arbitrary order (e.g., randomly) 8
9 Aside: Hyperplane Distance Line is a 1D, Plane is 2D Hyperplane is many D Includes Line and Plane w Defined by (w,b) Distance: w T x b w b/ w Signed Distance: w T x b w Linear Model = un-normalized signed distance!
10 Perceptron Learning 10
11 Perceptron Learning Misclassified! 11
12 Perceptron Learning Update! 12
13 Perceptron Learning Correct! 13
14 Perceptron Learning Misclassified! 14
15 Perceptron Learning Update! 15
16 Perceptron Learning Update! 16
17 Perceptron Learning Correct! 17
18 Perceptron Learning Correct! 18
19 Perceptron Learning Misclassified! 19
20 Perceptron Learning Update! 20
21 Perceptron Learning Update! 21
22 All Training Examples Correctly Classified! Perceptron Learning 22
23 Start Again Perceptron Learning 23
24 Perceptron Learning Misclassified! 24
25 Perceptron Learning Update! 25
26 Perceptron Learning Correct! 26
27 Perceptron Learning Correct! 27
28 Perceptron Learning Misclassified! 28
29 Perceptron Learning Update! 29
30 Perceptron Learning Update! 30
31 Perceptron Learning Correct! 31
32 Perceptron Learning Correct! 32
33 Perceptron Learning Misclassified! 33
34 Perceptron Learning Update! 34
35 Perceptron Learning Update! 35
36 Perceptron Learning Misclassified! 36
37 Perceptron Learning Update! 37
38 Perceptron Learning Update! 38
39 Perceptron Learning Misclassified! 39
40 Perceptron Learning Update! 40
41 Perceptron Learning Update! 41
42 Perceptron Learning Misclassified! 42
43 Perceptron Learning Update! 43
44 Perceptron Learning Update! 44
45 All Training Examples Correctly Classified! Perceptron Learning 45
46 Recap: Perceptron Learning Algorithm (Linear ClassificaIon Model) w 1 = 0, b 1 = 0 For t = 1. Receive example (x,y) If f(x w t ) = y [w t+1, b t+1 ] = [w t, b t ] Else w t+1 = w t + yx b t+1 = b t + y f (x w) = sign(w T x b) Training Set: N S = {(x i, y i )} i=1 y { +1, 1 } Go through training set in arbitrary order (e.g., randomly) 46
47 Comparing the Two Models 47
48 Convergence to Mistake Free = Linearly Separable! 48
49 Margin γ = max w min (x,y) y(w T x) w 49
50 Linear Separability A classificaion problem is Linearly Separable: Exists w with perfect classificaion accuracy Separable with Margin γ: γ = max w min (x,y) y(w T x) w Linearly Separable: γ > 0 50
51 Perceptron Mistake Bound Holds for any ordering of training examples! Radius of Feature Space R = max x x #Mistakes Bounded By: R 2 γ 2 Margin **If Linearly Separable More Details: h^p:// 51
52 In the Real World Most problems are NOT linearly separable! May never converge So what to do? Use valida8on set! 52
53 Early Stopping via ValidaIon Run Perceptron Learning on Training Set Evaluate current model on ValidaIon Set Terminate when validaion accuracy stops improving h^ps://en.wikipedia.org/wiki/early_stopping 53
54 Online Learning vs Batch Learning Online Learning: Receive a stream of data (x,y) Make incremental updates Perceptron Learning is an instance of Online Learning Batch Learning Train over all data simultaneously Can use online learning algorithms for batch learning E.g., stream the data to the learning algorithm h^ps://en.wikipedia.org/wiki/online_machine_learning 54
55 Recap: Perceptron One of the first machine learning algorithms Benefits: Simple and fast Clean analysis Drawbacks: Might not converge to a very good model What is the objecive funcion? 55
56 (StochasIc) Gradient Descent 56
57 Back to OpImizing ObjecIve FuncIons Training Data: Model Class: N S = {(x i, y i )} i=1 f (x w, b) = w T x b x R D y { 1,+1 } Linear Models Loss FuncIon: L(a, b) = (a b) 2 Squared Loss Learning ObjecIve: argmin w,b N i=1 L( y i, f (x i w, b) ) OpImizaIon Problem 57
58 Back to OpImizing ObjecIve FuncIons argmin w,b N i=1 L(w, b S) L y i, f (x i w, b) Typically, requires opimizaion algorithm. Simplest: Gradient Descent ( ) This Lecture: sick with squared loss Talk about various loss funcions next lecture
59 Gradient Review for Squared Loss N i=1 ( ) w L(w, b S) = w L y i, f (x i w, b) N i=1 ( ) = w L y i, f (x i w, b) N i=1 = 2(y i f (x i w, b)) w f (x i w, b) Linearity of DifferenIaIon L(a, b) = (a b) 2 Chain Rule N i=1 = 2(y i f (x i w, b))x i f (x w, b) = w T x b
60 Gradient Descent IniIalize: w 1 = 0, b 1 = 0 For t = 1 w t+1 = w t η t+1 w L(w t, b t S) b t+1 = b t η t+1 b L(w t, b t S) Step Size 60
61 How to Choose Step Size? η =1 w L(w) = 2(1 w) L w 61
62 How to Choose Step Size? η =1 w L(w) = 2(1 w) L w 62
63 How to Choose Step Size? η =1 w L(w) = 2(1 w) L w 63
64 How to Choose Step Size? η =1 w L(w) = 2(1 w) Oscillate Infinitely! L w 64
65 How to Choose Step Size? η = w L(w) = 2(1 w) L w 65
66 How to Choose Step Size? η = w L(w) = 2(1 w) L w 66
67 How to Choose Step Size? η = w L(w) = 2(1 w) L w 67
68 How to Choose Step Size? η = w L(w) = 2(1 w) Takes Really Long Time! L w 68
69 How to Choose Step Size? Loss As Large As Possible! (Without Diverging) IteraIons Note that the absolute scale is not meaningful Focus on the relaive magnitude differences 69
70 Being Scale Invariant Consider the following two gradient updates: w t+1 = w t η t+1 w L(w t, b t S) w t+1 = w t ˆη t+1 w ˆL(w t, b t S) Suppose: ˆL =1000L How are the two step sizes related? ˆη t+1 = η /
71 PracIcal Rules of Thumb Divide Loss FuncIon by Number of Examples: w t+1 = w t ηt+1 w L(w t, b t S) N Start with large step size If loss plateaus, divide step size by 2 (Can also use advanced opimizaion methods) (Step size must decrease over Ime to guarantee convergence to global opimum) 71
72 Aside: Convexity Not Convex Easy to find global op8ma! Strict convex if diff always >0 Image Source: h^p://en.wikipedia.org/wiki/convex_funcion 72
73 Aside: Convexity 2.5 L(x 2 ) L(x 1 )+ L(x 1 ) T (x 2 x 1 ) FuncIon is always above the locally linear extrapolaion
74 Aside: Convexity All local opima are global opima: Gradient Descent will find opimum Assuming step size chosen safely Strictly convex: unique global opimum: Almost all standard objecives are (strictly) convex: Squared Loss, SVMs, LR, Ridge, Lasso We will see non-convex objecives in 2 nd half of course 74
75 Convergence Assume L is convex How many iteraions to achieve: L(w) L(w * ) ε If: If: L(a) L(b) ρ a b Then O(1/ε 2 ) iteraions L(a) L(b) ρ a b Then O(1/ε) iteraions L is ρ-lipschitz L is ρ-smooth If: L(a) L(b)+ L(b) T (a b)+ ρ 2 a b 2 Then O(log(1/ε)) iteraions L is ρ-strongly convex More Details: Bubeck Textbook Chapter 3 75
76 Convergence In general, takes infinite Ime to reach global opimum. But in general, we don t care! As long as we re close enough to the global opimum Loss How do we know if we re here? And not here? IteraIons 76
77 When to Stop? Convergence analyses = worst-case upper bounds What to do in prac8ce? Stop when progress is sufficiently small E.g., relaive reducion less than Stop azer pre-specified #iteraions E.g., Yisong prefers this opion Stop when validaion error stops going down 77
78 LimitaIon of Gradient Descent Requires full pass over training set per iteraion N i=1 ( ) w L(w, b S) = w L y i, f (x i w, b) Very expensive if training set is huge Do we need to do a full pass over the data? 78
79 StochasIc Gradient Descent Suppose Loss FuncIon Decomposes AddiIvely L(w, b) = 1 N N L i (w, b) = E i L i (w, b) i=1 [ ] Gradient = expected gradient of sub-funcions L i (w, b) ( y i f (x i w, b) 2 Each L i corresponds to a single data point w L(w, b) = w E i L i (w, b) [ ] 79
80 StochasIc Gradient Descent Suffices to take random gradient update So long as it matches the true gradient in expectaion Each iteraion t: Choose i at random Expected Value is: w L(w, b) w t+1 = w t η t+1 w L i (w, b) b t+1 = b t η t+1 b L i (w, b) SGD is an online learning algorithm! 80
81 Mini-Batch SGD Each L i is a small batch of training examples E.g, examples Can leverage vector operaions Decrease volaility of gradient updates Industry state-of-the-art Everyone uses mini-batch SGD Ozen parallelized (e.g., different cores work on different mini-batches) 81
82 Checking for Convergence How to check for convergence? EvaluaIng loss on enire training set seems expensive Loss IteraIons 82
83 Checking for Convergence How to check for convergence? EvaluaIng loss on enire training set seems expensive Don t check azer every iteraion E.g., check every 1000 iteraions Evaluate loss on a subset of training data E.g., the previous 5000 examples. 83
84 Recap: StochasIc Gradient Descent Conceptually: Decompose Loss FuncIon AddiIvely Choose a Component Randomly Gradient Update Benefits: Avoid iteraing enire dataset for every update Gradient update is consistent (in expectaion) Industry Standard 84
85 Perceptron Revisited (What is the Objec8ve Func8on?) w 1 = 0, b 1 = 0 For t = 1. Receive example (x,y) If f(x w t ) = y [w t+1, b t+1 ] = [w t, b t ] Else w t+1 = w t + yx b t+1 = b t + y f (x w) = sign(w T x b) Training Set: N S = {(x i, y i )} i=1 y { +1, 1 } Go through training set in arbitrary order (e.g., randomly) 85
86 Perceptron (Implicit) ObjecIve L i (w, b) = max 0, y i f (x i w, b) { } Loss yf(x) 86
87 Recap: Complete Pipeline N S = {(x i, y i )} i=1 Training Data f (x w, b) = w T x b Model Class(es) L(a, b) = (a b) 2 Loss FuncIon argmin w,b N i=1 L( y i, f (x i w, b) ) Use SGD! Cross ValidaIon & Model SelecIon Profit! 87
88 Next Week Different Loss FuncIons Hinge Loss (SVM) Log Loss (LogisIc Regression) Non-linear model classes Neural Nets RegularizaIon Recita8on on Python Programming Tonight! 88
Logistic Regression Logistic
Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,
More informationLinear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))
Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationKernelized Perceptron Support Vector Machines
Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:
More informationClassification with Perceptrons. Reading:
Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationMore about the Perceptron
More about the Perceptron CMSC 422 MARINE CARPUAT marine@cs.umd.edu Credit: figures by Piyush Rai and Hal Daume III Recap: Perceptron for binary classification Classifier = hyperplane that separates positive
More informationDeep Learning for Computer Vision
Deep Learning for Computer Vision Lecture 4: Curse of Dimensionality, High Dimensional Feature Spaces, Linear Classifiers, Linear Regression, Python, and Jupyter Notebooks Peter Belhumeur Computer Science
More informationWarm up. Regrade requests submitted directly in Gradescope, do not instructors.
Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationTopics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families
Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines
More informationCSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression
CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationSupport Vector Machines, Kernel SVM
Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationCSC 411 Lecture 17: Support Vector Machine
CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Slides adapted from Jordan Boyd-Graber Machine Learning: Chenhao Tan Boulder 1 of 39 Recap Supervised learning Previously: KNN, naïve
More informationCSE546: SVMs, Dual Formula5on, and Kernels Winter 2012
CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012 Luke ZeClemoyer Slides adapted from Carlos Guestrin Linear classifiers Which line is becer? w. = j w (j) x (j) Data Example i Pick the one with the
More informationCSC321 Lecture 4 The Perceptron Algorithm
CSC321 Lecture 4 The Perceptron Algorithm Roger Grosse and Nitish Srivastava January 17, 2017 Roger Grosse and Nitish Srivastava CSC321 Lecture 4 The Perceptron Algorithm January 17, 2017 1 / 1 Recap:
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationKernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin
Kernels Machine Learning CSE446 Carlos Guestrin University of Washington October 28, 2013 Carlos Guestrin 2005-2013 1 Linear Separability: More formally, Using Margin Data linearly separable, if there
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationINTRODUCTION TO DATA SCIENCE
INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #13 3/9/2017 CMSC320 Tuesdays & Thursdays 3:30pm 4:45pm ANNOUNCEMENTS Mini-Project #1 is due Saturday night (3/11): Seems like people are able to do
More informationInf2b Learning and Data
Inf2b Learning and Data Lecture : Single layer Neural Networks () (Credit: Hiroshi Shimodaira Iain Murray and Steve Renals) Centre for Speech Technology Research (CSTR) School of Informatics University
More informationPattern Recognition and Machine Learning. Perceptrons and Support Vector machines
Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3
More informationLinear Discrimination Functions
Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationMachine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018
Policies Due 9 PM, January 9 th, via Moodle. You are free to collaborate on all of the problems, subject to the collaboration policy stated in the syllabus. You should submit all code used in the homework.
More informationPerceptron (Theory) + Linear Regression
10601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron (Theory) Linear Regression Matt Gormley Lecture 6 Feb. 5, 2018 1 Q&A
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationSupport Vector Machines
Support Vector Machines Jordan Boyd-Graber University of Colorado Boulder LECTURE 7 Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Jordan Boyd-Graber Boulder Support Vector Machines 1 of
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationMachine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 3: Perceptron Princeton University COS 495 Instructor: Yingyu Liang Perceptron Overview Previous lectures: (Principle for loss function) MLE to derive loss Example: linear
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationMidterm: CS 6375 Spring 2015 Solutions
Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationLinear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.
Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationSupport Vector Machine
Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
More informationCOMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37
COMP 652: Machine Learning Lecture 12 COMP 652 Lecture 12 1 / 37 Today Perceptrons Definition Perceptron learning rule Convergence (Linear) support vector machines Margin & max margin classifier Formulation
More informationThe Perceptron Algorithm
The Perceptron Algorithm Greg Grudic Greg Grudic Machine Learning Questions? Greg Grudic Machine Learning 2 Binary Classification A binary classifier is a mapping from a set of d inputs to a single output
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationLecture 9: Large Margin Classifiers. Linear Support Vector Machines
Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationIntroduction to Machine Learning
Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland SUPPORT VECTOR MACHINES Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Machine Learning: Jordan
More informationCSC242: Intro to AI. Lecture 21
CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages
More informationLecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data
Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given
More informationValue Function Methods. CS : Deep Reinforcement Learning Sergey Levine
Value Function Methods CS 294-112: Deep Reinforcement Learning Sergey Levine Class Notes 1. Homework 2 is due in one week 2. Remember to start forming final project groups and writing your proposal! Proposal
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationMachine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso
Machine Learning Data Mining CS/CS/EE 155 Lecture 4: Regularization, Sparsity Lasso 1 Recap: Complete Pipeline S = {(x i, y i )} Training Data f (x, b) = T x b Model Class(es) L(a, b) = (a b) 2 Loss Function,b
More informationNeural networks and support vector machines
Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith
More informationMachine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution
More informationOptimization and Gradient Descent
Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function
More informationThe Perceptron Algorithm 1
CS 64: Machine Learning Spring 5 College of Computer and Information Science Northeastern University Lecture 5 March, 6 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu Introduction The Perceptron
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationLinear discriminant functions
Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More information4.1 Online Convex Optimization
CS/CNS/EE 53: Advanced Topics in Machine Learning Topic: Online Convex Optimization and Online SVM Lecturer: Daniel Golovin Scribe: Xiaodi Hou Date: Jan 3, 4. Online Convex Optimization Definition 4..
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More information1 Review of Winnow Algorithm
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture # 17 Scribe: Xingyuan Fang, Ethan April 9th, 2013 1 Review of Winnow Algorithm We have studied Winnow algorithm in Algorithm 1. Algorithm
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More informationSupport Vector Machines
Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationLINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationLinear classifiers Lecture 3
Linear classifiers Lecture 3 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin ML Methodology Data: labeled instances, e.g. emails marked spam/ham
More informationLecture 6. Notes on Linear Algebra. Perceptron
Lecture 6. Notes on Linear Algebra. Perceptron COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne This lecture Notes on linear algebra Vectors
More informationMachine Learning and Data Mining. Linear regression. Prof. Alexander Ihler
+ Machine Learning and Data Mining Linear regression Prof. Alexander Ihler Supervised learning Notation Features x Targets y Predictions ŷ Parameters θ Learning algorithm Program ( Learner ) Change µ Improve
More informationCS260: Machine Learning Algorithms
CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {
More informationLecture 5 Neural models for NLP
CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm
More informationMachine Learning A Geometric Approach
Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector Machines (SVM) Professor Liang Huang some slides from Alex Smola (CMU) Linear Separator Ham Spam From Perceptron
More informationy(x n, w) t n 2. (1)
Network training: Training a neural network involves determining the weight parameter vector w that minimizes a cost function. Given a training set comprising a set of input vector {x n }, n = 1,...N,
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationApplied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson
Applied Machine Learning Lecture 5: Linear classifiers, continued Richard Johansson overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More information1 Review of the Perceptron Algorithm
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #15 Scribe: (James) Zhen XIANG April, 008 1 Review of the Perceptron Algorithm In the last few lectures, we talked about various kinds
More informationNeural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35
Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How
More informationMachine Learning and Data Mining. Linear regression. Kalev Kask
Machine Learning and Data Mining Linear regression Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ Parameters q Learning algorithm Program ( Learner ) Change q Improve performance
More informationBinary Classification / Perceptron
Binary Classification / Perceptron Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Supervised Learning Input: x 1, y 1,, (x n, y n ) x i is the i th data
More informationSupport Vector Machines. Machine Learning Fall 2017
Support Vector Machines Machine Learning Fall 2017 1 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce
More informationLinear Regression. CSL603 - Fall 2017 Narayanan C Krishnan
Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationLinear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan
Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis
More informationStochastic Subgradient Method
Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationLinear classifiers: Overfitting and regularization
Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x
More information