Machine Learning (CS 567) Lecture 3

Similar documents
Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 5

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall

Linear Classifiers: Expressiveness

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Logistic Regression. Machine Learning Fall 2018

Classification with Perceptrons. Reading:

Midterm: CS 6375 Spring 2015 Solutions

Support Vector Machines

The Perceptron Algorithm 1

Machine Learning and Data Mining. Linear classification. Kalev Kask

Machine Learning for NLP

Introduction to Machine Learning

INTRODUCTION TO DATA SCIENCE

Linear discriminant functions

CSC242: Intro to AI. Lecture 21

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Stochastic Gradient Descent

Statistical and Computational Learning Theory

Introduction to Logistic Regression and Support Vector Machine

COMS 4771 Introduction to Machine Learning. Nakul Verma

Deep Learning for Computer Vision

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Warm up: risk prediction with logistic regression

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Max Margin-Classifier

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning Lecture 7

CS 6375 Machine Learning

Announcements - Homework

Linear Discrimination Functions

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Machine Learning (CSE 446): Neural Networks

Supervised Learning. George Konidaris

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Machine Learning (CS 567)

Linear Classifiers. Michael Collins. January 18, 2012

Machine Learning Practice Page 2 of 2 10/28/13

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Least Mean Squares Regression

Machine Learning Lecture 5

Machine Learning Linear Models

Classification Logistic Regression

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Evaluation requires to define performance measures to be optimized

IFT Lecture 7 Elements of statistical learning theory

Lecture 3: Multiclass Classification

Machine Learning for NLP

Online Learning With Kernel

Midterm Exam, Spring 2005

CS446: Machine Learning Spring Problem Set 4

Introduction to Bayesian Learning. Machine Learning Fall 2018

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning

Support Vector Machines and Kernel Methods

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

The Perceptron Algorithm, Margins

Learning Neural Networks

Lecture 5 Neural models for NLP

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Support Vector Machine I

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Mining Classification Knowledge

Empirical Risk Minimization, Model Selection, and Model Assessment

Machine Learning

Evaluation. Andrea Passerini Machine Learning. Evaluation

18.9 SUPPORT VECTOR MACHINES

Artificial Neural Networks

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Kernelized Perceptron Support Vector Machines

Linear Regression. S. Sumitra

Lecture 2 Machine Learning Review

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Machine Learning

Is the test error unbiased for these programs?

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Name (NetID): (1 Point)

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CSCI 5622 Machine Learning

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Tufts COMP 135: Introduction to Machine Learning

The Perceptron. Volker Tresp Summer 2016

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

CPSC 340: Machine Learning and Data Mining

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Binary Classification / Perceptron

Intelligent Systems Discriminative Learning, Neural Networks

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Linear & nonlinear classifiers

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

Learning From Data Lecture 2 The Perceptron

Machine Learning

Transcription:

Machine Learning (CS 567) Lecture 3 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol Han (cheolhan@usc.edu) Office: TBA Office hours: TBA Class web page: http://www-scf.usc.edu/~csci567/index.html 1

Administrative Email me to get on the mailing list To: macskass@usc.edu Subject: csci567 student Please stop me if I talk too fast Please stop me if you have any questions Any feedback is better than no feedback Sooner is better End of the semester is too late to make it easier on you 2

Administrative Weekly quizzes will start next lecture Last 10 minutes of class Example quizz at end of today s class 3

Questions from Last Class Single-person Projects Need for primary book Supervised Learning Example P(y x) = 0.1 4

Example: Making the tumor decision Suppose our tumor recurrence detector predicts: P(y = recurrence x ) = 0.1 What is the optimal classification decision ŷ? Expected loss of ŷ = recurrence is: 0*0.1 + 1*0.9 = 0.9 Expected loss of ŷ = non recurrence is: 100*0.1 + 0*0.9 = 10 Therefore, the optimal prediction is recurrence Loss function L(ŷ,y): predicted label ŷ recurrence true label y non recurrence recurrence 0 1 non recurrence 100 0 P(y x) 0.1 0.9 5 CS 567 Lecture 2 - Sofus A. Macskassy

Lecture 3 Outline A Framework for hypothesis spaces A Framework for algorithms Linear Threshold Units Perceptron 6

Fundamental Problem of Machine Learning: It is ill-posed Example x 1 x 2 x 3 x 4 y 1 0 0 1 0 0 2 0 1 0 0 0 3 0 0 1 1 1 4 1 0 0 1 1 5 0 1 1 0 0 6 1 1 0 0 0 7 0 1 0 1 0 7

Learning Appears Impossible There are 2 16 = 65536 possible boolean functions over four input features. We can t figure out which one is correct until we ve seen every possible inputoutput pair. After 7 examples, we still have 2 9 possibilities. x 1 x 2 x 3 x 4 y 0 0 0 0? 0 0 0 1? 0 0 1 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1? 1 0 0 0? 1 0 0 1 1 1 0 1 0? 1 0 1 1? 1 1 0 0 0 1 1 0 1? 1 1 1 0? 1 1 1 1? 8

Illustration: Simple Conjunctive Rules There are only 16 simple conjunctions (no negation) However, no simple rule explains the data. The same is true for simple clauses Example x 1 x 2 x 3 x 4 y 1 0 0 1 0 0 2 0 1 0 0 0 3 0 0 1 1 1 4 1 0 0 1 1 5 0 1 1 0 0 6 1 1 0 0 0 7 0 1 0 1 0 Rule Counterexample true, y 1 x 1, y 3 x 2, y 2 x 3, y 1 x 4, y 7 x 1 ^ x 2, y 3 x 1 ^ x 3, y 3 x 1 ^ x 4, y 3 x 2 ^ x 3, y 3 x 2 ^ x 4, y 3 x 3 ^ x 4, y 4 x 1 ^ x 2 ^ x 3, y 3 x 1 ^ x 2 ^ x 4, y 3 x 1 ^ x 3 ^ x 4, y 3 x 2 ^ x 3 ^ x 4, y 3 x 1 ^ x 2 ^ x 3 ^ x 4, y 3 9

A larger hypothesis space: m-of-n rules At least m of the n variables must be true There are 32 possible rules Only one rule is consistent! Example x 1 x 2 x 3 x 4 y 1 0 0 1 0 0 2 0 1 0 0 0 3 0 0 1 1 1 4 1 0 0 1 1 5 0 1 1 0 0 6 1 1 0 0 0 7 0 1 0 1 0 Counterexample variables 1-of 2-of 3-of 4-of fx 1 g 3 { { { fx 2 g 2 { { { fx 3 g 1 { { { fx 4 g 7 { { { fx 1 ; x 2 g 3 3 { { fx 1 ; x 3 g 4 3 { { fx 1 ; x 4 g 6 3 { { fx 2 ; x 3 g 2 3 { { fx 2 ; x 4 g 2 3 { { fx 3 ; x 4 g 4 4 { { fx 1 ; x 2 ; x 3 g 1 3 3 { fx 1 ; x 2 ; x 4 g 2 3 3 { fx 1 ; x 3 ; x 4 g 1 *** 3 { fx 2 ; x 3 ; x 4 g 1 5 3 { fx 1 ; x 2 ; x 3 ; x 4 g 1 5 3 3 10

Two Strategies for Machine Learning Develop Languages for Expressing Prior Knowledge Rule grammars, stochastic models, Bayesian networks (Corresponds to the Prior Knowledge view) Develop Flexible Hypothesis Spaces Nested collections of hypotheses: decision trees, neural networks, cases, SVMs (Corresponds to the Guessing view) In either case we must develop algorithms for finding an hypothesis that fits the data 11

S, G, and the Version Space most specific hypothesis, S most general hypothesis, G h H, between S and G is consistent and make up the version space (Mitchell, 1997) 12

Key Issues in Machine Learning What are good hypothesis spaces? which spaces have been useful in practical applications? What algorithms can work with these spaces? Are there general design principles for learning algorithms? How can we optimize accuracy on future data points? This is related to the problem of overfitting How can we have confidence in the results? (the statistical question) How much training data is required to find an accurate hypotheses? Are some learning problems computationally intractable? (the computational question) How can we formulate application problems as machine learning problems? (the engineering question) 13

A framework for hypothesis spaces Size: Does the hypothesis space have a fixed size or a variable size? fixed-sized spaces are easier to understand, but variable-sized spaces are generally more useful. Variable-sized spaces introduce the problem of overfitting Stochasticity. Is the hypothesis a classifier, a conditional distribution, or a joint distribution? This affects how we evaluate hypotheses. For a deterministic hypothesis, a training example is either consistent (correctly predicted) or inconsistent (incorrectly predicted). For a stochastic hypothesis, a training example is more likely or less likely. Parameterization. Is each hypothesis described by a set of symbolic (discrete) choices or is it described by a set of continuous parameters? If both are required, we say the space has a mixed parameterization. Discrete parameters must be found by combinatorial search methods; continuous parameters can be found by numerical search methods 14

A Framework for Hypothesis Spaces (2) 15

A Framework for Learning Algorithms Search Procedure Direct Computation: solve for the hypothesis directly Local Search: start with an initial hypothesis, make small improvements until a local maximum Constructive Search: start with an empty hypothesis, gradually add structure to it until a local optimum Timing Eager: analyze training data and construct an explicit hypothesis Lazy: store the training data and wait until a test data point is presented, then construct an ad hoc hypothesis to classify that one data point Online vs. Batch (for eager algorithms) Online: analyze each training example as it is presented Batch: collect examples, analyze them in a batch, output an hypothesis 16

A Framework for Learning Algorithms (2) Learners described in this course 17

Choosing a target function 18

Choosing a target function A linear function will do the trick 19

Linear Threshold Units A classifier can be viewed as partitioning the input space or feature space X into decision regions A linear threshold unit always produces a linear decision boundary. A set of points that can be separated by a linear decision boundary is said to be linearly separable. 20

Linear Threshold Units h(x) = ( +1 if w1 x 1 + : : : + w n x n w 0 1 otherwise We assume that each feature x j and each weight w j is a real number We will study three different algorithms for learning linear threshold units: Perceptron: classifier Logistic Regression: conditional distribution Linear Discriminant Analysis: joint distribution 21

What can be represented by an LTU: Conjunctions x 1 ^ x 2 ^ x 4, y 1 x 1 + 1 x 2 + 0 x 3 + 1 x 4 3 At least m-of-n at-least-2-offx 1 ; x 3 ; x 4 g, y 1 x 1 + 0 x 2 + 1 x 3 + 1 x 4 2 22

Exclusive-OR is Not Linearly Separable 23

Things that cannot be represented: Non-trivial disjunctions: (x 1 ^ x 2 ) _ (x 3 ^ x 4 ), y 1 x 1 + 1 x 2 + 1 x 3 + 1 x 4 2 predicts f(h0110i) = 1. Exclusive-OR: (x 1 ^ :x 2 ) _ (:x 1 ^ x 2 ), y 24

A canonical representation Given a training example of the form (hx 1, x 2, x 3, x 4 i, y) transform it to (h1, x 1, x 2, x 3, x 4 i, y) The parameter vector will then be w = hw 0, w 1, w 2, w 3, w 4 i. We will call the unthresholded hypothesis u(x,w) u(x,w) = w x Each hypothesis can be written h(x) = sgn(u(x,w)) Our goal is to find w. 25

A canonical representation Given a training example of the form (hx 1, x 2, x 3, x 4 i, y) transform it to (h1, x 1, x 2, x 3, x 4 i, y) The parameter vector will then be w = hw 0, w 1, w 2, w 3, w 4 i. h(x) = ( +1 if w1 x 1 + : : : + w n x n w 0 1 otherwise We will call the unthresholded hypothesis u(x,w) u(x,w) = w x Each hypothesis can be written h(x) = sgn(u(x,w)) Our goal is to find w. 26

The LTU Hypothesis Space Fixed size: There are O distinct linear threshold units over n boolean features Deterministic Continuous parameters µ2 n2 27

Geometrical View Consider three training examples: We want a classifier that looks like the following: (h1:0; 1:0i; +1) (h0:5; 3:0i; +1) (h2:0; 2:0i; 1) 28

The Unthresholded Discriminant Function is a Hyperplane The equation g(x) = w x is a plane ^y = ( +1 if g(x) 0 1 otherwise 29

Alternatively g x g x g x 1 T T w x w w x w w 1 T w w x w w T 1 x 2 w 0 2 10 2 10 20 20 choose C C 1 2 if g x 0 otherwise 30

Geometry 31

Multiple Classes g i T x w i,w i 0 w i x w i 0 ChooseC g i x i K max g j 1 if j x Classes are linearly separable 32

Pairwise Separation g ij g ij T x w ij,w ij 0 w ij x w ij 0 x 0 0 don' t care choose j i,g ij if x C if x C otherwise C i if x 0 i j 33

Machine Learning and Optimization When learning a classifier, the natural way to formulate the learning problem is the following: Given: A set of N training examples {(x 1,y 1 ), (x 2,y 2 ),, (x N,y N )} A loss function L Find: The weight vector w that minimizes the expected loss on the training data J(w) = 1 N NX i=1 L(sgn(w x i ); y i ): In general, machine learning algorithms apply some optimization algorithm to find a good hypothesis. In this case, J is piecewise constant, which makes this a difficult problem 34

Approximating the expected loss by a smooth function Simplify the optimization problem by replacing the original objective function by a smooth, differentiable function. For example, consider the hinge loss: ~J(w) = 1 N NX i=1 max(0; 1 y i w x i ) When y = 1 35

Optimizing what? w0 36

Optimizing what? w0 37

Minimizing ~J by Gradient Descent Search Start with weight vector w 0 Compute gradient r ~J(w 0 ) = Compute w 1 = w 0 r ~J(w 0 ) where is a step size parameter Repeat until convergence à @ ~J(w 0 ) ; @ ~J(w 0 ) ; : : : ; @! ~J(w 0 ) @w 0 @w 1 @w n 38

EXAMPLE QUIZ 39 Lecture 4 - Sofus A. Macskassy