Nonlinear Classification

Similar documents
CSC321 Lecture 5: Multilayer Perceptrons

Support Vector Machines

Machine Learning (CSE 446): Neural Networks

1 What a Neural Network Computes

Data Mining Part 5. Prediction

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Optimization and Gradient Descent

Supervised Learning. George Konidaris

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Neural Networks and the Back-propagation Algorithm

AI Programming CS F-20 Neural Networks

Course 395: Machine Learning - Lectures

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Learning and Neural Networks

Neural networks. Chapter 19, Sections 1 5 1

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

CSC321 Lecture 4: Learning a Classifier

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning. Neural Networks

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Introduction to Neural Networks

Neural Networks: Backpropagation

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Artificial Intelligence

COMS 4771 Introduction to Machine Learning. Nakul Verma

Machine Learning and Data Mining. Linear classification. Kalev Kask

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Holdout and Cross-Validation Methods Overfitting Avoidance

Neural Networks and Deep Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

CSC321 Lecture 4: Learning a Classifier

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

18.6 Regression and Classification with Linear Models

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Midterm: CS 6375 Spring 2015 Solutions

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Machine Learning (CSE 446): Backpropagation

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Lecture 4: Training a Classifier

Neural networks. Chapter 20. Chapter 20 1

Lecture 5: Logistic Regression. Neural Networks

The exam is closed book, closed notes except your one-page cheat sheet.

Neural networks and optimization

Lecture 6. Notes on Linear Algebra. Perceptron

Security Analytics. Topic 6: Perceptron and Support Vector Machine

Machine Learning Lecture 5

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Neural networks. Chapter 20, Section 5 1

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Neural networks and optimization

Support Vector Machines

Multilayer Perceptron

Lecture 4: Training a Classifier

Neural Nets Supervised learning

AE = q < H(p < ) + (1 q < )H(p > ) H(p) = p lg(p) (1 p) lg(1 p)

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Artificial Neural Networks

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Artificial Neural Networks. MGS Lecture 2

CSC 411 Lecture 10: Neural Networks

Artificial Neural Networks Examination, March 2004

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Neural Networks (Part 1) Goals for the lecture

Multilayer Perceptron

Artificial neural networks

10 NEURAL NETWORKS Bio-inspired Multi-Layer Networks. Learning Objectives:

ECE521 Lectures 9 Fully Connected Neural Networks

Jeff Howbert Introduction to Machine Learning Winter

CSC321 Lecture 4 The Perceptron Algorithm

Machine Learning Basics

Lecture 7: DecisionTrees

Classification with Perceptrons. Reading:

CPSC 340: Machine Learning and Data Mining

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

ECE521 Lecture 7/8. Logistic Regression

Multilayer Perceptrons and Backpropagation

FINAL: CS 6375 (Machine Learning) Fall 2014

SGD and Deep Learning

Midterm: CS 6375 Spring 2018

Statistical NLP for the Web

Artificial Neuron (Perceptron)

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

From perceptrons to word embeddings. Simon Šuster University of Groningen

Input layer. Weight matrix [ ] Output layer

Kernel Methods and Support Vector Machines

Feed-forward Network Functions

Generative v. Discriminative classifiers Intuition

Transcription:

Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul

Linear Classification Most classifiers we ve seen use linear functions to separate classes: Perceptron Logistic regression Support vector machines (unless kernelized)

Linear Classification If the data are not linearly separable, a linear classification cannot perfectly distinguish the two classes. In many datasets that are not linearly separable, a linear classifier will still be good enough and classify most instances correctly.

Linear Classification If the data are not linearly separable, a linear classification cannot perfectly distinguish the two classes. In some datasets, there is no way to learn a linear classifier that works well.

Linear Classification Aside: In datasets like this, it might still be possible to find a boundary that isolates one class, even if the classes are mixed on the other side of the boundary. This would yield a classifier with decent precision on that class, despite having poor overall accuracy.

Nonlinear Classification Nonlinear functions can be used to separate instances that are not linearly separable. We ve seen two nonlinear classifiers: k-nearest-neighbors (knn) Kernel SVM Kernel SVMs are still implicitly learning a linear separator in a higher dimensional space, but the separator is nonlinear in the original feature space.

Nonlinear Classification? knn would probably work well for classifying these instances.??

Nonlinear Classification knn would probably work well for classifying these instances. A Gaussian/RBF kernel SVM could also learn a boundary that looks something like this. (not exact; just an illustration)

Nonlinear Classification Both knn and kernel methods use the concept of distance/similarity to training instances Next, we ll see two nonlinear classifiers that make predictions based on features instead of distance/similarity: Decision tree Multilayer perceptron A basic type of neural network

Decision Trees What color is the cat in this photo? Calico Orange Tabby Tuxedo

Decision Trees Pattern? Stripes Patches Contains Color? Contains Color? Gray Orange Black Orange Gray Tabby Orange Tabby Tuxedo Calico

Decision Trees # of Colors? 1 2 3 Contains Color? Tuxedo Calico Gray Orange Gray Tabby Orange Tabby

Decision Trees Decision tree classifiers are structured as a tree, where: nodes are features edges are feature values If the values are numeric, an edge usually corresponds to a range of values (e.g., x < 2.5) leaves are classes To classify an instance: Start at the root of the tree, and follow the branches based on the feature values in that instance. The final node is the final prediction.

Decision Trees We won t cover how to learn a decision tree in detail in this class (see book for more detail) General idea: 1. Pick the feature that best distinguishes classes If you group the instances based on their value for that feature, some classes should become more likely The distribution of classes should have low entropy (high entropy means the classes are evenly distributed) 2. Recursively repeat for each group of instances 3. When all instances in a group have the same label, set that class as the final node.

Decision Trees Decision trees can easily overfit. Without doing anything extra, they will literally memorize training data! x 1 x 2 x 3 0 0 0 0 0 1 x 1 x 2 x 2 0 1 0 0 1 1 x 3 x 3 x 3 x 3 1 0 0 1 0 1 1 1 0 1 1 1 A tree can encode all possible combinations of feature values.

Decision Trees Two common techniques to avoid overfitting: Restrict the maximum depth of the tree Restrict the minimum number of instances that must be remaining before splitting further If you stop creating a deeper tree before all instances at that node belong to the same class, use the majority class within that group as the final class for the leaf node.

Decision Trees One reason decision trees are popular is because the algorithm is relatively easy to understand, and the classifiers are relatively interpretable. That is, it is possible to see how a classifier makes a decision. This stops being true if your decision tree becomes large; trees of depth 3 or less are easiest to visualize.

Decision Trees Decision trees are powerful because they automatically use conjunctions of features (e.g., Pattern= striped AND Color= Orange ) This gives context to the feature values. In a decision tree, Color= Orange can lead to a different prediction depend on the value of the Pattern feature. In a linear classifier, only one weight can be given to Color= Orange ; it can t be associated with different classes in different contexts

Decision Trees Decision trees naturally handle multiclass classification without making any modifications Decision trees can also be used for regression instead of classification Common implementation: final prediction is the average value of all instances at the leaf

Random Forests Random forests are a type of ensemble learning with decision trees (a forest is a set of trees) We ll revisit this later in the semester, but it s useful to know that random forests: are one of the most successful types of ensemble learning avoid overfitting better than individual decision trees

Neural Networks Recall from the book that perceptron was inspired by the way neurons work: a perceptron fires only if the inputs sum above a threshold (that is, a perceptron outputs a positive label if the score is above the threshold; negative otherwise) Also recall that a perceptron is also called an artificial neuron.

Neural Networks An artificial neural network is a collection of artificial neurons that interact with each other The outputs of some are used as inputs for others A multilayer perceptron is one type of neural network which combines multiple perceptrons Multiple layers of perceptrons, where each layer s output feeds into the next layer as input Called a feed-forward network

Multilayer Perceptron Let s start with a cartoon about how a multilayer perceptron (MLP) works conceptually.

Multilayer Perceptron Contains Gray? Contains Orange? Tabby? Contains Black? # of Colors Color Diffusion Train a perceptron to predict if the cat is a tabby

Multilayer Perceptron Contains Gray? Contains Orange? Contains Black? Tabby? Multi-color? # of Colors Color Diffusion Train another perceptron to predict if the cat is bi-color or tri-color

Multilayer Perceptron Contains Gray? Contains Orange? Contains Black? # of Colors Color Diffusion Tabby? Multi-color? Ginger colors? Train another perceptron to predict if the cat contains orange/red/brown colors

Multilayer Perceptron Tabby? Multi-color? Color Prediction Ginger colors? Treat the outputs of your perceptrons as new features Train another perceptron on these new features

Multilayer Perceptron Contains Gray? Contains Orange? Contains Black? Tabby? Multi-color? Prediction # of Colors Color Diffusion Ginger colors? Usually you don t/can t specify what the perceptrons should output to use as new features in the next layer.

Multilayer Perceptron Contains Gray? Contains Orange? Contains Black????????? Prediction # of Colors Color Diffusion???? Usually you don t/can t specify what the perceptrons should output to use as new features in the next layer.

Multilayer Perceptron Contains Gray? Contains Orange? Contains Black????????? Prediction # of Colors Color Diffusion???? Instead, train a network to learn something that will be useful for prediction.

Multilayer Perceptron Contains Gray? Contains Orange? Contains Black? # of Colors Color Diffusion???????????? Hidden layer Prediction Input layer

Multilayer Perceptron The input layer is the first set of perceptrons which output positive/negative based on the observed features in your data. A hidden layer is a set of perceptrons that uses the outputs of the previous layer as inputs, instead of using the original data. There can be multiple hidden layers! The final hidden layer is also called the output layer. Each perceptron in the network is called a unit.

Multilayer Perceptron Contains Gray? Contains Orange? Contains Black? # of Colors Color Diffusion???????????? 1 hidden unit Prediction 3 input units

Activation Functions Remember, perceptron defines a score w T x, then the score is input into an activation function which converts the score into an output: ϕ(w T x) = 1 if above 0, -1 otherwise Logistic regression used the logistic function to convert the score into an output between 0 and 1: ϕ(w T x) = 1 / (1 + exp(-w T x))

Activation Functions Neural networks usually use also use the logistic function (or another sigmoid function) as the activation function This is true even in multilayer perceptron Potentially confusing terminology: The units in a multilayer perceptron aren t technically perceptrons! The reason is that calculating the perceptron threshold is not differentiable, so can t calculate the gradient for learning.

Multilayer Perceptron The final classification can be written out in full: ϕ(w 211 (ϕ(w 11T x)) + w 212 (ϕ(w 12T x)) + w 213 (ϕ(w 13T x))) Equivalent to writing: ϕ(w 21T y), where y = <ϕ(w 11T x), ϕ(w 12T x), ϕ(w 13T x)> (above, the dot product has been expanded out)

Multilayer Perceptron The final classification can be written out in full: ϕ(w 211 (ϕ(w 11T x)) + w 212 (ϕ(w 12T x)) + w 213 (ϕ(w 13T x))) Scores of the three perceptron units in the first layer

Multilayer Perceptron The final classification can be written out in full: ϕ(w 211 (ϕ(w 11T x)) + w 212 (ϕ(w 12T x)) + w 213 (ϕ(w 13T x))) Outputs of the three perceptron units in the first layer (passing the three scores through the activation function) Scores of the three perceptron units in the first layer

Multilayer Perceptron The final classification can be written out in full: ϕ(w 211 (ϕ(w 11T x)) + w 212 (ϕ(w 12T x)) + w 213 (ϕ(w 13T x))) Score of the one perceptron unit in the second layer (which uses the three outputs from the last layer as features ) Outputs of the three perceptron units in the first layer Scores of the three perceptron units in the first layer

Multilayer Perceptron The final classification can be written out in full: ϕ(w 211 (ϕ(w 11T x)) + w 212 (ϕ(w 12T x)) + w 213 (ϕ(w 13T x))) Final output (passing the final score through the activation function) Score of the one perceptron unit in the second layer Outputs of the three perceptron units in the first layer Scores of the three perceptron units in the first layer

Multilayer Perceptron The final classification can be written out in full: ϕ(w 211 (ϕ(w 11T x)) + w 212 (ϕ(w 12T x)) + w 213 (ϕ(w 13T x))) Can then define a loss function L(w) based on the difference between classifications and true labels Can minimize with gradient descent (or related methods) Algorithm called backpropagation makes gradient calculations more efficient Loss function is non-convex Lots of local minima make it hard to find good solution

Multilayer Perceptron How many hidden layers should there be? How many units should there be in each layer? You have to specify these when you run the algorithm. You can think of these as yet more hyperparameters to tune.

Nonlinear Example: XOR Consider two binary features x 1 and x 2 and you want to learn to output the function, x 1 XOR x 2. x 1 x 2 y 0 0 0 0 1 1 1 0 1 1 1 0 A linear classifier would not be able to learn this, but decision trees and neural networks can.

Nonlinear Example: XOR A decision tree can simply learn that: if x 1 =0, output 0 if x 2 =0 and output 1 if x 2 =1 if x 1 =1, output 1 if x 2 =0 and output 0 if x 2 =1 x 1 x 2 x 2 0 1 1 0

Nonlinear Example: XOR We can learn this with a MLP with 2 input units and 1 hidden unit. Input units: w 11 = <0.6, 0.6>, b 11 = -1.0 w 12 = <1.1, 1.1>, b 12 = -1.0 Unit 1 will only output 1 if both x 1 =1 and x 2 =1 Unit 2 will only output 1 if either x 1 =1 or x 2 =1

Nonlinear Example: XOR Input units: Unit 1 will only output 1 if both x 1 =1 and x 2 =1 Unit 2 will only output 1 if either x 1 =1 or x 2 =1 Hidden unit: w 2 = <-2.0, 1.1>, b 2 = -1.0 If feature 1 (Unit 1 output) is 1, this will output 0 If both features (Units 1 and 2) are 0, this will output 0 If only feature 2 (Unit 2 output) is 1, this will output 1

What Classifier to Use?

Data- driven Advice for Applying Machine Learning to Bioinformatics Problems. Randal S. Olson, William La Cava, Zairah Mustahsan, Akshay Varik, Jason H. Moore.