Supervised Learning. George Konidaris

Similar documents
Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Nonlinear Classification

Neural Networks: Backpropagation

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Machine Learning (CSE 446): Neural Networks

CSC321 Lecture 5: Multilayer Perceptrons

Neural networks. Chapter 20. Chapter 20 1

ECE521 Lecture 7/8. Logistic Regression

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Artificial Neural Networks

Artificial Neural Networks 2

Neural networks. Chapter 20, Section 5 1

Neural Networks and Deep Learning

CSC242: Intro to AI. Lecture 21

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural networks. Chapter 19, Sections 1 5 1

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

ECE521 Lectures 9 Fully Connected Neural Networks

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Name (NetID): (1 Point)

Neural Networks. Intro to AI Bert Huang Virginia Tech

Neural Networks (Part 1) Goals for the lecture

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk

Artificial Neural Networks. Historical description

Lecture 5: Logistic Regression. Neural Networks

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

18.6 Regression and Classification with Linear Models

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Artificial Intelligence

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Learning Decision Trees

Lecture 17: Neural Networks and Deep Learning

Feedforward Neural Nets and Backpropagation

Learning Deep Architectures for AI. Part I - Vijay Chakilam

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Based on the original slides of Hung-yi Lee

CS 4700: Foundations of Artificial Intelligence

Learning Decision Trees

Course 395: Machine Learning - Lectures

Artificial Neuron (Perceptron)

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Input layer. Weight matrix [ ] Output layer

Notes on Machine Learning for and

Midterm: CS 6375 Spring 2015 Solutions

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

Neural Networks and the Back-propagation Algorithm

Multilayer Perceptron

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Learning from Examples

The Perceptron. Volker Tresp Summer 2016

Logistic Regression. COMP 527 Danushka Bollegala

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

Unit III. A Survey of Neural Network Model

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Machine Learning Linear Models

Neural Nets Supervised learning

B555 - Machine Learning - Homework 4. Enrique Areyan April 28, 2015

Logistic Regression & Neural Networks

Neural Networks: Introduction

Warm up: risk prediction with logistic regression

Online Advertising is Big Business

Machine Learning (CS 567) Lecture 3

CSCI567 Machine Learning (Fall 2018)

y(x n, w) t n 2. (1)

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

CS 6375 Machine Learning

Lecture 5 Neural models for NLP

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Introduction to Deep Learning

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Machine Learning Lecture 12

SGD and Deep Learning

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

AI Programming CS F-20 Neural Networks

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

CS 4700: Foundations of Artificial Intelligence

In the Name of God. Lecture 11: Single Layer Perceptrons

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Deep Learning: a gentle introduction

Data Mining Part 5. Prediction

ECE521 Lecture7. Logistic Regression

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Binary Classification / Perceptron

Deep Learning Lab Course 2017 (Deep Learning Practical)

Linear & nonlinear classifiers

10-701/ Machine Learning, Fall

Transcription:

Supervised Learning George Konidaris gdk@cs.brown.edu Fall 2017

Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell, 1997)

A breakthrough in machine learning would be worth ten Microsofts. - Bill Gates

Supervised Learning Input: X = {x1,, xn} Y = {y1,, yn} inputs labels training data Learn to predict new labels. Given x: y?

Classification vs. Regression If the set of labels Y is discrete: Classification Minimize number of errors If Y is real-valued: Regression Minimize sum squared error Today we focus on classification.

Supervised Learning Formal definition: Given training data: X = {x1,, xn} Y = {y1,, yn} inputs labels Produce: Decision function f : X! Y That minimizes error: X err(f(x i ),y i ) i

Test/Train Split Minimize error measured on what? Don t get to see future data. Could use test data but! may not generalize. General principle: Do not measure error on the data you train on! Methodology: Split data into training set and test set. Fit f using training set. Measure error on test set. Always do this.

Test/Train Split What if you choose unluckily? And aren t we wasting data? k-fold Cross Validation: Common alternative Repeat k times: Partition data into train (n - n/k) and test (n/k) data sets Train on training set, test on test set Average results across k choices of test set.

Key Idea: Hypothesis Space Typically Fixed representation of classifier. Learning algorithm constructed to match. Representation induces class of functions F, from which to find f. F is known as the hypothesis space. Tradeoff: power vs. expressibility vs. data efficiency. Not every F can represent every function. F = {f1, f2,, fn} Set of possible functions that can be returned Typically infinite set (not always) Learning is finding f i 2 F that minimizes error.

Key Idea: Decision Boundary + + + + + + + + - - - + + - - - + - - - - - - Boundary at which label changes

Decision Trees Let s assume: Two classes (true and false). Input: vector of discrete values. What s the simplest thing we could do? How about some if-then rules? Relatively simple classifier: Tree of tests. Evaluate test for for each xi, follow branch. Leaves are class labels.

Decision Trees xi = [a, b, c] a? each boolean true false b? c? true false true false y=1 y=2 b? y=1 true false y=2 y=1

Decision Trees How to make one? Given X = {x1,, xn} Y = {y1,, yn} repeat: if all the labels are the same, we have a leaf node. pick an attribute and split data bases on its value. recurse on each half. If we run out of splits, and data not perfectly in one class, then take a max.

Decision Trees a? A B C L T F T 1 T T F 1 T F F 1 F T F 2 F T T 2 F T F 2 F F T 1 F F F 1

Decision Trees true y=1 a? A B C L T F T 1 T T F 1 T F F 1 F T F 2 F T T 2 F T F 2 F F T 1 F F F 1

Decision Trees true y=1 a? false b? A B C L T F T 1 T T F 1 T F F 1 F T F 2 F T T 2 F T F 2 F F T 1 F F F 1

Decision Trees true y=1 a? false b? true y=2 A B C L T F T 1 T T F 1 T F F 1 F T F 2 F T T 2 F T F 2 F F T 1 F F F 1

Decision Trees true y=1 a? false b? true y=2 false y=1 A B C L T F T 1 T T F 1 T F F 1 F T F 2 F T T 2 F T F 2 F F T 1 F F F 1

Attribute Picking Key question: Which attribute to split over? Information contained in a data set: I(A) = f 1 log 2 f 1 f 2 log 2 f 2 How many bits of information do we need to determine the label in a dataset? Pick the attribute with the max information gain: X Gain(B) =I(A) f i I(B i ) i

Example X Gain(B) =I(A) f i I(B i ) I(A) = f 1 log 2 f 1 f 2 log 2 f 2 i A B C L T F T 1 T T F 1 T F F 1 F T F 2 F T T 2 F T F 2 F F T 1 F F F 1

Decision Trees What if the inputs are real-valued? Have inequalities rather than equalities. Can repeat variables. a > 3.1 true false y=1 b < 0.6? true false y=2 y=1

Hypothesis Class What is the hypothesis class for a decision tree? Discrete inputs? Real-valued inputs?

The Perceptron If your input (xi) is real-valued explicit decision boundary? - - + + + + + - + -

The Perceptron If x = [x(1),, x(n)]: Create an n-d line Slope for each x(i) Constant offset f(x) =sign(w x + c) y = wx + c gradient offset

The Perceptron Which side of a line are you on? w x = w x cos( ) x w

The Perceptron How do you reduce error? - e =(y i (w x i + c)) 2 @e @w j = 2(y i (w i x i + c))x i (j) descend this gradient to reduce error

The Perceptron Algorithm Assume you have a batch of data: X = {x1,, xn} Y = {y1,, yn} set w, c to 0. for each xi: predict zi = sign(w.xi + c) if zi!= yi: w = w + a(yi - zi)xi converges if data is linearly separate learning rate https://www.youtube.com/watch?v=kcmiq3zwyro

Probabilities What if you want a probabilistic classifier? Instead of sign, squash output of linear sum down to [0, 1]: (w x + c) Resulting algorithm: logistic regression.

Frank Rosenblatt Built the Mark I in 1960.

Perceptrons What can t you do? - - + + + + + - + -

Perceptrons 1969

Neural Networks (w x + c) logistic regression

Neurons

Neural Networks output layer o1 o2 hidden layer h1 h2 h3 input layer x1 x2

Neural Networks (w o 1 1 h 1 + w o 1 2 h 2 + w o 1 3 h 3 + w o 1 4 ) value computed o1 (w o 2 1 h 1 + w o 2 2 h 2 + w o 2 3 h 3 + w o 2 4 ) o2 value computed h 1 = (w h 1 1 x 1 + w h 1 2 x 2 + w h 1 h1 1 x 1 + w h 2 2h2 x 2 + w h 2 3 ) (wh 3 h3 3 ) (w h 2 x1 x2 1 x 1 + w h 3 2 x 2 + w h 3 3 ) feed forward input layer x 1,x 2 2 [0, 1]

Neural Networks probability of class 1 (w 1 h 1 + w 2 h 2 + w 3 h 3 + w 4 ) o1 o2 (w 1 h 1 + w 2 h 2 + w 3 h 3 + w 4 ) probability of class 2 (w 1 x 1 + w 2 x 2 + w 3 ) (w 1 x 1 + w 2 x 2 + w 3 ) (w 1 x 1 + w 2 x 2 + w 3 ) h1 h2 h3 weights (parameters) x1 x2 input data x 1,x 2 2 [0, 1]

Neural Classification A neural network is just a parametrized function: y = f(x, w) How to train it? Write down an error function: Minimize it! (w.r.t. w) (y i f(x i,w)) 2

Neural Classification Recall that the squashing function is defined as: (t) = 1 1+e t @ (t) @t = (t)(1 (t))

Neural Classification OK, so we can minimize error using gradient descent. To do so, we must calculate @e @w i for each wi. How to do so? Easy for output layers: @e = @(y i o i ) 2 = 2(y i o i ) @(y i o i ) = 2(o i y i )o i (1 o i ) @w i @w i @w i chain rule Interior weights: repeat chain rule application.

Backpropagation This algorithm is called backpropagation. Bryson and Ho, 1969 Rumelhart, Hinton, and Williams, 1986.

Deep Neural Networks o1 o2 hn1 hn2 hn3. h11 h12 h13 x1 x2

Applications Fraud detection Internet advertising Friend or link prediction Sentiment analysis Face recognition Spam filtering

Applications MNIST Data Set Training set: 60k digits Test set: 10k digits