CSC242: Intro to AI. Lecture 21

Similar documents
Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

CSC242: Intro to AI. Lecture 23

18.6 Regression and Classification with Linear Models

Neural networks. Chapter 19, Sections 1 5 1

Neural networks. Chapter 20. Chapter 20 1

Error Functions & Linear Regression (2)

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Feedforward Neural Nets and Backpropagation

Artificial neural networks

Artificial Intelligence

Machine Learning (CSE 446): Neural Networks

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

CSC321 Lecture 5: Multilayer Perceptrons

Sections 18.6 and 18.7 Artificial Neural Networks

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Lecture 5: Logistic Regression. Neural Networks

Neural networks. Chapter 20, Section 5 1

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Sections 18.6 and 18.7 Artificial Neural Networks

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Sections 18.6 and 18.7 Analysis of Artificial Neural Networks

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Lecture 4: Perceptrons and Multilayer Perceptrons

Learning from Examples

CS:4420 Artificial Intelligence

Lecture 5 Neural models for NLP

Numerical Learning Algorithms

Revision: Neural Network

Artifical Neural Networks

Neural Networks and Deep Learning

Linear classification with logistic regression

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Artificial Neural Networks

Lecture 4: Feed Forward Neural Networks

ECS171: Machine Learning

Multilayer Perceptron

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

AI Programming CS F-20 Neural Networks

CS260: Machine Learning Algorithms

Artificial Neural Networks

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Deep Feedforward Networks

Linear discriminant functions

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Linear Regression, Neural Networks, etc.

Grundlagen der Künstlichen Intelligenz

Error Functions & Linear Regression (1)

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Neural Networks. Intro to AI Bert Huang Virginia Tech

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Logistic Regression & Neural Networks

Unit 8: Introduction to neural networks. Perceptrons

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

CSC 411 Lecture 10: Neural Networks

Machine Learning Lecture 5

y(x n, w) t n 2. (1)

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Machine Learning. Neural Networks

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Classification with Perceptrons. Reading:

CSC321 Lecture 4: Learning a Classifier

Unit III. A Survey of Neural Network Model

ECE521 Lectures 9 Fully Connected Neural Networks

Artificial Neural Network : Training

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Course 395: Machine Learning - Lectures

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Introduction to Neural Networks

Lecture 6. Notes on Linear Algebra. Perceptron

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Supervised Learning. George Konidaris

Computational statistics

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Machine Learning (CS 567) Lecture 3

CSC321 Lecture 2: Linear Regression

CS 730/730W/830: Intro AI

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Networks biological neuron artificial neuron 1

CSC321 Lecture 4 The Perceptron Algorithm

Neural Networks in Structured Prediction. November 17, 2015

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Artificial Neural Networks 2

Artificial Neural Networks

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

SGD and Deep Learning

Deep Learning Lab Course 2017 (Deep Learning Practical)

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

The Perceptron algorithm

Intro to Neural Networks and Deep Learning

Machine Learning Lecture 7

CSC321 Lecture 4: Learning a Classifier

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Lecture 17: Neural Networks and Deep Learning

Security Analytics. Topic 6: Perceptron and Support Vector Machine

Transcription:

CSC242: Intro to AI Lecture 21

Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages (approx 22 x34 ) Look around CSB for poster examples

Learning (Regression, Linear Classifiers, Neural Nets)

f(x) (a) y = 0.4x+3 x

f(x) (c) x

Outline Regression: Learning a function from data Linear regression: Linear function Linear Classifiers Use a line to separate the data Hard and soft thresholds Neural Nets & Support Vector Machines Powerful applications of linear classifiers

m y = mx + b b w = [w 0, w1] x = [1, x] = w1x + w0 = w0 + w1x y = w x

Linear Regression Given a set of N data points (x,y) Find linear function (line) h w (x) = w0 + w1x that best fits the data

y ŷ Data point (x,y) from y = f(x) Hypothesis h w (x) = w1x + w0 x ŷ = h w (x) =w 1 x + w 0

y ŷ Data point (x,y) from y = f(x) Hypothesis h w (x) = w1x + w0 x L(y, ŷ) = y ŷ = L 1 (y, ŷ) ŷ = h w (x) =w 1 x + w 0 =(y ŷ) 2 = L 2 (y, ŷ) =0ify =ŷ else 1 = L 0/1 (y, ŷ)

y ŷ x h w (x) = w1x + w0 L(y, h w (x)) = L 2 (y, h w (x)) =(y h w (x)) 2 =(y w 1 x + w 0 ) 2

h w (x) = w1x + w0 L(h w )= = N L 2 (y j,h w (x j )) j=1 N (y h w (x)) 2 j=1 = N (y w 1 x + w 0 ) 2 j=1

Linear Regression Find w = [w0, w1] that minimizes L(h w ) w = argmin w = argmin w L(h w ) N (y w 1 x + w 0 ) 2 j=1

w 1 = N( x j y j ) ( x j )( y j ) N( x 2 j ) ( x j ) 2 w 0 = ( y j w 1 ( x j )) N

1000 House price in $1000 900 800 700 600 500 400 300 500 1000 1500 2000 2500 3000 3500 House size in square feet y = 0.232x + 246

General Regression Find w = [w0, w1] that minimizes L(h w ) w = argmin w L(h w )

Weight Space Loss w 0 w 1

Gradient Descent w any point in parameter space loop until convergence do Gradient of for each w i in w do loss function Update rule w i w i α along wi axis L(w) w i Learning rate

Gradient Descent for Linear Regression w 0 w 0 + α j (y j h w (x j )) w 1 w 1 + α j (y j h w (x j )) x j

Gradient Descent In Weight Space w* = [w0, [w0, w1] w1] Loss w 0 w 1

Batch Gradient Descent Use update rule over all training data every step Guaranteed to converge; but slow

Stochastic Gradient Descent Pick data points at random Apply single point update Not guaranteed; but much faster Requires decreasing learning rate schedule (like simulated annealing)

Multivariate Linear Regression Input x is not just a single value but a vector of n values x = [x1, x2,..., xn ] Example: x1 = weight of car x2 = size of engine x3 = color of car f(x) = fuel economy of car

Multivariate Linear Regression Hypothesis space: h w (x) =w 0 + w 1 x 1 + w 2 x 2 +...+ w n x n = w 0 + i w i x i w =[w 0,w 1,w 2,...,w n ] x =[1,x 1,x 2,...,x n ] h w (x) =w x = i w i x i

Multivariate Linear Regression w = argmin w L 2 (y j, w x j ) j

Gradient Descent for Multivariate Linear Regression w i w i + α j x j,i (y j h w (x j ))

Linear Regression using Gradient Descent Goal: w = argmin w L(h w ) Update rule: w i w i + α j x j,i (y j h w (x j ))

Linear Regression using Gradient Descent Learning (estimating) a function from examples: supervised learning Minimizing loss between hypothesis and actual values (typically L2 loss) Solve using gradient descent (iterative, local search method)

x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1

x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1

x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1

Classification Given training data x = (x 1,x2), learn a hypothesis h such that: h(x) = 0 if x is from an earthquake = 1 if x is from an explosion

Decision Boundary Path (or surface in higher dimensions) that separates the two classes h (x) > 0 if x is from an earthquake < 0 if x is from an explosion

x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1 x 2 =1.7x 1 4.9

Linear Separator Decision boundary is a line Line in 2D, plane in 3D, hyperplane in nd Data that admit a linear separator are said to be linearly seperable

Linear Classifier w 0 + w 1 x 1 + w 2 x 2 =0 w x =0 All instances of one class are above the line: w x > 0 All instances of one class are below the line: w x < 0 h w (x) =Threshold(w x)

Hard Threshold 1 0.5 0-8 -6-4 -2 0 2 4 6 8 Threshold(z) =1ifz 0 = 0 otherwise

Linear Classifier h w (x) =Threshold(w x) w = argmin w L(h w )

Perceptron Learning Rule w i w i + α(y h w (x)) x i

1 Proportion correct 0.9 0.8 0.7 0.6 0.5 0.4 0 100 200 300 400 500 600 700 Number of weight updates N=63

Linear Classifier with Hard Threshold Learn with gradient descent Perceptron learning rule just like linear regression Convergence isn t pretty but it works

x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1

1 1 Proportion correct 0.9 0.8 0.7 0.6 0.5 Proportion correct 0.9 0.8 0.7 0.6 0.5 0.4 0 20000 40000 60000 80000 100000 0.4 0 20000 40000 60000 80000 100000 Number of weight updates Number of weight updates Fixed α α decays as O(1/t)

x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1

Hard Threshold 1 0.5 0-8 -6-4 -2 0 2 4 6 8 Threshold(z) =1ifz 0 = 0 otherwise

Soft Threshold 1 0.5 0-6 -4-2 0 2 4 6 Logistic(z) = 1 1+e z

Logistic Regression h w (x) =Logistic(w x) = 1 1+e w x w = argmin w L(h w )

Squared error per example 1 0.9 0.8 0.7 0.6 0.5 0.4 0 1000 2000 3000 4000 5000 Number of weight updates Squared error per example 1 0.9 0.8 0.7 0.6 0.5 0.4 0 20000 40000 60000 80000 100000 Number of weight updates Squared error per example 1 0.9 0.8 0.7 0.6 0.5 0.4 0 20000 40000 60000 80000 100000 Number of weight updates

Learning Linear Classifiers Can learn linear classifiers fairly effectively from data Based on linear regression Hard threshold Unpredictable, not robust to noise Soft threshold: logistic regression Slower, but more predictable Robust to noisy data

Dendrites Axon

Linear Classifier w 0 + w 1 x 1 + w 2 x 2 =0 w x =0 All instances of one class are above the line: w x > 0 All instances of one class are below the line: w x < 0 h w (x) =Logistic(w =Threshold(w x) x)

Neuron a 0 = 1 a j = g(in j ) wi,j a i Bias Weight w 0,j Σ in j g a j Input Links Input Function Activation Function Output Output Links a j = g( i w i,j a i )

Neural Network Set of linear classifiers with logistic thresholds ( units ) Outputs connected to inputs Feed-forward Recurrent network (loops) Typically layered: input, hidden, output

Single-Layer Feed-Forward NNs Input Output

Single-Layer Feed-Forward NN 1 w 1,3 w 1,4 3 2 w 2,3 w 2,4 4 output j = g( i (a) w i,j a i )=g(w 1,j a 1 + w 2,j a 2 )

Multi-Layer Feed-Forward NNs Input Hidden Output

Multi-Layer Feed-Forward NNs 1 w 1,3 3 w 3,5 5 w 1,4 w 3,6 2 w 2,3 w 2,4 4 w 4,5 w 4,6 6 (b)

Recurrent NNs Allow connections from outputs to inputs => Feedback loops! Mathematics is daunting...

Learning in Neural Networks Need to learn weights for each connection between nodes Multiple logistic regression problems Changes to weights of one node change its output => change inputs to other nodes Convergence is complicated

Neural Nets Linear classifiers with sigmoid threshold functions, connected in a graph topology Learns classification function from inputs to outputs

Support Vector Machines Learn a maximum margin separator Decision boundary with largest possible distance to example points Learns linear separator (but possibly in higher-dimensional space) Generalize well Can represent complex functions (boundaries) without overfitting

Corinna Cortes 2008 ACM Paris Kanellakis Theory and Practice Award (with Vladimir Vapnik)

Summary Linear regression for fitting a line to data Linear classifiers: use a line to separate the data Gradient descent for finding weights Hard threshold (perceptron learning rule) Logistic (sigmoid) threshold Neural Networks: Network of linear classifiers Can learn nonlinear functions Support Vector Machines: State of the art for supervised learning of linear classifiers

For Next Time: AIMA 20.0-20.2.2; 20.2.5 fyi; 20.3-20.3.4