CSC242: Intro to AI Lecture 21
Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages (approx 22 x34 ) Look around CSB for poster examples
Learning (Regression, Linear Classifiers, Neural Nets)
f(x) (a) y = 0.4x+3 x
f(x) (c) x
Outline Regression: Learning a function from data Linear regression: Linear function Linear Classifiers Use a line to separate the data Hard and soft thresholds Neural Nets & Support Vector Machines Powerful applications of linear classifiers
m y = mx + b b w = [w 0, w1] x = [1, x] = w1x + w0 = w0 + w1x y = w x
Linear Regression Given a set of N data points (x,y) Find linear function (line) h w (x) = w0 + w1x that best fits the data
y ŷ Data point (x,y) from y = f(x) Hypothesis h w (x) = w1x + w0 x ŷ = h w (x) =w 1 x + w 0
y ŷ Data point (x,y) from y = f(x) Hypothesis h w (x) = w1x + w0 x L(y, ŷ) = y ŷ = L 1 (y, ŷ) ŷ = h w (x) =w 1 x + w 0 =(y ŷ) 2 = L 2 (y, ŷ) =0ify =ŷ else 1 = L 0/1 (y, ŷ)
y ŷ x h w (x) = w1x + w0 L(y, h w (x)) = L 2 (y, h w (x)) =(y h w (x)) 2 =(y w 1 x + w 0 ) 2
h w (x) = w1x + w0 L(h w )= = N L 2 (y j,h w (x j )) j=1 N (y h w (x)) 2 j=1 = N (y w 1 x + w 0 ) 2 j=1
Linear Regression Find w = [w0, w1] that minimizes L(h w ) w = argmin w = argmin w L(h w ) N (y w 1 x + w 0 ) 2 j=1
w 1 = N( x j y j ) ( x j )( y j ) N( x 2 j ) ( x j ) 2 w 0 = ( y j w 1 ( x j )) N
1000 House price in $1000 900 800 700 600 500 400 300 500 1000 1500 2000 2500 3000 3500 House size in square feet y = 0.232x + 246
General Regression Find w = [w0, w1] that minimizes L(h w ) w = argmin w L(h w )
Weight Space Loss w 0 w 1
Gradient Descent w any point in parameter space loop until convergence do Gradient of for each w i in w do loss function Update rule w i w i α along wi axis L(w) w i Learning rate
Gradient Descent for Linear Regression w 0 w 0 + α j (y j h w (x j )) w 1 w 1 + α j (y j h w (x j )) x j
Gradient Descent In Weight Space w* = [w0, [w0, w1] w1] Loss w 0 w 1
Batch Gradient Descent Use update rule over all training data every step Guaranteed to converge; but slow
Stochastic Gradient Descent Pick data points at random Apply single point update Not guaranteed; but much faster Requires decreasing learning rate schedule (like simulated annealing)
Multivariate Linear Regression Input x is not just a single value but a vector of n values x = [x1, x2,..., xn ] Example: x1 = weight of car x2 = size of engine x3 = color of car f(x) = fuel economy of car
Multivariate Linear Regression Hypothesis space: h w (x) =w 0 + w 1 x 1 + w 2 x 2 +...+ w n x n = w 0 + i w i x i w =[w 0,w 1,w 2,...,w n ] x =[1,x 1,x 2,...,x n ] h w (x) =w x = i w i x i
Multivariate Linear Regression w = argmin w L 2 (y j, w x j ) j
Gradient Descent for Multivariate Linear Regression w i w i + α j x j,i (y j h w (x j ))
Linear Regression using Gradient Descent Goal: w = argmin w L(h w ) Update rule: w i w i + α j x j,i (y j h w (x j ))
Linear Regression using Gradient Descent Learning (estimating) a function from examples: supervised learning Minimizing loss between hypothesis and actual values (typically L2 loss) Solve using gradient descent (iterative, local search method)
x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1
x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1
x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1
Classification Given training data x = (x 1,x2), learn a hypothesis h such that: h(x) = 0 if x is from an earthquake = 1 if x is from an explosion
Decision Boundary Path (or surface in higher dimensions) that separates the two classes h (x) > 0 if x is from an earthquake < 0 if x is from an explosion
x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1 x 2 =1.7x 1 4.9
Linear Separator Decision boundary is a line Line in 2D, plane in 3D, hyperplane in nd Data that admit a linear separator are said to be linearly seperable
Linear Classifier w 0 + w 1 x 1 + w 2 x 2 =0 w x =0 All instances of one class are above the line: w x > 0 All instances of one class are below the line: w x < 0 h w (x) =Threshold(w x)
Hard Threshold 1 0.5 0-8 -6-4 -2 0 2 4 6 8 Threshold(z) =1ifz 0 = 0 otherwise
Linear Classifier h w (x) =Threshold(w x) w = argmin w L(h w )
Perceptron Learning Rule w i w i + α(y h w (x)) x i
1 Proportion correct 0.9 0.8 0.7 0.6 0.5 0.4 0 100 200 300 400 500 600 700 Number of weight updates N=63
Linear Classifier with Hard Threshold Learn with gradient descent Perceptron learning rule just like linear regression Convergence isn t pretty but it works
x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1
1 1 Proportion correct 0.9 0.8 0.7 0.6 0.5 Proportion correct 0.9 0.8 0.7 0.6 0.5 0.4 0 20000 40000 60000 80000 100000 0.4 0 20000 40000 60000 80000 100000 Number of weight updates Number of weight updates Fixed α α decays as O(1/t)
x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1
Hard Threshold 1 0.5 0-8 -6-4 -2 0 2 4 6 8 Threshold(z) =1ifz 0 = 0 otherwise
Soft Threshold 1 0.5 0-6 -4-2 0 2 4 6 Logistic(z) = 1 1+e z
Logistic Regression h w (x) =Logistic(w x) = 1 1+e w x w = argmin w L(h w )
Squared error per example 1 0.9 0.8 0.7 0.6 0.5 0.4 0 1000 2000 3000 4000 5000 Number of weight updates Squared error per example 1 0.9 0.8 0.7 0.6 0.5 0.4 0 20000 40000 60000 80000 100000 Number of weight updates Squared error per example 1 0.9 0.8 0.7 0.6 0.5 0.4 0 20000 40000 60000 80000 100000 Number of weight updates
Learning Linear Classifiers Can learn linear classifiers fairly effectively from data Based on linear regression Hard threshold Unpredictable, not robust to noise Soft threshold: logistic regression Slower, but more predictable Robust to noisy data
Dendrites Axon
Linear Classifier w 0 + w 1 x 1 + w 2 x 2 =0 w x =0 All instances of one class are above the line: w x > 0 All instances of one class are below the line: w x < 0 h w (x) =Logistic(w =Threshold(w x) x)
Neuron a 0 = 1 a j = g(in j ) wi,j a i Bias Weight w 0,j Σ in j g a j Input Links Input Function Activation Function Output Output Links a j = g( i w i,j a i )
Neural Network Set of linear classifiers with logistic thresholds ( units ) Outputs connected to inputs Feed-forward Recurrent network (loops) Typically layered: input, hidden, output
Single-Layer Feed-Forward NNs Input Output
Single-Layer Feed-Forward NN 1 w 1,3 w 1,4 3 2 w 2,3 w 2,4 4 output j = g( i (a) w i,j a i )=g(w 1,j a 1 + w 2,j a 2 )
Multi-Layer Feed-Forward NNs Input Hidden Output
Multi-Layer Feed-Forward NNs 1 w 1,3 3 w 3,5 5 w 1,4 w 3,6 2 w 2,3 w 2,4 4 w 4,5 w 4,6 6 (b)
Recurrent NNs Allow connections from outputs to inputs => Feedback loops! Mathematics is daunting...
Learning in Neural Networks Need to learn weights for each connection between nodes Multiple logistic regression problems Changes to weights of one node change its output => change inputs to other nodes Convergence is complicated
Neural Nets Linear classifiers with sigmoid threshold functions, connected in a graph topology Learns classification function from inputs to outputs
Support Vector Machines Learn a maximum margin separator Decision boundary with largest possible distance to example points Learns linear separator (but possibly in higher-dimensional space) Generalize well Can represent complex functions (boundaries) without overfitting
Corinna Cortes 2008 ACM Paris Kanellakis Theory and Practice Award (with Vladimir Vapnik)
Summary Linear regression for fitting a line to data Linear classifiers: use a line to separate the data Gradient descent for finding weights Hard threshold (perceptron learning rule) Logistic (sigmoid) threshold Neural Networks: Network of linear classifiers Can learn nonlinear functions Support Vector Machines: State of the art for supervised learning of linear classifiers
For Next Time: AIMA 20.0-20.2.2; 20.2.5 fyi; 20.3-20.3.4