CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

Size: px

Start display at page:

Download "CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)"

Barnard Flowers
5 years ago
Views:

1 CS325 Artificial Intelligence Cengiz Spring 2013

2 Model Complexity in Learning f(x) x

3 Model Complexity in Learning f(x) x Let s start with the linear case...

4 Linear Regression

5 Linear Regression price = f (size)

6 Linear Regression price = f (size)? = f (3000)

7 Regression Finding the Parameters from Data x y y = f (x) = w 1 x + w 0 w 0 =? w 1 =?

8 Regression Finding the Parameters from Data x y y = f (x) = w 1 x + w 0 w 0 = 1 w 1 = 2

9 Linear Regression Defining a Loss Function y = f (x) = w 1 x + w 0 Loss(f ) = j (y j f (x j )) 2 = j (y j (w 1 x j + w 0 )) 2

10 Linear Regression Defining a Loss Function y = f (x) = w 1 x + w 0 Loss(f ) = j (y j f (x j )) 2 = j (y j (w 1 x j + w 0 )) 2 Minimum is where the derivative is zero: (y j (w 1 x j + w 0 )) 2 = 0, w 0 j (y j (w 1 x j + w 0 )) 2 = 0 w 1 j

11 Linear Regression Defining a Loss Function y = f (x) = w 1 x + w 0 Loss(f ) = j (y j f (x j )) 2 = j (y j (w 1 x j + w 0 )) 2 Minimum is where the derivative is zero: (y j (w 1 x j + w 0 )) 2 = 0, w 0 j (y j (w 1 x j + w 0 )) 2 = 0 w 1 Solution is: w 1 = N( x j y j ) ( x j )( y j ) ( N( xj 2) ( y j ) 2, w 0 = yj w 1 ( x j )) /N j

12 Remember Bayes Nets? We can learn them from data, too. Everybody loves spam!

13 Maximum Likelihood: Guessing Spam Probability

14 Maximum Likelihood: Guessing Spam Probability Let s guess: P(S) = π P(y i ) = { π if y i = S 1 π if y i = H

15 Maximum Likelihood: Guessing Spam Probability Let s guess: P(S) = π Joint probability P(y i ) = { π if y i = S 1 π if y i = H P(data) = π count(s) (1 π) count(h) = π 3 (1 π) 5

16 Maximum Likelihood: Guessing Spam Probability Joint probability P(data) = π count(s) (1 π) count(h) = π 3 (1 π) 5 Take log of both sides log P(data) = 3 log π + 5 log(1 π)

17 Maximum Likelihood: Guessing Spam Probability Joint probability P(data) = π count(s) (1 π) count(h) = π 3 (1 π) 5 Take log of both sides log P(data) = 3 log π + 5 log(1 π) Find max by zero derivative P(data) π = 0 = 3 π 5 1 π π = 3/8

18 Bag of Words Representation

19 Bag of Words Representation

20 Maximum Likelihood: Guessing Word Probability

21 Finally, Back to Bayes Nets

22 Finally, Back to Bayes Nets

23 Finally, Back to Bayes Nets P(S M) = αp(m S)P(S)

24 Finally, Back to Bayes Nets P(S M) = αp(m S)P(S)

25 Finally, Back to Bayes Nets P(S M) = αp(m S)P(S) = αp(m 1, M 2, M 3 S)P(S)

26 Problems?

27 Problems? P(S M) = 0

28 Problems? P(S M) = 0 Need Laplace Smoothing (check the videos)

29 What If We Cannot Learn So Easily? So far we calculated directly from data: Linear regression coefficients through explicit solution Bayes net parameters through maximal likelihood

30 What If We Cannot Learn So Easily? So far we calculated directly from data: Linear regression coefficients through explicit solution Bayes net parameters through maximal likelihood But cannot solve every problem with these. Classification solution is not unique: x x 1

31 What If We Cannot Learn So Easily? So far we calculated directly from data: Linear regression coefficients through explicit solution Bayes net parameters through maximal likelihood But cannot solve every problem with these. Classification solution is not unique: x x x 1 x 1

32 Perceptron Also Calculates Linear Boundary x x 1 sum = N I i W i i=1 y = { 1, if sum T 0, if sum < T

33 Perceptron, 2D Case Line equation in 2D: x 2 = a x 1 + b x x 1

34 Perceptron, 2D Case Line equation in 2D: x 2 = a x 1 + b b = a x 1 x 2 x x 1

35 Perceptron, 2D Case Line equation in 2D: x 2 = a x 1 + b b = a x 1 x 2 The perceptron boundary: x x 1 T = w 1 x 1 + w 2 x 2

36 Perceptron, 2D Case Line equation in 2D: x 2 = a x 1 + b b = a x 1 x 2 The perceptron boundary: x x 1 T = w 1 x 1 + w 2 x 2 { y = 1, if sum T 0, if sum < T

37 Perceptron, 2D Case Line equation in 2D: x 2 = a x 1 + b b = a x 1 x 2 The perceptron boundary: x x 1 T = w 1 x 1 + w 2 x 2 { y = 1, if sum T 0, if sum < T How to learn it?

38 Use the Loss Function, Perceptron Perceptron: Over all samples: y = f w (x) Loss(w) = i (y i f w (x i )) 2

39 Use the Loss Function, Perceptron Perceptron: Over all samples: y = f w (x) Loss(w) = i (y i f w (x i )) 2 Trying to find arg min Loss(w) w

40 Use the Loss Function, Perceptron Perceptron: Over all samples: y = f w (x) Loss(w) = i (y i f w (x i )) 2 Trying to find An incremental rule: arg min Loss(w) w w j w j α w j Loss(w)

41 Use the Loss Function, Perceptron Perceptron: Over all samples: y = f w (x) Loss(w) = i (y i f w (x i )) 2 Trying to find An incremental rule: arg min Loss(w) w w j w j α w j Loss(w) w j w j + α(y f w (x)) x j

42 It s Called the Perceptron Learning Rule w j w j + α(y f w (x)) x j x y = 0 y = 1 Tom Jerry Trucks 1 0 Sedans 0 1 Hybrids 0 1 SUVs 1 0

43 It s Called the Perceptron Learning Rule w j w j + α(y f w (x)) x j x y = 0 y = 1 Tom Jerry Trucks 1 0 Sedans 0 1 Hybrids 0 1 SUVs 1 0 Start: w = 0, α = 1, T = 1. For y Tom :

44 It s Called the Perceptron Learning Rule w j w j + α(y f w (x)) x j x y = 0 y = 1 Tom Jerry Trucks 1 0 Sedans 0 1 Hybrids 0 1 SUVs 1 0 Start: w = 0, α = 1, T = 1. For y Tom : w Trucks w Trucks +(0 0) 1 For y Jerry :

45 It s Called the Perceptron Learning Rule w j w j + α(y f w (x)) x j x y = 0 y = 1 Tom Jerry Trucks 1 0 Sedans 0 1 Hybrids 0 1 SUVs 1 0 Start: w = 0, α = 1, T = 1. For y Tom : w Trucks w Trucks +(0 0) 1 For y Jerry : w Sedans w Sedans +(1 0) 1

46 Gradient Descent on the Loss Function In general, Loss w j w j + α w j Loss(w)??? Local minimum Global minimum

47 Gradient Descent on the Loss Function In general, Loss w j w j + α w j Loss(w)??? Local minimum Adaptive α Simulated Annealing Global minimum

48 Gradient Descent on the Loss Function In general, Loss w j w j + α w j Loss(w)??? Local minimum Adaptive α Simulated Annealing Major problem: local minima Global minimum

49 What If the Boundary is Non-linear?

50 What If the Boundary is Non-linear?

51 What If the Boundary is Non-linear?

52 What If the Boundary is Non-linear? Multi-Layer Perceptrons 1 w 1,3 3 1 w 1,3 3 w 3,5 5 w 1,4 w 1,4 w 3,6 2 w 2,3 w 2,4 4 2 w 2,3 w 2,4 w 4,5 w 4,6 6

53 Another Solution: Non-linear Kernels

54 Another Solution: Non-linear Kernels Convert feature (input) space using non-linear kernel (e.g., radial distance)

55 Optimal Boundary? Enter Support Vector Machines

56 Optimal Boundary? Enter Support Vector Machines

57 Optimal Boundary? Enter Support Vector Machines SVMs are guaranteed to find optimal solution Statistical Learning Theory

58 Optimal Boundary? Enter Support Vector Machines SVMs are guaranteed to find optimal solution Statistical Learning Theory Kernel SVMs are especially powerful because it can search in multi-dimensional kernel space

59 So Many Methods So Little Time... How to Choose? Problem choosing model complexity Kernel type Structural complexity of MLP or SVM

60 So Many Methods So Little Time... How to Choose? Problem choosing model complexity Kernel type Structural complexity of MLP or SVM Solution, ask the data: 1 cross-validation Divide data into three sets: training, validate, test

61 So Many Methods So Little Time... How to Choose? Problem choosing model complexity Kernel type Structural complexity of MLP or SVM Solution, ask the data: 1 cross-validation Divide data into three sets: training, validate, test 2 regularization Add complexity minimization term to Loss function Loss = (y i f (x i )) 2 + β num params

62 Or, Get Rid of em Altogether: Non-parameric Models k-nearest Neighbors algorithm: Keep all data points as lookup table Smoothing parameter, k x x 2 x (k =1) (k = 5) x 2

63 Or, Get Rid of em Altogether: Non-parameric Models k-nearest Neighbors algorithm: Keep all data points as lookup table Smoothing parameter, k x x 2 x (k =1) (k = 5) x 2 Problems?

64 Or, Get Rid of em Altogether: Non-parameric Models k-nearest Neighbors algorithm: Keep all data points as lookup table Smoothing parameter, k x x 2 x (k =1) (k = 5) x 2 Problems? Number of data points Number of features

65 Finally, a Totally Different One: Genetic Algorithms

66 Finally, a Totally Different One: Genetic Algorithms Problems: No local minima, takes longer, must design problem well

67 Summary of Supervised Machine Learning Can solve problems too complex for man-made algorithms Gets better with data (good for information age) Supervised learning with labels: regression and classification

68 Summary of Supervised Machine Learning Can solve problems too complex for man-made algorithms Gets better with data (good for information age) Supervised learning with labels: regression and classification Linear regression and Bayes nets calculated from data directly Classification by minimizing Loss function iteratively

69 Summary of Supervised Machine Learning Can solve problems too complex for man-made algorithms Gets better with data (good for information age) Supervised learning with labels: regression and classification Linear regression and Bayes nets calculated from data directly Classification by minimizing Loss function iteratively Local minima is a problem with gradient descent Non-linear problems can be solved with multiple boundaries or kernels Support vector machines find optimal solution faster

70 Summary of Supervised Machine Learning Can solve problems too complex for man-made algorithms Gets better with data (good for information age) Supervised learning with labels: regression and classification Linear regression and Bayes nets calculated from data directly Classification by minimizing Loss function iteratively Local minima is a problem with gradient descent Non-linear problems can be solved with multiple boundaries or kernels Support vector machines find optimal solution faster Parameter complexity can be reduced with cross validation and regularization Non-parametric models good for low-dimensional problems Genetic algorithms have no local minima

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric