CS 343: Artificial Intelligence

Size: px

Start display at page:

Download "CS 343: Artificial Intelligence"

Ezra Clark
5 years ago
Views:

1 CS 343: Artificial Intelligence Deep Learning Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein, Pieter Abbeel, Anca Dragan for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at

2 Review: Linear Classifiers

3 Feature Vectors Hello, Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just # free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0... SPAM or + PIXEL-7,12 : 1 PIXEL-7,13 : 0... NUM_LOOPS :

4 Some (Simplified) Biology Very loose inspiration: human neurons

5 Linear Classifiers Inputs are feature values Each feature has a weight Sum is the activation If the activation is: Positive, output +1 Negative, output -1 f 1 f 2 f 3 w 1 w 2 Σ >0? w 3

6 Non-Linearity

7 Non-Linear Separators Data that is linearly separable works out great for linear decision rules: 0 x But what are we going to do if the dataset is just too hard? 0 x How about mapping data to a higher-dimensional space: x 2 0 x This and next slide adapted from Ray Mooney, UT

8 Non-Linear Separators General idea: the original feature space can always be mapped to some higherdimensional feature space where the training set is separable: Φ: x φ(x)

9 Feature Set Selection To choose between two feature sets: For feature set 1: train perceptron on training data -> Classifier 1 For feature set 2: train perceptron on training data -> Classifier 2 Evaluate performance of Classifier 1 and Classifier 2 on hold-out data Select the one performing best on the hold-out data

10 Computer Vision

11 Object Detection

12 Manual Feature Design

13 Features and Generalization [Dalal and Triggs, 2005]

14 Features and Generalization Image HoG

15 Manual Feature Design àdeep Learning Manual feature design requires: Domain-specific expertise Domain-specific effort What if we could learn the features, too? Deep Learning

16 Perceptron f 1 f 2 f 3 w 1 Σ w 2 w 3 >0?

17 Two-Layer Perceptron Network w 11 Σ w 21 w 31 >0? w 1 f 1 w 12 f 2 Σ w 22 w 32 >0? w 2 Σ >0? f 3 w 13 w 3 Σ w 23 w 33 >0?

18 N-Layer Perceptron Network Σ >0? Σ >0? Σ >0? f 1 f 2 Σ >0? Σ >0? Σ >0? Σ >0? f 3 Σ >0? Σ >0? Σ >0?

19 Performance graph credit Matt Zeiler, Clarifai

20 Performance graph credit Matt Zeiler, Clarifai

21 graph credit Matt Zeiler, Clarifai Performance AlexNet

22 graph credit Matt Zeiler, Clarifai Performance AlexNet

23 graph credit Matt Zeiler, Clarifai Performance AlexNet

24 Speech Recognition graph credit Matt Zeiler, Clarifai

25 N-Layer Perceptron Network Σ >0? Σ >0? Σ >0? f 1 f 2 Σ >0? Σ >0? Σ >0? Σ >0? f 3 Σ >0? Σ >0? Σ >0?

Local Search Simple, general idea: Start wherever Repeat: move to the best neighboring state If no neighbors better than current, quit Neighbors = small

26 Local Search Simple, general idea: Start wherever Repeat: move to the best neighboring state If no neighbors better than current, quit Neighbors = small perturbations of w Properties Plateaus and local optima àhow to escape plateaus and find a good local optimum? àhow to deal with very large parameter vectors? E.g.,

27 Perceptron f 1 f 2 f 3 w 1 Σ w 2 w 3 >0? Objective: Classification Accuracy Issue: many plateaus à how to measure incremental progress?

28 Soft-Max Score for y=1: Score for y=-1: Probability of label: Objective: Log:

29 Two-Layer Neural Network w 11 Σ w 21 w 31 >0? w 1 f 1 w 12 f 2 Σ w 22 w 32 >0? w 2 Σ f 3 w 13 w 3 Σ w 23 w 33 >0?

30 N-Layer Neural Network Σ >0? Σ >0? Σ >0? f 1 f 2 Σ >0? Σ >0? Σ >0? Σ f 3 Σ >0? Σ >0? Σ >0?

31 Our Status Our objective Changes smoothly with changes in w Doesn t suffer from the same plateaus as the perceptron network Challenge: how to find a good w? Equivalently:

32 1-d optimization Could evaluate Then step in best direction and Or, evaluate derivative: Tells which direction to step into

33 2-D Optimization Source: Thomas Jungblut s Blog

34 Steepest Descent Idea: Start somewhere Repeat: Take a step in the steepest descent direction Figure source: Mathworks

35 What is the Steepest Descent Direction?

36 What is the Steepest Descent Direction? Steepest Direction = direction of the gradient

37 Optimization Procedure 1: Gradient Descent Init: For i = 1, 2, : learning rate --- tweaking parameter that needs to be chosen carefully How? Try multiple choices Crude rule of thumb: update changes about %

38 Suppose loss function is steep vertically but shallow horizontally: Q: What is the trajectory along which we converge towards the minimum with Gradient Descent? Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6-25 Jan

39 Suppose loss function is steep vertically but shallow horizontally: Q: What is the trajectory along which we converge towards the minimum with Gradient Descent? Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6-25 Jan

40 Suppose loss function is steep vertically but shallow horizontally: Q: What is the trajectory along which we converge towards the minimum with Gradient Descent? very slow progress along flat direction, jitter along steep one Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Jan 2016

41 Optimization Procedure 2: Momentum Init: Gradient Descent For i = 1, 2, Init: Momentum For i = 1, 2, - Physical interpretation as ball rolling down the loss function + friction (mu coefficient). - mu = usually ~0.5, 0.9, or 0.99 (Sometimes annealed over time, e.g. from 0.5 -> 0.99)

42 Suppose loss function is steep vertically but shallow horizontally: Q: What is the trajectory along which we converge towards the minimum with Momentum? Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6-25 Jan

43 How do we actually compute gradient w.r.t. weights? Backpropagation!

44 Backpropagation Learning /782: Artificial Neural Networks David S. Touretzky Fall 2006

45 With Linear Units, Multiple Layers Don't Add Anything y U : V : 2 3 matrix 3 4 matrix y = U V x = U V x 2 4 x Linear operators are closed under composition. Equivalent to a single layer of weights W=U V But with non-linear units, extra layers add computational power. 3

46 What Can be Done with Non-Linear (e.g., Threshold) Units? 1 layer of trainable weights separating hyperplane 4

47 2 layers of trainable weights convex polygon region 5

48 3 layers of trainable weights composition of polygons: non convex regions 6

49 How Do We Train A Multi-Layer Network? y Error = d-y Error =??? Can't use perceptron training algorithm because we don't know the 'correct' outputs for hidden units. 7

50 How Do We Train A Multi-Layer Network? y Define sum-squared error: E = 1 2 p d p y p 2 Use gradient descent error minimization: w ij = E w ij Works if the nonlinear transfer function is differentiable. 8

51 Deriving the LMS or Delta Rule As Gradient Descent Learning y = i w i x i E = 1 2 p y d p y p 2 E w i de d y = y d = de d y y w i = y d x i x i w i w i = E w i = y d x i How do we extend this to two layers? 9

52 Switch to Smooth Nonlinear Units net j = i w ij y i y j = g net j g must be differentiable Common choices for g: 1 g x = 1 e x g ' x = g x 1 g x g x =tanh x g ' x =1/cosh 2 x 10

53 Gradient Descent with Nonlinear Units tanh(sw i x i ) x i w i y y=g net =tanh i w i x i de d y = y d, d y d net =1/cosh2 net, net w i =x i E = d E w i d y d y net d net w i = y d /cosh 2 i w i x i x i 11

54 Now We Can Use The Chain Rule y k w jk y j w ij y i E y k = y k d k k = E net k = y k d k g' net k E = E net k = E y w jk net k w jk net j k E y j = k j = E net j E net k net k y j = E g' net y j j E w ij = E net j y i 12

55 Weight Updates E = E net k w jk net k w jk = k y j E = E net j w ij net j w ij = j y i w jk = E w jk w ij = E w ij 13

56 Deep learning is everywhere Classification Retrieval [Krizhevsky 2012] Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6-25 Jan

57 Deep learning is everywhere Detection Segmentation [Faster R-CNN: Ren, He, Girshick, Sun 2015] [Farabet et al., 2012] Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6-25 Jan

58 Deep learning is everywhere NVIDIA Tegra X1 self-driving cars Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6-25 Jan

59 Deep learning is everywhere [Taigman et al. 2014] [Simonyan et al. 2014] [Goodfellow 2014] Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6-25 Jan

60 Deep learning is everywhere [Toshev, Szegedy 2014] [Mnih 2013] Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6-25 Jan

61 Deep learning is everywhere [Ciresan et al. 2013] [Sermanet et al. 2011] [Ciresan et al.] Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6-25 Jan

62 Image Captioning [Vinyals et al., 2015]

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I Non-Linearity CS 188: Artificial Intelligence Deep Learning I Instructors: Pieter Abbeel & Anca Dragan --- University of California, Berkeley [These slides were created by Dan Klein, Pieter Abbeel, Anca