Neural Networks Teaser

Size: px

Start display at page:

Download "Neural Networks Teaser"

George Glenn
5 years ago
Views:

1 1/11 Neural Networks Teaser February 27, 2017

2 Deep Learning in the News 2/11 Go falls to computers.

3 Learning 3/11 How to teach a robot to be able to recognize images as either a cat or a non-cat? This sounds like a biology problem. How can we formulate this as a mathematics problem?

4 Learning 3/11 How to teach a robot to be able to recognize images as either a cat or a non-cat? This sounds like a biology problem. How can we formulate this as a mathematics problem? R is a space of 1000 by 1000 rgb images

5 Learning 3/11 How to teach a robot to be able to recognize images as either a cat or a non-cat? This sounds like a biology problem. How can we formulate this as a mathematics problem? R is a space of 1000 by 1000 rgb images C R is the cat subset.

6 Learning 3/11 How to teach a robot to be able to recognize images as either a cat or a non-cat? This sounds like a biology problem. How can we formulate this as a mathematics problem? R is a space of 1000 by 1000 rgb images C R is the cat subset. Try to learn the classier function f C : R {1, 1} so that f C (x) = 1 x C.

7 Learning 3/11 How to teach a robot to be able to recognize images as either a cat or a non-cat? This sounds like a biology problem. How can we formulate this as a mathematics problem? R is a space of 1000 by 1000 rgb images C R is the cat subset. Try to learn the classier function f C : R {1, 1} so that f C (x) = 1 x C. Usually we try to nd the best function f in a a class of functions F that approximates f C.

8 Learning 3/11 How to teach a robot to be able to recognize images as either a cat or a non-cat? This sounds like a biology problem. How can we formulate this as a mathematics problem? R is a space of 1000 by 1000 rgb images C R is the cat subset. Try to learn the classier function f C : R {1, 1} so that f C (x) = 1 x C. Usually we try to nd the best function f in a a class of functions F that approximates f C. Let us play in a playground: playground.tensorflow.org/

9 Optimization and Learning 4/11 For example, support vector machines (SVM).

10 Optimization and Learning 4/11 For example, support vector machines (SVM). Data points (x i, y i ) n i=1, x i R d and y i { 1, 1}.

11 Optimization and Learning 4/11 For example, support vector machines (SVM). Data points (x i, y i ) n i=1, x i R d and y i { 1, 1}. Classifer f(x) = w x + b.

12 Optimization and Learning 4/11 For example, support vector machines (SVM). Data points (x i, y i ) n i=1, x i R d and y i { 1, 1}. Classifer f(x) = w x + b. Optimization problem: min w 2 + C w R d,b R n max(0, 1 y i f(x i )) i=1

13 Optimization and Learning 4/11 For example, support vector machines (SVM). Data points (x i, y i ) n i=1, x i R d and y i { 1, 1}. Classifer f(x) = w x + b. Optimization problem: min w 2 + C w R d,b R n max(0, 1 y i f(x i )) i=1 The regularization term is convex and so is the loss function.

14 Scikit-learn for Quick and Dirty Machine Learning 5/11

15 Scikit-learn for Quick and Dirty Machine Learning 5/11 Scikit-learn is a convenient python library built on Numpy, Scipy, and matplotlib for many standard machine learning algorithms.

16 Scikit-learn for Quick and Dirty Machine Learning 5/11 Scikit-learn is a convenient python library built on Numpy, Scipy, and matplotlib for many standard machine learning algorithms. Helpful examples at

17 Obligatory Slide on Big Data" 6/11 How many images do you think we have?

18 Obligatory Slide on Big Data" 6/11 How many images do you think we have? 7 billion people, 3 billion people with smartphones, 1 picture a day = approximately 1 trillion pictures a year

19 Obligatory Slide on Big Data" 6/11 How many images do you think we have? 7 billion people, 3 billion people with smartphones, 1 picture a day = approximately 1 trillion pictures a year Some claim that more data was generated in the last 2 years than the rest of the history of mankind.

20 Obligatory Slide on Big Data" 6/11 How many images do you think we have? 7 billion people, 3 billion people with smartphones, 1 picture a day = approximately 1 trillion pictures a year Some claim that more data was generated in the last 2 years than the rest of the history of mankind. In comparison: there are around 3 billion seconds in a 100 year lifetime.

21 Obligatory Slide on Big Data" 6/11 How many images do you think we have? 7 billion people, 3 billion people with smartphones, 1 picture a day = approximately 1 trillion pictures a year Some claim that more data was generated in the last 2 years than the rest of the history of mankind. In comparison: there are around 3 billion seconds in a 100 year lifetime. Want algorithms that can continuously improve with such large data sets.

22 Obligatory Slide on Big Data" 6/11 How many images do you think we have? 7 billion people, 3 billion people with smartphones, 1 picture a day = approximately 1 trillion pictures a year Some claim that more data was generated in the last 2 years than the rest of the history of mankind. In comparison: there are around 3 billion seconds in a 100 year lifetime. Want algorithms that can continuously improve with such large data sets. If error = bias + variance, then we want a large and exible class of functions so that bias is small since large enough data can control variance.

23 Deep Learning: Learning Representation and Classier 7/11 Imagine two inputs x (0) y = f(x (0) 1, x(0 2 ). 1, x(0) 2, trying to learn classier

24 Deep Learning: Learning Representation and Classier 7/11 Imagine two inputs x (0) y = f(x (0) 1, x(0 2 ). Linear classiers f(x (0) 1, x(0) 2, trying to learn classier 1, x(0) 2 ) = w 1x (0) 1 + w 2 x (0) 2 + b are not always expressive enough.

25 Deep Learning: Learning Representation and Classier 7/11 Imagine two inputs x (0) y = f(x (0) 1, x(0 2 ). Linear classiers f(x (0) 1, x(0) 2, trying to learn classier 1, x(0) 2 ) = w 1x (0) 1 + w 2 x (0) 2 + b are not always expressive enough. Idea: introduce a non-linearity such as g(x) = ex e x e x +e x.

26 Deep Learning: Learning Representation and Classier 7/11 Imagine two inputs x (0) y = f(x (0) 1, x(0 2 ). Linear classiers f(x (0) 1, x(0) 2, trying to learn classier 1, x(0) 2 ) = w 1x (0) 1 + w 2 x (0) 2 + b are not always expressive enough. Idea: introduce a non-linearity such as g(x) = ex e x e x +e x. Now we can stack many layers to get a composite classier:. f(x 0 ) = W 3 g(w 2 g(w 1 X (0) + b 1 ) + b 2 ) + b 3

27 Deep Learning: Learning Representation and Classier 7/11 Imagine two inputs x (0) y = f(x (0) 1, x(0 2 ). Linear classiers f(x (0) 1, x(0) 2, trying to learn classier 1, x(0) 2 ) = w 1x (0) 1 + w 2 x (0) 2 + b are not always expressive enough. Idea: introduce a non-linearity such as g(x) = ex e x e x +e x. Now we can stack many layers to get a composite classier: f(x 0 ) = W 3 g(w 2 g(w 1 X (0) + b 1 ) + b 2 ) + b 3. The output of the rst hidden layer X (1) = g(w 1 X 0 + b 1 ) is a feature representation of the input X (0).

28 Deep Learning: Learning Representation and Classier 7/11 Imagine two inputs x (0) y = f(x (0) 1, x(0 2 ). Linear classiers f(x (0) 1, x(0) 2, trying to learn classier 1, x(0) 2 ) = w 1x (0) 1 + w 2 x (0) 2 + b are not always expressive enough. Idea: introduce a non-linearity such as g(x) = ex e x e x +e x. Now we can stack many layers to get a composite classier: f(x 0 ) = W 3 g(w 2 g(w 1 X (0) + b 1 ) + b 2 ) + b 3. The output of the rst hidden layer X (1) = g(w 1 X 0 + b 1 ) is a feature representation of the input X (0). The output of the second hidden layer X (2) = g(w 2 X (1) + b 2 ) is a feature transformation of the feature extractor X (1).

29 Deep Learning: Learning Representation and Classier 7/11 Imagine two inputs x (0) y = f(x (0) 1, x(0 2 ). Linear classiers f(x (0) 1, x(0) 2, trying to learn classier 1, x(0) 2 ) = w 1x (0) 1 + w 2 x (0) 2 + b are not always expressive enough. Idea: introduce a non-linearity such as g(x) = ex e x e x +e x. Now we can stack many layers to get a composite classier: f(x 0 ) = W 3 g(w 2 g(w 1 X (0) + b 1 ) + b 2 ) + b 3. The output of the rst hidden layer X (1) = g(w 1 X 0 + b 1 ) is a feature representation of the input X (0). The output of the second hidden layer X (2) = g(w 2 X (1) + b 2 ) is a feature transformation of the feature extractor X (1). Finally, f(x 0 ) is a classier trained on the representation learned by data. The representation is not xed!

30 Backpropagation 8/11 In order to train we need to take the gradient of the loss 1 n n i=1 L(y i f(x i )) with respect to the weight matrices W.

31 Backpropagation 8/11 In order to train we need to take the gradient of the loss 1 n n i=1 L(y i f(x i )) with respect to the weight matrices W. Exercise: take the derivative of f(x) = w 3 g(w 2 g(w 1 x + b 1 ) + b 2 ) + b 3 for g(x) = ex e x e x +e x with respect to the w i.

32 Backpropagation 8/11 In order to train we need to take the gradient of the loss 1 n n i=1 L(y i f(x i )) with respect to the weight matrices W. Exercise: take the derivative of f(x) = w 3 g(w 2 g(w 1 x + b 1 ) + b 2 ) + b 3 for g(x) = ex e x e x +e x with respect to the w i. df dw 3 = g(w 2 g(w 1 x + b 1 ) + b 2 ) df dw 2 = w 3 g (w 2 g(w 1 x + b 1 ) + b 2 )g(w 1 x + b 1 ) df dw 1 = w 3 g (w 2 g(w 1 x + b 1 ) + b 2 )w 2 g (w 1 x + b 1 )x Can exploit structure to save on computation in a recursive way.

33 Making Descent Work better 9/11

34 Making Descent Work better 9/11 In order to save time in descent on expressions such as E(W ) = 1 n n i=1 L(y i f(x i )), people often work with one example at a time.

35 Making Descent Work better 9/11 In order to save time in descent on expressions such as E(W ) = 1 n n i=1 L(y i f(x i )), people often work with one example at a time. They may also descend with small batches (say 100 examples) at a time.

36 Making Descent Work better 9/11 In order to save time in descent on expressions such as E(W ) = 1 n n i=1 L(y i f(x i )), people often work with one example at a time. They may also descend with small batches (say 100 examples) at a time. With momentum: W = η E(W ) + α W where α is the momentum parameter and η is the learning rate.

37 Making Descent Work better 9/11 In order to save time in descent on expressions such as E(W ) = 1 n n i=1 L(y i f(x i )), people often work with one example at a time. They may also descend with small batches (say 100 examples) at a time. With momentum: W = η E(W ) + α W where α is the momentum parameter and η is the learning rate. Overall emphasis is on faster rst-order methods that try to avoid getting stuck. Finding a global optimum may even give worse generalization performance.

38 Making Descent Work better 9/11 In order to save time in descent on expressions such as E(W ) = 1 n n i=1 L(y i f(x i )), people often work with one example at a time. They may also descend with small batches (say 100 examples) at a time. With momentum: W = η E(W ) + α W where α is the momentum parameter and η is the learning rate. Overall emphasis is on faster rst-order methods that try to avoid getting stuck. Finding a global optimum may even give worse generalization performance. There are many issues in practice:

39 Making Descent Work better 9/11 In order to save time in descent on expressions such as E(W ) = 1 n n i=1 L(y i f(x i )), people often work with one example at a time. They may also descend with small batches (say 100 examples) at a time. With momentum: W = η E(W ) + α W where α is the momentum parameter and η is the learning rate. Overall emphasis is on faster rst-order methods that try to avoid getting stuck. Finding a global optimum may even give worse generalization performance. There are many issues in practice: Try running a small example in scikit-learn

40 Open Challenges 10/11 Solve harder challenges with better network architectures, optimization methods, and datasets.

41 Open Challenges 10/11 Solve harder challenges with better network architectures, optimization methods, and datasets.

42 Open Challenges 10/11 Solve harder challenges with better network architectures, optimization methods, and datasets. Give a more satisfactory theoretical explanation for why they work.

43 Getting Started 11/11 Nice Tutorials and Free Books lecun-ranzato-icml2013.pdf

44 Getting Started 11/11 Nice Tutorials and Free Books lecun-ranzato-icml2013.pdf Popular packages: Popular framework: Destined to be the future standard high level library: Popular lower level library: New lower level library: Course to learn keras without need for your own machine:

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)