Rapid Introduction to Machine Learning/ Deep Learning

Size: px

Start display at page:

Download "Rapid Introduction to Machine Learning/ Deep Learning"

Molly King
5 years ago
Views:

1 Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/62

2 Lecture 1b Logistic regression & neural network October 2, /62

3 Table of contents 1 1 Bird s-eye view of Lecture 1b 1.1 Objectives 1.2 Quick Summary 2 2. GLM: Generalized linear model 2.1 Exponential family of distributions 2.2 Generalized linear model(glm) 2.3 Parameter estimation 3 3. XOR problem and neural network with hidden layer 4 4. Universal approximation 4.1 Further construction 4.2 Universal approximation theorem 4.3 Deep vs Shallow learning 3/62

4 1.1 Objectives 1 Bird s-eye view of Lecture 1b 1.1 Objectives Objective1 Understand logistic regression (binary classification) and its multiclass generalization (softmax regression) Objective2 Recast logistic and softmax regression in a neural network (perceptron) formalism 4/62

5 1.1 Objectives Objective3 Learn the limitations of the perceptron by looking at the XOR problem Learn how to fix it by adding a hidden layer Objective4 Introduce the Universal Approximation Learn about the clash of Deep vs Shallow paradigms in machine learning 5/62

6 1.2 Quick Summary 1.2 Quick Summary Logistic regression Data D = {(x (t), y (t) )} N t=1 input x (t) R d label y (t) {0, 1} Given x = (x 1,, x d ) R d, logistic regression outputs the probability of the output label y being equal to 1 by P[y = 1 x] = sigm(b + w 1 x w d x d ), where sigm(t) = et 1 + e t. 6/62

7 1.2 Quick Summary Thus P[y = 1 x] = P[y = 0 x] = e b+ j w j x j 1 + e b+ j w j x j e b+ j w j x j Decision Given x, decide the output label is ŷ, where { ŷ = 1 if b + j w jx j 0 ŷ = 0 if b + j w jx j < 0 [Thus the decision boundary is the hyperplane b + j w jx j = 0 in R d ] 7/62

8 1.2 Quick Summary Neural network formulation Figure: Neural Network 8/62

9 1.2 Quick Summary z: input to the output neuron. z = b 1 + w 1 x b 1 + w d x d h: output of the output neuron h = sigm(z) = sigm(b 1 + w 1 x b 1 + w d x d ) 9/62

10 1.2 Quick Summary Symmetric (redundant) form of logistic regression The probabilities P[y = 1 x] and P[y = 0 x] have different form in logistic regression. We can put them in symmetric form by rewriting them in the following (redundant) form: exp (b 1 + ) j w 1jx j P[y = 1 x] = exp (b 1 + ) j w 1jx j + exp (b 2 + ) j w 2jx j exp (b 2 + ) j w 2jx j P[y = 0 x] = exp (b 1 + ) j w 1jx j + exp (b 2 + ) j w 2jx j 10/62

11 1.2 Quick Summary Decision Given x, decide the output label is ŷ, where { ŷ = 1 if b1 + j w 1jx j b 2 + j w 2jx j ŷ = 0 if b 1 + j w 1jx j < b 2 + j w 2jx j The decision boundary is the hyperplane b 1 + j w 1j x j = b 2 + j w 2j x j in R d 11/62

12 1.2 Quick Summary Neural network formulation Figure: Neural Network 12/62

13 1.2 Quick Summary z i : input to the ith neuron in the output layer. z i = b i + j w ij x j, i = 1, 2 h i : output of the ith neuron in the output layer. h i = e z i e z 1 + e z 2, i = 1, 2 13/62

14 1.2 Quick Summary Softmax regression: multiclass classification There are K output labels, i.e., y {1,, K} Probability P[y = i x] = exp (b i + ) j w ijx j exp (b 1 + ) j w 1jx j + + exp (b K + ) j w Kjx j for i = 1,, K. Decision Given x, decide the output label is ŷ, where ŷ = argmaxp[y = i x] i 14/62

15 1.2 Quick Summary Decision boundary Figure: example of decision boundary Decision regions are partitioned by linear hyperplanes in R d 15/62

16 1.2 Quick Summary Neural network formalism Figure: Neural network 16/62

17 1.2 Quick Summary z i : input to the ith neuron in the output layer. z i = b i + j w ij x j, i = 1,, K h i : output of the ith neuron in the output layer. h i = ez i k ez k In vector notation, we write: = P[y = i x], i = 1,, K (h 1,, h K ) = softmax(z 1,, z K ), or h = softmax(z) 17/62

18 1.2 Quick Summary XOR problem Given a data set D consisting of 4 points in R 2 in 2 classes as shown in the following: Figure: XOR Note that there is no line that separates there two classes 18/62

19 1.2 Quick Summary But if we add one more (hidden) layer to the neural network, then this network can separate the two classes Figure: hidden layer 19/62

20 1.2 Quick Summary Cybenko-Hornik-Funabashi Theorem Let Σ = [0, 1] d : d-dimensional hypercube. Then the sum of the form f (x) = d c i sigm(b i + w ij x j ) i j=1 can approximate any continuous function on Σ to any degree of accuracy 20/62

21 1.2 Quick Summary Universal Approximation This theorem implies that the neural network with one hidden layer is good enough to do any classification job with small error. at least in principle. In fact, Lecture 2 should be viewed in this spirit Then, why deep learning? 21/62

22 2.1 Exponential family of distributions 2. GLM: Generalized linear model 2.1 Exponential family of distributions Exponential family of distributions An exponential family of distributions in canonical form is a probability distribution of the form: ( ) P θ (y) = 1 h(y) exp θ i T i (y), Z(θ) where y = (y 1,, y K ) R K, θ = (θ 1,, θ m ) R m, T : R m R K i 22/62

23 2.1 Exponential family of distributions Rewrite it in the form [ ] P θ (y) = exp θ i T i (y) A(θ) + C(y), i where A(θ) = log Z(θ) : log partition (cumulant) function C(y) = log h(y) [Remark: Here, we assume the dispersion parameter is 1] 23/62

24 2.1 Exponential family of distributions Bernoulli distribution Random variable Y with value y {0, 1}. Let Then p = P[y = 1] P(y) = p y (1 p) 1 y = exp In exponential family form: T (y) = y θ = log [ ] p y log + log(1 p) 1 p p 1 p = logit(p) p = sigm(θ) = eθ 1 + e θ 24/62

25 2.1 Exponential family of distributions Multivariate Bernoulli (Multinoulli) distribution Random variable Y with value y {0,, K}. Let p i = P[y = i] Define y i = I(y = i) {0, 1}. Thus y y K = 1, and we have P(y) = p y 1 1 py K K = p y 1 1 py K 1 [ K 1 = exp i=1 y i log p i p K + log p K K 1 K 1 p1 i=1 y i k [Note: when K = 2, this is exactly the Bernoulli distribution.] ] 25/62

26 2.1 Exponential family of distributions In the exponential family form: T i (y) = y i θ i = log p i p K, i = 1,, K 1 26/62

27 2.1 Exponential family of distributions Solving for p i, we get generalized sigmoid (softmax) function p i = p K = e θ i 1 + K 1 k=1 eθ k K 1 k=1 eθ k The generalized logit function θ i = log = P[y = i], i = 1,, K 1 = P[y = K] p i 1 K 1 k=1 p, i = 1,, K 1 k The above expressions show how p 1,, p K 1 and θ 1,, θ K 1 are related; p K is gotten by setting p K = 1 (p p K 1 ) 27/62

28 2.2 Generalized linear model(glm) 2.2 Generalized linear model(glm) GLM GLM mechanism is a way to relate the input vector x = (x 1,, x d ) to the parameters θ i of GLM by setting θ i = b i + d w ij x j, where b i and w ij are the GLM parameters to be determined by the data Thus get p i = j=1 ( exp b i + ) d j=1 w ijx j 1 + ( K 1 k=1 exp b i + ), d j=1 w kjx j 28/62

29 2.2 Generalized linear model(glm) i.e., p i = P[y = i x] for i = 1,, K 1, and p K = P[y = k x] = ( K 1 k=1 exp b i + ) d j=1 w kjx j 29/62

30 2.2 Generalized linear model(glm) Note: when K = 2, it is the logistic regression such that P[y = 1 x] = p 1 = P[y = 0 x] = p 2 = exp (b + ) j w jx j 1 + exp (b + ) j w jx j exp (b + ). j w jx j Here, we set b = b 1, w j = w 1j 30/62

31 2.2 Generalized linear model(glm) Symmetric (redundant) form The expression for p K is different from those for p i. To put p 1,, p K in symmetric form, multiply d exp a + α j x j on the numerator and the denominator of p i and p K. Then p i = ( exp a + b i + ) d j=1 (w ij + α j )x j ( exp a + ) d j=1 α jx j + ( K 1 k=1 exp a + b k + ), d j=1 (w kj + α j )x j i = 1,, K 1 j=1 31/62

32 2.2 Generalized linear model(glm) and p K = ( exp a + ) d j=1 α jx j ( exp a + ) d j=1 α jx j + ( K 1 k=1 exp a + b k + ) d j=1 (w kj + α j )x j Set b i b i + a w ij w ij + α j, j = 1,, d, for i = 1,, K 1 and set b K = a w Kj = α j, j = 1,, d 32/62

33 2.2 Generalized linear model(glm) Then we have ( exp b i + ) d j=1 w ijx j p i = K k=1 (b exp k + ) = P[y = 1 x], d j=1 w kjx j i = 1,, K. In vector notation p = (p 1,, p K ) = softmax(z 1,, z K ) = softmax(z), where d z i = exp b i + w ij x j, i = 1,, K j=1 33/62

34 2.2 Generalized linear model(glm) Neural network formalism Figure: Neural network 34/62

35 2.2 Generalized linear model(glm) z i : input to the ith neuron in the output layer z i = b i + j w ij x j, i = 1,, K h i : output of the ith neuron in the output layer h i = ez i k ez k In vector notation, we write: = P[y = i x], i = 1,, K (h 1,, h K ) = softmax(z 1,, z K ), or h = softmax(z) 35/62

36 2.3 Parameter estimation 2.3 Parameter estimation Determining W and b MLE So far the parameters K 1 vector b = [b 1,, b K ] T and K d matrix W = [w ij ] are regarded as given But need to determine b and W using the given data Use MLE (maximum likelihood estimation) Data D = {(x (t), y (t) )} N t=1 Probability P(y x) = p y 1 1 py K K, 36/62

37 2.3 Parameter estimation where p i = Likelihood function Log likelihood function ( exp b i + ) d j=1 w ijx j K k=1 (b exp k + ) d j=1 w kjx j L(W, b) = N P[y (t) x (t) ] t=1 l(w, b) = log L(W, b) = N log P[y (t) x (t) ] t=1 37/62

38 2.3 Parameter estimation Recall Thus P(y x) = p y 1 1 py K K log P[y x] = y 1 log p y K log p K K = I(y = k) log p k = = k=1 K I(y = k) log P[y = k x] k=1 K e z k I(y = k) log K, i=1 ez i k=1 38/62

39 2.3 Parameter estimation where z i = b i + d j=1 w ijx j Rewrite the log likelihood function: l(w, b) = = = N log P[y (t) x (t) ] t=1 N t=1 k=1 N t=1 k=1 K I(y (t) = k) log P[y (t) = k x (t) ] K I(y (t) = k) log e z(t) k K i=1 ez(t) i, 39/62

40 2.3 Parameter estimation where z (t) i = b i + d j=1 w ijx (t) j MLE is to find W and b that maximizes l(w, b) [Note: for softmax regression it turns out that l(w, b) is a concave (for generic data sets, strictly concave) function of W and b.] 40/62

41 2.3 Parameter estimation Neural network formalism Recall Figure: Neural network 41/62

42 2.3 Parameter estimation For each input x (t) z (t) i : input to the ith neuron in the output layer. z (t) i = b i + j w ij x (t) j, i = 1,, K h (t) i : output of the ith neuron in the output layer. (t) h (t) i = ez i k ez(t) k, i = 1,, K 42/62

43 2.3 Parameter estimation For neural networks, the error function is set to be l(w, b) and the training is to minimize this error. [Note: This neural network training is exactly the same as the MLE estimation in softmax regression] Training (learning) of neural network in case of single layer (no hidden layer) neural network Training (learning) is a convex optimization optimization problem; so it is a relatively easy problem Three kinds of training (learning) strategies Full-batch learning: train using all data in D at once Mini-batch learning: train using a small portion of D successively, and cycle through them On-line learning: train using one data point at a time and cycle through them 43/62

44 3. XOR problem and neural network with hidden layer XOR Problem Separate X s from O s 44/62

45 XOR(x 1, x 2 ) = x 1 x 2 + x 1 x 2 x 1 x 2 : 45/62

46 z 1 = a(x 1 x ), a : large h 1 = sigm(z 1 ) 46/62

47 47/62

48 x 1 x 2 : 48/62

49 z 2 = a( x 1 + x ), a : large h 2 = sigm(z 2 ) 49/62

50 50/62

51 z 3 = b(h 1 + h ), b : large h 2 = sigm(z 3 ) 51/62

52 This neural network achieves the separation 52/62

53 4.1 Further construction 4. Universal approximation 4.1 Further construction Further construction The NN constructed above has values 53/62

54 4.1 Further construction Can also construct another NN 54/62

55 4.1 Further construction 55/62

56 4.1 Further construction The region where h 1 h 2 h 3 h 4 = 0 is The neural network 56/62

57 4.1 Further construction One can easily find a hyperplane in R 4 that separates (0, 0, 0, 0) from the rest; and this hyperplane define h 5, which defines a function with value 0 in the center and 1 in the rest 57/62

58 4.1 Further construction Continuing this way, one can construct any approximate bump function as an output of a neural network with one hidden Combining these bump functions, one can approximate any continuous function Namely, a neural network with one hidden layer can do any task, at least in principle 58/62

59 4.2 Universal approximation theorem 4.2 Universal approximation theorem Universal approximation theorem This heuristic argument can be made rigorous using Stone-Weienstrass theorem-type argument to get Cybenko-Hornik-Funabashi Theorem Cybenko-Hornik-Funabashi Theorem Let Σ = [0, 1] d : d-dimensional hypercube. Then the sum of the form f (x) = d c i sigm(b i + w ij x j ) i j=1 can approximate any continuous function on Σ to any degree of accuracy 59/62

60 4.2 Universal approximation theorem Universal approximation theorem There are many similar results to this effect 60/62

61 4.3 Deep vs Shallow learning 4.2 Deep vs Shallow learning Deep vs Shallow learning This theorem says that at least in principle one can do any classification with a neural network with one hidden layer Deep learning utilizes neural network with many hidden layers, typically up to 40 or more layers. Question: If universal Approximation Theorem says one can do the job with only one hidden layer, why does one use so many hidden layers? What is advantage in doing so? This is one big question we like to address to for the rest of this lecture series. 61/62

62 4.3 Deep vs Shallow learning Deep vs Shallow learning To achieve high accuracy, the number of terms has to be huge and the training (learning) is a big problem: typical problem of shallow networks (shallow learning) In contrast, deep NN arranges neurons in depth for more efficiency and better training, but training is a very subtle issue, [which will be dealt with later in this lecture series] 62/62

Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2)

Lectures on Machine Learning (Fall 2017) Hyeong In Choi Seoul National University Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2) Topics to be covered: