Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Size: px

Start display at page:

Download "Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6"

Jonas Tracy McCarthy
5 years ago
Views:

1 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6 Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan Boulder 1 of 39

2 HW1 turned in HW2 released Office hour Group formation signup Machine Learning: Chenhao Tan Boulder 2 of 39

3 Overview Feature engineering Revisiting Logistic Regression Layers for Structured Data Machine Learning: Chenhao Tan Boulder 3 of 39

4 Feature engineering Outline Feature engineering Revisiting Logistic Regression Layers for Structured Data Machine Learning: Chenhao Tan Boulder 4 of 39

adopted home state of Texas, where he ended.

5 Feature engineering Feature Engineering Republican nominee George Bush said he felt nervous as he voted today in his adopted home state of Texas, where he ended... ( (From Chris Harrison's WikiViz) Machine Learning: Chenhao Tan Boulder 5 of 39

6 Feature engineering Brainstorming What are features useful for sentiment analysis? Machine Learning: Chenhao Tan Boulder 6 of 39

7 Feature engineering What are features useful for sentiment analysis? Unigram Bigram Normalizing options Part-of-speech tagging Parse-tree related features Negation related features Additional resources Machine Learning: Chenhao Tan Boulder 7 of 39

8 Feature engineering Sarcasm detection Trees died for this book? (book) Machine Learning: Chenhao Tan Boulder 8 of 39

9 Feature engineering Sarcasm detection Trees died for this book? (book) find high-frequency words and content words replace content words with CW extract patterns, e.g., does not CW much about CW [Tsur et al., 2010] Machine Learning: Chenhao Tan Boulder 8 of 39

10 Feature engineering More examples: Which one will be retweeted more? [Tan et al., 2014] Machine Learning: Chenhao Tan Boulder 9 of 39

11 Revisiting Logistic Regression Outline Feature engineering Revisiting Logistic Regression Layers for Structured Data Machine Learning: Chenhao Tan Boulder 10 of 39

12 Revisiting Logistic Regression Revisiting Logistic Regression 1 P(Y = 0 x, β) = 1 + exp [β 0 + i β ix i ] P(Y = 1 x, β) = exp [β 0 + i β ix i ] 1 + exp [β 0 + i β ix i ] L = j log P(y (j) X (j), β) Machine Learning: Chenhao Tan Boulder 11 of 39

13 Revisiting Logistic Regression Revisiting Logistic Regression Transformation on x (we map class labels from {0, 1} to {1, 2}): l i = β T i x, i = 1, 2 o i = exp l i c {1,2} exp l, i = 1, 2 c Objective function (using cross entropy i p i log q i ): L (Y, Ŷ) = j P(y (j) = 1) log P(ŷ i = 1 x (j), β) + P(y (j) = 0) log ˆP(y i = 0 X i ) Machine Learning: Chenhao Tan Boulder 12 of 39

14 Revisiting Logistic Regression Logistic Regression as a Single-layer Neural Network Input layer x 1 Linear Softmax x 2... l 1 l 2 o 1 o 2 x d Machine Learning: Chenhao Tan Boulder 13 of 39

15 Revisiting Logistic Regression Logistic Regression as a Single-layer Neural Network Input layer Single Layer x 1 x 2... o 1 o 2 x d Machine Learning: Chenhao Tan Boulder 14 of 39

16 Outline Feature engineering Revisiting Logistic Regression Layers for Structured Data Machine Learning: Chenhao Tan Boulder 15 of 39

17 Deep Neural networks A two-layer example (one hidden layer) Input Hidden Output x 1 x 2... o 1 o 2 x d Machine Learning: Chenhao Tan Boulder 16 of 39

18 Deep Neural networks More layers: Input Hidden 1 Hidden 2 Hidden 3 Output x 1 x 2... o 1 o 2 x d Machine Learning: Chenhao Tan Boulder 17 of 39

19 Forward propagation algorithm How do we make predictions based on a multi-layer neural network? Store the biases for layer l in b l, weight matrix in W l W 1, b 1 W 2, b 2 W 3, b 3 W 4, b 4 x 1 x 2 o 1... o 2 x d Machine Learning: Chenhao Tan Boulder 18 of 39

20 Forward propagation algorithm Suppose your network has L layers Make a prediction based on text point x 1: Initialize a 0 = x 2: for l = 1 to L do 3: z l = W l a l 1 + b l 4: a l = g(z l ) 5: end for 6: The prediction ŷ is simply a L Machine Learning: Chenhao Tan Boulder 19 of 39

21 Nonlinearity What happens if there is no nonlinearity? Machine Learning: Chenhao Tan Boulder 20 of 39

22 Nonlinearity What happens if there is no nonlinearity? Linear combinations of linear combinations are still linear combinations. Machine Learning: Chenhao Tan Boulder 20 of 39

23 Neural networks in a nutshell Training data S train = {(x, y)} Network architecture (model) ŷ = f w (x) Loss function (objective function) L (y, ŷ) Learning (next lecture) Machine Learning: Chenhao Tan Boulder 21 of 39

24 Nonlinearity Options Sigmoid tanh ReLU (rectified linear unit) f (x) = f (x) = exp(x) exp(x) exp( x) exp(x) + exp( x) f (x) = max(0, x) softmax x = exp(x) x i exp(x i ) Machine Learning: Chenhao Tan Boulder 22 of 39

25 Nonlinearity Options Machine Learning: Chenhao Tan Boulder 23 of 39

26 Loss Function Options l 2 loss l 1 loss Cross entropy (y i ŷ i ) 2 i y i ŷ i i y i log ŷ i i Hinge loss (more on this during SVM) max(0, 1 yŷ) Machine Learning: Chenhao Tan Boulder 24 of 39

27 A Perceptron Example x = (x 1, x 2 ), y = f (x 1, x 2 ) b x 1 o 1 x 2 Machine Learning: Chenhao Tan Boulder 25 of 39

28 A Perceptron Example x = (x 1, x 2 ), y = f (x 1, x 2 ) b x 1 o 1 x 2 We consider a simple activation function { 1 z 0 f (z) = 0 z < 0 Machine Learning: Chenhao Tan Boulder 25 of 39

29 A Perceptron Example Simple Example: Can we learn OR? x x y = x 1 x Machine Learning: Chenhao Tan Boulder 26 of 39

30 A Perceptron Example Simple Example: Can we learn OR? x x y = x 1 x w = (1, 1), b = 0.5 b x 1 o 1 x 2 Machine Learning: Chenhao Tan Boulder 26 of 39

31 A Perceptron Example Simple Example: Can we learn AND? x x y = x 1 x Machine Learning: Chenhao Tan Boulder 27 of 39

32 A Perceptron Example Simple Example: Can we learn AND? x x y = x 1 x w = (1, 1), b = 1.5 b x 1 o 1 x 2 Machine Learning: Chenhao Tan Boulder 27 of 39

33 A Perceptron Example Simple Example: Can we learn NAND? x x y = (x 1 x 2 ) Machine Learning: Chenhao Tan Boulder 28 of 39

34 A Perceptron Example Simple Example: Can we learn NAND? x x y = (x 1 x 2 ) w = ( 1, 1), b = 0.5 b x 1 o 1 x 2 Machine Learning: Chenhao Tan Boulder 28 of 39

35 A Perceptron Example Simple Example: Can we learn XOR? x x x 1 XOR x Machine Learning: Chenhao Tan Boulder 29 of 39

36 A Perceptron Example Simple Example: Can we learn XOR? NOPE! x x x 1 XOR x Machine Learning: Chenhao Tan Boulder 29 of 39

37 A Perceptron Example Simple Example: Can we learn XOR? NOPE! But why? x x x 1 XOR x Machine Learning: Chenhao Tan Boulder 29 of 39

38 A Perceptron Example Simple Example: Can we learn XOR? x x x 1 XOR x NOPE! But why? The single-layer perceptron is just a linear classifier, and can only learn things that are linearly separable. Machine Learning: Chenhao Tan Boulder 29 of 39

39 A Perceptron Example Simple Example: Can we learn XOR? x x x 1 XOR x NOPE! But why? The single-layer perceptron is just a linear classifier, and can only learn things that are linearly separable. How can we fix this? Machine Learning: Chenhao Tan Boulder 29 of 39

40 A Perceptron Example Increase the number of layers. x x x 1 XOR x b x 1 x 2 b h 1 h 2 [ ] [ ] W 1 =, b o = 1.5 [ ] 1 W 2 =, b 1 2 = 1.5 Machine Learning: Chenhao Tan Boulder 30 of 39

41 General Expressiveness of Neural Networks Neural networks with a single hidden layer can approximate any measurable functions [Hornik et al., 1989, Cybenko, 1989]. Machine Learning: Chenhao Tan Boulder 31 of 39

42 Layers for Structured Data Outline Feature engineering Revisiting Logistic Regression Layers for Structured Data Machine Learning: Chenhao Tan Boulder 32 of 39

Layers for Structured Data Structured data Spatial information https://www.reddit.

43 Layers for Structured Data Structured data Spatial information after_she_was_told_she_was_a_good_girl/ Machine Learning: Chenhao Tan Boulder 33 of 39

44 Layers for Structured Data Convolutional Layers Sharing parameters across patches input image or input feature map output feature maps tikz-convolutional-layer/convolutional-layer.tex Machine Learning: Chenhao Tan Boulder 34 of 39

45 Layers for Structured Data Structured data Sequential information My words fly up, my thoughts remain below: Words without thoughts never to heaven go. Hamlet Machine Learning: Chenhao Tan Boulder 35 of 39

46 Layers for Structured Data Structured data Sequential information My words fly up, my thoughts remain below: Words without thoughts never to heaven go. Hamlet language activity history Machine Learning: Chenhao Tan Boulder 35 of 39

47 Layers for Structured Data Structured data Sequential information My words fly up, my thoughts remain below: Words without thoughts never to heaven go. Hamlet language activity history x = (x 1,..., x T ) Machine Learning: Chenhao Tan Boulder 35 of 39

48 Layers for Structured Data Recurrent Layers Sharing parameters along a sequence h t = f (x t, h t 1 ) Machine Learning: Chenhao Tan Boulder 36 of 39

49 Layers for Structured Data Recurrent Layers Sharing parameters along a sequence Long short-term memory h t = f (x t, h t 1 ) Machine Learning: Chenhao Tan Boulder 37 of 39

50 Layers for Structured Data What is missing? How to find good weights? How to make the model work (regularization, architecture, etc)? Machine Learning: Chenhao Tan Boulder 38 of 39

51 Layers for Structured Data References (1) George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS), 2(4): , Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5): , Chenhao Tan, Lillian Lee, and Bo Pang. The effect of wording on message propagation: Topic- and author-controlled natural experiments on twitter. In Proceedings of ACL, Oren Tsur, Dmitry Davidov, and Ari Rappoport. ICWSM-A Great Catchy Name: Semi-Supervised Recognition of Sarcastic Sentences in Online Product Reviews. In Proceedings of ICWSM, Machine Learning: Chenhao Tan Boulder 39 of 39

Deep Feedforward Networks

Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3