Natural Language Processing

Size: px

Start display at page:

Download "Natural Language Processing"

Kathleen Allison
5 years ago
Views:

1 Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay Recurrent Neural Network 1

2 NLP-ML marriage 2

3 An eample SMS complaint n I have purchased a 80 litre Videocon fridge about 4 months ago when the freeze go to sleep that time compressor give a sound (khat khat khat khat...) what is possible fault over it is normal I can't understand please help me give me a suitable answer. 3

4 Significant words (in red): after stop word removal n I have purchased a 80 litre Videocon fridge about 4 months ago when the freeze go to sleep that time compressor give a sound (khat khat khat khat...) what is possible fault over it is normal I can't understand please help me give me a suitable answer. 4

5 SMS classification Action complaint Emergency Hidden neurons 1 2 SMS feature vector: input neurons 5

6 Feedforward Network and Backpropagation 6

7 Gradient Descent Technique n Let E be the error at the output layer E = 1 2 p n j= 1 i= 1 ( t i o i ) 2 j n t i = target output; o i = observed output n i is the inde going over n neurons in the outermost layer n j is the inde going over the p patterns (1 to p) n E: XOR: p=4 and n=1 7

8 Backpropagation algorithm i w ji j... Output layer (m o/p neurons) Hidden layers. Input layer (n i/p neurons) n Fully connected feed forward network n Pure FF network (no jumping of connections over layers) 8

9 General Backpropagation Rule General weight updating rule: Where Δw = ηδ ji jo i δ j = ( t j o j ) o j (1 o j ) for outermost layer = ( w δ kj k ) o j (1 o j ) o k net layer i for hidden layers 9

10 How does it work? n Input propagation forward and error propagation backward (e.g. XOR) w 1 =1 θ = 0.5 w 2 =

11 Recurrent Neural Network (1 min recap) 11

12 Sequence processing m/c 12

13 E.g. POS Tagging VBD NNP NN Purchased Videocon machine 13

14 E.g. Sentiment Analysis I Decision on a piece of tet h 0 h 1 c 1 a 11 a 12 a 13 a 14 o 1 o 2 o 3 o 4 14

15 I like h 0 h 1 h 2 c 2 a 21 a 22 a 23 a 24 o 1 o 2 o 3 o 4 15

16 I like the h 0 h 1 h 2 h 3 c 3 a 31 a32 a33 a 34 o 1 o 2 o 3 o 4 16

17 I like the camera h 0 h 1 h 2 h 3 h 4 c 4 a 41 a42 a 43 a 44 o 1 o 2 o 3 o 4 17

18 I like the camera <EOS > h 0 h 1 h 2 h 3 h 4 h 5 Positive sentiment c 5 a 51 a52 a 53 a 54 o 1 o 2 o 3 o 4 18

19 Most famous Ancient RNN Hopfield Net 19

20 Hopfield net n Inspired by associative memory which means memory retrieval is not by address, but by part of the data; Same idea can be used for sentence completion n Consists of N neurons fully connected with symmetric weight strength w ij = w ji n No self connection. So the weight matri is 0- diagonal and symmetric. n Each computing element or neuron is a linear threshold element with threshold = 0. 20

21 Connection matri of the network, 0-diagonal and symmetric j n 1 n 2 n 3... n k n 1 n 2 i n 3... w ij n k 0 diagonal 21

22 Eample w 12 = w 21 = 5 w 13 = w 31 = 3 w 23 = w 32 = 2 At time t=0 s 1 (t) = 1 s 2 (t) = -1 s 3 (t) = 1 Unstable state: Neuron 1 will flip. A stable pattern is called an attractor for the net. 22

23 State Vector n Binary valued vector: value is either 1 or -1 X = < n n > n e.g. Various attributes of a student can be represented by a state vector height 1 3 address 5 roll number 2 hair color Ram = 1 ~(Ram) =

24 Concept of Energy n Energy at state s is given by the equation: E( s) = [ w w w1 n1n + w + + w n 2 n +! + w ( n 1) n ( n 1) n] 24

25 Energy Consideration At time t = 0, state of the neural network is: s(0) = <1, -1, 1> n E(0) = -[(5*1*-1)+(3*1* 1)+(2*-1*1)] = The state of the neural network under stability is <-1, -1, -1> E(stable state) = - -[(5*-1*-1)+(3*-1* -1)+(2*-1*-1)] =

26 The Hopfield net has to converge in the asynchronous mode of operation n As the energy E goes on decreasing, it has to hit the bottom, since the weight and the state vector have finite values. n That is, the Hopfield Net has to converge to an energy minimum. n Hence the Hopfield Net reaches stability. 26

27 Hopfield Net: an o(n 2 ) algorithm n Consider the energy epression E = w + w + + w [ K 1n 1 n + w + w + K + w n 2 E has [n(n-1)]/2 terms Nature of each term w ij is a real number i and j are each +1 or -1 M + w ( n 1) n ( n 1) n] n 27

28 No. of steps taken to reach stability n E gap = E high - E low E high E gap E low 28

29 The energy gap n In general, E high E gap = ma( w ma, w min ) * n*(n-1) E low 29

30 #steps to stable state: o(n 2 ) n It is possible to reach the minimum independent of n. n Hence in the worst case, the number of steps taken to cover the energy gap is less than or equal to [ma( w ma, w min ) * n * (n-1)] / constant n Thus stability has to be attained in O(n 2 ) steps 30

31 Training of Hopfield Net n Early Training Rule proposed by Hopfield n Rule inspired by the concept of electron spin n Hebb s rule of learning n If two neurons i and j have activation i and j respectively, then the weight w ij between the two neurons is directly proportional to the product i j i.e. w ij i j 31

32 Hopfield Rule n Training by Hopfield Rule n Train the Hopfield net for a specific memory behavior n Store memory elements n How to store patterns? 32

33 Hopfield Rule n To store a pattern make < n, n-1,., 3, 2, 1 > w ij = 1 ( n 1) i j Storing pattern is equivalent to Making that pattern the stable state 33

34 Hopfield Net for Optimization n Optimization problem n Maimizes or minimizes a quantity n Hopfield net used for optimization n Hopfield net and Traveling Salesman Problem n Hopfield net and Job Scheduling Problem 34

35 The essential idea of the correspondence n In optimization problems, we have to minimize a quantity. n Hopfield net minimizes the energy n THIS IS THE CORRESPONDENCE 35

36 Hopfield net and Traveling Salesman problem n We consider the problem for n=4 cities n In the given figure, nodes represent cities and edges represent the paths between the cities with associated distance. D d CD C d DA d AC d BD d BC A d AB B 36

37 Traveling Salesman Problem n Goal n Come back to the city A, visiting j = 2 to n (n is number of cities) eactly once and minimize the total distance. n To solve by Hopfield net we need to decide the architecture: n How many neurons? n What are the weights? 37

38 Constraints decide the parameters 1. For n cities and n positions, establish city to position correspondence, i.e. Number of neurons = n cities * n positions 2. Each position can take one and only one city 3. Each city can be in eactly one position 4. Total distance should be minimum 38

39 Architecture n n n * n matri where rows denote cities and columns denote positions cell(i, j) = 1 if and only if i th city is in j th position n Each cell is a neuron n n 2 neurons, O(n 4 ) connections city(i) pos(α) 39

40 Epressions corresponding to constraints n We equate constraint energy: E Problem = E network (*) Where, E problem = E 1 +E 2 +E 3 +E 4 and E network is the well known energy epression for the Hopfield net n Find the weights from (*). 40

41 Finding weights for Hopfield Net applied to TSP n E problem = E 1 + E 2 where E 1 is the equation for n cities, each city in one position and each position with one city. E 2 is the equation for distance 41

42 Epressions for E 1 and E 2 E A n n n n 2 = [ ( i α 1) ( i α 1) ] 2 α = 1 i= 1 i= 1 α = 1 E 1 n n n 2 = dij i α ( j, α j, α 1) 2 i= 1 j= 1 α = 1 42

43 Eplanatory eample Fig. 1 shows two possible directions in which tour can take place Fig. 1 city pos For the matri alongside, iα = 1, if and only if, i th city is in position α

44 Kinds of weights n Row weights: w 11,12 w 11,13 w 12,13 w 21,22 w 21,23 w 22,23 w 31,32 w 31,33 w 32,33 n Column weights w 11,21 w 11,31 w 21,31 w 12,22 w 12,32 w 22,32 w 13,23 w 13,33 w 23,33 44

45 Cross weights w 11,22 w 11,23 w 11,32 w 11,33 w 12,21 w 12,23 w 12,31 w 12,33 w 13,21 w 13,22 w 13,31 w 13,32 w 21,32 w 21,33 w 22,31 w 23,33 w 23,31 w 23,32 45

46 Epressions ] 1) ( 1) ( 1) ( 1) ( 1) ( 1) [( = + = A E E E E problem 13 June, LG:nlp:rnn:pushpak

47 Epressions (contd.) )...] ( ) ( ) ( ) ( ) ( ) ( [ d d d d d d E = 13 June, LG:nlp:rnn:pushpak

48 E network...] [ , , , , , , , , ,12 w w w w w w w w w E network = 13 June, LG:nlp:rnn:pushpak

49 Find row weight n To find, w 11,12 = -(co-efficient of ) in E network n Search in E problem w 11,12 = -A contribute...from E 1. E 2 cannot 49

50 Find column weight n To find, w 11,21 = -(co-efficient of ) in E network n Search co-efficient of in E problem w 11,21 = -A contribute...from E 1. E 2 cannot 50

51 Find Cross weights n To find, w 11,22 = -(co-efficient of ) n Search from E problem. E 1 cannot contribute n Co-eff. of in E 2 (d 12 + d 21 ) / 2 Therefore, w 11,22 = -( (d 12 + d 21 ) / 2) 51

52 Find Cross weights n To find, w 11,33 = -(co-efficient of ) n Search for in E problem w 11,33 = -( (d 13 + d 31 ) / 2) 52

53 Summary n Row weights = -A n Column weights = -A n Cross weights = - ( (d ij + d ji ) / 2), j = i ± 1 53

54 Recurrent Neural Network Acknowledgement: 1. By Denny Britz 2. Introduction to RNN by Jeffrey Hinton lectures.html 54

55 Sequence processing m/c 55

56 E.g. POS Tagging VBD NNP NN Purchased Videocon machine 56

57 E.g. Sentiment Analysis I Decision on a piece of tet h 0 h 1 c 1 a 11 a 12 a 13 a 14 o 1 o 2 o 3 o 4 57

58 I like h 0 h 1 h 2 c 2 a 21 a 22 a 23 a 24 o 1 o 2 o 3 o 4 58

59 I like the h 0 h 1 h 2 h 3 c 3 a 31 a32 a33 a 34 o 1 o 2 o 3 o 4 59

60 I like the camera h 0 h 1 h 2 h 3 h 4 c 4 a 41 a42 a 43 a 44 o 1 o 2 o 3 o 4 60

61 I like the camera <EOS > h 0 h 1 h 2 h 3 h 4 h 5 Positive sentiment c 5 a 51 a52 a 53 a 54 o 1 o 2 o 3 o 4 61

62 Back to RNN model 62

63 Notation: input and state n t is the input at time step t. For eample, could be a one-hot vector corresponding to the second word of a sentence. n s t is the hidden state at time step t. It is the memory of the network. n s t = f(u. t +Ws t-1 ) U and W matrices are learnt n f is a function of the input and the previous state n Usually tanh or ReLU (approimated by softplus) 63

64 Tanh, ReLU (rectifier linear unit) and Softplus tanh = e e + e e tanh = f ( ) = ma(0, ) g ( ) = ln(1 + e ) 64

65 Notation: output n o t is the output at step t n For eample, if we wanted to predict the net word in a sentence it would be a vector of probabilities across our vocabulary n o t =softma(v.s t ) 65

66 Operation of RNN n RNN shares the same parameters (U, V, W) across all steps n Only the input changes n Sometimes the output at each time step is not needed: e.g., in sentiment analysis n Main point: the hidden states!! 66

67 The equivalence between feedforward nets and recurrent nets w1 w4 time= 3 w1 w2 W3 W4 w2 w3 Assume that there is a time delay of 1 in using each connection. The recurrent net is just a layered net that keeps reusing the same weights. time= 2 time= 1 time= 0 w1 w2 W3 W4 w1 w2 W3 W4 67

68 Reminder: Backpropagation with weight constraints Linear constraints between the weights. Compute the gradients as usual To we constrain : need Eample : Δw 1 w 1 = = Δw w 2 2 Then modify the gradients so that they satisfy the constraints. compute : E w 1 and E w 2 So if the weights started off satisfying the constraints, they will continue to satisfy them. use E w 1 + E w 2 for w 1 and w 2 68

69 Backpropagation through time (BPTT algorithm) n The forward pass at each time step. n n The backward pass computes the error derivatives at each time step. n After the backward pass we add together the derivatives at all the different times for each weight. 69

70 Binary addition using recurrent network (Jeffrey Hinton s lecture) Feed forward n/w But problem of variable length input hidden units

71 The algorithm for binary addition no carry print carry print no carry print carry print This is a finite state automaton. It decides what transition to make by looking at the net column. It prints after making the transition. It moves from right to left over the two input numbers. 71

72 A recurrent net for binary addition Two input units and one output unit. Given two input digits at each time step. The desired output at each time step is the output for the column that was provided as input two time steps ago. It takes one time step to update the hidden units based on the two input digits. It takes another time step for the hidden units to cause the output time 72

73 The connectivity of the network The input units have feed forward connections Allow them to vote for the net hidden activity pattern. 3 fully interconnected hidden units 73

74 What the network learns n Learns four distinct patterns of activity for the 3 hidden units. n Patterns correspond to the nodes in the finite state automaton n Nodes in FSM are like activity vectors n The automaton is restricted to be in eactly one state at each time n The hidden units are restricted to have eactly one vector of activity at each time. 74

75 Recall: Backpropagation Rule General weight updating rule: Where Δw = ηδ ji jo i δ j = ( t j o j ) o j (1 o j ) for outermost layer = ( w δ kj k ) o j (1 o j ) o k net layer i for hidden layers 75

76 The problem of eploding or vanishing gradients (1/2) If the weights are small, the gradients shrink eponentially If the weights are big the gradients grow eponentially. Typical feed-forward neural nets can cope with these eponential effects because they only have a few hidden layers. 76

77 The problem of eploding or vanishing gradients (2/2) In an RNN trained on long sequences (e.g. sentence with 20 words) the gradients can easily eplode or vanish. We can avoid this by initializing the weights very carefully. Even with good initial weights, its very hard to detect that the current target output depends on an input from many time-steps ago. So RNNs have difficulty dealing with long-range dependencies. 77

78 Vanishing/Eploding gradient: solution n LSTM n Error becomes trapped in the memory portion of the block n This is referred to as an "error carousel n Continuously feeds error back to each of the gates until they become trained to cut off the value n (to be epanded) 78

79 Application of RNN Language Modeling 79

80 Definitions etc. n n What is Language Modeling: conditional probability of a word given the previous words in the sequence What is its mathematical epression P(wn w n 1 n 1 1 ) P(wn w n N +1 ) n How do we compute P (w n w n-1 n-n+1 ): ( n-n+1 Count (w n w n-1.. W n-n+1 )/ Count (w n-1.. W 80

81 An Eample n n n <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs </s> n n Suppose we add another sentence <s> I do like salad </s> ( I Now, what is the probability P(I <s>), P(do 81

82 Estimating Sentence Probabilities n P(<s> I want english food </s>) = P(i <s>)* P(want I)* P(english want)* P(food english)* P(</s> food)* = P(wn w n 1 n 1 1 ) P(wn w n N +1 ) 82

83 Application of Sentence Probability n Evaluate various possible translations in Statistical Translation Systems he briefed to reporters on the chief contents of the statement he briefed reporters on the chief contents of the statement he briefed to reporters on the main contents of the statement he briefed reporters on the main contents of the statement n Similarly in Speech Recognition System A friend in deed is a friend indeed A fred in need is a friend indeed Afraid in need is a friend indeed A friend in need is a friend indeed 83

84 Another Application: Transliteration n Mapping a written word from one language- script pair to another language-script pair: n n n n n य ध ष ठ र Yudhisthir, Car क र This looks like a straight-forward problem Define a simple mapping Or is it Yudhisthira? On web search: n n 15,900 hits for Yudhisthir, vs 50,200 for Yudhisthira 184,000 hits for Qatil vs. 7,300,000 for Katil n BTW, Qatil is the correct spelling Generate all possible candidates and then somehow rank them 84

85 LM with RNN 85

86 The prediction task 86

87 Goal: minimize cross entropy Perpleity: 2 J 87

88 Schematic Ack: Socher lecture on RNN 88

89 Vanishing gradient problem hits again! General weight updating rule: Where Δw = ηδ ji jo i δ j = ( t j o j ) o j (1 o j ) for outermost layer = ( w δ kj k ) o j (1 o j ) o k net layer i for hidden layers 89

90 Trick for solving eploding/ vanishing gradient n Clipping! (Pascanu et al, 2013) n Do not let the gradient rise above (case of eploding) or fall below (case of vanishing) a threshold n Facilitated by using ReLUs 90

91 Opinion Mining with Deep Recurrent Nets (Irsoy and Cardie, 2014) 91

92 Goal n Classify each wordas direct subjective epressions (DSEs) and epressive subjective epressions (ESEs) n DSE: Eplicit mentions of private states or speech events Epressing private states n ESE: Epressions that indicate sentiment, emotion, etc without eplicitly conveying them 92

93 Annotation (1/2) n In BIO notation (tags either beginof-entity (B_X) or continuation-ofentity (I_X)): The committee, [as usual] ESE, [has refused to make any Statements] DSE 93

94 Annotation (2/2) 94

95 Unidirectional RNN 95

96 Notation from Irsoy and Cardie n represents a token (word) as a vector. n y represents the output label (B, I or O) n h is the memory, computed from the past memory and current word. It summarizes the sentence up to that time. 96

97 Bidirectional RNN 97

98 Deep Bidirectional RNN 98

99 Data n MPQA 1.2 corpus (Wiebe et al., 2005) n consists of 535 news articles (11,111 sentences) n manually labeled with DSE and ESEs at the phrase level 99

100 F1 score 100

Natural Language Processing

Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay LSTM 15 jun, 2017 lgsoft:nlp:lstm:pushpak 1 Recap 15 jun, 2017 lgsoft:nlp:lstm:pushpak 2 Feedforward Network and Backpropagation