Part-of-Speech Tagging + Neural Networks 2 CS 287

Size: px

Start display at page:

Download "Part-of-Speech Tagging + Neural Networks 2 CS 287"

Alexina Casey
5 years ago
Views:

1 Part-of-Speech Tagging + Neural Networks 2 CS 287

2 Review: Bilinear Model Bilinear model, ŷ = f ((x 0 W 0 )W 1 + b) x 0 R 1 d 0 start with one-hot. W 0 R d 0 d in, d 0 = F W 1 R d in d out, b R 1 d out ; model parameters Notes: Bilinear parameter interaction. d 0 >> d in, e.g. d 0 = 10000, d in = 50

3 Review: Bilinear Model: Intuition (x 0 W 0 )W 1 + b [ ] w 0 1,1... w 0 0,d in... w 0 k,1... w 0 k,d in... w 0 d 0,1... w 0 d 0,d in w 1 1, w 1 0,d out w 1 d in, w 1 d in,d out

4 Review: Window Model Goal: predict t 5. Windowed word model. w 1 w 2 [w 3 w 4 w 5 w 6 w 7 ] w 8 w 3, w 4 ; left context w 5 ; Word of interest w 6, w 7 ; right context d win ; size of window (d win = 5)

5 Review: Dense Windowed BoW Features f 1,..., f dwin are words in window Input representation is the concatenation of embeddings Example: Tagging x = [v(f 1 ) v(f 2 )... v(f dwin )] w 1 w 2 [w 3 w 4 w 5 w 6 w 7 ] w 8 x = [v(w 3 ) v(w 4 ) v(w 5 ) v(w 6 ) v(w 7 )] x d in /5 d in /5 d in /5 d in /5 d in /5 Rows of W 1 encode position specific weights.

6 Quiz We are doing tagging with a windowed bilinear model with hinge-loss and no capitalization features. The model has d win = 5, d in = 50, d out = 40, and vocabulary size We are given the input window: The dog walked to the Unfortunately we incorrectly classify walked as NN as opposed to VP, in a bilinear model with a hinge-loss. What is the maximum number of parameters that receive a non-zero gradient?

7 Answer: [ ] w 0 1,1... w 0 0,d in w 0 the,1... w 0 the,d in. w 0 dog,1... w 0 dog,d in. w 0 walked,1... w 0 walked,d in. w 0 to,1... w 0 to,d in. w 0 the,1... w 0 the,d in. w 1 1,1... w 1 1,NN... w 1 1,VP w 1 0,d out w 1 d in,0... w 1 d in,nn... w 1 d in,vp w 1 d in,d out w 0 d 0,1... w 0 d 0,d in W 0 = 5 d in W 1 = d in 2

8 Part-of-Speech Tagging 1 Consider the following windowed model, and assume for now a linear model. [w 1 the w 3 w 4 w 5 ] What information do we have about the tag of w 3? What weight should the features values associated with the in position w 2 take?

9 Part-of-Speech Tagging 2 Next Consider the following windowed model, and assume for now a linear model. [w 1 w 2 w 3 dog w 5 ] What information do we have about the tag of w 3? What weight should the features values associated with dog in position w 4 take?

10 Part-of-Speech Tagging 3 Now finally consider the following windowed model, and assume for now a linear model. [w 1 the w 3 dog w 5 ] What information do we have about the tag of w 3? What weight would we want if we combined both the features values?

11 Contents Neural Networks Backpropagation

13 Neural Network One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the dense representation in R 1 d in W 1 R d in d hid, b 1 R 1 d hid; first affine transformation W 2 R d hid d out, b 2 R 1 d out ; second affine transformation g : R d hid d hid is an activation non-linearity (often pointwise) g(xw 1 + b 1 ) is the hidden layer

14 Schematic x 1 x 2 x 3. x din h 1 h 2. h dhid z 1 z 2 z 3. z dout

15 Non-Linearities: 0/1 0/1 function: 0/1(t) = 1(t > 0) 01((xW 1 + b 1 ) i ) Intuition: On, if above a threshold

16 Exercise Input layer to NN MLP1 is the sparse indicator features of the word at each position. Design a network to recognize [w 1 the w 3 dog w 5 ] Design a network to recognize where w 2 is not the [w 1 w 2 w 3 dog w 5 ]

17 Feature Conjunctions Many NLP tasks require conjunctive features, examples Sequence-based taggers look at last two-part of speech tags. Chinese part-of-speech taggers look at first character and last tag. Higher-level models (parses) look at tags of words and distances apart (example) For some natural language tasks, conjunctions are painstakingly hard. NNs: Capacity to learn conjunctions and feature combinations. Also possible with other convex models such as SVMs

18 Feature Conjunctions Many NLP tasks require conjunctive features, examples Sequence-based taggers look at last two-part of speech tags. Chinese part-of-speech taggers look at first character and last tag. Higher-level models (parses) look at tags of words and distances apart (example) For some natural language tasks, conjunctions are painstakingly hard. NNs: Capacity to learn conjunctions and feature combinations. Also possible with other convex models such as SVMs

19 Simple Antecedent/Pairwise Features Not Discriminative E.g., is [Lexus sales] the antecedent of [their sales]? Common pairwise features: String/Head Match, Sentences Between, Mention-Antecedent Numbers/Heads/Genders, etc. string-match=false head-match=true φ p ([their sales],[lexus sales]) = sentences-between=0 ment-ant-numbers=plur.,plur..

20 Dealing with the Feature Problem Finding discriminative features is a major challenge for coreference systems [Fernandes et al. 2012; Durrett and Klein 2013] Typical to define (or search for) feature conjunction-schemes to improve predictive performance [Fernandes et al. 2012; Durrett and Klein 2013; Björkelund and Kuhn 2014]. For instance: string-match(x, y) type(x) type(y) [Durrett and Klein 2013], where Nom. type(x) = Prop. citation-form(x) if x is nominal if x is proper if x is pronominal substring-match(head(x), y) substring-match(x, head(y)) coarse-type(y) coarse-type(x) [Björkelund and Kuhn 2014] Not just a problem for Mention Ranking systems!

21 Non-Linearities: 0/1 0/1 function: 0/1(t) = 1(t > 0) Issue: No gradient anywhere

22 Non-Linear Functions: Sigmoid Logistic sigmoid function: σ(t) = exp( t) σ((xw 1 + b 1 ) i ) Intuition: Each hidden dimension ( neuron ) is result of logistic regression.

23 Other Non-Linearities: ReLU Rectified Linear Unit: ReLU(t) = max{0, t} Intuition: Each hidden-unit gives activation margin No gradient (saturation) when below 0.

24 Saturation: Intuition x 1 x 2 x 3. x din h 1 h 2. h dhid z 1 z 2 z 3. z dout

25 Other Non-Linearities: Tanh Hyperbolic Tangeant: tanh(t) = exp(t) exp( t) exp(t) + exp( t) Intuition: Similar to sigmoid, but range between 0 and -1.

26 Other Non-Linearities: Hard Tanh Hyperbolic Tangeant: 1 t < 1 hardtanh(t) = t 1 t 1 1 t > 1 Intuition: Similar to sigmoid, but range between 0 and -1.

27 Other Non-Linearities: Cube Cube non-linearity (directly encourage parameter interaction): cube(t) = t 3 Intuition: Directly encourage higher-order interactions.

28 Tagging from Scratch

29 Function Approximator MLP1 is a universal approximator Can approximate with any desired non-zero amount of error a family of functions that include all continuous functions on a closed and bounded subset of R n, and any function mapping from any finite dimensional discrete space to another (YG) Caveats: Does not give size of hidden layer. Does not specify how hard this is to learn.

30 Deep Neural Networks (DNNs) Can stack MLPs, create deep fully connected networks, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 NN MLP2 (x) = g(nn MLP1 (x)w 1 + b 1 )W 2 + b 2 Can have multiple hidden layers, etc. Benefit: may be able to find better function Known to be harder to train (although other approaches)

31 Other Layers We will discuss many other neural network layers, convolutional attention-based gated layers...

32 Highway Network y : output from CharCNN Multilayer Perceptron z = g(wy + b) Highway Network (Srivastava, Greff, and Schmidhuber 2015) z = t g(w H y + b H ) + (1 t) y W H, b H : Affine transformation t = σ(w T y + b T ) : transform gate 1 t : carry gate Hierarchical, adaptive composition of character n-grams. Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 37 / 75

33 Highway Network Input to LSTM Input from CharCNN Kim, Jernite, Sontag, Rush Character-Aware Neural Language Models 38 / 75

34 Contents Neural Networks Backpropagation

35 Sequential Neural Network Sequential neural networks consist of a series of composed functions, Consider a vector-valued parameterized functions f 1,..., f k where f i (x; θ i ) : R n i 1 R n i ; function θ R d i ; function parameters Consider a scalar-valued loss function L(y, ŷ) where L(y, ) : R n k R; loss for input

36 Backpropagation Forward Step (f-prop): Compute L(f k (... f 1 (x 0 ))) Saving intermediary values f i (... f 1 (x 0 ))) Backward Step (b-prop): ni L f i (... f 1 (x 0 )) = f i+1 (... f 1 (x 0 )) j j=1 f i (... f 1 (x 0 )) L f i+1 (... f 1 (x 0 )) j n L i f i+1 (... f 1 (x = θ i 0 )) j L j=1 θ i f i+1 (... f 1 (x 0 )) j

37 Backpropagation Forward Step (f-prop): Compute L(f k (... f 1 (x 0 ))) Saving intermediary values f i (... f 1 (x 0 ))) Backward Step (b-prop): ni L f i (... f 1 (x 0 )) = f i+1 (... f 1 (x 0 )) j j=1 f i (... f 1 (x 0 )) L f i+1 (... f 1 (x 0 )) j n L i f i+1 (... f 1 (x = θ i 0 )) j L j=1 θ i f i+1 (... f 1 (x 0 )) j

38 Backpropagation Forward Step (f-prop): Compute L(f k (... f 1 (x 0 ))) Saving intermediary values f i (... f 1 (x 0 ))) Backward Step (b-prop): ni L f i (... f 1 (x 0 )) = f i+1 (... f 1 (x 0 )) j j=1 f i (... f 1 (x 0 )) L f i+1 (... f 1 (x 0 )) j n L i f i+1 (... f 1 (x = θ i 0 )) j L j=1 θ i f i+1 (... f 1 (x 0 )) j

39 Backpropagation: Data flow f i (... f 1 (x 0 )) f i+1 (f i (... f 1 (x 0 ))) L f i (... f 1 (x 0 )) f i+1 ( ; θ i+1 ) L θ i+1 L f i+1 (... f 1 (x 0 ))

40 Torch Implementation Torch uses declarative unit-based specification of NN Every function is a represented as a unit. Responsibilities: 1. Expose any parameters θ i+1 as tensors 2. Compute f i+1 (x, θ i+1 ) (fprop) 3. Compute any necessary state needed for bprop 4. Compute chain-rule given L f i+1 (...f 1 (x 0 )) and f i (... f 1 (x 0 )) 5. Compute parameter gradient L θ i+1 Contract: forward will always be called before backward.

41 Torch Units input f i (... f 1 (x 0 )) self.output f i+1 (f i (... f 1 (x 0 ))) weights f i+1 ( ; θ i+1 ) self.gradinput L f i (... f 1 (x 0 )) self.gradweight L θ i+1 gradoutput L f i+1 (... f 1 (x 0 ))

42 Loss Criterions prediction ŷ loss L(y, ŷ) weights L(, ) self.gradinput L ŷ target y

43 Torch Internals [Web]

44 Torch Units: Batch input f i (... f 1 (X 0 )) self.output f i+1 (f i (... f 1 (X 0 ))) f i+1 ( ; θ i+1 ) self.gradinput L f i (... f 1 (X 0 )) self.gradweight L θ i+1 gradoutput L f i+1 (... f 1 (X 0 ))

45 Loss Criterions prediction Ŷ loss L(Y, Ŷ) L(, ) self.gradinput L Ŷ target Y

46 Today Benefits of neural networks Training neural networks Next time: Pretraining and word embeddings

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the