This lecture. Miscellaneous classification methods: Neural networks, Support vector machines, Transformation-based learning, K nearest neighbours.

Size: px

Start display at page:

Download "This lecture. Miscellaneous classification methods: Neural networks, Support vector machines, Transformation-based learning, K nearest neighbours."

Nathaniel Underwood
5 years ago
Views:

2 This lecture Miscellaneous classification methods: Neural networks, Support vector machines, Transformation-based learning, K nearest neighbours. Neural word-vector representations. CSC401/2511 Spring

MFCCs Extract features Word statistics Train a classifier HMM

3 Classification Define classes/categories Phonemes Happy <-> Sad Filter acoustic noise Preprocess Tokenize and normalize MFCCs Extract features Word statistics Train a classifier HMM SVM Decision tree Use the trained classifier CSC401/2511 Spring

4 Types of classifiers Generative classifiers model the world. Parameters set to maximize likelihood of training data. We can generate new observations from these. e.g., hidden Markov models Discriminative classifiers emphasize class boundaries. Parameters set to minimize error on training data. e.g., ID3 decision trees. What do class boundaries look like in the data? CSC401/2511 Spring

5 Binary and linearly separable Perhaps the easiest case. Extends to dimensions d 3, line becomes (hyper-)plane. CSC401/2511 Spring

6 N-ary and linearly separable A bit harder random guessing gives $ % accuracy (given equally likely classes). We can logically combine N 1 binary classifiers. Decision Region Decision Boundaries CSC401/2511 Spring

7 Class holes Sometimes it can be impossible to draw any lines through the data to separate the classes. Are those troublesome points noise or real phenomena? CSC401/2511 Spring

8 The kernel trick We can sometimes linearize a non-linear case by moving the data into a higher dimension with a kernel function. E.g., S Now we have a linear decision boundary, S = 0! S = sin x0 + y 0 x 0 + y 0 CSC401/2511 Spring

9 Support Vector Machines (SVMs)

10 Support vector machines (SVMs) In binary linear classification, two classes are assumed to be separable by a line (or plane). However, many possible separating planes might exist. Each of these blue lines separates the training data. Which line is the best? CSC401/2511 Spring

11 Support vector machines (SVMs) The margin is the width by which the boundary could be increased before it hits a training datum. The maximum margin linear classifier is the linear classifier with the maximum margin. The support vectors (indicated) are those data points against which the margin is pressed. The bigger the margin the less sensitive the boundary is to error. CSC401/2511 Spring

12 Support vector machines (SVMs) The width of the margin, M, can be computed by the angle and displacement of the planar boundary, x, as well as M the planes that touch data points. Given an initial guess of the angle and displacement of x we can compute: whether all data is correctly classified, The width of the margin, M. We update our guess by quadratic programming, which is semi-analytic. CSC401/2511 Spring x

13 Support vector machines (SVMs) The maximum margin helps SVM generalize to situations when it s impossible to linearly separate the data. We introduce a parameter that allows us to measure the distance of all data not in their correct zones. We simultaneously maximize the margin while minimizing the misclassification error. There is a straightforward approach to solving this system based on quadratic programming. CSC401/2511 Spring

14 Support vector machines (SVMs) SVMs generalize to higher-dimensional data and to systems in which the data is non-linearly separable (e.g., by a circular decision boundary). Using the kernel trick (slide 8) is common. Many binary SVM classifiers can also be combined to simulate a multi-category classifier (slide 6). )Still) one of the most popular off-the-shelf classifiers. CSC401/2511 Spring

15 Support vector machines (SVMs) SVMs are empirically very accurate classifiers. They perform well in situations where data are static, i.e., don t change over time, e.g., genre classification given fixedstatistics of documents Phoneme recognition given only a single frame of speech. SVMs do not generalize as well to time-variant systems. Kernel functions tend to not allow for observations of different lengths (i.e., all data points have to be of the same dimensionality). CSC401/2511 Spring

16 Artificial Neural Networks (ANNs)

The nucleus fires (sends an electric signal along the axon) given input from other neurons.

17 Artificial neural networks Artificial neural networks (ANNs) were (kind of) inspired from neurobiology (Widrowand Hoff, 1960). Each unit has many inputs (dendrites), one output (axon). The nucleus fires (sends an electric signal along the axon) given input from other neurons. Learning occurs at the synapses that connect neurons, either by amplifying or attenuating signals. Dendrites Axon Nucleus CSC401/2511 Spring

Perceptron: an artificial neuron Each neuron

neurons, each weighted by a parameter w <.

18 Perceptron: an artificial neuron Each neuron calculates a weighted sum of its inputs and compares this to a threshold, τ. If the sum exceeds the threshold, the neuron fires. Inputs a < are activations from adjacent neurons, each weighted by a parameter w <. a $ a 0 w 0 w $ w B B x = A w < a < <C$ g() S g() If x > τ, S 1 Else, S 0 a B McCullogh-Pitts model CSC401/2511 Spring

19 Perceptron output Perceptron output is determined by activation functions, g(), which can be non-linear functions of weighted input. A popular activation function is the sigmoid: 1 S = g x = 1 + e FG Its derivative is the easily computable g H = g (1 g) Output Input CSC401/2511 Spring

20 Perceptron learning Weights are adjusted in proportion to the error (i.e., the difference between the desired, y, and the actual output, S. The derivative g allows us to assign blame proportionally. Given a small learning rate, α (e.g., 0.05), we repeatedly adjust each of the weighting parameters by w N w N + α A Err < g (x < )x < Q <C$ where Err < = (y S), and we have R training examples. CSC401/2511 Spring

perceptra (since they are not linearly separable).

21 Threshold perceptra and XOR Some relatively simple logical functions cannot be learned by threshold perceptra (since they are not linearly separable). a 0 a 0 a 0 a $ a $ a $ a 1 a 2 a 1 a 2 a 1 a 2 CSC401/2511 Spring

22 Artificial neural networks Complex functions can be represented by layers of perceptra (Multi-Layer Perceptra, MLPs) Input are passed to the input layer. Activations are propagated through hidden layers to the output layer (which is usually the class). MLP CSC401/2511 Spring

23 Artificial neural networks MLPs are quite robust to noise, and are trained specifically to reduce error However, they can be sensitive to initial parameterization, relatively slow to train, and incapable of capturing long-term dependencies. MLP CSC401/2511 Spring What can they learn about words?

24 Words Given a corpus with D (e.g., = 100K) unique words, the classical binary approach is to uniquely assign each word with an index in D-dimensional vectors ( one-hot representation). soccer Classic word-feature representation assigns features to each index. E.g., VBG, positive, age-of-acquisition d D Is there a way to learn the nature of these abstract features? D CSC401/2511 Spring

25 Singular value decomposition M = a as U = chuck Σ = could Emb = U :,$:0 Σ $:0,$:0 CSC401/2511 Spring

26 Singular value decomposition dendrogram Rohde et al. (2006) An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence. Communications of the ACM 8: CSC401/2511 Spring

Hard to incorporate new words. Rohde et al.

27 Singular value decomposition Problems with SVD: 1. Computational costs scale quadratically with size of M. 2. Hard to incorporate new words. Rohde et al. (2006) An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence. Communications of the ACM 8: CSC401/2511 Spring

28 Word2vec to the rescue Solution: Don t capture co-occurrence directly just try to predict surrounding words. P(w bc$ = yourself w b = kiss) hey go kiss yourself, hey go hug yourself, Here, we re predicting the center word given the context. Popular alternative, GloVe: CSC401/2511 Spring

29 Learning word representations Continuous bag of words (CBOW) D = 100K x W m (D H) a W p (H D) y D = 100K Note: we have two vector representations of each word: v s = x W m (w bu row of W m ) V s = W p x (w bu col of W p ) 0,0,0, 1,, 0 kiss go kiss yourself go hug yourself outside inside outside 0,1,0,, 0,, 0 go softmax P w w w < = exp (V s { v s ) ~ sc$ exp (V s v s ) Where v s is the input vector for word w, V s is the output vector for word w, CSC401/2511 Spring

30 Using word representations Without a latent space, kiss = 0,0,0,, 0,1,0,, 0, & hug = 0,0,0,, 0,0,1,, 0 so Similarity = cos (x, y) = 0.0 Transform v s = x W m D = 100K x W m In latent space, kiss = 0.8,0.69,0.4,, 0.05, & hug = 0.9,0.7,0.43,, 0.05 so Similarity = cos (x, y) = 0.9 H = 300 CSC401/2511 Spring

31 Linguistic regularities in vector space Trained on the Google news corpus with over 300 billion words. CSC401/2511 Spring

32 Linguistic regularities in vector space CSC401/2511 Spring (from GloVe same idea)

33 Linguistic regularities in vector space Expression Paris France + Italy Bigger big + cold Sushi Japan + Germany Cu copper + gold Windows Microsoft + Google Nearest token Rome Colder bratwurst Au Android Analogies: Hypernymy: apple:apples :: octopus:octopodes shirt:clothing :: chair:furniture CSC401/2511 Spring

34 Actually doing the learning First, let s define what our parameters are. Given H-dimensional vectors, and V words: vˆ θ = vˆˆ Š ˆ Œ v Ž Vˆ Vˆˆ Š ˆ Œ V Ž R 0 CSC401/2511 Spring

35 Aside Actually doing the learning We have many options. Gradient descent is popular. We want to optimize, given T words of training data, Ÿ J θ = 1 T A A log P(w bcn w b ) bc$ š œnœ,n And we want to update vectors V s then v s θ s = θ w Š η J θ so we ll need to take the derivative of the (log of the) softmax function: P w w w < = exp (V s { v s ) ~ exp (V s v s ) Where and ž sc$ v s is the input vector for word w, V s is the output vector for word w, within θ CSC401/2511 Spring

36 Aside Actually doing the learning We need to take the derivative of the (log of the) softmax function: δ δv s log P w bcn w b = δ δv s = δ δv s log exp (V s v s ) ~ exp (V s v s ) sc$ log exp V s v s log A exp (V s v s ) = V s δ δv s ~ sc$ ~ log A exp (V s v s ) sc$ [apply the chain rule ª«ª More details: ~ = V s A p w w b V s sc$ = ª«ªŽ ªŽ ª ] CSC401/2511 Spring

37 Results (note all extrinsic) Bengio et al. 2001, 2003: beating N-grams on small datasets (Brown & APNews), but much slower. Schwenk et al. 2002,2004,2006: beating state-of-the-art largevocabulary ASR using deep & distributed NLP model, with real-time speech recognition. Morin & Bengio 2005, Blitzer et al. 2005, Mnih & Hinton 2007,2009: better & faster models through hierarchical representations. Collobert & Weston 2008: reaching or beating state-of-the-art in multiple NLP tasks (SRL, PoS, NER, chunking) thanks to unsupervised pre-training and multi-task learning. Bai et al. 2009: ranking & semantic indexing (IR). CSC401/2511 Spring

38 Sentiment analysis The traditional bag-of-words approach to sentiment analysis used dictionaries of happy and sad words, simple counts, and either regression or binary classification. But consider these: Best movie of the year Slick and entertaining, despite a weak script Fun and sweet but ultimately unsatisfying CSC401/2511 Spring

39 Tree-based sentiment analysis We can combine pairs of words into phrase structures. Similarly, we can combine phrase and word structures hierarchically for classification. x1,2 x $ x1 x2 D = W m D = 300 x 0 CSC401/2511 Spring H = 300

40 Tree-based sentiment analysis (currently broken) demo: CSC401/2511 Spring

41 Recurrent neural networks (RNNs) An RNN has feedback connections in its structure so that it remembers n previous inputs, when reading a sequence. e.g., it can use current word input with hidden units from previous word) Elman network feed hidden units back Jordan network (not shown) feed output units back CSC401/2511 Spring

42 RNNs do PoS You can unroll RNNs over time for various dynamic models, e.g., PoS tagging. t=1 t=2 t=3 t=4 Pronoun Aux Det She had a CSC401/2511 Spring

43 SMT with RNNs SMT is hard and involves long-term dependencies. Solution: Encode entire sentence into a single vector representation, then decode. t=1 t=2 t=3 t=4 t=5 Sentence representation ENCODE The ocarina of time <eos> CSC401/2511 Spring

44 SMT with RNNs Try it ( 30K vocabulary, 500M word training corpus (taking 5 days on GPUs) All that good morphological/syntactic/semantic stuff gets embedded into sentence vectors. t=5 t=6 t=7 t=8 t=9 DECODE L ocarina de temps <eos> Sentence representation CSC401/2511 Spring

45 Transformation-Based Learning (TBL)

46 Transformation-based learning Developed by Eric Brill for his part-of-speech tagger. Is also used for text chunking, prepositional phrase attachment (*), syntactic parsing, dialog tagging, etc. Transformation-based learning (TBL) modifies the output of one method (e.g., HMM) according to a set of learned rules. These rules are determined automatically by a discriminative training process. (*) Prepositional phrase attachment is the problem of determining, e.g., who has the telescope in I saw the man on the hill with the telescope. CSC401/2511 Spring

47 Transformation-based learning Initial imperfect tagging of data (many errors) Transformation rules of form [Condition, Action] New tagging with fewer errors Components: Allowable transformations Learning algorithm CSC401/2511 Spring

48 TBL: allowable transformations TBL requires transformation rule templates. Each template is of the form [CONDITION, ACTION]. Actions include, e.g., changing the i bu tag to τ, t < τ. Conditions include conjunctions, negations, and disjunctions of, e.g., The M bu preceding/following tag is tˆ, e.g., the preceding tag is a NNS, The M bu preceding/following word is wˆ, E.g., The preceding/following word is ocelot, The M bu word is wˆ and the N bu tag is t ³, CSC401/2511 Spring

49 TBL: example transformation An instantiated rule might be, e.g., if the preceding word is to and the current word is strike and the current tag is NN then change the current tag to VB. Condition (triggering environment): preceding word= to & current word= strike & current tag= NN Action (transformation/rewrite rule): change current tag from NN to VB CSC401/2511 Spring

50 TBL: learning algorithm In training, we generate one new rule per iteration and apply it to the training set, thereby modifying it. The initial training set includes: the output of another tagger (possibly riddled with errors), the correct gold standard tags. CSC401/2511 Spring

51 TBL: learning algorithm Learning TBL rules is an iterative process: 1. Generate all rules, R, that correct 1 error, 2. For each rule r R, 1. Apply the rule r to a copy of the current state of the training set, 2. Score the result (compute the overall error) 3. Select the rule r that minimizes error. 4. Update the training set by applying r. 5. If the error is below some threshold, halt. Otherwise, repeat from step 1. CSC401/2511 Spring

52 Transformation-based learning Advantages of transformation-based learning include: TBL rules can capture more context than Markov models, The entire training set is used for training, The evaluation criterion (error rate) is direct, as opposed to indirect methods like the reduction of entropy (e.g., decision trees), Resulting rules can be easy to review and to understand. Disadvantages include: The rules that TBL generates are not probabilistic, The rule sequences may not be optimal, since only one is considered at a time. CSC401/2511 Spring

53 Reading/Announcements ANN: SVM: Russell & Norvig, Artificial Intelligence: A Modern Approach 2 nd ed., section 20.5 (optional) IBID, section 20.6 (optional) TBL: Manning & Schütze, section 10.4 Friday: Review Session 19 or20 April: Second review session CSC401/2511 Spring

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding