Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287
|
|
- Brittney Poole
- 5 years ago
- Views:
Transcription
1 Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287
2 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the dense representation in R 1 d in W 1 R d in d hid, b 1 R 1 d hid; first affine transformation W 2 R d hid d out, b 2 R 1 d out ; second affine transformation g : R d hid d hid is an activation non-linearity (often pointwise) g(xw 1 + b 1 ) is the hidden layer
3 Review: Non-Linearities Tanh Hyperbolic Tangeant: tanh(t) = exp(t) exp( t) exp(t) + exp( t) Intuition: Similar to sigmoid, but range between 0 and -1.
4 Review: Backpropagation f i (... f 1 (x 0 )) f i+1 (f i (... f 1 (x 0 ))) L f i (... f 1 (x 0 )) f i+1 ( ; θ i+1 ) L θ i+1 L f i+1 (... f 1 (x 0 ))
5 Quiz One common class of operations in neural network models is known as pooling. Informally a pooling layer consists of aggregation unit, typically unparameterized, that reduces the input to a smaller size. Consider three pooling functions of the form f : R n R, 1. f (x) = max i x i 2. f (x) = min i x i 3. f (x) = i x i /n What action do each of these functions have? What are their gradients? How would you implement backpropagation for these units?
6 Quiz Max pooling: f (x) = max i x i Keeps only the most activated input Fprop is simple; however must store arg max ( switch ) Bprop gradient is zero except for switch, which gets gradoutput Min pooling: f (x) = min i x i Keeps only the least activated input Fprop is simple; however must store arg min ( switch ) Bprop gradient is zero except for switch, which gets gradoutput Avg pooling: f (x) = i x i /n Keeps the average activation input Fprop is simply mean. Gradoutput is averaged and passed to all inputs.
7 Quiz Max pooling: f (x) = max i x i Keeps only the most activated input Fprop is simple; however must store arg max ( switch ) Bprop gradient is zero except for switch, which gets gradoutput Min pooling: f (x) = min i x i Keeps only the least activated input Fprop is simple; however must store arg min ( switch ) Bprop gradient is zero except for switch, which gets gradoutput Avg pooling: f (x) = i x i /n Keeps the average activation input Fprop is simply mean. Gradoutput is averaged and passed to all inputs.
8 Quiz Max pooling: f (x) = max i x i Keeps only the most activated input Fprop is simple; however must store arg max ( switch ) Bprop gradient is zero except for switch, which gets gradoutput Min pooling: f (x) = min i x i Keeps only the least activated input Fprop is simple; however must store arg min ( switch ) Bprop gradient is zero except for switch, which gets gradoutput Avg pooling: f (x) = i x i /n Keeps the average activation input Fprop is simply mean. Gradoutput is averaged and passed to all inputs.
9 Contents Embedding Motivation C&W Embeddings word2vec Evaluating Embeddings
10 1. Use dense representations instead of sparse 2. Use windowed area instead of sequence models 3. Use neural networks to model windowed interactions
11 What about rare words?
12 Word Embeddings Embedding layer, x 0 W 0 x 0 R 1 d 0 one-hot word. W 0 R d 0 d in, d 0 = V Notes: d 0 >> d in, e.g. d 0 = 10000, d in = 50
13 Pretraining Representations We would strong shared representations of words However, PTB only 1M labeled words, relatively small Collobert et al. (2008, 2011) use semi-supervised method. (Close connection to Bengio et al (2003), next topic)
14 Semi-Supervised Training Idea: Train representations separately on more data 1. Pretrain word embeddings W 0 first. 2. Substitute them in as first NN layer 3. Fine-tune embeddings for final task Modify the first layer based on supervised gradients Optional, some work skips this step
15 Large Corpora To learn rare word embeddings, need many more tokens, C&W English Wikipedia (631 million words tokens) Reuters Corpus (221 million word tokens) word2vec Total vocabulary size: 130,000 word types Google News (6 billion word tokens) Total vocabulary size: 1M word types But this data has no labels...
16 Contents Embedding Motivation C&W Embeddings word2vec Evaluating Embeddings
17 C&W Embeddings Assumption: Text in Wikipedia is coherent (in some sense). Most randomly corrupted text is incoherent. Embeddings should distinguish coherence. Common idea in unsupervised learning (distributional hypothesis).
18 C&W Setup Let V be the vocabulary of English and let s score any window of size d win = 5, if we see the phrase [ the dog walks to the ] It should score higher by s than [ the dog house to the ] [ the dog cats to the ] [ the dog skips to the ]...
19 C&W Setup Can estimate score s as a windowed neural network. s(w 1,..., w dwin ) = hardtanh(xw 1 + b 1 )W 2 + b with x = [v(w 1 ) v(w 2 )... v(w dwin )] d in = d win 50, d hid = 100, d win = 11, d out = 1! Example: Function s x = [v(w 3 ) v(w 4 ) v(w 5 ) v(w 6 ) v(w 7 )] d in /d win d in /d win x d in /d win d in /d win d in /d win
20 Training? Different setup than previous experiments. No direct supervision y Train to rank good examples better.
21 Ranking Loss Given only example {x 1,..., x n } and for each example have set D(x) of alternatives. L(θ) = i Example: C&W ranking x D(x) L ranking (s(x i ; θ), s(x ; θ)) L ranking (y, ŷ) = max{0, 1 (y ŷ)} x = [the dog walks to the] D(x) = { [the dog skips to the], [the dog in to the],... } (Torch nn.rankingcriterion) Note: slightly different setup.
22 C&W Embeddings in Practice Vocabulary size D(x) > 100, 000 Training time for 4 weeks (Collobert is main an author of Torch)
23 Sampling (Sketch of Wsabie (Weston, 2011)) Observation: in many contexts L ranking (y, ˆ y) = max{0, 1 (y ŷ)} = 0 Particularly true later in training. For difficult contexts, may be easy to find L ranking (y, ˆ y) = max{0, 1 (y ŷ)} = 0 We can therefore sample from D(x) to find an update.
24 C&W Results 1. Use dense representations instead of sparse 2. Use windowed area instead of sequence models 3. Use neural networks to model windowed interactions 4. Use semi-supervised learning to pretrain representations.
25 Contents Embedding Motivation C&W Embeddings word2vec Evaluating Embeddings
26 word2vec Contributions: Scale embedding process to massive sizes Experiments with several architectures Empirical evaluations of embeddings Influential release of software/data. Differences with C&W Instead of MLP uses (bi)linear model (linear in paper) Instead of ranking model, directly predict word (cross-entropy) Various other extensions. Two different models 1. Continuous Bag-of-Words (CBOW) 2. Continuous Skip-gram
27 word2vec Contributions: Scale embedding process to massive sizes Experiments with several architectures Empirical evaluations of embeddings Influential release of software/data. Differences with C&W Instead of MLP uses (bi)linear model (linear in paper) Instead of ranking model, directly predict word (cross-entropy) Various other extensions. Two different models 1. Continuous Bag-of-Words (CBOW) 2. Continuous Skip-gram
28 word2vec (Bilinear Model) Back to pure bilinear model, but with much bigger output space ŷ = softmax(( i x 0 i W0 d win 1 )W1 ) x 0 i R 1 d 0 input words one-hot vectors. W 0 R d 0 d in ; d 0 = V, word embeddings W 1 R d in d out ; d out = V output embeddings Notes: Bilinear parameter interaction. d 0 >> d in, e.g. 50 d in 1000, V 1M or more
29 word2vec (Mikolov, 2013)
30 Continuous Bag-of-Words (CBOW) ŷ = softmax(( i x 0 i W0 d win 1 )W1 ) Attempt to predict the middle word Example: CBOW [ the dog walks to the ] x = v(w 3) + v(w 4 ) + v(w 6 ) + v(w 7 ) d win 1 y = δ(w 5 ) W 1 is no longer partitioned by row (order is lost)
31 Continous Skip-gram ŷ = softmax(x 0 W 0 )W 1 ) Also a bilinear model Attempt to predict each context-word from middle [ the dog ] Example: Skip-gram x = v(w 5 ) Done for each word in window. y = δ(w 3 )
32 Additional aspects The window d win is sampled for each SGD step SGD is done less for frequent words. We have slightly simplified the training objective.
33 Softmax Issues Use a softmax to force a distribution, Issue: class C is huge. softmax(z) = c C exp(z) exp(z c ) log softmax(z) = z log exp(z c ) c C For C&W, 100,000, for word2vec 1,000,000 types Note largest dataset is 6 billion words
34 Two-Layer Softmax First, clustering words into hard classes (for instance Brown clusters) Groups words into classes based on word-context dog cat horse... car truck motorcycle...
35 Two-Layer Softmax Assume that we first generate a class C and then a word, p(y X ) P(Y C, X ; θ)p(c X ; θ) Estimate distributions with a shared embedding layer, P(C X ; θ) ŷ 1 = softmax((x 0 W 0 )W 1 + b) P(Y C = class, X ; θ) ŷ 2 = softmax((x 0 W 0 )W class + b))
36 Softmax as Tree dog cat horse... car truck motorcycle... ŷ (1) = softmax((x 0 W 0 )W 1 + b) ŷ (2) = softmax((x 0 W 0 )W class + b)) L 2SM (y (1), y (2), ŷ (1), ŷ (2) ) = log p(y x, class(y)) log p(class(y) x) = log ŷ (1) c 1 log ŷ (2) c 2
37 Speed dog cat horse... car truck motorcycle... Computing loss only requires walking path. Two-layer a balanced tree. Computing loss requires O( V ) (Note: computing full distribution requires O( V ))
38 Hierarchical Softmax(HSM) Build multiple layer tree L HSM (y (1),..., y (C ), ŷ (1),..., ŷ (C ) ) = log ŷ (i) c i i Balanced tree only requires O(log 2 V ) Experiments on website (Mnih and Hinton, 2008)
39 HSM with Huffman Encoding Requires O(log 2 perp(unigram)) Reduces time to only 1 day for 1.6 million tokens
40
41 Contents Embedding Motivation C&W Embeddings word2vec Evaluating Embeddings
42 How good are embeddings? Qualitative Analysis/Visualization Analogy task Extrinsic Metrics
43 Metrics Dot-product x cat x dog Cosine Similarity x cat x dog x cat x dog
44 k-nearest neighbors (cosine sim) dog cat dogs horse puppy pet rabbit pig snake Intuition: trained to match words that act the same.
45 Empirical Measures: Analogy task Analogy questions: A:B::C: 5 types of semantic questions, 9 types of syntactic
46 Embedding Tasks
47 Analogy Prediction A:B::C: Project to the closest word, Code example x = x B x A + x C arg max D V x D x x D x
48 Extrinsic Tasks Text classification Part-of-speech tagging Many, many others over last couple years
49 Conclusion Word Embeddings Scaling issues and tricks Next Class: Language Modeling
Part-of-Speech Tagging + Neural Networks 2 CS 287
Part-of-Speech Tagging + Neural Networks 2 CS 287 Review: Bilinear Model Bilinear model, ŷ = f ((x 0 W 0 )W 1 + b) x 0 R 1 d 0 start with one-hot. W 0 R d 0 d in, d 0 = F W 1 R d in d out, b R 1 d out
More informationAn overview of word2vec
An overview of word2vec Benjamin Wilson Berlin ML Meetup, July 8 2014 Benjamin Wilson word2vec Berlin ML Meetup 1 / 25 Outline 1 Introduction 2 Background & Significance 3 Architecture 4 CBOW word representations
More informationDeep Learning for NLP Part 2
Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) 2 Part 1.3: The Basics Word Representations The
More informationSemantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing
Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationGloVe: Global Vectors for Word Representation 1
GloVe: Global Vectors for Word Representation 1 J. Pennington, R. Socher, C.D. Manning M. Korniyenko, S. Samson Deep Learning for NLP, 13 Jun 2017 1 https://nlp.stanford.edu/projects/glove/ Outline Background
More informationSparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent
More informationANLP Lecture 22 Lexical Semantics with Dense Vectors
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous
More informationDeep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017
Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion
More informationword2vec Parameter Learning Explained
word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu Abstract The word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent two years. The vector
More informationLecture 6: Neural Networks for Representing Word Meaning
Lecture 6: Neural Networks for Representing Word Meaning Mirella Lapata School of Informatics University of Edinburgh mlap@inf.ed.ac.uk February 7, 2017 1 / 28 Logistic Regression Input is a feature vector,
More informationtext classification 3: neural networks
text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University
More informationLecture 5 Neural models for NLP
CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm
More informationNatural Language Processing
Natural Language Processing Word vectors Many slides borrowed from Richard Socher and Chris Manning Lecture plan Word representations Word vectors (embeddings) skip-gram algorithm Relation to matrix factorization
More informationNeural Word Embeddings from Scratch
Neural Word Embeddings from Scratch Xin Li 12 1 NLP Center Tencent AI Lab 2 Dept. of System Engineering & Engineering Management The Chinese University of Hong Kong 2018-04-09 Xin Li Neural Word Embeddings
More informationCS224n: Natural Language Processing with Deep Learning 1
CS224n: Natural Language Processing with Deep Learning Lecture Notes: Part I 2 Winter 27 Course Instructors: Christopher Manning, Richard Socher 2 Authors: Francois Chaubard, Michael Fang, Guillaume Genthial,
More informationNatural Language Processing and Recurrent Neural Networks
Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018 Outline Introduction to NLP Word2vec RNN GRU LSTM Demo What is NLP? Natural Language? : Huge amount of information
More informationarxiv: v3 [cs.cl] 30 Jan 2016
word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu arxiv:1411.2738v3 [cs.cl] 30 Jan 2016 Abstract The word2vec model and application by Mikolov et al. have attracted a great amount of attention
More informationHomework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown
Homework 3 COMS 4705 Fall 017 Prof. Kathleen McKeown The assignment consists of a programming part and a written part. For the programming part, make sure you have set up the development environment as
More informationDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca borui.ye@uwaterloo.ca July 8, 2015 Dylan Drover, Borui Ye, Jie Peng (University
More informationDISTRIBUTIONAL SEMANTICS
COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.
More informationNeural Networks Language Models
Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,
More informationDeep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang
Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10 Part 2: Introduction to Deep Learning Part 2.1: Deep Learning Background What is Machine Learning? From Data
More informationUNSUPERVISED LEARNING
UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training
More informationNatural Language Processing
Natural Language Processing Info 159/259 Lecture 7: Language models 2 (Sept 14, 2017) David Bamman, UC Berkeley Language Model Vocabulary V is a finite set of discrete symbols (e.g., words, characters);
More informationRegularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018
1-61 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Regularization Matt Gormley Lecture 1 Feb. 19, 218 1 Reminders Homework 4: Logistic
More informationDeep Learning. Ali Ghodsi. University of Waterloo
University of Waterloo Language Models A language model computes a probability for a sequence of words: P(w 1,..., w T ) Useful for machine translation Word ordering: p (the cat is small) > p (small the
More informationSeman&cs with Dense Vectors. Dorota Glowacka
Semancs with Dense Vectors Dorota Glowacka dorota.glowacka@ed.ac.uk Previous lectures: - how to represent a word as a sparse vector with dimensions corresponding to the words in the vocabulary - the values
More informationNeural Networks for NLP. COMP-599 Nov 30, 2016
Neural Networks for NLP COMP-599 Nov 30, 2016 Outline Neural networks and deep learning: introduction Feedforward neural networks word2vec Complex neural network architectures Convolutional neural networks
More informationGoogle s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill
More informationDeep Learning Recurrent Networks 2/28/2018
Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good
More informationNatural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors
Natural Language Processing with Deep Learning CS224N/Ling284 Richard Socher Lecture 2: Word Vectors Organization PSet 1 is released. Coding Session 1/22: (Monday, PA1 due Thursday) Some of the questions
More informationDeep Learning for NLP
Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Machine Learning and NLP NER WordNet Usually machine learning
More informationNEURAL LANGUAGE MODELS
COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More information11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)
11/3/15 Machine Learning and NLP Deep Learning for NLP Usually machine learning works well because of human-designed representations and input features CS224N WordNet SRL Parser Machine learning becomes
More informationOverview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop
Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop Project Tips Announcement: Hint for PSet1: Understand
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More informationStatistical Machine Learning from Data
January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole
More informationLecture 17: Neural Networks and Deep Learning
UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions
More informationLearning to translate with neural networks. Michael Auli
Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each
More informationTowards Universal Sentence Embeddings
Towards Universal Sentence Embeddings Towards Universal Paraphrastic Sentence Embeddings J. Wieting, M. Bansal, K. Gimpel and K. Livescu, ICLR 2016 A Simple But Tough-To-Beat Baseline For Sentence Embeddings
More informationNeural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35
Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How
More informationCS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam
CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam This examination consists of 17 printed sides, 5 questions, and 100 points. The exam accounts for 20% of your total grade.
More informationNeural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17
3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural
More informationLecture 3: Multiclass Classification
Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in
More informationNatural Language Processing
Natural Language Processing Info 59/259 Lecture 4: Text classification 3 (Sept 5, 207) David Bamman, UC Berkeley . https://www.forbes.com/sites/kevinmurnane/206/04/0/what-is-deep-learning-and-how-is-it-useful
More informationPrepositional Phrase Attachment over Word Embedding Products
Prepositional Phrase Attachment over Word Embedding Products Pranava Madhyastha (1), Xavier Carreras (2), Ariadna Quattoni (2) (1) University of Sheffield (2) Naver Labs Europe Prepositional Phrases I
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationATASS: Word Embeddings
ATASS: Word Embeddings Lee Gao April 22th, 2016 Guideline Bag-of-words (bag-of-n-grams) Today High dimensional, sparse representation dimension reductions LSA, LDA, MNIR Neural networks Backpropagation
More informationDEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY
DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo
More informationDeep Learning. Language Models and Word Embeddings. Christof Monz
Deep Learning Today s Class N-gram language modeling Feed-forward neural language model Architecture Final layer computations Word embeddings Continuous bag-of-words model Skip-gram Negative sampling 1
More informationAssignment 1. Learning distributed word representations. Jimmy Ba
Assignment 1 Learning distributed word representations Jimmy Ba csc321ta@cstorontoedu Background Text and language play central role in a wide range of computer science and engineering problems Applications
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationLecture 13: Structured Prediction
Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page
More informationJakub Hajic Artificial Intelligence Seminar I
Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network
More informationMaking Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation
Making Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation Dr. Yanjun Qi Department of Computer Science University of Virginia Tutorial @ ACM BCB-2018 8/29/18 Yanjun Qi / UVA
More information(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses
Contents (75pts) COS495 Midterm 1 (15pts) Short answers........................... 1 (5pts) Unequal loss............................. 2 (15pts) About LSTMs........................... 3 (25pts) Modular
More informationIncrementally Learning the Hierarchical Softmax Function for Neural Language Models
Incrementally Learning the Hierarchical Softmax Function for Neural Language Models Hao Peng Jianxin Li Yangqiu Song Yaopeng Liu Department of Computer Science & Engineering, Beihang University, Beijing
More informationarxiv: v2 [cs.cl] 1 Jan 2019
Variational Self-attention Model for Sentence Representation arxiv:1812.11559v2 [cs.cl] 1 Jan 2019 Qiang Zhang 1, Shangsong Liang 2, Emine Yilmaz 1 1 University College London, London, United Kingdom 2
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationThe representation of word and sentence
2vec Jul 4, 2017 Presentation Outline 2vec 1 2 2vec 3 4 5 6 discrete representation taxonomy:wordnet Example:good 2vec Problems 2vec synonyms: adept,expert,good It can t keep up to date It can t accurate
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More information(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann
(Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for
More informationWord Representations via Gaussian Embedding. Luke Vilnis Andrew McCallum University of Massachusetts Amherst
Word Representations via Gaussian Embedding Luke Vilnis Andrew McCallum University of Massachusetts Amherst Vector word embeddings teacher chef astronaut composer person Low-Level NLP [Turian et al. 2010,
More informationSequence Modeling with Neural Networks
Sequence Modeling with Neural Networks Harini Suresh y 0 y 1 y 2 s 0 s 1 s 2... x 0 x 1 x 2 hat is a sequence? This morning I took the dog for a walk. sentence medical signals speech waveform Successes
More informationArtificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen
Artificial Neural Networks Introduction to Computational Neuroscience Tambet Matiisen 2.04.2018 Artificial neural network NB! Inspired by biology, not based on biology! Applications Automatic speech recognition
More informationBayesian Networks (Part I)
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks (Part I) Graphical Model Readings: Murphy 10 10.2.1 Bishop 8.1,
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Language Models Tobias Scheffer Stochastic Language Models A stochastic language model is a probability distribution over words.
More informationCSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer
CSE446: Neural Networks Spring 2017 Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer Human Neurons Switching time ~ 0.001 second Number of neurons 10 10 Connections per neuron 10 4-5 Scene
More informationDistributional Semantics and Word Embeddings. Chase Geigle
Distributional Semantics and Word Embeddings Chase Geigle 2016-10-14 1 What is a word? dog 2 What is a word? dog canine 2 What is a word? dog canine 3 2 What is a word? dog 3 canine 399,999 2 What is a
More informationThis lecture. Miscellaneous classification methods: Neural networks, Support vector machines, Transformation-based learning, K nearest neighbours.
This lecture Miscellaneous classification methods: Neural networks, Support vector machines, Transformation-based learning, K nearest neighbours. Neural word-vector representations. CSC401/2511 Spring
More informationFeedforward Neural Networks. Michael Collins, Columbia University
Feedforward Neural Networks Michael Collins, Columbia University Recap: Log-linear Models A log-linear model takes the following form: p(y x; v) = exp (v f(x, y)) y Y exp (v f(x, y )) f(x, y) is the representation
More informationDeep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 10: Neural Language Models Princeton University COS 495 Instructor: Yingyu Liang Natural language Processing (NLP) The processing of the human languages by computers One of
More informationCourse Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch
Psychology 452 Week 12: Deep Learning What Is Deep Learning? Preliminary Ideas (that we already know!) The Restricted Boltzmann Machine (RBM) Many Layers of RBMs Pros and Cons of Deep Learning Course Structure
More informationAlgorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address
More informationRuslan Salakhutdinov Joint work with Geoff Hinton. University of Toronto, Machine Learning Group
NON-LINEAR DIMENSIONALITY REDUCTION USING NEURAL NETORKS Ruslan Salakhutdinov Joint work with Geoff Hinton University of Toronto, Machine Learning Group Overview Document Retrieval Present layer-by-layer
More informationConditional Language Modeling. Chris Dyer
Conditional Language Modeling Chris Dyer Unconditional LMs A language model assigns probabilities to sequences of words,. w =(w 1,w 2,...,w`) It is convenient to decompose this probability using the chain
More informationMachine Learning Basics
Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each
More informationNeural Network Language Modeling
Neural Network Language Modeling Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Marek Rei, Philipp Koehn and Noah Smith Course Project Sign up your course project In-class presentation
More informationSequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018
Sequence Models Ji Yang Department of Computing Science, University of Alberta February 14, 2018 This is a note mainly based on Prof. Andrew Ng s MOOC Sequential Models. I also include materials (equations,
More informationDeep Learning for NLP
Deep Learning for NLP Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Greg Durrett Outline Motivation for neural networks Feedforward neural networks Applying feedforward neural networks
More informationNatural Language Processing with Deep Learning CS224N/Ling284
Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 4: Word Window Classification and Neural Networks Richard Socher Organization Main midterm: Feb 13 Alternative midterm: Friday Feb
More informationGenerating Sequences with Recurrent Neural Networks
Generating Sequences with Recurrent Neural Networks Alex Graves University of Toronto & Google DeepMind Presented by Zhe Gan, Duke University May 15, 2015 1 / 23 Outline Deep recurrent neural network based
More informationStatistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks
Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered
More informationBayesian Paragraph Vectors
Bayesian Paragraph Vectors Geng Ji 1, Robert Bamler 2, Erik B. Sudderth 1, and Stephan Mandt 2 1 Department of Computer Science, UC Irvine, {gji1, sudderth}@uci.edu 2 Disney Research, firstname.lastname@disneyresearch.com
More informationSummary and discussion of: Dropout Training as Adaptive Regularization
Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial
More informationNeural Architectures for Image, Language, and Speech Processing
Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationDeep Feedforward Networks. Seung-Hoon Na Chonbuk National University
Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections
More informationLecture 7: Word Embeddings
Lecture 7: Word Embeddings Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 6501 Natural Language Processing 1 This lecture v Learning word vectors
More informationIndex. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,
Index A Activation functions, neuron/perceptron binary threshold activation function, 102 103 linear activation function, 102 rectified linear unit, 106 sigmoid activation function, 103 104 SoftMax activation
More informationNeural networks and optimization
Neural networks and optimization Nicolas Le Roux INRIA 8 Nov 2011 Nicolas Le Roux (INRIA) Neural networks and optimization 8 Nov 2011 1 / 80 1 Introduction 2 Linear classifier 3 Convolutional neural networks
More informationNeural Networks Learning the network: Backprop , Fall 2018 Lecture 4
Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:
More informationCS224N: Natural Language Processing with Deep Learning Winter 2017 Midterm Exam
CS224N: Natural Language Processing with Deep Learning Winter 2017 Midterm Exam This examination consists of 14 printed sides, 5 questions, and 100 points. The exam accounts for 17% of your total grade.
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 9: Dimension Reduction/Word2vec Cho-Jui Hsieh UC Davis May 15, 2018 Principal Component Analysis Principal Component Analysis (PCA) Data
More informationNatural Language Processing
Natural Language Processing Global linear models Based on slides from Michael Collins Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability
More informationFeature Noising. Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager
Feature Noising Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager Outline Part 0: Some backgrounds Part 1: Dropout as adaptive
More informationAdministration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6
Administration Registration Hw3 is out Due on Thursday 10/6 Questions Lecture Captioning (Extra-Credit) Look at Piazza for details Scribing lectures With pay; come talk to me/send email. 1 Projects Projects
More information