This lecture. Miscellaneous classification methods: Neural networks, Support vector machines, Transformation-based learning, K nearest neighbours.
|
|
- Nathaniel Underwood
- 5 years ago
- Views:
Transcription
1
2 This lecture Miscellaneous classification methods: Neural networks, Support vector machines, Transformation-based learning, K nearest neighbours. Neural word-vector representations. CSC401/2511 Spring
3 Classification Define classes/categories Phonemes Happy <-> Sad Filter acoustic noise Preprocess Tokenize and normalize MFCCs Extract features Word statistics Train a classifier HMM SVM Decision tree Use the trained classifier CSC401/2511 Spring
4 Types of classifiers Generative classifiers model the world. Parameters set to maximize likelihood of training data. We can generate new observations from these. e.g., hidden Markov models Discriminative classifiers emphasize class boundaries. Parameters set to minimize error on training data. e.g., ID3 decision trees. What do class boundaries look like in the data? CSC401/2511 Spring
5 Binary and linearly separable Perhaps the easiest case. Extends to dimensions d 3, line becomes (hyper-)plane. CSC401/2511 Spring
6 N-ary and linearly separable A bit harder random guessing gives $ % accuracy (given equally likely classes). We can logically combine N 1 binary classifiers. Decision Region Decision Boundaries CSC401/2511 Spring
7 Class holes Sometimes it can be impossible to draw any lines through the data to separate the classes. Are those troublesome points noise or real phenomena? CSC401/2511 Spring
8 The kernel trick We can sometimes linearize a non-linear case by moving the data into a higher dimension with a kernel function. E.g., S Now we have a linear decision boundary, S = 0! S = sin x0 + y 0 x 0 + y 0 CSC401/2511 Spring
9 Support Vector Machines (SVMs)
10 Support vector machines (SVMs) In binary linear classification, two classes are assumed to be separable by a line (or plane). However, many possible separating planes might exist. Each of these blue lines separates the training data. Which line is the best? CSC401/2511 Spring
11 Support vector machines (SVMs) The margin is the width by which the boundary could be increased before it hits a training datum. The maximum margin linear classifier is the linear classifier with the maximum margin. The support vectors (indicated) are those data points against which the margin is pressed. The bigger the margin the less sensitive the boundary is to error. CSC401/2511 Spring
12 Support vector machines (SVMs) The width of the margin, M, can be computed by the angle and displacement of the planar boundary, x, as well as M the planes that touch data points. Given an initial guess of the angle and displacement of x we can compute: whether all data is correctly classified, The width of the margin, M. We update our guess by quadratic programming, which is semi-analytic. CSC401/2511 Spring x
13 Support vector machines (SVMs) The maximum margin helps SVM generalize to situations when it s impossible to linearly separate the data. We introduce a parameter that allows us to measure the distance of all data not in their correct zones. We simultaneously maximize the margin while minimizing the misclassification error. There is a straightforward approach to solving this system based on quadratic programming. CSC401/2511 Spring
14 Support vector machines (SVMs) SVMs generalize to higher-dimensional data and to systems in which the data is non-linearly separable (e.g., by a circular decision boundary). Using the kernel trick (slide 8) is common. Many binary SVM classifiers can also be combined to simulate a multi-category classifier (slide 6). )Still) one of the most popular off-the-shelf classifiers. CSC401/2511 Spring
15 Support vector machines (SVMs) SVMs are empirically very accurate classifiers. They perform well in situations where data are static, i.e., don t change over time, e.g., genre classification given fixedstatistics of documents Phoneme recognition given only a single frame of speech. SVMs do not generalize as well to time-variant systems. Kernel functions tend to not allow for observations of different lengths (i.e., all data points have to be of the same dimensionality). CSC401/2511 Spring
16 Artificial Neural Networks (ANNs)
17 Artificial neural networks Artificial neural networks (ANNs) were (kind of) inspired from neurobiology (Widrowand Hoff, 1960). Each unit has many inputs (dendrites), one output (axon). The nucleus fires (sends an electric signal along the axon) given input from other neurons. Learning occurs at the synapses that connect neurons, either by amplifying or attenuating signals. Dendrites Axon Nucleus CSC401/2511 Spring
18 Perceptron: an artificial neuron Each neuron calculates a weighted sum of its inputs and compares this to a threshold, τ. If the sum exceeds the threshold, the neuron fires. Inputs a < are activations from adjacent neurons, each weighted by a parameter w <. a $ a 0 w 0 w $ w B B x = A w < a < <C$ g() S g() If x > τ, S 1 Else, S 0 a B McCullogh-Pitts model CSC401/2511 Spring
19 Perceptron output Perceptron output is determined by activation functions, g(), which can be non-linear functions of weighted input. A popular activation function is the sigmoid: 1 S = g x = 1 + e FG Its derivative is the easily computable g H = g (1 g) Output Input CSC401/2511 Spring
20 Perceptron learning Weights are adjusted in proportion to the error (i.e., the difference between the desired, y, and the actual output, S. The derivative g allows us to assign blame proportionally. Given a small learning rate, α (e.g., 0.05), we repeatedly adjust each of the weighting parameters by w N w N + α A Err < g (x < )x < Q <C$ where Err < = (y S), and we have R training examples. CSC401/2511 Spring
21 Threshold perceptra and XOR Some relatively simple logical functions cannot be learned by threshold perceptra (since they are not linearly separable). a 0 a 0 a 0 a $ a $ a $ a 1 a 2 a 1 a 2 a 1 a 2 CSC401/2511 Spring
22 Artificial neural networks Complex functions can be represented by layers of perceptra (Multi-Layer Perceptra, MLPs) Input are passed to the input layer. Activations are propagated through hidden layers to the output layer (which is usually the class). MLP CSC401/2511 Spring
23 Artificial neural networks MLPs are quite robust to noise, and are trained specifically to reduce error However, they can be sensitive to initial parameterization, relatively slow to train, and incapable of capturing long-term dependencies. MLP CSC401/2511 Spring What can they learn about words?
24 Words Given a corpus with D (e.g., = 100K) unique words, the classical binary approach is to uniquely assign each word with an index in D-dimensional vectors ( one-hot representation). soccer Classic word-feature representation assigns features to each index. E.g., VBG, positive, age-of-acquisition d D Is there a way to learn the nature of these abstract features? D CSC401/2511 Spring
25 Singular value decomposition M = a as U = chuck Σ = could Emb = U :,$:0 Σ $:0,$:0 CSC401/2511 Spring
26 Singular value decomposition dendrogram Rohde et al. (2006) An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence. Communications of the ACM 8: CSC401/2511 Spring
27 Singular value decomposition Problems with SVD: 1. Computational costs scale quadratically with size of M. 2. Hard to incorporate new words. Rohde et al. (2006) An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence. Communications of the ACM 8: CSC401/2511 Spring
28 Word2vec to the rescue Solution: Don t capture co-occurrence directly just try to predict surrounding words. P(w bc$ = yourself w b = kiss) hey go kiss yourself, hey go hug yourself, Here, we re predicting the center word given the context. Popular alternative, GloVe: CSC401/2511 Spring
29 Learning word representations Continuous bag of words (CBOW) D = 100K x W m (D H) a W p (H D) y D = 100K Note: we have two vector representations of each word: v s = x W m (w bu row of W m ) V s = W p x (w bu col of W p ) 0,0,0, 1,, 0 kiss go kiss yourself go hug yourself outside inside outside 0,1,0,, 0,, 0 go softmax P w w w < = exp (V s { v s ) ~ sc$ exp (V s v s ) Where v s is the input vector for word w, V s is the output vector for word w, CSC401/2511 Spring
30 Using word representations Without a latent space, kiss = 0,0,0,, 0,1,0,, 0, & hug = 0,0,0,, 0,0,1,, 0 so Similarity = cos (x, y) = 0.0 Transform v s = x W m D = 100K x W m In latent space, kiss = 0.8,0.69,0.4,, 0.05, & hug = 0.9,0.7,0.43,, 0.05 so Similarity = cos (x, y) = 0.9 H = 300 CSC401/2511 Spring
31 Linguistic regularities in vector space Trained on the Google news corpus with over 300 billion words. CSC401/2511 Spring
32 Linguistic regularities in vector space CSC401/2511 Spring (from GloVe same idea)
33 Linguistic regularities in vector space Expression Paris France + Italy Bigger big + cold Sushi Japan + Germany Cu copper + gold Windows Microsoft + Google Nearest token Rome Colder bratwurst Au Android Analogies: Hypernymy: apple:apples :: octopus:octopodes shirt:clothing :: chair:furniture CSC401/2511 Spring
34 Actually doing the learning First, let s define what our parameters are. Given H-dimensional vectors, and V words: vˆ θ = vˆˆ Š ˆ Œ v Ž Vˆ Vˆˆ Š ˆ Œ V Ž R 0 CSC401/2511 Spring
35 Aside Actually doing the learning We have many options. Gradient descent is popular. We want to optimize, given T words of training data, Ÿ J θ = 1 T A A log P(w bcn w b ) bc$ š œnœ,n And we want to update vectors V s then v s θ s = θ w Š η J θ so we ll need to take the derivative of the (log of the) softmax function: P w w w < = exp (V s { v s ) ~ exp (V s v s ) Where and ž sc$ v s is the input vector for word w, V s is the output vector for word w, within θ CSC401/2511 Spring
36 Aside Actually doing the learning We need to take the derivative of the (log of the) softmax function: δ δv s log P w bcn w b = δ δv s = δ δv s log exp (V s v s ) ~ exp (V s v s ) sc$ log exp V s v s log A exp (V s v s ) = V s δ δv s ~ sc$ ~ log A exp (V s v s ) sc$ [apply the chain rule ª«ª More details: ~ = V s A p w w b V s sc$ = ª«ªŽ ªŽ ª ] CSC401/2511 Spring
37 Results (note all extrinsic) Bengio et al. 2001, 2003: beating N-grams on small datasets (Brown & APNews), but much slower. Schwenk et al. 2002,2004,2006: beating state-of-the-art largevocabulary ASR using deep & distributed NLP model, with real-time speech recognition. Morin & Bengio 2005, Blitzer et al. 2005, Mnih & Hinton 2007,2009: better & faster models through hierarchical representations. Collobert & Weston 2008: reaching or beating state-of-the-art in multiple NLP tasks (SRL, PoS, NER, chunking) thanks to unsupervised pre-training and multi-task learning. Bai et al. 2009: ranking & semantic indexing (IR). CSC401/2511 Spring
38 Sentiment analysis The traditional bag-of-words approach to sentiment analysis used dictionaries of happy and sad words, simple counts, and either regression or binary classification. But consider these: Best movie of the year Slick and entertaining, despite a weak script Fun and sweet but ultimately unsatisfying CSC401/2511 Spring
39 Tree-based sentiment analysis We can combine pairs of words into phrase structures. Similarly, we can combine phrase and word structures hierarchically for classification. x1,2 x $ x1 x2 D = W m D = 300 x 0 CSC401/2511 Spring H = 300
40 Tree-based sentiment analysis (currently broken) demo: CSC401/2511 Spring
41 Recurrent neural networks (RNNs) An RNN has feedback connections in its structure so that it remembers n previous inputs, when reading a sequence. e.g., it can use current word input with hidden units from previous word) Elman network feed hidden units back Jordan network (not shown) feed output units back CSC401/2511 Spring
42 RNNs do PoS You can unroll RNNs over time for various dynamic models, e.g., PoS tagging. t=1 t=2 t=3 t=4 Pronoun Aux Det She had a CSC401/2511 Spring
43 SMT with RNNs SMT is hard and involves long-term dependencies. Solution: Encode entire sentence into a single vector representation, then decode. t=1 t=2 t=3 t=4 t=5 Sentence representation ENCODE The ocarina of time <eos> CSC401/2511 Spring
44 SMT with RNNs Try it ( 30K vocabulary, 500M word training corpus (taking 5 days on GPUs) All that good morphological/syntactic/semantic stuff gets embedded into sentence vectors. t=5 t=6 t=7 t=8 t=9 DECODE L ocarina de temps <eos> Sentence representation CSC401/2511 Spring
45 Transformation-Based Learning (TBL)
46 Transformation-based learning Developed by Eric Brill for his part-of-speech tagger. Is also used for text chunking, prepositional phrase attachment (*), syntactic parsing, dialog tagging, etc. Transformation-based learning (TBL) modifies the output of one method (e.g., HMM) according to a set of learned rules. These rules are determined automatically by a discriminative training process. (*) Prepositional phrase attachment is the problem of determining, e.g., who has the telescope in I saw the man on the hill with the telescope. CSC401/2511 Spring
47 Transformation-based learning Initial imperfect tagging of data (many errors) Transformation rules of form [Condition, Action] New tagging with fewer errors Components: Allowable transformations Learning algorithm CSC401/2511 Spring
48 TBL: allowable transformations TBL requires transformation rule templates. Each template is of the form [CONDITION, ACTION]. Actions include, e.g., changing the i bu tag to τ, t < τ. Conditions include conjunctions, negations, and disjunctions of, e.g., The M bu preceding/following tag is tˆ, e.g., the preceding tag is a NNS, The M bu preceding/following word is wˆ, E.g., The preceding/following word is ocelot, The M bu word is wˆ and the N bu tag is t ³, CSC401/2511 Spring
49 TBL: example transformation An instantiated rule might be, e.g., if the preceding word is to and the current word is strike and the current tag is NN then change the current tag to VB. Condition (triggering environment): preceding word= to & current word= strike & current tag= NN Action (transformation/rewrite rule): change current tag from NN to VB CSC401/2511 Spring
50 TBL: learning algorithm In training, we generate one new rule per iteration and apply it to the training set, thereby modifying it. The initial training set includes: the output of another tagger (possibly riddled with errors), the correct gold standard tags. CSC401/2511 Spring
51 TBL: learning algorithm Learning TBL rules is an iterative process: 1. Generate all rules, R, that correct 1 error, 2. For each rule r R, 1. Apply the rule r to a copy of the current state of the training set, 2. Score the result (compute the overall error) 3. Select the rule r that minimizes error. 4. Update the training set by applying r. 5. If the error is below some threshold, halt. Otherwise, repeat from step 1. CSC401/2511 Spring
52 Transformation-based learning Advantages of transformation-based learning include: TBL rules can capture more context than Markov models, The entire training set is used for training, The evaluation criterion (error rate) is direct, as opposed to indirect methods like the reduction of entropy (e.g., decision trees), Resulting rules can be easy to review and to understand. Disadvantages include: The rules that TBL generates are not probabilistic, The rule sequences may not be optimal, since only one is considered at a time. CSC401/2511 Spring
53 Reading/Announcements ANN: SVM: Russell & Norvig, Artificial Intelligence: A Modern Approach 2 nd ed., section 20.5 (optional) IBID, section 20.6 (optional) TBL: Manning & Schütze, section 10.4 Friday: Review Session 19 or20 April: Second review session CSC401/2511 Spring
Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing
Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding
More informationPart-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287
Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the
More informationSparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent
More informationANLP Lecture 22 Lexical Semantics with Dense Vectors
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationDeep Learning for NLP
Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Machine Learning and NLP NER WordNet Usually machine learning
More information11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)
11/3/15 Machine Learning and NLP Deep Learning for NLP Usually machine learning works well because of human-designed representations and input features CS224N WordNet SRL Parser Machine learning becomes
More informationArtificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino
Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as
More informationGloVe: Global Vectors for Word Representation 1
GloVe: Global Vectors for Word Representation 1 J. Pennington, R. Socher, C.D. Manning M. Korniyenko, S. Samson Deep Learning for NLP, 13 Jun 2017 1 https://nlp.stanford.edu/projects/glove/ Outline Background
More informationLecture 5 Neural models for NLP
CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm
More informationDeep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017
Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion
More informationAn overview of word2vec
An overview of word2vec Benjamin Wilson Berlin ML Meetup, July 8 2014 Benjamin Wilson word2vec Berlin ML Meetup 1 / 25 Outline 1 Introduction 2 Background & Significance 3 Architecture 4 CBOW word representations
More informationDeep Learning Recurrent Networks 2/28/2018
Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good
More informationArtificial neural networks
Artificial neural networks Chapter 8, Section 7 Artificial Intelligence, spring 203, Peter Ljunglöf; based on AIMA Slides c Stuart Russel and Peter Norvig, 2004 Chapter 8, Section 7 Outline Brains Neural
More informationData Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,
More informationLecture 13: Structured Prediction
Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page
More informationMultilayer Perceptron
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4
More informationword2vec Parameter Learning Explained
word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu Abstract The word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent two years. The vector
More informationNEURAL LANGUAGE MODELS
COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationLecture 6: Neural Networks for Representing Word Meaning
Lecture 6: Neural Networks for Representing Word Meaning Mirella Lapata School of Informatics University of Edinburgh mlap@inf.ed.ac.uk February 7, 2017 1 / 28 Logistic Regression Input is a feature vector,
More informationNatural Language Processing and Recurrent Neural Networks
Natural Language Processing and Recurrent Neural Networks Pranay Tarafdar October 19 th, 2018 Outline Introduction to NLP Word2vec RNN GRU LSTM Demo What is NLP? Natural Language? : Huge amount of information
More informationMachine Learning. Neural Networks. (slides from Domingos, Pardo, others)
Machine Learning Neural Networks (slides from Domingos, Pardo, others) Human Brain Neurons Input-Output Transformation Input Spikes Output Spike Spike (= a brief pulse) (Excitatory Post-Synaptic Potential)
More informationDeep Learning for NLP Part 2
Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) 2 Part 1.3: The Basics Word Representations The
More informationNeural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21
Neural Networks Chapter 8, Section 7 TB Artificial Intelligence Slides from AIMA http://aima.cs.berkeley.edu / 2 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural
More informationNonlinear Classification
Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions
More informationMachine Learning. Neural Networks. (slides from Domingos, Pardo, others)
Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationNeural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17
3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural
More informationNeural networks. Chapter 19, Sections 1 5 1
Neural networks Chapter 19, Sections 1 5 Chapter 19, Sections 1 5 1 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 19, Sections 1 5 2 Brains 10
More informationOnline Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?
Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification
More informationCourse 395: Machine Learning - Lectures
Course 395: Machine Learning - Lectures Lecture 1-2: Concept Learning (M. Pantic) Lecture 3-4: Decision Trees & CBC Intro (M. Pantic & S. Petridis) Lecture 5-6: Evaluating Hypotheses (S. Petridis) Lecture
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationLinear discriminant functions
Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative
More informationCSC321 Lecture 15: Recurrent Neural Networks
CSC321 Lecture 15: Recurrent Neural Networks Roger Grosse Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 1 / 26 Overview Sometimes we re interested in predicting sequences Speech-to-text and
More informationRegularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018
1-61 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Regularization Matt Gormley Lecture 1 Feb. 19, 218 1 Reminders Homework 4: Logistic
More informationNatural Language Processing
Natural Language Processing Word vectors Many slides borrowed from Richard Socher and Chris Manning Lecture plan Word representations Word vectors (embeddings) skip-gram algorithm Relation to matrix factorization
More informationMidterm: CS 6375 Spring 2015 Solutions
Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an
More informationIntroduction to Convolutional Neural Networks (CNNs)
Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei
More informationCS:4420 Artificial Intelligence
CS:4420 Artificial Intelligence Spring 2018 Neural Networks Cesare Tinelli The University of Iowa Copyright 2004 18, Cesare Tinelli and Stuart Russell a a These notes were originally developed by Stuart
More informationCS 4700: Foundations of Artificial Intelligence
CS 4700: Foundations of Artificial Intelligence Prof. Bart Selman selman@cs.cornell.edu Machine Learning: Neural Networks R&N 18.7 Intro & perceptron learning 1 2 Neuron: How the brain works # neurons
More informationNatural Language Processing with Deep Learning CS224N/Ling284
Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 4: Word Window Classification and Neural Networks Richard Socher Organization Main midterm: Feb 13 Alternative midterm: Friday Feb
More information10/17/04. Today s Main Points
Part-of-speech Tagging & Hidden Markov Model Intro Lecture #10 Introduction to Natural Language Processing CMPSCI 585, Fall 2004 University of Massachusetts Amherst Andrew McCallum Today s Main Points
More informationIntroduction to Neural Networks
Introduction to Neural Networks Steve Renals Automatic Speech Recognition ASR Lecture 10 24 February 2014 ASR Lecture 10 Introduction to Neural Networks 1 Neural networks for speech recognition Introduction
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationArtificial Neural Networks. Historical description
Artificial Neural Networks Historical description Victor G. Lopez 1 / 23 Artificial Neural Networks (ANN) An artificial neural network is a computational model that attempts to emulate the functions of
More informationN-gram Language Modeling
N-gram Language Modeling Outline: Statistical Language Model (LM) Intro General N-gram models Basic (non-parametric) n-grams Class LMs Mixtures Part I: Statistical Language Model (LM) Intro What is a statistical
More informationCS 4700: Foundations of Artificial Intelligence
CS 4700: Foundations of Artificial Intelligence Prof. Bart Selman selman@cs.cornell.edu Machine Learning: Neural Networks R&N 18.7 Intro & perceptron learning 1 2 Neuron: How the brain works # neurons
More informationARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD
ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided
More informationStatistical Machine Learning from Data
January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole
More informationMachine Learning. Neural Networks. (slides from Domingos, Pardo, others)
Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward
More informationDISTRIBUTIONAL SEMANTICS
COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.
More informationNeural networks. Chapter 20. Chapter 20 1
Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural networks Perceptrons Multilayer networks Applications of neural networks Chapter 20 2 Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms
More informationAdministration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6
Administration Registration Hw3 is out Due on Thursday 10/6 Questions Lecture Captioning (Extra-Credit) Look at Piazza for details Scribing lectures With pay; come talk to me/send email. 1 Projects Projects
More informationArtificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen
Artificial Neural Networks Introduction to Computational Neuroscience Tambet Matiisen 2.04.2018 Artificial neural network NB! Inspired by biology, not based on biology! Applications Automatic speech recognition
More informationNeural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35
Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How
More informationDeep Neural Networks
Deep Neural Networks DT2118 Speech and Speaker Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 45 Outline State-to-Output Probability Model Artificial Neural Networks Perceptron Multi
More informationNaïve Bayes, Maxent and Neural Models
Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words
More information(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann
(Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for
More informationDeep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 10: Neural Language Models Princeton University COS 495 Instructor: Yingyu Liang Natural language Processing (NLP) The processing of the human languages by computers One of
More informationARTIFICIAL INTELLIGENCE. Artificial Neural Networks
INFOB2KI 2017-2018 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Artificial Neural Networks Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
More informationMidterm sample questions
Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts
More informationNeural networks. Chapter 20, Section 5 1
Neural networks Chapter 20, Section 5 Chapter 20, Section 5 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 20, Section 5 2 Brains 0 neurons of
More informationDEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY
DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo
More informationDeep Learning: a gentle introduction
Deep Learning: a gentle introduction Jamal Atif jamal.atif@dauphine.fr PSL, Université Paris-Dauphine, LAMSADE February 8, 206 Jamal Atif (Université Paris-Dauphine) Deep Learning February 8, 206 / Why
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationNumerical Learning Algorithms
Numerical Learning Algorithms Example SVM for Separable Examples.......................... Example SVM for Nonseparable Examples....................... 4 Example Gaussian Kernel SVM...............................
More informationHomework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown
Homework 3 COMS 4705 Fall 017 Prof. Kathleen McKeown The assignment consists of a programming part and a written part. For the programming part, make sure you have set up the development environment as
More informationLong-Short Term Memory and Other Gated RNNs
Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling
More informationN-gram Language Modeling Tutorial
N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: Statistical Language Model (LM) Basics n-gram models Class LMs Cache LMs Mixtures
More informationNeural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Neural Networks CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Perceptrons x 0 = 1 x 1 x 2 z = h w T x Output: z x D A perceptron
More informationArtificial Intelligence
Artificial Intelligence Jeff Clune Assistant Professor Evolving Artificial Intelligence Laboratory Announcements Be making progress on your projects! Three Types of Learning Unsupervised Supervised Reinforcement
More informationFinal Exam, Machine Learning, Spring 2009
Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3
More informationPart of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015
Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationLecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)
Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for
More informationFeedforward Neural Nets and Backpropagation
Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28 th, 2016 1 / 23 Supervised Learning Roadmap Supervised Learning: Assume that we are given the features
More informationDeep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang
Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10 Part 2: Introduction to Deep Learning Part 2.1: Deep Learning Background What is Machine Learning? From Data
More informationHow to do backpropagation in a brain
How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep
More informationDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca borui.ye@uwaterloo.ca July 8, 2015 Dylan Drover, Borui Ye, Jie Peng (University
More informationACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging
ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers
More informationArtificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!
Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011! 1 Todayʼs lecture" How the brain works (!)! Artificial neural networks! Perceptrons! Multilayer feed-forward networks! Error
More informationDeep Learning. Ali Ghodsi. University of Waterloo
University of Waterloo Language Models A language model computes a probability for a sequence of words: P(w 1,..., w T ) Useful for machine translation Word ordering: p (the cat is small) > p (small the
More informationLecture 17: Neural Networks and Deep Learning
UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions
More informationCSC242: Intro to AI. Lecture 21
CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationSequence Modeling with Neural Networks
Sequence Modeling with Neural Networks Harini Suresh y 0 y 1 y 2 s 0 s 1 s 2... x 0 x 1 x 2 hat is a sequence? This morning I took the dog for a walk. sentence medical signals speech waveform Successes
More informationArtifical Neural Networks
Neural Networks Artifical Neural Networks Neural Networks Biological Neural Networks.................................. Artificial Neural Networks................................... 3 ANN Structure...........................................
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Language Models Tobias Scheffer Stochastic Language Models A stochastic language model is a probability distribution over words.
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More information22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1
Neural Networks Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1 Brains as Computational Devices Brains advantages with respect to digital computers: Massively parallel Fault-tolerant Reliable
More informationCSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer
CSE446: Neural Networks Spring 2017 Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer Human Neurons Switching time ~ 0.001 second Number of neurons 10 10 Connections per neuron 10 4-5 Scene
More informationSeman&cs with Dense Vectors. Dorota Glowacka
Semancs with Dense Vectors Dorota Glowacka dorota.glowacka@ed.ac.uk Previous lectures: - how to represent a word as a sparse vector with dimensions corresponding to the words in the vocabulary - the values
More informationLast update: October 26, Neural networks. CMSC 421: Section Dana Nau
Last update: October 26, 207 Neural networks CMSC 42: Section 8.7 Dana Nau Outline Applications of neural networks Brains Neural network units Perceptrons Multilayer perceptrons 2 Example Applications
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationPart 8: Neural Networks
METU Informatics Institute Min720 Pattern Classification ith Bio-Medical Applications Part 8: Neural Netors - INTRODUCTION: BIOLOGICAL VS. ARTIFICIAL Biological Neural Netors A Neuron: - A nerve cell as
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationAlgorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address
More informationLecture 11 Recurrent Neural Networks I
Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor niversity of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks
More information