Feature Noising. Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager
|
|
- Harry Harper
- 5 years ago
- Views:
Transcription
1 Feature Noising Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager
2 Outline Part 0: Some backgrounds Part 1: Dropout as adaptive regularization with applications to semi-supervised learning joint work with Stefan Wager and Percy Part 2: Applications to structured prediction using CRFs when the log-partition function cannot be easily computed joint work with Mengqiu, Chris, Percy and Stefan Wager
3 The basics of dropout training Introduced by Hinton et al. in Improving neural networks by preventing co-adaptation of feature detectors For each example, randomly select features zero them compute the gradient, make an update repeat
4 Empirically successful Dropout is important in some recent successes won the ImageNet challenge [Krizhevsky et al., 2012] won the Merck challenge [Dahl et al., 2012] Improved performance on standard datasets images: MNIST, CIFAR, ImageNet, etc. document classification: Reuters, IMDB, Rotten Tomatoes, etc. speech: TIMIT, GlobalPhone, etc.
5 Lots of related works already Variants DropConnect [Wan et al., 2013] Maxout networks [Goodfellow et al., 2013] Analytical integration Fast Dropout [Wang and Manning, 2013] Marginalized Corrupted Features [van der Maaten et al., 2013] Many other works report empirical gains
6 Outline Part 0: Some backgrounds Part 1: Dropout as adaptive regularization with applications to semi-supervised learning Part 2: Applications to structured prediction using CRFs when the log-partition function cannot be easily computed
7 Theoretical understanding? Dropout as adaptive regularization feature noising -> interpretable penalty term Loss( Dropout(data) ) = Loss(data)+Regularizer(data) Semi-supervised learning feature dependent, label independent regularizer: Regularizer(Unlabeled data)
8 Dropout for Log-linear Models Log likelihood (e.g., softmax classification): log p(y x; ) =x T y A(x T ) =[ 1, 2,..., K ]
9 Dropout for Log-linear Models Log likelihood (e.g., softmax classification): Dropout: log p(y x; ) =x T y A(x T ) =[ 1, 2,..., K ] ( 2x j with p=0.5 x j = 0 otherwise E[ x] =x Dropout objective: E[log p(y x; )] {z } Loss(Dropout(data)) = E[ x T y ] E[A( x T )] = Loss(data)+Regularizer(data)
10 Dropout for Log-linear Models Log likelihood (e.g., softmax classification): Dropout: Dropout objective: log p(y x; ) =x T y A(x T ) =[ 1, 2,..., K ] ( 2x j with p=0.5 x j = 0 otherwise E[log p(y x; )] {z } -Loss(Dropout(data)) Loss(data)+Regularizer(data) E[ x] =x = E[ x T y ] E[A( x T )]
11 Dropout for Log-linear Models We can rewrite the dropout log-likelihood E[log p(y x; )] = E[ x T y ] E[A( x T )] log p(y x; ) = x T y A(x T ) E[log p(y x; )] {z } -Loss(Dropout(data)) = log p(y x; ) {z } -Loss(data) Dropout reduces to a regularizer R(,x)=E[A( x T )] A(x T ) (E[A( x T )] A(x T ) ) {z } Regularizer(data)
12 Second-order delta method Take the Taylor expansion A(s) A(s 0 )+(s s 0 ) T A 0 (s 0 )+(s s 0 ) T A00 (s 0 ) (s s 0 ) 2
13 Second-order delta method Take the Taylor expansion A(s) A(s 0 )+(s s 0 ) T A 0 (s 0 )+(s s 0 ) T A00 (s 0 ) (s s 0 ) 2 Substitute s = s def = T x, s 0 = E[ s] Take expectations to get the quadratic approximation: R q (,x)= 1 2 E[( s s)t r 2 A(s)( s s)] = 1 2 tr(r2 A(s)Cov( s))
14 Example: logistic regression The quadratic approximation R q (,x)= 1 2 A00 (x T )Var[ x T ]
15 Example: logistic regression The quadratic approximation R q (,x)= 1 2 A00 (x T )Var[ x T ] A 00 (x T )=p(1 p) represents uncertainty: p = p(y x; ) =(1+exp( yx T )) 1
16 Example: logistic regression The quadratic approximation R q (,x)= 1 2 A00 (x T )Var[ x T ] A 00 (x T )=p(1 p) represents uncertainty: p = p(y x; ) =(1+exp( yx T )) 1 Var[ x T ]= X j 2 x 2 j is L 2 -regularization after j normalizing the data
17 The regularizers Dropout on Linear Regression R q ( ) = 1 X X j 2 2 Dropout on Logistic Regression R q ( ) = 1 X X j 2 p i (1 2 Multiclass, CRFs [Wang et al., 2013] j j i i x (i)2 j p i )x (i)2 j
18 Dropout intuition R q ( ) = 1 2 X j 2 j X p i (1 Regularizes rare features less, like AdaGrad: there is actually a more precise connection [Wager et al., 2013] Big weights are okay if they contribute only to confident predictions Normalizing by the diagonal Fisher information i p i )x (i)2 j
19 Semi-supervised Learning These regularizers are label-independent but can be data adaptive in interesting ways labeled dataset D = {x 1,x 2,...,x n } unlabeled data D unlabeled = {u 1,u 2,...,u n } We can better estimate the regularizer R (, D, D unlabeled ) def n X n = R(,x i )+ n + m i=1 for some tunable. mx i=1 R(,u i ).
20 Semi-supervised intuition R q ( ) = 1 2 X Like other semi-supervised methods: j 2 j X p i (1 transductive SVMs [Joachims, 1999] entropy regularization [Grandvalet and Bengio, 2005] EM: guess a label [Nigam et al., 2000] want to make confident predictions on the unlabeled data Get a better estimate of the Fisher information i p i )x (i)2 j
21 IMDB dataset [Maas et al., 2011] 25k examples of positive reviews 25k examples of negative reviews Half for training and half for testing 50k unlabeled reviews also containing neutral reviews 300k sparse unigram features ~5 million sparse bigram features
22 Experiments: semi-supervised Add more unlabeled data (10k labeled) improves performance accuracy dropout+unlabeled dropout L size of unlabeled data
23 Experiments: semi-supervised Add more labeled data (40k unlabeled) improves performance accuracy dropout+unlabeled dropout L size of labeled data
24 Quantitative results on IMDB Method \ Settings Supervised Semi-sup. MNB - unigrams with SFE [Su et al., 2011] Vectors for sentiment analysis [Maas et al., 2011] This work: dropout + unigrams This work: dropout + bigrams
25 Experiments: other datasets Dataset \ Settings L 2 Drop +Unlbl Subjectivity [Peng and Lee, 2004] Rotten Tomatoes [Peng and Lee, 2005] newsgroups CoNLL
26 Outline Part 0: Some backgrounds Part 1: Dropout as adaptive regularization with applications to semi-supervised learning With Stefan and Percy Part 2: Applications to structured prediction using CRFs when the log-partition function cannot be easily computed with Mengqiu, Chris, Percy and Stefan
27 Log-linear structured prediction A vector of scores s =(s 1,...,s Y ) s y = f(y,x) The likelihood is: p(y x; ) =exp{s y A(s)} A(s) = log X y exp{s y } Y might be really huge!
28 What about structured prediction? Recall that in logistic regression: R q ( ) = 1 X X j 2 p i (1 2 j What if we cannot easily compute the logpartition function A? and its second derivatives? i p i )x (i)2 j
29 The original setup Take the Taylor expansion A(s) A(s 0 )+(s s 0 ) T A 0 (s 0 )+(s s 0 ) T A00 (s 0 ) (s s 0 ) 2 Substitute s = s def = T x, s 0 = E[ s] Take expectations to get the quadratic approximation: R q (,x)= 1 2 E[( s s)t r 2 A(s)( s s)] = 1 2 tr(r2 A(s)Cov( s))
30 The structured prediction setup Take the Taylor expansion A(s) A(s 0 )+(s s 0 ) T A 0 (s 0 )+(s s 0 ) T A00 (s 0 ) (s s 0 ) 2 Substitute s = s def = f(y,x), s 0 = E[ s] Take expectations to get the quadratic approximation: R q (,x)= 1 2 E[( s s)t r 2 A(s)( s s)] = 1 2 tr(r2 A(s)Cov( s))
31 Use the independence structure Depends on the underlying graphical model We assume we can do exact inference via message passing (e.g. clique tree) E.g. Linear-chain CRF: f(y,x)= TX t=1 g t (y t 1,y t,x) ( T X ) A(s) = log X y2y exp t=1 s yt 1,y t,t
32 Local Noising Global noising: s = s def = f(y,x) Local noising: s = s def = TX g(y t 1,y t,x) t=1 Can try to justify in restrospect
33 The regularizer The regularizer is: X R q (,x)= 1 2 For marginals: And derivatives: ( ) X X µ a,b,t (1 µ a,b,t ) Var[ s a,b,t ] a,b,t µ a,b,t = p (y t 1 = a, y t = b x) al elements of the Hessian r 2 rµ a,b,t = E p (y x,y t 1 =a,y t =b)[f(y,x)] E p (y x)[f(y,x)] (
34 Efficient computation For every a,b,t we need rµ a,b,t = E p (y x,y t 1 =a,y t =b)[f(y,x)] E p (y x)[f(y,x)] Naïve computation is O(K 4 T 2 ) Can reduce to O(K 3 T 2 ) We provide a dynamic program to compute in O(KT 2 ), like normal forward backwards, except need to do this for every feature
35 Feature group trick (Mengqiu) rµ a,b,t = E p (y x,y t 1 =a,y t =b)[f(y,x)] E p (y x)[f(y,x)] Features that always appeared in the same location all have the same conditional expectations Gives a 4x speedup, applicable to general CRFs
36 CRF sequence tagging CoNLL 2003 Named Entity Recognition Stanford[ORG] is[o] near[o] Palo[LOC] Alto[LOC] Dataset \ Settings None L 2 Drop CoNLL 2003 Dev CoNLL 2003 Test
37 CRF sequence tagging Dropout helps more on precision than recall Tag LOC MISC ORG PER Overall ularization Precision Recall F = % 86.13% % 79.30% % 80.49% % 93.33% % 86.08% (e) CoNLL test set with L 2 regularization regularization Precision Recall F = % 87.74% % 77.34% % 81.89% % 92.68% % 86.45% (f) CoNLL test set with dropout regularization
38 SANCL POS Tagging Test set difference statistically significant for newsgroups and reviews F =1 None L 2 Drop newsgroups Dev Test reviews Dev Test answers Dev Test
39 Summary Part 0: Some backgrounds Part 1: Dropout as adaptive regularization with applications to semi-supervised learning joint work with Stefan Wager and Percy Part 2: Applications to structured prediction using CRFs when the log-partition function cannot be easily computed joint work with Mengqiu, Chris, Percy and Stefan Wager
40 References Our arxiv paper [Wager et al., 2013] has more details, including the relation to AdaGrad Our EMNLP paper [Wang et al., 2013] extends this framework to structured prediction Our ICML paper [Wang and Manning, 2013] applies a related technique to neural networks and provides some negative examples
41 Dropout vs. L 2 Can be much better than all settings of L 2 Part of the gain comes from normalization Accuracy L2 only 0.8 L2+Gaussian dropout L2+Quadratic dropout L 2 regularization strength (λ)
42 Example: linear least squares The loss function is f( x) =1/2( x y) 2 Let X = x where x j =2z j x j, z j = Bernoulli(0.5) E[f(X)] = f(e[x]) + f 00 (E[X]) 2 =1/2( x The total regularizer is R q ( ) = 1 X 2 This is just L2 applied after data normalization j y) 2 +1/2 X j 2 j X i x (i)2 j Var[X] x 2 j 2 j
43 Quantitative results on IMDB Method \ Settings Supervised Semi-sup. MNB - unigrams with SFE [Su et al., 2011] MNB bigrams Vectors for sentiment analysis [Maas et al., 2011] NBSVM bigrams [Wang and Manning, 2012] This work: dropout + unigrams This work: dropout + bigrams
Feature Noising for Log-linear Structured Prediction
Feature Noising for Log-linear Structured Prediction Sida I. Wang, Mengqiu Wang, Stefan Wager, Percy Liang, Christopher D. Manning Department of Computer Science, Department of Statistics Stanford University,
More informationSummary and discussion of: Dropout Training as Adaptive Regularization
Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial
More informationarxiv: v2 [stat.ml] 1 Nov 2013
Dropout Training as Adaptive Regularization arxiv:1307.1493v2 [stat.ml] 1 Nov 2013 Stefan Wager, Sida Wang, and Percy Liang Departments of Statistics and Computer Science Stanford University, Stanford,
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationLarge-Scale Feature Learning with Spike-and-Slab Sparse Coding
Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab
More informationNeural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35
Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How
More informationConditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013
Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationThe connection of dropout and Bayesian statistics
The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationSequential Supervised Learning
Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given
More informationMachine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)
More informationActive learning in sequence labeling
Active learning in sequence labeling Tomáš Šabata 11. 5. 2017 Czech Technical University in Prague Faculty of Information technology Department of Theoretical Computer Science Table of contents 1. Introduction
More informationParameter learning in CRF s
Parameter learning in CRF s June 01, 2009 Structured output learning We ish to learn a discriminant (or compatability) function: F : X Y R (1) here X is the space of inputs and Y is the space of outputs.
More informationBowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations
Bowl Maximum Entropy #4 By Ejay Weiss Maxent Models: Maximum Entropy Foundations By Yanju Chen A Basic Comprehension with Derivations Outlines Generative vs. Discriminative Feature-Based Models Softmax
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationJakub Hajic Artificial Intelligence Seminar I
Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationPart-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287
Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationMachine Learning Basics
Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:
More informationDeep Learning & Neural Networks Lecture 4
Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014 2/20 3/20 Advanced Topics in Optimization Today we ll briefly
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationIntroduction to Deep Neural Networks
Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationMachine Learning for Structured Prediction
Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for
More informationGraphical models for part of speech tagging
Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional
More informationSequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015
Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative
More informationLogistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction
More informationHow to do backpropagation in a brain
How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline
More informationStatistical Methods for NLP
Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured
More informationCSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer
CSE446: Neural Networks Spring 2017 Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer Human Neurons Switching time ~ 0.001 second Number of neurons 10 10 Connections per neuron 10 4-5 Scene
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6 Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan Boulder 1 of 39 HW1 turned in HW2 released Office
More informationFast Learning with Noise in Deep Neural Nets
Fast Learning with Noise in Deep Neural Nets Zhiyun Lu U. of Southern California Los Angeles, CA 90089 zhiyunlu@usc.edu Zi Wang Massachusetts Institute of Technology Cambridge, MA 02139 ziwang.thu@gmail.com
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationtext classification 3: neural networks
text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University
More informationLogistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 9 Sep. 26, 2018 1 Reminders Homework 3:
More informationMachine Learning Lecture 10
Machine Learning Lecture 10 Neural Networks 26.11.2018 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Today s Topic Deep Learning 2 Course Outline Fundamentals Bayes
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More informationConditional Random Field
Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationStatistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields
Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project
More informationUnsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih
Unsupervised Learning of Hierarchical Models Marc'Aurelio Ranzato Geoff Hinton in collaboration with Josh Susskind and Vlad Mnih Advanced Machine Learning, 9 March 2011 Example: facial expression recognition
More informationMachine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015
Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary
More informationOnline Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?
Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification
More informationNatural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu
Natural Language Processing Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Projects Project descriptions due today! Last class Sequence to sequence models Attention Pointer networks Today Weak
More informationConditional Random Fields for Sequential Supervised Learning
Conditional Random Fields for Sequential Supervised Learning Thomas G. Dietterich Adam Ashenfelter Department of Computer Science Oregon State University Corvallis, Oregon 97331 http://www.eecs.oregonstate.edu/~tgd
More informationDeep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i
More informationNeural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17
3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationLogistic Regression & Neural Networks
Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability
More informationCSCI567 Machine Learning (Fall 2018)
CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationModeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop
Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector
More informationA Least Squares Formulation for Canonical Correlation Analysis
A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationMaxout Networks. Hien Quoc Dang
Maxout Networks Hien Quoc Dang Outline Introduction Maxout Networks Description A Universal Approximator & Proof Experiments with Maxout Why does Maxout work? Conclusion 10/12/13 Hien Quoc Dang Machine
More informationTutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba
Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationPart of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015
Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about
More information10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging
10-708: Probabilistic Graphical Models 10-708, Spring 2018 10 : HMM and CRF Lecturer: Kayhan Batmanghelich Scribes: Ben Lengerich, Michael Kleyman 1 Case Study: Supervised Part-of-Speech Tagging We will
More informationClassical Predictive Models
Laplace Max-margin Markov Networks Recent Advances in Learning SPARSE Structured I/O Models: models, algorithms, and applications Eric Xing epxing@cs.cmu.edu Machine Learning Dept./Language Technology
More informationNatural Language Processing
Natural Language Processing Info 59/259 Lecture 4: Text classification 3 (Sept 5, 207) David Bamman, UC Berkeley . https://www.forbes.com/sites/kevinmurnane/206/04/0/what-is-deep-learning-and-how-is-it-useful
More informationPartially Directed Graphs and Conditional Random Fields. Sargur Srihari
Partially Directed Graphs and Conditional Random Fields Sargur srihari@cedar.buffalo.edu 1 Topics Conditional Random Fields Gibbs distribution and CRF Directed and Undirected Independencies View as combination
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationLinear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)
Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Nathan Schneider (some slides borrowed from Chris Dyer) ENLP 12 February 2018 23 Outline Words, probabilities Features,
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationProbabilistic Machine Learning. Industrial AI Lab.
Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear
More informationLearning to translate with neural networks. Michael Auli
Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationLecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants
Advanced Machine Learning Lecture 2 Neural Networks 24..206 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Talk Announcement Yann LeCun (NYU & FaceBook AI) 28..
More informationIntroduction to Convolutional Neural Networks (CNNs)
Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationBayesian Networks (Part I)
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks (Part I) Graphical Model Readings: Murphy 10 10.2.1 Bishop 8.1,
More informationConditional Random Fields
Conditional Random Fields Micha Elsner February 14, 2013 2 Sums of logs Issue: computing α forward probabilities can undeflow Normally we d fix this using logs But α requires a sum of probabilities Not
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationMark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.
University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More informationHidden Markov Models
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework
More informationSegmental Recurrent Neural Networks for End-to-end Speech Recognition
Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016 Background A new wave
More informationBasis Expansion and Nonlinear SVM. Kai Yu
Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion
More informationNatural Language Processing
Natural Language Processing Global linear models Based on slides from Michael Collins Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability
More informationLecture 13: Structured Prediction
Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page
More informationInternal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.
Batch Normalization Devin Willmott University of Kentucky October 23, 2017 Overview 1 Internal Covariate Shift 2 Batch Normalization 3 Implementation 4 Experiments Covariate Shift Suppose we have two distributions,
More informationMore on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013
More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative
More informationLecture 3: Multiclass Classification
Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in
More informationApplied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson
Applied Machine Learning Lecture 5: Linear classifiers, continued Richard Johansson overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers
More informationGaussian Processes (10/16/13)
STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs
More informationLearning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014
Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of
More informationAdvanced computational methods X Selected Topics: SGD
Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety
More informationHidden Markov Models
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 Reminders Homework
More informationRandom Field Models for Applications in Computer Vision
Random Field Models for Applications in Computer Vision Nazre Batool Post-doctorate Fellow, Team AYIN, INRIA Sophia Antipolis Outline Graphical Models Generative vs. Discriminative Classifiers Markov Random
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII
More information