Feature Noising. Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager

Size: px
Start display at page:

Download "Feature Noising. Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager"

Transcription

1 Feature Noising Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager

2 Outline Part 0: Some backgrounds Part 1: Dropout as adaptive regularization with applications to semi-supervised learning joint work with Stefan Wager and Percy Part 2: Applications to structured prediction using CRFs when the log-partition function cannot be easily computed joint work with Mengqiu, Chris, Percy and Stefan Wager

3 The basics of dropout training Introduced by Hinton et al. in Improving neural networks by preventing co-adaptation of feature detectors For each example, randomly select features zero them compute the gradient, make an update repeat

4 Empirically successful Dropout is important in some recent successes won the ImageNet challenge [Krizhevsky et al., 2012] won the Merck challenge [Dahl et al., 2012] Improved performance on standard datasets images: MNIST, CIFAR, ImageNet, etc. document classification: Reuters, IMDB, Rotten Tomatoes, etc. speech: TIMIT, GlobalPhone, etc.

5 Lots of related works already Variants DropConnect [Wan et al., 2013] Maxout networks [Goodfellow et al., 2013] Analytical integration Fast Dropout [Wang and Manning, 2013] Marginalized Corrupted Features [van der Maaten et al., 2013] Many other works report empirical gains

6 Outline Part 0: Some backgrounds Part 1: Dropout as adaptive regularization with applications to semi-supervised learning Part 2: Applications to structured prediction using CRFs when the log-partition function cannot be easily computed

7 Theoretical understanding? Dropout as adaptive regularization feature noising -> interpretable penalty term Loss( Dropout(data) ) = Loss(data)+Regularizer(data) Semi-supervised learning feature dependent, label independent regularizer: Regularizer(Unlabeled data)

8 Dropout for Log-linear Models Log likelihood (e.g., softmax classification): log p(y x; ) =x T y A(x T ) =[ 1, 2,..., K ]

9 Dropout for Log-linear Models Log likelihood (e.g., softmax classification): Dropout: log p(y x; ) =x T y A(x T ) =[ 1, 2,..., K ] ( 2x j with p=0.5 x j = 0 otherwise E[ x] =x Dropout objective: E[log p(y x; )] {z } Loss(Dropout(data)) = E[ x T y ] E[A( x T )] = Loss(data)+Regularizer(data)

10 Dropout for Log-linear Models Log likelihood (e.g., softmax classification): Dropout: Dropout objective: log p(y x; ) =x T y A(x T ) =[ 1, 2,..., K ] ( 2x j with p=0.5 x j = 0 otherwise E[log p(y x; )] {z } -Loss(Dropout(data)) Loss(data)+Regularizer(data) E[ x] =x = E[ x T y ] E[A( x T )]

11 Dropout for Log-linear Models We can rewrite the dropout log-likelihood E[log p(y x; )] = E[ x T y ] E[A( x T )] log p(y x; ) = x T y A(x T ) E[log p(y x; )] {z } -Loss(Dropout(data)) = log p(y x; ) {z } -Loss(data) Dropout reduces to a regularizer R(,x)=E[A( x T )] A(x T ) (E[A( x T )] A(x T ) ) {z } Regularizer(data)

12 Second-order delta method Take the Taylor expansion A(s) A(s 0 )+(s s 0 ) T A 0 (s 0 )+(s s 0 ) T A00 (s 0 ) (s s 0 ) 2

13 Second-order delta method Take the Taylor expansion A(s) A(s 0 )+(s s 0 ) T A 0 (s 0 )+(s s 0 ) T A00 (s 0 ) (s s 0 ) 2 Substitute s = s def = T x, s 0 = E[ s] Take expectations to get the quadratic approximation: R q (,x)= 1 2 E[( s s)t r 2 A(s)( s s)] = 1 2 tr(r2 A(s)Cov( s))

14 Example: logistic regression The quadratic approximation R q (,x)= 1 2 A00 (x T )Var[ x T ]

15 Example: logistic regression The quadratic approximation R q (,x)= 1 2 A00 (x T )Var[ x T ] A 00 (x T )=p(1 p) represents uncertainty: p = p(y x; ) =(1+exp( yx T )) 1

16 Example: logistic regression The quadratic approximation R q (,x)= 1 2 A00 (x T )Var[ x T ] A 00 (x T )=p(1 p) represents uncertainty: p = p(y x; ) =(1+exp( yx T )) 1 Var[ x T ]= X j 2 x 2 j is L 2 -regularization after j normalizing the data

17 The regularizers Dropout on Linear Regression R q ( ) = 1 X X j 2 2 Dropout on Logistic Regression R q ( ) = 1 X X j 2 p i (1 2 Multiclass, CRFs [Wang et al., 2013] j j i i x (i)2 j p i )x (i)2 j

18 Dropout intuition R q ( ) = 1 2 X j 2 j X p i (1 Regularizes rare features less, like AdaGrad: there is actually a more precise connection [Wager et al., 2013] Big weights are okay if they contribute only to confident predictions Normalizing by the diagonal Fisher information i p i )x (i)2 j

19 Semi-supervised Learning These regularizers are label-independent but can be data adaptive in interesting ways labeled dataset D = {x 1,x 2,...,x n } unlabeled data D unlabeled = {u 1,u 2,...,u n } We can better estimate the regularizer R (, D, D unlabeled ) def n X n = R(,x i )+ n + m i=1 for some tunable. mx i=1 R(,u i ).

20 Semi-supervised intuition R q ( ) = 1 2 X Like other semi-supervised methods: j 2 j X p i (1 transductive SVMs [Joachims, 1999] entropy regularization [Grandvalet and Bengio, 2005] EM: guess a label [Nigam et al., 2000] want to make confident predictions on the unlabeled data Get a better estimate of the Fisher information i p i )x (i)2 j

21 IMDB dataset [Maas et al., 2011] 25k examples of positive reviews 25k examples of negative reviews Half for training and half for testing 50k unlabeled reviews also containing neutral reviews 300k sparse unigram features ~5 million sparse bigram features

22 Experiments: semi-supervised Add more unlabeled data (10k labeled) improves performance accuracy dropout+unlabeled dropout L size of unlabeled data

23 Experiments: semi-supervised Add more labeled data (40k unlabeled) improves performance accuracy dropout+unlabeled dropout L size of labeled data

24 Quantitative results on IMDB Method \ Settings Supervised Semi-sup. MNB - unigrams with SFE [Su et al., 2011] Vectors for sentiment analysis [Maas et al., 2011] This work: dropout + unigrams This work: dropout + bigrams

25 Experiments: other datasets Dataset \ Settings L 2 Drop +Unlbl Subjectivity [Peng and Lee, 2004] Rotten Tomatoes [Peng and Lee, 2005] newsgroups CoNLL

26 Outline Part 0: Some backgrounds Part 1: Dropout as adaptive regularization with applications to semi-supervised learning With Stefan and Percy Part 2: Applications to structured prediction using CRFs when the log-partition function cannot be easily computed with Mengqiu, Chris, Percy and Stefan

27 Log-linear structured prediction A vector of scores s =(s 1,...,s Y ) s y = f(y,x) The likelihood is: p(y x; ) =exp{s y A(s)} A(s) = log X y exp{s y } Y might be really huge!

28 What about structured prediction? Recall that in logistic regression: R q ( ) = 1 X X j 2 p i (1 2 j What if we cannot easily compute the logpartition function A? and its second derivatives? i p i )x (i)2 j

29 The original setup Take the Taylor expansion A(s) A(s 0 )+(s s 0 ) T A 0 (s 0 )+(s s 0 ) T A00 (s 0 ) (s s 0 ) 2 Substitute s = s def = T x, s 0 = E[ s] Take expectations to get the quadratic approximation: R q (,x)= 1 2 E[( s s)t r 2 A(s)( s s)] = 1 2 tr(r2 A(s)Cov( s))

30 The structured prediction setup Take the Taylor expansion A(s) A(s 0 )+(s s 0 ) T A 0 (s 0 )+(s s 0 ) T A00 (s 0 ) (s s 0 ) 2 Substitute s = s def = f(y,x), s 0 = E[ s] Take expectations to get the quadratic approximation: R q (,x)= 1 2 E[( s s)t r 2 A(s)( s s)] = 1 2 tr(r2 A(s)Cov( s))

31 Use the independence structure Depends on the underlying graphical model We assume we can do exact inference via message passing (e.g. clique tree) E.g. Linear-chain CRF: f(y,x)= TX t=1 g t (y t 1,y t,x) ( T X ) A(s) = log X y2y exp t=1 s yt 1,y t,t

32 Local Noising Global noising: s = s def = f(y,x) Local noising: s = s def = TX g(y t 1,y t,x) t=1 Can try to justify in restrospect

33 The regularizer The regularizer is: X R q (,x)= 1 2 For marginals: And derivatives: ( ) X X µ a,b,t (1 µ a,b,t ) Var[ s a,b,t ] a,b,t µ a,b,t = p (y t 1 = a, y t = b x) al elements of the Hessian r 2 rµ a,b,t = E p (y x,y t 1 =a,y t =b)[f(y,x)] E p (y x)[f(y,x)] (

34 Efficient computation For every a,b,t we need rµ a,b,t = E p (y x,y t 1 =a,y t =b)[f(y,x)] E p (y x)[f(y,x)] Naïve computation is O(K 4 T 2 ) Can reduce to O(K 3 T 2 ) We provide a dynamic program to compute in O(KT 2 ), like normal forward backwards, except need to do this for every feature

35 Feature group trick (Mengqiu) rµ a,b,t = E p (y x,y t 1 =a,y t =b)[f(y,x)] E p (y x)[f(y,x)] Features that always appeared in the same location all have the same conditional expectations Gives a 4x speedup, applicable to general CRFs

36 CRF sequence tagging CoNLL 2003 Named Entity Recognition Stanford[ORG] is[o] near[o] Palo[LOC] Alto[LOC] Dataset \ Settings None L 2 Drop CoNLL 2003 Dev CoNLL 2003 Test

37 CRF sequence tagging Dropout helps more on precision than recall Tag LOC MISC ORG PER Overall ularization Precision Recall F = % 86.13% % 79.30% % 80.49% % 93.33% % 86.08% (e) CoNLL test set with L 2 regularization regularization Precision Recall F = % 87.74% % 77.34% % 81.89% % 92.68% % 86.45% (f) CoNLL test set with dropout regularization

38 SANCL POS Tagging Test set difference statistically significant for newsgroups and reviews F =1 None L 2 Drop newsgroups Dev Test reviews Dev Test answers Dev Test

39 Summary Part 0: Some backgrounds Part 1: Dropout as adaptive regularization with applications to semi-supervised learning joint work with Stefan Wager and Percy Part 2: Applications to structured prediction using CRFs when the log-partition function cannot be easily computed joint work with Mengqiu, Chris, Percy and Stefan Wager

40 References Our arxiv paper [Wager et al., 2013] has more details, including the relation to AdaGrad Our EMNLP paper [Wang et al., 2013] extends this framework to structured prediction Our ICML paper [Wang and Manning, 2013] applies a related technique to neural networks and provides some negative examples

41 Dropout vs. L 2 Can be much better than all settings of L 2 Part of the gain comes from normalization Accuracy L2 only 0.8 L2+Gaussian dropout L2+Quadratic dropout L 2 regularization strength (λ)

42 Example: linear least squares The loss function is f( x) =1/2( x y) 2 Let X = x where x j =2z j x j, z j = Bernoulli(0.5) E[f(X)] = f(e[x]) + f 00 (E[X]) 2 =1/2( x The total regularizer is R q ( ) = 1 X 2 This is just L2 applied after data normalization j y) 2 +1/2 X j 2 j X i x (i)2 j Var[X] x 2 j 2 j

43 Quantitative results on IMDB Method \ Settings Supervised Semi-sup. MNB - unigrams with SFE [Su et al., 2011] MNB bigrams Vectors for sentiment analysis [Maas et al., 2011] NBSVM bigrams [Wang and Manning, 2012] This work: dropout + unigrams This work: dropout + bigrams

Feature Noising for Log-linear Structured Prediction

Feature Noising for Log-linear Structured Prediction Feature Noising for Log-linear Structured Prediction Sida I. Wang, Mengqiu Wang, Stefan Wager, Percy Liang, Christopher D. Manning Department of Computer Science, Department of Statistics Stanford University,

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

arxiv: v2 [stat.ml] 1 Nov 2013

arxiv: v2 [stat.ml] 1 Nov 2013 Dropout Training as Adaptive Regularization arxiv:1307.1493v2 [stat.ml] 1 Nov 2013 Stefan Wager, Sida Wang, and Percy Liang Departments of Statistics and Computer Science Stanford University, Stanford,

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Sequential Supervised Learning

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given

More information

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)

More information

Active learning in sequence labeling

Active learning in sequence labeling Active learning in sequence labeling Tomáš Šabata 11. 5. 2017 Czech Technical University in Prague Faculty of Information technology Department of Theoretical Computer Science Table of contents 1. Introduction

More information

Parameter learning in CRF s

Parameter learning in CRF s Parameter learning in CRF s June 01, 2009 Structured output learning We ish to learn a discriminant (or compatability) function: F : X Y R (1) here X is the space of inputs and Y is the space of outputs.

More information

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations Bowl Maximum Entropy #4 By Ejay Weiss Maxent Models: Maximum Entropy Foundations By Yanju Chen A Basic Comprehension with Derivations Outlines Generative vs. Discriminative Feature-Based Models Softmax

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Jakub Hajic Artificial Intelligence Seminar I

Jakub Hajic Artificial Intelligence Seminar I Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Machine Learning Basics

Machine Learning Basics Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

Deep Learning & Neural Networks Lecture 4

Deep Learning & Neural Networks Lecture 4 Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014 2/20 3/20 Advanced Topics in Optimization Today we ll briefly

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for

More information

Graphical models for part of speech tagging

Graphical models for part of speech tagging Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction

More information

How to do backpropagation in a brain

How to do backpropagation in a brain How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer CSE446: Neural Networks Spring 2017 Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer Human Neurons Switching time ~ 0.001 second Number of neurons 10 10 Connections per neuron 10 4-5 Scene

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6 Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan Boulder 1 of 39 HW1 turned in HW2 released Office

More information

Fast Learning with Noise in Deep Neural Nets

Fast Learning with Noise in Deep Neural Nets Fast Learning with Noise in Deep Neural Nets Zhiyun Lu U. of Southern California Los Angeles, CA 90089 zhiyunlu@usc.edu Zi Wang Massachusetts Institute of Technology Cambridge, MA 02139 ziwang.thu@gmail.com

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

text classification 3: neural networks

text classification 3: neural networks text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 9 Sep. 26, 2018 1 Reminders Homework 3:

More information

Machine Learning Lecture 10

Machine Learning Lecture 10 Machine Learning Lecture 10 Neural Networks 26.11.2018 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Today s Topic Deep Learning 2 Course Outline Fundamentals Bayes

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih

Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih Unsupervised Learning of Hierarchical Models Marc'Aurelio Ranzato Geoff Hinton in collaboration with Josh Susskind and Vlad Mnih Advanced Machine Learning, 9 March 2011 Example: facial expression recognition

More information

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015 Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary

More information

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions? Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification

More information

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Natural Language Processing Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Projects Project descriptions due today! Last class Sequence to sequence models Attention Pointer networks Today Weak

More information

Conditional Random Fields for Sequential Supervised Learning

Conditional Random Fields for Sequential Supervised Learning Conditional Random Fields for Sequential Supervised Learning Thomas G. Dietterich Adam Ashenfelter Department of Computer Science Oregon State University Corvallis, Oregon 97331 http://www.eecs.oregonstate.edu/~tgd

More information

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i

More information

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17 3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Logistic Regression & Neural Networks

Logistic Regression & Neural Networks Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability

More information

CSCI567 Machine Learning (Fall 2018)

CSCI567 Machine Learning (Fall 2018) CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector

More information

A Least Squares Formulation for Canonical Correlation Analysis

A Least Squares Formulation for Canonical Correlation Analysis A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Maxout Networks. Hien Quoc Dang

Maxout Networks. Hien Quoc Dang Maxout Networks Hien Quoc Dang Outline Introduction Maxout Networks Description A Universal Approximator & Proof Experiments with Maxout Why does Maxout work? Conclusion 10/12/13 Hien Quoc Dang Machine

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging 10-708: Probabilistic Graphical Models 10-708, Spring 2018 10 : HMM and CRF Lecturer: Kayhan Batmanghelich Scribes: Ben Lengerich, Michael Kleyman 1 Case Study: Supervised Part-of-Speech Tagging We will

More information

Classical Predictive Models

Classical Predictive Models Laplace Max-margin Markov Networks Recent Advances in Learning SPARSE Structured I/O Models: models, algorithms, and applications Eric Xing epxing@cs.cmu.edu Machine Learning Dept./Language Technology

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Info 59/259 Lecture 4: Text classification 3 (Sept 5, 207) David Bamman, UC Berkeley . https://www.forbes.com/sites/kevinmurnane/206/04/0/what-is-deep-learning-and-how-is-it-useful

More information

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari Partially Directed Graphs and Conditional Random Fields Sargur srihari@cedar.buffalo.edu 1 Topics Conditional Random Fields Gibbs distribution and CRF Directed and Undirected Independencies View as combination

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Nathan Schneider (some slides borrowed from Chris Dyer) ENLP 12 February 2018 23 Outline Words, probabilities Features,

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Probabilistic Machine Learning. Industrial AI Lab.

Probabilistic Machine Learning. Industrial AI Lab. Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear

More information

Learning to translate with neural networks. Michael Auli

Learning to translate with neural networks. Michael Auli Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants Advanced Machine Learning Lecture 2 Neural Networks 24..206 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Talk Announcement Yann LeCun (NYU & FaceBook AI) 28..

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Bayesian Networks (Part I)

Bayesian Networks (Part I) 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks (Part I) Graphical Model Readings: Murphy 10 10.2.1 Bishop 8.1,

More information

Conditional Random Fields

Conditional Random Fields Conditional Random Fields Micha Elsner February 14, 2013 2 Sums of logs Issue: computing α forward probabilities can undeflow Normally we d fix this using logs But α requires a sum of probabilities Not

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework

More information

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Segmental Recurrent Neural Networks for End-to-end Speech Recognition Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016 Background A new wave

More information

Basis Expansion and Nonlinear SVM. Kai Yu

Basis Expansion and Nonlinear SVM. Kai Yu Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Global linear models Based on slides from Michael Collins Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky. Batch Normalization Devin Willmott University of Kentucky October 23, 2017 Overview 1 Internal Covariate Shift 2 Batch Normalization 3 Implementation 4 Experiments Covariate Shift Suppose we have two distributions,

More information

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013 More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative

More information

Lecture 3: Multiclass Classification

Lecture 3: Multiclass Classification Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in

More information

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson Applied Machine Learning Lecture 5: Linear classifiers, continued Richard Johansson overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers

More information

Gaussian Processes (10/16/13)

Gaussian Processes (10/16/13) STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

More information

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014 Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 Reminders Homework

More information

Random Field Models for Applications in Computer Vision

Random Field Models for Applications in Computer Vision Random Field Models for Applications in Computer Vision Nazre Batool Post-doctorate Fellow, Team AYIN, INRIA Sophia Antipolis Outline Graphical Models Generative vs. Discriminative Classifiers Markov Random

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII

More information