Introduction to Machine Learning Eran Halperin, Yishay Mansour, Lior Wolf Lecture 13: Applications

Similar documents
Adaptive Multi-Compositionality for Recursive Neural Models with Applications to Sentiment Analysis. July 31, 2014

From perceptrons to word embeddings. Simon Šuster University of Groningen

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Deep Learning For Mathematical Functions

Neural Networks: Backpropagation

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

text classification 3: neural networks

Neural Networks for NLP. COMP-599 Nov 30, 2016

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

Deep Learning for NLP Part 2

word2vec Parameter Learning Explained

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Probability Review and Naïve Bayes

DISTRIBUTIONAL SEMANTICS

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Lecture 6: Neural Networks for Representing Word Meaning

Natural Language Processing

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Uncertainty in prediction. Can we usually expect to get a perfect classifier, if we have enough training data?

Lecture 5 Neural models for NLP

Deep Learning for NLP

Nonlinear Classification

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Statistical NLP for the Web

Deep Learning Recurrent Networks 2/28/2018

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Introduction to Machine Learning Midterm Exam

Natural Language Processing with Deep Learning CS224N/Ling284

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Logistic regression for conditional probability estimation

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

An overview of word2vec

Natural Language Processing and Recurrent Neural Networks

Supervised Learning. George Konidaris

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)

Deep Learning for NLP

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Probabilistic Machine Learning. Industrial AI Lab.

Linear & nonlinear classifiers

Neural Networks: Introduction

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

FINAL: CS 6375 (Machine Learning) Fall 2014

ATASS: Word Embeddings

Multi-theme Sentiment Analysis using Quantified Contextual

A Survey of Techniques for Sentiment Analysis in Movie Reviews and Deep Stochastic Recurrent Nets

Bayesian Paragraph Vectors

Lecture 11 Recurrent Neural Networks I

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Lecture 11 Recurrent Neural Networks I

CS534 Machine Learning - Spring Final Exam

ECE521 Lectures 9 Fully Connected Neural Networks

Long-Short Term Memory and Other Gated RNNs

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

NEURAL LANGUAGE MODELS

(COM4513/6513) Week 1. Nikolaos Aletras ( Department of Computer Science University of Sheffield

Sequence Modeling with Neural Networks

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

Introduction to Machine Learning Midterm Exam Solutions

Final Exam, Machine Learning, Spring 2009

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Behavioral Data Mining. Lecture 2

Bag of Words Meets Bags of Popcorn

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

ML4NLP Multiclass Classification

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

CSC321 Lecture 16: ResNets and Attention

Prepositional Phrase Attachment over Word Embedding Products

Midterm: CS 6375 Spring 2015 Solutions

Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors

Discriminative Direction for Kernel Classifiers

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Structured Neural Networks (I)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning for Structured Prediction

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Deep Learning. Ali Ghodsi. University of Waterloo

4 Bias-Variance for Ridge Regression (24 points)

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Seq2Seq Losses (CTC)

Dreem Challenge report (team Bussanati)

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Task-Oriented Dialogue System (Young, 2000)

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

arxiv: v2 [cs.cl] 1 Jan 2019

Applied Natural Language Processing

Support Vector Machines: Kernels

CS224N: Natural Language Processing with Deep Learning Winter 2017 Midterm Exam

Support Vector Machines

Transcription:

Introduction to Machine Learning Eran Halperin, Yishay Mansour, Lior Wolf 2013-2014 Lecture 13: Applications

Today Similarity learning ROC curves Probabilistic graphicsl models Neural Networks

Cairo Genizah a collection containing ~250,000 fragments of mainly Jewish texts discovered in the late 19th century discarded codices, scrolls, and documents, written mainly in the 10th to 15th centuries spread out in tens of libraries and private collections worldwide enormous impact on 20th century scholarship in a multitude of fields

Basic notion: join A join is a set of manuscriptfragments that are known to originate from the same original work. Known joins are documented in catalogs Catalogs (very partial list) Adler, Elkan Nathan Catalogue of Hebrew Manuscripts in the Collection of Elkan Nathan Adler., Cambridge, 1921 Cowley, Arthur Ernest Photocopy of Unpublished Typescript Catalogue of the Hebrew Manuscripts in the Bodleian Library Gottstein, M.H. "Hebrew Fragments in the Mingana Collection," Journal of Jewish Studies V (1954), 1954 Halper, Benzion Descriptive catalogue of Genizah fragments in Philadelphia, Philadelphia, 1924 Lutzki, Morris Catalogue of Biblical Manuscripts in the Library of the Jewish Theological Seminary, Photocopy of Unpublished Typescript (New York: JTS) Neubauer, Adolf, Cowley, Arthur Ernest Catalogue of the Hebrew Manuscripts in the Bodleian Library, Vol. II, Oxford, 1886-1906 Reif, Stefan C. Hebrew manuscripts at Cambridge University Library, Cambridge, 1997 Schwab, Moise "Les Manuscrits du Consistoire Israelite de Paris Provenant de la Gueniza du Caire," REJ LXII (1911), pp. 107-119, 267-277; LXIII (1911), pp. 100-120, 276-296; LXIV (1912), pp. 118-141.,., 1911-1912 Schwarz, A.Z., Loewinger D.S. and Roth E. Die hebraieschen handschriften in Oesterreich (ausserhalb der Nationalbibliothek in Wien), New York Wickersheimer, Ernest Catalogue général des manuscrits des bibliothèques publiques de France. Départements, Tome XLVII : Strasbourg, Paris, 1923 Worman, E.J. Hand-list of pieces in Glass of thetayler-schechter Collection. Photocopy of Unpublished Handwriting Worrell, William Hoyt, Gottheill, Richard James Horatio Entries 1026 1318 40 487 927 2199 126 1896 185 3 2291 50

Preprocessing the images Foreground segmentation Ruller removal Line detection Image alignment Character detection Character description Vectorization

Similarity computation Take 2 vectors (v 1,v 2 ) return a similarity value 12. Ideally, the is a threshold such that 12 > q join 12 < q not a join Sim (, ) Methods used: L2 distance of vectors L1 distance of vectors Hellinger norm SVM of vector of absolute differences w a b i i i i OneShot Similarity (ECCVLFW 08,ICCV 09)

Computing the One-Shot Similarity Step a: Model1 = train (p, A) Step b: Score1 = classify(q, Model1) Step c: Model2 = train (q, A) Step d: Score2 = classify(p, Model1) Set A of negative examples One-Shot-Sim = (score1 + score2) /2 Similarity

Properties of the 1-shot similarity Using LDA as the underlying classifier: 1. It is a PD kernel (actually Consitionally PD) 2. It can be efficiently computed 3. It has a half-explicit embedding: k(x 1,x 2 ) = ψ(x 1 ) T φ(x 2 )

Adding metric learning Using LDA as the underlying classifier: 2 ) ( 2 ) ( ),, OSS( A j i W T A j A i j W T A i j i x x S x x x S x A x x 2 0 not same ML same ML ),,, ( OSS ),,, ( OSS ) ( T T A T x x A T x x T f j i j i Instead of examples x i use Tx i, obtaining T by a gradient decent procedure, optimizing the score:

From similarity to decision Sim (, ) Join Sim (, ) = 1 = 2 (a) (b) (c) Compute global descr. for images Measure descr. Similarity Train classifier (e.g., SVM) to threshold join from not-join Not join Sim (, ) Sim (, ) = i = i+1

Multi similarities/vectors/descriptors Join Training set Training ( 1,1, 1,2,, 1,n ) ( 2,1, 2,2,, 2,n ) Train with a vector of similarity values Combine with SVM Not join ( i,1, i,2,, i,n ) ( i+1,1, i+1,2,, i+1,n )

Combining similarities

Original slide credit: Darlene Goldstein True label vs. classifier result classifier Prediction=-1 Real label No disease (D = -1) True negative Disease (D = +1) X Miss Prediction=1 X False positive True positive

Original slide credit: Darlene Goldstein Specific Example Pts without the disease Pts with disease Test Result

Original slide credit: Darlene Goldstein Threshold Call these patients negative Call these patients positive Test Result

Some definitions... Call these patients negative Call these patients positive True Positives Test Result without the disease with the disease Original slide credit: Darlene Goldstein

Call these patients negative Call these patients positive without the disease with the disease Test Result False Positives Original slide credit: Darlene Goldstein

Call these patients negative Call these patients positive True negatives Test Result without the disease with the disease Original slide credit: Darlene Goldstein

Call these patients negative Call these patients positive False negatives Test Result without the disease with the disease Original slide credit: Darlene Goldstein

Moving the Threshold: right - + Test Result without the disease with the disease Original slide credit: Darlene Goldstein

Moving the Threshold: left - + Test Result without the disease with the disease Original slide credit: Darlene Goldstein

True Positive Rate Original slide credit: Darlene Goldstein ROC curve 100% 0% 0% 100% False Positive Rate

True Positive Rate True Positive Rate ROC curve comparison A good classifier: A poor classifier: 100% 100% 0 % 0 % False Positive Rate 100% 0 % 0 % False Positive Rate 100%

True Positive Rate True Positive Rate ROC curve extremes Best Classifier: Worst Classifier: 100% 100% 0 % 0 % False Positive Rate 100 % 0 % 0 % False Positive Rate 100 % The distributions don t overlap at all The distributions overlap completely

Combining similarities

Example of our discoveries: Lost halakhic monograph of Rav Saadya Gaon (10 cent.) in judeo-arabic Geneva New York

Sorting face albums a collection containing ~250,000 fragments of mainly Jewish texts discovered in the late 19th century discarded codices, scrolls, and documents, written mainly in the 10th to 15th centuries. by now mostly fragmented pages, spread out in tens of libraries and private collections worldwide

Sorting face albums Large scale Unknown number of clusters Some clusters are huge Many small clusters and singletons Pairwise similarities are noisy and misleading Conventional algorithms would fail Commercial solutions?

Clustering via Graphical Models The variables to be inferred are: hi face attributes lij grouping variables (0/1) hi lij lik hk ljk hj

Clustering via Probabilistic Graphical Models The variables to be inferred are: hi face attributes χijk lij grouping variables (0/1) The factors: ξi(hi) attribute distributions ψij(hi,hj,lij) pairwise compatibility in attr. γij,φij lij lik γik,φik ljk γjk,φjk γij(lij) visual similarity φij(lij) non visual similarity χijk(lij,lik,ljk) transitivity constraints ξi hi ψij ψik hj ψjk hk ξk ξj

word2vec Natural Language Processing (NLP) Learned from a large corpus Employs a Neural Network for learning vector representations of words

+1 Andrew Ng Logistic regression Logistic regression learns a parameter vector q. On input x, it outputs: x 1 x 2 x 3

Neural Network Example 4 layer network with 2 output units: x 1 x 2 x 3 +1 +1 Layer 1 Layer 2 +1 Layer 3 Trained via gradient descent. Backpropagation algorithm. (Susceptible to local optima) Layer 4 Andrew Ng

CBOW Architecture In the beginning God created the heaven and the earth. And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters. And God said, Let there be light: and there was light. And God saw the light, that it was good: and God divided the light from the darkness. Input Layer Hidden Layer Output Layer Huffman code of Spirit

word2vec variants Continuous Bag-of-Words Architecture Skip-gram Architecture Predicts the current word given the context Predicts the surrounding words given the current word

word2vec Semantic and additive Surprising regularities emerge King-Man+Woman~Queen France+Capital~Paris

Sentiment Analysis unbelievably disappointing Full of zany characters and richly applied satire, and some great plot twists This is the greatest screwball comedy ever filmed It was pathetic. The worst part about it was the boxing scenes. Original slides by Yackov Lubarsky

Why Hard? Subtlety: Perfume review in Perfumes: the Guide: If you are reading this because it is your darling fragrance, please wear it at home exclusively, and tape the windows shut. Dorothy Parker on Katherine Hepburn She runs the gamut of emotions from A to B

Why Hard? Thwarted Expectations and Ordering Effects This film should be brilliant. It sounds like a great plot, the actors are first grade, and the supporting cast is good as well, and Stallone is attempting to deliver a good performance. However, it can t hold up. Well as usual Keanu Reeves is nothing special, but surprisingly, the very talented Laurence Fishbourne is not so good either, I was surprised.

Standard Approaches Bag of Words features Semantic Vector Spaces While enjoyable in spots, 'The Dark World' is haphazard and ultimately unsatisfying. He feels for his subject matter, no doubt, but every once in a while it feels like he falls short on imagination and takes a short cut to move his story forward. Standard approaches ignore phrase compositions

Compositionality Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng and Christopher Potts. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. Conference on Empirical Methods in Natural Language Processing (EMNLP 2013)

Stanford Sentiment Treebank 10,662 sentences from rottentomatoes.com movie reviews Parsed using Stanford Parser to create parse trees 215,154 unique phrases Sentiment classified using Amazon Mechanical Turk 3 judges per phrase

Treebank Sentiment Value Short n-grams are mostly neutral Extreme values or mid-tick are rarely used 5-class classification is enough

Recursive Neural Models Compositional vector representations for phrases of variable length. Each word/phrase is represented by a vector These vectors are used as features for calculating combined sentiment

Recursive Neural Network Word/Phrase vectors a, b, c, p i R d Word embedding matrix L R dx V Classification y a = softmax(w s a) W s R Cxd L, W s are trained parameters What about p i?

Recursive Neural Network p 1 = f W b c, p 2 = f W a p 1 W R dx2d is a trained parameter f is usually tanh

MV-RNN: Matrix-Vector RNN Composition function depends on words/phrases being combined p 1 = f W Cb Bc B P 1 = f W M C W R dx2d, W M R dx2d W, W M, P i, A, B, C, are trained Number of parameters depends on V

Sentiment Prediction Accuracy for fine grained (5-class) and binary predictions at the sentence level (Root) and at all nodes

Most Positive/Negative