Introduction to Machine Learning Eran Halperin, Yishay Mansour, Lior Wolf Lecture 13: Applications

Introduction to Machine Learning Eran Halperin, Yishay Mansour, Lior Wolf 2013-2014 Lecture 13: Applications

Today Similarity learning ROC curves Probabilistic graphicsl models Neural Networks

Cairo Genizah a collection containing ~250,000 fragments of mainly Jewish texts discovered in the late 19th century discarded codices, scrolls, and documents, written mainly in the 10th to 15th centuries spread out in tens of libraries and private collections worldwide enormous impact on 20th century scholarship in a multitude of fields

Basic notion: join A join is a set of manuscriptfragments that are known to originate from the same original work. Known joins are documented in catalogs Catalogs (very partial list) Adler, Elkan Nathan Catalogue of Hebrew Manuscripts in the Collection of Elkan Nathan Adler., Cambridge, 1921 Cowley, Arthur Ernest Photocopy of Unpublished Typescript Catalogue of the Hebrew Manuscripts in the Bodleian Library Gottstein, M.H. "Hebrew Fragments in the Mingana Collection," Journal of Jewish Studies V (1954), 1954 Halper, Benzion Descriptive catalogue of Genizah fragments in Philadelphia, Philadelphia, 1924 Lutzki, Morris Catalogue of Biblical Manuscripts in the Library of the Jewish Theological Seminary, Photocopy of Unpublished Typescript (New York: JTS) Neubauer, Adolf, Cowley, Arthur Ernest Catalogue of the Hebrew Manuscripts in the Bodleian Library, Vol. II, Oxford, 1886-1906 Reif, Stefan C. Hebrew manuscripts at Cambridge University Library, Cambridge, 1997 Schwab, Moise "Les Manuscrits du Consistoire Israelite de Paris Provenant de la Gueniza du Caire," REJ LXII (1911), pp. 107-119, 267-277; LXIII (1911), pp. 100-120, 276-296; LXIV (1912), pp. 118-141.,., 1911-1912 Schwarz, A.Z., Loewinger D.S. and Roth E. Die hebraieschen handschriften in Oesterreich (ausserhalb der Nationalbibliothek in Wien), New York Wickersheimer, Ernest Catalogue général des manuscrits des bibliothèques publiques de France. Départements, Tome XLVII : Strasbourg, Paris, 1923 Worman, E.J. Hand-list of pieces in Glass of thetayler-schechter Collection. Photocopy of Unpublished Handwriting Worrell, William Hoyt, Gottheill, Richard James Horatio Entries 1026 1318 40 487 927 2199 126 1896 185 3 2291 50

Preprocessing the images Foreground segmentation Ruller removal Line detection Image alignment Character detection Character description Vectorization

Similarity computation Take 2 vectors (v 1,v 2 ) return a similarity value 12. Ideally, the is a threshold such that 12 > q join 12 < q not a join Sim (, ) Methods used: L2 distance of vectors L1 distance of vectors Hellinger norm SVM of vector of absolute differences w a b i i i i OneShot Similarity (ECCVLFW 08,ICCV 09)

Computing the One-Shot Similarity Step a: Model1 = train (p, A) Step b: Score1 = classify(q, Model1) Step c: Model2 = train (q, A) Step d: Score2 = classify(p, Model1) Set A of negative examples One-Shot-Sim = (score1 + score2) /2 Similarity

Properties of the 1-shot similarity Using LDA as the underlying classifier: 1. It is a PD kernel (actually Consitionally PD) 2. It can be efficiently computed 3. It has a half-explicit embedding: k(x 1,x 2 ) = ψ(x 1 ) T φ(x 2 )

Adding metric learning Using LDA as the underlying classifier: 2 ) ( 2 ) ( ),, OSS( A j i W T A j A i j W T A i j i x x S x x x S x A x x 2 0 not same ML same ML ),,, ( OSS ),,, ( OSS ) ( T T A T x x A T x x T f j i j i Instead of examples x i use Tx i, obtaining T by a gradient decent procedure, optimizing the score:

From similarity to decision Sim (, ) Join Sim (, ) = 1 = 2 (a) (b) (c) Compute global descr. for images Measure descr. Similarity Train classifier (e.g., SVM) to threshold join from not-join Not join Sim (, ) Sim (, ) = i = i+1

Multi similarities/vectors/descriptors Join Training set Training ( 1,1, 1,2,, 1,n ) ( 2,1, 2,2,, 2,n ) Train with a vector of similarity values Combine with SVM Not join ( i,1, i,2,, i,n ) ( i+1,1, i+1,2,, i+1,n )

Combining similarities

Original slide credit: Darlene Goldstein True label vs. classifier result classifier Prediction=-1 Real label No disease (D = -1) True negative Disease (D = +1) X Miss Prediction=1 X False positive True positive

Original slide credit: Darlene Goldstein Specific Example Pts without the disease Pts with disease Test Result

Original slide credit: Darlene Goldstein Threshold Call these patients negative Call these patients positive Test Result

Some definitions... Call these patients negative Call these patients positive True Positives Test Result without the disease with the disease Original slide credit: Darlene Goldstein

Call these patients negative Call these patients positive without the disease with the disease Test Result False Positives Original slide credit: Darlene Goldstein

Call these patients negative Call these patients positive True negatives Test Result without the disease with the disease Original slide credit: Darlene Goldstein

Call these patients negative Call these patients positive False negatives Test Result without the disease with the disease Original slide credit: Darlene Goldstein

Moving the Threshold: right - + Test Result without the disease with the disease Original slide credit: Darlene Goldstein

Moving the Threshold: left - + Test Result without the disease with the disease Original slide credit: Darlene Goldstein

True Positive Rate Original slide credit: Darlene Goldstein ROC curve 100% 0% 0% 100% False Positive Rate

True Positive Rate True Positive Rate ROC curve comparison A good classifier: A poor classifier: 100% 100% 0 % 0 % False Positive Rate 100% 0 % 0 % False Positive Rate 100%

True Positive Rate True Positive Rate ROC curve extremes Best Classifier: Worst Classifier: 100% 100% 0 % 0 % False Positive Rate 100 % 0 % 0 % False Positive Rate 100 % The distributions don t overlap at all The distributions overlap completely

Combining similarities

Example of our discoveries: Lost halakhic monograph of Rav Saadya Gaon (10 cent.) in judeo-arabic Geneva New York

Sorting face albums a collection containing ~250,000 fragments of mainly Jewish texts discovered in the late 19th century discarded codices, scrolls, and documents, written mainly in the 10th to 15th centuries. by now mostly fragmented pages, spread out in tens of libraries and private collections worldwide

Sorting face albums Large scale Unknown number of clusters Some clusters are huge Many small clusters and singletons Pairwise similarities are noisy and misleading Conventional algorithms would fail Commercial solutions?

Clustering via Graphical Models The variables to be inferred are: hi face attributes lij grouping variables (0/1) hi lij lik hk ljk hj

Clustering via Probabilistic Graphical Models The variables to be inferred are: hi face attributes χijk lij grouping variables (0/1) The factors: ξi(hi) attribute distributions ψij(hi,hj,lij) pairwise compatibility in attr. γij,φij lij lik γik,φik ljk γjk,φjk γij(lij) visual similarity φij(lij) non visual similarity χijk(lij,lik,ljk) transitivity constraints ξi hi ψij ψik hj ψjk hk ξk ξj

word2vec Natural Language Processing (NLP) Learned from a large corpus Employs a Neural Network for learning vector representations of words

+1 Andrew Ng Logistic regression Logistic regression learns a parameter vector q. On input x, it outputs: x 1 x 2 x 3

Neural Network Example 4 layer network with 2 output units: x 1 x 2 x 3 +1 +1 Layer 1 Layer 2 +1 Layer 3 Trained via gradient descent. Backpropagation algorithm. (Susceptible to local optima) Layer 4 Andrew Ng

CBOW Architecture In the beginning God created the heaven and the earth. And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters. And God said, Let there be light: and there was light. And God saw the light, that it was good: and God divided the light from the darkness. Input Layer Hidden Layer Output Layer Huffman code of Spirit

word2vec variants Continuous Bag-of-Words Architecture Skip-gram Architecture Predicts the current word given the context Predicts the surrounding words given the current word

word2vec Semantic and additive Surprising regularities emerge King-Man+Woman~Queen France+Capital~Paris

Sentiment Analysis unbelievably disappointing Full of zany characters and richly applied satire, and some great plot twists This is the greatest screwball comedy ever filmed It was pathetic. The worst part about it was the boxing scenes. Original slides by Yackov Lubarsky

Why Hard? Subtlety: Perfume review in Perfumes: the Guide: If you are reading this because it is your darling fragrance, please wear it at home exclusively, and tape the windows shut. Dorothy Parker on Katherine Hepburn She runs the gamut of emotions from A to B

Why Hard? Thwarted Expectations and Ordering Effects This film should be brilliant. It sounds like a great plot, the actors are first grade, and the supporting cast is good as well, and Stallone is attempting to deliver a good performance. However, it can t hold up. Well as usual Keanu Reeves is nothing special, but surprisingly, the very talented Laurence Fishbourne is not so good either, I was surprised.

Standard Approaches Bag of Words features Semantic Vector Spaces While enjoyable in spots, 'The Dark World' is haphazard and ultimately unsatisfying. He feels for his subject matter, no doubt, but every once in a while it feels like he falls short on imagination and takes a short cut to move his story forward. Standard approaches ignore phrase compositions

Compositionality Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng and Christopher Potts. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. Conference on Empirical Methods in Natural Language Processing (EMNLP 2013)

Stanford Sentiment Treebank 10,662 sentences from rottentomatoes.com movie reviews Parsed using Stanford Parser to create parse trees 215,154 unique phrases Sentiment classified using Amazon Mechanical Turk 3 judges per phrase

Treebank Sentiment Value Short n-grams are mostly neutral Extreme values or mid-tick are rarely used 5-class classification is enough

Recursive Neural Models Compositional vector representations for phrases of variable length. Each word/phrase is represented by a vector These vectors are used as features for calculating combined sentiment

Recursive Neural Network Word/Phrase vectors a, b, c, p i R d Word embedding matrix L R dx V Classification y a = softmax(w s a) W s R Cxd L, W s are trained parameters What about p i?

Recursive Neural Network p 1 = f W b c, p 2 = f W a p 1 W R dx2d is a trained parameter f is usually tanh

MV-RNN: Matrix-Vector RNN Composition function depends on words/phrases being combined p 1 = f W Cb Bc B P 1 = f W M C W R dx2d, W M R dx2d W, W M, P i, A, B, C, are trained Number of parameters depends on V

Sentiment Prediction Accuracy for fine grained (5-class) and binary predictions at the sentence level (Root) and at all nodes

Most Positive/Negative