Latent Semantic Analysis Using Google N-Grams

Similar documents
Information Retrieval

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Singular Value Decompsition

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

Notes on Latent Semantic Analysis

Manning & Schuetze, FSNLP (c) 1999,2000

Eigenvalues and eigenvectors

Manning & Schuetze, FSNLP, (c)

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Vector Space Models. wine_spectral.r

7 Principal Component Analysis

CS 572: Information Retrieval

Introduction to SVD and Applications

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

DISTRIBUTIONAL SEMANTICS

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Semantic Similarity from Corpora - Latent Semantic Analysis

Linear Algebra and Eigenproblems

Foundations of Computer Vision

14 Singular Value Decomposition

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

9 Searching the Internet with the SVD

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization

Image Registration Lecture 2: Vectors and Matrices

Latent semantic indexing

Lecture 5: Web Searching using the SVD

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING. Crista Lopes

The 3 Indeterminacies of Common Factor Analysis

Generic Text Summarization

Latent Semantic Analysis. Hongning Wang

13 Searching the Web with the SVD

Introduction to Machine Learning

Text Analytics (Text Mining)

1 Searching the World Wide Web

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

Boolean and Vector Space Retrieval Models

The value of a problem is not so much coming up with the answer as in the ideas and attempted ideas it forces on the would be solver I.N.

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

PV211: Introduction to Information Retrieval

Introduction to Information Retrieval

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Probabilistic Latent Semantic Analysis

1. Ignoring case, extract all unique words from the entire set of documents.

Latent Semantic Analysis. Hongning Wang

Embeddings Learned By Matrix Factorization

STA141C: Big Data & High Performance Statistical Computing

Dimensionality reduction

15 Singular Value Decomposition

Eigenvalues and diagonalization

CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

Deep Learning. Ali Ghodsi. University of Waterloo

A Introduction to Matrix Algebra and the Multivariate Normal Distribution

Name Solutions Linear Algebra; Test 3. Throughout the test simplify all answers except where stated otherwise.

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017

Latent Semantic Analysis (Tutorial)

Principal components analysis COMS 4771

CS281 Section 4: Factor Analysis and PCA

Exercises * on Principal Component Analysis


Chapter 4: Factor Analysis

Review problems for MA 54, Fall 2004.

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Information Retrieval. Lecture 6

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Singular Value Decomposition

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

An Introduction to Matrix Algebra

Machine Learning (Spring 2012) Principal Component Analysis

Roberto s Notes on Linear Algebra Chapter 9: Orthogonality Section 2. Orthogonal matrices

Geometric Modeling Summer Semester 2010 Mathematical Tools (1)

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

. = V c = V [x]v (5.1) c 1. c k

This appendix provides a very basic introduction to linear algebra concepts.

1 Singular Value Decomposition and Principal Component

Large Scale Data Analysis Using Deep Learning

What is A + B? What is A B? What is AB? What is BA? What is A 2? and B = QUESTION 2. What is the reduced row echelon matrix of A =

Designing Information Devices and Systems I Spring 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way

B553 Lecture 5: Matrix Algebra Review

CS 340 Lec. 6: Linear Dimensionality Reduction

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way

Chapter XII: Data Pre and Post Processing

Quick Introduction to Nonnegative Matrix Factorization

4 Bias-Variance for Ridge Regression (24 points)

Krylov Subspaces. Lab 1. The Arnoldi Iteration

On the Foundations of Diverse Information Retrieval. Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi

Structured matrix factorizations. Example: Eigenfaces

Math 520 Exam 2 Topic Outline Sections 1 3 (Xiao/Dumas/Liaw) Spring 2008

More PCA; and, Factor Analysis

Transcription:

Latent Semantic Analysis Using Google N-Grams Statistics 991 Computational Linguistics Final Project Adam Kapelner December 9, 2010 Abstract We replicate the analysis found in Landauer and Dumais (1997) where they use Grolier encyclopedia articles to train a program to answer closest synonym questions using their Latent Semantic Analysis (LSA) algorithm. Our analysis instead uses fourgram data generated from an Internet corpus. In short, we validate LSA s dimension reduction optimality property but a full comparison was not possible due to computational constraints. 1 Introduction Plato noted 2400 years ago that people seemingly have more knowledge than appears to be present in the information they are exposed to. His solution was that we are born with the knowledge innately and we just need a hint or intimation to unlock it from within. A canonical problem in learning theory is how we acquire vocabulary: how exactly do we learn the thousands of words we know? According to recent research, the estimated rate of acquisition is astonishing nearly 7-15 words per day for children. More shockingly, Chomsky (1991) and others showed that adult language is insufficient to learn both grammar and core competency vocabulary. There must be some mechanism inside the human mind that can make efficient use the incoming information, which is paltry, that explains our observed language learning. 2 Latent Semantic Analysis 2.1 Background and Theory To answer Plato s problem, Landauer and Dumais (1997) posit a less numinous theory of learning called Latent Semantic Analysis (LSA). The learning is latent because it lies unobserved and mostly unconscious; it is semantic because it is based on learning words, and there is an analysis because an internal computation is made. The idea is simple. People encounter knowledge in the form of written contexts, conversations, movies, etc. Each of these experiences we ll denote as contexts. After experiencing many contexts, over time, the brain begins to build knowledge intuitively. The idea of 1

LSA is that contexts are interrelated and knowledge is built by the brain exploiting these interrelations. Furthermore, there exists a mechanism that shrinks the dimensionality of the space of all knowledge. Obviously two articles about computer hardware may be shrunk into one computer hardware space. However, you can imagine a section of a psychology textbook concerning artificial intelligence also getting sucked into the computer hardware space due to some degree of interrelation. The brain s ability to shrink the space of experiences (dimension optimization) can greatly amplify learning ability. If artificial intelligence and computer hardware were treated as independent facts, they would never be associated. It is worth noting that this idea is quite iconoclastic. There exists a great body of literature in psychology, lingruistics, etc. that assume that human knowledge rests upon special foundational knowledge (knowing grammar structures for instance) rather than a simple, general principle. How do we simulate knowledge-learning from experiencing contexts? To adduce evidence for the validity of their theory, the following experiment was run. 2.2 The LSA Experiment and Results Landauer and Dumais (1997) used 30,473 articles from Grolier (1980), a popular contemporary academic encyclopedia in the United States. To simulate contexts, they took the first 2,000 characters (or the whole text, whichever was shorter) which averaged to be approximately 150 words, about the size of a large paragraph. They then removed all punctuation and converted the words to lowercase. To simulate knowledge induction, they had to define what knowledge meant exactly in this example. They chose the canonical problem of learning vocabulary. The number of words that appeared more than once in the encyclopedia of which there were 60,768, became these granules of knowledge. The test of their method would be an assessment of LSA s ability to learn words. For each context, the word totals were tabulated (words that appeared in only one context were discarded). Therefore for each context, we can generate a vector x j N 60,768. For ease of analysis, the counts in each entry were transformed via: x ij = ln (1 + x ij) entropy(x = ln (1 + x ij ) p i ) l=1 P (x il) log 2 (P (x il )) = ln (1 + x ij ) ( p l=1 x il p k=1 x ik log 2 x il p k=1 x ik ) (1) To represent all the data, we can concatenate each column together and we define the data matrix as X = [ x 1 x p] where the n = 60, 768 rows represent unique words and the p = 30, 473 columns are the contexts. To find the salient dimensions of the data, the authors used singular value decomposition (SVD). See appendix A.1 for a review of the mathematics. Now, they shrink the space of experience following the method of least squares approximation (see appendix A.2) which entails picking the first d p most salient dimensions. Denote the approximated matrix ˆX (d). 2

Landauer and Dumais (1997) then gauged the theory s ability to actually learn words. They used 80 synonym questions from a retired TOEFL examination. The format is problem word and four alternative choices. In the example below zenith is the problem word and the correct answer is (b) pinnacle. 5. zenith a. completion b. pinnacle c. outset d. decline How would the d-dimensional model evaluate the first alternative? It would somehow have to compute the similarity between the row vector for zenith, denoted ˆx (d) T for target, and the row vector for completion, denoted ˆx a because it is the first choice. There are many ways to compare vectors but they want to ensure that there is no effects due to the raw frequencies of the words (i.e. their magnitude). This is akin to representing the topic rather than how much is said about the topic. In information retrieval applications (as we have seen in class), the cosine measure works well, which was their choice. The best guess (g ) to a synonym question would be to take the smallest angle between the vectors i.e. the largest cosine which can be denoted formally: { } { ( )} g = arg min (d) = arg max cos θˆx g {a,b,c,d} T, ˆx(d) (d) = arg max g θˆx g {a,b,c,d} T, ˆx(d) g g {a,b,c,d} How well did their algorithm do? See figure 1. ˆx (d) T, ˆx(d) g ˆx (d) T ˆx (d) g (2) Figure 1: Performance on 80-synonym test as a function of d, the number of dimensions used to approximate X, log scale (excerpted from page 220 in Landauer and Dumais (1997)) There are two extraordinary results (a) after controlling for chance, their model achieved a stellar 51.5% correct after correcting for chance guessing (see appendix A.3) which would 3

garner the computer running the LSA algorithm admission to a United States university (b) the model features a local max at approximately d = 300 dimensions. The second result has implications in learning theory. The authors have demonstrated that as the space of contexts is squished, the performance on the synonym test increases until a certain threshold and then begins to decrease. This suggests that too many dimensions can confuse the algorithm, disallowing fruitful interrelationships from gelling. However, too few dimensions are too squished and knowledge begins to blend together. What is amazing is the global max at approximately 300 dimensions which allowed for triple the performance compared with the full model with p dimensions. 2.3 Other LSA Analyses and Thoughts To further validate that LSA models human knowledge induction, the authors measured how quickly the algorithm could learn words. In short, they found that the model s improvement, per paragraph of new encyclopedia contexts, is comparable to estimates to the rate that children acquire vocabulary. Almost as astounding as the model s overall success is how parsimonious its structure is. The model treats the contextual examples as mere unordered bags-of-words without structure or punctuation. The model treats each glyph as a separate word (obviously related words such as color and colors would be considered different). The model cannot take advantage of morphology, syntax, similarity in spelling, or any of the myriad other properties humans are privy to. The authors conclude that LSA partially answers Plato s ancient question of knowledge induction. Beyond vocabulary acquisition, it may be extendable to all forms of knowledge representation. 1 3 LSA Duplication Using Google 4-Grams In short, my study attempts to repeat the LSA experiment found in section 2.2 replacing contexts from the truncated Grolier encyclopedia entries with 4-grams from the Google web corpus (Goo, 2006). 3.1 Experimental Setup We proceed with minor alterations from the procedures found in Landauer and Dumais (1997). The code can be found in appendix B. We first load the first 100,000 unigrams and the 80 TOEFL questions. We refer to TOEFL words for both the target and the alternatives. Immediately, 12 questions were disqualified since they were not even present in the unigram listing. 1 This study was cited about 2,500 times since its publication in 1997. A further literature search was not done to see if the model was improved upon. However, the paper does appear to be very popular in the machine learning literature. 4

Also, due to computational restrictions, we could not load all of the 4-grams (not even close). Therefore, we picked a certain maximum number p = 2000 which was also chosen to allow expedient computation. However, since the TOEFL words are rare, we were unable to grab the first p 4-grams naively, we had to choose the p wisely. A happy medium was to ensure that 10% of the 4-grams contained at least one TOEFL word. This still posed problems. Some TOEFL words are much more common than others so the 10% would fill up with words like large and make. Therefore, the number of 4-grams allowed in per TOEFL word was capped at 5. Each 4-gram formed a column. The rows were then the taken as the union of words in all p 4-grams (n = 2173). The column entries were either zero if the word did not appear in 4-gram, or the count of how often the 4-gram appeared in the Google web corpus. We then employ the same preprocessing as found in equation 1 to arrive at the data matrix X. 2 We then calculate the SVD and do a dimension reduction to obtain ˆX (d), the least squares estimate. Then, to assess the effectiveness of the model, we assess using the same TOEFL questions from the original LSA study. Once again, not all the TOEFL words were present. Therefore, we skipped questions that were missing the target word (the question prompt), and questions that were missing all the alternative choices. However, for expediency, we left in questions where a subset of the alternative choices and the target word was represented in X. 3 Unfortunately, this left us only with 39 questions of the original 80 and not all of them complete with all four answer choices. We compute our best guesses for the TOEFL synonym test by using the cosine measure found in equation 2. We repeat this procedure for all the dimensions d {2, 3,..., MAX}. The MAX dimension was chosen to be 200 for computational concerns. For each chosen dimension, we calculate the adjusted score using the correction for chance guessing found in appendix A.3. The entire computation took about 45min on a 3GHZ machine. The adjusted score versus the number of dimensions is shown in figure 2. 3.2 Conclusions and Comparison to Previous Study Since our implementation suffered from computational difficulties, we were unable to make use all the synonym questions during assessment, so a direct comparison with the results from Landauer and Dumais (1997) was not possible. The crowning achievement of this crude duplication, is that we see a very clear effect of optimum dimensionality. In figure 2, we observe a local maximum at around 50-70 dimensions. This agrees with the result of the original LSA study where a dimension-reduction by a large factor yields optimal results. Note the precipitous drop in performance after d = 110. Most likely, if the knowledge space isn t sufficiently condensed and the interrelations are not exploited, four word contexts offer very, very little information about the meaning of words. 2 Note that rank [X] = 983 < min {n, p} since most of the 2173 words were singletons and therefore ignored in the computation by zeroing out the row completely. 3 Note that it is possible the correct answer did not appear, but this would not bias any given dimension s performance, which was the goal of this duplication effort. 5

% correct on TOEFL synonym test by dimension % correct 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0 50 100 150 200 number of dimensions (d) Figure 2: Our results I still do not have any intuition on if many 4-grams could compete with encyclopedia articles. The more that are included, the better representated the knowledge space becomes. Since we were unable to increase p without computational issues, this was not able to be investigated. 3.3 Future Directions This study was limited due to computational constraints. It would be nice to use five-grams and increase the number of vectors p and be able to assess using all the synonym questions. This would require a more efficient SVD algorithm and upgraded hardware. Using the 4-grams from the Internet is a convenient way to test a variety of very short contexts. A problem I find to be of interest is optimizing over both length of context and dimensionality of context. I propose a study where we we duplicate Landauer and Dumais (1997) and vary their 2000 character word limit on the encylclopedia articles. Instead of using Grolier (1980) encyclopedia, we can use the Internet. This would require a more updated Google corpus with n-grams beyond n = 5 or, even better, a raw text dump. Acknowledgments Thanks to Professor Dean Foster for mentorship and giving me this interesting project. Thanks to Bill Oliver from Professor Thomas Landauer s lab in the University of Colorado at Boulder for providing the TOEFL questions from the original study. 6

References N. Chomsky. Linguistics and Cognitive Science: Problems and Mysteries. In A. Kasher (ed.), The Chomskyan Turn. Cambridge, MA: Basil Blackwell. 1991. Web 1T 5-gram Corpus Version 1.1, 2006. Google, Inc. Grolier. Grolier s Academic American Encyclopedia. Arte Publishing, 1980. TK Landauer and ST Dumais. A solution to Platos problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological review, 1 (2):211 240, 1997. A Technical Appendix A.1 Singular Value Decomposition Any real matrix X R n m with rank [X] = p m can be decomposed into three matrices in a process known as singular value decomposition (SVD): X = F 1 W F T 2 where F 1 R n p, W R p p, and F T 2 R p m The matrices F 1, F 2 are traditionally orthonormal can be thought of as dimensions or factors of the matrix X, which can be thought of as principal components, capturing the internal structure of the data. The matrix W is a positive diagonal matrix where the non-zero entries in a particular column W jj are called singular values and represent the weight of the corresponding column vector in factor matrices F 1, F 2. Why are there two factor matrices and why are they not related? In short, because we have a rectangular matrix. If X was square, we would have a standard eigendecomposition of X = F W F 1 where all the dimensionality is really captured in one matrix (since the second factor matrix is a computable function of the first). As a side note, the two factor matrices in SVD actually correspond to eigenvectors for two special spaces of X. F 1 is actually the eigenvectors for the matrix XX T also called the output basis vectors for X and F 2 is actually the eigenvectors for the matrix X T X which is also called the input basis vectors for X. The W matrix s entries are actually the square root of the eigenvalues of both XX T and X T X (which explains why all diagonal entries are positive). It is also traditional to order the singular values in W, putting the largest in the W 11 position, the second-largest in the W 22 position, and so forth. The corresponding eigenvectors in F 1, F 2 are then analogously ordered making the corresponding switches. This is really convenient because upon inspection it allows the investigator to see the vectors in order of importance and how important they are relative to each other (by exmaining the weights) without the need to dart all over the matrix. 7

A.2 Matrix Approximation via SVD We can approximate the original matrix X by using a subset of the factors found in the previous section. This would be akin to using a subspace of a lower dimension to represent a higher dimensional object. Since this approximation is really a projection of a higher space onto a lower space, i.e. the same mechanims used in least squares regression, we call this approximation ˆX, the least squares estimate of X. How is this done in practice? A subset of the factors of the original space are selected. A convenient computational mechanism is to vanish the weights on the factors you no longer desire. For instance, let s say we want to estimate X by just the most important vector, we can set W 22 = W 33 =... = 0 and compute. X = F 1 ˆX (1) = F 1 W 11 0 0 0 W 22 0... W 11 0 0 0 0 0... F T 2 F T 2 The least squares estimate is denoted ˆX (1) where the 1 denotes that we are approximating it using only one dimension. This estimate can be accurate if most of the variation in X is captured in that one hyperline; if the variation is more evenly distributed among other dimensions, the approximation will be very poor. A.3 Chance Guessing on Tests Consider a test with 100 questions and 5 answer choices. The taker knew 50 answers but on the other 50, he guessed. In expectation, he will get 10 of the 50 guesses correct for a total score of 60. In order to adjust for guessing we could subtract off 1 of the incorrect (algebra 4 not shown). Generally speaking, a test with T questions and Q answer choices and a raw score of p percent will have an adjusted score p given by: ( ) ( ) # correct - #incorrect adj pt (1 p)t 1 ( ) p Q 1 pq 1 = = = # total + T Q 1 + We need the positive component in the above formula since if the taker guessed all the questions at random, he deserves a zero (not a negative score, which is impossible). + B Ruby Code We now provide the code used to run the experiment found in section 3. Ruby 1.8.6 (http://www.ruby-lang.org/en/) was used as well as the ruby linear algebra library which is a wrapper for a native fortran implementation (http://linalg.rubyforge.org/). 8

lsa duplication experiment.rb 1 r e q u i r e linalg 2 i n c l u d e Linalg 3 r e q u i r e t_o_e_f_l 4 i n c l u d e TOEFL 5 6 MAX FIVEGRAMS = 2000 7 MAX TOEFL INDEX COUNT = 5 8 MAX DIMENSIONS = 200 9 10 ####################### 11 ### Unigrams 12 # 13 #MAX UNIGRAMS = 10 14 15 #We f i r s t should load the unigrams 16 #we want to have the unigrams in order that they appear s i n c e 17 #t h i s i s how they are r e f e r e n c e d in the n grams f i l e 18 unigrams ordered = [ ] 19 #we a l s o would l i k e to have a hash o f the words so we can have a 20 #d i c t i o n a r y to check i f a mystery word a c t u a l l y e x i s t s, we might 21 #as w e l l s t o r e the index as a value s i n c e we get i t f o r f r e e 22 u n i g r a m s i n d i c e s = {} 23 24 f = F i l e. open ( " google_top_100000_unigrams ", "r" ) 25 #read each l i n e, s p l i t along whitespace in middle and take the l e f t part 26 l = 0 27 f. e a c h l i n e do l i n e 28 u n l e s s l. zero? #we don t care about the header 29 unigram = l i n e. match (/ " (.*) ", / ). c a p t u r e s. f i r s t 30 # p r i n t unigram 31 # p r i n t \ r \n 32 unigrams ordered << unigram 33 u n i g r a m s i n d i c e s [ unigram ] = l 34 # break i f l >= MAX UNIGRAMS 35 end 36 l += 1 37 end 38 f. c l o s e 39 40 p " total unigrams : #{ unigrams_ordered. length }" 41 42 43 ####################### 44 ### TOEFL synonym t e s t r e l a t e d 45 # 46 9

47 #l e t s get a l l the words from the synonym t e s t 48 t o e f l w o r d s = [ ] 49 SynonymQuestions. e a c h w i t h i n d e x do q, i 50 next i f QuestionsToBeKilled. i n c l u d e?( i + 1) 51 t o e f l w o r d s << q [ : t a r g e t ] << q [ : a l t e r n a t i v e s ] << q [ : c o r r e c t a n s w e r ] 52 end 53 t o e f l w o r d s = t o e f l w o r d s. f l a t t e n. uniq #k i l l d u p l i c a t e s 54 p " num toefl questions : #{ NumSynonymQuestions }" 55 p " num toefl words #{ toefl_words. length }" 56 57 t o e f l w o r d s n o t f o u n d i n u n i g r a m s = [ ] 58 t o e f l w o r d s. each do t o e f l w o r d 59 t o e f l w o r d s n o t f o u n d i n u n i g r a m s << t o e f l w o r d i f u n i g r a m s i n d i c e s [ t o e f l w o r d ]. n i l? 60 end 61 p " num toefl words not in unigrams : #{ toefl_words_not_found_in_unigrams. length }: #{ toefl_words_not_found_in_unigrams. join (, )}" 62 63 #now get the i n d i c e s o f the t o e f l words i n s i d e the unigrams 64 t o e f l u n i g r a m i n d i c e s = t o e f l w o r d s. map{ word u n i g r a m s i n d i c e s [ word ] } 65 #p t o e f l i n d i c e s : 66 #t o e f l u n i g r a m i n d i c e s. e a c h w i t h i n d e x do index, i 67 # p r i n t #{ t o e f l w o r d s [ i ] } #{index } 68 #end 69 t o e f l u n i g r a m i n d i c e s c o u n t = {} 70 71 ####################### 72 ### N Grams 73 # 74 75 f i v e g r a m s = [ ] 76 u n i q u e w o r d s i n f i v e g r a m s = {} #g l o b a l t a b u l a t u r e o f unique words found in the f i v e gram 77 78 #now we should load some 5 grams and keep track o f the unique words 79 f = F i l e. open ( "C :\\ Users \\ kapelner \\ Desktop \\ google_data \\ google_10000_4grms ", "r" ) 80 k = 0 81 f. e a c h l i n e do l i n e 82 k += 1 83 # p l i n e 84 #keep the e n t i r e f i v e gram and the format i s : 85 #[ word 1 index, word 2 index, word 3 index, word 4 index, word 5 index, frequency ] 86 five gram raw = l i n e. s p l i t (/\ s /). map{ unigram ind unigram ind. t o i } #ensure i n t e g e r s 10

87 #the f i r s t f i v e p o s i t i o n s are the i n d i c e s o f the unigrams, so that s the r e a l 5 gram 88 five g r a m = five gram raw [ 0... 5 ] 89 #now i f the gram does not contain any o f the words in the t o e f l t e s t, d i t c h i t 90 # p array i n t e r s e c t i o n : #{( t o e f l u n i g r a m i n d i c e s & f i v e g r a m ) } 91 i n t e r s e c t i o n = t o e f l u n i g r a m i n d i c e s & f i v e g r a m 92 s k i p t h i s c o n t e x t = i n t e r s e c t i o n. empty? 93 i n t e r s e c t i o n. each do t o e f l i n d e x 94 t o e f l u n i g r a m i n d i c e s c o u n t [ t o e f l i n d e x ] = ( t o e f l u n i g r a m i n d i c e s c o u n t [ t o e f l i n d e x ]. n i l?? 1 : t o e f l u n i g r a m i n d i c e s c o u n t [ t o e f l i n d e x ] + 1) 95 s k i p t h i s c o n t e x t = t r u e i f t o e f l u n i g r a m i n d i c e s c o u n t [ t o e f l i n d e x ] > MAX TOEFL INDEX COUNT 96 end 97 s k i p t h i s c o n t e x t = f a l s e i f rand < 0.001 98 next i f s k i p t h i s c o n t e x t 99 100 #now we need to keep track o f the words ( and t h e i r counts ) used in a l l f i v e grams 101 #by t a b u l a t i n g the words in t h i s p a r t i c u l a r f i v e gram 102 f i v e g r a m. each { unigram ind u n i q u e w o r d s i n f i v e g r a m s [ unigram ind ] = u n i q u e w o r d s i n f i v e g r a m s [ unigram ind ]. n i l?? 1 : ( u n i q u e w o r d s i n f i v e g r a m s [ unigram ind ] + 1) } 103 # n = f i v e g r a m s [ l ]. l a s t. t o i #the l a s t i s the frequency 104 # p n + times : + f iv e gram. map{ ind unigrams ordered [ ind. t o i ] }. j o i n ( ) 105 #f i n a l l y add i t to the array 106 f i v e g r a m s << five gram raw 107 # p f i v e grams : #{f i v e g r a m s. l e n g t h } k = #{k} i f f i v e g r a m s. l e n g t h % 50 == 0 108 break i f f i v e g r a m s. l e n g t h >= MAX FIVEGRAMS 109 end 110 f. c l o s e 111 112 #k i l l the words that only appear once 113 u n i q u e w o r d s i n f i v e g r a m s. each { unigram ind, f r e q u n i q u e w o r d s i n f i v e g r a m s. remove ( unigram ind ) i f f r e q. zero?} 114 115 #now we need to go back to s e e how many t o e f l words were l e f t out o f the f i v e grams 116 t o e f l w o r d s n o t f o u n d i n n g r a m s = {} 117 t o e f l u n i g r a m i n d i c e s. each do t o e f l u n i g r a m i n d e x 118 t o e f l w o r d s n o t f o u n d i n n g r a m s [ t o e f l u n i g r a m i n d e x ] = t r u e i f u n i q u e w o r d s i n f i v e g r a m s [ t o e f l u n i g r a m i n d e x ]. n i l? 119 end 11

120 p " num toefl words not found in n grams : #{ toefl_words_not_found_in_ngrams. length }" 121 122 #now we have to go through an nuke the q u e s t i o n s that we cannot attempt 123 q uestions w e c a n a nswer = [ ] 124 SynonymQuestions. each do q 125 t o e f l w o r d s = ( [ ] << q [ : t a r g e t ] << q [ : a l t e r n a t i v e s ] << q [ : c o r r e c t a n s w e r ] ). f l a t t e n 126 k e e p q u e s t i o n = t r u e 127 t o e f l w o r d s. each do word 128 index = u n i g r a m s i n d i c e s [ word ] 129 i f index. n i l? #or t o e f l w o r d s n o t f o u n d i n n g r a m s [ index ] 130 k e e p q u e s t i o n = f a l s e 131 break 132 end 133 end 134 q uestions w e c a n a n swer << q i f k e e p q u e s t i o n 135 end 136 p " num toefl synonym questions we can answer : #{ questions_we_can_answer. length }" 137 138 e x i t i f q u e s t i o n s we c a n a n swer. l e n g t h. zero? 139 140 #unique words. each do unigram index, f r e q 141 # p #{unigrams ordered [ unigram index ] } #{f r e q } 142 #end 143 144 #now we need the unique words in an o r d e r i n g 145 u n i q u e w o r d s i n f i v e g r a m s o r d e r i n g = {} 146 u n i q u e w o r d s i n f i v e g r a m s. keys. e a c h w i t h i n d e x { unigram ind, i u n i q u e w o r d s i n f i v e g r a m s o r d e r i n g [ unigram ind ] = i } 147 148 p " num unique words in five grams : #{ unique_words_in_five_grams. length }" 149 150 ####################### 151 ### Matrix c r e a t i o n and SVD 152 # 153 154 #now c r e a t e X, the data matrix 155 #the number o f rows should be n = # o f unique words 156 n = u n i q u e w o r d s i n f i v e g r a m s. l e n g t h 157 #the number o f c o l s should be the number o f f i v e grams 158 p = f i v e g r a m s. l e n g t h 159 160 161 #now c r e a t e X using the l i n a l g package 12

162 xmat = DMatrix. new (n, p ) 163 #now we have t o s e t up t h e X matrix, and we have t o do i t manually 164 #use convention that i i t e r a t e s over rows, and j i t e r a t e s over columns 165 f i v e g r a m s. e a c h w i t h i n d e x do five gram raw, j 166 # p f i v e gram to be i n s e r t e d : #{five gram raw. j o i n (, ) } (#{ five gram raw. map{ ind unigrams ordered [ ind. t o i ] }. j o i n ( ) }) 167 f r e q = five gram raw. l a s t #p u l l out frequency 168 five gram raw [ 0... 5 ]. each do unigram ind 169 i = u n i q u e w o r d s i n f i v e g r a m s o r d e r i n g [ unigram ind ] 170 # p i n s e r t i=#{i } j=#{j } o f dims #{n} x #{p} #{five gram raw. j o i n (, ) } (#{ five gram raw. map{ ind unigrams ordered [ ind. t o i ] }. j o i n ( ) }) 171 xmat [ i, j ] = f r e q 172 end 173 end 174 p " matrix created with dims #{ n} x #{ p}, now preprocessing " 175 #p xmat 176 177 #now we have to p r e p r o c e s s the X 178 0. upto ( xmat. num rows 1) do i 179 #f i r s t f i n d sum o f the row ( number o f words throughout a l l n grams ) 180 sum = 0. t o f 181 0. upto ( xmat. num columns 1) do j 182 sum += xmat [ i, j ] 183 end 184 #now f i n d the p r o b a b i l i t i e s f o r each n gram 185 probs = Array. new ( xmat. num columns ) 186 0. upto ( xmat. num columns 1) do j 187 probs [ j ] = xmat [ i, j ] / sum 188 end 189 #now c a l c u l a t e entropy ( remember i f prob = 0, entropy d e f i n e d to be 0, c o n s i s t e n t with l i m i t ) 190 entropy = 0. t o f 191 0. upto ( xmat. num columns 1) do j 192 entropy += probs [ j ] Math. l o g ( probs [ j ] ) u n l e s s probs [ j ]. zero? 193 end 194 #now a s s i g n a new value ( only i f i t s non zero to begin with ) 195 0. upto ( xmat. num columns 1) do j 196 u n l e s s xmat [ i, j ]. zero? 197 xmat [ i, j ] = entropy. zero?? 0 : (1 + xmat [ i, j ] ) / entropy 198 end 199 end 200 end 201 202 #p xmat 203 204 #keep f i r s t d s i n g u l a r v a l u e s 205 d e f reduce dimensions (d, x ) 13

206 0. upto ( [ x. num rows 1, x. num columns 1 ]. min ) do i 207 x [ i, i ] = 0 i f i >= d 208 end 209 x 210 end 211 212 #svd a n a l y s i s 213 p " done preprocessing, now running matrix SVD " 214 f 1 mat, wmat, f 2 mat = xmat. s i n g u l a r v a l u e d e c o m p o s i t i o n 215 p " done matrix SVD now beginning dimensional tests " 216 p " dims : f_1_mat #{ f_1_mat. dimensions. join ( x )} w_mat #{ wmat. dimensions. join ( x )} f_2_mat #{ f_2_mat. dimensions. join ( x )} rank : #{ wmat. rank }" 217 218 #p w mat 219 #now check q u e s t i o n s 220 221 d e f c a l c u l a t e c o s i n e ( x, i1, i 2 ) 222 row1 = x. row ( i 1 ) 223 row2 = x. row ( i 2 ) 224 c o s i n e = ( ( row1 row2. t ) / ( row1. norm row2. norm ) ) [ 0, 0 ] #a s c a l a r, but we s t i l l need to return the f l o a t 225 ( c o s i n e. i n f i n i t e? or c o s i n e. nan?)? 10000 : c o s i n e ####HACK 226 end 227 228 r e s u l t s = Array. new ( [ wmat. rank, MAX DIMENSIONS ]. min + 1, 0) 229 ( [ wmat. rank, MAX DIMENSIONS ]. min ). downto ( 2 ) do d #we go down so we don t have to waste time copying the matrix w mat 230 231 wmat d = reduce dimensions (d, wmat) 232 xmathat d = f 1 m a t wmat d f 2 m a t 233 234 #now c a l c u l a t e some answers 235 q u e s t i o n s a n s w e r e d = 0 236 q u e s t i o n s c o r r e c t = 0 237 q uestions w e c a n a n swer. each do q 238 i t a r g e t = u n i q u e w o r d s i n f i v e g r a m s o r d e r i n g [ u n i g r a m s i n d i c e s [ q [ : t a r g e t ] ] ] 239 next i f i t a r g e t. n i l? 240 c o s i n e s = {} 241 q [ : a l t e r n a t i v e s ]. each do a l t 242 i a l t = u n i q u e w o r d s i n f i v e g r a m s o r d e r i n g [ u n i g r a m s i n d i c e s [ a l t ] ] 243 next i f i a l t. n i l? 244 c o s i n e s [ a l t ] = c a l c u l a t e c o s i n e ( xmathat d, i t a r g e t, i a l t ) 245 end 246 next i f c o s i n e s. empty? 14

247 # p c o s i n e s 248 guess = c o s i n e s. i n v e r t [ c o s i n e s. v a l u e s. max ] 249 questions a n s w ered += 1 250 q u e s t i o n s c o r r e c t +=1 i f guess == q [ : c o r r e c t a n s w e r ] 251 # p guess #{ guess } f o r q u e s t i o n with t a r g e t word #{q [ : t a r g e t ] } } 252 253 end 254 r e s u l t s [ d ] = q u e s t i o n s c o r r e c t / q u e s t i o n s a n s w e r e d. t o f 255 p " results for dimension #{d} -- #{ questions_correct } / #{ questions_answered } = #{ %.2 f % ( results [d] * 100) }%" 256 end 257 258 #now dump r e s u l t s... note that they are not yet adjusted f o r g u e s s i n g 259 F i l e. open ( results. txt, w ) do f 260 r e s u l t s. each { r e s f. puts ( r e s ) } 261 end In addition, below is an example of what the module which contains the TOEFL synonym listings looks like: 1 SynonymQuestions = [ 2 # 1. enormously 3 # a. a p p r o p r i a t e l y 4 # b. uniquely 5 # c. tremendously 6 # d. d e c i d e d l y 7 { 8 : t a r g e t => enormously, 9 : a l t e r n a t i v e s => %w( a p p r o p r i a t e l y uniquely tremendously d e c i d e d l y ), 10 : c o r r e c t a n s w e r => tremendously 11 }, 12 # 2. p r o v i s i o n s 13 # a. s t i p u l a t i o n s 14 # b. i n t e r r e l a t i o n s 15 # c. j u r i s d i c t i o n s 16 # d. i n t e r p r e t a t i o n s 17 { 18 : t a r g e t => provisions, 19 : a l t e r n a t i v e s => %w( s t i p u l a t i o n s i n t e r r e l a t i o n s j u r i s d i c t i o n s i n t e r p r e t a t i o n s ), 20 : c o r r e c t a n s w e r => stipulations 21 }, 22. 23. 24. 15