AUTHORSHIP RECOGNITION USING THE DYNAMICS OF CO-OCCURRENCE NETWORKS. Camilo Akimushkin Valencia

Size: px

Start display at page:

Download "AUTHORSHIP RECOGNITION USING THE DYNAMICS OF CO-OCCURRENCE NETWORKS. Camilo Akimushkin Valencia"

Verity Townsend
5 years ago
Views:

1 AUTHORSHIP RECOGNITION USING THE DYNAMICS OF CO-OCCURRENCE NETWORKS Camilo Akimushkin Valencia Instituto de Física de São Carlos Universidade de São Paulo 11 November 2016 C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

2 Authorship recognition Lexical features (frequency) Words, n-grams, functional words, types of words, discourse-connecting expressions, slang, contractions, dialects, orthography mistakes, proper names, semantic features (polysemy). Character-level features Character n-grams, frequent suffixes, punctuation. Text format Lengths of lines, words and phrases, formatting (white spaces), capitalization, nonalphanumeric characters, beginnings and ends of texts. Other Syntactic features: n-grams syntactic function, kinds of phrases, perplexity, morphological complexity. C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

3 Complex networks Natural C. Akimushkin (IFSC-USP) Artificial Social SoFiA Other 11 November / 24

4 Complexity of language Zipf s law (1939) f j = Ar q j, q = 1 P (k) k γ, γ = 1 + q 1 Biemann, Quasthoff. Networks generated from natural language text. In: Dynamics on and of Complex Networks Dorogovtsev, Mendes. Language as an Evolving Word Web. P. Roy. Soc. Lond. B. Bio C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

5 Word co-occurrence networks Construction It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness... A Tale of Two Cities - Charles Dickens C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

6 Word co-occurrence networks Construction best worst age times times wisdom age foolishness C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

7 Word co-occurrence networks Construction best times good worst times foolishness time age wisdom age foolishness wisdom bad age C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

8 Network metrics C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

9 Network metrics 1 Clustering: c i = e i k i(k i 1) 2 Diameter: D = max{d ij } 3 Radius: r = min{d ij } 4 Cliques: Number of complete subgraphs 5 Load centrality: Betweenness centrality with loads on edges 6 Transitivity: T = 3 triangles triads 7 Betweenness centrality: B i = Σ s i t g i st g st 8 Shortest path: l ij = [A n ] ij 9 Connectivity: k i = [ A 2] ii 10 Intermittency: I i = var( )/ 11 Number of nodes: N 12 Number of edges: E A: Adjacency matrix; g st = Σl st; i : Distance between two appearances of a word. C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

10 Dynamics of networks for authorship recognition Authorship of books Few books per author Depends on style Small networks Uneven networks C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

11 Dynamics of networks for authorship recognition Authorship of books Few books per author Depends on style Small networks Uneven networks Dynamics OF the network C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

12 Time series Clustering Radius Load centrality Betweenness centrality Degree Nodes Diameter Cliques Transitivity Shortest path Intermittency Edges Time series for Moby Dick by H. Melville C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

13 Time series 2 1 Frequency Arthur Conan Doyle Bernard Shaw Fyodor Dostoyevsky Herman Melville Jack London Jonathan Swift Leo Tolstoy Nathaniel Hawthorne Connectivity Clustering Radius Load centrality Betweenness centrality Degree Nodes Diameter Cliques Transitivity Shortest path Intermittency Edges Time series for Moby Dick by H. Melville C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

14 Time series 2 1 Frequency Arthur Conan Doyle Bernard Shaw Fyodor Dostoyevsky Herman Melville Jack London Jonathan Swift Leo Tolstoy Nathaniel Hawthorne [ µ i = 1 T Connectivity T j=1 (x j µ 1 ) i ] 1/i 3 Clustering Radius Load centrality Betweenness centrality Degree Nodes Diameter Cliques Transitivity Shortest path Intermittency Edges Time series for Moby Dick by H. Melville C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

15 Time series Autocorrelation r(x, y) = Σ T i=1 (xi x)(yi ȳ) Σ T i=1 (x i x) 2 Σ T i=1 (yi ȳ)2 ACF (τ) = r(x, x τ ) ±5% Wiener-Khinchin theorem C(τ) x (t)x(τ + t)dt = x ν 2 e 2πıντ dν = F[ x ν 2 ](τ) C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

16 Time series Stationarity tests and ARIMA fittings Auto-regressive model AR(p) x t = a 1 x t 1 + a 2 x t a p x t p + ε t, t > p Characteristic equation: 1 a 1 z a 2 z a p z p = 0 Unit root tests: z = 1? Auto-Regressive Integrated Moving Average model ARIMA(p,d,q) ( ) ( ) p q 1 φ i L i (1 L) d x t = 1 + θ i L i ε t i=0 Lag operator: Lx t = x t 1 i=0 C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

17 Time series Stationarity tests Phillips-Perron KPSS Dickey-Fuller McKinnon Clustering Betweenness centrality Cliques Diameter Intermittency Load centrality Degree Radius Shortest path Edges Nodes Transitivity p value > 0.05 p value < 0.05 C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

18 Time series ARIMA fittings Network metric Value of d Clustering Betweenness centrality Cliques Diameter Intermittency Load centrality Degree Radius Shortest path Edges Nodes Transitivity ARIMA(p,0,q) Stationary 73% ARIMA(p,1,q) First order integrated 27% Total C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

19 Time series ARIMA fittings Table: Series fitted with an ARIMA(p,d,q) model having the biggest values of the sum p + d + q. ARIMA(p,d,q) Book Measure Sum p d q The Poems of Jonathan Swift, D.D., Volume 2 Load centrality The Journal to Stella Clustering The Iron Heel Clustering Typee: A Romance of the South Seas Edges C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

20 Data analysis Dimensionality reduction Feature selection Feature extraction Supervised learning Zero Rule: 1/8 = 12.5% One Rule Naive Bayes K-Nearest Neighbors J48 (tree) Radial Basis Function Networks 48 Attributes 80 Books 8 Authors Precision T P A P A = T P A + F P A Recall T P A R A = T P A + F N A T P : True Positives F P : False Positives F N: False Negatives C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

21 Feature selection Variance treshold Success score % J48 K-Nearest Neighbors 20 Naive Bayes RBFNetwork Variance treshold Success scores using variance threshold C D R Cq L T B S K I N E Attributes Features using variance threshold. Success score % J48 K-Nearest Neighbors Naive Bayes RBFNetwork Number of attributes Success scores using score-based criteria. J48 KNN NB RBFN C D R Cq L T B S K I N E Attributes Features using score-based criteria Success scores and combinations of features using feature selection. In the upper figures maximum values are marked with circles. In the lower figures if an attribute is present in the combination the corresponding cell is painted black. C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

22 Feature extraction Arthur Conan Doyle Bernard Shaw Fyodor Dostoyevsky Herman Melville Jack London Jonathan Swift Leo Tolstoy Nathaniel Hawthorne Feature extraction using ISOMAP. C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

23 Feature extraction Scores Original precision Original recall PCA precision PCA recall Isomap precision Isomap recall OneR J48 KNN NB RBFN Scores using feature extraction. C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

24 Summary of classification success scores Attributes J48 (%) KNN (%) NB (%) RBFN (%) Original set Variance threshold best Score-based best {µ 1 } {µ 2, µ 3, µ 4 } PCA ISOMAP C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

25 The role of words The Memoirs of Sherlock Holmes The Return of Sherlock Holmes Only one different word out of the 20 highest ranked! C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

26 Dissimilarity matrix C. Akimushkin (IFSC-USP) SoFiA November / 24

27 Dissimilarity matrix C. Akimushkin (IFSC-USP) SoFiA November / 24

28 Projection Success scores > 90% C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

29 Summary Time series are stationary. Global sample statistics can be obtained. Dynamic measures are author-dependent. Weight on edges is relevant. Dimensionality reduction enhances classification. Books are located on a curved manifold in attribute space. A word s role in a network is author-dependent. Network metrics must be jointly used for classification. Many hidden features of networks. C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

30 Muchas gracias! C. Akimushkin (IFSC-USP) SoFiA 11 November / 24

Univariate, Nonstationary Processes

Univariate, Nonstationary Processes Jamie Monogan University of Georgia March 20, 2018 Jamie Monogan (UGA) Univariate, Nonstationary Processes March 20, 2018 1 / 14 Objectives By the end of this meeting,