Summarizing Creative Content

Size: px

Start display at page:

Download "Summarizing Creative Content"

Jodie Cobb
5 years ago
Views:

1 Summarizing Creative Content Olivier Toubia Columbia Business School Behavioral Insights from Text Conference, January / 49

2 Background and Motivation More than 40 million ( one third of the employed population) of Americans belong to the Creative Class (Florida 2014) Science and engineering, education, arts, entertainment, etc. Primary economic function is to create new ideas/content Output often takes the form of creative documents (e.g., academic papers, books, scripts, business models) Creative Documents usually come with summaries (e.g., abstracts, synopses, executive summaries) Indispensable, given that the average american spends approx. 12 hours a day consuming media (Statista, 2017) 2 / 49

3 Paper Overview Objectives: Quantify how humans summarize creative documents Computer-assisted writing of summaries of creative documents Natural Language Processing Model Inspired by creativity literature, based on Poisson Factorization Capture both inside the cone (based on common topics) and outside the cone (residual) content in documents Capture writing norms that govern summarization process Empirical applications Marketing academic papers and their abstracts Movie scripts and their synopses Online interactive tool (publicly available at 3 / 49

4 Outline Relevant Literatures Model Empirical Applications Practical Application 4 / 49

5 Relevant literatures Creativity Creativity lies in balance between novelty and familiarity (Giora 2003; Uzzi et al., 2013; Toubia and Netzer, 2017) Summaries should capture both the familiar and novel aspects of the creative document, possibly with different weights Novelty and Familiarity should be measured by combinations of words rather than individual words (Mednick 1962; Finke, Ward and Smith 1992; Toubia and Netzer 2017) Text Summarization (e.g., Radev et al. 2002; Nenkova and McKeown, 2012) Focused primarily on automatic text summarization This project: Focus on how humans summarize creative documents Use computers to assist humans 5 / 49

6 Relevant literatures Poisson Factorization (e.g., Gopalan, Hofman and Blei, 2013; Gopalan, Charlin and Blei, 2014) Topic model Offset variables (e.g., explain choices of articles by academics) This project: Leverage offset variables to capture changes in topic weights in documents vs. summaries Introduce residual topics that capture outside the cone content 6 / 49

7 Traditional Content-Based Poisson Factorization (Gopalan, Charlin and Blei, 2014) Document d (e.g., academic article, movie script, book, pitch, product description) Words v = 1,...V Document d has w dv occurrences of word w K topics Each topic k has weight β kv on word v Document d has topic intensity θ dk on topic k 7 / 49

8 Traditional Content-Based Poisson Factorization: Data Generating Process 1. For each topic k = 1,...K: For each word v, draw βkv Gamma(a, b) 2. For each document d = 1,...D: For each topic, draw topic intensity θdk Gamma(c, d) For each word v, draw word count wdv Poisson( k θ dkβ kv ) 8 / 49

9 Outline Relevant Literatures Model Empirical Applications Practical Application 9 / 49

10 Geometric Interpretation of Poisson Factorization Traditional Poisson Factorization approximates frequency of words in document d as a weighted average of topics: w d Poisson( k θ dkβ k ) E(wd ) = k θ dkβ k E(wd ) is a point in the cone defined by the topics {β k }, in the Euclidean space defined by the words in the vocabulary Observed word frequency w d is: E(w d ) (projection on the cone - inside the cone ) + residual ( outside the cone ) Residual ( outside the cone ) should help explain content in summary May reflect some novel aspects of the document 10 / 49

11 Geometric Interpretation of Poisson Factorization three words, three topics 11 / 49

12 Proposed Model: Regular vs. Residual Topics K regular topics Similar to traditional topics in Content-Based Poisson Factorization and other topic models Each regular topic k has weight β reg kv on word v Document d has topic intensity θ reg dk on regular topic k D residual topics One per document Each residual topic d has weight β res dv on word v 12 / 49

13 Proposed Model: Offset Parameters Capture writing norms that govern the appearance of topics in summaries vs. full documents (e.g., abstracts of marketing academic papers rarely mention limitations) Each regular topic k has an offset parameter ɛ reg k that captures the relation between occurrences of this topic in full documents vs. summaries Residual topics have their own offset parameters (common across residual topics): ɛ res 13 / 49

14 Proposed Model: Data Generating Process 1. For each regular topic k = 1,...K: For each word v, draw β reg kv Gamma(a, b) Draw offset parameter ɛ reg k Gamma(g, h) 14 / 49

15 Proposed Model: Data Generating Process 1. For each regular topic k = 1,...K: For each word v, draw β reg kv Gamma(a, b) Draw offset parameter ɛ reg k Gamma(g, h) 2. For each residual topic d = 1,...D: For each word v, draw βdv res Gamma(a, b) 14 / 49

16 Proposed Model: Data Generating Process 1. For each regular topic k = 1,...K: For each word v, draw β reg kv Gamma(a, b) Draw offset parameter ɛ reg k Gamma(g, h) 2. For each residual topic d = 1,...D: For each word v, draw βdv res Gamma(a, b) 3. Draw (single) offset parameter for residual topics ɛ res Gamma(g, h) 14 / 49

17 Proposed Model: Data Generating Process 1. For each regular topic k = 1,...K: For each word v, draw β reg kv Gamma(a, b) Draw offset parameter ɛ reg k Gamma(g, h) 2. For each residual topic d = 1,...D: For each word v, draw βdv res Gamma(a, b) 3. Draw (single) offset parameter for residual topics ɛ res Gamma(g, h) 4. For each document d = 1,...D: For each regular topic, draw topic intensity θ reg dk Gamma(c, d) For each word v, draw word count w dv Poisson( k θreg dk βreg kv + βres dv ) 14 / 49

18 Proposed Model: Data Generating Process 1. For each regular topic k = 1,...K: For each word v, draw β reg kv Gamma(a, b) Draw offset parameter ɛ reg k Gamma(g, h) 2. For each residual topic d = 1,...D: For each word v, draw βdv res Gamma(a, b) 3. Draw (single) offset parameter for residual topics ɛ res Gamma(g, h) 4. For each document d = 1,...D: For each regular topic, draw topic intensity θ reg dk Gamma(c, d) For each word v, draw word count w dv Poisson( k θreg dk βreg kv + βres dv ) 5. For each document summary d = 1,...D, For each word v, draw word count w summary dv Poisson( k θreg dk βreg kv ɛreg k + βdv res ɛ res ) 14 / 49

19 Inference: Auxiliary Variables ( Gopalan, Hofman and Blei, 2013; Gopalan, Charlin and Blei, 2014) Assign occurrences of word v in document d across the various topics (latent variables) z reg dv,k Poisson(θreg dk βreg kv ) ; zres dv w dv = k zreg dv,k + zres dv Poisson(βres dv ) such that z sum,reg dv,k that w summary dv Poisson(θ reg dk βreg = k zsum,reg dv,k kv ɛreg k ); zsum,res dv + z sum,res dv Poisson(β res dv ɛres ) such 15 / 49

20 Inference: Posterior Distributions - ALL CONDITIONALLY CONJUGATE! ( Gopalan, Hofman and Blei, 2013; Gopalan, Charlin and Blei, 2014) β reg kv Gamma(a + d (zreg dv,k + zsum,reg dv,k ), b + d θreg dk (1 + ɛreg k )) β res dv ɛ reg Gamma(a + zres dv + zsum,res dv, b ɛ res ) k Gamma(g + d,v zsum,reg dv,k, h + d,v θreg dk βreg kv ) ɛ res Gamma(g + d,v zsum,res dv, h + d,v βres dv ) θ reg dk Gamma(c + v (zreg dv,k + zsum,reg dv,k ), d + v βreg vk (1 + ɛreg k )) [{z reg dv,k } k, z res dv ] Mult([{θreg dk βreg kv } k, β res dv ]) [{z sum,reg dv,k } k, z sum,res dv ] Mult([{θ reg dk βreg kv ɛreg k } k, β res dv ɛres ]) 16 / 49

21 Inference Using Variational Inference (Blei et al., 2003, 2016) Approximate posterior distribution of each parameter with member of a family of distribution Identify member of the family that minimizes the distance (KL divergence) between the true and approximated distribution Coordinate Ascent Mean-Field Variational Inference Algorithm: iteratively minimize the distance between the posterior distribution of each model parameter and the approximate distribution Order of magnitude faster than MCMC 17 / 49

22 Variational Inference P(θ reg reg dk...) approximated by Gamma( θ dk ) P(β reg kv...) approximated by Gamma( β reg kv ); P(βres kv...) approximated by Gamma( β kv res) P(ɛ reg k...) approximated by Gamma( ɛ reg k ); P(ɛ res...) approximated by Gamma( ɛ res ) P({z reg dv,k } k, zdv res...) = Mult({φreg dv,k } k, φ res dv ) where φ reg dv,k θreg dk βreg kv = exp(log(θ reg dk ) + log(βreg kv )) approximated by φ reg dv,k φ res dv βres kv P({z sum,reg dv,k φ sum,reg dv,k φ sum,reg dv,k φ sum,res dv } k, z sum,res dv...) = Mult({φ sum,reg dv,k } k, φ sum,res dv ) where θ reg dk βreg kv ɛreg k = exp(log(θ reg dk ) + log(βreg β res kv ɛres approximated by φ sum,res dv kv ) + log(ɛreg k )) approximated by 18 / 49

23 Coordinate Ascent Mean-Field Variational Inference Algorithm Blei et al., 2016 θ reg dk = < c + v (w dv φ reg β reg kv = < a + d (w dv φ reg β res dv = < a + w dv φ res ɛ reg k = < g + d,v ɛ res = < g + d,v dv,k + w summary dv dv,k + w summary dv dv + w summary dv (w summary dv (w summary dv φ summary,reg dv,k ), d + v φ summary,reg dv,k ), b + d φ summary,res SHAPE ɛres dv, b φ summary,reg dv,k ), h + d,v φ summary,res dv ), h + d,v β reg SHAPE kv β reg kv RATE θ reg SHAPE dk θ reg dk RATE (1 + ɛ (1 + ɛ > ɛ res RATE θ reg SHAPE β reg SHAPE dk θ reg dk RATE kv β reg > kv RATE β dv res SHAPE > β dv res RATE reg SHAPE k ɛ reg k reg SHAPE k ɛ reg k RATE ) > RATE ) > [{φ reg dv,k } k, φ res dv ] Exp({Ψ( θ reg dk SHAPE ) log( θ reg dk RATE ) + Ψ( β reg kv SHAPE ) log( β reg kv RATE )} k, Ψ( β dv res SHAPE ) log( β dv res RATE )) [{φ sum,reg dv,k } k, φ sum,res dv ] Exp({Ψ( θ reg dk SHAPE ) log( θ reg dk RATE )+Ψ( β reg kv SHAPE ) log( β reg kv RATE )+ Ψ( ɛ reg k SHAPE ) log( ɛ reg k RATE )} k, Ψ( β dv res SHAPE ) log( β dv res RATE )+Ψ( ɛ res SHAPE ) log( ɛ res RATE )) Where Ψ is the digamma function 19 / 49

24 Predicting Summary of an Out-of-Sample Document Based on its Full Text Input: Parameter estimates based on in-sample documents: {β reg k } k ; {ɛ reg k } k; ɛ res Full text of out-of-sample document d out Estimate topic intensities {θ reg d out k } k and residual βd res out out-of-sample document (using Variational Inference) topic for Predict word occurrences in summary of out-of-sample document: λ summary d out v = k θreg d out k βreg kv ɛreg k + βd res out v ɛres 20 / 49

25 Outline Relevant Literatures Model Empirical Applications Practical Application 21 / 49

26 Application I: Marketing Academic Papers and their Abstracts Abstracts and full texts of all 1,333 research papers published in Marketing Science, Journal of Marketing, Journal of Marketing Research, and Journal of Consumer Research, between 2010 and 2015 Preprocessing: Spelling corrector (Python) Eliminate non-english characters and words, numbers, punctuation Tokenize text Remove stopwords and words that contain only one character No stemming or Lemmatization Randomly split documents between calibration (75%) and validation (25%) 22 / 49

27 Vocabulary of words (based on calibration set of documents) Term frequency (tf ) of word v: total number of occurrences of word across all documents Remove words with tf < 100 Document frequency (df ) of word v: number of documents with at least one occurrence of word Term-frequency document inverse-frequency tf -idf (v) = tf (v) log( #documents df (v) ) Keep 1,000 words with highest tf -idf Remove words that appear in too many documents or that appear too infrequently 23 / 49

28 Descriptive Statistics Statistic Unit of analysis Mean St. dev. Min Max Occurrences of words from vocabulary Occurrences of words from vocabulary Number of words from vocabulary with at least one occurrence Number of words from vocabulary with at least one occurrence Number of occurrences across full texts Number of occurrences across abstracts Number of full texts with at least one occurrence Number of abstracts with at least one occurrence Full text of paper (N=1,333) Abstract of paper (N=1,333) Full text of paper (N=1,333) Abstract of paper (N=1,333) Word in vocabulary (N=1,000) Word in vocabulary (N=1,000) Word in vocabulary (N=1,000) Word in vocabulary (N=1,000) 2, , , , , , , / 49

29 Number of Topics Could be determined using cross-validation Instead, simply set number of regular topics K to 100 (Gopalan, Chaplin and Blei, 2014) Gamma prior induces sparsity If K = 100 is more than enough, some topics can be flat 25 / 49

30 Results - Distribution of Offset Parameters Among 29 non-flat topics 26 / 49

31 Results - Examples of Regular Topics with Small Offset Parameters (relatively weaker representation in summaries vs. documents) Visualize topic by creating word cloud of content simulated based on topic distribution 27 / 49

32 Results - Regular Topic with the highest Offset Parameter (relatively stronger representation in summaries vs. documents) 28 / 49

33 Results - Example Iyengar, Van den Bulte, Valente. Opinion leadership and social contagion in new product diffusion. Marketing Science 30.2 (2011) Content of actual paper Inside the cone content Outside the cone content 29 / 49

34 Outside the Cone Content Novelty? Covariates Journal fixed effects Publication year fixed effects Intensities on non-flat regular topics Proportion of outside the cone content DV=log(1+#citations) Proportion of outside the cone content Number of parameters 40 Number of observations 1,000 R Regression estimated separately using OLS. : significant at p < : significant at p < Residual is the proportion of words in the paper assigned to the residual topic, standardized across papers for interpretability. 30 / 49

35 Nested Benchmarks Full model with residual topics and offset parameters No residual topic Offset parameter constant across all topics (assume relative topic intensities are the same in summaries vs. documents) No residual topic and constant offset parameter traditional Poisson Factorization Residual topic only (each document is unique, no learning across documents) 31 / 49

36 Non-Nested Benchmarks Latent Dirichlet Allocation (Blei et al., 2003) Models probability of occurrence of each word conditional on total number of words Each document in calibration sample is merged with its summary No residual topic, no offset parameter Also estimated using Variational Inference (Blei et al., 2003) 32 / 49

37 Fit criterion: Perplexity Model output: fitted Poisson rate for each word in each document: λ dv = k θreg dk βreg kv in each summary: λsummary dv + βres dv = k θreg dk βreg kv ɛreg k + β res dv ɛres Transform Poisson rates into probability weights for each word φdv = probability that a random word in document d is word v Fit measured by Perplexity Perplexity = exp( d obs d log( φ d,obs ) N ) Inversely related to geometric mean of the likelihood function 33 / 49

38 Results - Fit Measures on Calibration Documents Train model on calibration documents and their summaries: Estimate regular topics β reg kv Estimate topic intensities θ reg dk Estimate offset parameters ɛ reg k, ɛres Estimate residual topics β res dv Perplexity of calibration documents and summaries (in-sample) 34 / 49

39 Results - Fit Measures on Validation Documents Based on parameter estimates from calibration documents + full texts of validation documents: Estimate topic intensities of validation documents: θ reg d out k Estimate residual topics of validation documents: β res d out v Perplexity of validation documents (in-sample) Predict content of validation summaries: λ summary d out v = k θreg d out k βreg kv ɛreg k + βd res out v ɛres Perplexity of validation summaries (out-of-sample) 35 / 49

40 Benchmark Comparisons (Perplexity - less is better) Approach Calibration documents Fit Calibration summaries Validation documents Predictive perf. Validation summaries Full Model No residual topic ɛ constant No residual topic and ɛ constant Residual topic only LDA / 49

41 Application II: Movie Scripts and their Synopses Scripts (documents) and synopses (summaries) of 858 movies Scripts from International Movie Script Database (imsdb.com) Synopses from International Movie Database (imdb.com) Same pre-processing as with marketing papers Calibration sample (75%) and validation sample (25%) 1,000 words selected based on tf -idf (cutoff of 65 vs. 100 because fewer documents) 37 / 49

42 Descriptive Statistics Statistic Unit of analysis Mean St. dev. Min Max Number of occurrences of words from vocabulary Number of occurrences of words from vocabulary Number of words from vocabulary with at least one occurrence Number of words from vocabulary with at least one occurrence Number of occurrences across scripts Number of occurrences across synopses Number of scripts with at least one occurrence Number of synopses with at least one occurrence Script 1, ,489 Synopsis Script Synopsis Word in vocabulary 1, , ,633 Word in vocabulary ,322 Word in vocabulary Word in vocabulary / 49

43 Results - Distribution of Offset Parameters Among 29 non-flat topics 39 / 49

44 Results - Examples of Regular Topics with Small Offset Parameters (relatively weaker representation in summaries vs. documents) 40 / 49

45 Results - Examples of Regular Topics with Large Offset Parameters (relatively stronger representation in summaries vs. documents) 41 / 49

46 Results - Example: Forrest Gump Content of actual script Inside the cone content Outside the cone content 42 / 49

47 Outside the Cone Content Novelty? Covariates DV=Movie rating DV=Log(ROI) MPAA rating fixed effects Genre fixed effects Intensities on non-flat regular topics Log(inflation-adjusted production budget) Movie rating N/A (Movie rating) 2 N/A Proportion of outside the cone content Proportion of outside the cone content Number of parameters Number of observations R Each column estimated separately using OLS. : significant at p < : significant at p < Proportion of outside the cone content is the proportion of words in the script assigned to the residual topic, standardized across movies for interpretability. Movie rating is also standardized across movies for interpretability. ROI is the ratio of box office to production budget. Box office performance and/or production budget was not available for all movies. 43 / 49

48 Benchmark Comparisons (Perplexity) Approach Calibration documents Fit Calibration summaries Validation documents Predictive perf. Validation summaries Full Model No residual topic ɛ constant No residual topic and ɛ constant Residual topics only LDA / 49

49 Outline Relevant Literatures Model Empirical Applications Practical Application 45 / 49

50 Practical Application: creativesummary.org Domain specific (marketing academic papers and movie scripts for now) User uploads document, and summary (optional) Based on previously calibrated model, estimate on the fly (using php - up to 100 iterations of Variational Inference): Topic intensities of new document Residual topic of new document Predict occurrences of words in summary of new document Simulate summary content If user provided summary, compare predicted and observed occurrences in summary 46 / 49

51 Paper Overview Objectives: Quantify how humans summarize creative documents Computer-assisted writing of summaries of creative documents Natural Language Processing Model Inspired by creativity literature, based on Poisson Factorization Capture both inside the cone (based on common topics) and outside the cone (residual) content in documents Capture writing norms that govern summarization process Empirical applications Marketing academic papers and their abstracts Movie scripts and their synopses Online interactive tool (publicly available at 47 / 49

52 Examples of Other Ongoing Projects Liu, Jia, and Olivier Toubia, How do Consumers Form Online Search Queries? The Importance of Activation Probabilities between Queries and Results Liu, Jia, and Olivier Toubia, A Semantic Approach for Estimating Consumer Content Preferences from Online Search Queries Liu, Jia, Olivier Toubia, and Shawndra Hill, Content-Based Dynamic Model of Web Search Behavior: An Application to TV Show Search Dew, Ryan, Asim Ansari, and Olivier Toubia, Letting Logos Speak: Deep Probabilistic Models for Logo Design 48 / 49

53 THANK YOU! 49 / 49

Content-based Recommendation

Content-based Recommendation Suthee Chaidaroon June 13, 2016 Contents 1 Introduction 1 1.1 Matrix Factorization......................... 2 2 slda 2 2.1 Model................................. 3 3 flda 3