Video Pandemics: Worldwide Viral Spreading of Psy s Gangnam Style Video

Similar documents
Inferring Latent Preferences from Network Data

A re examination of the Columbian exchange: Agriculture and Economic Development in the Long Run

Agriculture, Transportation and the Timing of Urbanization

Lecture 10 Optimal Growth Endogenous Growth. Noah Williams

Economic Growth: Lecture 1, Questions and Evidence

WP/18/117 Sharp Instrument: A Stab at Identifying the Causes of Economic Growth

Landlocked or Policy Locked?

Lecture 9 Endogenous Growth Consumption and Savings. Noah Williams

ECON 581. The Solow Growth Model, Continued. Instructor: Dmytro Hryshko

Lecture Note 13 The Gains from International Trade: Empirical Evidence Using the Method of Instrumental Variables

For Adam Smith, the secret to the wealth of nations was related

IDENTIFYING MULTILATERAL DEPENDENCIES IN THE WORLD TRADE NETWORK

Nursing Facilities' Life Safety Standard Survey Results Quarterly Reference Tables

INSTITUTIONS AND THE LONG-RUN IMPACT OF EARLY DEVELOPMENT

Lecture 26 Section 8.4. Mon, Oct 13, 2008

Evolution Strategies for Optimizing Rectangular Cartograms

Your Galactic Address

SUPPLEMENTARY MATERIAL FOR:

Analyzing Severe Weather Data

arxiv: v1 [physics.soc-ph] 6 Nov 2013

Growth and Comparative Development - An Overview

SAMPLE AUDIT FORMAT. Pre Audit Notification Letter Draft. Dear Registrant:

Use your text to define the following term. Use the terms to label the figure below. Define the following term.

The Diffusion of Development: Along Genetic or Geographic Lines?

Competition, Innovation and Growth with Limited Commitment

!" #$$% & ' ' () ) * ) )) ' + ( ) + ) +( ), - ). & " '" ) / ) ' ' (' + 0 ) ' " ' ) () ( ( ' ) ' 1)

International Investment Positions and Exchange Rate Dynamics: A Dynamic Panel Analysis

ENDOGENOUS GROWTH. Carl-Johan Dalgaard Department of Economics University of Copenhagen

Research Article Diagnosing and Predicting the Earth s Health via Ecological Network Analysis

Multiway Analysis of Bridge Structural Types in the National Bridge Inventory (NBI) A Tensor Decomposition Approach

Growth and Comparative Development: An Overview

Swine Enteric Coronavirus Disease (SECD) Situation Report June 30, 2016

Landlocked or Policy Locked?

Sample Statistics 5021 First Midterm Examination with solutions

Social and Economic Rights Fulfillment Index SERF Index and Substantive Rights Indices for 2015 (2017 update) Index for High Income OECD Countries

University of Groningen. Corruption and governance around the world Seldadyo, H.

Mean and covariance models for relational arrays

Roots of Autocracy. January 26, 2017

Parametric Test. Multiple Linear Regression Spatial Application I: State Homicide Rates Equations taken from Zar, 1984.

Smart Magnets for Smart Product Design: Advanced Topics

EXST 7015 Fall 2014 Lab 08: Polynomial Regression

What Lies Beneath: A Sub- National Look at Okun s Law for the United States.

Annual Performance Report: State Assessment Data

Forecasting the 2012 Presidential Election from History and the Polls

Epidemics and information spreading

econstor Make Your Publications Visible.

A. Cuñat 1 R. Zymek 2

Dany Bahar. Ricardo Hausmann. Cesar A. Hidalgo. Harvard Kennedy School. Harvard Kennedy School MIT. August 2013 RWP

2006 Supplemental Tax Information for JennisonDryden and Strategic Partners Funds

Chapter 9.D Services Trade Data

PTV Africa City Map (Standardmap)

Growth and Comparative Development

Swine Enteric Coronavirus Disease (SECD) Situation Report Sept 17, 2015

Statistical Mechanics of Money, Income, and Wealth

Cluster Analysis. Part of the Michigan Prosperity Initiative

Evolution Strategies for Optimizing Rectangular Cartograms

Value added trade: A tale of two concepts

Summary of Natural Hazard Statistics for 2008 in the United States

Drought Monitoring Capability of the Oklahoma Mesonet. Gary McManus Oklahoma Climatological Survey Oklahoma Mesonet

PTV Africa City Map 2017 (Standardmap)

C Further Concepts in Statistics

Data Visualization (DSC 530/CIS )

(1) (2) (3) Baseline (replicated)

Appendix 5 Summary of State Trademark Registration Provisions (as of July 2016)

Swine Enteric Coronavirus Disease (SECD) Situation Report Mar 5, 2015

ISLANDS AS BAD GEOGRAPHY.

The Out of Africa Hypothesis of Comparative Development Reflected by Nighttime Light Intensity

External Economies of Scale and Industrial Policy: A View from Trade

ISLANDS AS BAD GEOGRAPHY.

Using Web Maps to Measure the Development of Global Scale Cognitive Maps

Political Economy of Institutions and Development. Lecture 7: The Role of the State and Different Political Regimes

2008 Men's 20 European Championship / Qualification

Class business PS is due Wed. Lecture 20 (QPM 2016) Multivariate Regression November 14, / 44

Supplementary Information Activity driven modeling of time varying networks

Curse or Blessing? Natural Resources and Human Development

ISLANDS AS BAD GEOGRAPHY.

CS-11 Tornado/Hail: To Model or Not To Model

Testing measurement invariance in a confirmatory factor analysis framework State of the art

Data Visualization (CIS 468)

Meteorology 110. Lab 1. Geography and Map Skills

FIRST-TIME COLLEGE OTHER TOTAL FIRST GRAND GEOGRAPHIC ORIGIN UNDERGRADUATES UNDERGRADUATES GRADUATES PROFESSIONAL TOTAL

Transparency in Non-Tariff Measures : An International Comparison

Inferring the interplay of network structure and market effects in Bitcoin

Machine Learning and Modeling for Social Networks

Online Appendix of the paper " How does sovereign bond market integration relate to fundamentals and CDS spreads?"

Urban characteristics attributable to density-driven tie formation

Chapter 8.F Representative Table and Composite Regions

Modeling Social Media Memes as a Contagious Process

Resources. Amusement Park Physics With a NASA Twist EG GRC

Diffusion of information and social contagion

Modeling, Analysis, and Control of Information Propagation in Multi-layer and Multiplex Networks. Osman Yağan

Kari Lock. Department of Statistics, Harvard University Joint Work with Andrew Gelman (Columbia University)

Blueline Tilefish Assessment Approach: Stock structure, data structure, and modeling approach

PTV South America City Map 2017 (Standardmap)

Hierarchical models for multiway data

AIR FORCE RESCUE COORDINATION CENTER

ISLANDS AS BAD GEOGRAPHY.

PTV South America City Map 2018 (Standardmap)

Estimating Dynamic Games of Electoral Competition to Evaluate Term Limits in U.S. Gubernatorial Elections: Online Appendix

Web Structure Mining Nodes, Links and Influence

Transcription:

Video Pandemics: Worldwide Viral Spreading of Psy s Gangnam Style Video Zsófia Kallus 1,2, Dániel Kondor 1,3, József Stéger 1, István Csabai 1, Eszter Bokányi 1, and Gábor Vattay 1 ( ) 1 Department of Physics of Complex Systems, Eötvös University Pázmány P. s. 1/A, H-1117, Budapest, Hungary vattay@elte.hu 2 Ericsson Research, Budapest, Hungary 3 Senseable City Lab, Massachusetts Institute of Technology, Cambridge, USA arxiv:1707.04460v1 [cs.si] 14 Jul 2017 Abstract. Viral videos can reach global penetration traveling through international channels of communication similarly to real diseases starting from a well-localized source. In past centuries, disease fronts propagated in a concentric spatial fashion from the the source of the outbreak via the short range human contact network. The emergence of longdistance air-travel changed these ancient patterns. However, recently, Brockmann and Helbing have shown that concentric propagation waves can be reinstated if propagation time and distance is measured in the flight-time and travel volume weighted underlying air-travel network. Here, we adopt this method for the analysis of viral meme propagation in Twitter messages, and define a similar weighted network distance in the communication network connecting countries and states of the World. We recover a wave-like behavior on average and assess the randomizing e ect of non-locality of spreading. We show that similar result can be recovered from Google Trends data as well. Keywords: geo-social networks, meme dynamics, online news propagation, graph embedding 1 Introduction According to Wikipedia, the music video of Gangnam Style by recording artist Psy reached the unprecedented milestone of one billion YouTube views on December 21, 2012. It was directed by Cho Soo-Hyun and the video premiered on July 15, 2012. What makes this viral video unique from the point of view of social network research is that it was spreading mostly via human-to-human social network links before its first public appearance in the United States on August 20, 2012 and Katy Perry shared the music video with her 25 million followers on Twitter on August 21. Other viral videos spread typically via news media and reach worldwide audiences quickly, within 1-3 days. Our assumption is that only those online viral phenomena can show similarities to global pandemics that were originally constrained to a well localized, limited region and then, after an outbreak period,

2 reached a worldwide level of penetration. In 2012, the record breaking Gangnam Style [1] marked the appearance of a new type of online meme, reaching unprecedented level of fame despite its originally small local audience. From the sub-culture of k-pop fans it reached an increasingly wide range of users of online media - including academics - from around the World (Refs. [2 5]). We approximately reconstructed its spreading process by filtering geo-tagged messages containing the words Gangnam and style. In Fig.1 we show the location of geo-tagged posts containing the expression Gangnam Style from our collection of the public Twitter feed, as of September 2012. For this purpose we used our historical Twitter dataset and collection of follower relations connecting 5.8M active Twitter users who enabled access to their location information while posting messages to their public accounts (see Ref. [6] for details). When tracing videos on the Twitter social platform we have access only to the public posts, and only look at geo-tagged messages. Location information allows us to record the approximate arrival time of a certain news to a specific geo-political region. In the real space this process looks indeed random, but the local to global transition is also apparent as the messages cover a progressively larger territory. We collected the approximate first arrival time of the video in di erent geo-political regions of the World. In order to study the viral spreading of the appearances Fig. 1. Geo-locations of Twitter messages containing Gangnam Style. of the video in the Twitter data stream first we coarse grained the World map into large homogeneous geo-political regions. We used regions of countries and states of the World as the cells (i.e. the nodes) and aggregated the individual links connecting them. By performing an aggregation into geo-political regions we construct a weighted graph connecting 261 super nodes. Thus the ratio of

3 edge weights can be interpreted as an approximation of the relative strength of communication between pairs of the connected regions. This high-level graph is thus naturally embedded into the geographic space giving a natural length to its edges. We then connect individuals in the spatial social network and then recreate a high-level aggregated weighted graph between regions by querying a large database containing the collection of historic, freely available Twitter messages [6 8]. In Fig. 2 we show the resulting weights in graphical form for Twitter users in California. Fig. 2. Social connection weights between large geo-political regions of the World. The map shows our 261 geo-political regions and the number of friendships (mutual Twitter followers) between users in California and the rest of the World. Colour codes the number of friendships with users in California in our database. Deep red means that Californians have 10 5 friendships within California and deep blue indicates that 10 0 friendship connects them for example to certain regions of Africa. 2 Weighting the speed of spreading in the network Brockmann and Helbing in Ref. [9] worked out a method by which they were able to predict the e ective spreading time of a real disease between two nodes of air-tra c based on the network and the number of passengers travelling between them in unit time and introduced an e ective measure of distance of the two

4 nodes. Here, we repeat their derivation, except, we use the number of mutual social contacts (mutual Twitter followers called friends) between geo-located Twitter users in the geo-political regions. Each user can be in one of two possible states: susceptible for the viral video (never seen the video before) or infected (already seen the video) and a ects the state of others when contact occurs between them via sharing the video in their social network feeds. The model that we adopt is based on the meta-population model [10, 11]. After dividing the world map into geo-political regions the dynamics within the n th spatial unit is modelled by the SI equations( [12 15]) with disease specific parameters: @ t S n = I n S n /N n, @ t I n = I n + I n S n /N n. (1) This local dynamics is complemented by the weights that connect the separated cells as described by X w mn U m w mn U n. (2) m6=n Here U n represents the nth SI state variable, and w mn = F nm /N m is the per capita tra c flux from site m to site n, F nm being the weighted adjacency matrix representing the network. This means that the human contact network can be e ectively divided into interconnected layers and the spatial reach of a spreading process is determined mainly by the weighted network of regions. We only need this high-level information and the details of the large and complex temporal contact network can be neglected. Once we have a weighted graph and the arrival times of the video at each of the nodes the embedding of the graph into an abstract space can be performed. Its goal is to uncover the wave pattern of the dynamics. This means finding a source region from which the dependence of the arrival times at a region is linear on the e ective distance of the region from the origin i.e., there exists an e ective velocity. If the spreading is governed by the equations (1-2) the e ective distance (spreading time) between nodes m and n can be defined as follows: d mn =1 log(p mn ) apple 1, (3) where P mn = F mn / P m F mn is the flux fraction from m to n i.e., the probability to choose destination m if one is in the region n. This way a minimal distance means maximal probability, and additivity of the distances is ensured by the logarithmic function. 3 From geographic to network-based embedding The embedding comprises of the following steps. After calculating the connection weighted adjacency matrix, each link weight is transformed into the e ective quasi distance defined by Eq. 3. Starting from each node a shortest path tree can

5 be constructed from all nodes reachable from the selected one. These shortest paths correspond by definition to the most probable active routes that the equations of epidemics and the weights would predict. In such a tree the distance of each node from the origin is the length of the shortest path connecting them. The original graph and one of its shortest path trees is shown in Fig. 3. If the LSO ARM ALA AGO AFG BFA BEN ZWE AZE CA PE CA NB CA NL BMU BWA BOLBRB CA NS BGD BLR BIH BHS BGR BHR AU TAS AUT BEL AU WA AU NT AU SA ATG AU VIC AU ACT AU QLD AU NSW ARG ALB ARE ZMB ZAF CA NT CMR COD CA SK CHE COL CA QC CHL CHN CA MB BRN CA AB CA BC CA ON BRA CIV ETH FJI CZE CYM CRICYP DEU DJI DZA EST ECU DNK FIN ESPFRA GBR XKX VNM WSM VEN VIR US TX VCT US WV US WI US WA US VA US WY US TN US PA US NY UZB US UTUS SC US OH US VT US CA US NV US NJ US RI US OR US OK US NC US MI US FL US SD US IL US GA THA US MD US MO US NM US MS US MN US MA US LA US IN US HI TUR US NH US NE US KY US CO US AZ US AL US DC US CT US KS TWN US IA US AR US ND US ME US MT US DE US AK UKR US ID GAB MTQ GHA GTM HTI HND GRC GUM HRV HUN HKG IND IRL SLV SRB SVN SVK TCA TCD SWZ GMB GUF GEO IDN ITA JPN KOR PHL MEX TTO TUN UGA URYTZA SWE MYS QAT RUS ISR GUY KEN KHM KWTLBN NGA NLD NZL NOR IRQ PAK OMN PER PAN POL PRT PRI SUR PRY PSE JAM REU SDN SEN IRN ISL KAZ KGZ LAO LCA LKA LTU LUX LVA MAR MDG MKD MMR MNG MNE MNP MOZ MUS MRT MWI NIC NAM NPL PNG RWA Fig. 3. Embedded shortest path tree. The countries of the World are partitioned to 261 large administrative regions. An aggregated version of the weighted and directed graph has been used from the individual-level follower relations. E ective distances are represented as the radial distance from the node at the origin in arbitrary units, and the angular coordinate of the nodes is arbitrary as well. node selected as origin corresponds to the most e ective source, the video will arrive first at the closest nodes and then propagate towards the periphery. If the arrival times show a linear dependence on the e ective distance, it is likely that we found the source of the spreading. Comparing di erent trees by measuring the goodness of a linear fit allows us to find the most probable and most e ective source node on the graph. For the Gangnam Style video it is the region of the Philippines. While the obvious center should be Seoul/South Korea, where the

6 video has been created, it seems that the most intensive source of social network spreading was in the Philippines. This is probably due to the fact that we are not able to measure the very short time it took the video to spread from South Korea to the Philippines and more importantly, the Philippines is much more connected socially to the rest of the world than South Korea. This is partially due to the English language use and a well spread diaspora of the Philippines. The linear fit, on the other hand, is not perfect. Uncertainty is introduced into our analysis by multiple factors. These are the heterogeneous nature of the use of the social platforms around the world and the variability of activities over time; the partial measurements obtained from the publicly available geo-tagged sample of tweets; and the external e ect of other media propagating the same news. These circumstances all add to the uncertainty of the first arrival times in our dataset. In order to recover the average wave form we had to use smoothing in space and in time as well. The spatial averaging achieved by regional aggregation is also justified by the assumption that users within the same region are likely to be more connected to each other than to the rest of the world [8]. We used a moving window over the arrival times and used the linear fitting to the average distances and the average times of the windows. The noise can be e ectively reduced by choosing a window of two weeks to a month. As shown on Fig. 4 by the colour scale, the smaller the number of Twitter geo-users in a region, the less reliable the analysis becomes. Moreover these small regions form the peripheral ring as seen from the most probable source region s point of view. 4 Comparison of Twitter and Google Trends distances Once the site of origin is selected, the embedding is straight-forward. The e ective distances between nodes and the source node is equivalent to the shortest path distances on the embedded graph. Fig. 5 shows how the underlying order can be uncovered by this transformation. The seemingly randomized left panel representing the arrival of the video at various geographic distances becomes structured, and a linear trend emerges as a function of the e ective distance. Linearity breaks down only at large distances, where remote peripheral regions are left waiting for the video to arrive at last. We also measured arrival times by looking at the Google search engine records through Google Trends analytics service. In Figure6 we show the results. Its sparsity is coming from a lower temporal resolution of Google Trends compared to our Twitter based dataset. It is, however, very remarkable that the network embedding we found for the Twitter data works also for the Google Trends dataset and creates the proper reordering of the nodes. This result further underlines the e ectiveness of our Twitter measurements. In order to check the consistency of our results with other sources we analyzed an additional set for the regional arrival times. For this purpose we used publicly available service of the Google Trends platform, and gathered the cumulative number of historic web searches performed with the keywords Gangnam and style. We were able to create a dataset with a resolution of 2 weeks. As shown on

7 0 Max d eff from source 0 Max d eff from source 0 Max d eff from source 0 Max d eff from source Fig. 4. Progressive stages of the pandemic. The spreading of the wave is shown in four progressive stages of the propagation. Each stage is defined by separate time slice of equal length. The nodes where the news has just arrived in that slice are first shown on the shortest path tree. Second, a corresponding histogram is created based on e ective distances. Each rectangle represents one of the regional nodes and a common logarithmic color scale represents the number of users of the nodes (color scale of Fig. 5 is used).

8 #Users Arrival time of news Arrival time of news 100 1000 10.000 Tmin Tmin 0 10.000 20.000 Geographic distance from source [km] 0 5 10 15 Effective distance from source 20 Fig. 5. Geographic distance vs e ective distance on Twitter. Here each dot represents a state, colored according to the number of users, using the same logarithmic color scale as before. The horizontal axis represents the distance from the source node of the Philippines, while the vertical axis is the arrival time of the video at that regional node. The clear order is uncovered in the second panel, where geographic distance is replaced by the e ective distance. Arrival time of news [week] 20 15 10 5 Tmin Arrival time of news [week] 20 15 10 5 Tmin 5000 10.000 15.000 Geographic distance from source [km] 0 5 10 15 Effective distance from source Fig. 6. Geographic distance vs e ective distance in Google Trends. Here each dot represents a state, colored according to the number of Users (same color scale as Fig. 5), using the same logarithmic color scale as before. The horizontal axis represents the distance from the source node of the Philippines, while the vertical axis is the arrival time of the video at that regional node. The clear order is uncovered in the second panel, where geographic distance is replaced by the e ective distance.

9 the second panel of Fig. 6 the linear dependence is once again recovered, and the linear propagation of the wave can be e ectively seen. This proves that Twitter predicts correctly the flow of news and information on the global network of communication. Searches performed first in each region seem to be synchronized with the arrival times of the first tweets coming from Twitter. 5 Conclusions In this paper we investigated the online spreading of the viral video Gangnam Style in the social network Twitter. Using geo-localized tweets we determined the first appearance of the video in the 261 major geo-political areas of the World. We adapted the tools developed in Ref. [9] for the spreading of infectious diseases over the air-travel network for the present problem. We showed that the number of friendships (mutual following) of Twitter users between geo-political regions is analogous to the passenger tra c volume of air-tra c from the point of view of online information spreading. Using our historic public tweet database we could calculate the relative weights of information tra c between the regions. We then reproduced the shortest path e ective distance of geo-political regions from the epicenter of the social outbreak of the Gangnam Style video pandemic, which seems to be in the Philippines, not far from South Korea, where it has been produced. We managed to verify the spreading pattern independently, based on the search results in Google Trends in the same time period. The synchrony between first appearances in Twitter and Google suggest that a universal pattern of social information flow (information highways) exist between geo-political regions, which is inherently non-technological, determined by the strength of social ties between di erent countries, cultures and languages. Further research is necessary to understand the main features of this global network. Acknowledgement. The authors would like to thank János Szüle, László Dobos, Tamás Hanyecz, and Tamás Sebők for the maintenance of the Twitter database at Eötvös University. Authors thank the Hungarian National Research, Development and Innovation O ce under Grant No. 125280 and the grant of Ericsson Ltd. References 1. Gruger W. Psys Gangnam Style Video Hits 1 Billion Views, Unprecedented Milestone. Billboard com. 2012;. 2. Jung S, Shim D. Social distribution: K-pop fan practices in Indonesia and the Gangnam Stylephenomenon. International Journal of Cultural Studies. 2013;p. 1367877913505173. 3. Weng L, Menczer F, Ahn YY. Virality prediction and community structure in social networks. Scientific reports. 2013;3. 4. Keung Jung S. Global Audience Participation in the Production and Consumption of Gangnam Style. ScholarWorks@ Georgia State University; 2014.

10 5. Wiggins BE, Bowers GB. Memes as genre: A structurational analysis of the memescape. new media & society. 2015;17(11):1886 1906. 6. Dobos L, Szüle J, Bodnar T, Hanyecz T, Sebök T, Kondor D, et al. A multiterabyte relational database for geo-tagged social network data. In: Cognitive Infocommunications (CogInfoCom), 2013 IEEE 4th International Conference on. IEEE; 2013. p. 289 294. 7. Szüle J, Kondor D, Dobos L, Csabai I, Vattay G. Lost in the City: Revisiting Milgram s Experiment in the Age of Social Networks. PLoS ONE. 2014 11;9(11):e111973. 8. Kallus Z, Barankai N, Szüle J, Vattay G. Spatial Fingerprints of Community Structure in Human Interaction Network for an Extensive Set of Large-Scale Regions. PLoS ONE. 2015 05;10(5):e0126713. 9. Brockmann D, Helbing D. The hidden geometry of complex, network-driven contagion phenomena. Science. 2013;342(6164):1337 1342. 10. Balcan D, Gonalves B, Hu H, Ramasco JJ, Colizza V, Vespignani A. Modeling the spatial spread of infectious diseases: The GLobal Epidemic and Mobility computational model. Journal of Computational Science. 2010;1(3):132 145. Available from: http://www.sciencedirect.com/science/article/pii/s1877750310000438. 11. Vespignani A. Modelling dynamical processes in complex socio-technical systems. Nature Physics. 2012;8(1):32 39. 12. Dietz K. Epidemics and rumours: A survey. Journal of the Royal Statistical Society Series A (General). 1967;p. 505 528. 13. Newman MEJ. Spread of epidemic disease on networks. Phys Rev E. 2002 Jul;66:016128. Available from: http://link.aps.org/doi/10.1103/physreve.66.016128. 14. Barthelemy M, Barrat A, Vespignani A. Dynamical processes on complex networks. Cambridge University Press; 2008. 15. Isham V, Harden S, Nekovee M. Stochastic epidemics and rumours on finite random networks. Physica A: Statistical Mechanics and its Applications. 2010;389(3):561 576.