Latent Geographic Feature Extraction from Social Media

Size: px

Start display at page:

Download "Latent Geographic Feature Extraction from Social Media"

Moses Fowler
5 years ago
Views:

1 Latent Geographic Feature Extraction from Social Media Christian Sengstock* Michael Gertz Database Systems Research Group Heidelberg University, Germany November 8, 2012

2 Social Media is a huge and increasing source of unstructured and uncertain geographic information Eort to make data usable: (Structured) Information Extraction Place/event extraction from Flickr [Rattenburry SIGIR'07] Event trajectory extraction from Twitter [Sakaki WWW'10] Spatial Analysis Spatio-temporal forecasting using Flickr [Jin MM'12] Study ecological phenomena [Zhang WWW'12] Nov 8 2 / 33

General Motivation: Extract spatial variables from unstructured and noisy geographic information sources Flickr Twitter Wikipedia Φ(l) l 2 Φ(l) l 2 l 1

3 General Motivation: Extract spatial variables from unstructured and noisy geographic information sources Flickr Twitter Wikipedia Φ(l) l 2 Φ(l) l 2 l 1 l 1 This work: Framework for unsupervised extraction of informative spatial variables (dimensions of geographic semantics) from Social Media Nov 8 3 / 33

4 Outline 1 Denitions and Problem Statement 2 Data Characteristics and Normalization 3 Latent Geographic Feature Extraction 4 Experiments 5 Conclusions Nov 8 4 / 33

5 Outline 1 Denitions and Problem Statement Geographic Feature Signal Estimation Problem Statement 2 Data Characteristics and Normalization Distribution Characteristics Normalization Geographic Feature Types 3 Latent Geographic Feature Extraction Dimensionality Reduction Framework 4 Experiments Technique Comparison Normalization Inuence Exploration Task 5 Conclusions Nov 8 5 / 33

6 Terminology Geographic Feature f A dimension representing some semantics of a location (e.g., temperature, population, number of restaurants) Sampled (measured) at any location l in geographic space W ( spatial variable) Geographic Feature Sensor φ and Signal φ(l) of f : φ : W R + l φ(l) Nov 8 6 / 33

7 Terminology Set of geographic features f 1,..., f p Geographic Feature Sensor: denes a Multivariate Φ := (φ 1,..., φ p ) T Spatial sampling scheme (measurements) L = (l 1,..., l n ) denes a Location Sampling Matrix: φ 1 (l 1 )... φ p (l 1 ) X n p = (Φ(l 1 ),..., Φ(l n )) T =..... φ 1 (l n )... φ p (l n ) Nov 8 7 / 33

8 Terminology A Social Media Collection D consists of documents: X : u : l : t : d i = (X, u, l, t) Bag of document features (terms, tags, image features,...) User Location Timestamp Assumption: Features with geographic meaning aggregate in subsets of geographic space high signal Nov 8 8 / 33

9 Signal Estimation Every document feature f 1,..., f p is a possibly meaningful/meaningless geographic feature Intuition of geographic feature signal φ i (l): Number of users using feature f i around location l W 1 Estimation of φ i by Non-parametric 2D-histogram estimator on regular grid C of bandwidth w Small w Capture small scale variation/phenomena Large w Capture large scale variation/phenomena 1 motivated in next section Nov 8 9 / 33

10 Problem Statement Problem: Given high-dimensional geographic feature signal Φ from a Social Media collection (all terms/tags) Features might be meaningless, redundant, noisy Goal: Unsupervised extraction of small number of informative geographic features Applications: Prepare data for learning tasks that cannot handle high-dimensional data Discover hidden spatial variables in the data Nov 8 10 / 33

11 Outline 1 Denitions and Problem Statement Geographic Feature Signal Estimation Problem Statement 2 Data Characteristics and Normalization Distribution Characteristics Normalization Geographic Feature Types 3 Latent Geographic Feature Extraction Dimensionality Reduction Framework 4 Experiments Technique Comparison Normalization Inuence Exploration Task 5 Conclusions Nov 8 11 / 33

12 Dataset Two Flickr datasets covering US and LA Document features: Tags (pre-ltered by minimum user frequency) Spatial resolution: US (1.0 degree), LA (0.01 degree) Nov 8 12 / 33

13 Spatial Distribution Characteristics Figure: F (l): Num of features, D(l): Num of documents, U(l): Num of users, Fd(l): Num of distinct features. Exponential characteristics of spatial feature distribution Users distinct features / documents features Nov 8 13 / 33

14 Spatial Feature Distribution: 'beach' Figure: F (l, f ): Number of feature f = beach, U(l, f ): Number of users using f = beach Some users contribute large number of documents Estimate signal on basis of users is less biased (more robust) Nov 8 14 / 33

15 Normalization Exponential distribution characteristics Few locations dominate the signals' spatial distribution Normalization transforms the signal into a more natural domain Logging: Binarization: φ i(l) := log φ i (l) + 1 φ i(l) := 1{φ i (l) > 0} Nov 8 15 / 33

16 Geographic Feature Types Geographic Feature Types: Classes of geographic features with similar geographic semantics [Sengstock ACMGIS'11] Global: Same intensity as baseline distribution (number of users) Not interesting to discriminate between locations Regional: Widely spread in geographic space but dierent from baseline Interesting to discriminate between large subsets in geographic space Landmark: Occurring only in small subsets of geographic space Interesting to discriminate between single small subset and the rest Depends on area of interest W and spatial resolution w. Nov 8 16 / 33

17 Geographic Feature Types Entropy over locations of spatial signal X i as geographic feature type statistic for f i : large entropy Signal widely spread / smoothly distributed small entropy Signal peaky / occurs in small areas Figure: Ordered entropies H[X i ] for tag features of US Flickr dataset Nov 8 17 / 33

18 Outline 1 Denitions and Problem Statement Geographic Feature Signal Estimation Problem Statement 2 Data Characteristics and Normalization Distribution Characteristics Normalization Geographic Feature Types 3 Latent Geographic Feature Extraction Dimensionality Reduction Framework 4 Experiments Technique Comparison Normalization Inuence Exploration Task 5 Conclusions Nov 8 18 / 33

19 Dimensionality Reduction Describe high-dimensional data by k << p dimensions while preserving statistical properties of data General formulation (generative latent factor model) X n p = S n k A k p S: Latent factor (component) values for each record A: Combination of latent factors by original features Latent Geographic Feature Sensor/Signal: Φ(l) k 1 := A k p Φ(l) p 1 Nov 8 19 / 33

20 Dimensionality Reduction Spatial distributions in the data reect spatial phenomena Statistical structure in X depends on spatial co-occurrence of features Reducing the location sampling matrix X: Latent factors represent dominant spatial distributions (correlated features collapse, non-dominant features diminish) Latent factors describe distinct spatial distributions Latent geographic features describe signals of dominant and distinct geographic phenomena Nov 8 20 / 33

21 Framework Nov 8 21 / 33

22 Outline 1 Denitions and Problem Statement Geographic Feature Signal Estimation Problem Statement 2 Data Characteristics and Normalization Distribution Characteristics Normalization Geographic Feature Types 3 Latent Geographic Feature Extraction Dimensionality Reduction Framework 4 Experiments Technique Comparison Normalization Inuence Exploration Task 5 Conclusions Nov 8 22 / 33

23 Technique Comparison Study of dimensionality reduction techniques preserving dierent statistical properties Principal Component Analysis (PCA): Components are statistically uncorrelated Sparse Principal Component Analysis (SPCA): Components are statistically uncorrelated and sparse (α << p non-zero entries) Independent Component Analysis (ICA): Components are statistically independent Nov 8 23 / 33

24 Technique Comparison Qualitative Evaluation a Extraction of k = 20 latent geographic features b Manual labeling of extracted features on basis of component weights and spatial signal distribution c Identication of 'informative' latent features d Selection of similar latent features of other techniques on basis of highest component weights e Comparison of component weights and signal distribution Nov 8 24 / 33

25 Technique Comparison SPCA feat.: 'landscape' (top), 'beach' (center), 'desert' (bot.) Nov 8 25 / 33

26 Technique Comparison: Comparison 'beach': PCA (top), ICA (center), SPCA (bottom) Nov 8 26 / 33

27 Normalization Inuence Extracted latent geographic features can be of dierent types (global, landmark, regional) Calculation of entropy over k = 20 extracted features for each technique and each normalization strategy (none, logging, binarization) SPCA and ICA show response towards more regional features for stronger normalization Nov 8 27 / 33

28 Normalization Inuence Normalization: None (top), Logging (center), Binar. (bottom) Nov 8 28 / 33

29 Exploration Task Extraction of informative geographic features for Los Angeles using Flickr Exploration Setting SPCA feature extraction Normalization as exploration parameter Nov 8 29 / 33

30 Exploration Task Los Angeles Landmark Features Nov 8 30 / 33

31 Exploration Task Los Angeles Regional Features Nov 8 31 / 33

32 Outline 1 Denitions and Problem Statement Geographic Feature Signal Estimation Problem Statement 2 Data Characteristics and Normalization Distribution Characteristics Normalization Geographic Feature Types 3 Latent Geographic Feature Extraction Dimensionality Reduction Framework 4 Experiments Technique Comparison Normalization Inuence Exploration Task 5 Conclusions Nov 8 32 / 33

33 Conclusions Summary General framework for unsupervised extraction of informative spatial variables (dimensions of geographic semantics) from Social Media PCA ICA SPCA Transformation of informative geographic feature extraction into a problem of high-dimensional statistics in geographic feature space Extraction of spatial variables of dierent types (landmarks, regional) by normalization Ongoing Work Extrinsic Evaluation of parametrization and techniques (e.g. spatial classication task) (Semi-) supervised feature extraction Nov 8 33 / 33

Latent Geographic Feature Extraction from Social Media

Latent Geographic Feature Extraction from Social Media Christian Sengstock Institute of Computer Science Heidelberg University, Germany sengstock@informatik.uni-heidelberg.de Michael Gertz Institute of