How to generate large-scale data from small-scale realworld
|
|
- Bryan Pierce
- 5 years ago
- Views:
Transcription
1 How to generate large-scale data from small-scale realworld data sets? Gang Lu Institute of Computing Technology, Chinese Academy of Sciences BigDataBench Tutorial MICRO 2014 Cambridge, UK INSTITUTE OF COMPUTING TECHNOLOGY 1
2 Mo#va#on n Benchmarking big data systems n The first thing is to obtain BIG data n Obtaining REAL big data sets? n Large companies possess a lot of data Confiden#ality issue (User privacy) n Transferring big data sets is rather expensive Is it possible to use large scale synthe#c data?
3 Goals n Genera#ng synthe#c data to sa#sfy 4V proper#es of big data n Volume n Velocity n Variety n Veracity Big Data Generator Suite(BDGS)
4 Architecture of BDGS Veracity Variety Velocity Volume
5 Veracity and Variety n From real world data, we can get:
6 Original size of real data sets Use BDGS to scale up these data sets
7 What does BDGS provide? n Text generator n Graph generator n Table generator
8 Text generator n Use LDA (Latent Dirichlet Alloca#on) (David M Blei, et al.) to generate text corpus. n Topic model To model the imformaion of seman#c level n Widely used in machine learning and natural language processing
9 Text generator n How to generate a new document topic1 topic3 topics topic2 following mul#nomial distribu#on select topic randomly machine evaluate big CPU data mining architecture benchmarking memory system learning words following mul#nomial distribu#on under topic2 select word randomly Progress of genera#ng a new document CPU document
10 Latent dirichlet alloca#on Hyper parameters Topic proporaon Topic Word Document length Number of documents in corpus David M Blei, et al., Latent dirichlet alloca5on, the Journal of machine Learning research, vol. 3, pp , 2003.
11 Latent dirichlet alloca#on Hyper parameters three- level hierarchical Bayesian model Dirichlet distribuaon MulAnomial distribuaon David M Blei, et al., Latent dirichlet alloca#on, the Journal of machine Learning research, vol. 3, pp , 2003.
12 We can use expecta#on- maximiza#on algorithm to determine α and β
13 How to use it to generate texts? n Go into the directory of BigDataGeneratorSuite gen_text_data.sh <model name> <number of files> <number of lines> <number of words> <output dir > n An example Parameters model name No. of files No. of lines No. of words Output dir ExplannaAon the name of model used to generate new data (lda_wiki1w or amazonmr) the number of files to be generated number of lines in each file number of words in each line output director n sh gen_text_data.sh lda_wiki1w gen_data/ Note: Installa#on of the GSL- GNU Scien#fic Library is needed.
14 Graph generator n Use the Stochas#c Kronecker Graph model (Jure Leskovec,et al.) to generate graph n Used also by graph 500 n Different from Graph 500, our graph is applica#on specific, the stochas#c kronecker ini#ator is obtained from real representa#ve data set of specific applica#ons.
15 Determinis#c Kronecker Graph 1: has edge 0: no edge self similar Jure Leskovec,et al., Kronecker graphs: An approach to modeling networks, The Journal of Machine Learning Research, vol. 11, pp , 2010.
16 Stochas#c Kronecker Graph The probability with which the cell generate a edge Jure Leskovec,et al., Kronecker graphs: An approach to modeling networks, The Journal of Machine Learning Research, vol. 11, pp , 2010.
17 Applica#on- specific Specific real data EsAmate parameters Scale Big graph input Specific applicaaon Google Web Graph KronFit StochasAc Kronecker SyntheAcal Web Graph input PageRank Facebook KronFit Social Graph StochasAc Kronecker SyntheAcal Social Graph input Connected components
18 How to use it to generate graphs? n Go into the directory of BigDataGeneratorSuite gen_kronecker_graph <output file> <matrix> <itera#on> <random seed> n An example Parameters ExplannaAon output file output file name (default:'graph.txt') matrix Matrix (in Maltab nota#on) (default: ; ) iteraaon Itera#on of using kronecker product (default: 5) random seed #me seed of random algorithm (default: 0) n sh gen kronecker graph - o:../data- oujile/amazon gen.txt - m: ; i:23
19 Table generator n Related structured table n Parallel Data Genera#on Framework (Tilmann Rabl, et al.) PDGF is also used by BigBench and TPC- DS using XML configura#on files for data descrip#on and distribu#on n Semi- structured resumes n choose a mix of fields, each field follows bernoulli distribu#on
20 How to use it to generate tables? n Go into the directory of BigDataGeneratorSuite pdgf.jar - l schema.xml - l genera#on.xml - sf 2000 n An example Parameters schema.xml generaaon.x ml sf ExplannaAon the schema configura#on: the structure of the data and the genera#on rules the genera#on configura#on defines the output and post- processing of generated data A mul#ple increase in the reference data base n java - XX:NewRa#o=1 - jar pdgf.jar - l demo- schema.xml - l demo- genera5on.xml - c - s - sf 2000
21 Any Questions
Latent Dirichlet Alloca/on
Latent Dirichlet Alloca/on Blei, Ng and Jordan ( 2002 ) Presented by Deepak Santhanam What is Latent Dirichlet Alloca/on? Genera/ve Model for collec/ons of discrete data Data generated by parameters which
More informationCS 6140: Machine Learning Spring 2017
CS 6140: Machine Learning Spring 2017 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Logis@cs Assignment
More informationDistributed ML for DOSNs: giving power back to users
Distributed ML for DOSNs: giving power back to users Amira Soliman KTH isocial Marie Curie Initial Training Networks Part1 Agenda DOSNs and Machine Learning DIVa: Decentralized Identity Validation for
More informationIS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma
IS4200/CS6200 Informa0on Retrieval PageRank Con+nued with slides from Hinrich Schütze and Chris6na Lioma Exercise: Assump0ons underlying PageRank Assump0on 1: A link on the web is a quality signal the
More informationTopic Modeling: Beyond Bag-of-Words
University of Cambridge hmw26@cam.ac.uk June 26, 2006 Generative Probabilistic Models of Text Used in text compression, predictive text entry, information retrieval Estimate probability of a word in a
More informationDEKDIV: A Linked-Data-Driven Web Portal for Learning Analytics Data Enrichment, Interactive Visualization, and Knowledge Discovery
DEKDIV: A Linked-Data-Driven Web Portal for Learning Analytics Data Enrichment, Interactive Visualization, and Knowledge Discovery Yingjie Hu, Grant McKenzie, Jiue-An Yang, Song Gao, Amin Abdalla, and
More informationIntroduc)on to Ar)ficial Intelligence
Introduc)on to Ar)ficial Intelligence Lecture 13 Approximate Inference CS/CNS/EE 154 Andreas Krause Bayesian networks! Compact representa)on of distribu)ons over large number of variables! (OQen) allows
More informationRecent Advances in Bayesian Inference Techniques
Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian
More informationNon-parametric Clustering with Dirichlet Processes
Non-parametric Clustering with Dirichlet Processes Timothy Burns SUNY at Buffalo Mar. 31 2009 T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 1 / 24 Introduction
More informationUnderstanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014
Understanding Comments Submitted to FCC on Net Neutrality Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014 Abstract We aim to understand and summarize themes in the 1.65 million
More informationCollaborative topic models: motivations cont
Collaborative topic models: motivations cont Two topics: machine learning social network analysis Two people: " boy Two articles: article A! girl article B Preferences: The boy likes A and B --- no problem.
More informationContent-based Recommendation
Content-based Recommendation Suthee Chaidaroon June 13, 2016 Contents 1 Introduction 1 1.1 Matrix Factorization......................... 2 2 slda 2 2.1 Model................................. 3 3 flda 3
More informationText mining and natural language analysis. Jefrey Lijffijt
Text mining and natural language analysis Jefrey Lijffijt PART I: Introduction to Text Mining Why text mining The amount of text published on paper, on the web, and even within companies is inconceivably
More informationLatent Dirichlet Allocation Introduction/Overview
Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models
More informationRECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf
RECSM Summer School: Facebook + Topic Models Pablo Barberá School of International Relations University of Southern California pablobarbera.com Networked Democracy Lab www.netdem.org Course website: github.com/pablobarbera/big-data-upf
More informationGaussian Mixture Model
Case Study : Document Retrieval MAP EM, Latent Dirichlet Allocation, Gibbs Sampling Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 5 th,
More informationTopic Models and Applications to Short Documents
Topic Models and Applications to Short Documents Dieu-Thu Le Email: dieuthu.le@unitn.it Trento University April 6, 2011 1 / 43 Outline Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text
More informationCollaborative Topic Modeling for Recommending Scientific Articles
Collaborative Topic Modeling for Recommending Scientific Articles Chong Wang and David M. Blei Best student paper award at KDD 2011 Computer Science Department, Princeton University Presented by Tian Cao
More informationApplying hlda to Practical Topic Modeling
Joseph Heng lengerfulluse@gmail.com CIST Lab of BUPT March 17, 2013 Outline 1 HLDA Discussion 2 the nested CRP GEM Distribution Dirichlet Distribution Posterior Inference Outline 1 HLDA Discussion 2 the
More informationLatent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) A review of topic modeling and customer interactions application 3/11/2015 1 Agenda Agenda Items 1 What is topic modeling? Intro Text Mining & Pre-Processing Natural Language
More informationInformation retrieval LSI, plsi and LDA. Jian-Yun Nie
Information retrieval LSI, plsi and LDA Jian-Yun Nie Basics: Eigenvector, Eigenvalue Ref: http://en.wikipedia.org/wiki/eigenvector For a square matrix A: Ax = λx where x is a vector (eigenvector), and
More informationGraphical Models. Lecture 1: Mo4va4on and Founda4ons. Andrew McCallum
Graphical Models Lecture 1: Mo4va4on and Founda4ons Andrew McCallum mccallum@cs.umass.edu Thanks to Noah Smith and Carlos Guestrin for some slide materials. Board work Expert systems the desire for probability
More informationStudy Notes on the Latent Dirichlet Allocation
Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection
More informationCSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on
CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on Professor Wei-Min Shen Week 13.1 and 13.2 1 Status Check Extra credits? Announcement Evalua/on process will start soon
More informationA Continuous-Time Model of Topic Co-occurrence Trends
A Continuous-Time Model of Topic Co-occurrence Trends Wei Li, Xuerui Wang and Andrew McCallum Department of Computer Science University of Massachusetts 140 Governors Drive Amherst, MA 01003-9264 Abstract
More informationTopic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up
Much of this material is adapted from Blei 2003. Many of the images were taken from the Internet February 20, 2014 Suppose we have a large number of books. Each is about several unknown topics. How can
More informationSparse Stochastic Inference for Latent Dirichlet Allocation
Sparse Stochastic Inference for Latent Dirichlet Allocation David Mimno 1, Matthew D. Hoffman 2, David M. Blei 1 1 Dept. of Computer Science, Princeton U. 2 Dept. of Statistics, Columbia U. Presentation
More informationGenerative Clustering, Topic Modeling, & Bayesian Inference
Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Chapter 13 Text Classification and Naïve Bayes Dell Zhang Birkbeck, University of London Motivation Relevance Feedback revisited The user marks a number of documents
More informationPolynomials and Gröbner Bases
Alice Feldmann 16th December 2014 ETH Zürich Student Seminar in Combinatorics: Mathema:cal So
More informationCS Lecture 18. Topic Models and LDA
CS 6347 Lecture 18 Topic Models and LDA (some slides by David Blei) Generative vs. Discriminative Models Recall that, in Bayesian networks, there could be many different, but equivalent models of the same
More informationNote for plsa and LDA-Version 1.1
Note for plsa and LDA-Version 1.1 Wayne Xin Zhao March 2, 2011 1 Disclaimer In this part of PLSA, I refer to [4, 5, 1]. In LDA part, I refer to [3, 2]. Due to the limit of my English ability, in some place,
More informationDS504/CS586: Big Data Analytics Graph Mining II
Welcome to DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua Li Time: 6:00pm 8:50pm Mon. and Wed. Location: SL105 Spring 2016 Reading assignments We will increase the bar a little bit Please
More informationIPSJ SIG Technical Report Vol.2014-MPS-100 No /9/25 1,a) 1 1 SNS / / / / / / Time Series Topic Model Considering Dependence to Multiple Topics S
1,a) 1 1 SNS /// / // Time Series Topic Model Considering Dependence to Multiple Topics Sasaki Kentaro 1,a) Yoshikawa Tomohiro 1 Furuhashi Takeshi 1 Abstract: This pater proposes a topic model that considers
More informationCSE 473: Ar+ficial Intelligence
CSE 473: Ar+ficial Intelligence Hidden Markov Models Luke Ze@lemoyer - University of Washington [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188
More informationLecture 22 Exploratory Text Analysis & Topic Models
Lecture 22 Exploratory Text Analysis & Topic Models Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor [Some slides borrowed from Michael Paul] 1 Text Corpus
More informationDimension Reduction (PCA, ICA, CCA, FLD,
Dimension Reduction (PCA, ICA, CCA, FLD, Topic Models) Yi Zhang 10-701, Machine Learning, Spring 2011 April 6 th, 2011 Parts of the PCA slides are from previous 10-701 lectures 1 Outline Dimension reduction
More informationMachine learning for Dynamic Social Network Analysis
Machine learning for Dynamic Social Network Analysis Manuel Gomez Rodriguez Max Planck Ins7tute for So;ware Systems UC3M, MAY 2017 Interconnected World SOCIAL NETWORKS TRANSPORTATION NETWORKS WORLD WIDE
More informationOUTLINE. Deterministic and Stochastic With spreadsheet program : Integrated Mathematics 2
COMPUTER SIMULATION OUTLINE In this module, we will focus on the act simulation, taking mathematical models and implement them on computer systems. Simulation & Computer Simulations Mathematical (Simulation)
More informationBasics and Random Graphs con0nued
Basics and Random Graphs con0nued Social and Technological Networks Rik Sarkar University of Edinburgh, 2017. Random graphs on jupyter notebook Solu0on to exercises 1 is out If your BSc/MSc/PhD work is
More informationTopic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1
Topic Models Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1 Low-Dimensional Space for Documents Last time: embedding space
More informationProbabilistic Topic Models Tutorial: COMAD 2011
Probabilistic Topic Models Tutorial: COMAD 2011 Indrajit Bhattacharya Assistant Professor Dept of Computer Sc. & Automation Indian Institute Of Science, Bangalore My Background Interests Topic Models Probabilistic
More informationScripting Languages Fast development, extensible programs
Scripting Languages Fast development, extensible programs Devert Alexandre School of Software Engineering of USTC November 30, 2012 Slide 1/60 Table of Contents 1 Introduction 2 Dynamic languages A Python
More informationCS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model
Logis@cs CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Assignment
More informationCS 6140: Machine Learning Spring 2016
CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa?on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Logis?cs Assignment
More informationRETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic
More informationText Mining for Economics and Finance Latent Dirichlet Allocation
Text Mining for Economics and Finance Latent Dirichlet Allocation Stephen Hansen Text Mining Lecture 5 1 / 45 Introduction Recall we are interested in mixed-membership modeling, but that the plsi model
More informationAnalyzing Burst of Topics in News Stream
1 1 1 2 2 Kleinberg LDA (latent Dirichlet allocation) DTM (dynamic topic model) DTM Analyzing Burst of Topics in News Stream Yusuke Takahashi, 1 Daisuke Yokomoto, 1 Takehito Utsuro 1 and Masaharu Yoshioka
More informationTopical Sequence Profiling
Tim Gollub Nedim Lipka Eunyee Koh Erdan Genc Benno Stein TIR @ DEXA 5. Sept. 2016 Webis Group Bauhaus-Universität Weimar www.webis.de Big Data Experience Lab Adobe Systems www.research.adobe.com R e
More informationKernel Density Topic Models: Visual Topics Without Visual Words
Kernel Density Topic Models: Visual Topics Without Visual Words Konstantinos Rematas K.U. Leuven ESAT-iMinds krematas@esat.kuleuven.be Mario Fritz Max Planck Institute for Informatics mfrtiz@mpi-inf.mpg.de
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationLearning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute
More informationLecture 12: Link Analysis for Web Retrieval
Lecture 12: Link Analysis for Web Retrieval Trevor Cohn COMP90042, 2015, Semester 1 What we ll learn in this lecture The web as a graph Page-rank method for deriving the importance of pages Hubs and authorities
More informationarxiv: v1 [cs.si] 7 Dec 2013
Sequential Monte Carlo Inference of Mixed Membership Stochastic Blockmodels for Dynamic Social Networks arxiv:1312.2154v1 [cs.si] 7 Dec 2013 Tomoki Kobayashi, Koji Eguchi Graduate School of System Informatics,
More informationNon-Parametric Bayes
Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian
More informationUnified Modeling of User Activities on Social Networking Sites
Unified Modeling of User Activities on Social Networking Sites Himabindu Lakkaraju IBM Research - India Manyata Embassy Business Park Bangalore, Karnataka - 5645 klakkara@in.ibm.com Angshu Rai IBM Research
More informationDecoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process
Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process Chong Wang Computer Science Department Princeton University chongw@cs.princeton.edu David M. Blei Computer Science Department
More informationApplying Latent Dirichlet Allocation to Group Discovery in Large Graphs
Lawrence Livermore National Laboratory Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs Keith Henderson and Tina Eliassi-Rad keith@llnl.gov and eliassi@llnl.gov This work was performed
More informationSapienza universita di Roma Dipartimento di Informatica e Sistemistica. User guide WSCE-Lite Web Service Composition Engine v 0.1.
Sapienza universita di Roma Dipartimento di Informatica e Sistemistica User guide WSCE-Lite Web Service Composition Engine v 0.1 Valerio Colaianni Contents 1 Installation 5 1.1 Installing TLV..........................
More informationAn Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling
An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling Masaharu Kato, Tetsuo Kosaka, Akinori Ito and Shozo Makino Abstract Topic-based stochastic models such as the probabilistic
More informationProbability and Structure in Natural Language Processing
Probability and Structure in Natural Language Processing Noah Smith Heidelberg University, November 2014 Introduc@on Mo@va@on Sta@s@cal methods in NLP arrived ~20 years ago and now dominate. Mercer was
More informationIncorporating Social Context and Domain Knowledge for Entity Recognition
Incorporating Social Context and Domain Knowledge for Entity Recognition Jie Tang, Zhanpeng Fang Department of Computer Science, Tsinghua University Jimeng Sun College of Computing, Georgia Institute of
More informationMobility Analytics through Social and Personal Data. Pierre Senellart
Mobility Analytics through Social and Personal Data Pierre Senellart Session: Big Data & Transport Business Convention on Big Data Université Paris-Saclay, 25 novembre 2015 Analyzing Transportation and
More informationBayesian Nonparametrics for Speech and Signal Processing
Bayesian Nonparametrics for Speech and Signal Processing Michael I. Jordan University of California, Berkeley June 28, 2011 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux Computer
More informationCS 188: Artificial Intelligence Fall 2011
CS 188: Artificial Intelligence Fall 2011 Lecture 20: HMMs / Speech / ML 11/8/2011 Dan Klein UC Berkeley Today HMMs Demo bonanza! Most likely explanation queries Speech recognition A massive HMM! Details
More informationCrust and Lithosphere
Crust and Lithosphere Our Charge descrip5on of scien5fic problems importance for broader society importance of the topics within Earth and other sciences exis5ng and required resources for fundamental
More informationDS504/CS586: Big Data Analytics Graph Mining II
Welcome to DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua Li Time: 6-8:50PM Thursday Location: AK233 Spring 2018 v Course Project I has been graded. Grading was based on v 1. Project report
More informationMaximum Likelihood (ML), Expecta6on Maximiza6on (EM)
Maximum Likelihood (ML), Expecta6on Maximiza6on (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, ProbabilisAc RoboAcs Outline Maximum likelihood (ML) Priors, and maximum
More informationSimple Spatial Growth Models The Origins of Scaling in Size Distributions
Lectures on Spatial Complexity 17 th 28 th October 2011 Lecture 3: 21 st October 2011 Simple Spatial Growth Models The Origins of Scaling in Size Distributions Michael Batty m.batty@ucl.ac.uk @jmichaelbatty
More informationMETHODS FOR IDENTIFYING PUBLIC HEALTH TRENDS. Mark Dredze Department of Computer Science Johns Hopkins University
METHODS FOR IDENTIFYING PUBLIC HEALTH TRENDS Mark Dredze Department of Computer Science Johns Hopkins University disease surveillance self medicating vaccination PUBLIC HEALTH The prevention of disease,
More informationBenchmarking and Improving Recovery of Number of Topics in Latent Dirichlet Allocation Models
Benchmarking and Improving Recovery of Number of Topics in Latent Dirichlet Allocation Models Jason Hou-Liu January 4, 2018 Abstract Latent Dirichlet Allocation (LDA) is a generative model describing the
More informationSeman&cs with Dense Vectors. Dorota Glowacka
Semancs with Dense Vectors Dorota Glowacka dorota.glowacka@ed.ac.uk Previous lectures: - how to represent a word as a sparse vector with dimensions corresponding to the words in the vocabulary - the values
More informationTopic Modelling and Latent Dirichlet Allocation
Topic Modelling and Latent Dirichlet Allocation Stephen Clark (with thanks to Mark Gales for some of the slides) Lent 2013 Machine Learning for Language Processing: Lecture 7 MPhil in Advanced Computer
More informationInstructor: Amol Deshpande
Instructor: Amol Deshpande amol@cs.umd.edu } New topics to discuss More constructs in E/R modeling Conver@ng from E/R to rela@onal schema Crea@ng some E/R models Ruby on Rails } Other things Grading of
More informationLatent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3:993-1022, January 2003. Following slides borrowed ant then heavily modified from: Jonathan Huang
More informationDistributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College
Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic
More informationGraphical Models. Lecture 5: Template- Based Representa:ons. Andrew McCallum
Graphical Models Lecture 5: Template- Based Representa:ons Andrew McCallum mccallum@cs.umass.edu Thanks to Noah Smith and Carlos Guestrin for some slide materials. 1 Administra:on Homework #3 won t go
More informationCS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine
CS 277: Data Mining Mining Web Link Structure Class Presentations In-class, Tuesday and Thursday next week 2-person teams: 6 minutes, up to 6 slides, 3 minutes/slides each person 1-person teams 4 minutes,
More informationTerm Filtering with Bounded Error
Term Filtering with Bounded Error Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi, tangjie, ljz}@keg.cs.tsinghua.edu.cn
More informationHaupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process
Haupthseminar: Machine Learning Chinese Restaurant Process, Indian Buffet Process Agenda Motivation Chinese Restaurant Process- CRP Dirichlet Process Interlude on CRP Infinite and CRP mixture model Estimation
More informationStreaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:www.mmds.org.
Streaming - 2 Bloom Filters, Distinct Item counting, Computing moments credits:www.mmds.org http://www.mmds.org Outline More algorithms for streams: 2 Outline More algorithms for streams: (1) Filtering
More informationGEANT4. A pla+orm for the simula6on of the passage of par6cles through ma:er. FYS- KJM5920. Gry M. Tveten,
GEANT4 A pla+orm for the simula6on of the passage of par6cles through ma:er GEANT4 Download from geant4.cern.ch Read installa6on instruc6ons for your OS carefully GEANT4 is not a program, but rather
More informationReplicated Softmax: an Undirected Topic Model. Stephen Turner
Replicated Softmax: an Undirected Topic Model Stephen Turner 1. Introduction 2. Replicated Softmax: A Generative Model of Word Counts 3. Evaluating Replicated Softmax as a Generative Model 4. Experimental
More informationLatent Dirichlet Allocation Based Multi-Document Summarization
Latent Dirichlet Allocation Based Multi-Document Summarization Rachit Arora Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai - 600 036, India. rachitar@cse.iitm.ernet.in
More informationLink Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci
Link Analysis Information Retrieval and Data Mining Prof. Matteo Matteucci Hyperlinks for Indexing and Ranking 2 Page A Hyperlink Page B Intuitions The anchor text might describe the target page B Anchor
More informationECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University
ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University Prof. Mi Lu TA: Ehsan Rohani Laboratory Exercise #4 MIPS Assembly and Simulation
More informationPachinko Allocation: DAG-Structured Mixture Models of Topic Correlations
: DAG-Structured Mixture Models of Topic Correlations Wei Li and Andrew McCallum University of Massachusetts, Dept. of Computer Science {weili,mccallum}@cs.umass.edu Abstract Latent Dirichlet allocation
More informationExploring Class Discussions from a Massive Open Online Course (MOOC) on Cartography
Forthcoming in: Vondrakova, A., Brus, J., and Vozenilek, V. (Eds.) (2015) Modern Trends in Cartography, Selected Papers of CARTOCON 2014, Lecture Notes in Geoinformation and Cartography, Springer-Verlag.
More informationMeasuring Topic Quality in Latent Dirichlet Allocation
Measuring Topic Quality in Sergei Koltsov Olessia Koltsova Steklov Institute of Mathematics at St. Petersburg Laboratory for Internet Studies, National Research University Higher School of Economics, St.
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,
More informationAn introduction to clustering techniques
- ABSTRACT Cluster analysis has been used in a wide variety of fields, such as marketing, social science, biology, pattern recognition etc. It is used to identify homogenous groups of cases to better understand
More information27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling
10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! h0p://www.cs.toronto.edu/~rsalakhu/ Lecture 2 Linear Least Squares From
More informationMachine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Recent Applica6ons of Lasso
Machine Learning Data Mining CS/CNS/EE 155 Lecture 4: Recent Applica6ons of Lasso 1 Today: Two Recent Applica6ons Cancer Detec0on Personaliza0on via twi9er music( Biden( soccer( Labour( Applica6ons of
More informationAdvanced Machine Learning
Advanced Machine Learning Nonparametric Bayesian Models --Learning/Reasoning in Open Possible Worlds Eric Xing Lecture 7, August 4, 2009 Reading: Eric Xing Eric Xing @ CMU, 2006-2009 Clustering Eric Xing
More informationEMERGING TOPIC MODELS CAMCOS REPORT FALL 2011 NEETI MITTAL
EMERGING TOPIC MODELS CAMCOS REPORT FALL 2011 NEETI MITTAL Abstract. We review the concept of Latent Dirichlet Allocation (LDA), along with the definitions of Text Mining, Topic, and Topic Modeling. We
More informationarxiv: v1 [stat.ml] 8 Jan 2012
A Split-Merge MCMC Algorithm for the Hierarchical Dirichlet Process Chong Wang David M. Blei arxiv:1201.1657v1 [stat.ml] 8 Jan 2012 Received: date / Accepted: date Abstract The hierarchical Dirichlet process
More informationAdditive Regularization of Topic Models for Topic Selection and Sparse Factorization
Additive Regularization of Topic Models for Topic Selection and Sparse Factorization Konstantin Vorontsov 1, Anna Potapenko 2, and Alexander Plavin 3 1 Moscow Institute of Physics and Technology, Dorodnicyn
More informationHybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media
Hybrid Models for Text and Graphs 10/23/2012 Analysis of Social Media Newswire Text Formal Primary purpose: Inform typical reader about recent events Broad audience: Explicitly establish shared context
More informationPV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211
PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211 IIR 18: Latent Semantic Indexing Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University,
More information