Sampling. Everything Data CompSci Spring 2014
|
|
- Sabina Boyd
- 5 years ago
- Views:
Transcription
1 Sampling Everything Data CompSci Spring 2014
2 2 Announcements (Thu. Mar 26) Homework #11 will be posted by noon tomorrow.
3 3 Outline Simple Random Sampling Means & Proportions Importance Sampling Counting Δs in a graph Graph sampling
4 4 Sampling Using a fraction of the available data to make inferences about the whole dataset Why? Dataset is very large Cost associated with data acquisition is high Computing the answer on the entire dataset is time consuming
5 5 Population Population: a dataset D with N records Population Statistics Mean: E.g., average weight of newborns in the US Count: E.g., Number of votes cast for a presidential candidate Proportion: E.g., Fraction of pages on the Web that are spam
6 6 Simple random sample (SRS) SRS(D, n) is a sample of n records where each record is in the sample with equal probability P(record in sample) =? # possible samples =?
7 7 Simple random sample (SRS) SRS(D, n) is a sample of n records where each record is in the sample with equal probability P(record in sample) = n/n # possible samples = C(N, n) = N! / {(N- n)! n!}
8 8 Population vs Sample Will a statistic (mean/count/proportion) computed on the sample be the same as that computed on the population? Not necessarily
9 9 Population vs Sample Proportion of even numbers in the set of all natural numbers is 0.5 Sample may contain only odd numbers. But the probability of geeing such a sample is tiny. P(sample of n numbers are all odd) = 2 - n
10 10 Sampling Distribution Let µμ be a population statistic. Let µμ S be the same statistic on the sample. Sampling Distribution: Probability distribution of µμ S obtained by considering all possible samples of size n.
11 11 Expected Sample Statistic For µμ = mean and proportion: E[µμ S ] = µμ For µμ = count: E[µμ S ] = nµμ / N
12 12 But what about your sample? We would like to say: µμ S for a large (1- δ) fraction of all samples is within ε of µμ Additive (aka confidence interval): P[ µμ S µμ < ε] > 1- δ Multiplicative: P[(1- ε)µμ < µμ S < (1+ε)µμ] > 1- δ
13 13 Confidence Intervals Let µμ = population mean Let σ = population standard deviation Let σ S = Standard deviation of µμ S (aka standard error) σ S = σ / n
14 14 Confidence Intervals By central limit theorem, sampling distribution is close to normal distribution ( for n > 25)
15 15 Confidence Intervals σ S = σ / n + sampling distr Normal For at least 95% fraction of samples, µμ S - µμ < 2 σ / n when n > 25
16 16 Confidence intervals for proportions p : proportion in the population ps : proportion in the sample At least 95% of all samples have when np > 5 and n(1- p) > 5
17 17 Summary of Simple Random Sampling All of statistics involves using a sample to learn properties of population Sampling distribution helps connect the sample statistic to population statistic In expectation, mean/proportion of sample equals mean/proportion of population Confidence intervals help us decide number of samples needed.
18 18 Outline Simple Random Sampling Means & Proportions Biased Sampling Counting Δs in a graph Graph sampling
19 19 Counting problems Triangle counting: Find the number of triplets of vertices (a, b, c) such that (a,b), (b,c) and (c,a) are edges. Advertising contracts: Need 1 million impressions that satisfy [Male, 15-35, CA] OR [Male, 5-25, TX] Historically, were a million such individuals seen?
20 20 Counting can be hard Triangle counting (N= # nodes, M= #edges, d max = max degree): Naïve Method: O(N 3 ) Look at every triple (a,b,c) and check for Δ Best known methods: O(d max 2 N) or O(m 1.5 ) Not efficient for large graphs Twieer 2009: N = 54,981,152, M = 1,963,263,821, d max > 3 million
21 21 Sampling to the rescue Suppose S is the set whose size we want to estimate Let U be some universe such that S is a subset of U
22 22 Monte Carlo Method For i = 1 to n Choose y from U uniformly at random Check whether y is in S Let Xi = 1 if y is in S, and 0 otherwise Return: Stop when the estimated count converges
23 23 When to use Monte Carlo Method Easy to uniformly sample from U Easy to check whether sample is in S Estimate Ĉ converges with a small number of samples P[(1- ε) S < Ĉ < (1+ε) S ] > (1- δ)
24 24 Triangle Counting U = set of all triples U = N 3. # Samples needed for convergence! >!!! 3 2!!!ln!!!! Number of Δs can be much smaller than N. No beeer than the naïve algorithm.
25 25 Triangle Counting U = T0 + T1 + T2 + T3 Ti : triples that have i edges amongst them S = T3! > (!! +!! +!! +!! )!! 3!!!ln 2!!!! Most triples are in T0. Can we not sample any triple in T0?
26 26 Biased sampling Sample an edge (a,b) uniformly at random Sample a node that is not a or b at random Universe does not contain triples from T0 Universe contains all triples from T1 Universe contains all triples from T2 they are twice more likely! Universe contains all triples from T3 they are thrice more likely!
27 27 Triangle counting 2.0 Sample an edge (a,b) uniformly at random Sample a node that is not a or b at random U = T1 + 2 T2 + 3 T3 S = T3! > 3 +!! + 2!!!! 3!!!ln 2!!!!
28 28 Summary of Counting Counting exactly can be time consuming Monte Carlo method: pick samples from a universe containing the set of interest, and approximate the count using samples. Simple random sample may need many samples. Biased sampling can help.
29 29 Outline Simple Random Sampling Means & Proportions Biased Sampling Counting Δs in a graph Graph sampling
30 30 Problem Given an input graph G = (V, E), construct a subgraph H = (V, E ), where V and E are subsets of V and E, resp. Want H to have the same properties as G
31 31 Properties of interest Density of the graph (# edges/ #nodes) Degree distribution Distance distribution Clustering coefficients distribution
32 32 How to Sample? On edges? On nodes?
33 33 Edge Sampling For every edge in the original graph, retain it with some probability p.
34 34 Edge Sampling Resulting graph is extremely sparse. Diameter is very large Clustering coefficient is very small Nodes with high degree in the original graph also have high degree in the sampled graph. Not a random sample on nodes Degree distribution is not preserved
35 35 Node Sampling Choose V as a simple random sample of V Induced Subgraph: E = set of edges (a,b) that appear in E, such that both a and b are in V
36 36 Node Sampling Degree distribution looks similar Resulting graph is sparser compared to the original graph Clustering coefficient is much smaller When nodes are sampled uniformly at random, they are unlikely to sample triples that form triangles.
37 37 Random Walk Sampling Pick a starting node uniformly at random Perform a random walk With probability (1- c), choose an outgoing edge uniformly at random With probability c, jump to starting node Stop when enough number of nodes are visited, and compute the induced subgraph
38 38 Random Walk Sampling Is also biased to high degree nodes So the degree distribution will not look very similar. Preserves the clustering coefficient
39 39 Random Jump sampling Pick a starting node uniformly at random Perform a random walk with jumps With probability (1- c), choose an outgoing edge uniformly at random With probability c, jump to an arbitrary node To avoid ge]ing stuck in dead- ends in directed graphs.
40 40 Summary of Graph Sampling Graph sampling is not like sampling from tables Hard to preserve all statistics using a single sample. Node sampling preserves the shape of the degree distribution Random walk sampling preserves the clustering in the graph.
Markov Chains and MCMC
Markov Chains and MCMC CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 4 : 590.02 Spring 13 1 Recap: Monte Carlo Method If U is a universe of items, and G is a subset satisfying some property,
More informationSlide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.
Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means - Clustering
More informationLink Mining PageRank. From Stanford C246
Link Mining PageRank From Stanford C246 Broad Question: How to organize the Web? First try: Human curated Web dictionaries Yahoo, DMOZ LookSmart Second try: Web Search Information Retrieval investigates
More informationProject in Computational Game Theory: Communities in Social Networks
Project in Computational Game Theory: Communities in Social Networks Eldad Rubinstein November 11, 2012 1 Presentation of the Original Paper 1.1 Introduction In this section I present the article [1].
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 Web pages are not equally important www.joe-schmoe.com
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University.
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu What is the structure of the Web? How is it organized? 2/7/2011 Jure Leskovec, Stanford C246: Mining Massive
More informationCOMPSCI 611 Advanced Algorithms Second Midterm Exam Fall 2017
NAME: COMPSCI 611 Advanced Algorithms Second Midterm Exam Fall 2017 A. McGregor 15 November 2017 DIRECTIONS: Do not turn over the page until you are told to do so. This is a closed book exam. No communicating
More informationSlides based on those in:
Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering
More informationData Mining Techniques
Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!
More informationRandomized Algorithms
Randomized Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours
More informationCS155: Probability and Computing: Randomized Algorithms and Probabilistic Analysis
CS155: Probability and Computing: Randomized Algorithms and Probabilistic Analysis Eli Upfal Eli Upfal@brown.edu Office: 319 TA s: Lorenzo De Stefani and Sorin Vatasoiu cs155tas@cs.brown.edu It is remarkable
More informationAn Algorithmist s Toolkit September 24, Lecture 5
8.49 An Algorithmist s Toolkit September 24, 29 Lecture 5 Lecturer: Jonathan Kelner Scribe: Shaunak Kishore Administrivia Two additional resources on approximating the permanent Jerrum and Sinclair s original
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationCMPUT651: Differential Privacy
CMPUT65: Differential Privacy Homework assignment # 2 Due date: Apr. 3rd, 208 Discussion and the exchange of ideas are essential to doing academic work. For assignments in this course, you are encouraged
More informationLink Analysis Ranking
Link Analysis Ranking How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would you do it? Naïve ranking of query results Given query
More informationRandom Walk Based Algorithms for Complex Network Analysis
Random Walk Based Algorithms for Complex Network Analysis Konstantin Avrachenkov Inria Sophia Antipolis Winter School on Complex Networks 2015, Inria SAM, 12-16 Jan. Complex networks Main features of complex
More informationA Linear Time Algorithm for Finding Three Edge-Disjoint Paths in Eulerian Networks
A Linear Time Algorithm for Finding Three Edge-Disjoint Paths in Eulerian Networks Maxim A. Babenko Ignat I. Kolesnichenko Ilya P. Razenshteyn Moscow State University 36th International Conference on Current
More informationP P P NP-Hard: L is NP-hard if for all L NP, L L. Thus, if we could solve L in polynomial. Cook's Theorem and Reductions
Summary of the previous lecture Recall that we mentioned the following topics: P: is the set of decision problems (or languages) that are solvable in polynomial time. NP: is the set of decision problems
More informationCSE 151 Machine Learning. Instructor: Kamalika Chaudhuri
CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of
More informationData and Algorithms of the Web
Data and Algorithms of the Web Link Analysis Algorithms Page Rank some slides from: Anand Rajaraman, Jeffrey D. Ullman InfoLab (Stanford University) Link Analysis Algorithms Page Rank Hubs and Authorities
More informationLink Analysis. Stony Brook University CSE545, Fall 2016
Link Analysis Stony Brook University CSE545, Fall 2016 The Web, circa 1998 The Web, circa 1998 The Web, circa 1998 Match keywords, language (information retrieval) Explore directory The Web, circa 1998
More informationUC Berkeley Math 10B, Spring 2015: Midterm 2 Prof. Sturmfels, April 9, SOLUTIONS
UC Berkeley Math 10B, Spring 2015: Midterm 2 Prof. Sturmfels, April 9, SOLUTIONS 1. (5 points) You are a pollster for the 2016 presidential elections. You ask 0 random people whether they would vote for
More informationOn-Line Social Systems with Long-Range Goals
On-Line Social Systems with Long-Range Goals Jon Kleinberg Cornell University Including joint work with Ashton Anderson, Dan Huttenlocher, Jure Leskovec, and Sigal Oren. Long-Range Planning Growth in on-line
More informationQuantum Algorithms for Finding Constant-sized Sub-hypergraphs
Quantum Algorithms for Finding Constant-sized Sub-hypergraphs Seiichiro Tani (Joint work with François Le Gall and Harumichi Nishimura) NTT Communication Science Labs., NTT Corporation, Japan. The 20th
More informationStochastic calculus for summable processes 1
Stochastic calculus for summable processes 1 Lecture I Definition 1. Statistics is the science of collecting, organizing, summarizing and analyzing the information in order to draw conclusions. It is a
More informationTheorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr( )
Theorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr Pr = Pr Pr Pr() Pr Pr. We are given three coins and are told that two of the coins are fair and the
More informationCommunities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices
Communities Via Laplacian Matrices Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices The Laplacian Approach As with betweenness approach, we want to divide a social graph into
More information[Title removed for anonymity]
[Title removed for anonymity] Graham Cormode graham@research.att.com Magda Procopiuc(AT&T) Divesh Srivastava(AT&T) Thanh Tran (UMass Amherst) 1 Introduction Privacy is a common theme in public discourse
More informationInterpretation of results through confidence intervals
Interpretation of results through confidence intervals Hypothesis tests Confidence intervals Hypothesis Test Reject H 0 : μ = μ 0 Confidence Intervals μ 0 is not in confidence interval μ 0 P(observed statistic
More informationTeam Solutions. November 19, qr = (m + 2)(m 2)
Team s November 19, 2017 1. Let p, q, r, and s be four distinct primes such that p + q + r + s is prime, and the numbers p 2 + qr and p 2 + qs are both perfect squares. What is the value of p + q + r +
More information0.1 Naive formulation of PageRank
PageRank is a ranking system designed to find the best pages on the web. A webpage is considered good if it is endorsed (i.e. linked to) by other good webpages. The more webpages link to it, and the more
More informationCS-C Data Science Chapter 8: Discrete methods for analyzing large binary datasets
CS-C3160 - Data Science Chapter 8: Discrete methods for analyzing large binary datasets Jaakko Hollmén, Department of Computer Science 30.10.2017-18.12.2017 1 Rest of the course In the first part of the
More informationCOMPSCI 514: Algorithms for Data Science
COMPSCI 514: Algorithms for Data Science Arya Mazumdar University of Massachusetts at Amherst Fall 2018 Lecture 4 Markov Chain & Pagerank Homework Announcement Show your work in the homework Write the
More informationHomework 4 Solutions
CS 174: Combinatorics and Discrete Probability Fall 01 Homework 4 Solutions Problem 1. (Exercise 3.4 from MU 5 points) Recall the randomized algorithm discussed in class for finding the median of a set
More informationMAT 271E Probability and Statistics
MAT 7E Probability and Statistics Spring 6 Instructor : Class Meets : Office Hours : Textbook : İlker Bayram EEB 3 ibayram@itu.edu.tr 3.3 6.3, Wednesday EEB 6.., Monday D. B. Bertsekas, J. N. Tsitsiklis,
More informationDistributed Optimization. Song Chong EE, KAIST
Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links
More informationMore on NP and Reductions
Indian Institute of Information Technology Design and Manufacturing, Kancheepuram Chennai 600 127, India An Autonomous Institute under MHRD, Govt of India http://www.iiitdm.ac.in COM 501 Advanced Data
More informationIntroduction to Artificial Intelligence Midterm 2. CS 188 Spring You have approximately 2 hours and 50 minutes.
CS 188 Spring 2014 Introduction to Artificial Intelligence Midterm 2 You have approximately 2 hours and 50 minutes. The exam is closed book, closed notes except your two-page crib sheet. Mark your answers
More informationMonte Carlo methods in PageRank computation: When one iteration is sufficient
Monte Carlo methods in PageRank computation: When one iteration is sufficient Nelly Litvak (University of Twente, The Netherlands) e-mail: n.litvak@ewi.utwente.nl Konstantin Avrachenkov (INRIA Sophia Antipolis,
More informationON THE NP-COMPLETENESS OF SOME GRAPH CLUSTER MEASURES
ON THE NP-COMPLETENESS OF SOME GRAPH CLUSTER MEASURES JIŘÍ ŠÍMA AND SATU ELISA SCHAEFFER Academy of Sciences of the Czech Republic Helsinki University of Technology, Finland elisa.schaeffer@tkk.fi SOFSEM
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #9: Link Analysis Seoul National University 1 In This Lecture Motivation for link analysis Pagerank: an important graph ranking algorithm Flow and random walk formulation
More informationNP-Complete Problems and Approximation Algorithms
NP-Complete Problems and Approximation Algorithms Efficiency of Algorithms Algorithms that have time efficiency of O(n k ), that is polynomial of the input size, are considered to be tractable or easy
More information1998: enter Link Analysis
1998: enter Link Analysis uses hyperlink structure to focus the relevant set combine traditional IR score with popularity score Page and Brin 1998 Kleinberg Web Information Retrieval IR before the Web
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #12: Frequent Itemsets Seoul National University 1 In This Lecture Motivation of association rule mining Important concepts of association rules Naïve approaches for
More informationMarkov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)
Markov Networks l Like Bayes Nets l Graphical model that describes joint probability distribution using tables (AKA potentials) l Nodes are random variables l Labels are outcomes over the variables Markov
More informationCOMPSCI 611 Advanced Algorithms Second Midterm Exam Fall 2017
NAME: COMPSCI 611 Advanced Algorithms Second Midterm Exam Fall 2017 A. McGregor 15 November 2017 DIRECTIONS: Do not turn over the page until you are told to do so. This is a closed book exam. No communicating
More informationLearning with multiple models. Boosting.
CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models
More informationReferences. Markov-Chain Monte Carlo. Recall: Sampling Motivation. Problem. Recall: Sampling Methods. CSE586 Computer Vision II
References Markov-Chain Monte Carlo CSE586 Computer Vision II Spring 2010, Penn State Univ. Recall: Sampling Motivation If we can generate random samples x i from a given distribution P(x), then we can
More informationSimilar Shapes and Gnomons
Similar Shapes and Gnomons May 12, 2013 1. Similar Shapes For now, we will say two shapes are similar if one shape is a magnified version of another. 1. In the picture below, the square on the left is
More informationP(X 0 = j 0,... X nk = j k )
Introduction to Probability Example Sheet 3 - Michaelmas 2006 Michael Tehranchi Problem. Let (X n ) n 0 be a homogeneous Markov chain on S with transition matrix P. Given a k N, let Z n = X kn. Prove that
More informationLink Prediction. Eman Badr Mohammed Saquib Akmal Khan
Link Prediction Eman Badr Mohammed Saquib Akmal Khan 11-06-2013 Link Prediction Which pair of nodes should be connected? Applications Facebook friend suggestion Recommendation systems Monitoring and controlling
More informationHorizontal and Vertical Asymptotes from section 2.6
Horizontal and Vertical Asymptotes from section 2.6 Definition: In either of the cases f(x) = L or f(x) = L we say that the x x horizontal line y = L is a horizontal asymptote of the function f. Note:
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun yzsun@ccs.neu.edu November 16, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining Matrix Data Decision
More information2. Outliers and inference for regression
Unit6: Introductiontolinearregression 2. Outliers and inference for regression Sta 101 - Spring 2016 Duke University, Department of Statistical Science Dr. Çetinkaya-Rundel Slides posted at http://bit.ly/sta101_s16
More informationECON 331 Homework #2 - Solution. In a closed model the vector of external demand is zero, so the matrix equation writes:
ECON 33 Homework #2 - Solution. (Leontief model) (a) (i) The matrix of input-output A and the vector of level of production X are, respectively:.2.3.2 x A =.5.2.3 and X = y.3.5.5 z In a closed model the
More informationLecture 5: January 30
CS71 Randomness & Computation Spring 018 Instructor: Alistair Sinclair Lecture 5: January 30 Disclaimer: These notes have not been subjected to the usual scrutiny accorded to formal publications. They
More informationModel Counting for Logical Theories
Model Counting for Logical Theories Wednesday Dmitry Chistikov Rayna Dimitrova Department of Computer Science University of Oxford, UK Max Planck Institute for Software Systems (MPI-SWS) Kaiserslautern
More informationMATH 2200 Final LC Review
MATH 2200 Final LC Review Thomas Goller April 25, 2013 1 Final LC Format The final learning celebration will consist of 12-15 claims to be proven or disproven. It will take place on Wednesday, May 1, from
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement
More informationLecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016
Lecture 15: MCMC Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Course progress Learning from examples Definition + fundamental theorem of statistical learning,
More informationToday. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?
Today Statistical Learning Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Continuous case Learning Parameters for a Bayesian Network Naive Bayes Maximum Likelihood estimates
More informationModeling Data Correlations in Private Data Mining with Markov Model and Markov Networks. Yang Cao Emory University
Modeling Data Correlations in Private Data Mining with Markov Model and Markov Networks Yang Cao Emory University 207..5 Outline Data Mining with Differential Privacy (DP) Scenario: Spatiotemporal Data
More informationLecture Topic 4: Chapter 7 Sampling and Sampling Distributions
Lecture Topic 4: Chapter 7 Sampling and Sampling Distributions Statistical Inference: The aim is to obtain information about a population from information contained in a sample. A population is the set
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationMarkov-Chain Monte Carlo
Markov-Chain Monte Carlo CSE586 Computer Vision II Spring 2010, Penn State Univ. References Recall: Sampling Motivation If we can generate random samples x i from a given distribution P(x), then we can
More informationε-nets and VC Dimension
ε-nets and VC Dimension Sampling is a powerful idea applied widely in many disciplines, including CS. There are at least two important uses of sampling: estimation and detection. CNN, Nielsen, NYT etc
More information1.3 Vertex Degrees. Vertex Degree for Undirected Graphs: Let G be an undirected. Vertex Degree for Digraphs: Let D be a digraph and y V (D).
1.3. VERTEX DEGREES 11 1.3 Vertex Degrees Vertex Degree for Undirected Graphs: Let G be an undirected graph and x V (G). The degree d G (x) of x in G: the number of edges incident with x, each loop counting
More informationDiscrete Mathematics. Spring 2017
Discrete Mathematics Spring 2017 Previous Lecture Principle of Mathematical Induction Mathematical Induction: rule of inference Mathematical Induction: Conjecturing and Proving Climbing an Infinite Ladder
More informationRAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response
RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response Úlfar Erlingsson, Vasyl Pihur, Aleksandra Korolova Google & USC Presented By: Pat Pannuto RAPPOR, What is is good for? (Absolutely something!)
More informationBayes Networks 6.872/HST.950
Bayes Networks 6.872/HST.950 What Probabilistic Models Should We Use? Full joint distribution Completely expressive Hugely data-hungry Exponential computational complexity Naive Bayes (full conditional
More informationAn Introduction to Reversible Jump MCMC for Bayesian Networks, with Application
An Introduction to Reversible Jump MCMC for Bayesian Networks, with Application, CleverSet, Inc. STARMAP/DAMARS Conference Page 1 The research described in this presentation has been funded by the U.S.
More informationMachine Learning CPSC 340. Tutorial 12
Machine Learning CPSC 340 Tutorial 12 Random Walk on Graph Page Rank Algorithm Label Propagation on Graph Assume a strongly connected graph G = (V, A) Label Propagation on Graph Assume a strongly connected
More informationBayes Nets III: Inference
1 Hal Daumé III (me@hal3.name) Bayes Nets III: Inference Hal Daumé III Computer Science University of Maryland me@hal3.name CS 421: Introduction to Artificial Intelligence 10 Apr 2012 Many slides courtesy
More information35 38 Absolute Value Quiz and Unit 8 Review.notebook April 18, Learning Target 2ab: I can write and solve absolute value equations.
Unit 8 Quiz #3 Absolute Value Name: Hour: Learning Target 2ab: I can write and solve absolute value equations. Solve each absolute value equation. x + 5 = 3 3 x + 5 = 6 Learning Target 3abc: I can write,
More informationOutline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting
Outline for today Information Retrieval Efficient Scoring and Ranking Recap on ranked retrieval Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Efficient
More informationCS249: ADVANCED DATA MINING
CS249: ADVANCED DATA MINING Graph and Network Instructor: Yizhou Sun yzsun@cs.ucla.edu May 31, 2017 Methods Learnt Classification Clustering Vector Data Text Data Recommender System Decision Tree; Naïve
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More information27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling
10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel
More information< k 2n. 2 1 (n 2). + (1 p) s) N (n < 1
List of Problems jacques@ucsd.edu Those question with a star next to them are considered slightly more challenging. Problems 9, 11, and 19 from the book The probabilistic method, by Alon and Spencer. Question
More informationSampling Algorithms for Probabilistic Graphical models
Sampling Algorithms for Probabilistic Graphical models Vibhav Gogate University of Washington References: Chapter 12 of Probabilistic Graphical models: Principles and Techniques by Daphne Koller and Nir
More informationU.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 8 Luca Trevisan September 19, 2017
U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 8 Luca Trevisan September 19, 2017 Scribed by Luowen Qian Lecture 8 In which we use spectral techniques to find certificates of unsatisfiability
More informationOnline Advertising is Big Business
Online Advertising Online Advertising is Big Business Multiple billion dollar industry $43B in 2013 in USA, 17% increase over 2012 [PWC, Internet Advertising Bureau, April 2013] Higher revenue in USA
More information6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search
6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search Daron Acemoglu and Asu Ozdaglar MIT September 30, 2009 1 Networks: Lecture 7 Outline Navigation (or decentralized search)
More informationEstimating a Population Mean. Section 7-3
Estimating a Population Mean Section 7-3 The Mean Age Suppose that we wish to estimate the mean age of residents of Metropolis. Furthermore, suppose we know that the ages of Metropolis residents are normally
More informationA Note on Google s PageRank
A Note on Google s PageRank According to Google, google-search on a given topic results in a listing of most relevant web pages related to the topic. Google ranks the importance of webpages according to
More informationCS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine
CS 277: Data Mining Mining Web Link Structure Class Presentations In-class, Tuesday and Thursday next week 2-person teams: 6 minutes, up to 6 slides, 3 minutes/slides each person 1-person teams 4 minutes,
More informationd(x n, x) d(x n, x nk ) + d(x nk, x) where we chose any fixed k > N
Problem 1. Let f : A R R have the property that for every x A, there exists ɛ > 0 such that f(t) > ɛ if t (x ɛ, x + ɛ) A. If the set A is compact, prove there exists c > 0 such that f(x) > c for all x
More informationMAT 271E Probability and Statistics
MAT 71E Probability and Statistics Spring 013 Instructor : Class Meets : Office Hours : Textbook : Supp. Text : İlker Bayram EEB 1103 ibayram@itu.edu.tr 13.30 1.30, Wednesday EEB 5303 10.00 1.00, Wednesday
More informationECE 695 Numerical Simulations Lecture 2: Computability and NPhardness. Prof. Peter Bermel January 11, 2017
ECE 695 Numerical Simulations Lecture 2: Computability and NPhardness Prof. Peter Bermel January 11, 2017 Outline Overview Definitions Computing Machines Church-Turing Thesis Polynomial Time (Class P)
More informationPairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events
Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events Massimo Franceschet Angelo Montanari Dipartimento di Matematica e Informatica, Università di Udine Via delle
More informationData Integration for Big Data Analysis for finite population inference
for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1 / 36 What is big data? 2 / 36 Data do not speak for themselves Knowledge Reproducibility Information Intepretation
More informationAlternative Parameterizations of Markov Networks. Sargur Srihari
Alternative Parameterizations of Markov Networks Sargur srihari@cedar.buffalo.edu 1 Topics Three types of parameterization 1. Gibbs Parameterization 2. Factor Graphs 3. Log-linear Models with Energy functions
More informationWeb Structure Mining Nodes, Links and Influence
Web Structure Mining Nodes, Links and Influence 1 Outline 1. Importance of nodes 1. Centrality 2. Prestige 3. Page Rank 4. Hubs and Authority 5. Metrics comparison 2. Link analysis 3. Influence model 1.
More informationGraph Detection and Estimation Theory
Introduction Detection Estimation Graph Detection and Estimation Theory (and algorithms, and applications) Patrick J. Wolfe Statistics and Information Sciences Laboratory (SISL) School of Engineering and
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationProportion. Lecture 25 Sections Fri, Oct 10, Hampden-Sydney College. Sampling Distribution of a Sample. Proportion. Robb T.
PDFs n = s Lecture 25 Sections 8.1-8.2 Hampden-Sydney College Fri, Oct 10, 2008 Outline PDFs n = s 1 2 3 PDFs n = 4 5 s 6 7 PDFs n = s The of the In our experiment, we collected a total of 100 samples,
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationCS 188: Artificial Intelligence. Bayes Nets
CS 188: Artificial Intelligence Probabilistic Inference: Enumeration, Variable Elimination, Sampling Pieter Abbeel UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew
More information