CS-C Data Science Chapter 8: Discrete methods for analyzing large binary datasets

Size: px
Start display at page:

Download "CS-C Data Science Chapter 8: Discrete methods for analyzing large binary datasets"

Transcription

1 CS-C Data Science Chapter 8: Discrete methods for analyzing large binary datasets Jaakko Hollmén, Department of Computer Science

2 Rest of the course In the first part of the course most of the variables had real values. In the two last chapters we look at discrete patterns. Chapter 8: Discrete methods for analyzing large binary datasets Frequent itemsets and breadth-first search Chapter 9: Searching the Web for relevant pages: random walks on the Web Hubs-and-authorities algorithm We are only able to take a cursory look at these important areas of research 2

3 Contents of this chapter 3

4 Large binary datasets Many rows (variables), many columns (observations) 0/1 data: does that thing occur in this observation? The cell D(m, h) of matrix (table) D tells us whether the phenomenon described by variable m is present in observation h: D(m, h) = 1 or D(m, h) = D =

5 Binary data example: Words appearing in documents Observation: document Variable: word (in its basic form) Data D(s, d) = 1: did the word s appear at least once in document d? Figure: Word-document matrix from the computer exercises (Round 5). Black stripe = word s appears in document d; D(s, d) = 1. A corpus of 56 documents with 5098 different words, including two intentionally generated "copies". 5

6 Binary data example: Web pages Observation: Web page Variable: Web page D(s, p) = 1 if and only if page s links to page p 6

7 Binary data example: Mammal populations in nature Observation: 50km 50km square r Variable: Mammal species l D(l, r) = 1 if and only if species l was observed in square r 7

8 Binary data example: Molecules within larger molecules Observation: molecule m (typically large) Observation: molecule p (smaller) D(p, m) = 1 if and only if molecule p exists as part of molecule m 8

9 NSF 0-1 document example: Source data Abstract data base: abstracts describing scientific ventures funded by NSF nsfawards.html 9

10 NSF 0-1 data representation Text documents (abstracts) transformed into a 0/1 tabular format: 1 = appears at least once Columns represent documents, rows represent words Example words with indices:..., 73:abscissa, 74:absence, 75:absent, 76:absolute, 77:absolutely, 78:absorb, 79:absorbed,... #1 #2 # :additional :genetic :studies

11 NSF 0-1 document example: Why this representation? From the point of view of information search, a list of words contained within a document is often enough Fixed-format data is easier to process than variable-length strings 11

12 NSF 0-1 document example: Basic statistics documents, words not a very large set of documents The table has = cells There are (0.265%) ones in the dataset; no point in representing zeros There are approximately 81 different words in each document (1 = the word appears at least once) Each word appears in about 340 documents The distribution is skewed: some words appear often, others very rarely Some very common words have been omitted ( blacklist ) 12

13 NSF 0-1 document example: Number of different words by document 1/3 Figure: X axis: Document index number, documents in total. Y axis: Number of different words in document. 13

14 Number of different words by document 2/3 Figure: X axis: Document index number, documents arranged by number of different words in document. Y axis: Number of different words in document. 14

15 Number of different words by document 3/3 Figure: Number of different words in document as histogram. Document length varies mainly between (different) words. 15

16 NSF 0-1 document example: How many documents contain a certain word? 1/4 Figure: X axis: Word index number (1,..., 30800). Y axis: Occurrences 10 4 (1,..., 80000). 16

17 How many documents contain a certain word? 2/4 Figure: X axis: Word index number (1,..., 30800), arranged by frequency of occurrence. Y axis: Occurrences 10 4 (1,..., 80000). 17

18 How many documents contain a certain word? 3/4 Figure: X axis: Word index number (1,..., 30800), arranged by frequency of occurrence. Y axis: (Base 2) logarithm of occurrences 10 4 (1,..., 80000). Why is the graph between ( 4000,..., 16000) almost a straight line? 18

19 How many documents contain a certain word? 4/4 Figure: Histogram of word occurrence. 19

20 Binary dataset as a network Network (or graph): a set of nodes (partially) connected with edges A binary dataset is easy to turn into a bipartite network: variables and observations are nodes, with an edge connecting them if and only if the corresponding data value is 1 20

21 What could we look for in data like this? What does the data contain? What kind of groups do the variables and observations form? How do the variables depend on each other? Etc. Analysis of variable and observation degree (how many outbound edges does this variable have = how many observations involve this variable): power laws - a very active area of research How do we look for small interesting sets of variables? 21

22 Searching for commonly appearing combinations of variables: Notation x 11 x 12 x 1 3 x 1n x 21 x 22 x = x 31 x d1 d-dimensional observation vector x; its cells are x 1, x 2,..., x d Many observation vectors x(1),..., x(n) Value of variable i in observation vector x(j) is denoted with x(i, j) or x ij x dn 22

23 Frequent itemsets: Definition If there are eg variables, it s not possible to search for all pairwise correlations How about variables commonly appearing together? Search for all variable pairs (a, b) so that there are at least N observations x(i) where x(a, i) = 1 and x(b, i) = 1 Generally: Search for all variable sets {a 1, a 2,..., a k } so that there are at least N observations x(i) where x(a 1, i) = 1 and x(a 1, i) = 1 and... and x(a k, i) = 1 Let s call this variable set {a 1, a 2,..., a k } a frequent itemset 23

24 Frequent itemsets: Example We have 4 shopping bags, each of which either contain or don t contain oranges (a), bananas (b), and apples (c). Bag #1: 5 oranges, 2 bananas. Bag #2: 4 bananas. Bag #3: 1 orange, 2 bananas, 1 apple. Bag #4: 8 oranges, 2 apples. 1 = bag contains item, 0 = bag does not contain item: X = With threshold value N = 2, frequent itemsets are {a}, {b}, {c}, {a, b} and {a, c}. For example, oranges (a) and apples (c) appear together ({a, c}) in at least two bags. 24

25 How many frequent itemsets are there? If the data only contains ones, all pairs and subsets of variables fulfill the criterion In order for the problem to be sensible, the data has to be sparse. This is typically the case. In the previous example both shopping bags and products number in the thousands, but a customer only buys a small number of products at one time. The sparse matrix has way more zeros than ones. Using machine learning terminology: we look for commonly occurring positive conjunctions (A B) 25

26 How do we search for frequent itemsets? Trivial approach: brute force search through all subsets of variables d variables 2 d subsets of variables If there are e.g. 100 variables, there are way too many subsets We have to be more clever Basic observation: if a set of variables {a 1, a 2,..., a k } is frequent, all its subsets are frequent Conversely: A set of variables {a 1, a 2,..., a k } can be frequent only if all of its subsets are frequent. The frequentness of subsets is a necessary but not sufficient condition. 26

27 Example of frequent itemset search 1/3 Let s assume the variable sets {a}, {b}, {c}, {e}, {f }, {g} (set size 1) occur often enough (we have at least N observations of each) Then the pair (i, j) is a candidate frequent itemset, when ) = 6! 4! 2! = 15 i, j {a, b, c, e, f, g}. There are ( 6 2 candidate pairs. Let s assume that only the pairs {a, b}, {a, c}, {a, e}, {a, f }, {b, c}, {b, e}, {c, g} are frequent (size 2) i.e. they occur together at least N times Let s combine these pairs into sets of size 3. The next step is to remove those sets which include non-frequent sets of size 2. Now we have candidate frequent itemsets of size 3. For instance frequent pairs {a, b} and {a, f } can be combined as {a, b, f }, but then the subset {b, f } is not frequent, so {a, b, f } is not accepted as a candidate 27

28 Example of frequent itemset search 2/3 Then {a, b, c} and {a, b, e} are the only possible frequent itemsets of size 3, because all their subsets are frequent Let s assume that {a, b, c} is a frequent itemset (of size 3) We also already know that there can t be a frequent itemset of size 4 (or larger), because the subsets of {a, b, c, e} include e.g. {a, c, e} which is not frequent. In other words: the non-frequent set {a, c, e} can t be turned into a frequent itemset by adding variables. As we now know that candidates of size 4 can t exist, the search ends. 28

29 Example of frequent itemset search 3/3 The final step is to list all frequent itemsets {C 0, C 1, C 2, C 3 }: {{}, {a}, {b}, {c}, {e}, {f }, {g}, {a, b}, {a, c}, {a, e}, {a, f }, {b, c}, {b, e}, {c, g}, {a, b, c}} Here assume means has been computed from data. In this example we did not have the actual data (matrix X ). 29

30 Breadth-first search: Principle Breadth-first search (aka levelwise algorithm, a priori algorithm) is a simple but effective algorithm for frequent itemset search First, we look for frequent itemsets of size 1, then size 2, etc. The basic observation we made earlier becomes very useful: a set can be frequent only if all its subsets are frequent 30

31 Breadth-first search: Frequent itemset of size 1 Frequent itemsets of size 1: variable a that has at least N ones in the dataset i.e. there are at least N observations (columns) i for which x(a, i) = 1. These are easy to find by going through the data once and summing the occurrences of each variable. 31

32 Breadth-first search: Frequent itemset of size 2 When we know the frequent itemsets of size 1, we form candidate sets of size 2 The set {a, b} is a candidate set when both {a} and {b} are frequent Let s go through the data again, noting for each candidate set how many observations have a 1 for both variables a and b. Now we have found the frequent itemsets of size 2. 32

33 Breadth-first search: Frequent itemset of size k 1/2 Let s assume that we know all the frequent itemsets of size k; let s call this collection C k Let s form candidate sets of size k + 1: sets that could be frequent A candidate set is a set of k + 1 cells where all its subsets are frequent. It is enough to check that all subsets of size k are frequent. Let s take all pairs of sets A = {a 1,..., a k } and B = {b 1,..., b k } from collection C k, compute their union A B, check that the union has k + 1 cells and that all subsets of size k are frequent (i.e. are in the collection C k ). 33

34 Breadth-first search: Frequent itemset of size k 2/2 When we have found the candidate sets of size k + 1, we go through the data and note for each candidate set the number of columns where all the candidate set variables have the value 1 Now we have found the frequent itemsets of size k + 1 The algorithm is repeated until no new candidate sets can be found 34

35 Breadth-first search: Available implementations Let s keep on working with NSF data Many implementations of frequent itemset search are available on the Web, e.g. and Computer exercise TODO uses the apriori program by Bart Goethals to look for common products in cash register receipts and common words in coursework documents Exercise TODO explores the complexity of computing frequent itemsets 35

36 Frequent itemset search with threshold N = Using threshold value N = we can find a total of 250 frequent itemsets in the variables (words) of the corpus, originating from documents At least documents have 9 different three-word sets size candidates frequent itemsets

37 NSF example: Frequent itemset search with threshold N = 2000 Using a smaller threshold value of N = 2000 we can find frequent itemsets and a much greater number of candidate sets There is one six-word set appearing in at least 2000 documents size candidates frequent itemsets

38 NSF example: Number of frequent itemsets as a function of threshold N Threshold N: there have to be at least N columns with the value 1 threshold N frequent itemsets

39 NSF example: Frequent itemset of size 6 with threshold N = program project research science students university 39

40 What does the frequentness of a set of variables mean? If a set of variables occurs often, can this be due to chance? Let a and b be variables. In the data there are n a observations where a occurs and n b observations where b occurs n observations in total p(a = 1): probability that a random observation has a = 1 p(a = 1) = n a /n and p(b = 1) = n b /n If we assume a and b to be independent, in how many observations do they occur together? p(a = 1 b = 1) = p(a = 1)p(b = 1) n ab /n = (n a /n)(n b /n) = n a n b /n 2 40

41 NSF example of independence a = computational : n a = 7803 b = algorithms : n b = 6812 Independence: computational and algorithms occur in a random column with the probability n a n n b n = = The expected value for columns like this is n n a n n b n = n an b n = 415 There are n ab = 1974 columns like this in the data Is this significant? (We ll get back to this shortly.) 41

42 Looking for pairs occurring more often than random with awk Input: word and word pair frequencies in NSF data Usage: cat bag-of-words.txt./this.awk #!/usr/bin/awk -f BEGIN { N= } { if (NF==2) { f[$1]=$2; } if (NF==3) { p[$1, $2] = $3; } } END { for (u in f) { for (v in f) { if (p[u, v]>0) { print u,v,f[u],f[v],p[u,v],p[u,v]/(f[u]*f[v]/n) }}}} 42

43 Rough analysis of pairwise uncommon occurrence Ratio between observed and random occurrence of word pairs: p(a = 1 b = 1) p(a = 1)p(b = 1) = n ab /n (n a /n)(n b /n) = n n ab n a n b Columns: (1) word pair (a, b), (2) n a, (3) n b, (4) n ab, (5) p(a=1 b=1) p(a=1)p(b=1) Examples with a large ratio p(a=1 b=1) : p(a=1)p(b=1) fellowship postdoctoral film thin dissertation doctoral business innovation polymer polymers microscopy scanning

44 NSF example of independence: Lots of variable pairs Some examples of variable pairs Pair #100: held workshop Pair #500: gene identify Pair #1000: engineering manufacturing

45 Significance of frequent itemsets Let s think of data generation as a repeated experiment (binomial distribution) Let s make n observations (NSF data: n = ) We succeed with every observation when we produce an observation where both a = 1 and b = 1: this happens with the probability p = n a n b /n 2 The number of successes is binomially distributed Bin(n, p) with parameters n and n a n b /n 2 What is the probability that we will succeed at least n ab times? How small is the probability that we will get a result so far from the expected value? If this probability is small, then there is something interesting within the found pattern 45

46 Significance of frequent itemsets Estimating the probability of a binomial distribution tail: n B(n, i, p) i=n ab +1 where B(n, i, p) is the probability that in an experiment with n repetitions we will get exactly i successes, when the probability of a success is p: B(n, i, p) = ( n i ) p i (1 p) n i Traditional approximation methods (normal approximation, Poisson approximation) do not necessarily work here (probability p is small, but the number of successes is large) 46

47 Significance of frequent itemsets Chernoff bound Let X denote the number of successes in a repeated experiment of n repeats and p probability of success. Then the expected value of the number of successes is np. The probability Pr(X > (1 + δ)np) is bounded (Chernoff bound): ( ) e δ np Pr(X > (1 + δ)np) < < 2 δnp (1 + δ) 1+δ 47

48 Frequent itemset significance: Testing by simulation NSF data: a = computational ; n a = 7803 b = algorithms ; n b = 6812 common observations n ab = 1974 probability with assumed independence p(a = 1)p(b = 1) = n a n b /n 2 = Randomization: let s see what happens when we toss a coin n = times Probability of success p = Let s count the number of successes 48

49 Frequent itemset significance: Testing by simulation with Matlab p = ; count=zeros(10,1); % initialization for a=1:10 for i=1: % slow method if rand(1,1) < p count(a) = count(a) + 1; end end end count=zeros(1000,1); % initialization for a=1:1000 % faster method in Matlab count(a) = sum(rand(128000,1)<p); end hist(count,40); 49

50 NSF example of frequent itemset significance: Simulation result Figure: Histogram of number of successes in 1000 repeats. The probability of 1974 successes appears to be small but how small? 50

51 Frequent itemset significance Comparison with Chernoff bound Repeated experiment X Bin(n, p) with expected value of number of successes np ( ) e δ np Pr(X > (1 + δ)np) < < 2 δnp (1 + δ) 1+δ Now p = , expected value np = 415, observed value X = X > (1 + δ)np i.e. δ = (X np)/np = The probability is very small: Pr(X > 1974) = Pr(X > ( )415) < ( ) < δnp

52 Problem of multiple testing 1/2 Let s assume we find 100 interesting pairs of variables (a, b) and we wish to find out (for all pairs) whether n ab is significantly larger (or smaller) than the estimate given by the independence assumption n a n b /n. Let s further assume that for the pair (a, b) this difference occurs by chance with the probability 0.01 The probability that we get this result for at least one of the 100 pairs is (why?) 1 (1 0.01) When we examine the randomness of the frequency of occurrence of several patterns, we have take into account that we are observing several patterns 52

53 Problem of multiple testing 2/2 Bonferroni correction: if we are examining the probabilities of occurrence of M patterns, we have to multiply the random occurrence probability by M In the NSF example the probability of a pattern was so small that the correction does not change things Most of the variable pairs appear together so often that the probability of this happening by chance is extremely unlikely 53

54 What about larger frequent itemsets? Let s assume the variable set a, b, c occurs n abc times in the data (i.e. there are n abc columns in the data where the variables a, b, c have the value 1) Is this more than one should assume for the random case? What is the random case in this situation? E.g. n = 1000 and n a = n b = n c = 500. If the variables are independent, n abc should be approximately 125. If n abc = 250, then the set a, b, c is somehow unusual This could be e.g. because a and b are strongly correlated. If n ab = 500, then n abc = 250 is to be expected. What is a suitable zero hypothesis? E.g. the so-called maximum entropy hypothesis 54

55 How hard is it to look for frequent itemsets? We have 0/1 data and a threshold N, the task is to look for all frequent itemsets The answer can be very large In practice the a priori algorithm is efficient enough 55

56 Searching for commonly occurring patterns of different kind The same principles can be applied to other tasks Search for commonly occurring episodes Search for commonly occurring subnets Search for commonly occurring subtrees The same basic idea of the a priori algorithm works 56

57 Summary In this chapter we studied how to transform discrete datasets of various forms into binary (0/1) data. Binary data can be processed and analyzed efficiently with e.g. the a priori algorithm. 57

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University at Buffalo The State University of New York 1 A Priori Algorithm for Association Rule Learning Association

More information

Approximate counting: count-min data structure. Problem definition

Approximate counting: count-min data structure. Problem definition Approximate counting: count-min data structure G. Cormode and S. Muthukrishhan: An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55 (2005) 58-75. Problem

More information

MA 1125 Lecture 15 - The Standard Normal Distribution. Friday, October 6, Objectives: Introduce the standard normal distribution and table.

MA 1125 Lecture 15 - The Standard Normal Distribution. Friday, October 6, Objectives: Introduce the standard normal distribution and table. MA 1125 Lecture 15 - The Standard Normal Distribution Friday, October 6, 2017. Objectives: Introduce the standard normal distribution and table. 1. The Standard Normal Distribution We ve been looking at

More information

CIS 2033 Lecture 5, Fall

CIS 2033 Lecture 5, Fall CIS 2033 Lecture 5, Fall 2016 1 Instructor: David Dobor September 13, 2016 1 Supplemental reading from Dekking s textbook: Chapter2, 3. We mentioned at the beginning of this class that calculus was a prerequisite

More information

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2) The Market-Basket Model Association Rules Market Baskets Frequent sets A-priori Algorithm A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set

More information

2. AXIOMATIC PROBABILITY

2. AXIOMATIC PROBABILITY IA Probability Lent Term 2. AXIOMATIC PROBABILITY 2. The axioms The formulation for classical probability in which all outcomes or points in the sample space are equally likely is too restrictive to develop

More information

Course overview. Heikki Mannila Laboratory for Computer and Information Science Helsinki University of Technology

Course overview. Heikki Mannila Laboratory for Computer and Information Science Helsinki University of Technology Course overview Heikki Mannila Laboratory for Computer and Information Science Helsinki University of Technology Heikki.Mannila@tkk.fi Algorithmic methods of data mining, Autumn 2007, Course overview 1

More information

37.3. The Poisson Distribution. Introduction. Prerequisites. Learning Outcomes

37.3. The Poisson Distribution. Introduction. Prerequisites. Learning Outcomes The Poisson Distribution 37.3 Introduction In this Section we introduce a probability model which can be used when the outcome of an experiment is a random variable taking on positive integer values and

More information

Just Enough Likelihood

Just Enough Likelihood Just Enough Likelihood Alan R. Rogers September 2, 2013 1. Introduction Statisticians have developed several methods for comparing hypotheses and for estimating parameters from data. Of these, the method

More information

Math 180B Problem Set 3

Math 180B Problem Set 3 Math 180B Problem Set 3 Problem 1. (Exercise 3.1.2) Solution. By the definition of conditional probabilities we have Pr{X 2 = 1, X 3 = 1 X 1 = 0} = Pr{X 3 = 1 X 2 = 1, X 1 = 0} Pr{X 2 = 1 X 1 = 0} = P

More information

Algorithmic methods of data mining, Autumn 2007, Course overview0-0. Course overview

Algorithmic methods of data mining, Autumn 2007, Course overview0-0. Course overview Algorithmic methods of data mining, Autumn 2007, Course overview0-0 Course overview Heikki Mannila Laboratory for Computer and Information Science Helsinki University of Technology Heikki.Mannila@tkk.fi

More information

Algorithmic Methods of Data Mining, Fall 2005, Course overview 1. Course overview

Algorithmic Methods of Data Mining, Fall 2005, Course overview 1. Course overview Algorithmic Methods of Data Mining, Fall 2005, Course overview 1 Course overview lgorithmic Methods of Data Mining, Fall 2005, Course overview 1 T-61.5060 Algorithmic methods of data mining (3 cp) P T-61.5060

More information

Chapter 8 Sequences, Series, and Probability

Chapter 8 Sequences, Series, and Probability Chapter 8 Sequences, Series, and Probability Overview 8.1 Sequences and Series 8.2 Arithmetic Sequences and Partial Sums 8.3 Geometric Sequences and Partial Sums 8.5 The Binomial Theorem 8.6 Counting Principles

More information

Data mining, 4 cu Lecture 5:

Data mining, 4 cu Lecture 5: 582364 Data mining, 4 cu Lecture 5: Evaluation of Association Patterns Spring 2010 Lecturer: Juho Rousu Teaching assistant: Taru Itäpelto Evaluation of Association Patterns Association rule algorithms

More information

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Mining Molecular Fragments: Finding Relevant Substructures of Molecules Mining Molecular Fragments: Finding Relevant Substructures of Molecules Christian Borgelt, Michael R. Berthold Proc. IEEE International Conference on Data Mining, 2002. ICDM 2002. Lecturers: Carlo Cagli

More information

Some Statistics. V. Lindberg. May 16, 2007

Some Statistics. V. Lindberg. May 16, 2007 Some Statistics V. Lindberg May 16, 2007 1 Go here for full details An excellent reference written by physicists with sample programs available is Data Reduction and Error Analysis for the Physical Sciences,

More information

{ p if x = 1 1 p if x = 0

{ p if x = 1 1 p if x = 0 Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =

More information

STAT 201 Chapter 5. Probability

STAT 201 Chapter 5. Probability STAT 201 Chapter 5 Probability 1 2 Introduction to Probability Probability The way we quantify uncertainty. Subjective Probability A probability derived from an individual's personal judgment about whether

More information

Probability the chance that an uncertain event will occur (always between 0 and 1)

Probability the chance that an uncertain event will occur (always between 0 and 1) Quantitative Methods 2013 1 Probability as a Numerical Measure of the Likelihood of Occurrence Probability the chance that an uncertain event will occur (always between 0 and 1) Increasing Likelihood of

More information

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05 Meelis Kull meelis.kull@ut.ee Autumn 2017 1 Sample vs population Example task with red and black cards Statistical terminology Permutation test and hypergeometric test Histogram on a sample vs population

More information

Discrete Probability. Chemistry & Physics. Medicine

Discrete Probability. Chemistry & Physics. Medicine Discrete Probability The existence of gambling for many centuries is evidence of long-running interest in probability. But a good understanding of probability transcends mere gambling. The mathematics

More information

C Homework Set 3 Spring The file CDCLifeTable is available from the course web site. It s in the M folder; the direct link is

C Homework Set 3 Spring The file CDCLifeTable is available from the course web site. It s in the M folder; the direct link is 1. The file CDCLifeTable is available from the course web site. It s in the M folder; the direct link is http://people.stern.nyu.edu/gsimon/statdata/b01.1305/m/index.htm This has data for non-hispanic

More information

Languages, regular languages, finite automata

Languages, regular languages, finite automata Notes on Computer Theory Last updated: January, 2018 Languages, regular languages, finite automata Content largely taken from Richards [1] and Sipser [2] 1 Languages An alphabet is a finite set of characters,

More information

Chapter 18 Sampling Distribution Models

Chapter 18 Sampling Distribution Models Chapter 18 Sampling Distribution Models The histogram above is a simulation of what we'd get if we could see all the proportions from all possible samples. The distribution has a special name. It's called

More information

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 CS 70 Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 Introduction One of the key properties of coin flips is independence: if you flip a fair coin ten times and get ten

More information

Discrete Probability

Discrete Probability Discrete Probability Counting Permutations Combinations r- Combinations r- Combinations with repetition Allowed Pascal s Formula Binomial Theorem Conditional Probability Baye s Formula Independent Events

More information

Probability Year 9. Terminology

Probability Year 9. Terminology Probability Year 9 Terminology Probability measures the chance something happens. Formally, we say it measures how likely is the outcome of an event. We write P(result) as a shorthand. An event is some

More information

1 INFO Sep 05

1 INFO Sep 05 Events A 1,...A n are said to be mutually independent if for all subsets S {1,..., n}, p( i S A i ) = p(a i ). (For example, flip a coin N times, then the events {A i = i th flip is heads} are mutually

More information

Chapter 14. From Randomness to Probability. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Chapter 14. From Randomness to Probability. Copyright 2012, 2008, 2005 Pearson Education, Inc. Chapter 14 From Randomness to Probability Copyright 2012, 2008, 2005 Pearson Education, Inc. Dealing with Random Phenomena A random phenomenon is a situation in which we know what outcomes could happen,

More information

Counting principles, including permutations and combinations.

Counting principles, including permutations and combinations. 1 Counting principles, including permutations and combinations. The binomial theorem: expansion of a + b n, n ε N. THE PRODUCT RULE If there are m different ways of performing an operation and for each

More information

Poisson approximations

Poisson approximations Chapter 9 Poisson approximations 9.1 Overview The Binn, p) can be thought of as the distribution of a sum of independent indicator random variables X 1 + + X n, with {X i = 1} denoting a head on the ith

More information

1 Normal Distribution.

1 Normal Distribution. Normal Distribution.. Introduction A Bernoulli trial is simple random experiment that ends in success or failure. A Bernoulli trial can be used to make a new random experiment by repeating the Bernoulli

More information

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION THOMAS MAILUND Machine learning means different things to different people, and there is no general agreed upon core set of algorithms that must be

More information

When working with probabilities we often perform more than one event in a sequence - this is called a compound probability.

When working with probabilities we often perform more than one event in a sequence - this is called a compound probability. + Independence + Compound Events When working with probabilities we often perform more than one event in a sequence - this is called a compound probability. Compound probabilities are more complex than

More information

Elementary Statistics Triola, Elementary Statistics 11/e Unit 17 The Basics of Hypotheses Testing

Elementary Statistics Triola, Elementary Statistics 11/e Unit 17 The Basics of Hypotheses Testing (Section 8-2) Hypotheses testing is not all that different from confidence intervals, so let s do a quick review of the theory behind the latter. If it s our goal to estimate the mean of a population,

More information

Chapter 8: An Introduction to Probability and Statistics

Chapter 8: An Introduction to Probability and Statistics Course S3, 200 07 Chapter 8: An Introduction to Probability and Statistics This material is covered in the book: Erwin Kreyszig, Advanced Engineering Mathematics (9th edition) Chapter 24 (not including

More information

Chapter 5. Means and Variances

Chapter 5. Means and Variances 1 Chapter 5 Means and Variances Our discussion of probability has taken us from a simple classical view of counting successes relative to total outcomes and has brought us to the idea of a probability

More information

CS5112: Algorithms and Data Structures for Applications

CS5112: Algorithms and Data Structures for Applications CS5112: Algorithms and Data Structures for Applications Lecture 19: Association rules Ramin Zabih Some content from: Wikipedia/Google image search; Harrington; J. Leskovec, A. Rajaraman, J. Ullman: Mining

More information

Mathematical Statistics

Mathematical Statistics Mathematical Statistics Kenichi ISHIKAWA http://ishiken.free.fr/lecture.html Oct. 18 Combination and Probability Oct. 25 Random variables and probability distributions Nov. 1 Representative probability

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics February 26, 2018 CS 361: Probability & Statistics Random variables The discrete uniform distribution If every value of a discrete random variable has the same probability, then its distribution is called

More information

Probability and Probability Distributions. Dr. Mohammed Alahmed

Probability and Probability Distributions. Dr. Mohammed Alahmed Probability and Probability Distributions 1 Probability and Probability Distributions Usually we want to do more with data than just describing them! We might want to test certain specific inferences about

More information

Data Mining Recitation Notes Week 3

Data Mining Recitation Notes Week 3 Data Mining Recitation Notes Week 3 Jack Rae January 28, 2013 1 Information Retrieval Given a set of documents, pull the (k) most similar document(s) to a given query. 1.1 Setup Say we have D documents

More information

X = X X n, + X 2

X = X X n, + X 2 CS 70 Discrete Mathematics for CS Fall 2003 Wagner Lecture 22 Variance Question: At each time step, I flip a fair coin. If it comes up Heads, I walk one step to the right; if it comes up Tails, I walk

More information

Chapter 4: An Introduction to Probability and Statistics

Chapter 4: An Introduction to Probability and Statistics Chapter 4: An Introduction to Probability and Statistics 4. Probability The simplest kinds of probabilities to understand are reflected in everyday ideas like these: (i) if you toss a coin, the probability

More information

Notes on Mathematics Groups

Notes on Mathematics Groups EPGY Singapore Quantum Mechanics: 2007 Notes on Mathematics Groups A group, G, is defined is a set of elements G and a binary operation on G; one of the elements of G has particularly special properties

More information

Probability Year 10. Terminology

Probability Year 10. Terminology Probability Year 10 Terminology Probability measures the chance something happens. Formally, we say it measures how likely is the outcome of an event. We write P(result) as a shorthand. An event is some

More information

Linear Classifiers (Kernels)

Linear Classifiers (Kernels) Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers (Kernels) Blaine Nelson, Christoph Sawade, Tobias Scheffer Exam Dates & Course Conclusion There are 2 Exam dates: Feb 20 th March

More information

2.4. Conditional Probability

2.4. Conditional Probability 2.4. Conditional Probability Objectives. Definition of conditional probability and multiplication rule Total probability Bayes Theorem Example 2.4.1. (#46 p.80 textbook) Suppose an individual is randomly

More information

Interesting Patterns. Jilles Vreeken. 15 May 2015

Interesting Patterns. Jilles Vreeken. 15 May 2015 Interesting Patterns Jilles Vreeken 15 May 2015 Questions of the Day What is interestingness? what is a pattern? and how can we mine interesting patterns? What is a pattern? Data Pattern y = x - 1 What

More information

Vectors and their uses

Vectors and their uses Vectors and their uses Sharon Goldwater Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh DRAFT Version 0.95: 3 Sep 2015. Do not redistribute without permission.

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics February 19, 2018 CS 361: Probability & Statistics Random variables Markov s inequality This theorem says that for any random variable X and any value a, we have A random variable is unlikely to have an

More information

Chapter 5 Simplifying Formulas and Solving Equations

Chapter 5 Simplifying Formulas and Solving Equations Chapter 5 Simplifying Formulas and Solving Equations Look at the geometry formula for Perimeter of a rectangle P = L + W + L + W. Can this formula be written in a simpler way? If it is true, that we can

More information

Machine Learning: Pattern Mining

Machine Learning: Pattern Mining Machine Learning: Pattern Mining Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008 Pattern Mining Overview Itemsets Task Naive Algorithm Apriori Algorithm

More information

Lecture 4 : Conditional Probability and Bayes Theorem 0/ 26

Lecture 4 : Conditional Probability and Bayes Theorem 0/ 26 0/ 26 The conditional sample space Motivating examples 1. Roll a fair die once 1 2 3 S = 4 5 6 Let A = 6 appears B = an even number appears So P(A) = 1 6 P(B) = 1 2 1/ 26 Now what about P ( 6 appears given

More information

Lecture 1. ABC of Probability

Lecture 1. ABC of Probability Math 408 - Mathematical Statistics Lecture 1. ABC of Probability January 16, 2013 Konstantin Zuev (USC) Math 408, Lecture 1 January 16, 2013 1 / 9 Agenda Sample Spaces Realizations, Events Axioms of Probability

More information

Toss 1. Fig.1. 2 Heads 2 Tails Heads/Tails (H, H) (T, T) (H, T) Fig.2

Toss 1. Fig.1. 2 Heads 2 Tails Heads/Tails (H, H) (T, T) (H, T) Fig.2 1 Basic Probabilities The probabilities that we ll be learning about build from the set theory that we learned last class, only this time, the sets are specifically sets of events. What are events? Roughly,

More information

1 The Basic Counting Principles

1 The Basic Counting Principles 1 The Basic Counting Principles The Multiplication Rule If an operation consists of k steps and the first step can be performed in n 1 ways, the second step can be performed in n ways [regardless of how

More information

Conditional probabilities and graphical models

Conditional probabilities and graphical models Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within

More information

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard

More information

Example 1. The sample space of an experiment where we flip a pair of coins is denoted by:

Example 1. The sample space of an experiment where we flip a pair of coins is denoted by: Chapter 8 Probability 8. Preliminaries Definition (Sample Space). A Sample Space, Ω, is the set of all possible outcomes of an experiment. Such a sample space is considered discrete if Ω has finite cardinality.

More information

Business Statistics. Lecture 3: Random Variables and the Normal Distribution

Business Statistics. Lecture 3: Random Variables and the Normal Distribution Business Statistics Lecture 3: Random Variables and the Normal Distribution 1 Goals for this Lecture A little bit of probability Random variables The normal distribution 2 Probability vs. Statistics Probability:

More information

CS246 Final Exam. March 16, :30AM - 11:30AM

CS246 Final Exam. March 16, :30AM - 11:30AM CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions

More information

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur Lecture No. # 36 Sampling Distribution and Parameter Estimation

More information

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions 58669 Supervised Machine Learning (Spring 014) Homework, sample solutions Credit for the solutions goes to mainly to Panu Luosto and Joonas Paalasmaa, with some additional contributions by Jyrki Kivinen

More information

Statistical testing. Samantha Kleinberg. October 20, 2009

Statistical testing. Samantha Kleinberg. October 20, 2009 October 20, 2009 Intro to significance testing Significance testing and bioinformatics Gene expression: Frequently have microarray data for some group of subjects with/without the disease. Want to find

More information

MAT Mathematics in Today's World

MAT Mathematics in Today's World MAT 1000 Mathematics in Today's World Last Time We discussed the four rules that govern probabilities: 1. Probabilities are numbers between 0 and 1 2. The probability an event does not occur is 1 minus

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics March 14, 2018 CS 361: Probability & Statistics Inference The prior From Bayes rule, we know that we can express our function of interest as Likelihood Prior Posterior The right hand side contains the

More information

CSC 5170: Theory of Computational Complexity Lecture 4 The Chinese University of Hong Kong 1 February 2010

CSC 5170: Theory of Computational Complexity Lecture 4 The Chinese University of Hong Kong 1 February 2010 CSC 5170: Theory of Computational Complexity Lecture 4 The Chinese University of Hong Kong 1 February 2010 Computational complexity studies the amount of resources necessary to perform given computations.

More information

Some Review Problems for Exam 2: Solutions

Some Review Problems for Exam 2: Solutions Math 5366 Fall 017 Some Review Problems for Exam : Solutions 1 Find the coefficient of x 15 in each of the following: 1 (a) (1 x) 6 Solution: 1 (1 x) = ( ) k + 5 x k 6 k ( ) ( ) 0 0 so the coefficient

More information

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit: Frequent Itemsets and Association Rule Mining Vinay Setty vinay.j.setty@uis.no Slides credit: http://www.mmds.org/ Association Rule Discovery Supermarket shelf management Market-basket model: Goal: Identify

More information

Lecture 17: Trees and Merge Sort 10:00 AM, Oct 15, 2018

Lecture 17: Trees and Merge Sort 10:00 AM, Oct 15, 2018 CS17 Integrated Introduction to Computer Science Klein Contents Lecture 17: Trees and Merge Sort 10:00 AM, Oct 15, 2018 1 Tree definitions 1 2 Analysis of mergesort using a binary tree 1 3 Analysis of

More information

Pr[X = s Y = t] = Pr[X = s] Pr[Y = t]

Pr[X = s Y = t] = Pr[X = s] Pr[Y = t] Homework 4 By: John Steinberger Problem 1. Recall that a real n n matrix A is positive semidefinite if A is symmetric and x T Ax 0 for all x R n. Assume A is a real n n matrix. Show TFAE 1 : (a) A is positive

More information

Discrete Probability and State Estimation

Discrete Probability and State Estimation 6.01, Fall Semester, 2007 Lecture 12 Notes 1 MASSACHVSETTS INSTITVTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science 6.01 Introduction to EECS I Fall Semester, 2007 Lecture 12 Notes

More information

Chapter 5 Simplifying Formulas and Solving Equations

Chapter 5 Simplifying Formulas and Solving Equations Chapter 5 Simplifying Formulas and Solving Equations Look at the geometry formula for Perimeter of a rectangle P = L W L W. Can this formula be written in a simpler way? If it is true, that we can simplify

More information

The normal distribution Mixed exercise 3

The normal distribution Mixed exercise 3 The normal distribution Mixed exercise 3 ~ N(78, 4 ) a Using the normal CD function, P( 85).459....4 (4 d.p.) b Using the normal CD function, P( 8).6946... The probability that three men, selected at random,

More information

Ling 289 Contingency Table Statistics

Ling 289 Contingency Table Statistics Ling 289 Contingency Table Statistics Roger Levy and Christopher Manning This is a summary of the material that we ve covered on contingency tables. Contingency tables: introduction Odds ratios Counting,

More information

Lab 2 Worksheet. Problems. Problem 1: Geometry and Linear Equations

Lab 2 Worksheet. Problems. Problem 1: Geometry and Linear Equations Lab 2 Worksheet Problems Problem : Geometry and Linear Equations Linear algebra is, first and foremost, the study of systems of linear equations. You are going to encounter linear systems frequently in

More information

Machine Learning 3. week

Machine Learning 3. week Machine Learning 3. week Entropy Decision Trees ID3 C4.5 Classification and Regression Trees (CART) 1 What is Decision Tree As a short description, decision tree is a data classification procedure which

More information

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 2 MATH00040 SEMESTER / Probability

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 2 MATH00040 SEMESTER / Probability ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 2 MATH00040 SEMESTER 2 2017/2018 DR. ANTHONY BROWN 5.1. Introduction to Probability. 5. Probability You are probably familiar with the elementary

More information

Probability Distributions

Probability Distributions Probability Distributions Probability This is not a math class, or an applied math class, or a statistics class; but it is a computer science course! Still, probability, which is a math-y concept underlies

More information

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 2016 2017 Road map The Apriori algorithm Step 1: Mining all frequent

More information

11.1 Set Cover ILP formulation of set cover Deterministic rounding

11.1 Set Cover ILP formulation of set cover Deterministic rounding CS787: Advanced Algorithms Lecture 11: Randomized Rounding, Concentration Bounds In this lecture we will see some more examples of approximation algorithms based on LP relaxations. This time we will use

More information

Data Mining and Matrices

Data Mining and Matrices Data Mining and Matrices 08 Boolean Matrix Factorization Rainer Gemulla, Pauli Miettinen June 13, 2013 Outline 1 Warm-Up 2 What is BMF 3 BMF vs. other three-letter abbreviations 4 Binary matrices, tiles,

More information

Lecture 4: Probability, Proof Techniques, Method of Induction Lecturer: Lale Özkahya

Lecture 4: Probability, Proof Techniques, Method of Induction Lecturer: Lale Özkahya BBM 205 Discrete Mathematics Hacettepe University http://web.cs.hacettepe.edu.tr/ bbm205 Lecture 4: Probability, Proof Techniques, Method of Induction Lecturer: Lale Özkahya Resources: Kenneth Rosen, Discrete

More information

Discrete Random Variables

Discrete Random Variables Chapter 5 Discrete Random Variables Suppose that an experiment and a sample space are given. A random variable is a real-valued function of the outcome of the experiment. In other words, the random variable

More information

CS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms

CS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms CS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms Tim Roughgarden March 3, 2016 1 Preamble In CS109 and CS161, you learned some tricks of

More information

Relationships between elements of sets occur in many contexts. Every day we deal with

Relationships between elements of sets occur in many contexts. Every day we deal with C H A P T E R 9 Relations 9.1 Relations and Their Properties 9.2 n-ary Relations and Their Applications 9.3 Representing Relations 9.4 Closures of Relations 9.5 Equivalence Relations 9.6 Partial Orderings

More information

Practice Problems Section Problems

Practice Problems Section Problems Practice Problems Section 4-4-3 4-4 4-5 4-6 4-7 4-8 4-10 Supplemental Problems 4-1 to 4-9 4-13, 14, 15, 17, 19, 0 4-3, 34, 36, 38 4-47, 49, 5, 54, 55 4-59, 60, 63 4-66, 68, 69, 70, 74 4-79, 81, 84 4-85,

More information

Machine Learning for Computational Advertising

Machine Learning for Computational Advertising Machine Learning for Computational Advertising L1: Basics and Probability Theory Alexander J. Smola Yahoo! Labs Santa Clara, CA 95051 alex@smola.org UC Santa Cruz, April 2009 Alexander J. Smola: Machine

More information

Class President: A Network Approach to Popularity. Due July 18, 2014

Class President: A Network Approach to Popularity. Due July 18, 2014 Class President: A Network Approach to Popularity Due July 8, 24 Instructions. Due Fri, July 8 at :59 PM 2. Work in groups of up to 3 3. Type up the report, and submit as a pdf on D2L 4. Attach the code

More information

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) 12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.

More information

LECTURE NOTES by DR. J.S.V.R. KRISHNA PRASAD

LECTURE NOTES by DR. J.S.V.R. KRISHNA PRASAD .0 Introduction: The theory of probability has its origin in the games of chance related to gambling such as tossing of a coin, throwing of a die, drawing cards from a pack of cards etc. Jerame Cardon,

More information

PHYS 114 Exam 1 Answer Key NAME:

PHYS 114 Exam 1 Answer Key NAME: PHYS 4 Exam Answer Key AME: Please answer all of the questions below. Each part of each question is worth points, except question 5, which is worth 0 points.. Explain what the following MatLAB commands

More information

CS Data Structures and Algorithm Analysis

CS Data Structures and Algorithm Analysis CS 483 - Data Structures and Algorithm Analysis Lecture VII: Chapter 6, part 2 R. Paul Wiegand George Mason University, Department of Computer Science March 22, 2006 Outline 1 Balanced Trees 2 Heaps &

More information

Lecture 10: The Normal Distribution. So far all the random variables have been discrete.

Lecture 10: The Normal Distribution. So far all the random variables have been discrete. Lecture 10: The Normal Distribution 1. Continuous Random Variables So far all the random variables have been discrete. We need a different type of model (called a probability density function) for continuous

More information

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic CS 188: Artificial Intelligence Fall 2008 Lecture 16: Bayes Nets III 10/23/2008 Announcements Midterms graded, up on glookup, back Tuesday W4 also graded, back in sections / box Past homeworks in return

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #12: Frequent Itemsets Seoul National University 1 In This Lecture Motivation of association rule mining Important concepts of association rules Naïve approaches for

More information

Topic 17: Simple Hypotheses

Topic 17: Simple Hypotheses Topic 17: November, 2011 1 Overview and Terminology Statistical hypothesis testing is designed to address the question: Do the data provide sufficient evidence to conclude that we must depart from our

More information

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019 Lecture 10: Probability distributions DANIEL WELLER TUESDAY, FEBRUARY 19, 2019 Agenda What is probability? (again) Describing probabilities (distributions) Understanding probabilities (expectation) Partial

More information

AP Statistics Ch 6 Probability: The Study of Randomness

AP Statistics Ch 6 Probability: The Study of Randomness Ch 6.1 The Idea of Probability Chance behavior is unpredictable in the short run but has a regular and predictable pattern in the long run. We call a phenomenon random if individual outcomes are uncertain

More information