Applying the Science of Similarity to Computer Forensics. Jesse Kornblum

Size: px
Start display at page:

Download "Applying the Science of Similarity to Computer Forensics. Jesse Kornblum"

Transcription

1 Applying the Science of Similarity to Computer Forensics Jesse Kornblum

2 Outline Introduction Identical Data Similar Generic Data Fuzzy Hashing Image Comparisons General Approach Questions 2

3 Motivation 3

4 Identical A == B Difficult for humans (for large documents) Easy for computers Requires storing the original A and B Big files Could be illegal or private content 4

5 Identical Cryptographic Hashing shortcut MD5 and friends If MD5(A) == MD5(B) then A == B* * to within a high degree of certainty Chance of random collision is 2-128, or about Hashes signatures are small Impossible to recover input from signature 5

6 Identical Data Cryptographic hashes are spoiled by even a single byte difference in the input Very similar things have wildly different cryptographic hashes Image courtesy of Flickr user krystalchu and used under Create Commons license. 6

7 What does it mean for two things to be similar? Similar Data 7

8 Similar Data Depends on: The kind of things be compared How they re being compared Pictures Looks the same Same subject Same location Taken by the same camera Taken by the same person 8

9 Generic Data Don t care about the structure Assume any differences are byte aligned No insertions or deletions The quick brown fox jumped over the lazy dog. How much wood could The quick brown fax jumped over the lazy dog. How much good could 9

10 Piecewise Hashing Developed by Nick Harbour Designed for errors in drive imaging Found in dcfldd, dc3dd, md5deep, etc Divide input into fixed size sections and hash separately 3b152e0baa367a f6df 40c39f174a8756a2c266849b fdb a8bc69ecc46ec 10

11 Bytewise Comparison The quick brown fox jumped over the lazy dog. How much wood could The quick brown fax jumped over the lazy dog. How much good could 97% of the data is identical 11

12 Scenario: Image computer Lose control of computer Regain control, image again Bytewise Comparison 97% of the the data on the drive was identical What changed? 12

13 Compare the data in each block Can specify block size later If identical, add a green pixel If different, add a red pixel Visual Representation The quick brown fox jumped over the lazy dog. How much wood could The quick brown fax jumped over the lazy dog. How much good could 13

14 No changes made 14

15 Powered on and off 97% of the data is identical 15

16 Actual Result 97% of the data is identical 16

17 Generic Data What if the data is not byte-aligned? The quick brown fox jumped over the lazy dog. How much wood could The quick brown fox jumped up and over the lazy dog. How much wood 17

18 Disclaimer I didn t invent this math Originally Dr. Andrew Tridgell Samba rsync was part of his thesis Modified slightly for spamsum Spam detector in his junk code folder

19 Combination of a rolling hash and traditional hash Rolling hash looks only at last few bytes Fuzzy Hashing F o u r s c o r e -> 83,742,221 F o u r s c o r e -> 5 F o u r s c o r e -> 90,281 When processing a file, compute block size using file size If rolling hash mod block size = 1, it s a trigger point 19

20 Compute traditional hash while processing file On each trigger point, record value Reset traditional hash and continue Fuzzy Hashing Example Excerpt from "The Raven" by Edgar Allan Poe Triggers on ood and ore 20

21 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore?, This I whispered, and an echo murmured back the word, "Lenore!" Merely this, and nothing more

22 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore?, This I whispered, and an echo murmured back the word, "Lenore!" Merely this, and nothing more

23 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore?, This I whispered, and an echo murmured back the word, "Lenore!" Merely this, and nothing more

24 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore 57?, This I whispered, and an echo murmured back the word, "Lenore !" Merely this, and nothing more

25 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore 57?, This I whispered, and an echo murmured back the word, "Lenore !" Merely this, and nothing more Signature = 32730

26 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore?, This I whispered, and an echo murmured back the word, "Lenore!" Merely this, and nothing more

27 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, I AM THE LIZARD KING!!!1! fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore?, This I whispered, and an echo murmured back the word, "Lenore!" Merely this, and nothing more

28 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, I AM THE LIZARD KING!!!1! fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore 57?, This I whispered, and an echo murmured back the word, "Lenore !" Merely this, and nothing more

29 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, I AM THE LIZARD KING!!!1! fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore 57?, This I whispered, and an echo murmured back the word, "Lenore!" Merely this, and nothing more Original Signature = New Signatures =

30 Comparing Signatures Edit Distance Number of changes to turn one into the other Fuzzy Hashing Edit distance = 1 If edit distance is small relative to length, fuzzy hash match 30

31 Demonstration WARNING: EXPLICIT IMAGERY

32 Demonstration 32

33 Corrupted File MATCH! 33

34 File Footer MATCH! 34

35 File Footer MATCH! 35

36 Where Fuzzy Hashing Fails Do not match 36

37 Visual Comparisons Easy for humans Somewhat difficult for computers Comparing Pictures Content Based Image Retrieval (CBIR) There are companies tripping over themselves to do this Nobody has it quite nailed yet A free product is ImgSeek Search Styles Search by drawing Search by example 37

38 Search by Example Query Result Image courtesy Flickr user andrewbain and licensed under the Creative Commons 38

39 Non-visual comparisons EXIF information Same camera Comparing Pictures Looks at imperfections in CCDs Requires lots of pictures and some mathy stuff 39

40 There are many ways to find similar inputs Academically, this is a solved problem There are working theoretical approaches The magic lies in the implementation General Approach 1. Feature Extraction 2. Feature Selection 3. Comparison 4. Clustering 5. Classification 40

41 Feature Extraction Anything can be a feature Strings Metadata Registry key/value Display window? For Programs What do they do "Look and feel Authorship Compilation method Image courtesy of Flickr user doctor_keats and used under Create Commons license. 41

42 Similar inputs should have similar features Feature Extraction Features may be represented mathematically 42

43 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore 57?, This I whispered, and an echo murmured back the word, "Lenore !" Merely this, and nothing more Signature = 32730

44 Feature Extraction Example: Strings Individual words don t work well Ordering issues Use phrases The quick brown fox jumped over the lazy dog the quick quick brown brown fox Generally refer to n-grams The above are 2-grams 44

45 Throw out Stop Words Common words Defined by linguistics for each language the, and, but, of, is In our case, throw out the quick Feature Extraction Feature Presence or Feature Count Count occurrences of each feature in a document quick brown 4 brown fox 2 fox jumped 1 45

46 The Curse of Dimensionality Feature Selection So many dimensions (features) that comparisons become too time consuming or too complex No problem Select the important features (Insert mathy stuff here) Example: advanced persistent threat vs. quick brown Depends on context 46

47 We ve already covered one comparison method Edit distance Comparison See also: Hamming distance Manhattan distance Dice s coefficient See Wikipedia category: String similarity measures And these are just for strings! See Wikipedia category Statistical distance measures 47

48 Clustering Until now talking about comparing one document to another Could use one document as a query We can divide a set of documents into clusters of similar ones [Insert mathy stuff here] Your computer can help you of course Real challenge for us will be representation How do we display this information? Image from Hubble Space Telescope/NASA and is not eligible for Copyright protection. 48

49 Classify an input as belonging to a set or not Relevant document? Illicit imagery? Malicious program? Classification Assisted Machine Learning Requires a training set After that, can classify any new input Performance measured by precision and recall Precision is for false positives Recall is for false negatives 49

50 Classification Lots of algorithms Naïve Bayesian classifier K-Nearest Neighbor Locality Sensitive Hashing Decision Trees Neural Networks Hidden Markov Models See Wikipedia article on Classification (machine learning) 50

51 General Approach 1. Feature Extraction 2. Feature Selection 3. Comparison 4. Clustering 5. Classification 6.??? 7. Profit! The??? means: Which features to extract Which similarity measure to use Which classification algorithm 51

52 General Approach Currently being used in ediscovery Identify relevant documents 52

53 Outline Introduction Identical Data Similar Generic Data Fuzzy Hashing Image Comparisons General Approach Questions 53

54 Questions? Jesse Kornblum 54

Beyond Fuzzy Hashing. Jesse Kornblum

Beyond Fuzzy Hashing. Jesse Kornblum Beyond Fuzzy Hashing Jesse Kornblum Outline Introduction Identical Data Similar Generic Data Fuzzy Hashing Image Comparisons General Approach Documents Applications Questions 2 Motivation 3 Identical A

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24 L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,

More information

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 10: Neural Language Models Princeton University COS 495 Instructor: Yingyu Liang Natural language Processing (NLP) The processing of the human languages by computers One of

More information

Lecture 18 - Secret Sharing, Visual Cryptography, Distributed Signatures

Lecture 18 - Secret Sharing, Visual Cryptography, Distributed Signatures Lecture 18 - Secret Sharing, Visual Cryptography, Distributed Signatures Boaz Barak November 27, 2007 Quick review of homework 7 Existence of a CPA-secure public key encryption scheme such that oracle

More information

1 Probabilities. 1.1 Basics 1 PROBABILITIES

1 Probabilities. 1.1 Basics 1 PROBABILITIES 1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability

More information

CS 124 Math Review Section January 29, 2018

CS 124 Math Review Section January 29, 2018 CS 124 Math Review Section CS 124 is more math intensive than most of the introductory courses in the department. You re going to need to be able to do two things: 1. Perform some clever calculations to

More information

Neural Networks Language Models

Neural Networks Language Models Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,

More information

Metric-based classifiers. Nuno Vasconcelos UCSD

Metric-based classifiers. Nuno Vasconcelos UCSD Metric-based classifiers Nuno Vasconcelos UCSD Statistical learning goal: given a function f. y f and a collection of eample data-points, learn what the function f. is. this is called training. two major

More information

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Lecture 21 K - Nearest Neighbor V In this lecture we discuss; how do we evaluate the

More information

1. Counting. Chris Piech and Mehran Sahami. Oct 2017

1. Counting. Chris Piech and Mehran Sahami. Oct 2017 . Counting Chris Piech and Mehran Sahami Oct 07 Introduction Although you may have thought you had a pretty good grasp on the notion of counting at the age of three, it turns out that you had to wait until

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

1 Probabilities. 1.1 Basics 1 PROBABILITIES

1 Probabilities. 1.1 Basics 1 PROBABILITIES 1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability

More information

Machine Learning (CS 567) Lecture 3

Machine Learning (CS 567) Lecture 3 Machine Learning (CS 567) Lecture 3 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Linear Classifiers: multi-class Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due in a week Midterm:

More information

Classification & Information Theory Lecture #8

Classification & Information Theory Lecture #8 Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing

More information

Predicting English keywords from Java Bytecodes using Machine Learning

Predicting English keywords from Java Bytecodes using Machine Learning Predicting English keywords from Java Bytecodes using Machine Learning Pablo Ariel Duboue Les Laboratoires Foulab (Montreal Hackerspace) 999 du College Montreal, QC, H4C 2S3 REcon June 15th, 2012 Outline

More information

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17 3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural

More information

Be able to define the following terms and answer basic questions about them:

Be able to define the following terms and answer basic questions about them: CS440/ECE448 Section Q Fall 2017 Final Review Be able to define the following terms and answer basic questions about them: Probability o Random variables, axioms of probability o Joint, marginal, conditional

More information

B490 Mining the Big Data

B490 Mining the Big Data B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations

More information

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown Homework 3 COMS 4705 Fall 017 Prof. Kathleen McKeown The assignment consists of a programming part and a written part. For the programming part, make sure you have set up the development environment as

More information

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton Language Models CS6200: Information Retrieval Slides by: Jesse Anderton What s wrong with VSMs? Vector Space Models work reasonably well, but have a few problems: They are based on bag-of-words, so they

More information

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically

More information

Deep Learning for NLP

Deep Learning for NLP Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Machine Learning and NLP NER WordNet Usually machine learning

More information

Algorithms for Nearest Neighbors

Algorithms for Nearest Neighbors Algorithms for Nearest Neighbors Background and Two Challenges Yury Lifshits Steklov Institute of Mathematics at St.Petersburg http://logic.pdmi.ras.ru/~yura McGill University, July 2007 1 / 29 Outline

More information

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1) 11/3/15 Machine Learning and NLP Deep Learning for NLP Usually machine learning works well because of human-designed representations and input features CS224N WordNet SRL Parser Machine learning becomes

More information

Introduction to Randomized Algorithms III

Introduction to Randomized Algorithms III Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts) Introduction to Algorithms October 13, 2010 Massachusetts Institute of Technology 6.006 Fall 2010 Professors Konstantinos Daskalakis and Patrick Jaillet Quiz 1 Solutions Quiz 1 Solutions Problem 1. We

More information

Machine Learning and Deep Learning! Vincent Lepetit!

Machine Learning and Deep Learning! Vincent Lepetit! Machine Learning and Deep Learning!! Vincent Lepetit! 1! What is Machine Learning?! 2! Hand-Written Digit Recognition! 2 9 3! Hand-Written Digit Recognition! Formalization! 0 1 x = @ A Images are 28x28

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Bayes Formula. MATH 107: Finite Mathematics University of Louisville. March 26, 2014

Bayes Formula. MATH 107: Finite Mathematics University of Louisville. March 26, 2014 Bayes Formula MATH 07: Finite Mathematics University of Louisville March 26, 204 Test Accuracy Conditional reversal 2 / 5 A motivating question A rare disease occurs in out of every 0,000 people. A test

More information

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat Geometric View of Machine Learning Nearest Neighbor Classification Slides adapted from Prof. Carpuat What we know so far Decision Trees What is a decision tree, and how to induce it from data Fundamental

More information

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard

More information

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 ) Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds

More information

Be able to define the following terms and answer basic questions about them:

Be able to define the following terms and answer basic questions about them: CS440/ECE448 Fall 2016 Final Review Be able to define the following terms and answer basic questions about them: Probability o Random variables o Axioms of probability o Joint, marginal, conditional probability

More information

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010. Material presented Direct Models for Classification SCARF JHU Summer School June 18, 2010 Patrick Nguyen (panguyen@microsoft.com) What is classification? What is a linear classifier? What are Direct Models?

More information

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30 CSCB63 Winter 2019 Week10 - Lecture 2 - Hashing Anna Bretscher March 21, 2019 1 / 30 Today Hashing Open Addressing Hash functions Universal Hashing 2 / 30 Open Addressing Open Addressing. Each entry in

More information

Experiment 1: The Same or Not The Same?

Experiment 1: The Same or Not The Same? Experiment 1: The Same or Not The Same? Learning Goals After you finish this lab, you will be able to: 1. Use Logger Pro to collect data and calculate statistics (mean and standard deviation). 2. Explain

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Learning Theory Continued

Learning Theory Continued Learning Theory Continued Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 1 A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec.

More information

6.080 / Great Ideas in Theoretical Computer Science Spring 2008

6.080 / Great Ideas in Theoretical Computer Science Spring 2008 MIT OpenCourseWare http://ocw.mit.edu 6.080 / 6.089 Great Ideas in Theoretical Computer Science Spring 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Machine Learning Overview

Machine Learning Overview Machine Learning Overview Sargur N. Srihari University at Buffalo, State University of New York USA 1 Outline 1. What is Machine Learning (ML)? 2. Types of Information Processing Problems Solved 1. Regression

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

2. Probability. Chris Piech and Mehran Sahami. Oct 2017

2. Probability. Chris Piech and Mehran Sahami. Oct 2017 2. Probability Chris Piech and Mehran Sahami Oct 2017 1 Introduction It is that time in the quarter (it is still week one) when we get to talk about probability. Again we are going to build up from first

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some

More information

Naive Bayes classification

Naive Bayes classification Naive Bayes classification Christos Dimitrakakis December 4, 2015 1 Introduction One of the most important methods in machine learning and statistics is that of Bayesian inference. This is the most fundamental

More information

Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Probability (Devore Chapter Two)

Probability (Devore Chapter Two) Probability (Devore Chapter Two) 1016-345-01: Probability and Statistics for Engineers Fall 2012 Contents 0 Administrata 2 0.1 Outline....................................... 3 1 Axiomatic Probability 3

More information

Improving Disk Sector Integrity Using 3-dimension Hashing Scheme

Improving Disk Sector Integrity Using 3-dimension Hashing Scheme Improving Disk Sector Integrity Using 3-dimension Hashing Scheme Zoe L. Jiang, Lucas C.K. Hui, K.P. Chow, S.M. Yiu and Pierre K.Y. Lai Department of Computer Science The University of Hong Kong, Hong Kong

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology SE lecture revision 2013 Outline 1. Bayesian classification

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul Generative Learning INFO-4604, Applied Machine Learning University of Colorado Boulder November 29, 2018 Prof. Michael Paul Generative vs Discriminative The classification algorithms we have seen so far

More information

Bayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers:

Bayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers: Bayesian Inference The purpose of this document is to review belief networks and naive Bayes classifiers. Definitions from Probability: Belief networks: Naive Bayes Classifiers: Advantages and Disadvantages

More information

N/4 + N/2 + N = 2N 2.

N/4 + N/2 + N = 2N 2. CS61B Summer 2006 Instructor: Erin Korber Lecture 24, 7 Aug. 1 Amortized Analysis For some of the data structures we ve discussed (namely hash tables and splay trees), it was claimed that the average time

More information

The World According to Wolfram

The World According to Wolfram The World According to Wolfram Basic Summary of NKS- A New Kind of Science is Stephen Wolfram s attempt to revolutionize the theoretical and methodological underpinnings of the universe. Though this endeavor

More information

Entropy-based data organization tricks for browsing logs and packet captures

Entropy-based data organization tricks for browsing logs and packet captures Entropy-based data organization tricks for browsing logs and packet captures Department of Computer Science Dartmouth College Outline 1 Log browsing moves Pipes and tables Trees are better than pipes and

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang / CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors Furong Huang / furongh@cs.umd.edu What we know so far Decision Trees What is a decision tree, and how to induce it from

More information

Uncertain Knowledge and Bayes Rule. George Konidaris

Uncertain Knowledge and Bayes Rule. George Konidaris Uncertain Knowledge and Bayes Rule George Konidaris gdk@cs.brown.edu Fall 2018 Knowledge Logic Logical representations are based on: Facts about the world. Either true or false. We may not know which.

More information

Feedforward Neural Networks

Feedforward Neural Networks Chapter 4 Feedforward Neural Networks 4. Motivation Let s start with our logistic regression model from before: P(k d) = softma k =k ( λ(k ) + w d λ(k, w) ). (4.) Recall that this model gives us a lot

More information

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Steven Bergner, Chris Demwell Lecture notes for Cmpt 882 Machine Learning February 19, 2004 Abstract In these notes, a

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT) Metric Embedding of Task-Specific Similarity Greg Shakhnarovich Brown University joint work with Trevor Darrell (MIT) August 9, 2006 Task-specific similarity A toy example: Task-specific similarity A toy

More information

LAB 2 - ONE DIMENSIONAL MOTION

LAB 2 - ONE DIMENSIONAL MOTION Name Date Partners L02-1 LAB 2 - ONE DIMENSIONAL MOTION OBJECTIVES Slow and steady wins the race. Aesop s fable: The Hare and the Tortoise To learn how to use a motion detector and gain more familiarity

More information

Machine Learning (CSE 446): Neural Networks

Machine Learning (CSE 446): Neural Networks Machine Learning (CSE 446): Neural Networks Noah Smith c 2017 University of Washington nasmith@cs.washington.edu November 6, 2017 1 / 22 Admin No Wednesday office hours for Noah; no lecture Friday. 2 /

More information

4. Probability of an event A for equally likely outcomes:

4. Probability of an event A for equally likely outcomes: University of California, Los Angeles Department of Statistics Statistics 110A Instructor: Nicolas Christou Probability Probability: A measure of the chance that something will occur. 1. Random experiment:

More information

The RSA public encryption scheme: How I learned to stop worrying and love buying stuff online

The RSA public encryption scheme: How I learned to stop worrying and love buying stuff online The RSA public encryption scheme: How I learned to stop worrying and love buying stuff online Anthony Várilly-Alvarado Rice University Mathematics Leadership Institute, June 2010 Our Goal Today I will

More information

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI Bayes Classifiers CAP5610 Machine Learning Instructor: Guo-Jun QI Recap: Joint distributions Joint distribution over Input vector X = (X 1, X 2 ) X 1 =B or B (drinking beer or not) X 2 = H or H (headache

More information

MACHINE LEARNING FOR GEOLOGICAL MAPPING: ALGORITHMS AND APPLICATIONS

MACHINE LEARNING FOR GEOLOGICAL MAPPING: ALGORITHMS AND APPLICATIONS MACHINE LEARNING FOR GEOLOGICAL MAPPING: ALGORITHMS AND APPLICATIONS MATTHEW J. CRACKNELL BSc (Hons) ARC Centre of Excellence in Ore Deposits (CODES) School of Physical Sciences (Earth Sciences) Submitted

More information

AP Calculus. Derivatives.

AP Calculus. Derivatives. 1 AP Calculus Derivatives 2015 11 03 www.njctl.org 2 Table of Contents Rate of Change Slope of a Curve (Instantaneous ROC) Derivative Rules: Power, Constant, Sum/Difference Higher Order Derivatives Derivatives

More information

2.4. Conditional Probability

2.4. Conditional Probability 2.4. Conditional Probability Objectives. Definition of conditional probability and multiplication rule Total probability Bayes Theorem Example 2.4.1. (#46 p.80 textbook) Suppose an individual is randomly

More information

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others) Text Categorization CSE 454 (Based on slides by Dan Weld, Tom Mitchell, and others) 1 Given: Categorization A description of an instance, x X, where X is the instance language or instance space. A fixed

More information

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10 PageRank Ryan Tibshirani 36-462/36-662: Data Mining January 24 2012 Optional reading: ESL 14.10 1 Information retrieval with the web Last time we learned about information retrieval. We learned how to

More information

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding

More information

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Lecture - 17 K - Nearest Neighbor I Welcome to our discussion on the classification

More information

md5bloom: Forensic Filesystem Hashing Revisited

md5bloom: Forensic Filesystem Hashing Revisited DIGITAL FORENSIC RESEARCH CONFERENCE md5bloom: Forensic Filesystem Hashing Revisited By Vassil Roussev, Timothy Bourg, Yixin Chen, Golden Richard Presented At The Digital Forensic Research Conference DFRWS

More information

Quantum Classification of Malware. John Seymour

Quantum Classification of Malware. John Seymour Quantum Classification of Malware John Seymour (seymour1@umbc.edu) 2015-08-09 whoami Ph.D. student at the University of Maryland, Baltimore County (UMBC) Actively studying/researching infosec for about

More information

Pattern recognition. "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher

Pattern recognition. To understand is to perceive patterns Sir Isaiah Berlin, Russian philosopher Pattern recognition "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher The more relevant patterns at your disposal, the better your decisions will be. This is hopeful news to

More information

Statistical Debugging. Ben Liblit, University of Wisconsin Madison

Statistical Debugging. Ben Liblit, University of Wisconsin Madison Statistical Debugging Ben Liblit, University of Wisconsin Madison Bug Isolation Architecture Program Source Predicates Sampler Compiler Shipping Application Top bugs with likely causes Statistical Debugging

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Data Structures & Database Queries in GIS

Data Structures & Database Queries in GIS Data Structures & Database Queries in GIS Objective In this lab we will show you how to use ArcGIS for analysis of digital elevation models (DEM s), in relationship to Rocky Mountain bighorn sheep (Ovis

More information

Probability deals with modeling of random phenomena (phenomena or experiments whose outcomes may vary)

Probability deals with modeling of random phenomena (phenomena or experiments whose outcomes may vary) Chapter 14 From Randomness to Probability How to measure a likelihood of an event? How likely is it to answer correctly one out of two true-false questions on a quiz? Is it more, less, or equally likely

More information

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

An Intuitive Introduction to Motivic Homotopy Theory Vladimir Voevodsky

An Intuitive Introduction to Motivic Homotopy Theory Vladimir Voevodsky What follows is Vladimir Voevodsky s snapshot of his Fields Medal work on motivic homotopy, plus a little philosophy and from my point of view the main fun of doing mathematics Voevodsky (2002). Voevodsky

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

Hidden Markov Models Part 1: Introduction

Hidden Markov Models Part 1: Introduction Hidden Markov Models Part 1: Introduction CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Modeling Sequential Data Suppose that

More information

Machine Learning Midterm Exam March 4, 2015

Machine Learning Midterm Exam March 4, 2015 Name: Andrew ID: Instructions Anything on paper is OK in arbitrary shape size, and quantity. Electronic devices are not acceptable. This includes ipods, ipads, Android tablets, Blackberries, Nokias, Windows

More information

Physics E-1ax, Fall 2014 Experiment 3. Experiment 3: Force. 2. Find your center of mass by balancing yourself on two force plates.

Physics E-1ax, Fall 2014 Experiment 3. Experiment 3: Force. 2. Find your center of mass by balancing yourself on two force plates. Learning Goals Experiment 3: Force After you finish this lab, you will be able to: 1. Use Logger Pro to analyze video and calculate position, velocity, and acceleration. 2. Find your center of mass by

More information

LEC1: Instance-based classifiers

LEC1: Instance-based classifiers LEC1: Instance-based classifiers Dr. Guangliang Chen February 2, 2016 Outline Having data ready knn kmeans Summary Downloading data Save all the scripts (from course webpage) and raw files (from LeCun

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Chapter 8&9: Classification: Part 3 Instructor: Yizhou Sun yzsun@ccs.neu.edu March 12, 2013 Midterm Report Grade Distribution 90-100 10 80-89 16 70-79 8 60-69 4

More information

7.1 What is it and why should we care?

7.1 What is it and why should we care? Chapter 7 Probability In this section, we go over some simple concepts from probability theory. We integrate these with ideas from formal language theory in the next chapter. 7.1 What is it and why should

More information

SIGNATURE SCHEMES & CRYPTOGRAPHIC HASH FUNCTIONS. CIS 400/628 Spring 2005 Introduction to Cryptography

SIGNATURE SCHEMES & CRYPTOGRAPHIC HASH FUNCTIONS. CIS 400/628 Spring 2005 Introduction to Cryptography SIGNATURE SCHEMES & CRYPTOGRAPHIC HASH FUNCTIONS CIS 400/628 Spring 2005 Introduction to Cryptography This is based on Chapter 8 of Trappe and Washington DIGITAL SIGNATURES message sig 1. How do we bind

More information

Kernel Methods. Barnabás Póczos

Kernel Methods. Barnabás Póczos Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels

More information

Machine Learning for Computational Advertising

Machine Learning for Computational Advertising Machine Learning for Computational Advertising L1: Basics and Probability Theory Alexander J. Smola Yahoo! Labs Santa Clara, CA 95051 alex@smola.org UC Santa Cruz, April 2009 Alexander J. Smola: Machine

More information

1 Closest Pair of Points on the Plane

1 Closest Pair of Points on the Plane CS 31: Algorithms (Spring 2019): Lecture 5 Date: 4th April, 2019 Topic: Divide and Conquer 3: Closest Pair of Points on a Plane Disclaimer: These notes have not gone through scrutiny and in all probability

More information

CSE 446 Dimensionality Reduction, Sequences

CSE 446 Dimensionality Reduction, Sequences CSE 446 Dimensionality Reduction, Sequences Administrative Final review this week Practice exam questions will come out Wed Final exam next week Wed 8:30 am Today Dimensionality reduction examples Sequence

More information