Applying the Science of Similarity to Computer Forensics. Jesse Kornblum
|
|
- Donna O’Connor’
- 5 years ago
- Views:
Transcription
1 Applying the Science of Similarity to Computer Forensics Jesse Kornblum
2 Outline Introduction Identical Data Similar Generic Data Fuzzy Hashing Image Comparisons General Approach Questions 2
3 Motivation 3
4 Identical A == B Difficult for humans (for large documents) Easy for computers Requires storing the original A and B Big files Could be illegal or private content 4
5 Identical Cryptographic Hashing shortcut MD5 and friends If MD5(A) == MD5(B) then A == B* * to within a high degree of certainty Chance of random collision is 2-128, or about Hashes signatures are small Impossible to recover input from signature 5
6 Identical Data Cryptographic hashes are spoiled by even a single byte difference in the input Very similar things have wildly different cryptographic hashes Image courtesy of Flickr user krystalchu and used under Create Commons license. 6
7 What does it mean for two things to be similar? Similar Data 7
8 Similar Data Depends on: The kind of things be compared How they re being compared Pictures Looks the same Same subject Same location Taken by the same camera Taken by the same person 8
9 Generic Data Don t care about the structure Assume any differences are byte aligned No insertions or deletions The quick brown fox jumped over the lazy dog. How much wood could The quick brown fax jumped over the lazy dog. How much good could 9
10 Piecewise Hashing Developed by Nick Harbour Designed for errors in drive imaging Found in dcfldd, dc3dd, md5deep, etc Divide input into fixed size sections and hash separately 3b152e0baa367a f6df 40c39f174a8756a2c266849b fdb a8bc69ecc46ec 10
11 Bytewise Comparison The quick brown fox jumped over the lazy dog. How much wood could The quick brown fax jumped over the lazy dog. How much good could 97% of the data is identical 11
12 Scenario: Image computer Lose control of computer Regain control, image again Bytewise Comparison 97% of the the data on the drive was identical What changed? 12
13 Compare the data in each block Can specify block size later If identical, add a green pixel If different, add a red pixel Visual Representation The quick brown fox jumped over the lazy dog. How much wood could The quick brown fax jumped over the lazy dog. How much good could 13
14 No changes made 14
15 Powered on and off 97% of the data is identical 15
16 Actual Result 97% of the data is identical 16
17 Generic Data What if the data is not byte-aligned? The quick brown fox jumped over the lazy dog. How much wood could The quick brown fox jumped up and over the lazy dog. How much wood 17
18 Disclaimer I didn t invent this math Originally Dr. Andrew Tridgell Samba rsync was part of his thesis Modified slightly for spamsum Spam detector in his junk code folder
19 Combination of a rolling hash and traditional hash Rolling hash looks only at last few bytes Fuzzy Hashing F o u r s c o r e -> 83,742,221 F o u r s c o r e -> 5 F o u r s c o r e -> 90,281 When processing a file, compute block size using file size If rolling hash mod block size = 1, it s a trigger point 19
20 Compute traditional hash while processing file On each trigger point, record value Reset traditional hash and continue Fuzzy Hashing Example Excerpt from "The Raven" by Edgar Allan Poe Triggers on ood and ore 20
21 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore?, This I whispered, and an echo murmured back the word, "Lenore!" Merely this, and nothing more
22 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore?, This I whispered, and an echo murmured back the word, "Lenore!" Merely this, and nothing more
23 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore?, This I whispered, and an echo murmured back the word, "Lenore!" Merely this, and nothing more
24 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore 57?, This I whispered, and an echo murmured back the word, "Lenore !" Merely this, and nothing more
25 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore 57?, This I whispered, and an echo murmured back the word, "Lenore !" Merely this, and nothing more Signature = 32730
26 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore?, This I whispered, and an echo murmured back the word, "Lenore!" Merely this, and nothing more
27 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, I AM THE LIZARD KING!!!1! fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore?, This I whispered, and an echo murmured back the word, "Lenore!" Merely this, and nothing more
28 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, I AM THE LIZARD KING!!!1! fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore 57?, This I whispered, and an echo murmured back the word, "Lenore !" Merely this, and nothing more
29 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, I AM THE LIZARD KING!!!1! fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore 57?, This I whispered, and an echo murmured back the word, "Lenore!" Merely this, and nothing more Original Signature = New Signatures =
30 Comparing Signatures Edit Distance Number of changes to turn one into the other Fuzzy Hashing Edit distance = 1 If edit distance is small relative to length, fuzzy hash match 30
31 Demonstration WARNING: EXPLICIT IMAGERY
32 Demonstration 32
33 Corrupted File MATCH! 33
34 File Footer MATCH! 34
35 File Footer MATCH! 35
36 Where Fuzzy Hashing Fails Do not match 36
37 Visual Comparisons Easy for humans Somewhat difficult for computers Comparing Pictures Content Based Image Retrieval (CBIR) There are companies tripping over themselves to do this Nobody has it quite nailed yet A free product is ImgSeek Search Styles Search by drawing Search by example 37
38 Search by Example Query Result Image courtesy Flickr user andrewbain and licensed under the Creative Commons 38
39 Non-visual comparisons EXIF information Same camera Comparing Pictures Looks at imperfections in CCDs Requires lots of pictures and some mathy stuff 39
40 There are many ways to find similar inputs Academically, this is a solved problem There are working theoretical approaches The magic lies in the implementation General Approach 1. Feature Extraction 2. Feature Selection 3. Comparison 4. Clustering 5. Classification 40
41 Feature Extraction Anything can be a feature Strings Metadata Registry key/value Display window? For Programs What do they do "Look and feel Authorship Compilation method Image courtesy of Flickr user doctor_keats and used under Create Commons license. 41
42 Similar inputs should have similar features Feature Extraction Features may be represented mathematically 42
43 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore 57?, This I whispered, and an echo murmured back the word, "Lenore !" Merely this, and nothing more Signature = 32730
44 Feature Extraction Example: Strings Individual words don t work well Ordering issues Use phrases The quick brown fox jumped over the lazy dog the quick quick brown brown fox Generally refer to n-grams The above are 2-grams 44
45 Throw out Stop Words Common words Defined by linguistics for each language the, and, but, of, is In our case, throw out the quick Feature Extraction Feature Presence or Feature Count Count occurrences of each feature in a document quick brown 4 brown fox 2 fox jumped 1 45
46 The Curse of Dimensionality Feature Selection So many dimensions (features) that comparisons become too time consuming or too complex No problem Select the important features (Insert mathy stuff here) Example: advanced persistent threat vs. quick brown Depends on context 46
47 We ve already covered one comparison method Edit distance Comparison See also: Hamming distance Manhattan distance Dice s coefficient See Wikipedia category: String similarity measures And these are just for strings! See Wikipedia category Statistical distance measures 47
48 Clustering Until now talking about comparing one document to another Could use one document as a query We can divide a set of documents into clusters of similar ones [Insert mathy stuff here] Your computer can help you of course Real challenge for us will be representation How do we display this information? Image from Hubble Space Telescope/NASA and is not eligible for Copyright protection. 48
49 Classify an input as belonging to a set or not Relevant document? Illicit imagery? Malicious program? Classification Assisted Machine Learning Requires a training set After that, can classify any new input Performance measured by precision and recall Precision is for false positives Recall is for false negatives 49
50 Classification Lots of algorithms Naïve Bayesian classifier K-Nearest Neighbor Locality Sensitive Hashing Decision Trees Neural Networks Hidden Markov Models See Wikipedia article on Classification (machine learning) 50
51 General Approach 1. Feature Extraction 2. Feature Selection 3. Comparison 4. Clustering 5. Classification 6.??? 7. Profit! The??? means: Which features to extract Which similarity measure to use Which classification algorithm 51
52 General Approach Currently being used in ediscovery Identify relevant documents 52
53 Outline Introduction Identical Data Similar Generic Data Fuzzy Hashing Image Comparisons General Approach Questions 53
54 Questions? Jesse Kornblum 54
Beyond Fuzzy Hashing. Jesse Kornblum
Beyond Fuzzy Hashing Jesse Kornblum Outline Introduction Identical Data Similar Generic Data Fuzzy Hashing Image Comparisons General Approach Documents Applications Questions 2 Motivation 3 Identical A
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationN-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24
L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,
More informationDeep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 10: Neural Language Models Princeton University COS 495 Instructor: Yingyu Liang Natural language Processing (NLP) The processing of the human languages by computers One of
More informationLecture 18 - Secret Sharing, Visual Cryptography, Distributed Signatures
Lecture 18 - Secret Sharing, Visual Cryptography, Distributed Signatures Boaz Barak November 27, 2007 Quick review of homework 7 Existence of a CPA-secure public key encryption scheme such that oracle
More information1 Probabilities. 1.1 Basics 1 PROBABILITIES
1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability
More informationCS 124 Math Review Section January 29, 2018
CS 124 Math Review Section CS 124 is more math intensive than most of the introductory courses in the department. You re going to need to be able to do two things: 1. Perform some clever calculations to
More informationNeural Networks Language Models
Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,
More informationMetric-based classifiers. Nuno Vasconcelos UCSD
Metric-based classifiers Nuno Vasconcelos UCSD Statistical learning goal: given a function f. y f and a collection of eample data-points, learn what the function f. is. this is called training. two major
More informationData Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur
Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Lecture 21 K - Nearest Neighbor V In this lecture we discuss; how do we evaluate the
More information1. Counting. Chris Piech and Mehran Sahami. Oct 2017
. Counting Chris Piech and Mehran Sahami Oct 07 Introduction Although you may have thought you had a pretty good grasp on the notion of counting at the age of three, it turns out that you had to wait until
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More information1 Probabilities. 1.1 Basics 1 PROBABILITIES
1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability
More informationMachine Learning (CS 567) Lecture 3
Machine Learning (CS 567) Lecture 3 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining Linear Classifiers: multi-class Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due in a week Midterm:
More informationClassification & Information Theory Lecture #8
Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing
More informationPredicting English keywords from Java Bytecodes using Machine Learning
Predicting English keywords from Java Bytecodes using Machine Learning Pablo Ariel Duboue Les Laboratoires Foulab (Montreal Hackerspace) 999 du College Montreal, QC, H4C 2S3 REcon June 15th, 2012 Outline
More informationNeural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17
3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural
More informationBe able to define the following terms and answer basic questions about them:
CS440/ECE448 Section Q Fall 2017 Final Review Be able to define the following terms and answer basic questions about them: Probability o Random variables, axioms of probability o Joint, marginal, conditional
More informationB490 Mining the Big Data
B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations
More informationHomework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown
Homework 3 COMS 4705 Fall 017 Prof. Kathleen McKeown The assignment consists of a programming part and a written part. For the programming part, make sure you have set up the development environment as
More informationDATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining
More informationLanguage Models. CS6200: Information Retrieval. Slides by: Jesse Anderton
Language Models CS6200: Information Retrieval Slides by: Jesse Anderton What s wrong with VSMs? Vector Space Models work reasonably well, but have a few problems: They are based on bag-of-words, so they
More informationMachine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier
Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically
More informationDeep Learning for NLP
Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Machine Learning and NLP NER WordNet Usually machine learning
More informationAlgorithms for Nearest Neighbors
Algorithms for Nearest Neighbors Background and Two Challenges Yury Lifshits Steklov Institute of Mathematics at St.Petersburg http://logic.pdmi.ras.ru/~yura McGill University, July 2007 1 / 29 Outline
More information11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)
11/3/15 Machine Learning and NLP Deep Learning for NLP Usually machine learning works well because of human-designed representations and input features CS224N WordNet SRL Parser Machine learning becomes
More informationIntroduction to Randomized Algorithms III
Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationQuiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)
Introduction to Algorithms October 13, 2010 Massachusetts Institute of Technology 6.006 Fall 2010 Professors Konstantinos Daskalakis and Patrick Jaillet Quiz 1 Solutions Quiz 1 Solutions Problem 1. We
More informationMachine Learning and Deep Learning! Vincent Lepetit!
Machine Learning and Deep Learning!! Vincent Lepetit! 1! What is Machine Learning?! 2! Hand-Written Digit Recognition! 2 9 3! Hand-Written Digit Recognition! Formalization! 0 1 x = @ A Images are 28x28
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationBayes Formula. MATH 107: Finite Mathematics University of Louisville. March 26, 2014
Bayes Formula MATH 07: Finite Mathematics University of Louisville March 26, 204 Test Accuracy Conditional reversal 2 / 5 A motivating question A rare disease occurs in out of every 0,000 people. A test
More informationGeometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat
Geometric View of Machine Learning Nearest Neighbor Classification Slides adapted from Prof. Carpuat What we know so far Decision Trees What is a decision tree, and how to induce it from data Fundamental
More informationCOMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from
COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard
More informationPart A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )
Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds
More informationBe able to define the following terms and answer basic questions about them:
CS440/ECE448 Fall 2016 Final Review Be able to define the following terms and answer basic questions about them: Probability o Random variables o Axioms of probability o Joint, marginal, conditional probability
More informationMaterial presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.
Material presented Direct Models for Classification SCARF JHU Summer School June 18, 2010 Patrick Nguyen (panguyen@microsoft.com) What is classification? What is a linear classifier? What are Direct Models?
More informationCSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30
CSCB63 Winter 2019 Week10 - Lecture 2 - Hashing Anna Bretscher March 21, 2019 1 / 30 Today Hashing Open Addressing Hash functions Universal Hashing 2 / 30 Open Addressing Open Addressing. Each entry in
More informationExperiment 1: The Same or Not The Same?
Experiment 1: The Same or Not The Same? Learning Goals After you finish this lab, you will be able to: 1. Use Logger Pro to collect data and calculate statistics (mean and standard deviation). 2. Explain
More informationAn Introduction to Bioinformatics Algorithms Hidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationLearning Theory Continued
Learning Theory Continued Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 1 A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec.
More information6.080 / Great Ideas in Theoretical Computer Science Spring 2008
MIT OpenCourseWare http://ocw.mit.edu 6.080 / 6.089 Great Ideas in Theoretical Computer Science Spring 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationMachine Learning Overview
Machine Learning Overview Sargur N. Srihari University at Buffalo, State University of New York USA 1 Outline 1. What is Machine Learning (ML)? 2. Types of Information Processing Problems Solved 1. Regression
More informationMining Classification Knowledge
Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification
More informationMidterm sample questions
Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts
More information2. Probability. Chris Piech and Mehran Sahami. Oct 2017
2. Probability Chris Piech and Mehran Sahami Oct 2017 1 Introduction It is that time in the quarter (it is still week one) when we get to talk about probability. Again we are going to build up from first
More informationJeffrey D. Ullman Stanford University
Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some
More informationNaive Bayes classification
Naive Bayes classification Christos Dimitrakakis December 4, 2015 1 Introduction One of the most important methods in machine learning and statistics is that of Bayesian inference. This is the most fundamental
More informationMachine Learning (CS 567) Lecture 2
Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol
More informationProbability (Devore Chapter Two)
Probability (Devore Chapter Two) 1016-345-01: Probability and Statistics for Engineers Fall 2012 Contents 0 Administrata 2 0.1 Outline....................................... 3 1 Axiomatic Probability 3
More informationImproving Disk Sector Integrity Using 3-dimension Hashing Scheme
Improving Disk Sector Integrity Using 3-dimension Hashing Scheme Zoe L. Jiang, Lucas C.K. Hui, K.P. Chow, S.M. Yiu and Pierre K.Y. Lai Department of Computer Science The University of Hong Kong, Hong Kong
More informationMining Classification Knowledge
Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology SE lecture revision 2013 Outline 1. Bayesian classification
More informationText Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University
Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data
More informationGenerative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul
Generative Learning INFO-4604, Applied Machine Learning University of Colorado Boulder November 29, 2018 Prof. Michael Paul Generative vs Discriminative The classification algorithms we have seen so far
More informationBayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers:
Bayesian Inference The purpose of this document is to review belief networks and naive Bayes classifiers. Definitions from Probability: Belief networks: Naive Bayes Classifiers: Advantages and Disadvantages
More informationN/4 + N/2 + N = 2N 2.
CS61B Summer 2006 Instructor: Erin Korber Lecture 24, 7 Aug. 1 Amortized Analysis For some of the data structures we ve discussed (namely hash tables and splay trees), it was claimed that the average time
More informationThe World According to Wolfram
The World According to Wolfram Basic Summary of NKS- A New Kind of Science is Stephen Wolfram s attempt to revolutionize the theoretical and methodological underpinnings of the universe. Though this endeavor
More informationEntropy-based data organization tricks for browsing logs and packet captures
Entropy-based data organization tricks for browsing logs and packet captures Department of Computer Science Dartmouth College Outline 1 Log browsing moves Pipes and tables Trees are better than pipes and
More informationInstance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest
More informationCMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /
CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors Furong Huang / furongh@cs.umd.edu What we know so far Decision Trees What is a decision tree, and how to induce it from
More informationUncertain Knowledge and Bayes Rule. George Konidaris
Uncertain Knowledge and Bayes Rule George Konidaris gdk@cs.brown.edu Fall 2018 Knowledge Logic Logical representations are based on: Facts about the world. Either true or false. We may not know which.
More informationFeedforward Neural Networks
Chapter 4 Feedforward Neural Networks 4. Motivation Let s start with our logistic regression model from before: P(k d) = softma k =k ( λ(k ) + w d λ(k, w) ). (4.) Recall that this model gives us a lot
More informationRelationship between Least Squares Approximation and Maximum Likelihood Hypotheses
Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Steven Bergner, Chris Demwell Lecture notes for Cmpt 882 Machine Learning February 19, 2004 Abstract In these notes, a
More informationClick Prediction and Preference Ranking of RSS Feeds
Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS
More informationStephen Scott.
1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training
More informationMetric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)
Metric Embedding of Task-Specific Similarity Greg Shakhnarovich Brown University joint work with Trevor Darrell (MIT) August 9, 2006 Task-specific similarity A toy example: Task-specific similarity A toy
More informationLAB 2 - ONE DIMENSIONAL MOTION
Name Date Partners L02-1 LAB 2 - ONE DIMENSIONAL MOTION OBJECTIVES Slow and steady wins the race. Aesop s fable: The Hare and the Tortoise To learn how to use a motion detector and gain more familiarity
More informationMachine Learning (CSE 446): Neural Networks
Machine Learning (CSE 446): Neural Networks Noah Smith c 2017 University of Washington nasmith@cs.washington.edu November 6, 2017 1 / 22 Admin No Wednesday office hours for Noah; no lecture Friday. 2 /
More information4. Probability of an event A for equally likely outcomes:
University of California, Los Angeles Department of Statistics Statistics 110A Instructor: Nicolas Christou Probability Probability: A measure of the chance that something will occur. 1. Random experiment:
More informationThe RSA public encryption scheme: How I learned to stop worrying and love buying stuff online
The RSA public encryption scheme: How I learned to stop worrying and love buying stuff online Anthony Várilly-Alvarado Rice University Mathematics Leadership Institute, June 2010 Our Goal Today I will
More informationBayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI
Bayes Classifiers CAP5610 Machine Learning Instructor: Guo-Jun QI Recap: Joint distributions Joint distribution over Input vector X = (X 1, X 2 ) X 1 =B or B (drinking beer or not) X 2 = H or H (headache
More informationMACHINE LEARNING FOR GEOLOGICAL MAPPING: ALGORITHMS AND APPLICATIONS
MACHINE LEARNING FOR GEOLOGICAL MAPPING: ALGORITHMS AND APPLICATIONS MATTHEW J. CRACKNELL BSc (Hons) ARC Centre of Excellence in Ore Deposits (CODES) School of Physical Sciences (Earth Sciences) Submitted
More informationAP Calculus. Derivatives.
1 AP Calculus Derivatives 2015 11 03 www.njctl.org 2 Table of Contents Rate of Change Slope of a Curve (Instantaneous ROC) Derivative Rules: Power, Constant, Sum/Difference Higher Order Derivatives Derivatives
More information2.4. Conditional Probability
2.4. Conditional Probability Objectives. Definition of conditional probability and multiplication rule Total probability Bayes Theorem Example 2.4.1. (#46 p.80 textbook) Suppose an individual is randomly
More informationText Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)
Text Categorization CSE 454 (Based on slides by Dan Weld, Tom Mitchell, and others) 1 Given: Categorization A description of an instance, x X, where X is the instance language or instance space. A fixed
More informationPageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10
PageRank Ryan Tibshirani 36-462/36-662: Data Mining January 24 2012 Optional reading: ESL 14.10 1 Information retrieval with the web Last time we learned about information retrieval. We learned how to
More informationSemantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing
Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding
More informationData Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur
Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Lecture - 17 K - Nearest Neighbor I Welcome to our discussion on the classification
More informationmd5bloom: Forensic Filesystem Hashing Revisited
DIGITAL FORENSIC RESEARCH CONFERENCE md5bloom: Forensic Filesystem Hashing Revisited By Vassil Roussev, Timothy Bourg, Yixin Chen, Golden Richard Presented At The Digital Forensic Research Conference DFRWS
More informationQuantum Classification of Malware. John Seymour
Quantum Classification of Malware John Seymour (seymour1@umbc.edu) 2015-08-09 whoami Ph.D. student at the University of Maryland, Baltimore County (UMBC) Actively studying/researching infosec for about
More informationPattern recognition. "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher
Pattern recognition "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher The more relevant patterns at your disposal, the better your decisions will be. This is hopeful news to
More informationStatistical Debugging. Ben Liblit, University of Wisconsin Madison
Statistical Debugging Ben Liblit, University of Wisconsin Madison Bug Isolation Architecture Program Source Predicates Sampler Compiler Shipping Application Top bugs with likely causes Statistical Debugging
More informationHidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationData Structures & Database Queries in GIS
Data Structures & Database Queries in GIS Objective In this lab we will show you how to use ArcGIS for analysis of digital elevation models (DEM s), in relationship to Rocky Mountain bighorn sheep (Ovis
More informationProbability deals with modeling of random phenomena (phenomena or experiments whose outcomes may vary)
Chapter 14 From Randomness to Probability How to measure a likelihood of an event? How likely is it to answer correctly one out of two true-false questions on a quiz? Is it more, less, or equally likely
More informationMIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,
MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run
More informationAn Intuitive Introduction to Motivic Homotopy Theory Vladimir Voevodsky
What follows is Vladimir Voevodsky s snapshot of his Fields Medal work on motivic homotopy, plus a little philosophy and from my point of view the main fun of doing mathematics Voevodsky (2002). Voevodsky
More information15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018
15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science
More informationHidden Markov Models Part 1: Introduction
Hidden Markov Models Part 1: Introduction CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Modeling Sequential Data Suppose that
More informationMachine Learning Midterm Exam March 4, 2015
Name: Andrew ID: Instructions Anything on paper is OK in arbitrary shape size, and quantity. Electronic devices are not acceptable. This includes ipods, ipads, Android tablets, Blackberries, Nokias, Windows
More informationPhysics E-1ax, Fall 2014 Experiment 3. Experiment 3: Force. 2. Find your center of mass by balancing yourself on two force plates.
Learning Goals Experiment 3: Force After you finish this lab, you will be able to: 1. Use Logger Pro to analyze video and calculate position, velocity, and acceleration. 2. Find your center of mass by
More informationLEC1: Instance-based classifiers
LEC1: Instance-based classifiers Dr. Guangliang Chen February 2, 2016 Outline Having data ready knn kmeans Summary Downloading data Save all the scripts (from course webpage) and raw files (from LeCun
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Chapter 8&9: Classification: Part 3 Instructor: Yizhou Sun yzsun@ccs.neu.edu March 12, 2013 Midterm Report Grade Distribution 90-100 10 80-89 16 70-79 8 60-69 4
More information7.1 What is it and why should we care?
Chapter 7 Probability In this section, we go over some simple concepts from probability theory. We integrate these with ideas from formal language theory in the next chapter. 7.1 What is it and why should
More informationSIGNATURE SCHEMES & CRYPTOGRAPHIC HASH FUNCTIONS. CIS 400/628 Spring 2005 Introduction to Cryptography
SIGNATURE SCHEMES & CRYPTOGRAPHIC HASH FUNCTIONS CIS 400/628 Spring 2005 Introduction to Cryptography This is based on Chapter 8 of Trappe and Washington DIGITAL SIGNATURES message sig 1. How do we bind
More informationKernel Methods. Barnabás Póczos
Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels
More informationMachine Learning for Computational Advertising
Machine Learning for Computational Advertising L1: Basics and Probability Theory Alexander J. Smola Yahoo! Labs Santa Clara, CA 95051 alex@smola.org UC Santa Cruz, April 2009 Alexander J. Smola: Machine
More information1 Closest Pair of Points on the Plane
CS 31: Algorithms (Spring 2019): Lecture 5 Date: 4th April, 2019 Topic: Divide and Conquer 3: Closest Pair of Points on a Plane Disclaimer: These notes have not gone through scrutiny and in all probability
More informationCSE 446 Dimensionality Reduction, Sequences
CSE 446 Dimensionality Reduction, Sequences Administrative Final review this week Practice exam questions will come out Wed Final exam next week Wed 8:30 am Today Dimensionality reduction examples Sequence
More information