Applying the Science of Similarity to Computer Forensics. Jesse Kornblum

Size: px

Start display at page:

Download "Applying the Science of Similarity to Computer Forensics. Jesse Kornblum"

Donna O’Connor’
5 years ago
Views:

1 Applying the Science of Similarity to Computer Forensics Jesse Kornblum

2 Outline Introduction Identical Data Similar Generic Data Fuzzy Hashing Image Comparisons General Approach Questions 2

3 Motivation 3

4 Identical A == B Difficult for humans (for large documents) Easy for computers Requires storing the original A and B Big files Could be illegal or private content 4

5 Identical Cryptographic Hashing shortcut MD5 and friends If MD5(A) == MD5(B) then A == B* * to within a high degree of certainty Chance of random collision is 2-128, or about Hashes signatures are small Impossible to recover input from signature 5

6 Identical Data Cryptographic hashes are spoiled by even a single byte difference in the input Very similar things have wildly different cryptographic hashes Image courtesy of Flickr user krystalchu and used under Create Commons license. 6

7 What does it mean for two things to be similar? Similar Data 7

8 Similar Data Depends on: The kind of things be compared How they re being compared Pictures Looks the same Same subject Same location Taken by the same camera Taken by the same person 8

9 Generic Data Don t care about the structure Assume any differences are byte aligned No insertions or deletions The quick brown fox jumped over the lazy dog. How much wood could The quick brown fax jumped over the lazy dog. How much good could 9

10 Piecewise Hashing Developed by Nick Harbour Designed for errors in drive imaging Found in dcfldd, dc3dd, md5deep, etc Divide input into fixed size sections and hash separately 3b152e0baa367a f6df 40c39f174a8756a2c266849b fdb a8bc69ecc46ec 10

11 Bytewise Comparison The quick brown fox jumped over the lazy dog. How much wood could The quick brown fax jumped over the lazy dog. How much good could 97% of the data is identical 11

12 Scenario: Image computer Lose control of computer Regain control, image again Bytewise Comparison 97% of the the data on the drive was identical What changed? 12

13 Compare the data in each block Can specify block size later If identical, add a green pixel If different, add a red pixel Visual Representation The quick brown fox jumped over the lazy dog. How much wood could The quick brown fax jumped over the lazy dog. How much good could 13

14 No changes made 14

15 Powered on and off 97% of the data is identical 15

16 Actual Result 97% of the data is identical 16

17 Generic Data What if the data is not byte-aligned? The quick brown fox jumped over the lazy dog. How much wood could The quick brown fox jumped up and over the lazy dog. How much wood 17

18 Disclaimer I didn t invent this math Originally Dr. Andrew Tridgell Samba rsync was part of his thesis Modified slightly for spamsum Spam detector in his junk code folder

19 Combination of a rolling hash and traditional hash Rolling hash looks only at last few bytes Fuzzy Hashing F o u r s c o r e -> 83,742,221 F o u r s c o r e -> 5 F o u r s c o r e -> 90,281 When processing a file, compute block size using file size If rolling hash mod block size = 1, it s a trigger point 19

20 Compute traditional hash while processing file On each trigger point, record value Reset traditional hash and continue Fuzzy Hashing Example Excerpt from "The Raven" by Edgar Allan Poe Triggers on ood and ore 20

21 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore?, This I whispered, and an echo murmured back the word, "Lenore!" Merely this, and nothing more

22 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore?, This I whispered, and an echo murmured back the word, "Lenore!" Merely this, and nothing more

23 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore?, This I whispered, and an echo murmured back the word, "Lenore!" Merely this, and nothing more

24 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore 57?, This I whispered, and an echo murmured back the word, "Lenore !" Merely this, and nothing more

25 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore 57?, This I whispered, and an echo murmured back the word, "Lenore !" Merely this, and nothing more Signature = 32730

26 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore?, This I whispered, and an echo murmured back the word, "Lenore!" Merely this, and nothing more

27 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, I AM THE LIZARD KING!!!1! fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore?, This I whispered, and an echo murmured back the word, "Lenore!" Merely this, and nothing more

28 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, I AM THE LIZARD KING!!!1! fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore 57?, This I whispered, and an echo murmured back the word, "Lenore !" Merely this, and nothing more

29 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, I AM THE LIZARD KING!!!1! fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore 57?, This I whispered, and an echo murmured back the word, "Lenore!" Merely this, and nothing more Original Signature = New Signatures =

30 Comparing Signatures Edit Distance Number of changes to turn one into the other Fuzzy Hashing Edit distance = 1 If edit distance is small relative to length, fuzzy hash match 30

31 Demonstration WARNING: EXPLICIT IMAGERY

32 Demonstration 32

33 Corrupted File MATCH! 33

34 File Footer MATCH! 34

35 File Footer MATCH! 35

36 Where Fuzzy Hashing Fails Do not match 36

37 Visual Comparisons Easy for humans Somewhat difficult for computers Comparing Pictures Content Based Image Retrieval (CBIR) There are companies tripping over themselves to do this Nobody has it quite nailed yet A free product is ImgSeek Search Styles Search by drawing Search by example 37

38 Search by Example Query Result Image courtesy Flickr user andrewbain and licensed under the Creative Commons 38

39 Non-visual comparisons EXIF information Same camera Comparing Pictures Looks at imperfections in CCDs Requires lots of pictures and some mathy stuff 39

40 There are many ways to find similar inputs Academically, this is a solved problem There are working theoretical approaches The magic lies in the implementation General Approach 1. Feature Extraction 2. Feature Selection 3. Comparison 4. Clustering 5. Classification 40

41 Feature Extraction Anything can be a feature Strings Metadata Registry key/value Display window? For Programs What do they do "Look and feel Authorship Compilation method Image courtesy of Flickr user doctor_keats and used under Create Commons license. 41

42 Similar inputs should have similar features Feature Extraction Features may be represented mathematically 42

43 Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore 57?, This I whispered, and an echo murmured back the word, "Lenore !" Merely this, and nothing more Signature = 32730

44 Feature Extraction Example: Strings Individual words don t work well Ordering issues Use phrases The quick brown fox jumped over the lazy dog the quick quick brown brown fox Generally refer to n-grams The above are 2-grams 44

45 Throw out Stop Words Common words Defined by linguistics for each language the, and, but, of, is In our case, throw out the quick Feature Extraction Feature Presence or Feature Count Count occurrences of each feature in a document quick brown 4 brown fox 2 fox jumped 1 45

46 The Curse of Dimensionality Feature Selection So many dimensions (features) that comparisons become too time consuming or too complex No problem Select the important features (Insert mathy stuff here) Example: advanced persistent threat vs. quick brown Depends on context 46

47 We ve already covered one comparison method Edit distance Comparison See also: Hamming distance Manhattan distance Dice s coefficient See Wikipedia category: String similarity measures And these are just for strings! See Wikipedia category Statistical distance measures 47

Clustering Until now talking about comparing one document to another Could use one document as a query We can divide a set of documents into clusters of similar ones [Insert mathy stuff here] Your

48 Clustering Until now talking about comparing one document to another Could use one document as a query We can divide a set of documents into clusters of similar ones [Insert mathy stuff here] Your computer can help you of course Real challenge for us will be representation How do we display this information? Image from Hubble Space Telescope/NASA and is not eligible for Copyright protection. 48

49 Classify an input as belonging to a set or not Relevant document? Illicit imagery? Malicious program? Classification Assisted Machine Learning Requires a training set After that, can classify any new input Performance measured by precision and recall Precision is for false positives Recall is for false negatives 49

50 Classification Lots of algorithms Naïve Bayesian classifier K-Nearest Neighbor Locality Sensitive Hashing Decision Trees Neural Networks Hidden Markov Models See Wikipedia article on Classification (machine learning) 50

51 General Approach 1. Feature Extraction 2. Feature Selection 3. Comparison 4. Clustering 5. Classification 6.??? 7. Profit! The??? means: Which features to extract Which similarity measure to use Which classification algorithm 51

52 General Approach Currently being used in ediscovery Identify relevant documents 52

53 Outline Introduction Identical Data Similar Generic Data Fuzzy Hashing Image Comparisons General Approach Questions 53

54 Questions? Jesse Kornblum 54

Beyond Fuzzy Hashing. Jesse Kornblum

Beyond Fuzzy Hashing. Jesse Kornblum Beyond Fuzzy Hashing Jesse Kornblum Outline Introduction Identical Data Similar Generic Data Fuzzy Hashing Image Comparisons General Approach Documents Applications Questions 2 Motivation 3 Identical A