Predicting English keywords from Java Bytecodes using Machine Learning

Similar documents
Practical Introduction to Machine Learning

Applying the Science of Similarity to Computer Forensics. Jesse Kornblum

Latent Dirichlet Allocation Introduction/Overview

Using Artificial Intelligence to Train Your Chatbot.

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University

Spatial Role Labeling CS365 Course Project

Naïve Bayes, Maxent and Neural Models

Gaussian Models

Midterm sample questions

Quantum Classification of Malware. John Seymour

Prenominal Modifier Ordering via MSA. Alignment

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Generative Clustering, Topic Modeling, & Bayesian Inference

Naïve Bayes Classifiers

Classification & Information Theory Lecture #8

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Social Media & Text Analysis

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Introduction to Deep Learning

1. Ignoring case, extract all unique words from the entire set of documents.

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Internet Engineering Jacek Mazurkiewicz, PhD

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

From perceptrons to word embeddings. Simon Šuster University of Groningen

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Click Prediction and Preference Ranking of RSS Feeds

Computer Science Introductory Course MSc - Introduction to Java

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert

Generative Models for Sentences

Computational Genomics

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

Language-based Information Security. CS252r Spring 2012

Unsupervised Anomaly Detection for High Dimensional Data

More Smoothing, Tuning, and Evaluation

CSC321 Lecture 7 Neural language models

CSC321 Lecture 9 Recurrent neural nets

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

DEKDIV: A Linked-Data-Driven Web Portal for Learning Analytics Data Enrichment, Interactive Visualization, and Knowledge Discovery

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

Information Extraction and GATE. Valentin Tablan University of Sheffield Department of Computer Science NLP Group

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Collaborative Topic Modeling for Recommending Scientific Articles

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Learning Features from Co-occurrences: A Theoretical Analysis

CS 188: Artificial Intelligence. Outline

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

CLRG Biocreative V

CSCI-567: Machine Learning (Spring 2019)

Replicated Softmax: an Undirected Topic Model. Stephen Turner

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Assessing the Consequences of Text Preprocessing Decisions

Boolean and Vector Space Retrieval Models

Lecture 5 Neural models for NLP

Presented By: Omer Shmueli and Sivan Niv

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

PROBABILISTIC LATENT SEMANTIC ANALYSIS

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Naïve Bayes classification

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Natural Language Processing

Driving Semantic Parsing from the World s Response

Machine Learning for natural language processing

arxiv: v2 [cs.cl] 1 Jan 2019

Machine Learning Algorithm. Heejun Kim

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

IBM Model 1 for Machine Translation

Applied Natural Language Processing

ML4NLP Multiclass Classification

ArchaeoKM: Managing Archaeological data through Archaeological Knowledge

1 Information retrieval fundamentals

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Bayesian Networks Inference with Probabilistic Graphical Models

CSC321 Lecture 10 Training RNNs

CSC321 Lecture 16: ResNets and Attention

Machine Learning Summer 2018 Exercise Sheet 4

Essence of Machine Learning (and Deep Learning) Hoa M. Le Data Science Lab, HUST hoamle.github.io

The Noisy Channel Model and Markov Models

Representing structured relational data in Euclidean vector spaces. Tony Plate

CSC321 Lecture 15: Recurrent Neural Networks

Optical Character Recognition of Jutakshars within Devanagari Script

Introduction to Support Vector Machines

Classification, Linear Models, Naïve Bayes

Probability Review and Naïve Bayes

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)

Interpreting Deep Classifiers

(COM4513/6513) Week 1. Nikolaos Aletras ( Department of Computer Science University of Sheffield

Short Note: Naive Bayes Classifiers and Permanence of Ratios

Machine Learning. CS Spring 2015 a Bayesian Learning (I) Uncertainty

Fourier Series (Dover Books On Mathematics) PDF

Lecture 14. Clustering, K-means, and EM

(S1) (S2) = =8.16

Data Aggregation with InfraWorks and ArcGIS for Visualization, Analysis, and Planning

Hidden Markov Models

Machine Learning (CS 567) Lecture 2

Transcription:

Predicting English keywords from Java Bytecodes using Machine Learning Pablo Ariel Duboue Les Laboratoires Foulab (Montreal Hackerspace) 999 du College Montreal, QC, H4C 2S3 REcon June 15th, 2012

Outline 1 2 Debian Basic Machine Learning 3 Potential Applications Advanced Machine Learning Other Issues 4

Long Term Vision private f i n a l i n t c ( i n t ) { 0 aload_0 1 g e t f i e l d org. j p c. emulator. f. v 4 i n v o k e i n t e r f a c e org. j p c. support. j. e ( ) 9 aload_0 10 g e t f i e l d org. j p c. emulator. f. i 13 i n v o k e v i r t u a l org. j p c. emulator. motherboard. q. e ( ) 16 aload_0 17 g e t f i e l d org. j p c. emulator. f. j 20 i n v o k e v i r t u a l org. j p c. emulator. motherboard. q. e ( ) 23 iconst_0 24 i s t o r e _ 2 25 iload_1 26 i f l e 128 29 aload_0 30 g e t f i e l d org. j p c. emulator. f. b 33 i n v o k e v i r t u a l org. j p c. emulator. processor. t.w( )

Long Term Vision private f i n a l i n t c ( i n t ) { 0 aload_0 1 g e t f i e l d org. j p c. emulator. f. v 4 i n v o k e i n t e r f a c e org. j p c. support. j. e ( ) 9 aload_0 10 g e t f i e l d org. j p c. emulator. f. i 13 i n v o k e v i r t u a l org. j p c. emulator. motherboard. q. e ( ) 16 aload_0 17 g e t f i e l d org. j p c. emulator. f. j 20 i n v o k e v i r t u a l org. j p c. emulator. motherboard. q. e ( ) 23 iconst_0 24 i s t o r e _ 2 25 iload_1 26 i f l e 128 29 aload_0 30 g e t f i e l d org. j p c. emulator. f. b 33 i n v o k e v i r t u a l org. j p c. emulator. processor. t.w( )

Long Term Vision private f i n a l i n t c ( i n t ) { 0 aload_0 1 g e t f i e l d org. j p c. emulator. f. v 4 i n v o k e i n t e r f a c e org. j p c. support. j. e ( ) 9 aload_0 10 g e t f i e l d org. j p c. emulator. f. i 13 i n v o k e v i r t u a l org. j p c. emulator. motherboard. q. e ( ) 16 aload_0 17 g e t f i e l d org. j p c. emulator. f. j 20 i n v o k e v i r t u a l org. j p c. emulator. motherboard. q. e ( ) 23 iconst_0 24 i s t o r e _ 2 25 iload_1 26 i f l e 128 29 aload_0 30 g e t f i e l d org. j p c. emulator. f. b 33 i n v o k e v i r t u a l org. j p c. emulator. processor. t.w( )

Long Term Vision private f i n a l i n t c ( i n t ) { 0 aload_0 1 g e t f i e l d org. j p c. emulator. f. v 4 i n v o k e i n t e r f a c e org. j p c. support. j. e ( ) 9 aload_0 10 g e t f i e l d org. j p c. emulator. f. i 13 i n v o k e v i r t u a l org. j p c. emulator. motherboard. q. e ( ) 16 aload_0 17 g e t f i e l d org. j p c. emulator. f. j 20 i n v o k e v i r t u a l org. j p c. emulator. motherboard. q. e ( ) 23 iconst_0 24 i s t o r e _ 2 25 iload_1 26 i f l e 128 29 aload_0 30 g e t f i e l d org. j p c. emulator. f. b 33 i n v o k e v i r t u a l org. j p c. emulator. processor. t.w( )

Long Term Vision private f i n a l i n t c ( i n t ) { 0 aload_0 1 g e t f i e l d org. j p c. emulator. f. v 4 i n v o k e i n t e r f a c e org. j p c. support. j. e ( ) 9 aload_0 10 g e t f i e l d org. j p c. emulator. f. i 13 i n v o k e v i r t u a l org. j p c. emulator. motherboard. q. e ( ) 16 aload_0 17 g e t f i e l d org. j p c. emulator. f. j 20 i n v o k e v i r t u a l org. j p c. emulator. motherboard. q. e ( ) 23 iconst_0 24 i s t o r e _ 2 25 iload_1 26 i f l e 128 29 aload_0 30 g e t f i e l d org. j p c. emulator. f. b 33 i n v o k e v i r t u a l org. j p c. emulator. processor. t.w( )

Current Status Does this work now? Not really.

Current Status Does this work now? Not really.

About the Speaker From Cordoba, Argentina. Natural Language Generation at Columbia University. PhD thesis on learning the structure of biographies. Research Scientist at IBM. DeepQA Watson Project. Independent Scientist. Member of Foulab. Interested in Crypto and RE. Attended ToorCamp in 10.

Intuitions Behind the Approach Attended REcon last year. Gap between tools and the way practitioners work. Tools: from the programming language and translators community (i.e., compilers). Practitioners: more similar to NLP practitioners, more driven by intuitions. More machine learning-driven tools?

Intuitions Behind the Approach Attended REcon last year. Gap between tools and the way practitioners work. Tools: from the programming language and translators community (i.e., compilers). Practitioners: more similar to NLP practitioners, more driven by intuitions. More machine learning-driven tools?

Objective of this Talk Share some really early results. Really share them, the data, models and scripts are available at http://keywords4bytecodes.org. Find potential users and collaborators.

Outline Debian Basic Machine Learning 1 2 Debian Basic Machine Learning 3 Potential Applications Advanced Machine Learning Other Issues 4

Using the Debian Archive Debian Basic Machine Learning apt-file search - -package-only.jar 1,400+ packages dpkg-query -p package name Look for Source field dpkg-source -x source.dsc Search for Java source files. dpkg -x binary.deb Search for jars, disassemble the methods.

Debian Basic Machine Learning Assembling the Bytecodes / Javadoc File Disassemble using jclassinfo - -disasm Dump Javadoc comments using qdox. A lightweight Java source parsing library. Heuristically match source methods to compiled methods. Normalize source code signatures to binary signatures. Final corpus: 350,000+ methods. 5.9M words. 8.4M JVM instructions.

Preprocessing Debian Basic Machine Learning What is a term is a key issue. Currently: lowercase, split on spaces and certain punctuations. Everything not alphanumeric is replaced with _ Poor choice moved to unsupervised tokenization. Discussed later. Took top 2,000 keywords after the top 100 keywords Stopword removal.

Outline Debian Basic Machine Learning 1 2 Debian Basic Machine Learning 3 Potential Applications Advanced Machine Learning Other Issues 4

Train n classifiers for large n Debian Basic Machine Learning For each of the top 2,000 keywords: Put aside the methods that contain the keyword on its Javadoc. Sample the methods that do not. Positive and negative training data. Train a classifier for each keyword. For classifier, I use dbacl. High performance, very robust tokenizer. The input to the classifier is the ascii rendering of the disassembled bytecodes.

Naïve Bayes Debian Basic Machine Learning P(A B) = P(B A)P(A) P(B) Go from observed to predicted. P(calculation fadd) = P(fadd calculation)p(fadd) P(calculation) Naïve Bayes: complete independence assumptions. Clearly false in the case of bytecode and human language. Good start. Very resilient to noise in the data. Used in spam detection.

Results Debian Basic Machine Learning Trained on 10x more negatives. Random accuracy: 1/2000 = 0.05% Because there are so many negatives, accuracy is not meaningful: For policy, we get 99.67% accuracy because it gets 70,138 true negatives right. And only 2 true positives (!) Precision and Recall true positives Precision = Recall = true positives+false positives true positives true positives+false negatives F = 2 precision recall precision+recall

Results Debian Basic Machine Learning Keyword P R F com 0.33 0.36 0.34 smartgwt 0.08 0.96 0.15 widgets 0.07 0.75 0.13 advanced 0.09 0.17 0.12 occurs 0.08 0.15 0.11 public 0.06 0.32 0.10 setting 0.12 0.08 0.09 receiver 0.08 0.11 0.09 most 0.07 0.12 0.09 show 0.04 0.12 0.06

Analysis Debian Basic Machine Learning com? smartgwt? package names Lot of signal on the invoke instruction A missing keyword does not make it wrong There s signal, more work to go...

Outline Potential Applications Advanced Machine Learning Other Issues 1 2 Debian Basic Machine Learning 3 Potential Applications Advanced Machine Learning Other Issues 4

In Reverse Engineering Potential Applications Advanced Machine Learning Other Issues Hinting Subroutines The motivating example at the beginning. Beacon identification in Software Engineering. Even if it fails, it might work in a predictable manner that can still be useful. Custom (malware) VMs Identifying which methods correspond to different VM operations (addition, jump, etc). Lack of training data? Native code.

In Security Potential Applications Advanced Machine Learning Other Issues Dalvik Word Clouds. Use dex2jar, obtain word clouds for the whole executable. Maybe the user can tell if anything looks fishy there? Flagging Suspicious Methods. Finding methods that can be described with keywords very different from the rest of the existing methods. Kullback-Leibler divergence. Lots of false positives. Can be done with dynamically generated bytecodes.

Elsewhere Potential Applications Advanced Machine Learning Other Issues Method Search Searching for a method related to certain keywords. Useful in case source code is missing. Can be evaluated against a set of queries without annotated data. Helping map similar methods. A heuristic approach to code clone detection.

Outline Potential Applications Advanced Machine Learning Other Issues 1 2 Debian Basic Machine Learning 3 Potential Applications Advanced Machine Learning Other Issues 4

Improving the Data Potential Applications Advanced Machine Learning Other Issues Clustering: reducing errors due to missing text. Predict keywords that could have been used. Aligning bytecode to Javadoc the Right Way Modifying the compiler and re-building the packages. Morfessor Unsupervised tokenization. LDA: dimensionality reduction. Finding similarities at the bytecode level.

Discriminative Learning Potential Applications Advanced Machine Learning Other Issues Naïve Bayes is generative Provides a full probabilistic model for all variables. Reversible Discriminative models only model the target variables. Many times have better performance than generative models.

Multi-task Neural Networks Potential Applications Advanced Machine Learning Other Issues Inspired from the talk by Prof. Bengio last week at Semantic Interpretation in an Actionable Context workshop. Learn about the code, then generate from there.

Outline Potential Applications Advanced Machine Learning Other Issues 1 2 Debian Basic Machine Learning 3 Potential Applications Advanced Machine Learning Other Issues 4

Other Issues Potential Applications Advanced Machine Learning Other Issues Obfuscate the Training Dalvik Beyond Keywords: Full Phrases Dynamic vs. Static features

An idea with potential. Might be too early? We will see. mailto:pablo.duboue@gmail.com http://keywords4bytecodes.org @pabloduboue DrDub on FreeNode

Acknowledgements Foulab. Danukeru. REcon organizers. Subgraph. Annie Ying. mailto:pablo.duboue@gmail.com http://keywords4bytecodes.org @pabloduboue DrDub on FreeNode