Assignment 3: Chapter 2 & 3 (2.6, 3.8)

Size: px
Start display at page:

Download "Assignment 3: Chapter 2 & 3 (2.6, 3.8)"

Transcription

1 Neha Aggarwal Comp 578 Data Mining Fall Assignment 3: Chapter 2 & 3 (2.6, 3.8) 2.6 Q.18 This exercise compares and contrasts some similarity and distance measures. (a) For binary data, the L1 distance corresponds to the Hamming distance; that is, the number of bits that are different between two binary vectors. The Jaccard similarity is a measure of the similarity between two binary vectors. Compute the Hamming distance and the Jaccard similarity between the following two binary vectors. x = 1111 y = 111 Ans. Hamming distance: x = 1111 y = 111 The bit in red denotes the values which are different in the two vectors. The Hamming distance is the count of such bits, and is 3 in this case. Jaccard measure: f ij means frequency of the pair (i,j) in (x,y) f = 5, f 1 = 1, f 1 = 2, f 11 = 2 J = f 11 /( f 1 + f 1 + f 11 ) = 2/5 =.4 (b) Which approach, Jaccard or Hamming distance, is more similar to the Simple Matching coefficient, and which approach is more similar to the cosine measure? Explain Ans. Hamming distance is a measure of dissimilarity and computes the total of values that are different in the two vectors. For a binary vector this measure simulates an XOR operation. That is, the number of attributes for which the values are (,1) or (1,) only. In part (a), Hamming distance is calculated to be 3. Simple matching coefficient (SMC) is a similarity measure that is useful when there is symmetry in the data i.e when both positive and negative values are important. SMC counts the bit values that are equal. From the example in (a), SMC = (f 11 + f )/ (f 1 + f 1 + f 11 + f )

2 = 7/1 Although, SMC is a similarity measure, it can be easily used to measure the dissimilarity, simply by (1-SMC). In the above example, this results in 3/1. Both Hamming distance and SMC can therefore, be used to measure the dissimilarity among vectors. The only difference between SMC and Hamming distance is that, SMC is expressed as a ratio of dissimilar bits to the total number of bits whereas Hamming distance simple counts the number of dissimilar bits. Jaccard measure: Assume that x and y are two items, and the values of x, y are expressed as binary attributes, with meaning the item was not purchased and 1 meaning the item was purchased. While measuring the similarity between items x and y, the attribute with value is of no use because signifies the item was not purchased. Since the number of items not purchased by a customer outweighs the number of items purchased, Jaccard measure discards these (,) attributes, thereby reducing computational time. Cosine similarity is mostly used to measure document similarity. Since two documents are not likely to contain many of the same words, similarity does not depend on the number of shared values. Both Jaccard measure and cosine similarity ignore - matches, but cosine similarity can also handle non-binary vectors. (c) Suppose that you are comparing how similar two organisms of different species are in terms of the number of genes they share. Describe which measure, Hamming or Jaccard, you think would be more appropriate for comparing the genetic makeup of two organisms. Explain. (Assume that each animal is represented as a binary vector, where each attribute is 1 if a particular gene is present in the organism and otherwise.) Ans. Hamming distance between two vectors is the number of positions for which the two values are different. In other words, it is the minimum bits to be changed to convert one binary vector to another. Jaccard measure is used to measure the similarity among vectors. Jaccard measure ignores - values, i.e. it considers the absence of an attribute in both the vectors as of no real importance. Although different species share many genes, there is still a lot of difference in the gene structure of different organisms. Also, different species may have varying number of genes. Jaccard measure would ignore the cases when a particular gene is not present in either of the organisms and count the number of matches between the two gene vectors. Hamming distance, on the other hand, would provide information about difference in the gene structure of the two organisms. Thus, the similarity between organisms of different species is more accurately represented by Jaccard similarity

3 (d) If you wanted to compare the genetic makeup of two organisms of the same species, e.g., two human beings, would you use the hamming distance, the Jaccard coefficient, or a different measure of similarity or distance? Explain. (Note that two human beings share > 99.9% of the same genes.) Ans. Hamming distance would be a good measure while comparing the genetic makeup of two organisms of the same species. Human beings share more than 99.9% of the same genes, thus it is more crucial to measure the differences in their gene structure, in order to obtain useful information regarding what sets them apart. Jaccard similarity would unnecessarily compute the large number of similar bits. Hamming distance, on the other hand, would ignore similar bits, and compute the difference in the genome of two organisms. 2.6 Q 19. For the following vectors, x and y, calculate the indicated similarity or distance measures. The formulas for different measures are: Cosine Similarity Cos(x,y) = (x.y) / x y Where x.y = n k=1 x k y k x = n k=1 x 2 k = x.x y = y.y Correlation Corr(x,y) = s xy / s x s y s xy = (1/n-1) n k=1 (x k Mean x) (y k Mean y) s x = (1/n-1) n k=1 (x k Mean x) 2 s y = (1/n-1) n k=1 (y k Mean y) 2 Jaccard coefficient J = f 11 / f 1 + f 1 + f 11 Euclidean Distance d(x,y) = n k=1 (x k y k ) 2 (a) x = (1,1,1,1), y = (2,2,2,2) cosine, correlation, Euclidean Cosine similarity(cs) x.y = 1*2 + 1*2 + 1*2 + 1*2 = 8 x = = 2 y = = 4 CS = 8/2*4 = 1 Correlation(x,y) = s xy /s x * s y

4 Mean x = 1, Mean y = 2 s xy = 1/3[(1-1)(2-2) + (1-1)(2-2) + (1-1)(2-2) + (1-1)(2-2)] = Therefore, correlation = Euclidean distance = (1-2) 2 + (1-2) 2 + (1-2) 2 + (1-2) 2 = 2 (b) x = (,1,,1), y = (1,,1,) cosine, correlation, Euclidean, Jaccard Cosine Similarity x.y = *1 + 1* + *1 + 1* = CS = Correlation Mean x = 1/2, Mean y = 1/2 s xy = 1/3[(-1/2)(1-1/2) + (1-1/2)(-1/2) + (-1/2)(1-1/2) + (1-1/2)(-1/2)] = 1/3[-1/4-1/4-1/4 1/4] = -1/3 s x = 1/3[(-1/2) 2 + (1-1/2) 2 + (-1/2) 2 + (1-1/2) 2 ] = 1/3[1/4 + 1/4 + 1/4 + 1/4] = 1/3 s y = 1/3[(1-1/2) 2 + (-1/2) 2 + (1-1/2) 2 + (-1/2) 2 ] = 1/3[1/4 + 1/4 + 1/4 + 1/4] = 1/3 Correlation = s xy / s x * s y = (-1/3)/ 1/3 * 1/3 = (-1/3)/(1/3) = -1 Euclidean distance = (-1) 2 + (1-) 2 + (-1) 2 + (1-) 2 = 2 Jaccard measure f 11 = Therefore, Jaccard measure = (c) x = (,-1,,1), y = (1,,-1,) cosine, correlation, Euclidean Cosine similarity x.y = *1 + -1* + *-1 + 1* = CS = correlation Mean x =, Mean y = s xy = 1/3[(-)(1-) + (-1-)(-) + (-)(-1-) + (1-)(-)]

5 = 1/3[] = Therefore, correlation = Euclidean distance = (-1) 2 + (-1-) 2 + (+1) 2 + (1-) 2 = 2 (d) x = (1,1,,1,,1), y = (1,1,1,,,1) cosine, correlation, Jaccard Cosine similarity x.y = 1*1 + 1*1 + *1 + 1* + * + 1*1 = 3 x = = 2 y = = 2 = 3/2*2 = 3/4 =.75 Correlation Mean x = 2/3, Mean y = 2/3 s xy = 1/5[(1-2/3)(1-2/3) + (1-2/3)(1-2/3) + (-2/3)(1-2/3) + (1-2/3)(-2/3) + (-2/3) (-2/3) + (1-2/3)(1-2/3)] = 1/5[1/9 + 1/9 2/9 2/9 + 4/9 + 1/9] = 1/5 * 3/9 = 1/5 * 1/3 = 1/15 s x = 1/5[(1-2/3) 2 + (1-2/3) 2 + (-2/3) 2 + (1-2/3) 2 + (-2/3) 2 + (1-2/3) 2 ] = 1/5[1/9 + 1/9 + 4/9 + 1/9 + 4/9 + 1/9] = 1/5[12/9] = 1/5[4/3] = 4/15 s y = 1/5[(1-2/3) 2 + (1-2/3) 2 + (1-2/3) 2 + (-2/3) 2 + (-2/3) 2 + (1-2/3) 2 ] = 1/5[1/9 + 1/9 + 1/9 + 4/9 + 4/9 + 1/9] = 1/5[12/9] = 1/5[4/3] = 4/15 Correlation = (s xy )/ s x * s y = [(1/15)/( 4/15)*( 4/15)] = [(1/15)/(4/15)] = 1/4 =.25 Jaccard coefficient f 11 = 3 f 1 = 1 f 1 = 1

6 J = 3/5 =.6 (e) x = (2,-1,,2,,-3), y = (-1,1,-1,,,-1) cosine, correlation Cosine similarity x.y = 2 * (-1) + ( -1) * 1 + * (-1) + 2 * + * + (-3) * (-1) = (-2) + (-1) = CS = correlation Mean x =, Mean y = s xy = 1/5[(2-)(-1-) + (-1-)(1-) + (-)(-1-) + (2-)(-) + (-)(-) + (-3-) (-1-)] = 1/5[ ] = 1/5 * = Correlation = 3.8 Q1 Obtain one of the data sets available at the UCI machine learning repository and apply as many of the different visualization techniques described in the chapter as possible. Ans. I used the Haberman s survival data set. The attributes are: 1. age at the time of operation 2. year of operation(19xx) 3. number of positive auxiliary nodes detected The data set belongs to: Class 1 if the patient survived for more than 5 years Class 2 if the patient died within 5 years Histograms

7 2 19 count No. of aux. nodes count Year of operation(19xx)

8 6 5 Count Age Box plots Class 1 Box plot 1 Value 8 6 LQ min median max UQ - age yop aux nodes

9 Class 2 Box plot Value LQ min median max UQ - age yop aux nodes Scatter plots Scatter plot for age of patient at time of operation and year of operation yer of operation age Class 1 Class 2

10 Scatter plot for age at the time of operation and no. of auxillary nodes detected no. of aux. nodes age Class 1 Class 2 Scatter plot for year of operation and no. of auxillary nodes detected no. of aux. nodes year of operation Class 1 Class 2 Pie Chart

11 Distribution of both the classes 26% Class 1 Class 2 74% Percentile plot Percentile plots for age, year of operation and no. of auxillary nodes detected 1 Value 8 6 age yop aux nodes Percentile

12 Reference: Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach, Vipin Kumar

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are. Is higher

More information

proximity similarity dissimilarity distance Proximity Measures:

proximity similarity dissimilarity distance Proximity Measures: Similarity Measures Similarity and dissimilarity are important because they are used by a number of data mining techniques, such as clustering nearest neighbor classification and anomaly detection. The

More information

Descriptive Data Summarization

Descriptive Data Summarization Descriptive Data Summarization Descriptive data summarization gives the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning

More information

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining Distances and similarities Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Similarities Start with X which we assume is centered and standardized. The PCA loadings were

More information

CISC 4631 Data Mining

CISC 4631 Data Mining CISC 4631 Data Mining Lecture 02: Data Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) 1 10 What is Data? Collection of data objects and their attributes An attribute

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 10 What is Data? Collection of data objects

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Types of data sets Record Tables Document Data Transaction Data Graph World Wide Web Molecular Structures

More information

University of Florida CISE department Gator Engineering. Clustering Part 1

University of Florida CISE department Gator Engineering. Clustering Part 1 Clustering Part 1 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville What is Cluster Analysis? Finding groups of objects such that the objects

More information

ANÁLISE DOS DADOS. Daniela Barreiro Claro

ANÁLISE DOS DADOS. Daniela Barreiro Claro ANÁLISE DOS DADOS Daniela Barreiro Claro Outline Data types Graphical Analysis Proimity measures Prof. Daniela Barreiro Claro Types of Data Sets Record Ordered Relational records Video data: sequence of

More information

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data Elena Baralis and Tania Cerquitelli Politecnico di Torino Data set types Record Tables Document Data Transaction Data Graph World Wide Web Molecular Structures Ordered Spatial Data Temporal Data Sequential

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 10 What is Data? Collection of data objects

More information

Data Exploration Slides by: Shree Jaswal

Data Exploration Slides by: Shree Jaswal Data Exploration Slides by: Shree Jaswal Topics to be covered Types of Attributes; Statistical Description of Data; Data Visualization; Measuring similarity and dissimilarity. Chp2 Slides by: Shree Jaswal

More information

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo Clustering Lecture 1: Basics Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics Clustering

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar 10 What is Data? Collection of data objects and their attributes Attributes An attribute is a property

More information

DATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 4 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

Getting To Know Your Data

Getting To Know Your Data Getting To Know Your Data Road Map 1. Data Objects and Attribute Types 2. Descriptive Data Summarization 3. Measuring Data Similarity and Dissimilarity Data Objects and Attribute Types Types of data sets

More information

Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr

Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr Foundation of Data Mining i Topic: Data CMSC 49D/69D CSEE Department, e t, UMBC Some of the slides used in this presentation are prepared by Jiawei Han and Micheline Kamber Data Data types Quality of data

More information

Chapter I: Introduction & Foundations

Chapter I: Introduction & Foundations Chapter I: Introduction & Foundations } 1.1 Introduction 1.1.1 Definitions & Motivations 1.1.2 Data to be Mined 1.1.3 Knowledge to be discovered 1.1.4 Techniques Utilized 1.1.5 Applications Adapted 1.1.6

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1 Week 2 Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Part I Other datatypes, preprocessing 2 / 1 Other datatypes Document data You might start with a collection of

More information

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes Week 2 Based in part on slides from textbook, slides of Susan Holmes Part I Other datatypes, preprocessing October 3, 2012 1 / 1 2 / 1 Other datatypes Other datatypes Document data You might start with

More information

Admin. Assignment 1 is out (due next Friday, start early). Tutorials start Monday: Office hours: Sign up for the course page on Piazza.

Admin. Assignment 1 is out (due next Friday, start early). Tutorials start Monday: Office hours: Sign up for the course page on Piazza. Admin Assignment 1 is out (due next Friday, start early). Tutorials start Monday: 11am, 2pm, and 4pm in DMP 201. New tutorial section: 5pm in DMP 101. Make sure you sign up for one. No requirement to attend,

More information

Mining Infrequent Patter ns

Mining Infrequent Patter ns Mining Infrequent Patter ns JOHAN BJARNLE (JOHBJ551) PETER ZHU (PETZH912) LINKÖPING UNIVERSITY, 2009 TNM033 DATA MINING Contents 1 Introduction... 2 2 Techniques... 3 2.1 Negative Patterns... 3 2.2 Negative

More information

Similarity and Dissimilarity

Similarity and Dissimilarity 1//015 Similarity and Dissimilarity COMP 465 Data Mining Similarity of Data Data Preprocessing Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed.

More information

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Data Mining: Concepts and Techniques Chapter 2 1 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser University 2011 Han, Kamber, and Pei. All rights reserved.

More information

Data Exploration and Unsupervised Learning with Clustering

Data Exploration and Unsupervised Learning with Clustering Data Exploration and Unsupervised Learning with Clustering Paul F Rodriguez,PhD San Diego Supercomputer Center Predictive Analytic Center of Excellence Clustering Idea Given a set of data can we find a

More information

Two Posts to Fill On School Board

Two Posts to Fill On School Board Y Y 9 86 4 4 qz 86 x : ( ) z 7 854 Y x 4 z z x x 4 87 88 Y 5 x q x 8 Y 8 x x : 6 ; : 5 x ; 4 ( z ; ( ) ) x ; z 94 ; x 3 3 3 5 94 ; ; ; ; 3 x : 5 89 q ; ; x ; x ; ; x : ; ; ; ; ; ; 87 47% : () : / : 83

More information

CPSC 340: Machine Learning and Data Mining. Data Exploration Fall 2016

CPSC 340: Machine Learning and Data Mining. Data Exploration Fall 2016 CPSC 340: Machine Learning and Data Mining Data Exploration Fall 2016 Admin Assignment 1 is coming over the weekend: Start soon. Sign up for the course page on Piazza. www.piazza.com/ubc.ca/winterterm12016/cpsc340/home

More information

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance Notion of Distance Metric Distance Binary Vector Distances Tangent Distance Distance Measures Many pattern recognition/data mining techniques are based on similarity measures between objects e.g., nearest-neighbor

More information

Marginal Balance of Spread Designs

Marginal Balance of Spread Designs Marginal Balance of Spread Designs For High Dimensional Binary Data Joe Verducci, Ohio State Mike Fligner, Ohio State Paul Blower, Leadscope Motivation Database: M x N array of 0-1 bits M = number of compounds

More information

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of

More information

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 by Tan, Steinbach, Karpatne, Kumar 1 Classification: Definition Given a collection of records (training set ) Each

More information

Multimedia Retrieval Distance. Egon L. van den Broek

Multimedia Retrieval Distance. Egon L. van den Broek Multimedia Retrieval 2018-1019 Distance Egon L. van den Broek 1 The project: Two perspectives Man Machine or? Objective Subjective 2 The default Default: distance = Euclidean distance This is how it is

More information

P8130: Biostatistical Methods I

P8130: Biostatistical Methods I P8130: Biostatistical Methods I Lecture 2: Descriptive Statistics Cody Chiuzan, PhD Department of Biostatistics Mailman School of Public Health (MSPH) Lecture 1: Recap Intro to Biostatistics Types of Data

More information

Measurement and Data

Measurement and Data Measurement and Data Data describes the real world Data maps entities in the domain of interest to symbolic representation by means of a measurement procedure Numerical relationships between variables

More information

Similarity and recommender systems

Similarity and recommender systems Similarity and recommender systems Hiroshi Shimodaira January-March 208 In this chapter we shall look at how to measure the similarity between items To be precise we ll look at a measure of the dissimilarity

More information

Chapter 4: Frequent Itemsets and Association Rules

Chapter 4: Frequent Itemsets and Association Rules Chapter 4: Frequent Itemsets and Association Rules Jilles Vreeken Revision 1, November 9 th Notation clarified, Chi-square: clarified Revision 2, November 10 th details added of derivability example Revision

More information

Foundation of Intelligent Systems, Part I. Regression

Foundation of Intelligent Systems, Part I. Regression Foundation of Intelligent Systems, Part I Regression mcuturi@i.kyoto-u.ac.jp FIS-2013 1 Before starting Please take this survey before the end of this week. Here are a few books which you can check beyond

More information

Lecture 10: The Normal Distribution. So far all the random variables have been discrete.

Lecture 10: The Normal Distribution. So far all the random variables have been discrete. Lecture 10: The Normal Distribution 1. Continuous Random Variables So far all the random variables have been discrete. We need a different type of model (called a probability density function) for continuous

More information

MSCBD 5002/IT5210: Knowledge Discovery and Data Minig

MSCBD 5002/IT5210: Knowledge Discovery and Data Minig MSCBD 5002/IT5210: Knowledge Discovery and Data Minig Instructor: Lei Chen Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei and

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

OWELL WEEKLY JOURNAL

OWELL WEEKLY JOURNAL Y \»< - } Y Y Y & #»»» q ] q»»»>) & - - - } ) x ( - { Y» & ( x - (» & )< - Y X - & Q Q» 3 - x Q Y 6 \Y > Y Y X 3 3-9 33 x - - / - -»- --

More information

Exam: 4 hour multiple choice. Agenda. Course Introduction to Statistics. Lecture 1: Introduction to Statistics. Per Bruun Brockhoff

Exam: 4 hour multiple choice. Agenda. Course Introduction to Statistics. Lecture 1: Introduction to Statistics. Per Bruun Brockhoff Course 02402 Lecture 1: Per Bruun Brockhoff DTU Informatics Building 305 - room 110 Danish Technical University 2800 Lyngby Denmark e-mail: pbb@imm.dtu.dk Agenda 1 2 3 4 Per Bruun Brockhoff (pbb@imm.dtu.dk),

More information

Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1]

Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1] Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1] Dissimilarity (e.g., distance) Numerical

More information

CS-C Data Science Chapter 8: Discrete methods for analyzing large binary datasets

CS-C Data Science Chapter 8: Discrete methods for analyzing large binary datasets CS-C3160 - Data Science Chapter 8: Discrete methods for analyzing large binary datasets Jaakko Hollmén, Department of Computer Science 30.10.2017-18.12.2017 1 Rest of the course In the first part of the

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu May 2, 2017 Announcements Homework 2 due later today Due May 3 rd (11:59pm) Course project

More information

Algorithms for Data Science: Lecture on Finding Similar Items

Algorithms for Data Science: Lecture on Finding Similar Items Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar

More information

ENGIN 112 Intro to Electrical and Computer Engineering

ENGIN 112 Intro to Electrical and Computer Engineering ENGIN 112 Intro to Electrical and Computer Engineering Lecture 3 More Number Systems Overview Hexadecimal numbers Related to binary and octal numbers Conversion between hexadecimal, octal and binary Value

More information

PubHlth 540 Fall Summarizing Data Page 1 of 18. Unit 1 - Summarizing Data Practice Problems. Solutions

PubHlth 540 Fall Summarizing Data Page 1 of 18. Unit 1 - Summarizing Data Practice Problems. Solutions PubHlth 50 Fall 0. Summarizing Data Page of 8 Unit - Summarizing Data Practice Problems Solutions #. a. Qualitative - ordinal b. Qualitative - nominal c. Quantitative continuous, ratio d. Qualitative -

More information

RaRE: Social Rank Regulated Large-scale Network Embedding

RaRE: Social Rank Regulated Large-scale Network Embedding RaRE: Social Rank Regulated Large-scale Network Embedding Authors: Yupeng Gu 1, Yizhou Sun 1, Yanen Li 2, Yang Yang 3 04/26/2018 The Web Conference, 2018 1 University of California, Los Angeles 2 Snapchat

More information

7 Distances. 7.1 Metrics. 7.2 Distances L p Distances

7 Distances. 7.1 Metrics. 7.2 Distances L p Distances 7 Distances We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing that way, and in particular the Jaccard similarity is easy to define in regards

More information

Discovering Correlation in Data. Vinh Nguyen Research Fellow in Data Science Computing and Information Systems DMD 7.

Discovering Correlation in Data. Vinh Nguyen Research Fellow in Data Science Computing and Information Systems DMD 7. Discovering Correlation in Data Vinh Nguyen (vinh.nguyen@unimelb.edu.au) Research Fellow in Data Science Computing and Information Systems DMD 7.14 Discovering Correlation Why is correlation important?

More information

Distances & Similarities

Distances & Similarities Introduction to Data Mining Distances & Similarities CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Distances & Similarities Yale - Fall 2016 1 / 22 Outline

More information

Lesson 04. KAZE, Non-linear diffusion filtering, ORB, MSER. Ing. Marek Hrúz, Ph.D.

Lesson 04. KAZE, Non-linear diffusion filtering, ORB, MSER. Ing. Marek Hrúz, Ph.D. Lesson 04 KAZE, Non-linear diffusion filtering, ORB, MSER Ing. Marek Hrúz, Ph.D. Katedra Kybernetiky Fakulta aplikovaných věd Západočeská univerzita v Plzni Lesson 04 KAZE ORB: an efficient alternative

More information

AIM HIGH SCHOOL. Curriculum Map W. 12 Mile Road Farmington Hills, MI (248)

AIM HIGH SCHOOL. Curriculum Map W. 12 Mile Road Farmington Hills, MI (248) AIM HIGH SCHOOL Curriculum Map 2923 W. 12 Mile Road Farmington Hills, MI 48334 (248) 702-6922 www.aimhighschool.com COURSE TITLE: Statistics DESCRIPTION OF COURSE: PREREQUISITES: Algebra 2 Students will

More information

Error Detection and Correction: Small Applications of Exclusive-Or

Error Detection and Correction: Small Applications of Exclusive-Or Error Detection and Correction: Small Applications of Exclusive-Or Greg Plaxton Theory in Programming Practice, Fall 2005 Department of Computer Science University of Texas at Austin Exclusive-Or (XOR,

More information

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 Support Vector Machines Find a linear hyperplane

More information

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here) 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 Piazza Recitation session: Review of linear algebra Location: Thursday, April, from 3:30-5:20 pm in SIG 34

More information

STAT 730 Chapter 1 Background

STAT 730 Chapter 1 Background STAT 730 Chapter 1 Background Timothy Hanson Department of Statistics, University of South Carolina Stat 730: Multivariate Analysis 1 / 27 Logistics Course notes hopefully posted evening before lecture,

More information

Data Mining: Data. Lecture Notes for Chapter 2 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Data Mining: Data. Lecture Notes for Chapter 2 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Data Mining: Data Lecture Notes for Chapter 2 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Look for accompanying R code on the course web site. 1 Topics Attributes/Features Types of Data

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

Statistics lecture 3. Bell-Shaped Curves and Other Shapes

Statistics lecture 3. Bell-Shaped Curves and Other Shapes Statistics lecture 3 Bell-Shaped Curves and Other Shapes Goals for lecture 3 Realize many measurements in nature follow a bell-shaped ( normal ) curve Understand and learn to compute a standardized score

More information

A Novel Gaussian Based Similarity Measure for Clustering Customer Transactions Using Transaction Sequence Vector

A Novel Gaussian Based Similarity Measure for Clustering Customer Transactions Using Transaction Sequence Vector A Novel Gaussian Based Similarity Measure for Clustering Customer Transactions Using Transaction Sequence Vector M.S.B.Phridvi Raj 1, Vangipuram Radhakrishna 2, C.V.Guru Rao 3 1 prudviraj.kits@gmail.com,

More information

Genetic Changes Lesson 2 CW

Genetic Changes Lesson 2 CW Guiding Question What theory serves as the basis of what we believe about how evolutionary changes occur? 7 th GRADE SCIENCE Genetic Changes Lesson 2 CW # Name: Date: Homeroom: I can Activator At the beginning

More information

1 Closest Pair of Points on the Plane

1 Closest Pair of Points on the Plane CS 31: Algorithms (Spring 2019): Lecture 5 Date: 4th April, 2019 Topic: Divide and Conquer 3: Closest Pair of Points on a Plane Disclaimer: These notes have not gone through scrutiny and in all probability

More information

How do species change over time?

How do species change over time? Who first studied how species change over time? How do species change over time? Jean-Baptiste Lamarck (1744-1829) and Charles Darwin (1809-1882) both had ideas about how life on earth changed over time.

More information

6 Distances. 6.1 Metrics. 6.2 Distances L p Distances

6 Distances. 6.1 Metrics. 6.2 Distances L p Distances 6 Distances We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing that way, and in particular the Jaccard similarity is easy to define in regards

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University A real story from CS341 data-mining project class. Students involved did a wonderful job, got an A. But their first attempt at a MapReduce algorithm caused them problems

More information

Computation Theory Finite Automata

Computation Theory Finite Automata Computation Theory Dept. of Computing ITT Dublin October 14, 2010 Computation Theory I 1 We would like a model that captures the general nature of computation Consider two simple problems: 2 Design a program

More information

Data Mining algorithms

Data Mining algorithms Data Mining algorithms 2017-2018 spring 02.07-09.2018 Overview Classification vs. Regression Evaluation I Basics Bálint Daróczy daroczyb@ilab.sztaki.hu Basic reachability: MTA SZTAKI, Lágymányosi str.

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Lecture 2 September 4, 2014

Lecture 2 September 4, 2014 CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 2 September 4, 2014 Scribe: David Liu 1 Overview In the last lecture we introduced the word RAM model and covered veb trees to solve the

More information

Multiplex network inference

Multiplex network inference (using hidden Markov models) University of Cambridge Bioinformatics Group Meeting 11 February 2016 Words of warning Disclaimer These slides have been produced by combining & translating two of my previous

More information

Kernels and the Kernel Trick. Machine Learning Fall 2017

Kernels and the Kernel Trick. Machine Learning Fall 2017 Kernels and the Kernel Trick Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels

More information

Complement Arithmetic

Complement Arithmetic Complement Arithmetic Objectives In this lesson, you will learn: How additions and subtractions are performed using the complement representation, What is the Overflow condition, and How to perform arithmetic

More information

The symmetric group R + :1! 2! 3! 1. R :1! 3! 2! 1.

The symmetric group R + :1! 2! 3! 1. R :1! 3! 2! 1. Chapter 2 The symmetric group Consider the equilateral triangle. 3 1 2 We want to describe all the symmetries, which are the motions (both rotations and flips) which takes the triangle to itself. First

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics January 24, 2018 CS 361: Probability & Statistics Relationships in data Standard coordinates If we have two quantities of interest in a dataset, we might like to plot their histograms and compare the two

More information

Lets start off with a visual intuition

Lets start off with a visual intuition Naïve Bayes Classifier (pages 231 238 on text book) Lets start off with a visual intuition Adapted from Dr. Eamonn Keogh s lecture UCR 1 Body length Data 2 Alligators 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7

More information

2.0 Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

2.0 Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table 2.0 Lesson Plan Answer Questions 1 Summary Statistics Histograms The Normal Distribution Using the Standard Normal Table 2. Summary Statistics Given a collection of data, one needs to find representations

More information

Theory of LSH. Distance Measures LS Families of Hash Functions S-Curves

Theory of LSH. Distance Measures LS Families of Hash Functions S-Curves Theory of LSH Distance Measures LS Families of Hash Functions S-Curves 1 Distance Measures Generalized LSH is based on some kind of distance between points. Similar points are close. Two major classes

More information

The Accuracy of Network Visualizations. Kevin W. Boyack SciTech Strategies, Inc.

The Accuracy of Network Visualizations. Kevin W. Boyack SciTech Strategies, Inc. The Accuracy of Network Visualizations Kevin W. Boyack SciTech Strategies, Inc. kboyack@mapofscience.com Overview Science mapping history Conceptual Mapping Early Bibliometric Maps Recent Bibliometric

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

COMP 5331: Knowledge Discovery and Data Mining

COMP 5331: Knowledge Discovery and Data Mining COMP 5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Tan, Steinbach, Kumar And Jiawei Han, Micheline Kamber, and Jian Pei 1 10

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

Reading Selection: How do species change over time?

Reading Selection: How do species change over time? Reading Selection: How do species change over time? 1. Who first studied how species change over time? Jean-Baptiste Lamarck (1744-1829) and Charles Darwin (1809-1882) both had ideas about how life on

More information

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved. 1-1 Chapter 1 Sampling and Descriptive Statistics 1-2 Why Statistics? Deal with uncertainty in repeated scientific measurements Draw conclusions from data Design valid experiments and draw reliable conclusions

More information

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages

More information

THE MOST IMPORTANT BIT

THE MOST IMPORTANT BIT NEURAL NETWORKS THE MOST IMPORTANT BIT A neural network represents a function f : R d R d 2. Peter Orbanz Applied Data Mining 262 BUILDING BLOCKS Units The basic building block is a node or unit: φ The

More information

Overview of clustering analysis. Yuehua Cui

Overview of clustering analysis. Yuehua Cui Overview of clustering analysis Yuehua Cui Email: cuiy@msu.edu http://www.stt.msu.edu/~cui A data set with clear cluster structure How would you design an algorithm for finding the three clusters in this

More information

LOWELL WEEKLY JOURNAL

LOWELL WEEKLY JOURNAL Y G y G Y 87 y Y 8 Y - $ X ; ; y y q 8 y $8 $ $ $ G 8 q < 8 6 4 y 8 7 4 8 8 < < y 6 $ q - - y G y G - Y y y 8 y y y Y Y 7-7- G - y y y ) y - y y y y - - y - y 87 7-7- G G < G y G y y 6 X y G y y y 87 G

More information

Algebra III and Trigonometry Summer Assignment

Algebra III and Trigonometry Summer Assignment Algebra III and Trigonometry Summer Assignment Welcome to Algebra III and Trigonometry! This summer assignment is a review of the skills you learned in Algebra II. Please bring this assignment with you

More information

Multimodal Deep Learning for Predicting Survival from Breast Cancer

Multimodal Deep Learning for Predicting Survival from Breast Cancer Multimodal Deep Learning for Predicting Survival from Breast Cancer Heather Couture Deep Learning Journal Club Nov. 16, 2016 Outline Background on tumor histology & genetic data Background on survival

More information

Plan-Making Methods AICP EXAM REVIEW. Quantitative, Spatial, Mapping, and Visualization

Plan-Making Methods AICP EXAM REVIEW. Quantitative, Spatial, Mapping, and Visualization Quantitative, Spatial, Mapping, and Visualization Plan-Making Methods AICP EXAM REVIEW Bill Drummond Georgia Tech School of City and Regional Planning February 9, 2018 Session Outline Introduction (5 min)

More information

Visualizing and summarizing data

Visualizing and summarizing data Visualizing and summarizing data Ken Rice, Dept of Biostatistics HUBIO 530 January 2015 Q. What s your talk about? Today I will describe: How to visualize small datasets How to summarize small datasets

More information

Sequential Logic Optimization. Optimization in Context. Algorithmic Approach to State Minimization. Finite State Machine Optimization

Sequential Logic Optimization. Optimization in Context. Algorithmic Approach to State Minimization. Finite State Machine Optimization Sequential Logic Optimization! State Minimization " Algorithms for State Minimization! State, Input, and Output Encodings " Minimize the Next State and Output logic Optimization in Context! Understand

More information

Chapter 01 : What is Statistics?

Chapter 01 : What is Statistics? Chapter 01 : What is Statistics? Feras Awad Data: The information coming from observations, counts, measurements, and responses. Statistics: The science of collecting, organizing, analyzing, and interpreting

More information

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality Measurement and Data Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality Importance of Measurement Aim of mining structured data is to discover relationships that

More information

Chapter 8: Introduction to Evolutionary Computation

Chapter 8: Introduction to Evolutionary Computation Computational Intelligence: Second Edition Contents Some Theories about Evolution Evolution is an optimization process: the aim is to improve the ability of an organism to survive in dynamically changing

More information

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II) Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture

More information