Creative Data Mining

Similar documents
The SAS System 18:28 Saturday, March 10, Plot of Canonical Variables Identified by Cluster

MULTIVARIATE HOMEWORK #5

PRINCIPAL COMPONENTS ANALYSIS

PANDAS FOUNDATIONS. pandas Foundations

Robust scale estimation with extensions

An Introduction to Multivariate Methods

MLE/MAP + Naïve Bayes

1. Introduction to Multivariate Analysis

Propensity Score Matching

LEC 4: Discriminant Analysis for Classification

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

4 Statistics of Normally Distributed Data

Data Mining with R. Linear Classifiers and the Perceptron Algorithm. Hugh Murrell

Last time: PCA. Statistical Data Mining and Machine Learning Hilary Term Singular Value Decomposition (SVD) Eigendecomposition and PCA

Machine Leanring Theory and Applications: third lecture

Applied Multivariate and Longitudinal Data Analysis

In most cases, a plot of d (j) = {d (j) 2 } against {χ p 2 (1-q j )} is preferable since there is less piling up

MLE/MAP + Naïve Bayes

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Principal Component Analysis (PCA) Principal Component Analysis (PCA)

ISCID-CO Dunkerque/ULCO. Mathematics applied to economics and management Foundations of Descriptive and Inferential Statistics

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

2 Getting Started with Numerical Computations in Python

Outline. Administrivia and Introduction Course Structure Syllabus Introduction to Data Mining

Data Mining and Analysis

Discriminant Analysis

Discriminant Analysis (DA)

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Statistics Terminology

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Techniques and Applications of Multivariate Analysis

Lecture 5: Classification

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Section 7: Discriminant Analysis.

Multivariate Statistics

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Experiment 1: Linear Regression

STAT 730 Chapter 1 Background

Introduction to Python Practical 2

Advanced Studies in Applied Statistics (WBL), ETHZ Applied Multivariate Statistics Spring 2018, Week 1

The Rain in Spain - Tableau Public Workbook

Applied Multivariate Analysis (Stat 206)

Lectures about Python, useful both for beginners and experts, can be found at (

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /15/ /12

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Error, Accuracy and Convergence

CS444/544: Midterm Review. Carlos Scheidegger

Machine Learning 2nd Edition

Multivariate analysis of variance and covariance

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

HOMEWORK #4: LOGISTIC REGRESSION

Outline Python, Numpy, and Matplotlib Making Models with Polynomials Making Models with Monte Carlo Error, Accuracy and Convergence Floating Point Mod

Data Intensive Computing Handout 11 Spark: Transitive Closure

Lecture 2: Linear regression

INTRODUCTION TO DATA SCIENCE

DATA SCIENCE SIMPLIFIED USING ARCGIS API FOR PYTHON

Latent Variable Methods Course

Tutorial Three: Loops and Conditionals

Statistics in Stata Introduction to Stata

Lecture 08: Poisson and More. Lisa Yan July 13, 2018

Classification Methods II: Linear and Quadratic Discrimminant Analysis

VPython Class 2: Functions, Fields, and the dreaded ˆr

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Data Exploration and Unsupervised Learning with Clustering

Introduction to SVD and Applications

Tools of AI. Marcin Sydow. Summary. Machine Learning

Quiz3_NaiveBayesTest

import pandas as pd d = {"A":[1,2,np.nan], "B":[5,np.nan,np.nan], "C":[1,2,3]} In [9]: Out[8]: {'A': [1, 2, nan], 'B': [5, nan, nan], 'C': [1, 2, 3]}

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran

Dimension Reduction (PCA, ICA, CCA, FLD,

Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis

Homework : Data Mining SOLUTIONS

Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions

Packing_Similarity_Dendrogram.py

Star Cluster Photometry and the H-R Diagram

Keppel, G. & Wickens, T. D. Design and Analysis Chapter 4: Analytical Comparisons Among Treatment Means

HOMEWORK #4: LOGISTIC REGRESSION

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

CPSC 340 Assignment 4 (due November 17 ATE)

L14. 1 Lecture 14: Crash Course in Probability. July 7, Overview and Objectives. 1.2 Part 1: Probability

Numpy. Luis Pedro Coelho. October 22, Programming for Scientists. Luis Pedro Coelho (Programming for Scientists) Numpy October 22, 2012 (1 / 26)

Exercise 1. In the lecture you have used logistic regression as a binary classifier to assign a label y i { 1, 1} for a sample X i R D by

Introduction to ggplot2. ggplot2 is (in my opinion) one of the best documented packages in R. The full documentation for it can be found here:

z = β βσβ Statistical Analysis of MV Data Example : µ=0 (Σ known) consider Y = β X~ N 1 (β µ, β Σβ) test statistic for H 0β is

Applied Multivariate Analysis

Feature selection. Micha Elsner. January 29, 2014

LAST TIME..THE COMMAND LINE

The Mean Version One way to write the One True Regression Line is: Equation 1 - The One True Line

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

PYTHON, NUMPY, AND SPARK. Prof. Chris Jermaine

Homework : Data Mining SOLUTIONS

AMS 132: Discussion Section 2

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Correlation Preserving Unsupervised Discretization. Outline

Data Structures & Database Queries in GIS

Chapter 7, continued: MANOVA

Introduction to Python

Ratio of Vector Lengths as an Indicator of Sample Representativeness

Lab 2 Worksheet. Problems. Problem 1: Geometry and Linear Equations

Transcription:

Creative Data Mining Using ML algorithms in python Artem Chirkin Dr. Daniel Zünd Danielle Griego Lecture 7 0.04.207 /7

What we will cover today Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality /7

Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality /7

Getting started Setting up python environment At first, we need to set up our environment. We need p an d a s t o i m p o r t and u s e. c s v f i l e s 2 i m p o r t p a n d a s a s pd 3 numpy i s u s e d f o r v a r i o u s n u m e r i c o p e r a t i o n s 5 i m p o r t numpy a s np 4 We ll use several package, most of which you have already installed 6 m a t p l o t l i b l e t s us draw n i c e p l o t s 8 import m a t p l o t l i b. pyplot as p l t 7 9 We w i l l u s e s e v e r a l mo d u le s o f s k l e a r n p a c k a g e t h a t i s a " s w i s s k n i f e " ML t o o l s e t f o r p y t h o n 2 from s k l e a r n i m p o r t c l u s t e r, m e t r i c s, d e c o m p o s i t i o n 0 pip install numpy pandas matplotlib sklearn 2/7

Get the Iris dataset Example data for clustering The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronal Fisher in 936. You can read more about it on Wikipedia s page https://en.wikipedia.org/wiki/iris_flower_data_ set You can find the.csv file easily; I got it from https://raw.githubusercontent.com/vincentarelbundock/ Rdatasets/master/csv/datasets/iris.csv Let s download it and use in our python script 3/7

Get the Iris dataset Example data for clustering Now, we are ready to load the.csv file into the interpreter using pandas package. U s i n g p an d a s l i b r a r y t o l o a d c s v f i l e 2 i n t o a DataFrame o b j e c t 3 ourdata = pd. r e a d _ c s v ( " i r i s. c s v " ) 4 6 7 8 9 0 5 Created a funny At f i r s t what i s For t h i s several o b j e c t ourdata c o n t a i n s and famous d a t a s e t o f f l o w e r s., we need t o e x p l o r e, i n t h i s data s e t., pandas package p r o v i d e s very useful functions 4/7

Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality 4/7

Explore the Iris dataset Trying useful functions from pandas The first useful command is head() It returns first several lines of a dataframe. Very useful to get an idea of what kind of data you have! ourdata. head ( ) 0 2 3 4 Unnamed: 0 2 3 4 5 Sepal.Length 5. 4.9 4.7 4.6 5.0 Sepal.Width 3.5 3.0 3.2 3. 3.6 Petal.Length.4.4.3.5.4 Petal.Width Species 0.2 setosa 0.2 setosa 0.2 setosa 0.2 setosa 0.2 setosa 5/7

Explore the Iris dataset Trying useful functions from pandas Second command, info(), can give a more precise information on what data types we operate on. I The dataset has 50 entries I Sepal and petal parameters are floating-point numbers I Species is recognized as object type in fact, this is a text ourdata. i n f o ( ) <class pandas.core.frame.dataframe > Int64Index: 50 entries, 0 to 49 Data columns (total 6 columns): Unnamed: 0 50 non-null int64 Sepal.Length 50 non-null float64 Sepal.Width 50 non-null float64 Petal.Length 50 non-null float64 Petal.Width 50 non-null float64 Species 50 non-null object dtypes: float64(4), int64(), object() memory usage: 8.2+ KB 6/7

Explore the Iris dataset Trying useful functions from pandas The last command is more advanced: ourdata.describe(). This command gives some statistical information about numeric values in your dataset. ourdata. d e s c r i b e ( ) count mean std min 25% 50% 75% max Unnamed: 0 50.000000 75.500000 43.445368.000000 38.250000 75.500000 2.750000 50.000000 Sepal.Length 50.000000 5.843333 0.828066 4.300000 5.00000 5.800000 6.400000 7.900000 Sepal.Width 50.000000 3.057333 0.435866 2.000000 2.800000 3.000000 3.300000 4.400000 Petal.Length 50.000000 3.758000.765298.000000.600000 4.350000 5.00000 6.900000 Petal.Width 50.000000.99333 0.762238 0.00000 0.300000.300000.800000 2.500000 It is useful to understand what is the range of your values. 7/7

Explore the Iris dataset Trying useful functions from pandas But what about the last column Species? Get u n i q u e v a l u e s i n t h e column 2 s p e c i e s = ourdata. S p e c i e s. u n i q u e ( ) 3 4 We ve seen these are textual values. Now, let us see what kinds of values does this column have? print ( species ) [ setosa versicolor virginica ] 8/7

Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality 8/7

Draw some plots Using matlplotlib Let s try to plot something! In this example I map names of the species onto colors. Play with this code a little bit changing the columns to plot. This gives you an insight how the data looks like. plot different species in different colours 2 c o l o r s = [ g, r, b, c, m, k ] 3 species_dict = dict ( zip ( species, colors ) ) 4 p l t. s c a t t e r ( ourdata [ S e p a l. L e n g t h ] 5, ourdata [ S e p a l. Width ] 6, c=ourdata [ S p e c i e s ] 7. a p p l y ( lambda x : s p e c i e s _ d i c t [ x ] ) 8 ) 9 use e i t h e r 0 p l t. show ( ) t o draw on s c r e e n o r 2 p l t. s a v e f i g (" f i l e n a m e ") to save p i c t u r e 3 p l t. s a v e f i g ( " r e a l l a b e l s. png " ) 9/7

Draw some plots Using matlplotlib...and here is how the result looks like 0/7

Prepare data Split input and labels Copy the data to a new variable ourdatanolabels. We will play like we don t know the labels and want to estimate them. Here i s o u r data, b u t w i t h o u t s p e c i e s i n f o r m a t i o n 2 ourdatanolabels 3 = pd. DataFrame ( ourdata 4, c o l u mns =[ S e p a l. L e n g t h 5, S e p a l. Width 6, P e t a l. Length 7, P e t a l. Width 8 ] 9 ) /7

Principal Component analysis Using sklearn.decomposition Run P r i n c i p a l Component A n a l y s i s 2 t o g e t more u n d e r s t a n d i n g how o u r d a t a l o o k s l i k e 3 ourdatareduced 4 = decomposition 5. PCA( n_components =3) 6. f i t _ t r a n s f o r m ( ourdatanolabels ) 7 p l t. s c a t t e r ( ourdatareduced [ :, 0 ] 8, ourdatareduced [ :, ] 9, c=ourdata [ S p e c i e s ] 0. a p p l y ( lambda x : s p e c i e s _ d i c t [ x ] ) ) p l t. s a v e f i g ( " r e a l pca 2.png " ) 2 p l t. s c a t t e r ( ourdatareduced [ :, 0 ] 3, ourdatareduced [ :, 2 ] 4, c=ourdata [ S p e c i e s ] 5. a p p l y ( lambda x : s p e c i e s _ d i c t [ x ] ) ) 6 p l t. s a v e f i g ( " r e a l pca 3.png " ) 7 p l t. s c a t t e r ( ourdatareduced [ :, ] 8, ourdatareduced [ :, 2 ] 9, c=ourdata [ S p e c i e s ] 20. a p p l y ( lambda x : s p e c i e s _ d i c t [ x ] ) ) 2 p l t. s a v e f i g ( " r e a l pca 2 3.png " ) This code performs PCA (lines 3-7). (Most of the code is just to draw three plots). Principal component transform is a procedure that rotates point space in such a way that points variance is the biggest along X-axis, the second is along Y-axis, and so on. The method is simple but very powerful for data exploration. If data have many dimensions (i.e. 50) it is very convenient to look only at 2-5 most significant dimensions. Otherwise visual inspection of the data would be too difficult. 2/7

Draw some plots Using matlplotlib...and here are the plots 3/7

Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality 3/7

K-means clustering Using sklearn.cluster We know t h a t t h e r e must be 3 d i f f e r e n t c l u s t e r s 2 species of flowers. 3 So l e t s g i v e t h i s h i n t t o t h e a l g o r i t h m 4 kmeans = c l u s t e r. KMeans ( 3 ) 5. f i t ( np. a r r a y ( o u r D a t a N o L a b e l s ) ) 6 foundlabels 7 = pd. DataFrame ( kmeans. l a b e l s _ 8, c o l u mns =[ K means c l u s t e r s ] ) Finally, we are ready for some clustering! 9 0 2 3 4 5 6 p l t. s c a t t e r ( ourdata [ S e p a l. L e n g t h ], ourdata [ S e p a l. Width ], c=f o u n d L a b e l s [ K means c l u s t e r s ]. a p p l y ( lambda x : c o l o r s [ x ] ) ) p l t. s a v e f i g ( " p r e d i c t e d l a b e l s. png " ) 4/7

K-means clustering Using sklearn.cluster Let s compare plots......would you say this is a good result? 5/7

Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality 5/7

Assess clustering quality Using pandas A very good way to assess performance of an unsupervised clustering algorithm is to look at co-occurrence tables. 2 3 4 Package pandas provides a special function crosstab() that calculates how many times a value from one column occurs together with a value from another column. mat = pd. c r o s s t a b ( ourdata [ S p e c i e s ], f o u n d L a b e l s [ K means c l u s t e r s ] ) p r i n t ( mat ) K-means clusters Species setosa versicolor virginica 0 2 0 48 4 50 0 0 0 2 36 So, what would be a conclusion now? 6/7

Bonus: DBSCAN Using sklearn.cluster This code does everything the same way as KMeans clustering example. 2 3 4 5 6 7 Try it! And compare the results. Which algorithm performs better? 8 9 0 2 3 d b s c a n = c l u s t e r. DBSCAN( ). f i t ( np. a r r a y ( o u r D a t a N o L a b e l s ) ) f o u n d L a b e l s [ DBSCAN c l u s t e r s ] = pd. DataFrame ( d b s c a n. l a b e l s _, c o l u mns =[ DBSCAN c l u s t e r s ] ) p l t. s c a t t e r ( ourdata [ S e p a l. L e n g t h ], ourdata [ S e p a l. Width ], c=f o u n d L a b e l s [ DBSCAN c l u s t e r s ]. a p p l y ( lambda x : c o l o r s [ x ] ) ) p l t. s a v e f i g ( " p r e d i c t e d l a b e l s 2. png " ) mat = pd. c r o s s t a b ( ourdata [ S p e c i e s ], f o u n d L a b e l s [ DBSCAN c l u s t e r s ] ) p r i n t ( mat ) 7/7

What we have covered today Thanks for your attention! Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality 7/7