Creative Data Mining

Size: px

Start display at page:

Download "Creative Data Mining"

Linette Gibbs
6 years ago
Views:

1 Creative Data Mining Using ML algorithms in python Artem Chirkin Dr. Daniel Zünd Danielle Griego Lecture /7

2 What we will cover today Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality /7

3 Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality /7

4 Getting started Setting up python environment At first, we need to set up our environment. We need p an d a s t o i m p o r t and u s e. c s v f i l e s 2 i m p o r t p a n d a s a s pd 3 numpy i s u s e d f o r v a r i o u s n u m e r i c o p e r a t i o n s 5 i m p o r t numpy a s np 4 We ll use several package, most of which you have already installed 6 m a t p l o t l i b l e t s us draw n i c e p l o t s 8 import m a t p l o t l i b. pyplot as p l t 7 9 We w i l l u s e s e v e r a l mo d u le s o f s k l e a r n p a c k a g e t h a t i s a " s w i s s k n i f e " ML t o o l s e t f o r p y t h o n 2 from s k l e a r n i m p o r t c l u s t e r, m e t r i c s, d e c o m p o s i t i o n 0 pip install numpy pandas matplotlib sklearn 2/7

5 Get the Iris dataset Example data for clustering The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronal Fisher in 936. You can read more about it on Wikipedia s page set You can find the.csv file easily; I got it from Rdatasets/master/csv/datasets/iris.csv Let s download it and use in our python script 3/7

6 Get the Iris dataset Example data for clustering Now, we are ready to load the.csv file into the interpreter using pandas package. U s i n g p an d a s l i b r a r y t o l o a d c s v f i l e 2 i n t o a DataFrame o b j e c t 3 ourdata = pd. r e a d _ c s v ( " i r i s. c s v " ) Created a funny At f i r s t what i s For t h i s several o b j e c t ourdata c o n t a i n s and famous d a t a s e t o f f l o w e r s., we need t o e x p l o r e, i n t h i s data s e t., pandas package p r o v i d e s very useful functions 4/7

7 Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality 4/7

8 Explore the Iris dataset Trying useful functions from pandas The first useful command is head() It returns first several lines of a dataframe. Very useful to get an idea of what kind of data you have! ourdata. head ( ) Unnamed: Sepal.Length Sepal.Width Petal.Length Petal.Width Species 0.2 setosa 0.2 setosa 0.2 setosa 0.2 setosa 0.2 setosa 5/7

9 Explore the Iris dataset Trying useful functions from pandas Second command, info(), can give a more precise information on what data types we operate on. I The dataset has 50 entries I Sepal and petal parameters are floating-point numbers I Species is recognized as object type in fact, this is a text ourdata. i n f o ( ) <class pandas.core.frame.dataframe > Int64Index: 50 entries, 0 to 49 Data columns (total 6 columns): Unnamed: 0 50 non-null int64 Sepal.Length 50 non-null float64 Sepal.Width 50 non-null float64 Petal.Length 50 non-null float64 Petal.Width 50 non-null float64 Species 50 non-null object dtypes: float64(4), int64(), object() memory usage: 8.2+ KB 6/7

10 Explore the Iris dataset Trying useful functions from pandas The last command is more advanced: ourdata.describe(). This command gives some statistical information about numeric values in your dataset. ourdata. d e s c r i b e ( ) count mean std min 25% 50% 75% max Unnamed: Sepal.Length Sepal.Width Petal.Length Petal.Width It is useful to understand what is the range of your values. 7/7

11 Explore the Iris dataset Trying useful functions from pandas But what about the last column Species? Get u n i q u e v a l u e s i n t h e column 2 s p e c i e s = ourdata. S p e c i e s. u n i q u e ( ) 3 4 We ve seen these are textual values. Now, let us see what kinds of values does this column have? print ( species ) [ setosa versicolor virginica ] 8/7

12 Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality 8/7

13 Draw some plots Using matlplotlib Let s try to plot something! In this example I map names of the species onto colors. Play with this code a little bit changing the columns to plot. This gives you an insight how the data looks like. plot different species in different colours 2 c o l o r s = [ g, r, b, c, m, k ] 3 species_dict = dict ( zip ( species, colors ) ) 4 p l t. s c a t t e r ( ourdata [ S e p a l. L e n g t h ] 5, ourdata [ S e p a l. Width ] 6, c=ourdata [ S p e c i e s ] 7. a p p l y ( lambda x : s p e c i e s _ d i c t [ x ] ) 8 ) 9 use e i t h e r 0 p l t. show ( ) t o draw on s c r e e n o r 2 p l t. s a v e f i g (" f i l e n a m e ") to save p i c t u r e 3 p l t. s a v e f i g ( " r e a l l a b e l s. png " ) 9/7

14 Draw some plots Using matlplotlib...and here is how the result looks like 0/7

15 Prepare data Split input and labels Copy the data to a new variable ourdatanolabels. We will play like we don t know the labels and want to estimate them. Here i s o u r data, b u t w i t h o u t s p e c i e s i n f o r m a t i o n 2 ourdatanolabels 3 = pd. DataFrame ( ourdata 4, c o l u mns =[ S e p a l. L e n g t h 5, S e p a l. Width 6, P e t a l. Length 7, P e t a l. Width 8 ] 9 ) /7

16 Principal Component analysis Using sklearn.decomposition Run P r i n c i p a l Component A n a l y s i s 2 t o g e t more u n d e r s t a n d i n g how o u r d a t a l o o k s l i k e 3 ourdatareduced 4 = decomposition 5. PCA( n_components =3) 6. f i t _ t r a n s f o r m ( ourdatanolabels ) 7 p l t. s c a t t e r ( ourdatareduced [ :, 0 ] 8, ourdatareduced [ :, ] 9, c=ourdata [ S p e c i e s ] 0. a p p l y ( lambda x : s p e c i e s _ d i c t [ x ] ) ) p l t. s a v e f i g ( " r e a l pca 2.png " ) 2 p l t. s c a t t e r ( ourdatareduced [ :, 0 ] 3, ourdatareduced [ :, 2 ] 4, c=ourdata [ S p e c i e s ] 5. a p p l y ( lambda x : s p e c i e s _ d i c t [ x ] ) ) 6 p l t. s a v e f i g ( " r e a l pca 3.png " ) 7 p l t. s c a t t e r ( ourdatareduced [ :, ] 8, ourdatareduced [ :, 2 ] 9, c=ourdata [ S p e c i e s ] 20. a p p l y ( lambda x : s p e c i e s _ d i c t [ x ] ) ) 2 p l t. s a v e f i g ( " r e a l pca 2 3.png " ) This code performs PCA (lines 3-7). (Most of the code is just to draw three plots). Principal component transform is a procedure that rotates point space in such a way that points variance is the biggest along X-axis, the second is along Y-axis, and so on. The method is simple but very powerful for data exploration. If data have many dimensions (i.e. 50) it is very convenient to look only at 2-5 most significant dimensions. Otherwise visual inspection of the data would be too difficult. 2/7

17 Draw some plots Using matlplotlib...and here are the plots 3/7

18 Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality 3/7

19 K-means clustering Using sklearn.cluster We know t h a t t h e r e must be 3 d i f f e r e n t c l u s t e r s 2 species of flowers. 3 So l e t s g i v e t h i s h i n t t o t h e a l g o r i t h m 4 kmeans = c l u s t e r. KMeans ( 3 ) 5. f i t ( np. a r r a y ( o u r D a t a N o L a b e l s ) ) 6 foundlabels 7 = pd. DataFrame ( kmeans. l a b e l s _ 8, c o l u mns =[ K means c l u s t e r s ] ) Finally, we are ready for some clustering! p l t. s c a t t e r ( ourdata [ S e p a l. L e n g t h ], ourdata [ S e p a l. Width ], c=f o u n d L a b e l s [ K means c l u s t e r s ]. a p p l y ( lambda x : c o l o r s [ x ] ) ) p l t. s a v e f i g ( " p r e d i c t e d l a b e l s. png " ) 4/7

20 K-means clustering Using sklearn.cluster Let s compare plots......would you say this is a good result? 5/7

21 Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality 5/7

22 Assess clustering quality Using pandas A very good way to assess performance of an unsupervised clustering algorithm is to look at co-occurrence tables Package pandas provides a special function crosstab() that calculates how many times a value from one column occurs together with a value from another column. mat = pd. c r o s s t a b ( ourdata [ S p e c i e s ], f o u n d L a b e l s [ K means c l u s t e r s ] ) p r i n t ( mat ) K-means clusters Species setosa versicolor virginica So, what would be a conclusion now? 6/7

23 Bonus: DBSCAN Using sklearn.cluster This code does everything the same way as KMeans clustering example Try it! And compare the results. Which algorithm performs better? d b s c a n = c l u s t e r. DBSCAN( ). f i t ( np. a r r a y ( o u r D a t a N o L a b e l s ) ) f o u n d L a b e l s [ DBSCAN c l u s t e r s ] = pd. DataFrame ( d b s c a n. l a b e l s _, c o l u mns =[ DBSCAN c l u s t e r s ] ) p l t. s c a t t e r ( ourdata [ S e p a l. L e n g t h ], ourdata [ S e p a l. Width ], c=f o u n d L a b e l s [ DBSCAN c l u s t e r s ]. a p p l y ( lambda x : c o l o r s [ x ] ) ) p l t. s a v e f i g ( " p r e d i c t e d l a b e l s 2. png " ) mat = pd. c r o s s t a b ( ourdata [ S p e c i e s ], f o u n d L a b e l s [ DBSCAN c l u s t e r s ] ) p r i n t ( mat ) 7/7

24 What we have covered today Thanks for your attention! Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality 7/7

The SAS System 18:28 Saturday, March 10, Plot of Canonical Variables Identified by Cluster

The SAS System 18:28 Saturday, March 10, 2018 1 The FASTCLUS Procedure Replace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02 Initial Seeds Cluster SepalLength SepalWidth PetalLength PetalWidth 1