Creative Data Mining Using ML algorithms in python Artem Chirkin Dr. Daniel Zünd Danielle Griego Lecture 7 0.04.207 /7
What we will cover today Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality /7
Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality /7
Getting started Setting up python environment At first, we need to set up our environment. We need p an d a s t o i m p o r t and u s e. c s v f i l e s 2 i m p o r t p a n d a s a s pd 3 numpy i s u s e d f o r v a r i o u s n u m e r i c o p e r a t i o n s 5 i m p o r t numpy a s np 4 We ll use several package, most of which you have already installed 6 m a t p l o t l i b l e t s us draw n i c e p l o t s 8 import m a t p l o t l i b. pyplot as p l t 7 9 We w i l l u s e s e v e r a l mo d u le s o f s k l e a r n p a c k a g e t h a t i s a " s w i s s k n i f e " ML t o o l s e t f o r p y t h o n 2 from s k l e a r n i m p o r t c l u s t e r, m e t r i c s, d e c o m p o s i t i o n 0 pip install numpy pandas matplotlib sklearn 2/7
Get the Iris dataset Example data for clustering The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronal Fisher in 936. You can read more about it on Wikipedia s page https://en.wikipedia.org/wiki/iris_flower_data_ set You can find the.csv file easily; I got it from https://raw.githubusercontent.com/vincentarelbundock/ Rdatasets/master/csv/datasets/iris.csv Let s download it and use in our python script 3/7
Get the Iris dataset Example data for clustering Now, we are ready to load the.csv file into the interpreter using pandas package. U s i n g p an d a s l i b r a r y t o l o a d c s v f i l e 2 i n t o a DataFrame o b j e c t 3 ourdata = pd. r e a d _ c s v ( " i r i s. c s v " ) 4 6 7 8 9 0 5 Created a funny At f i r s t what i s For t h i s several o b j e c t ourdata c o n t a i n s and famous d a t a s e t o f f l o w e r s., we need t o e x p l o r e, i n t h i s data s e t., pandas package p r o v i d e s very useful functions 4/7
Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality 4/7
Explore the Iris dataset Trying useful functions from pandas The first useful command is head() It returns first several lines of a dataframe. Very useful to get an idea of what kind of data you have! ourdata. head ( ) 0 2 3 4 Unnamed: 0 2 3 4 5 Sepal.Length 5. 4.9 4.7 4.6 5.0 Sepal.Width 3.5 3.0 3.2 3. 3.6 Petal.Length.4.4.3.5.4 Petal.Width Species 0.2 setosa 0.2 setosa 0.2 setosa 0.2 setosa 0.2 setosa 5/7
Explore the Iris dataset Trying useful functions from pandas Second command, info(), can give a more precise information on what data types we operate on. I The dataset has 50 entries I Sepal and petal parameters are floating-point numbers I Species is recognized as object type in fact, this is a text ourdata. i n f o ( ) <class pandas.core.frame.dataframe > Int64Index: 50 entries, 0 to 49 Data columns (total 6 columns): Unnamed: 0 50 non-null int64 Sepal.Length 50 non-null float64 Sepal.Width 50 non-null float64 Petal.Length 50 non-null float64 Petal.Width 50 non-null float64 Species 50 non-null object dtypes: float64(4), int64(), object() memory usage: 8.2+ KB 6/7
Explore the Iris dataset Trying useful functions from pandas The last command is more advanced: ourdata.describe(). This command gives some statistical information about numeric values in your dataset. ourdata. d e s c r i b e ( ) count mean std min 25% 50% 75% max Unnamed: 0 50.000000 75.500000 43.445368.000000 38.250000 75.500000 2.750000 50.000000 Sepal.Length 50.000000 5.843333 0.828066 4.300000 5.00000 5.800000 6.400000 7.900000 Sepal.Width 50.000000 3.057333 0.435866 2.000000 2.800000 3.000000 3.300000 4.400000 Petal.Length 50.000000 3.758000.765298.000000.600000 4.350000 5.00000 6.900000 Petal.Width 50.000000.99333 0.762238 0.00000 0.300000.300000.800000 2.500000 It is useful to understand what is the range of your values. 7/7
Explore the Iris dataset Trying useful functions from pandas But what about the last column Species? Get u n i q u e v a l u e s i n t h e column 2 s p e c i e s = ourdata. S p e c i e s. u n i q u e ( ) 3 4 We ve seen these are textual values. Now, let us see what kinds of values does this column have? print ( species ) [ setosa versicolor virginica ] 8/7
Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality 8/7
Draw some plots Using matlplotlib Let s try to plot something! In this example I map names of the species onto colors. Play with this code a little bit changing the columns to plot. This gives you an insight how the data looks like. plot different species in different colours 2 c o l o r s = [ g, r, b, c, m, k ] 3 species_dict = dict ( zip ( species, colors ) ) 4 p l t. s c a t t e r ( ourdata [ S e p a l. L e n g t h ] 5, ourdata [ S e p a l. Width ] 6, c=ourdata [ S p e c i e s ] 7. a p p l y ( lambda x : s p e c i e s _ d i c t [ x ] ) 8 ) 9 use e i t h e r 0 p l t. show ( ) t o draw on s c r e e n o r 2 p l t. s a v e f i g (" f i l e n a m e ") to save p i c t u r e 3 p l t. s a v e f i g ( " r e a l l a b e l s. png " ) 9/7
Draw some plots Using matlplotlib...and here is how the result looks like 0/7
Prepare data Split input and labels Copy the data to a new variable ourdatanolabels. We will play like we don t know the labels and want to estimate them. Here i s o u r data, b u t w i t h o u t s p e c i e s i n f o r m a t i o n 2 ourdatanolabels 3 = pd. DataFrame ( ourdata 4, c o l u mns =[ S e p a l. L e n g t h 5, S e p a l. Width 6, P e t a l. Length 7, P e t a l. Width 8 ] 9 ) /7
Principal Component analysis Using sklearn.decomposition Run P r i n c i p a l Component A n a l y s i s 2 t o g e t more u n d e r s t a n d i n g how o u r d a t a l o o k s l i k e 3 ourdatareduced 4 = decomposition 5. PCA( n_components =3) 6. f i t _ t r a n s f o r m ( ourdatanolabels ) 7 p l t. s c a t t e r ( ourdatareduced [ :, 0 ] 8, ourdatareduced [ :, ] 9, c=ourdata [ S p e c i e s ] 0. a p p l y ( lambda x : s p e c i e s _ d i c t [ x ] ) ) p l t. s a v e f i g ( " r e a l pca 2.png " ) 2 p l t. s c a t t e r ( ourdatareduced [ :, 0 ] 3, ourdatareduced [ :, 2 ] 4, c=ourdata [ S p e c i e s ] 5. a p p l y ( lambda x : s p e c i e s _ d i c t [ x ] ) ) 6 p l t. s a v e f i g ( " r e a l pca 3.png " ) 7 p l t. s c a t t e r ( ourdatareduced [ :, ] 8, ourdatareduced [ :, 2 ] 9, c=ourdata [ S p e c i e s ] 20. a p p l y ( lambda x : s p e c i e s _ d i c t [ x ] ) ) 2 p l t. s a v e f i g ( " r e a l pca 2 3.png " ) This code performs PCA (lines 3-7). (Most of the code is just to draw three plots). Principal component transform is a procedure that rotates point space in such a way that points variance is the biggest along X-axis, the second is along Y-axis, and so on. The method is simple but very powerful for data exploration. If data have many dimensions (i.e. 50) it is very convenient to look only at 2-5 most significant dimensions. Otherwise visual inspection of the data would be too difficult. 2/7
Draw some plots Using matlplotlib...and here are the plots 3/7
Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality 3/7
K-means clustering Using sklearn.cluster We know t h a t t h e r e must be 3 d i f f e r e n t c l u s t e r s 2 species of flowers. 3 So l e t s g i v e t h i s h i n t t o t h e a l g o r i t h m 4 kmeans = c l u s t e r. KMeans ( 3 ) 5. f i t ( np. a r r a y ( o u r D a t a N o L a b e l s ) ) 6 foundlabels 7 = pd. DataFrame ( kmeans. l a b e l s _ 8, c o l u mns =[ K means c l u s t e r s ] ) Finally, we are ready for some clustering! 9 0 2 3 4 5 6 p l t. s c a t t e r ( ourdata [ S e p a l. L e n g t h ], ourdata [ S e p a l. Width ], c=f o u n d L a b e l s [ K means c l u s t e r s ]. a p p l y ( lambda x : c o l o r s [ x ] ) ) p l t. s a v e f i g ( " p r e d i c t e d l a b e l s. png " ) 4/7
K-means clustering Using sklearn.cluster Let s compare plots......would you say this is a good result? 5/7
Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality 5/7
Assess clustering quality Using pandas A very good way to assess performance of an unsupervised clustering algorithm is to look at co-occurrence tables. 2 3 4 Package pandas provides a special function crosstab() that calculates how many times a value from one column occurs together with a value from another column. mat = pd. c r o s s t a b ( ourdata [ S p e c i e s ], f o u n d L a b e l s [ K means c l u s t e r s ] ) p r i n t ( mat ) K-means clusters Species setosa versicolor virginica 0 2 0 48 4 50 0 0 0 2 36 So, what would be a conclusion now? 6/7
Bonus: DBSCAN Using sklearn.cluster This code does everything the same way as KMeans clustering example. 2 3 4 5 6 7 Try it! And compare the results. Which algorithm performs better? 8 9 0 2 3 d b s c a n = c l u s t e r. DBSCAN( ). f i t ( np. a r r a y ( o u r D a t a N o L a b e l s ) ) f o u n d L a b e l s [ DBSCAN c l u s t e r s ] = pd. DataFrame ( d b s c a n. l a b e l s _, c o l u mns =[ DBSCAN c l u s t e r s ] ) p l t. s c a t t e r ( ourdata [ S e p a l. L e n g t h ], ourdata [ S e p a l. Width ], c=f o u n d L a b e l s [ DBSCAN c l u s t e r s ]. a p p l y ( lambda x : c o l o r s [ x ] ) ) p l t. s a v e f i g ( " p r e d i c t e d l a b e l s 2. png " ) mat = pd. c r o s s t a b ( ourdata [ S p e c i e s ], f o u n d L a b e l s [ DBSCAN c l u s t e r s ] ) p r i n t ( mat ) 7/7
What we have covered today Thanks for your attention! Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality 7/7