Creative Data Mining

Size: px
Start display at page:

Download "Creative Data Mining"

Transcription

1 Creative Data Mining Using ML algorithms in python Artem Chirkin Dr. Daniel Zünd Danielle Griego Lecture /7

2 What we will cover today Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality /7

3 Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality /7

4 Getting started Setting up python environment At first, we need to set up our environment. We need p an d a s t o i m p o r t and u s e. c s v f i l e s 2 i m p o r t p a n d a s a s pd 3 numpy i s u s e d f o r v a r i o u s n u m e r i c o p e r a t i o n s 5 i m p o r t numpy a s np 4 We ll use several package, most of which you have already installed 6 m a t p l o t l i b l e t s us draw n i c e p l o t s 8 import m a t p l o t l i b. pyplot as p l t 7 9 We w i l l u s e s e v e r a l mo d u le s o f s k l e a r n p a c k a g e t h a t i s a " s w i s s k n i f e " ML t o o l s e t f o r p y t h o n 2 from s k l e a r n i m p o r t c l u s t e r, m e t r i c s, d e c o m p o s i t i o n 0 pip install numpy pandas matplotlib sklearn 2/7

5 Get the Iris dataset Example data for clustering The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronal Fisher in 936. You can read more about it on Wikipedia s page set You can find the.csv file easily; I got it from Rdatasets/master/csv/datasets/iris.csv Let s download it and use in our python script 3/7

6 Get the Iris dataset Example data for clustering Now, we are ready to load the.csv file into the interpreter using pandas package. U s i n g p an d a s l i b r a r y t o l o a d c s v f i l e 2 i n t o a DataFrame o b j e c t 3 ourdata = pd. r e a d _ c s v ( " i r i s. c s v " ) Created a funny At f i r s t what i s For t h i s several o b j e c t ourdata c o n t a i n s and famous d a t a s e t o f f l o w e r s., we need t o e x p l o r e, i n t h i s data s e t., pandas package p r o v i d e s very useful functions 4/7

7 Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality 4/7

8 Explore the Iris dataset Trying useful functions from pandas The first useful command is head() It returns first several lines of a dataframe. Very useful to get an idea of what kind of data you have! ourdata. head ( ) Unnamed: Sepal.Length Sepal.Width Petal.Length Petal.Width Species 0.2 setosa 0.2 setosa 0.2 setosa 0.2 setosa 0.2 setosa 5/7

9 Explore the Iris dataset Trying useful functions from pandas Second command, info(), can give a more precise information on what data types we operate on. I The dataset has 50 entries I Sepal and petal parameters are floating-point numbers I Species is recognized as object type in fact, this is a text ourdata. i n f o ( ) <class pandas.core.frame.dataframe > Int64Index: 50 entries, 0 to 49 Data columns (total 6 columns): Unnamed: 0 50 non-null int64 Sepal.Length 50 non-null float64 Sepal.Width 50 non-null float64 Petal.Length 50 non-null float64 Petal.Width 50 non-null float64 Species 50 non-null object dtypes: float64(4), int64(), object() memory usage: 8.2+ KB 6/7

10 Explore the Iris dataset Trying useful functions from pandas The last command is more advanced: ourdata.describe(). This command gives some statistical information about numeric values in your dataset. ourdata. d e s c r i b e ( ) count mean std min 25% 50% 75% max Unnamed: Sepal.Length Sepal.Width Petal.Length Petal.Width It is useful to understand what is the range of your values. 7/7

11 Explore the Iris dataset Trying useful functions from pandas But what about the last column Species? Get u n i q u e v a l u e s i n t h e column 2 s p e c i e s = ourdata. S p e c i e s. u n i q u e ( ) 3 4 We ve seen these are textual values. Now, let us see what kinds of values does this column have? print ( species ) [ setosa versicolor virginica ] 8/7

12 Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality 8/7

13 Draw some plots Using matlplotlib Let s try to plot something! In this example I map names of the species onto colors. Play with this code a little bit changing the columns to plot. This gives you an insight how the data looks like. plot different species in different colours 2 c o l o r s = [ g, r, b, c, m, k ] 3 species_dict = dict ( zip ( species, colors ) ) 4 p l t. s c a t t e r ( ourdata [ S e p a l. L e n g t h ] 5, ourdata [ S e p a l. Width ] 6, c=ourdata [ S p e c i e s ] 7. a p p l y ( lambda x : s p e c i e s _ d i c t [ x ] ) 8 ) 9 use e i t h e r 0 p l t. show ( ) t o draw on s c r e e n o r 2 p l t. s a v e f i g (" f i l e n a m e ") to save p i c t u r e 3 p l t. s a v e f i g ( " r e a l l a b e l s. png " ) 9/7

14 Draw some plots Using matlplotlib...and here is how the result looks like 0/7

15 Prepare data Split input and labels Copy the data to a new variable ourdatanolabels. We will play like we don t know the labels and want to estimate them. Here i s o u r data, b u t w i t h o u t s p e c i e s i n f o r m a t i o n 2 ourdatanolabels 3 = pd. DataFrame ( ourdata 4, c o l u mns =[ S e p a l. L e n g t h 5, S e p a l. Width 6, P e t a l. Length 7, P e t a l. Width 8 ] 9 ) /7

16 Principal Component analysis Using sklearn.decomposition Run P r i n c i p a l Component A n a l y s i s 2 t o g e t more u n d e r s t a n d i n g how o u r d a t a l o o k s l i k e 3 ourdatareduced 4 = decomposition 5. PCA( n_components =3) 6. f i t _ t r a n s f o r m ( ourdatanolabels ) 7 p l t. s c a t t e r ( ourdatareduced [ :, 0 ] 8, ourdatareduced [ :, ] 9, c=ourdata [ S p e c i e s ] 0. a p p l y ( lambda x : s p e c i e s _ d i c t [ x ] ) ) p l t. s a v e f i g ( " r e a l pca 2.png " ) 2 p l t. s c a t t e r ( ourdatareduced [ :, 0 ] 3, ourdatareduced [ :, 2 ] 4, c=ourdata [ S p e c i e s ] 5. a p p l y ( lambda x : s p e c i e s _ d i c t [ x ] ) ) 6 p l t. s a v e f i g ( " r e a l pca 3.png " ) 7 p l t. s c a t t e r ( ourdatareduced [ :, ] 8, ourdatareduced [ :, 2 ] 9, c=ourdata [ S p e c i e s ] 20. a p p l y ( lambda x : s p e c i e s _ d i c t [ x ] ) ) 2 p l t. s a v e f i g ( " r e a l pca 2 3.png " ) This code performs PCA (lines 3-7). (Most of the code is just to draw three plots). Principal component transform is a procedure that rotates point space in such a way that points variance is the biggest along X-axis, the second is along Y-axis, and so on. The method is simple but very powerful for data exploration. If data have many dimensions (i.e. 50) it is very convenient to look only at 2-5 most significant dimensions. Otherwise visual inspection of the data would be too difficult. 2/7

17 Draw some plots Using matlplotlib...and here are the plots 3/7

18 Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality 3/7

19 K-means clustering Using sklearn.cluster We know t h a t t h e r e must be 3 d i f f e r e n t c l u s t e r s 2 species of flowers. 3 So l e t s g i v e t h i s h i n t t o t h e a l g o r i t h m 4 kmeans = c l u s t e r. KMeans ( 3 ) 5. f i t ( np. a r r a y ( o u r D a t a N o L a b e l s ) ) 6 foundlabels 7 = pd. DataFrame ( kmeans. l a b e l s _ 8, c o l u mns =[ K means c l u s t e r s ] ) Finally, we are ready for some clustering! p l t. s c a t t e r ( ourdata [ S e p a l. L e n g t h ], ourdata [ S e p a l. Width ], c=f o u n d L a b e l s [ K means c l u s t e r s ]. a p p l y ( lambda x : c o l o r s [ x ] ) ) p l t. s a v e f i g ( " p r e d i c t e d l a b e l s. png " ) 4/7

20 K-means clustering Using sklearn.cluster Let s compare plots......would you say this is a good result? 5/7

21 Outline Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality 5/7

22 Assess clustering quality Using pandas A very good way to assess performance of an unsupervised clustering algorithm is to look at co-occurrence tables Package pandas provides a special function crosstab() that calculates how many times a value from one column occurs together with a value from another column. mat = pd. c r o s s t a b ( ourdata [ S p e c i e s ], f o u n d L a b e l s [ K means c l u s t e r s ] ) p r i n t ( mat ) K-means clusters Species setosa versicolor virginica So, what would be a conclusion now? 6/7

23 Bonus: DBSCAN Using sklearn.cluster This code does everything the same way as KMeans clustering example Try it! And compare the results. Which algorithm performs better? d b s c a n = c l u s t e r. DBSCAN( ). f i t ( np. a r r a y ( o u r D a t a N o L a b e l s ) ) f o u n d L a b e l s [ DBSCAN c l u s t e r s ] = pd. DataFrame ( d b s c a n. l a b e l s _, c o l u mns =[ DBSCAN c l u s t e r s ] ) p l t. s c a t t e r ( ourdata [ S e p a l. L e n g t h ], ourdata [ S e p a l. Width ], c=f o u n d L a b e l s [ DBSCAN c l u s t e r s ]. a p p l y ( lambda x : c o l o r s [ x ] ) ) p l t. s a v e f i g ( " p r e d i c t e d l a b e l s 2. png " ) mat = pd. c r o s s t a b ( ourdata [ S p e c i e s ], f o u n d L a b e l s [ DBSCAN c l u s t e r s ] ) p r i n t ( mat ) 7/7

24 What we have covered today Thanks for your attention! Getting started Explore dataset content Inspect visually Run clustering Assess clustering quality 7/7

The SAS System 18:28 Saturday, March 10, Plot of Canonical Variables Identified by Cluster

The SAS System 18:28 Saturday, March 10, Plot of Canonical Variables Identified by Cluster The SAS System 18:28 Saturday, March 10, 2018 1 The FASTCLUS Procedure Replace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02 Initial Seeds Cluster SepalLength SepalWidth PetalLength PetalWidth 1

More information

MULTIVARIATE HOMEWORK #5

MULTIVARIATE HOMEWORK #5 MULTIVARIATE HOMEWORK #5 Fisher s dataset on differentiating species of Iris based on measurements on four morphological characters (i.e. sepal length, sepal width, petal length, and petal width) was subjected

More information

PRINCIPAL COMPONENTS ANALYSIS

PRINCIPAL COMPONENTS ANALYSIS PRINCIPAL COMPONENTS ANALYSIS Iris Data Let s find Principal Components using the iris dataset. This is a well known dataset, often used to demonstrate the effect of clustering algorithms. It contains

More information

PANDAS FOUNDATIONS. pandas Foundations

PANDAS FOUNDATIONS. pandas Foundations PANDAS FOUNDATIONS pandas Foundations What is pandas? Python library for data analysis High-performance containers for data analysis Data structures with a lot of functionality Meaningful labels Time series

More information

Robust scale estimation with extensions

Robust scale estimation with extensions Robust scale estimation with extensions Garth Tarr, Samuel Müller and Neville Weber School of Mathematics and Statistics THE UNIVERSITY OF SYDNEY Outline The robust scale estimator P n Robust covariance

More information

An Introduction to Multivariate Methods

An Introduction to Multivariate Methods Chapter 12 An Introduction to Multivariate Methods Multivariate statistical methods are used to display, analyze, and describe data on two or more features or variables simultaneously. I will discuss multivariate

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes Matt Gormley Lecture 19 March 20, 2018 1 Midterm Exam Reminders

More information

1. Introduction to Multivariate Analysis

1. Introduction to Multivariate Analysis 1. Introduction to Multivariate Analysis Isabel M. Rodrigues 1 / 44 1.1 Overview of multivariate methods and main objectives. WHY MULTIVARIATE ANALYSIS? Multivariate statistical analysis is concerned with

More information

Propensity Score Matching

Propensity Score Matching Propensity Score Matching This notebook illustrates how to do propensity score matching in Python. Original dataset available at: http://biostat.mc.vanderbilt.edu/wiki/main/datasets (http://biostat.mc.vanderbilt.edu/wiki/main/datasets)

More information

LEC 4: Discriminant Analysis for Classification

LEC 4: Discriminant Analysis for Classification LEC 4: Discriminant Analysis for Classification Dr. Guangliang Chen February 25, 2016 Outline Last time: FDA (dimensionality reduction) Today: QDA/LDA (classification) Naive Bayes classifiers Matlab/Python

More information

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis Week 5 Based in part on slides from textbook, slides of Susan Holmes Part I Linear Discriminant Analysis October 29, 2012 1 / 1 2 / 1 Nearest centroid rule Suppose we break down our data matrix as by the

More information

4 Statistics of Normally Distributed Data

4 Statistics of Normally Distributed Data 4 Statistics of Normally Distributed Data 4.1 One Sample a The Three Basic Questions of Inferential Statistics. Inferential statistics form the bridge between the probability models that structure our

More information

Data Mining with R. Linear Classifiers and the Perceptron Algorithm. Hugh Murrell

Data Mining with R. Linear Classifiers and the Perceptron Algorithm. Hugh Murrell Data Mining with R Linear Classifiers and the Perceptron Algorithm Hugh Murrell references These slides are based on a notes by Cosma Shalizi and an essay by Charles Elkan, but are self contained and access

More information

Last time: PCA. Statistical Data Mining and Machine Learning Hilary Term Singular Value Decomposition (SVD) Eigendecomposition and PCA

Last time: PCA. Statistical Data Mining and Machine Learning Hilary Term Singular Value Decomposition (SVD) Eigendecomposition and PCA Last time: PCA Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml

More information

Machine Leanring Theory and Applications: third lecture

Machine Leanring Theory and Applications: third lecture Machine Leanring Theory and Applications: third lecture Arnak Dalalyan ENSAE PARISTECH 12 avril 2016 Framework and notation Framework and notation We observe (X i, Y i ) X Y, i = 1,..., n independent randomly

More information

Applied Multivariate and Longitudinal Data Analysis

Applied Multivariate and Longitudinal Data Analysis Applied Multivariate and Longitudinal Data Analysis Chapter 2: Inference about the mean vector(s) Ana-Maria Staicu SAS Hall 5220; 919-515-0644; astaicu@ncsu.edu 1 In this chapter we will discuss inference

More information

In most cases, a plot of d (j) = {d (j) 2 } against {χ p 2 (1-q j )} is preferable since there is less piling up

In most cases, a plot of d (j) = {d (j) 2 } against {χ p 2 (1-q j )} is preferable since there is less piling up THE UNIVERSITY OF MINNESOTA Statistics 5401 September 17, 2005 Chi-Squared Q-Q plots to Assess Multivariate Normality Suppose x 1, x 2,..., x n is a random sample from a p-dimensional multivariate distribution

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)

More information

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes November 9, 2012 1 / 1 Nearest centroid rule Suppose we break down our data matrix as by the labels yielding (X

More information

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch. School of Computer Science 10-701 Introduction to Machine Learning aïve Bayes Readings: Mitchell Ch. 6.1 6.10 Murphy Ch. 3 Matt Gormley Lecture 3 September 14, 2016 1 Homewor 1: due 9/26/16 Project Proposal:

More information

Principal Component Analysis (PCA) Principal Component Analysis (PCA)

Principal Component Analysis (PCA) Principal Component Analysis (PCA) Recall: Eigenvectors of the Covariance Matrix Covariance matrices are symmetric. Eigenvectors are orthogonal Eigenvectors are ordered by the magnitude of eigenvalues: λ 1 λ 2 λ p {v 1, v 2,..., v n } Recall:

More information

ISCID-CO Dunkerque/ULCO. Mathematics applied to economics and management Foundations of Descriptive and Inferential Statistics

ISCID-CO Dunkerque/ULCO. Mathematics applied to economics and management Foundations of Descriptive and Inferential Statistics IMBS 1 ISCID-CO Dunkerque/ULCO Mathematics applied to economics and management Foundations of Descriptive and Inferential Statistics December 2015 - Final assessment - Session 1 - Semester 1 Time allowed

More information

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26 Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1

More information

2 Getting Started with Numerical Computations in Python

2 Getting Started with Numerical Computations in Python 1 Documentation and Resources * Download: o Requirements: Python, IPython, Numpy, Scipy, Matplotlib o Windows: google "windows download (Python,IPython,Numpy,Scipy,Matplotlib" o Debian based: sudo apt-get

More information

Outline. Administrivia and Introduction Course Structure Syllabus Introduction to Data Mining

Outline. Administrivia and Introduction Course Structure Syllabus Introduction to Data Mining Outline Administrivia and Introduction Course Structure Syllabus Introduction to Data Mining Dimensionality Reduction Introduction Principal Components Analysis Singular Value Decomposition Multidimensional

More information

Data Mining and Analysis

Data Mining and Analysis 978--5-766- - Data Mining and Analysis: Fundamental Concepts and Algorithms CHAPTER Data Mining and Analysis Data mining is the process of discovering insightful, interesting, and novel patterns, as well

More information

Discriminant Analysis

Discriminant Analysis Chapter 16 Discriminant Analysis A researcher collected data on two external features for two (known) sub-species of an insect. She can use discriminant analysis to find linear combinations of the features

More information

Discriminant Analysis (DA)

Discriminant Analysis (DA) Discriminant Analysis (DA) Involves two main goals: 1) Separation/discrimination: Descriptive Discriminant Analysis (DDA) 2) Classification/allocation: Predictive Discriminant Analysis (PDA) In DDA Classification

More information

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive

More information

Statistics Terminology

Statistics Terminology :\the\world Machine Learning Camp Day 1: Exploring Data with Excel Statistics Terminology Variance Variance is a measure of how spread out a set of data or numbers is from its mean. For a set of numbers

More information

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning

More information

Techniques and Applications of Multivariate Analysis

Techniques and Applications of Multivariate Analysis Techniques and Applications of Multivariate Analysis Department of Statistics Professor Yong-Seok Choi E-mail: yschoi@pusan.ac.kr Home : yschoi.pusan.ac.kr Contents Multivariate Statistics (I) in Spring

More information

Lecture 5: Classification

Lecture 5: Classification Lecture 5: Classification Advanced Applied Multivariate Analysis STAT 2221, Spring 2015 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department of Mathematical Sciences Binghamton

More information

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Linear regression Least-squares estimation

More information

Section 7: Discriminant Analysis.

Section 7: Discriminant Analysis. Section 7: Discriminant Analysis. Ensure you have completed all previous worksheets before advancing. 1 Linear Discriminant Analysis To perform Linear Discriminant Analysis in R we will make use of the

More information

Multivariate Statistics

Multivariate Statistics Multivariate Statistics Chapter 2: Multivariate distributions and inference Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2016/2017 Master in Mathematical

More information

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing Supervised Learning Unsupervised learning: To extract structure and postulate hypotheses about data generating process from observations x 1,...,x n. Visualize, summarize and compress data. We have seen

More information

Experiment 1: Linear Regression

Experiment 1: Linear Regression Experiment 1: Linear Regression August 27, 2018 1 Description This first exercise will give you practice with linear regression. These exercises have been extensively tested with Matlab, but they should

More information

STAT 730 Chapter 1 Background

STAT 730 Chapter 1 Background STAT 730 Chapter 1 Background Timothy Hanson Department of Statistics, University of South Carolina Stat 730: Multivariate Analysis 1 / 27 Logistics Course notes hopefully posted evening before lecture,

More information

Introduction to Python Practical 2

Introduction to Python Practical 2 Introduction to Python Practical 2 Daniel Carrera & Brian Thorsbro November 2017 1 Searching through data One of the most important skills in scientific computing is sorting through large datasets and

More information

Advanced Studies in Applied Statistics (WBL), ETHZ Applied Multivariate Statistics Spring 2018, Week 1

Advanced Studies in Applied Statistics (WBL), ETHZ Applied Multivariate Statistics Spring 2018, Week 1 Advanced Studies in Applied Statistics (WBL), ETHZ Applied Multivariate Statistics Spring 2018, Week 1 Lecturer: Beate Sick sickb@ethz.ch Remark: Much of the material have been developed together with

More information

The Rain in Spain - Tableau Public Workbook

The Rain in Spain - Tableau Public Workbook The Rain in Spain - Tableau Public Workbook This guide will take you through the steps required to visualize how the rain falls in Spain with Tableau public. (All pics from Mac version of Tableau) Workbook

More information

Applied Multivariate Analysis (Stat 206)

Applied Multivariate Analysis (Stat 206) Applied Multivariate Analysis (Stat 206) James Johndrow 2016-09-26 Outline of the course This course covers statistical methods for learning from multivariate data. A tentative list of topics, with relevant

More information

Lectures about Python, useful both for beginners and experts, can be found at (http://scipy-lectures.github.io).

Lectures about Python, useful both for beginners and experts, can be found at  (http://scipy-lectures.github.io). Random Matrix Theory (Sethna, "Entropy, Order Parameters, and Complexity", ex. 1.6, developed with Piet Brouwer) 2016, James Sethna, all rights reserved. This is an ipython notebook. This hints file is

More information

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /15/ /12

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /15/ /12 BIO5312 Biostatistics R Session 12: Principal Component Analysis Dr. Junchao Xia Center of Biophysics and Computational Biology Fall 2016 11/15/2016 1 /12 Matrix Operations I: Constructing matrix(data,

More information

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification ISyE 6416: Computational Statistics Spring 2017 Lecture 5: Discriminant analysis and classification Prof. Yao Xie H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology

More information

Error, Accuracy and Convergence

Error, Accuracy and Convergence Error, Accuracy and Convergence Error in Numerical Methods i Every result we compute in Numerical Methods is inaccurate. What is our model of that error? Approximate Result = True Value + Error. x = x

More information

CS444/544: Midterm Review. Carlos Scheidegger

CS444/544: Midterm Review. Carlos Scheidegger CS444/544: Midterm Review Carlos Scheidegger D3: DATA-DRIVEN DOCUMENTS The essential idea D3 creates a two-way association between elements of your dataset and entries in the DOM D3 operates on selections

More information

Machine Learning 2nd Edition

Machine Learning 2nd Edition INTRODUCTION TO Lecture Slides for Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some parts from http://www.cs.tau.ac.il/~apartzin/machinelearning/ The MIT Press, 2010

More information

Multivariate analysis of variance and covariance

Multivariate analysis of variance and covariance Introduction Multivariate analysis of variance and covariance Univariate ANOVA: have observations from several groups, numerical dependent variable. Ask whether dependent variable has same mean for each

More information

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012 Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012 Overview Review: Conditional Probability LDA / QDA: Theory Fisher s Discriminant Analysis LDA: Example Quality control:

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a

More information

Outline Python, Numpy, and Matplotlib Making Models with Polynomials Making Models with Monte Carlo Error, Accuracy and Convergence Floating Point Mod

Outline Python, Numpy, and Matplotlib Making Models with Polynomials Making Models with Monte Carlo Error, Accuracy and Convergence Floating Point Mod Outline Python, Numpy, and Matplotlib Making Models with Polynomials Making Models with Monte Carlo Error, Accuracy and Convergence Floating Point Modeling the World with Arrays The World in a Vector What

More information

Data Intensive Computing Handout 11 Spark: Transitive Closure

Data Intensive Computing Handout 11 Spark: Transitive Closure Data Intensive Computing Handout 11 Spark: Transitive Closure from random import Random spark/transitive_closure.py num edges = 3000 num vertices = 500 rand = Random(42) def generate g raph ( ) : edges

More information

Lecture 2: Linear regression

Lecture 2: Linear regression Lecture 2: Linear regression Roger Grosse 1 Introduction Let s ump right in and look at our first machine learning algorithm, linear regression. In regression, we are interested in predicting a scalar-valued

More information

INTRODUCTION TO DATA SCIENCE

INTRODUCTION TO DATA SCIENCE INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #13 3/9/2017 CMSC320 Tuesdays & Thursdays 3:30pm 4:45pm ANNOUNCEMENTS Mini-Project #1 is due Saturday night (3/11): Seems like people are able to do

More information

DATA SCIENCE SIMPLIFIED USING ARCGIS API FOR PYTHON

DATA SCIENCE SIMPLIFIED USING ARCGIS API FOR PYTHON DATA SCIENCE SIMPLIFIED USING ARCGIS API FOR PYTHON LEAD CONSULTANT, INFOSYS LIMITED SEZ Survey No. 41 (pt) 50 (pt), Singapore Township PO, Ghatkesar Mandal, Hyderabad, Telengana 500088 Word Limit of the

More information

Latent Variable Methods Course

Latent Variable Methods Course Latent Variable Methods Course Learning from data Instructor: Kevin Dunn kevin.dunn@connectmv.com http://connectmv.com Kevin Dunn, ConnectMV, Inc. 2011 Revision: 269:35e2 compiled on 15-12-2011 ConnectMV,

More information

Tutorial Three: Loops and Conditionals

Tutorial Three: Loops and Conditionals Tutorial Three: Loops and Conditionals Imad Pasha Chris Agostino February 18, 2015 1 Introduction In lecture Monday we learned that combinations of conditionals and loops could make our code much more

More information

Statistics in Stata Introduction to Stata

Statistics in Stata Introduction to Stata 50 55 60 65 70 Statistics in Stata Introduction to Stata Thomas Scheike Statistical Methods, Used to test simple hypothesis regarding the mean in a single group. Independent samples and data approximately

More information

Lecture 08: Poisson and More. Lisa Yan July 13, 2018

Lecture 08: Poisson and More. Lisa Yan July 13, 2018 Lecture 08: Poisson and More Lisa Yan July 13, 2018 Announcements PS1: Grades out later today Solutions out after class today PS2 due today PS3 out today (due next Friday 7/20) 2 Midterm announcement Tuesday,

More information

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Classification Methods II: Linear and Quadratic Discrimminant Analysis Classification Methods II: Linear and Quadratic Discrimminant Analysis Rebecca C. Steorts, Duke University STA 325, Chapter 4 ISL Agenda Linear Discrimminant Analysis (LDA) Classification Recall that linear

More information

VPython Class 2: Functions, Fields, and the dreaded ˆr

VPython Class 2: Functions, Fields, and the dreaded ˆr Physics 212E Classical and Modern Physics Spring 2016 1. Introduction VPython Class 2: Functions, Fields, and the dreaded ˆr In our study of electric fields, we start with a single point charge source

More information

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Principal Component Analysis. Applied Multivariate Statistics Spring 2012 Principal Component Analysis Applied Multivariate Statistics Spring 2012 Overview Intuition Four definitions Practical examples Mathematical example Case study 2 PCA: Goals Goal 1: Dimension reduction

More information

Data Exploration and Unsupervised Learning with Clustering

Data Exploration and Unsupervised Learning with Clustering Data Exploration and Unsupervised Learning with Clustering Paul F Rodriguez,PhD San Diego Supercomputer Center Predictive Analytic Center of Excellence Clustering Idea Given a set of data can we find a

More information

Introduction to SVD and Applications

Introduction to SVD and Applications Introduction to SVD and Applications Eric Kostelich and Dave Kuhl MSRI Climate Change Summer School July 18, 2008 Introduction The goal of this exercise is to familiarize you with the basics of the singular

More information

Tools of AI. Marcin Sydow. Summary. Machine Learning

Tools of AI. Marcin Sydow. Summary. Machine Learning Machine Learning Outline of this Lecture Motivation for Data Mining and Machine Learning Idea of Machine Learning Decision Table: Cases and Attributes Supervised and Unsupervised Learning Classication

More information

Quiz3_NaiveBayesTest

Quiz3_NaiveBayesTest Quiz3_NaiveBayesTest November 8, 2018 In [1]: import numpy as np import pandas as pd data = pd.read_csv("weatherx.csv") data Out[1]: Outlook Temp Humidity Windy 0 Sunny hot high False no 1 Sunny hot high

More information

import pandas as pd d = {"A":[1,2,np.nan], "B":[5,np.nan,np.nan], "C":[1,2,3]} In [9]: Out[8]: {'A': [1, 2, nan], 'B': [5, nan, nan], 'C': [1, 2, 3]}

import pandas as pd d = {A:[1,2,np.nan], B:[5,np.nan,np.nan], C:[1,2,3]} In [9]: Out[8]: {'A': [1, 2, nan], 'B': [5, nan, nan], 'C': [1, 2, 3]} In [4]: import numpy as np In [5]: import pandas as pd In [7]: d = {"":[1,2,np.nan], "B":[5,np.nan,np.nan], "C":[1,2,3]} In [8]: d Out[8]: {'': [1, 2, nan], 'B': [5, nan, nan], 'C': [1, 2, 3]} In [9]:

More information

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran Assignment 3 Introduction to Machine Learning Prof. B. Ravindran 1. In building a linear regression model for a particular data set, you observe the coefficient of one of the features having a relatively

More information

Dimension Reduction (PCA, ICA, CCA, FLD,

Dimension Reduction (PCA, ICA, CCA, FLD, Dimension Reduction (PCA, ICA, CCA, FLD, Topic Models) Yi Zhang 10-701, Machine Learning, Spring 2011 April 6 th, 2011 Parts of the PCA slides are from previous 10-701 lectures 1 Outline Dimension reduction

More information

Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis

Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis Sham M Kakade c 2019 University of Washington cse446-staff@cs.washington.edu 0 / 10 Announcements Please do Q1

More information

Homework : Data Mining SOLUTIONS

Homework : Data Mining SOLUTIONS Homework 4 36-350: Data Mining SOLUTIONS 1. Similarity searching for cells (a) How many features are there in the raw data? Are they discrete or continuous? (If there are some of each, which is which?)

More information

Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions

Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions Parthan Kasarapu & Lloyd Allison Monash University, Australia September 8, 25 Parthan Kasarapu

More information

Packing_Similarity_Dendrogram.py

Packing_Similarity_Dendrogram.py Packing_Similarity_Dendrogram.py Summary: This command-line script is designed to compare the packing of a set of input structures of a molecule (polymorphs, co-crystals, solvates, and hydrates). An all-to-all

More information

Star Cluster Photometry and the H-R Diagram

Star Cluster Photometry and the H-R Diagram Star Cluster Photometry and the H-R Diagram Contents Introduction Star Cluster Photometry... 1 Downloads... 1 Part 1: Measuring Star Magnitudes... 2 Part 2: Plotting the Stars on a Colour-Magnitude (H-R)

More information

Keppel, G. & Wickens, T. D. Design and Analysis Chapter 4: Analytical Comparisons Among Treatment Means

Keppel, G. & Wickens, T. D. Design and Analysis Chapter 4: Analytical Comparisons Among Treatment Means Keppel, G. & Wickens, T. D. Design and Analysis Chapter 4: Analytical Comparisons Among Treatment Means 4.1 The Need for Analytical Comparisons...the between-groups sum of squares averages the differences

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your

More information

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition Ad Feelders Universiteit Utrecht Department of Information and Computing Sciences Algorithmic Data

More information

CPSC 340 Assignment 4 (due November 17 ATE)

CPSC 340 Assignment 4 (due November 17 ATE) CPSC 340 Assignment 4 due November 7 ATE) Multi-Class Logistic The function example multiclass loads a multi-class classification datasetwith y i {,, 3, 4, 5} and fits a one-vs-all classification model

More information

L14. 1 Lecture 14: Crash Course in Probability. July 7, Overview and Objectives. 1.2 Part 1: Probability

L14. 1 Lecture 14: Crash Course in Probability. July 7, Overview and Objectives. 1.2 Part 1: Probability L14 July 7, 2017 1 Lecture 14: Crash Course in Probability CSCI 1360E: Foundations for Informatics and Analytics 1.1 Overview and Objectives To wrap up the fundamental concepts of core data science, as

More information

Numpy. Luis Pedro Coelho. October 22, Programming for Scientists. Luis Pedro Coelho (Programming for Scientists) Numpy October 22, 2012 (1 / 26)

Numpy. Luis Pedro Coelho. October 22, Programming for Scientists. Luis Pedro Coelho (Programming for Scientists) Numpy October 22, 2012 (1 / 26) Numpy Luis Pedro Coelho Programming for Scientists October 22, 2012 Luis Pedro Coelho (Programming for Scientists) Numpy October 22, 2012 (1 / 26) Historical Numeric (1995) Numarray (for large arrays)

More information

Exercise 1. In the lecture you have used logistic regression as a binary classifier to assign a label y i { 1, 1} for a sample X i R D by

Exercise 1. In the lecture you have used logistic regression as a binary classifier to assign a label y i { 1, 1} for a sample X i R D by Exercise 1 Deadline: 04.05.2016, 2:00 pm Procedure for the exercises: You will work on the exercises in groups of 2-3 people (due to your large group size I will unfortunately not be able to correct single

More information

Introduction to ggplot2. ggplot2 is (in my opinion) one of the best documented packages in R. The full documentation for it can be found here:

Introduction to ggplot2. ggplot2 is (in my opinion) one of the best documented packages in R. The full documentation for it can be found here: Introduction to ggplot2 This practical introduces a slightly different method of creating plots in R using the ggplot2 package. The package is an implementation of Leland Wilkinson's Grammar of Graphics-

More information

z = β βσβ Statistical Analysis of MV Data Example : µ=0 (Σ known) consider Y = β X~ N 1 (β µ, β Σβ) test statistic for H 0β is

z = β βσβ Statistical Analysis of MV Data Example : µ=0 (Σ known) consider Y = β X~ N 1 (β µ, β Σβ) test statistic for H 0β is Example X~N p (µ,σ); H 0 : µ=0 (Σ known) consider Y = β X~ N 1 (β µ, β Σβ) H 0β : β µ = 0 test statistic for H 0β is y z = β βσβ /n And reject H 0β if z β > c [suitable critical value] 301 Reject H 0 if

More information

Applied Multivariate Analysis

Applied Multivariate Analysis Department of Mathematics and Statistics, University of Vaasa, Finland Spring 2017 Discriminant Analysis Background 1 Discriminant analysis Background General Setup for the Discriminant Analysis Descriptive

More information

Feature selection. Micha Elsner. January 29, 2014

Feature selection. Micha Elsner. January 29, 2014 Feature selection Micha Elsner January 29, 2014 2 Using megam as max-ent learner Hal Daume III from UMD wrote a max-ent learner Pretty typical of many classifiers out there... Step one: create a text file

More information

LAST TIME..THE COMMAND LINE

LAST TIME..THE COMMAND LINE LAST TIME..THE COMMAND LINE Maybe slightly to fast.. Finder Applications Utilities Terminal The color can be adjusted and is simply a matter of taste. Per default it is black! Dr. Robert Kofler Introduction

More information

The Mean Version One way to write the One True Regression Line is: Equation 1 - The One True Line

The Mean Version One way to write the One True Regression Line is: Equation 1 - The One True Line Chapter 27: Inferences for Regression And so, there is one more thing which might vary one more thing aout which we might want to make some inference: the slope of the least squares regression line. The

More information

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis .. December 20, 2013 Todays lecture. (PCA) (PLS-R) (LDA) . (PCA) is a method often used to reduce the dimension of a large dataset to one of a more manageble size. The new dataset can then be used to make

More information

PYTHON, NUMPY, AND SPARK. Prof. Chris Jermaine

PYTHON, NUMPY, AND SPARK. Prof. Chris Jermaine PYTHON, NUMPY, AND SPARK Prof. Chris Jermaine cmj4@cs.rice.edu 1 Next 1.5 Days Intro to Python for statistical/numerical programming (day one) Focus on basic NumPy API, using arrays efficiently Will take

More information

Homework : Data Mining SOLUTIONS

Homework : Data Mining SOLUTIONS Homework 4 36-350: Data Mining SOLUTIONS I loaded the data and transposed it thus: library(elemstatlearn) data(nci) nci.t = t(nci) This means that rownames(nci.t) contains the cell classes, or equivalently

More information

AMS 132: Discussion Section 2

AMS 132: Discussion Section 2 Prof. David Draper Department of Applied Mathematics and Statistics University of California, Santa Cruz AMS 132: Discussion Section 2 All computer operations in this course will be described for the Windows

More information

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann Machine Learning! in just a few minutes Jan Peters Gerhard Neumann 1 Purpose of this Lecture Foundations of machine learning tools for robotics We focus on regression methods and general principles Often

More information

Correlation Preserving Unsupervised Discretization. Outline

Correlation Preserving Unsupervised Discretization. Outline Correlation Preserving Unsupervised Discretization Jee Vang Outline Paper References What is discretization? Motivation Principal Component Analysis (PCA) Association Mining Correlation Preserving Discretization

More information

Data Structures & Database Queries in GIS

Data Structures & Database Queries in GIS Data Structures & Database Queries in GIS Objective In this lab we will show you how to use ArcGIS for analysis of digital elevation models (DEM s), in relationship to Rocky Mountain bighorn sheep (Ovis

More information

Chapter 7, continued: MANOVA

Chapter 7, continued: MANOVA Chapter 7, continued: MANOVA The Multivariate Analysis of Variance (MANOVA) technique extends Hotelling T 2 test that compares two mean vectors to the setting in which there are m 2 groups. We wish to

More information

Introduction to Python

Introduction to Python Introduction to Python Luis Pedro Coelho luis@luispedro.org @luispedrocoelho European Molecular Biology Laboratory Lisbon Machine Learning School 2015 Luis Pedro Coelho (@luispedrocoelho) Introduction

More information

Ratio of Vector Lengths as an Indicator of Sample Representativeness

Ratio of Vector Lengths as an Indicator of Sample Representativeness Abstract Ratio of Vector Lengths as an Indicator of Sample Representativeness Hee-Choon Shin National Center for Health Statistics, 3311 Toledo Rd., Hyattsville, MD 20782 The main objective of sampling

More information

Lab 2 Worksheet. Problems. Problem 1: Geometry and Linear Equations

Lab 2 Worksheet. Problems. Problem 1: Geometry and Linear Equations Lab 2 Worksheet Problems Problem : Geometry and Linear Equations Linear algebra is, first and foremost, the study of systems of linear equations. You are going to encounter linear systems frequently in

More information