Data Mining algorithms

Similar documents
Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

6.036 midterm review. Wednesday, March 18, 15

CS 6375 Machine Learning

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Performance Evaluation

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Midterm: CS 6375 Spring 2015 Solutions

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Performance Evaluation

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Lecture 4 Discriminant Analysis, k-nearest Neighbors

ECE 5984: Introduction to Machine Learning

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Classification and Pattern Recognition

CS249: ADVANCED DATA MINING

Machine Learning Concepts in Chemoinformatics

DATA MINING LECTURE 10

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Modern Information Retrieval

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Data Mining Techniques

Moving Average Rules to Find. Confusion Matrix. CC283 Intelligent Problem Solving 05/11/2010. Edward Tsang (all rights reserved) 1

Linear and Logistic Regression. Dr. Xiaowei Huang

Data Mining und Maschinelles Lernen

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Anomaly Detection. Jing Gao. SUNY Buffalo

FINAL: CS 6375 (Machine Learning) Fall 2014

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

CS6220: DATA MINING TECHNIQUES

CC283 Intelligent Problem Solving 28/10/2013

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning Linear Classification. Prof. Matteo Matteucci

Classification Part 2. Overview

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

Pattern Recognition and Machine Learning

Brief Introduction to Machine Learning

Mining Classification Knowledge

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Evaluation requires to define performance measures to be optimized

Linear Classifiers as Pattern Detectors

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

Applied Machine Learning Annalisa Marsico

CS 188: Artificial Intelligence Spring Announcements

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Day 3: Classification, logistic regression

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Learning with multiple models. Boosting.

Data Analytics for Social Science

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 5. Introduction to Data Mining

Numerical Learning Algorithms

Announcements. Proposals graded

Evaluation. Andrea Passerini Machine Learning. Evaluation

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

CS145: INTRODUCTION TO DATA MINING

Logic and machine learning review. CS 540 Yingyu Liang

Data Mining algorithms

CS534 Machine Learning - Spring Final Exam

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Performance Evaluation

CS7267 MACHINE LEARNING

The exam is closed book, closed notes except your one-page cheat sheet.

Data Analysis. Santiago González

Machine Learning Practice Page 2 of 2 10/28/13

Statistical Machine Learning from Data

Linear & nonlinear classifiers

Learning Methods for Linear Detectors

CS6220: DATA MINING TECHNIQUES

Advanced classifica-on methods

Introduction to Logistic Regression

Machine Learning Lecture 5

Model Accuracy Measures

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data

ECE521 Lectures 9 Fully Connected Neural Networks

Pointwise Exact Bootstrap Distributions of Cost Curves

9 Classification. 9.1 Linear Classifiers

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

CSE 546 Final Exam, Autumn 2013

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Data Mining and Knowledge Discovery: Practice Notes

Statistical Machine Learning

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

2018 EE448, Big Data Mining, Lecture 4. (Part I) Weinan Zhang Shanghai Jiao Tong University

1 Handling of Continuous Attributes in C4.5. Algorithm

ESS2222. Lecture 4 Linear model

1 Machine Learning Concepts (16 points)

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning, Midterm Exam

DATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Voting (Ensemble Methods)

Transcription:

Data Mining algorithms 2017-2018 spring 02.07-09.2018 Overview Classification vs. Regression Evaluation I

Basics Bálint Daróczy daroczyb@ilab.sztaki.hu Basic reachability: MTA SZTAKI, Lágymányosi str. 11 Web site: http://cs.bme.hu/~daroczyb/dm_2018_spring (slides will be uploaded after class)

Requirements Lectures: 2x(2x45) min., wed and fri 12pm 2pm Where? IB134 Can we start at 12:15 with a 5 min. break and finish at 13:50? Project work: challenge? Tests: midterm (7th week?) + exam

References 1. Tan, Steinbach, Kumar (TSK): Introduction to Data Mining Addison-Wesley, 2006, Cloth; 769 pp, ISBN-10: 0321321367, ISBN-13: 9780321321367 http://www-users.cs.umn.edu/~kumar/dmbook/index.php 2. Leskovic, Rajraman, Ullmann: Mining of Massive Datasets http://infolab.stanford.edu/~ullman/mmds.html 3. Devroye, Győrfi, Lugosi: A Probabilistic Theory of Pattern Recognition, 1996 4. Rojas: Neural Networks, Springer-Verlag, Berlin, 1996 5. Hopcroft, Kannan: Computer Science Theory for the Information Age http://www.cs.cmu.edu/~venkatg/teaching/cstheory-infoage/hopcroft-kannanfeb2012.pdf + papers

Main topics o Evaluation of classifiers: cross-validation, bias-variance trade-off o Supervised learning (classification): nearest neighbour methods, decision trees, logistic regression, non-linear classification, neural networks, support vector networks timeseries classification and dynamic time warping o Linear and polynomial, one and multidimensional regression and optimization: gradient descent and least squares o Advanced classification methods: semi-supervised learning, multi-class classification, multi-task learning, ensemble methods: bagging, boosting, stacking, ensemble o Clustering: k-means (k-medoid, FurthestFirst), hierarchical clustering, Kleinberg s impossibility theorem, internal and external evaluation, convergence speed o Principal component analysis, low-rank approximation, collaborative filtering and applications (recommender systems, drug-target prediction) o Density estimation and anomaly detection o Frequent itemset mining o Additional applications and problems: preprocessing, scaling, overfitting, hyperparameter optimization, imbalanced classification

Tools Scikit (mainly) Chainer Tensorflow Keras Weka (some) DATO (opt.) Underlying: python (numpy), R etc. Server: at SZTAKI (unfortunately w/o GPU)

Projects Some ideas: Text mining/classification trust and bias embeddings network? Recommendation system: item-to-item recommendation regular explicit Image: classification/reconstruction medical image classification Team work would be preferable Presentation at the end of the semester User/Movie Napoleon Dynamite Monster RT. Cindarella Life on Earth David 1?? 3 Dori 5 3 5 5 Peter? 4 3?

10 Representation Dataset: set of objects, with some known attributes Hypothesis: the attributes represent and differentiate the objects Attributes, features Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No Cheat E.g. attribute types: binary nominal numerical string date records 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes

10 Representation Attributes, features Structure: - sequential - spatial Sparse or dense Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No Cheat We presume that the set of attributes are previously known and fixed records 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes Missing values? 9 No Married 75K No 10 No Single 90K Yes

Machine learning Let be a finite set X={x 1,..,x T } in R d and for each point a label y={y 1,..,y T } usually in {-1,1}. The problem of binary classification is to find a particular f(x) which approximate y over X. How to measure the performance of the approximation? How to choose the function class? How to find a particular element in the chosen function class? How to generalize? Classification vs. regression?

Classification E.g. the problem of learning a half-space or a linear separator. The task is to find a d-dimensional vector w, if one exists, and a threshold b such that w x i > b for each x i labelled+1 w x i < b for each x i labelled 1 A vector-threshold pair, (w, b), satisfying the inequalities is called a linear separator -> dual problem: high dimensional learning via kernels (inner products)

Classification Sample set Labels _ + + + + Fig.: TSK

Clustering, is it regression? Fig.: TSK

K-Means Presumption: our data points are in a vector space. K-means (D, k) Init: Let C 1, C 2,..., C k be the centroids of the clusters While the centroids change: assign every point in D to the cluster with the closest centroid Update the centroids according to the assigned points (mean) The initial centroids are: a) random points from D b) random vectors When do we stop? a) the centroids are not changing b) the approximation error is below a threshold c) we reach the maximal number of allowed iterations x 2 x 1 x 3 x 1 + x 2 + x3 3

K-means 10 10 8 7.5 5 5 3 2.5 0 0 3 5 8 10 0 0 2.5 5 7.5 10 10 10 7.5 8 5 5 2.5 3 0 0 2.5 5 7.5 10 0 0 3 5 8 10 Fig.: TSK

K-means

K- nearest neighbor (K-NN) Hypothesis: If it walks like a duck, swim like a duck, eat like a duck than it is a duck! 1. Find k nearest training points 2. Majority vote E.g.: Fig.: TSK

K- nearest neighbor (K-NN) Why it is not a good classifier? Machine learning algorithms are either Eager: the algorithm builds a model and predicate using only the model or Lazy: the algorithm use the training set during prediction knn is lazy Complexity? Generalization? Fig.: TSK

K- nearest neighbor (K-NN) E.g. Distance/divergence metrics: - Minkowski - Mahalanobis Notes: - Cosine, Jaccard, Kullback-Leibler, Jensen-Shannon etc. - scale - normalization + +? o + o + o o o o + + o + o o o + o o o + + + + + o o? o oo o o o + + + o o o o o oo o Fig.: TSK

Johnson-Lindenstrauss lemma (1984)

Johnson-Lindenstrauss lemma

Johnson-Lindenstrauss lemma

Johnson-Lindenstrauss lemma

Johnson-Lindenstrauss lemma

Johnson-Lindenstrauss lemma

Johnson-Lindenstrauss lemma OK, we should stop, since the next step is a bit far away. Yet. But wait What may be the next step? Are there any other methods to approximate distance or approximate NN?

e.g. Riemannian Manifold Given a smooth (or differentiable) n-dimensional manifold M, a Riemannian metric on M (or TM) is a family of inner products (, p ) p M on each tangent space T p M, such that the inner product depends smoothly on p. A smooth manifold M, with a Riemannian metric is called a Riemannian manifold.

13 Riemannian Metric Let γ: [x, y] be a continuously differentiable curve in M. The length of a curve γ on M is defined as integrating the length of the tangent vector dγ (d is a differential operator). Example: g 11 dx 1 2 + g 12 dx 1 dx 2 + g 22 dx 2 2. If g ij is the Kronecker delta it will be the Euclidean. The distance d(x,y) is the shortest among the curves between x and y. OK, at this point we should really stop! Do not worry, we will come back.

Evaluation Confusion matrix (binary classification): Ground truth / predicted class pos neg Total pos True Positive (TP) False Negative (FN) TP+FN neg False Positive (FP) True Negative (TN) FP+TN Total TP+FN FP+TN

Evaluation Accuracy: proportion of correctly classified instances TP+TN/(TP+FP+TN+FN) Precision (p): proportion of correctly classified positive instances in the set of instances with positive predicted label TP/(TP+FP) Recall (r): proportion of correctly classified positive instances TP/(TP+FN) F-measure: harmonic mean of precision and recall (2*p*r/(p+r))

False-Positive Rate (FPR) = FP/(FP+TN) True-Positive Rate (TPR) = TP/(TP+FN) ROC: Receiver Operating Characteristic MAP: Mean Average Precision (Friday) Evaluation ndcg: normalized Discriminative Cummulative Gain (later)

Evaluation, tradeoff

ROC: Receiver Operating Characteristic o Only for binary classification o Area Under Curve: prop. with the probability of correct separation o threshold independent o Presumption: available scores (ties?)

ROC: Receiver Operating Characteristic AUC=?

+ + - + - - + + - + 0.16 0.32 0.42 0.44 0.45 0.51 0.78 0.87 0.91 0.93 Score Some exercise: How to compare models? TPR FPR AUC? + + - - - + + - + + 0.43 0.56 0.62 0.78 0.79 0.86 0.89 0.89 0.91 0.96 Score TP FN TN FP TP FN TN FP TPR FPR