Data Mining. Supervised Learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Size: px

Start display at page:

Download "Data Mining. Supervised Learning. Hamid Beigy. Sharif University of Technology. Fall 1396"

Charla Logan
5 years ago
Views:

1 Data Mining Supervised Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall / 15

2 Table of contents 1 Introduction 2 Supervised learning 3 Classification 4 Regression 5 Model selection Hamid Beigy (Sharif University of Technology) Data Mining Fall / 15

3 Introduction The data can be represented as X = {x 1,..., x N } denotes data in form of an N D feature matrix. N examples, D features to represent each example Each row is an example, each column is a feature x n = (x n1,..., x nd ) denotes the nth example (a vector of length D) x x 1f... x 1D..... x i1... x if... x id..... x N1... x Nf... x ND t 1.. t i. t denotes labels/responses in form of an N 1 label/response vector. t n denotes label/response of example x n. t N Hamid Beigy (Sharif University of Technology) Data Mining Fall / 15

4 Supervised learning In supervised learning, the goal is to find a mapping from inputs X to outputs t given a labeled set of input-output pairs S is called training set. S = {(x 1, t 1 ), (x 2, t 2 ),..., (x N, t N )}. In the simplest setting, each training input x is a D dimensional vector of numbers. Each component of x is called feature, attribute, or variable and x is called feature vector. In general, x could be a complex structure of object, such as an image, a sentence, an message, a time series, a molecular shape, a graph. When t i {1, 2,..., C}, the problem is known as classification. In some situation, multiple classes are associated to each input x, and the problem is called multi-label classification. When t i R, the problem is known as regression. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 15

5 Classification Classification is a form of data analysis that extracts models describing important data classes. Such models, called classifiers, predict categorical (discrete, unordered) class labels. Regression analysis is a statistical methodology that is most often used for numeric prediction. How does classification work? 1 Data classification is a two-step process, consisting of a learning step and a classification step. 2 In the first step, a classifier is built describing a predetermined set of data (training set) classes or concepts. 3 In the second step, the classifier is used to classify the data without label (called training set). Hamid Beigy (Sharif University of Technology) Data Mining Fall / 15

6 Data classification: training step 8.1 Basic Concepts 329 Classification algorithm Training data name Sandy Jones Bill Lee Caroline Fox Rick Field Susan Lake Claire Phips Joe Smith age youth youth middle_aged middle_aged senior senior middle_aged income low low high low low medium high loan_decision risky risky safe risky safe safe safe Classification rules IF age youth THEN loan_decision risky IF income high THEN loan_decision safe IF age middle_aged AND income low THEN loan_decision risky (a) Hamid Beigy (Sharif University of Technology) Data Mining Fall / 15

7 Data classification: testing step IF income high THEN loan_decision safe IF age middle_aged AND income low THEN loan_decision risky (a) Classification rules Test data New data name age income loan_decision Juan Bello Sylvia Crest Anne Yee senior middle_aged middle_aged low low high safe risky safe (b) (John Henry, middle_aged, low) Loan decision? risky Hamid Beigy (Sharif University of Technology) Data Mining Fall / 15

8 Data classification algorithms Several learning algorithms are proposed for building classifiers such as 1 Decision tree induction 2 Probabilistic classifiers 3 Rul-based classifiers 4 Bayesian networks 5 Neural networks 6 Support vector machines Hamid Beigy (Sharif University of Technology) Data Mining Fall / 15

9 Regression In regression, label is a real value. Hence the training set is in the form of S = {(x 1, t 1 ), (x 2, t 2 ),..., (x N, t N )}, t k R. If there is no noise, the task is interpolation and our goal is to find a function f (x) that passes through these points such that we have t k = f (x k ) k = 1, 2,..., N In polynomial interpolation, given N points, we find (N 1)st degree polynomial to predict the output for any x. If x is outside of the range of the training set, the task is called extrapolation. In regression, there is noise added to the output of the unknown function. t k = f (x k ) + ϵ k = 1, 2,..., N f (x k ) R is the unknown function and ϵ is the random noise. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 15

10 Regression(cont.) In regression, there is noise added to the output of the unknown function. t k = f (x k ) + ϵ k = 1, 2,..., N The explanation for the noise is that there are extra hidden variables that we cannot observe. t k = f (x k, z k ) + ϵ k = 1, 2,..., N z k denotes hidden variables Hamid Beigy (Sharif University of Technology) Data Mining Fall / 15

11 Regression(cont.) Our goal is to approximate the output by function g(x). The empirical error on the training set S is E E (g S) = 1 N N [t k g(x k )] 2 The aim is to find g(.) that minimizes the empirical error. We assume that a hypothesis class for g(.) with a small set of parameters. k=1 Hamid Beigy (Sharif University of Technology) Data Mining Fall / 15

12 Model selection The training data is not sufficient to find the solution, we should make some extra assumption for learning. Inductive bias The inductive bias of a learning algorithm is the set of assumptions that the learner uses to predict outputs given inputs that it has not encountered. One way to introduce the inductive bias is when we assume a hypothesis class. Each hypotheses class has certain capacity and can learn only certain functions. How to choose the right inductive bias (for example hypotheses class)? This is called model selection. How well a model trained on the training set predicts the right output for new instances is called generalization. For best generalization, we should choose the right model that match the complexity of the hypothesis with the complexity of the function underlying data. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 15

13 Model selection (cont.) For best generalization, we should choose the right model that match the complexity of the hypothesis with the complexity of the function underlying data. If the hypothesis is less complex than the function, we have underfitting Hamid Beigy (Sharif University of Technology) Data Mining Fall / 15

14 Model selection (cont.) If the hypothesis is more complex than the function, we have overfitting There are trade-off between three factors Complexity of hypotheses class Amount of training data Generalization error Hamid Beigy (Sharif University of Technology) Data Mining Fall / 15

15 Model selection (cont.) As the amount of training data increases, the generalization error decreases. As the capacity of the models increases, the generalization error decreases first and then increases. We measure generalization ability of a model using a validation set. The available data for training is divided to Training set Validation data Test data Hamid Beigy (Sharif University of Technology) Data Mining Fall / 15

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem