Machine Learning (CS 567) Lecture 2

Similar documents
Machine Learning (CS 567) Lecture 3

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall

Machine Learning (CS 567) Lecture 5

CS 6375 Machine Learning

Machine Learning (COMP-652 and ECSE-608)

Machine Learning (COMP-652 and ECSE-608)

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Midterm: CS 6375 Spring 2015 Solutions

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Empirical Risk Minimization, Model Selection, and Model Assessment

Logistic Regression. Machine Learning Fall 2018

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Active Learning and Optimized Information Gathering

Introduction to Machine Learning

Logic and machine learning review. CS 540 Yingyu Liang

Data Informatics. Seon Ho Kim, Ph.D.

Linear Classifiers: Expressiveness

Naïve Bayes classification

Decision trees COMS 4771

Midterm, Fall 2003

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Statistical Learning. Philipp Koehn. 10 November 2015

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

FINAL: CS 6375 (Machine Learning) Fall 2014

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Foundations of Machine Learning

Machine Learning, Fall 2012 Homework 2

ECE521 week 3: 23/26 January 2017

Mathematical Formulation of Our Example

6.036 midterm review. Wednesday, March 18, 15

Machine Learning, Fall 2009: Midterm

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning Linear Classification. Prof. Matteo Matteucci

CS 188: Artificial Intelligence Spring Today


Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

CSC242: Intro to AI. Lecture 21

Gaussian and Linear Discriminant Analysis; Multiclass Classification

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

Linear Classifiers. Michael Collins. January 18, 2012

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Midterm Exam, Spring 2005

CS446: Machine Learning Spring Problem Set 4

IE598 Big Data Optimization Introduction

Machine Learning and Deep Learning! Vincent Lepetit!

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Lecture 11 Linear regression

Learning from Data. Amos Storkey, School of Informatics. Semester 1. amos/lfd/

Lecture 5 Neural models for NLP

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Decision Tree Learning Lecture 2

CSCI-567: Machine Learning (Spring 2019)

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning Lecture 5

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Machine Learning (CS 567)

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Qualifier: CS 6375 Machine Learning Spring 2015

MLE/MAP + Naïve Bayes

MODULE -4 BAYEIAN LEARNING

Tufts COMP 135: Introduction to Machine Learning

Lecture 2 Machine Learning Review

INTRODUCTION TO DATA SCIENCE

Loss Functions, Decision Theory, and Linear Models

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Lecture 29: Computational Learning Theory

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Machine Learning Lecture 2

Warm up: risk prediction with logistic regression

Data Mining. Supervised Learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Hidden Markov Models

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Machine Learning for Structured Prediction

10-701/ Machine Learning - Midterm Exam, Fall 2010

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Machine Learning

Least Squares Classification

ECE521 Lecture7. Logistic Regression

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Learning with Probabilities

From inductive inference to machine learning

Be able to define the following terms and answer basic questions about them:

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Bayesian decision making

Introduction to Bayesian Learning. Machine Learning Fall 2018

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

Lecture #11: Classification & Logistic Regression

Transcription:

Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol Han Office hours: TBA Class web page: http://www-scf.usc.edu/~csci567/index.html

Administrative Email me to get on the mailing list To: macskass@usc.edu Subject: csci567 student Please stop me if I talk too fast Please stop me if you have any questions Any feedback is better than no feedback Sooner is better End of the semester is too late to make it easier on you 2

Administrative Class project info Get into teams of size 2-4 as soon as you can Would prefer size 2-3. See me if you want to do a project by yourself I ll suggest topics in a week or two Topics can be anything as long as it is related to ML Pre-proposal due on Sep 23 Abstract, lay out topic and scope Proposal due on Oct 9 1-2 pages write up on what you will do Final paper due on Dec 2 Presentations on Dec 2 & 4 3

Administrative Questions from last class? 4

Lecture 2 Outline Supervised learning defined Hypothesis spaces Linear Threshold Algorithms Introduction 5

Review: Focus for this Course Based on information available Supervised true labels provided Reinforcement Only indirect labels provided (reward/punishment) Unsupervised No feedback & no labels Based on the role of the learner Passive given a set of data, produce a model Online given one data point at a time, update model Active ask for specific data points to improve model Based on type of output Concept Learning Binary output based on +ve/-ve examples Classification Classifying into one among many classes Regression Numeric, ordered output 6

Supervised Learning (and appropriate applications) Given: training examples <x, f(x)> for some unknown function f Find: A good approximation to f Appropriate Applications There is no human expert E.g., DNA analysis x: bond graph of a new molecule f(x): predicted binding strength to AIDS protease molecule Humans can perform the task but cannot explain how E.g., character recognition x: bitmap picture picture of hand-written character f(x): ascii code of the character Desired function changes frequently E.g., predicting stock prices based on recent trading data x: description of stock prices and trades for last 10 days f(x): recommended stock transactions Each user needs a customized function f E.g., email filtering x: incoming email message in some (un)structured format Spam f(x): importance score for presenting to the user (or deleting without presenting) Filter f(x): predicting which email folder to put the email into 7

Example: A data set for supervised learning Wisconsin Breast Tumor data set from UC-Irvine Machine Learning repository. Thirty real-valued variables per tumor. Two variables that have been predicted: Outcome (R=recurrence, N=non-recurrence) Time (until recurrence, for R, time healthy, for N) tumor size texture perimeter outcome time 18.02 27.60 117.5 N 31 17.99 10.38 122.8 N 61 20.29 14.34 135.1 R 27 8

Terminology tumor size texture perimeter outcome time 18.02 27.60 117.5 N 31 17.99 10.38 122.8 N 61 20.29 14.34 135.1 R 27 Columns are called input variables or features or attributes The outcome and time (which we are trying to predict) are called output variables or targets or labels A row in the table is called a training example (if used to build the model) or instance The whole table is called (training, validation, test or evaluation) data set 9

Prediction problems tumor size texture perimeter outcome time 18.02 27.60 117.5 N 31 17.99 10.38 122.8 N 61 20.29 14.34 135.1 R 27 The problem of predicting the recurrence is called (binary) classification The problem of predicting the time is called regression 10

More formally Instances has the form: <x 1,, x n,y> where n is the number of attributes We will use the notation x i to denote the attributes of the i-th instances: <x i,1,,x i,n >. Assumptions: There is underlying unknown probability distribution P(x,y), from which instances are drawn. This distribution is fixed. Data for training as well as for future evaluation are drawn independently at random from P. Instances are said to be i.i.d. (independent and identically distributed) Let X denote the space of input values Let Y denote the space of output values If Y = R, this problem is called regression If Y is a finite discrete set, the problem is called classification If Y has 2 elements, the problem is called binary classification or concept learning 11

Supervised Learning Problem P(x,y) training points test point x <x,y> y training samples learning algorithm f Training examples are drawn from independently at random according to unknown probability distribution P(x,y) The learning algorithm analyzes the the examples and produces a classifier f or hypothesis h Given a new data point <x,y> drawn from P (independently and at random), the classifier is given x and predicts ŷ = f(x) The loss L(ŷ,y) is then measured. Goal of the learning algorithm: Find the f that minimizes the expected loss. ŷ loss function L(ŷ,y) y 12

Example Problem: Family Car Class C of a family car Classify: Is car x a family car? Output: Positive (+) and negative ( ) examples Input representation: x 1 : price, x 2 : engine power 13

Training set X X y { x } t t N,y t 1 1if x is 0 if x is positive negative x x x 1 2 14

Class C p price p AND e engine power e 1 2 1 2 15

Hypothesis class H h( x) 1if 0 if h classifies h classifies x as positive x as negative Error of h on H N t t 1 h y E( h X) x t 1 16

S, G, and the Version Space most specific hypothesis, S most general hypothesis, G h H, between S and G is consistent and make up the version space (Mitchell, 1997) 17

Example Problem: Tumor Recurrence P(x,y ): distribution of patients x and their true labels y ( recurrence or non recurrence ) training sample: a set of patients that have been labeled by the user learning algorithm: that s what this class is about! f : the classifier output by the learning algorithm test point: A new patient x (with its true, but hidden, label y ) loss function L(ŷ,y): predicted label ŷ recurrence true label y non recurrence recurrence 0 1 non recurrence 100 0 18

What is f (x)? There are three main approaches: Learn a classifier: f : X Y Learn a conditional probability distribution: P(y x) Learn a joint probability distribution: P(x, y) We will start by studying one example of each approach Classifier: Perceptron Conditional Distribution: Logistic Regression Joint Distribution: Linear Discriminant Analysis (LDA) 19

Inferring f (x) from P(y x) Predict the ŷ that minimizes the expected loss: ŷ 20

Example: Making the tumor decision Suppose our tumor recurrence detector predicts: P(y = recurrence x ) = 0.1 What is the optimal classification decision ŷ? Expected loss of ŷ = recurrence is: 0*0.1 + 1*0.9 = 0.9 Expected loss of ŷ = non recurrence is: 100*0.1 + 0*0.9 = 10 Therefore, the optimal prediction is recurrence Loss function L(ŷ,y): predicted label ŷ recurrence true label y non recurrence recurrence 0 1 non recurrence 100 0 P(y x) 0.1 0.9 21

Inferring f (x) from P(x,y) We can compute the conditional distribution according to the definition of conditional probability: P (y = kjx) = P (x; y = k) Pj P (x; y = j) : In words, compute P(x, y=k ) for each value of k. Then normalize these numbers. Compute ŷ using the method from the previous slide. 22

Fundamental Problem of Machine Learning: It is ill-posed Example x 1 x 2 x 3 x 4 y 1 0 0 1 0 0 2 0 1 0 0 0 3 0 0 1 1 1 4 1 0 0 1 1 5 0 1 1 0 0 6 1 1 0 0 0 7 0 1 0 1 0 23

Learning Appears Impossible There are 2 16 = 65536 possible boolean functions over four input features. We can t figure out which one is correct until we ve seen every possible inputoutput pair. After 7 examples, we still have 2 9 possibilities. x 1 x 2 x 3 x 4 y 0 0 0 0? 0 0 0 1? 0 0 1 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1? 1 0 0 0? 1 0 0 1 1 1 0 1 0? 1 0 1 1? 1 1 0 0 0 1 1 0 1? 1 1 1 0? 1 1 1 1? 24

Solution: Work with a restricted hypothesis space Either by applying prior knowledge or by guessing, we choose a space of hypotheses H that is smaller than the space of all possible functions: simple conjunctive rules m-of-n rules linear functions multivariate Gaussian joint probability distributions etc. 25

Illustration: Simple Conjunctive Rules There are only 16 simple conjunctions (no negation) However, no simple rule explains the data. The same is true for simple clauses Example x 1 x 2 x 3 x 4 y 1 0 0 1 0 0 2 0 1 0 0 0 3 0 0 1 1 1 4 1 0 0 1 1 5 0 1 1 0 0 6 1 1 0 0 0 7 0 1 0 1 0 Rule Counterexample true, y 1 x 1, y 3 x 2, y 2 x 3, y 1 x 4, y 7 x 1 ^ x 2, y 3 x 1 ^ x 3, y 3 x 1 ^ x 4, y 3 x 2 ^ x 3, y 3 x 2 ^ x 4, y 3 x 3 ^ x 4, y 4 x 1 ^ x 2 ^ x 3, y 3 x 1 ^ x 2 ^ x 4, y 3 x 1 ^ x 3 ^ x 4, y 3 x 2 ^ x 3 ^ x 4, y 3 x 1 ^ x 2 ^ x 3 ^ x 4, y 3 26

A larger hypothesis space: m-of-n rules At least m of the n variables must be true There are 32 possible rules Only one rule is consistent! Example x 1 x 2 x 3 x 4 y 1 0 0 1 0 0 2 0 1 0 0 0 3 0 0 1 1 1 4 1 0 0 1 1 5 0 1 1 0 0 6 1 1 0 0 0 7 0 1 0 1 0 Counterexample variables 1-of 2-of 3-of 4-of fx 1 g 3 { { { fx 2 g 2 { { { fx 3 g 1 { { { fx 4 g 7 { { { fx 1 ; x 2 g 3 3 { { fx 1 ; x 3 g 4 3 { { fx 1 ; x 4 g 6 3 { { fx 2 ; x 3 g 2 3 { { fx 2 ; x 4 g 2 3 { { fx 3 ; x 4 g 4 4 { { fx 1 ; x 2 ; x 3 g 1 3 3 { fx 1 ; x 2 ; x 4 g 2 3 3 { fx 1 ; x 3 ; x 4 g 1 *** 3 { fx 2 ; x 3 ; x 4 g 1 5 3 { fx 1 ; x 2 ; x 3 ; x 4 g 1 5 3 3 27

Two Views of Learning View 1: Learning is the removal of our remaining uncertainty Suppose we knew that the unknown function was an m- of-n boolean function. Then we could use the training examples to deduce which function it is. View 2: Learning requires guessing a good, small hypothesis class We can start with a very small class and enlarge it until it contains an hypothesis that fits the data 28

We could be wrong! Our prior knowledge might be wrong Our guess of the hypothesis class could be wrong The smaller the class, the more likely we are wrong 29

Two Strategies for Machine Learning Develop Languages for Expressing Prior Knowledge Rule grammars, stochastic models, Bayesian networks (Corresponds to the Prior Knowledge view) Develop Flexible Hypothesis Spaces Nested collections of hypotheses: decision trees, neural networks, cases, SVMs (Corresponds to the Guessing view) In either case we must develop algorithms for finding an hypothesis that fits the data 30

Terminology Training example. An example of the form hx,yi. x is usually a vector of features, y is called the class label. We will index the features by j, hence x j is the j-th feature of x. The number of features is n. Target function. The true function f, the true conditional distribution P(y x), or the true joint distribution P(x,y). Hypothesis. A proposed function or distribution h believed to be similar to f or P. Concept. A boolean function. Examples for which f(x)=1 are called positive examples or positive instances of the concept. Examples for which f(x)=0 are called negative examples or negative instances. 31

Terminology Classifier. A discrete-valued function. The possible values f(x) 2 {1,, K} are called the classes or class labels. Hypothesis space. The space of all hypotheses that can, in principle, be output by a particular learning algorithm. Version Space. The space of all hypotheses in the hypothesis space that have not yet been ruled out by a training example. Training Sample (or Training Set or Training Data): a set of N training examples drawn according to P(x,y). Test Set: A set of training examples used to evaluate a proposed hypothesis h. Validation Set/Development Set: A set of training examples (typically a subset of the training set) used to guide the learning algorithm and prevent overfitting. 32

S, G, and the Version Space most specific hypothesis, S most general hypothesis, G h H, between S and G is consistent and make up the version space (Mitchell, 1997) 33

Key Issues in Machine Learning What are good hypothesis spaces? which spaces have been useful in practical applications? What algorithms can work with these spaces? Are there general design principles for learning algorithms? How can we optimize accuracy on future data points? This is related to the problem of overfitting How can we have confidence in the results? (the statistical question) How much training data is required to find an accurate hypotheses? Are some learning problems computationally intractable? (the computational question) How can we formulate application problems as machine learning problems? (the engineering question) 34