Introduction to Machine Learning (Pattern recognition and model fitting) for Master students

Similar documents
CS 570: Machine Learning Seminar. Fall 2016

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Machine Learning. CS Spring 2015 a Bayesian Learning (I) Uncertainty

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Machine Learning (CS 567) Lecture 2

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

P (A B) P ((B C) A) P (B A) = P (B A) + P (C A) P (A) = P (B A) + P (C A) = Q(A) + Q(B).

Machine Learning Lecture 2

Numerical Learning Algorithms

Generative Clustering, Topic Modeling, & Bayesian Inference

Linear discriminant functions

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Neural networks and support vector machines

Intelligent Systems Statistical Machine Learning

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

CS 188: Artificial Intelligence. Outline

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Introduction to Machine Learning

Midterm: CS 6375 Spring 2015 Solutions

Final Examination CS 540-2: Introduction to Artificial Intelligence

Introduction to Machine Learning

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning Lecture 5

IE 230 Probability & Statistics in Engineering I. Closed book and notes. 60 minutes. Four Pages.

ECE521 Lecture7. Logistic Regression

What does Bayes theorem give us? Lets revisit the ball in the box example.

Logistic Regression. Machine Learning Fall 2018

Midterm, Fall 2003

Machine Learning Lecture 1

Artificial Neural Networks Examination, June 2005

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

2.4 Conditional Probability

From perceptrons to word embeddings. Simon Šuster University of Groningen

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

CS534 Machine Learning - Spring Final Exam

Randomized Decision Trees

Machine Learning Linear Classification. Prof. Matteo Matteucci

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Intelligent Systems Statistical Machine Learning

CMU-Q Lecture 24:

Machine Learning, Midterm Exam

Neural Networks biological neuron artificial neuron 1

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 3

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Unit 8: Introduction to neural networks. Perceptrons

Statistical Learning Reading Assignments

Artificial Neural Networks Examination, March 2004

ECE521 week 3: 23/26 January 2017

Classification with Perceptrons. Reading:

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

CS 188: Artificial Intelligence Fall 2011

Errors, and What to Do. CS 188: Artificial Intelligence Fall What to Do About Errors. Later On. Some (Simplified) Biology

FINAL: CS 6375 (Machine Learning) Fall 2014

CS 188: Artificial Intelligence Spring Announcements

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Brief Introduction of Machine Learning Techniques for Content Analysis

Midterm: CS 6375 Spring 2018

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

10-701/ Machine Learning - Midterm Exam, Fall 2010

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning Lecture 2

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Bayesian Learning (II)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

SGN (4 cr) Chapter 5

Machine Learning Lecture 2

Probability, Statistics, and Bayes Theorem Session 3

Linear & nonlinear classifiers

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

CS 188: Artificial Intelligence Spring Announcements

Deep Learning for Computer Vision

Machine Learning for Signal Processing Bayes Classification and Regression

Linear classifiers Lecture 3

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Statistical methods in recognition. Why is classification a problem?

Introduction To Artificial Neural Networks

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

With Question/Answer Animations. Chapter 7

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date: Monday 22nd May 2017 Time: 09:45-11:45

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Artificial Intelligence

Decision-making, inference, and learning theory. ECE 830 & CS 761, Spring 2016

Probabilistic Graphical Models for Image Analysis - Lecture 1

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Introduction to Machine Learning

Transcription:

Introduction to Machine Learning (Pattern recognition and model fitting) for Master students Spring 007, ÖU/RAP Thorsteinn Rögnvaldsson thorsteinn.rognvaldsson@tech.oru.se

Contents Machine learning algorithms Mostly Artificial Neural Networks (ANN) Problems attacked with learning systems (classification and regression) Issues in learning Bias, over-learning, generalization Seminars Practical project

What you should learn How to approach a machine learning classification or a regression problem Basic knowledge of the common linear machine learning algorithms Basic knowledge of some nonlinear machine learning algorithms Practical use of a few machine learning algorithms with MATLAB

Form Projects (individual or in group) Written report & oral presentation (40%) Theory (lecture notes, books, whatever you choose). Lectures (3 hrs/occasion, approx. 7 occasions) Seminars Where you read up on the material and present it (in a pedagogic way) to your fellow master students. You re given a paper/chapter to read and then present to the others (you re more than welcome to complement with other material) Evaluation (0%) Material mailed out to the students

Why machine learning? Some tasks can easily be described by example, but difficult to write down the rules for. There may be new information in data (i.e. the expert might not know all the information available in the data). On-line tuning (knowledge increases, too difficult to update by hand). Machine learning is very close to statistics.

Typical tasks for ML Build systems with the purpose of Classify observations Good/bad, Healthy/sick, red/green/blue, A/B/C,... Estimate some value for observations How good/bad is it?, how healthy is the patient?, how many points will I gain?, how likely is it that I win if I do this or that?, what risk do I take if I do this or that?, what is a reasonable price for this house?, etc. The latter is called regression in statistics.

Some machine learning methods Artificial neural networks (ANN) Models inspired by the structure of the neural system Support Vector Machines (SVM) Models designed from statistical learning theory Decision trees Similar to expert systems, produces rules Bayesian networks Reasoning under uncertainty

Game playing Chess Search Evaluation function There are 10 10 possible game paths in chess. Image from 001: A space Odyssey Backgammon Pattern recognition

IBM deep blue Deep Blue relies on computational power and search and evaluation. Deep Blue evaluates 00 10 6 positions per second. The latest Deep Blue is a 3- node IBM RS/6000 SP with PSC processors. Each node of the SP employs a single microchannel card containing 8 dedicated VLSI chess processors, for a total of 56 processors working in tandem. Deep Blue can calculate 60 10 9 moves in three minutes. Deep Blue is brute force. Humans (probably) play chess differently...

TD-Gammon The best backgammon programs use temporal difference (TD) algorithms to train a back-propagation neural network by self-play. The top programs are world-class in playing strength. 1998, the American Association of Artificial Intelligence meeting: NeuroGammon won 99 of 100 games against a human grand master (the current World Champion). TD-Gammon is an example of machine learning. It plays itself and adapts its rules after each game depending on wins/losses. http://satirist.org/learn-game/systems/gammon/

Steps in ML/AI problem Measure environment: x Evaluate environment: yf(x) Take a decision and act: α[y]

Steps in ML/AI problem Measure environment: x Evaluate environment: yf(x) Take a decision and act: α[y]

Introduction to classification

Classification Order into one out of several classes ( 1 of K ) K D X C Input space Output (category) space D D X x x x M 1 x K K C c c c 0 1 0 1 M M c

Example 1: Robot color vision (Halmstad Univ. mechatronics competition 1999) Classify the Lego pieces into red, blue, and yellow. Classify white balls, black sideboard, and green carpet.

What the camera sees (RGB space) Yellow Red Green

Mapping RGB (3D) to rgb (D) r g b R R + G + B G R + G + B B R + G + B

Lego in normalized rgb space x x X x 1 Input is D Output is 6D: {red, blue, yellow, green, black, white} c C 6

All together... The classifier task is to find optimal borders between the different categories. What is a yellow lego piece? What is a blue lego piece? Given rgb values, how likely is it that the robot is seeing e.g. a red lego piece?

Example : ALVINN @ CMU

ALVINN: ANN guided vehicle Input: Output: x X 65 c C 0 Image Steering signal

Classification means taking a decision If I believe x c k then I will do α i Examples: I see something thal looks yellow. I decide that it is a yellow Lego brick. If I see a yellow Lego brick, then I will lift it up and carry it to my home. If I see a white ball, then I will try to score a goal. The road looks like it is turning left. I decide it is turning left. If the road turns left, then I will turn the steering wheel left. The patient is bleeding heavily. I decide that the patient needs treatment. Statistical decision theory Sometimes the decision is wrong. Decision theory is about making the best possible decision.

Notation p(x) : Probability density for x. p(c k ) : A priori probability for category c k. p(x c k ) : Probability density for all x c k. p(c k x) : A posteriori probability for category c k. p(c k,x) : Joint probability for x and c k. p(x,c k ) α i : Action i. λ(α i c k ) : Cost for making decision α i if x c k. λ ik

Illustration from health care Two categories: c 1 Healthy, c Ill p(c i ) The probability that the person is healthy/ill before the doctor meets him/her. (How many of the people going to see a doctor are actually ill?) x {x 1,x,...} The results (the observation) from the doctor s examination (the doctor may have done many tests).

Illustration from health care (continued) p(x) The probability for observing x. p(x,c i ) The probability for observing a person from category c i with the test results x. p(x,c i ) p(x c i )p(c i ) p(c i x)p(x) p(x c i ) The probability for observing x when we know the person is from category c i.

Bayes rule p(c k,x) p(x,c k ) ) ( ) ( ) ( ) ( x x x p c p c p c p k k k K k k c k p c p p 1 ) ( ) ( ) ( x x

Bayes theorem example Joe is a randomly chosen member of a large population in which 3% are heroin users. Joe tests positive for heroin in a drug test that correctly identifies users 95% of the time and correctly identifies nonusers 90% of the time. Is Joe a heroin addict? Example from http://plato.stanford.edu/entries/bayes-theorem/supplement.html

Bayes theorem example Joe is a randomly chosen member of a large population in which 3% are heroin users. Joe tests positive for heroin in a drug test that correctly identifies users 95% of the time and correctly identifies nonusers 90% of the time. Is Joe a heroin addict? P( H pos) P( pos H ) P( H ) P( pos) P( H ) 3% 0.03, P( H ) 1 P( H ) 0.97 P( pos H ) 95% 0.95, P( pos H ) 10% 1 0.90 P( pos) P( pos H ) P( H ) + P( pos H ) P( H ) 0.155 P( H pos) 0.7 3% Example from http://plato.stanford.edu/entries/bayes-theorem/supplement.html

Bayes theorem: The Monty Hall Game show In a TV Game show, a contestant selects one of three doors; behind one of the doors there is a prize, and behind the other two there are no prizes. After the contestant select a door, the game-show host opens one of the remaining doors, and reveals that there is no prize behind it. The host then asks the contestant whether he/she wants to SWITCH to the other unopened door, or STICK to the original choice. What should the contestant do? See http://www.io.com/~kmellis/monty.html Let s make a deal (A Joint Venture)

The Monty Hall Game Show prize behind door {1,,3}, openi Host opens door i Let s make a deal (A Joint Venture)

P(1 P(3 The Monty Hall Game Show prize behind door {1,,3}, Contestant selects door 1 Host opens door open open P(open P(open ) ) ) 1) 1/, P(open 1) P(1) P(open ) P(open 3) P(3) P(open ) P(open open P(open open i) P( i) 1/ ) behind door 3 (the contestant has chosen door 1). 1/ 3 Host opens door i / 3 P(1) a priori probability 3 that the prize is behind door 1 (etc. for & 3) P(open 1) probability that the host opens door if the prize is behind door 1 (the contestant i 1 has chosen door 1). P(open 3) probability that the host opens door if the prize is i 0, Let s make a deal (A Joint Venture) P(open 3) 1

P(1 P(3 The Monty Hall Game Show prize behind door {1,,3}, Contestant selects door 1 Host opens door open open P(open P(open ) ) ) 3 i 1 1) 1/, P(open 1) P(1) P(open ) P(open 3) P(3) P(open ) P(open open P(open open 1/ 3 / 3 i) P( i) 1/ i ) 0, Host opens door i Let s make a deal (A Joint Venture) P(open 3) 1

Bayes theorem: The Monty Hall Game show In a TV Game show, a contestant selects one of three doors; behind one of the doors there is a prize, and behind the other two there are no prizes. After the contestant select a door, the game-show host opens one of the remaining doors, and reveals that there is no prize behind it. The host then asks the contestant whether he/she wants to SWITCH to the other unopened door, or STICK to the original choice. What should the contestant do? The host is actually asking the contestant whether he/she wants to SWITCH the choice to both other doors, or STICK to the original choice. Phrased this way, it is obvious what the optimal thing to do is.

Decision theory: Expected conditional risk K R( α x) λ( α c ) p( c x) i k 1 i k k The Bayes optimal decision: Choose action α i that minimizes R(α i x) Choose the action that has the least severe consequences (averaged over all possible outcomes) Requires estimating the a posteriori probability p(c k x)

Decision theory: Expected conditional utility U K ( α x) u( α c ) p( c x) i k 1 i k k The Bayes optimal decision: Choose action α i that maximizes U(α i x) Choose the action that has the most good consequences (averaged over all possible outcomes) Requires estimating the a posteriori probability p(c k x)

Classification approaches 1. Model discrimination functions & discrimination boundaries. Model probability densities & use Bayes rule p(x c k ) 3. Model a posteriori probabilities p(c k x) Examples on following slides...

Example: The thermostat 5 0 15 10 We want to classify the temperature in a room into three categories {cold, fine, hot} (hot mean that we want air conditioning, cold means we want heating, fine means we re happy). Discrimination boundary approach: Set thresholds, e.g. above 1 is hot, below 19 is cold, and in between is fine. Don t bother with computing probabilities...but this is bad if you want to use decision theory.

Example: Equipment health (diagnostics & predictive maintenance) Discrimination boundary approach: Set thresholds and define ok and not-ok regions. Does not scale well to many variables. Probability density approach: Use large sample of ok and not-ok equipment and measure relevant variables x. Estimate p(x ok), p(x not-ok), p(ok) and p(not-ok). Then use Bayes theorem. A posteriori approach: Use large sample of ok and not-ok equipment and measure relevant variables x. Estimate p(ok x) and p(not-ok x).

Parametric & non-parametric methods Parametric : Assume a parametric form. Few degrees of freedom leads to large model bias (i.e. assume that everything is linear). Non-parametric : Assume no parametric form. Many degrees of freedom leads to large model variance (i.e. everything can be any nonlinear function). Optimal often somewhere in-between.

Linear Gaussian classifier: Parametric Assume p(x c k ) Gaussian with different means µ k and common covariance matrices Σ. ) ( ) ( 1 exp ) det( ) ( 1 ) ( 1 / µ x Σ µ x Σ x T k D c k p π

Linear Gaussian classifier: Parametric Assume p(x c) Gaussian with different means µ and common covariance matrices Σ. Estimate means and covariance matrices for the categories from the data: [ ][ ] K k k k T k c n k k k c n k k N N n n N n N k k 1 ) ( ) ( ˆ ˆ ˆ ) ( ˆ ) ( 1) ( 1 ˆ ) ( 1 ˆ Σ Σ µ µ Σ µ x x x x x

Linear Gaussian class boundary 11399 green samples 14 red samples Training error 0.06% Test error 0.10% Decision boundary

The simple perceptron With {-1,+1} representation y( x) sgn[ w T x] + 1 1 if if w w T T x x > < 0 0 w are the parameters Traditionally (early 60:s) trained with Perceptron learning. w T x w0 + w1 x1 + w x +L

Perceptron learning Desired output f ( n) + 1 1 if if x( n) x( n) belongs belongs to class to class A B Repeat until no errors are made anymore 1. Pick a random example [x(n),f(n)]. If the classification is correct then do nothing 3. If the classification is wrong, then do the following update to the parameters: (η, the learning rate, is a small positive number) w i w i + ηf ( n) x ( n) i

Example: Perceptron learning x 1 x f 0 0-1 0 1-1 1 0-1 1 1 +1 x The AND function; the function we want the Perceptron to learn. The AND function x 1

Example: Perceptron learning x 1 x f 0 0-1 0 1-1 1 0-1 1 1 +1 x x 1 Initial values: η 0.3 w 0.5 1 1 The AND function

Example: Perceptron learning x 1 x f 0 0-1 0 1-1 1 0-1 1 1 +1 x x 1 Initial values: η 0.3 w 0.5 1 1 The AND function w T x 0 w 0.5 + x 1 + x 0 + w x 1 1 0 + x w x 0.5 x 1

Example: Perceptron learning x 1 x f 0 0-1 0 1-1 1 0-1 1 1 +1 x Initial values: η 0.3 w 0.5 1 1 The AND function w T x 0 w 0.5 + x 1 + x 0 + w x 1 1 0 + x x 1 w This is the vector (1,1) x 0.5 x 1

Example: Perceptron learning x 1 x f 0 0-1 0 1-1 1 0-1 1 1 +1 x This one is correctly classified, no action. x 1 w 0.5 1 1 The AND function

Example: Perceptron learning x 1 x f 0 0-1 0 1-1 1 0-1 1 1 +1 x This one is incorrectly classified, learning action. x 1 w 0.5 1 1 The AND function w w w 0 1 w w w 1 0 η 1 0.8 η 0 + 1 η 1 0.7

Example: Perceptron learning x 1 x f 0 0-1 0 1-1 1 0-1 1 1 +1 x This one is incorrectly classified, learning action. x 1 w 0.8 1 0.7 The AND function w w w 0 1 w w w 1 0 η 1 0.8 η 0 + 1 η 1 0.7

Example: Perceptron learning x 1 x f 0 0-1 0 1-1 1 0-1 1 1 +1 This one is correctly classified, no action. x x 1 w 0.8 1 0.7 The AND function

Example: Perceptron learning x 1 x f 0 0-1 0 1-1 1 0-1 1 1 +1 x This one is incorrectly classified, learning action. x 1 w 0.8 1 0.7 The AND function w w w 0 1 w w w 1 0 η 1 1.1 η 1 0.7 η 0 0.7

Example: Perceptron learning x 1 x f 0 0-1 0 1-1 1 0-1 1 1 +1 x This one is incorrectly classified, learning action. x 1 w 1.1 0.7 0.7 The AND function w w w 0 1 w w w 1 0 η 1 1.1 η 1 0.7 η 0 0.7

Example: Perceptron learning x 1 x f 0 0-1 0 1-1 1 0-1 1 1 +1 x w 1.1 0.7 0.7 The AND function x 1 Final solution

Perceptron learning Perceptron learning is guaranteed to find a solution in finite time, if a solution exists. However, the Perceptron is only linear.

Perceptron final decision boundary After 100 epochs. 1 epoch 1 full presentation of the entire data set. Training error 0.07% Test error 0.09%

Seminars for next week Decision theory The simple perceptron Probability density estimation ( students)