Introduction to Machine Learning (Pattern recognition and model fitting) for Master students

Introduction to Machine Learning (Pattern recognition and model fitting) for Master students Spring 007, ÖU/RAP Thorsteinn Rögnvaldsson thorsteinn.rognvaldsson@tech.oru.se

Contents Machine learning algorithms Mostly Artificial Neural Networks (ANN) Problems attacked with learning systems (classification and regression) Issues in learning Bias, over-learning, generalization Seminars Practical project

What you should learn How to approach a machine learning classification or a regression problem Basic knowledge of the common linear machine learning algorithms Basic knowledge of some nonlinear machine learning algorithms Practical use of a few machine learning algorithms with MATLAB

Form Projects (individual or in group) Written report & oral presentation (40%) Theory (lecture notes, books, whatever you choose). Lectures (3 hrs/occasion, approx. 7 occasions) Seminars Where you read up on the material and present it (in a pedagogic way) to your fellow master students. You re given a paper/chapter to read and then present to the others (you re more than welcome to complement with other material) Evaluation (0%) Material mailed out to the students

Why machine learning? Some tasks can easily be described by example, but difficult to write down the rules for. There may be new information in data (i.e. the expert might not know all the information available in the data). On-line tuning (knowledge increases, too difficult to update by hand). Machine learning is very close to statistics.

Typical tasks for ML Build systems with the purpose of Classify observations Good/bad, Healthy/sick, red/green/blue, A/B/C,... Estimate some value for observations How good/bad is it?, how healthy is the patient?, how many points will I gain?, how likely is it that I win if I do this or that?, what risk do I take if I do this or that?, what is a reasonable price for this house?, etc. The latter is called regression in statistics.

Some machine learning methods Artificial neural networks (ANN) Models inspired by the structure of the neural system Support Vector Machines (SVM) Models designed from statistical learning theory Decision trees Similar to expert systems, produces rules Bayesian networks Reasoning under uncertainty

Game playing Chess Search Evaluation function There are 10 10 possible game paths in chess. Image from 001: A space Odyssey Backgammon Pattern recognition

IBM deep blue Deep Blue relies on computational power and search and evaluation. Deep Blue evaluates 00 10 6 positions per second. The latest Deep Blue is a 3- node IBM RS/6000 SP with PSC processors. Each node of the SP employs a single microchannel card containing 8 dedicated VLSI chess processors, for a total of 56 processors working in tandem. Deep Blue can calculate 60 10 9 moves in three minutes. Deep Blue is brute force. Humans (probably) play chess differently...

TD-Gammon The best backgammon programs use temporal difference (TD) algorithms to train a back-propagation neural network by self-play. The top programs are world-class in playing strength. 1998, the American Association of Artificial Intelligence meeting: NeuroGammon won 99 of 100 games against a human grand master (the current World Champion). TD-Gammon is an example of machine learning. It plays itself and adapts its rules after each game depending on wins/losses. http://satirist.org/learn-game/systems/gammon/

Steps in ML/AI problem Measure environment: x Evaluate environment: yf(x) Take a decision and act: α[y]

Introduction to classification

Classification Order into one out of several classes ( 1 of K ) K D X C Input space Output (category) space D D X x x x M 1 x K K C c c c 0 1 0 1 M M c

Example 1: Robot color vision (Halmstad Univ. mechatronics competition 1999) Classify the Lego pieces into red, blue, and yellow. Classify white balls, black sideboard, and green carpet.

What the camera sees (RGB space) Yellow Red Green

Mapping RGB (3D) to rgb (D) r g b R R + G + B G R + G + B B R + G + B

Lego in normalized rgb space x x X x 1 Input is D Output is 6D: {red, blue, yellow, green, black, white} c C 6

All together... The classifier task is to find optimal borders between the different categories. What is a yellow lego piece? What is a blue lego piece? Given rgb values, how likely is it that the robot is seeing e.g. a red lego piece?

Example : ALVINN @ CMU

ALVINN: ANN guided vehicle Input: Output: x X 65 c C 0 Image Steering signal

Classification means taking a decision If I believe x c k then I will do α i Examples: I see something thal looks yellow. I decide that it is a yellow Lego brick. If I see a yellow Lego brick, then I will lift it up and carry it to my home. If I see a white ball, then I will try to score a goal. The road looks like it is turning left. I decide it is turning left. If the road turns left, then I will turn the steering wheel left. The patient is bleeding heavily. I decide that the patient needs treatment. Statistical decision theory Sometimes the decision is wrong. Decision theory is about making the best possible decision.

Notation p(x) : Probability density for x. p(c k ) : A priori probability for category c k. p(x c k ) : Probability density for all x c k. p(c k x) : A posteriori probability for category c k. p(c k,x) : Joint probability for x and c k. p(x,c k ) α i : Action i. λ(α i c k ) : Cost for making decision α i if x c k. λ ik

Illustration from health care Two categories: c 1 Healthy, c Ill p(c i ) The probability that the person is healthy/ill before the doctor meets him/her. (How many of the people going to see a doctor are actually ill?) x {x 1,x,...} The results (the observation) from the doctor s examination (the doctor may have done many tests).

Illustration from health care (continued) p(x) The probability for observing x. p(x,c i ) The probability for observing a person from category c i with the test results x. p(x,c i ) p(x c i )p(c i ) p(c i x)p(x) p(x c i ) The probability for observing x when we know the person is from category c i.

Bayes rule p(c k,x) p(x,c k ) ) ( ) ( ) ( ) ( x x x p c p c p c p k k k K k k c k p c p p 1 ) ( ) ( ) ( x x

Bayes theorem example Joe is a randomly chosen member of a large population in which 3% are heroin users. Joe tests positive for heroin in a drug test that correctly identifies users 95% of the time and correctly identifies nonusers 90% of the time. Is Joe a heroin addict? P( H pos) P( pos H ) P( H ) P( pos) P( H ) 3% 0.03, P( H ) 1 P( H ) 0.97 P( pos H ) 95% 0.95, P( pos H ) 10% 1 0.90 P( pos) P( pos H ) P( H ) + P( pos H ) P( H ) 0.155 P( H pos) 0.7 3% Example from http://plato.stanford.edu/entries/bayes-theorem/supplement.html

Bayes theorem: The Monty Hall Game show In a TV Game show, a contestant selects one of three doors; behind one of the doors there is a prize, and behind the other two there are no prizes. After the contestant select a door, the game-show host opens one of the remaining doors, and reveals that there is no prize behind it. The host then asks the contestant whether he/she wants to SWITCH to the other unopened door, or STICK to the original choice. What should the contestant do? See http://www.io.com/~kmellis/monty.html Let s make a deal (A Joint Venture)

The Monty Hall Game Show prize behind door {1,,3}, openi Host opens door i Let s make a deal (A Joint Venture)

P(1 P(3 The Monty Hall Game Show prize behind door {1,,3}, Contestant selects door 1 Host opens door open open P(open P(open ) ) ) 1) 1/, P(open 1) P(1) P(open ) P(open 3) P(3) P(open ) P(open open P(open open i) P( i) 1/ ) behind door 3 (the contestant has chosen door 1). 1/ 3 Host opens door i / 3 P(1) a priori probability 3 that the prize is behind door 1 (etc. for & 3) P(open 1) probability that the host opens door if the prize is behind door 1 (the contestant i 1 has chosen door 1). P(open 3) probability that the host opens door if the prize is i 0, Let s make a deal (A Joint Venture) P(open 3) 1

P(1 P(3 The Monty Hall Game Show prize behind door {1,,3}, Contestant selects door 1 Host opens door open open P(open P(open ) ) ) 3 i 1 1) 1/, P(open 1) P(1) P(open ) P(open 3) P(3) P(open ) P(open open P(open open 1/ 3 / 3 i) P( i) 1/ i ) 0, Host opens door i Let s make a deal (A Joint Venture) P(open 3) 1

Bayes theorem: The Monty Hall Game show In a TV Game show, a contestant selects one of three doors; behind one of the doors there is a prize, and behind the other two there are no prizes. After the contestant select a door, the game-show host opens one of the remaining doors, and reveals that there is no prize behind it. The host then asks the contestant whether he/she wants to SWITCH to the other unopened door, or STICK to the original choice. What should the contestant do? The host is actually asking the contestant whether he/she wants to SWITCH the choice to both other doors, or STICK to the original choice. Phrased this way, it is obvious what the optimal thing to do is.

Decision theory: Expected conditional risk K R( α x) λ( α c ) p( c x) i k 1 i k k The Bayes optimal decision: Choose action α i that minimizes R(α i x) Choose the action that has the least severe consequences (averaged over all possible outcomes) Requires estimating the a posteriori probability p(c k x)

Decision theory: Expected conditional utility U K ( α x) u( α c ) p( c x) i k 1 i k k The Bayes optimal decision: Choose action α i that maximizes U(α i x) Choose the action that has the most good consequences (averaged over all possible outcomes) Requires estimating the a posteriori probability p(c k x)

Classification approaches 1. Model discrimination functions & discrimination boundaries. Model probability densities & use Bayes rule p(x c k ) 3. Model a posteriori probabilities p(c k x) Examples on following slides...

Example: The thermostat 5 0 15 10 We want to classify the temperature in a room into three categories {cold, fine, hot} (hot mean that we want air conditioning, cold means we want heating, fine means we re happy). Discrimination boundary approach: Set thresholds, e.g. above 1 is hot, below 19 is cold, and in between is fine. Don t bother with computing probabilities...but this is bad if you want to use decision theory.

Example: Equipment health (diagnostics & predictive maintenance) Discrimination boundary approach: Set thresholds and define ok and not-ok regions. Does not scale well to many variables. Probability density approach: Use large sample of ok and not-ok equipment and measure relevant variables x. Estimate p(x ok), p(x not-ok), p(ok) and p(not-ok). Then use Bayes theorem. A posteriori approach: Use large sample of ok and not-ok equipment and measure relevant variables x. Estimate p(ok x) and p(not-ok x).

Parametric & non-parametric methods Parametric : Assume a parametric form. Few degrees of freedom leads to large model bias (i.e. assume that everything is linear). Non-parametric : Assume no parametric form. Many degrees of freedom leads to large model variance (i.e. everything can be any nonlinear function). Optimal often somewhere in-between.

Linear Gaussian classifier: Parametric Assume p(x c k ) Gaussian with different means µ k and common covariance matrices Σ. ) ( ) ( 1 exp ) det( ) ( 1 ) ( 1 / µ x Σ µ x Σ x T k D c k p π

Linear Gaussian classifier: Parametric Assume p(x c) Gaussian with different means µ and common covariance matrices Σ. Estimate means and covariance matrices for the categories from the data: [ ][ ] K k k k T k c n k k k c n k k N N n n N n N k k 1 ) ( ) ( ˆ ˆ ˆ ) ( ˆ ) ( 1) ( 1 ˆ ) ( 1 ˆ Σ Σ µ µ Σ µ x x x x x

Linear Gaussian class boundary 11399 green samples 14 red samples Training error 0.06% Test error 0.10% Decision boundary

The simple perceptron With {-1,+1} representation y( x) sgn[ w T x] + 1 1 if if w w T T x x > < 0 0 w are the parameters Traditionally (early 60:s) trained with Perceptron learning. w T x w0 + w1 x1 + w x +L

Perceptron learning Desired output f ( n) + 1 1 if if x( n) x( n) belongs belongs to class to class A B Repeat until no errors are made anymore 1. Pick a random example [x(n),f(n)]. If the classification is correct then do nothing 3. If the classification is wrong, then do the following update to the parameters: (η, the learning rate, is a small positive number) w i w i + ηf ( n) x ( n) i

Example: Perceptron learning x 1 x f 0 0-1 0 1-1 1 0-1 1 1 +1 x The AND function; the function we want the Perceptron to learn. The AND function x 1

Example: Perceptron learning x 1 x f 0 0-1 0 1-1 1 0-1 1 1 +1 x x 1 Initial values: η 0.3 w 0.5 1 1 The AND function

Example: Perceptron learning x 1 x f 0 0-1 0 1-1 1 0-1 1 1 +1 x x 1 Initial values: η 0.3 w 0.5 1 1 The AND function w T x 0 w 0.5 + x 1 + x 0 + w x 1 1 0 + x w x 0.5 x 1

Example: Perceptron learning x 1 x f 0 0-1 0 1-1 1 0-1 1 1 +1 x Initial values: η 0.3 w 0.5 1 1 The AND function w T x 0 w 0.5 + x 1 + x 0 + w x 1 1 0 + x x 1 w This is the vector (1,1) x 0.5 x 1

Example: Perceptron learning x 1 x f 0 0-1 0 1-1 1 0-1 1 1 +1 x This one is correctly classified, no action. x 1 w 0.5 1 1 The AND function

Example: Perceptron learning x 1 x f 0 0-1 0 1-1 1 0-1 1 1 +1 x This one is incorrectly classified, learning action. x 1 w 0.5 1 1 The AND function w w w 0 1 w w w 1 0 η 1 0.8 η 0 + 1 η 1 0.7

Example: Perceptron learning x 1 x f 0 0-1 0 1-1 1 0-1 1 1 +1 x This one is incorrectly classified, learning action. x 1 w 0.8 1 0.7 The AND function w w w 0 1 w w w 1 0 η 1 0.8 η 0 + 1 η 1 0.7

Example: Perceptron learning x 1 x f 0 0-1 0 1-1 1 0-1 1 1 +1 This one is correctly classified, no action. x x 1 w 0.8 1 0.7 The AND function

Example: Perceptron learning x 1 x f 0 0-1 0 1-1 1 0-1 1 1 +1 x This one is incorrectly classified, learning action. x 1 w 0.8 1 0.7 The AND function w w w 0 1 w w w 1 0 η 1 1.1 η 1 0.7 η 0 0.7

Example: Perceptron learning x 1 x f 0 0-1 0 1-1 1 0-1 1 1 +1 x This one is incorrectly classified, learning action. x 1 w 1.1 0.7 0.7 The AND function w w w 0 1 w w w 1 0 η 1 1.1 η 1 0.7 η 0 0.7

Example: Perceptron learning x 1 x f 0 0-1 0 1-1 1 0-1 1 1 +1 x w 1.1 0.7 0.7 The AND function x 1 Final solution

Perceptron learning Perceptron learning is guaranteed to find a solution in finite time, if a solution exists. However, the Perceptron is only linear.

Perceptron final decision boundary After 100 epochs. 1 epoch 1 full presentation of the entire data set. Training error 0.07% Test error 0.09%

Seminars for next week Decision theory The simple perceptron Probability density estimation ( students)