Learning from Examples

Similar documents
Learning and Neural Networks

Learning Decision Trees

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

18.6 Regression and Classification with Linear Models

Artificial Intelligence Roman Barták

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

CSC242: Intro to AI. Lecture 21

Learning from Observations. Chapter 18, Sections 1 3 1

Supervised Learning (contd) Decision Trees. Mausam (based on slides by UW-AI faculty)

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Neural networks. Chapter 20. Chapter 20 1

Introduction To Artificial Neural Networks

1. Courses are either tough or boring. 2. Not all courses are boring. 3. Therefore there are tough courses. (Cx, Tx, Bx, )

CS 380: ARTIFICIAL INTELLIGENCE

18.9 SUPPORT VECTOR MACHINES

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

Incremental Stochastic Gradient Descent

Neural networks. Chapter 20, Section 5 1

EECS 349:Machine Learning Bryan Pardo

Neural networks. Chapter 19, Sections 1 5 1

Lecture 5: Logistic Regression. Neural Networks

Midterm: CS 6375 Spring 2015 Solutions

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Decision Trees. None Some Full > No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. Patrons? WaitEstimate? Hungry? Alternate?

Introduction to Machine Learning

From inductive inference to machine learning

COMS 4771 Introduction to Machine Learning. Nakul Verma

9 Classification. 9.1 Linear Classifiers

Name (NetID): (1 Point)

Final Exam, Fall 2002

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Sections 18.6 and 18.7 Analysis of Artificial Neural Networks

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Bayesian learning Probably Approximately Correct Learning

Statistics and learning: Big Data

Support Vector Machines

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Artificial neural networks

Machine Learning Lecture 5

Neural Networks biological neuron artificial neuron 1

CS:4420 Artificial Intelligence

FINAL: CS 6375 (Machine Learning) Fall 2014

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Classification Algorithms

Machine Learning (CSE 446): Neural Networks

Sections 18.6 and 18.7 Artificial Neural Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

CS534 Machine Learning - Spring Final Exam

Neural Networks and Deep Learning

Learning with multiple models. Boosting.

Supervised Learning. George Konidaris

Nonlinear Classification

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Numerical Learning Algorithms

Sections 18.6 and 18.7 Artificial Neural Networks

Data Mining und Maschinelles Lernen

Final Exam, Machine Learning, Spring 2009

Neural networks and support vector machines

Kernel Methods. Charles Elkan October 17, 2007

VBM683 Machine Learning

Linear discriminant functions

Introduction to Machine Learning Midterm Exam

Single layer NN. Neuron Model

TDT4173 Machine Learning

Mining Classification Knowledge

Linear classification with logistic regression

6.036 midterm review. Wednesday, March 18, 15

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

1 Machine Learning Concepts (16 points)

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

The Perceptron algorithm

Final Examination CS 540-2: Introduction to Artificial Intelligence

Hierarchical Boosting and Filter Generation

ECE 5424: Introduction to Machine Learning

Part of the slides are adapted from Ziko Kolter

Voting (Ensemble Methods)

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

10-701/ Machine Learning, Fall

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Holdout and Cross-Validation Methods Overfitting Avoidance

TDT4173 Machine Learning

Computational Learning Theory. Definitions

CSC242: Intro to AI. Lecture 23

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Web-Mining Agents Computational Learning Theory

Multiclass Boosting with Repartitioning

Artificial Neuron (Perceptron)

Multilayer Perceptron

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Transcription:

Learning from Examples Data fitting Decision trees Cross validation Computational learning theory Linear classifiers Neural networks Nonparametric methods: nearest neighbor Support vector machines Ensemble learning and boosting

Data Fitting f(x) f(x) f(x) f(x) x x x x (a) (b) (c) (d) Accuracy Simplicity Hypothesis space size Hypothesis space expressive power Accuracy of best member versus complexity of finding it

Decision Trees Patrons? None Some Full No Yes WaitEstimate? >60 30-60 0-30 0-0 No Alternate? Hungry? No Yes No Yes Yes Reservation? Fri/Sat? Yes Alternate? No Yes No Yes No Yes Bar? Yes No Yes Yes Raining? No Yes No Yes No Yes No Yes Figure 8.2 FILES: figures/restaurant-tree.eps (Tue Nov 3 6:23:29 2009). A decision tree for deciding whether to wait for a table.

Construction Algorithm Input: examples, attributes.. If examples is empty, return the plurality parent label. 2. If every example has the same label, return that label. 3. If attributes is empty, return the plurality example label. 4. Pick an attribute, partition the examples, and recurse.

Picking an Attribute 3 4 6 8 2 2 5 7 9 0 Type? 3 4 6 8 2 2 5 7 9 0 Patrons? French Italian Thai Burger 5 6 0 4 8 2 3 2 7 9 7 None Some Full 3 6 8 4 2 2 5 9 0 No Yes Hungry? No Yes 4 2 (a) (b) 5 9 2 0 Figure 8.4 FILES: figures/restaurant-stub.eps (Tue Nov 3 6:23:28 2009). Splitting the examples by testing on attributes. At each node we show the positive (light boxes) and negative (dark boxes) examples remaining. (a) Splitting on Type brings us no nearer to distinguishing between positive and negative examples. (b) Splitting on Patrons does a good job of separating positive and negative examples. After splitting on Patrons, Hungry is a fairly good second test.

Output Decision Tree Patrons? None Some Full No Yes Hungry? No Yes No Type? French Italian Thai Burger Yes No Fri/Sat? Yes No Yes No Yes Figure 8.6 FILES: figures/induced-restaurant-tree.eps (Tue Nov 3 6:23:04 2009). The decision tree induced from the 2-example training set.

Impurity Impurity is a heuristic for decision tree construction. The impurity of p positive and n negative instances is p p + n n p + n = pn (p + n) 2 The impurity is unimodal with minima of 0 at p = 0 and n = 0 and a maximum of /4 at p = n. Average impurity after test with k subsets with p i and n i k p i n i k (p i + n i ) (p i + n i ) 2 = p i n i p i + n i i= Pick the test that minimizes this value. Identical tree for restaurant example. i=

Learning Curve Proportion correct on test set 0.9 0.8 0.7 0.6 0.5 0.4 0 20 40 60 80 00 Training set size Figure 8.7 FILES:. A learning curve for the decision tree learning algorithm on 00 randomly generated examples in the restaurant domain. Each data point is the average of 20 trials.

Cross Validation Split data into k equal subsets Perform k learning rounds Round k reserves one subset for testing Average the results k = 0 is common k = n (singleton sets) is the ultimate Construct classifier from all the data

Model Complexity Versus Quality 60 50 Validation Set Error Training Set Error 40 Error rate 30 20 0 0 2 3 4 5 6 7 8 9 0 Tree size

Computational Learning Theory I We will consider Boolean functions of Boolean attributes. Assumption: training and test data are independent samples from a fixed distribution. The error of a hypthesis is the probability that it is wrong on a random sample from this distribution. A hypothesis is approximately correct if its error is less than ɛ. A hypothesis is probably approximately correct (PAC) if it is approximately correct with probability δ. The parameters ɛ and δ must be between 0 and but are otherwise arbitrary. Goal: compute a PAC hypothesis from a reasonable number of samples with reasonable computational complexity. Idea: a bad (not approximately correct) hypothesis will usually fail quickly. Pick a hypothesis space H with H members.

Computational Learning Theory II Probability that a bad h is right on a sample ɛ. Probability that it is right on n samples ( ɛ) n. Probability for any bad h in H is H ( ɛ) n. We want this to be less than δ: H ( ɛ) n δ. Fun fact: ɛ e ɛ. Take logs and rearrange: n ɛ (log H + log δ ) This n is called the sample complexity of H. Any hypothesis that is consistent with n samples is PAC!

PAC Learning The sample complexity limits the choice of H. The sample complexity of decision trees is exponential in the number of attributes. A decision tree on m Boolean attributes is equivalent to a propositional logic formula in disjunctive normal form. Every formula is expressible in disjunctive normal form. The truth table of a formula has 2 m rows. Each of the 2 2 m subsets can be the true rows. We consider a smaller sample space next. But something is wrong with computational learning theory because decision trees work well in practice! Fishy assumptions: no prior knowledge, distribution independent, independent of structure of H.

Decision Lists Patrons(x, Some) Yes No Patrons(x, Full) ^ Yes Fri/Sat(x) No No Yes Yes log H = O(m k ) for m attributes and k conjuncts per test. ( ) 2m choices of i literals in m attributes. i Conjuncts can have i = 0,,..., k literals, altogether: O(m k ) A conjunct can classify yes, classify no, or be absent: 3 m k Conjuncts can be in any order: 3 mk m k! Stirling approximation yields bound. Greedy algorithms give good results. Example: pick smallest conjunction that matches some instances.

Decision Lists Versus Decision Trees Proportion correct on test set 0.9 0.8 0.7 0.6 0.5 0.4 Decision tree Decision list 0 20 40 60 80 00 Training set size

Least-Squares Fitting Fit a line to points in the plane. Line is h w (x) = w 0 + w x with unknown w 0, w. Training data is points (x, y ),..., (x n, y n ). Minimize distance (y,..., y n ) (h w (x ),..., h w (x n )). Square and take partials with respect to w 0 and w. w 0 w n (y i w 0 w x i ) 2 = 0 i= n (y i w 0 w x i ) 2 = 0 i= Obtain two linear equations in w 0 and w. General case: fit linear combination of basis functions to data. Example: w 0 + w x + w 2 sin x + w 3 cos x.

Linear Classifier (Perceptron) Instances are feature vectors x = (x, x 2 ). Find a line that separates the classes. The points are linearly separable if such a line exists. Approximate separation is useful for non-separable data. General case x = (, x,..., x n ) Linear function w x = w0 + w x + + w n x n classes y = 0 and y = Classifier h w (x) returns if w x > 0 and 0 otherwise

Linearly Separable Data x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x x separable not separable Earthquake versus explosion given body and surface waves. Larger dataset is more accurate, but is not linearly separable

Perceptron Learning Proportion correct 0.9 0.8 0.7 0.6 0.5 0.4 0 00 200 300 400 500 600 700 Number of weight updates Proportion correct 0.9 0.8 0.7 0.6 0.5 0.4 0 20000 40000 60000 80000 00000 Number of weight updates Proportion correct 0.9 0.8 0.7 0.6 0.5 0.4 0 20000 40000 60000 80000 00000 Number of weight updates separable not separable decaying α Error on training data (x j, y j ) is e w = j (y j h w (x j )) 2. Update w i w i + α(y j h w (x j ))x j,i (like gradient descent). Fixed α > 0 converges on linearly separable data. Decreasing α O(/t) in iteration t usually converges. Convergence is uneven and can be slow.

Threshold Functions 0.5 0.5 5 0-8 -6-4 -2 0 2 4 6 8 0-6 -4-2 0 2 4 6 0 5 hard soft halfwave Performance greatly improved with soft threshold g(z) = + e z Classify based on h w (x) = g(w x) > 0.5. Recent neural networks use the halfwave rectifier.

Learning with Soft Threshold Squared error per example 0.9 0.8 0.7 0.6 0.5 0.4 0 000 2000 3000 4000 5000 Number of weight updates Squared error per example 0.9 0.8 0.7 0.6 0.5 0.4 0 20000 40000 60000 80000 00000 Number of weight updates Squared error per example 0.9 0.8 0.7 0.6 0.5 0.4 0 20000 40000 60000 80000 00000 Number of weight updates separable not separable decaying α Compute w that minimizes e w = (y g(w x)) 2. Gradient descent on f (x): iterate x x αf (x). Multivariate version for e w g (z) = = e z ( + e z ) 2 = e z + ( + e z ) 2 + e z = g(z)( g(z)) ( + e z 2 ) w i w i α(y g(w x))g(w x)( g(w x))x i Converges fast and smoothly even on non-separable data.

Neural Networks Bias Weight a 0 = a j = g(in j ) wi,j a i w 0,j Σ in j g a j Input Links Input Function Activation Function Output Output Links Perceptrons are of limited use because linear separation is rare. The natural next step is a network of perceptrons. Each perceptron is analogous to a neuron. The network is called a neural network.

Feed-forward Networks w,3 3 w,3 3 w 3,5 5 w,4 w,4 w 3,6 2 w 2,3 w 2,4 4 2 w 2,3 w 2,4 4 w 4,5 w 4,6 6 (a) (b) A feed-forward network is a directed graph of perceptrons. It is organized into input, hidden, and output layers. It is trained by gradient descent, called back propagation.

What Feed-Forward Networks Can Learn x x x? 0 0 x 2 0 0 x 2 0 0 x 2 (a) x and x 2 (b) x or x2 (c) x xor x2 No hidden layers: linearly separable functions. One hidden layer: continuous functions. Two hidden layers: discontinuous functions.

Deep Learning The term deep learning refers primarily to neural networks with multiple hidden layers. The internal layers are meant to learn a hierarchy of domain features without human help. Deep learning is today s hottest machine learning technique. The basic ideas are 35 years old, e.g. back propagation. Increased computing power and data storage enable larger networks and training sets. There are some improvements in network organization, notably convolutional networks, and in training algorithms, notably stochastic gradient descent, half-wave rectifier threshold function, and dropout.

Nonparameteric Methods A neural network learns a fixed set of parameters. Too many/few parameters cause over/under fitting. The user must pick a network that avoids these problems. Update: deep learning questions this claim. Non parametric methods pick the number of parameters based on the training data. They are more flexible, but use more time and space.

k Nearest Neighbors x 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 2 x 2 k = k = 5 Store all the training data. Classify based on the majority vote of k nearest neighbors. Metric: Euclidean, Manhattan, Hamming, normalization. Degrades with dimension.

Nonparametric Regression (Curve Fitting) 0 2 3 4 5 6 7 8 0 2 4 6 8 0 2 4 0 2 3 4 5 6 7 8 0 2 4 6 8 0 2 4 linear 3-nearest average 0 2 3 4 5 6 7 8 0 2 4 6 8 0 2 4 0 2 3 4 5 6 7 8 0 2 4 6 8 0 2 4 3-nearest linear regression locally weighted regression

Locally Weighted Regression 0.5 0-0 -5 0 5 0 kernel of width 0 8 7 6 5 4 3 2 0 0 2 4 6 8 0 2 4 regression Weight the error in sample (x i, y i ) by a function of δ = x x i. Function has a maximum of at δ = 0 and decreases to zero monotonically and symmetrically. Quadratic kernel function with width u: k(δ) = max(0, (2δ/u) 2 ). Compute w that minimizes i k(x x i)(y i w x i ) 2. Predict y = w x.

Support Vector Machines Make training data linearly separable by defining extra features as polynomials in given features. Use optimal linear classifier. Use kernel functions for fast training and classification. All the rage 5 0 years ago. Deep learning is hotter now.

Maximum Margin Separator 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.2 0.4 0.6 0.8 0 0 0.2 0.4 0.6 0.8 Linear separators might misclassify nearby test data. Support vectors: points closest to linear separator. Maximum margin separator: furthest from support vectors.

Computing the Maximum Margin Separator h(x) is a function of the support vectors: ( ) h(x) = sign α i y i (x x i ) b There are usually (but not always) few support vectors. Compute α i and b via quadratic programming. Algorithm and result use x x i. i

Defining Features for Linear Separability.5 2x x 2 x 2 0.5 0-0.5 - -.5 -.5 - -0.5 0 0.5.5 3 2 0 - -2-3 0 0.5 2.5 2.5 2 x 2 2.5 x 0.5 2 x Circular separator in 2D x 2 + x 2 2 =. Linear separator in 3D (could have used 2D) u + u 2 = with u = x 2, u 2 = x 2 2, u 3 = 2x x 2.

Kernel Trick Replace feature vector x with feature vector F (x). Circle example: F (x) = (x 2, x 2 2, 2x x 2 ). Training and classification use F (a) F (b) instead of a b. Pick F (x) such that F (a) F (b) = K(a, b). K is called a kernel function. Circle example: K(a, b) = (a b) 2. (a b) 2 = (a b + a 2 b 2 ) 2 = a 2 b 2 + 2a a 2 b b 2 + a 2 2b 2 2 F (a) F (b) = (a 2, a 2 2, 2a a 2 ) (b 2, b 2 2, 2b b 2 ) = a 2 b 2 + 2a a 2 b b 2 + a 2 2b 2 2 An explicit definition of F (x) is unnecessary.

Ensemble Learning + + + + + + + + + + ++ + + Generate multiple hypotheses and use the majority vote. Reduces error to the extent hypotheses are independent. Expands hypothesis space, e.g. triangles versus lines.

Boosted Learning Learning algorithm for samples weighted by importance u j. Neural network with u j weights: e w = j uj (y j h w (x j )) 2. Decision tree: make u j copies of (x j, y j ). Construct hypothesis h with all weights equal to. Assign h the sum of the weights of its correct answers. Increase/decrease the weights of the samples that h got wrong/right. Construct hypotheses h 2,..., h k. Classify based on the k answers weighted by their hypotheses.

Boosted Learning of Decision Trees h = h 2 = h 3 = h 4 = h

Restaurant Data Proportion correct on test set 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 Boosted decision stumps Decision stump 0 20 40 60 80 00 Training set size Training/test accuracy 0.95 0.9 0.85 0.8 0.75 0.7 0.65 Training error Test error 0.6 0 50 00 50 200 Number of hypotheses K

Character Recognition

Learning Algorithm versus Dataset Size Proportion correct on test set 0.95 0.9 0.85 0.8 0.75 0 00 000 Training set size (millions of words)