Learning from Examples - PDF Free Download

Learning from Examples Data fitting Decision trees Cross validation Computational learning theory Linear classifiers Neural networks Nonparametric methods: nearest neighbor Support vector machines Ensemble learning and boosting

Data Fitting f(x) f(x) f(x) f(x) x x x x (a) (b) (c) (d) Accuracy Simplicity Hypothesis space size Hypothesis space expressive power Accuracy of best member versus complexity of finding it

Decision Trees Patrons? None Some Full No Yes WaitEstimate? >60 30-60 0-30 0-0 No Alternate? Hungry? No Yes No Yes Yes Reservation? Fri/Sat? Yes Alternate? No Yes No Yes No Yes Bar? Yes No Yes Yes Raining? No Yes No Yes No Yes No Yes Figure 8.2 FILES: figures/restaurant-tree.eps (Tue Nov 3 6:23:29 2009). A decision tree for deciding whether to wait for a table.

Construction Algorithm Input: examples, attributes.. If examples is empty, return the plurality parent label. 2. If every example has the same label, return that label. 3. If attributes is empty, return the plurality example label. 4. Pick an attribute, partition the examples, and recurse.

Picking an Attribute 3 4 6 8 2 2 5 7 9 0 Type? 3 4 6 8 2 2 5 7 9 0 Patrons? French Italian Thai Burger 5 6 0 4 8 2 3 2 7 9 7 None Some Full 3 6 8 4 2 2 5 9 0 No Yes Hungry? No Yes 4 2 (a) (b) 5 9 2 0 Figure 8.4 FILES: figures/restaurant-stub.eps (Tue Nov 3 6:23:28 2009). Splitting the examples by testing on attributes. At each node we show the positive (light boxes) and negative (dark boxes) examples remaining. (a) Splitting on Type brings us no nearer to distinguishing between positive and negative examples. (b) Splitting on Patrons does a good job of separating positive and negative examples. After splitting on Patrons, Hungry is a fairly good second test.

Output Decision Tree Patrons? None Some Full No Yes Hungry? No Yes No Type? French Italian Thai Burger Yes No Fri/Sat? Yes No Yes No Yes Figure 8.6 FILES: figures/induced-restaurant-tree.eps (Tue Nov 3 6:23:04 2009). The decision tree induced from the 2-example training set.

Impurity Impurity is a heuristic for decision tree construction. The impurity of p positive and n negative instances is p p + n n p + n = pn (p + n) 2 The impurity is unimodal with minima of 0 at p = 0 and n = 0 and a maximum of /4 at p = n. Average impurity after test with k subsets with p i and n i k p i n i k (p i + n i ) (p i + n i ) 2 = p i n i p i + n i i= Pick the test that minimizes this value. Identical tree for restaurant example. i=

Learning Curve Proportion correct on test set 0.9 0.8 0.7 0.6 0.5 0.4 0 20 40 60 80 00 Training set size Figure 8.7 FILES:. A learning curve for the decision tree learning algorithm on 00 randomly generated examples in the restaurant domain. Each data point is the average of 20 trials.

Cross Validation Split data into k equal subsets Perform k learning rounds Round k reserves one subset for testing Average the results k = 0 is common k = n (singleton sets) is the ultimate Construct classifier from all the data

Model Complexity Versus Quality 60 50 Validation Set Error Training Set Error 40 Error rate 30 20 0 0 2 3 4 5 6 7 8 9 0 Tree size

Computational Learning Theory I We will consider Boolean functions of Boolean attributes. Assumption: training and test data are independent samples from a fixed distribution. The error of a hypthesis is the probability that it is wrong on a random sample from this distribution. A hypothesis is approximately correct if its error is less than ɛ. A hypothesis is probably approximately correct (PAC) if it is approximately correct with probability δ. The parameters ɛ and δ must be between 0 and but are otherwise arbitrary. Goal: compute a PAC hypothesis from a reasonable number of samples with reasonable computational complexity. Idea: a bad (not approximately correct) hypothesis will usually fail quickly. Pick a hypothesis space H with H members.

Computational Learning Theory II Probability that a bad h is right on a sample ɛ. Probability that it is right on n samples ( ɛ) n. Probability for any bad h in H is H ( ɛ) n. We want this to be less than δ: H ( ɛ) n δ. Fun fact: ɛ e ɛ. Take logs and rearrange: n ɛ (log H + log δ ) This n is called the sample complexity of H. Any hypothesis that is consistent with n samples is PAC!

PAC Learning The sample complexity limits the choice of H. The sample complexity of decision trees is exponential in the number of attributes. A decision tree on m Boolean attributes is equivalent to a propositional logic formula in disjunctive normal form. Every formula is expressible in disjunctive normal form. The truth table of a formula has 2 m rows. Each of the 2 2 m subsets can be the true rows. We consider a smaller sample space next. But something is wrong with computational learning theory because decision trees work well in practice! Fishy assumptions: no prior knowledge, distribution independent, independent of structure of H.

Decision Lists Patrons(x, Some) Yes No Patrons(x, Full) ^ Yes Fri/Sat(x) No No Yes Yes log H = O(m k ) for m attributes and k conjuncts per test. ( ) 2m choices of i literals in m attributes. i Conjuncts can have i = 0,,..., k literals, altogether: O(m k ) A conjunct can classify yes, classify no, or be absent: 3 m k Conjuncts can be in any order: 3 mk m k! Stirling approximation yields bound. Greedy algorithms give good results. Example: pick smallest conjunction that matches some instances.

Decision Lists Versus Decision Trees Proportion correct on test set 0.9 0.8 0.7 0.6 0.5 0.4 Decision tree Decision list 0 20 40 60 80 00 Training set size

Least-Squares Fitting Fit a line to points in the plane. Line is h w (x) = w 0 + w x with unknown w 0, w. Training data is points (x, y ),..., (x n, y n ). Minimize distance (y,..., y n ) (h w (x ),..., h w (x n )). Square and take partials with respect to w 0 and w. w 0 w n (y i w 0 w x i ) 2 = 0 i= n (y i w 0 w x i ) 2 = 0 i= Obtain two linear equations in w 0 and w. General case: fit linear combination of basis functions to data. Example: w 0 + w x + w 2 sin x + w 3 cos x.

Linear Classifier (Perceptron) Instances are feature vectors x = (x, x 2 ). Find a line that separates the classes. The points are linearly separable if such a line exists. Approximate separation is useful for non-separable data. General case x = (, x,..., x n ) Linear function w x = w0 + w x + + w n x n classes y = 0 and y = Classifier h w (x) returns if w x > 0 and 0 otherwise

Linearly Separable Data x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x x separable not separable Earthquake versus explosion given body and surface waves. Larger dataset is more accurate, but is not linearly separable

Perceptron Learning Proportion correct 0.9 0.8 0.7 0.6 0.5 0.4 0 00 200 300 400 500 600 700 Number of weight updates Proportion correct 0.9 0.8 0.7 0.6 0.5 0.4 0 20000 40000 60000 80000 00000 Number of weight updates Proportion correct 0.9 0.8 0.7 0.6 0.5 0.4 0 20000 40000 60000 80000 00000 Number of weight updates separable not separable decaying α Error on training data (x j, y j ) is e w = j (y j h w (x j )) 2. Update w i w i + α(y j h w (x j ))x j,i (like gradient descent). Fixed α > 0 converges on linearly separable data. Decreasing α O(/t) in iteration t usually converges. Convergence is uneven and can be slow.

Threshold Functions 0.5 0.5 5 0-8 -6-4 -2 0 2 4 6 8 0-6 -4-2 0 2 4 6 0 5 hard soft halfwave Performance greatly improved with soft threshold g(z) = + e z Classify based on h w (x) = g(w x) > 0.5. Recent neural networks use the halfwave rectifier.

Learning with Soft Threshold Squared error per example 0.9 0.8 0.7 0.6 0.5 0.4 0 000 2000 3000 4000 5000 Number of weight updates Squared error per example 0.9 0.8 0.7 0.6 0.5 0.4 0 20000 40000 60000 80000 00000 Number of weight updates Squared error per example 0.9 0.8 0.7 0.6 0.5 0.4 0 20000 40000 60000 80000 00000 Number of weight updates separable not separable decaying α Compute w that minimizes e w = (y g(w x)) 2. Gradient descent on f (x): iterate x x αf (x). Multivariate version for e w g (z) = = e z ( + e z ) 2 = e z + ( + e z ) 2 + e z = g(z)( g(z)) ( + e z 2 ) w i w i α(y g(w x))g(w x)( g(w x))x i Converges fast and smoothly even on non-separable data.

Neural Networks Bias Weight a 0 = a j = g(in j ) wi,j a i w 0,j Σ in j g a j Input Links Input Function Activation Function Output Output Links Perceptrons are of limited use because linear separation is rare. The natural next step is a network of perceptrons. Each perceptron is analogous to a neuron. The network is called a neural network.

Feed-forward Networks w,3 3 w,3 3 w 3,5 5 w,4 w,4 w 3,6 2 w 2,3 w 2,4 4 2 w 2,3 w 2,4 4 w 4,5 w 4,6 6 (a) (b) A feed-forward network is a directed graph of perceptrons. It is organized into input, hidden, and output layers. It is trained by gradient descent, called back propagation.

What Feed-Forward Networks Can Learn x x x? 0 0 x 2 0 0 x 2 0 0 x 2 (a) x and x 2 (b) x or x2 (c) x xor x2 No hidden layers: linearly separable functions. One hidden layer: continuous functions. Two hidden layers: discontinuous functions.

Deep Learning The term deep learning refers primarily to neural networks with multiple hidden layers. The internal layers are meant to learn a hierarchy of domain features without human help. Deep learning is today s hottest machine learning technique. The basic ideas are 35 years old, e.g. back propagation. Increased computing power and data storage enable larger networks and training sets. There are some improvements in network organization, notably convolutional networks, and in training algorithms, notably stochastic gradient descent, half-wave rectifier threshold function, and dropout.

Nonparameteric Methods A neural network learns a fixed set of parameters. Too many/few parameters cause over/under fitting. The user must pick a network that avoids these problems. Update: deep learning questions this claim. Non parametric methods pick the number of parameters based on the training data. They are more flexible, but use more time and space.

k Nearest Neighbors x 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 2 x 2 k = k = 5 Store all the training data. Classify based on the majority vote of k nearest neighbors. Metric: Euclidean, Manhattan, Hamming, normalization. Degrades with dimension.

Nonparametric Regression (Curve Fitting) 0 2 3 4 5 6 7 8 0 2 4 6 8 0 2 4 0 2 3 4 5 6 7 8 0 2 4 6 8 0 2 4 linear 3-nearest average 0 2 3 4 5 6 7 8 0 2 4 6 8 0 2 4 0 2 3 4 5 6 7 8 0 2 4 6 8 0 2 4 3-nearest linear regression locally weighted regression

Locally Weighted Regression 0.5 0-0 -5 0 5 0 kernel of width 0 8 7 6 5 4 3 2 0 0 2 4 6 8 0 2 4 regression Weight the error in sample (x i, y i ) by a function of δ = x x i. Function has a maximum of at δ = 0 and decreases to zero monotonically and symmetrically. Quadratic kernel function with width u: k(δ) = max(0, (2δ/u) 2 ). Compute w that minimizes i k(x x i)(y i w x i ) 2. Predict y = w x.

Support Vector Machines Make training data linearly separable by defining extra features as polynomials in given features. Use optimal linear classifier. Use kernel functions for fast training and classification. All the rage 5 0 years ago. Deep learning is hotter now.

Maximum Margin Separator 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.2 0.4 0.6 0.8 0 0 0.2 0.4 0.6 0.8 Linear separators might misclassify nearby test data. Support vectors: points closest to linear separator. Maximum margin separator: furthest from support vectors.

Computing the Maximum Margin Separator h(x) is a function of the support vectors: ( ) h(x) = sign α i y i (x x i ) b There are usually (but not always) few support vectors. Compute α i and b via quadratic programming. Algorithm and result use x x i. i

Defining Features for Linear Separability.5 2x x 2 x 2 0.5 0-0.5 - -.5 -.5 - -0.5 0 0.5.5 3 2 0 - -2-3 0 0.5 2.5 2.5 2 x 2 2.5 x 0.5 2 x Circular separator in 2D x 2 + x 2 2 =. Linear separator in 3D (could have used 2D) u + u 2 = with u = x 2, u 2 = x 2 2, u 3 = 2x x 2.

Kernel Trick Replace feature vector x with feature vector F (x). Circle example: F (x) = (x 2, x 2 2, 2x x 2 ). Training and classification use F (a) F (b) instead of a b. Pick F (x) such that F (a) F (b) = K(a, b). K is called a kernel function. Circle example: K(a, b) = (a b) 2. (a b) 2 = (a b + a 2 b 2 ) 2 = a 2 b 2 + 2a a 2 b b 2 + a 2 2b 2 2 F (a) F (b) = (a 2, a 2 2, 2a a 2 ) (b 2, b 2 2, 2b b 2 ) = a 2 b 2 + 2a a 2 b b 2 + a 2 2b 2 2 An explicit definition of F (x) is unnecessary.

Ensemble Learning + + + + + + + + + + ++ + + Generate multiple hypotheses and use the majority vote. Reduces error to the extent hypotheses are independent. Expands hypothesis space, e.g. triangles versus lines.

Boosted Learning Learning algorithm for samples weighted by importance u j. Neural network with u j weights: e w = j uj (y j h w (x j )) 2. Decision tree: make u j copies of (x j, y j ). Construct hypothesis h with all weights equal to. Assign h the sum of the weights of its correct answers. Increase/decrease the weights of the samples that h got wrong/right. Construct hypotheses h 2,..., h k. Classify based on the k answers weighted by their hypotheses.

Boosted Learning of Decision Trees h = h 2 = h 3 = h 4 = h

Restaurant Data Proportion correct on test set 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 Boosted decision stumps Decision stump 0 20 40 60 80 00 Training set size Training/test accuracy 0.95 0.9 0.85 0.8 0.75 0.7 0.65 Training error Test error 0.6 0 50 00 50 200 Number of hypotheses K

Character Recognition

Learning Algorithm versus Dataset Size Proportion correct on test set 0.95 0.9 0.85 0.8 0.75 0 00 000 Training set size (millions of words)