CS 188: Artificial Intelligence Learning :

Size: px
Start display at page:

Download "CS 188: Artificial Intelligence Learning :"

Transcription

1 CS 188: Artificial Intelligence Learning : Instructor: Fabrice Popineau [These slides adapted from Stuart Russell, Dan Klein and Pieter

2 Lectures on learning Learning: a process for improving the performance of an agent through experience Learning I (today): The general idea: generalization from experience Supervised learning: classification and regression Learning II: neural networks and deep learning Learning III: statistical learning and Bayes nets Reinforcement learning II: learning complex V and Q functions

3 Learning: Why? The baby, assailed by eyes, ears, nose, skin, and entrails at once, feels it all as one great blooming, buzzing confusion [William James, 1890] Learning is essential in unknown environments when the agent designer lacks omniscience

4 Learning: Why? Instead of trying to produce a program to simulate the adult mind, why not rather try to produce one which simulates the child's? If this were then subjected to an appropriate course of education one would obtain the adult brain. Presumably the child brain is something like a notebook as one buys it from the stationer's. Rather little mechanism, and lots of blank sheets. [Alan Turing, 1950] Learning is useful as a system construction method, i.e., expose the system to reality rather than trying to write it down

5 Learning: How?

6 Learning: How?

7 Key questions when building a learning agent What is the agent design that will implement the desired performance? Improve the performance of what piece of the agent system and how is that piece represented? What data are available relevant to that piece? (In particular, do we know the right answers?) What knowledge is already available?

8 Examples Agent design Component Representation Feedback Knowledge Alpha-beta search Evaluation function Linear polynomial Win/loss Rules of game; Coefficient signs Logical planning agent Transition model (observable envt) Successor-state axioms Action outcomes Available actions; Argument types Utility-based patient monitor Physiology/senso r model Dynamic Bayesian network Observation sequences Gen physiology; Sensor design Satellite image pixel classifier Classifier (policy) Markov random field Partial labels Coastline; Continuity scales Supervised learning: correct answers for each training instance Reinforcement learning: reward sequence, no correct answers Unsupervised learning: just make sense of the data

9 Supervised learning To learn an unknown target function f Input: a training set of labeled examples (x j,y j ) where y j = f(x j ) E.g., x j is an image, f(x j ) is the label giraffe E.g., x j is a seismic signal, f(x j ) is the label explosion Output: hypothesis h that is close to f, i.e., predicts well on unseen examples ( test set ) Many possible hypothesis families for h Linear models, logistic regression, neural networks, decision trees, examples (nearest-neighbor), grammars, kernelized separators, etc etc Classification = learning f with discrete output value Regression = learning f with real-valued output value

10 Classification example: Object recognition x f(x) giraffe giraffe giraffe llama llama llama X= f(x)=?

11 Regression example: Curve fitting

12 Regression example: Curve fitting

13 Regression example: Curve fitting

14 Regression example: Curve fitting

15 Regression example: Curve fitting

16 Which hypothesis space H to choose? How to measure degree of fit? How to trade off degree of fit vs. complexity? Ockham s razor How do we find a good h? Basic questions How do we know if a good h will predict well?

17 Classification

18 Example: Spam Filter Input: an Output: spam/ham Setup: Get a large collection of example s, each labeled spam or ham (by hand) Learn to predict labels of new incoming s Classifiers reject 200 billion spam s per day Features: The attributes used to make the ham / spam decision Words: FREE! Text Patterns: $dd, CAPS Non-text: SenderInContacts, AnchorLinkMismatch Dear Sir. First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT. 99 MILLION ADDRESSES FOR ONLY $99 Ok, Iknow this is blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing happened.

19 Example: Digit Recognition Input: images / pixel grids Output: a digit 0-9 Setup: MNIST data set of 60K collection hand-labeled imagesnote: someone has to hand label all this data! Want to learn to predict labels of new, future digit images Features: The attributes used to make the digit decision Pixels: (6,8)=ON Shape Patterns: NumComponents, AspectRatio, NumLoops 1??

20 Other Classification Tasks o Classification: given inputs x, predict labels (classes) y o Examples: o Spam detection (input: document, classes: spam / ham) o OCR (input: images, classes: characters) o Medical diagnosis (input: symptoms, classes: diseases) o Automatic essay grading (input: document, classes: grades) o Fraud detection (input: account activity, classes: fraud / no fraud) o Customer service routing o many more o Classification is an important commercial technology!

21 Decision tree models Tree construction Measuring learning performance Decision tree learning

22 Decision trees Popular representation for classifiers Even among humans! I ve just arrived at a restaurant: should I stay (and wait for a table) or go elsewhere?

23 Decision trees It s Friday night and you re hungry You arrive at your favorite cheap but really cool happening burger place It s full up and you have no reservation but there is a bar The host estimates a 45 minute wait There are alternatives nearby but it s raining outside Decision tree partitions the input space, assigns a label to each partition

24 Expressiveness Discrete decision trees can express any function of the input E.g., for Boolean functions, build a path from root to leaf for each row of the truth table: Trivially there is a consistent decision tree that fits any training set exactly (unless true function is nondeterministic) But a tree that simply records the examples is essentially a lookup table To get generalization to new examples, need a compact tree

25 Quiz How many distinct decision trees with n Boolean attributes? = number of distinct Boolean functions with n inputs = number of truth tables with n inputs = number of ways of filling in output column with 2 n entries = 2 2n For n=6 attributes, there are 18,446,744,073,709,551,616 trees

26 Hypothesis spaces in general Increasing the expressiveness of the hypothesis language Increases the chance that the true function can be expressed Increases the number of hypotheses that are consistent with training set => many consistent hypotheses have large test error => may reduce prediction accuracy! With 2 2n hypotheses, all but an exponentially small fraction will require O(2 n ) bits to express in any representation (even brains!) I.e., any given representation can represent only an exponentially small fraction of hypotheses concisely; no universal compression E.g., decision trees are bad at k-out-of-n functions

27 Training data

28 Decision tree learning function Decision-Tree-Learning(examples,attributes,parent_examples) returns a tree if examples is empty then return Plurality-Value(parent_examples) else if all examples have the same classification then return the classification else if attributes is empty then return Plurality-Value(examples) else A argmax a attributes Importance(a, examples) tree a new decision tree with root test A for each value v of A do exs the subset of examples with value v for attribute A subtree Decision-Tree-Learning(exs, attributes A, examples) add a branch to tree with label (A = v k ) and subtree subtree return tree

29 Choosing an attribute: Information gain Idea: measure contribution of attribute to increasing purity of labels in each subset of examples Patrons is a better choice: gives information about classification (i.e., reduces entropy of the distribution of labels)

30 Information answers questions Information The more clueless I am about the answer initially, the more information is contained in the answer Scale: 1 bit = answer to Boolean question with prior 0.5, 0.5 Information in an answer when prior is p 1,,p n is H( p 1,,p n ) = Σ i p i log p i This is the entropy of the prior Convenient notation: B(p) = H( p,1-p )

31 Information gain from splitting on an attribute Suppose we have p positive and n negative examples at the root => B(p/(p+n)) bits needed to classify a new example E.g., for 12 restaurant examples, p = n = 6 so we need 1 bit An attribute splits the examples E into subsets E k, each of which (we hope) needs less information to complete the classification For an example in E k we expect to need B(p k /(p k +n k )) more bits Probability a new example goes into E k is (p k +n k )/(p+n) Expected number of bits needed after split is Σ k (p k +n k )/(p+n) B(p k /(p k +n k )) Information gain = B(p/(p+n)) - Σ k (p k +n k )/(p+n) B(p k /(p k +n k ))

32 Example 1 [(2/12)B(0) + (4/12)B(1) + (6/12)B(2/6)] = bits 1 [(2/12)B(1/2) + (2/12)B(1/2) + = 0 bits (4/12)B(2/6) + (4/12)B(1/2)]

33 Decision tree learned from the 12 examples: Simpler than true tree! Results for restaurant data

34 Training and Testing

35 Basic Concepts Data: labeled instances, e.g. restaurant episodes Training set Validation set Test set Experimentation cycle Generate hypothesis h from training set (Possibly choose best h by trying out on validation set) Finally, compute accuracy of h on test set Very important: never peek at the test set! Evaluation Accuracy: fraction of instances predicted correctly Learning curve: test set accuracy as a function of training set size Training Data Validation Data Test Data

36 Results for restaurant data

37 Linear Regression

38 Linear regression = fitting a straight line/hyperplane 40 h w (x) x Prediction: h w (x) = w 0 + w 1 x Berkeley house prices, 2009

39 Prediction error Error on one instance: y h w (x) Observation y Prediction h w (x) Error or residual x

40 Least squares: Minimizing squared error L2 loss function: sum of squared errors over all examples Loss = Σ j (y j h w (x j )) 2 = Σ j (y j (w 0 + w 1 x j )) 2 We want the weights w* that minimize loss At w* the derivatives of loss w.r.t. each weight are zero: Loss/ w 0 = 2 Σ j (y j (w 0 + w 1 x j )) = 0 Loss/ w 1 = 2 Σ j (y j (w 0 + w 1 x j )) x j = 0 Exact solutions for N examples: w 1 = [NΣ j x j y j (Σ j x j )(Σ j y j )]/[NΣ j x j2 (Σ j x j ) 2 ] and w 0 = 1/N [Σ j y j w 1 Σ j x j ] For the general case where x is an n-dimensional vector X is the data matrix (all the data, one example per row); y is the column of labels w* = (X T X) -1 X T y

41 Summary Learning is essential in unknown environments, useful in many others The nature of the learning process depends on agent design what part of the agent you want to improve what prior knowledge and new experiences are available Supervised learning: learning a function from labeled examples Concise hypotheses generalize better; trade off conciseness and accuracy Classification: discrete-valued function; example: decision trees Regression: real-valued function; example: linear regression Next: neural networks and deep learning

42 CS 188: Artificial Intelligence Naïve Bayes Instructor: Fabrice Popineau [These slides adapted from Stuart Russell, Dan Klein, Pieter Abbeel and Anca

43 Classification

44 Example: Spam Filter o Input: an o Output: spam/ham o Setup: o Get a large collection of example s, each labeled spam or ham o Note: someone has to hand label all this data! o Want to learn to predict labels of new, future s o Features: The attributes used to make the ham / spam decision o Words: FREE! o Text Patterns: $dd, CAPS o Non-text: SenderInContacts o Dear Sir. First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT. 99 MILLION ADDRESSES FOR ONLY $99 Ok, Iknow this is blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing happened.

45 Example: Digit Recognition o Input: images / pixel grids o Output: a digit 0-9 o Setup: o Get a large collection of example images, each labeled with a digit o Note: someone has to hand label all this data! o Want to learn to predict labels of new, future digit images o Features: The attributes used to make the digit decision o Pixels: (6,8)=ON o Shape Patterns: NumComponents, AspectRatio, NumLoops o 1??

46 Model-Based Classification

47 Model-Based Classification o Model-based approach o Build a model (e.g. Bayes net) where both the label and features are random variables o Instantiate any observed features o Query for the distribution of the label conditioned on the features o Challenges o What structure should the BN have? o How should we learn its parameters?

48 Naïve Bayes for Digits o Naïve Bayes: Assume all features are independent effects of the label o Simple digit recognition version: o One feature (variable) F ij for each grid position <i,j> o Feature values are on / off, based on whether intensity is more or less than 0.5 in underlying image o Each input maps to a feature vector, e.g. F 1 F 2 Y F n o Here: lots of features, each is binary valued o Naïve Bayes model: o What do we need to learn?

49 General Naïve Bayes o A general Naive Bayes model: Y Y parameters F 1 F 2 F n Y x F n values n x F x Y parameters o For each feature we only have to specify how it depends on the class o Total number of parameters is linear in n o Model is very simplistic, but often works anyway

50 Inference for Naïve Bayes o Goal: compute posterior distribution over label variable Y o Step 1: get joint probability of label and evidence for each label + o Step 2: sum to get probability of evidence o Step 3: normalize by dividing Step 1 by Step 2

51 Example: Conditional Probabilities

52 Naïve Bayes for Text o Bag-of-words Naïve Bayes: o Features: W i is the word at positon i o As before: predict label conditioned on feature variables (spam vs. ham) o As before: assume features are conditionally independent given label o New: each W i is identically distributed o Generative model: Word at position i, not i th word in the dictionary! o Tied distributions and bag-of-words o Usually, each variable gets its own conditional probability distribution P(F Y) o In a bag-of-words model o Each position is identically distributed o All positions share the same conditional probs P(W Y) o Why make this assumption? o Called bag-of-words because model is insensitive to word order or reordering

53 Example: Spam Filtering o Model: For numerical stability, work with log instead of actual probability! o What are the parameters? ham : 0.66 spam: 0.33 the : to : and : of : you : a : with: from: the : to : of : : with: from: and : a :

54 Spam Example Word P(w spam) P(w ham) Tot Spam Tot Ham (prior) Gary would you like to lose weight while you sleep P(spam w) = 98.9

55 Example: Spam Filtering o Model: o What are the parameters? ham : 0.66 spam: 0.33 the : to : and : of : you : a : with: from: the : to : of : : with: from: and : a : o Where do these tables come from?

56 General Naïve Bayes o What do we need in order to use Naïve Bayes? o Inference method (we just saw this part) o Start with a bunch of probabilities: P(Y) and the P(F i Y) tables o Use standard inference to compute P(Y F 1 F n ) o Nothing new here o Estimates of local conditional probability tables o P(Y), the prior over labels o P(F i Y) for each feature (evidence variable) o These probabilities are collectively called the parameters of the model and denoted by θ o Up until now, we assumed these appeared by magic, but o they typically come from training data counts

57 Parameter Estimation

58 Parameter Estimation o Estimating the distribution of a random variable o Elicitation: ask a human (why is this hard?) r b b r b b r b b r r b b b b o Empirically: use training data (learning!) o E.g.: for each outcome x, look at the empirical rate of that value: r r b o This is the estimate that maximizes the likelihood of the data

59 Smoothing

60 Maximum Likelihood? o Relative frequencies are the maximum likelihood estimates o Another option is to consider the most likely parameter value given the data????

61 Unseen Events

62 Generalization and Overfitting

63 Overfitting Degree 15 polynomial

64 Example: Overfitting 2 wins!!

65 Example: Overfitting o Posteriors determined by relative probabilities (odds ratios): south-west : inf nation : inf morally : inf nicely : inf extent : inf seriously : inf... screens : inf minute : inf guaranteed : inf $ : inf delivery : inf signature : inf... What went wrong here?

66 Generalization and Overfitting o Relative frequency parameters will overfit the training data! o Just because we never saw a 3 with pixel (15,15) on during training doesn t mean we won t see it at test time o Unlikely that every occurrence of minute is 100% spam o Unlikely that every occurrence of seriously is 100% ham o What about all the words that don t occur in the training set at all? o In general, we can t go around giving unseen events zero probability o As an extreme case, imagine using the entire as the only feature o Would get the training data perfect (if deterministic labeling) o Wouldn t generalize at all o Just making the bag-of-words assumption gives us some generalization, but isn t enough o To generalize better: we need to smooth or regularize the estimates

67 Laplace Smoothing o Laplace s estimate: o Pretend you saw every outcome once more than you actually did r r b o Can derive this estimate with Dirichlet priors (see cs281a)

68 Laplace Smoothing o Laplace s estimate (extended): o Pretend you saw every outcome k extra times r r b o What s Laplace with k = 0? o k is the strength of the prior o Laplace for conditionals: o Smooth each condition independently:

69 Estimation: Linear Interpolation* o In practice, Laplace often performs poorly for P(X Y): o When X is very large o When Y is very large o Another option: linear interpolation o Also get the empirical P(X) from the data o Make sure the estimate of P(X Y) isn t too different from the empirical P(X) o What if α is 0? 1? o For even better ways to estimate parameters, as well as details of the math, see cs281a, cs288

70 Real NB: Smoothing o For real classification problems, smoothing is critical o New odds ratios: helvetica : 11.4 seems : 10.8 group : 10.2 ago : 8.4 areas : verdana : 28.8 Credit : 28.4 ORDER : 27.2 <FONT> : 26.9 money : Do these make more sense?

71 Training and Testing

72 Important Concepts o o o o o Data: labeled instances, e.g. s marked spam/ham o o o Training set Held out set Test set Features: attribute-value pairs which characterize each x Experimentation cycle o o o o Learn parameters (e.g. model probabilities) on training set Tune hyperparameters on held-out set Compute accuracy of test set Very important: never peek at the test set! Evaluation o Accuracy: fraction of instances predicted correctly Overfitting and generalization o o Want a classifier which does well on test data Overfitting: fitting the training data very closely, but not generalizing well Training Data Held-Out Data Test Data

73 Tuning

74 Tuning on Held-Out Data o Now we ve got two kinds of unknowns o Parameters: the probabilities P(X Y), P(Y) o Hyperparameters: e.g. the amount / type of smoothing to do, k, α o What should we learn where? o Learn parameters from training data o Tune hyperparameters on different data o Why? o For each value of the hyperparameters, train and test on the held-out data o Choose the best value and do a final test on the test data

75 Features

76 Errors, and What to Do o Examples of errors Dear GlobalSCAPE Customer, GlobalSCAPE has partnered with ScanSoft to offer you the latest version of OmniPage Pro, for just $99.99* - the regular list price is $499! The most common question we've received about this offer is - Is this genuine? We would like to assure you that this offer is authorized by ScanSoft, is genuine and valid. You can get the To receive your $30 Amazon.com promotional certificate, click through to and see the prominent link for the $30 offer. All details are there. We hope you enjoyed receiving this message. However, if you'd rather not receive future s announcing new store launches, please click...

77 What to Do About Errors? o Need more features words aren t enough! o Have you ed the sender before? o Have 1K other people just gotten the same ? o Is the sending information consistent? o Is the in ALL CAPS? o Do inline URLs point where they say they point? o Does the address you by (your) name? o Can add these information sources as new variables in the NB model o Next class we ll talk about classifiers which let you easily add arbitrary features more easily

78 Baselines o First step: get a baseline o Baselines are very simple straw man procedures o Help determine how hard the task is o Help know what a good accuracy is o Weak baseline: most frequent label classifier o Gives all test instances whatever label was most common in the training set o E.g. for spam filtering, might label everything as ham o Accuracy might be very high if the problem is skewed o E.g. if calling everything ham gets 66%, then a classifier that gets 70% isn t very good o For real research, usually use previous work as a (strong) baseline

79 Confidences from a Classifier o The confidence of a probabilistic classifier: o Posterior over the top label o Represents how sure the classifier is of the classification o Any probabilistic model will have confidences o No guarantee confidence is correct o Calibration o Weak calibration: higher confidences mean higher accuracy o Strong calibration: confidence predicts accuracy rate o What s the value of calibration?

80 Summary o Bayes rule lets us do diagnostic queries with causal probabilities o The naïve Bayes assumption takes all features to be independent given the class label o We can build classifiers out of a naïve Bayes model using training data o Smoothing estimates is important in real systems o Classifier confidences are useful, when you can get them

81 CS 188: Artificial Intelligence Perceptrons Instructor: Fabrice Popineau [These slides adapted from Stuart Russell, Dan Klein, Pieter Abbeel and Anca

82 Regression vs Classification Linear regression when output is binary, yy 0, 1 h ww xx = ww 0 + ww 1 xx y ww 0 + ww 1 xx 1 Linear classification Used with discrete output values Threshold a linear function h ww xx = 1, if ww 0 + ww 1 xx 0 h ww xx = 0, if ww 0 + ww 1 xx < 0 Activation function g 1 y x gg ww 0 + ww 1 xx x

83 Linear Classifiers

84 Feature Vectors Hello, Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just # free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0... SPAM or + PIXEL-7,12 : 1 PIXEL-7,13 : 0... NUM_LOOPS :

85 Linear Classifiers o Inputs are feature values o Each feature has a weight o Sum is the activation o If the activation is: o Positive, output +1 o Negative, output 0 f 1 f 2 w 1 w 2 Σ >0? w 3 We can add bias ww 0 f 3

86 Geometric Explanation # free : 4 YOUR_NAME :-1 MISSPELLED : 1 FROM_FRIEND :-3... # free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0... Dot product positive means the positive class # free : 0 YOUR_NAME : 1 MISSPELLED : 1 FROM_FRIEND : 1...

87 Geometric Explanation o In the space of feature vectors o Examples are points o Any weight vector is a hyperplane o One side corresponds to Y=+1 o Other corresponds to Y=0 money 2 +1 = SPAM BIAS : -3 free : 4 money : = HAM free

88 Weight Updates

89 Perceptron learning rule If true y h w (x) (an error), adjust the weights If w.x < 0 but the output should be y=1 This is called a false negative Should increase weights on positive inputs Should decrease weights on negative inputs If w.x > 0 but the output should be y=0 This is called a false positive Should decrease weights on positive inputs Should increase weights on negative inputs The perceptron learning rule does this: w w + α (y h w (x)) x The perceptron has a default learning rate α=1 Learning rate smaller than 1 is useful in the general case for convergence. See convergence theorem later. learning rate αα ]0, 1] +1, -1, or 0 (no error)

90 Learning: Binary Perceptron o Start with weights = 0 o For each training instance: o Classify with current weights o If correct (i.e., y=y*), no change! o If wrong: adjust the weight vector

91 Learning: Binary Perceptron o Start with weights = 0 o Assume α=1 o For each training instance: o Classify with current weights +1 if ww ff xx 0 yy = 0 if ww ff xx < 0 o If correct (i.e., y=y*), no change! o If wrong: adjust the weight vector by adding or subtracting the feature vector. Subtract if y* is 0.

92 Example y=0 (HAM) money y=1 (SPAM) Bias! free w 0 : -3 w free : 4 w money : 2 x 0 : 1 x free : 1 x money : 1 w.x = -3x1 + 4x1 + 2x1 = 3 Dear Stuart, I wanted to let you know that I have decided to leave Macrosoft and return to academia. The money is is great here but I prefer to be free to pursue more interesting research and I really love teaching undergraduates! Do I need to finish my BA first before applying? Best wishes Bill w w + α (y h w (x)) x α = 0.5 (learning rate) w (-3,4,2) (0 1) (1,1,1) = (-3.5,3.5,1.5)

93 Multiclass Decision Rule o If we have multiple classes: o A weight vector for each class: o Score (activation) of a class y: o Prediction highest score wins Binary = multiclass where the negative class has weight zero

94 Learning: Multiclass Perceptron o Start with all weights = 0 o Assume α=1 o Pick up training examples one by one o Predict with current weights o If correct, no change! o If wrong: lower score of wrong answer, raise score of right answer

95 Example: Multiclass Perceptron win the vote win the election win the game BIAS : 1 win : 0 game : 0 vote : 0 the : 0... BIAS : 0 win : 0 game : 0 vote : 0 the : 0... BIAS : 0 win : 0 game : 0 vote : 0 the : 0...

96 o Separable Case Examples: Perceptron

97 o Non-Separable Case Examples: Perceptron

98 Properties of Perceptrons o Separability: true if some parameters get the training set perfectly correct Separable o Convergence: if the training is separable, perceptron will eventually converge (binary case) o Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separability Non-Separable o kk is the number of features and δδ 2 is the margin (Novikoff, 1962)

99 Perceptron convergence theorem A learning problem is linearly separable iff there is some hyperplane exactly separating +ve from ve examples Separable Convergence: if the training data are separable, perceptron learning applied repeatedly to the training set will eventually converge to a perfect separator Non-Separable This is true with α=1

100 Example: Earthquakes vs nuclear explosions 63 examples, 657 updates required

101 Perceptron convergence theorem A learning problem is linearly separable iff there is some hyperplane exactly separating +ve from ve examples Separable Convergence: if the training data are separable, perceptron learning applied repeatedly to the training set will eventually converge to a perfect separator Convergence: if the training data are non-separable, perceptron learning will converge to a minimum-error solution provided the learning rate α is decayed appropriately (e.g., α=1/t) Non-Separable

102 Perceptron learning with fixed α 71 examples, 100,000 updates fixed αα = 0.2, no convergence

103 Perceptron learning with decaying α 71 examples, 100,000 updates decaying αα = 1000/( tt), near-convergence

104 Improving the Perceptron

105 Problems with the Perceptron o Noise: if the data isn t separable, weights might thrash o Averaging weight vectors over time can help (averaged perceptron) o Mediocre generalization: finds a barely separating solution o Overtraining: test / held-out accuracy usually rises, then falls o Overtraining is a kind of overfitting

106 Fixing the Perceptron o Idea: adjust the weight update to mitigate these effects o MIRA*: choose an update size that fixes the current mistake o but, minimizes the change to w o The +1 helps to generalize * Margin Infused Relaxed Algorithm

107 Minimum Correcting Update min not τ=0, or would not have made an error, so min will be where equality holds

108 Maximum Step Size o In practice, it s also bad to make updates that are too large o Example may be labeled incorrectly o You may not have enough features o Solution: cap the maximum possible value of τ with some constant C o Corresponds to an optimization that assumes non-separable data o Usually converges faster than perceptron o Usually better, especially on noisy data

109 Linear Separators o Which of these linear separators is optimal?

110 Support Vector Machines o Maximizing the margin: good according to intuition, theory, practice o Only support vectors matter; other training examples are ignorable o Support vector machines (SVMs) find the separator with max margin o Basically, SVMs are MIRA where you optimize over all examples at once MIRA SVM

111 Classification: Comparison o Naïve Bayes o Builds a model training data o Gives prediction probabilities o Strong assumptions about feature independence o One pass through data (counting) o Perceptrons / MIRA: o Makes less assumptions about data o Mistake-driven learning o Multiple passes through data (prediction) o Often more accurate

112 Non-Linearity

113 Non-Linear Separators o Data that is linearly separable works out great for linear decision rules: 0 x o But what are we going to do if the dataset is just too hard? 0 x o How about mapping data to a higher-dimensional space: x 2 0 x This and next slide adapted from Ray Mooney, UT

114 Non-Linear Separators General idea: the original feature space can always be mapped to some higherdimensional feature space where the training set is separable: Φ: x φ(x)

115 Neural Networks

116 Very Loose Inspiration: Human Neurons

117 Simple Model of a Neuron (McCulloch & Pitts, 1943) Inputs a i come from the output of node i to this node j (or from outside ) Each input link has a weight w i,j There is an additional fixed input a 0 with bias weight w 0,j The total input is in j = Σ i w i,j a i The output is a j = g(in j ) = g(σ i w i,j a i ) = g(w.a)

118 Single Neuron Single neuron system Perception (if gg is threshold) Logistic regression (if gg is sigmoid) xx 1 ww 1 gg Computed Value aa 1 True Label yy xx 2 ww 2 h ww xx = aa 1 h ww xx = gg ww ii xx ii ii

119 Minimize Single Neuron Loss Data point (xx 1, xx 2 ), label yy Define loss (e.g. quadratic loss) LLLLLLLL ww = (yy h ww (xx)) 2 = (yy gg(dddddd(ww, xx)) 2, gg xx = 1 1+ee xx, dddddd xx, ww = ii ww ii xx ii Compute partial derivative LLLLLLLL ww = ww ii gg dddddd dddddd ww ii Step each weight along (negative) derivative of loss function Gradient descent w i = w ii αα ww ii LLLLLLLL ww LLLLLLss qqqqqqqqqqqqqqqqqq xx, yy = yy xx 2

120 Choice of Activation Function It would be really helpful to have a g(z) that was nicely differentiable Hard threshold: gg zz = 1 zz 0 0 zz < 0 dddd dddd = 0 zz 0 0 zz < 0 Sigmoid: gg zz gg zz = 1 1+ee zz dddd dddd = gg xx 1 gg zz ReLU: gg zz = mmmmmm(0, zz) dddd dddd = 1 zz 0 0 zz < 0

121 Multiclass Classification How do we handle labels that have more than two values? For example, classifying handwritten digits 0-9 One vs. Rest: Train NN binary classifiers For each class ii, Train a binary classifier where the new labels are 1 if label = ii, and 0 otherwise If the classfication score from binary classifier ii is higher than the other NN 1 classifiers, then label the point as label ii.

122 Multiclass Classification How do we handle labels that have more than two values? For example, classifying handwritten digits 0-9 Softmax Generalization of the sigmoid function to multiple classes NN inputs and NN outputs Give a probability that the input data produced each class The NN softmax outputs sum to 1 gg ssssssssssssss xx, ii = eexx ii jj ee xx jj xx 1 aa 11 ww 113 ww 111 ww 112 softmax function aa21 ww 121 xx 2 aa 12 gg aa 22 ww 123 aa 22 ww 131 xx 3 aa 13 ww 133

123 Multilayer Perceptrons A multilayer perceptron is a feedforward neural network with at least one hidden layer (nodes that are neither inputs nor outputs) MLPs with enough hidden nodes can represent any function

124 Neural Network Equations xx 1 aa 11 ww 111 gg aa 21 ww 211 aa 31 ww 112 ww 113 ww 212 gg ww 311 xx 2 aa 12 ww 121 gg aa 22 ww 221 gg aa 41 ww 123 ww 131 ww 221 gg aa 32 ww 321 xx 3 aa 13 w 133 gg aa 23 ww 232 h ww xx = aa 4,1 aa 1,1 = xx 1 aa 4,1 = gg ii ww 3,ii,1 aa 3,ii aa 3,1 = gg ii ww 2,ii,1 aa 2,ii h ww xx = gg ww 3,kk,1 gg kk jj ww 2,jj,kk gg ww 1,ii,jj xx ii ii aa l,1 = gg ii ww l 1,ii,1 aa l 1,ii

125 Minimize Neural Network Loss Data point (xx 1, xx 2 ), label yy Define loss (e.g. quadratic loss) LLLLLLLL ww = (yy h ww (xx)) 2 = yy gg ii ww LL,ii,1 aa LL,ii 2 Compute partial derivative LLLLLLLL ww = ww l,mm,nn gg = yy gg ww ii ww LL,ii,1 aa LL,ii l,mm,nn ww l,mm,nn Step each weight along (negative) derivative of loss function Gradient descent w l,mm,nn = w l,mm,nn αα ww l,mm,nn LLLLLLLL ww

126 Error Backpropagation Δ 3 = ggg(iiii 3 ) [ww 3, 5Δ 5 + ww 3, 6Δ 6 ] Δ 5 = ggg(iiii 5 ) (yy 5 aa 5 ) Δ 6 = ggg(iiii 6 ) (yy 6 aa 6 ) a 22 Δ 4 = ggg(iiii 4 ) [ww 4, 5Δ 5 + ww 4, 6Δ 6 ]

127 Represent as Computational Graph x * s (scores) hinge loss + L W R 141

128 e.g. x = -2, y = 5, z =

129 e.g. x = -2, y = 5, z = -4 Want: 143

130 e.g. x = -2, y = 5, z = -4 Want: 144

131 e.g. x = -2, y = 5, z = -4 Want: 145

132 e.g. x = -2, y = 5, z = -4 Want: 146

133 e.g. x = -2, y = 5, z = -4 Want: 147

134 e.g. x = -2, y = 5, z = -4 Want: 148

135 e.g. x = -2, y = 5, z = -4 Want: 149

136 e.g. x = -2, y = 5, z = -4 Want: 150

137 e.g. x = -2, y = 5, z = -4 Chain rule: Want: 151

138 e.g. x = -2, y = 5, z = -4 Want: 152

139 e.g. x = -2, y = 5, z = -4 Chain rule: Want: 153

140 activations f 154

141 activations local gradient f 155

142 activations local gradient f gradients 156

143 activations local gradient f gradients 157

144 activations local gradient f gradients 158

145 activations local gradient f gradients 159

146 Another example: 160

147 Another example: 161

148 Another example: 162

149 Another example: 163

150 Another example: 164

151 Another example: 165

152 Another example: 166

153 Another example: 167

154 Another example: 168

155 Another example: (-1) * (-0.20) =

156 Another example: 170

157 Another example: [local gradient] x [its gradient] [1] x [0.2] = 0.2 [1] x [0.2] = 0.2 (both inputs!) 171

158 Another example: 172

159 Another example: [local gradient] x [its gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] =

160 sigmoid function sigmoid gate 174

161 sigmoid function sigmoid gate (0.73) * (1-0.73) =

162 Gradients add at branches + 176

163 Mini-batches and Stochastic Gradient Descent o Typical objective: = average log-likelihood of label given input = estimate based on mini-batch 1 k - Mini-batch gradient descent: compute gradient on mini-batch (+ cycle over mini-batches: 1..k, k+1 2k,... ; make sure to randomize permutation of data!) - Stochastic gradient descent: k = 1

164 CS 188: Artificial Intelligence Deep Learning Instructor: Fabrice Popineau [These slides adapted from Stuart Russell, Dan Klein, Pieter Abbeel and Anca

165 Computer Vision

166 Object Detection

167 Manual Feature Design

168 Features and Generalization [Dalal and Triggs, 2005]

169 Features and Generalization Image HoG

170 Manual Feature Design Deep Learning o Manual feature design requires: o Domain-specific expertise o Domain-specific effort o What if we could learn the features, too? o -> Deep Learning

171 Performance graph credit Matt Zeiler, Clarifai

172 Performance graph credit Matt Zeiler, Clarifai

173 graph credit Matt Zeiler, Clarifai Performance AlexNet

174 graph credit Matt Zeiler, Clarifai Performance AlexNet

175 graph credit Matt Zeiler, Clarifai Performance AlexNet

176 Speech Recognition graph credit Matt Zeiler, Clarifai

177 Speech Synthesis: WaveNet WaveNet (2016) Approach for the Text to Speech problem (TTS). Concatenative TTS: Short recorded speech fragments concatenated. Parametric TTS: High level model based speech generation (from scratch). WaveNet: Dumb neural network based model for generating human speech at the level of individual audio samples. (Note: This is not a classifier! Not a direct application of today s ideas, but instead an example of the generality of deep neural networks)

178 Deep Learning: Convolutional Neural Networks LeNet5 Lecun, et al, 1998 Convnets for digit recognition LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE (1998):

179 Convolutional Neural Networks Alexnet Alex Krizhevsky, Geoffrey Hinton, et al, 2012 Convnets for image classification More data & more compute power Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems

180 Deep Learning: GoogLeNet Szegedy, Christian, et al. Going deeper with convolutions." CVPR (2015).

181 Neural Nets Incredible success in the last three years Data (ImageNet) Compute power Optimization Activation functions (ReLU) Regularization Reducing overfitting (dropout) Software packages Caffe (UC Berkeley) Theano (Université de Montréal) Torch (Facebook, Yann Lecun) TensorFlow (Google)

182 Practical Issues

CS 188: Artificial Intelligence. Machine Learning

CS 188: Artificial Intelligence. Machine Learning CS 188: Artificial Intelligence Review of Machine Learning (ML) DISCLAIMER: It is insufficient to simply study these slides, they are merely meant as a quick refresher of the high-level ideas covered.

More information

CS 188: Artificial Intelligence Fall 2008

CS 188: Artificial Intelligence Fall 2008 CS 188: Artificial Intelligence Fall 2008 Lecture 23: Perceptrons 11/20/2008 Dan Klein UC Berkeley 1 General Naïve Bayes A general naive Bayes model: C E 1 E 2 E n We only specify how each feature depends

More information

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering CS 188: Artificial Intelligence Fall 2008 General Naïve Bayes A general naive Bayes model: C Lecture 23: Perceptrons 11/20/2008 E 1 E 2 E n Dan Klein UC Berkeley We only specify how each feature depends

More information

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012 Machine Learning Hal Daumé III Computer Science University of Maryland me@hal3.name CS 421 Introduction to Artificial Intelligence 8 May 2012 g 1 Many slides courtesy of Dan Klein, Stuart Russell, or Andrew

More information

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Introduction to AI Learning Bayesian networks. Vibhav Gogate Introduction to AI Learning Bayesian networks Vibhav Gogate Inductive Learning in a nutshell Given: Data Examples of a function (X, F(X)) Predict function F(X) for new examples X Discrete F(X): Classification

More information

CS 188: Artificial Intelligence Spring Today

CS 188: Artificial Intelligence Spring Today CS 188: Artificial Intelligence Spring 2006 Lecture 9: Naïve Bayes 2/14/2006 Dan Klein UC Berkeley Many slides from either Stuart Russell or Andrew Moore Bayes rule Today Expectations and utilities Naïve

More information

CS 188: Artificial Intelligence Fall 2011

CS 188: Artificial Intelligence Fall 2011 CS 188: Artificial Intelligence Fall 2011 Lecture 22: Perceptrons and More! 11/15/2011 Dan Klein UC Berkeley Errors, and What to Do Examples of errors Dear GlobalSCAPE Customer, GlobalSCAPE has partnered

More information

Errors, and What to Do. CS 188: Artificial Intelligence Fall What to Do About Errors. Later On. Some (Simplified) Biology

Errors, and What to Do. CS 188: Artificial Intelligence Fall What to Do About Errors. Later On. Some (Simplified) Biology CS 188: Artificial Intelligence Fall 2011 Lecture 22: Perceptrons and More! 11/15/2011 Dan Klein UC Berkeley Errors, and What to Do Examples of errors Dear GlobalSCAPE Customer, GlobalSCAPE has partnered

More information

CS 5522: Artificial Intelligence II

CS 5522: Artificial Intelligence II CS 5522: Artificial Intelligence II Perceptrons Instructor: Alan Ritter Ohio State University [These slides were adapted from CS188 Intro to AI at UC Berkeley. All materials available at http://ai.berkeley.edu.]

More information

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence CS 343: Artificial Intelligence Perceptrons Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188

More information

CS 188: Artificial Intelligence. Outline

CS 188: Artificial Intelligence. Outline CS 188: Artificial Intelligence Lecture 21: Perceptrons Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. Outline Generative vs. Discriminative Binary Linear Classifiers Perceptron Multi-class

More information

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors Linear Classifiers CS 88 Artificial Intelligence Perceptrons and Logistic Regression Pieter Abbeel & Dan Klein University of California, Berkeley Feature Vectors Some (Simplified) Biology Very loose inspiration

More information

CSE446: Naïve Bayes Winter 2016

CSE446: Naïve Bayes Winter 2016 CSE446: Naïve Bayes Winter 2016 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, Luke ZeIlemoyer Supervised Learning: find f Given: Training set {(x i, y i ) i = 1 n} Find: A good approximanon

More information

MIRA, SVM, k-nn. Lirong Xia

MIRA, SVM, k-nn. Lirong Xia MIRA, SVM, k-nn Lirong Xia Linear Classifiers (perceptrons) Inputs are feature values Each feature has a weight Sum is the activation activation w If the activation is: Positive: output +1 Negative, output

More information

CSE546: Naïve Bayes Winter 2012

CSE546: Naïve Bayes Winter 2012 CSE546: Naïve Bayes Winter 2012 Luke Ze=lemoyer Slides adapted from Carlos Guestrin and Dan Klein Supervised Learning: find f Given: Training set {(x i, y i ) i = 1 n} Find: A good approximamon to f :

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2010 Lecture 24: Perceptrons and More! 4/22/2010 Pieter Abbeel UC Berkeley Slides adapted from Dan Klein Announcements W7 due tonight [this is your last written for

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2010 Lecture 22: Nearest Neighbors, Kernels 4/18/2011 Pieter Abbeel UC Berkeley Slides adapted from Dan Klein Announcements On-going: contest (optional and FUN!)

More information

Classification. Chris Amato Northeastern University. Some images and slides are used from: Rob Platt, CS188 UC Berkeley, AIMA

Classification. Chris Amato Northeastern University. Some images and slides are used from: Rob Platt, CS188 UC Berkeley, AIMA Classification Chris Amato Northeastern University Some images and slides are used from: Rob Platt, CS188 UC Berkeley, AIMA Supervised learning Given: Training set {(xi, yi) i = 1 N}, given a labeled set

More information

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence CS 343: Artificial Intelligence Deep Learning Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein, Pieter Abbeel, Anca Dragan for CS188 Intro to AI at UC Berkeley.

More information

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning CS 188: Artificial Intelligence Spring 21 Lecture 22: Nearest Neighbors, Kernels 4/18/211 Pieter Abbeel UC Berkeley Slides adapted from Dan Klein Announcements On-going: contest (optional and FUN!) Remaining

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

CS 380: ARTIFICIAL INTELLIGENCE

CS 380: ARTIFICIAL INTELLIGENCE CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING 11/11/2013 Santiago Ontañón santi@cs.drexel.edu https://www.cs.drexel.edu/~santi/teaching/2013/cs380/intro.html Summary so far: Rational Agents Problem

More information

Forward algorithm vs. particle filtering

Forward algorithm vs. particle filtering Particle Filtering ØSometimes X is too big to use exact inference X may be too big to even store B(X) E.g. X is continuous X 2 may be too big to do updates ØSolution: approximate inference Track samples

More information

Learning from Observations. Chapter 18, Sections 1 3 1

Learning from Observations. Chapter 18, Sections 1 3 1 Learning from Observations Chapter 18, Sections 1 3 Chapter 18, Sections 1 3 1 Outline Learning agents Inductive learning Decision tree learning Measuring learning performance Chapter 18, Sections 1 3

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees CS194-10 Fall 2011 Lecture 8 CS194-10 Fall 2011 Lecture 8 1 Outline Decision tree models Tree construction Tree pruning Continuous input features CS194-10 Fall 2011 Lecture 8 2

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Statistical Learning. Philipp Koehn. 10 November 2015

Statistical Learning. Philipp Koehn. 10 November 2015 Statistical Learning Philipp Koehn 10 November 2015 Outline 1 Learning agents Inductive learning Decision tree learning Measuring learning performance Bayesian learning Maximum a posteriori and maximum

More information

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I Non-Linearity CS 188: Artificial Intelligence Deep Learning I Instructors: Pieter Abbeel & Anca Dragan --- University of California, Berkeley [These slides were created by Dan Klein, Pieter Abbeel, Anca

More information

Learning and Neural Networks

Learning and Neural Networks Artificial Intelligence Learning and Neural Networks Readings: Chapter 19 & 20.5 of Russell & Norvig Example: A Feed-forward Network w 13 I 1 H 3 w 35 w 14 O 5 I 2 w 23 w 24 H 4 w 45 a 5 = g 5 (W 3,5 a

More information

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6 Administration Registration Hw3 is out Due on Thursday 10/6 Questions Lecture Captioning (Extra-Credit) Look at Piazza for details Scribing lectures With pay; come talk to me/send email. 1 Projects Projects

More information

Learning from Examples

Learning from Examples Learning from Examples Data fitting Decision trees Cross validation Computational learning theory Linear classifiers Neural networks Nonparametric methods: nearest neighbor Support vector machines Ensemble

More information

EE 511 Online Learning Perceptron

EE 511 Online Learning Perceptron Slides adapted from Ali Farhadi, Mari Ostendorf, Pedro Domingos, Carlos Guestrin, and Luke Zettelmoyer, Kevin Jamison EE 511 Online Learning Perceptron Instructor: Hanna Hajishirzi hannaneh@washington.edu

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks. Nicholas Ruozzi University of Texas at Dallas Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify

More information

From inductive inference to machine learning

From inductive inference to machine learning From inductive inference to machine learning ADAPTED FROM AIMA SLIDES Russel&Norvig:Artificial Intelligence: a modern approach AIMA: Inductive inference AIMA: Inductive inference 1 Outline Bayesian inferences

More information

Artificial Intelligence Roman Barták

Artificial Intelligence Roman Barták Artificial Intelligence Roman Barták Department of Theoretical Computer Science and Mathematical Logic Introduction We will describe agents that can improve their behavior through diligent study of their

More information

CSE 473: Artificial Intelligence Autumn Topics

CSE 473: Artificial Intelligence Autumn Topics CSE 473: Artificial Intelligence Autumn 2014 Bayesian Networks Learning II Dan Weld Slides adapted from Jack Breese, Dan Klein, Daphne Koller, Stuart Russell, Andrew Moore & Luke Zettlemoyer 1 473 Topics

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Fall 2015 Introduction to Artificial Intelligence Final You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

1 What a Neural Network Computes

1 What a Neural Network Computes Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists

More information

Part of the slides are adapted from Ziko Kolter

Part of the slides are adapted from Ziko Kolter Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Neural Networks: Introduction

Neural Networks: Introduction Neural Networks: Introduction Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others 1

More information

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING Santiago Ontañón so367@drexel.edu Summary so far: Rational Agents Problem Solving Systematic Search: Uninformed Informed Local Search Adversarial Search

More information

Incremental Stochastic Gradient Descent

Incremental Stochastic Gradient Descent Incremental Stochastic Gradient Descent Batch mode : gradient descent w=w - η E D [w] over the entire data D E D [w]=1/2σ d (t d -o d ) 2 Incremental mode: gradient descent w=w - η E d [w] over individual

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Introduction to Artificial Intelligence. Learning from Oberservations

Introduction to Artificial Intelligence. Learning from Oberservations Introduction to Artificial Intelligence Learning from Oberservations Bernhard Beckert UNIVERSITÄT KOBLENZ-LANDAU Winter Term 2004/2005 B. Beckert: KI für IM p.1 Outline Learning agents Inductive learning

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

Neural networks and optimization

Neural networks and optimization Neural networks and optimization Nicolas Le Roux Criteo 18/05/15 Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85 1 Introduction 2 Deep networks 3 Optimization 4 Convolutional

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released

More information

ECE521 Lecture 7/8. Logistic Regression

ECE521 Lecture 7/8. Logistic Regression ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression

More information

Nonlinear Classification

Nonlinear Classification Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions

More information

Reinforcement Learning Wrap-up

Reinforcement Learning Wrap-up Reinforcement Learning Wrap-up Slides courtesy of Dan Klein and Pieter Abbeel University of California, Berkeley [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley.

More information

Machine Learning Lecture 12

Machine Learning Lecture 12 Machine Learning Lecture 12 Neural Networks 30.11.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 18: HMMs and Particle Filtering 4/4/2011 Pieter Abbeel --- UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew Moore

More information

Statistical NLP Spring A Discriminative Approach

Statistical NLP Spring A Discriminative Approach Statistical NLP Spring 2008 Lecture 6: Classification Dan Klein UC Berkeley A Discriminative Approach View WSD as a discrimination task (regression, really) P(sense context:jail, context:county, context:feeding,

More information

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018 From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction

More information

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Reading for today: R&N 18.1-18.4 Next lecture: R&N 18.6-18.12, 20.1-20.3.2 Outline The importance of a good representation Different types of learning problems Different

More information

Supervised Learning. George Konidaris

Supervised Learning. George Konidaris Supervised Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell,

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9 Vote Vote on timing for night section: Option 1 (what we have now) Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9 Option 2 Lecture, 6:10-7 10 minute break Lecture, 7:10-8 10 minute break Tutorial,

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 12: Probability 3/2/2011 Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. 1 Announcements P3 due on Monday (3/7) at 4:59pm W3 going out

More information

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions? Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification

More information

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015 Perceptron Subhransu Maji CMPSCI 689: Machine Learning 3 February 2015 5 February 2015 So far in the class Decision trees Inductive bias: use a combination of small number of features Nearest neighbor

More information

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes. CS 188 Spring 2014 Introduction to Artificial Intelligence Final You have approximately 2 hours and 50 minutes. The exam is closed book, closed notes except your two-page crib sheet. Mark your answers

More information

Midterm, Fall 2003

Midterm, Fall 2003 5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

More information

Sections 18.6 and 18.7 Artificial Neural Networks

Sections 18.6 and 18.7 Artificial Neural Networks Sections 18.6 and 18.7 Artificial Neural Networks CS4811 - Artificial Intelligence Nilufer Onder Department of Computer Science Michigan Technological University Outline The brain vs. artifical neural

More information

Midterm: CS 6375 Spring 2015 Solutions

Midterm: CS 6375 Spring 2015 Solutions Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an

More information

18.6 Regression and Classification with Linear Models

18.6 Regression and Classification with Linear Models 18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight

More information

Introduction to Machine Learning (67577)

Introduction to Machine Learning (67577) Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks

More information

Sections 18.6 and 18.7 Artificial Neural Networks

Sections 18.6 and 18.7 Artificial Neural Networks Sections 18.6 and 18.7 Artificial Neural Networks CS4811 - Artificial Intelligence Nilufer Onder Department of Computer Science Michigan Technological University Outline The brain vs artifical neural networks

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Classification Algorithms

Classification Algorithms Classification Algorithms UCSB 290N, 2015. T. Yang Slides based on R. Mooney UT Austin 1 Table of Content roblem Definition Rocchio K-nearest neighbor case based Bayesian algorithm Decision trees 2 Given:

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

CS540 ANSWER SHEET

CS540 ANSWER SHEET CS540 ANSWER SHEET Name Email 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 1 2 Final Examination CS540-1: Introduction to Artificial Intelligence Fall 2016 20 questions, 5 points

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Fall 2018 Introduction to Artificial Intelligence Practice Final You have approximately 2 hours 50 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Yongjin Park 1 Goal of Feedforward Networks Deep Feedforward Networks are also called as Feedforward neural networks or Multilayer Perceptrons Their Goal: approximate some function

More information

16.4 Multiattribute Utility Functions

16.4 Multiattribute Utility Functions 285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

More information

Introduction To Artificial Neural Networks

Introduction To Artificial Neural Networks Introduction To Artificial Neural Networks Machine Learning Supervised circle square circle square Unsupervised group these into two categories Supervised Machine Learning Supervised Machine Learning Supervised

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Lecture 35: Optimization and Neural Nets

Lecture 35: Optimization and Neural Nets Lecture 35: Optimization and Neural Nets CS 4670/5670 Sean Bell DeepDream [Google, Inceptionism: Going Deeper into Neural Networks, blog 2015] Aside: CNN vs ConvNet Note: There are many papers that use

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17 3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Deep Feedforward Networks. Sargur N. Srihari

Deep Feedforward Networks. Sargur N. Srihari Deep Feedforward Networks Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

CS:4420 Artificial Intelligence

CS:4420 Artificial Intelligence CS:4420 Artificial Intelligence Spring 2018 Neural Networks Cesare Tinelli The University of Iowa Copyright 2004 18, Cesare Tinelli and Stuart Russell a a These notes were originally developed by Stuart

More information

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others) Text Categorization CSE 454 (Based on slides by Dan Weld, Tom Mitchell, and others) 1 Given: Categorization A description of an instance, x X, where X is the instance language or instance space. A fixed

More information