CS:4420 Artificial Intelligence

Similar documents
k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

CHAPTER 3 INEQUALITIES. Copyright -The Institute of Chartered Accountants of India

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

COMP9414/ 9814/ 3411: Artificial Intelligence. 14. Course Review. COMP3411 c UNSW, 2014

NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

Pattern Recognition 2014 Support Vector Machines

Reinforcement Learning" CMPSCI 383 Nov 29, 2011!

AIP Logic Chapter 4 Notes

Chapter 3: Cluster Analysis

Math Foundations 20 Work Plan

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Part 3 Introduction to statistical classification techniques

Hiding in plain sight

CS 380: ARTIFICIAL INTELLIGENCE

Subject description processes

Tree Structured Classifier

Administrativia. Assignment 1 due thursday 9/23/2004 BEFORE midnight. Midterm exam 10/07/2003 in class. CS 460, Sessions 8-9 1

Differentiation Applications 1: Related Rates

Resampling Methods. Chapter 5. Chapter 5 1 / 52

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

COMP 551 Applied Machine Learning Lecture 4: Linear classification

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

Engineering Decision Methods

Lab 1 The Scientific Method

Hypothesis Tests for One Population Mean

You need to be able to define the following terms and answer basic questions about them:

, which yields. where z1. and z2

Five Whys How To Do It Better

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

COMP9444 Neural Networks and Deep Learning 3. Backpropagation

Learning from Observations. Chapter 18, Sections 1 3 1

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Multiple Source Multiple. using Network Coding

Admin. MDP Search Trees. Optimal Quantities. Reinforcement Learning

If (IV) is (increased, decreased, changed), then (DV) will (increase, decrease, change) because (reason based on prior research).

2004 AP CHEMISTRY FREE-RESPONSE QUESTIONS

Turing Machines. Human-aware Robotics. 2017/10/17 & 19 Chapter 3.2 & 3.3 in Sipser Ø Announcement:

Lesson Plan. Recode: They will do a graphic organizer to sequence the steps of scientific method.

Assessment Primer: Writing Instructional Objectives

Lecture 23: Lattice Models of Materials; Modeling Polymer Solutions

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Phys. 344 Ch 7 Lecture 8 Fri., April. 10 th,

The steps of the engineering design process are to:

Introduction to Artificial Intelligence. Learning from Oberservations

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

From inductive inference to machine learning

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

A Correlation of. to the. South Carolina Academic Standards for Mathematics Precalculus

Eric Klein and Ning Sa

ENSC Discrete Time Systems. Project Outline. Semester

Thermodynamics and Equilibrium

NUMBERS, MATHEMATICS AND EQUATIONS

Determining Optimum Path in Synthesis of Organic Compounds using Branch and Bound Algorithm

7 TH GRADE MATH STANDARDS

The blessing of dimensionality for kernel methods

Math Foundations 10 Work Plan

Churn Prediction using Dynamic RFM-Augmented node2vec

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

Physical Layer: Outline

Activity Guide Loops and Random Numbers

CHM112 Lab Graphing with Excel Grading Rubric

Learning and Neural Networks

CONSTRUCTING STATECHART DIAGRAMS

Name: Block: Date: Science 10: The Great Geyser Experiment A controlled experiment

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

Public Key Cryptography. Tim van der Horst & Kent Seamons

Dataflow Analysis and Abstract Interpretation

Weathering. Title: Chemical and Mechanical Weathering. Grade Level: Subject/Content: Earth and Space Science

NUROP CONGRESS PAPER CHINESE PINYIN TO CHINESE CHARACTER CONVERSION

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Agenda. What is Machine Learning? Learning Type of Learning: Supervised, Unsupervised and semi supervised Classification

Emphases in Common Core Standards for Mathematical Content Kindergarten High School

Experiment #3. Graphing with Excel

ENG2410 Digital Design Sequential Circuits: Part B

Part a: Writing the nodal equations and solving for v o gives the magnitude and phase response: tan ( 0.25 )

Least Squares Optimal Filtering with Multirate Observations

15-381/781 Bayesian Nets & Probabilistic Inference

How do scientists measure trees? What is DBH?

INSTRUMENTAL VARIABLES

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

Support-Vector Machines

22.54 Neutron Interactions and Applications (Spring 2004) Chapter 11 (3/11/04) Neutron Diffusion

Medium Scale Integrated (MSI) devices [Sections 2.9 and 2.10]

37 Maxwell s Equations

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

ENG2410 Digital Design Sequential Circuits: Part A

Kinetic Model Completeness

Dead-beat controller design

Time, Synchronization, and Wireless Sensor Networks

1996 Engineering Systems Design and Analysis Conference, Montpellier, France, July 1-4, 1996, Vol. 7, pp

Chapters 29 and 35 Thermochemistry and Chemical Thermodynamics

o o IMPORTANT REMINDERS Reports will be graded largely on their ability to clearly communicate results and important conclusions.

Building to Transformations on Coordinate Axis Grade 5: Geometry Graph points on the coordinate plane to solve real-world and mathematical problems.

Transcription:

CS:4420 Artificial Intelligence Spring 2017 Learning frm Examples Cesare Tinelli The University f Iwa Cpyright 2004 17, Cesare Tinelli and Stuart Russell a a These ntes were riginally develped by Stuart Russell and are used with permissin. They are cpyrighted material and may nt be used in ther curse settings utside f the University f Iwa in their current r mdified frm withut the express written cnsent f the cpyright hlders. CS:4420 Spring 2017 p.1/36

Readings Chap. 18 f [Russell and Nrvig, 2012] CS:4420 Spring 2017 p.2/36

Learning Agents A distinct feature f intelligent agents in nature is their ability t learn frm experience Using his experience and his internal knwledge, a learning agent is able t prduce new knwledge That is, given his internal knwledge and a percept sequence, the agent is able t learn facts that are cnsistent with bth the percepts and the previus knwledge, d nt just fllw frm the percepts and the previus knwledge CS:4420 Spring 2017 p.3/36

Example: Learning fr Lgical Agents Learning in lgical agents can be frmalized as fllws. Let Γ, be set f sentences where Γ is the agent s knwledge base, the agent s current knwledge is a representatin f a percept sequence, the evidential data A learning agent is an agent able t generate facts ϕ frm Γ and such that Γ {ϕ} is satisfiable (cnsistency f ϕ) usually, Γ = ϕ (nvelty f ϕ) CS:4420 Spring 2017 p.4/36

Learning Agent: Cnceptual Cmpnents Perfrmance standard Critic Sensrs feedback learning gals Learning element changes knwledge Perfrmance element Envirnment Prblem generatr Agent Effectrs CS:4420 Spring 2017 p.5/36

Learning Elements Machine learning research has prduced a large variety f learning elements Majr issues in the design f learning elements: Which cmpnents f the perfrmance element are t be imprved What representatin is used fr thse cmpnents What kind f feedback is available: supervised learning reinfrcement learning unsupervised learning What prir knwledge is available CS:4420 Spring 2017 p.6/36

Learning as Learning f Functins Any cmpnent f a perfrmance element can be described mathematically as a functin: cnditin-actin rules predicates in the knwledge base next-state peratrs gal-state recgnizers search heuristic functins belief netwrks utility functins... All learning can be seen as learning the representatin f a functin CS:4420 Spring 2017 p.7/36

Inductive Learning A lt f learning is f an inductive nature: Given sme experimental data, the agent learns the general principles gverning thse data and is able t make crrect predictins n future data, based n these general principles. Examples: After a baby is tld that certain bjects in the huse are chairs, the baby is able t learn the cncept f chair and then recgnize previusly unseen chairs as such. Yur grandfather watches a sccer match fr the first time and frm the actin and the cmmentatrs reprt is able t figure ut the rules f the game. CS:4420 Spring 2017 p.8/36

Purely Inductive Learning Given a cllectin {(x 1,f(x 1 )),...,(x n,f(x n ))} f input/utput pairs, r examples, fr a functin f prduce a hypthesis, (a cmpact representatin f) a functin h that apprximates f (a) (b) (c) (d) In general, there are quite a lt f different hyptheses cnsistent with the examples CS:4420 Spring 2017 p.9/36

Bias in Learning Any kind f preference fr a hypthesis h ver anther is called a bias Bias is inescapable: Just the chice f frmalism t describe h already intrduces a bias. Bias is necessary: Learning is nearly impssible withut bias. (Which f the many hyptheses d yu chse?) CS:4420 Spring 2017 p.10/36

Learning Decisin Trees The simplest frm f learning frm examples ccurs in learning decisin trees A decisin tree is a Blean peratr that takes as input a set f predicates describing an bject r a situatin, and utputs a discrete value It is represented by a tree in which every nn-leaf nde crrespnds t a test n the value f ne f the predicates every leaf nde specifies the value t be returned if that leaf is reached Decisin trees returning a binary value (e.g., a Blean) act as classifiers CS:4420 Spring 2017 p.11/36

A Decisin Tree This tree can be used t decide whether t wait fr a table at a restaurant Patrns? Nne Sme Full F T WaitEstimate? >60 30 60 10 30 0 10 F Alternate? Hungry? T N Yes N Yes Reservatin? Fri/Sat? T Alternate? N Yes N Yes N Yes Bar? T F T T Raining? N Yes N Yes F T F T CS:4420 Spring 2017 p.12/36

A Decisin Tree as Predicates A decisin tree with Blean utput defines a lgical predicate Patrns? Nne Sme Full F T Hungry? Yes N Type? F French Italian Thai Burger T F Fri/Sat? T N Yes F T WillWait Patrns = Sme Patrns = Full Hungry Type = French Patrns = Full Hungry Type = Burger Patrns = Full Hungry Type = Thay isfrisat CS:4420 Spring 2017 p.13/36

Building Decisin Trees Hw can we build a decisin tree fr a specific predicate? We can lk at a number f examples that satisfy, r d nt satisfy, the predicate and try t extraplate the tree frm them Example Attributes Gal Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait X 1 Yes N N Yes Sme $$$ N Yes French 0 10 Yes X 2 Yes N N Yes Full $ N N Thai 30 60 N X 3 N Yes N N Sme $ N N Burger 0 10 Yes X 4 Yes N Yes Yes Full $ N N Thai 10 30 Yes X 5 Yes N Yes N Full $$$ N Yes French >60 N X 6 N Yes N Yes Sme $$ Yes Yes Italian 0 10 Yes X 7 N Yes N N Nne $ Yes N Burger 0 10 N X 8 N N N Yes Sme $$ Yes Yes Thai 0 10 Yes X 9 N Yes Yes N Full $ Yes N Burger >60 N X 10 Yes Yes Yes Yes Full $$$ N Yes Italian 10 30 N X 11 N N N N Nne $ N N Thai 0 10 N X 12 Yes Yes Yes Yes Full $ N N Burger 30 60 Yes CS:4420 Spring 2017 p.14/36

Sme Terminlgy The gal predicate is the predicate t be implemented by a decisin tree. The training set is the set f examples used t build the tree A member f the training set is a psitive example if it is satisfies the gal predicate, it is a negative example if it des nt A Blean decisin tree implements classifier: given a ptential instance f a gal predicate, it is able t say, by lking at sme attributes f the instance, whether the instance is a psitive example f the predicate r nt CS:4420 Spring 2017 p.15/36

Gd Decisin Trees It is trivial t cnstruct a decisin tree that agrees with a given training set (Hw?) Hwever, the trivial tree will simply memrize the given examples We want a tree that extraplates a cmmn pattern frm the examples We want the tree t crrectly classify all pssible examples, nt just thse in the training set CS:4420 Spring 2017 p.16/36

Lking fr Decisin Trees In general, there are several decisin trees that describe the same gal predicate. Which ne shuld we prefer? Ockham s razr: always prefer the simplest descriptin, that is, the smallest tree Prblem: searching thrugh the space f pssible trees and finding the smallest ne is pssible but takes expnential time Slutin: apply sme simple heuristics that lead t small (if nt smallest) trees Main Idea: start building the tree by testing at its rt an attribute that better splits the training set int hmgeneus classes CS:4420 Spring 2017 p.17/36

Chsing an attribute A gd attribute splits the examples int subsets that are ideally all psitive r all negative Patrns? Type? Nne Sme Full French Italian Thai Burger Patrns? is a better chice: it gives mre infrmatin abut the classificatin CS:4420 Spring 2017 p.18/36

Chsing an attribute Preferring mre infrmative attributes leads t smaller trees 1 3 4 6 8 12 2 5 7 9 10 11 Type? 1 3 4 6 8 12 2 5 7 9 10 11 Patrns? French Italian Thai Burger 1 5 6 10 4 8 2 11 3 12 7 9 7 11 Nne Sme Full 1 3 6 8 4 12 2 5 9 10 N Yes Hungry? N Yes 4 12 (a) (b) 5 9 2 10 CS:4420 Spring 2017 p.19/36

Building the Tree: General Prcedure 1. Chse fr the rt nde test the attribute that best partitins the given training set E int hmgeneus sets 2. If the chsen attribute has n pssible values, it will partitin E int n sets E 1,...,E n. Add a branch i t the rt nde fr each set E i 3. Fr each branch i: (a) If E i is empty, chse the mst cmmn yes/n classificatin amng E s examples and add a crrespnding leaf t the branch (b) If E i cntains nly psitive examples, add a yes leaf t the branch (c) If E i cntains nly negative examples, add a n leaf t the branch (d) Otherwise, add a nn-leaf nde t the branch and apply the prcedure recursively t that nde with the remaining attributes and with E i as the training set CS:4420 Spring 2017 p.20/36

Chsing the Best Attribute What d we exactly mean by best partitins the training set int hmgeneus classes? What if each attribute splits the training set int nn-hmgeneus classes? Which ne is better? Infrmatin Thery can be used t devise a measure f gdness fr attributes CS:4420 Spring 2017 p.21/36

Infrmatin Thery Studies the mathematical laws gverning systems designed t cmmunicate r manipulate infrmatin It defines quantitative measures f infrmatin and the capacity f varius systems t transmit, stre, and prcess infrmatin In particular, it measures the infrmatin cntent, r entrpy, f messages/events Infrmatin is measured in bits One bit represents the infrmatin we need t answer a yes/n questin when we have n idea abut the answer CS:4420 Spring 2017 p.22/36

Infrmatin Cntent If an event has n pssible utcmes v i, each with prir prbability P(v i ), the infrmatin cntent H f the event s actual utcme is H(P(v 1 ),...,P(v n )) = n P(v i )lg 2 P(v i ) i=1 i.e., the average infrmatin cntent f each utcme, lg 2 P(v i ), weighted by the utcme s prbability CS:4420 Spring 2017 p.23/36

Infrmatin Cntent/Entrpy Examples H(P(v 1 ),...,P(v n )) = 1) Entrpy f fair cin tss: n P(v i )lg 2 P(v i ) i=1 H(P(h),P(t)) = H( 1 2, 1 2 ) = 1 2 lg 2 1 2 1 2 lg 2 1 2 = 1 2 + 1 2 = 1 bit 2) Entrpy f a laded cin tss where P(head) = 0.99: H(P(h),P(t)) = H( 99 100, 1 100 ) = 0.99lg 20.99 0.01lg 2 0.01 0.08 bits 3) Entrpy f cin tss fr a cin with heads n bth sides: H(P(h),P(h)) = H(1,0) = 1lg 2 1 0lg 2 0 = 0 0 = 0 bits CS:4420 Spring 2017 p.24/36

Entrpy f a Decisin Tree Fr decisin trees, the event is questin is whether the tree will return yes r n fr a given input example e Assume the training set E is a representative sample f the dmain That is, the relative frequency f psitive examples in E clsely apprximates the prir prbability f a psitive example CS:4420 Spring 2017 p.25/36

Entrpy f a Decisin Tree Fr decisin trees, the event is questin is whether the tree will return yes r n fr a given input example e Assume the training set E is a representative sample f the dmain That is, the relative frequency f psitive examples in E clsely apprximates the prir prbability f a psitive example If E cntains p psitive examples and n negative examples, the prbability distributin f answers by a crrect decisin tree is: P(yes) = p p+n P(n) = n p+n CS:4420 Spring 2017 p.25/36

Entrpy f a Decisin Tree Fr decisin trees, the event is questin is whether the tree will return yes r n fr a given input example e Assume the training set E is a representative sample f the dmain That is, the relative frequency f psitive examples in E clsely apprximates the prir prbability f a psitive example If E cntains p psitive examples and n negative examples, the prbability distributin f answers by a crrect decisin tree is: P(yes) = p p+n P(n) = n p+n Entrpy f crrect decisin tree: H ( ) p p+n, n p+n = p p+n lg 2 p p+n n p+n lg 2 n p+n CS:4420 Spring 2017 p.25/36

Infrmatin Cntent f an Attribute Checking the value f a single attribute A in the tree prvides nly sme f the infrmatin prvided by the whle tree But we can measure hw much infrmatin is still needed after A has been checked CS:4420 Spring 2017 p.26/36

Infrmatin Cntent f an Attribute Let E 1,...,E m be the sets int which A partitins the current training set E Fr i = 1,...,m, let p = # f psitive examples in E n = # f negative examples in E p i = # f psitive examples in E i n i = # f negative examples in E i Then, alng branch i f nde A we will need Remainder(A) = m i=1 p i +n i p+n H ( pi p i +n i, n i p i +n i ) extra bits f infrmatin t classify the input example after we have checked A CS:4420 Spring 2017 p.27/36

Chsing an Attribute Cnclusin: The smaller the value f Remainder(A), the higher the infrmatin cntent f attribute A fr the purpse f classifying the input example Heuristic: When building a nn-leaf nde f a decisin tree, chse the attribute with the smallest remainder CS:4420 Spring 2017 p.28/36

Building Decisin Trees: An Example Prblem: Frm the infrmatin belw abut several prductin runs in a given factry, cnstruct a decisin tree t determine the factrs that influence prductin utput Run Supervisr Operatr Machine Overtime Output 1 Patrick Je a n high 2 Patrick Samantha b yes lw 3 Thmas Jim b yes lw 4 Patrick Jim b n high 5 Sally Je c n high 6 Thmas Samantha c n lw 7 Thmas Je c n lw 8 Patrick Jim a yes lw CS:4420 Spring 2017 p.29/36

Building Decisin Trees: An Example First identify the attribute with the lwest infrmatin remainder by using the whle table as the training set (the psitive examples are thse with high utput) Since fr each attribute A Remainder(A) = m i=1 = n i=1 p i +n i p+n I( p i p i +n i, n i p i +n i ) p i +n i p+n ( p i p i +n i lg 2 p i p i +n i n i p i +n i lg 2 n i p i +n i ) we need t cmpute all the relative frequencies invlved CS:4420 Spring 2017 p.30/36

Example (1) Here is hw each attribute splits the training set, tgether with the entrpy each branch Patrick 1(+) 4(+) 2 8 Supervisr Thmas Sally 5(+) 3 6 7 1 0 0 Jim Operatr Je 4(+) 1(+) 3 5(+) 8 7 Samantha 2 6 0.92 0.92 0 Machine 1(+) 4(+) 5(+) 8 3 2 6 7 1 0.92 0.92 1(+) 4(+) 5(+) 6 7 0.97 Overtime a b c n yes 2 3 8 0 Remainder(Supervisr) = 4 8 1+ 1 8 0+ 3 8 0 = 0.50 Remainder(Operatr) = 3 8 0.92+ 3 8 0.92+ 2 8 0 = 0.69 Remainder(Machine) = 2 8 1+ 3 8 0.92+ 3 8 0.92 = 0.94 Remainder(Overtime) = 5 8 0.97+ 3 8 0 = 0.61 Chse Supervisr since it has the lwest remainder CS:4420 Spring 2017 p.31/36

Example (2) Thmas runs are all negative and Sally s are all psitive Patrick 1(+) 4(+) 2 8 Supervisr Thmas Sally 5(+) 3 6 7 1 0 0 Jim Operatr Je 4(+) 1(+) 3 5(+) 8 7 Samantha 2 6 0.92 0.92 0 We need t further classify just Patrick s runs Machine 1(+) 4(+) 5(+) 8 3 2 6 7 1 0.92 0.92 1(+) 4(+) 5(+) 6 7 0.97 Overtime a b c n yes 2 3 8 0 CS:4420 Spring 2017 p.32/36

Example (2) Recmpute the remainders f the remaining attributes, but this time based slely n Patrick s runs Operatr Machine Overtime Jim Je Samantha a b c n yes 4(+) 1(+) 8 1 0 2 0 1(+) 4(+) 8 2 1 1 1(+) 2 4(+) 8 0 0 Remainder(Operatr) = 2 4 1+ 1 4 0+ 1 4 Remainder(Machine) = 2 4 1+ 2 4 1 = 1 Remainder(Overtime) = 2 4 0+ 2 4 0 = 0 0 = 0.5 Chse Overtime t further classify Patrick s runs CS:4420 Spring 2017 p.32/36

Example (3) The final decisin tree: Patrick Overtime Supervisr Sally yes Thmas n n yes n yes CS:4420 Spring 2017 p.33/36

Prblems in Building Decisin Trees Nise. Tw training examples may have identical values fr all the attributes but be classified differently Overfitting. Irrelevant attributes may make spurius distinctins amng training examples Missing data. The value f sme attributes f sme training examples may be missing Multi-valued attributes. The infrmatin gain f an attribute with many different values tends t be nn-zer even when the attribute is irrelevant Cntinuus-valued attributes. They must be discretized t be used. Of all the pssible discretizatins, sme are better than thers fr classificatin purpses. CS:4420 Spring 2017 p.34/36

Perfrmance measurement Hw d we knw that the learned hypthesis h apprximates the intended functin f? Use therems f cmputatinal/statistical learning thery Try h n a new test set f examples, using same distributin ver example space as training set Learning curve = % crrect n test set as a functin f training set size 1 % crrect n test set 0.9 0.8 0.7 0.6 0.5 0.4 0 20 40 60 80 100 Training set size 100 randmly-generated restaurant examples graph averaged ver 20 trials fr i = 1,...,99, each trial selects i examples randmly CS:4420 Spring 2017 p.35/36

Chsing the best hypthesis Cnsider a set S = {(x,y) y = f(x)} f N input/utput examples fr a target functin f Statinarity assumptin: All examples E S have the same prir prbability distributin P(E) and each f them is independent frm the previusly bserved nes Errr rate f an hypthesis h: {(x,y) (x,y) S, h(x) y} N Hldut crss-validatin: Partins S randmly int a training set and a test set. k-fld crss-validatin: Partins S int k subsets S 1,...,S n f the same size. Fr each i = 1,...,k, use S i as the test set and S \Si as the training set. Use the average errr rate CS:4420 Spring 2017 p.36/36