k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

Similar documents
Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

What is Statistical Learning?

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

Pattern Recognition 2014 Support Vector Machines

IAML: Support Vector Machines

Resampling Methods. Chapter 5. Chapter 5 1 / 52

COMP 551 Applied Machine Learning Lecture 4: Linear classification

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Part 3 Introduction to statistical classification techniques

Chapter 3: Cluster Analysis

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

Simple Linear Regression (single variable)

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

Agenda. What is Machine Learning? Learning Type of Learning: Supervised, Unsupervised and semi supervised Classification

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition

Elements of Machine Intelligence - I

Tree Structured Classifier

Distributions, spatial statistics and a Bayesian perspective

NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

Smoothing, penalized least squares and splines

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

The blessing of dimensionality for kernel methods

Support-Vector Machines

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must

COMP9444 Neural Networks and Deep Learning 3. Backpropagation

Physical Layer: Outline

Time, Synchronization, and Wireless Sensor Networks

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001)

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

Linear Classification

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

CS:4420 Artificial Intelligence

Admin. MDP Search Trees. Optimal Quantities. Reinforcement Learning

Video Encoder Control

INTRODUCTION TO MACHINE LEARNING FOR MEDICINE

Some Theory Behind Algorithms for Stochastic Optimization

Hiding in plain sight

Dead-beat controller design

Reinforcement Learning" CMPSCI 383 Nov 29, 2011!

1996 Engineering Systems Design and Analysis Conference, Montpellier, France, July 1-4, 1996, Vol. 7, pp

Turing Machines. Human-aware Robotics. 2017/10/17 & 19 Chapter 3.2 & 3.3 in Sipser Ø Announcement:

Localized Model Selection for Regression

A Scalable Recurrent Neural Network Framework for Model-free

SAMPLING DYNAMICAL SYSTEMS

initially lcated away frm the data set never win the cmpetitin, resulting in a nnptimal nal cdebk, [2] [3] [4] and [5]. Khnen's Self Organizing Featur

Particle Size Distributions from SANS Data Using the Maximum Entropy Method. By J. A. POTTON, G. J. DANIELL AND B. D. RAINFORD

and the Doppler frequency rate f R , can be related to the coefficients of this polynomial. The relationships are:

Phys. 344 Ch 7 Lecture 8 Fri., April. 10 th,

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

Data Mining with Linear Discriminants. Exercise: Business Intelligence (Part 6) Summer Term 2014 Stefan Feuerriegel

Chapter 11: Neural Networks

Application of ILIUM to the estimation of the T eff [Fe/H] pair from BP/RP

Artificial Neural Networks MLP, Backpropagation

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

You need to be able to define the following terms and answer basic questions about them:

Module 3: Gaussian Process Parameter Estimation, Prediction Uncertainty, and Diagnostics

Lecture 24: Flory-Huggins Theory

Modelling of Clock Behaviour. Don Percival. Applied Physics Laboratory University of Washington Seattle, Washington, USA

Data Mining: Concepts and Techniques. Classification and Prediction. Chapter February 8, 2007 CSE-4412: Data Mining 1

Determining the Accuracy of Modal Parameter Estimation Methods

Trigonometric Ratios Unit 5 Tentative TEST date

INSTRUMENTAL VARIABLES

Lead/Lag Compensator Frequency Domain Properties and Design Methods

Medium Scale Integrated (MSI) devices [Sections 2.9 and 2.10]

Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

Checking the resolved resonance region in EXFOR database

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

Dataflow Analysis and Abstract Interpretation

COMP9414/ 9814/ 3411: Artificial Intelligence. 14. Course Review. COMP3411 c UNSW, 2014

Online Model Racing based on Extreme Performance

CS 109 Lecture 23 May 18th, 2016

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

Lecture 8: Multiclass Classification (I)

We can see from the graph above that the intersection is, i.e., [ ).

MATCHING TECHNIQUES Technical Track Session VI Céline Ferré The World Bank

THERMAL-VACUUM VERSUS THERMAL- ATMOSPHERIC TESTS OF ELECTRONIC ASSEMBLIES

Inference in the Multiple-Regression

Sequential Allocation with Minimal Switching

Churn Prediction using Dynamic RFM-Augmented node2vec

Lecture 02 CSE 40547/60547 Computing at the Nanoscale

Lyapunov Stability Stability of Equilibrium Points

Chapter Summary. Mathematical Induction Strong Induction Recursive Definitions Structural Induction Recursive Algorithms

Overview of Supervised Learning

Maximum A Posteriori (MAP) CS 109 Lecture 22 May 16th, 2016

Least Squares Optimal Filtering with Multirate Observations

Enhancing Performance of MLP/RBF Neural Classifiers via an Multivariate Data Distribution Scheme

If (IV) is (increased, decreased, changed), then (DV) will (increase, decrease, change) because (reason based on prior research).

DESIGN OPTIMIZATION OF HIGH-LIFT CONFIGURATIONS USING A VISCOUS ADJOINT-BASED METHOD

Public Key Cryptography. Tim van der Horst & Kent Seamons

Administrativia. Assignment 1 due thursday 9/23/2004 BEFORE midnight. Midterm exam 10/07/2003 in class. CS 460, Sessions 8-9 1

Transcription:

Mtivating Example Memry-Based Learning Instance-Based Learning K-earest eighbr Inductive Assumptin Similar inputs map t similar utputs If nt true => learning is impssible If true => learning reduces t defining similar t all similarities created equal predicting a persn s height may depend n different attributes than predicting their IQ 1-earest eighbr ( 2 Dist(c 1,c 2 = attr i (c 1 - attr i (c 2 earesteighbr = MI j (Dist(c j,c test predictin test = class j (r value j wrs well if n attribute nise, class nise, class verlap can learn cmplex functins (sharp class bundaries as number f training cases grws large, errr rate f 1- is at mst 2 times the Bayes ptimal rate (i.e. if yu new the true prbability f each class fr each test case 1

-earest eighbr Dist(c1, c2 = ( Hw t chse attri (c1 - attri (c2 2 { Large : } - earesteighbrs = - MI(Dist(ci, ctest predictintest = 1 1 classi (r valuei less sensitive t nise (particularly class nise better prbability estimates fr discrete classes larger training sets allw larger values f Small : attribute_2 Average f pints mre reliable when: nise in attributes + + + nise in class labels + + +++ + + classes partially verlap + captures fine structure f prblem space better may be necessary with small training sets Balance must be struc between large and small As training set appraches infinity, and grws large, becmes Bayes ptimal + attribute_1 Frm Hastie, Tibshirani, Friedman 2001 p418 Frm Hastie, Tibshirani, Friedman 2001 p418 2

Frm Hastie, Tibshirani, Friedman 2001 p419 Crss-Validatin Mdels usually perfrm better n training data than n future test cases 1- is 100% accurate n training data! Leave-ne-ut-crss validatin: remve each case ne-at-a-time use as test case with remaining cases as train set average perfrmance ver all test cases LOOCV is impractical with mst learning methds, but extremely efficient with MBL! Distance-Weighted tradeff between small and large can be difficult use large, but mre emphasis n nearer neighbrs? predictin test = w = 1 Dist(c, c test * class i (r * value i 3

Lcally Weighted Averaging Let = number f training pints Let weight fall-ff rapidly with distance predictin test = w = 1 * class i (r e KernelWidth Dist(c,c test * value i KernelWidth cntrls size f neighbrhd that has large effect n value (analgus t Lcally Weighted Regressin All algs s far are strict averagers: interplate, but can t extraplate D weighted regressin, centered at test pint, weight cntrlled by distance and KernelWidth Lcal regressr can be linear, quadratic, n-th degree plynmial, neural net, Yields piecewise apprximatin t surface that typically is mre cmplex than lcal regressr Euclidean Distance D(c1,c2 = ( attr i (c1 - attr i (c2 2 gives all attributes equal weight? nly if scale f attributes and differences are similar scale attributes t equal range r equal variance assumes spherical classes attribute_2 + + + + + attribute_1 + Euclidean Distance? attribute_2 + + + + + + attribute_1 + if classes are nt spherical? + + + + + + + + + attribute_1 if sme attributes are mre/less imprtant than ther attributes? if sme attributes have mre/less nise in them than ther attributes? attribute_2 4

Weighted Euclidean Distance D(c1,c2 = ( attr i (c1 - attr i (c2 2 large weights => attribute is mre imprtant small weights => attribute is less imprtant zer weights => attribute desn t matter Weights allw t be effective with axis-parallel elliptical classes Where d weights cme frm? Learning Attribute Weights Scale attribute ranges r attribute variances t mae them unifrm (fast and easy Prir nwledge umerical ptimizatin: gradient descent, simplex methds, genetic algrithm criterin is crss-validatin perfrmance Infrmatin Gain r Gain Rati f single attributes Infrmatin Gain Infrmatin Gain = reductin in entrpy due t splitting n an attribute Entrpy = expected number f bits needed t encde the class f a randmly drawn + r example using the ptimal inf-thery cding Entrpy = - p + lg 2 p + - p - lg 2 p - Splitting Rules GainRati(S, A = Entrpy(S - S v S Entrpy(S v v ŒValues(A v ŒValues(A S v S lg 2 S v S Gain(S, A = Entrpy(S - vœvalues(a S v S Entrpy(S v 5

Gain_Rati Crrectin Factr GainRati Weighted Euclidean Distance Crrectin Factr Gain Rati fr Equal Sized n-way Splits 6.00 5.00 4.00 3.00 2.00 1.00 0.00 0 10 20 30 40 50 D(c1,c2 = gain _ rati i ( attr i (c1 - attr i (c2 2 weight with gain_rati after scaling? umber f Splits Bleans, minals, Ordinals, and Reals Cnsider attribute value differences: (attr i (c1 attr i (c2 Reals: Integers: Ordinals: easy! full cntinuum f differences nt bad: discrete set f differences nt bad: discrete set f differences Bleans: awward: hamming distances 0 r 1 minals? nt gd! recde as Bleans? Curse f Dimensinality as number f dimensins increases, distance between pints becmes larger and mre unifrm if number f relevant attributes is fixed, increasing the number f less relevant attributes may swamp distance ( 2 relevant D(c1,c2 = ( attr i (c1 - attr i (c2 2 irrelevant + attr j (c1 - attr j (c2 j =1 when mre irrelevant than relevant dimensins, distance becmes less reliable slutins: larger r KernelWidth, feature selectin, feature weights, mre cmplex distance functins 6

Advantages f Memry-Based Methds Weanesses f Memry-Based Methds Lazy learning: dn t d any wr until yu nw what yu want t predict (and frm what variables! never need t learn a glbal mdel many simple lcal mdels taen tgether can represent a mre cmplex glbal mdel better fcused learning handles missing values, time varying distributins,... Very efficient crss-validatin Intelligible learning methd t many users earest neighbrs supprt explanatin and training Can use any distance metric: string-edit distance, Curse f Dimensinality: ften wrs best with 25 r fewer dimensins Run-time cst scales with training set size Large training sets will nt fit in memry Many MBL methds are strict averagers Smetimes desn t seem t perfrm as well as ther methds such as neural nets Predicted values fr regressin nt cntinuus Cmbine K with A Current Research in MBL Train neural net n prblem Use utputs f neural net r hidden unit activatins as new feature vectrs fr each pint Use K n new feature vectrs fr predictin Des feature selectin and feature creatin Smetimes wrs better than K r A Cndensed representatins t reduce memry requirements and speed-up neighbr finding t scale t 10 6 10 12 cases Learn better distance metrics Feature selectin Overfitting, VC-dimensin,... MBL in higher dimensins MBL in nn-numeric dmains: Case-Based Reasning Reasning by Analgy 7

References Lcally Weighted Learning by Atesn, Mre, Schaal Tuning Lcally Weighted Learning by Schaal, Atesn, Mre Clsing Thught In many supervised learning prblems, all the infrmatin yu ever have abut the prblem is in the training set. Why d mst learning methds discard the training data after ding learning? D neural nets, decisin trees, and Bayes nets capture all the infrmatin in the training set when they are trained? In the future, we ll see mre methds that cmbine MBL with these ther learning methds. t imprve accuracy fr better explanatin fr increased flexibility 8