Learning From Data Lecture 14 Three Learning Principles

Similar documents
Learning From Data Lecture 2 The Perceptron

Learning From Data Lecture 13 Validation and Model Selection

Learning From Data Lecture 10 Nonlinear Transforms

Learning From Data Lecture 8 Linear Classification and Regression

Learning From Data Lecture 3 Is Learning Feasible?

Learning From Data Lecture 12 Regularization

Learning From Data Lecture 5 Training Versus Testing

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Statistics for IT Managers

Learning From Data Lecture 7 Approximation Versus Generalization

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

CS 2336 Discrete Mathematics

Do students sleep the recommended 8 hours a night on average?

Last few slides from last time

Today s lecture. Scientific method Hypotheses, models, theories... Occam' razor Examples Diet coke and menthos

Why maximize entropy?

The PROMYS Math Circle Problem of the Week #3 February 3, 2017

Hypothesis testing. Data to decisions

CS 124 Math Review Section January 29, 2018

Guide to Proofs on Sets

An analogy from Calculus: limits

Solutions to In-Class Problems Week 14, Mon.

Uncertainty. Michael Peters December 27, 2013

S.R.S Varadhan by Professor Tom Louis Lindstrøm

Simple Regression Model. January 24, 2011

SELECTIVE APPROACH TO SOLVING SYLLOGISM

Announcements. Lecture 5: Probability. Dangling threads from last week: Mean vs. median. Dangling threads from last week: Sampling bias

2.57 when the critical value is 1.96, what decision should be made?

The dark energ ASTR509 - y 2 puzzl e 2. Probability ASTR509 Jasper Wal Fal term

Machine Learning Foundations

Hypothesis Testing and Confidence Intervals (Part 2): Cohen s d, Logic of Testing, and Confidence Intervals

Introduction and Models

Probability Exercises. Problem 1.

MATH 19-02: HW 5 TUFTS UNIVERSITY DEPARTMENT OF MATHEMATICS SPRING 2018

Lesson 8: Graphs of Simple Non Linear Functions

Machine Learning Foundations

Please bring the task to your first physics lesson and hand it to the teacher.

Correlation. We don't consider one variable independent and the other dependent. Does x go up as y goes up? Does x go down as y goes up?

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

CIS 2033 Lecture 5, Fall

Chapter 26: Comparing Counts (Chi Square)

Introduction to Design of Experiments

Critical Reading Answer Key 1. D 2. B 3. D 4. A 5. A 6. B 7. E 8. C 9. D 10. C 11. D

Name Solutions Linear Algebra; Test 3. Throughout the test simplify all answers except where stated otherwise.

Section 2: Estimation, Confidence Intervals and Testing Hypothesis

Solving Equations by Adding and Subtracting

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Regression Analysis IV... More MLR and Model Building

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Student Book SERIES. Time and Money. Name

1 Recap: Interactive Proofs

1 Impact Evaluation: Randomized Controlled Trial (RCT)

ORIE 4741: Learning with Big Messy Data. Generalization

Decision Tree Learning. Dr. Xiaowei Huang

An Algorithms-based Intro to Machine Learning

CONSTRUCTION OF THE REAL NUMBERS.

Indicative conditionals

Discrete Distributions

Lesson 7: Classification of Solutions

5.2 Tests of Significance

Last week: Sample, population and sampling distributions finished with estimation & confidence intervals

( ) P A B : Probability of A given B. Probability that A happens

DIFFERENTIAL EQUATIONS

LECTURE 15: SIMPLE LINEAR REGRESSION I

Grades 7 & 8, Math Circles 24/25/26 October, Probability

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

Achilles: Now I know how powerful computers are going to become!

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

download instant at

*Karle Laska s Sections: There is no class tomorrow and Friday! Have a good weekend! Scores will be posted in Compass early Friday morning

Chapter 18. Sampling Distribution Models. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Econ 325: Introduction to Empirical Economics

Machine Learning

Correlation and regression

Final Exam Comments. UVa - cs302: Theory of Computation Spring < Total

Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Note 7

Elementary Statistics Triola, Elementary Statistics 11/e Unit 17 The Basics of Hypotheses Testing

Last two weeks: Sample, population and sampling distributions finished with estimation & confidence intervals

Is the Universe Random and Meaningless?

Weighted Majority and the Online Learning Approach

Solving Quadratic & Higher Degree Equations

CONTINUOUS RANDOM VARIABLES

Parametric Equations, Function Composition and the Chain Rule: A Worksheet

Example. How to Guess What to Prove

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

Algebra Exam. Solutions and Grading Guide

Statistics and Data Analysis

Primary KS1 1 VotesForSchools2018

WINGS scientific whitepaper, version 0.8

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

Lecture 2: Linear regression

MATH 10 INTRODUCTORY STATISTICS

Sampling Distribution Models. Central Limit Theorem

Menu. Lecture 2: Orders of Growth. Predicting Running Time. Order Notation. Predicting Program Properties

Probability Theory and Simulation Methods

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Inference in Regression Model

Recap Social Choice Fun Game Voting Paradoxes Properties. Social Choice. Lecture 11. Social Choice Lecture 11, Slide 1

Transcription:

Learning From Data Lecture 14 Three Learning Principles Occam s Razor Sampling Bias Data Snooping M. Magdon-Ismail CSCI 4100/6100

recap: Validation and Cross Validation Validation Cross Validation D (N) D 1 D 2 D N D train (N K) D g 1 g 2 g N g D val (K) (x 1,y 1 ) (x 2,y 2 ) (x N,y N ) e 1 } e 2 {{ e N } take average g E val (g ) g E cv Model Selection H 1 H 2 H 3 H M g 1 g 2 g 3 g M c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 2 /36 Occam, bias, snooping

We Will Discuss... Occam s Razor: pick a model carefully Sampling Bias: generate the data carefuly Data Snooping: handle the data carefully c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 3 /36 Occam

Occam s Razor use a razor to trim down an explanation of the data to make it as simple as possible but no simpler. attributed to William of Occam (14th Century) and often mistakenly to Einstein c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 4 /36 Simpler is Better

Simpler is Better The simplest model that fits the data is also the most plausible....or, beware of using complex models to fit data c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 5 /36 What is Simpler?

What is Simpler? simple hypothesis h Ω(h) simple hypothesis set H Ω(H) low order polynomial H with small d vc hypothesis with small weights small number of hypotheses easily described hypothesis low entropy set.. The equivalence: A hypothesis set with simple hypotheses must be small We had a glimpse of this: soft order constraint (smaller H) λ minimize E aug (favors simpler h). c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 6 /36 Why is Simpler Better

Why is Simpler Better Mathematically: simple curtails ability to fit noise, VC-dimension is small, and blah and blah... simpler is better because you will be more surprised when you fit the data. If something unlikely happens, it is very significant when it happens.. Detective Gregory: Is there any other point to which you would wish to draw my attention? Sherlock Holmes: To the curious incident of the dog in the night-time. Detective Gregory: The dog did nothing in the night-time. Sherlock Holmes: That was the curious incident.. Silver Blaze, Sir Arthur Conan Doyle c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 7 /36 Scientific Experiment

A Scientific Experiment Axiom. If an experiment has no chance of falsifying a hypothesis, then the result of that experiment provides no evidence one way or the other for the hypothesis. Scientist 1 Scientist 2 Scientist 3 resistivity ρ resistivity ρ resistivity ρ temperature T temperature T temperature T no evidence very convincing some evidence? Who provides most evidence for the hypothesis ρ is linear in T? c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 8 /36 Scientist 2 vs. 3

Scientist 2 Versus Scientist 3 Axiom. If an experiment has no chance of falsifying a hypothesis, then the result of that experiment provides no evidence one way or the other for the hypothesis. Scientist 1 Scientist 2 Scientist 3 resistivity ρ resistivity ρ resistivity ρ temperature T temperature T temperature T no evidence very convincing some evidence? Who provides most evidence? c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 9 /36 Scientist 1 vs. 3

Scientist 1 versus Scientist 3 Axiom. If an experiment has no chance of falsifying a hypothesis, then the result of that experiment provides no evidence one way or the other for the hypothesis. Scientist 1 Scientist 2 Scientist 3 resistivity ρ resistivity ρ resistivity ρ temperature T temperature T temperature T no evidence very convincing some evidence? Who provides most evidence? c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 10 /36 Non-Falsifiability

Axiom of Non-Falsifiability Axiom. If an experiment has no chance of falsifying a hypothesis, then the result of that experiment provides no evidence one way or the other for the hypothesis. Scientist 1 Scientist 2 Scientist 3 resistivity ρ resistivity ρ temperature T temperature T no evidence very convincing some evidence? Who provides most evidence? c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 11 /36 Falsification and m H (N)

Falsification and m H (N) If H shatters x 1,,x N,...don t be surprised if you fit the data. If H doesn t shatter x 1,,x N, and the target values are uniformly distributed, P[falsification] 1 m H(N) 2 N. A good fit is surprising with simpler H (small m H (N)) and hence more significant. c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 12 /36 Beyond Occam

Learning Goes Beyond Occam s Razor We may opt for a simpler fit than possible, namely an imperfect fit of the data using a simple model over a perfect fit using a more complex one. The reason is that the price we pay for a perfect fit in terms of the penalty for model complexity may be too much in comparison to the benefit of the better fit. Learning From Data, Abu-Mostafa, Magdon-Ismail, Lin c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 13 /36 Puzzle 1: football oracle

A Puzzle The Football Oracle Saturday, Oct 13, 2012 Home team will win the Monday Night Footbal Game. This happens for 5 weeks in a row. c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 14 /36 Got it right

A Puzzle The Football Oracle Saturday, Oct 13, 2012 Home team will win the Monday Night Footbal Game. This happens for 5 weeks in a row. c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 15 /36 5 weeks in a row

A Puzzle The Football Oracle Saturday, Oct 13, 2012 Home team will the Monday Night Footbal Game. This happens for 5 weeks in a row. c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 16 /36 Pay for more predictions

A Puzzle The Football Oracle...on the 6th week Saturday, Nov 17, 2012 Call 1-900-555-5555 for winner; $50 charge applied E in = 0! Meaningless without knowing the complexity of the process leading to that! c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 17 /36 Oracle is a single predictor

What did the Oracle Really Do? you day 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 day 2 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 day 3 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 day 4 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 day 5 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 Single hypothesis that worked? c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 18 /36 Oracle is every hypothesis

What did the Oracle Really Do? you day 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 day 2 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 day 3 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 day 4 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 day 5 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 Every possible hypothesis one of which worked? c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 19 /36 Pay for more predictions

A Puzzle The Football Oracle...on the 6th week Saturday, Nov 17, 2012 Call 1-900-555-5555 for winner; $50 charge applied E in = 0! Meaningless without the complexity of the process leading to that! c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 20 /36 Sampling bias

We Will Discuss... Occam s Razor: pick a model carefully Sampling Bias: generate the data carefuly Data Snooping: handle the data carefully c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 21 /36 Dewey Defeats Truman

November 3rd 1948, Dewey Defeats Truman Tribune wanted to show off its latest technology could go earlier to press. Telephone poll on how people voted statisticians had done their thing and were confident. c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 22 /36 Truman defeats Dewey

Imagine Their Surprise When... c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 23 /36 Sampling Bias in Learning

Sampling Bias in Learning If the data is sampled in a biased way, learning will produce a similarly biased outcome....or, make sure the training and test distributions are the same. You cannot draw a sample from one bin and make claims about another bin c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 24 /36 Extrapolation

Extrapolation 100 # Copies Sold 80 60 40 20 2000 4000 6000 8000 Amazon Ranking c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 25 /36 Extrapolation is Hard

Extrapolation is Hard 100 # Copies Sold 80 60 40 20 2000 4000 6000 8000 Amazon Ranking c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 26 /36 Dealing with Mismatch

Dealing with the Training-Test Mismatch 100 Think more carefully about what f should look like Need some additional help outside the data, by choosing a good H In our ranking example, account for the fat tail hyperbola # Copies Sold 80 60 40 20 2000 4000 6000 8000 Amazon Ranking (hyperbola fit) 10 3 Account for the training-test mismatch during learning There are methods that reweight/resample data can help If test data have zero representation in training, you are in trouble Think carefully about f Amazon Ranking Probability 0 2000 4000 6000 8000 (test versus training distributions) c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 27 /36 Puzzle credit analysis

Puzzle - Credit Analysis Determine credit given salary, debt, years in residence,... Banks have lots of data customer information: salary, debt, etc. whether or not they defaulted on their credit. age 32 years gender male salary 40,000 debt 26,000 years in job 1 year years at home 3 years...... Approve for credit? where is the sampling bias? c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 28 /36 Bias in approvals

Puzzle - Credit Analysis Determine credit given salary, debt, years in residence,... Banks have lots of data customer information: salary, debt, etc. whether or not who? defaulted on their credit. age 32 years gender male salary 40,000 debt 26,000 years in job 1 year years at home 3 years...... Approve for credit? only data on approved customers c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 29 /36 Data Snooping

We Will Discuss... Occam s Razor: pick a model carefully Sampling Bias: generate the data carefuly Data Snooping: handle the data carefully c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 30 /36 Data snooping definition

Data Snooping If a data set has affected any step in the learning process, it cannot be fully trusted in assessing the outcome.... or, estimate performance with a completely uncontaminated test set...and, choose H before looking at the data c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 31 /36 Puzzle buy and hold on S&P

Puzzle: The Buy and Hold Strategy on S&P 500 Stocks 16.2% return 1985 1990 1995 2000 2005 2010 Sampling Bias: didn t buy and hold a random sample of stocks. Snooping: Choose which stocks to hold by snooping into the test set (the future). c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 32 /36 Actual S&P

Puzzle: The Buy and Hold Strategy on S&P 500 Stocks 16.2% return snooping/sampling bias actual S&P 8.3% return 1985 1990 1995 2000 2005 2010 Sampling Bias: didn t buy and hold a random sample of stocks. Snooping: Choose which stocks to hold by snooping into the test set (the future). c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 33 /36 Data snooping is subtle

Data Snooping is a Subtle Happy Hell The data looks linear, so I will use a linear model, and it worked. If the data were different and didn t look linear, would you do something different? Try linear, it fails; try circles it works. If you torture the data enough, it will confess. Try linear, it works; so I don t need to try circles. Would you have tried circles if the data were different? Read papers, see what others did on the data. Modify and improve on that. If the data were different, would that modify what others did and hence what you did? the data snooping can happen all at once or sequentially by different people Input normalization: normalize the data, now set aside the test set. Since the test set was involved in the normalization, wouldn t your g change if the test set changed? c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 34 /36 Account for data snooping

Account for Data Snooping Ask yourself: If the data were different, could/would I have done something different? if yes, then there is data snooping.? h H? Data? g D your choices g You must account for every choice influenced by D. We know how to account for the choice of g from H. c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 35 /36 Account for all data snooping

Account for Data Snooping Ask yourself: If the data were different, could/would I have done something different? if yes, then there is data snooping.? h H? Data? g D your choices g You must account for every choice influenced by D. We know how to account for the choice of g from H. c AM L Creator: Malik Magdon-Ismail Three Learning Principles: 36 /36