Inductive Bias: How to generalize on novel data. CS Inductive Bias 1

Size: px

Start display at page:

Download "Inductive Bias: How to generalize on novel data. CS Inductive Bias 1"

Beverly Simmons
6 years ago
Views:

1 Inductive Bias: How to generaize on nove data CS Inductive Bias 1

2 Overfitting Noise vs. Exceptions CS Inductive Bias 2

3 Non-Linear Tasks Linear Regression wi not generaize we to the task beow Needs a non-inear surface Coud do a feature pre-process as with the quadric machine For exampe, we coud use an arbitrary poynomia in x Thus it is sti inear in the coefficients, and can be soved with deta rue, etc. Y = β 0 + β 1 X + β 2 X β n X n What order poynomia shoud we use? Overfit issues can occur CS 478 Inductive Bias 3

4 Regression Reguarization How to avoid overfit Keep the mode simpe For regression, keep the function smooth Inductive bias is that f(x) f(x ± ε) Reguarization approach: F(h) = Error(h) + λ Compexity(h) Tradeoff accuracy vs compexity Ridge Regression Minimize: F(w) = TSS(w) + λ w 2 = Σ (predicted i actua i ) 2 + λσw i 2 Gradient of F(w): Δw i = c(t net)x i λw i (Weight decay) Especiay usefu when the features are a non-inear transform from the initia features (e.g. poynomias in x) Aso when the number of initia features is greater than the number of exampes Lasso regression uses an L1 vs an L2 weight penaty: TSS(w) +λσ w i CS Regression 4

5 Hypothesis Space The Hypothesis space H is the set a the possibe modes h which can be earned by the current earning agorithm e.g. Set of possibe weight settings for a perceptron Restricted hypothesis space Can be easier to search May avoid overfit since they are usuay simper (e.g. inear or ow order decision surface) Often wi underfit Unrestricted Hypothesis Space Can represent any possibe function and thus can fit the training set we Mechanisms must be used to avoid overfit CS Inductive Bias 5

6 Avoiding Overfit - Reguarization Reguarization: any modification we make to a earning agorithm that is intended to reduce its generaization error but not its training error Occam s Razor Wiiam of Ockham (c ) Simpest accurate mode: accuracy vs. compexity trade-off. Find h H which minimizes an objective function of the form: F(h) = Error(h) + λ Compexity(h) Compexity coud be number of nodes, size of tree, magnitude of weights, order of decision surface, etc. L2 and L1 common. More Training Data (vs. overtraining on same data) Aso Data set augmentation Fake data, Can be very effective, Jitter, but take care Denoising add random noise to inputs during training can act as a reguarizer Adding noise to nodes, weights, outputs, etc. E.g. Dropout (discuss with ensembes) Most common reguarization approach: Eary Stopping Start with simpe mode (sma parameters/weights) and stop training as soon as we attain good generaization accuracy (before parameters get arge) Use a vaidation Set (next side: requires separate test set) Wi discuss other approaches with specific modes CS Inductive Bias 6

7 Stopping/Mode Seection with Vaidation Set SSE Epochs (new h at each) Vaidation Set Training Set There is a different mode h after each epoch Seect a mode in the area where the vaidation set accuracy fattens When no improvement occurs over m epochs The vaidation set comes out of training set data Sti need a separate test set to use after seecting mode h to predict future accuracy Simpe and unobtrusive, does not change objective function, etc Can be done in parae on a separate processor Can be used aone or in conjunction with other reguarizers CS Inductive Bias 7

8 Inductive Bias The approach used to decide how to generaize nove cases One common approach is Occam s Razor The simpest hypothesis which expains/fits the data is usuay the best Many other rationae biases and variations ABC Z AB C Z ABC Z AB C Z A B C Z A BC? When you get the new input Ā B C. What is your output? CS Inductive Bias 8

9 One Definition for Inductive Bias Inductive Bias: Any basis for choosing one generaization over another, other than strict consistency with the observed training instances Sometimes just caed the Bias of the agorithm (don't confuse with the bias weight in a neura network). Bias-Variance Trade-off Wi discuss in more detai when we discuss ensembes CS Inductive Bias 9

10 Some Inductive Bias Approaches Restricted Hypothesis Space - Can just try to minimize error since hypotheses are aready simpe Linear or ow order threshod function k-dnf, k-cnf, etc. Low order poynomia Preference Bias Prefer one hypothesis over another even though they have simiar training accuracy Occam s Razor Smaest DNF representation which matches we Shaow decision tree with high information gain Neura Network with ow vaidation error and sma magnitude weights CS Inductive Bias 10

11 Need for Bias 2 2n Booean functions of n inputs x1 x2 x3 Cass Possibe Consistent Function Hypotheses ? CS Inductive Bias 11

12 Need for Bias 2 2n Booean functions of n inputs x1 x2 x3 Cass Possibe Consistent Function Hypotheses ? 0 CS Inductive Bias 12

13 Need for Bias 2 2n Booean functions of n inputs x1 x2 x3 Cass Possibe Consistent Function Hypotheses ? 0 1 CS Inductive Bias 13

14 Need for Bias 2 2n Booean functions of n inputs x1 x2 x3 Cass Possibe Consistent Function Hypotheses ? Without an Inductive Bias we have no rationae to choose one hypothesis over another and thus a random guess woud be as good as any other option. CS Inductive Bias 14

15 Need for Bias 2 2n Booean functions of n inputs x1 x2 x3 Cass Possibe Consistent Function Hypotheses ? Inductive Bias guides which hypothesis we shoud prefer? What happens in this case if we use simpicity (Occam s Razor) as our inductive Bias? CS Inductive Bias 15

16 Learnabe Probems Raster Screen Probem Pattern Theory Reguarity in a task Compressibiity Don t care features and Impossibe states Interesting/Learnabe Probems What we actuay dea with Can we formay characterize them? Learning a training set vs. generaizing A function where each output is set randomy (coin-fip) Output cass is independent of a other instances in the data set Computabiity vs. Learnabiity (Optiona) CS Inductive Bias 16

17 Computabiity and Learnabiity Finite Probems Finite probems assume finite number of mappings (Finite Tabe) Fixed input size arithmetic Random memory in a RAM Learnabe: Can do better than random on nove exampes CS Inductive Bias 17

18 Computabiity and Learnabiity Finite Probems Finite probems assume finite number of mappings (Finite Tabe) Fixed input size arithmetic Random memory in a RAM Learnabe: Can do better than random on nove exampes Finite Probems A are Computabe Learnabe Probems: Those with Reguarity CS Inductive Bias 18

19 Computabiity and Learnabiity Infinite Probems Infinite number of mappings (Infinite Tabe) Arbitrary input size arithmetic Hating Probem (no imit on input size) Do two arbitrary strings match CS Inductive Bias 19

20 Computabiity and Learnabiity Infinite Probems Infinite number of mappings (Infinite Tabe) Arbitrary input size arithmetic Hating Probem (no imit on input size) Do two arbitrary strings match Infinite Probems Learnabe Probems: A reasonaby queried infinite subset has reguarity Computabe Probems: Ony those where a but a finite set of mappings have reguarity CS Inductive Bias 20

21 No Free Lunch Any inductive bias chosen wi have equa accuracy compared to any other bias over a possibe functions/tasks, assuming a functions are equay ikey. If a bias is correct on some cases, it must be incorrect on equay many cases. Is this a probem? Random vs. Reguar Anti-Bias? (even though reguar) The Interesting Probems subset of earnabe? Are a functions equay ikey in the rea word? CS Inductive Bias 21

22 Interesting Probems and Biases A Probems Structured Probems Interesting Probems Inductive Bias Inductive Bias Inductive Bias P I Inductive Bias Inductive Bias CS Inductive Bias 22

23 More on Inductive Bias Inductive Bias requires some set of prior assumptions about the tasks being considered and the earning approaches avaiabe Tom Mitche s definition: Inductive Bias of a earner is the set of additiona assumptions sufficient to justify its inductive inferences as deductive inferences We consider standard ML agorithms/hypothesis spaces to be different inductive biases: C4.5 (Greedy best attributes), Backpropagation (simpe to compex), etc. CS Inductive Bias 23

24 Which Bias is Best? Not one Bias that is best on a probems Our experiments Over 50 rea word probems Over 400 inductive biases mosty variations on critica variabe biases vs. simiarity biases Different biases were a better fit for different probems Given a data set, which Learning mode (Inductive Bias) shoud be chosen? CS Inductive Bias 24

25 Automatic Discovery of Inductive Bias Defining and characterizing the set of Interesting/Learnabe probems To what extent do current biases cover the set of interesting probems Automatic feature seection Automatic seection of Bias (before and/or during earning), incuding a earning parameters Dynamic Inductive Biases (in time and space) Combinations of Biases Ensembes, Orace Learning CS Inductive Bias 25

26 Dynamic Inductive Bias in Time Can be discovered as you earn May want to earn genera rues first foowed by true exceptions Can be based on ease of earning the probem Exampe: SoftProp From Lazy Learning to Backprop CS Inductive Bias 26

27 Dynamic Inductive Bias in Space CS Inductive Bias 27

28 ML Hoy Grai: We want a aspects of the earning mechanism automated, incuding the Inductive Bias Outputs Just a Data Set or just an expanation of the probem Automated Learner Hypothesis Input Features CS Inductive Bias 28

29 BYU Neura Network and Machine Learning Laboratory Work on Automatic Discover of Inductive Bias Proposing New Learning Agorithms (Inductive Biases) Theoretica issues Defining the set of Interesting/Learnabe probems Anaytica/empirica studies of differences between biases Ensembes Wagging, Mimicking, Orace Learning, etc. Meta-Learning A priori decision regarding which earning mode to use Features of the data set/appication Learning from mode experience Automatic seection of Parameters Constructive Agorithms ASOCS, DMPx, etc. Learning Parameters Windowed momentum, Automatic improved distance functions (IVDM) Automatic Bias in time SoftProp Automatic Bias in space Overfitting, sensitivity to compex portions of the space: DMP, higher order features CS Inductive Bias 29

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with?

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with? Bayesian Learning A powerfu and growing approach in machine earning We use it in our own decision making a the time You hear a which which coud equay be Thanks or Tanks, which woud you go with? Combine