Data Warehousing & Data Mining

Similar documents
Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Learning with multiple models. Boosting.

Voting (Ensemble Methods)

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

CS7267 MACHINE LEARNING

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Ensembles of Classifiers.

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Lecture 9 Evolutionary Computation: Genetic algorithms

Lecture 8. Instructor: Haipeng Luo

Boosting & Deep Learning

Ensemble Methods for Machine Learning

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Learning Ensembles. 293S T. Yang. UCSB, 2017.

A Brief Introduction to Adaboost

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Fundamentals of Genetic Algorithms

Machine Learning. Ensemble Methods. Manfred Huber

ECE 5424: Introduction to Machine Learning

Boos$ng Can we make dumb learners smart?

Artificial Intelligence (AI) Common AI Methods. Training. Signals to Perceptrons. Artificial Neural Networks (ANN) Artificial Intelligence

Search. Search is a key component of intelligent problem solving. Get closer to the goal if time is not enough

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Data Mining und Maschinelles Lernen

TDT4173 Machine Learning

CSC 4510 Machine Learning

VBM683 Machine Learning

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Evolutionary computation

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Learning theory. Ensemble methods. Boosting. Boosting: history

Statistics and learning: Big Data

Genetic Algorithms & Modeling

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

DETECTING THE FAULT FROM SPECTROGRAMS BY USING GENETIC ALGORITHM TECHNIQUES

Stochastic Gradient Descent

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

FINAL: CS 6375 (Machine Learning) Fall 2014

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Chapter 8: Introduction to Evolutionary Computation

TDT4173 Machine Learning

Hierarchical Boosting and Filter Generation

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Neural Networks and Ensemble Methods for Classification

Evolutionary Computation

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Boosting: Foundations and Algorithms. Rob Schapire

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

Moving Average Rules to Find. Confusion Matrix. CC283 Intelligent Problem Solving 05/11/2010. Edward Tsang (all rights reserved) 1

Scaling Up. So far, we have considered methods that systematically explore the full search space, possibly using principled pruning (A* etc.).

ECE 5984: Introduction to Machine Learning

2D1431 Machine Learning. Bagging & Boosting

1 Handling of Continuous Attributes in C4.5. Algorithm

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CS229 Supplemental Lecture notes

Ensembles. Léon Bottou COS 424 4/8/2010

Adaptive Boosting of Neural Networks for Character Recognition

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Lecture 3: Decision Trees

Learning Decision Trees

1 Handling of Continuous Attributes in C4.5. Algorithm

Representing Images Detecting faces in images

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Course 395: Machine Learning

Lecture 22. Introduction to Genetic Algorithms

Chapter 14 Combining Models

Algorithm-Independent Learning Issues

Computational and Statistical Learning Theory

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Evolutionary Computation. DEIS-Cesena Alma Mater Studiorum Università di Bologna Cesena (Italia)

Cross Validation & Ensembling

Combining Classifiers

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Decision Tree Learning

Announcements Kevin Jamieson

Totally Corrective Boosting Algorithms that Maximize the Margin

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

CSCI-567: Machine Learning (Spring 2019)

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

CC283 Intelligent Problem Solving 28/10/2013

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Ensemble Methods: Jay Hyer

COMS 4771 Lecture Boosting 1 / 16

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Haploid & diploid recombination and their evolutionary impact

Genetic Algorithms: Basic Principles and Applications

PAC Generalization Bounds for Co-training

Decision Trees: Overfitting

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Statistical Machine Learning from Data

CSCE 478/878 Lecture 6: Bayesian Learning

Transcription:

13. Meta-Algorithms for Classification Data Warehousing & Data Mining Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 13. Meta-Algorithms for Classification 13.1 Bagging (Bootstrap Aggregating ) 13.3 Adaptive Boosting (AdaBoost) DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 2 13.0 Meta-Algorithms 13.0 Meta-Algorithms Upper layer family of algorithms Are not problem-specific May make use of domain-specific knowledge in the form of heuristics that are controlled by the upper level strategy For this reason, also know as meta-heuristics Around since the early 1950s Meta-algorithms express common knowledge Usually the last resort before giving up and using random or brute-force search Used for problems where you don't know how to find a good solution But if shown a candidate solution, you can assign a grade The algorithmic family includes genetic algorithms, hill-climbing, ant/bee colony optimization, simulated annealing, etc. DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 3 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 4 13.0 Meta-Algorithms Genetic algorithms Search algorithms based on the mechanics of biological evolution Developed by John Holland, University of Michigan (1970 s) To understand the adaptive processes of natural systems In time, organisms adapt to the environment To design artificial systems software that retains the robustness of natural systems A way of solving problems by mimicking the same processes nature uses They use the same combination of selection, recombination and mutation to evolve a solution to a problem DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 5 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 6 1

Typical structure of genetic algorithms initialize population; evaluate population; while TerminationCriteriaNotSatisfied { select parents for reproduction; perform recombination and mutation; evaluate population; } Components of genetic algorithms Encoding technique: genes and chromosomes Initialization procedure or creation Evaluation function: the environment Selection of parents: reproduction Genetic operators: mutation and recombination Problem: Given the digits 0 through 9 and the operators +, -, * and /, find a sequence that will represent a given target number So, given the target number 23, the sequence 6+5*4/2+1 would be one possible solution The operators will be applied sequentially from left to right We need to encode a possible solution as a string of bits a chromosome Represent all the different characters available to the solution... that is 0 through 9 and +, -, * and / This will represent a gene Each chromosome will be made up of several genes DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 7 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 8 Four bits are required to represent the range of characters used 0 0000 Then 0000 for example will be a gene 1 2 0001 0010 1110 and 1111 will remain unused and therefore ignored by the algorithm 3 4 5 0011 0100 0101 A solution for 23 would then form the 6 0110 7 0111 following chromosome 6 + 5 * 4 / 2 + 1 0110 1010 0101 1100 0100 1101 0010 1010 0001 8 1000 9 1001 + 1010-1011 * 1100 / 1101 Initialization: At the beginning of a run of a genetic algorithm a large population of random chromosomes is created Each one, when decoded will represent a different solution to the problem at hand Evaluation of the population: Let's say there are N chromosomes in the initial population Test each chromosome to see how good it is at solving the problem at hand and assign a fitness score accordingly DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 9 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 10 Fitness score The fitness score is a measure of how good that chromosome is at solving the problem to hand It is problem dependent If we assume the target number 42, the chromosome 6 + 5 * 4 / 2 + 1 0110 1010 0101 1100 0100 1101 0010 1010 0001 which represents number 23, has a fitness score of has a fitness score of 1/(42-23) or 1/19 Select two members from the current population The chance of being selected is proportional to the chromosomes fitness (survival of the fittest) Roulette wheel selection is a commonly used method It does not guarantee that the fittest member goes through to the next generation, merely that it has a very good chance of doing so DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 11 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 12 2

Recombination Crossover the bits from each chosen chromosome at a randomly chosen point E.g. given two chromosomes, choose a random bit along the length, say at position 9, and swap all the bits after that point 10001001110010010 01010001001000011 10001001101000011 01010001010010010 Mutation Step through the chosen chromosomes bits and flip dependent on the mutation rate (usually a very low value for binary encoded genes, say 0.001) Repeat selection, recombination and mutation until a new population of N members has been created Then evaluate the new population with fitness scores Stop when a chromosome from the population solves the problem DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 13 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 14 13.0 Application in Data Mining Example: classification problems can be solved using multi-classifier combination (MCC) Basic classifiers may individually achieve a precision just better than random classification on difficult training data But if independent classifiers are used together, they strengthen each other The main idea originates from a technique called bootstrapping DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 15 The idea of bootstrapping comes from the stories of Baron von Münchausen He tries to pull himself and his horse from a swamp by his hair But why is it called bootstrapping? DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 16 The Baron s story inspired a metaphoric phrase in the early 19th century United States To pull oneself over a fence by one's bootstraps Bootstrapping in computer science Software bootstrapping Development of successively more complex, faster programming environments E.g. start with vim and assembler and build iteratively graphical IDEs and high-level programming languages Bootstrapping in classification tasks Iteratively improve a classifier's performance Seed AI is a strong artificial intelligence capable of improving itself Having improved itself it would become better at improving itself, potentially leading to an exponential increase in intelligence No such system exists DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 17 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 18 3

Check the accuracy of sample estimates Assume we are interested in the average height of all people in the world We can only measure a maximum of N people and calculate their average Is this average value good? We need a sense of variability Bootstrap: take the N values and build another sample also of length N By sampling with replacement (i.e. each element may appear multiple times in the same sample) from the N original values Example: choose a person, measure it, return it to the pool, choose again, The same person may be measured twice Repeat the bootstrapping 1000 times Calculate the 1000 averages and their variance DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 19 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 20 13.1 Bagging Bagging (Bootstrap aggregating) for classification problems Leo Breiman, Bagging Predictors, Machine Learning Journal,1994 Starting from a training set, draw n samples with replacement Usually n is larger than the number of records in the training set Train a classifier on the resulting sample Repeat the process m times to learn m classifiers DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 21 13.1 Bagging Classifying a new record Perform a majority vote over all trained classifiers Advantages: Increases classifier stability Reduces variance DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 22 Boosting is based on the idea of bootstrap aggregating Michael Kearns, Thoughts on hypothesis boosting, unpublished manuscript, 1988 Build a series of classifiers Each classifier in the series pays more attention to the examples misclassified by its predecessor May use any basic classifier from decision trees to support vector machines Bootstrap aggregating vs. Boosting Samples Models Final Model M 1 M 2 M 3 M Training Sample Weighted Sample Weighted Sample Models M 1 M 2 M t Final Model M Bagging Boosting DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 23 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 24 4

Basically, a boosting algorithm is a blueprint of how to combine a set of real classification algorithms to yield a single combined (and hopefully better) classifier Each different classification algorithm comes with individual strengths and weaknesses There ain t no such thing as a free lunch For hard classification problems, the usual classifiers tend to be weak learners Weak learner = only slightly better than random guessing Boosting algorithm Base classifier 1 Base classifier 2 Base classifier 3 Question: Can a set of weak learners create a single strong learner? Answer: YES! DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 25 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 26 Naïve approach to boosting: majority vote! 1. Train base classifiers independently on the training set 2. For each new object to be classified, independently ask each base classifier and return the answer given by the majority Problems: Does only work if the majority is right very often The base algorithms cannot take advantage of their individual strengths Should expert votes really have the same weight like any other vote? 13.3 Boosting Better approach: Adaptive boosting Yoav Freund, Robert E. Schapire: A Decision-Theoretic Generalization of on-line Learning and an Application to Boosting, 1995. Major steps: 1. Train a first base classifier on the training set 2. Check which training examples cannot be explained by the first case classifier s underlying model ( errors ) DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 27 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 28 13.3 Adaptive Boosting 3. Assign a weight to each training example Low weight = Example perfectly fits into the first classifier s model High weight = Example does not fit into the first classifier s model 4. Train a new base classifier on the weighted training set Fitting training examples with high weights is more important than fitting those with low weights 5. Reweight as in step (3) 6. Repeat the steps (4) and (5) to create a set of base classifiers 13.3 Adaptive Boosting Adaptive boosting (cont.) In addition, assign an importance weight to each base classifier, depending on how many training examples fit its model High importance, if errors occur only on training examples with low weight Low importance, if errors occur on training examples with high weight DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 29 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 30 5

13.3 Adaptive Boosting How does the combined classifier work? 1. Classify the new example with each base classifier 2. Use majority vote weighing the individual classifier s answers by their importance weights Also incorporate each classifier s confidence, whenever this information is available Typically, the importance weights and the weights of the individual training examples are chosen to be balanced, such that the weighted majority now is right very often Example: first weak classifier Weak Classifier 1 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 31 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 32 Weights of misclassified objects are increased Second weak classifier Weights Increased Weak Classifier 2 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 33 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 34 Weights of newly misclassified objects are increased Third weak classifier Weights Increased Weak Classifier 3 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 35 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 36 6

and we could go on like this, to improve the classification precision Classify new data C 1 and C 2 classify the new data object as red Final classifier is a combination of weak classifiers R R B R B B DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 37 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 38 13.3 Formal Description Let X denote the instance space and Y the set of class labels Assume binary classification Y = {-1, +1} Given a weak or base learning algorithm and a training set {(x 1, y 1 ), (x 2, y 2 ),, (x m, y m )}, where x i X and y i Y (i = 1,..., m) Let T be the number of iterations D t denotes the distribution of weights at the t-th learning round 13.3 Formal Description We assign equal weights to all training examples D 1 (i) = 1, i = 1,, m m From the training set and D t the algorithm generates a weak learner h t : X Y h t = argmax 0.5 - t where m t = i =1 D t (i)*i (yi ht(xi)) and I is the indicator function Put into words, h t maximizes the absolute value of the difference of the corresponding weighted error rate from 0.5 with respect to the distribution D t DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 39 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 40 13.3 Formal Description h t is then tested on the training examples and the weights of the incorrect classified examples are increased Thus, an updated weight distribution D t+1 is obtained D t+1 (i)= D t(i) ( α y h (xi)) t i t, Z t where α t = 1 ln 1 t 2 represents the weight of h t t and Z t is a normalization factor The process is repeated T rounds 13.3 Formal Description After T rounds we have trained weak hypotheses h 1,, h T The combined hypothesis H is a weighted majority vote of the T weak hypotheses Since each hypothesis h t has weight α t H(x) = sign( T t=1 α t h t (x)) If finding a suitable number T of hypotheses to train is difficult, stop training when the last trained hypothesis is good enough Stop condition: 0.5 - t β, where β is a quality threshold DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 41 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 42 7

13.3 AdaBoost Evaluation 13.3 Advantages AdaBoost example (Jan Šochman, Center for Machine Perception) Why is adaptive boosting better than pure majority vote? Later weak learners focus more on those training examples previous weak learners had problems with Individual weaknesses can be compensated Individual strengths can be exploited DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 43 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 44 Summary Basic classifiers alone achieve a precision of just better than random classification on difficult training data When more classifiers are used together, they strengthen each other out Bootstrap aggregating introduces the voting principle Boosting introduces weights for the falsely classified objects Adaptive Boosting introduces weights also for the classifiers Next week Guest: Mr. Andreas Engelke Head of Consulting Corporate Performance Management Braunschweig at ISR Information Products AG An insight on data warehousing and data mining from an industry perspective Data Warehousing & OLAP Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 45 Data Warehousing & OLAP Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 13 Thank You! I hope you enjoyed the lecture and learned some interesting stuff Next semester s master courses: Relational Databases 2 Deductive Databases and Knowledge-Based Systems Lab: Integrity Constraints Seminar: Advances in Digital Libraries 8