Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Similar documents
PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD

Ensamble methods: Bagging and Boosting

Online Convex Optimization Example And Follow-The-Leader

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

The Rosenblatt s LMS algorithm for Perceptron (1958) is built around a linear neuron (a neuron with a linear

Notes on Kalman Filtering

Ensamble methods: Boosting

1 Review of Zero-Sum Games

Bias in Conditional and Unconditional Fixed Effects Logit Estimation: a Correction * Tom Coupé

Ensemble Confidence Estimates Posterior Probability

Non-parametric techniques. Instance Based Learning. NN Decision Boundaries. Nearest Neighbor Algorithm. Distance metric important

L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS. NA568 Mobile Robotics: Methods & Algorithms

ACE 562 Fall Lecture 4: Simple Linear Regression Model: Specification and Estimation. by Professor Scott H. Irwin

MATH 5720: Gradient Methods Hung Phan, UMass Lowell October 4, 2018

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

Non-parametric techniques. Instance Based Learning. NN Decision Boundaries. Nearest Neighbor Algorithm. Distance metric important

A Specification Test for Linear Dynamic Stochastic General Equilibrium Models

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II

GMM - Generalized Method of Moments

Vehicle Arrival Models : Headway

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK

Some Basic Information about M-S-D Systems

State-Space Models. Initialization, Estimation and Smoothing of the Kalman Filter

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

Augmented Reality II - Kalman Filters - Gudrun Klinker May 25, 2004

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

20. Applications of the Genetic-Drift Model

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Random Walk with Anti-Correlated Steps

An recursive analytical technique to estimate time dependent physical parameters in the presence of noise processes

Deep Learning: Theory, Techniques & Applications - Recurrent Neural Networks -

Wednesday, November 7 Handout: Heteroskedasticity

Exponential Weighted Moving Average (EWMA) Chart Under The Assumption of Moderateness And Its 3 Control Limits

WEEK-3 Recitation PHYS 131. of the projectile s velocity remains constant throughout the motion, since the acceleration a x

Dimitri Solomatine. D.P. Solomatine. Data-driven modelling (part 2). 2

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

m = 41 members n = 27 (nonfounders), f = 14 (founders) 8 markers from chromosome 19

DEPARTMENT OF STATISTICS

A Shooting Method for A Node Generation Algorithm

Time series Decomposition method

Notes on online convex optimization

0.1 MAXIMUM LIKELIHOOD ESTIMATION EXPLAINED

Comparing Means: t-tests for One Sample & Two Related Samples

Spring Ammar Abu-Hudrouss Islamic University Gaza

Two Coupled Oscillators / Normal Modes

12: AUTOREGRESSIVE AND MOVING AVERAGE PROCESSES IN DISCRETE TIME. Σ j =

Lab 10: RC, RL, and RLC Circuits

1 Widrow-Hoff Algorithm

Lecture 3: Exponential Smoothing

RC, RL and RLC circuits

Sequential Importance Resampling (SIR) Particle Filter

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Let us start with a two dimensional case. We consider a vector ( x,

Smoothing. Backward smoother: At any give T, replace the observation yt by a combination of observations at & before T

Nature Neuroscience: doi: /nn Supplementary Figure 1. Spike-count autocorrelations in time.

Book Corrections for Optimal Estimation of Dynamic Systems, 2 nd Edition

Modal identification of structures from roving input data by means of maximum likelihood estimation of the state space model

Kriging Models Predicting Atrazine Concentrations in Surface Water Draining Agricultural Watersheds

ACE 562 Fall Lecture 8: The Simple Linear Regression Model: R 2, Reporting the Results and Prediction. by Professor Scott H.

STA 114: Statistics. Notes 2. Statistical Models and the Likelihood Function

References are appeared in the last slide. Last update: (1393/08/19)

Linear Response Theory: The connection between QFT and experiments

Biol. 356 Lab 8. Mortality, Recruitment, and Migration Rates

R t. C t P t. + u t. C t = αp t + βr t + v t. + β + w t

2.160 System Identification, Estimation, and Learning. Lecture Notes No. 8. March 6, 2006

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature

Robotics I. April 11, The kinematics of a 3R spatial robot is specified by the Denavit-Hartenberg parameters in Tab. 1.

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H.

Chapter 15. Time Series: Descriptive Analyses, Models, and Forecasting

Final Spring 2007

Sub Module 2.6. Measurement of transient temperature

OBJECTIVES OF TIME SERIES ANALYSIS

STATE-SPACE MODELLING. A mass balance across the tank gives:

Exponential Smoothing

Designing Information Devices and Systems I Spring 2019 Lecture Notes Note 17

Robot Motion Model EKF based Localization EKF SLAM Graph SLAM

Most Probable Phase Portraits of Stochastic Differential Equations and Its Numerical Simulation

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8)

( ) a system of differential equations with continuous parametrization ( T = R + These look like, respectively:

Air Traffic Forecast Empirical Research Based on the MCMC Method

Stability. Coefficients may change over time. Evolution of the economy Policy changes

A Dynamic Model of Economic Fluctuations

KINEMATICS IN ONE DIMENSION

Linear Gaussian State Space Models

The equation to any straight line can be expressed in the form:

Innova Junior College H2 Mathematics JC2 Preliminary Examinations Paper 2 Solutions 0 (*)

Online Appendix to Solution Methods for Models with Rare Disasters

Excel-Based Solution Method For The Optimal Policy Of The Hadley And Whittin s Exact Model With Arma Demand

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

14 Autoregressive Moving Average Models

Isolated-word speech recognition using hidden Markov models

Recursive Least-Squares Fixed-Interval Smoother Using Covariance Information based on Innovation Approach in Linear Continuous Stochastic Systems

Affine term structure models

Summer Term Albert-Ludwigs-Universität Freiburg Empirische Forschung und Okonometrie. Time Series Analysis

Západočeská Univerzita v Plzni, Czech Republic and Groupe ESIEE Paris, France

Solutions to the Exam Digital Communications I given on the 11th of June = 111 and g 2. c 2

Transcription:

Aricle from Predicive Analyics and Fuurism July 6 Issue

An Inroducion o Incremenal Learning By Qiang Wu and Dave Snell Machine learning provides useful ools for predicive analyics The ypical machine learning problem can be described as follows: A sysem produces a specific oupu for each given inpu The mechanism underlying he sysem can be described by a funcion ha maps he inpu o he oupu Human beings do no know he mechanism bu can observe he inpus and oupus The goal of a machine learning algorihm is o infer he mechanism by a se of observaions colleced for he inpu and oupu Mahemaically, we use (x i,y i ) o denoe he i-h pair of observaion of inpu and oupu If he real mechanism of he sysem o produce daa is described by a funcion f*, hen he rue oupu is supposed o be f*(x i ) However, due o sysemaic noise or measuremen error, he observed oupu y i saisfies y i = f*(x i )+ϵ i where ϵ i is an unavoidable bu hopefully small error erm The goal hen, is o learn he funcion f* from he n pairs of observaions {(x,y ),(x,y ),,(x n,y n )} A machine learning algorihm mus firs specify a loss funcion L(y,f(x)) o measure he error ha will occur when we use f(x) o predic he oupu y for an unobserved x We use he erm unobserved x o describe new observaions ouside our raining ses We wish o find a funcion such ha he oal loss on all unobserved daa is as small as possible Ideally, for an appropriaely designed loss funcion, f* is he arge funcion In his case, if we can compue he oal loss on all unobserved daa, we can exacly find f* Unforunaely, compuing he oal loss on unobserved daa is impossible A machine learning algorihm usually searches for an approximaion of f* by minimizing he loss on he observed daa This is called he empirical loss The erm generalizaion error measures how well a funcion having small empirical loss can predic unobserved daa There are wo machine learning paradigms Bach learning refers o machine learning mehods ha use all he observed daa a once Incremenal learning (also called online learning) refers o he machine learning mehods ha apply o sreaming daa colleced over ime These mehods are used o updae he learned funcion accordingly when new daa come in Incremenal learning mimics he human learning process from experiences In his aricle, we will inroduce hree classical incremenal learning algorihms: he sochasic gradien descen for linear regression, percepron for classificaion and incremenal principal componen analysis STOCHASTIC GRADIENT DESCENT In linear regression, f*(x) = w T x is a linear funcion of he inpu vecor The usual choice of he loss funcion is he squared loss L(y,w T x) = (y-w T x) The gradien of L wih respec o he weigh vecor w is given by Noe he gradien is he direcion for he funcion o increase, so if we wan he squared loss o decrease, we need o le he weigh vecor move opposie o he gradien This moivaes he sochasic gradien descen algorihm for linear regression as follows: he algorihm sars wih he iniial guess of w as w A ime, we receive he -h observaion x and we can predic he oupu as Afer we observe he rue oupu y, we can updae he esimae for w by The number η > is called he sep size Theoreical sudy shows ha w becomes closer and closer o he rue coefficien vecor w provided he sep size is properly chosen Typical choice of he sep size is for some predeermined consan η Anoher quaniy o mea- 8 JULY 6 PREDICTIVE ANALYTICS AND FUTURISM

sure he effeciveness is he accumulaed regre afer T seps defined by Figure : Regre vs Ieraions If his algorihm is used in a financial decision-making process and w T x is he opimal decision a sep, he regre measures he oal addiional losses because he decisions are no opimal In heory, he regre is bounded, implying ha he average addiional loss resuling from one decision is minimal when T is large We use a simulaion o illusrae he use and he effec of his algorihm Assume ha in a cerain business, here are five risk facors They may eiher drive up or down he financial losses The loss is he weighed sum of hese facors plus some flucuaion due o noise: y = x - x + x - x + x + ϵ So he rue weigh coefficiens are given by w=[, -,, -, ] We assume each risk facor can ake values beween and and he noise follows a mean zero normal disribuion wih variance The small variance choice is empirically seleced o achieve a smaller signal o noise raio We generae, daa poins sequenially o mimic he daa-generaing process and perform he learning wih an iniial esimae w =[,,,,] In Figure, we plo he disance beween w and w, showing esimaion error decays fas (which is desirable) In Figure, we plo he regre for each sep We see mos addiional losses occur a he beginning because we have used a supid iniial guess They increase very slowly afer seps, indicaing he decisions become near opimal In oher words, even a poor guess can lead o excellen resuls afer a sufficien number of seps Figure : Esimaion Error vs Ieraions Regre 6 7 8 9 PERCEPTRON In a classificaion problem, he arge is o develop a rule o assign a label o each insance For example, in auo insurance, a driver could be labeled as a high risk or low risk driver In financial decision-making, one can deermine wheher an acion should be aken or no In a binary classificaion problem where here are wo classes, he labels for he wo classes are usually aken as and or and + When and + are used as he wo labels, he classifier could be deermined by he sign of a real valued funcion A linear classifier is he sign of a linear funcion of predicors f(x) = sign(w T x) Mahemaically w T x = forms a separaing hyperplane in he space of predicors The percepron for binary classificaion is an algorihm o incremenally updae he weigh vecors of he hyperplane afer receiving each new insance I sars wih an iniial vecor w and when each new insance (x,y ) is received, he coefficien vecor is updaed by, Disance beween w and w 6 7 8 9 oherwise, where y is a user specified parameer called he margin The original percepron inroduced by Rosenbla in he 9s has a margin, ie, y = The percepron can be explained as follows If y (β - x )<, he -h observaion is classified incorrecly and hus he rule is updaed o decrease he chance for i being classified incorrecly If y (β - x )>, he -h observaion is classified correcly, and no updae is necessary The idea of using a posiive margin is from he well-known suppor vecor machine classificaion algorihm The moivaion is ha he classificaion is considered unsable if he observaion is oo close o he decision boundary even when i is classified correcly Updaing is sill required in his case as a penaly The classificaion rule is no updaed only when an insance is classified correcly JULY 6 PREDICTIVE ANALYTICS AND FUTURISM 9

An Inroducion Principal componen analysis (PCA) is probably he mos famous feaure exracion ool for analyics professionals and has a margin from he decision boundary For percepron, he cumulaive classificaion accuracy, which is defined as he percenage of he classified insances, can be used o measure he effeciveness of he algorihm In Figure, we simulaed, daa poins for wo classes: he posiive class conains daa poins cenered a (, ) and he negaive class conains daa poins cenered a (, ) Boh classes are normally disribued The opimal separaing line is x - x =, which can achieve a classificaion accuracy of 9 percen Tha is, here is a sysemaic error of 786 percen We assume he daa poins come in sequenially and apply he percepron algorihm The cumulaive classificaion accuracy is shown in Figure As desired, he classificaion abiliy of he percepron is near opimal afer some number of updaes Figure : Daa for a Binary Classificaion Problem Figure : Cumulaive Classificaion Accuracy of Percepron Technique Cumulaive Classificaion Accuracy 9 8 7 6 6 7 8 9 INCREMENTAL PCA Principal componen analysis (PCA) is probably he mos famous feaure exracion ool for analyics professionals The principal componens are linear combinaions of predicors ha preserve he mos variabiliy in he daa Mahemaically hey are defined as he direcions on which he projecion of he daa has larges variance and can be calculaed as he eigenvecors associaed wih he larges eigenvalues of he covariance marix I can also be implemened by an incremenal manner For he firs principal componen v, he algorihm can be described as follows I sars wih an iniial esimaion v, and when a new insance x comes in, he esimaion is updaed by x, x The accuracy can be measured by he disance beween he esimaed principal componen and he rue one Again, we use a simulaion o illusrae is use and effeciveness We generaed, daa poins from a mulivariae normal disribuion wih mean μ = [,,,,] and covariance marix JULY 6 PREDICTIVE ANALYTICS AND FUTURISM

The firs principal componen is [97, 898,,, ] In Figure, we used he scaer plo o show he firs wo variables of he daa wih he red line indicaing he direcion of he firs principal componen Afer applying he incremenal PCA algorihm, he disance beween he esimaed principal componen and he rue principal componen is ploed for each sep in Figure 6 As expeced, he disance shrinks o as more and more daa poins ge in Figure : Feaure Absracion via Principal Componen Analysis 6 REMARKS We close wih a few remarks Firs, incremenal learning has very imporan applicaion domains, for example, personalized handwriing recogniion for smarphones and sequenial decision-making for financial sysems In he real applicaions, bach learning mehods are usually used wih a number of experiences o se up he iniial esimaor This helps avoid large losses a he beginning Incremenal learning can hen be used o refine or personalize he esimaion Second, we have inroduced he algorihm for linear models All hese algorihms can be exended o nonlinear models by using he so-called kernel rick in machine learning Finally, we would menion ha i seems he erm online learning is more popular in machine learning lieraure; however, we prefer he erm incremenal learning because online learning is widely used o refer o he learning sysem via he Inerne and can easily confuse people Acually, in Google, you probably canno ge wha you wan by searching online learning Insead, online machine learning should be used x Qiang Wu, PhD, ASA, is asociae professor a Middle Tennessee Sae Universiy in Murfreesboro, Tenn He can be reached a qwu@msuedu 6 8 x Figure 6: Esimaion Error from Principal Componen Analysis Dave Snell, ASA, MAAA, is echnology evangelis a RGA Reinsurance Company in Cheserfield, MO He can be reached a dave@ AcuariesAndTechnologycom Disance beween V, and V 8 6 ENDNOTES Vladimir N Vapnik, Saisical Learning Theory, John Wiley & Sons,998 Juyang Weng, Yilu Zhang, and Wey-Shiuan Hwang, Candid Covariance-Free Incremenal Principal Componen Analysis, IEEE Transacions on Paern Analysis and Machine Inelligence, (8),, -9 Wikipedia, Online Machine Learning hps://enwikipediaorg/wiki/online_machine_ learning Wikipedia, Percepron hps://enwikipediaorg/wiki/percepron 6 7 8 9 JULY 6 PREDICTIVE ANALYTICS AND FUTURISM