Soft and hard models. Jiří Militky Computer assisted statistical modeling in the

Size: px

Start display at page:

Download "Soft and hard models. Jiří Militky Computer assisted statistical modeling in the"

Ross Byrd
5 years ago
Views:

1 Soft and hard models Jiří Militky Computer assisted statistical s modeling in the applied research

memory It s more aggressive: Disk storage

etreme model dependence for out-of-sample

generate data, and our ability to make use of

Trend (Jim Porter) http://www.disktrend.

2 Monsters Giants Moore s law: processing capacity doubles every 18 months : CPU, cache, memory It s more aggressive: Disk storage capacity doubles every 9 months Eample of etreme model dependence for out-of-sample predictions What do the two laws combined produce? A rapidly growing gap between our ability to generate data, and our ability to make use of it. Disk TB Shipped per Year 1E Disk Trend (Jim Porter) 1E+6 1E+5 1E+4 disk TB growth: 112%/y EaByte Moore's Law: 58.7%/y 1E

3 Basic terms y = (n) Models represent the relationships between variables (m) 1 1 (n) X (m) b Independent variable -value Predictor Input Eplanatory Dependent variable y-value Predictand Output t Response

4 High Bias - Low Variance Models Nonlinear regression Low Bias - High Variance a overfitting - modeling the random component

5 Style of analysis Data eploration Simplification of data structures Interactive model selection Interpretation of results Depth contours :Multidimensional analog of the median

Dimensionality problem I setosa versicolor Basic

the noise level and can be therefore ecluded from data

There are some redundancies due to near linear

6 Dimensionality problem I setosa versicolor Basic characteristics of multivariate data is their dimension (number of elements). High dimensions bring about huge problems in their statistical analysis. Variables reduction variables have often variability on the noise level and can be therefore ecluded from data (bring no information). There are some redundancies due to near linear dependencies between some variables or due to linkages arising from their physical essence. In both cases it is possible to replace the original set by reduced number of uncorrelated new variables. virginica

7 Dimensionality problem II Multivariate curse number of data necessary for achieving the multivariate estimates precision is epoe eponential function uco of number of variables. Empty py space phenomena multivariate data are concentrated on the peripheral part of variables space. Distance problem distance between objects is often weighted by the strength of the mutual links between variables,

8 Distance Euclidean distance ) ( ) ( 2 A i T A i i d = Euclidean distance ) ( ) ( A i A i M h l bi di t ) ( ) ( 1 2 A i T A i i d = S Mahalanobis distance ) ( ) ( A i A i i d S

9 Curse of dimensionality Consider n points scattered at random in a K-dimensional unit sphere. Let D be the distance between the centre of the sphere to the closest point Median of distribution of D: Smoothing doesn t n=1 n=2 n=5 n=1 work in high dimensions: points are K= too far apart K= Solution: pick K= estimate of f() from a class of functions that K= is fleible enough to K= match f() reasonably K= closely

10 Curse of dimensionality Multivariate normal distribution X ~ MVN p (, I) Gaussian kernel density estimation Bandwidth chosen to minimize MSE at the mean Suppose want: E[( pˆ( ) p( )) 2 p( ) 2 <.1 =

11 Data Projection PCA PP Usually, for 2D projection first two PCA are used. The information from the last two PC can be interesting as well. These projections preserve angles and distances between objects (points). On the other hand there is here no objective criterion for revealing the hidden structures in data.. The linear projections of multivariate i t dt data (projection pursuit) PP satisfy to some criterion called projection inde IP(C i ). The projection vectors C i, maimizing IP(C ( i ) under the constraints C it C i = 1 are here computed. Pdf of data in the projection f P (). 2 p IP( C) = f p ( ) d Projection on these vectors is then C it X.

12 Hard and soft models According to the actual type of task, an approach to building the model f(, β ) is chosen. For the so called hard models the main aim is to select adequate function f(, β)thi This function is typically in the eplicit form and it is used instead of original. The so called soft models are in fact used for approimation of unknown function given by table of values { i, y i }, i = 1,..., n. Function f(, β) is here often replaced by a linear combination of some elementary functions

13 overfitting Soft models Smoothing (low dimensional problems) Loess splines Regression models in high dimensions PPR additive interaction Neural nets Trees MARS

14 Linear models and etensions y = f( 1,, p ) + error n 2 ( y = i y fit ) χ 2 σ i= 1 i 2

15 Various models-similar results

Data define stepwise function Convolution Kernels KN 1 ( ) = c ep 2 2 Kernel

Has finite support Area under curve equals 1 Tent K U K c 1 ( ) = otherwise E 2 (

16 Data define stepwise function Convolution Kernels KN 1 ( ) = c ep 2 2 Kernel Gaussian Centered around zero Epanechnikov c ( 1 ) 1 Symmetric KT ( ) = otherwise Has finite support Area under curve equals 1 Tent K U K c 1 ( ) = otherwise E 2 ( ) c 1 1 ( ) = otherwise Bo New (averaged) points by convolving kernel with data New value for P 1 is: 1. Slide kernel over n 1 all points New value P( ) for = P1 K( is: -i ) P 2 n i = 1 2. Watch for overlap (Area of Overlap) P at beginning and end Data P i

Five Fourier Basis Functions Fleible 1.5 φ k (t) linear models -.

We epress f() as a weighted sum of these basis functions: f() = a 1 f 1 () + a

function..1.2.3.4.5.6.7.8.9 1 t Powers: 1,, 2, and so on.

These are not very fleible, e, and are used only for simple problems.

17 Five Fourier Basis Functions Fleible 1.5 φ k (t) linear models We need fleible method for constructing a function y = f() that can track local curvature. We pick a system of K basis functions f k (), and call this the basis for f(). We epress f() as a weighted sum of these basis functions: f() = a 1 f 1 () + a 2 f 2 () + + a K f K () The coefficients a 1,, a K determine the shape of the function t Powers: 1,, 2, and so on. They are the basis functions for polynomials. These are not very fleible, e, and are used only for simple problems. Fourier series: 1, sin(ω), cos(ω), sin(2ω), cos(2ω), and so on for a fied known frequency ω. These are used for periodic functions. B-spline functions: These have now more or less replaced polynomials for non-periodic problems.

18 Continuity C C 1 C 2

19 Ameasureof roughness When we want acceleration to be smooth, we measure roughness at the level of acceleration: What do we mean by smooth? A function that is smooth has limited curvature. Curvature depends on the second derivative. A straight line is completely smooth. We can measure the roughness of a function y() by integrating its squared second derivative. The second derivative i notation is D 2 f(). 2 2 PEN( f ) [ D f ( )] d = 2 4 PEN( f ) D f ( ) d = Penalized least squares ( ) 2 PLS ( f ) = yi f i +λ PEN ( f ) i Parameter λ controls roughness. When λ =, only fitting the data matters. As λ increases, more emphasis on is placed on penalizing roughness. As λ, only roughness matters, and functions having zero roughness are used.

20 1.9.8 Smoothing y cubic splines.3.2 True Response.1 Noisy Response Estimated Response Basis functions are composed from cubic polynomial segments, and they belong to the class C 2 [a, b]. ] Generally a function from the class C m [a, b] is continuous, in the interval [a, b] in functional values and in the first m derivatives. Finding the best smoothing function g() leads to the minimization of the modified sum of squares n i=1= 1 2 '' ( y f( )) + α f ( ) d i i 2 Measures degree of ft to data Measures smoothness Controls tradeoff between smoothness and closeness to data

21 Runge data eample The 11 points in interval [-1, 1] were generated from corrupted Runge function 1 y = N c (, ) c =.1 alpha =.62 * data... model ---spline c =.5 alpha = c =.25 alpha = c =.75 alpha =

22 Regression B splines Splines Truncated polynomials m n j m m = j + m j j f ( ) = β h ( ) j= i= 1 j j j= 1 S ( ) a b ( ξ ) + for > ( ) + = for The corresponding model is linear in the parameters a and b and contains in total (n + m + 1) parameters. For the case when the number and position of knots are estimated the corresponding model is nonlinear B-spline basis functions h j ()

23 B splines

Each dataset has the same size as the original training set.

24 Universal 95% CI confidence bands Bootstrap :Randomly draw datasets with replacement from the training data. Each dataset has the same size as the original training set. Parametric Bootstrap Simulate new data by adding noise to the predicted values Bagging Bootstrap Aggregating

25 Monotonicity Any strictly monotonic function y() must satisfy a simple linear differential equation: 2 1 D y ( ) = w ( ) D y ( ) Because of strict monotonicity, the first derivative D 1 y() will never be, and function w() )is therefore simply D 2 y()/d 1 y(). Any strictly monotonic function y() must be epressible in the form t u y( ) = β + β1 ep w( v) dv du Unconstrained function w(v) could be a B-spline

26 Neural networks Brain Number of neurons: ~ 1 1 Connections per neuron: ~1 4 to 1 5 Neuron switching time: ~.1 second Scene recognition time: ~.1 second Perceptron 1 σ ( ) = 1+ e

Observations training set Independent var. inputs Dependent var.

27 Statistics vs. Neural Networks Model network Estimation learning Regression supervised learning Observations training set Independent var. inputs Dependent var. outputs t Parameters synaptic weights m f ( ) = w jh j ( ) Logistic basis (artificial neural networks) j= 1 1 h j () basis functions -hidden units h( ) = T 1 + ep b b ( ) w j weigth functions

28 Regression Neural Networks Neural networks are very useful when there is no idea of the functional relationship between the dependent and independent variables If you there is an idea, it will be better to use a regression model Neural networks are not based on the functional relationship between the independent variables (predictors) and the data alone define dfi the functional lform.

29 Classical Neural Networks Nodes (neurons) connected in layers

30 Varying parameters I

31 Varying parameters II

32 1 Radial Basis Transfer Function Radial basis.8 functions RBF.4 Output a Response decreases or Input p increases monotonically with distance from 1.4 Weighted Sum of Radial Basis Transfer Functions central point 1.2 h( ) = ep Gaussian RBF: ( c ) 2 r 2 Output a c.. Center,.2 r..radius Input p

33 RBF Network Single layer NN Each of n component of the input vector feeds forward to m basis functions whise outputs are linearly combined with weights w j

34 1 Prediction 8 Target Output vs. prognosis 6 4 Good prediction Bad prognosis Input Target P = -1:.1:1;.6 Output T = [ ]; -.6 p = -3:.1: Input

35 Regression 15 Rbf tree data tree RBF 1 y pre d. m = 3 5 y pred Rbf tree data c: 8.375, 7, w : , , r: , 6,

36 Runge data eample Results of optimized RBF Neural Network regression (NETLAB) for Runge model with noise level c = hidden nodes 1 hidden nodes Target.5 Target Input hidden nodes Input 1 25 hidden nodes Target.5 Target Input Input

37 Neural networks drawbacks Hard to interpret the individual effects of each predictor variable on the response. Poor etrapolation properties The programs are filled with settings, they must input and a small error will cause an error of predictions also No shape preserving, no possible to add info about limiting behavior Regression performs better when theory or eperience indicates an underlying relationship The connection weights usually do not have obvious interpretations ANNs do not produce an eplicit model even though new cases can be fed into it and new results obtained.

Gains from neural networks Can reduce preliminary analysis in modeling discovery of interactions and nonlinear relationships becomes automatic Increases predictive power of models (tunable

38 Gains from neural networks Can reduce preliminary analysis in modeling discovery of interactions and nonlinear relationships becomes automatic Increases predictive power of models (tunable smoothness) Since they are data dependent performance will improve as sample size increases Fleibility and ease of maintenance. ANNs are very fleible in adapting their behavior to new and changing environments. They are also easier to maintain, with some having the ability to learn from eperience to improve their own performance.

39 Data locations y Maimal spread of eplanatory variables y Small variability leads often to non significance of variable y y Nearly uniform data location (against DOE)

40 Real Nature Eperiment Data data Facts Hypothesis Analysis Various error sources and noise due to eperiments, Parasite variables, False correlations, Compleity of effects, Non additivity Interaction of effects Non linearity A Age W Weight M Manoeuvreability M f ( a + b ) af ( ) + b f ( + y) f( ) + f( y) W

41 Thank you!!!

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive