Chaos, Complexity, and Inference (36-462)

Similar documents
Reconstruction Deconstruction:

1 Random walks and data

Revista Economica 65:6 (2013)

Phase-Space Reconstruction. Gerrit Ansmann

Investigation of Chaotic Nature of Sunspot Data by Nonlinear Analysis Techniques

Delay Coordinate Embedding

The Behaviour of a Mobile Robot Is Chaotic

Estimating Lyapunov Exponents from Time Series. Gerrit Ansmann

Modeling and Predicting Chaotic Time Series

CS 188: Artificial Intelligence Spring Announcements

Chaos, Complexity, and Inference (36-462)

Ensembles of Nearest Neighbor Forecasts

The tserieschaos Package

WHEELSET BEARING VIBRATION ANALYSIS BASED ON NONLINEAR DYNAMICAL METHOD

Available online at AASRI Procedia 1 (2012 ) AASRI Conference on Computational Intelligence and Bioinformatics

EM-algorithm for Training of State-space Models with Application to Time Series Prediction

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

INTRODUCTION TO CHAOS THEORY T.R.RAMAMOHAN C-MMACS BANGALORE

The Research of Railway Coal Dispatched Volume Prediction Based on Chaos Theory

COMS 4771 Regression. Nakul Verma

CS 188: Artificial Intelligence Spring Announcements

NONLINEAR TIME SERIES ANALYSIS, WITH APPLICATIONS TO MEDICINE

Ordinary Least Squares Linear Regression

Lecture 02 Linear classification methods I

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

Introduction to Dynamical Systems Basic Concepts of Dynamics

Regression Analysis. Ordinary Least Squares. The Linear Model

No. 6 Determining the input dimension of a To model a nonlinear time series with the widely used feed-forward neural network means to fit the a

Chaos and Liapunov exponents

Nonlinear Prediction for Top and Bottom Values of Time Series

Basics: Definitions and Notation. Stationarity. A More Formal Definition

Review for Exam 1. Erik G. Learned-Miller Department of Computer Science University of Massachusetts, Amherst Amherst, MA

41903: Introduction to Nonparametrics

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Documents de Travail du Centre d Economie de la Sorbonne

Nonlinear Characterization of Activity Dynamics in Online Collaboration Websites

Chaos, Complexity, and Inference (36-462)

Classification: The rest of the story

Evaluating nonlinearity and validity of nonlinear modeling for complex time series

Research Article Multivariate Local Polynomial Regression with Application to Shenzhen Component Index

Testing for Chaos in Type-I ELM Dynamics on JET with the ILW. Fabio Pisano

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

SIMULATED CHAOS IN BULLWHIP EFFECT

Application of Chaos Theory and Genetic Programming in Runoff Time Series

Chaos, Complexity, and Inference (36-462)

Liz Bradley & friends. Department of Computer Science

Chaos. Dr. Dylan McNamara people.uncw.edu/mcnamarad

Linear Regression (9/11/13)

Problem Set 2 Solution Sketches Time Series Analysis Spring 2010

2. An Introduction to Moving Average Models and ARMA Models

Lecture 6a: Unit Root and ARIMA Models

One dimensional Maps

ARIMA Models. Jamie Monogan. January 25, University of Georgia. Jamie Monogan (UGA) ARIMA Models January 25, / 38

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Where on Earth are We? Projections and Coordinate Reference Systems

What is Chaos? Implications of Chaos 4/12/2010

Name Solutions Linear Algebra; Test 3. Throughout the test simplify all answers except where stated otherwise.

Lecture 2: Linear regression

Investigating volcanoes behaviour. With novel statistical approaches. UCIM-UNAM - CONACYT b. UCIM-UNAM. c. Colegio de Morelos

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

STAT 248: EDA & Stationarity Handout 3

Econometrics I. Lecture 10: Nonparametric Estimation with Kernels. Paul T. Scott NYU Stern. Fall 2018

Predictability and Chaotic Nature of Daily Streamflow

Lecture 4a: ARMA Model

A note on the Gamma Test analysis of noisy input/output data and noisy time series

Forecasting Financial Indices: The Baltic Dry Indices

PAC-learning, VC Dimension and Margin-based Bounds

Dynamical Systems: Lecture 1 Naima Hammoud

Linear Regression (continued)

An Introduction to Machine Learning

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2

THREE DIMENSIONAL SYSTEMS. Lecture 6: The Lorenz Equations

12 Statistical Justifications; the Bias-Variance Decomposition

ARIMA Models. Jamie Monogan. January 16, University of Georgia. Jamie Monogan (UGA) ARIMA Models January 16, / 27

Detection of Nonlinearity and Stochastic Nature in Time Series by Delay Vector Variance Method

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

Kernel Methods. Barnabás Póczos

Lecture 9. Time series prediction

Lecture for Week 2 (Secs. 1.3 and ) Functions and Limits

FORECASTING ECONOMIC GROWTH USING CHAOS THEORY

Application of Physics Model in prediction of the Hellas Euro election results

Takens embedding theorem for infinite-dimensional dynamical systems

Time Series 2. Robert Almgren. Sept. 21, 2009

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Lecture 22. r i+1 = b Ax i+1 = b A(x i + α i r i ) =(b Ax i ) α i Ar i = r i α i Ar i

The Effects of Dynamical Noises on the Identification of Chaotic Systems: with Application to Streamflow Processes

Terminology for Statistical Data

The Perceptron algorithm

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Dynamical Systems and Chaos Part I: Theoretical Techniques. Lecture 4: Discrete systems + Chaos. Ilya Potapov Mathematics Department, TUT Room TD325

Dynamics of partial discharges involved in electrical tree growth in insulation and its relation with the fractal dimension

Example Chaotic Maps (that you can analyze)

Machine Learning Practice Page 2 of 2 10/28/13

appstats27.notebook April 06, 2017

Characterizing chaotic time series

PHY411 Lecture notes Part 5

Commun Nonlinear Sci Numer Simulat

MITOCW ocw f99-lec01_300k

Transcription:

Chaos, Complexity, and Inference (36-462) Lecture 4 Cosma Shalizi 22 January 2009

Reconstruction Inferring the attractor from a time series; powerful in a weird way Using the reconstructed attractor to make forecasts

Reconstruction What can we learn about the dynamical system from observing a trajectory? Assume it s reached an attractor; attractor is a function of the parameters; invert the function Function from parameters to attractors can be very ugly Assumes we know the dynamics up to parameters Second problem is bigger! Will see later an approach ( indirect inference ) for parametric estimation with messy models Do we need parameters?

A gross simplification of Reconstruction Takens s Embedding Theorem Suppose X t is a state vector Suppose the map/flow is sufficiently smooth Suppose X t has gone to an attractor, dimension d Suppose we observe S t = g(x t ), which again is sufficiently smooth Pick a time-lag τ and a number of lags k Time-delay vector R (k) t = (S t, S t τ, S t 2τ,... S t (k 1)τ ) THEOREM If k 2d + 1, then, for generic g and τ, R (k) t acts just like X t, up to a smooth change of coordinates

Determinism State has d dimensions so knowing d coordinates at once fixes trajectory OR knowing one variable at d times fixes trajectory (think of Henon map) Don t get to observe state variables so may need extra observations Turn out to never need more than d + 1 extras Sometimes don t need the extras [Packard et al., 1980]

Geometry Trajectories are d-dimensional (because they are on the attractor, and current state fixes future states) So time series are also only d-dimensional, but they might live on a weirdly curved d-dimensional space Geometric fact (Whitney embedding theorem): any curved d-dimensional space fits into a 2d + 1-dimensional ordinary Euclidean space [Takens, 1981]

An example of reconstruction Lorenz attractor, a = 10, b = 28, c = 8/3 See last lecture for equations of motion Solved equations in a separate (non-r) program (see website) > lorenz.state.ts = read.csv("lorenz-t50-to-t100-by-1e-3", col.names=c("t","x","y","z")) > scatterplot3d(lorenz.state.ts$x,lorenz.state.ts$y, lorenz.state.ts$z,type="l",xlab="x", ylab="y",zlab="z",lwd="0.2") > x.ts = lorenz.state.ts$x > library(tserieschaos) > scatterplot3d(embedd(x.ts,3,40),type="l",lwd="0.2")

Lorenz Attractor: state space x 3 0 10 20 30 40 50 30 20 10 0 10 20 30 x 2 20 10 0 10 20 x 1

Time series: first coordinate of state Time series of first state variable x 1(t) 10 0 10 20 50 60 70 80 90 100 t

Reconstruction, k = 3, τ = 0.04 Lorenz Attractor: time delay of one state variable s(t 0.08) 20 10 0 10 20 20 10 0 10 20 s(t 0.04) 20 10 0 10 20 s(t) s = x 1

x 3 Reconstruction x 2 Lorenz Attractor: state space Lorenz Attractor: time delay of one state variable 0 10 20 30 40 50 30 20 10 0 10 20 30 s(t 0.08) 20 10 0 10 20 20 10 0 10 20 s(t 0.04) 20 10 0 10 20 20 10 0 10 20 x 1 s(t) s = x 1 Note: reconstruction procedure knew nothing about what the state variables were or what the equations of motion were, it just worked with the original time-series

Attractor Reconstruction It s inference, Jim, but not as we know it Gets information about the larger system generating the data from partial data (inference) No parametric model form No nonparametric model form either No likelihood, no prior Requires very complete determinism Reconstructs the attractor up to a smooth change of coordinates

Coordinate change old, new state = X t, R t old, new map = Φ, Ψ coordinate change = G so R t = G(X t ) If the coordinate change works, then it doesn t matter whether we apply it or the map first X t+1 = Φ(X t ) R t+1 = Ψ(R t ) R t+1 = G(Φ(X t )) = Ψ(G(X t )) = R t+1

G X t R t Φ X t+1 G Ψ R t+1 New coordinates are a perfect model of the old coordinates time-delay vectors give a model of states, at least on the attractor many quantities (like Lyapunov exponents) aren t affected by change of coordinates

Lorenz Attractor: nonlinear observable s(t 0.08) 5000 0 5000 10000 15000 20000 5000 0 5000 10000 15000 20000 5000 0 5000 10000 15000 20000 s(t 0.04) s(t) s = (x 1 + 7) 3

Lorenz Attractor: Nonlinear observable s(t 0.08) 6000 4000 2000 0 2000 4000 6000 8000 2000 4000 6000 6000 4000 2000 0 2000 4000 6000 8000 0 8000 6000 4000 2000 s(t 0.04) s(t) s = x 1 2 x 2

It even works with observation noise Lorenz Attractor: Nonlinear observable+noise s(t 0.08) 6000 4000 2000 0 2000 4000 6000 8000 2000 4000 6000 6000 4000 2000 0 2000 4000 6000 8000 0 8000 6000 4000 2000 s(t 0.04) s(t) s = x 1 2 x 2 + ε, σ(ε) = 100

Lorenz Attractor: Nonlinear observable+noise s(t 0.08) 8000 6000 4000 2000 0 2000 4000 6000 8000 0 2000 4000 6000 8000 8000 6000 4000 2000 0 2000 4000 6000 8000 8000 6000 4000 2000 s(t 0.04) s(t) s = x 1 2 x 2 + ε, σ(ε) = 400

but not too much Reconstruction Lorenz Attractor: Nonlinear observable+noise s(t 0.08) 10000 5000 0 5000 10000 10000 10000 5000 0 5000 10000 5000 0 5000 10000 s(t 0.04) s(t) s = x 1 2 x 2 + ε, σ(ε) = 1000

Choice of reconstruction settings Need to choose k (number of delays, embedding dimension) and τ (delay between observations) typically the observable is given to us by the problem situation These involve a certain amount of uncertainty Software: used tserieschaos from CRAN Hegger, Kantz, & Schreiber s TISEAN has an R port, RTisean, also on CRAN, couldn t get it to work

Choice of delay τ In principle, almost any τ will do but if the attractor is periodic, multiples of the period are bad! In practice, want to try and get as much new information about the state as possible from each observation Heuristic 1: make τ the first minimum of the autocorrelation Heuristic 2: make τ the first minimum of the mutual information Will return to heuristic 2 after we explain mutual information

Autocorrelation function ρ(t, s) = cov[x t, X s ] σ(x t )σ(x s ) For weakly stationary or second-order stationary processes, ρ(t, s) = ρ( t s ), i.e., depends only on time-lag, not on absolute time Standard R command: acf

Space-time separation plot Distance between two random points from the trajectory will depend on how far apart in time we make the observations For each h > 0, calculate the empirical distribution of X t Y t+h Plot the quantiles as a function of h Example: logistic map stplot(x.ts,3,40,mdt=2000)

showing deciles; time in units of 1/1000 Space time separation plot distance 0 10 20 30 40 50 0 500 1000 1500 2000 time note growth (separation) + periodicity

correlations are reasonably decayed at around 250, before then observatons are correlated because not enough time to disperse around attractor

Choice of embedding dimension k: False Neighbor Method Take points which really live in a high-dimensional space and project into a low-dimensional one: many points which are really far apart will have projections which are close by These are the false neighbors Random dots near the unit sphere 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 x y z 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 Projection on to plane x y

Keep increasing the dimensionality until everything which was a neighbor in k dimensions is still a neighbor in k + 1 dimensions

Need to exclude points which are nearby just because the dynamics hasn t had time to separate them calculate this Theiler window from space-time plot plot(false.nearest(x.ts,6,40,250)) i.e. use τ = 40 in embedding, consider up to 6 dimensions, and use an Theiler exclusion window of 250 time-steps

False nearest neighbours % of false nearest neighbours 0 20 40 60 80 100 1 2 3 4 5 6 embedding dimension Conclusion: k = 3 it is

Determinism: there is a mapping from old states to new states : learn that mapping, then apply it Work in the reconstructed state space

Nearest-neighbor methods Also called method of analogs in dynamics Given: points x 1, x 2, x 3,... x n 1 and their sequels, x 2, x 3,... x n Wanted: prediction of what will happen after seeing new point x Nearest neighbor: find x i closest to x; predict x i+1 k-nearest neighbors: find k points x i1, x i2, x ik closest to x, predict the average of x i1 +1, x i2 +1,... x ik +1 Notation: U k (x) = k nearest neighbors of x Note: this k is not the k of the embedding dimension, but the phrase k-nearest neighbors is traditional Computation: finding the nearest neighbors fast is tricky; leave it to a professional

Assume the map Φ is smooth If x i is very close to x, then Φ(x i ) = x i+1 will be close to, but not exactly, Φ(x) Φ(x i ) = Φ(x) + (x i x)φ (x) + small If x i1, x i2, x ik are all very close to x, then x i1 +1, x i2 +1,... x ik +1 should all be close to Φ(x)

1 k j U k (x) 1 k Φ(x j ) = 1 k j U k (x) j U k (x) Φ(x)+ 1 k Φ(x j ) Φ(x) 1 k j U k (x) j U k (x) x j x Φ (x)+small x j x Φ (x) k > 1: the error averages a bunch of individual terms from the neighbors, which should tend to be smaller than any one of them, if they re not too correlated themselves One reason this works well together with mixing!

As n grows, we get more and more samples along the attractor If x itself is from the attractor, it becomes more and more likely that we are sampling from a place where we have many neighbors, and hence close neighbors, so the accuracy should keep going up Another reason this works well with mixing

knnflex from CRAN lets you do nearest-neighbor prediction (as well as classification) R is weak and refuses to do really big distance matrices > lorenz.rcon = embedd(x.ts[1:5000],3,40) > nrow(lorenz.rcon) [1] 4920 > lorenz.dist = knn.dist(lorenz.rcon) > lorenz.futures = x.ts[2:5001] > train = sample(1:nrow(lorenz.rcon),0.8*nrow(lorenz.rcon)) > test = (1:nrow(lorenz.rcon))[-train] > preds = knn.predict(train,test,lorenz.futures,lorenz.dist, k=3,agg.meth="mean") > plot(test,preds,col="red",xlab="t",ylab="x",type="p", main="3nn prediction vs. reality") > lines(test,lorenz.futures[test]) > plot(test,lorenz.futures[test] - preds,xlab="t", ylab="reality - prediction",main="residuals")

3NN prediction vs. reality x 15 10 5 0 5 10 15 0 1000 2000 3000 4000 5000 t

3NN prediction vs. reality x 15 10 5 0 5 10 15 0 1000 2000 3000 4000 5000 t

Not just a programming error! 0 1000 2000 3000 4000 5000 0.2 0.1 0.0 0.1 0.2 Residuals t reality prediction

15 10 5 0 5 10 15 0.2 0.1 0.0 0.1 0.2 Residuals vs. first history component lag 0 residual

15 10 5 0 5 10 15 0.2 0.1 0.0 0.1 0.2 Residuals vs. second history component lag 40 residual

15 10 5 0 5 10 15 0.2 0.1 0.0 0.1 0.2 Residuals vs. third history component lag 80 residual

Why not just use huge k? bias/variance trade-off Reconstruction small k: tracks local behavior, big change in prediction depending on which point just happens to be closest to you big k: smooth predictions over state space, less sensitive to sampling, less informative locally k = n: same prediction all over state space! Can we somehow increase k with n, to take advantage of filling-in along the attractor? See handout on kernel prediction.

Cross-Validation Standard and useful way of selecting control settings Principle: we don t just want to fit old data, we want to predict new data i.e., don t just optimize in-sample error, optimize out-of-sample error Problem: we don t know the distribution of future data! Would we be trying to learn a predictor if we did? Observation: we do have a sample from that distribution our sample ergodicity: long trajectories are representative Solution: fake getting a new sample by sub-dividing our existing one at random Chose the settings which generalize best, which can cross-validate

Random division into training and testing sets needs to respect the structure of the data e.g. divide points in the embedding space, not observations from the original time series note: did this already with the Lorenz example that was an out-of-sample prediction Good idea to use a couple of random divisions and average out-of-sample errors This multi-fold cross-validation also gives you an idea of the uncertainty due to sampling noise

5-fold cross-validation of k NN prediction of Lorenz, k 1:10 Warning: slow! fold = sample(1:5,nrow(lorenz.rcon),replace=true) cvpred = matrix(na,nrow=nrow(lorenz.rcon),ncol=10) cvprederror = matrix(na,nrow=nrow(lorenz.rcon),ncol=10) for (k in 1:10) { for (i in 1:5) { train=which(fold!=i) test=which(fold==i) cvpred[test,k] = knn.predict(train=train,test=test, lorenz.futures,lorenz.dist, k=k,agg.meth="mean") cvprederror[test,k] = lorenz.futures[test]-cvpred[test,k] } } mean.cv.errors = apply(abs(cvprederror),2,mean) plot(mean.cv.errors,xlab="k",ylab="mean absolute prediction error", main="5-fold CV of knn prediction",type="b")

5 fold CV of knn prediction error 0.015 0.020 0.025 0.030 0.035 2 4 6 8 10 k

Add the individual fold values to the previous plot graphics hygiene: have vertical scale run to zero plot(mean.cv.errors,xlab="k",ylab="mean absolute prediction error", main="5-fold CV of knn prediction",type="l", ylim=c(0,0.04)) error.by.fold = matrix(na,nrow=5,ncol=10) for (i in 1:5) { for (k in 1:10) { test=which(fold==i) error.by.fold[i,k] = mean(abs(cvprederror[test,k])) points(k,error.by.fold[i,k]) } }

2 4 6 8 10 0.00 0.01 0.02 0.03 0.04 5 fold CV of knn prediction k error Conclusion: 2 is best, but they re all pretty good note bigger scatter at bigger k

Norman H. Packard, James P. Crutchfield, J. Doyne Farmer, and Robert S. Shaw. Geometry from a time series. Physical Review Letters, 45:712 716, 1980. Floris Takens. Detecting strange attractors in fluid turbulence. In D. A. Rand and L. S. Young, editors, Symposium on Dynamical Systems and Turbulence, pages 366 381, Berlin, 1981. Springer-Verlag.