Time Series Mining and Forecasting

Size: px
Start display at page:

Download "Time Series Mining and Forecasting"

Transcription

1 CSE 6242 / CX 4242 Mar 25, 2014 Time Series Mining and Forecasting Duen Horng (Polo) Chau Georgia Tech Slides based on Prof. Christos Faloutsos s materials

2 Outline Motivation Similarity search distance functions Linear Forecasting on-linear forecasting Conclusions

3 Problem definition Given: one or more sequences x 1, x 2,, x t, (y 1, y 2,, y t, ) ( ) Find similar sequences; forecasts patterns; clusters; outliers

4 Motivation - Applications Financial, sales, economic series Medical ECGs +; blood pressure etc monitoring reactions to new drugs elderly care

5 Motivation - Applications (cont d) Smart house sensors monitor temperature, humidity, air quality video surveillance

6 Motivation - Applications (cont d) Weather, environment/anti-pollution volcano monitoring air/water pollutant monitoring

7 Motivation - Applications (cont d) Computer systems Active Disks (buffering, prefetching) web servers (ditto) network traffic monitoring...

8 Stream Data: Disk accesses #bytes time

9 Problem #1: Goal: given a signal (e.g.., #packets over time) Find: patterns, periodicities, and/or compress count lynx caught per year (packets per day; temperature per day) year

10 Problem#2: Forecast Given x t, x t-1,, forecast x t+1 umber of packets sent Time Tick??

11 Problem#2 : Similarity search E.g.., Find a 3-tick pattern, similar to the last one umber of packets sent Time Tick??

12 Problem #3: Given: A set of correlated time sequences Forecast Sent(t) 90 umber of packets sent lost repeated Time Tick

13 Important observations Patterns, rules, forecasting and similarity indexing are closely related: To do forecasting, we need to find patterns/rules to find similar settings in the past to find outliers, we need to have forecasts (outlier = too far away from our forecast)

14 Outline Motivation Similarity Search and Indexing Linear Forecasting on-linear forecasting Conclusions

15 Outline Motivation Similarity search and distance functions... Euclidean Time-warping

16 Importance of distance functions Subtle, but absolutely necessary: A must for similarity indexing (-> forecasting) A must for clustering Two major families Euclidean and Lp norms Time warping and variations

17 Euclidean and Lp = = n i x i y i y x D 1 2 ) ( ), ( x(t) y(t)... = = n i p i i p y x y x L 1 ), ( L 1 : city-block = Manhattan L 2 = Euclidean L

18 Observation #1 Time sequence -> n-d vector Day-n... Day-2 Day-1

19 Observation #2 Euclidean distance is closely related to cosine similarity dot product cross-correlation function Day-n... Day-2 Day-1

20 Time Warping allow accelerations - decelerations (with or w/o penalty) THE compute the (Euclidean) distance (+ penalty) related to the string-editing distance

21 stutters : Time Warping

22 Time warping Q: how to compute it? A: dynamic programming D( i, j ) = cost to match prefix of length i of first sequence x with prefix of length j of second sequence y

23 Time warping Thus, with no penalty for stutter, for sequences x 1, x 2,, x i,; y 1, y 2,, y j D( i, j) = x[ i] y[ j] + D( i min# D( i " D( i 1,, j 1, j 1) 1) j) no stutter x-stutter y-stutter

24 Time warping VERY SIMILAR to the string-editing distance D( i, j) = x[ i] y[ j] + D( i min# D( i " D( i 1,, j 1, j 1) 1) j) no stutter x-stutter y-stutter

25 Time warping Complexity: O(M*) - quadratic on the length of the strings Many variations (penalty for stutters; limit on the number/percentage of stutters; ) popular in voice processing [Rabiner + Juang]

26 Other Distance functions piece-wise linear/flat approx.; compare pieces [Keogh+01] [Faloutsos+97] cepstrum (for voice [Rabiner+Juang]) do DFT; take log of amplitude; do DFT again Allow for small gaps [Agrawal+95] See tutorial by [Gunopulos + Das, SIGMOD01]

27 Other Distance functions In [Keogh+, KDD 04]: parameter-free, MDL based

28 Conclusions Prevailing distances: Euclidean and time-warping

29 Outline Motivation Similarity search and distance functions Linear Forecasting on-linear forecasting Conclusions

30 Linear Forecasting

31 Forecasting Prediction is very difficult, especially about the future. - ils Bohr Danish physicist and obel Prize laureate

32 Outline Motivation... Linear Forecasting Auto-regression: Least Squares; RLS Co-evolving time sequences Examples Conclusions

33 Reference [Yi+00] Byoung-Kee Yi et al.: Online Data Mining for Co-Evolving Time Sequences, ICDE (Describes MUSCLES and Recursive Least Squares)

34 Problem#2: Forecast Example: give x t-1, x t-2,, forecast x t umber of packets sent Time Tick??

35 Forecasting: Preprocessing MAUALLY: remove trends spot periodicities 7 days time time

36 Problem#2: Forecast Solution: try to express x t as a linear function of the past: x (up to a window of w) t-1, x t-2,, ?? Time Tick

37 (Problem: Back-cast; interpolate) Solution - interpolate: try to express x t as a linear function of the past AD the future: x t+1, x t+2, x t+wfuture; x t-1, x t-wpast (up to windows of w past, w future ) EXACTLY the same algo s?? Time Tick

38 Linear Regression: idea patient weight height ?? Body height Body weight express what we don t know (= dependent variable ) as a linear function of what we know (= independent variable(s) )

39 Linear Regression: idea patient weight height ?? Body height Body weight express what we don t know (= dependent variable ) as a linear function of what we know (= independent variable(s) )

40 Linear Regression: idea patient weight height ?? Body height Body weight express what we don t know (= dependent variable ) as a linear function of what we know (= independent variable(s) )

41 Linear Regression: idea patient weight height ?? Body height Body weight express what we don t know (= dependent variable ) as a linear function of what we know (= independent variable(s) )

42 Linear Auto Regression: Time Packets Sent (t-1) Packets Sent(t) ??

43 Linear Auto Regression: Time Packets Sent (t-1) Packets Sent(t) lag-plot #packets sent at time t 25?? #packets sent at time t-1 lag w = 1 Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])

44 Linear Auto Regression: Time Packets Sent (t-1) Packets Sent(t) lag-plot #packets sent at time t 25?? #packets sent at time t-1 lag w = 1 Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])

45 Linear Auto Regression: Time Packets Sent (t-1) Packets Sent(t) lag-plot #packets sent at time t 25?? #packets sent at time t-1 lag w = 1 Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])

46 Linear Auto Regression: Time Packets Sent (t-1) Packets Sent(t) lag-plot #packets sent at time t 25?? #packets sent at time t-1 lag w = 1 Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])

47 Outline Motivation... Linear Forecasting Auto-regression: Least Squares; RLS Co-evolving time sequences Examples Conclusions

48 More details: Q1: Can it work with window w > 1? A1: YES x t x t-1 x t-2

49 More details: Q1: Can it work with window w > 1? A1: YES (we ll fit a hyper-plane, then) x t x t-1 x t-2

50 More details: Q1: Can it work with window w > 1? A1: YES (we ll fit a hyper-plane, then) x t x t-1 x t-2

51 More details: Q1: Can it work with window w > 1? A1: YES The problem becomes: X [ w] a [w 1] = y [ 1] OVER-COSTRAIED a is the vector of the regression coefficients X has the values of the w indep. variables y has the values of the dependent variable

52 More details: X [ w] a [w 1] = y [ 1] " # % & = " # % & " # % & w w w w y y y a a a X X X X X X X X X " ,,,,,,,,, Ind-var1 Ind-var-w time

53 More details: X [ w] a [w 1] = y [ 1] " # % & = " # % & " # % & w w w w y y y a a a X X X X X X X X X " ,,,,,,,,, Ind-var1 Ind-var-w time

54 More details Q2: How to estimate a 1, a 2, a w = a? A2: with Least Squares fit " a = ( X T X ) -1 (X T y) (Moore-Penrose pseudo-inverse) a is the vector that minimizes the RMSE from y

55 More details Straightforward solution: w a = ( X T X ) -1 (X T y) a : Regression Coeff. Vector X : Sample Matrix X : Observations: Sample matrix X grows over time needs matrix inversion O( w 2 ) computation O( w) storage

56 Even more details Q3: Can we estimate a incrementally? A3: Yes, with the brilliant, classic method of Recursive Least Squares (RLS) (see, e.g., [Yi+00], for details). We can do the matrix inversion, WITHOUT inversion (How is that possible?)

57 Even more details Q3: Can we estimate a incrementally? A3: Yes, with the brilliant, classic method of Recursive Least Squares (RLS) (see, e.g., [Yi+00], for details). We can do the matrix inversion, WITHOUT inversion (How is that possible?) A: our matrix has special form: (X T X)

58 SKIP More details w At the +1 time tick: X +1 X : x +1

59 SKIP More details Let G = ( X T X ) -1 ( gain matrix ) G +1 can be computed recursively from G w G w

60 SKIP EVE more details: 1 T + 1 = G [ c] [ G x + 1 ] x + 1 G G 1 x w row vector c = + [ 1+ x G x T ] Let s elaborate (VERY IMPORTAT, VERY VALUABLE)

61 SKIP EVE more details: a T = 1 T [ X X ] [ X y ]

62 SKIP EVE more details: a T = 1 T [ X X ] [ X y ] [w x 1] [(+1) x w] [(+1) x 1] [w x (+1)] [w x (+1)]

63 SKIP EVE more details: a T = 1 T [ X X ] [ X y ] [(+1) x w] [w x (+1)]

64 EVE more details: T G x x G c G G = ] [ ] [ ] 1 [ 1 1 T x G x c = ] [ ] [ = T T y X X X a ] [ T X X G 1 x w row vector gain matrix SKIP

65 EVE more details: T G x x G c G G = ] [ ] [ ] 1 [ 1 1 T x G x c = SKIP

66 SKIP EVE more details: 1x1 1xw wxw wxw wxw wx1 wxw 1 T + 1 = G [ c] [ G x + 1 ] x + 1 G G SCALAR c = + [ 1+ x G x T ]

67 Altogether: T G x x G c G G = ] [ ] [ ] 1 [ 1 1 T x G x c = ] [ ] [ = T T y X X X a ] [ T X X G SKIP

68 SKIP Altogether: G 0 δ I where I: w x w identity matrix δ: a large positive number

69 Comparison: Straightforward Least Squares eeds huge matrix (growing in size) O( w) Costly matrix operation O( w 2 ) Recursive LS eed much smaller, fixed size matrix O(w w) Fast, incremental computation O(1 w 2 ) no matrix inversion = 10 6, w = 1-100

70 Pictorially: Given: Dependent Variable Independent Variable

71 Pictorially:. Dependent Variable new point Independent Variable

72 Pictorially: RLS: quickly compute new best fit Dependent Variable new point Independent Variable

73 Even more details Q4: can we forget the older samples? A4: Yes - RLS can easily handle that [Yi+00]:

74 Adaptability - forgetting Dependent Variable eg., #bytes sent Independent Variable eg., #packets sent

75 Adaptability - forgetting Trend change Dependent Variable eg., #bytes sent (R)LS with no forgetting Independent Variable eg. #packets sent

76 Adaptability - forgetting Trend change Dependent Variable Independent Variable (R)LS with no forgetting (R)LS with forgetting RLS: can *trivially* handle forgetting

Time Series. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. Mining and Forecasting

Time Series. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. Mining and Forecasting http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Time Series Mining and Forecasting Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech

More information

15-826: Multimedia Databases and Data Mining. Must-Read Material. Thanks. C. Faloutsos CMU 1

15-826: Multimedia Databases and Data Mining. Must-Read Material. Thanks. C. Faloutsos CMU 1 15-826: Multimedia Databases and Data Mining Lecture #25: Time series mining and forecasting Christos Faloutsos Must-Read Material Byong-Kee Yi, Nikolaos D. Sidiropoulos, Theodore Johnson, H.V. Jagadish,

More information

Time Series. CSE 6242 / CX 4242 Mar 27, Duen Horng (Polo) Chau Georgia Tech. Nonlinear Forecasting; Visualization; Applications

Time Series. CSE 6242 / CX 4242 Mar 27, Duen Horng (Polo) Chau Georgia Tech. Nonlinear Forecasting; Visualization; Applications CSE 6242 / CX 4242 Mar 27, 2014 Time Series Nonlinear Forecasting; Visualization; Applications Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon,

More information

Data Mining. CS57300 Purdue University. Jan 11, Bruno Ribeiro

Data Mining. CS57300 Purdue University. Jan 11, Bruno Ribeiro Data Mining CS57300 Purdue University Jan, 208 Bruno Ribeiro Regression Posteriors Working with Data 2 Linear Regression: Review 3 Linear Regression (use A)? Interpolation (something is missing) (x,, x

More information

CS 6604: Data Mining Large Networks and Time-series. B. Aditya Prakash Lecture #12: Time Series Mining

CS 6604: Data Mining Large Networks and Time-series. B. Aditya Prakash Lecture #12: Time Series Mining CS 6604: Data Mining Large Networks and Time-series B. Aditya Prakash Lecture #12: Time Series Mining Pa

More information

Faloutsos, Tong ICDE, 2009

Faloutsos, Tong ICDE, 2009 Large Graph Mining: Patterns, Tools and Case Studies Christos Faloutsos Hanghang Tong CMU Copyright: Faloutsos, Tong (29) 2-1 Outline Part 1: Patterns Part 2: Matrix and Tensor Tools Part 3: Proximity

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS

More information

Probabilistic Near-Duplicate. Detection Using Simhash

Probabilistic Near-Duplicate. Detection Using Simhash Probabilistic Near-Duplicate Detection Using Simhash Sadhan Sood, Dmitri Loguinov Presented by Matt Smith Internet Research Lab Department of Computer Science and Engineering Texas A&M University 27 October

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS

More information

Outline : Multimedia Databases and Data Mining. Indexing - similarity search. Indexing - similarity search. C.

Outline : Multimedia Databases and Data Mining. Indexing - similarity search. Indexing - similarity search. C. Outline 15-826: Multimedia Databases and Data Mining Lecture #30: Conclusions C. Faloutsos Goal: Find similar / interesting things Intro to DB Indexing - similarity search Points Text Time sequences; images

More information

4. DATA ASSIMILATION FUNDAMENTALS

4. DATA ASSIMILATION FUNDAMENTALS 4. DATA ASSIMILATION FUNDAMENTALS... [the atmosphere] "is a chaotic system in which errors introduced into the system can grow with time... As a consequence, data assimilation is a struggle between chaotic

More information

Time Series (1) Information Systems M

Time Series (1) Information Systems M Time Series () Information Systems M Prof. Paolo Ciaccia http://www-db.deis.unibo.it/courses/si-m/ Time series are everywhere Time series, that is, sequences of observations (samples) made through time,

More information

Least Squares Estimation Namrata Vaswani,

Least Squares Estimation Namrata Vaswani, Least Squares Estimation Namrata Vaswani, namrata@iastate.edu Least Squares Estimation 1 Recall: Geometric Intuition for Least Squares Minimize J(x) = y Hx 2 Solution satisfies: H T H ˆx = H T y, i.e.

More information

CSCI5654 (Linear Programming, Fall 2013) Lectures Lectures 10,11 Slide# 1

CSCI5654 (Linear Programming, Fall 2013) Lectures Lectures 10,11 Slide# 1 CSCI5654 (Linear Programming, Fall 2013) Lectures 10-12 Lectures 10,11 Slide# 1 Today s Lecture 1. Introduction to norms: L 1,L 2,L. 2. Casting absolute value and max operators. 3. Norm minimization problems.

More information

COS 424: Interacting with Data

COS 424: Interacting with Data COS 424: Interacting with Data Lecturer: Rob Schapire Lecture #14 Scribe: Zia Khan April 3, 2007 Recall from previous lecture that in regression we are trying to predict a real value given our data. Specically,

More information

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore {arnabb,palash}@csa.iisc.ernet.in

More information

Fundamentals of Similarity Search

Fundamentals of Similarity Search Chapter 2 Fundamentals of Similarity Search We will now look at the fundamentals of similarity search systems, providing the background for a detailed discussion on similarity search operators in the subsequent

More information

CS545 Contents XVI. Adaptive Control. Reading Assignment for Next Class. u Model Reference Adaptive Control. u Self-Tuning Regulators

CS545 Contents XVI. Adaptive Control. Reading Assignment for Next Class. u Model Reference Adaptive Control. u Self-Tuning Regulators CS545 Contents XVI Adaptive Control u Model Reference Adaptive Control u Self-Tuning Regulators u Linear Regression u Recursive Least Squares u Gradient Descent u Feedback-Error Learning Reading Assignment

More information

Generating Uniform Random Numbers

Generating Uniform Random Numbers 1 / 43 Generating Uniform Random Numbers Christos Alexopoulos and Dave Goldsman Georgia Institute of Technology, Atlanta, GA, USA March 1, 2016 2 / 43 Outline 1 Introduction 2 Some Generators We Won t

More information

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Chapter 6. Frequent Pattern Mining: Concepts and Apriori Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Pattern Discovery: Definition What are patterns? Patterns: A set of

More information

Unsupervised Anomaly Detection for High Dimensional Data

Unsupervised Anomaly Detection for High Dimensional Data Unsupervised Anomaly Detection for High Dimensional Data Department of Mathematics, Rowan University. July 19th, 2013 International Workshop in Sequential Methodologies (IWSM-2013) Outline of Talk Motivation

More information

Describing and Forecasting Video Access Patterns

Describing and Forecasting Video Access Patterns Boston University OpenBU Computer Science http://open.bu.edu CAS: Computer Science: Technical Reports -- Describing and Forecasting Video Access Patterns Gursun, Gonca CS Department, Boston University

More information

Computing the Entropy of a Stream

Computing the Entropy of a Stream Computing the Entropy of a Stream To appear in SODA 2007 Graham Cormode graham@research.att.com Amit Chakrabarti Dartmouth College Andrew McGregor U. Penn / UCSD Outline Introduction Entropy Upper Bound

More information

CMU SCS. Large Graph Mining. Christos Faloutsos CMU

CMU SCS. Large Graph Mining. Christos Faloutsos CMU Large Graph Mining Christos Faloutsos CMU Thank you! Hillol Kargupta NGDM 2007 C. Faloutsos 2 Outline Problem definition / Motivation Static & dynamic laws; generators Tools: CenterPiece graphs; fraud

More information

Streaming multiscale anomaly detection

Streaming multiscale anomaly detection Streaming multiscale anomaly detection DATA-ENS Paris and ThalesAlenia Space B Ravi Kiran, Université Lille 3, CRISTaL Joint work with Mathieu Andreux beedotkiran@gmail.com June 20, 2017 (CRISTaL) Streaming

More information

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi Signal Modeling Techniques in Speech Recognition Hassan A. Kingravi Outline Introduction Spectral Shaping Spectral Analysis Parameter Transforms Statistical Modeling Discussion Conclusions 1: Introduction

More information

COMP9334: Capacity Planning of Computer Systems and Networks

COMP9334: Capacity Planning of Computer Systems and Networks COMP9334: Capacity Planning of Computer Systems and Networks Week 2: Operational analysis Lecturer: Prof. Sanjay Jha NETWORKS RESEARCH GROUP, CSE, UNSW Operational analysis Operational: Collect performance

More information

Online Data Mining for Co-Evolving Time Sequences

Online Data Mining for Co-Evolving Time Sequences Online Data Mining for Co-Evolving Time Sequences Byoung-Kee Yi Univ of Maryland kee@csumdedu H V Jagadish Univ of Michigan jag@eecsumichedu N D Sidiropoulos Univ of Virginia nds5j@virginiaedu Christos

More information

Tools of AI. Marcin Sydow. Summary. Machine Learning

Tools of AI. Marcin Sydow. Summary. Machine Learning Machine Learning Outline of this Lecture Motivation for Data Mining and Machine Learning Idea of Machine Learning Decision Table: Cases and Attributes Supervised and Unsupervised Learning Classication

More information

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net

More information

HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence

HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence Eamonn Keogh *Jessica Lin Ada Fu University of California, Riverside George Mason Univ Chinese Univ of Hong Kong The 5 th IEEE International

More information

Estimating the Selectivity of tf-idf based Cosine Similarity Predicates

Estimating the Selectivity of tf-idf based Cosine Similarity Predicates Estimating the Selectivity of tf-idf based Cosine Similarity Predicates Sandeep Tata Jignesh M. Patel Department of Electrical Engineering and Computer Science University of Michigan 22 Hayward Street,

More information

Adaptive Burst Detection in a Stream Engine

Adaptive Burst Detection in a Stream Engine Adaptive Burst Detection in a Stream Engine Daniel Klan, Marcel Karnstedt, Christian Pölitz, Kai-Uwe Sattler Department of Computer Science & Automation Ilmenau University of Technology, Germany {first.last}@tu-ilmenau.de

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Matrix Notation Mark Schmidt University of British Columbia Winter 2017 Admin Auditting/registration forms: Submit them at end of class, pick them up end of next class. I need

More information

Introduction to Physics Physics 114 Eyres

Introduction to Physics Physics 114 Eyres What is Physics? Introduction to Physics Collecting and analyzing experimental data Making explanations and experimentally testing them Creating different representations of physical processes Finding

More information

Switching-state Dynamical Modeling of Daily Behavioral Data

Switching-state Dynamical Modeling of Daily Behavioral Data Switching-state Dynamical Modeling of Daily Behavioral Data Randy Ardywibowo Shuai Huang Cao Xiao Shupeng Gui Yu Cheng Ji Liu Xiaoning Qian Texas A&M University University of Washington IBM T.J. Watson

More information

Data Mining 4. Cluster Analysis

Data Mining 4. Cluster Analysis Data Mining 4. Cluster Analysis 4.2 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Data Structures Interval-Valued (Numeric) Variables Binary Variables Categorical Variables Ordinal Variables Variables

More information

Generating Uniform Random Numbers

Generating Uniform Random Numbers 1 / 41 Generating Uniform Random Numbers Christos Alexopoulos and Dave Goldsman Georgia Institute of Technology, Atlanta, GA, USA 10/13/16 2 / 41 Outline 1 Introduction 2 Some Lousy Generators We Won t

More information

Time Series Data Cleaning

Time Series Data Cleaning Time Series Data Cleaning Shaoxu Song http://ise.thss.tsinghua.edu.cn/sxsong/ Dirty Time Series Data Unreliable Readings Sensor monitoring GPS trajectory J. Freire, A. Bessa, F. Chirigati, H. T. Vo, K.

More information

Algorithms for Calculating Statistical Properties on Moving Points

Algorithms for Calculating Statistical Properties on Moving Points Algorithms for Calculating Statistical Properties on Moving Points Dissertation Proposal Sorelle Friedler Committee: David Mount (Chair), William Gasarch Samir Khuller, Amitabh Varshney January 14, 2009

More information

Generating Uniform Random Numbers

Generating Uniform Random Numbers 1 / 44 Generating Uniform Random Numbers Christos Alexopoulos and Dave Goldsman Georgia Institute of Technology, Atlanta, GA, USA 10/29/17 2 / 44 Outline 1 Introduction 2 Some Lousy Generators We Won t

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University 10/17/2017 Slides adapted from Prof. Jiawei Han @UIUC, Prof.

More information

BNAD 276 Lecture 10 Simple Linear Regression Model

BNAD 276 Lecture 10 Simple Linear Regression Model 1 / 27 BNAD 276 Lecture 10 Simple Linear Regression Model Phuong Ho May 30, 2017 2 / 27 Outline 1 Introduction 2 3 / 27 Outline 1 Introduction 2 4 / 27 Simple Linear Regression Model Managerial decisions

More information

CS545 Contents XVI. l Adaptive Control. l Reading Assignment for Next Class

CS545 Contents XVI. l Adaptive Control. l Reading Assignment for Next Class CS545 Contents XVI Adaptive Control Model Reference Adaptive Control Self-Tuning Regulators Linear Regression Recursive Least Squares Gradient Descent Feedback-Error Learning Reading Assignment for Next

More information

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality Measurement and Data Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality Importance of Measurement Aim of mining structured data is to discover relationships that

More information

Time-Series Analysis Prediction Similarity between Time-series Symbolic Approximation SAX References. Time-Series Streams

Time-Series Analysis Prediction Similarity between Time-series Symbolic Approximation SAX References. Time-Series Streams Time-Series Streams João Gama LIAAD-INESC Porto, University of Porto, Portugal jgama@fep.up.pt 1 Time-Series Analysis 2 Prediction Filters Neural Nets 3 Similarity between Time-series Euclidean Distance

More information

From Non-Negative Matrix Factorization to Deep Learning

From Non-Negative Matrix Factorization to Deep Learning The Math!! From Non-Negative Matrix Factorization to Deep Learning Intuitions and some Math too! luissarmento@gmailcom https://wwwlinkedincom/in/luissarmento/ October 18, 2017 The Math!! Introduction Disclaimer

More information

Believe it Today or Tomorrow? Detecting Untrustworthy Information from Dynamic Multi-Source Data

Believe it Today or Tomorrow? Detecting Untrustworthy Information from Dynamic Multi-Source Data SDM 15 Vancouver, CAN Believe it Today or Tomorrow? Detecting Untrustworthy Information from Dynamic Multi-Source Data Houping Xiao 1, Yaliang Li 1, Jing Gao 1, Fei Wang 2, Liang Ge 3, Wei Fan 4, Long

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan

More information

LEC1: Instance-based classifiers

LEC1: Instance-based classifiers LEC1: Instance-based classifiers Dr. Guangliang Chen February 2, 2016 Outline Having data ready knn kmeans Summary Downloading data Save all the scripts (from course webpage) and raw files (from LeCun

More information

ES-2 Lecture: More Least-squares Fitting. Spring 2017

ES-2 Lecture: More Least-squares Fitting. Spring 2017 ES-2 Lecture: More Least-squares Fitting Spring 2017 Outline Quick review of least-squares line fitting (also called `linear regression ) How can we find the best-fit line? (Brute-force method is not efficient)

More information

Similarity Search. Stony Brook University CSE545, Fall 2016

Similarity Search. Stony Brook University CSE545, Fall 2016 Similarity Search Stony Brook University CSE545, Fall 20 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings

More information

Shannon-Fano-Elias coding

Shannon-Fano-Elias coding Shannon-Fano-Elias coding Suppose that we have a memoryless source X t taking values in the alphabet {1, 2,..., L}. Suppose that the probabilities for all symbols are strictly positive: p(i) > 0, i. The

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Types of data sets Record Tables Document Data Transaction Data Graph World Wide Web Molecular Structures

More information

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t Markov Chains. Statistical Problem. We may have an underlying evolving system (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t Consecutive speech feature vectors are related

More information

Fine-grained Photovoltaic Output Prediction using a Bayesian Ensemble

Fine-grained Photovoltaic Output Prediction using a Bayesian Ensemble Fine-grained Photovoltaic Output Prediction using a Bayesian Ensemble 1,2, Manish Marwah 3,, Martin Arlitt 3, and Naren Ramakrishnan 1,2 1 Department of Computer Science, Virginia Tech, Blacksburg, VA

More information

CSE 123: Computer Networks

CSE 123: Computer Networks CSE 123: Computer Networks Total points: 40 Homework 1 - Solutions Out: 10/4, Due: 10/11 Solutions 1. Two-dimensional parity Given below is a series of 7 7-bit items of data, with an additional bit each

More information

Ordinary Least Squares Linear Regression

Ordinary Least Squares Linear Regression Ordinary Least Squares Linear Regression Ryan P. Adams COS 324 Elements of Machine Learning Princeton University Linear regression is one of the simplest and most fundamental modeling ideas in statistics

More information

Window-aware Load Shedding for Aggregation Queries over Data Streams

Window-aware Load Shedding for Aggregation Queries over Data Streams Window-aware Load Shedding for Aggregation Queries over Data Streams Nesime Tatbul Stan Zdonik Talk Outline Background Load shedding in Aurora Windowed aggregation queries Window-aware load shedding Experimental

More information

Accelerated Advanced Algebra. Chapter 1 Patterns and Recursion Homework List and Objectives

Accelerated Advanced Algebra. Chapter 1 Patterns and Recursion Homework List and Objectives Chapter 1 Patterns and Recursion Use recursive formulas for generating arithmetic, geometric, and shifted geometric sequences and be able to identify each type from their equations and graphs Write and

More information

A Tutorial on Recursive methods in Linear Least Squares Problems

A Tutorial on Recursive methods in Linear Least Squares Problems A Tutorial on Recursive methods in Linear Least Squares Problems by Arvind Yedla 1 Introduction This tutorial motivates the use of Recursive Methods in Linear Least Squares problems, specifically Recursive

More information

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. SAMs - Detailed outline. Spatial Access Methods - problem

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. SAMs - Detailed outline. Spatial Access Methods - problem Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications Lecture #26: Spatial Databases (R&G ch. 28) SAMs - Detailed outline spatial access methods problem dfn R-trees Faloutsos 2

More information

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning Short Course Robust Optimization and 3. Optimization in Supervised EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012 Outline Overview of Supervised models and variants

More information

An economic application of machine learning: Nowcasting Thai exports using global financial market data and time-lag lasso

An economic application of machine learning: Nowcasting Thai exports using global financial market data and time-lag lasso An economic application of machine learning: Nowcasting Thai exports using global financial market data and time-lag lasso PIER Exchange Nov. 17, 2016 Thammarak Moenjak What is machine learning? Wikipedia

More information

In the derivation of Optimal Interpolation, we found the optimal weight matrix W that minimizes the total analysis error variance.

In the derivation of Optimal Interpolation, we found the optimal weight matrix W that minimizes the total analysis error variance. hree-dimensional variational assimilation (3D-Var) In the derivation of Optimal Interpolation, we found the optimal weight matrix W that minimizes the total analysis error variance. Lorenc (1986) showed

More information

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides Web Search: How to Organize the Web? Ranking Nodes on Graphs Hubs and Authorities PageRank How to Solve PageRank

More information

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance Notion of Distance Metric Distance Binary Vector Distances Tangent Distance Distance Measures Many pattern recognition/data mining techniques are based on similarity measures between objects e.g., nearest-neighbor

More information

Constructing comprehensive summaries of large event sequences

Constructing comprehensive summaries of large event sequences Constructing comprehensive summaries of large event sequences JERRY KIERNAN IBM Silicon Valley Lab and EVIMARIA TERZI IBM Almaden Research Center Event sequences capture system and user activity over time.

More information

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University Association Rules CS 5331 by Rattikorn Hewett Texas Tech University 1 Acknowledgements Some parts of these slides are modified from n C. Clifton & W. Aref, Purdue University 2 1 Outline n Association Rule

More information

Data mining in large graphs

Data mining in large graphs Data mining in large graphs Christos Faloutsos University www.cs.cmu.edu/~christos ALLADIN 2003 C. Faloutsos 1 Outline Introduction - motivation Patterns & Power laws Scalability & Fast algorithms Fractals,

More information

Incremental Pattern Discovery on Streams, Graphs and Tensors

Incremental Pattern Discovery on Streams, Graphs and Tensors Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Thesis Committee: Christos Faloutsos Tom Mitchell David Steier, External member Philip S. Yu, External member Hui Zhang Abstract

More information

Introducing GIS analysis

Introducing GIS analysis 1 Introducing GIS analysis GIS analysis lets you see patterns and relationships in your geographic data. The results of your analysis will give you insight into a place, help you focus your actions, or

More information

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides Web Search: How to Organize the Web? Ranking Nodes on Graphs Hubs and Authorities PageRank How to Solve PageRank

More information

Analytical Workload Model for Estimating En Route Sector Capacity in Convective Weather*

Analytical Workload Model for Estimating En Route Sector Capacity in Convective Weather* Analytical Workload Model for Estimating En Route Sector Capacity in Convective Weather* John Cho, Jerry Welch, and Ngaire Underhill 16 June 2011 Paper 33-1 *This work was sponsored by the Federal Aviation

More information

Must-read Material : Multimedia Databases and Data Mining. Indexing - Detailed outline. Outline. Faloutsos

Must-read Material : Multimedia Databases and Data Mining. Indexing - Detailed outline. Outline. Faloutsos Must-read Material 15-826: Multimedia Databases and Data Mining Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. Technical Report SAND2007-6702, Sandia National Laboratories,

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification

More information

Reservoir Computing and Echo State Networks

Reservoir Computing and Echo State Networks An Introduction to: Reservoir Computing and Echo State Networks Claudio Gallicchio gallicch@di.unipi.it Outline Focus: Supervised learning in domain of sequences Recurrent Neural networks for supervised

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Cloud-Based Computational Bio-surveillance Framework for Discovering Emergent Patterns From Big Data

Cloud-Based Computational Bio-surveillance Framework for Discovering Emergent Patterns From Big Data Cloud-Based Computational Bio-surveillance Framework for Discovering Emergent Patterns From Big Data Arvind Ramanathan ramanathana@ornl.gov Computational Data Analytics Group, Computer Science & Engineering

More information

arxiv: v2 [cs.ne] 16 Aug 2017

arxiv: v2 [cs.ne] 16 Aug 2017 Time Series Compression Based on Adaptive Piecewise Recurrent Autoencoder Daniel Hsu arxiv:1707.07961v2 [cs.ne] 16 Aug 2017 School of Electrical and Computer Engineering Georgia Institute of Technology

More information

Overview of Spatial Statistics with Applications to fmri

Overview of Spatial Statistics with Applications to fmri with Applications to fmri School of Mathematics & Statistics Newcastle University April 8 th, 2016 Outline Why spatial statistics? Basic results Nonstationary models Inference for large data sets An example

More information

Number Theory. Zachary Friggstad. Programming Club Meeting

Number Theory. Zachary Friggstad. Programming Club Meeting Number Theory Zachary Friggstad Programming Club Meeting Outline Factoring Sieve Multiplicative Functions Greatest Common Divisors Applications Chinese Remainder Theorem Throughout, problems to try are

More information

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg Temporal Reasoning Kai Arras, University of Freiburg 1 Temporal Reasoning Contents Introduction Temporal Reasoning Hidden Markov Models Linear Dynamical Systems (LDS) Kalman Filter 2 Temporal Reasoning

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ANLP Lecture 22 Lexical Semantics with Dense Vectors ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous

More information

Uniform Random Number Generators

Uniform Random Number Generators JHU 553.633/433: Monte Carlo Methods J. C. Spall 25 September 2017 CHAPTER 2 RANDOM NUMBER GENERATION Motivation and criteria for generators Linear generators (e.g., linear congruential generators) Multiple

More information

Bézier Curves and Splines

Bézier Curves and Splines CS-C3100 Computer Graphics Bézier Curves and Splines Majority of slides from Frédo Durand vectorportal.com CS-C3100 Fall 2017 Lehtinen Before We Begin Anything on your mind concerning Assignment 1? CS-C3100

More information

Applied Time Series Topics

Applied Time Series Topics Applied Time Series Topics Ivan Medovikov Brock University April 16, 2013 Ivan Medovikov, Brock University Applied Time Series Topics 1/34 Overview 1. Non-stationary data and consequences 2. Trends and

More information

Machine Learning on temporal data

Machine Learning on temporal data Machine Learning on temporal data Learning Dissimilarities on Time Series Ahlame Douzal (Ahlame.Douzal@imag.fr) AMA, LIG, Université Joseph Fourier Master 2R - MOSIG (2011) Plan Time series structure and

More information

Embedded Zerotree Wavelet (EZW)

Embedded Zerotree Wavelet (EZW) Embedded Zerotree Wavelet (EZW) These Notes are Based on (or use material from): 1. J. M. Shapiro, Embedded Image Coding Using Zerotrees of Wavelet Coefficients, IEEE Trans. on Signal Processing, Vol.

More information

Linear Prediction 1 / 41

Linear Prediction 1 / 41 Linear Prediction 1 / 41 A map of speech signal processing Natural signals Models Artificial signals Inference Speech synthesis Hidden Markov Inference Homomorphic processing Dereverberation, Deconvolution

More information

Relationship between Singular Vectors, Bred Vectors, 4D-Var and EnKF

Relationship between Singular Vectors, Bred Vectors, 4D-Var and EnKF Relationship between Singular Vectors, Bred Vectors, 4D-Var and EnKF Eugenia Kalnay and Shu-Chih Yang with Alberto Carrasi, Matteo Corazza and Takemasa Miyoshi ECODYC10, Dresden 28 January 2010 Relationship

More information

Short-term forecasts of GDP from dynamic factor models

Short-term forecasts of GDP from dynamic factor models Short-term forecasts of GDP from dynamic factor models Gerhard Rünstler gerhard.ruenstler@wifo.ac.at Austrian Institute for Economic Research November 16, 2011 1 Introduction Forecasting GDP from large

More information

Lecture 1: OLS derivations and inference

Lecture 1: OLS derivations and inference Lecture 1: OLS derivations and inference Econometric Methods Warsaw School of Economics (1) OLS 1 / 43 Outline 1 Introduction Course information Econometrics: a reminder Preliminary data exploration 2

More information

Vectors and their uses

Vectors and their uses Vectors and their uses Sharon Goldwater Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh DRAFT Version 0.95: 3 Sep 2015. Do not redistribute without permission.

More information

Ch. 12 Linear Bayesian Estimators

Ch. 12 Linear Bayesian Estimators Ch. 1 Linear Bayesian Estimators 1 In chapter 11 we saw: the MMSE estimator takes a simple form when and are jointly Gaussian it is linear and used only the 1 st and nd order moments (means and covariances).

More information

Similarity and Dissimilarity

Similarity and Dissimilarity 1//015 Similarity and Dissimilarity COMP 465 Data Mining Similarity of Data Data Preprocessing Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed.

More information

Sparsity in system identification and data-driven control

Sparsity in system identification and data-driven control 1 / 40 Sparsity in system identification and data-driven control Ivan Markovsky This signal is not sparse in the "time domain" 2 / 40 But it is sparse in the "frequency domain" (it is weighted sum of six

More information

Multiple Representations: Equations to Tables and Graphs Transcript

Multiple Representations: Equations to Tables and Graphs Transcript Algebra l Teacher: It s good to see you again. Last time we talked about multiple representations. If we could, I would like to continue and discuss the subtle differences of multiple representations between

More information

Matrix-Vector Operations

Matrix-Vector Operations Week3 Matrix-Vector Operations 31 Opening Remarks 311 Timmy Two Space View at edx Homework 3111 Click on the below link to open a browser window with the Timmy Two Space exercise This exercise was suggested

More information