Outline Learning Machines and Data! Linear and nonlinear methods to compress

Size: px

Start display at page:

Download "Outline Learning Machines and Data! Linear and nonlinear methods to compress"

Sara Berry
6 years ago
Views:

1 Machine Learning Tools for the Visualisation of BigData We all love images! Statutory warning: Has been produced in the same facility which produces maths and may contains traces of maths. Amit Kumar Mishra Kyle Harrison University of Cape Town

3 Introduction Human beings are lazy and commit errors!

4 Introduction Human beings are lazy and commit errors! Human level intelligence will need human level error! (Turing)

5 Introduction Human beings are lazy and commit errors! Human level intelligence will need human level error! (Turing) Machines are reliable and can have way more dimensions than we can understand

6 Introduction Human beings are lazy and commit errors! Human level intelligence will need human level error! (Turing) Machines are reliable and can have way more dimensions than we can understand We need to see things to understand (whether to extract new information or to classify known classes) We shall discuss some basic ways to make big-data small (!) enough to visualise them We shall present an open source tool we have developed which anyone can use to view large datasets

7 Taxonomy of Machine Learning

8 Squeezing the information out!-i E.g. smart conference hall heating (lets say we use 20 heat-sensors)

9 Squeezing the information out!-i E.g. smart conference hall heating (lets say we use 20 heat-sensors) What is the source of the 20-D data? may be the sources are pockets of attendees

10 Squeezing the information out!-i E.g. smart conference hall heating (lets say we use 20 heat-sensors) What is the source of the 20-D data? may be the sources are pockets of attendees few underlying factor determine the large dimension data Factor analysis (as old as statistics!)

11 Discrete Karhunen-Loeve expansion Let x be n dimensional random vector (of data from sensors); represented in terms of n independent vectors: x = y 1 φ 1 + y 2 φ = i y iφ i (like coordinate systems) Lets calculate only m (m < n) y i s, and use constants for the rest ˆX(m) = m i=1 y iφ i + n i=m+1 b iφ i

12 Discrete Karhunen-Loeve expansion Let x be n dimensional random vector (of data from sensors); represented in terms of n independent vectors: x = y 1 φ 1 + y 2 φ = i y iφ i (like coordinate systems) Lets calculate only m (m < n) y i s, and use constants for the rest ˆX(m) = m i=1 y iφ i + n i=m+1 b iφ i Error: X(m) = n i=m+1 (y i b i )φ i (prove this) Minimize the mean-squared error, i.e. E( X(m) 2 ) (WHY?)

13 Discrete Karhunen-Loeve expansion Let x be n dimensional random vector (of data from sensors); represented in terms of n independent vectors: x = y 1 φ 1 + y 2 φ = i y iφ i (like coordinate systems) Lets calculate only m (m < n) y i s, and use constants for the rest ˆX(m) = m i=1 y iφ i + n i=m+1 b iφ i Error: X(m) = n i=m+1 (y i b i )φ i (prove this) Minimize the mean-squared error, i.e. E( X(m) 2 ) (WHY?) Optimal choice of φ i s, satisfy Σ X φ i = λ i φ i, i.e. eigenvectors of the covariance matrix Σ X (KL transformation or PCA)

14 Linear Discriminant Analysis (LDA) Intuition: for classification clusters of different classes should be far off and clusters as such should be crowded (Just MSE is not enough)

15 Linear Discriminant Analysis (LDA) Intuition: for classification clusters of different classes should be far off and clusters as such should be crowded (Just MSE is not enough) Interclass (betweenclass) scattering: S b = c j=1 (m j m)(m j m) T Intraclass (withinclass) covariance matrix : S w = c Nj j=1 i=1 (xj i m j)(x j i m j ) T m j = 1 N j Nj i=1 xj i

16 Linear Discriminant Analysis (LDA) Intuition: for classification clusters of different classes should be far off and clusters as such should be crowded (Just MSE is not enough) Interclass (betweenclass) scattering: S b = c j=1 (m j m)(m j m) T Intraclass (withinclass) covariance matrix : S w = c Nj j=1 i=1 (xj i m j)(x j i m j ) T m j = 1 N j Nj i=1 xj i LDA maximizes the ratio: det[s b ] det[s w ] Solution: Eigen vectors of S 1 w S b

17 Squeezing the information out!-ii Look into tweeter feed to predict flood intensity! (A crazy hypothesis)

18 Squeezing the information out!-ii Look into tweeter feed to predict flood intensity! (A crazy hypothesis) Not easy to know what controls what: can we reduce the dimension keeping as much fidelity as possible? We want to see the data the way it looked in N-dimension (how?)

19 Squeezing the information out!-ii Look into tweeter feed to predict flood intensity! (A crazy hypothesis) Not easy to know what controls what: can we reduce the dimension keeping as much fidelity as possible? We want to see the data the way it looked in N-dimension (how?) Lets try to preserve the structure

20 Sammon s Mapping Structure (in Shannon s sense): geometrical relationships among subsets of the data vectors Original data: N data points of L dimension Problem: to find best N points in a lower d (2 or 3) dimension, s.t. their interpoint distances approximate the corresponding interpoint distance in the original L-dimensional space

21 Sammon s Mapping Structure (in Shannon s sense): geometrical relationships among subsets of the data vectors Original data: N data points of L dimension Problem: to find best N points in a lower d (2 or 3) dimension, s.t. their interpoint distances approximate the corresponding interpoint distance in the original L-dimensional space Distance between two points in the original space: dij dist[x i, X j ] Distance between two points in the mapped space: d ij dist[y i, Y j ] Minimize interpoint distance; define the error function as: 1 E = N i<j [d ij ] i<j [d ij d ij ] 2 d ij

22 Sammon s Mapping Structure (in Shannon s sense): geometrical relationships among subsets of the data vectors Original data: N data points of L dimension Problem: to find best N points in a lower d (2 or 3) dimension, s.t. their interpoint distances approximate the corresponding interpoint distance in the original L-dimensional space Distance between two points in the original space: dij dist[x i, X j ] Distance between two points in the mapped space: d ij dist[y i, Y j ] Minimize interpoint distance; define the error function as: 1 E = N [dij d ij ] 2 i<j [d ij ] i<j dij Good news: Its convex

23 Sammon mapping: 0-iteration

24 Sammon mapping: 1-iteration

25 Sammon mapping: 10-iteration

26 Squeezing the information out!-iii What if the original data is ill-behaved? Boundaries with curves (curves are not pretty here!)

27 Nonlinear turning into linear in higher dimensional space

28 kernel+pca Lets non-linearize our data: x becomes φ(x) and y becomes φ(y)

29 kernel+pca Lets non-linearize our data: x becomes φ(x) and y becomes φ(y) Designer choice of φ() s.t. φ(x) T φ(y) = k(x, y) How we perform PCA?

30 kernel+pca Lets non-linearize our data: x becomes φ(x) and y becomes φ(y) Designer choice of φ() s.t. φ(x) T φ(y) = k(x, y) How we perform PCA? Eigen values and functions of Σ X = 1 M 1 X.X T ; M data points, each of size N 1 Σ = 1 M 1 M i=1 X ix T i (any ideas?)

31 kernel+pca Lets non-linearize our data: x becomes φ(x) and y becomes φ(y) Designer choice of φ() s.t. φ(x) T φ(y) = k(x, y) How we perform PCA? Eigen values and functions of Σ X = 1 M 1 X.X T ; M data points, each of size N 1 M i=1 X ix T i Σ = 1 M 1 (any ideas?) Lets pass the data through a nonlinear mapping φ; then the new Σ X,φ = 1 M 1 M i=1 φ(x i)φ(x i ) T = 1 M 1 M i=1 k(x, X )

32 Take-homes Data visualisation is extremely essential before we design any algorithm (we only have our eyes)

33 Take-homes Data visualisation is extremely essential before we design any algorithm (we only have our eyes) A range of tools to do so We give you a free suite of tools

Statistical Pattern Recognition

Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction