Scuola di Calcolo Scientifico con MATLAB (SCSM) 2017 Palermo 31 Luglio - 4 Agosto 2017 www.u4learn.it Ing. Giuseppe La Tona
Sommario Machine Learning definition Machine Learning Problems Artificial Neural Networks (ANN) Nearest Neighbor classification Mixture Models and k-means Graphical Models
Machine Learning "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E. (Tom M. Mitchell)
Example Count salmon sea bass 22 20 18 16 12 10 8 6 4 2 0 5 10 15 20 25 l* Length
Example Width 22 21 20 19 18 17 16 salmon sea bass 15 14 Lightness 2 4 6 8 10
Machine Learning Sub-Problems Overfitting Noise Feature Extraction Model Selection Prior Knowledge Missing Features Width 22 21 20 19 18 17 16 15 14 salmon sea bass? Lightness 2 4 6 8 10
Styles of Machine Learning Supervised Learning Unsupervised Learning Anomaly detection On-line learning Semi-supervised learning
Supervised Learning Given a set of data D = {(x n,y n ),n=1,...,n} the task is to learn the relationship between the input x and output y such that, when given a novel input x the predicted output y is accurate. The pair (x,y ) is not in D but assumed to be generated by the same unknown process that generated D. To specify explicitly what accuracy means one defines a loss function L(ypred, ytrue) or, conversely, a utility function U = L.
Supervised Learning Example: A father decides to teach his young son what a sports car is. Finding it difficult to explain in words, he decides to give some examples. They stand on a motorway bridge and, as each car passes underneath, the father cries out that s a sports car! when a sports car passes by. After ten minutes, the father asks his son if he s understood what a sports car is. The son says, sure, it s easy. An old red VW Beetle passes by, and the son shouts that s a sports car!. Dejected, the father asks why do you say that?. Because all sports cars are red!, replies the son.
Unsupervised Learning Given a set of data D = {x n,n=1,...,n} in unsupervised learning we aim to find a plausible compact description of the data. An objective is used to quantify the accuracy of the description. In unsupervised learning there is no special prediction variable so that, from a probabilistic perspective, we are interested in modelling the distribution p(x). The likelihood of the model to generate the data is a popular measure of the accuracy of the description.
Unsupervised Learning
Other Types of Learning Anomaly Detection Detec%ng anomalous events in industrial processes (plant monitoring), engine monitoring and unexpected buying behaviour pa;erns in customers all fall under the area of anomaly detec%on. Online Learning (supervised and unsupervised) In online learning data arrives sequen%ally and we con%nually update our model as new data becomes available. Semi-supervised learning
Machine Learning Problems Classification Regression Clustering Density Estimation Dimensionality Reduction
Exercise A blog platform needs an automatic tagging service. From the text of a blog article recommend a list of tags How would you proceed? Which questions should you first ask?
Machine Learning Steps
Datasets Training set Validation set Test set
Artificial Neural Networks Neuron or network node Black box representation x 1 w 1 x 1 y 1 x 2 x n w 2. w n f f (w 1 x 1 + w 2 x 2 +... + w n x n ) x 2 x n... F... y 2 y m
Artificial Neural Networks General network node Binary threshold function x 1 1 x 2 g f f (g(x 1, x 2,...,x n )) x n 0 θ
Artificial Neural Networks Input space separation Binary threshold function 1 OR AND 1 1 0 1 0 0 1 0 0 0 1
Feed-Forward ANN k hidden units n input sites... m output units site n+1 1 (1) w w n+1, k k +1, m 1 (2) connection matrix W 1 connection matrix W 2
Recurrent ANN
Recurrent ANN Dealing with Time Series Meteorological forecast Energy consump%on Order request forecast Traffic forecast Financial market forecast
Nonlinear Autoregressive Exogenous model (NARX) Exogenous input Temperature Hour of day
Self Organizing Maps Nature-inspired Autonomous units organizing to adapt to a space input Organization maintaining topology
Kohonen s model Multi-dimensional lattices of computing units Each unit has associated a weight w also called prototype vector w has the dimension of the input space Each unit has lateral connections to several neighbors
Kohonen s model We have a train set D of vectors sampled from the input space The network learns to adapt to the input space updating the weights of its computing units
Learning algorithm Consider an n-dimensional input space A one-dimensional SOM is a chain of computing units When an input x is received each unit m i computes the Euclidean distance between x and its weight w i The unit k with the smallest value(highest excitement) is selected(fires)
Learning algorithm The neighbors of k are also updated We define a neighborhood function ϕ(i,k) i.e. ϕ(i,k)=1 if d(i,k)<r otherwise ϕ(i,k)=0 neighborhood of unit 2 with radius 1 1 2 3 m... w w w w w 1 2 3 m-1 m x
Learning algorithm Init: a learning constant η, a neighborhood function ϕ are selected. The m weight vectors are initialized randomly Select an input vector ξ using the desired probability distribution over the input space. The unit k with the maximum excitation is selected (that is, for which the distance between wi and ξ is minimal, i = 1,...,m). The weight vectors are updated using the neighborhood function and the update rule w i w i + ηφ(i, k)(ξ w i ), for i =1,...,m. Stop if the maximum number of iterations has been reached; otherwise modify η and φ as scheduled and continue with step 1.
Learning algorithm Each step attracts the weight of the excited unit toward the input Repeating this process, we expect to arrive at a uniform distribution of weight vectors in input space (if the inputs have also been uniformly selected).
Effect on neighbors The radius of the neighborhood is reduced according to a schedule Each time a unit is updated, neighboring units are also updated If the weight vector of a unit is attracted to a region in input space, the neighbors are also attracted, but to a lesser degree During the learning process both the size of the neighborhood and the value of φ fall gradually, so that the influence of each unit upon its neighbors is reduced.
Schedule and learning constant The learning constant controls the magnitude of the weight updates and is reduced gradually The net effect of the selected schedule is to produce larger corrections at the beginning of training than at the end
Linear SOM example......... The weight vectors reach a distribution which transforms each unit into a representative of a small region of input space. The unit in the lower corner responds with the largest excitation to vectors in the shaded region.
Bi-dimensional networks 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 1 1 0.3 0.3 0.9 0.9 0.2 0.2 0.8 0.8 0.1 0.1 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 1 1 0.9 0.9 0.8 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.7 0.6 0.5 0.4 0.3 0.2 0.7 0.7 0.1 0.1 0.6 0.6 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 0.5 0.4 0.5 0.4 Fig. 15.8. Planar network with a knot 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 Several proofs of convergence have been given for one-dimensional Kohonen networks in one-dimensional domains. There is no general proof of convergence for multidimensional networks. 15.2.2 Mapping high-dimensional spaces Usually, when an empirical data set is selected, we do not know its real dimension. Even if the input vectors are of dimension n, itcouldbethatthedata concentrates on a manifold of lower dimension. In general it is not obvious which network dimension should be used for a given data set. This general problem led Kohonen to consider what happens when a low-dimensional network is used to map a higher-dimensional space. In this case the network must fold in order to fill the available space. Figure 15.9 shows, in the middle, the result of an experiment in which a two-dimensional network was used to chart athree-dimensionalbox.ascanbeseen,thenetworkextendsinthex and y dimensions and folds in the z direction. The units in the network try as hard Anima%on: h;ps://www.youtube.com/watch?v=qvi6l-kqst4
Mapping high-dimensional spaces How a network of dimension n adapts to a space input of higher dimension It must fold to fill the space 0.4 0.2 0 0.2 0.4 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fig. 15.9. Two-dimensional map of a three-dimensional region map alternately to one side or the other of input space (for the z dimension). Acommonlycitedexampleforthiskindofstructureinthehumanbrainis the visual cortex. The brain actually processes not one but two visual images, one displaced with respect to the other. In this case the input domain consists of two planar regions (the two sides of the box of Figure 15.9). The planar cortex must fold in the same way in order to respond optimally to input from one or other side of the input domain. The result is the appearance of the stripes of ocular dominance studied by neurobiologists in recent years. Figure 15.10 shows a representation of the ocular dominance columns in LeVays reconstruction [205]. It is interesting to compare these stripes with the ones found in our simple experiment with the Kohonen network. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
What dimension for the network? In many cases we have experimental data which is coded using n real values, but whose effective dimension is much lower. Points in the surface of a sphere in threedimensional space. The input vectors have three components, but a two-dimensional Kohonen network will do a better job of charting this input space
Application: function approximation Apply planar grid to a surface P {(x,y,f(x,y)) x,y in [0,1]} After the learning algorithm is started, the planar network moves in the direction of P and distributes itself to cover the domain.
Application: function approximation θ f n or the other. The necf(θ) =α sin θ+β dθ/dt andthevertical,and The network is a kind of look-up table of the values of f. The table can be made as sparse or as dense as needed
Nearest Neighbour Classification Supervised method Assign to a new input the class of the Figure 14.1: In nearest neighbour classification a new vector nearest is assigned theinput label of thein nearest the vector in the training set. Here there are three classes, with training points training given by the circles, set along with their class. The dots indicate the class of the nearest training vector. The Distances: decision boundary piecewise linear with each segment corresponding to the perpendicular bisector between two datapoints belonging to di erent classes, Euclidean giving rise to a Voronoi tessellation of the input space. mahalanobis Algorithm 14.1 Nearest neighbour algorithm to classify a vector x, given train data D = {(x n,c n ),n=1,...,n}: 1: Calculate the dissimilarity of the test point x to each of the train points, d n = d (x, x n ), n =1,...,N.
Nearest Neighbor Classification Entire dataset must be stored Distance calculation may be expensive How to deal with missing data? How to incorporate prior knowledge?
K Nearest Neighbors More robust classifier Consider hypersphere that contains k train inputs and centered on test point How to choose k? Cross valida%on
Mixture models A mixture model is one in which a set of component models is combined to produce a richer model: 0.25 0.2 0.15 0.25 0.2 0.15 p(v) = HX p(v h)p(h) h=1 0.1 0.05 0 10 8 6 4 2 0 2 4 6 8 10 (a) 0.1 0.05 0 10 8 6 4 2 0 2 4 6 8 10 (b)
K-means clustering Partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
Graphical models