Machine Learning for Smart Learners

Size: px

Start display at page:

Download "Machine Learning for Smart Learners"

Brendan Baker
6 years ago
Views:

1 Department of Computer Science Instituto de Matemathics and Statistics University of Sao Paulo, Brazil Sao Paulo School of Advanced Science on Smart Cities 2017

2 What Is All This Fuss About Machine Learning?

3 What Is All This Fuss About Machine Learning? It s Just Hype!!! (Oba-Oba)

4 What Is All This Fuss About Machine Learning? It s Just Hype!!! (Oba-Oba) Other areas that went through similar hype Boolean Algebra and Digital Circuits ($$$) Automata and Formal Languages Relational Databases ($$$) Expert Systems Object-oriented Programming ($$$), etc

5 What Is All This Fuss About Machine Learning? It s Just Hype!!! (Oba-Oba) Other areas that went through similar hype Boolean Algebra and Digital Circuits ($$$) Automata and Formal Languages Relational Databases ($$$) Expert Systems Object-oriented Programming ($$$), etc When hype is gone, what remains is Science

6 Topics 1 Machine Learning and Smart Cities 2 Machine Learning Basics 3 Example 1: PoS Tagging 4 Example 2: Feature Learning via Word2Vec 5 Conclusion

7 Next Topic 1 Machine Learning and Smart Cities 2 Machine Learning Basics 3 Example 1: PoS Tagging 4 Example 2: Feature Learning via Word2Vec 5 Conclusion

8 Personal Note I have been working with machine learning (ML) since 1995

9 Personal Note I have been working with machine learning (ML) since 1995 mainly in Computational Linguistics

10 Personal Note I have been working with machine learning (ML) since 1995 mainly in Computational Linguistics Also worked in 2 environmental projects with ML

11 Personal Note I have been working with machine learning (ML) since 1995 mainly in Computational Linguistics Also worked in 2 environmental projects with ML Materials Discovery for Air and Water Cleaning ( ) Fine Chemistry, (very) expensive data High complexity (NPc), no so big data Cornell University

12 Personal Note I have been working with machine learning (ML) since 1995 mainly in Computational Linguistics Also worked in 2 environmental projects with ML Materials Discovery for Air and Water Cleaning ( ) Fine Chemistry, (very) expensive data High complexity (NPc), no so big data Cornell University Materials Refinement (just started) Brute Chemistry, not so expensive data Lots of data (Big Data) Civil engineering (USP), several companies interested

13 Personal Note I have been working with machine learning (ML) since 1995 mainly in Computational Linguistics Also worked in 2 environmental projects with ML Materials Discovery for Air and Water Cleaning ( ) Fine Chemistry, (very) expensive data High complexity (NPc), no so big data Cornell University Materials Refinement (just started) Brute Chemistry, not so expensive data Lots of data (Big Data) Civil engineering (USP), several companies interested

14 Description of a Real Project (Started) Q: What is the product that is most manufactured today?

15 Description of a Real Project (Started) Q: What is the product that is most manufactured today? A: Concrete

16 Description of a Real Project (Started) Q: What is the product that is most manufactured today? A: Concrete Artificial Stone, mainly used in cities Big potential to improve living environment Affects the efficiency of built environment and climate change

17 Description of a Real Project (Started) Q: What is the product that is most manufactured today? A: Concrete Artificial Stone, mainly used in cities Big potential to improve living environment Affects the efficiency of built environment and climate change Per-capta production (2003):

18 Description of a Real Project (Started) Q: What is the product that is most manufactured today? A: Concrete Artificial Stone, mainly used in cities Big potential to improve living environment Affects the efficiency of built environment and climate change Per-capta production (2003): 4.2 t/inhabitant (cementitious material)

19 A Cement World

20 Evolution of CO 2 Emissions from Cement Conclusion: Concrete has an enormous impact in CO 2 emissions

21 Concrete Depends on Too Many Inputs Cement quality Water quantity Aggregate origin (Sand, natural gravel, and crushed stone) Chemical and mineral admixtures Reinforcement (steel, protracted steel) Type of production, temperature, humidity, altitude Intended use, dust emission, Mixing, workability, curing, time to dry out, etc

22 Concrete Depends on Too Many Inputs Cement quality Water quantity Aggregate origin (Sand, natural gravel, and crushed stone) Chemical and mineral admixtures Reinforcement (steel, protracted steel) Type of production, temperature, humidity, altitude Intended use, dust emission, Mixing, workability, curing, time to dry out, etc And there are lots of data collected

23 The Problem 1 Given the inputs, predict CO 2 emission and important properties (compressive strength, tensile strength, elasticity, coefficient of thermal expansion)

24 The Problem 1 Given the inputs, predict CO 2 emission and important properties (compressive strength, tensile strength, elasticity, coefficient of thermal expansion) 2 Control deviations in the production process

25 The Problem 1 Given the inputs, predict CO 2 emission and important properties (compressive strength, tensile strength, elasticity, coefficient of thermal expansion) 2 Control deviations in the production process 3 Propose new kinds of cements and cementitious material to optimize: CO 2 emission profit Cement use in concrete Chemical additives in concrete, etc.

26 The Problem 1 Given the inputs, predict CO 2 emission and important properties (compressive strength, tensile strength, elasticity, coefficient of thermal expansion) 2 Control deviations in the production process 3 Propose new kinds of cements and cementitious material to optimize: CO 2 emission profit Cement use in concrete Chemical additives in concrete, etc. So now we can talk about machine learning

27 Similar Approaches in the Literature Taffese & Sistonen. Machine learning for durability and..., Automation in Construction, 2017 Machine learning methods can substitute time and resource consuming lab tests Machine learning techniques can play a substantial role in durability assessment Machine learning algorithms utilizing sensors data can discover hidden insights The future durability assessment and service-life prediction approach is proposed Thanks to Prof. Vanderley M. John for input on concrete information

28 Next Topic 1 Machine Learning and Smart Cities 2 Machine Learning Basics 3 Example 1: PoS Tagging 4 Example 2: Feature Learning via Word2Vec 5 Conclusion

29 What is Machine Learning? Branch of Computer Science Aim: Give computers the ability to learn without being explicitly programmed How?

30 What is Machine Learning? Branch of Computer Science Aim: Give computers the ability to learn without being explicitly programmed How? Algorithms that learn from data

31 What is Machine Learning? Branch of Computer Science Aim: Give computers the ability to learn without being explicitly programmed How? Algorithms that learn from data and make predictions on data

32 What is Machine Learning? Branch of Computer Science Aim: Give computers the ability to learn without being explicitly programmed How? Algorithms that learn from data (Learning Phase) and make predictions on data (Execution Phase)

33 The Most Important Element?

34 The Most Important Element? Data

35 The Most Important Element? Data Annotated Data: input and desired output. Expensive. Raw Data: input only. Less expensive.

36 The Most Important Element? Data Annotated Data: input and desired output. Expensive. Basis for Supervised Learning Raw Data: input only. Less expensive. Basis for Unsupervised Learning

37 The Most Important Element? Data Annotated Data: input and desired output. Expensive. Basis for Supervised Learning Raw Data: input only. Less expensive. Basis for Unsupervised Learning Other important elements: learning models learning algorithms computing power

38 Common Uses of Machine Learning

39 Common Application Domains Unstructured models Image Processing Computational Linguistics

40 Machine Learning Model Types (a more relevant classification) Generative Models: model-based learning Regressive Models: function-based learning

41 Machine Learning Model Types (a more relevant classification) Generative Models: model-based learning Decision Trees Logic-based models: association rules, inductive logic Markov Models: hidden, explicit, variable length Bayesian Models: naive, graphic models Probabilistic relational models PAC learning, etc Regressive Models: function-based learning Perceptron and its variants SVM Neural Nets (NN), etc

42 Concrete Examples (from Language Processing) Example 1 Example 2 Generative / Model-based Supervised Classification Hidden Markov Model Regressive / function-based Unsupervised Feature Learning Neural Net

43 Concrete Examples (from Language Processing) Example 1 Part-of-speech tagging Generative / Model-based Supervised Classification Hidden Markov Model Example 2 Regressive / function-based Unsupervised Feature Learning Neural Net

44 Concrete Examples (from Language Processing) Example 1 Part-of-speech tagging Example 2 Word2Vec Generative / Model-based Supervised Classification Hidden Markov Model Regressive / function-based Unsupervised Feature Learning Neural Net

45 Next Topic 1 Machine Learning and Smart Cities 2 Machine Learning Basics 3 Example 1: PoS Tagging 4 Example 2: Feature Learning via Word2Vec 5 Conclusion

46 Hidden Markov Models See presentation attached

47 Next Topic 1 Machine Learning and Smart Cities 2 Machine Learning Basics 3 Example 1: PoS Tagging 4 Example 2: Feature Learning via Word2Vec 5 Conclusion

48 Neural Networks It all started with the linear separation problem

49 Perceptron Created by Rosenblatt [1957] Binary separation of data X = {x 1,... x n }, dim(x i ) = k Find w and b such that, for x X { 1, w x + b > 0 perceptron(x) = 0, w x + b < 0 Zero (unseparable) or infinitely many hyperplanes (w, b) Minsky & Papert [1969]: Xor-functions cannot be learned by single layer perceptrons

50 Dealing with non-separable cases SVM Employs Quadratic Programming to obtain single answer Expanding number of dimensions leads to separability, extra dimensions are a function of given ones Uses kernels to deal with large number of dimensions

51 Dealing with non-separable cases SVM Employs Quadratic Programming to obtain single answer Expanding number of dimensions leads to separability, extra dimensions are a function of given ones Uses kernels to deal with large number of dimensions Neural Networks Multilayer perceptrons 0-1 functions a problem for learning (not differentiable) Use some sigmoid (σ) function instead Neuron: σ(w x + b)

52 Neural Net (Old View) Does not scale. e.g. X 15, 000

53 Neural Net (New View) A whole layer may be represented as : layer(x) = σ(wx + b) x, b, layer(x): vectors; W : matrix W W x x σ y + W x+b b Nodes are operations arcs are tensors

54 Neural Net (New View) A whole layer may be represented as : layer(x) = σ(wx + b) x, b, layer(x): vectors; W : matrix W W x x σ y + W x+b b Nodes are operations arcs are tensors Training occurs over such graph using backpropagation

55 The Backpropagation Algorithm Proposed by Kelley [1960], Bryson [1961] and Dreyfus [1962] Popularized by Rumelhart, Hinton & Williams [1986] Learning by gradient descent. Supervised learning Applies to a tensor graph (not only NN!)

56 The Backpropagation Algorithm (Guts) Initializes weights to be learned randomly Minimizes a loss function. E.g. L(x) = (y i ŷ i (x)) 2 Phase 1: Propagation. Computes the output ŷ and loss L Phase 2: Weight update. α is the learning rate W t+1 = W t + α L W b t+1 = b t + α L b A cycle propagation-update is an epoch Local optimization, may get stuck at local minima

57 Word2vec Word2vec is a name that covers two models Skip-gram Continuous Bag-of-Words (CBOW) Aim: learn efficiently a vectorial representation of words banana v 1,..., v N, v i Q Unsupervised learning from a large unstructured corpus Based on word co-ocurrence statistics. The idea is not new: You shall know a word by the company it keeps (J.R Firth, 1957)

58 A Simplified CBOW model Given a corpus, choose: A vocabulary V. A vector size N to represent words Use matrices W Q V,N and W Q N, V to create two vetor representations of each word w: input vector: vw (line of W ). output vector: v w (column of W ).

59 A Simplified CBOW model The model s task is to predict a focus word given a context of words: O primeiro rei de Portugal nasceu em... Observation (rei, primeiro) Observation (input word, output word) Observation (w I, w O )

60 A Simplified CBOW model Deep learning, with depth 2!!!

61 A Simplified CBOW model ( ) * (1, V ) ( V,N) ( ) x h y x wo W = * softmax cross (1,N) W' (N, V ) entropy (1, V ) (1, V )

62 Some Definitions A one-hot is a vector of bits with a single 1-bit; all other bits 0. Given N elements we associate them to N 1-hot vectors of size N: The softmax function is a probability distribution over the elements of a vector: P(z j ) = e z j N i=1 ez i The cross-entropy of distributions p and q CE(p, q) = i p i log q i

63 A Simplified CBOW model Given (xw I, xw O ) one-hot of (w I, w O ) and x = xw I, the model is: V h i = w si x s com i = 1,..., N (1) u j = s=1 N w sjh s com j = 1,..., V (2) s=1 y j = P(w j w I ) = exp(u j ) V j =1 exp(u j ) com j = 1,..., V (3) V E = CE(xw O, y) = xw O s log(y s ) (4) s=1

64 A Simplified CBOW model Due to 1-hot format of xw I e xw O we simplify (1), (2), (3) e (4) y j = where j is the index of w O. h = vw I (5) u j = v w j. T vw I (6) exp(u j ) V j =1 exp(u j ) (7) V E = u j + log( j =1 exp(u j )) (8)

65 A Simplified CBOW model: update Applying backpropagration we have the weight update of the last layer is w ij (new) = w (old) ij α ej h i (9) in vector notation v w j (new) = v wj (old) α ej vw I (10) where e = y xw O

66 A Simplified CBOW model: update w j w O α e j < 0 subtract from v w j a fraction of vw I increase the cosine distance between vw I and v w j. w j = w O α e j > 0 add a fraction of vw I to v w j decrease the cosine distance between vw I and v w j.

67 A Simplified CBOW model: update Proceed with backpropagation: W (new) = W (old) α xeh T (11) vw I (new) = vw I (old) α xeh T (k I,.) (12) where EH = e(w ) T and k I is the index of w I.

68 A Simplified CBOW model Repeat this process with examples from the corpus, the effect accumulates and as a result words with similar contexts will get close to each other. The model captures the co-ocurrence statistics using cosine distance

69 CBOW Now, starting from an arbitrary window of size C, we construct observations such as ([w I1,...,w IC ],w O ). E.g., for C = 4: Nunca me acostumei com o cantor dessa banda, e nem... ([com, o, dessa, banda], cantor)

70 CBOW: modelo (i) x wi 1 (1, V ) x wi 2 (1, V ). x wi C + ( ) * (1, V ) ( V,N) ( ) x h y x wo W 1/C * softmax cross (1,N) W' (N, V ) entropy (1, V ) (1, V ) (1, V )

71 CBOW: model x = xw I1 + + xw IC (13) h = 1 C (v w I1 + + vw IC ) (14) u j = N w sjh s (15) s=1 y j = p(w j w I1,...,w IC ) = V E = u j + log( exp(v w j. T h) V j =1 exp(v w j.t h) j =1 (16) exp(u j )) (17)

72 CBOW: update v w j (new) = v wj (old) α ej h (18) vw Ic (new) = vw Ic (old) 1 C α xeht (k Ic,.) (19) for c = 1,..., C. Where k I1,..., k IC respectively. w I1,...,w IC are the indexes of

73 Optimization y j = exp(u j ) V j =1 exp(u j ) Too costly to compute for each input training instance Negative Sampling Hierarchical Softmax

74 Negative Sampling We keep x, W, W, h as before. To compute the error function, we employ a distribution P n (w) over all words in the corpus. E.g.: P n (w) = U(w) 3 4 Z Using P n (w) we sample w i1,...,w ik ; avoid w O amog them

75 Negative Sampling (w I,w O ) Positive example (w i1,w O ),..., (w ik,w O ) Negative examples

76 Negative Sampling: the model p(d = 1 w,w O ) = σ(v w. T h) probability of (w,w O ) to co-occur in the corpus (in a C-window) p(d = 0 w,w O ) probability of (w,w O ) not to co-occur in the corpus The goal of training now is to maximize the probabilities p(d = 1 w I,w O ), p(d = 0 w i1,w O ),..., p(d = 0 w ik,w O )

77 Negative Sampling: the model Minimize the following error function: E = log(p(d = 1 w I,w O ). K s=1 = (log p(d = 1 w I,w O ) + log( = (log p(d = 1 w I,w O ) + K s=1 p(d = 0 w is,w O )) K s=1 K = log σ(v w O. T h) log(σ( v w is. T h)) s=1 p(d = 0 w is,w O ))) log(p(d = 0 w is,w O )))

78 Negative Sampling: update v w O (new) = v wo (old) α (σ(v wo. T h) 1) h (20) v w is (new) = v wis (old) α #(is ) σ(v w is. T h) h (21) W (new) = W (old) α xeh T (22) where K EH = (σ(v w O. T h) 1) v w O + σ(v w is s=1. T h)v w is Deep learning with 1.5 layers!!!

79 Intrinsic Evaluation of the Method w 1 is to w 2 as w 3 is to x w 1 = France, w 2 = Paris, w 3 = Japan; x = Tokyo w 1 = man, w 2 = king, w 3 = woman; x = queen

80 Intrinsic Evaluation of the Method w 1 is to w 2 as w 3 is to x w 1 = France, w 2 = Paris, w 3 = Japan; x = Tokyo w 1 = man, w 2 = king, w 3 = woman; x = queen Problematic cases: w 1 = man, w 2 = manager, w 3 = woman; x = secretary w 1 = white, w 2 = beauty, w 3 = black; x = ugly

81 Intrinsic Evaluation of the Method w 1 is to w 2 as w 3 is to x w 1 = France, w 2 = Paris, w 3 = Japan; x = Tokyo w 1 = man, w 2 = king, w 3 = woman; x = queen Problematic cases: w 1 = man, w 2 = manager, w 3 = woman; x = secretary w 1 = white, w 2 = beauty, w 3 = black; x = ugly The method unveils sexism and racism buried in the data! Book: Weapons of Math Destruction

82 Applications (extrinsic evaluation) Many NLP applications employ word2vec We have implemented word2vec in portuguese, using Google s TensorFlow And used it for Named Entity Recognition (NER)

83 Next Topic 1 Machine Learning and Smart Cities 2 Machine Learning Basics 3 Example 1: PoS Tagging 4 Example 2: Feature Learning via Word2Vec 5 Conclusion

84 Thank you! Visit our group s Machine Learning tutorials Tensorflow Theano+Lasagne Keras

85 Bibliography Halevy, A. Norvig, P. and Pereira, F. (2009). The unreasonable effectiveness of data in Intelligent Systems, IEEE. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representation in vector space. arxiv preprint arxiv: Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages Morin, F., Bengio, Y. (2005). Hierarchical probabilistic neural network language model. In AISTATS, pages Rong, X. (2016). Word2vec Parameter Learning Explained. arxiv preprint arxiv:

word2vec Parameter Learning Explained

word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu Abstract The word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent two years. The vector