MATH 567: Mathematical Techniques in Data Science Lab 8

1/14 MATH 567: Mathematcal Technques n Data Scence Lab 8 Domnque Gullot Departments of Mathematcal Scences Unversty of Delaware Aprl 11, 2017

Recall We have: a (2) 1 = f(w (1) 11 x 1 + W (1) 12 x 2 + W (1) 13 x 3 + b (1) 1 ) a (2) 2 = f(w (1) 21 x 1 + W (1) 22 x 2 + W (1) 23 x 3 + b (1) 2 ) a (2) 3 = f(w (1) 31 x 1 + W (1) 32 x 2 + W (1) 33 x 3 + b (1) 3 ) h W,b = a (3) 1 = f(w (2) 11 a(2) 1 + W (2) 12 a(2) 2 + W (2) 13 a(2) 3 + b (2) 1 ). 2/14

Recall (cont.) 3/14 Vector form: z (2) = W (1) x + b (1) a (2) = f(z (2) ) z (3) = W (2) a (2) + b (2) h W,b = a (3) = f(z (3) ).

Tranng neural networks 4/14 Suppose we have A neural network wth s l neurons n layer l (l = 1,..., n l ).

Tranng neural networks 4/14 Suppose we have A neural network wth s l neurons n layer l (l = 1,..., n l ). Observatons (x (1), y (1) ),..., (x (m), y (m) ) R s 1 R sn l. We would lke to choose W (l) and b (l) n some optmal way for all l. Let J(W, b; x, y) := 1 2 h W,b(x) y 2 2 (Squared error for one sample). Dene J(W, b) := 1 m m J(W, b; x (), y () ) + λ 2 =1 (average squared error wth Rdge penalty). n l 1 s l s l+1 (W (l) j )2. l=1 =1 j=1

Tranng neural networks Suppose we have A neural network wth s l neurons n layer l (l = 1,..., n l ). Observatons (x (1), y (1) ),..., (x (m), y (m) ) R s 1 R sn l. We would lke to choose W (l) and b (l) n some optmal way for all l. Let J(W, b; x, y) := 1 2 h W,b(x) y 2 2 (Squared error for one sample). Dene J(W, b) := 1 m m J(W, b; x (), y () ) + λ 2 =1 (average squared error wth Rdge penalty). Note: The Rdge penalty prevents overttng. We do not penalze the bas terms b (l). n l 1 s l s l+1 (W (l) j )2. l=1 =1 j=1 4/14

Some remarks 5/14 Can use other loss functons (e.g. for classcaton). Can use other penaltes (e.g. l 1, elastc net, etc.).

Some remarks 5/14 Can use other loss functons (e.g. for classcaton). Can use other penaltes (e.g. l 1, elastc net, etc.). In classcaton problems, we choose the labels y {0, 1} (f workng wth sgmod) or y { 1, 1} (f workng wth tanh). For regresson problems, we scale the output so that y [0, 1] (f workng wth sgmod) or y [ 1, 1] (f workng wth tanh). We can use gradent descent to mnmze J(W, b). Note that snce the functon J(W, b) s non-convex, we may only nd a local mnmum. We need an ntal choce for W (l) j and b (l). If we ntalze all the parameters to 0, then the parameters reman constant over the layers because of the symmetry of the problem.

Gradent descent and the backpropagaton algorthm 6/14 We update the parameters usng a gradent descent as follows: W (l) j b (l) W (l) j α b (l) α W (l) j b (l) J(W, b) J(W, b). Here α > 0 s a parameter (the learnng rate). The partal dervatves can be cleverly computed usng the chan rule to avod repeatng calculatons (backpropagaton algorthm).

Sparse neural networks 7/14 Sparse networks can be bult by Penalzng coecents (e.g. usng a l 1 penalty). Droppng some of the connectons at random (dropout). Srvastava et al., JMLR 15 (2014). Useful to prevent overttng. Recent work: One-shot learners can be used to tran models wth a smaller sample sze.

Autoencoders 8/14 An autoencoder learns the dentty functon: Input: unlabeled data. Output = nput. Idea: lmt the number of hdden layers to dscover structure n the data. Learn a compressed representaton of the nput. Source: UFLDL tutoral.

Example (UFLDL) 9/14 Tran an autoencoder on 10 10 mages wth one hdden layer.

Example (UFLDL) 9/14 Tran an autoencoder on 10 10 mages wth one hdden layer. Each hdden unt computes: 100 = f a (2) j=1 j x j + b (1) j. W (1)

Example (UFLDL) 9/14 Tran an autoencoder on 10 10 mages wth one hdden layer. Each hdden unt computes: 100 = f a (2) j=1 j x j + b (1) j. W (1) Thnk of a (2) as some non-lnear feature of the nput x.

Example (UFLDL) 9/14 Tran an autoencoder on 10 10 mages wth one hdden layer. Each hdden unt computes: 100 = f a (2) j=1 j x j + b (1) j. W (1) Thnk of a (2) as some non-lnear feature of the nput x. Problem: Fnd x that maxmally actvates a (2) over x 2 1.

Example (UFLDL) Tran an autoencoder on 10 10 mages wth one hdden layer. Each hdden unt computes: 100 = f a (2) j=1 j x j + b (1) j. W (1) Thnk of a (2) as some non-lnear feature of the nput x. Problem: Fnd x that maxmally actvates a (2) over x 2 1. Clam: x j = (Hnt: Use CauchySchwarz). W (1) j 100. (1) j=1 (W j )2 We can now dsplay the mage maxmzng a (2) for each. 9/14

Example (cont.) 10/14 100 hdden unts on 10x10 pxel nputs: The derent hdden unts have learned to detect edges at derent postons and orentatons n the mage.

Usng convolutons 11/14 Idea: Certan sgnals are statonary,.e., ther statstcal propertes do not change n space or tme.

Usng convolutons 11/14 Idea: Certan sgnals are statonary,.e., ther statstcal propertes do not change n space or tme. For example, mages often have smlar statstcal propertes n derent regons n space.

Usng convolutons 11/14 Idea: Certan sgnals are statonary,.e., ther statstcal propertes do not change n space or tme. For example, mages often have smlar statstcal propertes n derent regons n space. That suggests that the features that we learn at one part of an mage can also be appled to other parts of the mage. Can convolve the learned features wth the larger mage. Example: 96 96 mage. Learn features on small 8 8 patches sampled randomly (e.g. usng a sparse autoencoder).

Usng convolutons Idea: Certan sgnals are statonary,.e., ther statstcal propertes do not change n space or tme. For example, mages often have smlar statstcal propertes n derent regons n space. That suggests that the features that we learn at one part of an mage can also be appled to other parts of the mage. Can convolve the learned features wth the larger mage. Example: 96 96 mage. Learn features on small 8 8 patches sampled randomly (e.g. usng a sparse autoencoder). Run the traned model through all 8 8 patches of the mage to get the feature actvatons. Source: UFLDL tutoral. 11/14

Poolng features 12/14 Once can also pool the features obtaned va convoluton.

Poolng features 12/14 Once can also pool the features obtaned va convoluton. For example, to descrbe a large mage, one natural approach s to aggregate statstcs of these features at varous locatons.

Poolng features 12/14 Once can also pool the features obtaned va convoluton. For example, to descrbe a large mage, one natural approach s to aggregate statstcs of these features at varous locatons. E.g. compute the mean, max, etc. over derent regons. Can lead to more robust features. Can lead to nvarant features. For example, f the poolng regons are contguous, then the poolng unts wll be translaton nvarant,.e., they won't change much f objects n the mage are undergo a (small) translaton.

R 13/14 We wll use the package h2o to tran neural networks wth R. To get you started, we wll construct a neural network wth 1 hdden layers contanng 2 neurons to learn the XOR functon: # Intalze h2o lbrary(h2o) 0 1 0 0 1 1 1 0 h2o.nt(nthreads=-1, max_mem_sze="2g") h2o.removeall() # n case the cluster was # already runnng # Construct the XOR functon X = t(matrx(c(0,0,0,1,1,0,1,1),2,4)) y = matrx(c(-1,1,1,-1), 4) tran = as.h2o(cbnd(x,y))

R (cont.) 14/14 Tranng the model: # Tran model model <- h2o.deeplearnng(x = names(tran)[1:2], y = names(tran)[3], tranng_frame = tran, actvaton = "Tanh", hdden = c(2), nput_dropout_rato = 0.0, l1 = 0, epochs = 10000) # Test the model h2o.predct(model, tran) Some optons you may want to use when buldng more complcated models for data: actvaton = "RectferWthDropout" nput_dropout_rato = 0.2 l1 = 1e-5